Skip to content

shim: avoid re-downloading runner/shim binaries on forced re-install#3954

Open
peterschmidt85 wants to merge 2 commits into
masterfrom
fix/shim-redownload-loop
Open

shim: avoid re-downloading runner/shim binaries on forced re-install#3954
peterschmidt85 wants to merge 2 commits into
masterfrom
fix/shim-redownload-loop

Conversation

@peterschmidt85

@peterschmidt85 peterschmidt85 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Bounds the wasted egress from the runner/shim re-install loop (#3953): the shim re-downloaded the full binaries on every server-issued install (force=true), and a transient HTTP error during bootstrap could brick the instance. Two changes, both in the download path.

downloadFile — cache + revalidate instead of always re-downloading

runner/internal/shim/components/utils.go. Installing url into path now:

  1. force=false and path exists → return (unchanged).
  2. Resolve a per-URL cache file next to the binary: <path>.<sha256(url)[:16]>.cache (+ a .etag sidecar). Keying by URL means different versions get separate caches instead of overwriting one slot.
  3. ensureCached GETs url with If-None-Match: <stored etag>:
    • 304 Not Modified → cache is current, no body transferred;
    • 200 → stream into the cache (temp file + atomic rename) and store the response ETag.
  4. Install the cached bytes into path: copy → chmod → atomic rename. Skipped if path already matches the cache (same size). If chmod/rename fails, the next call retries from the cache — no re-download.
  5. pruneCaches keeps the newest maxCachedVersions (5) and deletes older ones, so disk stays bounded.

Net: a repeated/forced install of an unchanged version is a 304 with no transfer; two servers alternating between versions download each version once, then 304.

Bootstrap curl — fail instead of saving an error page

src/dstack/_internal/core/backends/base/compute.py. The shim-bootstrap curl gains -f and chains mv, so a transient 403/5xx is no longer written to dstack-shim and then executed (/usr/local/bin/dstack-shim: 2: Syntax error: newline unexpected); it fails cleanly.

Tests

runner/internal/shim/components/utils_test.go (pass under -race):

  • unchanged artifact under force → body served once;
  • changed ETag → re-downloaded and updated;
  • force=false + existing file → no request;
  • two alternating versions → each downloaded exactly once (guards the per-URL-cache regression).

Validated end-to-end (AWS)

Two servers on one DB + backend, one real eu-west-1 instance, custom-versioned builds on the staging bucket, measured via S3 request metrics:

  • Both shims carry this fix (0.20.30 ↔ 0.20.31): 22 conditional GETs over 25 min of continuous flip-flop → 1 real transfer (~17 MB warmup), the rest 304. Steady-state egress ≈ 0 (unfixed: ~430 MB and growing).
  • Mixed (unfixed 0.20.29 shim ↔ fixed 0.20.31): ~454 MB in 11 min, climbing — see scope below.

Scope — what this does NOT fix (tracked in #3953)

This PR bounds the egress, not the loop:

  • With two servers expecting different versions the version never converges, so the server keeps issuing installs and the shim still restarts every ~30s — now at zero bytes, but the churn remains.
  • Because the fix lives in the shim binary, a mixed fleet whose in-control shim predates this PR still re-downloads — the older binary has no cache (the ~454 MB case above).

Fully stopping the loop needs a server-side guard (no-downgrade on comparable versions + an attempt cap/backoff). That's deferred: dstack's version space (real semver vs CI build-numbers like 27305 vs dev/None) has no reliable ordering yet, so the server can't safely decide which version wins. See #3953.

🤖 Generated with Claude Code

The shim re-downloaded the full runner/shim binary on every server-issued
install (which is always force=true), with no content check. When an
instance's installed version never converges to the version the server
expects, the server re-issues the install every check, so the same binaries
were re-downloaded indefinitely -- unbounded egress from the downloads bucket.

Cache the downloaded bytes next to the target and revalidate them with the
object's ETag via a conditional request: an unchanged artifact returns 304 and
no body is transferred, even under force. Install from the cache so a
finalization (chmod/rename) failure is retried without re-downloading, and
propagate those errors instead of masking them. A genuinely changed artifact
(different ETag) is still downloaded exactly once; without force an existing
file is still left untouched.

This bounds the loop to at most one download per distinct artifact regardless
of why versions fail to converge.

Refs #3953

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The shim download cache used a single slot per binary, so when two server
versions reconcile the same instance (the confirmed real-world trigger), each
version-swap overwrote the slot and re-downloaded the full binary every cycle.

- downloadFile now keys the cache + ETag by URL, so alternating versions each
  persist and are revalidated independently -> each version is downloaded once,
  then 304s -> egress ~0 even while the binary keeps swapping. Bounded by
  maxCachedVersions (oldest pruned).
- bootstrap curl: add -f and chain `mv`, so a transient 403/5xx is never saved as
  the shim binary (which otherwise runs as a shell script -> "Syntax error:
  newline unexpected" crash loop) -- fail cleanly instead.
- add a flip-flop test (each of two alternating versions downloaded exactly once).

Refs #3953

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@peterschmidt85 peterschmidt85 force-pushed the fix/shim-redownload-loop branch from 5bfbf12 to 59f28e8 Compare June 10, 2026 20:59
@peterschmidt85 peterschmidt85 marked this pull request as ready for review June 10, 2026 22:40
@peterschmidt85 peterschmidt85 requested a review from un-def June 10, 2026 22:40

// Install the cached bytes; skip if path already matches the cache.
if !downloaded {
installed, err := sameSize(path, cachePath)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same-size adjacent Go builds are unlikely but not impossible.

Suggested fix: after a successful rename in installFile, write the cache's urlKey to a <path>.installed marker and replace sameSize with a marker comparison.


// downloadFile ensures path holds the artifact at url.
//
// Bytes are cached next to path (one cache per URL, validated by ETag), so a repeated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite like putting these helper-files in /usr/local/bin. How about using a shimHomeDir subdir?

case "/0.20.23/runner":
body, etag = "RUNNER-0.20.23", `"e23"`
case "/0.20.18/runner":
body, etag = "RUNNER-0.20.18-x", `"e18"` // intentionally different length

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See tests only work with different lengths!

# 403/5xx is never saved as the shim binary; chain `mv` so a failed download
# is never installed (otherwise it would run as a script -> "Syntax error").
f'sudo curl -fsS --compressed --connect-timeout 60 --max-time 240 --retry 1 --output "$dlpath" "{url}"'
f' && sudo mv "$dlpath" {dstack_shim_binary_path}',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if && is needed since commands are chained with &&. Adding -f should be sufficient

@r4victor r4victor left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice overall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants