Summary
A server-driven runner/shim re-install loop with no backoff, no attempt cap, and no content check. If an instance's reported version never equals the server's expected version, the server re-issues an install on every instance check (~7–14s) and the shim re-downloads the full ~37 MB binaries every time, indefinitely → large, sustained egress from the public dstack-runner-downloads bucket (paid by the bucket owner). Jobs keep running, so it's silent.
Status: reproduced live on AWS (below). PR #3954 bounds the egress; the loop itself is not fixed yet — see Why the loop isn't fully fixed yet.
Steps to reproduce
- Run two dstack servers on different versions (e.g.
0.20.18 and 0.20.23) against the same database / backend.
- Launch any instance (a plain fleet node is enough — no job required).
- → the instance re-downloads
dstack-runner + dstack-shim from dstack-runner-downloads continuously — the full binaries every ~30s, forever.
Two servers is just the simplest trigger. The general condition is any persistent mismatch between the instance's reported version and the server's expected version — e.g. a flapping DSTACK_*_VERSION_URL, or a download that succeeds but then fails chmod/rename so the new binary never lands.
Reproduced live
Two servers (0.20.18 + 0.20.23) on one Postgres + AWS backend, one eu-west-1 instance. The servers interleave, each forcing its version; the instance ping-pongs and re-downloads each cycle. Server log (~30s apart):
Instance <name>: installing runner 0.20.23 -> 0.20.18 from https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.20.18/binaries/dstack-runner-linux-amd64
Instance <name>: installing runner 0.20.18 -> 0.20.23 from https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.20.23/binaries/dstack-runner-linux-amd64
(The original incident was expensive because it was cross-region — eu-west-1 bucket, readers in us-east-2/us-west-2; the loop itself is region-independent.)
What's happening
Per instance, every ~7–14s (instances/__init__.py — 7s for transitional states, 14s for IDLE/BUSY) the server:
- reads installed =
dstack-runner --version / dstack-shim --version (via the shim's GET /api/components);
- computes expected =
get_dstack_runner_version() / get_dstack_shim_version() (compute.py:801);
- if they differ by string equality (
check.py:495 runner / :542 shim) → POST /api/components/install (client.py:438) → shim Install(url, force=true) (handlers.go:217, force hardcoded) → downloadFile re-downloads the full binary (runner/internal/shim/components/utils.go — no backoff/cap; idempotent on file existence only, which force=true bypasses).
Two servers pinned to different versions each force their version, so the instance never converges and re-downloads every cycle.
Impact
- Unbounded full re-downloads → large sustained egress from the public bucket; high request volume; the instance never converges; silent (the install trigger is only logged at DEBUG).
- Footgun: the shim-bootstrap
curl has --retry 1 but no --fail (compute.py ~L929), so a transient non-200 (e.g. a 403) is written to dstack-shim and then executed → /usr/local/bin/dstack-shim: 2: Syntax error: newline unexpected → crash loop.
Fix — two layers
Layer 1 — bound the egress (shim side, PR #3954, done). The shim caches each artifact per source URL and revalidates with a conditional GET (If-None-Match → 304), so a forced re-install of an unchanged version transfers zero bytes. Plus --fail on the bootstrap curl.
Validated end-to-end on AWS (two servers on one DB, one instance, custom-versioned builds, S3 request metrics):
- Both shims carry the fix: 22 conditional GETs over 25 min of continuous flip-flop → 1 real transfer (~17 MB at warmup), the rest
304. Steady-state egress ≈ 0 (unfixed: ~430 MB and growing).
- Mixed (one unfixed shim + one fixed): ~454 MB in 11 min and climbing — not mitigated (see below).
Layer 2 — stop the loop (server side, not done yet). Layer 1 stops the bytes, not the loop: with two servers disagreeing the version never converges, so the server keeps issuing installs and the shim restarts every ~30s forever (zero egress, but churn/instability).
Why the loop isn't fully fixed yet
Stopping the loop means the server must decide which version wins when two servers disagree — and that decision isn't currently well-defined:
- A natural rule is no-downgrade (install only if
expected is strictly newer than installed) → converges to the highest version and halts. It's correct for the incident (two real releases) and for production (all semver), and since server downgrade is unsupported (forward-only migrations) it breaks no supported flow.
- But dstack's version value space has no reliable total order.
get_*_version() can return real semver (0.20.23), a CI build number (str(run_number + 150), e.g. 27305, used with DSTACK_USE_LATEST_FROM_BRANCH), or None/dev. Under PEP 440, Version("27305") > Version("0.20.23") is true — a build-number build outranks every 0.x.y release, which would block upgrading a branch-build instance to a real release. So "newer" is ambiguous across schemes, and we can't safely pick a winner yet.
- It also doesn't cover the finalization-failure trigger (download OK but
chmod/rename fails → installed never advances → "expected is newer" stays true forever); that needs an attempt cap + backoff regardless.
So the complete server-side fix = no-downgrade restricted to comparable (semver) pairs + an attempt cap/backoff + a "multiple server versions" guard/warning. We're deferring it until we settle a reliable version ordering (or a single source of truth for the expected version) rather than ship a rule that mis-ranks build numbers.
Until then: in a mixed-version fleet where the in-control shim predates PR #3954, the binary still re-downloads (the cache fix lives in the shim binary), and the loop/churn persists even where egress is 304-bounded.
Summary
A server-driven runner/shim re-install loop with no backoff, no attempt cap, and no content check. If an instance's reported version never equals the server's expected version, the server re-issues an install on every instance check (~7–14s) and the shim re-downloads the full ~37 MB binaries every time, indefinitely → large, sustained egress from the public
dstack-runner-downloadsbucket (paid by the bucket owner). Jobs keep running, so it's silent.Status: reproduced live on AWS (below). PR #3954 bounds the egress; the loop itself is not fixed yet — see Why the loop isn't fully fixed yet.
Steps to reproduce
0.20.18and0.20.23) against the same database / backend.dstack-runner+dstack-shimfromdstack-runner-downloadscontinuously — the full binaries every ~30s, forever.Two servers is just the simplest trigger. The general condition is any persistent mismatch between the instance's reported version and the server's expected version — e.g. a flapping
DSTACK_*_VERSION_URL, or a download that succeeds but then failschmod/renameso the new binary never lands.Reproduced live
Two servers (
0.20.18+0.20.23) on one Postgres + AWS backend, oneeu-west-1instance. The servers interleave, each forcing its version; the instance ping-pongs and re-downloads each cycle. Server log (~30s apart):(The original incident was expensive because it was cross-region —
eu-west-1bucket, readers inus-east-2/us-west-2; the loop itself is region-independent.)What's happening
Per instance, every ~7–14s (
instances/__init__.py— 7s for transitional states, 14s forIDLE/BUSY) the server:dstack-runner --version/dstack-shim --version(via the shim'sGET /api/components);get_dstack_runner_version()/get_dstack_shim_version()(compute.py:801);check.py:495runner /:542shim) →POST /api/components/install(client.py:438) → shimInstall(url, force=true)(handlers.go:217,forcehardcoded) →downloadFilere-downloads the full binary (runner/internal/shim/components/utils.go— no backoff/cap; idempotent on file existence only, whichforce=truebypasses).Two servers pinned to different versions each force their version, so the instance never converges and re-downloads every cycle.
Impact
curlhas--retry 1but no--fail(compute.py~L929), so a transient non-200 (e.g. a 403) is written todstack-shimand then executed →/usr/local/bin/dstack-shim: 2: Syntax error: newline unexpected→ crash loop.Fix — two layers
Layer 1 — bound the egress (shim side, PR #3954, done). The shim caches each artifact per source URL and revalidates with a conditional
GET(If-None-Match→304), so a forced re-install of an unchanged version transfers zero bytes. Plus--failon the bootstrapcurl.Validated end-to-end on AWS (two servers on one DB, one instance, custom-versioned builds, S3 request metrics):
304. Steady-state egress ≈ 0 (unfixed: ~430 MB and growing).Layer 2 — stop the loop (server side, not done yet). Layer 1 stops the bytes, not the loop: with two servers disagreeing the version never converges, so the server keeps issuing installs and the shim restarts every ~30s forever (zero egress, but churn/instability).
Why the loop isn't fully fixed yet
Stopping the loop means the server must decide which version wins when two servers disagree — and that decision isn't currently well-defined:
expectedis strictly newer thaninstalled) → converges to the highest version and halts. It's correct for the incident (two real releases) and for production (all semver), and since server downgrade is unsupported (forward-only migrations) it breaks no supported flow.get_*_version()can return real semver (0.20.23), a CI build number (str(run_number + 150), e.g.27305, used withDSTACK_USE_LATEST_FROM_BRANCH), orNone/dev. Under PEP 440,Version("27305") > Version("0.20.23")is true — a build-number build outranks every0.x.yrelease, which would block upgrading a branch-build instance to a real release. So "newer" is ambiguous across schemes, and we can't safely pick a winner yet.chmod/rename fails →installednever advances → "expected is newer" stays true forever); that needs an attempt cap + backoff regardless.So the complete server-side fix = no-downgrade restricted to comparable (semver) pairs + an attempt cap/backoff + a "multiple server versions" guard/warning. We're deferring it until we settle a reliable version ordering (or a single source of truth for the expected version) rather than ship a rule that mis-ranks build numbers.
Until then: in a mixed-version fleet where the in-control shim predates PR #3954, the binary still re-downloads (the cache fix lives in the shim binary), and the loop/churn persists even where egress is
304-bounded.