Skip to content

Stuck runner/shim install loops re-download the full binaries every check (no backoff, no cap, no content check) #3953

@peterschmidt85

Description

@peterschmidt85

Summary

A server-driven runner/shim re-install loop with no backoff, no attempt cap, and no content check. If an instance's reported version never equals the server's expected version, the server re-issues an install on every instance check (~7–14s) and the shim re-downloads the full ~37 MB binaries every time, indefinitely → large, sustained egress from the public dstack-runner-downloads bucket (paid by the bucket owner). Jobs keep running, so it's silent.

Status: reproduced live on AWS (below). PR #3954 bounds the egress; the loop itself is not fixed yet — see Why the loop isn't fully fixed yet.

Steps to reproduce

  1. Run two dstack servers on different versions (e.g. 0.20.18 and 0.20.23) against the same database / backend.
  2. Launch any instance (a plain fleet node is enough — no job required).
  3. → the instance re-downloads dstack-runner + dstack-shim from dstack-runner-downloads continuously — the full binaries every ~30s, forever.

Two servers is just the simplest trigger. The general condition is any persistent mismatch between the instance's reported version and the server's expected version — e.g. a flapping DSTACK_*_VERSION_URL, or a download that succeeds but then fails chmod/rename so the new binary never lands.

Reproduced live

Two servers (0.20.18 + 0.20.23) on one Postgres + AWS backend, one eu-west-1 instance. The servers interleave, each forcing its version; the instance ping-pongs and re-downloads each cycle. Server log (~30s apart):

Instance <name>: installing runner 0.20.23 -> 0.20.18 from https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.20.18/binaries/dstack-runner-linux-amd64
Instance <name>: installing runner 0.20.18 -> 0.20.23 from https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.20.23/binaries/dstack-runner-linux-amd64

(The original incident was expensive because it was cross-regioneu-west-1 bucket, readers in us-east-2/us-west-2; the loop itself is region-independent.)

What's happening

Per instance, every ~7–14s (instances/__init__.py — 7s for transitional states, 14s for IDLE/BUSY) the server:

  1. reads installed = dstack-runner --version / dstack-shim --version (via the shim's GET /api/components);
  2. computes expected = get_dstack_runner_version() / get_dstack_shim_version() (compute.py:801);
  3. if they differ by string equality (check.py:495 runner / :542 shim) → POST /api/components/install (client.py:438) → shim Install(url, force=true) (handlers.go:217, force hardcoded) → downloadFile re-downloads the full binary (runner/internal/shim/components/utils.go — no backoff/cap; idempotent on file existence only, which force=true bypasses).

Two servers pinned to different versions each force their version, so the instance never converges and re-downloads every cycle.

Impact

  • Unbounded full re-downloads → large sustained egress from the public bucket; high request volume; the instance never converges; silent (the install trigger is only logged at DEBUG).
  • Footgun: the shim-bootstrap curl has --retry 1 but no --fail (compute.py ~L929), so a transient non-200 (e.g. a 403) is written to dstack-shim and then executed → /usr/local/bin/dstack-shim: 2: Syntax error: newline unexpected → crash loop.

Fix — two layers

Layer 1 — bound the egress (shim side, PR #3954, done). The shim caches each artifact per source URL and revalidates with a conditional GET (If-None-Match304), so a forced re-install of an unchanged version transfers zero bytes. Plus --fail on the bootstrap curl.

Validated end-to-end on AWS (two servers on one DB, one instance, custom-versioned builds, S3 request metrics):

  • Both shims carry the fix: 22 conditional GETs over 25 min of continuous flip-flop → 1 real transfer (~17 MB at warmup), the rest 304. Steady-state egress ≈ 0 (unfixed: ~430 MB and growing).
  • Mixed (one unfixed shim + one fixed): ~454 MB in 11 min and climbingnot mitigated (see below).

Layer 2 — stop the loop (server side, not done yet). Layer 1 stops the bytes, not the loop: with two servers disagreeing the version never converges, so the server keeps issuing installs and the shim restarts every ~30s forever (zero egress, but churn/instability).

Why the loop isn't fully fixed yet

Stopping the loop means the server must decide which version wins when two servers disagree — and that decision isn't currently well-defined:

  • A natural rule is no-downgrade (install only if expected is strictly newer than installed) → converges to the highest version and halts. It's correct for the incident (two real releases) and for production (all semver), and since server downgrade is unsupported (forward-only migrations) it breaks no supported flow.
  • But dstack's version value space has no reliable total order. get_*_version() can return real semver (0.20.23), a CI build number (str(run_number + 150), e.g. 27305, used with DSTACK_USE_LATEST_FROM_BRANCH), or None/dev. Under PEP 440, Version("27305") > Version("0.20.23") is true — a build-number build outranks every 0.x.y release, which would block upgrading a branch-build instance to a real release. So "newer" is ambiguous across schemes, and we can't safely pick a winner yet.
  • It also doesn't cover the finalization-failure trigger (download OK but chmod/rename fails → installed never advances → "expected is newer" stays true forever); that needs an attempt cap + backoff regardless.

So the complete server-side fix = no-downgrade restricted to comparable (semver) pairs + an attempt cap/backoff + a "multiple server versions" guard/warning. We're deferring it until we settle a reliable version ordering (or a single source of truth for the expected version) rather than ship a rule that mis-ranks build numbers.

Until then: in a mixed-version fleet where the in-control shim predates PR #3954, the binary still re-downloads (the cache fix lives in the shim binary), and the loop/churn persists even where egress is 304-bounded.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions