Stuck runner/shim install loops re-download the full binaries every check (no backoff, no cap, no content check)

## Summary

A server-driven runner/shim **re-install loop** with no backoff, no attempt cap, and no content check. If an instance's *reported* version never equals the server's *expected* version, the server re-issues an install on every instance check (~7–14s) and the shim **re-downloads the full ~37 MB binaries every time, indefinitely** → large, sustained egress from the public `dstack-runner-downloads` bucket (paid by the bucket owner). Jobs keep running, so it's silent.

**Status:** reproduced live on AWS (below). PR #3954 bounds the **egress**; the **loop itself** is not fixed yet — see *Why the loop isn't fully fixed yet*.

## Steps to reproduce

1. Run **two dstack servers on different versions** (e.g. `0.20.18` and `0.20.23`) against the **same database / backend**.
2. Launch any instance (a plain fleet node is enough — no job required).
3. → the instance re-downloads `dstack-runner` + `dstack-shim` from `dstack-runner-downloads` **continuously** — the full binaries every ~30s, forever.

Two servers is just the simplest trigger. The general condition is **any** persistent mismatch between the instance's reported version and the server's expected version — e.g. a flapping `DSTACK_*_VERSION_URL`, or a download that succeeds but then fails `chmod`/`rename` so the new binary never lands.

### Reproduced live

Two servers (`0.20.18` + `0.20.23`) on one Postgres + AWS backend, one `eu-west-1` instance. The servers interleave, each forcing *its* version; the instance ping-pongs and re-downloads each cycle. Server log (~30s apart):

```
Instance <name>: installing runner 0.20.23 -> 0.20.18 from https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.20.18/binaries/dstack-runner-linux-amd64
Instance <name>: installing runner 0.20.18 -> 0.20.23 from https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.20.23/binaries/dstack-runner-linux-amd64
```

(The original incident was expensive because it was **cross-region** — `eu-west-1` bucket, readers in `us-east-2`/`us-west-2`; the loop itself is region-independent.)

## What's happening

Per instance, every ~7–14s (`instances/__init__.py` — 7s for transitional states, 14s for `IDLE`/`BUSY`) the server:

1. reads **installed** = `dstack-runner --version` / `dstack-shim --version` (via the shim's `GET /api/components`);
2. computes **expected** = `get_dstack_runner_version()` / `get_dstack_shim_version()` (`compute.py:801`);
3. if they **differ by string equality** (`check.py:495` runner / `:542` shim) → `POST /api/components/install` (`client.py:438`) → shim `Install(url, force=true)` (`handlers.go:217`, `force` hardcoded) → `downloadFile` re-downloads the **full** binary (`runner/internal/shim/components/utils.go` — no backoff/cap; idempotent on *file existence* only, which `force=true` bypasses).

Two servers pinned to different versions each force *their* version, so the instance never converges and re-downloads every cycle.

## Impact

- Unbounded full re-downloads → large sustained egress from the public bucket; high request volume; the instance never converges; silent (the install trigger is only logged at DEBUG).
- Footgun: the shim-bootstrap `curl` has `--retry 1` but **no `--fail`** (`compute.py` ~L929), so a transient non-200 (e.g. a 403) is written to `dstack-shim` and then executed → `/usr/local/bin/dstack-shim: 2: Syntax error: newline unexpected` → crash loop.

## Fix — two layers

**Layer 1 — bound the egress (shim side, PR #3954, done).** The shim caches each artifact **per source URL** and revalidates with a conditional `GET` (`If-None-Match` → `304`), so a forced re-install of an unchanged version transfers **zero bytes**. Plus `--fail` on the bootstrap `curl`.

Validated end-to-end on AWS (two servers on one DB, one instance, custom-versioned builds, S3 request metrics):
- **Both shims carry the fix:** 22 conditional GETs over 25 min of continuous flip-flop → **1** real transfer (~17 MB at warmup), the rest `304`. Steady-state egress ≈ **0** (unfixed: ~430 MB and growing).
- **Mixed (one unfixed shim + one fixed):** ~**454 MB in 11 min and climbing** — *not* mitigated (see below).

**Layer 2 — stop the loop (server side, not done yet).** Layer 1 stops the *bytes*, not the *loop*: with two servers disagreeing the version never converges, so the server keeps issuing installs and the shim **restarts every ~30s** forever (zero egress, but churn/instability).

## Why the loop isn't fully fixed yet

Stopping the loop means the server must decide **which version wins** when two servers disagree — and that decision isn't currently well-defined:

- A natural rule is **no-downgrade** (install only if `expected` is strictly newer than `installed`) → converges to the highest version and halts. It's correct for the incident (two real releases) and for production (all semver), and since **server downgrade is unsupported** (forward-only migrations) it breaks no supported flow.
- **But dstack's version value space has no reliable total order.** `get_*_version()` can return real semver (`0.20.23`), a **CI build number** (`str(run_number + 150)`, e.g. `27305`, used with `DSTACK_USE_LATEST_FROM_BRANCH`), or `None`/dev. Under PEP 440, `Version("27305") > Version("0.20.23")` is **true** — a build-number build outranks every `0.x.y` release, which would **block upgrading a branch-build instance to a real release**. So "newer" is ambiguous across schemes, and we can't safely pick a winner yet.
- It also doesn't cover the **finalization-failure** trigger (download OK but `chmod`/rename fails → `installed` never advances → "expected is newer" stays true forever); that needs an **attempt cap + backoff** regardless.

So the complete server-side fix = **no-downgrade restricted to comparable (semver) pairs + an attempt cap/backoff + a "multiple server versions" guard/warning.** We're **deferring it** until we settle a reliable version ordering (or a single source of truth for the expected version) rather than ship a rule that mis-ranks build numbers.

**Until then:** in a mixed-version fleet where the in-control shim predates PR #3954, the binary still re-downloads (the cache fix lives in the shim binary), and the loop/churn persists even where egress is `304`-bounded.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stuck runner/shim install loops re-download the full binaries every check (no backoff, no cap, no content check) #3953

Summary

Steps to reproduce

Reproduced live

What's happening

Impact

Fix — two layers

Why the loop isn't fully fixed yet

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Stuck runner/shim install loops re-download the full binaries every check (no backoff, no cap, no content check) #3953

Description

Summary

Steps to reproduce

Reproduced live

What's happening

Impact

Fix — two layers

Why the loop isn't fully fixed yet

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions