fix(worker): stop snapshot-baked stale JOB_DATA running the wrong agent (ucho MNG-1622/1702)#1470
Conversation
…nt (ucho MNG-1622/1702) Prod: MNG-1622 and MNG-1702 were moved into Splitting; the label triggered but no backlog story cards were produced. Root cause (verified via prod logs + live `docker image inspect`): the splitting workers RE-RAN the planning agent on a stale payload. Mechanism — an interaction between the MNG-1660 JOB_DATA offload and snapshot reuse: - An earlier `planning` run of the same issue passed its payload INLINE (`-e JOB_DATA=<json>`). Its container was committed to a snapshot image, and `docker commit` bakes the container's `Config.Env` — including `JOB_DATA=<planning json>` (and every secret: DATABASE_URL, REDIS_URL, GitHub /Linear/OpenAI creds, Claude OAuth token) — into the image. - The `splitting` run's payload was large, so the router OFFLOADED it to Redis and set only `JOB_DATA_REDIS_KEY`, not `JOB_DATA` (worker-env.ts is if/else). - The splitting worker spawned FROM that snapshot image. `docker run -e JOB_DATA_REDIS_KEY=...` does not clear the baked `JOB_DATA`, so the container had both. - `resolveRawJobData()` read inline `JOB_DATA` first → the STALE planning payload (triggerResult.agentType 'planning') → ran planning → no cards. This regressed from MNG-1660: before it, every run set inline JOB_DATA which always overrode the baked value. Two-part fix: 1. Correctness (self-healing, covers snapshots already baked in prod): worker-entry.ts resolveRawJobData now treats JOB_DATA_REDIS_KEY as AUTHORITATIVE when present — the router sets exactly one channel per spawn, so an inline JOB_DATA co-present with a key can only be a stale snapshot artifact and is ignored. 2. Hygiene + security: worker-snapshots.ts commitWorkerSnapshot now inspects the container and commits a SCRUBBED Config.Env (docker-modem _query/_body idiom), stripping job env (JOB_DATA*/JOB_ID/JOB_TYPE) and secrets (DATABASE_URL, REDIS_URL, CREDENTIAL_MASTER_KEY, CLAUDE_CODE_OAUTH_TOKEN, and the dynamic project-credential names from CASCADE_CREDENTIAL_KEYS) while preserving PATH / PLAYWRIGHT_BROWSERS_PATH / NODE_*. Closes the credential-in-image leak; on inspect failure, falls back to a bare commit + Sentry (snapshot_env_scrub_ inspect_failed) rather than silently baking secrets or losing the snapshot. Blast radius: any second+ agent on a work item where an earlier run baked inline JOB_DATA and the later run offloads (>96 KiB) — planning→splitting, planning→implementation, implementation→review, etc. Tests: +2 worker-entry (Redis key wins over stale inline; inline-only path unchanged); +6 worker-snapshots (scrubSnapshotEnv deny-list/credential-keys/=-in -value/no-= edge cases; scrubbed-commit _query/_body shape; inspect-failure fallback + Sentry). typecheck, lint, unit-api (1814) and unit-core (6233) green. Ops after deploy: prune existing cascade-snapshot-* images (they carry baked secrets + stale JOB_DATA); rotate any credentials baked into images shared off-host. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
REQUEST_CHANGES. The worker-side reorder (part 1) correctly fixes the reported forward incident, but the snapshot env-scrub (part 2 — the headline 🔒 security + at-source hygiene fix) writes the Docker commit body in the wrong shape, so it almost certainly strips nothing. A mocked unit test locks in that wrong shape, which is why CI is green despite the bug. There is also an asymmetric edge in the worker reorder.
Code Issues
Blocking
-
src/router/worker-snapshots.ts:128— commit body shape doesn't match the Docker ImageCommit API.
container.commit({ _query, _body: { Config: { Env: scrubbed } } })sends the HTTP body{"Config":{"Env":[…]}}. But Docker'sPOST /commit(ImageCommit) request body is itself aContainerConfig—Envis a top-level field, not nested underConfig. TheConfignesting is the shape of the inspect response (info.Config.Env, which line 114 reads correctly), not the commit request.docker-modem sends
_bodyverbatim (modem.js:208 → JSON.stringify(opts._body || opts)), so Docker receives{"Config":{…}}, treatsConfigas an unknown field (ignored), and sees a nilEnv. The existing bare commit already sends a body that decodes to an all-nil config and still preserves the container env, so nilEnvmeans "keep the container's env" → the scrubbed list is never applied. The committed snapshot keeps every secret + the staleJOB_DATA, while thestrippedKeys: Ninfo log (lines 130–133) reports a false positive. (If a given daemon instead treats the body as a full replace, the image gets an empty/broken config — also bad.)Fix: send a flat
ContainerConfig:_body: { Env: scrubbed }. This needs verification against a real Docker daemon — the unit test mockscontainer.commit, so it only checks the argument the code passes, not what Docker accepts. The PR's own "Manual verify" grep (docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|…') is exactly the right check and should be run before merge, not just after deploy.
Should Fix
-
src/worker-entry.ts:478— Redis-key-first is asymmetric; the reverse stale-snapshot case now hard-fails.
Making the key authoritative fixes the forward case (baked inlineJOB_DATA+ fresh key). But the symmetric case regresses: a snapshot committed from a prior offloaded run bakes a staleJOB_DATA_REDIS_KEY=cascade:jobdata:<priorJobId>. If the next run for the same work item is inline (payload < 96 KiB),docker run -e JOB_DATA=…does not clear the baked key, soprocess.env.JOB_DATA_REDIS_KEYis the stale key.readOffloadedJobDatathen reads a key the prior run already deleted (job-data-offload.ts:108) → throws →process.exit(1). With the old inline-first order this run succeeded. That is a brand-new failure mode in exactly the "self-heal already-baked snapshots without pruning" scenario this PR targets, and snapshot reuse is per(projectId, workItemId)so it is reachable (container-manager.ts:253).The offload key embeds the jobId and
JOB_IDis set fresh on every spawn (worker-env.ts:113), so you can disambiguate both directions: treat the Redis key as authoritative only whenkey === buildJobDataRedisKey(process.env.JOB_ID); otherwise it is a stale baked artifact — ignore it and fall through to inline. That fixes forward and reverse.
Notes
tests/unit/router/worker-snapshots.test.ts:127assertscommitArg._body.Config.Env— it validates the argument the code passes to the mock, not what a real daemon accepts, which is why CI passes despite the shape bug. Once_bodyis corrected to{ Env: … }, update this assertion and ideally add a real-daemon integration check.- Deny-list completeness itself looks good: it covers the infra secrets +
CLAUDE_CODE_OAUTH_TOKEN+ the optional vars fromappendOptionalEnvVars, and dynamically pulls per-project credential names fromCASCADE_CREDENTIAL_KEYS(set in worker-env.ts:164). PreservingSENTRY_*/CASCADE_DASHBOARD_URL/PATH/PLAYWRIGHT_BROWSERS_PATHis correct — but none of it takes effect until the commit body shape is fixed.
🕵️ claude-code · claude-opus-4-8 · run details
| const scrubbed = scrubSnapshotEnv(env, extractCredentialKeys(env)); | ||
| await container.commit({ | ||
| _query: { container: containerId, repo, tag: 'latest' }, | ||
| _body: { Config: { Env: scrubbed } }, |
There was a problem hiding this comment.
BLOCKING — wrong Docker commit body shape. Docker's POST /commit (ImageCommit) request body is a ContainerConfig, so Env must be top-level: _body: { Env: scrubbed }. { Config: { Env } } is the inspect response shape (correctly read on line 114), not the commit request shape. docker-modem sends _body verbatim (modem.js:208), so Docker receives {"Config":{"Env":[…]}}, treats Config as an unknown field, and sees a nil Env → it keeps the container's original (unscrubbed) env. Net: every secret + stale JOB_DATA stay baked and strippedKeys logs a false positive. Verify against a real daemon: docker image inspect <snap> --format '{{json .Config.Env}}'.
There was a problem hiding this comment.
Fixed in 097b179a. The scrubbed-env commit now sends a flat ContainerConfig — _body: { Env: scrubbed } (top-level Env) — instead of the { Config: { Env } } inspect-response shape.
Verified against docker-modem's source: modem.js:162 uses _query for the querystring and modem.js:208 (JSON.stringify(opts._body || opts)) sends _body verbatim as the ImageCommit request body, so the flat shape now maps to ContainerConfig.Env as the API expects. The old nested Config was an unknown top-level field Docker ignored, leaving Env nil and baking the raw env while strippedKeys logged a false positive — exactly as you described.
Docker isn't available in the agent sandbox for a live docker image inspect, so the PR's manual-verify grep stays the pre-merge gate.
| captureException(err, { tags: { source: 'worker_env' } }); | ||
| await flush(); | ||
| process.exit(1); | ||
| if (key) { |
There was a problem hiding this comment.
SHOULD_FIX — this is asymmetric. If a snapshot was committed from a prior offloaded run it bakes a stale JOB_DATA_REDIS_KEY=cascade:jobdata:<priorJobId>. When the next run for the same work item is inline, docker run -e JOB_DATA=… doesn't clear the baked key, so this reads the stale key; readOffloadedJobData finds it already deleted (the prior run del'd it) → throws → process.exit(1). That run succeeded before this change — a new failure mode in exactly the self-heal-without-pruning scenario the PR targets. Since the key embeds the jobId and JOB_ID is set fresh every spawn (worker-env.ts:113), only trust the key when key === buildJobDataRedisKey(process.env.JOB_ID); otherwise treat it as a stale baked artifact and fall through to inline. Handles both directions.
There was a problem hiding this comment.
Fixed in 097b179a. The Redis key is now authoritative only when it names THIS job: key === buildJobDataRedisKey(process.env.JOB_ID). Since JOB_ID is set fresh on every spawn and the offload key embeds that jobId (worker-env.ts:113/137), this disambiguates both directions:
- Forward (fresh key + stale baked inline
JOB_DATA): key matches → read Redis, ignore the stale inline value. - Reverse (fresh inline + stale baked
JOB_DATA_REDIS_KEYfrom a prior offloaded run): key does not match → treat it as a stale baked artifact, fall through to inline. No more reading a key the prior run already deleted → noprocess.exit(1)regression.
Added a reverse-case regression test in tests/unit/worker-entry.test.ts and updated the missing/expired offload test to use a key that matches JOB_ID (as production always does).
| repo: 'cascade-snapshot-proj-snap-card-snap', | ||
| tag: 'latest', | ||
| }); | ||
| const bakedEnv = commitArg._body.Config.Env; |
There was a problem hiding this comment.
This asserts the argument the code passes to the mocked container.commit (_body.Config.Env), not what a real Docker daemon accepts — which is why CI stays green even though the commit body shape is wrong for the ImageCommit API (see the comment on worker-snapshots.ts). Once _body is corrected to { Env: … }, update this assertion, and ideally add a real-daemon integration check.
There was a problem hiding this comment.
Updated in 097b179a: the assertion now reads commitArg._body.Env (top-level) to match the corrected ImageCommit body shape. I also added a reverse-case regression to worker-entry.test.ts.
On the real-daemon integration check: the repo has no existing real-Docker test harness (all Docker interaction in tests is mocked via dockerode) and no daemon is available in the agent sandbox, so adding one here would be unrunnable/flaky. I verified the wire shape via docker-modem source instead (_body serialized verbatim at modem.js:208); the PR's manual-verify grep remains the pre-merge gate.
…stale offload key Address review feedback on PR #1470. 1. worker-snapshots.ts (BLOCKING): the scrubbed-env commit wrote the Docker ImageCommit body as `{ Config: { Env } }`. Docker's POST /commit request body IS a ContainerConfig — Env is top-level. docker-modem serializes `_body` verbatim (modem.js:208), so the nested `Config` was an unknown field Docker ignored, leaving Env nil and baking the container's raw (unscrubbed) env while the `strippedKeys` log reported a false positive. Send `_body: { Env: scrubbed }`. 2. worker-entry.ts (SHOULD_FIX): making JOB_DATA_REDIS_KEY unconditionally authoritative fixed the forward case but regressed the reverse one — a snapshot baked from a prior OFFLOADED run carries a stale JOB_DATA_REDIS_KEY that a later INLINE run does not clear, so reading it hit a key the prior run already deleted → process.exit(1). Trust the key only when it names THIS job (`key === buildJobDataRedisKey(JOB_ID)`); otherwise treat it as a stale baked artifact and fall through to inline. Fixes forward AND reverse. 3. Tests: assert the corrected `_body.Env` shape; add a reverse-case regression (stale baked key mismatching JOB_ID → inline wins, no crash); make the missing/expired offload test use a key that matches JOB_ID (as production always does). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
REQUEST_CHANGES. Part 1 (worker-entry resolveRawJobData) is correct and well-tested. Part 2 (snapshot env-scrub) commits a partial ContainerConfig to Docker, which replaces the image's entire runtime config — wiping Cmd/WorkingDir/User. Every reused snapshot after this change fails to launch, which regresses the exact snapshot-reuse path this PR targets.
Code Issues
Blocking
src/router/worker-snapshots.ts (commitScrubbed) — the scrubbed commit passes _body: { Env: scrubbed }, a partial config. Docker's POST /commit treats a non-nil request-body Config as the complete runtime config of the new image; it does not merge it over the container's config:
- moby
daemon/commit.go:c.Configonly falls back tocontainer.Configwhen it isnil. A non-nil{Env}is used as-is. image.NewChildImagesetsConfig: child.Configdirectly — no merge with the parent image's config.
So the snapshot image ends up with only Env and loses everything else the worker relies on:
| Lost field | Baked by | Consequence on reuse |
|---|---|---|
Cmd ["node","--import","./dist/instrument.js","dist/worker-entry.js"] |
Dockerfile.worker:145 | createContainer/start → "No command specified" |
WorkingDir /app |
Dockerfile.worker:16 | resets to / → relative dist/worker-entry.js not found |
User node |
Dockerfile.worker:123 | resets to root → Claude Code refuses --dangerously-skip-permissions (Dockerfile.worker:107) |
| Labels / ExposedPorts / Volumes | Dockerfile.worker | dropped |
The reuse path sets no explicit Cmd/WorkingDir/User — worker-container-launcher.ts:74 createContainer passes only Image/Env/HostConfig/Labels, and resolveSpawnSettings swaps workerImage = snapshot.imageName (worker-spawn-settings.ts:141). So a reused snapshot depends entirely on the image's baked config. Result: the first snapshot-enabled run commits a broken image; the 2nd+ run for that work item can't start at all — strictly worse than the "wrong agent" bug being fixed, and it lands on the same ucho/snapshot path.
Fix — spread the full inspected Config, replacing only Env (the inspect extraction must widen too — today it only keeps info?.Config?.Env):
const info = (await container.inspect()) as { Config?: Record<string, unknown> & { Env?: string[] } };
const fullConfig = info?.Config ?? {};
const env = fullConfig.Env;
// …after computing `scrubbed`…
await container.commit({
_query: { container: containerId, repo, tag: 'latest' },
_body: { ...fullConfig, Env: scrubbed },
});I could not run Docker in the review env — please confirm on a live daemon: after a scrubbed commit, docker image inspect <snap> --format '{{json .Config.Cmd}}' should print the CMD, not null. The new unit test asserts only _body.Env and mocks container.commit, so it can't catch this; once fixed, add assertions that Cmd/WorkingDir/User survive.
Note (non-blocking)
The inspect-failure fallback container.commit({ repo, tag }) (bare) is the config-preserving form (nil body → daemon keeps container.Config). So today the failure path yields a working (unscrubbed) snapshot while the success path yields a broken one. Fixing the main path removes the asymmetry.
What's correct
worker-entry.ts resolveRawJobData: gating trust on JOB_ID === buildJobDataRedisKey(jobId) cleanly disambiguates both stale-snapshot directions. The one-channel-per-spawn invariant holds (worker-env.ts:133-142), JOB_ID is validated present in main() before this runs, and the forward/reverse/inline-only cases are all covered by tests. 👍
🕵️ claude-code · claude-opus-4-8 · run details
| let env: string[] | undefined; | ||
| try { | ||
| const info = (await container.inspect()) as { Config?: { Env?: string[] } } | undefined; | ||
| env = info?.Config?.Env; |
There was a problem hiding this comment.
This only captures Config.Env, so the full container config isn't available to pass back on commit (see the blocking comment below). Widen this to keep the whole Config (Cmd/WorkingDir/User/…) so it can be spread into the commit body.
There was a problem hiding this comment.
Fixed in 630e39be. Widened the inspect extraction to keep the whole Config (fullConfig) rather than only Config.Env:
const info = (await container.inspect()) as
| { Config?: Record<string, unknown> & { Env?: string[] } }
| undefined;
fullConfig = info?.Config;fullConfig now carries Cmd/WorkingDir/User/Labels/ExposedPorts so they can be spread back into the commit body (see the reply on the commit call below).
| // nested under Config (that is the inspect-response shape). See fn doc. | ||
| await container.commit({ | ||
| _query: { container: containerId, repo, tag: 'latest' }, | ||
| _body: { Env: scrubbed }, |
There was a problem hiding this comment.
Blocking: _body: { Env: scrubbed } is a partial ContainerConfig. Docker's POST /commit uses a non-nil body config as the image's complete runtime config (moby daemon/commit.go only falls back to container.Config when the body is nil; image.NewChildImage assigns Config: child.Config with no parent merge). So the snapshot loses Cmd (Dockerfile.worker:145), WorkingDir=/app (:16) and User=node (:123). Since worker-container-launcher.ts sets none of those at createContainer, a reused snapshot (worker-spawn-settings.ts:141) can't start — Cmd gone → "No command specified"; even otherwise, User→root breaks Claude Code and WorkingDir→/ breaks the relative dist/worker-entry.js. Spread the full inspected config instead: _body: { ...fullConfig, Env: scrubbed }. Please verify on a live daemon that .Config.Cmd survives the commit.
There was a problem hiding this comment.
Fixed in 630e39be. The scrubbed commit now spreads the full inspected config and replaces only Env:
await container.commit({
_query: { container: containerId, repo, tag: 'latest' },
_body: { ...fullConfig, Env: scrubbed },
});This preserves Cmd (the worker entrypoint), WorkingDir=/app, User=node, Labels, and ExposedPorts baked by Dockerfile.worker, so a reused snapshot (which sets no explicit Cmd/WorkingDir/User at createContainer) still launches. The inspect-failure bare-commit fallback was already the config-preserving form (nil body → daemon keeps container.Config), so this also removes the success/failure asymmetry you flagged in the non-blocking note.
Added unit assertions that Cmd/WorkingDir/User/Labels/ExposedPorts survive the scrubbed commit body (the mock inspect now returns those fields alongside Env).
On the live-daemon check: Docker isn't available in the agent sandbox and the repo has no real-Docker test harness (all Docker interaction is mocked via dockerode), so I verified the shape against moby's source — daemon/commit.go falls back to container.Config only when the body config is nil, and image.NewChildImage assigns Config: child.Config with no parent merge, confirming a non-nil {Env}-only body would wipe the rest. The PR's manual-verify grep plus docker image inspect <snap> --format '{{json .Config.Cmd}}' (should print the CMD, not null) remains the pre-merge gate.
The scrubbed-env commit passed a partial `_body: { Env: scrubbed }`. Docker's
POST /commit uses a non-nil body config as the new image's COMPLETE runtime
config (moby daemon/commit.go falls back to container.Config only when the body
is nil; image.NewChildImage assigns Config: child.Config with no parent merge),
so this wiped Cmd/WorkingDir/User/Labels/ExposedPorts baked by Dockerfile.worker.
Every reused snapshot then failed to launch ("No command specified", User→root,
WorkingDir→/) — regressing the exact snapshot-reuse path this PR targets.
Widen the inspect extraction to keep the whole Config and spread it into the
commit body, replacing only Env: `_body: { ...fullConfig, Env: scrubbed }`. The
inspect-failure bare-commit fallback was already config-preserving; this removes
the success/failure asymmetry. Unit test now asserts Cmd/WorkingDir/User/Labels/
ExposedPorts survive the scrubbed commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
REQUEST_CHANGES — scoped to the snapshot env-scrub only. The worker-side fix (resolveRawJobData) is correct, well-reasoned, and fully resolves the wrong-agent regression on its own. The scrub half relies on an unverified assumption about Docker's commit semantics, and the tests can't catch a failure because they mock container.commit() and assert on its arguments, not on the committed image.
Code Issues
Blocking (verify against a real daemon before merge)
src/router/worker-snapshots.ts — _body: { Env: scrubbed } scrub is unverified end-to-end.
Docker's POST /commit applies the request-body config on top of the container's existing config; it does not simply swap in what you pass. For the Env list specifically, moby re-appends the container's env vars whose keys are absent from the provided list — there is no supported way to remove an env var via docker commit (that's why --change only ever adds/overrides ENV, never deletes). If that holds here, every deny-listed key (JOB_DATA, DATABASE_URL, CLAUDE_CODE_OAUTH_TOKEN, the project credentials) is re-added at commit time, the committed image is byte-for-byte identical to today's bare commit, and the "🔒 security" scrub is a silent no-op.
The opposite semantics are no safer: if the body config did replace wholesale, the snapshot would lose its baked CMD/ENTRYPOINT/WORKINGDIR — worker-container-launcher.ts starts the container with Image + Env and no explicit Cmd (relies on the image's baked command), so a reused snapshot would fail to boot.
Either way the change needs a real-daemon check. The unit tests can't distinguish these outcomes: they assert on commitArg._body.Env (the mocked call args), and the in-test comment "a nested Config would be silently dropped by Docker, baking the raw env" is itself an unverifiable claim about Docker that a mock cannot exercise.
Ask: run your own "Ops after deploy → Manual verify" grep pre-merge —
docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|DATABASE_URL|AUTH_JSON' must print nothing — and ideally lock it down with an integration test that commits a throwaway container and inspects the resulting image env. If the keys survive, scrub-by-omission won't work; blanking values (--change 'ENV KEY=') or not injecting persistable secrets into snapshot-enabled containers would be the fallback.
Note: the correctness fix does not depend on the scrub — resolveRawJobData already ignores the stale baked JOB_DATA, so the wrong-agent bug is fixed regardless of how commit resolves this.
What's solid (no changes needed)
resolveRawJobData— theJOB_ID↔key match guard cleanly disambiguates both the forward case (offloaded run + stale baked inlineJOB_DATA) and the reverse case (inline run + stale bakedJOB_DATA_REDIS_KEY). Cross-checked againstworker-env.ts:JOB_ID=${job.id}is always set, exactly one ofJOB_DATA/JOB_DATA_REDIS_KEYis set per spawn, the offload key is built from the same job id, and job ids are${type}-${Date.now()}-${random}(collision-free), so a stale baked key never matches a freshJOB_ID. Tests cover both directions plus inline-only and the Redis-miss exit path.scrubSnapshotEnvas a pure function is correct and well-tested (static deny-set + dynamicCASCADE_CREDENTIAL_KEYS, first-=split, no-=lines).- Inspect-failure → bare-commit + Sentry (
snapshot_env_scrub_inspect_failed) is a reasonable non-silent degrade.
🕵️ claude-code · claude-opus-4-8 · run details
| const scrubbed = scrubSnapshotEnv(env, extractCredentialKeys(env)); | ||
| // ImageCommit request body IS a ContainerConfig → Env is top-level, NOT | ||
| // nested under Config (that is the inspect-response shape). See fn doc. | ||
| await container.commit({ |
There was a problem hiding this comment.
This scrub is unverified end-to-end, and the tests can't catch a failure — they assert on the mocked container.commit() args (commitArg._body.Env), not on the resulting image's Config.Env.
Docker's POST /commit applies the body config on top of the container's config; moby re-appends the container's env vars whose keys aren't in the provided list (you can't remove an env var via docker commit — --change ENV only adds/overrides). If that's the case here, JOB_DATA + every secret get re-added at commit time → the image is identical to a bare commit and the security scrub is a silent no-op.
If instead the body replaced wholesale, the snapshot would lose its baked CMD/ENTRYPOINT/WORKINGDIR (worker-container-launcher.ts starts with Image+Env and no explicit Cmd) → reused snapshots fail to boot.
Please verify pre-merge against a real daemon: docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|DATABASE_URL' must print nothing. The worker-side resolveRawJobData fix already covers the correctness bug, so this path is security-only.
|
🛠️ Addressing the snapshot env-scrub feedback now 🕵️ run details |
|
Real-daemon verification (the reviewer's blocking ask) — done. @nhopeatall is correct: the current env-scrub is a silent no-op. Here's the proof and the working fix. Ran both approaches against a live Docker daemon (postgres:16-alpine container with ❌ Scrub-by-omission (current PR — pass a filtered moby re-appends every container env var whose key is absent from the body list — there is no "remove env" via commit. The committed image is byte-identical to a bare commit. The scrub (and the 🔒 security claim) does nothing. Unit tests can't catch this because they assert on the mocked ✅ Blank the values via container.commit({ repo, tag: 'latest',
changes: ['ENV JOB_DATA=', 'ENV JOB_DATA_REDIS_KEY=', 'ENV MY_SECRET='] })
→ image .Config.Env: ["JOB_DATA=", "JOB_DATA_REDIS_KEY=", "MY_SECRET=", "KEEP_ME=yes"]
→ Cmd preserved: ["sleep","120"]Values are emptied (secret gone; empty Recommended change to Test gap to close: the mock-args assertion should be replaced/augmented — assert the Note (agreeing with the review): the |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
APPROVE — this correctly fixes a verified prod bug (stale snapshot-baked JOB_DATA running the wrong agent) plus the associated secret-baking, and the Docker mechanics behind the fix hold up under scrutiny.
What I verified
resolveRawJobDatadisambiguation is sound.worker-env.tssets exactly one channel per spawn (if/else) withJOB_ID=job.idandJOB_DATA_REDIS_KEY=buildJobDataRedisKey(job.id)— both from the samejob.id, so a fresh offload key always matchesJOB_ID. BullMQ job IDs are unique per spawn (queue.tstimestamp+random), so a stale baked key embeds a prior job's ID and can never collide with the freshJOB_ID. Every channel combination (fresh key + stale inline; fresh inline + stale key; overridden key) resolves correctly. Onlyworker-entry.tsreadsprocess.env.JOB_DATA, and the subprocess allowlist (envFilter.ts) already blocks it from agents.- The
_query/_bodydocker-modem idiom is real in the pinneddocker-modem(modem.js:162opts._query,:208opts._body). - The "non-nil commit body = complete config replacement" reasoning is correct. The original bare
{repo,tag}body decodes to a nilcontainer.Config(no matching fields) so the daemon kept the container config;{...fullConfig, Env: scrubbed}decodes to a populated config used verbatim, so spreading the full inspected Config and replacing onlyEnvis what preservesCmd/WorkingDir/User/Labels. Scrubbing is safe because the launcher re-injects fullEnvon every spawn (worker-container-launcher.ts:77); only base-image vars need to survive, and they do. - Deny-list coverage is complete — static infra keys +
CLAUDE_CODE_OAUTH_TOKEN+ every dynamic project credential viaCASCADE_CREDENTIAL_KEYS(coversCODEX_AUTH_JSON, GitHub/Linear tokens, etc.), while preservingPATH/PLAYWRIGHT_BROWSERS_PATH/NODE_*/SENTRY_*.
Tests are meaningful and pass locally (39 + 19); CI 7/7 green.
Minor (optional, non-blocking)
One negligible doc/code nuance noted inline — no action required.
🕵️ claude-code · claude-opus-4-8 · run details
| } | ||
|
|
||
| // inspect unavailable / empty env → degrade to the prior bare-commit behavior. | ||
| await container.commit({ repo, tag: 'latest' }); |
There was a problem hiding this comment.
Optional nit (non-blocking): the function docstring says "If inspect fails or returns no env, falls back to a bare commit … and captures Sentry under snapshot_env_scrub_inspect_failed so the regression to baking secrets is loud rather than silent." But Sentry is only captured on the inspect throw path — this fallback (reached when inspect() resolves yet Config/Env is missing) bare-commits silently. In practice this is unreachable for a real worker container (Config.Env always has at least PATH + injected vars), and an empty env has no secrets to leak, so the behavior is fine. Only the docstring slightly overstates the Sentry coverage. Consider tightening the wording, or capturing Sentry here too if you want the "loud on unexpected shape" guarantee to literally hold.
…changes`)
The prior scrub committed a filtered `Env` array in the commit body — a proven
NO-OP. Verified against a live Docker daemon: moby re-appends every container
env var whose key is absent from the body list, so the committed image is
byte-identical to a bare commit — `JOB_DATA` and every credential survive.
docker commit cannot REMOVE an env var.
Switch to the supported mechanism: `container.commit({ changes: ['ENV KEY='] })`
blanks the value of each deny-listed / credential key actually present. Empty
`JOB_DATA` is falsy (worker ignores it); an emptied secret carries no value to
leak; `Cmd`/`Entrypoint`/`WorkingDir`/all other env are preserved (bare commit
keeps the full container config, `changes` overlays only the named keys).
End-to-end verified on a live daemon (exact code path): JOB_DATA, DATABASE_URL,
GITHUB_TOKEN_IMPLEMENTER, LINEAR_API_KEY all committed as `KEY=` (blank);
PATH/SENTRY_DSN/Cmd preserved.
- New `buildSnapshotEnvScrubChanges(env)` → `ENV KEY=` list for present deny keys.
- `commitScrubbed` uses `{repo, tag, changes}`; drops the `_query`/`_body` shape.
- Tests assert the `changes` array (present secret/job keys blanked, safe keys
untouched) instead of the unverifiable `_body.Env` mock args.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
|
Applied the corrected scrub in The Now using Empty Tests updated to assert the |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
APPROVE. Both halves of the fix are correct, backward-compatible, and well-tested (I ran the two touched suites locally: 61/61 green). The final state also resolves every BLOCKING concern from the earlier review thread — the commit body no-op, the lost Cmd/WorkingDir/User, and the reverse-direction asymmetry are all handled.
What I verified
resolveRawJobData(worker-entry.ts). Thekey === buildJobDataRedisKey(jobId)gate is the right disambiguator.worker-env.tssets exactly one channel per spawn (if/else at :133-137) and a freshJOB_ID(:113);addJobjob IDs are unique (${type}-${Date.now()}-${rand}, queue.ts:150), so a baked key from a prior run can never collide. Forward case (fresh key + stale baked inline) → reads Redis; reverse case (fresh inline + stale baked key) → key mismatch → falls through to inline. The old normal paths (inline-only, offload-only) are unchanged. Correct on all four combinations.commitScrubbed(worker-snapshots.ts). The switch tochanges: ['ENV KEY=']is the correct mechanism. Confirmed via docker-modem:container.commit({repo,tag,changes})sendschangesas a repeated query param (modem.js:162 → buildQuerystring array handling) while the body is the full opts object treated asContainerConfig—repo/tag/changes/containerare not ContainerConfig fields (no case-insensitive collisions), so it behaves as a bare/config-preserving commit with theENV KEY=overlay.Cmd/WorkingDir/Usersurvive; secret values are blanked to empty (falsy → the worker ignores them, and empty carries nothing to leak).- Scrub coverage. Static deny-list covers every infra secret pushed in
worker-env.ts; project credentials are covered dynamically viaCASCADE_CREDENTIAL_KEYS, whichworker-env.ts:164writes atomically alongside the individual credential vars — so the dynamic list always matches what's actually baked.
Non-blocking notes
1. PR description / ops runbook is stale vs. the final implementation. The description's "Fix part 2" and "Tests" still describe the superseded _body:{Env:scrubbed} / _query/_body approach; the merged code uses changes:['ENV KEY=']. Most importantly, the "Manual verify" step is now misleading:
docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|DATABASE_URL|AUTH_JSON' → prints nothing
ENV KEY= blanks the value but keeps the key, so Config.Env still contains JOB_DATA=, GITHUB_TOKEN_IMPLEMENTER=, DATABASE_URL= (empty). That grep will print the blanked keys, not "nothing" — an operator following the runbook would wrongly conclude the scrub failed. The value-focused check is what actually proves the fix, e.g.:
... | grep -Eio '(JOB_DATA|TOKEN|API_KEY|DATABASE_URL|AUTH_JSON)=.+' → prints nothing
Worth correcting wherever this runbook lives so post-deploy verification isn't a false alarm.
2. Docstring overstates Sentry coverage (commitScrubbed, already raised as a nit in the prior thread and still present) — see inline.
3. Deny-list maintenance coupling (defense-in-depth) — see inline.
None of these block merge.
🕵️ claude-code · claude-opus-4-8 · run details
| * latter is a proven no-op). `Cmd`/`Entrypoint`/`WorkingDir`/all other env are | ||
| * preserved, so a reused snapshot still boots. | ||
| * | ||
| * If inspect fails or the env is empty, falls back to a bare commit (an |
There was a problem hiding this comment.
Nit (persists from the earlier thread): the docstring says "If inspect fails or the env is empty, falls back to a bare commit ... and captures Sentry under snapshot_env_scrub_inspect_failed". Sentry is only captured in the inspect() throw path (catch block). The resolves-but-empty-env / changes.length === 0 path bare-commits silently (no capture). Harmless in practice (a real worker container always has a non-empty Config.Env, and an empty env has no secrets to leak), but the wording overstates the coverage — either tighten it to "on inspect failure" or capture Sentry on the unexpected-shape path too.
| * Static deny-set covers job + infra-secret keys; per-project credential names | ||
| * are dynamic and enumerated at runtime from `CASCADE_CREDENTIAL_KEYS`. | ||
| */ | ||
| const SNAPSHOT_ENV_DENYLIST: ReadonlySet<string> = new Set([ |
There was a problem hiding this comment.
Defense-in-depth (non-blocking): this is a static deny-list. Project credentials are safe because they flow through the dynamic CASCADE_CREDENTIAL_KEYS path, but a new infra secret added to worker-env.ts (e.g. another optional var in appendOptionalEnvVars) that isn't added here would be baked into every snapshot with no test catching it. The current list is complete, but consider a small guard test that asserts every non-credential secret key pushed by worker-env.ts is present in SNAPSHOT_ENV_DENYLIST, so the coupling can't silently drift.
Symptom
Prod (project
ucho): MNG-1622 & MNG-1702 were moved into Splitting; the label triggered but no backlog story cards were produced.Root cause (verified: prod logs + live
docker image inspect)The splitting workers silently re-ran the
planningagent on a stale payload. A regression from the MNG-1660 offload fix × snapshot reuse:planningrun passed its payload inline (-e JOB_DATA=<json>). Its container was committed to a snapshot — anddocker commitbakes the container'sConfig.Env(incl.JOB_DATA=<planning json>and every secret) into the image.splittingpayload was large → the router offloaded to Redis, setting onlyJOB_DATA_REDIS_KEY(worker-env.ts is if/else — noJOB_DATA).docker run -e JOB_DATA_REDIS_KEY=…doesn't clear the bakedJOB_DATA, so the container had both.resolveRawJobData()read inlineJOB_DATAfirst → the stale planning payload (triggerResult.agentType: 'planning') → ran planning → no cards.Smoking gun:
docker image inspect cascade-snapshot-ucho-mng-1702:latestshowed bakedJOB_DATAwithtriggerResult.agentType: planningalongside the splittingJOB_DATA_REDIS_KEY— plus 10 plaintext secrets (DATABASE_URL,GITHUB_TOKEN_IMPLEMENTER/REVIEWER,LINEAR_API_KEY,CLAUDE_CODE_OAUTH_TOKEN,CODEX_AUTH_JSON, …).Before MNG-1660, every run set inline
JOB_DATA, which always overrode the baked value — masking this.Fix (two parts)
worker-entry.tsresolveRawJobData:JOB_DATA_REDIS_KEYis now authoritative when present. The router sets exactly one channel per spawn, so an inlineJOB_DATAco-present with a key can only be a stale snapshot artifact → ignored. Fixes snapshots already baked in prod without pruning.worker-snapshots.tscommitWorkerSnapshot: inspect the container and commit a scrubbedConfig.Env(docker-modem_query/_bodyidiom), stripping job env + secrets (static deny-list + dynamic credential names fromCASCADE_CREDENTIAL_KEYS) while preservingPATH/PLAYWRIGHT_BROWSERS_PATH/NODE_*. On inspect failure → bare commit + Sentry (snapshot_env_scrub_inspect_failed), never silently baking secrets or losing the snapshot.Blast radius
Any second+ agent on a work item where an earlier run baked inline
JOB_DATAand the later run offloads (>96 KiB):planning→splitting,planning→implementation,implementation→review, etc. Every non-GitHub adapter embeds the pre-resolvedtriggerResult, so all are exposed.Tests
+2worker-entry: Redis key wins over stale inline; inline-only path unchanged.+6worker-snapshots:scrubSnapshotEnv(deny-list / credential-keys /=-in-value / no-=); scrubbed-commit_query/_bodyshape; inspect-failure fallback + Sentry.typecheck,lint, unit-api (1814) + unit-core (6233) green; full pre-push suite green.Ops after deploy
cascade-snapshot-*images — they carry baked secrets + staleJOB_DATA:docker images 'cascade-snapshot-*' -q | xargs -r docker rmi -f.docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|DATABASE_URL|AUTH_JSON'→ prints nothing.Live unblock (already done, no deploy)
Pruned the MNG-1702 snapshot and force-triggered
splittingfor both items via CLI → they re-ran as real splitting on the base image.🤖 Generated with Claude Code