Release: snapshot stale-JOB_DATA + env-scrub fix (ucho MNG-1622/1702)#1471
Merged
Conversation
Bumps [js-toml](https://github.com/sunnyadn/js-toml) from 1.0.3 to 1.1.3. - [Release notes](https://github.com/sunnyadn/js-toml/releases) - [Changelog](https://github.com/sunnyadn/js-toml/blob/main/CHANGELOG.md) - [Commits](sunnyadn/js-toml@v1.0.3...v1.1.3) --- updated-dependencies: - dependency-name: js-toml dependency-version: 1.1.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…nt (ucho MNG-1622/1702) Prod: MNG-1622 and MNG-1702 were moved into Splitting; the label triggered but no backlog story cards were produced. Root cause (verified via prod logs + live `docker image inspect`): the splitting workers RE-RAN the planning agent on a stale payload. Mechanism — an interaction between the MNG-1660 JOB_DATA offload and snapshot reuse: - An earlier `planning` run of the same issue passed its payload INLINE (`-e JOB_DATA=<json>`). Its container was committed to a snapshot image, and `docker commit` bakes the container's `Config.Env` — including `JOB_DATA=<planning json>` (and every secret: DATABASE_URL, REDIS_URL, GitHub /Linear/OpenAI creds, Claude OAuth token) — into the image. - The `splitting` run's payload was large, so the router OFFLOADED it to Redis and set only `JOB_DATA_REDIS_KEY`, not `JOB_DATA` (worker-env.ts is if/else). - The splitting worker spawned FROM that snapshot image. `docker run -e JOB_DATA_REDIS_KEY=...` does not clear the baked `JOB_DATA`, so the container had both. - `resolveRawJobData()` read inline `JOB_DATA` first → the STALE planning payload (triggerResult.agentType 'planning') → ran planning → no cards. This regressed from MNG-1660: before it, every run set inline JOB_DATA which always overrode the baked value. Two-part fix: 1. Correctness (self-healing, covers snapshots already baked in prod): worker-entry.ts resolveRawJobData now treats JOB_DATA_REDIS_KEY as AUTHORITATIVE when present — the router sets exactly one channel per spawn, so an inline JOB_DATA co-present with a key can only be a stale snapshot artifact and is ignored. 2. Hygiene + security: worker-snapshots.ts commitWorkerSnapshot now inspects the container and commits a SCRUBBED Config.Env (docker-modem _query/_body idiom), stripping job env (JOB_DATA*/JOB_ID/JOB_TYPE) and secrets (DATABASE_URL, REDIS_URL, CREDENTIAL_MASTER_KEY, CLAUDE_CODE_OAUTH_TOKEN, and the dynamic project-credential names from CASCADE_CREDENTIAL_KEYS) while preserving PATH / PLAYWRIGHT_BROWSERS_PATH / NODE_*. Closes the credential-in-image leak; on inspect failure, falls back to a bare commit + Sentry (snapshot_env_scrub_ inspect_failed) rather than silently baking secrets or losing the snapshot. Blast radius: any second+ agent on a work item where an earlier run baked inline JOB_DATA and the later run offloads (>96 KiB) — planning→splitting, planning→implementation, implementation→review, etc. Tests: +2 worker-entry (Redis key wins over stale inline; inline-only path unchanged); +6 worker-snapshots (scrubSnapshotEnv deny-list/credential-keys/=-in -value/no-= edge cases; scrubbed-commit _query/_body shape; inspect-failure fallback + Sentry). typecheck, lint, unit-api (1814) and unit-core (6233) green. Ops after deploy: prune existing cascade-snapshot-* images (they carry baked secrets + stale JOB_DATA); rotate any credentials baked into images shared off-host. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
…stale offload key Address review feedback on PR #1470. 1. worker-snapshots.ts (BLOCKING): the scrubbed-env commit wrote the Docker ImageCommit body as `{ Config: { Env } }`. Docker's POST /commit request body IS a ContainerConfig — Env is top-level. docker-modem serializes `_body` verbatim (modem.js:208), so the nested `Config` was an unknown field Docker ignored, leaving Env nil and baking the container's raw (unscrubbed) env while the `strippedKeys` log reported a false positive. Send `_body: { Env: scrubbed }`. 2. worker-entry.ts (SHOULD_FIX): making JOB_DATA_REDIS_KEY unconditionally authoritative fixed the forward case but regressed the reverse one — a snapshot baked from a prior OFFLOADED run carries a stale JOB_DATA_REDIS_KEY that a later INLINE run does not clear, so reading it hit a key the prior run already deleted → process.exit(1). Trust the key only when it names THIS job (`key === buildJobDataRedisKey(JOB_ID)`); otherwise treat it as a stale baked artifact and fall through to inline. Fixes forward AND reverse. 3. Tests: assert the corrected `_body.Env` shape; add a reverse-case regression (stale baked key mismatching JOB_ID → inline wins, no crash); make the missing/expired offload test use a key that matches JOB_ID (as production always does). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The scrubbed-env commit passed a partial `_body: { Env: scrubbed }`. Docker's
POST /commit uses a non-nil body config as the new image's COMPLETE runtime
config (moby daemon/commit.go falls back to container.Config only when the body
is nil; image.NewChildImage assigns Config: child.Config with no parent merge),
so this wiped Cmd/WorkingDir/User/Labels/ExposedPorts baked by Dockerfile.worker.
Every reused snapshot then failed to launch ("No command specified", User→root,
WorkingDir→/) — regressing the exact snapshot-reuse path this PR targets.
Widen the inspect extraction to keep the whole Config and spread it into the
commit body, replacing only Env: `_body: { ...fullConfig, Env: scrubbed }`. The
inspect-failure bare-commit fallback was already config-preserving; this removes
the success/failure asymmetry. Unit test now asserts Cmd/WorkingDir/User/Labels/
ExposedPorts survive the scrubbed commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…changes`)
The prior scrub committed a filtered `Env` array in the commit body — a proven
NO-OP. Verified against a live Docker daemon: moby re-appends every container
env var whose key is absent from the body list, so the committed image is
byte-identical to a bare commit — `JOB_DATA` and every credential survive.
docker commit cannot REMOVE an env var.
Switch to the supported mechanism: `container.commit({ changes: ['ENV KEY='] })`
blanks the value of each deny-listed / credential key actually present. Empty
`JOB_DATA` is falsy (worker ignores it); an emptied secret carries no value to
leak; `Cmd`/`Entrypoint`/`WorkingDir`/all other env are preserved (bare commit
keeps the full container config, `changes` overlays only the named keys).
End-to-end verified on a live daemon (exact code path): JOB_DATA, DATABASE_URL,
GITHUB_TOKEN_IMPLEMENTER, LINEAR_API_KEY all committed as `KEY=` (blank);
PATH/SENTRY_DSN/Cmd preserved.
- New `buildSnapshotEnvScrubChanges(env)` → `ENV KEY=` list for present deny keys.
- `commitScrubbed` uses `{repo, tag, changes}`; drops the `_query`/`_body` shape.
- Tests assert the `changes` array (present secret/job keys blanked, safe keys
untouched) instead of the unverifiable `_body.Env` mock args.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
…-jobdata-and-env-scrub fix(worker): stop snapshot-baked stale JOB_DATA running the wrong agent (ucho MNG-1622/1702)
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promotes #1470 from
devtomain(prod).Fixes the snapshot-reuse × JOB_DATA-offload bug where splitting workers re-ran the prior
planningagent on a stale baked payload (ucho MNG-1622/1702 produced no story cards):resolveRawJobDatatreatsJOB_DATA_REDIS_KEYas authoritative when present (ignores stale baked inlineJOB_DATA).changes: ['ENV KEY='](empirically verified on a live daemon — the priorEnv-body scrub was a proven no-op). Closes the credential-in-image leak.Dev CI + Build-and-Deploy(Dev) green.
Ops after this deploys: prune existing
cascade-snapshot-*images on prod (they still bake secrets + staleJOB_DATAfrom before this fix), and rotate any creds if snapshots ever left the host.🤖 Generated with Claude Code