Skip to content

Release: snapshot stale-JOB_DATA + env-scrub fix (ucho MNG-1622/1702)#1471

Merged
zbigniewsobiecki merged 6 commits into
mainfrom
dev
Jul 1, 2026
Merged

Release: snapshot stale-JOB_DATA + env-scrub fix (ucho MNG-1622/1702)#1471
zbigniewsobiecki merged 6 commits into
mainfrom
dev

Conversation

@zbigniewsobiecki

Copy link
Copy Markdown
Member

Promotes #1470 from dev to main (prod).

Fixes the snapshot-reuse × JOB_DATA-offload bug where splitting workers re-ran the prior planning agent on a stale baked payload (ucho MNG-1622/1702 produced no story cards):

  • Correctness: resolveRawJobData treats JOB_DATA_REDIS_KEY as authoritative when present (ignores stale baked inline JOB_DATA).
  • Security/hygiene: snapshot commit now blanks job + secret env via changes: ['ENV KEY='] (empirically verified on a live daemon — the prior Env-body scrub was a proven no-op). Closes the credential-in-image leak.

Dev CI + Build-and-Deploy(Dev) green.

Ops after this deploys: prune existing cascade-snapshot-* images on prod (they still bake secrets + stale JOB_DATA from before this fix), and rotate any creds if snapshots ever left the host.

🤖 Generated with Claude Code

dependabot Bot and others added 6 commits July 1, 2026 11:31
Bumps [js-toml](https://github.com/sunnyadn/js-toml) from 1.0.3 to 1.1.3.
- [Release notes](https://github.com/sunnyadn/js-toml/releases)
- [Changelog](https://github.com/sunnyadn/js-toml/blob/main/CHANGELOG.md)
- [Commits](sunnyadn/js-toml@v1.0.3...v1.1.3)

---
updated-dependencies:
- dependency-name: js-toml
  dependency-version: 1.1.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…nt (ucho MNG-1622/1702)

Prod: MNG-1622 and MNG-1702 were moved into Splitting; the label triggered but
no backlog story cards were produced. Root cause (verified via prod logs +
live `docker image inspect`): the splitting workers RE-RAN the planning agent
on a stale payload.

Mechanism — an interaction between the MNG-1660 JOB_DATA offload and snapshot
reuse:
- An earlier `planning` run of the same issue passed its payload INLINE
  (`-e JOB_DATA=<json>`). Its container was committed to a snapshot image, and
  `docker commit` bakes the container's `Config.Env` — including
  `JOB_DATA=<planning json>` (and every secret: DATABASE_URL, REDIS_URL, GitHub
  /Linear/OpenAI creds, Claude OAuth token) — into the image.
- The `splitting` run's payload was large, so the router OFFLOADED it to Redis
  and set only `JOB_DATA_REDIS_KEY`, not `JOB_DATA` (worker-env.ts is if/else).
- The splitting worker spawned FROM that snapshot image. `docker run
  -e JOB_DATA_REDIS_KEY=...` does not clear the baked `JOB_DATA`, so the
  container had both.
- `resolveRawJobData()` read inline `JOB_DATA` first → the STALE planning
  payload (triggerResult.agentType 'planning') → ran planning → no cards.
This regressed from MNG-1660: before it, every run set inline JOB_DATA which
always overrode the baked value.

Two-part fix:
1. Correctness (self-healing, covers snapshots already baked in prod):
   worker-entry.ts resolveRawJobData now treats JOB_DATA_REDIS_KEY as
   AUTHORITATIVE when present — the router sets exactly one channel per spawn,
   so an inline JOB_DATA co-present with a key can only be a stale snapshot
   artifact and is ignored.
2. Hygiene + security: worker-snapshots.ts commitWorkerSnapshot now inspects the
   container and commits a SCRUBBED Config.Env (docker-modem _query/_body idiom),
   stripping job env (JOB_DATA*/JOB_ID/JOB_TYPE) and secrets (DATABASE_URL,
   REDIS_URL, CREDENTIAL_MASTER_KEY, CLAUDE_CODE_OAUTH_TOKEN, and the dynamic
   project-credential names from CASCADE_CREDENTIAL_KEYS) while preserving PATH /
   PLAYWRIGHT_BROWSERS_PATH / NODE_*. Closes the credential-in-image leak; on
   inspect failure, falls back to a bare commit + Sentry (snapshot_env_scrub_
   inspect_failed) rather than silently baking secrets or losing the snapshot.

Blast radius: any second+ agent on a work item where an earlier run baked inline
JOB_DATA and the later run offloads (>96 KiB) — planning→splitting,
planning→implementation, implementation→review, etc.

Tests: +2 worker-entry (Redis key wins over stale inline; inline-only path
unchanged); +6 worker-snapshots (scrubSnapshotEnv deny-list/credential-keys/=-in
-value/no-= edge cases; scrubbed-commit _query/_body shape; inspect-failure
fallback + Sentry). typecheck, lint, unit-api (1814) and unit-core (6233) green.

Ops after deploy: prune existing cascade-snapshot-* images (they carry baked
secrets + stale JOB_DATA); rotate any credentials baked into images shared
off-host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
…stale offload key

Address review feedback on PR #1470.

1. worker-snapshots.ts (BLOCKING): the scrubbed-env commit wrote the Docker
   ImageCommit body as `{ Config: { Env } }`. Docker's POST /commit request body
   IS a ContainerConfig — Env is top-level. docker-modem serializes `_body`
   verbatim (modem.js:208), so the nested `Config` was an unknown field Docker
   ignored, leaving Env nil and baking the container's raw (unscrubbed) env while
   the `strippedKeys` log reported a false positive. Send `_body: { Env: scrubbed }`.

2. worker-entry.ts (SHOULD_FIX): making JOB_DATA_REDIS_KEY unconditionally
   authoritative fixed the forward case but regressed the reverse one — a snapshot
   baked from a prior OFFLOADED run carries a stale JOB_DATA_REDIS_KEY that a later
   INLINE run does not clear, so reading it hit a key the prior run already deleted
   → process.exit(1). Trust the key only when it names THIS job
   (`key === buildJobDataRedisKey(JOB_ID)`); otherwise treat it as a stale baked
   artifact and fall through to inline. Fixes forward AND reverse.

3. Tests: assert the corrected `_body.Env` shape; add a reverse-case regression
   (stale baked key mismatching JOB_ID → inline wins, no crash); make the
   missing/expired offload test use a key that matches JOB_ID (as production always
   does).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The scrubbed-env commit passed a partial `_body: { Env: scrubbed }`. Docker's
POST /commit uses a non-nil body config as the new image's COMPLETE runtime
config (moby daemon/commit.go falls back to container.Config only when the body
is nil; image.NewChildImage assigns Config: child.Config with no parent merge),
so this wiped Cmd/WorkingDir/User/Labels/ExposedPorts baked by Dockerfile.worker.
Every reused snapshot then failed to launch ("No command specified", User→root,
WorkingDir→/) — regressing the exact snapshot-reuse path this PR targets.

Widen the inspect extraction to keep the whole Config and spread it into the
commit body, replacing only Env: `_body: { ...fullConfig, Env: scrubbed }`. The
inspect-failure bare-commit fallback was already config-preserving; this removes
the success/failure asymmetry. Unit test now asserts Cmd/WorkingDir/User/Labels/
ExposedPorts survive the scrubbed commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…changes`)

The prior scrub committed a filtered `Env` array in the commit body — a proven
NO-OP. Verified against a live Docker daemon: moby re-appends every container
env var whose key is absent from the body list, so the committed image is
byte-identical to a bare commit — `JOB_DATA` and every credential survive.
docker commit cannot REMOVE an env var.

Switch to the supported mechanism: `container.commit({ changes: ['ENV KEY='] })`
blanks the value of each deny-listed / credential key actually present. Empty
`JOB_DATA` is falsy (worker ignores it); an emptied secret carries no value to
leak; `Cmd`/`Entrypoint`/`WorkingDir`/all other env are preserved (bare commit
keeps the full container config, `changes` overlays only the named keys).

End-to-end verified on a live daemon (exact code path): JOB_DATA, DATABASE_URL,
GITHUB_TOKEN_IMPLEMENTER, LINEAR_API_KEY all committed as `KEY=` (blank);
PATH/SENTRY_DSN/Cmd preserved.

- New `buildSnapshotEnvScrubChanges(env)` → `ENV KEY=` list for present deny keys.
- `commitScrubbed` uses `{repo, tag, changes}`; drops the `_query`/`_body` shape.
- Tests assert the `changes` array (present secret/job keys blanked, safe keys
  untouched) instead of the unverifiable `_body.Env` mock args.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
…-jobdata-and-env-scrub

fix(worker): stop snapshot-baked stale JOB_DATA running the wrong agent (ucho MNG-1622/1702)
@zbigniewsobiecki zbigniewsobiecki merged commit 005a3be into main Jul 1, 2026
14 checks passed
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.75000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/worker-entry.ts 66.66% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant