Skip to content

feat(gastown): batch gastown improvements — landing MR loop fix, env var propagation, container tooling, triage dispatch, dead code cleanup, UI polish#2374

Open
jrf0110 wants to merge 9 commits intomainfrom
gastown-staging
Open

feat(gastown): batch gastown improvements — landing MR loop fix, env var propagation, container tooling, triage dispatch, dead code cleanup, UI polish#2374
jrf0110 wants to merge 9 commits intomainfrom
gastown-staging

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Apr 13, 2026

Summary

Batch merge of 9 PRs from gastown-staging into main, covering dispatch reliability, container env propagation, container image tooling, dead code removal, and UI improvements.

Bug Fixes

  • Break create_landing_mr infinite loop (#2371) — Fixed the reconciler retrying landing MR creation every 5s without a circuit breaker (one town accumulated 5,335 actions over 41 hours). Added deduplication, max attempts with exponential cooldown (capped at 30min), PR URL validation guard, and a race condition fix. New FailConvoy action type for explicit convoy failure with reason tracking.
  • Propagate custom env_vars to running containers on settings save (#2366) — Custom environment variables set in town settings were not reaching already-running container processes until restart + re-dispatch. Three gaps fixed: syncTownConfigToProcessEnv() now writes custom vars to process.env (with cleanup tracking), syncConfigToContainer() persists custom vars to DO storage, and updateAgentModel() hot-swap overlays fresh custom vars over stale startup snapshots.
  • Prevent triage batch bead dispatch loop with wrong system prompt (#2321) — Two-layer fix: (A) mark triage batch bead as in_progress before startAgentInContainer() to prevent Rule 2 re-dispatch with generic polecat prompt, and (B) defense-in-depth detection in dispatchAgent to inject the correct triage system prompt even if Rule 2 fires.

Features

  • Add Java JDK to container images (#2066) — Install default-jdk (OpenJDK) in both prod and dev Dockerfiles to support Java project builds and runtime.
  • Add cmake and pkg-config to container images (#2060) — Install cmake and pkg-config in both prod and dev Dockerfiles for native dependency builds.
  • Add town ID copy badge and Debug settings section (#2296) — Copyable town ID badge in the Overview page sticky top bar, plus a new "Debug" section in Town Settings with a "Copy debug info" button that serializes a sanitized JSON snapshot (tokens/emails omitted, env var keys only).
  • Add dismiss actions for failed beads on Merge Queue page (#2295) — Individual dismiss button per failed bead row, bulk "Dismiss all failed" button with concurrent close + loading/toast feedback, and a view button fallback fix for orphaned MR beads.

Cleanup

  • Remove dead code from patrol/scheduling/review-queue (#2339) — Removed unused GUPP_WARN_MS export, updated stale comments referencing removed functions (witnessPatrol, schedulePendingWork, processReviewQueue, recoverStuckReviews) to describe the current reconciler/alarm-based architecture.
  • Remove dead popReviewQueue and update stale comments (#2318) — Deleted dead popReviewQueue() and its Town.do.ts wrapper, removed dead completeReview() wrapper, updated stale comments across review-queue, reconciler, beads, and container-dispatch modules.

Verification

  • All 9 constituent PRs were individually reviewed and merged to gastown-staging
  • Typecheck and tests passed on each PR at merge time
  • Constituent PRs linked above contain detailed per-change verification

Visual Changes

  • Copyable town ID badge in Overview page top bar (#2296)
  • Debug settings section with "Copy debug info" button in Town Settings (#2296)
  • Dismiss (X) button on individual failed bead rows and bulk "Dismiss all failed" button on Merge Queue page (#2295)

Reviewer Notes

  • The 9 PRs were staged on gastown-staging to batch test interactions before landing on main
  • The create_landing_mr loop fix (fix(gastown): break create_landing_mr infinite loop (#2260) #2371) stores retry metadata (landing_mr_attempts, last_landing_mr_attempt_at) in the convoy bead's existing metadata JSON column — no schema migration needed
  • The env var propagation fix (fix(gastown): propagate custom env_vars to running containers on settings save #2366) uses module-level Set tracking in control-server.ts (resets on container restart, which is fine since process.env also resets) and DO durable storage for cross-restart persistence
  • Custom env var keys that collide with infra keys (GIT_TOKEN, KILOCODE_TOKEN, etc.) are silently skipped to maintain infra precedence
  • The debug JSON output intentionally excludes PII (emails, tokens show as boolean _set fields only)

jrf0110 and others added 9 commits April 10, 2026 15:32
#2295)

* feat(merges): add dismiss actions for failed beads on Merge Queue page

- Add individual Dismiss (X) button to each failed bead row in AttentionItemRow
- Add bulk 'Dismiss all failed (N)' button to NeedsAttention header area
- Fix view button fallback: open MR bead when sourceBead is null (orphaned beads)
- Both individual and bulk dismiss call updateBead with status: 'closed' on the MR bead
- Dismiss all shows loading spinner and toast on completion/error

* fix(merges): use rigId directly in openDrawer to fix TS typecheck

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
* feat(gastown): add town ID copy badge and Debug settings section

* fix(gastown): sanitize debug payload — strip git_url credentials and replace git_author_name with presence flag

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
…2318)

* chore(gastown): remove dead popReviewQueue and update stale comments

Remove popReviewQueue() from review-queue.ts and its wrapper from Town.do.ts —
confirmed no callers in the tRPC router, reconciler, or anywhere else.

Also remove the Town.do.ts completeReview() wrapper (also had no external callers)
and update stale comments across review-queue.ts, Town.do.ts, reconciler.ts,
beads.ts, and container-dispatch.ts that referenced old patrol/scheduling
functions (feedStrandedConvoys, rehookOrphanedBeads, schedulePendingWork,
recoverStuckReviews, witnessPatrol, processReviewQueue) to reflect the current
reconciler-based architecture.

Closes #1403

* ci: retrigger Kilo Code Review (previous run failed due to transient clone error)

* test(gastown): update integration tests to remove removed popReviewQueue/completeReview APIs

popReviewQueue() and completeReview() were removed from TownDO as dead
code. Update integration tests to use listBeads({ type: 'merge_request' })
instead of popReviewQueue() to observe MR bead state, and remove the
regression guard test for completeReview which is no longer testable
via the TownDO public API.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
…em prompt (#2321)

* fix(gastown): prevent triage batch bead dispatch loop with wrong system prompt

Option A: Mark triage batch bead as in_progress immediately after hookBead()
in maybeDispatchTriageAgent(), before awaiting startAgentInContainer(). This
prevents reconciler Rule 2 (idle agent + open hooked bead → dispatch_agent)
from re-dispatching the triage bead with the generic polecat prompt on the
next tick when container start fails. Rule 3 (stale in_progress, 5-min
timeout) resets it to open for a clean retry via maybeDispatchTriageAgent.

Option B (defense-in-depth): In applyActionCtx.dispatchAgent, detect triage
batch beads (gt:triage label + created_by='patrol') and inject the triage
system prompt, ensuring the polecat gets the correct tools even if Rule 2
somehow fires against an open triage batch bead.

Fixes #1958

* fix(gastown): set rig_id on triage batch bead so reconciler Rule 1 can re-dispatch after timeout

Without rig_id, when Rule 3 resets an abandoned in_progress triage batch
bead to 'open', Rule 1 skips it (requires rig_id IS NOT NULL). This left
the bead permanently 'open', blocking maybeDispatchTriageAgent from
creating a replacement. Setting rig_id ensures Rule 1 can re-dispatch
the existing bead (with triage system prompt injected via Option B).

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
Add remaining build-essentials packages (cmake, pkg-config) to both
prod and dev Dockerfiles. build-essential and libssl-dev were already
present.
Install default-jdk (OpenJDK) in both prod and dev Dockerfiles to
support Java project builds and runtime.
…ings save (#2366)

* fix(gastown): propagate custom env_vars to running containers on settings save

Three gaps fixed:

1. syncTownConfigToProcessEnv() now applies custom env_vars from town
   config to process.env, with tracking of previously-applied keys so
   removed vars are deleted from process.env.

2. syncConfigToContainer() now persists custom env_vars to
   TownContainerDO storage (via container.setEnvVar/deleteEnvVar) so
   they survive container restarts. Previously-persisted custom keys
   are tracked in DO storage and cleaned up on removal.

3. updateAgentModel() hot-swap now overlays fresh custom env_vars from
   getCurrentTownConfig() over the stale startupEnv snapshot. Infra
   keys in LIVE_ENV_KEYS always take precedence.

* fix(gastown): guard custom env_vars against reserved key override

- control-server: export getLastAppliedEnvVarKeys() for process-manager
- process-manager: delete stale custom keys from hotSwapEnv on hot-swap
- Town.do: skip RESERVED_ENV_KEYS when setting custom env_vars on container

Addresses 3 review warnings about custom env_vars overriding infra keys.

* fix: skip reserved env keys in prevCustomKeys cleanup loop

prevCustomKeys may contain reserved keys persisted by the previous
implementation (before the RESERVED_ENV_KEYS filter was added). Without
this guard the cleanup loop would delete managed infra values like
KILOCODE_TOKEN that were just written by envMapping.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
…1403) (#2339)

chore(gastown): remove dead GUPP_WARN_MS export and update stale patrol/queue comments

- Remove unused GUPP_WARN_MS constant export from patrol.ts (never referenced outside the file)
- Update completion-reporter.ts JSDoc: replace stale witnessPatrol/schedulePendingWork references with reconciler-based description
- Update control-server.ts comments: replace stale processReviewQueue/recoverStuckReviews references with current TownDO terminology

Part of issue #1403 dead code cleanup.

Co-authored-by: John Fawcett <john@kilcoode.ai>
* fix(gastown): break create_landing_mr infinite loop (#2260)

Add circuit breaker for landing MR creation to prevent runaway retry
loops when convoys have no PR URLs. A town accumulated 5,335 failed
actions over 41 hours before this fix.

- Fix 1: Deduplicate MR bead creation — skip if an open/in_progress
  landing MR already exists for the convoy
- Fix 2: Max 5 landing MR attempts with exponential cooldown (30s base,
  30min cap), fail the convoy when exhausted
- Fix 3: PR URL validation guard — skip landing MR creation when no
  tracked beads have a pr_url
- Fix 4: Move convoy fail check before update_convoy_progress to
  prevent the race where progress updates are emitted for convoys
  about to be failed/closed

Store landing_mr_attempts and last_landing_mr_attempt_at in the
convoy bead's metadata JSON field (no schema migration needed).
Add FailConvoy action type for explicit convoy failure.

* fix(gastown): move max-attempts check after landing MR status lookup

The max landing MR attempts guard was firing before checking whether the
final landing MR was still active or already merged, making the last
allowed attempt impossible to succeed. Now we check landing MR status
first and only fail the convoy when no landing MR is active or merged.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
// If a landing MR is active (open or in_progress), wait for it
const hasActiveLanding = landingMrs.some(
mr => mr.status === 'open' || mr.status === 'in_progress'
if ((convoyBeadsWithPr[0]?.cnt ?? 0) === 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRITICAL: Final landing is blocked for direct-merge convoys

This guard requires at least one tracked bead MR to have a pr_url, but review-then-land intermediate steps in merge_strategy = 'direct' are merged straight into the convoy feature branch and never persist a PR URL. Once those beads finish, the convoy stays open forever because we keep skipping create_landing_mr.

}
// Apply current custom env vars, skipping reserved keys.
for (const [key, value] of Object.entries(customEnvVars)) {
if (RESERVED_ENV_KEYS.has(key)) continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Custom env vars can overwrite runtime routing vars

/sync-config now copies arbitrary env_vars into process.env, but RESERVED_ENV_KEYS does not include runtime-owned keys like GASTOWN_TOWN_ID and GASTOWN_RIG_ID. A town-level env var with one of those names will clobber the values that the pending-nudge routes and plugin client read from process.env, which breaks callbacks for every running agent in the container.

const freshCustomKeySet = new Set<string>();
if (freshEnvVars !== null && typeof freshEnvVars === 'object' && !Array.isArray(freshEnvVars)) {
for (const [key, value] of Object.entries(freshEnvVars as Record<string, unknown>)) {
if (LIVE_ENV_KEYS.has(key)) continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Hot-swap can still override reserved infra env vars

This overlay only skips LIVE_ENV_KEYS, not the full reserved set that buildAgentEnv and /sync-config protect. A custom env var like GASTOWN_API_URL or GASTOWN_SESSION_TOKEN will be copied into hotSwapEnv here during model updates and can restart the SDK server with broken callback credentials or URLs.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 13, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 1
WARNING 2
SUGGESTION 0
Issue Details (click to expand)

CRITICAL

File Line Issue
services/gastown/src/dos/town/reconciler.ts 1878 Direct-merge review-then-land convoys can never create the final landing MR because the new guard requires a tracked bead pr_url.

WARNING

File Line Issue
services/gastown/container/src/control-server.ts 150 /sync-config now allows custom env vars to overwrite runtime-owned keys like GASTOWN_TOWN_ID and GASTOWN_RIG_ID.
services/gastown/container/src/process-manager.ts 1277 Model hot-swap still overlays reserved infra env vars from custom env_vars, which can restart agents with broken callback credentials or URLs.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
N/A N/A No additional observations.
Files Reviewed (18 files)
  • apps/web/src/app/(app)/gastown/[townId]/TownOverviewPageClient.tsx - 0 issues
  • apps/web/src/app/(app)/gastown/[townId]/merges/NeedsAttention.tsx - 0 issues
  • apps/web/src/app/(app)/gastown/[townId]/settings/TownSettingsPageClient.tsx - 0 issues
  • services/gastown/container/Dockerfile - 0 issues
  • services/gastown/container/Dockerfile.dev - 0 issues
  • services/gastown/container/src/completion-reporter.ts - 0 issues
  • services/gastown/container/src/control-server.ts - 1 issue
  • services/gastown/container/src/process-manager.ts - 1 issue
  • services/gastown/src/dos/Town.do.ts - 0 issues
  • services/gastown/src/dos/town/actions.ts - 0 issues
  • services/gastown/src/dos/town/beads.ts - 0 issues
  • services/gastown/src/dos/town/container-dispatch.ts - 0 issues
  • services/gastown/src/dos/town/patrol.ts - 0 issues
  • services/gastown/src/dos/town/reconciler.ts - 1 issue
  • services/gastown/src/dos/town/review-queue.ts - 0 issues
  • services/gastown/test/integration/review-failure.test.ts - 0 issues
  • services/gastown/test/integration/rig-alarm.test.ts - 0 issues
  • services/gastown/test/integration/rig-do.test.ts - 0 issues

Fix these issues in Kilo Cloud


Reviewed by gpt-5.4-20260305 · 3,866,793 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants