agent: tenancy heartbeat + synthetic first DM (B.1.4)#99
Conversation
Adds two best-effort POSTs the in-VM Phantom fires to phantomd's
metadata gateway after the boot's key milestones:
1. After router.register(slackChannel) in src/index.ts, POST
/v1/tenant_status/agent_ready with the configured transport.
phantom-control's WaitTenantReady RPC unblocks; the wizard
advances out of the waiting_for_agent_ready phase.
2. After SlackHttpChannel.connect() succeeds, send the synthetic
introduction DM to the installer (the user moment the canary
is gated on), then POST /v1/tenant_status/first_dm_sent with
the Slack message_ts. phantomd persists tenants.first_dm_at
so phantom-control can flip applications.status=active.
Wire shape matches phantomd's tenant_status_handler.go exactly:
agent_ready accepts only { transport: "..." }; first_dm_sent accepts
only { slack_message_ts: "..." }. Source-IP authn is enforced by the
gateway, so we do not sign or pass any tenant identifier.
Best-effort posture per architect plan section 8.3 / 8.4: a network
or non-2xx failure logs and continues. METADATA_BASE_URL gates both
heartbeats so self-host installs (where there is no listener) skip
silently.
The first DM is gated by an instance-level firstDmSent flag, so a
transient disconnect+reconnect on the same channel object does not
re-introduce the agent. A process restart resets the flag; the
canary's worse-case is a duplicate intro DM after a hard restart,
preferable to silent UX failure when the user never sees a first
message.
The introduction copy + send logic lives in src/channels/slack-introduction.ts
(separate module so slack-http-receiver.ts stays close to its 300-line
budget; mirrors slack-egress.ts factoring).
Tests added (~397 LoC):
- src/tenancy/__tests__/heartbeat.test.ts (8 tests): URL shape,
body shape, trailing-slash handling, non-2xx swallow, fetch
rejection swallow, empty-ts skip.
- src/channels/__tests__/slack-introduction.test.ts (9 tests):
text composition, sendDm wiring, heartbeat ack, METADATA_BASE_URL
gate, sendDm-returns-null path, sendDm-throws path, token redaction.
- src/channels/__tests__/slack-http-receiver.test.ts (4 tests):
end-to-end DM on connect, reconnect dedup, failure does not break
connect, retry on fresh connect after first send returned no ts.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 74e269c873
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (process.env.METADATA_BASE_URL) { | ||
| await reportAgentReady({ | ||
| metadataBaseUrl: process.env.METADATA_BASE_URL, |
There was a problem hiding this comment.
Remove env-only gate from agent_ready heartbeat
This guard skips reportAgentReady unless METADATA_BASE_URL is explicitly set, but the HTTP Slack path can run without that env var because createSlackChannel falls back to DEFAULT_METADATA_BASE_URL (src/channels/slack-channel-factory.ts). In that valid configuration, the tenant boots and fetches metadata successfully but never sends agent_ready, so the control-plane WaitTenantReady step can time out even though the agent is up.
Useful? React with 👍 / 👎.
| if (process.env.METADATA_BASE_URL) { | ||
| await reportFirstDmSent({ | ||
| metadataBaseUrl: process.env.METADATA_BASE_URL, | ||
| slackMessageTs: messageTs, |
There was a problem hiding this comment.
Remove env-only gate from first_dm_sent heartbeat
The same env-var gate here prevents reportFirstDmSent when METADATA_BASE_URL is unset, even though HTTP mode may still be using the default metadata endpoint. In that case the introduction DM can be sent successfully but the first_dm_sent attestation is never posted, which can leave onboarding state machines waiting/failing despite a delivered DM.
Useful? React with 👍 / 👎.
…rnal refs (B.1.4 fix-and-review #2) Section A: env-only gate too strict on the agent_ready and first_dm_sent heartbeats. SLACK_TRANSPORT === "http" is the actual signal that the agent is in an HTTP-transport deployment with a host metadata gateway listening; METADATA_BASE_URL alone misses the case where the channel factory falls back to the link-local default. Fixed both call sites in src/index.ts and src/channels/slack-introduction.ts; both now also fall back to DEFAULT_METADATA_BASE_URL so an unset env is not a gating failure. Section B: heartbeat.test.ts refactored away from the mock(...) as unknown as FetchImpl cast pattern. The wrapper carries internal reset state that has shifted across Bun minor versions and broke a single sibling test on Bun 1.3.13 (passes locally on 1.3.5). Replaced makeFetchOk / makeFetchStatus / makeFetchThrows with plain async functions typed directly as FetchImpl. No mock() involvement, no cast, no per-runtime state. Should pass on every Bun release. Section D: dashboard URL is now parameterized. composeIntroductionText accepts an optional dashboardUrl, dropping the "manage me" line entirely when unset. sendIntroductionDm reads PHANTOM_DASHBOARD_URL via a resolveDashboardUrl helper that validates with new URL(...) and drops a malformed value with a warning rather than injecting unparseable text into the user-facing DM. Caller in phantom-rootfs sets the env var so cloud deployments get the cloud URL; self-hosters with no env var see a shorter, valid DM. Section D (continued): scrubbed internal milestone references from every comment new in the diff. "Phase B.1.4" prefixes dropped, "the canary user" rephrased to "the user", "phantomd" replaced with "the host metadata gateway" or "the host gateway", "architect plan section 12 (option A)" replaced with an inline description of the contract. The diff against main now matches zero internal-data leak patterns under the orchestrator's grep. Tests: - New slack-introduction tests assert the SLACK_TRANSPORT-based gate: unset / socket = no heartbeat, http = heartbeat with default URL when METADATA_BASE_URL is unset, http = heartbeat with set URL when both are set. Plus three new dashboard-URL tests covering valid URL, unset, and malformed. - New heartbeat tests exercise the same FetchImpl signature without mock(); existing assertions intact. Gates: bun lint clean, bun typecheck clean, 1957 / 1957 tests pass.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a94d37a32e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const slackTransport = process.env.SLACK_TRANSPORT; | ||
| if (slackTransport === "http") { |
There was a problem hiding this comment.
Normalize SLACK_TRANSPORT before agent_ready gate
readSlackTransportFromEnv() accepts values like " http " (it trims before deciding transport), so this process can boot SlackHttpChannel even when the raw env var contains whitespace. The new readiness heartbeat gate checks process.env.SLACK_TRANSPORT verbatim, which skips reportAgentReady in that case and can leave the control-plane WaitTenantReady flow stuck even though HTTP transport was selected. Use the normalized transport value (or trim here) before gating.
Useful? React with 👍 / 👎.
| const slackTransport = process.env.SLACK_TRANSPORT; | ||
| if (slackTransport === "http") { |
There was a problem hiding this comment.
Normalize SLACK_TRANSPORT before first_dm_sent gate
This gate has the same raw-env mismatch as the startup heartbeat: transport selection is normalized elsewhere, but sendIntroductionDm checks process.env.SLACK_TRANSPORT without trimming. With SLACK_TRANSPORT=" http ", the introduction DM still sends, but reportFirstDmSent is skipped, so the host never records the first-DM milestone and activation can remain pending indefinitely despite successful user-visible delivery.
Useful? React with 👍 / 👎.
…This.fetch (B.1.4 fix-and-review #3) Round-3 fixes for the heartbeat + first DM landing. F-B14-E (P2): index.ts read process.env.SLACK_TRANSPORT verbatim before the agent_ready heartbeat. readSlackTransportFromEnv() trims its input, so a whitespace-padded SLACK_TRANSPORT=" http " booted the HTTP receiver while the gate skipped reportAgentReady, leaving the operator's readiness RPC waiting indefinitely. The factory and the gate now share the same normalized transport via the helper. F-B14-F (P2): slack-introduction.ts had the same raw-env mismatch on the first_dm_sent gate. Same helper applied here so a successful Slack DM is always paired with the audit signal in HTTP-transport boots. CI test: heartbeat.test.ts "POSTs to ..." tests for both reportAgentReady and reportFirstDmSent rewritten to use a globalThis.fetch override with scalar locals (no closure-captured array). The previous pattern passed on Bun 1.3.5 locally but failed on Bun 1.3.13 in CI; the rewrite is Bun-version-agnostic and the try/finally restore is mandatory so a mid-test failure does not leak the override into sibling tests. Tests: 1959 pass, 10 skip, 0 fail. Lint clean, typecheck clean.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f9c54156f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (slackTransport === "http") { | ||
| await reportAgentReady({ | ||
| metadataBaseUrl: process.env.METADATA_BASE_URL ?? DEFAULT_METADATA_BASE_URL, | ||
| transport: slackTransport, | ||
| }); |
There was a problem hiding this comment.
Send agent_ready only after Slack channel connects
reportAgentReady is emitted immediately after router.register(slackChannel), but the actual Slack startup (auth.test + receiver start) happens later in router.connectAll(), and connect failures are only logged by ChannelRouter.connectAll rather than aborting startup. In environments with a revoked bot token or bind/start failure, this sends a false agent_ready signal and can advance control-plane state even though Slack is not usable. Gate this heartbeat on a successful Slack connect() result (or move it after confirmed connection) to avoid false readiness.
Useful? React with 👍 / 👎.
…igation (B.1.4 fix-and-review #4)
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1409a958e2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| metadataBaseUrl: process.env.METADATA_BASE_URL ?? DEFAULT_METADATA_BASE_URL, | ||
| slackMessageTs: messageTs, | ||
| }); | ||
| } | ||
| return { sent: true, messageTs }; |
There was a problem hiding this comment.
Surface first_dm_sent heartbeat failures to caller
sendIntroductionDm returns sent: true unconditionally after calling reportFirstDmSent, but reportFirstDmSent treats non-2xx and fetch errors as warn-and-continue. In an HTTP deployment, a transient metadata-gateway failure (e.g., 500/timeout) will therefore still mark the DM flow as successful; SlackHttpChannel.connect() then sets firstDmSent = true and suppresses future retries, so the control-plane first_dm_sent signal can be permanently lost until a process restart.
Useful? React with 👍 / 👎.
Summary
Phase B.1.4 builder commit: ships the in-VM Phantom side of the canary chain. Two best-effort POSTs the agent fires to phantomd's metadata gateway (
169.254.169.254) at boot milestones, plus a synthetic first DM the canary user is gated on.router.register(slackChannel)insrc/index.ts: POST/v1/tenant_status/agent_readywith{ transport }. Unblocks phantomd'sWaitTenantReadyRPC; phantom-control advances out of thewaiting_for_agent_readyphase.SlackHttpChannel.connect()succeeds: send the introduction DM to the installer (the canary user moment), then POST/v1/tenant_status/first_dm_sentwith the Slackmessage_ts. phantomd persiststenants.first_dm_atso phantom-control can flipapplications.status=active.process.env.METADATA_BASE_URLso self-host installs (no listener) skip silently.firstDmSentflag so a transient disconnect+reconnect on the same channel object does not re-introduce.Architect plan reference
This commit implements sections 8.3 (
agent.readyheartbeat), 8.4 (synthetic first DM), 8.5 (decision:first_dm_sentis the canonical "live" signal), 12 (first-message UX option A copy), and 13 commit B.1.4 (LoC budget, acceptance test, dependencies) of2026-04-26-canary-end-to-end-plan.md.Wire shape verified against
phantomd/internal/metadata/gateway.goandphantomd/internal/metadata/tenant_status_handler.go(both Phase B.1.2 merged on main):agent_readyaccepts only{ transport: string };first_dm_sentaccepts only{ slack_message_ts: string }. Source-IP authn at the gateway means no tenant identifier on the wire.What changed
src/tenancy/heartbeat.tsreportAgentReadyandreportFirstDmSent. Best-effort POSTs with injectablefetchImplfor testability.src/tenancy/__tests__/heartbeat.test.tssrc/channels/slack-introduction.tssendIntroductionDmandcomposeIntroductionText. Extracted fromslack-http-receiver.tsso the receiver class stays close to its 300-line budget; mirrors theslack-egress.tsfactoring.src/channels/__tests__/slack-introduction.test.tssrc/channels/slack-http-receiver.tsfirstDmSentflag, callssendIntroductionDmafterreceiver.startsucceeds.src/channels/__tests__/slack-http-receiver.test.tspostToChannel chunkstest for the new connect()-side-effect baseline.src/index.tsreportAgentReady, calls it afterrouter.register(slackChannel)whenMETADATA_BASE_URLis set.Total: ~241 production LoC + ~397 test LoC across the diff (~638 total).
Test plan
bun install: cleanbun run lint(Biome): cleanbun run typecheck(tsc --noEmit): cleanbun test: 1951 pass / 0 fail / 10 skip / 4249 expects across 145 files. +21 new tests across 3 files.phantomd/internal/metadata/tenant_status_handler.golines 67-77Plan items confirmed
FetchImpltyping, defensivetrimTrailingSlash, and an empty-ts guard. The slack-introduction.ts module (94 lines) is a CLAUDE.md-mandated split out of slack-http-receiver.ts which would otherwise hit 320 lines (above the 300-line cap).ready_at,agent_version,channel_id,sent_at); I dropped them per the "match phantomd; do NOT modify phantomd" instruction.Plan items diverged
agent_version: process.env.PHANTOM_VERSION ?? "unknown"in the body. phantomd's handler does not parse this field; including it would be dead bytes on the wire. Dropped.reportFirstDmSentsketch hadchannel_idandsent_atfields. phantomd's handler does not parse these. Dropped.Bonus features
composeIntroductionTextis exported alongsidesendIntroductionDmso the test pins the user-facing copy. Future copy changes propagate to one test, not many.redactTokensis reused insidesendIntroductionDm's catch block so a leaked bot token in a thrown error message is still scrubbed before logging.trimTrailingSlashdefensive helper in heartbeat.ts:METADATA_BASE_URL=http://host/produces a doubled-slash that phantomd's path-strict mux 404s on. Pinned by a test.Self-review tracks
applications.status=activeonfirst_dm_sent(vs today'sagent_ready) remains a B.1.5 follow-up per the round-1 review M1Verification gates results
Recommendation
@codex review
Squash-merge after operator approval. After merge, the phantom-rootfs PR (
ghostwright/phantom-rootfs#new) bakes this code into the OCI image; the operator then rebuilds + redeploys per the rootfs PR's recipe and runs the canary chain end-to-end.