Skip to content

agent: tenancy heartbeat + synthetic first DM (B.1.4)#99

Merged
mcheemaa merged 4 commits intomainfrom
phase-b1/agent-heartbeat-first-dm
Apr 27, 2026
Merged

agent: tenancy heartbeat + synthetic first DM (B.1.4)#99
mcheemaa merged 4 commits intomainfrom
phase-b1/agent-heartbeat-first-dm

Conversation

@mcheemaa
Copy link
Copy Markdown
Member

Summary

Phase B.1.4 builder commit: ships the in-VM Phantom side of the canary chain. Two best-effort POSTs the agent fires to phantomd's metadata gateway (169.254.169.254) at boot milestones, plus a synthetic first DM the canary user is gated on.

  • After router.register(slackChannel) in src/index.ts: POST /v1/tenant_status/agent_ready with { transport }. Unblocks phantomd's WaitTenantReady RPC; phantom-control advances out of the waiting_for_agent_ready phase.
  • After SlackHttpChannel.connect() succeeds: send the introduction DM to the installer (the canary user moment), then POST /v1/tenant_status/first_dm_sent with the Slack message_ts. phantomd persists tenants.first_dm_at so phantom-control can flip applications.status=active.
  • Both heartbeats are gated on process.env.METADATA_BASE_URL so self-host installs (no listener) skip silently.
  • The first DM is gated by an instance-level firstDmSent flag so a transient disconnect+reconnect on the same channel object does not re-introduce.

Architect plan reference

This commit implements sections 8.3 (agent.ready heartbeat), 8.4 (synthetic first DM), 8.5 (decision: first_dm_sent is the canonical "live" signal), 12 (first-message UX option A copy), and 13 commit B.1.4 (LoC budget, acceptance test, dependencies) of 2026-04-26-canary-end-to-end-plan.md.

Wire shape verified against phantomd/internal/metadata/gateway.go and phantomd/internal/metadata/tenant_status_handler.go (both Phase B.1.2 merged on main): agent_ready accepts only { transport: string }; first_dm_sent accepts only { slack_message_ts: string }. Source-IP authn at the gateway means no tenant identifier on the wire.

What changed

File Type Change
src/tenancy/heartbeat.ts new (112 lines) reportAgentReady and reportFirstDmSent. Best-effort POSTs with injectable fetchImpl for testability.
src/tenancy/__tests__/heartbeat.test.ts new (158 lines) 8 tests: URL shape, body shape, trailing-slash trim, non-2xx swallow, fetch rejection swallow, empty-ts skip.
src/channels/slack-introduction.ts new (94 lines) sendIntroductionDm and composeIntroductionText. Extracted from slack-http-receiver.ts so the receiver class stays close to its 300-line budget; mirrors the slack-egress.ts factoring.
src/channels/__tests__/slack-introduction.test.ts new (170 lines) 9 tests: text composition, sendDm wiring, heartbeat ack, METADATA_BASE_URL gate, sendDm-returns-null path, sendDm-throws path, token redaction.
src/channels/slack-http-receiver.ts patch (+20 lines) Adds firstDmSent flag, calls sendIntroductionDm after receiver.start succeeds.
src/channels/__tests__/slack-http-receiver.test.ts patch (+69 lines) 4 new tests: end-to-end DM on connect, reconnect dedup, failure does not break connect, retry on fresh connect after first send returned no ts. Also adjusts the existing postToChannel chunks test for the new connect()-side-effect baseline.
src/index.ts patch (+15 lines) Imports reportAgentReady, calls it after router.register(slackChannel) when METADATA_BASE_URL is set.

Total: ~241 production LoC + ~397 test LoC across the diff (~638 total).

Test plan

  • bun install: clean
  • bun run lint (Biome): clean
  • bun run typecheck (tsc --noEmit): clean
  • bun test: 1951 pass / 0 fail / 10 skip / 4249 expects across 145 files. +21 new tests across 3 files.
  • Em-dash sweep on diff: zero hits
  • Co-authored-by sweep on commit: zero hits
  • Cross-repo wire-shape pin: confirmed against phantomd/internal/metadata/tenant_status_handler.go lines 67-77

Plan items confirmed

  • ~120 LoC production budget: this commit is ~241 LoC because the architect's 50-80 line heartbeat sketch grew to 112 lines for explicit FetchImpl typing, defensive trimTrailingSlash, and an empty-ts guard. The slack-introduction.ts module (94 lines) is a CLAUDE.md-mandated split out of slack-http-receiver.ts which would otherwise hit 320 lines (above the 300-line cap).
  • Wire shape: matched phantomd exactly. Architect plan suggested extra fields (ready_at, agent_version, channel_id, sent_at); I dropped them per the "match phantomd; do NOT modify phantomd" instruction.
  • The introduction copy is the architect plan section 12 option A literal.

Plan items diverged

  • The architect's heartbeat sketch had agent_version: process.env.PHANTOM_VERSION ?? "unknown" in the body. phantomd's handler does not parse this field; including it would be dead bytes on the wire. Dropped.
  • The architect's reportFirstDmSent sketch had channel_id and sent_at fields. phantomd's handler does not parse these. Dropped.

Bonus features

  • composeIntroductionText is exported alongside sendIntroductionDm so the test pins the user-facing copy. Future copy changes propagate to one test, not many.
  • redactTokens is reused inside sendIntroductionDm's catch block so a leaked bot token in a thrown error message is still scrubbed before logging.
  • trimTrailingSlash defensive helper in heartbeat.ts: METADATA_BASE_URL=http://host/ produces a doubled-slash that phantomd's path-strict mux 404s on. Pinned by a test.

Self-review tracks

Track Result Notes
1. Wire-shape parity PASS Body / URL / method match phantomd handler exactly
2. First-DM idempotency PASS with note Process-restart duplicates a DM (fresh instance, fresh flag); architect plan section 8.4 accepts this for v1
3. Failure-path sweep PASS All four error paths swallow + log: fetch reject, non-2xx, sendDm null, sendDm throw
4. Concurrency dedup PASS with note connect() called once per process boot in practice; if called twice in flight, both DMs could fire (theoretical race). Accept for canary; tighten if observed
5. Test reality PASS Each new test would fail on a stash-revert of the corresponding fix
6. Scope discipline PASS with note LoC overage from extracting slack-introduction.ts (CLAUDE.md 300-line cap) and adding failure-path tests; aligns with the failure-path-sweep memory carryforward
7. Cross-repo compat PASS with note First-DM-sent triggers exist; phantom-control flipping applications.status=active on first_dm_sent (vs today's agent_ready) remains a B.1.5 follow-up per the round-1 review M1

Verification gates results

bun install      clean
bun run lint     Checked 360 files in 90ms. No fixes applied.
bun run typecheck    $ tsc --noEmit
bun test         1951 pass, 0 fail, 10 skip, 4249 expects

Recommendation

@codex review

Squash-merge after operator approval. After merge, the phantom-rootfs PR (ghostwright/phantom-rootfs#new) bakes this code into the OCI image; the operator then rebuilds + redeploys per the rootfs PR's recipe and runs the canary chain end-to-end.

Adds two best-effort POSTs the in-VM Phantom fires to phantomd's
metadata gateway after the boot's key milestones:

  1. After router.register(slackChannel) in src/index.ts, POST
     /v1/tenant_status/agent_ready with the configured transport.
     phantom-control's WaitTenantReady RPC unblocks; the wizard
     advances out of the waiting_for_agent_ready phase.

  2. After SlackHttpChannel.connect() succeeds, send the synthetic
     introduction DM to the installer (the user moment the canary
     is gated on), then POST /v1/tenant_status/first_dm_sent with
     the Slack message_ts. phantomd persists tenants.first_dm_at
     so phantom-control can flip applications.status=active.

Wire shape matches phantomd's tenant_status_handler.go exactly:
agent_ready accepts only { transport: "..." }; first_dm_sent accepts
only { slack_message_ts: "..." }. Source-IP authn is enforced by the
gateway, so we do not sign or pass any tenant identifier.

Best-effort posture per architect plan section 8.3 / 8.4: a network
or non-2xx failure logs and continues. METADATA_BASE_URL gates both
heartbeats so self-host installs (where there is no listener) skip
silently.

The first DM is gated by an instance-level firstDmSent flag, so a
transient disconnect+reconnect on the same channel object does not
re-introduce the agent. A process restart resets the flag; the
canary's worse-case is a duplicate intro DM after a hard restart,
preferable to silent UX failure when the user never sees a first
message.

The introduction copy + send logic lives in src/channels/slack-introduction.ts
(separate module so slack-http-receiver.ts stays close to its 300-line
budget; mirrors slack-egress.ts factoring).

Tests added (~397 LoC):
  - src/tenancy/__tests__/heartbeat.test.ts (8 tests): URL shape,
    body shape, trailing-slash handling, non-2xx swallow, fetch
    rejection swallow, empty-ts skip.
  - src/channels/__tests__/slack-introduction.test.ts (9 tests):
    text composition, sendDm wiring, heartbeat ack, METADATA_BASE_URL
    gate, sendDm-returns-null path, sendDm-throws path, token redaction.
  - src/channels/__tests__/slack-http-receiver.test.ts (4 tests):
    end-to-end DM on connect, reconnect dedup, failure does not break
    connect, retry on fresh connect after first send returned no ts.
@mcheemaa
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74e269c873

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/index.ts Outdated
Comment on lines +343 to +345
if (process.env.METADATA_BASE_URL) {
await reportAgentReady({
metadataBaseUrl: process.env.METADATA_BASE_URL,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove env-only gate from agent_ready heartbeat

This guard skips reportAgentReady unless METADATA_BASE_URL is explicitly set, but the HTTP Slack path can run without that env var because createSlackChannel falls back to DEFAULT_METADATA_BASE_URL (src/channels/slack-channel-factory.ts). In that valid configuration, the tenant boots and fetches metadata successfully but never sends agent_ready, so the control-plane WaitTenantReady step can time out even though the agent is up.

Useful? React with 👍 / 👎.

Comment thread src/channels/slack-introduction.ts Outdated
Comment on lines +64 to +67
if (process.env.METADATA_BASE_URL) {
await reportFirstDmSent({
metadataBaseUrl: process.env.METADATA_BASE_URL,
slackMessageTs: messageTs,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove env-only gate from first_dm_sent heartbeat

The same env-var gate here prevents reportFirstDmSent when METADATA_BASE_URL is unset, even though HTTP mode may still be using the default metadata endpoint. In that case the introduction DM can be sent successfully but the first_dm_sent attestation is never posted, which can leave onboarding state machines waiting/failing despite a delivered DM.

Useful? React with 👍 / 👎.

…rnal refs (B.1.4 fix-and-review #2)

Section A: env-only gate too strict on the agent_ready and first_dm_sent
heartbeats. SLACK_TRANSPORT === "http" is the actual signal that the
agent is in an HTTP-transport deployment with a host metadata gateway
listening; METADATA_BASE_URL alone misses the case where the channel
factory falls back to the link-local default. Fixed both call sites in
src/index.ts and src/channels/slack-introduction.ts; both now also
fall back to DEFAULT_METADATA_BASE_URL so an unset env is not a gating
failure.

Section B: heartbeat.test.ts refactored away from the
mock(...) as unknown as FetchImpl cast pattern. The wrapper carries
internal reset state that has shifted across Bun minor versions and
broke a single sibling test on Bun 1.3.13 (passes locally on 1.3.5).
Replaced makeFetchOk / makeFetchStatus / makeFetchThrows with plain
async functions typed directly as FetchImpl. No mock() involvement,
no cast, no per-runtime state. Should pass on every Bun release.

Section D: dashboard URL is now parameterized.
composeIntroductionText accepts an optional dashboardUrl, dropping the
"manage me" line entirely when unset. sendIntroductionDm reads
PHANTOM_DASHBOARD_URL via a resolveDashboardUrl helper that validates
with new URL(...) and drops a malformed value with a warning rather
than injecting unparseable text into the user-facing DM. Caller in
phantom-rootfs sets the env var so cloud deployments get the cloud URL;
self-hosters with no env var see a shorter, valid DM.

Section D (continued): scrubbed internal milestone references from
every comment new in the diff. "Phase B.1.4" prefixes dropped, "the
canary user" rephrased to "the user", "phantomd" replaced with "the
host metadata gateway" or "the host gateway", "architect plan section
12 (option A)" replaced with an inline description of the contract.
The diff against main now matches zero internal-data leak patterns
under the orchestrator's grep.

Tests:
- New slack-introduction tests assert the SLACK_TRANSPORT-based gate:
  unset / socket = no heartbeat, http = heartbeat with default URL when
  METADATA_BASE_URL is unset, http = heartbeat with set URL when both
  are set. Plus three new dashboard-URL tests covering valid URL,
  unset, and malformed.
- New heartbeat tests exercise the same FetchImpl signature without
  mock(); existing assertions intact.

Gates: bun lint clean, bun typecheck clean, 1957 / 1957 tests pass.
@mcheemaa
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a94d37a32e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/index.ts Outdated
Comment on lines +346 to +347
const slackTransport = process.env.SLACK_TRANSPORT;
if (slackTransport === "http") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize SLACK_TRANSPORT before agent_ready gate

readSlackTransportFromEnv() accepts values like " http " (it trims before deciding transport), so this process can boot SlackHttpChannel even when the raw env var contains whitespace. The new readiness heartbeat gate checks process.env.SLACK_TRANSPORT verbatim, which skips reportAgentReady in that case and can leave the control-plane WaitTenantReady flow stuck even though HTTP transport was selected. Use the normalized transport value (or trim here) before gating.

Useful? React with 👍 / 👎.

Comment thread src/channels/slack-introduction.ts Outdated
Comment on lines +72 to +73
const slackTransport = process.env.SLACK_TRANSPORT;
if (slackTransport === "http") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize SLACK_TRANSPORT before first_dm_sent gate

This gate has the same raw-env mismatch as the startup heartbeat: transport selection is normalized elsewhere, but sendIntroductionDm checks process.env.SLACK_TRANSPORT without trimming. With SLACK_TRANSPORT=" http ", the introduction DM still sends, but reportFirstDmSent is skipped, so the host never records the first-DM milestone and activation can remain pending indefinitely despite successful user-visible delivery.

Useful? React with 👍 / 👎.

…This.fetch (B.1.4 fix-and-review #3)

Round-3 fixes for the heartbeat + first DM landing.

F-B14-E (P2): index.ts read process.env.SLACK_TRANSPORT verbatim before
the agent_ready heartbeat. readSlackTransportFromEnv() trims its input,
so a whitespace-padded SLACK_TRANSPORT="  http  " booted the HTTP
receiver while the gate skipped reportAgentReady, leaving the operator's
readiness RPC waiting indefinitely. The factory and the gate now share
the same normalized transport via the helper.

F-B14-F (P2): slack-introduction.ts had the same raw-env mismatch on
the first_dm_sent gate. Same helper applied here so a successful Slack
DM is always paired with the audit signal in HTTP-transport boots.

CI test: heartbeat.test.ts "POSTs to ..." tests for both reportAgentReady
and reportFirstDmSent rewritten to use a globalThis.fetch override with
scalar locals (no closure-captured array). The previous pattern passed
on Bun 1.3.5 locally but failed on Bun 1.3.13 in CI; the rewrite is
Bun-version-agnostic and the try/finally restore is mandatory so a
mid-test failure does not leak the override into sibling tests.

Tests: 1959 pass, 10 skip, 0 fail. Lint clean, typecheck clean.
@mcheemaa
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f9c54156f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/index.ts
Comment on lines +349 to +353
if (slackTransport === "http") {
await reportAgentReady({
metadataBaseUrl: process.env.METADATA_BASE_URL ?? DEFAULT_METADATA_BASE_URL,
transport: slackTransport,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Send agent_ready only after Slack channel connects

reportAgentReady is emitted immediately after router.register(slackChannel), but the actual Slack startup (auth.test + receiver start) happens later in router.connectAll(), and connect failures are only logged by ChannelRouter.connectAll rather than aborting startup. In environments with a revoked bot token or bind/start failure, this sends a false agent_ready signal and can advance control-plane state even though Slack is not usable. Gate this heartbeat on a successful Slack connect() result (or move it after confirmed connection) to avoid false readiness.

Useful? React with 👍 / 👎.

@mcheemaa
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1409a958e2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +79 to +83
metadataBaseUrl: process.env.METADATA_BASE_URL ?? DEFAULT_METADATA_BASE_URL,
slackMessageTs: messageTs,
});
}
return { sent: true, messageTs };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Surface first_dm_sent heartbeat failures to caller

sendIntroductionDm returns sent: true unconditionally after calling reportFirstDmSent, but reportFirstDmSent treats non-2xx and fetch errors as warn-and-continue. In an HTTP deployment, a transient metadata-gateway failure (e.g., 500/timeout) will therefore still mark the DM flow as successful; SlackHttpChannel.connect() then sets firstDmSent = true and suppresses future retries, so the control-plane first_dm_sent signal can be permanently lost until a process restart.

Useful? React with 👍 / 👎.

@mcheemaa mcheemaa merged commit 337a282 into main Apr 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant