Skip to content

feat(telemetry): funnel + lifecycle events for onboarding drop-off#3305

Merged
louisgv merged 3 commits intoOpenRouterTeam:mainfrom
AhmedTMM:feat/funnel-telemetry
Apr 15, 2026
Merged

feat(telemetry): funnel + lifecycle events for onboarding drop-off#3305
louisgv merged 3 commits intoOpenRouterTeam:mainfrom
AhmedTMM:feat/funnel-telemetry

Conversation

@AhmedTMM
Copy link
Copy Markdown
Collaborator

Summary

Adds low-volume, high-signal product events on top of the existing errors/warnings telemetry so we can answer "where do users bail before reaching a running agent" at the fleet level, plus track spawn lifetime and login patterns.

Respects existing `SPAWN_TELEMETRY=0` opt-out — no new flags.

Funnel events (in `orchestrate.ts`, both fast and sequential paths)

Event Fires when
`funnel_started` Pipeline begins
`funnel_cloud_authed` `cloud.authenticate()` ok
`funnel_credentials_ready` OR key + preProvision resolved
`funnel_vm_ready` VM booted and SSH-reachable
`funnel_install_completed` Agent install succeeded (tarball or live)
`funnel_configure_completed` `agent.configure()` ran
`funnel_prelaunch_completed` Gateway / dashboard / preLaunch hooks done
`funnel_handoff` About to launch TUI (final step)

Every event carries `elapsed_ms` since `funnel_started`, plus `agent` and `cloud` via telemetry context. Per-step counts in PostHog reveal the exact drop-off funnel without PII.

Lifecycle events (new `shared/lifecycle-telemetry.ts`)

`spawn_connected` — fired from `list.ts` when the user reconnects via the interactive picker. Properties: `spawn_id`, `agent`, `cloud`, `connect_count`, `date`. Increments `connection.metadata.connect_count` and writes `last_connected_at` so subsequent events (and the eventual `spawn_deleted`) have the running total.

`spawn_deleted` — fired from `delete.ts` (both interactive `confirmAndDelete` and headless `cmdDelete` loop) after a successful cloud destroy. Properties: `spawn_id`, `agent`, `cloud`, `lifetime_hours`, `connect_count`, `date`. `lifetime_hours` is computed from `SpawnRecord.timestamp` to now and clamped at 0 for corrupt clocks.

Answers: how long does a typical spawn live, how many times do users reconnect to it, which agents/clouds get the most re-use.

Privacy + scrubbing

New `captureEvent(name, properties)` helper in `telemetry.ts`:

  • Gates on `SPAWN_TELEMETRY=0` (no new flag)
  • Runs every string property through the existing scrubber (API keys, GitHub tokens, bearer, emails, IPs, base64 blobs, home paths)
  • Non-string values (numbers, booleans, `spawn_id` UUIDs) pass through untouched

Nothing in the funnel events is user-typed — they're all known-at-compile-time agent/cloud names plus timing integers.

Persistence model for `connect_count`

Stored inside `SpawnRecord.connection.metadata` as a stringified integer (the existing metadata schema is `Record<string, string>`). `saveMetadata` merges — no risk of clobbering other keys like `tunnel_remote_port`.

Tests

  • `lifecycle-telemetry.test.ts` (15 new tests) — locks in the connect-count math, lifetime computation, no-op for missing records, event payload shape, and tolerance for malformed metadata.
  • `telemetry.test.ts` (+2 tests for `captureEvent`, +1 assertion in disabled-telemetry) — verifies the new helper emits batched events with the right shape, respects opt-out, and scrubs string values but passes non-strings through.
  • Full suite: 2129/2129 pass, biome 187 files 0 errors.

Not doing in this PR

  • Failure events (e.g. `funnel_provision_failed`) — existing `captureError` already handles errors with stack traces. Funnel drop-off is inferable from the absence of the next step (e.g. `funnel_credentials_ready` count − `funnel_vm_ready` count = VM provisioning drop-off).
  • Retry tracking — each retryOrQuit loop already fires `captureError` for the underlying failure. A separate retry-counter event would add noise for marginal signal.
  • Post-handoff tracking — once the TUI takes over, we're out of the CLI. In-agent session tracking is out of scope; that's the agent's responsibility.

Version

Bumps 1.0.10 → 1.0.11. Patch bump — auto-propagates under #3296's new policy, so the telemetry will start flowing to users on their next spawn run without any manual update.

Adds low-volume, high-signal product events on top of the existing
errors/warnings telemetry (shared/telemetry.ts). Answers "where do users
bail before reaching a running agent" at the fleet level.

Funnel events (in orchestrate.ts, both fast and sequential paths):

  funnel_started              pipeline begins
  funnel_cloud_authed         cloud.authenticate() ok
  funnel_credentials_ready    OR key + preProvision resolved
  funnel_vm_ready             VM booted and SSH-reachable
  funnel_install_completed    agent install succeeded (tarball or live)
  funnel_configure_completed  agent.configure() ran
  funnel_prelaunch_completed  gateway / dashboard / preLaunch hooks done
  funnel_handoff              about to launch TUI (final step)

Every event carries elapsed_ms since funnel_started, plus agent and cloud
via telemetry context. Per-step counts reveal the drop-off funnel in
PostHog without touching any PII.

Lifecycle events (new shared/lifecycle-telemetry.ts):

  spawn_connected  { spawn_id, agent, cloud, connect_count, date }
    fired from list.ts when the user reconnects via the interactive picker.
    Increments connection.metadata.connect_count and writes last_connected_at
    so subsequent events and the eventual spawn_deleted have the total.

  spawn_deleted    { spawn_id, agent, cloud, lifetime_hours, connect_count, date }
    fired from delete.ts (both interactive confirmAndDelete and headless
    cmdDelete loop) after a successful cloud destroy. lifetime_hours is
    computed from SpawnRecord.timestamp to now. Clamped at 0 for corrupt
    clocks. connect_count is read from metadata.

New captureEvent(name, properties) helper in telemetry.ts:
- Respects SPAWN_TELEMETRY=0 opt-out (no new flag)
- Runs every string property through the existing scrubber (API keys,
  GitHub tokens, bearer, emails, IPs, base64 blobs, home paths)
- Non-string values pass through untouched

Tests: 20 new (15 lifecycle-telemetry + 2 captureEvent + 3 assertion
additions to disabled-telemetry). Full suite: 2129/2129 pass.

Bumps 1.0.10 -> 1.0.11. Patch bump — auto-propagates under OpenRouterTeam#3296 policy.
louisgv
louisgv previously approved these changes Apr 15, 2026
Copy link
Copy Markdown
Member

@louisgv louisgv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Review

Verdict: APPROVED
Commit: f14e502

Summary

This PR adds funnel telemetry and lifecycle event tracking for onboarding analytics. All changes respect the existing SPAWN_TELEMETRY=0 opt-out mechanism.

Security Analysis

Telemetry Infrastructure

  • PII scrubbing: All string values in captureEvent() are passed through the same scrubber as errors/warnings (lines 196-205 in telemetry.ts)
  • Sensitive pattern redaction: API keys, GitHub tokens, emails, IPs, file paths are all redacted before upload (lines 14-58 in telemetry.ts)
  • Opt-out respected: All new events use captureEvent() which checks _enabled flag (controlled by SPAWN_TELEMETRY=0)
  • No command args: Only aggregated metrics (elapsed_ms, connect_count, lifetime_hours) are sent - no user input, file paths, or command arguments

Lifecycle Tracking

  • Safe metadata storage: connect_count and last_connected_at stored in existing SpawnRecord.connection.metadata as strings
  • No credential leakage: Only spawn_id (random UUID), agent/cloud names, and numeric metrics are sent
  • Proper event timing: trackSpawnDeleted() called AFTER successful deletion, not before (prevents false positives)

Funnel Tracking

  • Context isolation: Agent/cloud set via setTelemetryContext() and attached to all events automatically
  • No session tracking: Only pipeline step completion events - no keystroke tracking or prompt content
  • Safe timing calculation: Uses module-scoped _funnelStart timestamp - no external state manipulation

Tests

  • bash -n: N/A (no shell scripts modified)
  • bun test: PASS (2068 tests, 0 failures)
  • biome lint: PASS (187 files checked, no issues)
  • Test coverage: New test file lifecycle-telemetry.test.ts with comprehensive coverage of both tracking functions

Findings

None. Code is secure.


-- security/pr-reviewer

@louisgv louisgv added the security-approved Security review approved label Apr 15, 2026
louisgv and others added 2 commits April 15, 2026 08:39
mock.module contaminates the global module registry when running under
--coverage, causing telemetry.test.ts and history-cov.test.ts to receive
mocked implementations instead of the real modules. Switch to spyOn with
mockRestore in afterEach so the real modules are preserved across files.

Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@la14-1
Copy link
Copy Markdown
Member

la14-1 commented Apr 15, 2026

Pushed a fix for the 5 failing Mock Tests: mock.module in lifecycle-telemetry.test.ts was contaminating the global module registry when running under --coverage, causing telemetry.test.ts (2 failures) and history-cov.test.ts (3 failures) to receive mocked implementations instead of the real modules.

Replaced mock.module with spyOn + mockRestore in afterEach, which scopes the mocks to each test without polluting other files. Full suite passes: 2129/2129, biome clean.

-- refactor/pr-maintainer

Copy link
Copy Markdown
Member

@louisgv louisgv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Review

Verdict: APPROVED
Commit: 42f4426

Summary

This PR adds funnel telemetry and lifecycle event tracking for onboarding analytics. All changes respect the existing SPAWN_TELEMETRY=0 opt-out mechanism.

Security Analysis

Telemetry Infrastructure

  • PII scrubbing: All string values in captureEvent() are passed through the same scrubber as errors/warnings (lines 196-205 in telemetry.ts)
  • Sensitive pattern redaction: API keys, GitHub tokens, emails, IPs, file paths are all redacted before upload (lines 14-58 in telemetry.ts)
  • Opt-out respected: All new events use captureEvent() which checks _enabled flag (controlled by SPAWN_TELEMETRY=0)
  • No command args: Only aggregated metrics (elapsed_ms, connect_count, lifetime_hours) are sent - no user input, file paths, or command arguments

Lifecycle Tracking

  • Safe metadata storage: connect_count and last_connected_at stored in existing SpawnRecord.connection.metadata as strings
  • No credential leakage: Only spawn_id (random UUID), agent/cloud names, and numeric metrics are sent
  • Proper event timing: trackSpawnDeleted() called AFTER successful deletion (delete.ts:264, 454), not before (prevents false positives)
  • Correct reconnect placement: trackSpawnConnected() called BEFORE SSH handoff (list.ts:714) - correct as SSH session never returns

Funnel Tracking

  • Context isolation: Agent/cloud set via setTelemetryContext() and attached to all events automatically
  • No session tracking: Only pipeline step completion events - no keystroke tracking or prompt content
  • Safe timing calculation: Uses module-scoped _funnelStart timestamp - no external state manipulation

Test Coverage Fix

  • Proper mock isolation: Replaced mock.module with spyOn + mockRestore in lifecycle-telemetry.test.ts to prevent cross-file contamination
  • Full suite passes: 2068/2068 tests pass after the fix

Tests

  • bash -n: N/A (no shell scripts modified)
  • bun test: PASS (2068 tests, 0 failures)
  • biome lint: PASS (187 files checked, no issues)

Findings

None. Code is secure.


-- security/pr-reviewer

@louisgv louisgv merged commit 1e64d34 into OpenRouterTeam:main Apr 15, 2026
5 checks passed
AhmedTMM added a commit to AhmedTMM/spawn that referenced this pull request Apr 15, 2026
Two bugs from the OpenRouterTeam#3305 rollout:

1. Test pollution: orchestrate.test.ts imports runOrchestration directly
   and never calls initTelemetry, but _enabled defaulted to true in the
   module so captureEvent happily fired real events at PostHog tagged
   agent=testagent. The onboarding funnel filled up with CI fixture data.

2. Funnel started too late: funnel_* events fired inside runOrchestration,
   which is only called AFTER the interactive picker completes. Users who
   bail at the agent/cloud/setup-options/name prompts were invisible —
   yet that's exactly where real drop-off happens.

Fix 1 — telemetry.ts:
  - Default _enabled = false. Nothing fires until initTelemetry is
    explicitly called. Production (index.ts) calls it; tests that need
    telemetry (telemetry.test.ts) call it with BUN_ENV/NODE_ENV cleared.
  - Belt-and-suspenders: initTelemetry now short-circuits when
    BUN_ENV === "test" || NODE_ENV === "test", so even if future code
    calls it from a test context, events stay local.

Fix 2 — picker instrumentation:
  New events fired before runOrchestration in every entry path:

    spawn_launched         { mode: interactive | agent_interactive | direct | headless }
    menu_shown / menu_selected / menu_cancelled   (only when user has prior spawns)
    agent_picker_shown
    agent_selected         { agent }     — also sets telemetry context
    cloud_picker_shown
    cloud_selected         { cloud }     — also sets telemetry context
    preflight_passed
    setup_options_shown
    setup_options_selected { step_count }
    name_prompt_shown
    name_entered
    picker_completed

  Wired into:
    commands/interactive.ts  cmdInteractive + cmdAgentInteractive
    commands/run.ts          cmdRun (direct `spawn <agent> <cloud>`)
                             cmdRunHeadless (only spawn_launched)

  runOrchestration's existing funnel_* events continue to fire unchanged.
  The final funnel in PostHog:
    spawn_launched → agent_selected → cloud_selected → preflight_passed
    → setup_options_selected → name_entered → picker_completed
    → funnel_started → funnel_cloud_authed → funnel_credentials_ready
    → funnel_vm_ready → funnel_install_completed → funnel_configure_completed
    → funnel_prelaunch_completed → funnel_handoff

Tests:
- telemetry.test.ts: 2 new env-guard tests (BUN_ENV, NODE_ENV), plus
  updated beforeEach to clear both env vars so existing tests still
  exercise initTelemetry.
- Full suite: 2131/2131 pass, biome 0 errors.

Bumps 1.0.12 -> 1.0.13 (patch — auto-propagates under OpenRouterTeam#3296 policy).
louisgv pushed a commit that referenced this pull request Apr 15, 2026
Two bugs from the #3305 rollout:

1. Test pollution: orchestrate.test.ts imports runOrchestration directly
   and never calls initTelemetry, but _enabled defaulted to true in the
   module so captureEvent happily fired real events at PostHog tagged
   agent=testagent. The onboarding funnel filled up with CI fixture data.

2. Funnel started too late: funnel_* events fired inside runOrchestration,
   which is only called AFTER the interactive picker completes. Users who
   bail at the agent/cloud/setup-options/name prompts were invisible —
   yet that's exactly where real drop-off happens.

Fix 1 — telemetry.ts:
  - Default _enabled = false. Nothing fires until initTelemetry is
    explicitly called. Production (index.ts) calls it; tests that need
    telemetry (telemetry.test.ts) call it with BUN_ENV/NODE_ENV cleared.
  - Belt-and-suspenders: initTelemetry now short-circuits when
    BUN_ENV === "test" || NODE_ENV === "test", so even if future code
    calls it from a test context, events stay local.

Fix 2 — picker instrumentation:
  New events fired before runOrchestration in every entry path:

    spawn_launched         { mode: interactive | agent_interactive | direct | headless }
    menu_shown / menu_selected / menu_cancelled   (only when user has prior spawns)
    agent_picker_shown
    agent_selected         { agent }     — also sets telemetry context
    cloud_picker_shown
    cloud_selected         { cloud }     — also sets telemetry context
    preflight_passed
    setup_options_shown
    setup_options_selected { step_count }
    name_prompt_shown
    name_entered
    picker_completed

  Wired into:
    commands/interactive.ts  cmdInteractive + cmdAgentInteractive
    commands/run.ts          cmdRun (direct `spawn <agent> <cloud>`)
                             cmdRunHeadless (only spawn_launched)

  runOrchestration's existing funnel_* events continue to fire unchanged.
  The final funnel in PostHog:
    spawn_launched → agent_selected → cloud_selected → preflight_passed
    → setup_options_selected → name_entered → picker_completed
    → funnel_started → funnel_cloud_authed → funnel_credentials_ready
    → funnel_vm_ready → funnel_install_completed → funnel_configure_completed
    → funnel_prelaunch_completed → funnel_handoff

Tests:
- telemetry.test.ts: 2 new env-guard tests (BUN_ENV, NODE_ENV), plus
  updated beforeEach to clear both env vars so existing tests still
  exercise initTelemetry.
- Full suite: 2131/2131 pass, biome 0 errors.

Bumps 1.0.12 -> 1.0.13 (patch — auto-propagates under #3296 policy).
AhmedTMM added a commit to AhmedTMM/spawn that referenced this pull request Apr 16, 2026
Two bugs from the OpenRouterTeam#3305 rollout:

1. Test pollution: orchestrate.test.ts imports runOrchestration directly
   and never calls initTelemetry, but _enabled defaulted to true in the
   module so captureEvent happily fired real events at PostHog tagged
   agent=testagent. The onboarding funnel filled up with CI fixture data.

2. Funnel started too late: funnel_* events fired inside runOrchestration,
   which is only called AFTER the interactive picker completes. Users who
   bail at the agent/cloud/setup-options/name prompts were invisible —
   yet that's exactly where real drop-off happens.

Fix 1 — telemetry.ts:
  - Default _enabled = false. Nothing fires until initTelemetry is
    explicitly called. Production (index.ts) calls it; tests that need
    telemetry (telemetry.test.ts) call it with BUN_ENV/NODE_ENV cleared.
  - Belt-and-suspenders: initTelemetry now short-circuits when
    BUN_ENV === "test" || NODE_ENV === "test", so even if future code
    calls it from a test context, events stay local.

Fix 2 — picker instrumentation:
  New events fired before runOrchestration in every entry path:

    spawn_launched         { mode: interactive | agent_interactive | direct | headless }
    menu_shown / menu_selected / menu_cancelled   (only when user has prior spawns)
    agent_picker_shown
    agent_selected         { agent }     — also sets telemetry context
    cloud_picker_shown
    cloud_selected         { cloud }     — also sets telemetry context
    preflight_passed
    setup_options_shown
    setup_options_selected { step_count }
    name_prompt_shown
    name_entered
    picker_completed

  Wired into:
    commands/interactive.ts  cmdInteractive + cmdAgentInteractive
    commands/run.ts          cmdRun (direct `spawn <agent> <cloud>`)
                             cmdRunHeadless (only spawn_launched)

  runOrchestration's existing funnel_* events continue to fire unchanged.
  The final funnel in PostHog:
    spawn_launched → agent_selected → cloud_selected → preflight_passed
    → setup_options_selected → name_entered → picker_completed
    → funnel_started → funnel_cloud_authed → funnel_credentials_ready
    → funnel_vm_ready → funnel_install_completed → funnel_configure_completed
    → funnel_prelaunch_completed → funnel_handoff

Tests:
- telemetry.test.ts: 2 new env-guard tests (BUN_ENV, NODE_ENV), plus
  updated beforeEach to clear both env vars so existing tests still
  exercise initTelemetry.
- Full suite: 2131/2131 pass, biome 0 errors.

Bumps 1.0.12 -> 1.0.13 (patch — auto-propagates under OpenRouterTeam#3296 policy).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

security-approved Security review approved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants