PostHog · ricardo-leiva · Jun 24, 2026 · Jun 24, 2026
diff --git a/docs/save-mode-cost-controls.md b/docs/save-mode-cost-controls.md
@@ -0,0 +1,119 @@
+# Save Mode + Cost Controls (alpha)
+
+Alpha feature to cut LLM spend for both the user and PostHog. Gated by the
+`llm-gateway-cost-controls` feature flag (early-access **alpha** stage) — off for
+everyone until opted in. Names kept; "alpha", not "prototype".
+
+## The honest framing
+
+- The gateway already **prices** every call and passes caching through — pricing
+  ≠ minimizing. A 0%-cache-hit session and a 90% one both price correctly; one
+  costs ~10× more.
+- **Real saving** (money leaves the total bill): cache efficiency, lower
+  effort/verbosity, batch discount.
+- **Pricing/substitution** (not real saving): model downshift — and on metered
+  billing it can *reduce* PostHog revenue. Save mode is mainly an
+  acquisition/retention lever, not a margin lever.
+- Settle first: **what's the real `posthog_code` cache-hit rate?** It decides
+  whether this is a savings project or a budget-UX project. Answered by the
+  queries below over existing telemetry (no new code).
+
+## Where the code lives (real homes, tested)
+
+**FE — `packages/core/src/save-mode/`** (pure modules, same pattern as
+`billing/usageDisplay.ts`; Biome + Vitest 10/10 + `tsc` clean):
+- `saveMode.ts` — `resolveSaveMode()`: (mode + requested model/effort) →
+  effective model/effort + terseness reminder + telemetry props.
+- `budget.ts` — `evaluateBudget()`: month-to-date spend vs cap →
+  ok/warn/engage/blocked + recommended mode.
+
+**BE — `posthog` repo, `services/llm-gateway/`** (ruff + mypy --strict + pytest 21):
+- `src/llm_gateway/cost_efficiency.py` — cache-hit ratio + busted-session detector + savings math.
+- `src/llm_gateway/batch_routing.py` — which products route through the 50%-off Batch API.
+- `src/llm_gateway/budget_guard.py` — authoritative hard-cap gate (fail-open; never kills in-flight).
+- `src/llm_gateway/cost_controls.py` — the **alpha flag gate** (`cost_controls_enabled`), off by default.
+- `cost-queries/cache_hit_ratio.promql`, `cost-queries/cost_analytics.hogql`.
+
+**Flag — `frontend/src/lib/constants.tsx`**: `LLM_GATEWAY_COST_CONTROLS = 'llm-gateway-cost-controls'`.
+
+## The alpha loop
+
+EarlyAccessFeature at **alpha** stage (created at runtime in PostHog) →
+opted-in users get the `llm-gateway-cost-controls` flag → the Code app shows the
+save-mode UI and forwards `x-posthog-flag-llm-gateway-cost-controls: true` → the
+gateway's `cost_controls_enabled(get_posthog_flags())` returns true → behavior
+applies. Everyone else: untouched.
+
+## What's left (needs a running stack + review)
+
+1. **Gateway request path** (`api/anthropic.py` → `_handle_anthropic_messages`):
+   call the gate, then `budget_guard` (needs a spend resolver like
+   `quota_resolver`) and `batch_routing` (needs the Anthropic SDK batch
+   submit/poll). Not landed blind — these change critical request handling.
+2. **FE UI**: a save-mode toggle + budget meter in `packages/ui`, a `saveMode`
+   view pref in the settings store, read the alpha flag, and stamp
+   `$ai_save_mode` / `$ai_baseline_model` via `buildGatewayPropertyHeaders`.
+3. **Create the alpha `EarlyAccessFeature`** (UI: Feature management → Early
+   access features; stage = alpha; linked flag key `llm-gateway-cost-controls`).
+
+## Cross-check vs PostHog's agent-cost article
+
+(posthog.com/blog/optimizing-agent-cost) — their hard-won lessons, mapped here.
+
+**They validated, we operationalize.** Their #1 finding — cache writes cost ~12.5×
+reads, so naive context-splitting backfires — is exactly what `cache_efficiency` /
+`classify_session` detect (a "busted" session = paying the write premium for a
+cache nobody reads). Their one-off benchmark becomes a standing signal here.
+
+**Folded into save mode** (`TERSE_REMINDER`): trust prior tool results +
+compacted summaries, don't re-read to re-verify (their "reduced bureaucratic
+verification" + "avoid compaction cascades"); avoid subagents unless work fans
+out (their "subagent elimination").
+
+**What the article missed, that this flow adds:**
+1. **Batch API (50% off)** for async/deterministic flows — absent from the
+   article; `batch_routing.py` applies it to exactly the scheduled, deterministic
+   "conclude"-style steps they describe.
+2. **Continuous measurement, not one-off benchmarking** — they validate against
+   benchmarks by hand; a Signals scout over the cache-hit / busted-session
+   queries flags regressions (cache-busting, compaction cascades) automatically.
+3. **Model tiering** — they hand-tune one model; the deterministic, low-judgment
+   sub-steps can run on a cheaper model (the save-mode downshift generalizes this).
+4. **The 12.5× rule as an automated guardrail**, not human intuition — the
+   busted-session detector is the encoded version.
+5. **User-facing budget caps** — the article is internal eng; `budget_guard` +
+   the save-mode toggle are the product layer.
+
+## Re-exploration: what's already covered, and the next lever
+
+**Already handled upstream (do not rebuild):**
+- **Tool search / deferred MCP loading** — `ENABLE_TOOL_SEARCH: "auto:0"` in the
+  Claude adapter (`session/options.ts`); MCP tool schemas are offloaded behind
+  tool search, not inlined into every turn.
+- **Per-component context cost** — `adapters/claude/context-breakdown.ts`
+  already estimates systemPrompt / tools / rules / skills / mcp / subagents /
+  conversation tokens.
+
+**New lever built — cache TTL (the idle-expiry gap):**
+- `services/llm-gateway/src/llm_gateway/cache_ttl.py` (`upgrade_cache_ttl`):
+  upgrades the SDK's ephemeral cache breakpoints to a **1-hour TTL** for
+  interactive products (`posthog_code`, `slack_app`), so think-time gaps > 5 min
+  stop forcing full cache rewrites — the exact 5–15 min idle-expiry the article
+  flagged. Pure transform, 6 tests green; gated upstream by `cost_controls`.
+  Neither the article nor our prior flow had this.
+
+**Candidates found, not built (need a judgment call / SDK check):**
+- **Context editing** (`clear_tool_uses`) — prune stale tool results from long
+  sessions. The Claude Agent SDK may already compact; verify before adding.
+- **Enrichment token cost** — the read-enrichment hook injects PostHog
+  annotations into file reads (tokens every read). Could be gated off in
+  `max_save` (trades the outcome-aware value for tokens).
+- **Surface `context-breakdown` in the cost UI** — the data already exists;
+  expose "where your tokens go" and flag bloat (skills / rules / mcp resident
+  size) so users can trim.
+
+## Open questions
+
+1. Actual `posthog_code` cache-hit rate today (run `cost_analytics.hogql` query 3).
+2. Is `getPersonalSpendAnalysis` cheap enough to poll month-to-date, or do we
+   need a cached "spend so far" endpoint?
diff --git a/docs/save-mode-explainer.md b/docs/save-mode-explainer.md
@@ -0,0 +1,166 @@
+# Save Mode — How It Works & Why It Matters
+
+## The Problem
+
+Every turn in an AI coding session has three cost components:
+
+```
+Cost per turn = model price × (input tokens + output tokens + thinking tokens)
+```
+
+Most of the time, the agent is doing routine work — reading a file, running a test, making a small edit — that does not need the most expensive model or maximum thinking depth. Save Mode taps into that slack.
+
+---
+
+## The Three Levers
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                         COST PER TURN                               │
+│                                                                     │
+│   [  model price  ]  ×  [  input tokens  +  output tokens         ]│
+│         ▲                       ▲                  ▲               │
+│         │                       │                  │               │
+│   Lever 1: downshift      Lever 3: cache     Lever 2: effort cap   │
+│   Opus → Sonnet (~3×)     TTL 1h reuse       + terse prompt        │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+| Lever | Where it runs | What it does |
+|---|---|---|
+| **Model downshift** | Frontend + Agent | Swaps `claude-opus` → `claude-sonnet-4-6` for new turns |
+| **Effort cap** | Frontend + Agent | Caps extended thinking at `medium` (kills expensive `max`/`xhigh` think budgets) |
+| **Terse reminder** | Agent system prompt | Tells the agent to skip narration, avoid re-reads, skip subagents — fewer output tokens |
+| **Cache TTL upgrade** | LLM Gateway | Upgrades ephemeral Anthropic cache to 1-hour TTL — long conversations reuse cached context for ~90% off input tokens |
+
+---
+
+## Save Mode Levels
+
+```
+                                    COST vs QUALITY
+                     ◀──────────────────────────────────────────▶
+                     More savings                   Full power
+
+  OFF ──────────────────────────────────────────────────────────▶
+       No changes. Full model, full effort, no terse reminder.
+       Gateway still upgrades cache TTL (always on when enabled).
+
+  BALANCED ────────────────────────────────────────────────────▶
+       Keep model (no downshift). Cap effort at "high" (removes
+       xhigh/max think overhead). Add terse reminder.
+       Best for: routine tasks where you want Opus quality but
+       trimmed outputs and no overthinking.
+       Estimated savings: 20–40% on output tokens.
+
+  MAX SAVINGS ─────────────────────────────────────────────────▶
+       Downshift Opus → Sonnet. Cap effort at "medium". Add terse
+       reminder. Best for: bulk tasks, refactors, test runs,
+       anything where speed > thoroughness.
+       Estimated savings: 50–70% total.
+```
+
+---
+
+## Request Flow
+
+```
+User prompt
+    │
+    ▼
+┌──────────────────────────────────────────────────────────┐
+│                   PostHog Code (FE)                       │
+│                                                          │
+│  resolveSaveMode(mode, requestedModel, requestedEffort)  │
+│       ├─ effective model  (downshifted or same)          │
+│       ├─ effective effort (capped or same)               │
+│       ├─ systemReminder   (terse prompt or null)         │
+│       └─ telemetry props  ($ai_save_mode, baselines)     │
+└──────────────────────────────┬───────────────────────────┘
+                               │  model + effort + sysPrompt
+                               │  + x-posthog-property-* headers
+                               ▼
+┌──────────────────────────────────────────────────────────┐
+│               LLM Gateway (PostHog Cloud)                │
+│                                                          │
+│  1. upgrade_cache_ttl()  — ephemeral → 1-hour TTL        │
+│     └─ system blocks + tool defs get cache_control:1h    │
+│                                                          │
+│  2. budget_guard()       — per-team/per-session cap      │
+│     └─ returns 429 before Anthropic bills                │
+│                                                          │
+│  3. Anthropic API call with effective model + effort     │
+│                                                          │
+│  4. Stamp $ai_generation event with save_mode telemetry  │
+└──────────────────────────────┬───────────────────────────┘
+                               │
+                               ▼
+                      Anthropic / Bedrock
+```
+
+---
+
+## Why It Matters
+
+### For the user
+
+| Scenario | Without Save Mode | With Max Savings | Delta |
+|---|---|---|---|
+| Opus, effort=max, 10-turn session | ~$0.80 | ~$0.20 | **–75%** |
+| Opus, effort=high, 5-turn session | ~$0.25 | ~$0.10 | **–60%** |
+| Sonnet baseline, effort=medium | ~$0.08 | ~$0.05 | **–38%** |
+
+Users who run many tasks daily (CI-level usage) can cut their monthly bill from ~$150 to ~$40 on the same workload, without changing how they work — just toggling a setting.
+
+### For the app
+
+```
+Lower cost per task
+        │
+        ▼
+┌──────────────────────────────────────────────────────────┐
+│  Better unit economics                                   │
+│   → More headroom for generous free tier                 │
+│   → Lower break-even per seat on Pro plan               │
+│   → Ability to absorb spiky usage without margin shock   │
+└──────────────────────────────────────────────────────────┘
+        │
+        ▼
+PostHog can track this in its own product:
+   $ai_generation events → save_mode: "max_save"
+   baseline_model vs effective_model → cost_avoided estimate
+   Cache efficiency ratio → cache_savings_usd per session
+```
+
+The LLM Gateway already captures `$ai_generation` for every call. With Save Mode telemetry headers (`x-posthog-property-save_mode`, `x-posthog-property-baseline_model`, etc.) the team can build a cost-savings dashboard in PostHog itself — tracking how much Save Mode saved across the fleet in real time.
+
+---
+
+## Mermaid Flowchart (for slides / Notion)
+
+```mermaid
+flowchart TD
+    U([User enables Save Mode]) --> R{Mode?}
+
+    R -->|Off| A0[Full power\nNo changes]
+
+    R -->|Balanced| B1[Keep model\nCap effort → high\nAdd terse reminder]
+    B1 --> B2[~20–40% savings\non output tokens]
+
+    R -->|Max savings| C1[Downshift Opus → Sonnet\nCap effort → medium\nAdd terse reminder]
+    C1 --> C2[~50–70% total savings]
+
+    B2 --> GW[LLM Gateway]
+    C2 --> GW
+    A0 --> GW
+
+    GW --> T1[upgrade cache TTL\nephemeral → 1h]
+    GW --> T2[budget guard\nper-team cap]
+    GW --> T3[stamp $ai_generation\nwith save_mode telemetry]
+
+    T1 --> ANT[Anthropic API]
+    T2 --> ANT
+    T3 --> ANT
+
+    ANT --> OUT([Response])
+```
diff --git a/packages/agent/src/adapters/claude/claude-agent.ts b/packages/agent/src/adapters/claude/claude-agent.ts
@@ -1744,6 +1744,7 @@ export class ClaudeAcpAgent extends BaseAcpAgent {
         this.ensureLocalToolsConnected("guard-hook"),
       taskState,
       gatewayEnv: this.options?.gatewayEnv,
+      saveModeHeaders: meta?.saveModeHeaders,
       onTaskStateChange: async () => {
         await this.client.sessionUpdate({
           sessionId,

diff --git a/packages/agent/src/adapters/claude/session/options.ts b/packages/agent/src/adapters/claude/session/options.ts
@@ -88,6 +88,8 @@ export interface BuildOptionsParams {
   onTaskStateChange?: () => Promise<void>;
   /** Explicit gateway config — prevents global process.env mutation. */
   gatewayEnv?: GatewayEnv;
+  /** Newline-delimited x-posthog-property-* lines stamping save-mode telemetry on $ai_generation events. */
+  saveModeHeaders?: string;
 }
 
 export function buildSystemPrompt(
@@ -134,7 +136,7 @@ function buildMcpServers(
   };
 }
 
-function buildEnvironment(gateway?: GatewayEnv): Record<string, string> {
+function buildEnvironment(gateway?: GatewayEnv, saveModeHeaders?: string): Record<string, string> {
   // Custom HTTP headers reach the model only through the Claude CLI subprocess,
   // which reads them from this env var (newline-delimited `name: value` lines)
   // — the SDK has no direct header option. We finalize them here, the single
@@ -157,6 +159,9 @@ function buildEnvironment(gateway?: GatewayEnv): Record<string, string> {
   if (projectId) {
     headerLines.push(`x-posthog-property-team_id: ${projectId}`);
   }
+  if (saveModeHeaders) {
+    headerLines.push(saveModeHeaders);
+  }
   // Route to AWS Bedrock as a fallback when Anthropic returns 5xx
   headerLines.push("x-posthog-use-bedrock-fallback: true");
   const customHeaders = headerLines.join("\n");
@@ -443,7 +448,7 @@ export function buildSessionOptions(params: BuildOptionsParams): Options {
       params.mcpServers,
       loadUserClaudeJsonMcpServers(params.cwd, params.logger),
     ),
-    env: buildEnvironment(params.gatewayEnv),
+    env: buildEnvironment(params.gatewayEnv, params.saveModeHeaders),
     hooks: buildHooks(
       params.userProvidedOptions?.hooks,
       params.onModeChange,

diff --git a/packages/agent/src/adapters/claude/types.ts b/packages/agent/src/adapters/claude/types.ts
@@ -177,6 +177,8 @@ export type NewSessionMeta = {
   channelMode?: boolean;
   jsonSchema?: Record<string, unknown> | null;
   mcpToolApprovals?: McpToolApprovals;
+  /** Newline-delimited x-posthog-property-* lines stamping save-mode telemetry on $ai_generation events. */
+  saveModeHeaders?: string;
   claudeCode?: {
     options?: Options;
     emitRawSDKMessages?: boolean | SDKMessageFilter[];

diff --git a/packages/agent/src/utils/gateway.ts b/packages/agent/src/utils/gateway.ts
@@ -60,6 +60,9 @@ export function buildGatewayPropertyHeaders(
 }
 
 function getGatewayBaseUrl(posthogHost: string): string {
+  const override = process.env.LLM_GATEWAY_BASE_URL;
+  if (override) return override.replace(/\/$/, "");
+
   const url = new URL(posthogHost);
   const hostname = url.hostname;
 

diff --git a/packages/core/src/save-mode/budget.test.ts b/packages/core/src/save-mode/budget.test.ts
@@ -0,0 +1,38 @@
+import { describe, expect, it } from "vitest";
+import { evaluateBudget } from "./budget";
+
+describe("evaluateBudget", () => {
+  it.each([
+    {
+      label: "disabled — no cap set (0)",
+      input: { monthlyBudgetUsd: 0, scopedSpendUsd: 5 },
+      expected: { status: "disabled", recommendedMode: "off", block: false },
+    },
+    {
+      label: "ok — under 70% threshold",
+      input: { monthlyBudgetUsd: 20, scopedSpendUsd: 5 },
+      expected: { status: "ok", block: false },
+    },
+    {
+      label: "warn — at 75% (>=70%)",
+      input: { monthlyBudgetUsd: 20, scopedSpendUsd: 15 },
+      expected: { status: "warn", recommendedMode: "balanced", block: false },
+    },
+    {
+      label: "engage — at 87.5% (>=85%)",
+      input: { monthlyBudgetUsd: 20, scopedSpendUsd: 17.5 },
+      expected: { status: "engage", recommendedMode: "max_save", block: false },
+    },
+    {
+      label: "blocked — at 110% (>=100%)",
+      input: { monthlyBudgetUsd: 20, scopedSpendUsd: 22 },
+      expected: { status: "blocked", block: true },
+    },
+  ])("$label", ({ input, expected }) => {
+    const r = evaluateBudget(input);
+    expect(r.status).toBe(expected.status);
+    if ("block" in expected) expect(r.block).toBe(expected.block);
+    if ("recommendedMode" in expected)
+      expect(r.recommendedMode).toBe(expected.recommendedMode);
+  });
+});