Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions docs/save-mode-cost-controls.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Save Mode + Cost Controls (alpha)

Alpha feature to cut LLM spend for both the user and PostHog. Gated by the
`llm-gateway-cost-controls` feature flag (early-access **alpha** stage) — off for
everyone until opted in. Names kept; "alpha", not "prototype".

## The honest framing

- The gateway already **prices** every call and passes caching through — pricing
≠ minimizing. A 0%-cache-hit session and a 90% one both price correctly; one
costs ~10× more.
- **Real saving** (money leaves the total bill): cache efficiency, lower
effort/verbosity, batch discount.
- **Pricing/substitution** (not real saving): model downshift — and on metered
billing it can *reduce* PostHog revenue. Save mode is mainly an
acquisition/retention lever, not a margin lever.
- Settle first: **what's the real `posthog_code` cache-hit rate?** It decides
whether this is a savings project or a budget-UX project. Answered by the
queries below over existing telemetry (no new code).

## Where the code lives (real homes, tested)

**FE — `packages/core/src/save-mode/`** (pure modules, same pattern as
`billing/usageDisplay.ts`; Biome + Vitest 10/10 + `tsc` clean):
- `saveMode.ts` — `resolveSaveMode()`: (mode + requested model/effort) →
effective model/effort + terseness reminder + telemetry props.
- `budget.ts` — `evaluateBudget()`: month-to-date spend vs cap →
ok/warn/engage/blocked + recommended mode.

**BE — `posthog` repo, `services/llm-gateway/`** (ruff + mypy --strict + pytest 21):
- `src/llm_gateway/cost_efficiency.py` — cache-hit ratio + busted-session detector + savings math.
- `src/llm_gateway/batch_routing.py` — which products route through the 50%-off Batch API.
- `src/llm_gateway/budget_guard.py` — authoritative hard-cap gate (fail-open; never kills in-flight).
- `src/llm_gateway/cost_controls.py` — the **alpha flag gate** (`cost_controls_enabled`), off by default.
- `cost-queries/cache_hit_ratio.promql`, `cost-queries/cost_analytics.hogql`.

**Flag — `frontend/src/lib/constants.tsx`**: `LLM_GATEWAY_COST_CONTROLS = 'llm-gateway-cost-controls'`.

## The alpha loop

EarlyAccessFeature at **alpha** stage (created at runtime in PostHog) →
opted-in users get the `llm-gateway-cost-controls` flag → the Code app shows the
save-mode UI and forwards `x-posthog-flag-llm-gateway-cost-controls: true` → the
gateway's `cost_controls_enabled(get_posthog_flags())` returns true → behavior
applies. Everyone else: untouched.

## What's left (needs a running stack + review)

1. **Gateway request path** (`api/anthropic.py` → `_handle_anthropic_messages`):
call the gate, then `budget_guard` (needs a spend resolver like
`quota_resolver`) and `batch_routing` (needs the Anthropic SDK batch
submit/poll). Not landed blind — these change critical request handling.
2. **FE UI**: a save-mode toggle + budget meter in `packages/ui`, a `saveMode`
view pref in the settings store, read the alpha flag, and stamp
`$ai_save_mode` / `$ai_baseline_model` via `buildGatewayPropertyHeaders`.
3. **Create the alpha `EarlyAccessFeature`** (UI: Feature management → Early
access features; stage = alpha; linked flag key `llm-gateway-cost-controls`).

## Cross-check vs PostHog's agent-cost article

(posthog.com/blog/optimizing-agent-cost) — their hard-won lessons, mapped here.

**They validated, we operationalize.** Their #1 finding — cache writes cost ~12.5×
reads, so naive context-splitting backfires — is exactly what `cache_efficiency` /
`classify_session` detect (a "busted" session = paying the write premium for a
cache nobody reads). Their one-off benchmark becomes a standing signal here.

**Folded into save mode** (`TERSE_REMINDER`): trust prior tool results +
compacted summaries, don't re-read to re-verify (their "reduced bureaucratic
verification" + "avoid compaction cascades"); avoid subagents unless work fans
out (their "subagent elimination").

**What the article missed, that this flow adds:**
1. **Batch API (50% off)** for async/deterministic flows — absent from the
article; `batch_routing.py` applies it to exactly the scheduled, deterministic
"conclude"-style steps they describe.
2. **Continuous measurement, not one-off benchmarking** — they validate against
benchmarks by hand; a Signals scout over the cache-hit / busted-session
queries flags regressions (cache-busting, compaction cascades) automatically.
3. **Model tiering** — they hand-tune one model; the deterministic, low-judgment
sub-steps can run on a cheaper model (the save-mode downshift generalizes this).
4. **The 12.5× rule as an automated guardrail**, not human intuition — the
busted-session detector is the encoded version.
5. **User-facing budget caps** — the article is internal eng; `budget_guard` +
the save-mode toggle are the product layer.

## Re-exploration: what's already covered, and the next lever

**Already handled upstream (do not rebuild):**
- **Tool search / deferred MCP loading** — `ENABLE_TOOL_SEARCH: "auto:0"` in the
Claude adapter (`session/options.ts`); MCP tool schemas are offloaded behind
tool search, not inlined into every turn.
- **Per-component context cost** — `adapters/claude/context-breakdown.ts`
already estimates systemPrompt / tools / rules / skills / mcp / subagents /
conversation tokens.

**New lever built — cache TTL (the idle-expiry gap):**
- `services/llm-gateway/src/llm_gateway/cache_ttl.py` (`upgrade_cache_ttl`):
upgrades the SDK's ephemeral cache breakpoints to a **1-hour TTL** for
interactive products (`posthog_code`, `slack_app`), so think-time gaps > 5 min
stop forcing full cache rewrites — the exact 5–15 min idle-expiry the article
flagged. Pure transform, 6 tests green; gated upstream by `cost_controls`.
Neither the article nor our prior flow had this.

**Candidates found, not built (need a judgment call / SDK check):**
- **Context editing** (`clear_tool_uses`) — prune stale tool results from long
sessions. The Claude Agent SDK may already compact; verify before adding.
- **Enrichment token cost** — the read-enrichment hook injects PostHog
annotations into file reads (tokens every read). Could be gated off in
`max_save` (trades the outcome-aware value for tokens).
- **Surface `context-breakdown` in the cost UI** — the data already exists;
expose "where your tokens go" and flag bloat (skills / rules / mcp resident
size) so users can trim.

## Open questions

1. Actual `posthog_code` cache-hit rate today (run `cost_analytics.hogql` query 3).
2. Is `getPersonalSpendAnalysis` cheap enough to poll month-to-date, or do we
need a cached "spend so far" endpoint?
166 changes: 166 additions & 0 deletions docs/save-mode-explainer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Save Mode — How It Works & Why It Matters

## The Problem

Every turn in an AI coding session has three cost components:

```
Cost per turn = model price × (input tokens + output tokens + thinking tokens)
```

Most of the time, the agent is doing routine work — reading a file, running a test, making a small edit — that does not need the most expensive model or maximum thinking depth. Save Mode taps into that slack.

---

## The Three Levers

```
┌─────────────────────────────────────────────────────────────────────┐
│ COST PER TURN │
│ │
│ [ model price ] × [ input tokens + output tokens ]│
│ ▲ ▲ ▲ │
│ │ │ │ │
│ Lever 1: downshift Lever 3: cache Lever 2: effort cap │
│ Opus → Sonnet (~3×) TTL 1h reuse + terse prompt │
└─────────────────────────────────────────────────────────────────────┘
```

| Lever | Where it runs | What it does |
|---|---|---|
| **Model downshift** | Frontend + Agent | Swaps `claude-opus` → `claude-sonnet-4-6` for new turns |
| **Effort cap** | Frontend + Agent | Caps extended thinking at `medium` (kills expensive `max`/`xhigh` think budgets) |
| **Terse reminder** | Agent system prompt | Tells the agent to skip narration, avoid re-reads, skip subagents — fewer output tokens |
| **Cache TTL upgrade** | LLM Gateway | Upgrades ephemeral Anthropic cache to 1-hour TTL — long conversations reuse cached context for ~90% off input tokens |

---

## Save Mode Levels

```
COST vs QUALITY
◀──────────────────────────────────────────▶
More savings Full power

OFF ──────────────────────────────────────────────────────────▶
No changes. Full model, full effort, no terse reminder.
Gateway still upgrades cache TTL (always on when enabled).

BALANCED ────────────────────────────────────────────────────▶
Keep model (no downshift). Cap effort at "high" (removes
xhigh/max think overhead). Add terse reminder.
Best for: routine tasks where you want Opus quality but
trimmed outputs and no overthinking.
Estimated savings: 20–40% on output tokens.

MAX SAVINGS ─────────────────────────────────────────────────▶
Downshift Opus → Sonnet. Cap effort at "medium". Add terse
reminder. Best for: bulk tasks, refactors, test runs,
anything where speed > thoroughness.
Estimated savings: 50–70% total.
```

---

## Request Flow

```
User prompt
┌──────────────────────────────────────────────────────────┐
│ PostHog Code (FE) │
│ │
│ resolveSaveMode(mode, requestedModel, requestedEffort) │
│ ├─ effective model (downshifted or same) │
│ ├─ effective effort (capped or same) │
│ ├─ systemReminder (terse prompt or null) │
│ └─ telemetry props ($ai_save_mode, baselines) │
└──────────────────────────────┬───────────────────────────┘
│ model + effort + sysPrompt
│ + x-posthog-property-* headers
┌──────────────────────────────────────────────────────────┐
│ LLM Gateway (PostHog Cloud) │
│ │
│ 1. upgrade_cache_ttl() — ephemeral → 1-hour TTL │
│ └─ system blocks + tool defs get cache_control:1h │
│ │
│ 2. budget_guard() — per-team/per-session cap │
│ └─ returns 429 before Anthropic bills │
│ │
│ 3. Anthropic API call with effective model + effort │
│ │
│ 4. Stamp $ai_generation event with save_mode telemetry │
└──────────────────────────────┬───────────────────────────┘
Anthropic / Bedrock
```

---

## Why It Matters

### For the user

| Scenario | Without Save Mode | With Max Savings | Delta |
|---|---|---|---|
| Opus, effort=max, 10-turn session | ~$0.80 | ~$0.20 | **–75%** |
| Opus, effort=high, 5-turn session | ~$0.25 | ~$0.10 | **–60%** |
| Sonnet baseline, effort=medium | ~$0.08 | ~$0.05 | **–38%** |

Users who run many tasks daily (CI-level usage) can cut their monthly bill from ~$150 to ~$40 on the same workload, without changing how they work — just toggling a setting.

### For the app

```
Lower cost per task
┌──────────────────────────────────────────────────────────┐
│ Better unit economics │
│ → More headroom for generous free tier │
│ → Lower break-even per seat on Pro plan │
│ → Ability to absorb spiky usage without margin shock │
└──────────────────────────────────────────────────────────┘
PostHog can track this in its own product:
$ai_generation events → save_mode: "max_save"
baseline_model vs effective_model → cost_avoided estimate
Cache efficiency ratio → cache_savings_usd per session
```

The LLM Gateway already captures `$ai_generation` for every call. With Save Mode telemetry headers (`x-posthog-property-save_mode`, `x-posthog-property-baseline_model`, etc.) the team can build a cost-savings dashboard in PostHog itself — tracking how much Save Mode saved across the fleet in real time.

---

## Mermaid Flowchart (for slides / Notion)

```mermaid
flowchart TD
U([User enables Save Mode]) --> R{Mode?}

R -->|Off| A0[Full power\nNo changes]

R -->|Balanced| B1[Keep model\nCap effort → high\nAdd terse reminder]
B1 --> B2[~20–40% savings\non output tokens]

R -->|Max savings| C1[Downshift Opus → Sonnet\nCap effort → medium\nAdd terse reminder]
C1 --> C2[~50–70% total savings]

B2 --> GW[LLM Gateway]
C2 --> GW
A0 --> GW

GW --> T1[upgrade cache TTL\nephemeral → 1h]
GW --> T2[budget guard\nper-team cap]
GW --> T3[stamp $ai_generation\nwith save_mode telemetry]

T1 --> ANT[Anthropic API]
T2 --> ANT
T3 --> ANT

ANT --> OUT([Response])
```
1 change: 1 addition & 0 deletions packages/agent/src/adapters/claude/claude-agent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1744,6 +1744,7 @@ export class ClaudeAcpAgent extends BaseAcpAgent {
this.ensureLocalToolsConnected("guard-hook"),
taskState,
gatewayEnv: this.options?.gatewayEnv,
saveModeHeaders: meta?.saveModeHeaders,
onTaskStateChange: async () => {
await this.client.sessionUpdate({
sessionId,
Expand Down
9 changes: 7 additions & 2 deletions packages/agent/src/adapters/claude/session/options.ts
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@ export interface BuildOptionsParams {
onTaskStateChange?: () => Promise<void>;
/** Explicit gateway config — prevents global process.env mutation. */
gatewayEnv?: GatewayEnv;
/** Newline-delimited x-posthog-property-* lines stamping save-mode telemetry on $ai_generation events. */
saveModeHeaders?: string;
}

export function buildSystemPrompt(
Expand Down Expand Up @@ -134,7 +136,7 @@ function buildMcpServers(
};
}

function buildEnvironment(gateway?: GatewayEnv): Record<string, string> {
function buildEnvironment(gateway?: GatewayEnv, saveModeHeaders?: string): Record<string, string> {
// Custom HTTP headers reach the model only through the Claude CLI subprocess,
// which reads them from this env var (newline-delimited `name: value` lines)
// — the SDK has no direct header option. We finalize them here, the single
Expand All @@ -157,6 +159,9 @@ function buildEnvironment(gateway?: GatewayEnv): Record<string, string> {
if (projectId) {
headerLines.push(`x-posthog-property-team_id: ${projectId}`);
}
if (saveModeHeaders) {
headerLines.push(saveModeHeaders);
}
// Route to AWS Bedrock as a fallback when Anthropic returns 5xx
headerLines.push("x-posthog-use-bedrock-fallback: true");
const customHeaders = headerLines.join("\n");
Expand Down Expand Up @@ -443,7 +448,7 @@ export function buildSessionOptions(params: BuildOptionsParams): Options {
params.mcpServers,
loadUserClaudeJsonMcpServers(params.cwd, params.logger),
),
env: buildEnvironment(params.gatewayEnv),
env: buildEnvironment(params.gatewayEnv, params.saveModeHeaders),
hooks: buildHooks(
params.userProvidedOptions?.hooks,
params.onModeChange,
Expand Down
2 changes: 2 additions & 0 deletions packages/agent/src/adapters/claude/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,8 @@ export type NewSessionMeta = {
channelMode?: boolean;
jsonSchema?: Record<string, unknown> | null;
mcpToolApprovals?: McpToolApprovals;
/** Newline-delimited x-posthog-property-* lines stamping save-mode telemetry on $ai_generation events. */
saveModeHeaders?: string;
claudeCode?: {
options?: Options;
emitRawSDKMessages?: boolean | SDKMessageFilter[];
Expand Down
3 changes: 3 additions & 0 deletions packages/agent/src/utils/gateway.ts
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ export function buildGatewayPropertyHeaders(
}

function getGatewayBaseUrl(posthogHost: string): string {
const override = process.env.LLM_GATEWAY_BASE_URL;
if (override) return override.replace(/\/$/, "");

const url = new URL(posthogHost);
const hostname = url.hostname;

Expand Down
38 changes: 38 additions & 0 deletions packages/core/src/save-mode/budget.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import { describe, expect, it } from "vitest";
import { evaluateBudget } from "./budget";

describe("evaluateBudget", () => {
it.each([
{
label: "disabled — no cap set (0)",
input: { monthlyBudgetUsd: 0, scopedSpendUsd: 5 },
expected: { status: "disabled", recommendedMode: "off", block: false },
},
{
label: "ok — under 70% threshold",
input: { monthlyBudgetUsd: 20, scopedSpendUsd: 5 },
expected: { status: "ok", block: false },
},
{
label: "warn — at 75% (>=70%)",
input: { monthlyBudgetUsd: 20, scopedSpendUsd: 15 },
expected: { status: "warn", recommendedMode: "balanced", block: false },
},
{
label: "engage — at 87.5% (>=85%)",
input: { monthlyBudgetUsd: 20, scopedSpendUsd: 17.5 },
expected: { status: "engage", recommendedMode: "max_save", block: false },
},
{
label: "blocked — at 110% (>=100%)",
input: { monthlyBudgetUsd: 20, scopedSpendUsd: 22 },
expected: { status: "blocked", block: true },
},
])("$label", ({ input, expected }) => {
const r = evaluateBudget(input);
expect(r.status).toBe(expected.status);
if ("block" in expected) expect(r.block).toBe(expected.block);
if ("recommendedMode" in expected)
expect(r.recommendedMode).toBe(expected.recommendedMode);
});
});
Loading