Add authentication#538
Conversation
936d665 to
cc1a4af
Compare
There was a problem hiding this comment.
Hey @taylordowns2000, this is genuinely strong work and a clever evolution of the concept. Using the existing api_key field as the credential is the right call, it gets us per-client keys with no Lightning code change. Encryption fails closed correctly, the single-flight + stale-while-revalidate cache is solid and well tested, and swapping the loopback exemption for a per-process internal token is a real security improvement. Two blockers before we enable this, then a couple of things I'd like you to confirm.
Blockers
-
.tool-versionsstill pinsbun 1.1.13, butauth.tsimports{ SQL } from "bun", which doesn't exist before Bun 1.2.0 (I checked,typeof SQLisundefinedon the current toolchain). WithINSTANCE_AUTH=true,new SQL(url)throws, gets caught, and the gate fails closed, so every external caller gets a 401. #536 bumped this tobun 1.3.14for exactly this reason and it needs to come along here. -
The pinned cross-language hash test from #536 got dropped. We now have three SHA-256 sites with nothing enforcing they agree, and any drift silently 401s every provisioned client. Please restore it. (inline)
Needs confirmation
-
The gate is all-or-nothing per Apollo process, so turning
INSTANCE_AUTHon for our shared production Apollo would 401 every instance we haven't provisioned yet. Totally fine for dedicated Apollo instances. But concerning for shared ones. I just want to confirm that's the plan. Also worth being precise that "no Lightning-side change" is true for code, but each instance's configured key value still has to be swapped to the minted credential. -
WS transport is fully disabled while auth is on, since the upgrade has no body to carry the key. Should be fine since Lightning uses POST and /stream, but can you confirm nothing still speaks WS?
-
No changeset entry on this one and we've been keeping those (#536 had one), worth adding before release.
| try { | ||
| sql = new SQL(url); | ||
| const rows = (await sql` | ||
| SELECT to_regclass('public.lightning_clients') AS t | ||
| `) as Array<{ t: string | null }>; |
There was a problem hiding this comment.
This is the path breaks on breaks on any environment respecting the pinned Bun version in .tools-version: SQL is undefined on <1.2.0, so new SQL(url) throws here. It's swallowed by the catch below, so there's no crash - just a silent fail-closed gate and a log line. The whole real-DB path (this + loadClients) has no CI coverage because every test goes through the __setLoaderForTest seam. Worth one integration test that exercises initAuth/loadClients against a real PG so this can't rot again.
| const apiKey = | ||
| typeof ctx.body?.api_key === "string" ? ctx.body.api_key.trim() : ""; | ||
| if (!apiKey) return unauthorized(ctx); | ||
|
|
||
| const lookup = lookupOverride ?? dbLookup; | ||
| const client = await lookup(hashToken(apiKey)); | ||
| if (!client) return unauthorized(ctx); |
There was a problem hiding this comment.
Semantics changed from #536: that PR recognized tokens and swapped keys but passed unknown callers through untouched, so several instances could share one Apollo and the un-provisioned ones kept working as today. Here, any unknown or missing api_key is a hard 401 when enabled - it's all-or-nothing per Apollo process. That's correct for a dedicated Apollo instance, but flipping INSTANCE_AUTH on our shared production Apollo would 401 every not-yet-provisioned instance. Worth really noting and paying attention to. Can you confirm this is what we really want ?
There was a problem hiding this comment.
Yes, absolutely. That's the idea of the initiative.
There was a problem hiding this comment.
Too big and too restrictive for now. The approach I would take and that I proposed in #536 was to unblock the current with the minimum and we discuss this and build it in another phase.
| def hash_token(token: str) -> str: | ||
| return hashlib.sha256(token.encode()).hexdigest() |
There was a problem hiding this comment.
The docstring says this "must match the hash Apollo computes in auth.ts" - but #538 dropped the pinned-hash test that enforced it (#536 had KNOWN_TOKEN/KNOWN_HASH asserted on both the TS and Python sides). With three independent SHA-256 sites now (this, auth.ts:62, provision_client.ts:34) and nothing pinning them, a drift would silently 401 every provisioned client - the exact "nightmare to debug" failure mode. Please restore the cross-language pinned test.
| } | ||
|
|
||
| const apiKey = randomBytes(32).toString("base64url"); | ||
| const authTokenHash = createHash("sha256").update(apiKey).digest("hex"); |
There was a problem hiding this comment.
Re-implements the hash inline instead of importing hashToken from auth.ts. Since this is the canonical provisioning path, a drift here vs. the runtime hasher silently locks clients out. Import it instead:
import { hashToken } from "../../platform/src/middleware/auth";
...
const authTokenHash = hashToken(apiKey);| // MULTI-PROCESS: when unset, each process mints its OWN token. apollo() self-calls | ||
| // hit 127.0.0.1:{port} and normally land on the same process, but if processes | ||
| // share a port (SO_REUSEPORT / clustering) a self-call can hit a sibling and 401. | ||
| // Set APOLLO_INTERNAL_TOKEN to the SAME value across processes in that case. | ||
| export const INTERNAL_HEADER = "x-apollo-internal"; | ||
| const INTERNAL_TOKEN = | ||
| process.env.APOLLO_INTERNAL_TOKEN ?? randomBytes(32).toString("hex"); | ||
| process.env.APOLLO_INTERNAL_TOKEN = INTERNAL_TOKEN; |
There was a problem hiding this comment.
Multi-process caveat is documented here - just flagging for the deploy: if prod runs Apollo with SO_REUSEPORT/clustering and APOLLO_INTERNAL_TOKEN unset, apollo() self-calls can hit a sibling process with a different token and 401. Confirm our topology is single-process, or set a shared APOLLO_INTERNAL_TOKEN.
|
|
||
| // In-memory client cache (keyed by token hash), refreshed on a TTL so DB changes | ||
| // are picked up within CACHE_TTL_MS. | ||
| const CACHE_TTL_MS = 60_000; |
There was a problem hiding this comment.
Confirming this is an accepted trade-off: a revoked/rotated credential keeps working for up to ~60s (until the cache TTL refreshes). Fine for rotation; just want it acknowledged for the security posture, since "revoke" isn't immediate without a restart.
There was a problem hiding this comment.
Yes, intentional. I don't think there's a benefit in hard cutoff.
| expect(body.type).toBe("UNAUTHORIZED"); | ||
| }); | ||
|
|
||
| it("rejects an unknown api_key with 401", async () => { |
There was a problem hiding this comment.
This breaks production
There was a problem hiding this comment.
That's correct! Did you catch up with @stuartc about this? Migration notes are in the PR
Adds a
lightning_clientstable so that Apollo can:It is designed as a MVP, and built to be opt-in.
To migrate
INSTANCE_AUTH=trueto your.envAPOLLO_ENC_KEYto your.env(or secret store) withopenssl rand -base64 32lightning_clientstable withset -a; . ./.env; set +a; psql "$POSTGRES_URL" -f services/_instance_auth/schema.sqlbun services/_instance_auth/provision_client.ts <some-identifier-123> <some-sk-ant-token-blah-456>AI_ASSISTANT_API_KEY(i.e., the secret!) to your lightning.env(or secret store)After this PR (what's next)
Related to @elias-ba 's concurrent proposal to enhance error handling on Lightning, my next goal is to let Lightning reliably tell "we rejected your api_key" from "your api_key is fine, but the Anthropic key we use for you failed / is over its limit." HTTP status collides (both are
401), so the discriminator is thetypefield, not the status code.Proposed contract (
typeis the discriminator)typeUNAUTHORIZEDAUTH_ERRORAuthenticationErrorRATE_LIMITretry_after)RateLimitErrorBILLING(new)billing_errortype + "credit balance too low"KEY_UNAVAILABLE(new)APOLLO_ENC_KEYmisconfig)auth.ts, not the client's faultRule:
UNAUTHORIZEDis reserved for credential rejection. Anything touching the Anthropic/LiteLLM key returnsAUTH_ERROR/RATE_LIMIT/BILLING, neverUNAUTHORIZED.AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):