Add authentication by taylordowns2000 · Pull Request #538 · OpenFn/apollo

taylordowns2000 · 2026-06-18T08:53:19Z

Adds a lightning_clients table so that Apollo can:

authenticate lightning instances who attempt to use its services
handle the issuance of ai service keys (e.g., anthropic API keys) and associate those keys with specific clients

It is designed as a MVP, and built to be opt-in.

To migrate

add INSTANCE_AUTH=true to your .env
add APOLLO_ENC_KEY to your .env (or secret store) with openssl rand -base64 32
add the lightning_clients table with set -a; . ./.env; set +a; psql "$POSTGRES_URL" -f services/_instance_auth/schema.sql
provision a token for any lightning instance you want to connect with bun services/_instance_auth/provision_client.ts <some-identifier-123> <some-sk-ant-token-blah-456>
Add this hashed/encrypted record to the Apollo DB
Add this un-hashed AI_ASSISTANT_API_KEY (i.e., the secret!) to your lightning .env (or secret store)
Repeat for as many instances as you'd like to connect.
Optionally, merge this PR so that Lightning users see better error messages if their tokens are rejected by Apollo and/or downstream AI providers: optional: nicer apollo errors for lightning users lightning#4879

After this PR (what's next)

Related to @elias-ba 's concurrent proposal to enhance error handling on Lightning, my next goal is to let Lightning reliably tell "we rejected your api_key" from "your api_key is fine, but the Anthropic key we use for you failed / is over its limit." HTTP status collides (both are 401), so the discriminator is the type field, not the status code.

Proposed contract (`type` is the discriminator)

`type`	HTTP	Meaning to client	Source
`UNAUTHORIZED`	401	Lightning api_key not recognised	The gate, only the gate
`AUTH_ERROR`	401	api_key fine; Anthropic key invalid/revoked	Anthropic/LiteLLM `AuthenticationError`
`RATE_LIMIT`	429	Anthropic key hit a rate/usage limit, retry later (`retry_after`)	Anthropic/LiteLLM `RateLimitError`
`BILLING` (new)	402	Anthropic key out of credit / over spend cap	`billing_error` type + "credit balance too low"
`KEY_UNAVAILABLE` (new)	503	Server-side: stored key can't be decrypted (`APOLLO_ENC_KEY` misconfig)	`auth.ts`, not the client's fault

Rule: UNAUTHORIZED is reserved for credential rejection. Anything touching the Anthropic/LiteLLM key returns AUTH_ERROR / RATE_LIMIT / BILLING, never UNAUTHORIZED.

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

elias-ba

Hey @taylordowns2000, this is genuinely strong work and a clever evolution of the concept. Using the existing api_key field as the credential is the right call, it gets us per-client keys with no Lightning code change. Encryption fails closed correctly, the single-flight + stale-while-revalidate cache is solid and well tested, and swapping the loopback exemption for a per-process internal token is a real security improvement. Two blockers before we enable this, then a couple of things I'd like you to confirm.

Blockers

.tool-versions still pins bun 1.1.13, but auth.ts imports { SQL } from "bun", which doesn't exist before Bun 1.2.0 (I checked, typeof SQL is undefined on the current toolchain). With INSTANCE_AUTH=true, new SQL(url) throws, gets caught, and the gate fails closed, so every external caller gets a 401. #536 bumped this to bun 1.3.14 for exactly this reason and it needs to come along here.
The pinned cross-language hash test from #536 got dropped. We now have three SHA-256 sites with nothing enforcing they agree, and any drift silently 401s every provisioned client. Please restore it. (inline)

Needs confirmation

The gate is all-or-nothing per Apollo process, so turning INSTANCE_AUTH on for our shared production Apollo would 401 every instance we haven't provisioned yet. Totally fine for dedicated Apollo instances. But concerning for shared ones. I just want to confirm that's the plan. Also worth being precise that "no Lightning-side change" is true for code, but each instance's configured key value still has to be swapped to the minted credential.
WS transport is fully disabled while auth is on, since the upgrade has no body to carry the key. Should be fine since Lightning uses POST and /stream, but can you confirm nothing still speaks WS?
No changeset entry on this one and we've been keeping those (#536 had one), worth adding before release.

elias-ba · 2026-06-22T08:29:40Z

+  try {
+    sql = new SQL(url);
+    const rows = (await sql`
+      SELECT to_regclass('public.lightning_clients') AS t
+    `) as Array<{ t: string | null }>;


This is the path breaks on breaks on any environment respecting the pinned Bun version in .tools-version: SQL is undefined on <1.2.0, so new SQL(url) throws here. It's swallowed by the catch below, so there's no crash - just a silent fail-closed gate and a log line. The whole real-DB path (this + loadClients) has no CI coverage because every test goes through the __setLoaderForTest seam. Worth one integration test that exercises initAuth/loadClients against a real PG so this can't rot again.

elias-ba · 2026-06-22T08:32:19Z

+  const apiKey =
+    typeof ctx.body?.api_key === "string" ? ctx.body.api_key.trim() : "";
+  if (!apiKey) return unauthorized(ctx);
+
+  const lookup = lookupOverride ?? dbLookup;
+  const client = await lookup(hashToken(apiKey));
+  if (!client) return unauthorized(ctx);


Semantics changed from #536: that PR recognized tokens and swapped keys but passed unknown callers through untouched, so several instances could share one Apollo and the un-provisioned ones kept working as today. Here, any unknown or missing api_key is a hard 401 when enabled - it's all-or-nothing per Apollo process. That's correct for a dedicated Apollo instance, but flipping INSTANCE_AUTH on our shared production Apollo would 401 every not-yet-provisioned instance. Worth really noting and paying attention to. Can you confirm this is what we really want ?

Yes, absolutely. That's the idea of the initiative.

Too big and too restrictive for now. The approach I would take and that I proposed in #536 was to unblock the current with the minimum and we discuss this and build it in another phase.

elias-ba · 2026-06-22T08:35:30Z

+def hash_token(token: str) -> str:
+    return hashlib.sha256(token.encode()).hexdigest()


The docstring says this "must match the hash Apollo computes in auth.ts" - but #538 dropped the pinned-hash test that enforced it (#536 had KNOWN_TOKEN/KNOWN_HASH asserted on both the TS and Python sides). With three independent SHA-256 sites now (this, auth.ts:62, provision_client.ts:34) and nothing pinning them, a drift would silently 401 every provisioned client - the exact "nightmare to debug" failure mode. Please restore the cross-language pinned test.

elias-ba · 2026-06-22T08:37:41Z

+}
+
+const apiKey = randomBytes(32).toString("base64url");
+const authTokenHash = createHash("sha256").update(apiKey).digest("hex");


Re-implements the hash inline instead of importing hashToken from auth.ts. Since this is the canonical provisioning path, a drift here vs. the runtime hasher silently locks clients out. Import it instead:

import { hashToken } from "../../platform/src/middleware/auth"; ... const authTokenHash = hashToken(apiKey);

elias-ba · 2026-06-22T08:38:43Z

+// MULTI-PROCESS: when unset, each process mints its OWN token. apollo() self-calls
+// hit 127.0.0.1:{port} and normally land on the same process, but if processes
+// share a port (SO_REUSEPORT / clustering) a self-call can hit a sibling and 401.
+// Set APOLLO_INTERNAL_TOKEN to the SAME value across processes in that case.
+export const INTERNAL_HEADER = "x-apollo-internal";
+const INTERNAL_TOKEN =
+  process.env.APOLLO_INTERNAL_TOKEN ?? randomBytes(32).toString("hex");
+process.env.APOLLO_INTERNAL_TOKEN = INTERNAL_TOKEN;


Multi-process caveat is documented here - just flagging for the deploy: if prod runs Apollo with SO_REUSEPORT/clustering and APOLLO_INTERNAL_TOKEN unset, apollo() self-calls can hit a sibling process with a different token and 401. Confirm our topology is single-process, or set a shared APOLLO_INTERNAL_TOKEN.

elias-ba · 2026-06-22T08:39:23Z

+
+// In-memory client cache (keyed by token hash), refreshed on a TTL so DB changes
+// are picked up within CACHE_TTL_MS.
+const CACHE_TTL_MS = 60_000;


Confirming this is an accepted trade-off: a revoked/rotated credential keeps working for up to ~60s (until the cache TTL refreshes). Fine for rotation; just want it acknowledged for the security posture, since "revoke" isn't immediate without a restart.

Yes, intentional. I don't think there's a benefit in hard cutoff.

josephjclark · 2026-06-22T15:55:18Z

+      expect(body.type).toBe("UNAUTHORIZED");
+    });
+
+    it("rejects an unknown api_key with 401", async () => {


This breaks production

That's correct! Did you catch up with @stuartc about this? Migration notes are in the PR

taylordowns2000 requested review from elias-ba and hanna-paasivirta June 18, 2026 09:03

taylordowns2000 added 6 commits June 19, 2026 09:41

concept

ff2d091

instance auth

9c08d9a

fix

de51e08

clean

75447e3

handle db stampede, encrypt at rest

67d1bc8

review

cc1a4af

taylordowns2000 force-pushed the api-token branch from 936d665 to cc1a4af Compare June 19, 2026 07:42

taylordowns2000 added 4 commits June 19, 2026 14:02

updates

4aa9b79

env

f04367c

slimmer readme

f91e1b2

trim

1254aa5

taylordowns2000 marked this pull request as ready for review June 19, 2026 14:57

elias-ba requested changes Jun 22, 2026

View reviewed changes

theroinaochieng added this to Core Jun 22, 2026

github-project-automation Bot moved this to New Issues in Core Jun 22, 2026

github-project-automation Bot moved this from New Issues to In review in Core Jun 22, 2026

theroinaochieng moved this from In review to In progress in Core Jun 22, 2026

josephjclark reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add authentication#538

Add authentication#538
taylordowns2000 wants to merge 10 commits into
mainfrom
api-token

taylordowns2000 commented Jun 18, 2026 •

edited

Loading

Uh oh!

elias-ba left a comment •

edited

Loading

Uh oh!

elias-ba Jun 22, 2026

Uh oh!

elias-ba Jun 22, 2026

Uh oh!

taylordowns2000 Jun 22, 2026

Uh oh!

elias-ba Jun 22, 2026

Uh oh!

elias-ba Jun 22, 2026

Uh oh!

elias-ba Jun 22, 2026

Uh oh!

elias-ba Jun 22, 2026

Uh oh!

elias-ba Jun 22, 2026

Uh oh!

taylordowns2000 Jun 22, 2026

Uh oh!

josephjclark Jun 22, 2026

Uh oh!

taylordowns2000 Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		def hash_token(token: str) -> str:
		return hashlib.sha256(token.encode()).hexdigest()

Conversation

taylordowns2000 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To migrate

After this PR (what's next)

Proposed contract (type is the discriminator)

AI Usage

Uh oh!

elias-ba left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

taylordowns2000 commented Jun 18, 2026 •

edited

Loading

Proposed contract (`type` is the discriminator)

elias-ba left a comment •

edited

Loading