feat: Postgres data plane with Neon per-branch credential resolution by andybons · Pull Request #30 · majorcontext/gatekeeper

andybons · 2026-06-11T21:36:42Z

Summary

Adds a Postgres data plane to gatekeeper: a second listener speaking the Postgres wire protocol that lets a sandboxed client connect to arbitrary Neon databases (any project/branch) without any database secret in the sandbox.

The client connects with the real Neon hostname (minus password) and presents its gatekeeper run token as the Postgres password. Gatekeeper:

Terminates TLS with a CA-minted cert, reading the target endpoint from TLS SNI
Validates the run token → per-run context (constant-time, same trust model as the HTTP plane)
Mints the real per-branch password from the Neon API on the fly (TTL cache, invalidate+retry on rotation)
Completes SCRAM-SHA-256 upstream and relays the connection (blind pgproto3 message pump)

Threat-model win: today an exfiltrated Neon connection URL works from anywhere; after this, the only stealable thing in the sandbox is a run-scoped token that's useless outside gatekeeper.

Key components

credentialsource/neon.go — Neon endpoint-ID parsing + NeonResolver (per-branch password minting, TTL cache). Supports both account-scoped and project-scoped API keys (optional project: field).
proxy/postgres.go — PostgresServer listener (TLS termination, run-token auth, SCRAM upstream connector, bidirectional relay), PostgresCredentialResolver interface, StaticPostgresResolver.
config.go / gatekeeper.go — config schema + standalone-server wiring (listener lifecycle, resolver construction, CA validation).
Docs: README "Postgres data plane" section, examples/gatekeeper-postgres.yaml, AGENTS.md.

Notable details

Routing is by SNI only; the embedder (e.g. moat) arranges DNS so *.neon.tech resolves to gatekeeper. v1 is a blind message relay (no SQL inspection).
Client auth uses cleartext-password inside gatekeeper's TLS (the token never crosses bare).
No credential value (Neon API key, branch password, run token) ever appears in logs or client-facing errors.
TCP keep-alive on both relay legs; the handshake deadline is cleared before the relay so long-running queries and idle sessions aren't bounded by it.

Test Plan

go test -race ./... — full suite green (unit + integration: SCRAM round-trips against a fake Postgres server, end-to-end via a real pgconn client, policy/retry/logging, config parsing, server wiring)
go vet ./... clean, gofmt clean, go build ./... clean
End-to-end against a real Neon database (Mastra project, PostgreSQL 17.10): connected with only the run token as password — select version(), current_user, table counts all returned; audit log shows status=200, credential_injected=true, zero secrets logged
Long-running query: pg_sleep(75) ran 76s through the relay and returned cleanly (verifies the handshake deadline is cleared and there's no statement/idle timeout)
Project-scoped API key: verified the resolver works with a least-privilege project-scoped key via the optional project: field

🤖 Generated with Claude Code

Neon can reassign a compute endpoint to a different branch (e.g. on branch reset). InvalidatePassword previously dropped only the cached password, so the retry re-fetched credentials using the stale branch ID and could resolve the wrong branch's password indefinitely. Drop the cached endpoint info too, making the invalidate-and-retry path fully self-healing at the cost of one extra project enumeration per rotation event.

…exts

…rn contract

…qlstate fidelity)

…m paths

…y naming

…onal project ID

claude · 2026-06-11T21:36:57Z

Claude finished @andybons's task in 5m 6s —— View job

Code Review

Gather PR context
Read changed files
Analyze for security issues, race conditions, bugs, and error handling gaps
Post review findings

Summary

This is a well-constructed Postgres data-plane. The threat model is sound, the security-sensitive code (constant-time token comparison, no credentials in logs/errors, mandatory TLS before any credential exchange, upstream certificate verification) is all correct. Test coverage is solid — there are unit tests for every protocol error path, a SCRAM round-trip test against a genuine server-side SCRAM verifier, and an end-to-end pgx test through the full stack.

Four findings below, in priority order:

Findings

[MEDIUM] Startup parameters forwarded verbatim — replication not stripped (proxy/postgres.go:547) Fix this →

The client's raw sm.Parameters map is forwarded verbatim in the StartupMessage sent upstream. A client can include replication = database (or replication = true) to request a WAL-streaming replication session. Neon supports logical replication on some plans; if the resolved role has replication privileges this gives the client read access to all data via the replication slot — a privilege the proxy was not designed to grant. The options key (-c guc=val) is lower-risk but should also be stripped. See inline comment.

[MEDIUM] NeonResolver endpoint-cache TOCTOU under concurrent invalidation (credentialsource/neon.go:87) Fix this →

Between releasing the mutex after reading haveInfo (line 87) and reacquiring it to write r.endpoints (line 94), a concurrent InvalidatePassword call can clear both caches. A goroutine that read haveInfo = true before the invalidation skips findEndpoint and calls fetchPassword with stale branch info, then caches the old branch's password. After a Neon branch reassignment, this means one in-flight connection will re-poison the cache and require a second invalidation + retry cycle. Impact is bounded (one extra failed auth), but connectWithRetry is only designed for one retry, so the second failure surfaces to the client. See inline comment for a minimal fix.

[LOW] connectWithRetry uses a fresh context, detached from the client-side deadline (proxy/postgres.go:680) Fix this →

A fresh 30 s budget is started unconditionally, regardless of elapsed time in handleConn. If the client-facing handshake consumed most of its 30 s budget, the upstream dial can outlive the client-side deadline. The client receives no error (the sendPGError flush silently fails on the dead socket) and the proxy runs the dial for up to 30 s against a possibly-available upstream before giving up. Not a correctness bug, but wastes server resources. Threading a context from handleConn would fix both the timeout budget and the resource leak. See inline comment.

[NOTE] pgproto3 concurrent Send/Receive relies on undocumented internal field disjointness (proxy/postgres.go:736)

The comment is honest and the -race end-to-end test already covers this path. The only additional ask: document (in a comment or CI note) which pinned jackc/pgx/v5 version was verified, so a future go get -u doesn't silently move past the safe snapshot. See inline comment for longer-term alternatives.

[COSMETIC] docs/plans/ committed to the repository (docs/plans/2026-06-11-postgres-data-plane-plan.md)

2100+ lines of ephemeral design notes that future readers can't easily distinguish from current documentation. Consider removing from the repo; the PR description already captures the key design decisions well. See inline comment.

What's correct and well done:

Constant-time token comparison (subtle.ConstantTimeCompare)
Plaintext connection refused before any credential is requested
Upstream TLS certificate fully verified (no InsecureSkipVerify)
Server signature verification in SCRAM (conv.Step on AuthenticationSASLFinal)
No credential values in log output or client-facing error messages
Auth failure SQLSTATE mapping is precise (28P01/28000 → errUpstreamAuthFailed, everything else passes through)
errUpstreamAuthFailed as a sentinel for the retry-once invalidation loop
Handshake deadline cleared before entering the relay — no timeout on long-running queries
TCP keep-alives on both legs of the relay

claude · 2026-06-11T21:40:46Z

+		return
+	}
+
+	s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, sm.Parameters)


[MEDIUM] Startup parameters forwarded verbatim — replication flag not stripped

sm.Parameters is passed as-is to serveAuthenticated and ultimately sent to the upstream in the StartupMessage. A client can include replication = database (or replication = true) in the startup message to request a replication session. If the Neon role has replication privileges (e.g. a branch owner on certain Neon plans), this would give the client WAL streaming access — a significantly elevated privilege the data plane isn't designed to grant.

The options parameter (-c guc=value list) is lower-risk but also worth filtering.

Suggested fix — strip sensitive parameters before forwarding:

Suggested change

s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, sm.Parameters)

forwardParams := make(map[string]string, len(sm.Parameters))

for k, v := range sm.Parameters {

switch k {

case "replication", "options":

// Never forward replication or server-side option overrides upstream.

default:

forwardParams[k] = v

}

}

s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, forwardParams)

claude · 2026-06-11T21:40:59Z

+		return cached.password, nil
+	}
+	info, haveInfo := r.endpoints[endpointID]
+	r.mu.Unlock()


[MEDIUM] TOCTOU race on endpoint cache under concurrent invalidation

Between releasing the lock here and reacquiring it to write r.endpoints (line 98), a concurrent InvalidatePassword call can delete the endpoint entry. If haveInfo was true (meaning we loaded stale info from before the invalidation), the goroutine skips findEndpoint and calls fetchPassword with the old branch. After a branch reassignment (InvalidatePassword called because the cached password was rejected), another in-flight goroutine that already loaded haveInfo = true will re-cache the old-branch password, requiring a second invalidation + retry cycle before the new branch's password is used.

The impact is bounded — at most one extra failed upstream auth attempt and one extra API call — but the retry loop in connectWithRetry is designed for exactly one retry, so this could cause a second unexpected auth failure on a branch-move.

One mitigation: re-check haveInfo under the lock after findEndpoint returns:

Suggested change

r.mu.Unlock()

info, haveInfo := r.endpoints[endpointID]

r.mu.Unlock()

if !haveInfo {

info, err = r.findEndpoint(ctx, endpointID)

if err != nil {

return "", err

}

r.mu.Lock()

if r.endpoints == nil {

r.endpoints = make(map[string]neonEndpointInfo)

}

// Re-check after re-acquiring: another goroutine may have invalidated

// the cache while findEndpoint was in-flight. Prefer the fresher entry.

if _, stillValid := r.endpoints[endpointID]; !stillValid {

r.endpoints[endpointID] = info

}

r.mu.Unlock()

}

claude · 2026-06-11T21:41:10Z

+// after invalidating the (presumably stale) cached password. It returns the
+// authenticated upstream connection and the grant names for audit logging.
+func (s *PostgresServer) connectWithRetry(resolver PostgresCredentialResolver, host, user, database string, startupParams map[string]string) (*upstreamConn, []string, error) {
+	ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)


[LOW] Fresh context detaches from client-side handshake deadline

context.WithTimeout(context.Background(), postgresHandshakeTimeout) starts a fresh 30 s budget regardless of how much time has already elapsed in handleConn. The client-facing tlsConn still has the deadline set at line 495 — if the handshake phase consumed 25 s (slow client), the client's connection will timeout 5 s into the upstream dial, while this context permits up to 30 s more.

When the client's deadline fires mid-dial, the sendPGError in the error path silently fails to flush (the client socket is dead), so the client never sees an error response. The connection just drops. That's acceptable from a correctness standpoint, but it does mean connectWithRetry continues running for up to 30 s after the client is gone.

The cleaner fix is to use the caller's context when one is available:

Suggested change

ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)

ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)

Or thread the handshake context through from handleConn so the upstream timeout inherits the remaining budget:

// In handleConn — thread the context through ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout) defer cancel() // ... pass ctx down to serveAuthenticated and connectWithRetry

claude · 2026-06-11T21:41:20Z

+	// touch different memory. pgproto3 does NOT document Send/Receive as
+	// concurrency-safe; this was verified against the pinned v5.10.0 (via
+	// jackc/pgx/v5 v5.10.0). If pgproto3 is upgraded, re-verify this
+	// field-disjointness or split into per-direction objects over the same conn.


[NOTE] Undocumented pgproto3 concurrency contract is version-pinned

The comment is honest about this, and the -race tests do cover the relay path, which is the right mitigation. One additional safeguard worth considering: add a go.sum-level pin check or a CI assertion that jackc/pgx/v5 has not been bumped past the verified version. Without that, an innocent go get -u by a future contributor will silently move past the verified snapshot and may introduce a data race that passes -race in isolation but fires under load.

Longer-term, consider replacing the shared-object relay with two raw io.Copy goroutines over the underlying TCP connections — the buffered-bytes concern can be handled by draining the internal buffers first:

// After handshake: flush any bytes pgproto3 already buffered // (e.g. the first DataRow the frontend's chunkReader may have consumed) // then hand off to raw copies which are safe under concurrent use.

Not blocking, but worth a follow-up issue.

claude · 2026-06-11T21:41:29Z

@@ -0,0 +1,2110 @@
+# Postgres Data Plane Implementation Plan


[COSMETIC] Planning documents committed to the repository

docs/plans/2026-06-11-postgres-data-plane-plan.md (2110 lines) and docs/plans/2026-06-11-postgres-neon-design.md (167 lines) are ephemeral design artifacts. Committing them to main proliferates dead content — future readers can't tell if they reflect current behavior or a superseded plan — and adds >2000 lines to the diff that reviewers must wade through.

Consider either:

Removing them from this PR and preserving the content in the PR description (which is already quite good) or a GitHub wiki.

Adding docs/plans/ to .gitignore so the pattern is captured but the directory doesn't pollute the repo.

andybons added 27 commits June 11, 2026 14:04

docs: add postgres data plane with neon resolver design

1c458b5

chore: add gitignore for build artifact and worktrees

3e0a9b5

docs: add postgres data plane implementation plan

aeea15d

build: add pgproto3 and scram dependencies

ca409b1

feat(credentialsource): parse neon endpoint IDs from hostnames

69636aa

feat(credentialsource): add neon password resolver with TTL cache

ad1f525

feat(proxy): add postgres credential resolver interface and registration

2905862

fix(proxy): never fall back to global postgres resolvers for run cont…

32465a3

…exts

fix(proxy): strip port in postgres resolver lookup and document patte…

6caca82

…rn contract

test(proxy): add SCRAM-verifying fake postgres server for tests

bc50791

test(proxy): harden fake postgres server (deadlines, sync handling, s…

2e7ee64

…qlstate fidelity)

feat(proxy): add postgres upstream connector with SCRAM auth

a2180e3

test(proxy): cover non-auth upstream error and missing-SCRAM-mechanis…

13f69cc

…m paths

feat(proxy): postgres listener with TLS termination and token auth

5f1af22

fix(proxy): log postgres accept-loop failures and bound handshake writes

4955d18

feat(proxy): postgres relay with policy check, retry, and audit logging

95b5b81

docs(proxy): document postgres relay concurrency invariant and clarif…

5a156dd

…y naming

feat(config): add postgres listener and credential config

e090b00

feat: wire postgres data plane into standalone server

68d9606

fix: tear down HTTP listener when postgres listener fails to start

9f59a91

docs: document postgres data plane and neon resolver

8eb3d25

docs(proxy): clarify postgres audit-log field semantics and policy scope

da70178

style: gofmt alignment after comment edit

d441eea

feat(credentialsource): support project-scoped neon API keys via opti…

8c3fe0d

…onal project ID

docs: document optional project field for project-scoped neon keys

da17b6d

feat(proxy): enable TCP keep-alive on relayed postgres connections

30de5bf

claude Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Postgres data plane with Neon per-branch credential resolution#30

feat: Postgres data plane with Neon per-branch credential resolution#30
andybons wants to merge 27 commits into
mainfrom
feature/postgres-data-plane

andybons commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

claude Bot Jun 11, 2026

Uh oh!

claude Bot Jun 11, 2026

Uh oh!

claude Bot Jun 11, 2026

Uh oh!

claude Bot Jun 11, 2026

Uh oh!

claude Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-	s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, sm.Parameters)
+	forwardParams := make(map[string]string, len(sm.Parameters))
+	for k, v := range sm.Parameters {
+		switch k {
+		case "replication", "options":
+			// Never forward replication or server-side option overrides upstream.
+		default:
+			forwardParams[k] = v
+		}
+	}
+	s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, forwardParams)

-	r.mu.Unlock()
+	info, haveInfo := r.endpoints[endpointID]
+	r.mu.Unlock()
+	if !haveInfo {
+		info, err = r.findEndpoint(ctx, endpointID)
+		if err != nil {
+			return "", err
+		}
+		r.mu.Lock()
+		if r.endpoints == nil {
+			r.endpoints = make(map[string]neonEndpointInfo)
+		}
+		// Re-check after re-acquiring: another goroutine may have invalidated
+		// the cache while findEndpoint was in-flight. Prefer the fresher entry.
+		if _, stillValid := r.endpoints[endpointID]; !stillValid {
+			r.endpoints[endpointID] = info
+		}
+		r.mu.Unlock()
+	}

	ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)
	ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)

Conversation

andybons commented Jun 11, 2026

Summary

Key components

Notable details

Test Plan

Uh oh!

claude Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

Findings

Uh oh!

claude Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 11, 2026 •

edited

Loading