feat: Postgres data plane with Neon per-branch credential resolution#30
feat: Postgres data plane with Neon per-branch credential resolution#30andybons wants to merge 27 commits into
Conversation
Neon can reassign a compute endpoint to a different branch (e.g. on branch reset). InvalidatePassword previously dropped only the cached password, so the retry re-fetched credentials using the stale branch ID and could resolve the wrong branch's password indefinitely. Drop the cached endpoint info too, making the invalidate-and-retry path fully self-healing at the cost of one extra project enumeration per rotation event.
…qlstate fidelity)
|
Claude finished @andybons's task in 5m 6s —— View job Code Review
SummaryThis is a well-constructed Postgres data-plane. The threat model is sound, the security-sensitive code (constant-time token comparison, no credentials in logs/errors, mandatory TLS before any credential exchange, upstream certificate verification) is all correct. Test coverage is solid — there are unit tests for every protocol error path, a SCRAM round-trip test against a genuine server-side SCRAM verifier, and an end-to-end pgx test through the full stack. Four findings below, in priority order: Findings[MEDIUM] Startup parameters forwarded verbatim — The client's raw [MEDIUM] NeonResolver endpoint-cache TOCTOU under concurrent invalidation ( Between releasing the mutex after reading [LOW] A fresh 30 s budget is started unconditionally, regardless of elapsed time in [NOTE] pgproto3 concurrent Send/Receive relies on undocumented internal field disjointness ( The comment is honest and the [COSMETIC] 2100+ lines of ephemeral design notes that future readers can't easily distinguish from current documentation. Consider removing from the repo; the PR description already captures the key design decisions well. See inline comment. What's correct and well done:
|
| return | ||
| } | ||
|
|
||
| s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, sm.Parameters) |
There was a problem hiding this comment.
[MEDIUM] Startup parameters forwarded verbatim — replication flag not stripped
sm.Parameters is passed as-is to serveAuthenticated and ultimately sent to the upstream in the StartupMessage. A client can include replication = database (or replication = true) in the startup message to request a replication session. If the Neon role has replication privileges (e.g. a branch owner on certain Neon plans), this would give the client WAL streaming access — a significantly elevated privilege the data plane isn't designed to grant.
The options parameter (-c guc=value list) is lower-risk but also worth filtering.
Suggested fix — strip sensitive parameters before forwarding:
| s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, sm.Parameters) | |
| forwardParams := make(map[string]string, len(sm.Parameters)) | |
| for k, v := range sm.Parameters { | |
| switch k { | |
| case "replication", "options": | |
| // Never forward replication or server-side option overrides upstream. | |
| default: | |
| forwardParams[k] = v | |
| } | |
| } | |
| s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, forwardParams) |
| return cached.password, nil | ||
| } | ||
| info, haveInfo := r.endpoints[endpointID] | ||
| r.mu.Unlock() |
There was a problem hiding this comment.
[MEDIUM] TOCTOU race on endpoint cache under concurrent invalidation
Between releasing the lock here and reacquiring it to write r.endpoints (line 98), a concurrent InvalidatePassword call can delete the endpoint entry. If haveInfo was true (meaning we loaded stale info from before the invalidation), the goroutine skips findEndpoint and calls fetchPassword with the old branch. After a branch reassignment (InvalidatePassword called because the cached password was rejected), another in-flight goroutine that already loaded haveInfo = true will re-cache the old-branch password, requiring a second invalidation + retry cycle before the new branch's password is used.
The impact is bounded — at most one extra failed upstream auth attempt and one extra API call — but the retry loop in connectWithRetry is designed for exactly one retry, so this could cause a second unexpected auth failure on a branch-move.
One mitigation: re-check haveInfo under the lock after findEndpoint returns:
| r.mu.Unlock() | |
| info, haveInfo := r.endpoints[endpointID] | |
| r.mu.Unlock() | |
| if !haveInfo { | |
| info, err = r.findEndpoint(ctx, endpointID) | |
| if err != nil { | |
| return "", err | |
| } | |
| r.mu.Lock() | |
| if r.endpoints == nil { | |
| r.endpoints = make(map[string]neonEndpointInfo) | |
| } | |
| // Re-check after re-acquiring: another goroutine may have invalidated | |
| // the cache while findEndpoint was in-flight. Prefer the fresher entry. | |
| if _, stillValid := r.endpoints[endpointID]; !stillValid { | |
| r.endpoints[endpointID] = info | |
| } | |
| r.mu.Unlock() | |
| } |
| // after invalidating the (presumably stale) cached password. It returns the | ||
| // authenticated upstream connection and the grant names for audit logging. | ||
| func (s *PostgresServer) connectWithRetry(resolver PostgresCredentialResolver, host, user, database string, startupParams map[string]string) (*upstreamConn, []string, error) { | ||
| ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout) |
There was a problem hiding this comment.
[LOW] Fresh context detaches from client-side handshake deadline
context.WithTimeout(context.Background(), postgresHandshakeTimeout) starts a fresh 30 s budget regardless of how much time has already elapsed in handleConn. The client-facing tlsConn still has the deadline set at line 495 — if the handshake phase consumed 25 s (slow client), the client's connection will timeout 5 s into the upstream dial, while this context permits up to 30 s more.
When the client's deadline fires mid-dial, the sendPGError in the error path silently fails to flush (the client socket is dead), so the client never sees an error response. The connection just drops. That's acceptable from a correctness standpoint, but it does mean connectWithRetry continues running for up to 30 s after the client is gone.
The cleaner fix is to use the caller's context when one is available:
| ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout) | |
| ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout) |
Or thread the handshake context through from handleConn so the upstream timeout inherits the remaining budget:
// In handleConn — thread the context through
ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)
defer cancel()
// ... pass ctx down to serveAuthenticated and connectWithRetry| // touch different memory. pgproto3 does NOT document Send/Receive as | ||
| // concurrency-safe; this was verified against the pinned v5.10.0 (via | ||
| // jackc/pgx/v5 v5.10.0). If pgproto3 is upgraded, re-verify this | ||
| // field-disjointness or split into per-direction objects over the same conn. |
There was a problem hiding this comment.
[NOTE] Undocumented pgproto3 concurrency contract is version-pinned
The comment is honest about this, and the -race tests do cover the relay path, which is the right mitigation. One additional safeguard worth considering: add a go.sum-level pin check or a CI assertion that jackc/pgx/v5 has not been bumped past the verified version. Without that, an innocent go get -u by a future contributor will silently move past the verified snapshot and may introduce a data race that passes -race in isolation but fires under load.
Longer-term, consider replacing the shared-object relay with two raw io.Copy goroutines over the underlying TCP connections — the buffered-bytes concern can be handled by draining the internal buffers first:
// After handshake: flush any bytes pgproto3 already buffered
// (e.g. the first DataRow the frontend's chunkReader may have consumed)
// then hand off to raw copies which are safe under concurrent use.Not blocking, but worth a follow-up issue.
| @@ -0,0 +1,2110 @@ | |||
| # Postgres Data Plane Implementation Plan | |||
There was a problem hiding this comment.
[COSMETIC] Planning documents committed to the repository
docs/plans/2026-06-11-postgres-data-plane-plan.md (2110 lines) and docs/plans/2026-06-11-postgres-neon-design.md (167 lines) are ephemeral design artifacts. Committing them to main proliferates dead content — future readers can't tell if they reflect current behavior or a superseded plan — and adds >2000 lines to the diff that reviewers must wade through.
Consider either:
- Removing them from this PR and preserving the content in the PR description (which is already quite good) or a GitHub wiki.
- Adding
docs/plans/to.gitignoreso the pattern is captured but the directory doesn't pollute the repo.
Summary
Adds a Postgres data plane to gatekeeper: a second listener speaking the Postgres wire protocol that lets a sandboxed client connect to arbitrary Neon databases (any project/branch) without any database secret in the sandbox.
The client connects with the real Neon hostname (minus password) and presents its gatekeeper run token as the Postgres password. Gatekeeper:
Threat-model win: today an exfiltrated Neon connection URL works from anywhere; after this, the only stealable thing in the sandbox is a run-scoped token that's useless outside gatekeeper.
Key components
credentialsource/neon.go— Neon endpoint-ID parsing +NeonResolver(per-branch password minting, TTL cache). Supports both account-scoped and project-scoped API keys (optionalproject:field).proxy/postgres.go—PostgresServerlistener (TLS termination, run-token auth, SCRAM upstream connector, bidirectional relay),PostgresCredentialResolverinterface,StaticPostgresResolver.config.go/gatekeeper.go— config schema + standalone-server wiring (listener lifecycle, resolver construction, CA validation).examples/gatekeeper-postgres.yaml, AGENTS.md.Notable details
*.neon.techresolves to gatekeeper. v1 is a blind message relay (no SQL inspection).Test Plan
go test -race ./...— full suite green (unit + integration: SCRAM round-trips against a fake Postgres server, end-to-end via a real pgconn client, policy/retry/logging, config parsing, server wiring)go vet ./...clean,gofmtclean,go build ./...cleanselect version(),current_user, table counts all returned; audit log showsstatus=200,credential_injected=true, zero secrets loggedpg_sleep(75)ran 76s through the relay and returned cleanly (verifies the handshake deadline is cleared and there's no statement/idle timeout)project:field🤖 Generated with Claude Code