Skip to content

feat: Postgres data plane with Neon per-branch credential resolution#30

Open
andybons wants to merge 27 commits into
mainfrom
feature/postgres-data-plane
Open

feat: Postgres data plane with Neon per-branch credential resolution#30
andybons wants to merge 27 commits into
mainfrom
feature/postgres-data-plane

Conversation

@andybons

Copy link
Copy Markdown
Contributor

Summary

Adds a Postgres data plane to gatekeeper: a second listener speaking the Postgres wire protocol that lets a sandboxed client connect to arbitrary Neon databases (any project/branch) without any database secret in the sandbox.

The client connects with the real Neon hostname (minus password) and presents its gatekeeper run token as the Postgres password. Gatekeeper:

  1. Terminates TLS with a CA-minted cert, reading the target endpoint from TLS SNI
  2. Validates the run token → per-run context (constant-time, same trust model as the HTTP plane)
  3. Mints the real per-branch password from the Neon API on the fly (TTL cache, invalidate+retry on rotation)
  4. Completes SCRAM-SHA-256 upstream and relays the connection (blind pgproto3 message pump)

Threat-model win: today an exfiltrated Neon connection URL works from anywhere; after this, the only stealable thing in the sandbox is a run-scoped token that's useless outside gatekeeper.

Key components

  • credentialsource/neon.go — Neon endpoint-ID parsing + NeonResolver (per-branch password minting, TTL cache). Supports both account-scoped and project-scoped API keys (optional project: field).
  • proxy/postgres.goPostgresServer listener (TLS termination, run-token auth, SCRAM upstream connector, bidirectional relay), PostgresCredentialResolver interface, StaticPostgresResolver.
  • config.go / gatekeeper.go — config schema + standalone-server wiring (listener lifecycle, resolver construction, CA validation).
  • Docs: README "Postgres data plane" section, examples/gatekeeper-postgres.yaml, AGENTS.md.

Notable details

  • Routing is by SNI only; the embedder (e.g. moat) arranges DNS so *.neon.tech resolves to gatekeeper. v1 is a blind message relay (no SQL inspection).
  • Client auth uses cleartext-password inside gatekeeper's TLS (the token never crosses bare).
  • No credential value (Neon API key, branch password, run token) ever appears in logs or client-facing errors.
  • TCP keep-alive on both relay legs; the handshake deadline is cleared before the relay so long-running queries and idle sessions aren't bounded by it.

Test Plan

  • go test -race ./... — full suite green (unit + integration: SCRAM round-trips against a fake Postgres server, end-to-end via a real pgconn client, policy/retry/logging, config parsing, server wiring)
  • go vet ./... clean, gofmt clean, go build ./... clean
  • End-to-end against a real Neon database (Mastra project, PostgreSQL 17.10): connected with only the run token as password — select version(), current_user, table counts all returned; audit log shows status=200, credential_injected=true, zero secrets logged
  • Long-running query: pg_sleep(75) ran 76s through the relay and returned cleanly (verifies the handshake deadline is cleared and there's no statement/idle timeout)
  • Project-scoped API key: verified the resolver works with a least-privilege project-scoped key via the optional project: field

🤖 Generated with Claude Code

andybons added 27 commits June 11, 2026 14:04
Neon can reassign a compute endpoint to a different branch (e.g. on
branch reset). InvalidatePassword previously dropped only the cached
password, so the retry re-fetched credentials using the stale branch ID
and could resolve the wrong branch's password indefinitely. Drop the
cached endpoint info too, making the invalidate-and-retry path fully
self-healing at the cost of one extra project enumeration per rotation
event.
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Claude finished @andybons's task in 5m 6s —— View job


Code Review

  • Gather PR context
  • Read changed files
  • Analyze for security issues, race conditions, bugs, and error handling gaps
  • Post review findings

Summary

This is a well-constructed Postgres data-plane. The threat model is sound, the security-sensitive code (constant-time token comparison, no credentials in logs/errors, mandatory TLS before any credential exchange, upstream certificate verification) is all correct. Test coverage is solid — there are unit tests for every protocol error path, a SCRAM round-trip test against a genuine server-side SCRAM verifier, and an end-to-end pgx test through the full stack.

Four findings below, in priority order:


Findings

[MEDIUM] Startup parameters forwarded verbatim — replication not stripped (proxy/postgres.go:547) Fix this →

The client's raw sm.Parameters map is forwarded verbatim in the StartupMessage sent upstream. A client can include replication = database (or replication = true) to request a WAL-streaming replication session. Neon supports logical replication on some plans; if the resolved role has replication privileges this gives the client read access to all data via the replication slot — a privilege the proxy was not designed to grant. The options key (-c guc=val) is lower-risk but should also be stripped. See inline comment.

[MEDIUM] NeonResolver endpoint-cache TOCTOU under concurrent invalidation (credentialsource/neon.go:87) Fix this →

Between releasing the mutex after reading haveInfo (line 87) and reacquiring it to write r.endpoints (line 94), a concurrent InvalidatePassword call can clear both caches. A goroutine that read haveInfo = true before the invalidation skips findEndpoint and calls fetchPassword with stale branch info, then caches the old branch's password. After a Neon branch reassignment, this means one in-flight connection will re-poison the cache and require a second invalidation + retry cycle. Impact is bounded (one extra failed auth), but connectWithRetry is only designed for one retry, so the second failure surfaces to the client. See inline comment for a minimal fix.

[LOW] connectWithRetry uses a fresh context, detached from the client-side deadline (proxy/postgres.go:680) Fix this →

A fresh 30 s budget is started unconditionally, regardless of elapsed time in handleConn. If the client-facing handshake consumed most of its 30 s budget, the upstream dial can outlive the client-side deadline. The client receives no error (the sendPGError flush silently fails on the dead socket) and the proxy runs the dial for up to 30 s against a possibly-available upstream before giving up. Not a correctness bug, but wastes server resources. Threading a context from handleConn would fix both the timeout budget and the resource leak. See inline comment.

[NOTE] pgproto3 concurrent Send/Receive relies on undocumented internal field disjointness (proxy/postgres.go:736)

The comment is honest and the -race end-to-end test already covers this path. The only additional ask: document (in a comment or CI note) which pinned jackc/pgx/v5 version was verified, so a future go get -u doesn't silently move past the safe snapshot. See inline comment for longer-term alternatives.

[COSMETIC] docs/plans/ committed to the repository (docs/plans/2026-06-11-postgres-data-plane-plan.md)

2100+ lines of ephemeral design notes that future readers can't easily distinguish from current documentation. Consider removing from the repo; the PR description already captures the key design decisions well. See inline comment.


What's correct and well done:

  • Constant-time token comparison (subtle.ConstantTimeCompare)
  • Plaintext connection refused before any credential is requested
  • Upstream TLS certificate fully verified (no InsecureSkipVerify)
  • Server signature verification in SCRAM (conv.Step on AuthenticationSASLFinal)
  • No credential values in log output or client-facing error messages
  • Auth failure SQLSTATE mapping is precise (28P01/28000errUpstreamAuthFailed, everything else passes through)
  • errUpstreamAuthFailed as a sentinel for the retry-once invalidation loop
  • Handshake deadline cleared before entering the relay — no timeout on long-running queries
  • TCP keep-alives on both legs of the relay

Comment thread proxy/postgres.go
return
}

s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, sm.Parameters)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Startup parameters forwarded verbatim — replication flag not stripped

sm.Parameters is passed as-is to serveAuthenticated and ultimately sent to the upstream in the StartupMessage. A client can include replication = database (or replication = true) in the startup message to request a replication session. If the Neon role has replication privileges (e.g. a branch owner on certain Neon plans), this would give the client WAL streaming access — a significantly elevated privilege the data plane isn't designed to grant.

The options parameter (-c guc=value list) is lower-risk but also worth filtering.

Suggested fix — strip sensitive parameters before forwarding:

Suggested change
s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, sm.Parameters)
forwardParams := make(map[string]string, len(sm.Parameters))
for k, v := range sm.Parameters {
switch k {
case "replication", "options":
// Never forward replication or server-side option overrides upstream.
default:
forwardParams[k] = v
}
}
s.serveAuthenticated(tlsConn, backend, rc, sniHost, user, database, forwardParams)

Comment thread credentialsource/neon.go
return cached.password, nil
}
info, haveInfo := r.endpoints[endpointID]
r.mu.Unlock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] TOCTOU race on endpoint cache under concurrent invalidation

Between releasing the lock here and reacquiring it to write r.endpoints (line 98), a concurrent InvalidatePassword call can delete the endpoint entry. If haveInfo was true (meaning we loaded stale info from before the invalidation), the goroutine skips findEndpoint and calls fetchPassword with the old branch. After a branch reassignment (InvalidatePassword called because the cached password was rejected), another in-flight goroutine that already loaded haveInfo = true will re-cache the old-branch password, requiring a second invalidation + retry cycle before the new branch's password is used.

The impact is bounded — at most one extra failed upstream auth attempt and one extra API call — but the retry loop in connectWithRetry is designed for exactly one retry, so this could cause a second unexpected auth failure on a branch-move.

One mitigation: re-check haveInfo under the lock after findEndpoint returns:

Suggested change
r.mu.Unlock()
info, haveInfo := r.endpoints[endpointID]
r.mu.Unlock()
if !haveInfo {
info, err = r.findEndpoint(ctx, endpointID)
if err != nil {
return "", err
}
r.mu.Lock()
if r.endpoints == nil {
r.endpoints = make(map[string]neonEndpointInfo)
}
// Re-check after re-acquiring: another goroutine may have invalidated
// the cache while findEndpoint was in-flight. Prefer the fresher entry.
if _, stillValid := r.endpoints[endpointID]; !stillValid {
r.endpoints[endpointID] = info
}
r.mu.Unlock()
}

Comment thread proxy/postgres.go
// after invalidating the (presumably stale) cached password. It returns the
// authenticated upstream connection and the grant names for audit logging.
func (s *PostgresServer) connectWithRetry(resolver PostgresCredentialResolver, host, user, database string, startupParams map[string]string) (*upstreamConn, []string, error) {
ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Fresh context detaches from client-side handshake deadline

context.WithTimeout(context.Background(), postgresHandshakeTimeout) starts a fresh 30 s budget regardless of how much time has already elapsed in handleConn. The client-facing tlsConn still has the deadline set at line 495 — if the handshake phase consumed 25 s (slow client), the client's connection will timeout 5 s into the upstream dial, while this context permits up to 30 s more.

When the client's deadline fires mid-dial, the sendPGError in the error path silently fails to flush (the client socket is dead), so the client never sees an error response. The connection just drops. That's acceptable from a correctness standpoint, but it does mean connectWithRetry continues running for up to 30 s after the client is gone.

The cleaner fix is to use the caller's context when one is available:

Suggested change
ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)
ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)

Or thread the handshake context through from handleConn so the upstream timeout inherits the remaining budget:

// In handleConn — thread the context through
ctx, cancel := context.WithTimeout(context.Background(), postgresHandshakeTimeout)
defer cancel()
// ... pass ctx down to serveAuthenticated and connectWithRetry

Comment thread proxy/postgres.go
// touch different memory. pgproto3 does NOT document Send/Receive as
// concurrency-safe; this was verified against the pinned v5.10.0 (via
// jackc/pgx/v5 v5.10.0). If pgproto3 is upgraded, re-verify this
// field-disjointness or split into per-direction objects over the same conn.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NOTE] Undocumented pgproto3 concurrency contract is version-pinned

The comment is honest about this, and the -race tests do cover the relay path, which is the right mitigation. One additional safeguard worth considering: add a go.sum-level pin check or a CI assertion that jackc/pgx/v5 has not been bumped past the verified version. Without that, an innocent go get -u by a future contributor will silently move past the verified snapshot and may introduce a data race that passes -race in isolation but fires under load.

Longer-term, consider replacing the shared-object relay with two raw io.Copy goroutines over the underlying TCP connections — the buffered-bytes concern can be handled by draining the internal buffers first:

// After handshake: flush any bytes pgproto3 already buffered
// (e.g. the first DataRow the frontend's chunkReader may have consumed)
// then hand off to raw copies which are safe under concurrent use.

Not blocking, but worth a follow-up issue.

@@ -0,0 +1,2110 @@
# Postgres Data Plane Implementation Plan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[COSMETIC] Planning documents committed to the repository

docs/plans/2026-06-11-postgres-data-plane-plan.md (2110 lines) and docs/plans/2026-06-11-postgres-neon-design.md (167 lines) are ephemeral design artifacts. Committing them to main proliferates dead content — future readers can't tell if they reflect current behavior or a superseded plan — and adds >2000 lines to the diff that reviewers must wade through.

Consider either:

  • Removing them from this PR and preserving the content in the PR description (which is already quite good) or a GitHub wiki.
  • Adding docs/plans/ to .gitignore so the pattern is captured but the directory doesn't pollute the repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant