fix(redis): honour caller deadline in k8s Provision — stop pro-tier provision hang#52
Merged
Merged
Conversation
…rovision hang Provisioning a Redis cache for an authenticated Pro/Team team hung (>60s, never returned) while the anonymous provision returned in ~6s. Root cause: K8sBackend.Provision derived its provisioning context from context.Background() with a hardcoded 5-minute timeout, completely discarding the incoming gRPC ctx. The Pro/Team tier uses a real 10Gi PVC (pvcMi=10240); when the block-storage attach stalls, the pod never reaches Ready and waitPodReady polled for the full redisK8sReadyTO (3m). Because the wait ignored the caller's ctx, even after the api gRPC call's 45s deadline fired, the provisioner kept blocking the handler (and left a half-built namespace, since the cancellation never reached the rollback path). The anonymous path never hit this — pvcMi=0 → emptyDir → pod Ready in seconds. Fix: - provisionContext() derives provCtx from the incoming ctx (so the api's deadline/cancellation propagates) while still imposing a hard server-side ceiling (redisK8sProvisionCeiling, 5m) for callers with no deadline. The effective wait is now min(caller deadline, redisK8sReadyTO, ceiling). - waitPodReady checks ctx.Err() at the top of each poll and wraps the cancellation in a clear, mappable error. - rollback now deletes the half-built namespace via a fresh 30s background ctx so cleanup still runs even when the incoming ctx is already cancelled. - mapError classifies context.DeadlineExceeded/Canceled via errors.Is (not a fragile message substring) → retryable gRPC status (DeadlineExceeded / Unavailable) so the api soft-deletes + 503s instead of returning a hard 500. A timed-out/cancelled provision now returns a clean gRPC error promptly and rolls back, never hangs. Tests (internal/backend/redis/k8s_test.go, internal/server/maperror_test.go): - TestProvision_ProTier_HonoursCallerDeadline (core regression guard: fails if Provision blocks past the caller deadline) - TestProvision_AnonTier_HonoursCallerDeadline - TestProvision_CallerCancel_FastFails - TestProvisionContext_HonoursCallerDeadline / _CeilingBoundsNoDeadlineCaller - TestMapError_ContextErrors_AreRetryable Local gate green: go build ./... + go vet ./... + go test ./... -short -count=1 all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
Jun 8, 2026
…ovision (#53) The redis k8s backend was fixed for the pro-provision-hang bug (#52) but the postgres, mongo, AND queue backends still derived their provisioning context from context.Background(), discarding the caller's gRPC deadline. When the api caller's deadline fired (or it cancelled the RPC), the provisioner kept blocking up to 5m on a wedged PVC/CSI attach and the api handler hung — the same class that drove the e2e-prod flakiness, for db/nosql/queue instead of cache. Each backend now derives provCtx from the incoming ctx with a 5m ceiling backstop (min of the two), mirroring redis/k8s.go provisionContext. The api grants a generous provision deadline (provisionTimeout: 4m anon / 5m pro), so legitimate 30-90s pod startup is unaffected; only pathological hangs + early cancellations now fast-fail. waitPodReady already honours ctx.Err() each poll; the shared server.mapError already maps context.DeadlineExceeded/Canceled to retryable gRPC codes, so no server change is needed. Also bounded the rollback namespace-delete to a fresh 30s background ctx (redis parity) so cleanup runs even when the incoming ctx is cancelled without a wedged apiserver pinning the goroutine. Tests: TestProvision_HonoursCallerDeadline per backend — a 300ms caller ctx fast-fails (<30s) wrapping context.DeadlineExceeded, instead of blocking for the pod-ready ceiling. make gate green. Co-authored-by: Manas Srivastava <[email protected]> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
Provisioning a Redis cache for an authenticated Pro/Team team hangs (>60s, never returns), while the anonymous cache provision returns in ~6s.
teamID="",tier="anonymous"→ FAST (~6s)teamID=<uuid>,tier="pro"→ HANGSRoot cause (file:line)
internal/backend/redis/k8s.goK8sBackend.Provisionderived its provisioning context fromcontext.Background()with a hardcoded 5-minute timeout, completely discarding the incoming gRPCctx:Why only the pro/team path hits it:
pvcMi=10240,sizingForTier). When the DOKS block-storage attach / PVC bind stalls (CSI stuck, slow attach, quota), the pod never reachesReadyandwaitPodReadypolls for the fullredisK8sReadyTO(3 min).pvcMi=0→emptyDir, which skips the 5–10s volume attach, so the pod isReadyin seconds — the slow-attach path is never exercised.Because the wait used a background-derived context, even after the api's gRPC
ProvisionCachedeadline (cacheProvisionTimeout = 45s, already shipped api-side) fired and gRPC cancelled the server ctx, the provisioner ignored the cancellation and kept blocking the handler for up to 5 minutes — and left a half-built namespace behind, since the cancellation never reached the rollback path. From the api's side: an unbounded hang plus a resource leak.Fix
provisionContext()derivesprovCtxfrom the incomingctx(so the api's deadline/cancellation propagates) while still imposing a hard server-side ceilingredisK8sProvisionCeiling(5m) for callers with no deadline. Effective wait =min(caller deadline, redisK8sReadyTO, ceiling).waitPodReadychecksctx.Err()at the top of each poll and wraps cancellation in a clear, mappable error.rollbacknow deletes the half-built namespace via a fresh 30s background ctx so cleanup still runs even when the incoming ctx is already cancelled (no namespace leak on timeout).mapErrorclassifiescontext.DeadlineExceeded/context.Canceledviaerrors.Is(not a fragile substring) → retryable gRPC status (DeadlineExceeded/Unavailable), so the api soft-deletes + 503s rather than returning a hard 500.A timed-out/cancelled provision now returns a clean gRPC error promptly and rolls back, never hangs — complementary to the api-side 45s
cacheProvisionTimeout.Tests
internal/backend/redis/k8s_test.go:TestProvision_ProTier_HonoursCallerDeadline— core regression guard: fake clientset whose pod never goes Ready (simulates stalled PVC attach); asserts Provision returns in <30s wrappingcontext.DeadlineExceeded, not afterredisK8sReadyTO(3m). Reverting to a background context fails this test.TestProvision_AnonTier_HonoursCallerDeadlineTestProvision_CallerCancel_FastFailsTestProvisionContext_HonoursCallerDeadline/TestProvisionContext_CeilingBoundsNoDeadlineCallerinternal/server/maperror_test.go:TestMapError_ContextErrors_AreRetryable— wrapped + bare context errors map to retryable statuses.Local gate
make gateequivalent green:go build ./...✅go vet ./...✅go test ./... -short -count=1✅ (all packages OK)New Provision tests fast-fail in ~0.2–0.3s, confirming the caller deadline is honoured.
Scope note
The same
context.Background()provision anti-pattern exists in the postgres and mongo k8s backends. This PR fixes only the redis path (the reported symptom). The postgres/mongo paths are bounded api-side by their own longerprovisionTimeout(tier)and were not reported as hanging; left for a focused follow-up to keep this PR tight.🤖 Generated with Claude Code