fix(redis): honour caller deadline in k8s Provision — stop pro-tier provision hang by mastermanas805 · Pull Request #52 · InstaNode-dev/provisioner

mastermanas805 · 2026-06-07T14:30:27Z

Symptom

Provisioning a Redis cache for an authenticated Pro/Team team hangs (>60s, never returns), while the anonymous cache provision returns in ~6s.

anonymous → teamID="", tier="anonymous" → FAST (~6s)
authed pro → teamID=<uuid>, tier="pro" → HANGS

Root cause (file:line)

internal/backend/redis/k8s.go K8sBackend.Provision derived its provisioning context from context.Background() with a hardcoded 5-minute timeout, completely discarding the incoming gRPC ctx:

// OLD
provCtx, provCancel := context.WithTimeout(context.Background(), 5*time.Minute)

Why only the pro/team path hits it:

Pro/Team Redis uses a real 10Gi PVC (pvcMi=10240, sizingForTier). When the DOKS block-storage attach / PVC bind stalls (CSI stuck, slow attach, quota), the pod never reaches Ready and waitPodReady polls for the full redisK8sReadyTO (3 min).
Anonymous uses pvcMi=0 → emptyDir, which skips the 5–10s volume attach, so the pod is Ready in seconds — the slow-attach path is never exercised.

Because the wait used a background-derived context, even after the api's gRPC ProvisionCache deadline (cacheProvisionTimeout = 45s, already shipped api-side) fired and gRPC cancelled the server ctx, the provisioner ignored the cancellation and kept blocking the handler for up to 5 minutes — and left a half-built namespace behind, since the cancellation never reached the rollback path. From the api's side: an unbounded hang plus a resource leak.

Fix

provisionContext() derives provCtx from the incoming ctx (so the api's deadline/cancellation propagates) while still imposing a hard server-side ceiling redisK8sProvisionCeiling (5m) for callers with no deadline. Effective wait = min(caller deadline, redisK8sReadyTO, ceiling).
waitPodReady checks ctx.Err() at the top of each poll and wraps cancellation in a clear, mappable error.
rollback now deletes the half-built namespace via a fresh 30s background ctx so cleanup still runs even when the incoming ctx is already cancelled (no namespace leak on timeout).
mapError classifies context.DeadlineExceeded / context.Canceled via errors.Is (not a fragile substring) → retryable gRPC status (DeadlineExceeded / Unavailable), so the api soft-deletes + 503s rather than returning a hard 500.

A timed-out/cancelled provision now returns a clean gRPC error promptly and rolls back, never hangs — complementary to the api-side 45s cacheProvisionTimeout.

Tests

internal/backend/redis/k8s_test.go:

TestProvision_ProTier_HonoursCallerDeadline — core regression guard: fake clientset whose pod never goes Ready (simulates stalled PVC attach); asserts Provision returns in <30s wrapping context.DeadlineExceeded, not after redisK8sReadyTO (3m). Reverting to a background context fails this test.
TestProvision_AnonTier_HonoursCallerDeadline
TestProvision_CallerCancel_FastFails
TestProvisionContext_HonoursCallerDeadline / TestProvisionContext_CeilingBoundsNoDeadlineCaller

internal/server/maperror_test.go:

TestMapError_ContextErrors_AreRetryable — wrapped + bare context errors map to retryable statuses.

Local gate

make gate equivalent green:

go build ./... ✅
go vet ./... ✅
go test ./... -short -count=1 ✅ (all packages OK)

New Provision tests fast-fail in ~0.2–0.3s, confirming the caller deadline is honoured.

Scope note

The same context.Background() provision anti-pattern exists in the postgres and mongo k8s backends. This PR fixes only the redis path (the reported symptom). The postgres/mongo paths are bounded api-side by their own longer provisionTimeout(tier) and were not reported as hanging; left for a focused follow-up to keep this PR tight.

🤖 Generated with Claude Code

…rovision hang Provisioning a Redis cache for an authenticated Pro/Team team hung (>60s, never returned) while the anonymous provision returned in ~6s. Root cause: K8sBackend.Provision derived its provisioning context from context.Background() with a hardcoded 5-minute timeout, completely discarding the incoming gRPC ctx. The Pro/Team tier uses a real 10Gi PVC (pvcMi=10240); when the block-storage attach stalls, the pod never reaches Ready and waitPodReady polled for the full redisK8sReadyTO (3m). Because the wait ignored the caller's ctx, even after the api gRPC call's 45s deadline fired, the provisioner kept blocking the handler (and left a half-built namespace, since the cancellation never reached the rollback path). The anonymous path never hit this — pvcMi=0 → emptyDir → pod Ready in seconds. Fix: - provisionContext() derives provCtx from the incoming ctx (so the api's deadline/cancellation propagates) while still imposing a hard server-side ceiling (redisK8sProvisionCeiling, 5m) for callers with no deadline. The effective wait is now min(caller deadline, redisK8sReadyTO, ceiling). - waitPodReady checks ctx.Err() at the top of each poll and wraps the cancellation in a clear, mappable error. - rollback now deletes the half-built namespace via a fresh 30s background ctx so cleanup still runs even when the incoming ctx is already cancelled. - mapError classifies context.DeadlineExceeded/Canceled via errors.Is (not a fragile message substring) → retryable gRPC status (DeadlineExceeded / Unavailable) so the api soft-deletes + 503s instead of returning a hard 500. A timed-out/cancelled provision now returns a clean gRPC error promptly and rolls back, never hangs. Tests (internal/backend/redis/k8s_test.go, internal/server/maperror_test.go): - TestProvision_ProTier_HonoursCallerDeadline (core regression guard: fails if Provision blocks past the caller deadline) - TestProvision_AnonTier_HonoursCallerDeadline - TestProvision_CallerCancel_FastFails - TestProvisionContext_HonoursCallerDeadline / _CeilingBoundsNoDeadlineCaller - TestMapError_ContextErrors_AreRetryable Local gate green: go build ./... + go vet ./... + go test ./... -short -count=1 all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ovision (#53) The redis k8s backend was fixed for the pro-provision-hang bug (#52) but the postgres, mongo, AND queue backends still derived their provisioning context from context.Background(), discarding the caller's gRPC deadline. When the api caller's deadline fired (or it cancelled the RPC), the provisioner kept blocking up to 5m on a wedged PVC/CSI attach and the api handler hung — the same class that drove the e2e-prod flakiness, for db/nosql/queue instead of cache. Each backend now derives provCtx from the incoming ctx with a 5m ceiling backstop (min of the two), mirroring redis/k8s.go provisionContext. The api grants a generous provision deadline (provisionTimeout: 4m anon / 5m pro), so legitimate 30-90s pod startup is unaffected; only pathological hangs + early cancellations now fast-fail. waitPodReady already honours ctx.Err() each poll; the shared server.mapError already maps context.DeadlineExceeded/Canceled to retryable gRPC codes, so no server change is needed. Also bounded the rollback namespace-delete to a fresh 30s background ctx (redis parity) so cleanup runs even when the incoming ctx is cancelled without a wedged apiserver pinning the goroutine. Tests: TestProvision_HonoursCallerDeadline per backend — a 300ms caller ctx fast-fails (<30s) wrapping context.DeadlineExceeded, instead of blocking for the pod-ready ceiling. make gate green. Co-authored-by: Manas Srivastava <[email protected]> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mastermanas805 merged commit d30d3bb into master Jun 7, 2026
12 checks passed

mastermanas805 deleted the fix/redis-pro-provision-hang branch June 7, 2026 14:35

mastermanas805 mentioned this pull request Jun 8, 2026

fix(provision): honour caller deadline in postgres/mongo/queue k8s Provision #53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(redis): honour caller deadline in k8s Provision — stop pro-tier provision hang#52

fix(redis): honour caller deadline in k8s Provision — stop pro-tier provision hang#52
mastermanas805 merged 1 commit into
masterfrom
fix/redis-pro-provision-hang

mastermanas805 commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented Jun 7, 2026

Symptom

Root cause (file:line)

Fix

Tests

Local gate

Scope note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant