fix(provisioner): bound ProvisionCache gRPC with a 45s deadline#278
Merged
Conversation
Authed /cache/new (PRO Redis provision) was observed hanging >60s while anonymous /cache/new returned in ~6s. The provisioner-side root cause is fixed separately; this is the defensive api-side guard. provClient ALREADY applies a per-call deadline to every provision RPC via provisionTimeout(tier) — the issue is the value: 4m (anon/free/hobby) and 5m (pro/team/growth). Those budgets exist for Postgres/Mongo, which spin up a per-tenant pod (PVC bind + image pull + DB init, 30-90s on a cold node). A Redis namespace carve does none of that (~1-6s for every tier), so a hung provisioner could hang the whole /cache/new request for up to 5 minutes. Introduce cacheProvisionTimeout = 45s (named, no inline magic number; a package var so the timeout->503 unit test can shrink it) and use it in ProvisionCache only. 45s is ~7x the slowest observed healthy carve, so a legitimately slow-but-OK provision still succeeds while a genuine hang fails fast. On timeout the existing handler path runs unchanged: soft-delete the pending resource + 503 provision_failed — no orphan, no hang. db/vector/nosql/queue intentionally keep provisionTimeout(tier): they are pod-backed and genuinely need the cold-pod budget; tightening them to 45s would regress legitimate slow provisions. Noted as a deliberate non-change. Tests (internal/provisioner, no DB needed): - TestProvisionCache_HangBecomesDeadlineError: blocking mock provisioner -> ProvisionCache returns gRPC DeadlineExceeded in ~the bounded window, not an indefinite hang (the error the handler turns into a 503). - TestCacheProvisionTimeout_Value: pins 45s and asserts it stays tighter than provisionTimeout(pro) so a future edit can't silently reintroduce the multi-minute hang. ProvisionCache coverage 100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Authenticated
/cache/new(a logged-in PRO team provisioning Redis) was observed hanging indefinitely (>60s, HTTP 000 at a 60s client timeout), while anonymous/cache/newreturned in ~6s. The provisioner-side root cause is being fixed separately. This PR is the defensive api-side guard: a hung provisioner should become a clean503 provision_failed, never an indefinite hang.Did provClient already have a timeout?
Yes.
provisioner.Client.ProvisionCache(andProvisionPostgres/NoSQL/Queue) already wrap the gRPC call with a per-callcontext.WithTimeout(..., provisionTimeout(tier)). The problem was the value, not a missing deadline:provisionTimeout(tier)= 4m for anon/free/hobby, 5m for pro/team/growth./cache/newrequest for up to 5 minutes.So I adjusted the existing deadline rather than stacking a second layer.
Change
cacheProvisionTimeout = 45 * time.Second(no inline magic number — repo rule). It's a packagevarpurely so the timeout→503 unit test can shrink it instead of blocking the suite for 45s; production never mutates it.ProvisionCachenow usescacheProvisionTimeoutinstead ofprovisionTimeout(tier).On timeout the existing handler path runs unchanged (
cache.go): the gRPCDeadlineExceededsurfaces as a provision error → soft-delete the pending resource →respondProvisionFailed/ 503. No orphaned pending resource, no hang.Handlers touched
ProvisionCacheonly (the reported/cache/newsurface).provisionTimeout(tier): they are pod-backed and genuinely need the cold-pod budget. Tightening them to 45s would regress legitimate slow Postgres/Mongo cold-pod provisions. If those surfaces report the same hang, the fix is a per-resource-type deadline table — noted as a follow-up, not done here to keep this surgical.Tests (package
internal/provisioner, no DB needed)TestProvisionCache_HangBecomesDeadlineError: a blocking mock provisioner →ProvisionCachereturns a gRPCDeadlineExceedederror in ~the bounded window (caller ctx is 10s, internal deadline shrunk to 50ms, so the failure is provably our deadline), not an indefinite hang. This is exactly the errorcache.goconverts into a 503 after soft-deleting.TestCacheProvisionTimeout_Value: pins 45s and asserts it stays tighter thanprovisionTimeout("pro"), so a future edit can't silently reintroduce the multi-minute hang.ProvisionCachecoverage: 100%.Gate
go build ./...✅go vet ./...✅go test ./internal/provisioner/ -short -count=1✅ (incl. both new tests; hang test completes in 0.05s)go test ./internal/handlers/ -run Cache✅make gaterun locally: every package green except a pre-existinginternal/modelsfailure unrelated to this change — the local testhelpers schema mirror is missing migration 068'sdeployments.last_activity_atcolumn (pq: column "last_activity_at" ... does not exist). Confirmed it reproduces on a cleanorigin/master(stash + re-run), so it's environmental local-mirror drift, not introduced here. Authoritative full gate runs in CI.🤖 Generated with Claude Code