fix(provisioner): bound ProvisionCache gRPC with a 45s deadline by mastermanas805 · Pull Request #278 · InstaNode-dev/api

mastermanas805 · 2026-06-07T14:34:12Z

Problem

Authenticated /cache/new (a logged-in PRO team provisioning Redis) was observed hanging indefinitely (>60s, HTTP 000 at a 60s client timeout), while anonymous /cache/new returned in ~6s. The provisioner-side root cause is being fixed separately. This PR is the defensive api-side guard: a hung provisioner should become a clean 503 provision_failed, never an indefinite hang.

Did provClient already have a timeout?

Yes. provisioner.Client.ProvisionCache (and ProvisionPostgres/NoSQL/Queue) already wrap the gRPC call with a per-call context.WithTimeout(..., provisionTimeout(tier)). The problem was the value, not a missing deadline:

provisionTimeout(tier) = 4m for anon/free/hobby, 5m for pro/team/growth.
Those budgets exist because Postgres/Mongo now spin up a per-tenant pod (PVC bind + image pull + DB init, 30–90s on a cold node).
A Redis namespace carve does none of that (~1–6s for every tier), so granting it the 5m pod budget let a hung provisioner hang the whole /cache/new request for up to 5 minutes.

So I adjusted the existing deadline rather than stacking a second layer.

Change

New named constant cacheProvisionTimeout = 45 * time.Second (no inline magic number — repo rule). It's a package var purely so the timeout→503 unit test can shrink it instead of blocking the suite for 45s; production never mutates it.
ProvisionCache now uses cacheProvisionTimeout instead of provisionTimeout(tier).
45s is ~7× the slowest observed healthy carve: a legitimately slow-but-OK provision still succeeds, a genuine hang fails fast.

On timeout the existing handler path runs unchanged (cache.go): the gRPC DeadlineExceeded surfaces as a provision error → soft-delete the pending resource → respondProvisionFailed / 503. No orphaned pending resource, no hang.

Handlers touched

ProvisionCache only (the reported /cache/new surface).
db / vector / nosql / queue deliberately left on provisionTimeout(tier): they are pod-backed and genuinely need the cold-pod budget. Tightening them to 45s would regress legitimate slow Postgres/Mongo cold-pod provisions. If those surfaces report the same hang, the fix is a per-resource-type deadline table — noted as a follow-up, not done here to keep this surgical.

Tests (package `internal/provisioner`, no DB needed)

TestProvisionCache_HangBecomesDeadlineError: a blocking mock provisioner → ProvisionCache returns a gRPC DeadlineExceeded error in ~the bounded window (caller ctx is 10s, internal deadline shrunk to 50ms, so the failure is provably our deadline), not an indefinite hang. This is exactly the error cache.go converts into a 503 after soft-deleting.
TestCacheProvisionTimeout_Value: pins 45s and asserts it stays tighter than provisionTimeout("pro"), so a future edit can't silently reintroduce the multi-minute hang.
ProvisionCache coverage: 100%.

Gate

go build ./... ✅ go vet ./... ✅
go test ./internal/provisioner/ -short -count=1 ✅ (incl. both new tests; hang test completes in 0.05s)
go test ./internal/handlers/ -run Cache ✅
Full make gate run locally: every package green except a pre-existing internal/models failure unrelated to this change — the local testhelpers schema mirror is missing migration 068's deployments.last_activity_at column (pq: column "last_activity_at" ... does not exist). Confirmed it reproduces on a clean origin/master (stash + re-run), so it's environmental local-mirror drift, not introduced here. Authoritative full gate runs in CI.

🤖 Generated with Claude Code

Authed /cache/new (PRO Redis provision) was observed hanging >60s while anonymous /cache/new returned in ~6s. The provisioner-side root cause is fixed separately; this is the defensive api-side guard. provClient ALREADY applies a per-call deadline to every provision RPC via provisionTimeout(tier) — the issue is the value: 4m (anon/free/hobby) and 5m (pro/team/growth). Those budgets exist for Postgres/Mongo, which spin up a per-tenant pod (PVC bind + image pull + DB init, 30-90s on a cold node). A Redis namespace carve does none of that (~1-6s for every tier), so a hung provisioner could hang the whole /cache/new request for up to 5 minutes. Introduce cacheProvisionTimeout = 45s (named, no inline magic number; a package var so the timeout->503 unit test can shrink it) and use it in ProvisionCache only. 45s is ~7x the slowest observed healthy carve, so a legitimately slow-but-OK provision still succeeds while a genuine hang fails fast. On timeout the existing handler path runs unchanged: soft-delete the pending resource + 503 provision_failed — no orphan, no hang. db/vector/nosql/queue intentionally keep provisionTimeout(tier): they are pod-backed and genuinely need the cold-pod budget; tightening them to 45s would regress legitimate slow provisions. Noted as a deliberate non-change. Tests (internal/provisioner, no DB needed): - TestProvisionCache_HangBecomesDeadlineError: blocking mock provisioner -> ProvisionCache returns gRPC DeadlineExceeded in ~the bounded window, not an indefinite hang (the error the handler turns into a 503). - TestCacheProvisionTimeout_Value: pins 45s and asserts it stays tighter than provisionTimeout(pro) so a future edit can't silently reintroduce the multi-minute hang. ProvisionCache coverage 100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mastermanas805 merged commit 4dff0ad into master Jun 7, 2026
18 checks passed

mastermanas805 deleted the fix/provision-grpc-deadline branch June 7, 2026 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(provisioner): bound ProvisionCache gRPC with a 45s deadline#278

fix(provisioner): bound ProvisionCache gRPC with a 45s deadline#278
mastermanas805 merged 1 commit into
masterfrom
fix/provision-grpc-deadline

mastermanas805 commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented Jun 7, 2026

Problem

Did provClient already have a timeout?

Change

Handlers touched

Tests (package internal/provisioner, no DB needed)

Gate

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests (package `internal/provisioner`, no DB needed)