Skip to content

fix(provisioner): bound ProvisionCache gRPC with a 45s deadline#278

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/provision-grpc-deadline
Jun 7, 2026
Merged

fix(provisioner): bound ProvisionCache gRPC with a 45s deadline#278
mastermanas805 merged 1 commit into
masterfrom
fix/provision-grpc-deadline

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

Problem

Authenticated /cache/new (a logged-in PRO team provisioning Redis) was observed hanging indefinitely (>60s, HTTP 000 at a 60s client timeout), while anonymous /cache/new returned in ~6s. The provisioner-side root cause is being fixed separately. This PR is the defensive api-side guard: a hung provisioner should become a clean 503 provision_failed, never an indefinite hang.

Did provClient already have a timeout?

Yes. provisioner.Client.ProvisionCache (and ProvisionPostgres/NoSQL/Queue) already wrap the gRPC call with a per-call context.WithTimeout(..., provisionTimeout(tier)). The problem was the value, not a missing deadline:

  • provisionTimeout(tier) = 4m for anon/free/hobby, 5m for pro/team/growth.
  • Those budgets exist because Postgres/Mongo now spin up a per-tenant pod (PVC bind + image pull + DB init, 30–90s on a cold node).
  • A Redis namespace carve does none of that (~1–6s for every tier), so granting it the 5m pod budget let a hung provisioner hang the whole /cache/new request for up to 5 minutes.

So I adjusted the existing deadline rather than stacking a second layer.

Change

  • New named constant cacheProvisionTimeout = 45 * time.Second (no inline magic number — repo rule). It's a package var purely so the timeout→503 unit test can shrink it instead of blocking the suite for 45s; production never mutates it.
  • ProvisionCache now uses cacheProvisionTimeout instead of provisionTimeout(tier).
  • 45s is ~7× the slowest observed healthy carve: a legitimately slow-but-OK provision still succeeds, a genuine hang fails fast.

On timeout the existing handler path runs unchanged (cache.go): the gRPC DeadlineExceeded surfaces as a provision error → soft-delete the pending resource → respondProvisionFailed / 503. No orphaned pending resource, no hang.

Handlers touched

  • ProvisionCache only (the reported /cache/new surface).
  • db / vector / nosql / queue deliberately left on provisionTimeout(tier): they are pod-backed and genuinely need the cold-pod budget. Tightening them to 45s would regress legitimate slow Postgres/Mongo cold-pod provisions. If those surfaces report the same hang, the fix is a per-resource-type deadline table — noted as a follow-up, not done here to keep this surgical.

Tests (package internal/provisioner, no DB needed)

  • TestProvisionCache_HangBecomesDeadlineError: a blocking mock provisioner → ProvisionCache returns a gRPC DeadlineExceeded error in ~the bounded window (caller ctx is 10s, internal deadline shrunk to 50ms, so the failure is provably our deadline), not an indefinite hang. This is exactly the error cache.go converts into a 503 after soft-deleting.
  • TestCacheProvisionTimeout_Value: pins 45s and asserts it stays tighter than provisionTimeout("pro"), so a future edit can't silently reintroduce the multi-minute hang.
  • ProvisionCache coverage: 100%.

Gate

  • go build ./...go vet ./...
  • go test ./internal/provisioner/ -short -count=1 ✅ (incl. both new tests; hang test completes in 0.05s)
  • go test ./internal/handlers/ -run Cache
  • Full make gate run locally: every package green except a pre-existing internal/models failure unrelated to this change — the local testhelpers schema mirror is missing migration 068's deployments.last_activity_at column (pq: column "last_activity_at" ... does not exist). Confirmed it reproduces on a clean origin/master (stash + re-run), so it's environmental local-mirror drift, not introduced here. Authoritative full gate runs in CI.

🤖 Generated with Claude Code

Authed /cache/new (PRO Redis provision) was observed hanging >60s while
anonymous /cache/new returned in ~6s. The provisioner-side root cause is
fixed separately; this is the defensive api-side guard.

provClient ALREADY applies a per-call deadline to every provision RPC via
provisionTimeout(tier) — the issue is the value: 4m (anon/free/hobby) and
5m (pro/team/growth). Those budgets exist for Postgres/Mongo, which spin up
a per-tenant pod (PVC bind + image pull + DB init, 30-90s on a cold node).
A Redis namespace carve does none of that (~1-6s for every tier), so a hung
provisioner could hang the whole /cache/new request for up to 5 minutes.

Introduce cacheProvisionTimeout = 45s (named, no inline magic number; a
package var so the timeout->503 unit test can shrink it) and use it in
ProvisionCache only. 45s is ~7x the slowest observed healthy carve, so a
legitimately slow-but-OK provision still succeeds while a genuine hang
fails fast. On timeout the existing handler path runs unchanged:
soft-delete the pending resource + 503 provision_failed — no orphan, no hang.

db/vector/nosql/queue intentionally keep provisionTimeout(tier): they are
pod-backed and genuinely need the cold-pod budget; tightening them to 45s
would regress legitimate slow provisions. Noted as a deliberate non-change.

Tests (internal/provisioner, no DB needed):
- TestProvisionCache_HangBecomesDeadlineError: blocking mock provisioner ->
  ProvisionCache returns gRPC DeadlineExceeded in ~the bounded window, not
  an indefinite hang (the error the handler turns into a 503).
- TestCacheProvisionTimeout_Value: pins 45s and asserts it stays tighter
  than provisionTimeout(pro) so a future edit can't silently reintroduce
  the multi-minute hang.
ProvisionCache coverage 100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 4dff0ad into master Jun 7, 2026
18 checks passed
@mastermanas805 mastermanas805 deleted the fix/provision-grpc-deadline branch June 7, 2026 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant