Skip to content

Add warm-worker pool observability, stuck-worker reaper, and runbooks#344

Open
bill-ph wants to merge 3 commits intomainfrom
claude/goofy-meitner
Open

Add warm-worker pool observability, stuck-worker reaper, and runbooks#344
bill-ph wants to merge 3 commits intomainfrom
claude/goofy-meitner

Conversation

@bill-ph
Copy link
Collaborator

@bill-ph bill-ph commented Mar 23, 2026

Summary

  • 7 Prometheus metrics for the shared warm-worker lifecycle:

    • duckgres_warm_workers — idle (unassigned) workers gauge
    • duckgres_reserved_workers — reserved workers gauge
    • duckgres_activating_workers — activating workers gauge
    • duckgres_hot_workers — hot (tenant-bound) workers gauge
    • duckgres_draining_workers — draining workers gauge
    • duckgres_activation_duration_seconds — reservation-to-hot latency histogram
    • duckgres_activation_failures_total{reason} — activation failure counter
    • duckgres_worker_retirements_total{reason} — retirement counter with reason labels (normal, activation_failure, crash, shutdown, idle_timeout, stuck_activating)
    • duckgres_hot_worker_sessions_total — sessions served per hot worker at retirement
  • Stuck-worker reaper: auto-retires workers stuck in reserved/activating state >2 minutes, with automatic pool replenishment

  • idleReaper always runs: no longer exits early when idleTimeout=0, enabling stuck-worker detection even without idle reaping

  • reservedAt / peakSessions tracking on ManagedWorker for latency and session histograms

  • 3 operational runbooks: drain hot workers, recover stuck activating workers, replenish capacity

Test plan

  • TestObserveWarmPoolLifecycleGauges — lifecycle gauge counting
  • TestObserveWarmPoolLifecycleGauges_SkipsDeadWorkers — dead worker exclusion
  • TestMarkWorkerRetiredLocked_RecordsRetirementMetric — retirement counter with reason
  • TestMarkWorkerRetiredLocked_RecordsHotWorkerSessions — hot worker session histogram
  • TestReservedAtTracking — reservedAt set during ReserveSharedWorker
  • TestPeakSessionsTracking — peakSessions high-water mark
  • TestReapStuckActivatingWorkers — stuck worker reaped + replacement spawned
  • TestReapStuckActivatingWorkers_RecentlyReservedNotReaped — recently reserved protected
  • Full controlplane test suite passes (66 seconds, all green)

🤖 Generated with Claude Code

@bill-ph bill-ph changed the title Major architecture refactor: multi-tenant control plane and K8s support Add warm-worker pool observability, stuck-worker reaper, and runbooks Mar 23, 2026
… runbooks

Add Prometheus metrics for the shared warm-worker lifecycle (idle, reserved,
activating, hot, draining gauges), activation latency histogram, activation
failure counter, retirement counter with reason labels, and hot-worker session
histogram. Instrument k8s_pool.go and org_reserved_pool.go to emit metrics on
state transitions. Add automatic stuck-worker reaper that retires workers stuck
in reserved/activating state >2 minutes and replenishes the pool. Extend
idleReaper to always run for stuck-worker detection. Track reservedAt and
peakSessions on ManagedWorker. Include 3 operational runbooks (drain hot
workers, recover stuck activating workers, replenish capacity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bill-ph bill-ph force-pushed the claude/goofy-meitner branch from efaf485 to 16f6b13 Compare March 23, 2026 21:46
bill-ph and others added 2 commits March 23, 2026 18:20
peakSessions is now tracked in FlightWorkerPool's AcquireWorker too.
reservedAt only applies to the k8s warm-pool reservation flow so it
gets a targeted nolint:unused.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The multitenant-seed-kind recipe races with the control plane's config
store migration. The deployment becomes "available" before the CP has
finished creating the duckgres_orgs table via GORM auto-migrate. Add a
retry loop (up to 30s) so the seed waits for the schema to exist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant