Add warm-worker pool observability, stuck-worker reaper, and runbooks#344
Open
Add warm-worker pool observability, stuck-worker reaper, and runbooks#344
Conversation
… runbooks Add Prometheus metrics for the shared warm-worker lifecycle (idle, reserved, activating, hot, draining gauges), activation latency histogram, activation failure counter, retirement counter with reason labels, and hot-worker session histogram. Instrument k8s_pool.go and org_reserved_pool.go to emit metrics on state transitions. Add automatic stuck-worker reaper that retires workers stuck in reserved/activating state >2 minutes and replenishes the pool. Extend idleReaper to always run for stuck-worker detection. Track reservedAt and peakSessions on ManagedWorker. Include 3 operational runbooks (drain hot workers, recover stuck activating workers, replenish capacity). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
efaf485 to
16f6b13
Compare
peakSessions is now tracked in FlightWorkerPool's AcquireWorker too. reservedAt only applies to the k8s warm-pool reservation flow so it gets a targeted nolint:unused. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The multitenant-seed-kind recipe races with the control plane's config store migration. The deployment becomes "available" before the CP has finished creating the duckgres_orgs table via GORM auto-migrate. Add a retry loop (up to 30s) so the seed waits for the schema to exist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
7 Prometheus metrics for the shared warm-worker lifecycle:
duckgres_warm_workers— idle (unassigned) workers gaugeduckgres_reserved_workers— reserved workers gaugeduckgres_activating_workers— activating workers gaugeduckgres_hot_workers— hot (tenant-bound) workers gaugeduckgres_draining_workers— draining workers gaugeduckgres_activation_duration_seconds— reservation-to-hot latency histogramduckgres_activation_failures_total{reason}— activation failure counterduckgres_worker_retirements_total{reason}— retirement counter with reason labels (normal, activation_failure, crash, shutdown, idle_timeout, stuck_activating)duckgres_hot_worker_sessions_total— sessions served per hot worker at retirementStuck-worker reaper: auto-retires workers stuck in reserved/activating state >2 minutes, with automatic pool replenishment
idleReaperalways runs: no longer exits early whenidleTimeout=0, enabling stuck-worker detection even without idle reapingreservedAt/peakSessionstracking onManagedWorkerfor latency and session histograms3 operational runbooks: drain hot workers, recover stuck activating workers, replenish capacity
Test plan
TestObserveWarmPoolLifecycleGauges— lifecycle gauge countingTestObserveWarmPoolLifecycleGauges_SkipsDeadWorkers— dead worker exclusionTestMarkWorkerRetiredLocked_RecordsRetirementMetric— retirement counter with reasonTestMarkWorkerRetiredLocked_RecordsHotWorkerSessions— hot worker session histogramTestReservedAtTracking— reservedAt set during ReserveSharedWorkerTestPeakSessionsTracking— peakSessions high-water markTestReapStuckActivatingWorkers— stuck worker reaped + replacement spawnedTestReapStuckActivatingWorkers_RecentlyReservedNotReaped— recently reserved protectedcontrolplanetest suite passes (66 seconds, all green)🤖 Generated with Claude Code