Skip to content

Conversation

@kacpersaw
Copy link

Summary

Implement per-provider circuit breakers that detect upstream rate limiting (429/503/529 status codes) and temporarily stop sending requests when providers are overloaded.

This completes the overload protection story by adding the aibridge-specific component that couldn't be implemented as generic HTTP middleware in coderd (since it requires understanding upstream provider responses).

Key Features

  • Per-provider circuit breakers - Separate circuit breakers for Anthropic and OpenAI since upstream rate limits are provider-specific
  • Configurable parameters - Failure threshold, time window, cooldown period, half-open max requests
  • Half-open state - Allows gradual recovery testing after cooldown
  • Prometheus metrics - State gauge, trips counter, rejects counter for monitoring
  • Thread-safe - Proper mutex protection for concurrent access
  • Disabled by default - For backward compatibility, must be explicitly enabled

Circuit Breaker States

┌─────────┐  failures >= threshold  ┌─────────┐
│ CLOSED  │ ──────────────────────► │  OPEN   │
│(normal) │                         │(reject) │
└─────────┘                         └────┬────┘
     ▲                                   │
     │                          cooldown expires
     │                                   │
     │  successes >= threshold     ┌─────▼─────┐
     └──────────────────────────── │ HALF-OPEN │
                                   │ (testing) │
     ┌─────────────────────────────┤           │
     │         any failure         └───────────┘
     │
     ▼
┌─────────┐
│  OPEN   │
│(reject) │
└─────────┘

Status Codes That Trigger Circuit Breaker

Code Description
429 Too Many Requests
503 Service Unavailable
529 Anthropic Overloaded

Other error codes (400, 401, 500, 502, etc.) do not trigger the circuit breaker since they indicate different issues that circuit breaking wouldn't help with.

Default Configuration

Parameter Default Description
Enabled false Must be explicitly enabled
FailureThreshold 5 Failures needed to trip circuit
Window 10s Time window for counting failures
Cooldown 30s Time to wait before testing recovery
HalfOpenMaxRequests 3 Requests allowed in half-open state

New Prometheus Metrics

  • aibridge_circuit_breaker_state{provider} - Current state (0=closed, 1=open, 2=half-open)
  • aibridge_circuit_breaker_trips_total{provider} - Total times circuit opened
  • aibridge_circuit_breaker_rejects_total{provider} - Requests rejected due to open circuit

Files Changed

  • circuit_breaker.go - Core circuit breaker implementation
  • circuit_breaker_test.go - Comprehensive test suite (13 tests)
  • bridge.go - Integration into RequestBridge
  • interception.go - Apply circuit breaker to intercepted requests
  • metrics.go - Add Prometheus metrics

Testing

All tests pass:

go test -count=1 -short ./...
go vet ./...
go build ./...

Related

Implement per-provider circuit breakers that detect upstream rate limiting
(429/503/529 status codes) and temporarily stop sending requests when
providers are overloaded.

Key features:
- Per-provider circuit breakers (Anthropic, OpenAI)
- Configurable failure threshold, time window, and cooldown period
- Half-open state allows gradual recovery testing
- Prometheus metrics for monitoring (state gauge, trips counter, rejects counter)
- Thread-safe implementation with proper state machine transitions
- Disabled by default for backward compatibility

Circuit breaker states:
- Closed: normal operation, tracking failures within sliding window
- Open: all requests rejected with 503, waiting for cooldown
- Half-Open: limited requests allowed to test if upstream recovered

Status codes that trigger circuit breaker:
- 429 Too Many Requests
- 503 Service Unavailable
- 529 Anthropic Overloaded

Relates to: coder/internal#1153
@kacpersaw kacpersaw marked this pull request as draft December 11, 2025 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant