Problem
The RPS autoscaler uses a hardcoded 60-second window to evaluate requests per second:
# src/dstack/_internal/server/services/services/autoscalers.py
rps = stats[60].requests / 60
new_desired_count = math.ceil(rps / self.target)
For interactive LLM serving (chat bots, coding assistants, etc.), a single streaming request can last 30+ seconds. During that time, no request has completed, so stats[60].requests is 0 and RPS = 0. This causes the autoscaler to compute desired_count = 0 and scale down, even though the instance is actively serving a request.
This is related to #3824 — even with the cooldown fix, the fundamental issue remains that the RPS evaluation window is too short for interactive workloads.
Current behavior
With target: 0.5 (1 request every 2 seconds per replica):
| Scenario |
RPS (60s window) |
desired_count |
Result |
| 1 request completed in 60s |
0.017 |
ceil(0.017/0.5) = 1 |
Stays up ✅ |
| 0 requests completed in 60s (streaming) |
0 |
ceil(0/0.5) = 0 |
Scales down ❌ |
| 0 requests completed in 60s (pause) |
0 |
ceil(0/0.5) = 0 |
Scales down ✅ |
The autoscaler cannot distinguish between "actively streaming" and "truly idle" because it only sees completed requests in a 60-second window.
Proposed solution
Add a window parameter to the scaling configuration:
type: service
name: my-llm-service
replicas: 0..1
scaling:
metric: rps
target: 0.5
window: 300 # evaluate RPS over 5 minutes instead of 60 seconds
With window: 300 and target: 0.5:
| Scenario |
RPS (300s window) |
desired_count |
Result |
| 1 request completed in 5 min |
0.003 |
ceil(0.003/0.5) = 1 |
Stays up ✅ |
| 0 requests in 5 min (genuine idle) |
0 |
ceil(0/0.5) = 0 |
Scales down ✅ |
The gateway already collects statistics in 30s, 60s, and 300s windows (stats[30], stats[60], stats[300]), so this would be a small change to the autoscaler.
Implementation notes
The PerWindowStats in src/dstack/_internal/proxy/gateway/services/stats.py already has buckets for 30s, 60s, and 300s. The window parameter could select from these existing buckets, or interpolate for custom values.
# Proposed change to RPSAutoscaler
class RPSAutoscaler(BaseServiceScaler):
def __init__(self, ..., window: int = 60):
self.window = window # seconds
def get_desired_count(self, current_desired_count, stats, last_count_change):
if not stats:
return current_desired_count
window_stats = stats[self.window]
rps = window_stats.requests / self.window
...
And in ScalingSpec:
class ScalingSpec(CoreModel):
metric: Literal["rps"]
target: float
scale_up_delay: Duration = Duration.parse("5m")
scale_down_delay: Duration = Duration.parse("10m")
window: Duration = Duration.parse("60s") # NEW
Workaround
Currently, the only workaround is to use scale_down_delay with a very long value (e.g., 3600 for 1 hour) to prevent premature scale-down. This is imprecise and wastes GPU time if the service is genuinely idle.
Would you like to help us implement this feature by sending a PR?
Yes
Problem
The RPS autoscaler uses a hardcoded 60-second window to evaluate requests per second:
For interactive LLM serving (chat bots, coding assistants, etc.), a single streaming request can last 30+ seconds. During that time, no request has completed, so
stats[60].requestsis 0 and RPS = 0. This causes the autoscaler to computedesired_count = 0and scale down, even though the instance is actively serving a request.This is related to #3824 — even with the cooldown fix, the fundamental issue remains that the RPS evaluation window is too short for interactive workloads.
Current behavior
With
target: 0.5(1 request every 2 seconds per replica):desired_countceil(0.017/0.5) = 1ceil(0/0.5) = 0ceil(0/0.5) = 0The autoscaler cannot distinguish between "actively streaming" and "truly idle" because it only sees completed requests in a 60-second window.
Proposed solution
Add a
windowparameter to thescalingconfiguration:With
window: 300andtarget: 0.5:desired_countceil(0.003/0.5) = 1ceil(0/0.5) = 0The gateway already collects statistics in 30s, 60s, and 300s windows (
stats[30],stats[60],stats[300]), so this would be a small change to the autoscaler.Implementation notes
The
PerWindowStatsinsrc/dstack/_internal/proxy/gateway/services/stats.pyalready has buckets for 30s, 60s, and 300s. Thewindowparameter could select from these existing buckets, or interpolate for custom values.And in
ScalingSpec:Workaround
Currently, the only workaround is to use
scale_down_delaywith a very long value (e.g.,3600for 1 hour) to prevent premature scale-down. This is imprecise and wastes GPU time if the service is genuinely idle.Would you like to help us implement this feature by sending a PR?
Yes