Steps to reproduce
- Deploy a service with
replicas: 0..1 and scaling: { metric: rps, target: 0.5, scale_down_delay: 900 }
- Send a request to trigger scale-up (0→1)
- Use the service actively for 14 minutes (RPS > 0, requests completing regularly)
- Pause for 1 minute (RPS drops to 0)
- Observe: service scales down at T=15min, even though it was actively used for 14 of those 15 minutes
Expected behaviour
scale_down_delay should protect the instance for the specified duration after the autoscaler first decides to scale down — and reset if traffic returns during that window. Instead, the cooldown is calculated from last_scaled_at (the timestamp of the last applied scale event), not from when the autoscaler's intent changed.
The bug
In src/dstack/_internal/server/services/services/autoscalers.py, the RPSAutoscaler uses last_scaled_at to enforce cooldowns:
if (now - last_scaled_at).total_seconds() < self.scale_down_delay:
# too early to scale down, wait for the delay
return current_desired_count
last_scaled_at is only updated when a scale event is actually applied (the cooldown expires and the scale happens). It is not updated when the autoscaler's desired count changes (intent changes) during the cooldown period.
This means:
- Active traffic during the cooldown does not reset
last_scaled_at
- For
replicas: 0..1, last_scaled_at is set once at initial scale-up (0→1) and never updated, because desired_count == current_count (1 == 1) produces no scale event
- Scale-down can happen immediately after the cooldown expires, regardless of activity during the cooldown
Timeline demonstrating the bug
T=0: Scale up 0→1, last_scaled_at = T0
T=1m: RPS=0.02, desired=1, current=1 → no scale event → last_scaled_at stays T0
T=5m: RPS=0.03, desired=1, current=1 → no scale event → last_scaled_at stays T0
T=10m: RPS=0.01, desired=1, current=1 → no scale event → last_scaled_at stays T0
T=14m: User pauses, RPS drops to 0
T=15m: (now - T0) >= scale_down_delay(900s), RPS=0 → scale down happens
→ User was active for 14 of 15 minutes, but got scaled down after a 1-minute pause
Proposed fix
Replace last_scaled_at with last_count_change — track when the autoscaler's desired count changes from the current count, not when a scale event is applied:
- When
desired_count > current_desired_count: update last_count_change (scale-up intent)
- When
desired_count < current_desired_count: update last_count_change (scale-down intent)
- Use
last_count_change instead of last_scaled_at for cooldown calculations
This ensures:
- The cooldown starts from when the autoscaler first decided to scale down, not from the last applied scale event
- If traffic returns during the cooldown (desired_count goes back up), the cooldown resets because the autoscaler's intent changed
- This works symmetrically for both
scale_up_delay and scale_down_delay
# Proposed logic
if new_desired_count > current_desired_count:
if current_desired_count == 0:
return new_desired_count # immediate scale-up from zero
if last_count_change is not None and (now - last_count_change).total_seconds() < self.scale_up_delay:
return current_desired_count
return new_desired_count
elif new_desired_count < current_desired_count:
if last_count_change is not None and (now - last_count_change).total_seconds() < self.scale_down_delay:
return current_desired_count
return new_desired_count
return new_desired_count
Impact
This bug makes RPS-based autoscaling unreliable for interactive workloads (LLM serving, chat bots, etc.) where traffic is bursty with pauses between active periods. The cooldown provides a false sense of protection — it doesn't actually guarantee N seconds of inactivity before scaling down.
dstack version
0.19.x (current master)
Steps to reproduce
replicas: 0..1andscaling: { metric: rps, target: 0.5, scale_down_delay: 900 }Expected behaviour
scale_down_delayshould protect the instance for the specified duration after the autoscaler first decides to scale down — and reset if traffic returns during that window. Instead, the cooldown is calculated fromlast_scaled_at(the timestamp of the last applied scale event), not from when the autoscaler's intent changed.The bug
In
src/dstack/_internal/server/services/services/autoscalers.py, theRPSAutoscaleruseslast_scaled_atto enforce cooldowns:last_scaled_atis only updated when a scale event is actually applied (the cooldown expires and the scale happens). It is not updated when the autoscaler's desired count changes (intent changes) during the cooldown period.This means:
last_scaled_atreplicas: 0..1,last_scaled_atis set once at initial scale-up (0→1) and never updated, becausedesired_count == current_count(1 == 1) produces no scale eventTimeline demonstrating the bug
Proposed fix
Replace
last_scaled_atwithlast_count_change— track when the autoscaler's desired count changes from the current count, not when a scale event is applied:desired_count > current_desired_count: updatelast_count_change(scale-up intent)desired_count < current_desired_count: updatelast_count_change(scale-down intent)last_count_changeinstead oflast_scaled_atfor cooldown calculationsThis ensures:
scale_up_delayandscale_down_delayImpact
This bug makes RPS-based autoscaling unreliable for interactive workloads (LLM serving, chat bots, etc.) where traffic is bursty with pauses between active periods. The cooldown provides a false sense of protection — it doesn't actually guarantee N seconds of inactivity before scaling down.
dstack version
0.19.x (current master)