Skip to content

[Feature]: Configurable RPS evaluation window for service autoscaling #3825

@megheaiulian

Description

@megheaiulian

Problem

The RPS autoscaler uses a hardcoded 60-second window to evaluate requests per second:

# src/dstack/_internal/server/services/services/autoscalers.py
rps = stats[60].requests / 60
new_desired_count = math.ceil(rps / self.target)

For interactive LLM serving (chat bots, coding assistants, etc.), a single streaming request can last 30+ seconds. During that time, no request has completed, so stats[60].requests is 0 and RPS = 0. This causes the autoscaler to compute desired_count = 0 and scale down, even though the instance is actively serving a request.

This is related to #3824 — even with the cooldown fix, the fundamental issue remains that the RPS evaluation window is too short for interactive workloads.

Current behavior

With target: 0.5 (1 request every 2 seconds per replica):

Scenario RPS (60s window) desired_count Result
1 request completed in 60s 0.017 ceil(0.017/0.5) = 1 Stays up ✅
0 requests completed in 60s (streaming) 0 ceil(0/0.5) = 0 Scales down ❌
0 requests completed in 60s (pause) 0 ceil(0/0.5) = 0 Scales down ✅

The autoscaler cannot distinguish between "actively streaming" and "truly idle" because it only sees completed requests in a 60-second window.

Proposed solution

Add a window parameter to the scaling configuration:

type: service
name: my-llm-service
replicas: 0..1
scaling:
  metric: rps
  target: 0.5
  window: 300  # evaluate RPS over 5 minutes instead of 60 seconds

With window: 300 and target: 0.5:

Scenario RPS (300s window) desired_count Result
1 request completed in 5 min 0.003 ceil(0.003/0.5) = 1 Stays up ✅
0 requests in 5 min (genuine idle) 0 ceil(0/0.5) = 0 Scales down ✅

The gateway already collects statistics in 30s, 60s, and 300s windows (stats[30], stats[60], stats[300]), so this would be a small change to the autoscaler.

Implementation notes

The PerWindowStats in src/dstack/_internal/proxy/gateway/services/stats.py already has buckets for 30s, 60s, and 300s. The window parameter could select from these existing buckets, or interpolate for custom values.

# Proposed change to RPSAutoscaler
class RPSAutoscaler(BaseServiceScaler):
    def __init__(self, ..., window: int = 60):
        self.window = window  # seconds
    
    def get_desired_count(self, current_desired_count, stats, last_count_change):
        if not stats:
            return current_desired_count
        window_stats = stats[self.window]
        rps = window_stats.requests / self.window
        ...

And in ScalingSpec:

class ScalingSpec(CoreModel):
    metric: Literal["rps"]
    target: float
    scale_up_delay: Duration = Duration.parse("5m")
    scale_down_delay: Duration = Duration.parse("10m")
    window: Duration = Duration.parse("60s")  # NEW

Workaround

Currently, the only workaround is to use scale_down_delay with a very long value (e.g., 3600 for 1 hour) to prevent premature scale-down. This is imprecise and wastes GPU time if the service is genuinely idle.

Would you like to help us implement this feature by sending a PR?

Yes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions