[Feature]: Configurable RPS evaluation window for service autoscaling

### Problem

The RPS autoscaler uses a hardcoded 60-second window to evaluate requests per second:

```python
# src/dstack/_internal/server/services/services/autoscalers.py
rps = stats[60].requests / 60
new_desired_count = math.ceil(rps / self.target)
```

For interactive LLM serving (chat bots, coding assistants, etc.), a single streaming request can last 30+ seconds. During that time, no request has **completed**, so `stats[60].requests` is 0 and RPS = 0. This causes the autoscaler to compute `desired_count = 0` and scale down, even though the instance is actively serving a request.

This is related to #3824 — even with the cooldown fix, the fundamental issue remains that the RPS evaluation window is too short for interactive workloads.

### Current behavior

With `target: 0.5` (1 request every 2 seconds per replica):

| Scenario | RPS (60s window) | `desired_count` | Result |
|----------|----------------|----------------|--------|
| 1 request completed in 60s | 0.017 | `ceil(0.017/0.5) = 1` | Stays up ✅ |
| 0 requests completed in 60s (streaming) | 0 | `ceil(0/0.5) = 0` | Scales down ❌ |
| 0 requests completed in 60s (pause) | 0 | `ceil(0/0.5) = 0` | Scales down ✅ |

The autoscaler cannot distinguish between "actively streaming" and "truly idle" because it only sees completed requests in a 60-second window.

### Proposed solution

Add a `window` parameter to the `scaling` configuration:

```yaml
type: service
name: my-llm-service
replicas: 0..1
scaling:
  metric: rps
  target: 0.5
  window: 300  # evaluate RPS over 5 minutes instead of 60 seconds
```

With `window: 300` and `target: 0.5`:

| Scenario | RPS (300s window) | `desired_count` | Result |
|----------|------------------|----------------|--------|
| 1 request completed in 5 min | 0.003 | `ceil(0.003/0.5) = 1` | Stays up ✅ |
| 0 requests in 5 min (genuine idle) | 0 | `ceil(0/0.5) = 0` | Scales down ✅ |

The gateway already collects statistics in 30s, 60s, and 300s windows (`stats[30]`, `stats[60]`, `stats[300]`), so this would be a small change to the autoscaler.

### Implementation notes

The `PerWindowStats` in `src/dstack/_internal/proxy/gateway/services/stats.py` already has buckets for 30s, 60s, and 300s. The `window` parameter could select from these existing buckets, or interpolate for custom values.

```python
# Proposed change to RPSAutoscaler
class RPSAutoscaler(BaseServiceScaler):
    def __init__(self, ..., window: int = 60):
        self.window = window  # seconds
    
    def get_desired_count(self, current_desired_count, stats, last_count_change):
        if not stats:
            return current_desired_count
        window_stats = stats[self.window]
        rps = window_stats.requests / self.window
        ...
```

And in `ScalingSpec`:
```python
class ScalingSpec(CoreModel):
    metric: Literal["rps"]
    target: float
    scale_up_delay: Duration = Duration.parse("5m")
    scale_down_delay: Duration = Duration.parse("10m")
    window: Duration = Duration.parse("60s")  # NEW
```

### Workaround

Currently, the only workaround is to use `scale_down_delay` with a very long value (e.g., `3600` for 1 hour) to prevent premature scale-down. This is imprecise and wastes GPU time if the service is genuinely idle.

### Would you like to help us implement this feature by sending a PR?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Configurable RPS evaluation window for service autoscaling #3825

Problem

Current behavior

Proposed solution

Implementation notes

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	RPS (60s window)	`desired_count`	Result
1 request completed in 60s	0.017	`ceil(0.017/0.5) = 1`	Stays up ✅
0 requests completed in 60s (streaming)	0	`ceil(0/0.5) = 0`	Scales down ❌
0 requests completed in 60s (pause)	0	`ceil(0/0.5) = 0`	Scales down ✅

Scenario	RPS (300s window)	`desired_count`	Result
1 request completed in 5 min	0.003	`ceil(0.003/0.5) = 1`	Stays up ✅
0 requests in 5 min (genuine idle)	0	`ceil(0/0.5) = 0`	Scales down ✅

Uh oh!

[Feature]: Configurable RPS evaluation window for service autoscaling #3825

Description

Problem

Current behavior

Proposed solution

Implementation notes

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions