[vLLM] Missing --max-model-len guard for offload-disabled agentic runs causes KV cache OOM on H100

**Describe the bug**
When running Kimi-K2.5 INT4 agentic benchmarks with vLLM on 8×H100 (TP=8) without CPU offloading, the engine crashes immediately at startup due to insufficient KV cache memory. The server never starts and all TP workers are terminated.

https://github.com/SemiAnalysisAI/InferenceX/blob/c9798a7708826dd4ead79acad4500000f72d5576/benchmarks/single_node/agentic/kimik2.5_int4_h100.sh#L4

It is triggered in the master file here.
https://github.com/SemiAnalysisAI/InferenceX/blob/c9798a7708826dd4ead79acad4500000f72d5576/.github/configs/nvidia-master.yaml#L9195

**To Reproduce**
1. Launch a vLLM agentic benchmark for Kimi-K2.5 INT4 on 8×H100 with TP=8
2. Set offloading to `none` (no CPU KV cache offloading)
3. Do not set `--max-model-len` in the `vllm serve` command
4. Observe engine crash at startup

**Expected behavior**
The vLLM server should start successfully and serve agentic requests. Either the launch script should cap `--max-model-len` to a value that fits in available VRAM when offloading is disabled, or the error
should be surfaced earlier with a clear recommendation.

**Screenshots**
N/A

**Additional context**
Root cause: without `--max-model-len`, vLLM defaults to the model's full context length (262,144 tokens), which requires 8.58 GiB of KV cache. On 8×H100 with TP=8, after loading the INT4 weights, only ~0.9 GiB
 of VRAM remains for KV cache.

**Error**
```
ValueError: To serve at least one request with the model's max seq len (262144),
(8.58 GiB KV cache is needed, which is larger than the available KV cache memory
(0.9 GiB). Based on the available memory, the estimated maximum model length is 27392.
Try increasing gpu_memory_utilization or decreasing max_model_len when initializing
the engine.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vLLM] Missing --max-model-len guard for offload-disabled agentic runs causes KV cache OOM on H100 #1587

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[vLLM] Missing --max-model-len guard for offload-disabled agentic runs causes KV cache OOM on H100 #1587

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions