Skip to content

[vLLM] Missing --max-model-len guard for offload-disabled agentic runs causes KV cache OOM on H100 #1587

@Aavache

Description

@Aavache

Describe the bug
When running Kimi-K2.5 INT4 agentic benchmarks with vLLM on 8×H100 (TP=8) without CPU offloading, the engine crashes immediately at startup due to insufficient KV cache memory. The server never starts and all TP workers are terminated.

It is triggered in the master file here.

- { tp: 8, offloading: none, conc-list: [1, 2, 4, 8, 12, 16, 20] }

To Reproduce

  1. Launch a vLLM agentic benchmark for Kimi-K2.5 INT4 on 8×H100 with TP=8
  2. Set offloading to none (no CPU KV cache offloading)
  3. Do not set --max-model-len in the vllm serve command
  4. Observe engine crash at startup

Expected behavior
The vLLM server should start successfully and serve agentic requests. Either the launch script should cap --max-model-len to a value that fits in available VRAM when offloading is disabled, or the error
should be surfaced earlier with a clear recommendation.

Screenshots
N/A

Additional context
Root cause: without --max-model-len, vLLM defaults to the model's full context length (262,144 tokens), which requires 8.58 GiB of KV cache. On 8×H100 with TP=8, after loading the INT4 weights, only ~0.9 GiB
of VRAM remains for KV cache.

Error

ValueError: To serve at least one request with the model's max seq len (262144),
(8.58 GiB KV cache is needed, which is larger than the available KV cache memory
(0.9 GiB). Based on the available memory, the estimated maximum model length is 27392.
Try increasing gpu_memory_utilization or decreasing max_model_len when initializing
the engine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions