Describe the bug
When running Kimi-K2.5 INT4 agentic benchmarks with vLLM on 8×H100 (TP=8) without CPU offloading, the engine crashes immediately at startup due to insufficient KV cache memory. The server never starts and all TP workers are terminated.
It is triggered in the master file here.
|
- { tp: 8, offloading: none, conc-list: [1, 2, 4, 8, 12, 16, 20] } |
To Reproduce
- Launch a vLLM agentic benchmark for Kimi-K2.5 INT4 on 8×H100 with TP=8
- Set offloading to
none (no CPU KV cache offloading)
- Do not set
--max-model-len in the vllm serve command
- Observe engine crash at startup
Expected behavior
The vLLM server should start successfully and serve agentic requests. Either the launch script should cap --max-model-len to a value that fits in available VRAM when offloading is disabled, or the error
should be surfaced earlier with a clear recommendation.
Screenshots
N/A
Additional context
Root cause: without --max-model-len, vLLM defaults to the model's full context length (262,144 tokens), which requires 8.58 GiB of KV cache. On 8×H100 with TP=8, after loading the INT4 weights, only ~0.9 GiB
of VRAM remains for KV cache.
Error
ValueError: To serve at least one request with the model's max seq len (262144),
(8.58 GiB KV cache is needed, which is larger than the available KV cache memory
(0.9 GiB). Based on the available memory, the estimated maximum model length is 27392.
Try increasing gpu_memory_utilization or decreasing max_model_len when initializing
the engine.
Describe the bug
When running Kimi-K2.5 INT4 agentic benchmarks with vLLM on 8×H100 (TP=8) without CPU offloading, the engine crashes immediately at startup due to insufficient KV cache memory. The server never starts and all TP workers are terminated.
InferenceX/benchmarks/single_node/agentic/kimik2.5_int4_h100.sh
Line 4 in c9798a7
It is triggered in the master file here.
InferenceX/.github/configs/nvidia-master.yaml
Line 9195 in c9798a7
To Reproduce
none(no CPU KV cache offloading)--max-model-lenin thevllm servecommandExpected behavior
The vLLM server should start successfully and serve agentic requests. Either the launch script should cap
--max-model-lento a value that fits in available VRAM when offloading is disabled, or the errorshould be surfaced earlier with a clear recommendation.
Screenshots
N/A
Additional context
Root cause: without
--max-model-len, vLLM defaults to the model's full context length (262,144 tokens), which requires 8.58 GiB of KV cache. On 8×H100 with TP=8, after loading the INT4 weights, only ~0.9 GiBof VRAM remains for KV cache.
Error