fix(loadgen-parity): disable skip-special-tokens to have same TTFT calculation as loadgen#380
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces support for min_tokens and skip_special_tokens parameters for OpenAI text-completions servers. It updates the configuration schema, templates, adapters, and request types, and adds corresponding validation rules and unit tests. The reviewer suggested improving the validation error message in schema.py to dynamically reference only the parameters that were actually set, ensuring consistency and clarity.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
What does this PR do?
Adds Endpoints-facing
min_new_tokensandskip_special_tokenscontrols to OpenAI text completions.min_new_tokensdefaults to1and maps to the TRT-LLM/v1/completionswire fieldmin_tokens;skip_special_tokensdefaults totrue. Both are serialized explicitly. Non-default controls fail validation on API types that cannot forward them. The GPT-OSS vLLM example uses1andfalse.NVIDIA LoadGen sets these values in its generation config and forwards both fields. Without
skip_special_tokens=false, exactly one token is hidden at the first emission: token 200005,<|channel|>. This does not hide 20 tokens: with this B300 config'sstream_interval=20, TensorRT-LLM emits after token 1 and then token 20, so suppressing token 1 delays observable TTFT by 19 decode intervals. A normal complete response loses six delimiters in total:<|channel|> <|message|> <|end|> <|start|> <|channel|> <|message|>; those later removals affect returned text and accuracy parsing, not initial TTFT.Type of change
Related issues
Refs #8, #26, #132, #344.
B300x8 Server ablation
Latency is
p99 [min, mean, max]in milliseconds. Each Endpoint run used fresh eight-server B300x8 endpoints, 47,400 requests, QPS 79, seeds 42, disabled warmup, andperformance_timeout_s: null. LoadGen includes its default 8 cores x 10-query warmup, so max TTFT is not yet a like-for-like comparison; a fresh Endpoints warmup-80 remeasurement is running.min_tokens=1,skip=false)min_new_tokens=1onlyskip_special_tokens=falseonlymin_new_tokens=1matched baseline. Both controls cut p99/mean TTFT by 81.13%/85.72% at 0.11% lower QPS; both and skip-only were indistinguishable, soskip_special_tokens=falseis the TTFT fix. The skip-only result repeated at 145.177 ms mean / 265.518 ms p99 TTFT, and a 16-worker control left baseline TTFT unchanged.The min-only run needed the configured 300 s worker-initialization wait because one internal service became ready at 33.62 s versus the current hard-coded 30 s. This changed startup only; no requests were issued before readiness.
GPT-OSS accuracy check
With both controls, official-evaluator proxy reruns passed: Offline 83.213 and Server 82.933 versus the 82.299 threshold, with 4,395/4,395 samples evaluated in each. The prior no-control runs scored 81.878/81.607 because stripped Harmony markers changed answer extraction. LoadGen's native-token results scored 83.412/83.672; Endpoints results remain labeled text-reencoded proxies until native backend-token capture is implemented.
Testing
1287 passed,5 skipped; slow/performance/explicit tests excluded)pre-commit run --all-filesChecklist