Skip to content

fix(loadgen-parity): disable skip-special-tokens to have same TTFT calculation as loadgen#380

Merged
viraatc merged 5 commits into
mainfrom
codex/completions-generation-parity
Jun 30, 2026
Merged

fix(loadgen-parity): disable skip-special-tokens to have same TTFT calculation as loadgen#380
viraatc merged 5 commits into
mainfrom
codex/completions-generation-parity

Conversation

@viraatc

@viraatc viraatc commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds Endpoints-facing min_new_tokens and skip_special_tokens controls to OpenAI text completions. min_new_tokens defaults to 1 and maps to the TRT-LLM /v1/completions wire field min_tokens; skip_special_tokens defaults to true. Both are serialized explicitly. Non-default controls fail validation on API types that cannot forward them. The GPT-OSS vLLM example uses 1 and false.

NVIDIA LoadGen sets these values in its generation config and forwards both fields. Without skip_special_tokens=false, exactly one token is hidden at the first emission: token 200005, <|channel|>. This does not hide 20 tokens: with this B300 config's stream_interval=20, TensorRT-LLM emits after token 1 and then token 20, so suppressing token 1 delays observable TTFT by 19 decode intervals. A normal complete response loses six delimiters in total: <|channel|> <|message|> <|end|> <|start|> <|channel|> <|message|>; those later removals affect returned text and accuracy parsing, not initial TTFT.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Refs #8, #26, #132, #344.

B300x8 Server ablation

Latency is p99 [min, mean, max] in milliseconds. Each Endpoint run used fresh eight-server B300x8 endpoints, 47,400 requests, QPS 79, seeds 42, disabled warmup, and performance_timeout_s: null. LoadGen includes its default 8 cores x 10-query warmup, so max TTFT is not yet a like-for-like comparison; a fresh Endpoints warmup-80 remeasurement is running.

Request controls QPS tokens/s TTFT TPOT Mean OSL
LoadGen reference (wire min_tokens=1, skip=false) 76.7182 100280.320 340.079 [37.099, 165.633, 850.840] 50.019 [14.536, 45.004, 54.790] 1307.126
Neither (Endpoint baseline) 75.8316 98818.393 1369.653 [169.662, 993.379, 2229.535] 51.974 [14.956, 44.697, 52.667] 1303.101
min_new_tokens=1 only 75.8750 98884.666 1372.408 [179.558, 995.611, 2229.344] 53.024 [13.688, 44.809, 54.205] 1303.229
skip_special_tokens=false only 75.8549 99283.502 259.395 [39.938, 141.682, 1480.822] 52.325 [13.579, 44.606, 53.126] 1308.833
Both 75.7462 99166.551 258.465 [40.085, 141.888, 1531.742] 54.423 [12.874, 44.604, 55.182] 1309.168

min_new_tokens=1 matched baseline. Both controls cut p99/mean TTFT by 81.13%/85.72% at 0.11% lower QPS; both and skip-only were indistinguishable, so skip_special_tokens=false is the TTFT fix. The skip-only result repeated at 145.177 ms mean / 265.518 ms p99 TTFT, and a 16-worker control left baseline TTFT unchanged.

The min-only run needed the configured 300 s worker-initialization wait because one internal service became ready at 33.62 s versus the current hard-coded 30 s. This changed startup only; no requests were issued before readiness.

GPT-OSS accuracy check

With both controls, official-evaluator proxy reruns passed: Offline 83.213 and Server 82.933 versus the 82.299 threshold, with 4,395/4,395 samples evaluated in each. The prior no-control runs scored 81.878/81.607 because stripped Harmony markers changed answer extraction. LoadGen's native-token results scored 83.412/83.672; Endpoints results remain labeled text-reencoded proxies until native backend-token capture is implemented.

Testing

  • Tests added/updated
  • Supported unit/integration suite passes locally (1287 passed, 5 skipped; slow/performance/explicit tests excluded)
  • Manual testing completed
  • pre-commit run --all-files

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (GPT-OSS example and generated templates)

@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested review from arekay-nv and nvzhihanj June 30, 2026 07:19

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for min_tokens and skip_special_tokens parameters for OpenAI text-completions servers. It updates the configuration schema, templates, adapters, and request types, and adds corresponding validation rules and unit tests. The reviewer suggested improving the validation error message in schema.py to dynamically reference only the parameters that were actually set, ensuring consistency and clarity.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/config/schema.py Outdated
@viraatc viraatc changed the title [codex] forward completion generation controls fix(loadgen-parity): disable skip-special-tokens to have same TTFT calculation as loadgen Jun 30, 2026
@viraatc viraatc marked this pull request as ready for review June 30, 2026 10:23
@viraatc viraatc requested a review from a team June 30, 2026 10:23

@arekay-nv arekay-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread src/inference_endpoint/config/schema.py Outdated
@viraatc viraatc merged commit b2b508c into main Jun 30, 2026
8 checks passed
@viraatc viraatc deleted the codex/completions-generation-parity branch June 30, 2026 23:04
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 30, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants