fix(loadgen-parity): disable skip-special-tokens to have same TTFT calculation as loadgen by viraatc · Pull Request #380 · mlcommons/endpoints

viraatc · 2026-06-30T07:19:26Z

What does this PR do?

Adds Endpoints-facing min_new_tokens and skip_special_tokens controls to OpenAI text completions. min_new_tokens defaults to 1 and maps to the TRT-LLM /v1/completions wire field min_tokens; skip_special_tokens defaults to true. Both are serialized explicitly. Non-default controls fail validation on API types that cannot forward them. The GPT-OSS vLLM example uses 1 and false.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Refs #8, #26, #132, #344.

B300x8 Server ablation

Latency is p99 [min, mean, max] in milliseconds. Each Endpoint run used fresh eight-server B300x8 endpoints, 47,400 requests, QPS 79, seeds 42, disabled warmup, and performance_timeout_s: null. LoadGen includes its default 8 cores x 10-query warmup, so max TTFT is not yet a like-for-like comparison; a fresh Endpoints warmup-80 remeasurement is running.

Request controls	QPS	tokens/s	TTFT	TPOT	Mean OSL
LoadGen reference (wire `min_tokens=1`, `skip=false`)	76.7182	100280.320	340.079 [37.099, 165.633, 850.840]	50.019 [14.536, 45.004, 54.790]	1307.126
Neither (Endpoint baseline)	75.8316	98818.393	1369.653 [169.662, 993.379, 2229.535]	51.974 [14.956, 44.697, 52.667]	1303.101
`min_new_tokens=1` only	75.8750	98884.666	1372.408 [179.558, 995.611, 2229.344]	53.024 [13.688, 44.809, 54.205]	1303.229
`skip_special_tokens=false` only	75.8549	99283.502	259.395 [39.938, 141.682, 1480.822]	52.325 [13.579, 44.606, 53.126]	1308.833
Both	75.7462	99166.551	258.465 [40.085, 141.888, 1531.742]	54.423 [12.874, 44.604, 55.182]	1309.168

min_new_tokens=1 matched baseline. Both controls cut p99/mean TTFT by 81.13%/85.72% at 0.11% lower QPS; both and skip-only were indistinguishable, so skip_special_tokens=false is the TTFT fix. The skip-only result repeated at 145.177 ms mean / 265.518 ms p99 TTFT, and a 16-worker control left baseline TTFT unchanged.

The min-only run needed the configured 300 s worker-initialization wait because one internal service became ready at 33.62 s versus the current hard-coded 30 s. This changed startup only; no requests were issued before readiness.

GPT-OSS accuracy check

With both controls, official-evaluator proxy reruns passed: Offline 83.213 and Server 82.933 versus the 82.299 threshold, with 4,395/4,395 samples evaluated in each. The prior no-control runs scored 81.878/81.607 because stripped Harmony markers changed answer extraction. LoadGen's native-token results scored 83.412/83.672; Endpoints results remain labeled text-reencoded proxies until native backend-token capture is implemented.

Testing

Tests added/updated
Supported unit/integration suite passes locally (1287 passed, 5 skipped; slow/performance/explicit tests excluded)
Manual testing completed
pre-commit run --all-files

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (GPT-OSS example and generated templates)

github-actions · 2026-06-30T07:19:34Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces support for min_tokens and skip_special_tokens parameters for OpenAI text-completions servers. It updates the configuration schema, templates, adapters, and request types, and adds corresponding validation rules and unit tests. The reviewer suggested improving the validation error message in schema.py to dynamically reference only the parameters that were actually set, ensuring consistency and clarity.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

arekay-nv

Thanks!

feat: forward completion generation controls

b7426fd

github-actions Bot requested review from arekay-nv and nvzhihanj June 30, 2026 07:19

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/config/schema.py Outdated

fix: reject controls for agentic datasets

b4ad4e7

viraatc changed the title ~~[codex] forward completion generation controls~~ fix(loadgen-parity): disable skip-special-tokens to have same TTFT calculation as loadgen Jun 30, 2026

fix: clarify completion control validation

d16d1b8

viraatc marked this pull request as ready for review June 30, 2026 10:23

viraatc requested a review from a team June 30, 2026 10:23

arekay-nv approved these changes Jun 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/config/schema.py Outdated

viraatc added 2 commits June 30, 2026 14:12

fix: align completion control names and defaults

ae46b76

fix: default minimum generation length to one

5aec284

viraatc merged commit b2b508c into main Jun 30, 2026
8 checks passed

viraatc deleted the codex/completions-generation-parity branch June 30, 2026 23:04

github-actions Bot locked and limited conversation to collaborators Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(loadgen-parity): disable skip-special-tokens to have same TTFT calculation as loadgen#380

fix(loadgen-parity): disable skip-special-tokens to have same TTFT calculation as loadgen#380
viraatc merged 5 commits into
mainfrom
codex/completions-generation-parity

viraatc commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

arekay-nv left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

viraatc commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Related issues

B300x8 Server ablation

GPT-OSS accuracy check

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

viraatc commented Jun 30, 2026 •

edited

Loading