Skip to content

[None][feat] Reduce sampler overhead with min_tokens#13480

Open
galagam wants to merge 2 commits intoNVIDIA:mainfrom
nv-auto-deploy:dev-gagam-sampler-min-tokens-fix
Open

[None][feat] Reduce sampler overhead with min_tokens#13480
galagam wants to merge 2 commits intoNVIDIA:mainfrom
nv-auto-deploy:dev-gagam-sampler-min-tokens-fix

Conversation

@galagam
Copy link
Copy Markdown
Collaborator

@galagam galagam commented Apr 26, 2026

Summary by CodeRabbit

Release Notes

  • Refactor
    • Optimized token sampling performance in text generation. The min-length penalty application now uses more efficient batch processing for faster generation speeds.

Description

  • move allocation outside of loop
  • set dtype to match logits and avoid implicit cast in each iteration
  • per-request D2D copies (in loop) => 2 per-batch H2D copies (for the indices) + 1 per-batch GPU kernel (index_put_)

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

galagam added 2 commits April 26, 2026 12:36
- move allocation outside of loop
- set dtype to match logits and avoid implicit cast in each iteration

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
…-batch copy

per-request D2D copies (in loop) => 2 per-batch H2D copies (for the indices) + 1 per-batch GPU kernel (index_put_)

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
@galagam galagam self-assigned this Apr 26, 2026
@galagam galagam marked this pull request as ready for review April 27, 2026 13:59
@galagam galagam requested a review from a team as a code owner April 27, 2026 13:59
@galagam galagam requested a review from achartier April 27, 2026 13:59
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 38e8ff7a-c64c-4395-abe8-3efd8d0dfd71

📥 Commits

Reviewing files that changed from the base of the PR and between dd907c0 and 745dd2f.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/sampler.py

📝 Walkthrough

Walkthrough

The _apply_min_length_penalty function in the sampler is optimized to use batched indexed tensor operations instead of per-position in-place logits assignment. The refactored approach accumulates affected indices and performs a single index_put_ operation, with early return when no penalty is needed.

Changes

Cohort / File(s) Summary
Min-length Penalty Optimization
tensorrt_llm/_torch/pyexecutor/sampler.py
Refactored _apply_min_length_penalty to replace nested Python loops with batched indexed updates using index_put_, improving performance through vectorized tensor operations instead of individual element assignments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description provides specific technical details about the optimization (moving allocations, dtype matching, replacing D2D copies with H2D copies and GPU kernel), but the Test Coverage section is empty and incomplete. Add test coverage details explaining which existing tests validate the changes and whether new tests were added to ensure the optimization maintains correctness.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: reducing sampler overhead related to min_tokens functionality through optimization of the min-length penalty application logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@galagam
Copy link
Copy Markdown
Collaborator Author

galagam commented Apr 27, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45748 [ run ] triggered by Bot. Commit: 745dd2f Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants