Skip to content

Conversation

@stnie
Copy link
Collaborator

@stnie stnie commented Oct 30, 2025

Summary by CodeRabbit

  • New Features

    • Added finish-reason tracking to identify why sequences stopped during generation (max length, end tokens, or stop words met)
    • Enhanced state management to propagate and report finish reasons for each sequence
  • Tests

    • Expanded test coverage for finish-reason propagation and stop-word detection

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@Funatiq Funatiq self-requested a review October 30, 2025 12:25
@stnie stnie force-pushed the develop/sampler/move_stop_criteria_to_sample_async branch from 42829ec to 948e537 Compare October 31, 2025 17:07
@stnie
Copy link
Collaborator Author

stnie commented Oct 31, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23211 [ run ] triggered by Bot. Commit: 948e537

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23211 [ run ] completed with state SUCCESS. Commit: 948e537
/LLM/main/L0_MergeRequest_PR pipeline #17496 completed with status: 'FAILURE'

@stnie stnie force-pushed the develop/sampler/move_stop_criteria_to_sample_async branch from 948e537 to f8cf485 Compare November 3, 2025 20:12
@stnie
Copy link
Collaborator Author

stnie commented Nov 3, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23419 [ run ] triggered by Bot. Commit: f8cf485

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23419 [ run ] completed with state SUCCESS. Commit: f8cf485
/LLM/main/L0_MergeRequest_PR pipeline #17635 completed with status: 'FAILURE'

@stnie stnie force-pushed the develop/sampler/move_stop_criteria_to_sample_async branch from ff3690a to f665f52 Compare November 4, 2025 11:32
@stnie
Copy link
Collaborator Author

stnie commented Nov 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23505 [ run ] triggered by Bot. Commit: f665f52

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23505 [ run ] completed with state FAILURE. Commit: f665f52
/LLM/main/L0_MergeRequest_PR pipeline #17692 completed with status: 'FAILURE'

@stnie stnie force-pushed the develop/sampler/move_stop_criteria_to_sample_async branch from f665f52 to 38056f7 Compare November 4, 2025 15:08
@stnie
Copy link
Collaborator Author

stnie commented Nov 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23518 [ run ] triggered by Bot. Commit: 38056f7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23518 [ run ] completed with state SUCCESS. Commit: 38056f7
/LLM/main/L0_MergeRequest_PR pipeline #17700 completed with status: 'FAILURE'

@stnie stnie changed the title [https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) [https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) Nov 4, 2025
netanel-haber and others added 5 commits November 5, 2025 09:14
Signed-off-by: Netanel Haber <[email protected]>
…t stop words processing to use the correct offsets.

- context requests may have new_tokens padded with 0. step_seq >= 0 has no effect in this case. Use draft token length instead to get the correct positions
- words is correctly indexed with the first L entries instead of the last L entries.

Signed-off-by: Stefan Niebler <[email protected]>
…ndle padded tokens and simplified _write_finish_reasons.

- simplified _write_finish_reasons in Torch Sampler
- fixed bug, where padded sequences in _are_stop_words were not correctly processed
- merged both test_torch_sampler.py files into one file
- updated MTPSampler to work with changes to TorchSampler

Signed-off-by: Stefan Niebler <[email protected]>
…uests

- reverted introduction of _prepare_new_requests
- removed unused buffer _finish_reasons_nonzero_static_buffer
- updated test_draft_token_tree_verification.py to correctly call _process_draft_tokens_tree

Signed-off-by: Stefan Niebler <[email protected]>
@stnie stnie force-pushed the develop/sampler/move_stop_criteria_to_sample_async branch from 38056f7 to 5cae7fd Compare November 5, 2025 08:14
@stnie
Copy link
Collaborator Author

stnie commented Nov 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23619 [ run ] triggered by Bot. Commit: 5cae7fd

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23619 [ run ] completed with state SUCCESS. Commit: 5cae7fd
/LLM/main/L0_MergeRequest_PR pipeline #17772 completed with status: 'SUCCESS'

@stnie stnie marked this pull request as ready for review November 5, 2025 14:29
@stnie stnie requested review from a team as code owners November 5, 2025 14:29
@stnie stnie requested review from kris1025 and shaharmor98 November 5, 2025 14:29
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

📝 Walkthrough

Walkthrough

Introduced a per-batch finish-reason tracking mechanism for generation in TorchSampler. Added SampleStateTorch with host-side finish_reasons support, refactored sampler to propagate finish reasons across drafting and decoding steps, and added helper methods for stop-criteria logic. Updated MTPSampler and tests accordingly.

Changes

Cohort / File(s) Summary
Core sampler refactoring
tensorrt_llm/_torch/pyexecutor/sampler.py
Introduced finish-reason mechanism with new SampleStateTorch state type; added host-side finish_reasons tensor support; reworked TorchSampler.sample_async and update_requests to propagate finish reasons; added helper methods for finish-reason computation (_write_finish_reasons, _are_end_id, _are_max_length, _are_stop_words); refactored stop-criteria logic with finish_if_reason helper; replaced per-instance self.BEAM with module-level BEAM constant; made _meet_max_token_stop_criteria static with max_seq_len parameter.
Speculative sampler alignment
tensorrt_llm/_torch/speculative/mtp.py
Updated MTPSampler.Store dataclass to include new_tokens and finish_reasons fields; added max_seq_len initialization in init; replaced self.BEAM with imported BEAM constant; updated stop-criteria calls to forward max_seq_len parameter.
Sampler test scaffold
tests/unittest/_torch/sampler/test_torch_sampler.py
Added test constants and RequestCase builder class for finish-reason validation; introduced test_write_finish_reasons to validate finish-reason propagation; added test_are_stop_words_isnt_called_when_no_stop_words to verify stop-word handling logic.
Speculative test integration
tests/unittest/_torch/speculative/test_draft_token_tree_verification.py
Updated run_test to initialize and propagate finish_reasons tensor through _process_draft_tokens_tree; added FinishReason import.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant TorchSampler
    participant FinishReasonLogic
    participant Store

    Client->>TorchSampler: sample_async(requests)
    activate TorchSampler
    
    TorchSampler->>TorchSampler: process_draft_tokens()
    TorchSampler->>FinishReasonLogic: _are_end_id(requests, tokens)
    TorchSampler->>FinishReasonLogic: _are_max_length(requests)
    TorchSampler->>FinishReasonLogic: _are_stop_words(requests, tokens)
    
    TorchSampler->>FinishReasonLogic: _write_finish_reasons(requests, finish_reasons, seq_slots, new_tokens)
    activate FinishReasonLogic
    FinishReasonLogic->>Store: Update finish_reasons tensor
    deactivate FinishReasonLogic
    
    TorchSampler->>Store: Create SampleStateTorch with host.finish_reasons
    activate Store
    Store-->>TorchSampler: SampleStateTorch instance
    deactivate Store
    
    TorchSampler-->>Client: SampleStateTorch(host.finish_reasons, ...)
    deactivate TorchSampler
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • sampler.py: Dense logic with multiple new helper methods, refactored state handling, and interconnected finish-reason propagation across drafting/decoding paths—requires careful verification of finish-reason computation logic and state transitions.
  • mtp.py: Integration changes to align with new sampler interface; relatively straightforward but depends on understanding sampler.py refactoring.
  • Tests: New test scaffold for finish-reason validation; ensure RequestCase builder correctly mirrors production behavior and test assertions validate all finish-reason paths.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely empty, containing only the repository template with no information about changes, objectives, or test coverage. Provide a clear description explaining what changes were made and why. Include the Test Coverage section listing relevant tests, and fill out the PR Checklist items appropriately.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly references a specific bug fix (nvbugs/5508536) and describes the main change: reintroducing/moving stop_criteria to sample_async, which aligns with the substantial refactoring documented in the raw summary.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/unittest/_torch/sampler/test_torch_sampler.py (1)

514-515: Make Optional type hints explicit.

PEP 484 discourages implicit Optional. The parameters end_id and stop_words_list should use explicit Optional[...] or ... | None syntax.

Apply this diff:

-        end_id: int = None,
-        stop_words_list: list[list[int]] = None,
+        end_id: int | None = None,
+        stop_words_list: list[list[int]] | None = None,

As per coding guidelines

tensorrt_llm/_torch/speculative/mtp.py (1)

227-230: Document why MTPSampler doesn't track finish_reasons.

The finish_reasons: None workaround with custom __post_init__ is clear but consider adding a docstring explaining why MTPSampler handles finish reasons differently from TorchSampler (e.g., does it delegate to a different mechanism, or compute them elsewhere?).

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

1633-1667: Consider strict=True for zip() calls.

The _are_stop_words implementation handles variable-length stop words correctly. However, the zip() calls at line 1649 and 1656 should use strict=True to catch length mismatches early.

Apply these diffs:

         for step, (start, length) in enumerate(zip([0] + ends, lens)):
+            # or: enumerate(zip([0, *ends], lens, strict=True))
-                for word, L in zip(words_device, lens_device):
+                for word, L in zip(words_device, lens_device, strict=True):

Note: The static analysis suggestion to use [0, *ends] instead of [0] + ends is stylistic and optional.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9e5315 and 5cae7fd.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (23 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (4 hunks)
  • tests/unittest/_torch/sampler/test_torch_sampler.py (3 hunks)
  • tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/_torch/speculative/test_draft_token_tree_verification.py
  • tests/unittest/_torch/sampler/test_torch_sampler.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/_torch/speculative/mtp.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/_torch/speculative/test_draft_token_tree_verification.py
  • tests/unittest/_torch/sampler/test_torch_sampler.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/_torch/speculative/mtp.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/_torch/speculative/test_draft_token_tree_verification.py
  • tests/unittest/_torch/sampler/test_torch_sampler.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/_torch/speculative/mtp.py
🧠 Learnings (8)
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tests/unittest/_torch/sampler/test_torch_sampler.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tests/unittest/_torch/sampler/test_torch_sampler.py
📚 Learning: 2025-08-28T10:22:02.288Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1191-1197
Timestamp: 2025-08-28T10:22:02.288Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the object identity comparison `softmax_req_indices is not group_req_indices_cuda` on line ~1191 is intentional and used as an optimization to determine whether to reuse an existing indexer or create a new one, based on which code path was taken during tensor assignment.

Applied to files:

  • tests/unittest/_torch/sampler/test_torch_sampler.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/unittest/_torch/sampler/test_torch_sampler.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/_torch/sampler/test_torch_sampler.py
📚 Learning: 2025-08-28T10:25:22.370Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:887-891
Timestamp: 2025-08-28T10:25:22.370Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the draft_probs and target_probs tensors have shapes [1, steps] not [steps, vocab_size] as might be expected, making the .squeeze(0) operations appropriate for removing the batch dimension of size 1.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:03:57.149Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.149Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-13T16:20:37.987Z
Learnt from: dcampora
Repo: NVIDIA/TensorRT-LLM PR: 6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.987Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/_torch/speculative/mtp.py
🧬 Code graph analysis (4)
tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (2)
cpp/include/tensorrt_llm/executor/types.h (1)
  • FinishReason (503-598)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
  • finish_reasons_list (578-580)
  • _process_draft_tokens_tree (780-878)
tests/unittest/_torch/sampler/test_torch_sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
  • convert_wordlist (634-668)
  • LlmRequest (422-631)
cpp/include/tensorrt_llm/executor/types.h (1)
  • FinishReason (503-598)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
  • Args (605-610)
  • _write_finish_reasons (1519-1567)
  • _are_stop_words (1633-1667)
tensorrt_llm/_torch/pyexecutor/sampler.py (4)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • Store (221-230)
cpp/include/tensorrt_llm/executor/types.h (1)
  • FinishReason (503-598)
tensorrt_llm/_torch/pyexecutor/llm_request.py (5)
  • LlmRequest (422-631)
  • finish_by (602-605)
  • append (101-127)
  • append (192-209)
  • get_draft_token_length (776-787)
tensorrt_llm/_torch/pyexecutor/sampling_utils.py (1)
  • sample_rejected (236-248)
tensorrt_llm/_torch/speculative/mtp.py (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (5)
  • add_token (295-303)
  • int_tensor (306-307)
  • Args (605-610)
  • Store (596-602)
  • _handle_stop_criteria (702-719)
🪛 Ruff (0.14.3)
tests/unittest/_torch/sampler/test_torch_sampler.py

514-514: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


515-515: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


664-664: Unused function argument: args

(ARG001)


664-664: Unused function argument: kwargs

(ARG001)

tensorrt_llm/_torch/pyexecutor/sampler.py

1649-1649: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


1649-1649: Consider [0, *ends] instead of concatenation

Replace with [0, *ends]

(RUF005)


1656-1656: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (17)
tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (1)

13-13: LGTM! Test infrastructure properly updated.

The test correctly initializes finish_reasons state with NOT_FINISHED baseline and propagates it through the call chain, aligning with the new finish-reason tracking introduced in the sampler.

Also applies to: 48-58

tests/unittest/_torch/sampler/test_torch_sampler.py (2)

495-500: LGTM! Test constants clearly defined.

The module-level test constants align with the production constants introduced in the sampler module.


585-682: LGTM! Comprehensive finish-reason test coverage.

The tests exercise various finish-reason scenarios including stop words, end_id, max length, and combinations thereof. The optimization test confirms _are_stop_words is not called when no stop words are configured.

tensorrt_llm/_torch/speculative/mtp.py (3)

16-18: LGTM! Using shared constants.

Importing BEAM and MAX_BEAM_WIDTH from the sampler module maintains consistency across the codebase.


232-248: LGTM! Store initialization updated.

The initialization properly captures max_seq_len and updates store tensor shapes to use the shared MAX_BEAM_WIDTH constant.


269-269: LGTM! Properly uses shared constants and passes max_seq_len.

The update correctly uses the module-level BEAM constant and propagates max_seq_len to _handle_stop_criteria for proper stop-criteria checks.

Also applies to: 274-286

tensorrt_llm/_torch/pyexecutor/sampler.py (11)

21-21: LGTM! New imports properly utilized.

cached_property is used for _pad_steps_mask optimization, and numpy is used in stop-word length calculations.

Also applies to: 25-25


568-586: LGTM! State classes properly structured.

The module-level constants and new state classes cleanly encapsulate the finish-reason tracking infrastructure. The finish_reasons_list() helper appropriately transposes the tensor layout.


595-643: LGTM! Store properly extended with finish_reasons tracking.

The finish_reasons tensor shape validation and pre-allocated reason tensors are well-designed. The assertion that MAX_BEAM_WIDTH == 1 clearly documents the current limitation.


675-719: LGTM! Stop-criteria methods properly refactored.

Making _meet_max_token_stop_criteria static and _handle_stop_criteria a classmethod with explicit max_seq_len parameter improves reusability and testability.


747-878: LGTM! Draft token processing correctly propagates finish reasons.

The finish_if_reason helper centralizes finish-reason checks, and both greedy and tree verification paths properly check after each accepted token with appropriate early exits.


946-1092: LGTM! Rejection sampling and update flow properly adapted.

The comment at lines 946-947 clearly documents why finish_if_reason cannot be used in rejection sampling. The update_requests flow correctly synchronizes and extracts finish_reasons from the host state.


1096-1145: LGTM! sample_async properly computes and propagates finish_reasons.

The method correctly calls _write_finish_reasons to populate the finish_reasons tensor and includes it in the returned SampleStateTorch for downstream processing.


1493-1517: LGTM! Helper methods for stop-word filtering.

The _longest_stop_word_len, _requests_with_stop_words, and _request_indices_with_stop_words methods provide clean filtering and measurement for stop-word processing.


1519-1567: LGTM! Finish-reason writing follows correct precedence.

The _write_finish_reasons method correctly implements the finish-reason precedence (stop words checked first, then max length, then end_id), with later checks overwriting earlier ones. The pre-fill with NOT_FINISHED properly handles sequence slot reuse.


1569-1597: LGTM! End-ID and max-length predicates correctly implemented.

The _are_end_id and _are_max_length methods properly construct per-request, per-step boolean tensors for finish-reason determination.


1599-1631: LGTM! Token padding infrastructure for stop-word lookback.

The _pad_steps_mask cached property and _padded_old_tokens method efficiently construct padded token sequences for stop-word matching across token boundaries. The assertions validate tensor shapes.

Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good assuming all spec dec tests pass

@dcampora dcampora merged commit 326a201 into NVIDIA:main Nov 7, 2025
7 of 9 checks passed
eopXD added a commit to eopXD/TensorRT-LLM that referenced this pull request Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants