-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) #8794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
42829ec to
948e537
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #23211 [ run ] triggered by Bot. Commit: |
|
PR_Github #23211 [ run ] completed with state |
948e537 to
f8cf485
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #23419 [ run ] triggered by Bot. Commit: |
|
PR_Github #23419 [ run ] completed with state |
ff3690a to
f665f52
Compare
|
/bot run |
|
PR_Github #23505 [ run ] triggered by Bot. Commit: |
|
PR_Github #23505 [ run ] completed with state |
f665f52 to
38056f7
Compare
|
/bot run |
|
PR_Github #23518 [ run ] triggered by Bot. Commit: |
|
PR_Github #23518 [ run ] completed with state |
Signed-off-by: Netanel Haber <[email protected]>
Signed-off-by: Netanel Haber <[email protected]>
…t stop words processing to use the correct offsets. - context requests may have new_tokens padded with 0. step_seq >= 0 has no effect in this case. Use draft token length instead to get the correct positions - words is correctly indexed with the first L entries instead of the last L entries. Signed-off-by: Stefan Niebler <[email protected]>
…ndle padded tokens and simplified _write_finish_reasons. - simplified _write_finish_reasons in Torch Sampler - fixed bug, where padded sequences in _are_stop_words were not correctly processed - merged both test_torch_sampler.py files into one file - updated MTPSampler to work with changes to TorchSampler Signed-off-by: Stefan Niebler <[email protected]>
…uests - reverted introduction of _prepare_new_requests - removed unused buffer _finish_reasons_nonzero_static_buffer - updated test_draft_token_tree_verification.py to correctly call _process_draft_tokens_tree Signed-off-by: Stefan Niebler <[email protected]>
38056f7 to
5cae7fd
Compare
|
/bot run |
|
PR_Github #23619 [ run ] triggered by Bot. Commit: |
|
PR_Github #23619 [ run ] completed with state |
📝 WalkthroughWalkthroughIntroduced a per-batch finish-reason tracking mechanism for generation in TorchSampler. Added SampleStateTorch with host-side finish_reasons support, refactored sampler to propagate finish reasons across drafting and decoding steps, and added helper methods for stop-criteria logic. Updated MTPSampler and tests accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant TorchSampler
participant FinishReasonLogic
participant Store
Client->>TorchSampler: sample_async(requests)
activate TorchSampler
TorchSampler->>TorchSampler: process_draft_tokens()
TorchSampler->>FinishReasonLogic: _are_end_id(requests, tokens)
TorchSampler->>FinishReasonLogic: _are_max_length(requests)
TorchSampler->>FinishReasonLogic: _are_stop_words(requests, tokens)
TorchSampler->>FinishReasonLogic: _write_finish_reasons(requests, finish_reasons, seq_slots, new_tokens)
activate FinishReasonLogic
FinishReasonLogic->>Store: Update finish_reasons tensor
deactivate FinishReasonLogic
TorchSampler->>Store: Create SampleStateTorch with host.finish_reasons
activate Store
Store-->>TorchSampler: SampleStateTorch instance
deactivate Store
TorchSampler-->>Client: SampleStateTorch(host.finish_reasons, ...)
deactivate TorchSampler
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/unittest/_torch/sampler/test_torch_sampler.py (1)
514-515: Make Optional type hints explicit.PEP 484 discourages implicit
Optional. The parametersend_idandstop_words_listshould use explicitOptional[...]or... | Nonesyntax.Apply this diff:
- end_id: int = None, - stop_words_list: list[list[int]] = None, + end_id: int | None = None, + stop_words_list: list[list[int]] | None = None,As per coding guidelines
tensorrt_llm/_torch/speculative/mtp.py (1)
227-230: Document why MTPSampler doesn't track finish_reasons.The
finish_reasons: Noneworkaround with custom__post_init__is clear but consider adding a docstring explaining why MTPSampler handles finish reasons differently from TorchSampler (e.g., does it delegate to a different mechanism, or compute them elsewhere?).tensorrt_llm/_torch/pyexecutor/sampler.py (1)
1633-1667: Considerstrict=Truefor zip() calls.The
_are_stop_wordsimplementation handles variable-length stop words correctly. However, thezip()calls at line 1649 and 1656 should usestrict=Trueto catch length mismatches early.Apply these diffs:
for step, (start, length) in enumerate(zip([0] + ends, lens)): + # or: enumerate(zip([0, *ends], lens, strict=True))- for word, L in zip(words_device, lens_device): + for word, L in zip(words_device, lens_device, strict=True):Note: The static analysis suggestion to use
[0, *ends]instead of[0] + endsis stylistic and optional.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tensorrt_llm/_torch/pyexecutor/sampler.py(23 hunks)tensorrt_llm/_torch/speculative/mtp.py(4 hunks)tests/unittest/_torch/sampler/test_torch_sampler.py(3 hunks)tests/unittest/_torch/speculative/test_draft_token_tree_verification.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tests/unittest/_torch/speculative/test_draft_token_tree_verification.pytests/unittest/_torch/sampler/test_torch_sampler.pytensorrt_llm/_torch/pyexecutor/sampler.pytensorrt_llm/_torch/speculative/mtp.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tests/unittest/_torch/speculative/test_draft_token_tree_verification.pytests/unittest/_torch/sampler/test_torch_sampler.pytensorrt_llm/_torch/pyexecutor/sampler.pytensorrt_llm/_torch/speculative/mtp.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tests/unittest/_torch/speculative/test_draft_token_tree_verification.pytests/unittest/_torch/sampler/test_torch_sampler.pytensorrt_llm/_torch/pyexecutor/sampler.pytensorrt_llm/_torch/speculative/mtp.py
🧠 Learnings (8)
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/unittest/_torch/sampler/test_torch_sampler.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/unittest/_torch/sampler/test_torch_sampler.py
📚 Learning: 2025-08-28T10:22:02.288Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1191-1197
Timestamp: 2025-08-28T10:22:02.288Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the object identity comparison `softmax_req_indices is not group_req_indices_cuda` on line ~1191 is intentional and used as an optimization to determine whether to reuse an existing indexer or create a new one, based on which code path was taken during tensor assignment.
Applied to files:
tests/unittest/_torch/sampler/test_torch_sampler.pytensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/unittest/_torch/sampler/test_torch_sampler.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/sampler/test_torch_sampler.py
📚 Learning: 2025-08-28T10:25:22.370Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:887-891
Timestamp: 2025-08-28T10:25:22.370Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the draft_probs and target_probs tensors have shapes [1, steps] not [steps, vocab_size] as might be expected, making the .squeeze(0) operations appropriate for removing the batch dimension of size 1.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:03:57.149Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.149Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-13T16:20:37.987Z
Learnt from: dcampora
Repo: NVIDIA/TensorRT-LLM PR: 6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.987Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.pytensorrt_llm/_torch/speculative/mtp.py
🧬 Code graph analysis (4)
tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (2)
cpp/include/tensorrt_llm/executor/types.h (1)
FinishReason(503-598)tensorrt_llm/_torch/pyexecutor/sampler.py (2)
finish_reasons_list(578-580)_process_draft_tokens_tree(780-878)
tests/unittest/_torch/sampler/test_torch_sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
convert_wordlist(634-668)LlmRequest(422-631)cpp/include/tensorrt_llm/executor/types.h (1)
FinishReason(503-598)tensorrt_llm/_torch/pyexecutor/sampler.py (3)
Args(605-610)_write_finish_reasons(1519-1567)_are_stop_words(1633-1667)
tensorrt_llm/_torch/pyexecutor/sampler.py (4)
tensorrt_llm/_torch/speculative/mtp.py (1)
Store(221-230)cpp/include/tensorrt_llm/executor/types.h (1)
FinishReason(503-598)tensorrt_llm/_torch/pyexecutor/llm_request.py (5)
LlmRequest(422-631)finish_by(602-605)append(101-127)append(192-209)get_draft_token_length(776-787)tensorrt_llm/_torch/pyexecutor/sampling_utils.py (1)
sample_rejected(236-248)
tensorrt_llm/_torch/speculative/mtp.py (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (5)
add_token(295-303)int_tensor(306-307)Args(605-610)Store(596-602)_handle_stop_criteria(702-719)
🪛 Ruff (0.14.3)
tests/unittest/_torch/sampler/test_torch_sampler.py
514-514: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
515-515: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
664-664: Unused function argument: args
(ARG001)
664-664: Unused function argument: kwargs
(ARG001)
tensorrt_llm/_torch/pyexecutor/sampler.py
1649-1649: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
1649-1649: Consider [0, *ends] instead of concatenation
Replace with [0, *ends]
(RUF005)
1656-1656: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🔇 Additional comments (17)
tests/unittest/_torch/speculative/test_draft_token_tree_verification.py (1)
13-13: LGTM! Test infrastructure properly updated.The test correctly initializes finish_reasons state with
NOT_FINISHEDbaseline and propagates it through the call chain, aligning with the new finish-reason tracking introduced in the sampler.Also applies to: 48-58
tests/unittest/_torch/sampler/test_torch_sampler.py (2)
495-500: LGTM! Test constants clearly defined.The module-level test constants align with the production constants introduced in the sampler module.
585-682: LGTM! Comprehensive finish-reason test coverage.The tests exercise various finish-reason scenarios including stop words, end_id, max length, and combinations thereof. The optimization test confirms
_are_stop_wordsis not called when no stop words are configured.tensorrt_llm/_torch/speculative/mtp.py (3)
16-18: LGTM! Using shared constants.Importing
BEAMandMAX_BEAM_WIDTHfrom the sampler module maintains consistency across the codebase.
232-248: LGTM! Store initialization updated.The initialization properly captures
max_seq_lenand updates store tensor shapes to use the sharedMAX_BEAM_WIDTHconstant.
269-269: LGTM! Properly uses shared constants and passes max_seq_len.The update correctly uses the module-level
BEAMconstant and propagatesmax_seq_lento_handle_stop_criteriafor proper stop-criteria checks.Also applies to: 274-286
tensorrt_llm/_torch/pyexecutor/sampler.py (11)
21-21: LGTM! New imports properly utilized.
cached_propertyis used for_pad_steps_maskoptimization, andnumpyis used in stop-word length calculations.Also applies to: 25-25
568-586: LGTM! State classes properly structured.The module-level constants and new state classes cleanly encapsulate the finish-reason tracking infrastructure. The
finish_reasons_list()helper appropriately transposes the tensor layout.
595-643: LGTM! Store properly extended with finish_reasons tracking.The finish_reasons tensor shape validation and pre-allocated reason tensors are well-designed. The assertion that
MAX_BEAM_WIDTH == 1clearly documents the current limitation.
675-719: LGTM! Stop-criteria methods properly refactored.Making
_meet_max_token_stop_criteriastatic and_handle_stop_criteriaa classmethod with explicitmax_seq_lenparameter improves reusability and testability.
747-878: LGTM! Draft token processing correctly propagates finish reasons.The
finish_if_reasonhelper centralizes finish-reason checks, and both greedy and tree verification paths properly check after each accepted token with appropriate early exits.
946-1092: LGTM! Rejection sampling and update flow properly adapted.The comment at lines 946-947 clearly documents why
finish_if_reasoncannot be used in rejection sampling. The update_requests flow correctly synchronizes and extracts finish_reasons from the host state.
1096-1145: LGTM! sample_async properly computes and propagates finish_reasons.The method correctly calls
_write_finish_reasonsto populate the finish_reasons tensor and includes it in the returnedSampleStateTorchfor downstream processing.
1493-1517: LGTM! Helper methods for stop-word filtering.The
_longest_stop_word_len,_requests_with_stop_words, and_request_indices_with_stop_wordsmethods provide clean filtering and measurement for stop-word processing.
1519-1567: LGTM! Finish-reason writing follows correct precedence.The
_write_finish_reasonsmethod correctly implements the finish-reason precedence (stop words checked first, then max length, then end_id), with later checks overwriting earlier ones. The pre-fill withNOT_FINISHEDproperly handles sequence slot reuse.
1569-1597: LGTM! End-ID and max-length predicates correctly implemented.The
_are_end_idand_are_max_lengthmethods properly construct per-request, per-step boolean tensors for finish-reason determination.
1599-1631: LGTM! Token padding infrastructure for stop-word lookback.The
_pad_steps_maskcached property and_padded_old_tokensmethod efficiently construct padded token sequences for stop-word matching across token boundaries. The assertions validate tensor shapes.
mikeiovine
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good assuming all spec dec tests pass
…oduce: Move stop_criteria to sample_async (NVIDIA#7041) (NVIDIA#8794)" This reverts commit 326a201.
Summary by CodeRabbit
New Features
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.