-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true #8931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Lanyu Liao <[email protected]>
📝 WalkthroughWalkthroughAdds an idempotent helper function to initialize cached HuggingFace modules for models with Changes
Sequence DiagramsequenceDiagram
participant MainProc as Main Process
participant SubProc as Spawned Subprocess
participant WorkerMain as worker_main()
participant HFInit as HF Init Helper
Note over MainProc,HFInit: Module Import Phase
MainProc->>HFInit: invoke init helper (top-level)
HFInit-->>MainProc: HF modules cached
Note over SubProc,HFInit: Subprocess Creation
SubProc->>HFInit: invoke init helper (on import)
HFInit-->>SubProc: HF modules cached in subprocess
Note over WorkerMain,HFInit: Worker Execution (if trust_remote_code=True)
WorkerMain->>HFInit: invoke init helper (before processing)
HFInit-->>WorkerMain: HF modules ready
WorkerMain->>WorkerMain: proceed with work
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/executor/worker.py (1)
40-57: Log failures with stack tracesRight now we lose the stack when
_init_hf_modules()blows up, which makes remote-module path issues hard to diagnose. Please log withexc_infoorlogger.exceptionso we keep the context.@@ - except ImportError as e: - logger.warning(f"ImportError initializing HF modules: {e}") - except Exception as e: - logger.error(f"Exception initializing HF modules: {e}") + except ImportError: + logger.warning("ImportError initializing HF modules", exc_info=True) + except Exception: + logger.exception("Exception initializing HF modules")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/executor/worker.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/executor/worker.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/executor/worker.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/executor/worker.py
🧠 Learnings (1)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/executor/worker.py
🧬 Code graph analysis (1)
tensorrt_llm/executor/worker.py (1)
tensorrt_llm/logger.py (3)
debug(144-145)warning(132-133)error(126-127)
🪛 Ruff (0.14.3)
tensorrt_llm/executor/worker.py
55-55: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/executor/worker.py (1)
269-271: Good call re-initializing in worker processesRe-running
_init_hf_modules()whentrust_remote_codemodels are loaded keeps worker subprocesses in sync. Looks solid.
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #23591 [ run ] triggered by Bot. Commit: |
|
PR_Github #23591 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #23636 [ run ] triggered by Bot. Commit: |
|
PR_Github #23636 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #23690 [ run ] triggered by Bot. Commit: |
|
PR_Github #23690 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #23789 [ run ] triggered by Bot. Commit: |
|
PR_Github #23789 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #23919 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #23967 [ run ] triggered by Bot. Commit: |
|
/bot run --abort |
|
PR_Github #23968 Bot args parsing error: usage: /bot [-h] |
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
/bot kill |
|
PR_Github #23975 [ kill ] triggered by Bot. Commit: |
|
PR_Github #23967 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #23975 [ kill ] completed with state |
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #23988 [ run ] triggered by Bot. Commit: |
|
PR_Github #23988 [ run ] completed with state |
Superjomn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ls with trust_remote=true (NVIDIA#8931) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]>
…ls with trust_remote=true (NVIDIA#8931) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]>
…ls with trust_remote=true (NVIDIA#8931) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
Similar to: vllm-project/vllm#871
After adding this initialization, locally running
trtllm-llmapi-launch trtllm-eval --trust_remote_codewith Kimi-K2(as described in https://nvbugs/5556998) no longer raises:ModuleNotFoundError: No module named 'transformers_modules'Interestingly, IT coverage seems to have already included
trtllm-llmapi-launch trtllm-eval --trust_remote_codeusage , yet no error occurred:TensorRT-LLM/tests/integration/defs/test_e2e.py
Line 3407 in eeb56c2
Summary by CodeRabbit