[https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true #8931

lancelly · 2025-11-05T03:04:14Z

If we load a model with trust_remote_code=True, hf will download model code or load local model code and cache dynamic model modules in its cache directory (it's usually in ~/.cache/huggingface/modules/). When a server starts, it will serialize some configs and send them to worker sub process. The model config contains dynamic modules imported from hf cache directory which is already added in sys.path by init_hf_modules function. So workers have to init_hf_modules once they start in case of ModuleNotFoundError.

After adding this initialization, locally running trtllm-llmapi-launch trtllm-eval --trust_remote_code with Kimi-K2(as described in https://nvbugs/5556998) no longer raises:
ModuleNotFoundError: No module named 'transformers_modules'

Interestingly, IT coverage seems to have already included trtllm-llmapi-launch trtllm-eval --trust_remote_code usage , yet no error occurred:

TensorRT-LLM/tests/integration/defs/test_e2e.py

Line 3407 in eeb56c2

if "Kimi" in model_path:

Summary by CodeRabbit

Bug Fixes
- Enhanced initialization handling for models requiring remote code execution to ensure proper module setup during process startup and worker spawning.

Signed-off-by: Lanyu Liao <[email protected]>

coderabbitai · 2025-11-05T03:08:45Z

📝 Walkthrough

Walkthrough

Adds an idempotent helper function to initialize cached HuggingFace modules for models with trust_remote_code=True. The function is invoked at module import time to initialize HF modules in the main process and spawned subprocesses, and again inside worker_main when appropriate to ensure initialization in worker contexts.

Changes

Cohort / File(s)	Change Summary
HuggingFace Module Initialization `tensorrt_llm/executor/worker.py`	Added idempotent helper function to initialize cached HuggingFace modules with error handling and logging. Invoked at module import time and conditionally within `worker_main` when `trust_remote_code=True`.

Sequence Diagram

sequenceDiagram
    participant MainProc as Main Process
    participant SubProc as Spawned Subprocess
    participant WorkerMain as worker_main()
    participant HFInit as HF Init Helper

    Note over MainProc,HFInit: Module Import Phase
    MainProc->>HFInit: invoke init helper (top-level)
    HFInit-->>MainProc: HF modules cached

    Note over SubProc,HFInit: Subprocess Creation
    SubProc->>HFInit: invoke init helper (on import)
    HFInit-->>SubProc: HF modules cached in subprocess

    Note over WorkerMain,HFInit: Worker Execution (if trust_remote_code=True)
    WorkerMain->>HFInit: invoke init helper (before processing)
    HFInit-->>WorkerMain: HF modules ready
    WorkerMain->>WorkerMain: proceed with work

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Single focused file with clear scope
Initialization logic affecting multiple execution contexts (main, subprocess, worker) requires careful timing review
Need to verify idempotency guarantees and error handling completeness
Consider subprocess inheritance behavior and logging visibility across process boundaries

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title follows the required format with NVBugs ticket, type designation, and clearly describes the fix for HuggingFace module initialization in worker processes for models with trust_remote_code=True.
Description check	✅ Passed	The PR description explains the issue, references similar work (vllm PR #871), describes the solution, and provides verification that the fix resolves the ModuleNotFoundError. However, it lacks explicit sections for Test Coverage and PR Checklist completion as specified in the template.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/executor/worker.py (1)
40-57: Log failures with stack traces

Right now we lose the stack when _init_hf_modules() blows up, which makes remote-module path issues hard to diagnose. Please log with exc_info or logger.exception so we keep the context.
@@
-    except ImportError as e:
-        logger.warning(f"ImportError initializing HF modules: {e}")
-    except Exception as e:
-        logger.error(f"Exception initializing HF modules: {e}")
+    except ImportError:
+        logger.warning("ImportError initializing HF modules", exc_info=True)
+    except Exception:
+        logger.exception("Exception initializing HF modules")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eeb56c2 and b86753d.

📒 Files selected for processing (1)

tensorrt_llm/executor/worker.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/executor/worker.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/executor/worker.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/executor/worker.py

🧠 Learnings (1)

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/executor/worker.py

🧬 Code graph analysis (1)

tensorrt_llm/executor/worker.py (1)

tensorrt_llm/logger.py (3)

debug (144-145)

warning (132-133)

error (126-127)

🪛 Ruff (0.14.3)

tensorrt_llm/executor/worker.py

55-55: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/executor/worker.py (1)

269-271: Good call re-initializing in worker processes

Re-running _init_hf_modules() when trust_remote_code models are loaded keeps worker subprocesses in sync. Looks solid.

tensorrt_llm/executor/worker.py

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2025-11-05T05:34:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-05T05:40:27Z

PR_Github #23591 [ run ] triggered by Bot. Commit: dbbcfb8

tensorrt-cicd · 2025-11-05T09:00:31Z

PR_Github #23591 [ run ] completed with state SUCCESS. Commit: dbbcfb8
/LLM/main/L0_MergeRequest_PR pipeline #17752 completed with status: 'FAILURE'

lancelly · 2025-11-05T10:06:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-05T10:12:05Z

PR_Github #23636 [ run ] triggered by Bot. Commit: dbbcfb8

tensorrt-cicd · 2025-11-05T10:51:24Z

PR_Github #23636 [ run ] completed with state FAILURE. Commit: dbbcfb8
/LLM/main/L0_MergeRequest_PR pipeline #17783 completed with status: 'FAILURE'

lancelly · 2025-11-06T02:31:15Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-06T02:42:21Z

PR_Github #23690 [ run ] triggered by Bot. Commit: cc728dc

tensorrt-cicd · 2025-11-06T11:49:21Z

PR_Github #23690 [ run ] completed with state SUCCESS. Commit: cc728dc
/LLM/main/L0_MergeRequest_PR pipeline #17824 completed with status: 'FAILURE'

lancelly · 2025-11-07T02:09:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-07T02:15:25Z

PR_Github #23789 [ run ] triggered by Bot. Commit: cc728dc

tensorrt-cicd · 2025-11-07T17:59:03Z

PR_Github #23789 [ run ] completed with state SUCCESS. Commit: cc728dc
/LLM/main/L0_MergeRequest_PR pipeline #17908 completed with status: 'FAILURE'

lancelly · 2025-11-09T12:57:03Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-09T13:03:06Z

PR_Github #23919 [ run ] triggered by Bot. Commit: cc728dc

lancelly · 2025-11-10T05:56:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-10T06:02:42Z

PR_Github #23967 [ run ] triggered by Bot. Commit: cc728dc

lancelly · 2025-11-10T06:03:02Z

/bot run --abort

tensorrt-cicd · 2025-11-10T06:08:28Z

PR_Github #23968 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --abort

lancelly · 2025-11-10T06:09:47Z

/bot run --disable-fail-fast

lancelly · 2025-11-10T06:18:47Z

/bot run --disable-fail-fast

lancelly · 2025-11-10T06:29:26Z

/bot kill

tensorrt-cicd · 2025-11-10T06:34:55Z

PR_Github #23975 [ kill ] triggered by Bot. Commit: cc728dc

tensorrt-cicd · 2025-11-10T06:34:56Z

PR_Github #23967 [ run ] completed with state ABORTED. Commit: cc728dc
LLM/main/L0_MergeRequest_PR #18049 (Blue Ocean) completed with status: ABORTED

lancelly · 2025-11-10T06:35:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-10T06:35:26Z

PR_Github #23975 [ kill ] completed with state SUCCESS. Commit: cc728dc
Successfully killed previous jobs for commit cc728dc

lancelly · 2025-11-10T06:51:27Z

/bot run --disable-fail-fast

lancelly · 2025-11-10T07:55:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-10T08:01:29Z

PR_Github #23988 [ run ] triggered by Bot. Commit: cc728dc

tensorrt-cicd · 2025-11-10T09:13:18Z

PR_Github #23988 [ run ] completed with state SUCCESS. Commit: cc728dc
/LLM/main/L0_MergeRequest_PR pipeline #18065 completed with status: 'SUCCESS'

tensorrt_llm/executor/worker.py

Superjomn

LGTM

…ls with trust_remote=true (NVIDIA#8931) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]>

…ls with trust_remote=true (NVIDIA#8931) Signed-off-by: Lanyu Liao <[email protected]> Co-authored-by: Lanyu Liao <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

add init_hf_modules in worker.py

b86753d

Signed-off-by: Lanyu Liao <[email protected]>

lancelly requested a review from a team as a code owner November 5, 2025 03:04

lancelly requested a review from hchings November 5, 2025 03:04

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

Superjomn reviewed Nov 5, 2025

View reviewed changes

tensorrt_llm/executor/worker.py Outdated Show resolved Hide resolved

Superjomn reviewed Nov 5, 2025

View reviewed changes

tensorrt_llm/executor/worker.py Outdated Show resolved Hide resolved

fix comment

dbbcfb8

Signed-off-by: Lanyu Liao <[email protected]>

Merge branch 'main' into fix/module_not_found

cc728dc

Superjomn reviewed Nov 11, 2025

View reviewed changes

tensorrt_llm/executor/worker.py Show resolved Hide resolved

Superjomn approved these changes Nov 11, 2025

View reviewed changes

Superjomn merged commit 1fd1145 into NVIDIA:main Nov 11, 2025
5 checks passed

lancelly mentioned this pull request Dec 10, 2025

[https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true #9872

Closed

[https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true #8931

[https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true #8931

Uh oh!

Conversation

lancelly commented Nov 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 5, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lancelly commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

lancelly commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

lancelly commented Nov 6, 2025

Uh oh!

tensorrt-cicd commented Nov 6, 2025

Uh oh!

tensorrt-cicd commented Nov 6, 2025

Uh oh!

lancelly commented Nov 7, 2025

Uh oh!

tensorrt-cicd commented Nov 7, 2025

Uh oh!

tensorrt-cicd commented Nov 7, 2025

Uh oh!

lancelly commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

lancelly commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

lancelly commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

lancelly commented Nov 10, 2025

Uh oh!

lancelly commented Nov 10, 2025

Uh oh!

lancelly commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

lancelly commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

lancelly commented Nov 10, 2025

Uh oh!

lancelly commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

Uh oh!

Superjomn left a comment

lancelly commented Nov 5, 2025 •

edited by coderabbitai bot

Loading