Skip to content

[None][fix] Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set#12985

Merged
karljang merged 5 commits into
NVIDIA:mainfrom
YPxHolic:fix/numa-affinity-cuda-visible-devices
May 29, 2026
Merged

[None][fix] Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set#12985
karljang merged 5 commits into
NVIDIA:mainfrom
YPxHolic:fix/numa-affinity-cuda-visible-devices

Conversation

@YPxHolic
Copy link
Copy Markdown
Contributor

@YPxHolic YPxHolic commented Apr 13, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced GPU device identification to properly handle CUDA_VISIBLE_DEVICES environment variable remapping, ensuring correct logical-to-physical device mapping for CPU affinity assignment in multi-GPU environments.

Description

Problem

get_numa_aware_cpu_affinity() in tensorrt_llm/llmapi/utils.py passes the logical CUDA device index directly to pynvml.nvmlDeviceGetHandleByIndex(). However, NVML always enumerates all physical GPUs on the system regardless of CUDA_VISIBLE_DEVICES.

When a user restricts GPU visibility (e.g. CUDA_VISIBLE_DEVICES=3,4), logical CUDA device 0 actually corresponds to physical GPU 3. The old code would query physical GPU 0's NUMA topology instead, causing worker threads to be bound to incorrect CPU cores on the wrong NUMA node. This leads to degraded performance in multi-GPU deployments that use GPU subsetting (e.g. disaggregated prefill/decode with TP>1).

Solution

Added a mapping from logical CUDA device index to physical NVML device index by parsing CUDA_VISIBLE_DEVICES before the NVML call:

  • When CUDA_VISIBLE_DEVICES is set, parse it into a list of physical GPU IDs and use physical_ids[device_id] as the NVML index.
  • When CUDA_VISIBLE_DEVICES is not set, behavior is unchanged (nvml_device_id = device_id).
  • Added a warning log with graceful fallback when device_id exceeds the CUDA_VISIBLE_DEVICES list length.
  • Updated the docstring to clarify that device_id is the logical CUDA index.

Example

CUDA_VISIBLE_DEVICES=3,4
Logical device 0 → Physical GPU 3 → correct NUMA node queried
Logical device 1 → Physical GPU 4 → correct NUMA node queried

@YPxHolic YPxHolic requested a review from a team as a code owner April 13, 2026 07:07
@YPxHolic YPxHolic requested a review from syuoni April 13, 2026 07:07
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

Updated get_numa_aware_cpu_affinity() function to properly handle CUDA_VISIBLE_DEVICES environment variable remapping by interpreting the input device_id as a logical CUDA device index and mapping it to the corresponding physical GPU index for NVML operations.

Changes

Cohort / File(s) Summary
Device ID mapping logic
tensorrt_llm/llmapi/utils.py
Added logic to parse CUDA_VISIBLE_DEVICES environment variable, map logical device_id to physical GPU index, include fallback handling with warning for out-of-range indices, and updated docstring to clarify device_id parameter as logical CUDA index.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: resolving NVML device index mismatch when CUDA_VISIBLE_DEVICES is set, directly matching the core change in the PR.
Description check ✅ Passed The PR description includes Problem, Solution, and Example sections explaining the issue, fix, and impact clearly. However, the PR checklist items are not explicitly addressed, and Test Coverage section is missing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/llmapi/utils.py`:
- Around line 583-597: The CUDA_VISIBLE_DEVICES parsing assumes all tokens are
integers and allows negative device_id indexing; change the logic in the block
that computes nvml_device_id so CUDA_VISIBLE_DEVICES is parsed into a list of
trimmed tokens (convert tokens to int only when they are numeric, otherwise keep
the raw token string) into physical_ids, then guard selection with an explicit
non-negative bounds check (if 0 <= device_id < len(physical_ids)) before
indexing; if in-range set nvml_device_id to physical_ids[device_id] (which may
be an int or a UUID/MIG string), otherwise log the warning and fall back to
nvml_device_id = device_id. Use the existing symbols physical_ids, device_id,
nvml_device_id and logger when implementing this.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f810959a-468b-42d2-96f7-9f7a39eb0fbf

📥 Commits

Reviewing files that changed from the base of the PR and between 9a3dc61 and 64b7fb2.

📒 Files selected for processing (1)
  • tensorrt_llm/llmapi/utils.py

Comment thread tensorrt_llm/llmapi/utils.py
…y when CUDA_VISIBLE_DEVICES is set

NVML always enumerates all physical GPUs regardless of
CUDA_VISIBLE_DEVICES. When a user restricts GPU visibility
(e.g. CUDA_VISIBLE_DEVICES=3,4), the logical CUDA device index 0
actually corresponds to physical GPU 3. Previously,
get_numa_aware_cpu_affinity() passed the logical index directly to
nvmlDeviceGetHandleByIndex(), causing it to query the wrong GPU's
NUMA topology and bind worker threads to incorrect CPU cores.

This commit adds a mapping from logical CUDA device index to physical
NVML device index by parsing CUDA_VISIBLE_DEVICES before the NVML
call, ensuring correct NUMA-aware CPU affinity in multi-GPU
deployments that use GPU subsetting.

Signed-off-by: pernyyang <[email protected]>
@YPxHolic YPxHolic force-pushed the fix/numa-affinity-cuda-visible-devices branch from 64b7fb2 to 0406292 Compare April 13, 2026 07:22
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 13, 2026
Copy link
Copy Markdown
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48320 [ run ] triggered by Bot. Commit: 87e96f9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48320 [ run ] completed with state ABORTED. Commit: 87e96f9

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48604 [ run ] triggered by Bot. Commit: 87e96f9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48604 [ run ] completed with state FAILURE. Commit: 87e96f9
/LLM/main/L0_MergeRequest_PR pipeline #38387 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@YPxHolic
Copy link
Copy Markdown
Contributor Author

YPxHolic commented May 21, 2026

Hi @karljang ,
I don't have access to the CI logs. Could you please share the failure
details so I can investigate and fix? Thanks!

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49736 [ run ] triggered by Bot. Commit: 87e96f9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49736 [ run ] completed with state SUCCESS. Commit: 87e96f9
/LLM/main/L0_MergeRequest_PR pipeline #39340 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@YPxHolic
Copy link
Copy Markdown
Contributor Author

Hi @karljang ,
Still running into an issue here. Could you take a look?

@karljang
Copy link
Copy Markdown
Collaborator

Again, the failure doesn't seem to be related.

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50174 [ run ] triggered by Bot. Commit: 87e96f9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50174 [ run ] completed with state SUCCESS. Commit: 87e96f9
/LLM/main/L0_MergeRequest_PR pipeline #39718 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

@YPxHolic ,
Thanks for your contributions!
Most CI failures look irrelevant, but there is a formatting issue. could you please check this log:

diff --git a/tensorrt_llm/llmapi/utils.py b/tensorrt_llm/llmapi/utils.py
index 9c724ff..82adc4f 100644
--- a/tensorrt_llm/llmapi/utils.py
+++ b/tensorrt_llm/llmapi/utils.py
@@ -573,7 +573,9 @@ def get_numa_aware_cpu_affinity(device_id):
         # actually corresponds to physical GPU 3.
         cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES")
         if cuda_visible is not None and cuda_visible.strip():
-            visible_tokens = [x.strip() for x in cuda_visible.split(",") if x.strip()]
+            visible_tokens = [
+                x.strip() for x in cuda_visible.split(",") if x.strip()
+            ]
             if 0 <= device_id < len(visible_tokens):
                 token = visible_tokens[device_id]
                 if token.isdigit():
@@ -581,7 +583,8 @@ def get_numa_aware_cpu_affinity(device_id):
                 else:
                     logger.warning(
                         f"CUDA_VISIBLE_DEVICES token '{token}' is non-numeric; "
-                        f"falling back to device_id ({device_id}) as NVML index.")
+                        f"falling back to device_id ({device_id}) as NVML index."
+                    )
                     nvml_device_id = device_id
             else:
                 logger.warning(


Error: pre-commit checks failed
Please refer to our coding style guidelines at: https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#coding-style to fix this issue

@YPxHolic YPxHolic force-pushed the fix/numa-affinity-cuda-visible-devices branch from fa6e8e7 to ede5317 Compare May 27, 2026 11:46
@YPxHolic YPxHolic force-pushed the fix/numa-affinity-cuda-visible-devices branch from c90d446 to 7939099 Compare May 27, 2026 11:55
@YPxHolic
Copy link
Copy Markdown
Contributor Author

Hi @karljang ,
Thanks for the review! I've addressed the comments. Could you please take another look?

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50893 [ run ] triggered by Bot. Commit: 7939099 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50893 [ run ] completed with state SUCCESS. Commit: 7939099
/LLM/main/L0_MergeRequest_PR pipeline #40359 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50974 [ run ] triggered by Bot. Commit: 7939099 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50974 [ run ] completed with state SUCCESS. Commit: 7939099
/LLM/main/L0_MergeRequest_PR pipeline #40429 completed with status: 'SUCCESS'

CI Report

Link to invocation

@karljang karljang merged commit ebbbec4 into NVIDIA:main May 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants