Skip to content

[None][fix] Populate s_host_node_name for SLURM-based perf test runs#13367

Merged
hyukn merged 1 commit into
NVIDIA:mainfrom
hyukn:fix/hostname-fallback-opensearch
Apr 24, 2026
Merged

[None][fix] Populate s_host_node_name for SLURM-based perf test runs#13367
hyukn merged 1 commit into
NVIDIA:mainfrom
hyukn:fix/hostname-fallback-opensearch

Conversation

@hyukn
Copy link
Copy Markdown
Collaborator

@hyukn hyukn commented Apr 23, 2026

@coderabbitai summary

Description

s_host_node_name in OpenSearch is empty for all B200/GB200 SLURM-based perf test records (52,436 of 52,446 records) because HOST_NODE_NAME is set via Kubernetes downward API but not forwarded to SLURM containers.

This change falls back to socket.gethostname() when HOST_NODE_NAME env var is unset, so every run records the compute node identity. This is critical for host-overhead-dominant module_perf tests where different DGX B200 nodes with different CPUs cause a bimodal ~1.25x latency distribution (9-20% CoV). Without hostname data, infrastructure noise is indistinguishable from real code regressions.

Test Coverage

  • Verify socket.gethostname() returns a meaningful hostname on a SLURM compute node
  • Verify k8s runs still use the HOST_NODE_NAME env var (no behavior change)
  • Run a module_perf test locally and confirm s_host_node_name is populated in the uploaded OpenSearch record

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

HOST_NODE_NAME is set via Kubernetes downward API for k8s pods but is
not forwarded to SLURM containers, leaving s_host_node_name empty for
all B200/GB200 perf test records in OpenSearch. Fall back to
socket.gethostname() so that every run records the compute node
identity regardless of the execution environment.

This is especially important for host-overhead-dominant module_perf
tests where different DGX B200 nodes with different CPUs cause a
bimodal ~1.25x latency distribution (9-20% CoV). Without hostname
data, infrastructure noise cannot be distinguished from real code
regressions.

Signed-off-by: Yukun He <[email protected]>
@hyukn hyukn requested a review from a team as a code owner April 23, 2026 06:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

Adds socket import and modifies get_job_info() to default s_host_node_name to the system hostname via socket.gethostname() when the HOST_NODE_NAME environment variable is unset or empty, replacing the previous empty string behavior.

Changes

Cohort / File(s) Summary
Hostname Fallback
tests/integration/defs/perf/perf_regression_utils.py
Added socket import and updated get_job_info() to assign s_host_node_name to system hostname when HOST_NODE_NAME is unavailable, instead of defaulting to empty string.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding hostname population for SLURM-based perf test runs, following the repository's template format with [None][fix] prefix.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively explains the problem, solution, motivation, and includes a detailed test plan following the template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/defs/perf/perf_regression_utils.py (1)

1-1: ⚠️ Potential issue | 🟠 Major

Update copyright year to 2026.

The copyright header should reflect the year of latest modification. As per coding guidelines, update the year range to include 2026.

📅 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/perf_regression_utils.py` at line 1, Update the
copyright header year range in the SPDX/author header (the line starting with "#
SPDX-FileCopyrightText") to include 2026 by changing the end year from 2024 to
2026 so the header reflects the latest modification year.
🧹 Nitpick comments (1)
tests/integration/defs/perf/perf_regression_utils.py (1)

39-42: Consider updating the docstring to document hostname fallback behavior.

The docstring currently states "Get job info from environment variables," but the function now falls back to socket.gethostname() when HOST_NODE_NAME is unset. Documenting this behavior would improve clarity, especially given the PR's motivation around supporting multiple execution environments (Kubernetes vs. SLURM).

📝 Suggested docstring enhancement
 def get_job_info():
-    """Get job info from environment variables."""
+    """Get job info from environment variables.
+    
+    Returns job configuration including host node name (from HOST_NODE_NAME
+    env var or system hostname as fallback), build info, branch, commit, and
+    PR metadata when applicable.
+    """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/perf_regression_utils.py` around lines 39 - 42,
Update the get_job_info() docstring to describe that it reads job-related
environment variables (including HOST_NODE_NAME) and that when HOST_NODE_NAME is
not set the function falls back to using socket.gethostname() as the host node
name; mention any defaults used (e.g., empty string fallback) so callers
understand the hostname resolution behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/perf/perf_regression_utils.py`:
- Line 42: The HOST_NODE_NAME env handling currently uses host_node_name =
os.getenv("HOST_NODE_NAME", "") or socket.gethostname() which treats
whitespace-only values as valid; update the logic around host_node_name to trim
whitespace (e.g., call .strip() on the env value) and fall back to
socket.gethostname() when the stripped value is empty so that whitespace-only
environment variables don't become recorded hostnames.

---

Outside diff comments:
In `@tests/integration/defs/perf/perf_regression_utils.py`:
- Line 1: Update the copyright header year range in the SPDX/author header (the
line starting with "# SPDX-FileCopyrightText") to include 2026 by changing the
end year from 2024 to 2026 so the header reflects the latest modification year.

---

Nitpick comments:
In `@tests/integration/defs/perf/perf_regression_utils.py`:
- Around line 39-42: Update the get_job_info() docstring to describe that it
reads job-related environment variables (including HOST_NODE_NAME) and that when
HOST_NODE_NAME is not set the function falls back to using socket.gethostname()
as the host node name; mention any defaults used (e.g., empty string fallback)
so callers understand the hostname resolution behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bb532243-ce61-4d8d-bcc4-1b6299b48681

📥 Commits

Reviewing files that changed from the base of the PR and between 7b0f60e and e6ef747.

📒 Files selected for processing (1)
  • tests/integration/defs/perf/perf_regression_utils.py

Comment thread tests/integration/defs/perf/perf_regression_utils.py
@hyukn hyukn requested review from chenfeiz0326 and ruodil April 23, 2026 07:04
@hyukn
Copy link
Copy Markdown
Collaborator Author

hyukn commented Apr 24, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45277 [ run ] triggered by Bot. Commit: e6ef747 Link to invocation

@hyukn hyukn requested a review from ZhanruiSunCh April 24, 2026 02:55
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45277 [ run ] completed with state SUCCESS. Commit: e6ef747
/LLM/main/L0_MergeRequest_PR pipeline #35533 completed with status: 'SUCCESS'

CI Report

Link to invocation

@hyukn hyukn merged commit 5277e62 into NVIDIA:main Apr 24, 2026
8 of 10 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants