[None][fix] Populate s_host_node_name for SLURM-based perf test runs#13367
Conversation
HOST_NODE_NAME is set via Kubernetes downward API for k8s pods but is not forwarded to SLURM containers, leaving s_host_node_name empty for all B200/GB200 perf test records in OpenSearch. Fall back to socket.gethostname() so that every run records the compute node identity regardless of the execution environment. This is especially important for host-overhead-dominant module_perf tests where different DGX B200 nodes with different CPUs cause a bimodal ~1.25x latency distribution (9-20% CoV). Without hostname data, infrastructure noise cannot be distinguished from real code regressions. Signed-off-by: Yukun He <[email protected]>
📝 WalkthroughWalkthroughAdds socket import and modifies Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/integration/defs/perf/perf_regression_utils.py (1)
1-1:⚠️ Potential issue | 🟠 MajorUpdate copyright year to 2026.
The copyright header should reflect the year of latest modification. As per coding guidelines, update the year range to include 2026.
📅 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/perf_regression_utils.py` at line 1, Update the copyright header year range in the SPDX/author header (the line starting with "# SPDX-FileCopyrightText") to include 2026 by changing the end year from 2024 to 2026 so the header reflects the latest modification year.
🧹 Nitpick comments (1)
tests/integration/defs/perf/perf_regression_utils.py (1)
39-42: Consider updating the docstring to document hostname fallback behavior.The docstring currently states "Get job info from environment variables," but the function now falls back to
socket.gethostname()whenHOST_NODE_NAMEis unset. Documenting this behavior would improve clarity, especially given the PR's motivation around supporting multiple execution environments (Kubernetes vs. SLURM).📝 Suggested docstring enhancement
def get_job_info(): - """Get job info from environment variables.""" + """Get job info from environment variables. + + Returns job configuration including host node name (from HOST_NODE_NAME + env var or system hostname as fallback), build info, branch, commit, and + PR metadata when applicable. + """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/perf_regression_utils.py` around lines 39 - 42, Update the get_job_info() docstring to describe that it reads job-related environment variables (including HOST_NODE_NAME) and that when HOST_NODE_NAME is not set the function falls back to using socket.gethostname() as the host node name; mention any defaults used (e.g., empty string fallback) so callers understand the hostname resolution behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/integration/defs/perf/perf_regression_utils.py`:
- Line 42: The HOST_NODE_NAME env handling currently uses host_node_name =
os.getenv("HOST_NODE_NAME", "") or socket.gethostname() which treats
whitespace-only values as valid; update the logic around host_node_name to trim
whitespace (e.g., call .strip() on the env value) and fall back to
socket.gethostname() when the stripped value is empty so that whitespace-only
environment variables don't become recorded hostnames.
---
Outside diff comments:
In `@tests/integration/defs/perf/perf_regression_utils.py`:
- Line 1: Update the copyright header year range in the SPDX/author header (the
line starting with "# SPDX-FileCopyrightText") to include 2026 by changing the
end year from 2024 to 2026 so the header reflects the latest modification year.
---
Nitpick comments:
In `@tests/integration/defs/perf/perf_regression_utils.py`:
- Around line 39-42: Update the get_job_info() docstring to describe that it
reads job-related environment variables (including HOST_NODE_NAME) and that when
HOST_NODE_NAME is not set the function falls back to using socket.gethostname()
as the host node name; mention any defaults used (e.g., empty string fallback)
so callers understand the hostname resolution behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: bb532243-ce61-4d8d-bcc4-1b6299b48681
📒 Files selected for processing (1)
tests/integration/defs/perf/perf_regression_utils.py
|
/bot run --disable-fail-fast |
|
PR_Github #45277 [ run ] triggered by Bot. Commit: |
|
PR_Github #45277 [ run ] completed with state |
…VIDIA#13367) Signed-off-by: Yukun He <[email protected]>
@coderabbitai summary
Description
s_host_node_namein OpenSearch is empty for all B200/GB200 SLURM-based perf test records (52,436 of 52,446 records) becauseHOST_NODE_NAMEis set via Kubernetes downward API but not forwarded to SLURM containers.This change falls back to
socket.gethostname()whenHOST_NODE_NAMEenv var is unset, so every run records the compute node identity. This is critical for host-overhead-dominantmodule_perftests where different DGX B200 nodes with different CPUs cause a bimodal ~1.25x latency distribution (9-20% CoV). Without hostname data, infrastructure noise is indistinguishable from real code regressions.Test Coverage
socket.gethostname()returns a meaningful hostname on a SLURM compute nodeHOST_NODE_NAMEenv var (no behavior change)s_host_node_nameis populated in the uploaded OpenSearch recordPR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.