feat(validators): warm up inference-perf benchmark and make it tunable by yuanchen8911 · Pull Request #1096 · NVIDIA/aicr

yuanchen8911 · 2026-05-29T01:51:36Z

Summary

Warm up the inference-perf performance benchmark before measuring, and expose its tuning knobs via env so operators can retune without rebuilding the validator image.

Motivation / Context

The inference-perf check benchmarked vLLM from a cold start, so vLLM's one-time CUDA-graph capture / JIT compile cost landed inside the measured window. On an 8× RTX PRO 6000 node (Qwen3-0.6B) this produced a misleading p99 TTFT of ~42 s and ~772 tok/s — failing the recipe performance thresholds despite healthy hardware. Warmup is the standard way to measure steady-state serving performance (MLPerf Inference; AIPerf/genai-perf ship --warmup-request-count for exactly this).

Fixes: N/A
Related: N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)

Component(s) Affected

Validator (pkg/validator) — validators/performance
Docs/examples (docs/)

Implementation Notes

Add --warmup-request-count (one full concurrency wave, excluded from AIPerf stats) so the compile/capture cost is primed before measurement.
Widen the steady-state window: measured request count max(200, concurrency×2) → max(1000, concurrency×8).
Expose knobs via AICR_INFERENCE_PERF_* env on the inference-perf catalog entry (concurrency-per-GPU, warmup-per-concurrency, min requests, requests-per-concurrency, input/output token means), read with intFromEnv (WARN + fall back to default on unset/non-integer/non-positive, mirroring checkTimeoutFromEnv).
Methodology knobs live with the validator/catalog; per-accelerator pass/fail thresholds stay in the recipe overlays ("recipes define what, validators determine how").

Testing

Validated end-to-end on a live EKS cluster (8× RTX PRO 6000, g7e.48xlarge, Qwen3-0.6B):

Metric	Before (cold)	After (warmup)	Threshold
Throughput	772.74 tok/s	36,143.79 tok/s	`≥ 5000`	✅
TTFT p99	41,943 ms	46.59 ms	`≤ 200 ms`	✅

go test -race ./validators/performance/... ./pkg/validator/catalog/...   # ok
golangci-lint run -c .golangci.yaml ./validators/performance/...          # 0 issues
go build ./...                                                            # ok
yamllint recipes/validators/catalog.yaml                                  # clean

Full make qualify was not run locally (e2e needs a Kind/Tilt cluster and scan needs grype); the Go test/lint/build/vet gates above pass and the functional change is proven on real hardware. CI covers e2e + scan.

Risk Assessment

Low — Isolated to the inference-perf benchmark builder; defaults preserve the new behavior; easy to revert.

Rollout notes: No migration. Benchmark defaults change (warmup added, larger measured-request window); recipe thresholds unchanged.

Checklist

Tests pass locally (-race)
Linter passes (golangci-lint, yamllint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs (docs/contributor/validator.md)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-05-29T01:52:27Z

🌿 Preview your docs: https://nvidia-preview-feat-inference-perf-warmup-and-tuning.docs.buildwithfern.com/aicr

coderabbitai · 2026-05-29T01:56:43Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds env-driven tuning for the AIPerf inference validator: new constants and AICR_INFERENCE_PERF_* names; intFromEnv to parse validated positive-integer overrides; concurrency now uses per-GPU env overrides; buildAIPerfJob computes env-overridable request and warmup counts and emits --warmup-request-count; token-mean overrides are accepted; tests and docs/catalog comments updated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

area/validator

Suggested reviewers

mchmarny
lockwobr

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining motivation, implementation details, testing, and risk assessment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title 'feat(validators): warm up inference-perf benchmark and make it tunable' accurately summarizes the two main changes: adding warmup to the inference-perf benchmark and exposing tuning knobs via environment variables.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

validators/performance/inference_perf_constraint.go (1)
1270-1291: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Quote the URL in the shell script for consistency and defensive coding.

The model name is quoted (aiperf profile "%s") but the URL is not (--url %s). While the URL is currently constructed from safe internal service names, quoting prevents issues if URL construction changes or contains unexpected characters.
Suggested fix
 	script := fmt.Sprintf(`set -e
 aiperf profile "%s" \
-  --url %s \
+  --url "%s" \
   --endpoint-type chat \
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/inference_perf_constraint.go` around lines 1270 -
1291, The shell script built in the fmt.Sprintf assigned to variable "script"
does not quote the URL flag (--url %s); update the format string to pass the URL
quoted (e.g., --url "%s") so the generated command uses --url "<endpoint>" just
like the model name is quoted in aiperf profile "%s"; ensure the corresponding
argument ordering (endpoint) remains unchanged in the call to fmt.Sprintf in
validators/performance/inference_perf_constraint.go.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1234-1253: The buildAIPerfJob doc comment is now separated from
its declaration by the newly inserted intFromEnv function; move the existing doc
comment block that describes buildAIPerfJob so it immediately precedes the func
buildAIPerfJob declaration (or alternatively move intFromEnv below
buildAIPerfJob) so the Go doc comment is attached to buildAIPerfJob; ensure the
comment stays unchanged and sits directly above func buildAIPerfJob.

In `@validators/performance/inference_perf_test.go`:
- Around line 597-609: The test TestBuildAIPerfJob_Warmup should be refactored
into a table-driven style: create a slice of test cases (each with a name and
concurrency value), iterate over them and run each as t.Run(tc.name, func(t
*testing.T) { ... }), calling buildAIPerfJob and asserting the script contains
the expected "--warmup-request-count" needle; keep using the existing
helpers/variables like buildAIPerfJob and aiperfWarmupPerConcurrency and
preserve the current assertion message but move it inside each subtest.

---

Outside diff comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1270-1291: The shell script built in the fmt.Sprintf assigned to
variable "script" does not quote the URL flag (--url %s); update the format
string to pass the URL quoted (e.g., --url "%s") so the generated command uses
--url "<endpoint>" just like the model name is quoted in aiperf profile "%s";
ensure the corresponding argument ordering (endpoint) remains unchanged in the
call to fmt.Sprintf in validators/performance/inference_perf_constraint.go.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6d762139-98c0-4a04-acf8-c03396296cf6

📥 Commits

Reviewing files that changed from the base of the PR and between 7295c04 and e1534e6.

📒 Files selected for processing (4)

docs/contributor/validator.md
recipes/validators/catalog.yaml
validators/performance/inference_perf_constraint.go
validators/performance/inference_perf_test.go

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

validators/performance/inference_perf_constraint.go (1)

1256-1287: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Log the derived benchmark knobs here.

runAIPerfJob still reports the fixed aiperfMinRequests, so env overrides and concurrency scaling are invisible in logs. Please emit the computed request/warmup counts and token means from this block so the run artifacts show the actual methodology used.

Suggested change

  requestCount := intFromEnv(envMinRequests, aiperfMinRequests)
  if scaled := concurrency * intFromEnv(envRequestsPerConcurrency, aiperfRequestsPerConcurrency); scaled > requestCount {
  	requestCount = scaled
  }
  // Warmup primes vLLM (CUDA graph capture / JIT) before measurement so the
  // one-time compile cost does not inflate p99 TTFT. Excluded from stats.
  warmupCount := concurrency * intFromEnv(envWarmupPerConcurrency, aiperfWarmupPerConcurrency)
+ inputTokensMean := intFromEnv(envInputTokensMean, aiperfInputTokensMean)
+ outputTokensMean := intFromEnv(envOutputTokensMean, aiperfOutputTokensMean)
+ slog.Info("Configured AIPerf benchmark",
+ 	"concurrency", concurrency,
+ 	"requestCount", requestCount,
+ 	"warmupCount", warmupCount,
+ 	"inputTokensMean", inputTokensMean,
+ 	"outputTokensMean", outputTokensMean)
  // Resolve once so Image and ImagePullPolicy can't drift if env vars
  // were mutated between two calls (matters in tests; cheap in prod).
  aiperfImage := resolveAiperfImage()
...
-		intFromEnv(envInputTokensMean, aiperfInputTokensMean), intFromEnv(envOutputTokensMean, aiperfOutputTokensMean),
+		inputTokensMean, outputTokensMean,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/inference_perf_constraint.go` around lines 1256 -
1287, The code builds the aiperf command from derived knobs (requestCount,
warmupCount, input/output token means, concurrency, aiperfImage) but only logs
the constant aiperfMinRequests; update runAIPerfJob to emit a debug/info log
right after resolving aiperfImage (or before creating script) that prints the
computed values requestCount, warmupCount, concurrency,
intFromEnv(envInputTokensMean,...), intFromEnv(envOutputTokensMean,...), and the
resolved aiperfImage so the run artifacts reflect the actual knobs used; place
the log near the existing script construction (referencing variables
requestCount, warmupCount, concurrency, aiperfImage) and use the project's
standard logger (e.g., processLogger or the logger used elsewhere in
runAIPerfJob) to ensure it appears in job outputs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1241-1253: The intFromEnv helper currently swallows malformed
AICR_INFERENCE_PERF_* env values and returns a default; change it to surface
parse errors by returning an error so the caller can abort with
ErrCodeInvalidRequest. Update intFromEnv(name string, def int) to return (int,
error), parse the env and on empty return (def, nil), but on strconv.Atoi error
or v <= 0 return (0, fmt.Errorf(...)) (or a typed sentinel error); then update
the callers (e.g., where intFromEnv is used) to check the error and return
ErrCodeInvalidRequest so the process aborts rather than silently using defaults.
Ensure references to ErrCodeInvalidRequest are used for the abort path and
propagate the new error signature through the call chain.

---

Outside diff comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1256-1287: The code builds the aiperf command from derived knobs
(requestCount, warmupCount, input/output token means, concurrency, aiperfImage)
but only logs the constant aiperfMinRequests; update runAIPerfJob to emit a
debug/info log right after resolving aiperfImage (or before creating script)
that prints the computed values requestCount, warmupCount, concurrency,
intFromEnv(envInputTokensMean,...), intFromEnv(envOutputTokensMean,...), and the
resolved aiperfImage so the run artifacts reflect the actual knobs used; place
the log near the existing script construction (referencing variables
requestCount, warmupCount, concurrency, aiperfImage) and use the project's
standard logger (e.g., processLogger or the logger used elsewhere in
runAIPerfJob) to ensure it appears in job outputs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 272a43bb-e3e3-4b3d-bdfb-33f3d09b2181

📥 Commits

Reviewing files that changed from the base of the PR and between ddba9db and a1934c1.

📒 Files selected for processing (4)

docs/contributor/validator.md
recipes/validators/catalog.yaml
validators/performance/inference_perf_constraint.go
validators/performance/inference_perf_test.go

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/validator.md`:
- Line 152: Update the subsection text to expand the acronym TTFT on its first
mention by replacing the first occurrence of "TTFT" with "time-to-first-token
(TTFT)"; ensure the rest of the paragraph continues to use "TTFT" thereafter and
do not change identifiers like "inference-perf", the "env" knobs, or the error
symbol "ErrCodeInvalidRequest".

In `@validators/performance/inference_perf_test.go`:
- Around line 651-653: Update the top-of-test comment for TestIntFromEnv to
accurately describe the behavior: state that a valid positive integer is
accepted, an unset value falls back to the default, and non-integer, zero, or
negative values are rejected (fail-closed) with an error; mention the
TestIntFromEnv test and the env-reader function it exercises so readers
understand it's enforcing error-on-invalid rather than silently defaulting.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 91580f90-fc00-452f-a592-cd85b8d30365

📥 Commits

Reviewing files that changed from the base of the PR and between a28243f and 18f83c4.

📒 Files selected for processing (5)

docs/contributor/validator.md
recipes/validators/catalog.yaml
validators/performance/inference_perf.go
validators/performance/inference_perf_constraint.go
validators/performance/inference_perf_test.go

The inference-perf check benchmarked vLLM from a cold start, so its one-time CUDA-graph capture / JIT compile cost landed inside the measured window — inflating p99 TTFT (~42s observed on RTX PRO 6000) and depressing throughput. Add an AIPerf warmup phase (one full concurrency wave, excluded from stats) and widen the steady-state window: measured request count is now max(1000, concurrency*8) (was max(200, concurrency*2)). On an 8x RTX PRO 6000 node (Qwen3-0.6B) this moved throughput 772 -> 36,144 tok/s and p99 TTFT 41,943ms -> 46.6ms, well within the recipe thresholds. Expose the benchmark knobs (concurrency-per-GPU, warmup, request count, token means) via AICR_INFERENCE_PERF_* env on the inference-perf catalog entry, so operators can retune without rebuilding the validator image. Defaults preserve the new behavior; a malformed or non-positive value fails the check up front with ErrCodeInvalidRequest — validated before any workload is deployed — rather than silently benchmarking under defaults and reporting a pass/fail the operator never configured. Per-accelerator pass/fail thresholds stay in the recipe overlays — these are validation methodology knobs that live with the validator/catalog. Signed-off-by: Yuan Chen <[email protected]>

yuanchen8911 requested review from a team as code owners May 29, 2026 01:51

yuanchen8911 added the enhancement label May 29, 2026

github-actions Bot added area/recipes area/docs size/M labels May 29, 2026

yuanchen8911 marked this pull request as draft May 29, 2026 01:54

yuanchen8911 changed the title ~~feat(validators): warm up inference-perf benchmark and make it tunable~~ WIP: feat(validators): warm up inference-perf benchmark and make it tunable May 29, 2026

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread validators/performance/inference_perf_constraint.go Outdated

Comment thread validators/performance/inference_perf_test.go

yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch 3 times, most recently from a1934c1 to a28243f Compare May 29, 2026 02:10

github-actions Bot added size/L and removed size/M labels May 29, 2026

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread validators/performance/inference_perf_constraint.go Outdated

yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from a28243f to 18f83c4 Compare May 29, 2026 02:27

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread docs/contributor/validator.md Outdated

Comment thread validators/performance/inference_perf_test.go Outdated

yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch 2 times, most recently from f5a902e to f5716a9 Compare May 29, 2026 03:27

mchmarny assigned yuanchen8911 May 29, 2026

yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch 3 times, most recently from e85ca1a to a67c286 Compare May 29, 2026 23:32

yuanchen8911 changed the title ~~WIP: feat(validators): warm up inference-perf benchmark and make it tunable~~ feat(validators): warm up inference-perf benchmark and make it tunable May 29, 2026

yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from a67c286 to be2bda1 Compare May 30, 2026 00:00

yuanchen8911 marked this pull request as ready for review May 30, 2026 00:00

yuanchen8911 changed the title ~~feat(validators): warm up inference-perf benchmark and make it tunable~~ WIP: feat(validators): warm up inference-perf benchmark and make it tunable May 30, 2026

yuanchen8911 marked this pull request as draft May 30, 2026 00:07

yuanchen8911 mentioned this pull request May 30, 2026

Improve inference performance validation: tunable, per-accelerator params wired into recipes #1116

Closed

yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from be2bda1 to 4edeff2 Compare May 30, 2026 00:46

yuanchen8911 changed the title ~~WIP: feat(validators): warm up inference-perf benchmark and make it tunable~~ feat(validators): warm up inference-perf benchmark and make it tunable May 30, 2026

yuanchen8911 marked this pull request as ready for review May 30, 2026 00:47

yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from 4edeff2 to 6ac0fb7 Compare May 30, 2026 01:22

mchmarny approved these changes May 30, 2026

View reviewed changes

yuanchen8911 mentioned this pull request May 30, 2026

fix(recipes): stop setting gpu-operator driverInstallDir to / on kind and oke #1119

Merged

12 tasks

yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from 6ac0fb7 to b71e8a6 Compare May 30, 2026 14:47

yuanchen8911 merged commit 686d329 into NVIDIA:main May 30, 2026
118 checks passed

njhensley mentioned this pull request Jun 3, 2026

validator: tag override doesn't force PullAlways on the main container (stale cached image runs) #1177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(validators): warm up inference-perf benchmark and make it tunable#1096

feat(validators): warm up inference-perf benchmark and make it tunable#1096
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/inference-perf-warmup-and-tuning

yuanchen8911 commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented May 29, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 29, 2026 •

edited

Loading