Skip to content

feat(validators): warm up inference-perf benchmark and make it tunable#1096

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/inference-perf-warmup-and-tuning
May 30, 2026
Merged

feat(validators): warm up inference-perf benchmark and make it tunable#1096
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/inference-perf-warmup-and-tuning

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Warm up the inference-perf performance benchmark before measuring, and expose its tuning knobs via env so operators can retune without rebuilding the validator image.

Motivation / Context

The inference-perf check benchmarked vLLM from a cold start, so vLLM's one-time CUDA-graph capture / JIT compile cost landed inside the measured window. On an 8× RTX PRO 6000 node (Qwen3-0.6B) this produced a misleading p99 TTFT of ~42 s and ~772 tok/s — failing the recipe performance thresholds despite healthy hardware. Warmup is the standard way to measure steady-state serving performance (MLPerf Inference; AIPerf/genai-perf ship --warmup-request-count for exactly this).

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Validator (pkg/validator) — validators/performance
  • Docs/examples (docs/)

Implementation Notes

  • Add --warmup-request-count (one full concurrency wave, excluded from AIPerf stats) so the compile/capture cost is primed before measurement.
  • Widen the steady-state window: measured request count max(200, concurrency×2)max(1000, concurrency×8).
  • Expose knobs via AICR_INFERENCE_PERF_* env on the inference-perf catalog entry (concurrency-per-GPU, warmup-per-concurrency, min requests, requests-per-concurrency, input/output token means), read with intFromEnv (WARN + fall back to default on unset/non-integer/non-positive, mirroring checkTimeoutFromEnv).
  • Methodology knobs live with the validator/catalog; per-accelerator pass/fail thresholds stay in the recipe overlays ("recipes define what, validators determine how").

Testing

Validated end-to-end on a live EKS cluster (8× RTX PRO 6000, g7e.48xlarge, Qwen3-0.6B):

Metric Before (cold) After (warmup) Threshold
Throughput 772.74 tok/s 36,143.79 tok/s ≥ 5000
TTFT p99 41,943 ms 46.59 ms ≤ 200 ms
go test -race ./validators/performance/... ./pkg/validator/catalog/...   # ok
golangci-lint run -c .golangci.yaml ./validators/performance/...          # 0 issues
go build ./...                                                            # ok
yamllint recipes/validators/catalog.yaml                                  # clean

Full make qualify was not run locally (e2e needs a Kind/Tilt cluster and scan needs grype); the Go test/lint/build/vet gates above pass and the functional change is proven on real hardware. CI covers e2e + scan.

Risk Assessment

  • Low — Isolated to the inference-perf benchmark builder; defaults preserve the new behavior; easy to revert.

Rollout notes: No migration. Benchmark defaults change (warmup added, larger measured-request window); recipe thresholds unchanged.

Checklist

  • Tests pass locally (-race)
  • Linter passes (golangci-lint, yamllint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs (docs/contributor/validator.md)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 marked this pull request as draft May 29, 2026 01:54
@yuanchen8911 yuanchen8911 changed the title feat(validators): warm up inference-perf benchmark and make it tunable WIP: feat(validators): warm up inference-perf benchmark and make it tunable May 29, 2026
@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds env-driven tuning for the AIPerf inference validator: new constants and AICR_INFERENCE_PERF_* names; intFromEnv to parse validated positive-integer overrides; concurrency now uses per-GPU env overrides; buildAIPerfJob computes env-overridable request and warmup counts and emits --warmup-request-count; token-mean overrides are accepted; tests and docs/catalog comments updated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

area/validator

Suggested reviewers

  • mchmarny
  • lockwobr
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining motivation, implementation details, testing, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'feat(validators): warm up inference-perf benchmark and make it tunable' accurately summarizes the two main changes: adding warmup to the inference-perf benchmark and exposing tuning knobs via environment variables.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
validators/performance/inference_perf_constraint.go (1)

1270-1291: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Quote the URL in the shell script for consistency and defensive coding.

The model name is quoted (aiperf profile "%s") but the URL is not (--url %s). While the URL is currently constructed from safe internal service names, quoting prevents issues if URL construction changes or contains unexpected characters.

Suggested fix
 	script := fmt.Sprintf(`set -e
 aiperf profile "%s" \
-  --url %s \
+  --url "%s" \
   --endpoint-type chat \
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/inference_perf_constraint.go` around lines 1270 -
1291, The shell script built in the fmt.Sprintf assigned to variable "script"
does not quote the URL flag (--url %s); update the format string to pass the URL
quoted (e.g., --url "%s") so the generated command uses --url "<endpoint>" just
like the model name is quoted in aiperf profile "%s"; ensure the corresponding
argument ordering (endpoint) remains unchanged in the call to fmt.Sprintf in
validators/performance/inference_perf_constraint.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1234-1253: The buildAIPerfJob doc comment is now separated from
its declaration by the newly inserted intFromEnv function; move the existing doc
comment block that describes buildAIPerfJob so it immediately precedes the func
buildAIPerfJob declaration (or alternatively move intFromEnv below
buildAIPerfJob) so the Go doc comment is attached to buildAIPerfJob; ensure the
comment stays unchanged and sits directly above func buildAIPerfJob.

In `@validators/performance/inference_perf_test.go`:
- Around line 597-609: The test TestBuildAIPerfJob_Warmup should be refactored
into a table-driven style: create a slice of test cases (each with a name and
concurrency value), iterate over them and run each as t.Run(tc.name, func(t
*testing.T) { ... }), calling buildAIPerfJob and asserting the script contains
the expected "--warmup-request-count" needle; keep using the existing
helpers/variables like buildAIPerfJob and aiperfWarmupPerConcurrency and
preserve the current assertion message but move it inside each subtest.

---

Outside diff comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1270-1291: The shell script built in the fmt.Sprintf assigned to
variable "script" does not quote the URL flag (--url %s); update the format
string to pass the URL quoted (e.g., --url "%s") so the generated command uses
--url "<endpoint>" just like the model name is quoted in aiperf profile "%s";
ensure the corresponding argument ordering (endpoint) remains unchanged in the
call to fmt.Sprintf in validators/performance/inference_perf_constraint.go.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6d762139-98c0-4a04-acf8-c03396296cf6

📥 Commits

Reviewing files that changed from the base of the PR and between 7295c04 and e1534e6.

📒 Files selected for processing (4)
  • docs/contributor/validator.md
  • recipes/validators/catalog.yaml
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go

Comment thread validators/performance/inference_perf_constraint.go Outdated
Comment thread validators/performance/inference_perf_test.go
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch 3 times, most recently from a1934c1 to a28243f Compare May 29, 2026 02:10
@github-actions github-actions Bot added size/L and removed size/M labels May 29, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
validators/performance/inference_perf_constraint.go (1)

1256-1287: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Log the derived benchmark knobs here.

runAIPerfJob still reports the fixed aiperfMinRequests, so env overrides and concurrency scaling are invisible in logs. Please emit the computed request/warmup counts and token means from this block so the run artifacts show the actual methodology used.

Suggested change
  requestCount := intFromEnv(envMinRequests, aiperfMinRequests)
  if scaled := concurrency * intFromEnv(envRequestsPerConcurrency, aiperfRequestsPerConcurrency); scaled > requestCount {
  	requestCount = scaled
  }
  // Warmup primes vLLM (CUDA graph capture / JIT) before measurement so the
  // one-time compile cost does not inflate p99 TTFT. Excluded from stats.
  warmupCount := concurrency * intFromEnv(envWarmupPerConcurrency, aiperfWarmupPerConcurrency)
+ inputTokensMean := intFromEnv(envInputTokensMean, aiperfInputTokensMean)
+ outputTokensMean := intFromEnv(envOutputTokensMean, aiperfOutputTokensMean)
+ slog.Info("Configured AIPerf benchmark",
+ 	"concurrency", concurrency,
+ 	"requestCount", requestCount,
+ 	"warmupCount", warmupCount,
+ 	"inputTokensMean", inputTokensMean,
+ 	"outputTokensMean", outputTokensMean)
  // Resolve once so Image and ImagePullPolicy can't drift if env vars
  // were mutated between two calls (matters in tests; cheap in prod).
  aiperfImage := resolveAiperfImage()
...
-		intFromEnv(envInputTokensMean, aiperfInputTokensMean), intFromEnv(envOutputTokensMean, aiperfOutputTokensMean),
+		inputTokensMean, outputTokensMean,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/inference_perf_constraint.go` around lines 1256 -
1287, The code builds the aiperf command from derived knobs (requestCount,
warmupCount, input/output token means, concurrency, aiperfImage) but only logs
the constant aiperfMinRequests; update runAIPerfJob to emit a debug/info log
right after resolving aiperfImage (or before creating script) that prints the
computed values requestCount, warmupCount, concurrency,
intFromEnv(envInputTokensMean,...), intFromEnv(envOutputTokensMean,...), and the
resolved aiperfImage so the run artifacts reflect the actual knobs used; place
the log near the existing script construction (referencing variables
requestCount, warmupCount, concurrency, aiperfImage) and use the project's
standard logger (e.g., processLogger or the logger used elsewhere in
runAIPerfJob) to ensure it appears in job outputs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1241-1253: The intFromEnv helper currently swallows malformed
AICR_INFERENCE_PERF_* env values and returns a default; change it to surface
parse errors by returning an error so the caller can abort with
ErrCodeInvalidRequest. Update intFromEnv(name string, def int) to return (int,
error), parse the env and on empty return (def, nil), but on strconv.Atoi error
or v <= 0 return (0, fmt.Errorf(...)) (or a typed sentinel error); then update
the callers (e.g., where intFromEnv is used) to check the error and return
ErrCodeInvalidRequest so the process aborts rather than silently using defaults.
Ensure references to ErrCodeInvalidRequest are used for the abort path and
propagate the new error signature through the call chain.

---

Outside diff comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1256-1287: The code builds the aiperf command from derived knobs
(requestCount, warmupCount, input/output token means, concurrency, aiperfImage)
but only logs the constant aiperfMinRequests; update runAIPerfJob to emit a
debug/info log right after resolving aiperfImage (or before creating script)
that prints the computed values requestCount, warmupCount, concurrency,
intFromEnv(envInputTokensMean,...), intFromEnv(envOutputTokensMean,...), and the
resolved aiperfImage so the run artifacts reflect the actual knobs used; place
the log near the existing script construction (referencing variables
requestCount, warmupCount, concurrency, aiperfImage) and use the project's
standard logger (e.g., processLogger or the logger used elsewhere in
runAIPerfJob) to ensure it appears in job outputs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 272a43bb-e3e3-4b3d-bdfb-33f3d09b2181

📥 Commits

Reviewing files that changed from the base of the PR and between ddba9db and a1934c1.

📒 Files selected for processing (4)
  • docs/contributor/validator.md
  • recipes/validators/catalog.yaml
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go

Comment thread validators/performance/inference_perf_constraint.go Outdated
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from a28243f to 18f83c4 Compare May 29, 2026 02:27

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/validator.md`:
- Line 152: Update the subsection text to expand the acronym TTFT on its first
mention by replacing the first occurrence of "TTFT" with "time-to-first-token
(TTFT)"; ensure the rest of the paragraph continues to use "TTFT" thereafter and
do not change identifiers like "inference-perf", the "env" knobs, or the error
symbol "ErrCodeInvalidRequest".

In `@validators/performance/inference_perf_test.go`:
- Around line 651-653: Update the top-of-test comment for TestIntFromEnv to
accurately describe the behavior: state that a valid positive integer is
accepted, an unset value falls back to the default, and non-integer, zero, or
negative values are rejected (fail-closed) with an error; mention the
TestIntFromEnv test and the env-reader function it exercises so readers
understand it's enforcing error-on-invalid rather than silently defaulting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 91580f90-fc00-452f-a592-cd85b8d30365

📥 Commits

Reviewing files that changed from the base of the PR and between a28243f and 18f83c4.

📒 Files selected for processing (5)
  • docs/contributor/validator.md
  • recipes/validators/catalog.yaml
  • validators/performance/inference_perf.go
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go

Comment thread docs/contributor/validator.md Outdated
Comment thread validators/performance/inference_perf_test.go Outdated
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch 2 times, most recently from f5a902e to f5716a9 Compare May 29, 2026 03:27
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch 3 times, most recently from e85ca1a to a67c286 Compare May 29, 2026 23:32
@yuanchen8911 yuanchen8911 changed the title WIP: feat(validators): warm up inference-perf benchmark and make it tunable feat(validators): warm up inference-perf benchmark and make it tunable May 29, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from a67c286 to be2bda1 Compare May 30, 2026 00:00
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 30, 2026 00:00
@yuanchen8911 yuanchen8911 changed the title feat(validators): warm up inference-perf benchmark and make it tunable WIP: feat(validators): warm up inference-perf benchmark and make it tunable May 30, 2026
@yuanchen8911 yuanchen8911 marked this pull request as draft May 30, 2026 00:07
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from be2bda1 to 4edeff2 Compare May 30, 2026 00:46
@yuanchen8911 yuanchen8911 changed the title WIP: feat(validators): warm up inference-perf benchmark and make it tunable feat(validators): warm up inference-perf benchmark and make it tunable May 30, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 30, 2026 00:47
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from 4edeff2 to 6ac0fb7 Compare May 30, 2026 01:22
The inference-perf check benchmarked vLLM from a cold start, so its one-time CUDA-graph capture / JIT compile cost landed inside the measured window — inflating p99 TTFT (~42s observed on RTX PRO 6000) and depressing throughput.

Add an AIPerf warmup phase (one full concurrency wave, excluded from stats) and widen the steady-state window: measured request count is now max(1000, concurrency*8) (was max(200, concurrency*2)). On an 8x RTX PRO 6000 node (Qwen3-0.6B) this moved throughput 772 -> 36,144 tok/s and p99 TTFT 41,943ms -> 46.6ms, well within the recipe thresholds.

Expose the benchmark knobs (concurrency-per-GPU, warmup, request count, token means) via AICR_INFERENCE_PERF_* env on the inference-perf catalog entry, so operators can retune without rebuilding the validator image. Defaults preserve the new behavior; a malformed or non-positive value fails the check up front with ErrCodeInvalidRequest — validated before any workload is deployed — rather than silently benchmarking under defaults and reporting a pass/fail the operator never configured. Per-accelerator pass/fail thresholds stay in the recipe overlays — these are validation methodology knobs that live with the validator/catalog.

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-warmup-and-tuning branch from 6ac0fb7 to b71e8a6 Compare May 30, 2026 14:47
@yuanchen8911 yuanchen8911 merged commit 686d329 into NVIDIA:main May 30, 2026
118 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants