feat(validators): warm up inference-perf benchmark and make it tunable#1096
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds env-driven tuning for the AIPerf inference validator: new constants and AICR_INFERENCE_PERF_* names; intFromEnv to parse validated positive-integer overrides; concurrency now uses per-GPU env overrides; buildAIPerfJob computes env-overridable request and warmup counts and emits --warmup-request-count; token-mean overrides are accepted; tests and docs/catalog comments updated. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
validators/performance/inference_perf_constraint.go (1)
1270-1291: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueQuote the URL in the shell script for consistency and defensive coding.
The model name is quoted (
aiperf profile "%s") but the URL is not (--url %s). While the URL is currently constructed from safe internal service names, quoting prevents issues if URL construction changes or contains unexpected characters.Suggested fix
script := fmt.Sprintf(`set -e aiperf profile "%s" \ - --url %s \ + --url "%s" \ --endpoint-type chat \🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/performance/inference_perf_constraint.go` around lines 1270 - 1291, The shell script built in the fmt.Sprintf assigned to variable "script" does not quote the URL flag (--url %s); update the format string to pass the URL quoted (e.g., --url "%s") so the generated command uses --url "<endpoint>" just like the model name is quoted in aiperf profile "%s"; ensure the corresponding argument ordering (endpoint) remains unchanged in the call to fmt.Sprintf in validators/performance/inference_perf_constraint.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1234-1253: The buildAIPerfJob doc comment is now separated from
its declaration by the newly inserted intFromEnv function; move the existing doc
comment block that describes buildAIPerfJob so it immediately precedes the func
buildAIPerfJob declaration (or alternatively move intFromEnv below
buildAIPerfJob) so the Go doc comment is attached to buildAIPerfJob; ensure the
comment stays unchanged and sits directly above func buildAIPerfJob.
In `@validators/performance/inference_perf_test.go`:
- Around line 597-609: The test TestBuildAIPerfJob_Warmup should be refactored
into a table-driven style: create a slice of test cases (each with a name and
concurrency value), iterate over them and run each as t.Run(tc.name, func(t
*testing.T) { ... }), calling buildAIPerfJob and asserting the script contains
the expected "--warmup-request-count" needle; keep using the existing
helpers/variables like buildAIPerfJob and aiperfWarmupPerConcurrency and
preserve the current assertion message but move it inside each subtest.
---
Outside diff comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1270-1291: The shell script built in the fmt.Sprintf assigned to
variable "script" does not quote the URL flag (--url %s); update the format
string to pass the URL quoted (e.g., --url "%s") so the generated command uses
--url "<endpoint>" just like the model name is quoted in aiperf profile "%s";
ensure the corresponding argument ordering (endpoint) remains unchanged in the
call to fmt.Sprintf in validators/performance/inference_perf_constraint.go.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6d762139-98c0-4a04-acf8-c03396296cf6
📒 Files selected for processing (4)
docs/contributor/validator.mdrecipes/validators/catalog.yamlvalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.go
a1934c1 to
a28243f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
validators/performance/inference_perf_constraint.go (1)
1256-1287:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winLog the derived benchmark knobs here.
runAIPerfJobstill reports the fixedaiperfMinRequests, so env overrides and concurrency scaling are invisible in logs. Please emit the computed request/warmup counts and token means from this block so the run artifacts show the actual methodology used.Suggested change
requestCount := intFromEnv(envMinRequests, aiperfMinRequests) if scaled := concurrency * intFromEnv(envRequestsPerConcurrency, aiperfRequestsPerConcurrency); scaled > requestCount { requestCount = scaled } // Warmup primes vLLM (CUDA graph capture / JIT) before measurement so the // one-time compile cost does not inflate p99 TTFT. Excluded from stats. warmupCount := concurrency * intFromEnv(envWarmupPerConcurrency, aiperfWarmupPerConcurrency) + inputTokensMean := intFromEnv(envInputTokensMean, aiperfInputTokensMean) + outputTokensMean := intFromEnv(envOutputTokensMean, aiperfOutputTokensMean) + slog.Info("Configured AIPerf benchmark", + "concurrency", concurrency, + "requestCount", requestCount, + "warmupCount", warmupCount, + "inputTokensMean", inputTokensMean, + "outputTokensMean", outputTokensMean) // Resolve once so Image and ImagePullPolicy can't drift if env vars // were mutated between two calls (matters in tests; cheap in prod). aiperfImage := resolveAiperfImage() ... - intFromEnv(envInputTokensMean, aiperfInputTokensMean), intFromEnv(envOutputTokensMean, aiperfOutputTokensMean), + inputTokensMean, outputTokensMean,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/performance/inference_perf_constraint.go` around lines 1256 - 1287, The code builds the aiperf command from derived knobs (requestCount, warmupCount, input/output token means, concurrency, aiperfImage) but only logs the constant aiperfMinRequests; update runAIPerfJob to emit a debug/info log right after resolving aiperfImage (or before creating script) that prints the computed values requestCount, warmupCount, concurrency, intFromEnv(envInputTokensMean,...), intFromEnv(envOutputTokensMean,...), and the resolved aiperfImage so the run artifacts reflect the actual knobs used; place the log near the existing script construction (referencing variables requestCount, warmupCount, concurrency, aiperfImage) and use the project's standard logger (e.g., processLogger or the logger used elsewhere in runAIPerfJob) to ensure it appears in job outputs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1241-1253: The intFromEnv helper currently swallows malformed
AICR_INFERENCE_PERF_* env values and returns a default; change it to surface
parse errors by returning an error so the caller can abort with
ErrCodeInvalidRequest. Update intFromEnv(name string, def int) to return (int,
error), parse the env and on empty return (def, nil), but on strconv.Atoi error
or v <= 0 return (0, fmt.Errorf(...)) (or a typed sentinel error); then update
the callers (e.g., where intFromEnv is used) to check the error and return
ErrCodeInvalidRequest so the process aborts rather than silently using defaults.
Ensure references to ErrCodeInvalidRequest are used for the abort path and
propagate the new error signature through the call chain.
---
Outside diff comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1256-1287: The code builds the aiperf command from derived knobs
(requestCount, warmupCount, input/output token means, concurrency, aiperfImage)
but only logs the constant aiperfMinRequests; update runAIPerfJob to emit a
debug/info log right after resolving aiperfImage (or before creating script)
that prints the computed values requestCount, warmupCount, concurrency,
intFromEnv(envInputTokensMean,...), intFromEnv(envOutputTokensMean,...), and the
resolved aiperfImage so the run artifacts reflect the actual knobs used; place
the log near the existing script construction (referencing variables
requestCount, warmupCount, concurrency, aiperfImage) and use the project's
standard logger (e.g., processLogger or the logger used elsewhere in
runAIPerfJob) to ensure it appears in job outputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 272a43bb-e3e3-4b3d-bdfb-33f3d09b2181
📒 Files selected for processing (4)
docs/contributor/validator.mdrecipes/validators/catalog.yamlvalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.go
a28243f to
18f83c4
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/validator.md`:
- Line 152: Update the subsection text to expand the acronym TTFT on its first
mention by replacing the first occurrence of "TTFT" with "time-to-first-token
(TTFT)"; ensure the rest of the paragraph continues to use "TTFT" thereafter and
do not change identifiers like "inference-perf", the "env" knobs, or the error
symbol "ErrCodeInvalidRequest".
In `@validators/performance/inference_perf_test.go`:
- Around line 651-653: Update the top-of-test comment for TestIntFromEnv to
accurately describe the behavior: state that a valid positive integer is
accepted, an unset value falls back to the default, and non-integer, zero, or
negative values are rejected (fail-closed) with an error; mention the
TestIntFromEnv test and the env-reader function it exercises so readers
understand it's enforcing error-on-invalid rather than silently defaulting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 91580f90-fc00-452f-a592-cd85b8d30365
📒 Files selected for processing (5)
docs/contributor/validator.mdrecipes/validators/catalog.yamlvalidators/performance/inference_perf.govalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.go
f5a902e to
f5716a9
Compare
e85ca1a to
a67c286
Compare
a67c286 to
be2bda1
Compare
be2bda1 to
4edeff2
Compare
4edeff2 to
6ac0fb7
Compare
The inference-perf check benchmarked vLLM from a cold start, so its one-time CUDA-graph capture / JIT compile cost landed inside the measured window — inflating p99 TTFT (~42s observed on RTX PRO 6000) and depressing throughput. Add an AIPerf warmup phase (one full concurrency wave, excluded from stats) and widen the steady-state window: measured request count is now max(1000, concurrency*8) (was max(200, concurrency*2)). On an 8x RTX PRO 6000 node (Qwen3-0.6B) this moved throughput 772 -> 36,144 tok/s and p99 TTFT 41,943ms -> 46.6ms, well within the recipe thresholds. Expose the benchmark knobs (concurrency-per-GPU, warmup, request count, token means) via AICR_INFERENCE_PERF_* env on the inference-perf catalog entry, so operators can retune without rebuilding the validator image. Defaults preserve the new behavior; a malformed or non-positive value fails the check up front with ErrCodeInvalidRequest — validated before any workload is deployed — rather than silently benchmarking under defaults and reporting a pass/fail the operator never configured. Per-accelerator pass/fail thresholds stay in the recipe overlays — these are validation methodology knobs that live with the validator/catalog. Signed-off-by: Yuan Chen <[email protected]>
6ac0fb7 to
b71e8a6
Compare
Summary
Warm up the
inference-perfperformance benchmark before measuring, and expose its tuning knobs via env so operators can retune without rebuilding the validator image.Motivation / Context
The
inference-perfcheck benchmarked vLLM from a cold start, so vLLM's one-time CUDA-graph capture / JIT compile cost landed inside the measured window. On an 8× RTX PRO 6000 node (Qwen3-0.6B) this produced a misleading p99 TTFT of ~42 s and ~772 tok/s — failing the recipe performance thresholds despite healthy hardware. Warmup is the standard way to measure steady-state serving performance (MLPerf Inference; AIPerf/genai-perf ship--warmup-request-countfor exactly this).Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
pkg/validator) —validators/performancedocs/)Implementation Notes
--warmup-request-count(one full concurrency wave, excluded from AIPerf stats) so the compile/capture cost is primed before measurement.max(200, concurrency×2)→max(1000, concurrency×8).AICR_INFERENCE_PERF_*env on theinference-perfcatalog entry (concurrency-per-GPU, warmup-per-concurrency, min requests, requests-per-concurrency, input/output token means), read withintFromEnv(WARN + fall back to default on unset/non-integer/non-positive, mirroringcheckTimeoutFromEnv).Testing
Validated end-to-end on a live EKS cluster (8× RTX PRO 6000,
g7e.48xlarge,Qwen3-0.6B):≥ 5000≤ 200 msFull
make qualifywas not run locally (e2e needs a Kind/Tilt cluster andscanneeds grype); the Go test/lint/build/vet gates above pass and the functional change is proven on real hardware. CI covers e2e + scan.Risk Assessment
Rollout notes: No migration. Benchmark defaults change (warmup added, larger measured-request window); recipe thresholds unchanged.
Checklist
-race)golangci-lint,yamllint)docs/contributor/validator.md)git commit -S)