feat(validator): add inference performance validation by yuanchen8911 · Pull Request #641 · NVIDIA/aicr

yuanchen8911 · 2026-04-22T17:43:48Z

Summary

Adds an inference-perf performance validator that benchmarks Dynamo vLLM inference endpoints with AIPerf — measuring output-token throughput and TTFT p99 — plus the pre-built aiperf-bench benchmark runner image, build/publish coverage across release + main + local-dev paths, and a user-facing walkthrough at docs/user/validation.md covering both training and inference performance validation end-to-end.

Motivation / Context

AICR already validates training performance via NCCL all-reduce bandwidth (aicr validate --phase performance with --intent training --platform kubeflow). There's no equivalent for inference workloads — users deploying Dynamo-based inference stacks have no automated go/no-go gate for throughput and latency, and no way to catch broken GPU drivers, misconfigured DRA, or CUDA errors that silently degrade inference serving.

Fixes: #448
Related: N/A

Type of Change

New feature (non-breaking change that adds functionality)
Documentation update
Build/CI/tooling

Component(s) Affected

Validator (pkg/validator)
Docs/examples (docs/)
Other: benchmark runner image (validators/performance/aiperf-bench.Dockerfile), release workflow (.github/workflows/on-tag.yaml), main-push workflow (.github/workflows/on-push.yaml), local-dev Makefile, e2e action (.github/actions/e2e/action.yml)

Implementation Notes

The check

Deploys a DynamoGraphDeployment with a small vLLM-served model (Qwen/Qwen3-0.6B by default) plus an AIPerf benchmark Job against the frontend. Measures two metrics from the AIPerf output and evaluates each against a recipe constraint (10% tolerance):

Metric	Constraint name in recipe	Default (placeholder)
Output-token throughput	`inference-throughput`	`>= 5000 tok/s`
Time-to-first-token p99	`inference-ttft-p99`	`<= 200 ms`

Pipeline lifecycle (mirrors the NCCL performance validator):

Deploy ResourceClaimTemplate (DRA GPU allocation) + DynamoGraphDeployment + optional KAI Queue in a per-run namespace aicr-inference-perf-<8-hex-suffix>
Watch-based wait for DynamoGraphDeployment state=successful (no polling)
Probe frontend /health endpoint
Create AIPerf benchmark Job from a pre-built image (ghcr.io/nvidia/aicr-validators/aiperf-bench); parse JSON output via sentinel-delimited log markers
Compare throughput + TTFT p99 against recipe constraints
Synchronous cleanup (namespace → cascade; AIPerf Job → watch-for-deletion)

Scheduling — DRA-aware single-node pin

Three behaviors cooperate so the check produces a stable per-node baseline and plays well on shared clusters:

Per-run isolation. Both the benchmark namespace and the AIPerf Job name are suffixed with a short hash of AICR_RUN_ID (deriveRunID()) — two concurrent aicr validate invocations never share state, never delete each other's Job, never wait on the wrong pod. Suffix is 8 hex chars so <namespace>-<dgd-name> stays under Grove's 63-char PodClique label limit.
Single-node hostname pin. All Dynamo Frontend + worker pods pin to one specific GPU node via kubernetes.io/hostname. Without this, the scheduler could spread workers across multiple nodes and the throughput number would become a pool-level average rather than a per-node baseline.
DRA-aware node selection. countUsedGPUsByNode() enumerates existing ResourceClaims with driver == gpu.nvidia.com and accumulates per-node in-use counts (keyed by allocation pool = node name). pickCandidateWithMostFreeGPUs() picks the candidate with the most free GPUs (allocatable − in-use), sizes the workload to that count, and fails fast with an actionable message if every candidate is saturated. The Status.Allocatable["nvidia.com/gpu"] view does not shrink when DRA devices are allocated to other workloads — without this, a node that "looks empty" in the device-plugin view can leave the benchmark pending for the full timeout. Fails soft (falls back to allocatable-only sizing) if the cluster's resource.k8s.io/v1 API is unreachable.

--node-selector and --toleration from aicr validate continue to work: --node-selector narrows the candidate pool before DRA sizing, --toleration overrides the tolerate-the-node's-taints default. Both affect only the inner benchmark pods, not the validator orchestrator Job.

Correctness guards (three explicit skip paths)

Guard	Fires when	Skip reason
A	Recipe lists the check but no matching constraints	`no inference-throughput or inference-ttft-p99 constraint in recipe`
B	Check selected but `dynamo-platform` absent from `componentRefs`	`skipped - dynamo-platform not in recipe components`
C	`dynamo-platform` declared but `DynamoGraphDeployment` CRD not registered on cluster	`skipped - DynamoGraphDeployment CRD not installed on cluster ...`

Guard C uses IsNotFound only for "not installed" — Forbidden / timeout / auth failures surface as errors rather than masking as benign skips (mirrors isTrainerInstalled).

Pre-built AIPerf runner image

New validators/performance/aiperf-bench.Dockerfile bakes aiperf at build time (pins AIPERF_VERSION). Runtime pod needs only a single ghcr.io pull, no PyPI dependency (air-gap friendly, removes ~30s install warmup per run).

Build/publish coverage — extended to all paths, not only tagged releases:

.github/workflows/on-tag.yaml (releases): multi-arch manifest, vuln scan, attestation — symmetric with the three existing Go-validator images
.github/workflows/on-push.yaml (main-push): aiperf-bench added to the build matrix + VALIDATOR_PHASES env; per-arch :sha-<commit>-<arch> and :edge-<arch> tags; multi-arch manifest combines them into :sha-<commit> and :edge
Makefile: image-validators target builds/pushes aiperf-bench alongside the three Go validators; validate-local kind loads all four into the Kind cluster (also fixes a pre-existing typo — validate-local depended on a non-existent image-validator target)
.github/actions/e2e/action.yml (local registry): builds aiperf-bench from its own Dockerfile, pushes to the local Kind registry, and includes it in the tags-list verify step

Image-resolution consistency

Exported catalog.ResolveImage so the inner AIPerf image reference (held as a Go constant, not a catalog entry) gets the same :latest→version pinning and AICR_VALIDATOR_IMAGE_REGISTRY override that catalog.Load applies to top-level catalog entries. The outer validator's Deployer forwards AICR_CLI_VERSION + the registry env var to the Job pod.

Timeouts

Raised CheckExecutionTimeout from 10m → 40m to accommodate the sequential inference pipeline (workload ready 10m + health 5m + AIPerf 15m)
New InferenceNamespaceTerminationWait = 5m so back-to-back runs wait for a prior run's namespace deletion (Dynamo finalizers take 2–3 min) instead of racing
Catalog entry timeout bumped to 45m (must exceed the parent ctx, else the Job's activeDeadlineSeconds kills the pod before internal deadlines expire)

Docs

New docs/user/validation.md — task-oriented walkthrough covering both training and inference performance validation, all three phases, skip semantics, dry-run, CI/CD integration (with accurate exit-code mapping), and troubleshooting (including the DRA-saturated-node fail-fast path and fallback mode when the DRA API is unreachable)
Cross-linked from docs/user/index.md, docs/user/cli-reference.md (top of aicr validate section), and the sidebar (site/.vitepress/config.ts)
Training performance section carries a placeholder note for per-platform expected-bandwidth numbers (to be filled once reference benchmarks are published)

Testing

Local

make qualify    # tests (-race), lint, chainsaw e2e, license headers, sidebar sync

All green. Coverage on validators/performance/... stays positive net vs main baseline.

New unit tests cover the pure functions: parseAIPerfOutput (sentinel / missing / malformed), deriveRunID (hash determinism + uniqueness of random fallback), countUsedGPUsByNode (fake clientset; accumulate across claims, driver filter, unallocated claim, empty list), pickCandidateWithMostFreeGPUs (selection + tie-break + negative-free clamp), applyInferenceWorkerScheduling (worker gets DRA claim, frontend co-locates without claim), buildAIPerfJob (pre-built image, no pip install, sentinel framing, per-run jobName), buildTolerations (filtering, YAML-special chars), inferServicePort, hasDynamoPlatform, isDynamoDeploymentReady, nodesMatchingSelector, nodeGPUCount, resolveAiperfImage, catalog.ResolveImage.

On cluster (EKS H100, `aicr-cuj2`)

Verbatim shell output — snapshot → recipe → validate. This run exercises the DRA-aware picker on a cluster where dynamo-workload/vllm-gpu-claim already holds 1 GPU on ip-10-0-151-148; the validator correctly auto-picks the other GPU node (ip-10-0-186-114) without any hostname override:

$ /tmp/aicr/bin/aicr-ae2729ed snapshot --output /tmp/aicr/cuj2/snapshot.yaml

$ /tmp/aicr/bin/aicr-ae2729ed recipe \
    --service eks --accelerator h100 --os ubuntu \
    --intent inference --platform dynamo \
    --output /tmp/aicr/cuj2/recipe.yaml
[cli] recipe generation completed: output=/tmp/aicr/cuj2/recipe.yaml components=16 overlays=7

$ AICR_VALIDATOR_IMAGE_REGISTRY=ghcr.io/yuanchen8911 \
    /tmp/aicr/bin/aicr-ae2729ed validate \
      --recipe /tmp/aicr/cuj2/recipe.yaml \
      --snapshot /tmp/aicr/cuj2/snapshot.yaml \
      --node-selector nodeGroup=gpu-worker \
      --toleration dedicated=worker-workload:NoSchedule \
      --toleration dedicated=worker-workload:NoExecute \
      --phase performance
[cli] readiness pre-flight: constraints=4 ... all passed
[cli] running validation phase: phase=performance catalog=2 selected=1
[cli] running validator: name=inference-perf phase=performance
  Found GPU nodes count=2
  Filtered GPU nodes to match --node-selector selector=map[nodeGroup:gpu-worker] matched=2
  --node-selector narrowed candidate pool; workers pinned to single node
    via kubernetes.io/hostname node=ip-10-0-186-114.ec2.internal freeGPUs=8
  Deploying benchmark workload gpus=8 namespace=aicr-inference-perf-6b2bd21e
  Applied ResourceClaimTemplate name=aicr-inference-gpu-claim
  Applied DynamoGraphDeployment name=aicr-inference-perf gpuWorkers=8
  Waiting for DynamoGraphDeployment to become ready...
  DynamoGraphDeployment is ready                                          (~2 min)
  Using inference endpoint http://aicr-inference-perf-frontend...svc:8000 concurrency=128
  Inference endpoint is healthy
  Running AIPerf benchmark model=Qwen/Qwen3-0.6B concurrency=128 requests=200 job=aicr-aiperf-6b2bd21e
  pod aicr-aiperf-6b2bd21e-wgvr2: Pending → Running → Succeeded           (~12s)
  Inference benchmark results throughput_tok/s=39399.24 ttft_p99_ms=138.27
  Cleaning up inference benchmark workload...
  Deleted DynamoGraphDeployment
  Deleted namespace aicr-inference-perf-6b2bd21e
[cli] validator completed: name=inference-perf status=passed
[cli] phase completed: phase=performance status=passed duration=2m11.224776542s

CTRF stdout tail:
  "Inference throughput: 39399.24 tokens/sec"
  "Inference TTFT p99: 138.27 ms"
  "Throughput constraint: >= 5000 → PASS"
  "TTFT p99 constraint: <= 200 → PASS"

Result: inference-perf passed — throughput 39,399 tok/s, TTFT p99 138.27 ms. Both constraints met with margin.

The AICR_VALIDATOR_IMAGE_REGISTRY=ghcr.io/yuanchen8911 override is only used here because this PR is pre-merge testing from the fork. After merge, the aiperf-bench image is published by on-push.yaml to ghcr.io/nvidia/aicr-validators/aiperf-bench:edge (and :sha-<commit>) on every main push — no env-var override needed for main/edge consumers.

Risk Assessment

Medium — New validator binary path + new release image + bumped CheckExecutionTimeout affect all performance-phase runs. Skip semantics (Guards A/B/C) keep the check inert in any environment where Dynamo isn't in scope. Core validator runtime changes are small and covered by unit tests + an on-cluster EKS run.

Rollout notes: No migration required. The inference-perf check only activates for recipes generated with --intent inference --platform dynamo. Existing training/other-phase validators continue to work unchanged. The AICR_CLI_VERSION and AICR_VALIDATOR_IMAGE_REGISTRY env vars forwarded by the Deployer are additive (no existing behavior changes when unset). DRA-aware sizing degrades to allocatable-only sizing on clusters where resource.k8s.io/v1 is unavailable — no functional regression on non-DRA clusters.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

yuanchen8911 · 2026-04-22T18:15:01Z

CI failures (Tier 1/Tier 2 KWOK jobs) are a pre-existing regression from #603 — unrelated to this PR. Filed #643 to fix the bundler --set regex so Helm-style array indexing (tolerations[2].key=...) is accepted again. Once that lands, the KWOK tests here should go green after a rebase.

Add inference-perf performance validator that benchmarks Dynamo vLLM inference endpoints using AIPerf and evaluates two metrics against recipe constraints: * inference-throughput (output tokens/sec, default >= 5000) * inference-ttft-p99 (time-to-first-token p99 in ms, default <= 200) Pipeline: deploy ResourceClaimTemplate + DynamoGraphDeployment into a per-run namespace -> wait for DynamoGraphDeployment state=successful -> probe frontend /health -> run AIPerf benchmark Job -> parse sentinel-delimited JSON -> compare against recipe constraints -> synchronous cleanup. Scheduling is DRA-aware: the validator enumerates existing gpu.nvidia.com ResourceClaim allocations per node and picks the candidate with the most free GPUs, sizes the workload to that count, and fails fast with an actionable message when every candidate is saturated. All worker pods pin to a single kubernetes.io/hostname for a stable per-node baseline. Per-run namespace and inner Job names are suffixed with an 8-hex hash of AICR_RUN_ID so concurrent validate invocations cannot collide. Three skip guards keep the check inert in environments where it cannot succeed: missing recipe constraints, dynamo-platform absent from componentRefs, and DynamoGraphDeployment CRD not installed on the cluster. Guard C uses IsNotFound only for "not installed" so forbidden / timeout / auth failures surface as errors rather than benign skips. A pre-built aiperf-bench image (Python, non-root user, aiperf pinned at build time) ships from the release workflow and the main-push workflow, and is built alongside the three Go validators in the Makefile and e2e action so inference-perf works on main/edge and local-dev paths, not only on tagged releases. Validated on EKS H100 (aicr-cuj2): 39,399 tokens/sec, TTFT p99 138 ms, DRA-aware picker correctly auto-selects the free node on a cluster where another workload already holds 1 GPU via DRA on the other candidate.

…eGVR inference_perf_constraint.go and nccl_all_reduce_bw_constraint.go each declared their own package-level resourceClaimTemplateGVR. After both landed on this branch (inference-perf via NVIDIA#641 merge), the package no longer compiles — duplicate top-level name. Move the GVR to a new dra_gvr.go shared between both callers, since the definition is identical (resource.k8s.io/v1 resourceclaimtemplates) and any future validator that inspects DRA RCTs will want the same value. Drive-by: fix gofmt alignment of the "inference-perf" entry in main.go check map.

…ld-start readiness probe Two related fixes: 1. Wire the GKE H100 COS Dynamo overlay's performance phase to the inference-perf validator. Without this, aicr validate --phase performance against the GKE recipe is a no-op (validators=0, status=skipped) — the validator is in the catalog, the EKS overlay subscribes (NVIDIA#641), but the GKE sibling was never extended. 2. Replace the inference-perf validator's GET /health readiness probe with a real POST /v1/chat/completions probe. Dynamo's frontend returns 200 from /health as soon as the HTTP server is up — well before backend workers register or the model finishes loading. Hitting that window with AIPerf produced an "all requests completed, zero tokens" failure that looked like a benchmark regression. The chat-completion probe only accepts the endpoint once the response carries a non-empty completion, which is the only signal both necessary and sufficient to know AIPerf will produce real numbers. Affects both EKS and GKE Dynamo overlays (single shared codepath). Validation: ran end-to-end aicr validate --phase performance against GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s (>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder thresholds carried over from the EKS overlay sit comfortably inside both observed values, so no per-platform tuning is required at this time. Fixes NVIDIA#937

github-actions Bot added area/recipes area/ci area/validator area/docs size/XL labels Apr 22, 2026

yuanchen8911 added enhancement labels Apr 22, 2026

yuanchen8911 changed the title ~~WIP: feat(validator): inference performance validation with AIPerf~~ WIP: feat(validator): add inference performance validation Apr 22, 2026

yuanchen8911 force-pushed the feat/inference-perf-validator branch from 7a37f97 to 6a78393 Compare April 22, 2026 18:03

This was referenced Apr 22, 2026

[Feature]: Add Dynamo inference performance validation #448

Closed

fix(kwok): remove broken --set array-index lines from validate-scheduling.sh #643

Merged

yuanchen8911 force-pushed the feat/inference-perf-validator branch from 6a78393 to f5e21d4 Compare April 22, 2026 18:28

yuanchen8911 changed the title ~~WIP: feat(validator): add inference performance validation~~ feat(validator): inference performance validation with AIPerf Apr 22, 2026

yuanchen8911 force-pushed the feat/inference-perf-validator branch 5 times, most recently from a1a7ff4 to c34b52c Compare April 22, 2026 22:22

yuanchen8911 mentioned this pull request Apr 22, 2026

fix(ci): remove dangling steps.capacity.outcome reference in uat-gcp #645

Closed

10 tasks

yuanchen8911 force-pushed the feat/inference-perf-validator branch 5 times, most recently from 29053a5 to ae2729e Compare April 23, 2026 00:29

yuanchen8911 marked this pull request as ready for review April 23, 2026 00:34

yuanchen8911 requested review from a team as code owners April 23, 2026 00:34

This comment was marked as resolved.

Sign in to view

yuanchen8911 requested a review from njhensley April 23, 2026 02:18

yuanchen8911 force-pushed the feat/inference-perf-validator branch 4 times, most recently from fafd279 to accd9ca Compare April 23, 2026 02:47

yuanchen8911 changed the title ~~feat(validator): inference performance validation with AIPerf~~ feat(validator): add inference performance validation Apr 23, 2026

yuanchen8911 force-pushed the feat/inference-perf-validator branch 2 times, most recently from 6e8693b to aaa6218 Compare April 23, 2026 04:32

yuanchen8911 mentioned this pull request Apr 23, 2026

fix(recipes): disable Dynamo ssh-keygen on Kind #649

Closed

25 tasks

yuanchen8911 force-pushed the feat/inference-perf-validator branch 2 times, most recently from f052703 to 84e7443 Compare April 23, 2026 05:26

yuanchen8911 force-pushed the feat/inference-perf-validator branch from 84e7443 to 01a2025 Compare April 23, 2026 06:33

github-actions Bot added the area/cli label Apr 23, 2026

mchmarny assigned yuanchen8911 Apr 23, 2026

mchmarny added this to the v0.12 milestone Apr 23, 2026

Merge branch 'main' into feat/inference-perf-validator

27c324b

mchmarny approved these changes Apr 23, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) April 23, 2026 12:29

Merge branch 'main' into feat/inference-perf-validator

3dcba24

mchmarny merged commit 3a86364 into NVIDIA:main Apr 23, 2026
65 of 66 checks passed

This was referenced May 15, 2026

Add inference performance validation to GKE inference-dynamo overlay #937

Closed

fix(performance): add GKE inference performance validation + cold-start readiness probe #952

Merged

coderabbitai Bot mentioned this pull request Jun 1, 2026

feat(validators): enhance inference performance validation #1133

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(validator): add inference performance validation #641

feat(validator): add inference performance validation #641
mchmarny merged 3 commits into
NVIDIA:mainfrom
yuanchen8911:feat/inference-perf-validator

yuanchen8911 commented Apr 22, 2026 •

edited

Loading

Uh oh!

yuanchen8911 commented Apr 22, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

The check

Scheduling — DRA-aware single-node pin

Correctness guards (three explicit skip paths)

Pre-built AIPerf runner image

Image-resolution consistency

Timeouts

Docs

Testing

Local

On cluster (EKS H100, aicr-cuj2)

Risk Assessment

Checklist

Uh oh!

yuanchen8911 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Apr 22, 2026 •

edited

Loading

On cluster (EKS H100, `aicr-cuj2`)

yuanchen8911 commented Apr 22, 2026 •

edited

Loading