feat(validator): add inference performance validation #641
Merged
mchmarny merged 3 commits intoApr 23, 2026
Conversation
7a37f97 to
6a78393
Compare
This was referenced Apr 22, 2026
Contributor
Author
6a78393 to
f5e21d4
Compare
a1a7ff4 to
c34b52c
Compare
10 tasks
29053a5 to
ae2729e
Compare
This comment was marked as resolved.
This comment was marked as resolved.
fafd279 to
accd9ca
Compare
6e8693b to
aaa6218
Compare
25 tasks
f052703 to
84e7443
Compare
Add inference-perf performance validator that benchmarks Dynamo vLLM inference endpoints using AIPerf and evaluates two metrics against recipe constraints: * inference-throughput (output tokens/sec, default >= 5000) * inference-ttft-p99 (time-to-first-token p99 in ms, default <= 200) Pipeline: deploy ResourceClaimTemplate + DynamoGraphDeployment into a per-run namespace -> wait for DynamoGraphDeployment state=successful -> probe frontend /health -> run AIPerf benchmark Job -> parse sentinel-delimited JSON -> compare against recipe constraints -> synchronous cleanup. Scheduling is DRA-aware: the validator enumerates existing gpu.nvidia.com ResourceClaim allocations per node and picks the candidate with the most free GPUs, sizes the workload to that count, and fails fast with an actionable message when every candidate is saturated. All worker pods pin to a single kubernetes.io/hostname for a stable per-node baseline. Per-run namespace and inner Job names are suffixed with an 8-hex hash of AICR_RUN_ID so concurrent validate invocations cannot collide. Three skip guards keep the check inert in environments where it cannot succeed: missing recipe constraints, dynamo-platform absent from componentRefs, and DynamoGraphDeployment CRD not installed on the cluster. Guard C uses IsNotFound only for "not installed" so forbidden / timeout / auth failures surface as errors rather than benign skips. A pre-built aiperf-bench image (Python, non-root user, aiperf pinned at build time) ships from the release workflow and the main-push workflow, and is built alongside the three Go validators in the Makefile and e2e action so inference-perf works on main/edge and local-dev paths, not only on tagged releases. Validated on EKS H100 (aicr-cuj2): 39,399 tokens/sec, TTFT p99 138 ms, DRA-aware picker correctly auto-selects the free node on a cluster where another workload already holds 1 GPU via DRA on the other candidate.
84e7443 to
01a2025
Compare
mchmarny
approved these changes
Apr 23, 2026
njhensley
added a commit
to njhensley/aicr
that referenced
this pull request
Apr 23, 2026
…eGVR inference_perf_constraint.go and nccl_all_reduce_bw_constraint.go each declared their own package-level resourceClaimTemplateGVR. After both landed on this branch (inference-perf via NVIDIA#641 merge), the package no longer compiles — duplicate top-level name. Move the GVR to a new dra_gvr.go shared between both callers, since the definition is identical (resource.k8s.io/v1 resourceclaimtemplates) and any future validator that inspects DRA RCTs will want the same value. Drive-by: fix gofmt alignment of the "inference-perf" entry in main.go check map.
This was referenced May 15, 2026
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
May 18, 2026
…ld-start readiness probe Two related fixes: 1. Wire the GKE H100 COS Dynamo overlay's performance phase to the inference-perf validator. Without this, aicr validate --phase performance against the GKE recipe is a no-op (validators=0, status=skipped) — the validator is in the catalog, the EKS overlay subscribes (NVIDIA#641), but the GKE sibling was never extended. 2. Replace the inference-perf validator's GET /health readiness probe with a real POST /v1/chat/completions probe. Dynamo's frontend returns 200 from /health as soon as the HTTP server is up — well before backend workers register or the model finishes loading. Hitting that window with AIPerf produced an "all requests completed, zero tokens" failure that looked like a benchmark regression. The chat-completion probe only accepts the endpoint once the response carries a non-empty completion, which is the only signal both necessary and sufficient to know AIPerf will produce real numbers. Affects both EKS and GKE Dynamo overlays (single shared codepath). Validation: ran end-to-end aicr validate --phase performance against GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s (>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder thresholds carried over from the EKS overlay sit comfortably inside both observed values, so no per-platform tuning is required at this time. Fixes NVIDIA#937
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
May 18, 2026
…ld-start readiness probe Two related fixes: 1. Wire the GKE H100 COS Dynamo overlay's performance phase to the inference-perf validator. Without this, aicr validate --phase performance against the GKE recipe is a no-op (validators=0, status=skipped) — the validator is in the catalog, the EKS overlay subscribes (NVIDIA#641), but the GKE sibling was never extended. 2. Replace the inference-perf validator's GET /health readiness probe with a real POST /v1/chat/completions probe. Dynamo's frontend returns 200 from /health as soon as the HTTP server is up — well before backend workers register or the model finishes loading. Hitting that window with AIPerf produced an "all requests completed, zero tokens" failure that looked like a benchmark regression. The chat-completion probe only accepts the endpoint once the response carries a non-empty completion, which is the only signal both necessary and sufficient to know AIPerf will produce real numbers. Affects both EKS and GKE Dynamo overlays (single shared codepath). Validation: ran end-to-end aicr validate --phase performance against GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s (>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder thresholds carried over from the EKS overlay sit comfortably inside both observed values, so no per-platform tuning is required at this time. Fixes NVIDIA#937
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
May 18, 2026
…ld-start readiness probe Two related fixes: 1. Wire the GKE H100 COS Dynamo overlay's performance phase to the inference-perf validator. Without this, aicr validate --phase performance against the GKE recipe is a no-op (validators=0, status=skipped) — the validator is in the catalog, the EKS overlay subscribes (NVIDIA#641), but the GKE sibling was never extended. 2. Replace the inference-perf validator's GET /health readiness probe with a real POST /v1/chat/completions probe. Dynamo's frontend returns 200 from /health as soon as the HTTP server is up — well before backend workers register or the model finishes loading. Hitting that window with AIPerf produced an "all requests completed, zero tokens" failure that looked like a benchmark regression. The chat-completion probe only accepts the endpoint once the response carries a non-empty completion, which is the only signal both necessary and sufficient to know AIPerf will produce real numbers. Affects both EKS and GKE Dynamo overlays (single shared codepath). Validation: ran end-to-end aicr validate --phase performance against GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s (>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder thresholds carried over from the EKS overlay sit comfortably inside both observed values, so no per-platform tuning is required at this time. Fixes NVIDIA#937
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
May 18, 2026
…ld-start readiness probe Two related fixes: 1. Wire the GKE H100 COS Dynamo overlay's performance phase to the inference-perf validator. Without this, aicr validate --phase performance against the GKE recipe is a no-op (validators=0, status=skipped) — the validator is in the catalog, the EKS overlay subscribes (NVIDIA#641), but the GKE sibling was never extended. 2. Replace the inference-perf validator's GET /health readiness probe with a real POST /v1/chat/completions probe. Dynamo's frontend returns 200 from /health as soon as the HTTP server is up — well before backend workers register or the model finishes loading. Hitting that window with AIPerf produced an "all requests completed, zero tokens" failure that looked like a benchmark regression. The chat-completion probe only accepts the endpoint once the response carries a non-empty completion, which is the only signal both necessary and sufficient to know AIPerf will produce real numbers. Affects both EKS and GKE Dynamo overlays (single shared codepath). Validation: ran end-to-end aicr validate --phase performance against GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s (>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder thresholds carried over from the EKS overlay sit comfortably inside both observed values, so no per-platform tuning is required at this time. Fixes NVIDIA#937
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
May 19, 2026
…ld-start readiness probe Two related fixes: 1. Wire the GKE H100 COS Dynamo overlay's performance phase to the inference-perf validator. Without this, aicr validate --phase performance against the GKE recipe is a no-op (validators=0, status=skipped) — the validator is in the catalog, the EKS overlay subscribes (NVIDIA#641), but the GKE sibling was never extended. 2. Replace the inference-perf validator's GET /health readiness probe with a real POST /v1/chat/completions probe. Dynamo's frontend returns 200 from /health as soon as the HTTP server is up — well before backend workers register or the model finishes loading. Hitting that window with AIPerf produced an "all requests completed, zero tokens" failure that looked like a benchmark regression. The chat-completion probe only accepts the endpoint once the response carries a non-empty completion, which is the only signal both necessary and sufficient to know AIPerf will produce real numbers. Affects both EKS and GKE Dynamo overlays (single shared codepath). Validation: ran end-to-end aicr validate --phase performance against GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s (>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder thresholds carried over from the EKS overlay sit comfortably inside both observed values, so no per-platform tuning is required at this time. Fixes NVIDIA#937
13 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an
inference-perfperformance validator that benchmarks Dynamo vLLM inference endpoints with AIPerf — measuring output-token throughput and TTFT p99 — plus the pre-builtaiperf-benchbenchmark runner image, build/publish coverage across release + main + local-dev paths, and a user-facing walkthrough atdocs/user/validation.mdcovering both training and inference performance validation end-to-end.Motivation / Context
AICR already validates training performance via NCCL all-reduce bandwidth (
aicr validate --phase performancewith--intent training --platform kubeflow). There's no equivalent for inference workloads — users deploying Dynamo-based inference stacks have no automated go/no-go gate for throughput and latency, and no way to catch broken GPU drivers, misconfigured DRA, or CUDA errors that silently degrade inference serving.Fixes: #448
Related: N/A
Type of Change
Component(s) Affected
pkg/validator)docs/)validators/performance/aiperf-bench.Dockerfile), release workflow (.github/workflows/on-tag.yaml), main-push workflow (.github/workflows/on-push.yaml), local-dev Makefile, e2e action (.github/actions/e2e/action.yml)Implementation Notes
The check
Deploys a
DynamoGraphDeploymentwith a small vLLM-served model (Qwen/Qwen3-0.6B by default) plus an AIPerf benchmarkJobagainst the frontend. Measures two metrics from the AIPerf output and evaluates each against a recipe constraint (10% tolerance):inference-throughput>= 5000 tok/sinference-ttft-p99<= 200 msPipeline lifecycle (mirrors the NCCL performance validator):
ResourceClaimTemplate(DRA GPU allocation) +DynamoGraphDeployment+ optionalKAI Queuein a per-run namespaceaicr-inference-perf-<8-hex-suffix>DynamoGraphDeployment state=successful(no polling)/healthendpointghcr.io/nvidia/aicr-validators/aiperf-bench); parse JSON output via sentinel-delimited log markersScheduling — DRA-aware single-node pin
Three behaviors cooperate so the check produces a stable per-node baseline and plays well on shared clusters:
AICR_RUN_ID(deriveRunID()) — two concurrentaicr validateinvocations never share state, never delete each other's Job, never wait on the wrong pod. Suffix is 8 hex chars so<namespace>-<dgd-name>stays under Grove's 63-char PodClique label limit.kubernetes.io/hostname. Without this, the scheduler could spread workers across multiple nodes and the throughput number would become a pool-level average rather than a per-node baseline.countUsedGPUsByNode()enumerates existingResourceClaims withdriver == gpu.nvidia.comand accumulates per-node in-use counts (keyed by allocationpool= node name).pickCandidateWithMostFreeGPUs()picks the candidate with the most free GPUs (allocatable − in-use), sizes the workload to that count, and fails fast with an actionable message if every candidate is saturated. TheStatus.Allocatable["nvidia.com/gpu"]view does not shrink when DRA devices are allocated to other workloads — without this, a node that "looks empty" in the device-plugin view can leave the benchmark pending for the full timeout. Fails soft (falls back to allocatable-only sizing) if the cluster'sresource.k8s.io/v1API is unreachable.--node-selectorand--tolerationfromaicr validatecontinue to work:--node-selectornarrows the candidate pool before DRA sizing,--tolerationoverrides the tolerate-the-node's-taints default. Both affect only the inner benchmark pods, not the validator orchestrator Job.Correctness guards (three explicit skip paths)
no inference-throughput or inference-ttft-p99 constraint in recipedynamo-platformabsent fromcomponentRefsskipped - dynamo-platform not in recipe componentsdynamo-platformdeclared butDynamoGraphDeploymentCRD not registered on clusterskipped - DynamoGraphDeployment CRD not installed on cluster ...Guard C uses
IsNotFoundonly for "not installed" — Forbidden / timeout / auth failures surface as errors rather than masking as benign skips (mirrorsisTrainerInstalled).Pre-built AIPerf runner image
New
validators/performance/aiperf-bench.Dockerfilebakesaiperfat build time (pinsAIPERF_VERSION). Runtime pod needs only a single ghcr.io pull, no PyPI dependency (air-gap friendly, removes ~30s install warmup per run).Build/publish coverage — extended to all paths, not only tagged releases:
.github/workflows/on-tag.yaml(releases): multi-arch manifest, vuln scan, attestation — symmetric with the three existing Go-validator images.github/workflows/on-push.yaml(main-push):aiperf-benchadded to the build matrix +VALIDATOR_PHASESenv; per-arch:sha-<commit>-<arch>and:edge-<arch>tags; multi-arch manifest combines them into:sha-<commit>and:edgeMakefile:image-validatorstarget builds/pushesaiperf-benchalongside the three Go validators;validate-localkind loads all four into the Kind cluster (also fixes a pre-existing typo —validate-localdepended on a non-existentimage-validatortarget).github/actions/e2e/action.yml(local registry): buildsaiperf-benchfrom its own Dockerfile, pushes to the local Kind registry, and includes it in the tags-list verify stepImage-resolution consistency
Exported
catalog.ResolveImageso the inner AIPerf image reference (held as a Go constant, not a catalog entry) gets the same:latest→version pinning andAICR_VALIDATOR_IMAGE_REGISTRYoverride thatcatalog.Loadapplies to top-level catalog entries. The outer validator's Deployer forwardsAICR_CLI_VERSION+ the registry env var to the Job pod.Timeouts
CheckExecutionTimeoutfrom 10m → 40m to accommodate the sequential inference pipeline (workload ready 10m + health 5m + AIPerf 15m)InferenceNamespaceTerminationWait = 5mso back-to-back runs wait for a prior run's namespace deletion (Dynamo finalizers take 2–3 min) instead of racing45m(must exceed the parent ctx, else the Job'sactiveDeadlineSecondskills the pod before internal deadlines expire)Docs
docs/user/validation.md— task-oriented walkthrough covering both training and inference performance validation, all three phases, skip semantics, dry-run, CI/CD integration (with accurate exit-code mapping), and troubleshooting (including the DRA-saturated-node fail-fast path and fallback mode when the DRA API is unreachable)docs/user/index.md,docs/user/cli-reference.md(top ofaicr validatesection), and the sidebar (site/.vitepress/config.ts)Testing
Local
make qualify # tests (-race), lint, chainsaw e2e, license headers, sidebar syncAll green. Coverage on
validators/performance/...stays positive net vs main baseline.New unit tests cover the pure functions:
parseAIPerfOutput(sentinel / missing / malformed),deriveRunID(hash determinism + uniqueness of random fallback),countUsedGPUsByNode(fake clientset; accumulate across claims, driver filter, unallocated claim, empty list),pickCandidateWithMostFreeGPUs(selection + tie-break + negative-free clamp),applyInferenceWorkerScheduling(worker gets DRA claim, frontend co-locates without claim),buildAIPerfJob(pre-built image, no pip install, sentinel framing, per-run jobName),buildTolerations(filtering, YAML-special chars),inferServicePort,hasDynamoPlatform,isDynamoDeploymentReady,nodesMatchingSelector,nodeGPUCount,resolveAiperfImage,catalog.ResolveImage.On cluster (EKS H100,
aicr-cuj2)Verbatim shell output — snapshot → recipe → validate. This run exercises the DRA-aware picker on a cluster where
dynamo-workload/vllm-gpu-claimalready holds 1 GPU onip-10-0-151-148; the validator correctly auto-picks the other GPU node (ip-10-0-186-114) without any hostname override:Result:
inference-perfpassed — throughput 39,399 tok/s, TTFT p99 138.27 ms. Both constraints met with margin.Risk Assessment
CheckExecutionTimeoutaffect all performance-phase runs. Skip semantics (Guards A/B/C) keep the check inert in any environment where Dynamo isn't in scope. Core validator runtime changes are small and covered by unit tests + an on-cluster EKS run.Rollout notes: No migration required. The
inference-perfcheck only activates for recipes generated with--intent inference --platform dynamo. Existing training/other-phase validators continue to work unchanged. TheAICR_CLI_VERSIONandAICR_VALIDATOR_IMAGE_REGISTRYenv vars forwarded by the Deployer are additive (no existing behavior changes when unset). DRA-aware sizing degrades to allocatable-only sizing on clusters whereresource.k8s.io/v1is unavailable — no functional regression on non-DRA clusters.Checklist
make testwith-race)make lint)git commit -S)