Feature Summary
Add an inference-throughput performance validator that benchmarks Dynamo vLLM inference endpoints using AIPerf, complementing the existing nccl-all-reduce-bw training performance validator.
Problem/Use Case
AICR currently supports training performance validation via NCCL bandwidth tests (aicr validate --phase performance), but has no equivalent for inference workloads. Users deploying Dynamo-based inference stacks cannot validate that their inference endpoints meet throughput and latency requirements as part of the AICR validation pipeline.
Without inference performance validation:
- Broken GPU drivers, misconfigured DRA, or CUDA errors in inference deployments go undetected
- No automated go/no-go gate for inference stack readiness
- No baseline performance numbers for comparison across deployments
Proposed Solution
Add an inference-throughput check to the performance phase that:
-
Discovers or deploys an inference workload:
- If a Dynamo frontend service is already running, benchmarks against it (scoped to DynamoGraphDeployment namespaces to avoid benchmarking the wrong service on shared clusters)
- If no endpoint exists, auto-deploys a
DynamoGraphDeployment with Qwen/Qwen3-0.6B (1 worker per GPU, single node), benchmarks, then cleans up
-
Runs AIPerf as a K8s Job with dynamic concurrency (16 × worker_count), measuring:
- Output token throughput (tokens/sec)
- Time to first token p99 (ms)
-
Evaluates constraints from the recipe overlay (with 10% tolerance):
validation:
performance:
checks:
- inference-throughput
constraints:
- name: inference-throughput
value: ">= 5000"
- name: inference-ttft-p99
value: "<= 200"
Tested on EKS (H100, Qwen/Qwen3-0.6B)
| Scenario |
Workers |
Throughput (tok/s) |
TTFT p99 (ms) |
Result |
| 1 GPU, auto-deploy |
1 |
5,667 |
84 |
PASS |
| 1 GPU, existing workload |
1 |
6,039 |
58 |
PASS |
| 8 GPUs, auto-deploy (single node) |
8 |
37,961 |
146 |
PASS |
| 16 GPUs, auto-deploy (2 nodes) |
16 |
74,927 |
120 |
PASS |
Success Criteria
aicr validate --phase performance runs inference-throughput check for inference+dynamo recipes
- Auto-deploy path creates workload, benchmarks, cleans up (idempotent, handles partial failures)
- Existing workload path discovers and benchmarks scoped to the correct service
- Default constraints (>= 5000 tok/s, <= 200ms TTFT p99) catch broken deployments
- Near-linear GPU scaling observed (8.1x with 8 GPUs)
Alternatives Considered
- inference-perf (separate load generator) — similar to AIPerf but less integrated with Dynamo ecosystem
- Manual benchmarking — not automated, not part of validation pipeline
- Larger model (Llama-3.1-8B) — considered for more representative benchmarks, but Qwen3-0.6B is preferred for smoke testing (fast model load, small image, matches Dynamo deploy template defaults)
Implementation
Branch: feat/inference-perf-validator (yuan fork)
Files: 8 changed, +1043 lines
validators/performance/inference_throughput.go — CheckFunc
validators/performance/inference_throughput_constraint.go — Core pipeline
validators/performance/testdata/inference/{dynamo-deployment,queue}.yaml — Templates
recipes/validators/catalog.yaml — Catalog entry
recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml — Overlay constraints
pkg/defaults/timeouts.go — Timeout constants
validators/performance/main.go — Registration
Feature Summary
Add an
inference-throughputperformance validator that benchmarks Dynamo vLLM inference endpoints using AIPerf, complementing the existingnccl-all-reduce-bwtraining performance validator.Problem/Use Case
AICR currently supports training performance validation via NCCL bandwidth tests (
aicr validate --phase performance), but has no equivalent for inference workloads. Users deploying Dynamo-based inference stacks cannot validate that their inference endpoints meet throughput and latency requirements as part of the AICR validation pipeline.Without inference performance validation:
Proposed Solution
Add an
inference-throughputcheck to the performance phase that:Discovers or deploys an inference workload:
DynamoGraphDeploymentwith Qwen/Qwen3-0.6B (1 worker per GPU, single node), benchmarks, then cleans upRuns AIPerf as a K8s Job with dynamic concurrency (
16 × worker_count), measuring:Evaluates constraints from the recipe overlay (with 10% tolerance):
Tested on EKS (H100, Qwen/Qwen3-0.6B)
Success Criteria
aicr validate --phase performanceruns inference-throughput check for inference+dynamo recipesAlternatives Considered
Implementation
Branch:
feat/inference-perf-validator(yuan fork)Files: 8 changed, +1043 lines
validators/performance/inference_throughput.go— CheckFuncvalidators/performance/inference_throughput_constraint.go— Core pipelinevalidators/performance/testdata/inference/{dynamo-deployment,queue}.yaml— Templatesrecipes/validators/catalog.yaml— Catalog entryrecipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml— Overlay constraintspkg/defaults/timeouts.go— Timeout constantsvalidators/performance/main.go— Registration