Skip to content

feat(validator): add inference performance validation #641

Merged
mchmarny merged 3 commits into
NVIDIA:mainfrom
yuanchen8911:feat/inference-perf-validator
Apr 23, 2026
Merged

feat(validator): add inference performance validation #641
mchmarny merged 3 commits into
NVIDIA:mainfrom
yuanchen8911:feat/inference-perf-validator

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an inference-perf performance validator that benchmarks Dynamo vLLM inference endpoints with AIPerf — measuring output-token throughput and TTFT p99 — plus the pre-built aiperf-bench benchmark runner image, build/publish coverage across release + main + local-dev paths, and a user-facing walkthrough at docs/user/validation.md covering both training and inference performance validation end-to-end.

Motivation / Context

AICR already validates training performance via NCCL all-reduce bandwidth (aicr validate --phase performance with --intent training --platform kubeflow). There's no equivalent for inference workloads — users deploying Dynamo-based inference stacks have no automated go/no-go gate for throughput and latency, and no way to catch broken GPU drivers, misconfigured DRA, or CUDA errors that silently degrade inference serving.

Fixes: #448
Related: N/A

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update
  • Build/CI/tooling

Component(s) Affected

  • Validator (pkg/validator)
  • Docs/examples (docs/)
  • Other: benchmark runner image (validators/performance/aiperf-bench.Dockerfile), release workflow (.github/workflows/on-tag.yaml), main-push workflow (.github/workflows/on-push.yaml), local-dev Makefile, e2e action (.github/actions/e2e/action.yml)

Implementation Notes

The check

Deploys a DynamoGraphDeployment with a small vLLM-served model (Qwen/Qwen3-0.6B by default) plus an AIPerf benchmark Job against the frontend. Measures two metrics from the AIPerf output and evaluates each against a recipe constraint (10% tolerance):

Metric Constraint name in recipe Default (placeholder)
Output-token throughput inference-throughput >= 5000 tok/s
Time-to-first-token p99 inference-ttft-p99 <= 200 ms

Pipeline lifecycle (mirrors the NCCL performance validator):

  1. Deploy ResourceClaimTemplate (DRA GPU allocation) + DynamoGraphDeployment + optional KAI Queue in a per-run namespace aicr-inference-perf-<8-hex-suffix>
  2. Watch-based wait for DynamoGraphDeployment state=successful (no polling)
  3. Probe frontend /health endpoint
  4. Create AIPerf benchmark Job from a pre-built image (ghcr.io/nvidia/aicr-validators/aiperf-bench); parse JSON output via sentinel-delimited log markers
  5. Compare throughput + TTFT p99 against recipe constraints
  6. Synchronous cleanup (namespace → cascade; AIPerf Job → watch-for-deletion)

Scheduling — DRA-aware single-node pin

Three behaviors cooperate so the check produces a stable per-node baseline and plays well on shared clusters:

  • Per-run isolation. Both the benchmark namespace and the AIPerf Job name are suffixed with a short hash of AICR_RUN_ID (deriveRunID()) — two concurrent aicr validate invocations never share state, never delete each other's Job, never wait on the wrong pod. Suffix is 8 hex chars so <namespace>-<dgd-name> stays under Grove's 63-char PodClique label limit.
  • Single-node hostname pin. All Dynamo Frontend + worker pods pin to one specific GPU node via kubernetes.io/hostname. Without this, the scheduler could spread workers across multiple nodes and the throughput number would become a pool-level average rather than a per-node baseline.
  • DRA-aware node selection. countUsedGPUsByNode() enumerates existing ResourceClaims with driver == gpu.nvidia.com and accumulates per-node in-use counts (keyed by allocation pool = node name). pickCandidateWithMostFreeGPUs() picks the candidate with the most free GPUs (allocatable − in-use), sizes the workload to that count, and fails fast with an actionable message if every candidate is saturated. The Status.Allocatable["nvidia.com/gpu"] view does not shrink when DRA devices are allocated to other workloads — without this, a node that "looks empty" in the device-plugin view can leave the benchmark pending for the full timeout. Fails soft (falls back to allocatable-only sizing) if the cluster's resource.k8s.io/v1 API is unreachable.

--node-selector and --toleration from aicr validate continue to work: --node-selector narrows the candidate pool before DRA sizing, --toleration overrides the tolerate-the-node's-taints default. Both affect only the inner benchmark pods, not the validator orchestrator Job.

Correctness guards (three explicit skip paths)

Guard Fires when Skip reason
A Recipe lists the check but no matching constraints no inference-throughput or inference-ttft-p99 constraint in recipe
B Check selected but dynamo-platform absent from componentRefs skipped - dynamo-platform not in recipe components
C dynamo-platform declared but DynamoGraphDeployment CRD not registered on cluster skipped - DynamoGraphDeployment CRD not installed on cluster ...

Guard C uses IsNotFound only for "not installed" — Forbidden / timeout / auth failures surface as errors rather than masking as benign skips (mirrors isTrainerInstalled).

Pre-built AIPerf runner image

New validators/performance/aiperf-bench.Dockerfile bakes aiperf at build time (pins AIPERF_VERSION). Runtime pod needs only a single ghcr.io pull, no PyPI dependency (air-gap friendly, removes ~30s install warmup per run).

Build/publish coverage — extended to all paths, not only tagged releases:

  • .github/workflows/on-tag.yaml (releases): multi-arch manifest, vuln scan, attestation — symmetric with the three existing Go-validator images
  • .github/workflows/on-push.yaml (main-push): aiperf-bench added to the build matrix + VALIDATOR_PHASES env; per-arch :sha-<commit>-<arch> and :edge-<arch> tags; multi-arch manifest combines them into :sha-<commit> and :edge
  • Makefile: image-validators target builds/pushes aiperf-bench alongside the three Go validators; validate-local kind loads all four into the Kind cluster (also fixes a pre-existing typo — validate-local depended on a non-existent image-validator target)
  • .github/actions/e2e/action.yml (local registry): builds aiperf-bench from its own Dockerfile, pushes to the local Kind registry, and includes it in the tags-list verify step

Image-resolution consistency

Exported catalog.ResolveImage so the inner AIPerf image reference (held as a Go constant, not a catalog entry) gets the same :latest→version pinning and AICR_VALIDATOR_IMAGE_REGISTRY override that catalog.Load applies to top-level catalog entries. The outer validator's Deployer forwards AICR_CLI_VERSION + the registry env var to the Job pod.

Timeouts

  • Raised CheckExecutionTimeout from 10m → 40m to accommodate the sequential inference pipeline (workload ready 10m + health 5m + AIPerf 15m)
  • New InferenceNamespaceTerminationWait = 5m so back-to-back runs wait for a prior run's namespace deletion (Dynamo finalizers take 2–3 min) instead of racing
  • Catalog entry timeout bumped to 45m (must exceed the parent ctx, else the Job's activeDeadlineSeconds kills the pod before internal deadlines expire)

Docs

  • New docs/user/validation.md — task-oriented walkthrough covering both training and inference performance validation, all three phases, skip semantics, dry-run, CI/CD integration (with accurate exit-code mapping), and troubleshooting (including the DRA-saturated-node fail-fast path and fallback mode when the DRA API is unreachable)
  • Cross-linked from docs/user/index.md, docs/user/cli-reference.md (top of aicr validate section), and the sidebar (site/.vitepress/config.ts)
  • Training performance section carries a placeholder note for per-platform expected-bandwidth numbers (to be filled once reference benchmarks are published)

Testing

Local

make qualify    # tests (-race), lint, chainsaw e2e, license headers, sidebar sync

All green. Coverage on validators/performance/... stays positive net vs main baseline.

New unit tests cover the pure functions: parseAIPerfOutput (sentinel / missing / malformed), deriveRunID (hash determinism + uniqueness of random fallback), countUsedGPUsByNode (fake clientset; accumulate across claims, driver filter, unallocated claim, empty list), pickCandidateWithMostFreeGPUs (selection + tie-break + negative-free clamp), applyInferenceWorkerScheduling (worker gets DRA claim, frontend co-locates without claim), buildAIPerfJob (pre-built image, no pip install, sentinel framing, per-run jobName), buildTolerations (filtering, YAML-special chars), inferServicePort, hasDynamoPlatform, isDynamoDeploymentReady, nodesMatchingSelector, nodeGPUCount, resolveAiperfImage, catalog.ResolveImage.

On cluster (EKS H100, aicr-cuj2)

Verbatim shell output — snapshot → recipe → validate. This run exercises the DRA-aware picker on a cluster where dynamo-workload/vllm-gpu-claim already holds 1 GPU on ip-10-0-151-148; the validator correctly auto-picks the other GPU node (ip-10-0-186-114) without any hostname override:

$ /tmp/aicr/bin/aicr-ae2729ed snapshot --output /tmp/aicr/cuj2/snapshot.yaml

$ /tmp/aicr/bin/aicr-ae2729ed recipe \
    --service eks --accelerator h100 --os ubuntu \
    --intent inference --platform dynamo \
    --output /tmp/aicr/cuj2/recipe.yaml
[cli] recipe generation completed: output=/tmp/aicr/cuj2/recipe.yaml components=16 overlays=7

$ AICR_VALIDATOR_IMAGE_REGISTRY=ghcr.io/yuanchen8911 \
    /tmp/aicr/bin/aicr-ae2729ed validate \
      --recipe /tmp/aicr/cuj2/recipe.yaml \
      --snapshot /tmp/aicr/cuj2/snapshot.yaml \
      --node-selector nodeGroup=gpu-worker \
      --toleration dedicated=worker-workload:NoSchedule \
      --toleration dedicated=worker-workload:NoExecute \
      --phase performance
[cli] readiness pre-flight: constraints=4 ... all passed
[cli] running validation phase: phase=performance catalog=2 selected=1
[cli] running validator: name=inference-perf phase=performance
  Found GPU nodes count=2
  Filtered GPU nodes to match --node-selector selector=map[nodeGroup:gpu-worker] matched=2
  --node-selector narrowed candidate pool; workers pinned to single node
    via kubernetes.io/hostname node=ip-10-0-186-114.ec2.internal freeGPUs=8
  Deploying benchmark workload gpus=8 namespace=aicr-inference-perf-6b2bd21e
  Applied ResourceClaimTemplate name=aicr-inference-gpu-claim
  Applied DynamoGraphDeployment name=aicr-inference-perf gpuWorkers=8
  Waiting for DynamoGraphDeployment to become ready...
  DynamoGraphDeployment is ready                                          (~2 min)
  Using inference endpoint http://aicr-inference-perf-frontend...svc:8000 concurrency=128
  Inference endpoint is healthy
  Running AIPerf benchmark model=Qwen/Qwen3-0.6B concurrency=128 requests=200 job=aicr-aiperf-6b2bd21e
  pod aicr-aiperf-6b2bd21e-wgvr2: Pending → Running → Succeeded           (~12s)
  Inference benchmark results throughput_tok/s=39399.24 ttft_p99_ms=138.27
  Cleaning up inference benchmark workload...
  Deleted DynamoGraphDeployment
  Deleted namespace aicr-inference-perf-6b2bd21e
[cli] validator completed: name=inference-perf status=passed
[cli] phase completed: phase=performance status=passed duration=2m11.224776542s

CTRF stdout tail:
  "Inference throughput: 39399.24 tokens/sec"
  "Inference TTFT p99: 138.27 ms"
  "Throughput constraint: >= 5000 → PASS"
  "TTFT p99 constraint: <= 200 → PASS"

Result: inference-perf passed — throughput 39,399 tok/s, TTFT p99 138.27 ms. Both constraints met with margin.

The AICR_VALIDATOR_IMAGE_REGISTRY=ghcr.io/yuanchen8911 override is only used here because this PR is pre-merge testing from the fork. After merge, the aiperf-bench image is published by on-push.yaml to ghcr.io/nvidia/aicr-validators/aiperf-bench:edge (and :sha-<commit>) on every main push — no env-var override needed for main/edge consumers.

Risk Assessment

  • Medium — New validator binary path + new release image + bumped CheckExecutionTimeout affect all performance-phase runs. Skip semantics (Guards A/B/C) keep the check inert in any environment where Dynamo isn't in scope. Core validator runtime changes are small and covered by unit tests + an on-cluster EKS run.

Rollout notes: No migration required. The inference-perf check only activates for recipes generated with --intent inference --platform dynamo. Existing training/other-phase validators continue to work unchanged. The AICR_CLI_VERSION and AICR_VALIDATOR_IMAGE_REGISTRY env vars forwarded by the Deployer are additive (no existing behavior changes when unset). DRA-aware sizing degrades to allocatable-only sizing on clusters where resource.k8s.io/v1 is unavailable — no functional regression on non-DRA clusters.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 changed the title WIP: feat(validator): inference performance validation with AIPerf WIP: feat(validator): add inference performance validation Apr 22, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-validator branch from 7a37f97 to 6a78393 Compare April 22, 2026 18:03
@yuanchen8911

yuanchen8911 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

CI failures (Tier 1/Tier 2 KWOK jobs) are a pre-existing regression from #603 — unrelated to this PR. Filed #643 to fix the bundler --set regex so Helm-style array indexing (tolerations[2].key=...) is accepted again. Once that lands, the KWOK tests here should go green after a rebase.

@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-validator branch from 6a78393 to f5e21d4 Compare April 22, 2026 18:28
@yuanchen8911 yuanchen8911 changed the title WIP: feat(validator): add inference performance validation feat(validator): inference performance validation with AIPerf Apr 22, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-validator branch 5 times, most recently from a1a7ff4 to c34b52c Compare April 22, 2026 22:22
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-validator branch 5 times, most recently from 29053a5 to ae2729e Compare April 23, 2026 00:29
@yuanchen8911 yuanchen8911 marked this pull request as ready for review April 23, 2026 00:34
@yuanchen8911 yuanchen8911 requested review from a team as code owners April 23, 2026 00:34
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@yuanchen8911 yuanchen8911 requested a review from njhensley April 23, 2026 02:18
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-validator branch 4 times, most recently from fafd279 to accd9ca Compare April 23, 2026 02:47
@yuanchen8911 yuanchen8911 changed the title feat(validator): inference performance validation with AIPerf feat(validator): add inference performance validation Apr 23, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-validator branch 2 times, most recently from 6e8693b to aaa6218 Compare April 23, 2026 04:32
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-validator branch 2 times, most recently from f052703 to 84e7443 Compare April 23, 2026 05:26
Add inference-perf performance validator that benchmarks Dynamo vLLM
inference endpoints using AIPerf and evaluates two metrics against
recipe constraints:

  * inference-throughput (output tokens/sec, default >= 5000)
  * inference-ttft-p99   (time-to-first-token p99 in ms, default <= 200)

Pipeline: deploy ResourceClaimTemplate + DynamoGraphDeployment into a
per-run namespace -> wait for DynamoGraphDeployment state=successful
-> probe frontend /health -> run AIPerf benchmark Job -> parse
sentinel-delimited JSON -> compare against recipe constraints ->
synchronous cleanup.

Scheduling is DRA-aware: the validator enumerates existing gpu.nvidia.com
ResourceClaim allocations per node and picks the candidate with the most
free GPUs, sizes the workload to that count, and fails fast with an
actionable message when every candidate is saturated. All worker pods
pin to a single kubernetes.io/hostname for a stable per-node baseline.
Per-run namespace and inner Job names are suffixed with an 8-hex hash
of AICR_RUN_ID so concurrent validate invocations cannot collide.

Three skip guards keep the check inert in environments where it cannot
succeed: missing recipe constraints, dynamo-platform absent from
componentRefs, and DynamoGraphDeployment CRD not installed on the
cluster. Guard C uses IsNotFound only for "not installed" so forbidden
/ timeout / auth failures surface as errors rather than benign skips.

A pre-built aiperf-bench image (Python, non-root user, aiperf pinned
at build time) ships from the release workflow and the main-push
workflow, and is built alongside the three Go validators in the
Makefile and e2e action so inference-perf works on main/edge and
local-dev paths, not only on tagged releases.

Validated on EKS H100 (aicr-cuj2): 39,399 tokens/sec, TTFT p99 138 ms,
DRA-aware picker correctly auto-selects the free node on a cluster
where another workload already holds 1 GPU via DRA on the other
candidate.
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-validator branch from 84e7443 to 01a2025 Compare April 23, 2026 06:33
@mchmarny mchmarny added this to the v0.12 milestone Apr 23, 2026
@mchmarny mchmarny enabled auto-merge (squash) April 23, 2026 12:29
@mchmarny mchmarny merged commit 3a86364 into NVIDIA:main Apr 23, 2026
65 of 66 checks passed
njhensley added a commit to njhensley/aicr that referenced this pull request Apr 23, 2026
…eGVR

inference_perf_constraint.go and nccl_all_reduce_bw_constraint.go each
declared their own package-level resourceClaimTemplateGVR. After both
landed on this branch (inference-perf via NVIDIA#641 merge), the package no
longer compiles — duplicate top-level name.

Move the GVR to a new dra_gvr.go shared between both callers, since the
definition is identical (resource.k8s.io/v1 resourceclaimtemplates) and
any future validator that inspects DRA RCTs will want the same value.

Drive-by: fix gofmt alignment of the "inference-perf" entry in main.go
check map.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 18, 2026
…ld-start readiness probe

Two related fixes:

1. Wire the GKE H100 COS Dynamo overlay's performance phase to the
   inference-perf validator. Without this, aicr validate --phase
   performance against the GKE recipe is a no-op (validators=0,
   status=skipped) — the validator is in the catalog, the EKS overlay
   subscribes (NVIDIA#641), but the GKE sibling was never extended.

2. Replace the inference-perf validator's GET /health readiness probe
   with a real POST /v1/chat/completions probe. Dynamo's frontend
   returns 200 from /health as soon as the HTTP server is up — well
   before backend workers register or the model finishes loading.
   Hitting that window with AIPerf produced an "all requests completed,
   zero tokens" failure that looked like a benchmark regression. The
   chat-completion probe only accepts the endpoint once the response
   carries a non-empty completion, which is the only signal both
   necessary and sufficient to know AIPerf will produce real numbers.
   Affects both EKS and GKE Dynamo overlays (single shared codepath).

Validation: ran end-to-end aicr validate --phase performance against
GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s
(>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder
thresholds carried over from the EKS overlay sit comfortably inside
both observed values, so no per-platform tuning is required at this
time.

Fixes NVIDIA#937
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 18, 2026
…ld-start readiness probe

Two related fixes:

1. Wire the GKE H100 COS Dynamo overlay's performance phase to the
   inference-perf validator. Without this, aicr validate --phase
   performance against the GKE recipe is a no-op (validators=0,
   status=skipped) — the validator is in the catalog, the EKS overlay
   subscribes (NVIDIA#641), but the GKE sibling was never extended.

2. Replace the inference-perf validator's GET /health readiness probe
   with a real POST /v1/chat/completions probe. Dynamo's frontend
   returns 200 from /health as soon as the HTTP server is up — well
   before backend workers register or the model finishes loading.
   Hitting that window with AIPerf produced an "all requests completed,
   zero tokens" failure that looked like a benchmark regression. The
   chat-completion probe only accepts the endpoint once the response
   carries a non-empty completion, which is the only signal both
   necessary and sufficient to know AIPerf will produce real numbers.
   Affects both EKS and GKE Dynamo overlays (single shared codepath).

Validation: ran end-to-end aicr validate --phase performance against
GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s
(>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder
thresholds carried over from the EKS overlay sit comfortably inside
both observed values, so no per-platform tuning is required at this
time.

Fixes NVIDIA#937
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 18, 2026
…ld-start readiness probe

Two related fixes:

1. Wire the GKE H100 COS Dynamo overlay's performance phase to the
   inference-perf validator. Without this, aicr validate --phase
   performance against the GKE recipe is a no-op (validators=0,
   status=skipped) — the validator is in the catalog, the EKS overlay
   subscribes (NVIDIA#641), but the GKE sibling was never extended.

2. Replace the inference-perf validator's GET /health readiness probe
   with a real POST /v1/chat/completions probe. Dynamo's frontend
   returns 200 from /health as soon as the HTTP server is up — well
   before backend workers register or the model finishes loading.
   Hitting that window with AIPerf produced an "all requests completed,
   zero tokens" failure that looked like a benchmark regression. The
   chat-completion probe only accepts the endpoint once the response
   carries a non-empty completion, which is the only signal both
   necessary and sufficient to know AIPerf will produce real numbers.
   Affects both EKS and GKE Dynamo overlays (single shared codepath).

Validation: ran end-to-end aicr validate --phase performance against
GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s
(>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder
thresholds carried over from the EKS overlay sit comfortably inside
both observed values, so no per-platform tuning is required at this
time.

Fixes NVIDIA#937
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 18, 2026
…ld-start readiness probe

Two related fixes:

1. Wire the GKE H100 COS Dynamo overlay's performance phase to the
   inference-perf validator. Without this, aicr validate --phase
   performance against the GKE recipe is a no-op (validators=0,
   status=skipped) — the validator is in the catalog, the EKS overlay
   subscribes (NVIDIA#641), but the GKE sibling was never extended.

2. Replace the inference-perf validator's GET /health readiness probe
   with a real POST /v1/chat/completions probe. Dynamo's frontend
   returns 200 from /health as soon as the HTTP server is up — well
   before backend workers register or the model finishes loading.
   Hitting that window with AIPerf produced an "all requests completed,
   zero tokens" failure that looked like a benchmark regression. The
   chat-completion probe only accepts the endpoint once the response
   carries a non-empty completion, which is the only signal both
   necessary and sufficient to know AIPerf will produce real numbers.
   Affects both EKS and GKE Dynamo overlays (single shared codepath).

Validation: ran end-to-end aicr validate --phase performance against
GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s
(>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder
thresholds carried over from the EKS overlay sit comfortably inside
both observed values, so no per-platform tuning is required at this
time.

Fixes NVIDIA#937
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 19, 2026
…ld-start readiness probe

Two related fixes:

1. Wire the GKE H100 COS Dynamo overlay's performance phase to the
   inference-perf validator. Without this, aicr validate --phase
   performance against the GKE recipe is a no-op (validators=0,
   status=skipped) — the validator is in the catalog, the EKS overlay
   subscribes (NVIDIA#641), but the GKE sibling was never extended.

2. Replace the inference-perf validator's GET /health readiness probe
   with a real POST /v1/chat/completions probe. Dynamo's frontend
   returns 200 from /health as soon as the HTTP server is up — well
   before backend workers register or the model finishes loading.
   Hitting that window with AIPerf produced an "all requests completed,
   zero tokens" failure that looked like a benchmark regression. The
   chat-completion probe only accepts the endpoint once the response
   carries a non-empty completion, which is the only signal both
   necessary and sufficient to know AIPerf will produce real numbers.
   Affects both EKS and GKE Dynamo overlays (single shared codepath).

Validation: ran end-to-end aicr validate --phase performance against
GKE H100 COS Dynamo with the fix in place. Throughput 33,982 tok/s
(>= 5000 PASS); TTFT p99 119.79 ms (<= 200 PASS). The placeholder
thresholds carried over from the EKS overlay sit comfortably inside
both observed values, so no per-platform tuning is required at this
time.

Fixes NVIDIA#937
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add Dynamo inference performance validation

2 participants