feat(recipe): add AKS H100 Dynamo perf check by yuanchen8911 · Pull Request #1232 · NVIDIA/aicr

yuanchen8911 · 2026-06-08T21:27:38Z

Summary

Adds a Dynamo inference performance phase to the H100 AKS Ubuntu inference overlay, mirroring the current H100 EKS/GKE inference-perf smoke gate, and documents a known limitation of the existing performance gates: the inference throughput and NCCL bus-bandwidth floors are fixed absolute full-node values calibrated on 8-GPU H100 nodes, not normalized for GPU count / fabric class.

⚠️ Known limitation — the current performance gates assume a full 8-GPU H100 node. The inference-throughput >= 50000 floor (and the analogous NCCL nccl-all-reduce-bw busbw floors) was calibrated on a full 8-GPU H100 node (GB200 on a 4-GPU node) and is not normalized for GPU count or fabric class. Because the recipes are SKU-agnostic (service + accelerator only), the same recipe lands on smaller / different-fabric H100 SKUs (p5.4xlarge, a3-highgpu-1g/2g/4g, AKS NC80adis/NC40ads) where a healthy run can false-fail. This PR documents the 8-GPU happy path and the workaround (TTFT-only on smaller SKUs); the behavioral fixes are tracked in #1254 (inference, per-GPU normalization) and #1256 (NCCL, fabric/transport-class floors).

Motivation / Context

The AKS Dynamo inference leaf was missing the recommended performance phase. This adds the same inference throughput and TTFT goals used by the current H100 EKS/GKE Dynamo overlays.

While adding the AKS gate it became clear the shared >= 50000 throughput floor (and the analogous NCCL busbw floors) is a full-node value: on a smaller H100 SKU — which AKS commonly is (NC80adis = 2 GPUs, NC40ads = 1 GPU), and which exists on every provider (p5.4xlarge, a3-highgpu-1g/2g/4g) — it can false-fail a healthy run. Rather than ship that silently, this PR adds explicit NOTE comments to the affected overlays and a node-shape-assumption note to the validation docs, cross-referencing the normalization follow-ups.

Fixes: #1006
Related: #1254 (inference throughput per-GPU normalization), #1256 (NCCL fabric/transport-class floors)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: ____________

Implementation Notes

Adds performance.checks: [inference-perf] to h100-aks-ubuntu-inference-dynamo with the same model, concurrency, throughput, and TTFT goals as the current H100 EKS/GKE Dynamo overlays: Qwen/Qwen3-8B, 256 concurrency per GPU, >= 50000 throughput, <= 2000 TTFT p99.
Adds a resolved-recipe regression test so the AKS Dynamo leaf follows the inference pattern and does not pick up training NCCL checks.
Documents the 8-GPU ND H100 supported path for AKS: docs/integrator/aks-gpu-setup.md now uses Standard_ND96isr_H100_v5 (8-GPU ND H100 v5) for the example GPU nodepool instead of the 2-GPU Standard_NC80adis_H100_v5, with a note that the performance gates are full-node floors and that smaller NCads SKUs should gate on inference-ttft-p99 only until inference-perf throughput floor is a fixed absolute full-node value applied to SKU-agnostic recipes → false-fails smaller node shapes #1254. The AKS overlay NOTE is framed the same way (assumes 8-GPU ND H100; NCads SKUs drop the throughput constraint / TTFT-only until inference-perf throughput floor is a fixed absolute full-node value applied to SKU-agnostic recipes → false-fails smaller node shapes #1254).
Full-node-assumption notes (no behavior change):
- docs/user/validation.md — adds a "Node-shape assumption" note to both the training (NCCL) and inference performance sections, explaining that the floors are fixed absolute full-node values and which SKUs can false-fail, linking inference-perf throughput floor is a fixed absolute full-node value applied to SKU-agnostic recipes → false-fails smaller node shapes #1254 / nccl-all-reduce-bw training gate is a fixed absolute fabric-specific busbw value applied to SKU-agnostic recipes → false-fails EKS/H100 small SKUs #1256. (Uses the file's bold-label-paragraph convention, matching **Warm-up:** / **Determinism:**.)
- Inference Dynamo overlays (h100-aks, h100-eks, h100-gke, gb200-eks, rtx-pro-6000-eks) — adds a NOTE on the throughput floor. Corrects the EKS/GKE comments that implied the evaluator scales across node sizes (it only down-scales for partial occupancy, not across node shapes).
- Active H100 training overlays (h100-eks-training, h100-gke-cos-training) — adds a NOTE that the busbw floor is fabric-class dependent, not GPU-count scaled.
inference-ttft-p99 is called out as a per-request latency at fixed concurrency-per-GPU that does not need GPU-count normalization.

Scope note: the AKS recipe criteria remain generic (service: aks, accelerator: h100), so a user on NC40ads/NC80adis can still generate/validate this recipe and hit the throughput false-fail. This PR documents the 8-GPU ND H100 happy path and the NCads workaround (TTFT-only); the behavioral fix (per-GPU normalization) is tracked in #1254 and intentionally out of scope here.

Testing

go test ./pkg/recipe -count=1
env AICR_VALIDATION_FLOOR_STRICT=1 go test ./pkg/recipe -run 'TestOverlayValidationPhaseFloor/h100-aks-ubuntu-inference-dynamo$|TestAKSDynamoInferencePerformanceGoalsFollowPattern' -count=1
yamllint <changed overlays>          # clean
golangci-lint run -c .golangci.yaml ./pkg/recipe/...
unset GITLAB_TOKEN && make qualify

The added overlay/doc changes are comments and prose only (no semantic recipe change), so resolution is unchanged — TestAKSDynamoInferencePerformanceGoalsFollowPattern and TestOverlayValidationPhaseFloor still pass, and yamllint is clean on all changed overlays. make qualify completes the full gate; the vulnerability scan reports the existing k8s.io/kubernetes advisories and does not fail the gate.

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: The AKS goals are provisional smoke-floor values mirrored from the current H100 EKS/GKE Dynamo overlays; recalibrate once Azure AKS H100 testbed data is available. The full-node-assumption notes are documentation only — the actual normalization fixes are tracked separately in #1254 (inference) and #1256 (NCCL).

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

coderabbitai · 2026-06-08T21:30:29Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a performance validation block to recipes/overlays/h100-aks-ubuntu-inference-dynamo.yaml that declares an inference-perf check with four hard constraints (model, per-GPU concurrency, throughput, TTFT p99). Adds a Go unit test (pkg/recipe/performance_goals_aks_test.go) that loads the metadata store, resolves the overlay, and asserts the presence and exact expected performance check and constraint key/value pairs. Also adds clarifying NOTE comments across multiple overlays and updates the AKS integrator doc to recommend an 8-GPU H100 SKU.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

inference-perf throughput floor is a fixed absolute full-node value applied to SKU-agnostic recipes → false-fails smaller node shapes #1254: The PR introduces an absolute inference-throughput floor (>=50000) and node-shape notes matching the concerns in this issue.

Possibly related PRs

NVIDIA/aicr#1196: Related tuning and validation of the inference-perf gate including TTFT p99 constraints.

Suggested labels

enhancement, area/tests

Suggested reviewers

mchmarny
njhensley

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description comprehensively documents the changes, motivations, and known limitations. It clearly explains what is being added (Dynamo inference performance phase to H100 AKS overlay), why (addressing issue `#1006` and documenting full-node assumption of performance gates), and what the implementation includes (performance goals, test, documentation updates).
Title check	✅ Passed	The title 'feat(recipe): add AKS H100 Dynamo perf check' accurately describes the primary change: adding performance validation to the AKS H100 Dynamo inference overlay, though it omits that the change also includes documentation updates across multiple overlays.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/performance_goals_aks_test.go`:
- Around line 24-25: Replace the use of context.Background() with a bounded
context when calling loadMetadataStore and other metadata/build operations:
create ctx, cancel := context.WithTimeout(context.Background(),
defaults.FileReadTimeout) and defer cancel() before calling
loadMetadataStore(ctx) and BuildRecipeResult(ctx, ...); update imports to
include context and defaults as needed so DataProvider operations have a
deadline/cancel path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0d407169-2032-42eb-b9c1-d5198765b3e6

📥 Commits

Reviewing files that changed from the base of the PR and between 7c5ccc4 and c19167f.

📒 Files selected for processing (10)

docs/integrator/aks-gpu-setup.md
docs/user/validation.md
pkg/recipe/performance_goals_aks_test.go
recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml
recipes/overlays/h100-aks-ubuntu-inference-dynamo.yaml
recipes/overlays/h100-eks-training.yaml
recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
recipes/overlays/h100-gke-cos-inference-dynamo.yaml
recipes/overlays/h100-gke-cos-training.yaml
recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml

xdu31 · 2026-06-09T19:18:44Z

Cross-Review

Reviewers: Claude Code + Code Review Agent + Integration Analysis Agent

Confirmed Issues

None.

Open Questions

No questions — the NOTE comments are accurate, the AKS test correctly mirrors the AKS peer pattern, floor test coverage is intact, and the documentation accurately frames the known limitation.

Cross-review by Claude Code

njhensley · 2026-06-09T19:50:06Z

Review

What it does: Adds a performance: [inference-perf] block to h100-aks-ubuntu-inference-dynamo (mirroring the EKS/GKE Dynamo gates), adds a resolved-recipe regression test, and documents — via prose in docs/ and NOTE comments across 6 overlays — that the throughput/NCCL floors are fixed full-node values that can false-fail on smaller H100 SKUs.

Assessment: solid, low-risk. 👍

Strengths

The only semantic change is the one new AKS perf block; everything else is comments + docs. The test asserts both the exact checks and the full constraint set (count + per-item), which correctly locks in "AKS follows the inference pattern, doesn't inherit training NCCL checks."
The NOTE comments correctly distinguish the evaluator's partial-occupancy down-scaling from cross-node-shape normalization (which it does not do) — an accurate description of the gap and a useful correction to the prior EKS/GKE comments.
Docs follow the file's bold-label-paragraph convention as claimed.

Minor nits (non-blocking)

The >= 50000 floor is reused verbatim for AKS with no Azure baseline. Acknowledged as provisional in the PR description — fine, but the value is now load-bearing on a third provider with zero calibration data behind it.
The new test uses context.WithTimeout(..., defaults.FileReadTimeout), while the sibling OKE test in feat(recipe): add OKE GB200 perf check #1233 uses bare context.Background(). Trivial inconsistency between the two PRs; worth aligning on one.

yuanchen8911 · 2026-06-09T21:34:24Z

@njhensley thanks for the careful read. Addressing the two nits:

On >= 50000: keeping it provisional in this PR by design so AKS mirrors the existing H100 EKS/GKE smoke gate. The PR description, AKS overlay NOTE, and AKS setup doc call out that this is an 8-GPU full-node assumption, that NCads SKUs should use TTFT-only, and that inference-perf throughput floor is a fixed absolute full-node value applied to SKU-agnostic recipes → false-fails smaller node shapes #1254 is the behavioral follow-up for per-GPU normalization.
On the context style: leaving the AKS test with the bounded defaults.FileReadTimeout context because it directly addresses the existing inline review thread and stays within the repo timeout guidance. If feat(recipe): add OKE GB200 perf check #1233 lands first or second, the sibling test can be aligned to whichever pattern remains in tree so the merged tests are consistent.

… assumption Adds the inference-perf performance phase to h100-aks-ubuntu-inference-dynamo mirroring the H100 EKS/GKE Dynamo siblings, plus a resolved-recipe regression test. Also documents that the current inference throughput and NCCL bus-bandwidth gates are fixed absolute full-node values calibrated on 8-GPU H100 nodes and are not normalized for GPU count / fabric class, so smaller H100 SKUs can false-fail a healthy run. Adds NOTE comments to the affected inference and training overlays and a node-shape-assumption note to docs/user/validation.md, cross-referencing the normalization follow-ups (NVIDIA#1254 inference, NVIDIA#1256 NCCL).

github-actions Bot added area/recipes size/M labels Jun 8, 2026

yuanchen8911 mentioned this pull request Jun 8, 2026

feat(validation): add performance-phase constraints to AKS overlays once Azure testbed lands #1006

Closed

yuanchen8911 changed the title ~~WIP: feat(recipe): add AKS Dynamo perf goals~~ WIP: feat(recipe): add AKS Dynamo perf check Jun 8, 2026

yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 5ff6fd3 to 7c5ccc4 Compare June 9, 2026 15:32

yuanchen8911 mentioned this pull request Jun 9, 2026

inference-perf throughput floor is a fixed absolute full-node value applied to SKU-agnostic recipes → false-fails smaller node shapes #1254

Open

yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 7c5ccc4 to a213914 Compare June 9, 2026 16:43

github-actions Bot added the area/docs label Jun 9, 2026

yuanchen8911 added the documentation label Jun 9, 2026

yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch 2 times, most recently from c19167f to 7f368ee Compare June 9, 2026 16:56

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread pkg/recipe/performance_goals_aks_test.go Outdated

yuanchen8911 changed the title ~~WIP: feat(recipe): add AKS Dynamo perf check~~ feat(recipe): add AKS Dynamo perf check Jun 9, 2026

yuanchen8911 marked this pull request as ready for review June 9, 2026 17:07

yuanchen8911 requested review from a team as code owners June 9, 2026 17:07

yuanchen8911 requested a review from mchmarny June 9, 2026 17:11

yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 7f368ee to 09a9bd2 Compare June 9, 2026 17:14

yuanchen8911 changed the title ~~feat(recipe): add AKS Dynamo perf check~~ feat(recipe): add AKS H100 Dynamo perf check Jun 9, 2026

yuanchen8911 requested a review from xdu31 June 9, 2026 17:57

njhensley mentioned this pull request Jun 9, 2026

feat(recipe): add OKE GB200 perf check #1233

Merged

25 tasks

mchmarny assigned yuanchen8911 Jun 9, 2026

yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 09a9bd2 to 4625c22 Compare June 9, 2026 21:30

mchmarny removed the documentation label Jun 9, 2026

mchmarny approved these changes Jun 9, 2026

View reviewed changes

yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 063eaa5 to d0e0b4b Compare June 9, 2026 21:59

yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from d0e0b4b to f9f4505 Compare June 9, 2026 22:13

yuanchen8911 merged commit 3e2f823 into NVIDIA:main Jun 9, 2026
129 checks passed

coderabbitai Bot mentioned this pull request Jun 11, 2026

recipes/dynamo: bump platform to 1.2 #1308

Merged

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(recipe): add AKS H100 Dynamo perf check#1232

feat(recipe): add AKS H100 Dynamo perf check#1232
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/issue-1006-aks-performance-goals

yuanchen8911 commented Jun 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

xdu31 commented Jun 9, 2026

Uh oh!

njhensley commented Jun 9, 2026

Uh oh!

yuanchen8911 commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

yuanchen8911 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xdu31 commented Jun 9, 2026

Cross-Review

Confirmed Issues

Open Questions

Uh oh!

njhensley commented Jun 9, 2026

Review

Strengths

Minor nits (non-blocking)

Uh oh!

yuanchen8911 commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuanchen8911 commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading