Skip to content

feat(recipe): add AKS H100 Dynamo perf check#1232

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/issue-1006-aks-performance-goals
Jun 9, 2026
Merged

feat(recipe): add AKS H100 Dynamo perf check#1232
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/issue-1006-aks-performance-goals

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a Dynamo inference performance phase to the H100 AKS Ubuntu inference overlay, mirroring the current H100 EKS/GKE inference-perf smoke gate, and documents a known limitation of the existing performance gates: the inference throughput and NCCL bus-bandwidth floors are fixed absolute full-node values calibrated on 8-GPU H100 nodes, not normalized for GPU count / fabric class.

⚠️ Known limitation — the current performance gates assume a full 8-GPU H100 node. The inference-throughput >= 50000 floor (and the analogous NCCL nccl-all-reduce-bw busbw floors) was calibrated on a full 8-GPU H100 node (GB200 on a 4-GPU node) and is not normalized for GPU count or fabric class. Because the recipes are SKU-agnostic (service + accelerator only), the same recipe lands on smaller / different-fabric H100 SKUs (p5.4xlarge, a3-highgpu-1g/2g/4g, AKS NC80adis/NC40ads) where a healthy run can false-fail. This PR documents the 8-GPU happy path and the workaround (TTFT-only on smaller SKUs); the behavioral fixes are tracked in #1254 (inference, per-GPU normalization) and #1256 (NCCL, fabric/transport-class floors).

Motivation / Context

The AKS Dynamo inference leaf was missing the recommended performance phase. This adds the same inference throughput and TTFT goals used by the current H100 EKS/GKE Dynamo overlays.

While adding the AKS gate it became clear the shared >= 50000 throughput floor (and the analogous NCCL busbw floors) is a full-node value: on a smaller H100 SKU — which AKS commonly is (NC80adis = 2 GPUs, NC40ads = 1 GPU), and which exists on every provider (p5.4xlarge, a3-highgpu-1g/2g/4g) — it can false-fail a healthy run. Rather than ship that silently, this PR adds explicit NOTE comments to the affected overlays and a node-shape-assumption note to the validation docs, cross-referencing the normalization follow-ups.

Fixes: #1006
Related: #1254 (inference throughput per-GPU normalization), #1256 (NCCL fabric/transport-class floors)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Scope note: the AKS recipe criteria remain generic (service: aks, accelerator: h100), so a user on NC40ads/NC80adis can still generate/validate this recipe and hit the throughput false-fail. This PR documents the 8-GPU ND H100 happy path and the NCads workaround (TTFT-only); the behavioral fix (per-GPU normalization) is tracked in #1254 and intentionally out of scope here.

Testing

go test ./pkg/recipe -count=1
env AICR_VALIDATION_FLOOR_STRICT=1 go test ./pkg/recipe -run 'TestOverlayValidationPhaseFloor/h100-aks-ubuntu-inference-dynamo$|TestAKSDynamoInferencePerformanceGoalsFollowPattern' -count=1
yamllint <changed overlays>          # clean
golangci-lint run -c .golangci.yaml ./pkg/recipe/...
unset GITLAB_TOKEN && make qualify

The added overlay/doc changes are comments and prose only (no semantic recipe change), so resolution is unchanged — TestAKSDynamoInferencePerformanceGoalsFollowPattern and TestOverlayValidationPhaseFloor still pass, and yamllint is clean on all changed overlays. make qualify completes the full gate; the vulnerability scan reports the existing k8s.io/kubernetes advisories and does not fail the gate.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: The AKS goals are provisional smoke-floor values mirrored from the current H100 EKS/GKE Dynamo overlays; recalibrate once Azure AKS H100 testbed data is available. The full-node-assumption notes are documentation only — the actual normalization fixes are tracked separately in #1254 (inference) and #1256 (NCCL).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a performance validation block to recipes/overlays/h100-aks-ubuntu-inference-dynamo.yaml that declares an inference-perf check with four hard constraints (model, per-GPU concurrency, throughput, TTFT p99). Adds a Go unit test (pkg/recipe/performance_goals_aks_test.go) that loads the metadata store, resolves the overlay, and asserts the presence and exact expected performance check and constraint key/value pairs. Also adds clarifying NOTE comments across multiple overlays and updates the AKS integrator doc to recommend an 8-GPU H100 SKU.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#1196: Related tuning and validation of the inference-perf gate including TTFT p99 constraints.

Suggested labels

enhancement, area/tests

Suggested reviewers

  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description comprehensively documents the changes, motivations, and known limitations. It clearly explains what is being added (Dynamo inference performance phase to H100 AKS overlay), why (addressing issue #1006 and documenting full-node assumption of performance gates), and what the implementation includes (performance goals, test, documentation updates).
Title check ✅ Passed The title 'feat(recipe): add AKS H100 Dynamo perf check' accurately describes the primary change: adding performance validation to the AKS H100 Dynamo inference overlay, though it omits that the change also includes documentation updates across multiple overlays.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipe): add AKS Dynamo perf goals WIP: feat(recipe): add AKS Dynamo perf check Jun 8, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 5ff6fd3 to 7c5ccc4 Compare June 9, 2026 15:32
@yuanchen8911 yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 7c5ccc4 to a213914 Compare June 9, 2026 16:43
@yuanchen8911 yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch 2 times, most recently from c19167f to 7f368ee Compare June 9, 2026 16:56

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/performance_goals_aks_test.go`:
- Around line 24-25: Replace the use of context.Background() with a bounded
context when calling loadMetadataStore and other metadata/build operations:
create ctx, cancel := context.WithTimeout(context.Background(),
defaults.FileReadTimeout) and defer cancel() before calling
loadMetadataStore(ctx) and BuildRecipeResult(ctx, ...); update imports to
include context and defaults as needed so DataProvider operations have a
deadline/cancel path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0d407169-2032-42eb-b9c1-d5198765b3e6

📥 Commits

Reviewing files that changed from the base of the PR and between 7c5ccc4 and c19167f.

📒 Files selected for processing (10)
  • docs/integrator/aks-gpu-setup.md
  • docs/user/validation.md
  • pkg/recipe/performance_goals_aks_test.go
  • recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-aks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-eks-training.yaml
  • recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • recipes/overlays/h100-gke-cos-training.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml

Comment thread pkg/recipe/performance_goals_aks_test.go Outdated
@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipe): add AKS Dynamo perf check feat(recipe): add AKS Dynamo perf check Jun 9, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 9, 2026 17:07
@yuanchen8911 yuanchen8911 requested review from a team as code owners June 9, 2026 17:07
@yuanchen8911 yuanchen8911 requested a review from mchmarny June 9, 2026 17:11
@yuanchen8911 yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 7f368ee to 09a9bd2 Compare June 9, 2026 17:14
@yuanchen8911 yuanchen8911 changed the title feat(recipe): add AKS Dynamo perf check feat(recipe): add AKS H100 Dynamo perf check Jun 9, 2026
@yuanchen8911 yuanchen8911 requested a review from xdu31 June 9, 2026 17:57
@xdu31

xdu31 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Cross-Review

Reviewers: Claude Code + Code Review Agent + Integration Analysis Agent

Confirmed Issues

None.

Open Questions

No questions — the NOTE comments are accurate, the AKS test correctly mirrors the AKS peer pattern, floor test coverage is intact, and the documentation accurately frames the known limitation.


Cross-review by Claude Code

@njhensley

Copy link
Copy Markdown
Member

Review

What it does: Adds a performance: [inference-perf] block to h100-aks-ubuntu-inference-dynamo (mirroring the EKS/GKE Dynamo gates), adds a resolved-recipe regression test, and documents — via prose in docs/ and NOTE comments across 6 overlays — that the throughput/NCCL floors are fixed full-node values that can false-fail on smaller H100 SKUs.

Assessment: solid, low-risk. 👍

Strengths

  • The only semantic change is the one new AKS perf block; everything else is comments + docs. The test asserts both the exact checks and the full constraint set (count + per-item), which correctly locks in "AKS follows the inference pattern, doesn't inherit training NCCL checks."
  • The NOTE comments correctly distinguish the evaluator's partial-occupancy down-scaling from cross-node-shape normalization (which it does not do) — an accurate description of the gap and a useful correction to the prior EKS/GKE comments.
  • Docs follow the file's bold-label-paragraph convention as claimed.

Minor nits (non-blocking)

  • The >= 50000 floor is reused verbatim for AKS with no Azure baseline. Acknowledged as provisional in the PR description — fine, but the value is now load-bearing on a third provider with zero calibration data behind it.
  • The new test uses context.WithTimeout(..., defaults.FileReadTimeout), while the sibling OKE test in feat(recipe): add OKE GB200 perf check #1233 uses bare context.Background(). Trivial inconsistency between the two PRs; worth aligning on one.

@yuanchen8911 yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 09a9bd2 to 4625c22 Compare June 9, 2026 21:30
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

@njhensley thanks for the careful read. Addressing the two nits:

@yuanchen8911 yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from 063eaa5 to d0e0b4b Compare June 9, 2026 21:59
… assumption

Adds the inference-perf performance phase to h100-aks-ubuntu-inference-dynamo
mirroring the H100 EKS/GKE Dynamo siblings, plus a resolved-recipe regression
test.

Also documents that the current inference throughput and NCCL bus-bandwidth
gates are fixed absolute full-node values calibrated on 8-GPU H100 nodes and
are not normalized for GPU count / fabric class, so smaller H100 SKUs can
false-fail a healthy run. Adds NOTE comments to the affected inference and
training overlays and a node-shape-assumption note to docs/user/validation.md,
cross-referencing the normalization follow-ups (NVIDIA#1254 inference, NVIDIA#1256 NCCL).
@yuanchen8911 yuanchen8911 force-pushed the fix/issue-1006-aks-performance-goals branch from d0e0b4b to f9f4505 Compare June 9, 2026 22:13
@yuanchen8911 yuanchen8911 merged commit 3e2f823 into NVIDIA:main Jun 9, 2026
129 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Jun 11, 2026
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(validation): add performance-phase constraints to AKS overlays once Azure testbed lands

4 participants