Skip to content

feat(recipes): add concrete GB300 EKS service-bound overlays#1319

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/gb300-eks-overlays
Jun 11, 2026
Merged

feat(recipes): add concrete GB300 EKS service-bound overlays#1319
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/gb300-eks-overlays

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Resolve the GB300 recipe gap (#1318): declare GB300 (Blackwell Ultra) as an accelerator and add the EKS overlay set, modeled on the existing gb200-eks-* overlays so aicr recipe --accelerator gb300 --service eks --intent <z> resolves end-to-end.

Motivation / Context

GB300 was undeclared in pkg/recipe/criteria.go and had zero overlays. GB300 NVL72 shares GB200's topology (72 Blackwell GPUs + 36 Grace, 5th-gen NVLink/MNNVL, ARM64 Grace host) and shipped in the same gpu-operator v25.10 release as GB200 — so the EKS recipe is structurally the same as GB200's, with GB300's gains (more HBM3e, higher FP4/attention) being capacity/perf rather than config. mk8s GB300 clusters are all on AWS/EKS running gpu-operator 25.10.x, confirming the EKS focus and the v25.10 floor.

Fixes: #1318
Related: #1042 (parent epic), #1254, #1256

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (enum surface in api/aicr/v1/server.yaml)
  • Recipe engine / data (pkg/recipe)
  • Docs/examples (docs/)

Implementation Notes

Go: declare CriteriaAcceleratorGB300 ("gb300"), wire ParseAccelerator + GetCriteriaAcceleratorTypes, update the accelerator enum doc comments (criteria.go, recipe/doc.go, fingerprint/{types,doc}.go), and update TestGetCriteriaAcceleratorTypes.

Overlays (mirror gb200-eks-*):

  • gb300-any — deployment floor >= v25.10.0.
  • gb300-eks-training / -inference — EFA kernel-module-params preManifest, cdi + gdrcopy, nfd topologyUpdater. Training keeps NVLS+NET perf checks: NVLS >= 500 carries over unchanged (GB300 uses the same 5th-gen NVLink, 1.8 TB/s), NET >= 40 is a loose floor GB300's CX8 NICs should exceed.
  • gb300-eks-ubuntu-{training,inference}, -training-kubeflow, -inference-dynamo. Dynamo inherits the GB200 inference-perf gate as a loose floor (GB300's extra HBM3e / FP4 means it won't false-fail).

Nodewright: reuses the gb200 tuning profile (accelerator: gb200) — shared ARM64 Grace host + Blackwell NVL72 platform, no gb300-specific package. Confirmed with the nodewright maintainer as expected-to-work, with one caveat documented in-file: GB300 may hit the DOCA issue the AWS GPU AMI already works around — revisit if EFA/RDMA bring-up regresses.

Enum/doc audit: server.yaml enum blocks, bug_report.yml dropdowns, cli-reference / api-reference / recipe.md accelerator lists.

K8s floors: >= 1.32.4 (training/inference), >= 1.34 (DRA dynamo leaf) — same as GB200.

Testing

go test ./pkg/recipe/...                 # PASS (incl. TestGetCriteriaAcceleratorTypes, TestOverlayValidationPhaseFloor, TestAllOverlayCriteriaUseValidEnums auto-discover the 6 gb300 leaves)
go test ./pkg/cli/...                    # PASS (completion enum)
golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./pkg/cli/... ./pkg/fingerprint/...   # 0 issues
yamllint recipes/overlays/gb300-*.yaml   # clean
# End-to-end resolution:
aicr recipe --service eks --accelerator gb300 --os ubuntu --intent inference --platform dynamo
#   -> components=18 overlays=8; accelerator gb300; K8s >= 1.34; floor >= v25.10.0;
#      nodewright reuses gb200 profile; EFA kernel-module-params present
aicr recipe --service eks --accelerator gb300 --intent training   # -> components=14 overlays=6

Go lint gate satisfied (golangci-lint 0 issues on changed packages); broad go test ./pkg/{recipe,fingerprint,validator,bundler}/... run. Full make qualify e2e/scan left to CI.

Risk Assessment

  • Low — additive accelerator + overlays mirroring the merged GB200 EKS set; one small Go enum addition with test coverage. Easy to revert.

Rollout notes: Additive. Other GB300 services (OKE/AKS/GKE) are follow-ups under #1042; nodewright DOCA caveat tracked in-file.

Checklist

  • Tests pass locally (go test with -race via make test path)
  • Linter passes (golangci-lint 0 issues; yamllint clean)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (TestGetCriteriaAcceleratorTypes; floor/enum auto-discovery)
  • I updated docs if user-facing behavior changed (enum surfaces + accelerator lists)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested review from a team as code owners June 11, 2026 19:02
@yuanchen8911 yuanchen8911 marked this pull request as draft June 11, 2026 19:02
@yuanchen8911 yuanchen8911 changed the title feat(recipes): add concrete GB300 EKS service-bound overlays WIP: feat(recipes): add concrete GB300 EKS service-bound overlays Jun 11, 2026
@yuanchen8911 yuanchen8911 added area/cli area/docs theme/recipes Recipe expansion, overlays, mixins, and component registry labels Jun 11, 2026
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Recipe evidence check

Affected leaf overlays: 6

Recipe Pointer Verify Digest match
gb300-eks-inference ⚠️ missing
gb300-eks-training ⚠️ missing
gb300-eks-ubuntu-inference-dynamo ⚠️ missing
gb300-eks-ubuntu-inference ⚠️ missing
gb300-eks-ubuntu-training-kubeflow ⚠️ missing
gb300-eks-ubuntu-training ⚠️ missing

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR declares GB300 in the criteria/type system, extends OpenAPI enums and Criteria schema to accept gb300, adds GB300 GPU-SKU fingerprint mapping and tests, updates NCCL/preflight validators and tests to include GB300, synchronizes docs/CLI/comments, adjusts the bug-report template formatting, and adds multiple GB300 recipe overlays (wildcard, EKS training/inference, Ubuntu/Kubeflow variants, and a Dynamo DRA-enabled overlay).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/aicr#1233: Related changes in validators/performance and NCCL handling that touch similar validation logic and test coverage.

Suggested reviewers

  • mchmarny
  • ayuskauskas
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(recipes): add concrete GB300 EKS service-bound overlays' directly and specifically summarizes the main change: adding EKS overlays for GB300 accelerator support.
Description check ✅ Passed The description comprehensively documents the GB300 feature addition including motivation, implementation details, testing, and risk assessment, all relevant to the changeset.
Linked Issues check ✅ Passed The PR fully addresses issue #1318's success criteria: declares CriteriaAcceleratorGB300, adds concrete EKS service-bound overlays (gb300-eks-training, gb300-eks-inference, and ubuntu variants), enables aicr recipe resolution for GB300 on EKS, and provides test coverage via overlay validation auto-discovery.
Out of Scope Changes check ✅ Passed All changes are in-scope: GB300 constant/parsing additions, API enum updates, overlay definitions, documentation updates, and test coverage directly support the linked issue objectives; no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/recipe/doc.go (1)

73-82: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add CriteriaAcceleratorGB300 to the accelerator bullet list to keep docs internally consistent.

The package comment now lists gb300 in summary/query sections, but the detailed accelerator bullets still skip it.

As per coding guidelines, “Ensure code comments are accurate and helpful.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/recipe/doc.go` around lines 73 - 82, The accelerator bullet list is
missing CriteriaAcceleratorGB300; update the package comment in
pkg/recipe/doc.go to add a bullet for CriteriaAcceleratorGB300 (e.g.,
"CriteriaAcceleratorGB300: NVIDIA GB300") alongside the other accelerator
entries so the detailed bullets match the summary/query sections that reference
gb300.

Source: Coding guidelines

pkg/cli/recipe.go (1)

109-109: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update command description to include table output format.

The help text says JSON/YAML only, but this command accepts table too (formatFlag + parseRecipeOutputFormat).

As per coding guidelines, “CLI commands should support multiple output formats: JSON, YAML, and table formats.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/recipe.go` at line 109, Update the command help/description in
pkg/cli/recipe.go to mention the "table" output format in addition to JSON and
YAML; locate the text associated with the command where formatFlag and
parseRecipeOutputFormat are used (the command's description/help string) and
change "Output can be in JSON or YAML format." to include table (e.g., "Output
can be in JSON, YAML, or table format.") so it matches supported formats.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: The TTFT gate for this GB300 overlay is incorrectly set to
"inference-ttft-p99 value: \"<= 2000\""; change it to the standard
non-GB200/B200 floor by updating the inference-ttft-p99 value to "<= 1000" in
the gb300-eks-ubuntu-inference-dynamo overlay (follow the same rule used in
other *-eks-ubuntu-inference-dynamo and *-gke-cos-inference-dynamo overlays: use
"<= 1000" for all accelerators except b200/gb200 which remain "<= 2000").

---

Outside diff comments:
In `@pkg/cli/recipe.go`:
- Line 109: Update the command help/description in pkg/cli/recipe.go to mention
the "table" output format in addition to JSON and YAML; locate the text
associated with the command where formatFlag and parseRecipeOutputFormat are
used (the command's description/help string) and change "Output can be in JSON
or YAML format." to include table (e.g., "Output can be in JSON, YAML, or table
format.") so it matches supported formats.

In `@pkg/recipe/doc.go`:
- Around line 73-82: The accelerator bullet list is missing
CriteriaAcceleratorGB300; update the package comment in pkg/recipe/doc.go to add
a bullet for CriteriaAcceleratorGB300 (e.g., "CriteriaAcceleratorGB300: NVIDIA
GB300") alongside the other accelerator entries so the detailed bullets match
the summary/query sections that reference gb300.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 23abe7f8-e0bb-4820-9e4e-11847976cbf8

📥 Commits

Reviewing files that changed from the base of the PR and between 8d17940 and 1b6b22e.

📒 Files selected for processing (18)
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml

Comment thread recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (1)

80-81: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the GB300 TTFT floor used by other non-GB200/B200 Dynamo inference overlays.

This issue was already flagged in a previous review. For *-eks-ubuntu-inference-dynamo overlays, the WIP gate convention is inference-ttft-p99 <= 1000 for accelerators other than b200/gb200. GB300 (Blackwell Ultra) should follow the standard floor, not the GB200 exception.

🔧 Proposed fix
         - name: inference-ttft-p99
-          value: "<= 2000"
+          value: "<= 1000"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml` around lines 80 -
81, Update the WIP gate for the GB300 overlay so it uses the standard
non-GB200/B200 floor: locate the inference-ttft-p99 key in the
gb300-eks-ubuntu-inference-dynamo overlay (the entry currently set to "<= 2000")
and change its value to "<= 1000" to match other non-GB200/B200 Dynamo inference
overlays.

Source: Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/user/cli-reference.md`:
- Line 429: Update all remaining documentation references that list supported
accelerator types so they include the new "gb300" entry: search for occurrences
of the CLI flags `--accelerator` and `--gpu` and extend any accelerator
lists/examples that end at `gb200` to also include `gb300`; likewise, update the
wildcard-overlays section that references `gb200-any.yaml` to mention and
document `gb300-any.yaml` (add the overlay name `gb300-any` and any matching
criteria/value examples) so the CLI reference, examples, and recipe
wildcard-overlays remain consistent.

---

Duplicate comments:
In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: Update the WIP gate for the GB300 overlay so it uses the
standard non-GB200/B200 floor: locate the inference-ttft-p99 key in the
gb300-eks-ubuntu-inference-dynamo overlay (the entry currently set to "<= 2000")
and change its value to "<= 1000" to match other non-GB200/B200 Dynamo inference
overlays.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2e7e38e6-41f3-4104-a666-efec8b0381c9

📥 Commits

Reviewing files that changed from the base of the PR and between 1b6b22e and 8147723.

📒 Files selected for processing (21)
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml

Comment thread docs/user/cli-reference.md
@yuanchen8911 yuanchen8911 force-pushed the feat/gb300-eks-overlays branch from 8147723 to c403e42 Compare June 11, 2026 19:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (1)

80-81: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the non-GB200/B200 Dynamo TTFT floor for GB300.

inference-ttft-p99 is currently set to <= 2000; for *-eks-ubuntu-inference-dynamo overlays, the established WIP convention is <= 1000 for accelerators other than b200/gb200. Keeping <= 2000 here weakens the intended floor.

Based on learnings: in recipes/overlays/*-{eks-ubuntu-inference-dynamo,gke-cos-inference-dynamo}.yaml, use inference-ttft-p99 <= 1000, except b200/gb200 which use <= 2000.

Suggested patch
-        - name: inference-ttft-p99
-          value: "<= 2000"
+        - name: inference-ttft-p99
+          value: "<= 1000"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml` around lines 80 -
81, The TTFT p99 floor is incorrect for the GB300 overlay: locate the YAML key
inference-ttft-p99 in the gb300-eks-ubuntu-inference-dynamo overlay and change
its value from "<= 2000" to "<= 1000" so this overlay follows the convention
that only b200/gb200 use "<= 2000" while other accelerators use "<= 1000".

Source: Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/recipe.md`:
- Line 153: Update the contributor guide's accelerator documentation to include
the new GB300 overlay: add `gb300` to the static `accelerator` list entry
(alongside `gb200`, etc.) and add a corresponding wildcard overlay entry named
`gb300-any.yaml` in the wildcard-overlay section so the listed overlays match
the static accelerator set; update any nearby examples or references to the
wildcard overlays to mention `gb300-any.yaml` and ensure casing/formatting
matches the existing `gb200-any.yaml` entry.

In `@pkg/fingerprint/gpu_sku_test.go`:
- Line 34: The test entry titled "GB300 wins over B200 substring" is
inconsistent because the model string "NVIDIA GB300 NVL72" lacks the B200 token;
update the test row (the tuple {"GB300 wins over B200 substring", "NVIDIA GB300
NVL72", "gb300"}) so the model string includes the B200 substring (e.g. "NVIDIA
GB300 NVL72 B200") to actually exercise the GB300 vs B200 precedence path in the
GPU SKU tests.

In `@pkg/recipe/criteria_test.go`:
- Line 638: In TestParseCriteriaAcceleratorType add an explicit table-driven
test row that exercises parsing the literal "gb300" so its behavior is asserted
independently; update the parser test cases table (the slice used within
TestParseCriteriaAcceleratorType) to include an entry with input "gb300" and the
corresponding expected parsed value (and no error) so the parser’s handling of
"gb300" is directly validated alongside the existing list-membership check.

---

Duplicate comments:
In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: The TTFT p99 floor is incorrect for the GB300 overlay:
locate the YAML key inference-ttft-p99 in the gb300-eks-ubuntu-inference-dynamo
overlay and change its value from "<= 2000" to "<= 1000" so this overlay follows
the convention that only b200/gb200 use "<= 2000" while other accelerators use
"<= 1000".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7b4bf911-7474-44a4-b1c9-37d7bafc48dc

📥 Commits

Reviewing files that changed from the base of the PR and between 8147723 and c403e42.

📒 Files selected for processing (26)
  • .claude/skills/analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/nccl_preflight_nvreg_test.go
  • validators/performance/nccl_test.go

Comment thread docs/contributor/recipe.md
Comment thread pkg/fingerprint/gpu_sku_test.go Outdated
Comment thread pkg/recipe/criteria_test.go
@yuanchen8911 yuanchen8911 force-pushed the feat/gb300-eks-overlays branch from c403e42 to 941f01a Compare June 11, 2026 20:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
pkg/recipe/criteria_test.go (1)

638-638: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add explicit parser test cases for gb300 in TestParseCriteriaAcceleratorType.

Line 638 validates that "gb300" is in the list returned by GetCriteriaAcceleratorTypes() and can be parsed via the loop, but TestParseCriteriaAcceleratorType (lines 64-98) still lacks direct table-driven test rows for "gb300" and "GB300" inputs. Add explicit test cases to independently guard the parser behavior.

🧪 Suggested test cases
 		{"gb200", "gb200", CriteriaAcceleratorGB200, false},
+		{"gb300", "gb300", CriteriaAcceleratorGB300, false},
+		{"GB300 uppercase", "GB300", CriteriaAcceleratorGB300, false},
 		{"b200", "b200", CriteriaAcceleratorB200, false},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/recipe/criteria_test.go` at line 638, Add explicit table-driven test rows
for the "gb300" accelerator to TestParseCriteriaAcceleratorType: update the
test's cases slice to include inputs "gb300" and "GB300" with the expected
canonical value (matching what GetCriteriaAcceleratorTypes() lists) and expected
no-error outcome; run the same assertion logic already used in
TestParseCriteriaAcceleratorType so the parser is independently guarded for both
lowercase and uppercase forms.
docs/user/cli-reference.md (1)

429-429: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Finish the gb300 docs sync.

These rows are correct, but the surrounding docs still have stale gb200-only references: docs/user/cli-reference.md has later accelerator examples that stop at gb200, and docs/contributor/recipe.md still names only gb200-any.yaml in the wildcard-overlay section. Please update those remaining references in the same pass so the docs stay internally consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/user/cli-reference.md` at line 429, Update remaining stale "gb200"
references to include the new "gb300" entry: search the CLI accelerator example
block that lists accelerators (the row showing `--accelerator | --gpu | string |
... gb200`) and expand later accelerator examples to also mention `gb300`, and
update the wildcard-overlay file reference named `gb200-any.yaml` to
`gb300-any.yaml` in the wildcard-overlay section; ensure any plain-text mentions
or example filenames that currently stop at `gb200` are changed to include or
reference `gb300` so docs are internally consistent with the new `gb300`
accelerator.
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (1)

80-81: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the standard non-GB200/B200 TTFT floor (<= 1000).

The inference-ttft-p99 constraint is set to <= 2000, but per the established convention for *-eks-ubuntu-inference-dynamo overlays, accelerators other than b200/gb200 should use <= 1000. GB300 (Blackwell Ultra) is not GB200, so it should follow the standard floor.

🔧 Proposed fix
         - name: inference-ttft-p99
-          value: "<= 2000"
+          value: "<= 1000"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml` around lines 80 -
81, The TTFT p99 constraint inference-ttft-p99 is set incorrectly to "<= 2000";
update the overlay's inference-ttft-p99 entry to use the standard non-GB200/B200
floor value "<= 1000" so GB300 follows the same rule—locate the
inference-ttft-p99 key in the gb300-eks-ubuntu-inference-dynamo overlay and
change its value to "<= 1000".

Source: Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/gb300-eks-inference.yaml`:
- Around line 36-61: The gpu-operator componentRefs block (gpu-operator, cdi,
gdrcopy, driver.kernelModuleConfig referencing kernel-module-params.yaml) is
duplicated between the two leaf overlays; extract that block into a single
shared overlay fragment (e.g., gpu-operator-common overlay) that contains the
full gpu-operator component declaration and kernel-module-params
preManifestFiles, then replace the duplicated blocks in gb300-eks-inference.yaml
and gb300-eks-training.yaml with an include/reference to that shared fragment
(or add a small build-time sync script that injects the shared fragment into
both overlays), and ensure the shared fragment preserves the same keys
(gpu-operator, cdi, gdrcopy, driver.kernelModuleConfig) so both overlays behave
identically.

---

Duplicate comments:
In `@docs/user/cli-reference.md`:
- Line 429: Update remaining stale "gb200" references to include the new "gb300"
entry: search the CLI accelerator example block that lists accelerators (the row
showing `--accelerator | --gpu | string | ... gb200`) and expand later
accelerator examples to also mention `gb300`, and update the wildcard-overlay
file reference named `gb200-any.yaml` to `gb300-any.yaml` in the
wildcard-overlay section; ensure any plain-text mentions or example filenames
that currently stop at `gb200` are changed to include or reference `gb300` so
docs are internally consistent with the new `gb300` accelerator.

In `@pkg/recipe/criteria_test.go`:
- Line 638: Add explicit table-driven test rows for the "gb300" accelerator to
TestParseCriteriaAcceleratorType: update the test's cases slice to include
inputs "gb300" and "GB300" with the expected canonical value (matching what
GetCriteriaAcceleratorTypes() lists) and expected no-error outcome; run the same
assertion logic already used in TestParseCriteriaAcceleratorType so the parser
is independently guarded for both lowercase and uppercase forms.

In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: The TTFT p99 constraint inference-ttft-p99 is set
incorrectly to "<= 2000"; update the overlay's inference-ttft-p99 entry to use
the standard non-GB200/B200 floor value "<= 1000" so GB300 follows the same
rule—locate the inference-ttft-p99 key in the gb300-eks-ubuntu-inference-dynamo
overlay and change its value to "<= 1000".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4a435d9f-1a9d-47bb-9532-d98887e89cf0

📥 Commits

Reviewing files that changed from the base of the PR and between c403e42 and 941f01a.

📒 Files selected for processing (26)
  • .claude/skills/analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/nccl_preflight_nvreg_test.go
  • validators/performance/nccl_test.go

Comment thread recipes/overlays/gb300-eks-inference.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/gb300-eks-overlays branch from 941f01a to 7f8dc29 Compare June 11, 2026 20:50
@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipes): add concrete GB300 EKS service-bound overlays feat(recipes): add concrete GB300 EKS service-bound overlays Jun 11, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 11, 2026 20:50
@yuanchen8911 yuanchen8911 force-pushed the feat/gb300-eks-overlays branch from 7f8dc29 to 2810159 Compare June 11, 2026 20:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (1)

80-81: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the standard TTFT floor for non-GB200/B200 Dynamo overlays.

The inference-ttft-p99 constraint is set to <= 2000, but per the established WIP convention for *-eks-ubuntu-inference-dynamo overlays, accelerators other than b200/gb200 should use <= 1000. GB300 (Blackwell Ultra) is a distinct accelerator from GB200 and should follow the standard floor.

📝 Suggested fix
         - name: inference-ttft-p99
-          value: "<= 2000"
+          value: "<= 1000"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml` around lines 80 -
81, The inference-ttft-p99 constraint in the gb300-eks-ubuntu-inference-dynamo
overlay is set to "<= 2000" but should use the standard non-GB200/B200 TTFT
floor; update the inference-ttft-p99 entry in this overlay from "<= 2000" to "<=
1000" so GB300 follows the same floor as other non-GB200/B200 Dynamo overlays.

Source: Learnings

pkg/fingerprint/gpu_sku_test.go (1)

34-34: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Precedence test case does not include the competing B200 token.

The case name says GB300 should win over B200, but the input string lacks B200, so that precedence path isn’t exercised.

Suggested patch
-		{"GB300 wins over B200 substring", "NVIDIA GB300 NVL72", "gb300"},
+		{"GB300 wins over B200 substring", "NVIDIA GB300 B200 NVL72", "gb300"},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/fingerprint/gpu_sku_test.go` at line 34, Update the test case labeled
"GB300 wins over B200 substring" so the input string actually contains the
competing token `B200`; currently it uses "NVIDIA GB300 NVL72" so change the
second field to include `B200` (for example "NVIDIA GB300 B200 NVL72" or "NVIDIA
B200 GB300 NVL72") so the precedence between `gb300` and `b200` is exercised
when running the test; adjust the expected winner token `"gb300"` as needed if
you change token order.

Source: Coding guidelines

docs/user/cli-reference.md (1)

431-431: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Finish the remaining GB300 docs sync.
docs/user/cli-reference.md still has later accelerator lists/examples that stop at gb200, and docs/contributor/recipe.md still names only gb200-any.yaml in the wildcard-overlays section. Already flagged in the previous GB300 docs-sync review.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/user/cli-reference.md` at line 431, The docs still only list GB200 in
accelerator examples; update all occurrences to include GB300: add "gb300" to
the `--accelerator` / `--gpu` table entry (so the list becomes h100, h200,
gb200, gb300, b200, a100, l40, rtx-pro-6000), update any example command
lines/snippets that stop at gb200 to include gb300, and in the wildcard-overlays
section replace or add `gb300-any.yaml` alongside `gb200-any.yaml` (search for
the literal "gb200" and "gb200-any.yaml" and add corresponding GB300 entries to
keep examples, overlays, and references in sync).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@docs/user/cli-reference.md`:
- Line 431: The docs still only list GB200 in accelerator examples; update all
occurrences to include GB300: add "gb300" to the `--accelerator` / `--gpu` table
entry (so the list becomes h100, h200, gb200, gb300, b200, a100, l40,
rtx-pro-6000), update any example command lines/snippets that stop at gb200 to
include gb300, and in the wildcard-overlays section replace or add
`gb300-any.yaml` alongside `gb200-any.yaml` (search for the literal "gb200" and
"gb200-any.yaml" and add corresponding GB300 entries to keep examples, overlays,
and references in sync).

In `@pkg/fingerprint/gpu_sku_test.go`:
- Line 34: Update the test case labeled "GB300 wins over B200 substring" so the
input string actually contains the competing token `B200`; currently it uses
"NVIDIA GB300 NVL72" so change the second field to include `B200` (for example
"NVIDIA GB300 B200 NVL72" or "NVIDIA B200 GB300 NVL72") so the precedence
between `gb300` and `b200` is exercised when running the test; adjust the
expected winner token `"gb300"` as needed if you change token order.

In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: The inference-ttft-p99 constraint in the
gb300-eks-ubuntu-inference-dynamo overlay is set to "<= 2000" but should use the
standard non-GB200/B200 TTFT floor; update the inference-ttft-p99 entry in this
overlay from "<= 2000" to "<= 1000" so GB300 follows the same floor as other
non-GB200/B200 Dynamo overlays.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 28a0e810-5c74-4718-8eb8-143b8fc38657

📥 Commits

Reviewing files that changed from the base of the PR and between 941f01a and 7f8dc29.

📒 Files selected for processing (27)
  • .claude/skills/analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/metadata_test.go
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/nccl_preflight_nvreg_test.go
  • validators/performance/nccl_test.go

Resolves the GB300 recipe gap (NVIDIA#1318): GB300 (Blackwell Ultra) was an
undeclared accelerator with zero overlays. Declare it and add the EKS
overlay set, modeled on the existing GB200 EKS overlays.

Go:
- Declare CriteriaAcceleratorGB300 ("gb300") in pkg/recipe/criteria.go,
  wire ParseAccelerator and GetCriteriaAcceleratorTypes, update the
  accelerator enum doc comments (criteria/doc/fingerprint).
- Update TestGetCriteriaAcceleratorTypes expectation.

Overlays (mirror gb200-eks-*):
- gb300-any: deployment-phase floor (>= v25.10.0; GB300 shipped in the
  same gpu-operator release as GB200).
- gb300-eks-training / -inference: EFA kernel-module-params preManifest,
  cdi + gdrcopy, nfd topologyUpdater, NVLS+NET perf checks (NVLS carries
  over unchanged — same 5th-gen NVLink as GB200; NET is a loose floor
  GB300's CX8 should exceed).
- gb300-eks-ubuntu-{training,inference}, -training-kubeflow,
  -inference-dynamo. Dynamo inherits the GB200 inference-perf gate as a
  loose floor (GB300 has more HBM3e / FP4, so it will not false-fail).

Nodewright reuses the gb200 tuning profile (shared ARM64 Grace host and
Blackwell NVL72 platform; no gb300-specific package). Per the nodewright
maintainer this is expected to work, with one caveat noted in-file: GB300
may hit the DOCA issue the AWS GPU AMI already works around.

Enum/doc audit: api/aicr/v1/server.yaml enum blocks, bug_report.yml
dropdowns, cli-reference / api-reference / recipe.md accelerator lists.

K8s floors and gpu-operator floor match GB200 (1.32.4; 1.34 for the DRA
dynamo leaf; Deployment.gpu-operator.version >= v25.10.0).

Refs: NVIDIA#1318
@yuanchen8911 yuanchen8911 force-pushed the feat/gb300-eks-overlays branch from 2810159 to 731229c Compare June 11, 2026 21:24
@yuanchen8911 yuanchen8911 merged commit 4b817ce into NVIDIA:main Jun 11, 2026
210 of 212 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 12, 2026
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 22, 2026
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 23, 2026
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api area/ci area/cli area/docs area/recipes size/XL theme/recipes Recipe expansion, overlays, mixins, and component registry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add and validate GB300 recipe overlays

2 participants