feat(recipes): add concrete GB300 EKS service-bound overlays#1319
Conversation
Recipe evidence checkAffected leaf overlays: 6
How to refresh evidenceRun on a cluster matching the recipe's aicr snapshot -o snapshot.yaml
aicr validate \
-r recipes/overlays/<slug>.yaml \
-s snapshot.yaml \
--emit-attestation ./out \
--push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yamlThis gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR declares GB300 in the criteria/type system, extends OpenAPI enums and Criteria schema to accept gb300, adds GB300 GPU-SKU fingerprint mapping and tests, updates NCCL/preflight validators and tests to include GB300, synchronizes docs/CLI/comments, adjusts the bug-report template formatting, and adds multiple GB300 recipe overlays (wildcard, EKS training/inference, Ubuntu/Kubeflow variants, and a Dynamo DRA-enabled overlay). Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/recipe/doc.go (1)
73-82:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd
CriteriaAcceleratorGB300to the accelerator bullet list to keep docs internally consistent.The package comment now lists
gb300in summary/query sections, but the detailed accelerator bullets still skip it.As per coding guidelines, “Ensure code comments are accurate and helpful.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/recipe/doc.go` around lines 73 - 82, The accelerator bullet list is missing CriteriaAcceleratorGB300; update the package comment in pkg/recipe/doc.go to add a bullet for CriteriaAcceleratorGB300 (e.g., "CriteriaAcceleratorGB300: NVIDIA GB300") alongside the other accelerator entries so the detailed bullets match the summary/query sections that reference gb300.Source: Coding guidelines
pkg/cli/recipe.go (1)
109-109:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate command description to include
tableoutput format.The help text says JSON/YAML only, but this command accepts
tabletoo (formatFlag+parseRecipeOutputFormat).As per coding guidelines, “CLI commands should support multiple output formats: JSON, YAML, and table formats.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cli/recipe.go` at line 109, Update the command help/description in pkg/cli/recipe.go to mention the "table" output format in addition to JSON and YAML; locate the text associated with the command where formatFlag and parseRecipeOutputFormat are used (the command's description/help string) and change "Output can be in JSON or YAML format." to include table (e.g., "Output can be in JSON, YAML, or table format.") so it matches supported formats.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: The TTFT gate for this GB300 overlay is incorrectly set to
"inference-ttft-p99 value: \"<= 2000\""; change it to the standard
non-GB200/B200 floor by updating the inference-ttft-p99 value to "<= 1000" in
the gb300-eks-ubuntu-inference-dynamo overlay (follow the same rule used in
other *-eks-ubuntu-inference-dynamo and *-gke-cos-inference-dynamo overlays: use
"<= 1000" for all accelerators except b200/gb200 which remain "<= 2000").
---
Outside diff comments:
In `@pkg/cli/recipe.go`:
- Line 109: Update the command help/description in pkg/cli/recipe.go to mention
the "table" output format in addition to JSON and YAML; locate the text
associated with the command where formatFlag and parseRecipeOutputFormat are
used (the command's description/help string) and change "Output can be in JSON
or YAML format." to include table (e.g., "Output can be in JSON, YAML, or table
format.") so it matches supported formats.
In `@pkg/recipe/doc.go`:
- Around line 73-82: The accelerator bullet list is missing
CriteriaAcceleratorGB300; update the package comment in pkg/recipe/doc.go to add
a bullet for CriteriaAcceleratorGB300 (e.g., "CriteriaAcceleratorGB300: NVIDIA
GB300") alongside the other accelerator entries so the detailed bullets match
the summary/query sections that reference gb300.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 23abe7f8-e0bb-4820-9e4e-11847976cbf8
📒 Files selected for processing (18)
.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/fingerprint/doc.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gorecipes/overlays/gb300-any.yamlrecipes/overlays/gb300-eks-inference.yamlrecipes/overlays/gb300-eks-training.yamlrecipes/overlays/gb300-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/gb300-eks-ubuntu-inference.yamlrecipes/overlays/gb300-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/gb300-eks-ubuntu-training.yaml
1b6b22e to
8147723
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (1)
80-81:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the GB300 TTFT floor used by other non-GB200/B200 Dynamo inference overlays.
This issue was already flagged in a previous review. For
*-eks-ubuntu-inference-dynamooverlays, the WIP gate convention isinference-ttft-p99 <= 1000for accelerators other thanb200/gb200. GB300 (Blackwell Ultra) should follow the standard floor, not the GB200 exception.🔧 Proposed fix
- name: inference-ttft-p99 - value: "<= 2000" + value: "<= 1000"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml` around lines 80 - 81, Update the WIP gate for the GB300 overlay so it uses the standard non-GB200/B200 floor: locate the inference-ttft-p99 key in the gb300-eks-ubuntu-inference-dynamo overlay (the entry currently set to "<= 2000") and change its value to "<= 1000" to match other non-GB200/B200 Dynamo inference overlays.Source: Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/user/cli-reference.md`:
- Line 429: Update all remaining documentation references that list supported
accelerator types so they include the new "gb300" entry: search for occurrences
of the CLI flags `--accelerator` and `--gpu` and extend any accelerator
lists/examples that end at `gb200` to also include `gb300`; likewise, update the
wildcard-overlays section that references `gb200-any.yaml` to mention and
document `gb300-any.yaml` (add the overlay name `gb300-any` and any matching
criteria/value examples) so the CLI reference, examples, and recipe
wildcard-overlays remain consistent.
---
Duplicate comments:
In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: Update the WIP gate for the GB300 overlay so it uses the
standard non-GB200/B200 floor: locate the inference-ttft-p99 key in the
gb300-eks-ubuntu-inference-dynamo overlay (the entry currently set to "<= 2000")
and change its value to "<= 1000" to match other non-GB200/B200 Dynamo inference
overlays.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 2e7e38e6-41f3-4104-a666-efec8b0381c9
📒 Files selected for processing (21)
.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gorecipes/overlays/gb300-any.yamlrecipes/overlays/gb300-eks-inference.yamlrecipes/overlays/gb300-eks-training.yamlrecipes/overlays/gb300-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/gb300-eks-ubuntu-inference.yamlrecipes/overlays/gb300-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/gb300-eks-ubuntu-training.yaml
8147723 to
c403e42
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (1)
80-81:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the non-GB200/B200 Dynamo TTFT floor for GB300.
inference-ttft-p99is currently set to<= 2000; for*-eks-ubuntu-inference-dynamooverlays, the established WIP convention is<= 1000for accelerators other thanb200/gb200. Keeping<= 2000here weakens the intended floor.Based on learnings: in
recipes/overlays/*-{eks-ubuntu-inference-dynamo,gke-cos-inference-dynamo}.yaml, useinference-ttft-p99 <= 1000, exceptb200/gb200which use<= 2000.Suggested patch
- - name: inference-ttft-p99 - value: "<= 2000" + - name: inference-ttft-p99 + value: "<= 1000"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml` around lines 80 - 81, The TTFT p99 floor is incorrect for the GB300 overlay: locate the YAML key inference-ttft-p99 in the gb300-eks-ubuntu-inference-dynamo overlay and change its value from "<= 2000" to "<= 1000" so this overlay follows the convention that only b200/gb200 use "<= 2000" while other accelerators use "<= 1000".Source: Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/recipe.md`:
- Line 153: Update the contributor guide's accelerator documentation to include
the new GB300 overlay: add `gb300` to the static `accelerator` list entry
(alongside `gb200`, etc.) and add a corresponding wildcard overlay entry named
`gb300-any.yaml` in the wildcard-overlay section so the listed overlays match
the static accelerator set; update any nearby examples or references to the
wildcard overlays to mention `gb300-any.yaml` and ensure casing/formatting
matches the existing `gb200-any.yaml` entry.
In `@pkg/fingerprint/gpu_sku_test.go`:
- Line 34: The test entry titled "GB300 wins over B200 substring" is
inconsistent because the model string "NVIDIA GB300 NVL72" lacks the B200 token;
update the test row (the tuple {"GB300 wins over B200 substring", "NVIDIA GB300
NVL72", "gb300"}) so the model string includes the B200 substring (e.g. "NVIDIA
GB300 NVL72 B200") to actually exercise the GB300 vs B200 precedence path in the
GPU SKU tests.
In `@pkg/recipe/criteria_test.go`:
- Line 638: In TestParseCriteriaAcceleratorType add an explicit table-driven
test row that exercises parsing the literal "gb300" so its behavior is asserted
independently; update the parser test cases table (the slice used within
TestParseCriteriaAcceleratorType) to include an entry with input "gb300" and the
corresponding expected parsed value (and no error) so the parser’s handling of
"gb300" is directly validated alongside the existing list-membership check.
---
Duplicate comments:
In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: The TTFT p99 floor is incorrect for the GB300 overlay:
locate the YAML key inference-ttft-p99 in the gb300-eks-ubuntu-inference-dynamo
overlay and change its value from "<= 2000" to "<= 1000" so this overlay follows
the convention that only b200/gb200 use "<= 2000" while other accelerators use
"<= 1000".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 7b4bf911-7474-44a4-b1c9-37d7bafc48dc
📒 Files selected for processing (26)
.claude/skills/analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gorecipes/overlays/gb300-any.yamlrecipes/overlays/gb300-eks-inference.yamlrecipes/overlays/gb300-eks-training.yamlrecipes/overlays/gb300-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/gb300-eks-ubuntu-inference.yamlrecipes/overlays/gb300-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/gb300-eks-ubuntu-training.yamlvalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_preflight_nvreg.govalidators/performance/nccl_preflight_nvreg_test.govalidators/performance/nccl_test.go
c403e42 to
941f01a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
pkg/recipe/criteria_test.go (1)
638-638:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd explicit parser test cases for
gb300inTestParseCriteriaAcceleratorType.Line 638 validates that
"gb300"is in the list returned byGetCriteriaAcceleratorTypes()and can be parsed via the loop, butTestParseCriteriaAcceleratorType(lines 64-98) still lacks direct table-driven test rows for"gb300"and"GB300"inputs. Add explicit test cases to independently guard the parser behavior.🧪 Suggested test cases
{"gb200", "gb200", CriteriaAcceleratorGB200, false}, + {"gb300", "gb300", CriteriaAcceleratorGB300, false}, + {"GB300 uppercase", "GB300", CriteriaAcceleratorGB300, false}, {"b200", "b200", CriteriaAcceleratorB200, false},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/recipe/criteria_test.go` at line 638, Add explicit table-driven test rows for the "gb300" accelerator to TestParseCriteriaAcceleratorType: update the test's cases slice to include inputs "gb300" and "GB300" with the expected canonical value (matching what GetCriteriaAcceleratorTypes() lists) and expected no-error outcome; run the same assertion logic already used in TestParseCriteriaAcceleratorType so the parser is independently guarded for both lowercase and uppercase forms.docs/user/cli-reference.md (1)
429-429:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFinish the gb300 docs sync.
These rows are correct, but the surrounding docs still have stale gb200-only references:
docs/user/cli-reference.mdhas later accelerator examples that stop atgb200, anddocs/contributor/recipe.mdstill names onlygb200-any.yamlin the wildcard-overlay section. Please update those remaining references in the same pass so the docs stay internally consistent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/user/cli-reference.md` at line 429, Update remaining stale "gb200" references to include the new "gb300" entry: search the CLI accelerator example block that lists accelerators (the row showing `--accelerator | --gpu | string | ... gb200`) and expand later accelerator examples to also mention `gb300`, and update the wildcard-overlay file reference named `gb200-any.yaml` to `gb300-any.yaml` in the wildcard-overlay section; ensure any plain-text mentions or example filenames that currently stop at `gb200` are changed to include or reference `gb300` so docs are internally consistent with the new `gb300` accelerator.recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (1)
80-81:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse the standard non-GB200/B200 TTFT floor (<= 1000).
The
inference-ttft-p99constraint is set to<= 2000, but per the established convention for*-eks-ubuntu-inference-dynamooverlays, accelerators other than b200/gb200 should use<= 1000. GB300 (Blackwell Ultra) is not GB200, so it should follow the standard floor.🔧 Proposed fix
- name: inference-ttft-p99 - value: "<= 2000" + value: "<= 1000"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml` around lines 80 - 81, The TTFT p99 constraint inference-ttft-p99 is set incorrectly to "<= 2000"; update the overlay's inference-ttft-p99 entry to use the standard non-GB200/B200 floor value "<= 1000" so GB300 follows the same rule—locate the inference-ttft-p99 key in the gb300-eks-ubuntu-inference-dynamo overlay and change its value to "<= 1000".Source: Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/gb300-eks-inference.yaml`:
- Around line 36-61: The gpu-operator componentRefs block (gpu-operator, cdi,
gdrcopy, driver.kernelModuleConfig referencing kernel-module-params.yaml) is
duplicated between the two leaf overlays; extract that block into a single
shared overlay fragment (e.g., gpu-operator-common overlay) that contains the
full gpu-operator component declaration and kernel-module-params
preManifestFiles, then replace the duplicated blocks in gb300-eks-inference.yaml
and gb300-eks-training.yaml with an include/reference to that shared fragment
(or add a small build-time sync script that injects the shared fragment into
both overlays), and ensure the shared fragment preserves the same keys
(gpu-operator, cdi, gdrcopy, driver.kernelModuleConfig) so both overlays behave
identically.
---
Duplicate comments:
In `@docs/user/cli-reference.md`:
- Line 429: Update remaining stale "gb200" references to include the new "gb300"
entry: search the CLI accelerator example block that lists accelerators (the row
showing `--accelerator | --gpu | string | ... gb200`) and expand later
accelerator examples to also mention `gb300`, and update the wildcard-overlay
file reference named `gb200-any.yaml` to `gb300-any.yaml` in the
wildcard-overlay section; ensure any plain-text mentions or example filenames
that currently stop at `gb200` are changed to include or reference `gb300` so
docs are internally consistent with the new `gb300` accelerator.
In `@pkg/recipe/criteria_test.go`:
- Line 638: Add explicit table-driven test rows for the "gb300" accelerator to
TestParseCriteriaAcceleratorType: update the test's cases slice to include
inputs "gb300" and "GB300" with the expected canonical value (matching what
GetCriteriaAcceleratorTypes() lists) and expected no-error outcome; run the same
assertion logic already used in TestParseCriteriaAcceleratorType so the parser
is independently guarded for both lowercase and uppercase forms.
In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: The TTFT p99 constraint inference-ttft-p99 is set
incorrectly to "<= 2000"; update the overlay's inference-ttft-p99 entry to use
the standard non-GB200/B200 floor value "<= 1000" so GB300 follows the same
rule—locate the inference-ttft-p99 key in the gb300-eks-ubuntu-inference-dynamo
overlay and change its value to "<= 1000".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 4a435d9f-1a9d-47bb-9532-d98887e89cf0
📒 Files selected for processing (26)
.claude/skills/analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gorecipes/overlays/gb300-any.yamlrecipes/overlays/gb300-eks-inference.yamlrecipes/overlays/gb300-eks-training.yamlrecipes/overlays/gb300-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/gb300-eks-ubuntu-inference.yamlrecipes/overlays/gb300-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/gb300-eks-ubuntu-training.yamlvalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_preflight_nvreg.govalidators/performance/nccl_preflight_nvreg_test.govalidators/performance/nccl_test.go
941f01a to
7f8dc29
Compare
7f8dc29 to
2810159
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (1)
80-81:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse the standard TTFT floor for non-GB200/B200 Dynamo overlays.
The
inference-ttft-p99constraint is set to<= 2000, but per the established WIP convention for*-eks-ubuntu-inference-dynamooverlays, accelerators other thanb200/gb200should use<= 1000. GB300 (Blackwell Ultra) is a distinct accelerator from GB200 and should follow the standard floor.📝 Suggested fix
- name: inference-ttft-p99 - value: "<= 2000" + value: "<= 1000"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml` around lines 80 - 81, The inference-ttft-p99 constraint in the gb300-eks-ubuntu-inference-dynamo overlay is set to "<= 2000" but should use the standard non-GB200/B200 TTFT floor; update the inference-ttft-p99 entry in this overlay from "<= 2000" to "<= 1000" so GB300 follows the same floor as other non-GB200/B200 Dynamo overlays.Source: Learnings
pkg/fingerprint/gpu_sku_test.go (1)
34-34:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPrecedence test case does not include the competing
B200token.The case name says GB300 should win over B200, but the input string lacks
B200, so that precedence path isn’t exercised.Suggested patch
- {"GB300 wins over B200 substring", "NVIDIA GB300 NVL72", "gb300"}, + {"GB300 wins over B200 substring", "NVIDIA GB300 B200 NVL72", "gb300"},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/fingerprint/gpu_sku_test.go` at line 34, Update the test case labeled "GB300 wins over B200 substring" so the input string actually contains the competing token `B200`; currently it uses "NVIDIA GB300 NVL72" so change the second field to include `B200` (for example "NVIDIA GB300 B200 NVL72" or "NVIDIA B200 GB300 NVL72") so the precedence between `gb300` and `b200` is exercised when running the test; adjust the expected winner token `"gb300"` as needed if you change token order.Source: Coding guidelines
docs/user/cli-reference.md (1)
431-431:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFinish the remaining GB300 docs sync.
docs/user/cli-reference.mdstill has later accelerator lists/examples that stop atgb200, anddocs/contributor/recipe.mdstill names onlygb200-any.yamlin the wildcard-overlays section. Already flagged in the previous GB300 docs-sync review.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/user/cli-reference.md` at line 431, The docs still only list GB200 in accelerator examples; update all occurrences to include GB300: add "gb300" to the `--accelerator` / `--gpu` table entry (so the list becomes h100, h200, gb200, gb300, b200, a100, l40, rtx-pro-6000), update any example command lines/snippets that stop at gb200 to include gb300, and in the wildcard-overlays section replace or add `gb300-any.yaml` alongside `gb200-any.yaml` (search for the literal "gb200" and "gb200-any.yaml" and add corresponding GB300 entries to keep examples, overlays, and references in sync).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@docs/user/cli-reference.md`:
- Line 431: The docs still only list GB200 in accelerator examples; update all
occurrences to include GB300: add "gb300" to the `--accelerator` / `--gpu` table
entry (so the list becomes h100, h200, gb200, gb300, b200, a100, l40,
rtx-pro-6000), update any example command lines/snippets that stop at gb200 to
include gb300, and in the wildcard-overlays section replace or add
`gb300-any.yaml` alongside `gb200-any.yaml` (search for the literal "gb200" and
"gb200-any.yaml" and add corresponding GB300 entries to keep examples, overlays,
and references in sync).
In `@pkg/fingerprint/gpu_sku_test.go`:
- Line 34: Update the test case labeled "GB300 wins over B200 substring" so the
input string actually contains the competing token `B200`; currently it uses
"NVIDIA GB300 NVL72" so change the second field to include `B200` (for example
"NVIDIA GB300 B200 NVL72" or "NVIDIA B200 GB300 NVL72") so the precedence
between `gb300` and `b200` is exercised when running the test; adjust the
expected winner token `"gb300"` as needed if you change token order.
In `@recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml`:
- Around line 80-81: The inference-ttft-p99 constraint in the
gb300-eks-ubuntu-inference-dynamo overlay is set to "<= 2000" but should use the
standard non-GB200/B200 TTFT floor; update the inference-ttft-p99 entry in this
overlay from "<= 2000" to "<= 1000" so GB300 follows the same floor as other
non-GB200/B200 Dynamo overlays.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 28a0e810-5c74-4718-8eb8-143b8fc38657
📒 Files selected for processing (27)
.claude/skills/analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/recipe/metadata_test.gorecipes/overlays/gb300-any.yamlrecipes/overlays/gb300-eks-inference.yamlrecipes/overlays/gb300-eks-training.yamlrecipes/overlays/gb300-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/gb300-eks-ubuntu-inference.yamlrecipes/overlays/gb300-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/gb300-eks-ubuntu-training.yamlvalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_preflight_nvreg.govalidators/performance/nccl_preflight_nvreg_test.govalidators/performance/nccl_test.go
Resolves the GB300 recipe gap (NVIDIA#1318): GB300 (Blackwell Ultra) was an undeclared accelerator with zero overlays. Declare it and add the EKS overlay set, modeled on the existing GB200 EKS overlays. Go: - Declare CriteriaAcceleratorGB300 ("gb300") in pkg/recipe/criteria.go, wire ParseAccelerator and GetCriteriaAcceleratorTypes, update the accelerator enum doc comments (criteria/doc/fingerprint). - Update TestGetCriteriaAcceleratorTypes expectation. Overlays (mirror gb200-eks-*): - gb300-any: deployment-phase floor (>= v25.10.0; GB300 shipped in the same gpu-operator release as GB200). - gb300-eks-training / -inference: EFA kernel-module-params preManifest, cdi + gdrcopy, nfd topologyUpdater, NVLS+NET perf checks (NVLS carries over unchanged — same 5th-gen NVLink as GB200; NET is a loose floor GB300's CX8 should exceed). - gb300-eks-ubuntu-{training,inference}, -training-kubeflow, -inference-dynamo. Dynamo inherits the GB200 inference-perf gate as a loose floor (GB300 has more HBM3e / FP4, so it will not false-fail). Nodewright reuses the gb200 tuning profile (shared ARM64 Grace host and Blackwell NVL72 platform; no gb300-specific package). Per the nodewright maintainer this is expected to work, with one caveat noted in-file: GB300 may hit the DOCA issue the AWS GPU AMI already works around. Enum/doc audit: api/aicr/v1/server.yaml enum blocks, bug_report.yml dropdowns, cli-reference / api-reference / recipe.md accelerator lists. K8s floors and gpu-operator floor match GB200 (1.32.4; 1.34 for the DRA dynamo leaf; Deployment.gpu-operator.version >= v25.10.0). Refs: NVIDIA#1318
2810159 to
731229c
Compare
Summary
Resolve the GB300 recipe gap (#1318): declare GB300 (Blackwell Ultra) as an accelerator and add the EKS overlay set, modeled on the existing
gb200-eks-*overlays soaicr recipe --accelerator gb300 --service eks --intent <z>resolves end-to-end.Motivation / Context
GB300 was undeclared in
pkg/recipe/criteria.goand had zero overlays. GB300 NVL72 shares GB200's topology (72 Blackwell GPUs + 36 Grace, 5th-gen NVLink/MNNVL, ARM64 Grace host) and shipped in the same gpu-operator v25.10 release as GB200 — so the EKS recipe is structurally the same as GB200's, with GB300's gains (more HBM3e, higher FP4/attention) being capacity/perf rather than config. mk8s GB300 clusters are all on AWS/EKS running gpu-operator 25.10.x, confirming the EKS focus and thev25.10floor.Fixes: #1318
Related: #1042 (parent epic), #1254, #1256
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)api/aicr/v1/server.yaml)pkg/recipe)docs/)Implementation Notes
Go: declare
CriteriaAcceleratorGB300("gb300"), wireParseAccelerator+GetCriteriaAcceleratorTypes, update the accelerator enum doc comments (criteria.go,recipe/doc.go,fingerprint/{types,doc}.go), and updateTestGetCriteriaAcceleratorTypes.Overlays (mirror
gb200-eks-*):gb300-any— deployment floor>= v25.10.0.gb300-eks-training/-inference— EFAkernel-module-paramspreManifest,cdi+gdrcopy, nfdtopologyUpdater. Training keeps NVLS+NET perf checks: NVLS>= 500carries over unchanged (GB300 uses the same 5th-gen NVLink, 1.8 TB/s), NET>= 40is a loose floor GB300's CX8 NICs should exceed.gb300-eks-ubuntu-{training,inference},-training-kubeflow,-inference-dynamo. Dynamo inherits the GB200inference-perfgate as a loose floor (GB300's extra HBM3e / FP4 means it won't false-fail).Nodewright: reuses the gb200 tuning profile (
accelerator: gb200) — shared ARM64 Grace host + Blackwell NVL72 platform, no gb300-specific package. Confirmed with the nodewright maintainer as expected-to-work, with one caveat documented in-file: GB300 may hit the DOCA issue the AWS GPU AMI already works around — revisit if EFA/RDMA bring-up regresses.Enum/doc audit:
server.yamlenum blocks,bug_report.ymldropdowns,cli-reference/api-reference/recipe.mdaccelerator lists.K8s floors:
>= 1.32.4(training/inference),>= 1.34(DRA dynamo leaf) — same as GB200.Testing
Go lint gate satisfied (golangci-lint 0 issues on changed packages); broad
go test ./pkg/{recipe,fingerprint,validator,bundler}/...run. Fullmake qualifye2e/scan left to CI.Risk Assessment
Rollout notes: Additive. Other GB300 services (OKE/AKS/GKE) are follow-ups under #1042; nodewright DOCA caveat tracked in-file.
Checklist
go testwith-raceviamake testpath)golangci-lint0 issues; yamllint clean)TestGetCriteriaAcceleratorTypes; floor/enum auto-discovery)git commit -S)