Skip to content

revert: GB300 EKS overlays (#1319) — unresolved issues#1328

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:revert/gb300-eks-overlays
Jun 12, 2026
Merged

revert: GB300 EKS overlays (#1319) — unresolved issues#1328
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:revert/gb300-eks-overlays

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Draft / for discussion. Reverts #1319 (feat(recipes): add concrete GB300 EKS service-bound overlays).

Summary

Revert the merged GB300 EKS recipe work (#1319) until the remaining GB300 issues are resolved. Hands-on validation on a live GB300 cluster surfaced a bundle of unresolved problems, so the GB300 recipe is not ready to ship as-is.

Why revert

GB300 validation could not be completed and uncovered several open issues (most are runtime/validator gaps the recipe depends on, not just the overlay YAML):

Full GB300 sign-off also needs a fresh, clean GB300 cluster (the one available is platform-managed with a pre-installed stack, so a clean deploy-and-validate wasn't possible).

Scope

Clean revert of the single squash commit 4b817ce5. Backs out: the gb300 accelerator enum (pkg/recipe, pkg/fingerprint), the GB300 EKS overlays (recipes/overlays/gb300-*), NCCL validator wiring (validators/performance), and the OpenAPI/CLI/docs enum entries. Reopens #1318.

27 files changed, +28 / −560.

Testing

go build ./pkg/recipe/... ./pkg/fingerprint/... ./validators/performance/... ./pkg/cli/...   # ok
go test  ./pkg/recipe/... ./pkg/fingerprint/...                                              # ok

No dangling gb300 code references after revert. (make qualify to be run before marking ready.)

@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

No leaf overlays affected by this PR.

This gate is warning-only and never blocks merge.

@yuanchen8911 yuanchen8911 requested a review from mchmarny June 12, 2026 01:11
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f4a4f4e6-487b-41a0-88ef-ce583750d261

📥 Commits

Reviewing files that changed from the base of the PR and between 3fadaf6 and 6f41df4.

📒 Files selected for processing (27)
  • .claude/skills/analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/metadata_test.go
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/nccl_preflight_nvreg_test.go
  • validators/performance/nccl_test.go
💤 Files with no reviewable changes (11)
  • recipes/overlays/gb300-eks-ubuntu-training.yaml
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/gpu_sku.go
  • validators/performance/nccl_preflight_nvreg_test.go
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • pkg/recipe/metadata_test.go

📝 Walkthrough

Walkthrough

This PR removes GB300 accelerator type from the AICR codebase. The change eliminates the CriteriaAcceleratorGB300 constant from the recipe package, removes GB300 from OpenAPI specifications and all user/contributor documentation, deletes 7 GB300-specific recipe overlay manifests, removes GB300 from the GPU fingerprint SKU registry, and updates NCCL performance validators to drop GB300 support. The gb300 SKU is now mapped to the gb200 class in GPU detection logic, and all hardcoded GB300 references in parsing, validation, and test expectations are removed.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/aicr#1319: This PR previously added GB300 accelerator support (constants, parsing, overlays, fingerprint detection); the main PR removes the same code elements, making these PRs directly inverse in scope.
  • NVIDIA/aicr#1233: Both PRs modify validators/performance/nccl_all_reduce_bw_constraint.go's supportedNCCLCombinations logic—this PR removes GB300, while #1233 adds OKE NVLS support for GB200.

Suggested labels

area/api, area/cli, area/docs, theme/recipes, size/XL

Suggested reviewers

  • mchmarny
  • xdu31
  • ayuskauskas
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: a revert of GB300 EKS overlays due to unresolved issues, with specific reference to the related PR #1319.
Description check ✅ Passed The description is clearly related to the changeset, providing context for the revert, listing specific unresolved issues, and detailing the scope of changes being reverted.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 12, 2026 02:13
@yuanchen8911 yuanchen8911 requested review from a team as code owners June 12, 2026 02:13
@mchmarny mchmarny merged commit 32ff487 into NVIDIA:main Jun 12, 2026
121 of 122 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants