revert: GB300 EKS overlays (#1319) — unresolved issues#1328
Conversation
…VIDIA#1319)" This reverts commit 4b817ce.
Recipe evidence checkNo leaf overlays affected by this PR. This gate is warning-only and never blocks merge. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (27)
💤 Files with no reviewable changes (11)
📝 WalkthroughWalkthroughThis PR removes GB300 accelerator type from the AICR codebase. The change eliminates the Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
Revert the merged GB300 EKS recipe work (#1319) until the remaining GB300 issues are resolved. Hands-on validation on a live GB300 cluster surfaced a bundle of unresolved problems, so the GB300 recipe is not ready to ship as-is.
Why revert
GB300 validation could not be completed and uncovered several open issues (most are runtime/validator gaps the recipe depends on, not just the overlay YAML):
vllm-runtime:1.0.2(CUDA 12) crash-loops on flashinfer (no kernel image available). Add and validate GB300 recipe overlays #1318Full GB300 sign-off also needs a fresh, clean GB300 cluster (the one available is platform-managed with a pre-installed stack, so a clean deploy-and-validate wasn't possible).
Scope
Clean revert of the single squash commit
4b817ce5. Backs out: thegb300accelerator enum (pkg/recipe,pkg/fingerprint), the GB300 EKS overlays (recipes/overlays/gb300-*), NCCL validator wiring (validators/performance), and the OpenAPI/CLI/docs enum entries. Reopens #1318.27 files changed, +28 / −560.
Testing
No dangling
gb300code references after revert. (make qualifyto be run before marking ready.)