Skip to content

refactor(recipes): retire gb200-any-training.yaml wildcard (#1052)#1068

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/1052-retire-gb200-any-training-wildcard
May 28, 2026
Merged

refactor(recipes): retire gb200-any-training.yaml wildcard (#1052)#1068
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/1052-retire-gb200-any-training-wildcard

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Retire the gb200-any-training.yaml criteria-wildcard overlay. NCCL bandwidth thresholds (the only content the wildcard carried) belong per-leaf, not on a cross-service wildcard — each service has a different network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and the same number is rarely correct for two fabrics.

The fabric-independent deployment-phase floor (gpu-operator version pin + 4 standard health checks) remains on gb200-any.yaml — that pattern (established by #1001) is correct because the version requirement applies across every service.

Motivation / Context

Per #1052: the wildcard pattern was useful before any service-specific GB200 leaves existed but is misleading now. EKS GB200 training leaves carry per-leaf NCCL thresholds; the wildcard was load-bearing only for the OKE GB200 leaves (which had no perf block of their own). After the OCI testbed work in #1007 lands, gb200-oke-training.yaml will ship its own measurement-anchored constraint and the warn-only data gap closes.

The companion b200-any-training.yaml was a stub — no concrete B200 service-bound leaves existed yet — and is retired by #1004 (PR #1053) as a natural consequence of adding the first concrete B200 leaf.

Fixes: #1052
Related: #1004 (PR #1053 — retires the B200 wildcard), #1007 (OCI testbed; supplies the OKE GB200 NCCL measurement that closes the warn-only gap)

Type of Change

  • Refactoring (no functional changes for in-tree recipes today — gb200-oke-* training leaves lose their inherited NCCL constraint, which is currently a warn-only floor recommendation in non-strict mode)

Component(s) Affected

  • Recipe engine / data (pkg/recipe) — floor-test header refresh only; no code change
  • Docs/examples (docs/) — three pages updated to reflect the retirement

Implementation Notes

Deletion. recipes/overlays/gb200-any-training.yaml removed. It carried only validation.performance.checks/constraints for nccl-all-reduce-bw >= 720.

OKE GB200 training data gap. gb200-oke-training, gb200-oke-ubuntu-training, and gb200-oke-ubuntu-training-kubeflow no longer inherit a performance block. Took option (b) from the issue: defer the real OCI testbed measurement to #1007, leave the gap as a warn-only floor-test result in the meantime. TestOverlayValidationPhaseFloor treats missing performance as t.Log(WARN ...) in default (non-strict) mode, so a knownGaps entry would be stale-flagged by the hygiene guard — instead, the knownGaps header comment in pkg/recipe/validation_phase_floor_test.go documents the pending gap and notes that a strict-mode CI toggle would need three entries referencing #1007.

gb200-any.yaml comment refresh. Dropped the "Companion to gb200-any-training.yaml" intro; explains why the intent-scoped sibling was retired and points at #1052.

Doc updates (all three pages the issue called out):

  • docs/integrator/recipe-development.md — switch the criteria-wildcard example from gb200-any-training.yaml to gb200-any.yaml (the accelerator-scoped deployment floor). Add the caveat that per-fabric values don't belong on wildcards.
  • docs/contributor/data.md — refresh the wildcard explanation section, the resolver-tracing prose example, and the Mermaid flowchart to use gb200-any throughout. Update the naming convention to mention -any (deployment-floor pattern) alongside -any-<intent>.
  • docs/design/005-overlay-refactoring.md — remove b200-any-training and gb200-any-training from the overlay tree; brief historical note explains the retirement (cross-referencing feat(recipes): add concrete GKE B200 service-bound overlays #1004 and refactor(recipes): retire *-any-training.yaml wildcards (NCCL thresholds per-leaf) #1052).

Out of scope.

Testing

# Floor test
go test -count=1 -run TestOverlayValidationPhaseFloor ./pkg/recipe/
# ok  	github.com/NVIDIA/aicr/pkg/recipe	0.738s

# Acceptance: hydrate the OKE GB200 training recipe
go run ./cmd/aicr recipe --service oke --accelerator gb200 --intent training --format yaml
# resolves: 11 components, 6 overlays
# (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training)
# deployment phase carries Deployment.gpu-operator.version >= v25.10.0 via gb200-any.yaml
# performance phase absent (warn-only floor gap — see Implementation Notes)

# Full gate
unset GITLAB_TOKEN && make qualify
# 22 chainsaw tests pass; no vulnerabilities; license check clean.

Risk Assessment

  • Low — Single overlay deletion plus doc refreshes. Behavior delta is the loss of an inherited NCCL constraint on three OKE GB200 training leaves, which today is a warn-only floor recommendation (no CI failure, no production validator change). EKS GB200 leaves are unaffected (carry their own perf blocks). Inference recipes are unaffected (the wildcard was intent: training). Trivial to revert (one file restore + comment rollbacks).

Rollout notes: No flag, no opt-in. The next aicr recipe --service oke --accelerator gb200 --intent training hydration produces a recipe without the cross-service NCCL constraint; everything else is identical. When #1007 closes with real OCI numbers, gb200-oke-training.yaml ships its own performance block and the warn drains.

Checklist

  • Tests pass locally (make test with -race) — via make qualify
  • Linter passes (make lint) — via make qualify
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (deletion + doc refresh; floor-test header comment refreshed)
  • I updated docs if user-facing behavior changed — three doc pages refreshed
  • Changes follow existing patterns in the codebase — mirrors the per-leaf perf model the EKS GB200 leaves already use
  • Commits are cryptographically signed (git commit -S)

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR deletes the intent-scoped gb200-any-training overlay, updates recipes/overlays/gb200-any.yaml comments to document the retirement and the accelerator-scoped deployment-floor, expands TestOverlayValidationPhaseFloor comments about knownGaps and pending OCI performance gaps, and updates design/integrator/contributor docs and worked examples/diagrams to reference gb200-any and deployment-phase checks instead of the retired training wildcard.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • NVIDIA/aicr#1001: The main PR adjusts docs and comments around the accelerator-scoped gb200-any deployment-phase floor and updates the TestOverlayValidationPhaseFloor/knownGaps commentary in pkg/recipe/validation_phase_floor_test.go, which aligns with the retrieved PR’s core changes to deliver and enforce that same deployment-phase floor via validation_phase_floor_test.go and the gb200-any overlay.

Suggested labels

area/tests

Suggested reviewers

  • xdu31
  • lockwobr
  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description comprehensively explains the retirement of gb200-any-training.yaml, motivations, implementation details, testing, and risk assessment, all directly related to the changeset.
Linked Issues check ✅ Passed The PR fully addresses all primary coding objectives from #1052: deletes gb200-any-training.yaml, implements option (b) by documenting the warn-only gap in the test header comment rather than knownGaps, and updates all three required documentation files (docs/design/005-overlay-refactoring.md, docs/integrator/recipe-development.md, docs/contributor/data.md).
Out of Scope Changes check ✅ Passed All changes are in-scope: overlay deletion, test comment refresh, and doc updates directly support #1052's objective. Explicitly scoped-out work (b200-any-training removal via #1004, OCI testbed measurements via #1007) is not included.
Title check ✅ Passed The pull request title accurately describes the main change: retiring the gb200-any-training.yaml wildcard overlay file and updating related documentation to reflect this refactoring.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 force-pushed the fix/1052-retire-gb200-any-training-wildcard branch from 2986a18 to 5572563 Compare May 27, 2026 21:19
@yuanchen8911 yuanchen8911 force-pushed the fix/1052-retire-gb200-any-training-wildcard branch 2 times, most recently from 5009c31 to 76de5af Compare May 27, 2026 21:25

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/data.md`:
- Line 490: Replace the awkward phrase "wants kept" in the wildcard override
note with clearer wording such as "wants to keep" (or "to be kept") so the
sentence reads: "a service-specific leaf must restate every `check` it wants to
keep AND every constraint..." — edit the sentence containing "wants kept" in the
docs content to use the improved phrasing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9ea4cec5-8326-40be-84f6-e432fee63511

📥 Commits

Reviewing files that changed from the base of the PR and between 5009c31 and 76de5af.

📒 Files selected for processing (6)
  • docs/contributor/data.md
  • docs/design/005-overlay-refactoring.md
  • docs/integrator/recipe-development.md
  • pkg/recipe/validation_phase_floor_test.go
  • recipes/overlays/gb200-any-training.yaml
  • recipes/overlays/gb200-any.yaml
💤 Files with no reviewable changes (1)
  • recipes/overlays/gb200-any-training.yaml

Comment thread docs/contributor/data.md Outdated
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single
`nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training
query regardless of service. The companion `b200-any-training.yaml` had
the same shape. The pattern is misleading: each service has a different
network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single
cross-service threshold is rarely correct for two fabrics. NCCL
bandwidth thresholds belong on the concrete service-bound leaf, anchored
to a measurement on that specific fabric.

The fabric-independent deployment-phase floor (4 standard health checks
+ `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` —
that pattern (established by NVIDIA#1001) is correct because the same
gpu-operator version requirement applies across every service for the
accelerator.

Changes:

- Delete `recipes/overlays/gb200-any-training.yaml`.
- Update `recipes/overlays/gb200-any.yaml` comment block: drop the
  "Companion to gb200-any-training.yaml" intro, explain why the
  intent-scoped sibling was retired.
- Doc updates with the same rationale:
    - `docs/integrator/recipe-development.md` — switch the
      criteria-wildcard example from `gb200-any-training.yaml` to
      `gb200-any.yaml` (deployment-phase floor); add the
      "per-fabric values don't belong here" caveat.
    - `docs/contributor/data.md` — refresh the wildcard explanation
      section, the resolver-tracing example, and the Mermaid flowchart
      to use `gb200-any` throughout; rename `-any-` naming convention
      to allow `-any` (deployment-floor pattern).
    - `docs/design/005-overlay-refactoring.md` — drop the
      `b200-any-training` / `gb200-any-training` lines from the overlay
      tree and leave a brief historical note explaining the retirement
      across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard).
- Update the `knownGaps` header comment in
  `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200
  training performance data gap (warn-only in non-strict mode today;
  closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays
  empty: the floor test treats missing performance as a `t.Log(WARN)`
  in default mode, so a knownGaps entry would be stale-flagged.

The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053);
the two issues are independent.

Acceptance (per NVIDIA#1052):

1. `recipes/overlays/gb200-any-training.yaml` removed: yes.
2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007
   for real OCI measurements; covered as a warn-only floor gap today.
3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes.
4. `aicr recipe --service oke --accelerator gb200 --intent training
   --format yaml`: resolves to 11 components, 6 overlays
   (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training);
   carries the gpu-operator v25.10.0 floor and the standard 4 deployment
   checks via `gb200-any.yaml`.
5. Doc references updated.
6. `make qualify` clean.

Fixes NVIDIA#1052
Related NVIDIA#1004, NVIDIA#1007
@yuanchen8911 yuanchen8911 force-pushed the fix/1052-retire-gb200-any-training-wildcard branch from 76de5af to 5de37c3 Compare May 28, 2026 01:34
@yuanchen8911 yuanchen8911 changed the title WIP: refactor(recipes): retire gb200-any-training.yaml wildcard (#1052) refactor(recipes): retire gb200-any-training.yaml wildcard (#1052) May 28, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 28, 2026 01:50
@yuanchen8911 yuanchen8911 requested review from a team as code owners May 28, 2026 01:50
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 28, 2026 01:57
@mchmarny mchmarny merged commit a521c59 into NVIDIA:main May 28, 2026
199 of 201 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor(recipes): retire *-any-training.yaml wildcards (NCCL thresholds per-leaf)

2 participants