Skip to content

feat(recipes): add concrete GKE B200 service-bound overlays#1053

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/b200-gke-overlays-1004
May 28, 2026
Merged

feat(recipes): add concrete GKE B200 service-bound overlays#1053
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/b200-gke-overlays-1004

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the first concrete service-bound overlays for the b200 accelerator on GKE COS — b200-any.yaml (deployment-phase floor wildcard) plus four leaves (training, inference, training-kubeflow, inference-dynamo). Retires the placeholder b200-any-training.yaml wildcard; H100 / RTX Pro 6000 already follow the no-*-any-training.yaml shape, so B200 reaches parity.

Motivation / Context

Before this PR, aicr recipe --service gke --accelerator b200 --intent <any> resolved only the wildcard NCCL threshold from b200-any-training.yaml (added by #436), with no GKE COS GPU operator config and no platform variant.

Anchored on the production cluster nvcf-dgxc-k8s-gcp-azne1-prd7 (GCP, asia-northeast1, NVCF prod) — confirmed against its cluster-spec.yaml, ArgoCD app set, and gpu-operator values file in dgxcloud/mk8s/manifests:clusters/nvcf-prod/nvcf-dgxc-k8s-gcp-azne1-prd7/.

Fixes: #1004
Related: #1001 (per-accelerator deployment-floor wildcard pattern), #1052 (*-any-training.yaml retirement), #969 (validation-phase coverage audit), #436 (B200 enum + stub wildcard)

Type of Change

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

(YAML-only — no Go source changes.)

Implementation Notes

Layout mirrors h100-gke-cos-*.yaml (the closest reference pattern; GKE is COS-only by AICR convention — no -ubuntu- variant).

New overlay Inherits from Validation contract
b200-any base Accelerator-wide deployment-phase floor (4 standard checks + Deployment.gpu-operator.version >= v25.10.0). Mirrors gb200-any.yaml / h100-any.yaml / rtx-pro-6000-any.yaml from #1001.
b200-gke-cos-training gke-cos-training Self-declares deployment + performance + conformance; gpu-operator floor >= v25.10.0; K8s >= 1.32; nccl-all-reduce-bw >= 100 (placeholder, see Networking note).
b200-gke-cos-inference gke-cos-inference Inherits deployment from b200-any.yaml, conformance from gke-cos.yaml. No performance phase (single-card inference, no NCCL fabric to gate).
b200-gke-cos-training-kubeflow b200-gke-cos-training Adds kubeflow-trainer component for TrainJob distributed training.
b200-gke-cos-inference-dynamo b200-gke-cos-inference Adds DRA + Dynamo + Grove; K8s >= 1.34 (DRA GA); self-declares deployment + performance + conformance.

B200-vs-GB200 deltas honored:

  • Host CPU is x86 (datacenter Blackwell on standard x86 host). Uses the real tuning-gke.yaml for nodewright-customizations (same as h100-gke-cos-*), not the GB200 no-op tuning.yaml.
  • No NVreg_GrdmaPciTopoCheckOverride=1 override — that flag exists for GB200's Grace PCI topology with EFA. GKE A4 (B200) uses RDMA over Ethernet provided by GCP's native multi-NIC fabric.
  • Single-fabric NCCL (no MNNVL / NVL72 IMEX domain — that is a GB200 feature). Single nccl-all-reduce-bw constraint, not split net + nvls.
  • gpu-operator version floor >= v25.10.0 (Blackwell support stabilized in 25.10, matches GB200 / RTX Pro 6000).
  • gpu-operator overrides: cdi: enabled + gdrcopy: enabled on both training and inference leaves — mirrors the production reference cluster's gpu-operator values (580.95.05 driver, gdrcopy.enabled: true, cdi.enabled: true). GB200/EKS sets the same pair.

Networking model — no separate installer (Codex P1 / cluster-verified): GKE A4 (B200) on nvcf-dgxc-k8s-gcp-azne1-prd7 deploys no NCCL plugin installer DaemonSet — no gke-nccl-tcpxo, no GPUDirect-RDMA component. Multi-node NCCL is provided by GPU Operator (v25.10.1) with gdrcopy: enabled: true combined with GKE A4's native multi-NIC infrastructure managed by GCP. The gke-nccl-tcpxo componentRef is intentionally not added to the B200 training leaf because its DaemonSets pin cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb and target the TCPX transport on a3-megagpu-8g (H100) — would not run on A4 nodes and would misrepresent the deployment model.

NCCL threshold — placeholder pending measurement: nccl-all-reduce-bw >= 100 is a conservative floor reflecting that A4 reaches high-bandwidth NCCL via gdrcopy + GKE native multi-NIC rather than via a TCPX installer — so the H100/GKE TCPXO baseline (>= 250) is not the right anchor. Tighten once an empirical number from the reference cluster is captured.

Wildcard cleanup (b200-any-training.yaml): retired in the same PR per #1052. The original >= 350 GB/s cross-cloud placeholder threshold was fabric-blind (one number for EFA, TCPX, RoCE, native multi-NIC — none correct for all). H100 already follows this shape (no h100-any-training.yaml), so B200 reaches parity. Future per-cloud B200 leaves carry their own fabric-tuned thresholds.

Inference-perf thresholds on the Dynamo leaf mirror H100's loose smoke-test floor (inference-throughput >= 5000, inference-ttft-p99 <= 200); B200 is expected to exceed both with margin.

Testing

# Overlay phase-floor gate
go test -v ./pkg/recipe/... -run TestOverlayValidationPhaseFloor

# Full gate
make qualify

Results:

  • TestOverlayValidationPhaseFloor passes for all 4 new leaves. No new knownGaps entries.
  • make qualify: all 22 chainsaw tests pass, coverage 77.0% (threshold 75%), golangci-lint 0 issues, no vulnerabilities, license headers OK.
  • Smoke-tests resolve cleanly:
    • aicr recipe --service gke --accelerator b200 --os cos --intent training --format yaml → 17 components, gpu-operator with cdi.enabled: true + gdrcopy.enabled: true, deployment (>= v25.10.0), performance (nccl-all-reduce-bw >= 100), conformance (10 checks).
    • aicr recipe --service gke --accelerator b200 --os cos --intent inference --format yaml → gpu-operator overrides include gdrcopy: enabled; deployment inherited from b200-any.yaml, conformance inherited from gke-cos.yaml.
    • aicr recipe --service gke --accelerator b200 --os cos --intent inference --platform dynamo --format yaml → DRA + Dynamo + Grove resolved; full validation contract; K8s >= 1.34.
    • aicr recipe --service gke --accelerator b200 --os cos --intent training --platform kubeflow --format yaml → kubeflow-trainer injected.

Coverage: YAML-only change — per CLAUDE.md the per-package coverage gate does not apply. Project-wide make test-coverage floor (75%) passes under make qualify at 77.0%.

Risk Assessment

  • Low — Isolated change (one wildcard retired, five new overlays added). Easy to revert. No existing recipe behavior changes; the new overlays only resolve when a user explicitly queries --service gke --accelerator b200.

Rollout notes: Net deletion of b200-any-training.yaml is safe — that wildcard had no concrete leaves depending on it (no b200-<svc>-*.yaml existed before this PR), and removing it brings B200 to parity with H100 / RTX Pro 6000 (neither has an *-any-training.yaml). The cross-cloud threshold it contributed (>= 350) was fabric-blind; the per-leaf >= 100 is grounded in the reference cluster's actual deployment model and will be tightened once empirical numbers land.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — existing TestOverlayValidationPhaseFloor auto-enumerates and validates the new overlays
  • I updated docs if user-facing behavior changed — no user-facing CLI/API surface changed; overlays are data, discoverable via aicr criteria list
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR renames b200-any-training → b200-any and broadens its selector to all intents for accelerator b200, replacing the NCCL perf gate with a deployment-phase floor (including Deployment.gpu-operator.version >= v25.10.0). It also adds four GKE/COS B200 overlays: b200-gke-cos-inference, b200-gke-cos-inference-dynamo (DRA/K8s >= 1.34), b200-gke-cos-training, and b200-gke-cos-training-kubeflow, each wiring componentRefs, K8s/GPU-operator version constraints, and deployment/performance/conformance validations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/aicr#1001 — Introduced the per-accelerator wildcard deployment-floor pattern and gpu-operator version constraint that this PR mirrors for B200.

Suggested labels

area/docs, area/tests

Suggested reviewers

  • xdu31
  • lockwobr
  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed Description thoroughly covers the changes: new overlays added, old wildcard retired, B200-specific deltas explained, testing confirmed, and motivation anchored to issue #1004.
Linked Issues check ✅ Passed All coding requirements from issue #1004 are met: new b200-any.yaml with deployment-phase floor [#1004], four concrete GKE-COS leaves with proper K8s version constraints [#1004], B200-vs-GB200 deltas honored [#1004], validation checks pass [#1004], and aicr recipe smoke-tests resolve correctly [#1004].
Out of Scope Changes check ✅ Passed All changes are within scope of issue #1004: new YAML overlays for GKE B200 (training, inference, training-kubeflow, inference-dynamo), b200-any.yaml wildcard, and retirement of b200-any-training.yaml. No unrelated modifications detected.
Title check ✅ Passed The title clearly and concisely describes the primary change: adding concrete GKE B200 service-bound overlays to the recipes directory.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/b200-gke-cos-inference-dynamo.yaml`:
- Around line 46-53: The PR added explicit chart version pins for the Helm
release named "dynamo-platform" (version "1.0.2") in the overlay recipe, so
regenerate the bill-of-materials docs and commit the output: run make bom-docs
locally, verify the regenerated docs/user/container-images.md reflects the new
chart pins, and add/commit that updated docs/user/container-images.md to this PR
alongside the recipe change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 924479a7-6c13-4888-83ff-e6a316b2c7e0

📥 Commits

Reviewing files that changed from the base of the PR and between 506507b and ea71456.

📒 Files selected for processing (5)
  • recipes/overlays/b200-any.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/b200-gke-cos-inference.yaml
  • recipes/overlays/b200-gke-cos-training-kubeflow.yaml
  • recipes/overlays/b200-gke-cos-training.yaml

Comment thread recipes/overlays/b200-gke-cos-inference-dynamo.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/b200-gke-cos-training.yaml`:
- Around line 40-49: The YAML comments claim gke-nccl-tcpxo is intentionally
omitted while the PR objectives require adding GPUDirect-TCPXO for B200;
reconcile by either (A) actually adding the gke-nccl-tcpxo component to this
overlay (insert a component entry for "gke-nccl-tcpxo" alongside "gpu-operator"
and enable any B200-specific selectors/taints/labels required for B200 nodes),
or (B) update the header comment/PR objectives to state that gke-nccl-tcpxo is
intentionally excluded for this overlay; adjust the text referring to
GPUDirect-TCPXO and the omitted component so code and objectives match.
- Around line 102-103: Update the NCCL performance floor to match the training
acceptance target by changing the value for the key "nccl-all-reduce-bw" from
">= 100" to ">= 250" in the overlay where that key is defined; ensure the new
threshold is applied wherever "nccl-all-reduce-bw" is set so the validation gate
enforces the intended >= 250 target.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 485f34fa-a5ab-4964-b2ef-36076aa0e6d8

📥 Commits

Reviewing files that changed from the base of the PR and between ea71456 and bead546.

📒 Files selected for processing (5)
  • recipes/overlays/b200-any.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/b200-gke-cos-inference.yaml
  • recipes/overlays/b200-gke-cos-training-kubeflow.yaml
  • recipes/overlays/b200-gke-cos-training.yaml

Comment thread recipes/overlays/b200-gke-cos-training.yaml
Comment thread recipes/overlays/b200-gke-cos-training.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from bead546 to f6ea784 Compare May 27, 2026 19:55

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/b200-gke-cos-training.yaml`:
- Around line 64-65: In recipes/overlays/b200-gke-cos-training.yaml the
manifestFiles entry incorrectly references
components/nodewright-customizations/manifests/tuning-gke.yaml (which does not
exist at repo root); change the manifestFiles value to
recipes/components/nodewright-customizations/manifests/tuning-gke.yaml so it
points to the actual tuning-gke.yaml file and then run yamllint against
recipes/overlays/b200-gke-cos-training.yaml to ensure YAML formatting/lint rules
pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 066d3d72-226d-4a97-aec5-d6de836b81d2

📥 Commits

Reviewing files that changed from the base of the PR and between bead546 and f6ea784.

📒 Files selected for processing (5)
  • recipes/overlays/b200-any.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/b200-gke-cos-inference.yaml
  • recipes/overlays/b200-gke-cos-training-kubeflow.yaml
  • recipes/overlays/b200-gke-cos-training.yaml

Comment thread recipes/overlays/b200-gke-cos-training.yaml
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single
`nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training
query regardless of service. The companion `b200-any-training.yaml` had
the same shape. The pattern is misleading: each service has a different
network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single
cross-service threshold is rarely correct for two fabrics. NCCL
bandwidth thresholds belong on the concrete service-bound leaf, anchored
to a measurement on that specific fabric.

The fabric-independent deployment-phase floor (4 standard health checks
+ `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` —
that pattern (established by NVIDIA#1001) is correct because the same
gpu-operator version requirement applies across every service for the
accelerator.

Changes:

- Delete `recipes/overlays/gb200-any-training.yaml`.
- Update `recipes/overlays/gb200-any.yaml` comment block: drop the
  "Companion to gb200-any-training.yaml" intro, explain why the
  intent-scoped sibling was retired.
- Doc updates with the same rationale:
    - `docs/integrator/recipe-development.md` — switch the
      criteria-wildcard example from `gb200-any-training.yaml` to
      `gb200-any.yaml` (deployment-phase floor); add the
      "per-fabric values don't belong here" caveat.
    - `docs/contributor/data.md` — refresh the wildcard explanation
      section, the resolver-tracing example, and the Mermaid flowchart
      to use `gb200-any` throughout; rename `-any-` naming convention
      to allow `-any` (deployment-floor pattern).
    - `docs/design/005-overlay-refactoring.md` — drop the
      `b200-any-training` / `gb200-any-training` lines from the overlay
      tree and leave a brief historical note explaining the retirement
      across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard).
- Update the `knownGaps` header comment in
  `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200
  training performance data gap (warn-only in non-strict mode today;
  closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays
  empty: the floor test treats missing performance as a `t.Log(WARN)`
  in default mode, so a knownGaps entry would be stale-flagged.

The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053);
the two issues are independent.

Acceptance (per NVIDIA#1052):

1. `recipes/overlays/gb200-any-training.yaml` removed: yes.
2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007
   for real OCI measurements; covered as a warn-only floor gap today.
3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes.
4. `aicr recipe --service oke --accelerator gb200 --intent training
   --format yaml`: resolves to 11 components, 6 overlays
   (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training);
   carries the gpu-operator v25.10.0 floor and the standard 4 deployment
   checks via `gb200-any.yaml`.
5. Doc references updated.
6. `make qualify` clean.

Fixes NVIDIA#1052
Related NVIDIA#1004, NVIDIA#1007
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single
`nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training
query regardless of service. The companion `b200-any-training.yaml` had
the same shape. The pattern is misleading: each service has a different
network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single
cross-service threshold is rarely correct for two fabrics. NCCL
bandwidth thresholds belong on the concrete service-bound leaf, anchored
to a measurement on that specific fabric.

The fabric-independent deployment-phase floor (4 standard health checks
+ `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` —
that pattern (established by NVIDIA#1001) is correct because the same
gpu-operator version requirement applies across every service for the
accelerator.

Changes:

- Delete `recipes/overlays/gb200-any-training.yaml`.
- Update `recipes/overlays/gb200-any.yaml` comment block: drop the
  "Companion to gb200-any-training.yaml" intro, explain why the
  intent-scoped sibling was retired.
- Doc updates with the same rationale:
    - `docs/integrator/recipe-development.md` — switch the
      criteria-wildcard example from `gb200-any-training.yaml` to
      `gb200-any.yaml` (deployment-phase floor); add the
      "per-fabric values don't belong here" caveat.
    - `docs/contributor/data.md` — refresh the wildcard explanation
      section, the resolver-tracing example, and the Mermaid flowchart
      to use `gb200-any` throughout; rename `-any-` naming convention
      to allow `-any` (deployment-floor pattern).
    - `docs/design/005-overlay-refactoring.md` — drop the
      `b200-any-training` / `gb200-any-training` lines from the overlay
      tree and leave a brief historical note explaining the retirement
      across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard).
- Update the `knownGaps` header comment in
  `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200
  training performance data gap (warn-only in non-strict mode today;
  closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays
  empty: the floor test treats missing performance as a `t.Log(WARN)`
  in default mode, so a knownGaps entry would be stale-flagged.

The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053);
the two issues are independent.

Acceptance (per NVIDIA#1052):

1. `recipes/overlays/gb200-any-training.yaml` removed: yes.
2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007
   for real OCI measurements; covered as a warn-only floor gap today.
3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes.
4. `aicr recipe --service oke --accelerator gb200 --intent training
   --format yaml`: resolves to 11 components, 6 overlays
   (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training);
   carries the gpu-operator v25.10.0 floor and the standard 4 deployment
   checks via `gb200-any.yaml`.
5. Doc references updated.
6. `make qualify` clean.

Fixes NVIDIA#1052
Related NVIDIA#1004, NVIDIA#1007
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single
`nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training
query regardless of service. The companion `b200-any-training.yaml` had
the same shape. The pattern is misleading: each service has a different
network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single
cross-service threshold is rarely correct for two fabrics. NCCL
bandwidth thresholds belong on the concrete service-bound leaf, anchored
to a measurement on that specific fabric.

The fabric-independent deployment-phase floor (4 standard health checks
+ `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` —
that pattern (established by NVIDIA#1001) is correct because the same
gpu-operator version requirement applies across every service for the
accelerator.

Changes:

- Delete `recipes/overlays/gb200-any-training.yaml`.
- Update `recipes/overlays/gb200-any.yaml` comment block: drop the
  "Companion to gb200-any-training.yaml" intro, explain why the
  intent-scoped sibling was retired.
- Doc updates with the same rationale:
    - `docs/integrator/recipe-development.md` — switch the
      criteria-wildcard example from `gb200-any-training.yaml` to
      `gb200-any.yaml` (deployment-phase floor); add the
      "per-fabric values don't belong here" caveat.
    - `docs/contributor/data.md` — refresh the wildcard explanation
      section, the resolver-tracing example, and the Mermaid flowchart
      to use `gb200-any` throughout; rename `-any-` naming convention
      to allow `-any` (deployment-floor pattern).
    - `docs/design/005-overlay-refactoring.md` — drop the
      `b200-any-training` / `gb200-any-training` lines from the overlay
      tree and leave a brief historical note explaining the retirement
      across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard).
- Update the `knownGaps` header comment in
  `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200
  training performance data gap (warn-only in non-strict mode today;
  closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays
  empty: the floor test treats missing performance as a `t.Log(WARN)`
  in default mode, so a knownGaps entry would be stale-flagged.

The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053);
the two issues are independent.

Acceptance (per NVIDIA#1052):

1. `recipes/overlays/gb200-any-training.yaml` removed: yes.
2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007
   for real OCI measurements; covered as a warn-only floor gap today.
3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes.
4. `aicr recipe --service oke --accelerator gb200 --intent training
   --format yaml`: resolves to 11 components, 6 overlays
   (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training);
   carries the gpu-operator v25.10.0 floor and the standard 4 deployment
   checks via `gb200-any.yaml`.
5. Doc references updated.
6. `make qualify` clean.

Fixes NVIDIA#1052
Related NVIDIA#1004, NVIDIA#1007
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single
`nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training
query regardless of service. The companion `b200-any-training.yaml` had
the same shape. The pattern is misleading: each service has a different
network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single
cross-service threshold is rarely correct for two fabrics. NCCL
bandwidth thresholds belong on the concrete service-bound leaf, anchored
to a measurement on that specific fabric.

The fabric-independent deployment-phase floor (4 standard health checks
+ `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` —
that pattern (established by NVIDIA#1001) is correct because the same
gpu-operator version requirement applies across every service for the
accelerator.

Changes:

- Delete `recipes/overlays/gb200-any-training.yaml`.
- Update `recipes/overlays/gb200-any.yaml` comment block: drop the
  "Companion to gb200-any-training.yaml" intro, explain why the
  intent-scoped sibling was retired.
- Doc updates with the same rationale:
    - `docs/integrator/recipe-development.md` — switch the
      criteria-wildcard example from `gb200-any-training.yaml` to
      `gb200-any.yaml` (deployment-phase floor); add the
      "per-fabric values don't belong here" caveat.
    - `docs/contributor/data.md` — refresh the wildcard explanation
      section, the resolver-tracing example, and the Mermaid flowchart
      to use `gb200-any` throughout; rename `-any-` naming convention
      to allow `-any` (deployment-floor pattern).
    - `docs/design/005-overlay-refactoring.md` — drop the
      `b200-any-training` / `gb200-any-training` lines from the overlay
      tree and leave a brief historical note explaining the retirement
      across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard).
- Update the `knownGaps` header comment in
  `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200
  training performance data gap (warn-only in non-strict mode today;
  closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays
  empty: the floor test treats missing performance as a `t.Log(WARN)`
  in default mode, so a knownGaps entry would be stale-flagged.

The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053);
the two issues are independent.

Acceptance (per NVIDIA#1052):

1. `recipes/overlays/gb200-any-training.yaml` removed: yes.
2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007
   for real OCI measurements; covered as a warn-only floor gap today.
3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes.
4. `aicr recipe --service oke --accelerator gb200 --intent training
   --format yaml`: resolves to 11 components, 6 overlays
   (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training);
   carries the gpu-operator v25.10.0 floor and the standard 4 deployment
   checks via `gb200-any.yaml`.
5. Doc references updated.
6. `make qualify` clean.

Fixes NVIDIA#1052
Related NVIDIA#1004, NVIDIA#1007
@yuanchen8911 yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from f6ea784 to 285283e Compare May 28, 2026 13:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
recipes/overlays/b200-gke-cos-training.yaml (1)

64-65: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Path issue persists: manifestFiles references incorrect location.

The path components/nodewright-customizations/manifests/tuning-gke.yaml should be recipes/components/nodewright-customizations/manifests/tuning-gke.yaml to match the actual file location in the repository structure.

📁 Proposed fix
     - name: nodewright-customizations
       type: Helm
       manifestFiles:
-        - components/nodewright-customizations/manifests/tuning-gke.yaml
+        - recipes/components/nodewright-customizations/manifests/tuning-gke.yaml
       overrides:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/b200-gke-cos-training.yaml` around lines 64 - 65, Update the
manifestFiles entry under manifestFiles in the overlay to point to the actual
repository location by replacing the current reference
"components/nodewright-customizations/manifests/tuning-gke.yaml" with
"recipes/components/nodewright-customizations/manifests/tuning-gke.yaml" so the
tuning-gke.yaml manifest is correctly resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@recipes/overlays/b200-gke-cos-training.yaml`:
- Around line 64-65: Update the manifestFiles entry under manifestFiles in the
overlay to point to the actual repository location by replacing the current
reference "components/nodewright-customizations/manifests/tuning-gke.yaml" with
"recipes/components/nodewright-customizations/manifests/tuning-gke.yaml" so the
tuning-gke.yaml manifest is correctly resolved.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c3489fa9-c6bd-4aff-9595-92fbc46b9b87

📥 Commits

Reviewing files that changed from the base of the PR and between f6ea784 and 285283e.

📒 Files selected for processing (5)
  • recipes/overlays/b200-any.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/b200-gke-cos-inference.yaml
  • recipes/overlays/b200-gke-cos-training-kubeflow.yaml
  • recipes/overlays/b200-gke-cos-training.yaml

@yuanchen8911 yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from 285283e to 16df4fb Compare May 28, 2026 15:21

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/b200-gke-cos-training-kubeflow.yaml`:
- Around line 32-35: Remove the redundant constraints block from the
b200-gke-cos-training-kubeflow overlay: delete the K8s.server.version constraint
(the entry with name "K8s.server.version" and value ">= 1.32") since that same
constraint is already defined in the parent overlay b200-gke-cos-training;
leaving the child overlay without its own constraints block will inherit the
parent's version floor and avoid duplicate maintenance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8e5fae7a-b0f5-486b-a647-291083441571

📥 Commits

Reviewing files that changed from the base of the PR and between 285283e and 16df4fb.

📒 Files selected for processing (5)
  • recipes/overlays/b200-any.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/b200-gke-cos-inference.yaml
  • recipes/overlays/b200-gke-cos-training-kubeflow.yaml
  • recipes/overlays/b200-gke-cos-training.yaml

Comment thread recipes/overlays/b200-gke-cos-training-kubeflow.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch 2 times, most recently from 9ee137c to ee6f5c4 Compare May 28, 2026 16:55
@github-actions

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipes): add concrete GKE B200 service-bound overlays feat(recipes): add concrete GKE B200 service-bound overlays May 28, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 28, 2026 17:08
@yuanchen8911 yuanchen8911 requested review from a team as code owners May 28, 2026 17:08
@yuanchen8911 yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from ee6f5c4 to 97c6e2b Compare May 28, 2026 19:17
@yuanchen8911 yuanchen8911 marked this pull request as draft May 28, 2026 19:51
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 28, 2026 21:19
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 28, 2026 21:20
Adds the first concrete service-bound overlays for the b200 accelerator
on GKE COS: b200-any.yaml (deployment-phase floor wildcard, mirrors
gb200-any.yaml from NVIDIA#1001) plus four leaves (training, inference,
training-kubeflow, inference-dynamo). Layout mirrors h100-gke-cos-*.yaml
(GKE is COS-only by AICR convention — no -ubuntu- variant).

Retires the placeholder b200-any-training.yaml wildcard per the principle
established in NVIDIA#1052; H100 / RTX Pro 6000 already follow this shape, so
B200 reaches parity.

Anchored on the production reference cluster
nvcf-dgxc-k8s-gcp-azne1-prd7, which deploys no separate NCCL plugin
installer (no gke-nccl-tcpxo, no GPUDirect-RDMA DaemonSet) — high-
bandwidth multi-node NCCL is provided by GPU Operator with `gdrcopy`
enabled, combined with GKE A4's native multi-NIC infrastructure. Both
training and inference leaves set cdi.enabled + gdrcopy.enabled on the
gpu-operator overrides to mirror that deployment model.

B200-vs-GB200 deltas honored:
  - x86 host (vs GB200 Grace ARM) → real tuning-gke.yaml, not no-op
  - No NVreg_GrdmaPciTopoCheckOverride flag
  - Single-fabric NCCL (no MNNVL / NVL72)
  - gpu-operator floor >= v25.10.0 (Blackwell baseline)

The gke-nccl-tcpxo component is intentionally NOT added: its DaemonSets
pin to nvidia-h100-mega-80gb and target the TCPX transport on
a3-megagpu-8g (H100), so they would not run on A4 nodes and would
misrepresent the deployment model.

nccl-all-reduce-bw threshold is a placeholder (>= 100); the H100 GKE
TCPXO baseline (>= 250) is not the right anchor for A4's gdrcopy +
native-multi-NIC model. Tighten once an empirical measurement from the
reference cluster is captured.

Fixes: NVIDIA#1004
Related: NVIDIA#1001, NVIDIA#1052, NVIDIA#969, NVIDIA#436
@yuanchen8911 yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from 97c6e2b to 1edb775 Compare May 28, 2026 21:21
@yuanchen8911 yuanchen8911 merged commit bbf8176 into NVIDIA:main May 28, 2026
121 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 25, 2026
…dcard guide

recipe-development.md's criteria-wildcard section credited only the
gb200-any-training.yaml retirement (NVIDIA#1052), but b200-any-training.yaml
was likewise retired in NVIDIA#1053 (both -any-training wildcards are gone on
main; b200-any.yaml / gb200-any.yaml are the live deployment-floor
overlays). Name both so recipe authors don't reintroduce the retired
B200 cross-service NCCL-threshold pattern.

Doc-only.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 25, 2026
…dcard guide

recipe-development.md's criteria-wildcard section credited only the
gb200-any-training.yaml retirement (NVIDIA#1052), but b200-any-training.yaml
was likewise retired in NVIDIA#1053 (both -any-training wildcards are gone on
main; b200-any.yaml / gb200-any.yaml are the live deployment-floor
overlays). Name both so recipe authors don't reintroduce the retired
B200 cross-service NCCL-threshold pattern.

Doc-only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(recipes): add concrete GKE B200 service-bound overlays

2 participants