feat(recipes): add concrete GKE B200 service-bound overlays by yuanchen8911 · Pull Request #1053 · NVIDIA/aicr

yuanchen8911 · 2026-05-27T00:34:01Z

Summary

Adds the first concrete service-bound overlays for the b200 accelerator on GKE COS — b200-any.yaml (deployment-phase floor wildcard) plus four leaves (training, inference, training-kubeflow, inference-dynamo). Retires the placeholder b200-any-training.yaml wildcard; H100 / RTX Pro 6000 already follow the no-*-any-training.yaml shape, so B200 reaches parity.

Motivation / Context

Before this PR, aicr recipe --service gke --accelerator b200 --intent <any> resolved only the wildcard NCCL threshold from b200-any-training.yaml (added by #436), with no GKE COS GPU operator config and no platform variant.

Anchored on the production cluster nvcf-dgxc-k8s-gcp-azne1-prd7 (GCP, asia-northeast1, NVCF prod) — confirmed against its cluster-spec.yaml, ArgoCD app set, and gpu-operator values file in dgxcloud/mk8s/manifests:clusters/nvcf-prod/nvcf-dgxc-k8s-gcp-azne1-prd7/.

Fixes: #1004
Related: #1001 (per-accelerator deployment-floor wildcard pattern), #1052 (*-any-training.yaml retirement), #969 (validation-phase coverage audit), #436 (B200 enum + stub wildcard)

Type of Change

New feature (non-breaking change that adds functionality)
Refactoring (no functional changes) — b200-any-training.yaml retirement per refactor(recipes): retire *-any-training.yaml wildcards (NCCL thresholds per-leaf) #1052

Component(s) Affected

Recipe engine / data (pkg/recipe)

(YAML-only — no Go source changes.)

Implementation Notes

Layout mirrors h100-gke-cos-*.yaml (the closest reference pattern; GKE is COS-only by AICR convention — no -ubuntu- variant).

New overlay	Inherits from	Validation contract
`b200-any`	`base`	Accelerator-wide deployment-phase floor (4 standard checks + `Deployment.gpu-operator.version >= v25.10.0`). Mirrors `gb200-any.yaml` / `h100-any.yaml` / `rtx-pro-6000-any.yaml` from #1001.
`b200-gke-cos-training`	`gke-cos-training`	Self-declares deployment + performance + conformance; gpu-operator floor `>= v25.10.0`; K8s `>= 1.32`; `nccl-all-reduce-bw >= 100` (placeholder, see Networking note).
`b200-gke-cos-inference`	`gke-cos-inference`	Inherits deployment from `b200-any.yaml`, conformance from `gke-cos.yaml`. No performance phase (single-card inference, no NCCL fabric to gate).
`b200-gke-cos-training-kubeflow`	`b200-gke-cos-training`	Adds `kubeflow-trainer` component for `TrainJob` distributed training.
`b200-gke-cos-inference-dynamo`	`b200-gke-cos-inference`	Adds DRA + Dynamo + Grove; K8s `>= 1.34` (DRA GA); self-declares deployment + performance + conformance.

B200-vs-GB200 deltas honored:

Host CPU is x86 (datacenter Blackwell on standard x86 host). Uses the real tuning-gke.yaml for nodewright-customizations (same as h100-gke-cos-*), not the GB200 no-op tuning.yaml.
No NVreg_GrdmaPciTopoCheckOverride=1 override — that flag exists for GB200's Grace PCI topology with EFA. GKE A4 (B200) uses RDMA over Ethernet provided by GCP's native multi-NIC fabric.
Single-fabric NCCL (no MNNVL / NVL72 IMEX domain — that is a GB200 feature). Single nccl-all-reduce-bw constraint, not split net + nvls.
gpu-operator version floor >= v25.10.0 (Blackwell support stabilized in 25.10, matches GB200 / RTX Pro 6000).
gpu-operator overrides: cdi: enabled + gdrcopy: enabled on both training and inference leaves — mirrors the production reference cluster's gpu-operator values (580.95.05 driver, gdrcopy.enabled: true, cdi.enabled: true). GB200/EKS sets the same pair.

Networking model — no separate installer (Codex P1 / cluster-verified): GKE A4 (B200) on nvcf-dgxc-k8s-gcp-azne1-prd7 deploys no NCCL plugin installer DaemonSet — no gke-nccl-tcpxo, no GPUDirect-RDMA component. Multi-node NCCL is provided by GPU Operator (v25.10.1) with gdrcopy: enabled: true combined with GKE A4's native multi-NIC infrastructure managed by GCP. The gke-nccl-tcpxo componentRef is intentionally not added to the B200 training leaf because its DaemonSets pin cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb and target the TCPX transport on a3-megagpu-8g (H100) — would not run on A4 nodes and would misrepresent the deployment model.

NCCL threshold — placeholder pending measurement: nccl-all-reduce-bw >= 100 is a conservative floor reflecting that A4 reaches high-bandwidth NCCL via gdrcopy + GKE native multi-NIC rather than via a TCPX installer — so the H100/GKE TCPXO baseline (>= 250) is not the right anchor. Tighten once an empirical number from the reference cluster is captured.

Wildcard cleanup (b200-any-training.yaml): retired in the same PR per #1052. The original >= 350 GB/s cross-cloud placeholder threshold was fabric-blind (one number for EFA, TCPX, RoCE, native multi-NIC — none correct for all). H100 already follows this shape (no h100-any-training.yaml), so B200 reaches parity. Future per-cloud B200 leaves carry their own fabric-tuned thresholds.

Inference-perf thresholds on the Dynamo leaf mirror H100's loose smoke-test floor (inference-throughput >= 5000, inference-ttft-p99 <= 200); B200 is expected to exceed both with margin.

Testing

# Overlay phase-floor gate
go test -v ./pkg/recipe/... -run TestOverlayValidationPhaseFloor

# Full gate
make qualify

Results:

TestOverlayValidationPhaseFloor passes for all 4 new leaves. No new knownGaps entries.
make qualify: all 22 chainsaw tests pass, coverage 77.0% (threshold 75%), golangci-lint 0 issues, no vulnerabilities, license headers OK.
Smoke-tests resolve cleanly:
- aicr recipe --service gke --accelerator b200 --os cos --intent training --format yaml → 17 components, gpu-operator with cdi.enabled: true + gdrcopy.enabled: true, deployment (>= v25.10.0), performance (nccl-all-reduce-bw >= 100), conformance (10 checks).
- aicr recipe --service gke --accelerator b200 --os cos --intent inference --format yaml → gpu-operator overrides include gdrcopy: enabled; deployment inherited from b200-any.yaml, conformance inherited from gke-cos.yaml.
- aicr recipe --service gke --accelerator b200 --os cos --intent inference --platform dynamo --format yaml → DRA + Dynamo + Grove resolved; full validation contract; K8s >= 1.34.
- aicr recipe --service gke --accelerator b200 --os cos --intent training --platform kubeflow --format yaml → kubeflow-trainer injected.

Coverage: YAML-only change — per CLAUDE.md the per-package coverage gate does not apply. Project-wide make test-coverage floor (75%) passes under make qualify at 77.0%.

Risk Assessment

Low — Isolated change (one wildcard retired, five new overlays added). Easy to revert. No existing recipe behavior changes; the new overlays only resolve when a user explicitly queries --service gke --accelerator b200.

Rollout notes: Net deletion of b200-any-training.yaml is safe — that wildcard had no concrete leaves depending on it (no b200-<svc>-*.yaml existed before this PR), and removing it brings B200 to parity with H100 / RTX Pro 6000 (neither has an *-any-training.yaml). The cross-cloud threshold it contributed (>= 350) was fabric-blind; the per-leaf >= 100 is grounded in the reference cluster's actual deployment model and will be tightened once empirical numbers land.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — existing TestOverlayValidationPhaseFloor auto-enumerates and validates the new overlays
I updated docs if user-facing behavior changed — no user-facing CLI/API surface changed; overlays are data, discoverable via aicr criteria list
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-05-27T00:38:25Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR renames b200-any-training → b200-any and broadens its selector to all intents for accelerator b200, replacing the NCCL perf gate with a deployment-phase floor (including Deployment.gpu-operator.version >= v25.10.0). It also adds four GKE/COS B200 overlays: b200-gke-cos-inference, b200-gke-cos-inference-dynamo (DRA/K8s >= 1.34), b200-gke-cos-training, and b200-gke-cos-training-kubeflow, each wiring componentRefs, K8s/GPU-operator version constraints, and deployment/performance/conformance validations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/aicr#1001 — Introduced the per-accelerator wildcard deployment-floor pattern and gpu-operator version constraint that this PR mirrors for B200.

Suggested labels

area/docs, area/tests

Suggested reviewers

xdu31
lockwobr
mchmarny

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	Description thoroughly covers the changes: new overlays added, old wildcard retired, B200-specific deltas explained, testing confirmed, and motivation anchored to issue `#1004`.
Linked Issues check	✅ Passed	All coding requirements from issue `#1004` are met: new `b200-any.yaml` with deployment-phase floor [`#1004`], four concrete GKE-COS leaves with proper K8s version constraints [`#1004`], B200-vs-GB200 deltas honored [`#1004`], validation checks pass [`#1004`], and `aicr recipe` smoke-tests resolve correctly [`#1004`].
Out of Scope Changes check	✅ Passed	All changes are within scope of issue `#1004`: new YAML overlays for GKE B200 (training, inference, training-kubeflow, inference-dynamo), `b200-any.yaml` wildcard, and retirement of `b200-any-training.yaml`. No unrelated modifications detected.
Title check	✅ Passed	The title clearly and concisely describes the primary change: adding concrete GKE B200 service-bound overlays to the recipes directory.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/b200-gke-cos-inference-dynamo.yaml`:
- Around line 46-53: The PR added explicit chart version pins for the Helm
release named "dynamo-platform" (version "1.0.2") in the overlay recipe, so
regenerate the bill-of-materials docs and commit the output: run make bom-docs
locally, verify the regenerated docs/user/container-images.md reflects the new
chart pins, and add/commit that updated docs/user/container-images.md to this PR
alongside the recipe change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 924479a7-6c13-4888-83ff-e6a316b2c7e0

📥 Commits

Reviewing files that changed from the base of the PR and between 506507b and ea71456.

📒 Files selected for processing (5)

recipes/overlays/b200-any.yaml
recipes/overlays/b200-gke-cos-inference-dynamo.yaml
recipes/overlays/b200-gke-cos-inference.yaml
recipes/overlays/b200-gke-cos-training-kubeflow.yaml
recipes/overlays/b200-gke-cos-training.yaml

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/b200-gke-cos-training.yaml`:
- Around line 40-49: The YAML comments claim gke-nccl-tcpxo is intentionally
omitted while the PR objectives require adding GPUDirect-TCPXO for B200;
reconcile by either (A) actually adding the gke-nccl-tcpxo component to this
overlay (insert a component entry for "gke-nccl-tcpxo" alongside "gpu-operator"
and enable any B200-specific selectors/taints/labels required for B200 nodes),
or (B) update the header comment/PR objectives to state that gke-nccl-tcpxo is
intentionally excluded for this overlay; adjust the text referring to
GPUDirect-TCPXO and the omitted component so code and objectives match.
- Around line 102-103: Update the NCCL performance floor to match the training
acceptance target by changing the value for the key "nccl-all-reduce-bw" from
">= 100" to ">= 250" in the overlay where that key is defined; ensure the new
threshold is applied wherever "nccl-all-reduce-bw" is set so the validation gate
enforces the intended >= 250 target.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 485f34fa-a5ab-4964-b2ef-36076aa0e6d8

📥 Commits

Reviewing files that changed from the base of the PR and between ea71456 and bead546.

📒 Files selected for processing (5)

recipes/overlays/b200-any.yaml
recipes/overlays/b200-gke-cos-inference-dynamo.yaml
recipes/overlays/b200-gke-cos-inference.yaml
recipes/overlays/b200-gke-cos-training-kubeflow.yaml
recipes/overlays/b200-gke-cos-training.yaml

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/b200-gke-cos-training.yaml`:
- Around line 64-65: In recipes/overlays/b200-gke-cos-training.yaml the
manifestFiles entry incorrectly references
components/nodewright-customizations/manifests/tuning-gke.yaml (which does not
exist at repo root); change the manifestFiles value to
recipes/components/nodewright-customizations/manifests/tuning-gke.yaml so it
points to the actual tuning-gke.yaml file and then run yamllint against
recipes/overlays/b200-gke-cos-training.yaml to ensure YAML formatting/lint rules
pass.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 066d3d72-226d-4a97-aec5-d6de836b81d2

📥 Commits

Reviewing files that changed from the base of the PR and between bead546 and f6ea784.

📒 Files selected for processing (5)

recipes/overlays/b200-any.yaml
recipes/overlays/b200-gke-cos-inference-dynamo.yaml
recipes/overlays/b200-gke-cos-inference.yaml
recipes/overlays/b200-gke-cos-training-kubeflow.yaml
recipes/overlays/b200-gke-cos-training.yaml

The criteria-wildcard overlay `gb200-any-training.yaml` carried a single `nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training query regardless of service. The companion `b200-any-training.yaml` had the same shape. The pattern is misleading: each service has a different network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single cross-service threshold is rarely correct for two fabrics. NCCL bandwidth thresholds belong on the concrete service-bound leaf, anchored to a measurement on that specific fabric. The fabric-independent deployment-phase floor (4 standard health checks + `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` — that pattern (established by NVIDIA#1001) is correct because the same gpu-operator version requirement applies across every service for the accelerator. Changes: - Delete `recipes/overlays/gb200-any-training.yaml`. - Update `recipes/overlays/gb200-any.yaml` comment block: drop the "Companion to gb200-any-training.yaml" intro, explain why the intent-scoped sibling was retired. - Doc updates with the same rationale: - `docs/integrator/recipe-development.md` — switch the criteria-wildcard example from `gb200-any-training.yaml` to `gb200-any.yaml` (deployment-phase floor); add the "per-fabric values don't belong here" caveat. - `docs/contributor/data.md` — refresh the wildcard explanation section, the resolver-tracing example, and the Mermaid flowchart to use `gb200-any` throughout; rename `-any-` naming convention to allow `-any` (deployment-floor pattern). - `docs/design/005-overlay-refactoring.md` — drop the `b200-any-training` / `gb200-any-training` lines from the overlay tree and leave a brief historical note explaining the retirement across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard). - Update the `knownGaps` header comment in `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200 training performance data gap (warn-only in non-strict mode today; closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays empty: the floor test treats missing performance as a `t.Log(WARN)` in default mode, so a knownGaps entry would be stale-flagged. The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053); the two issues are independent. Acceptance (per NVIDIA#1052): 1. `recipes/overlays/gb200-any-training.yaml` removed: yes. 2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007 for real OCI measurements; covered as a warn-only floor gap today. 3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes. 4. `aicr recipe --service oke --accelerator gb200 --intent training --format yaml`: resolves to 11 components, 6 overlays (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training); carries the gpu-operator v25.10.0 floor and the standard 4 deployment checks via `gb200-any.yaml`. 5. Doc references updated. 6. `make qualify` clean. Fixes NVIDIA#1052 Related NVIDIA#1004, NVIDIA#1007

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

recipes/overlays/b200-gke-cos-training.yaml (1)
64-65: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Path issue persists: manifestFiles references incorrect location.

The path components/nodewright-customizations/manifests/tuning-gke.yaml should be recipes/components/nodewright-customizations/manifests/tuning-gke.yaml to match the actual file location in the repository structure.
📁 Proposed fix
     - name: nodewright-customizations
       type: Helm
       manifestFiles:
-        - components/nodewright-customizations/manifests/tuning-gke.yaml
+        - recipes/components/nodewright-customizations/manifests/tuning-gke.yaml
       overrides:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/b200-gke-cos-training.yaml` around lines 64 - 65, Update the
manifestFiles entry under manifestFiles in the overlay to point to the actual
repository location by replacing the current reference
"components/nodewright-customizations/manifests/tuning-gke.yaml" with
"recipes/components/nodewright-customizations/manifests/tuning-gke.yaml" so the
tuning-gke.yaml manifest is correctly resolved.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@recipes/overlays/b200-gke-cos-training.yaml`:
- Around line 64-65: Update the manifestFiles entry under manifestFiles in the
overlay to point to the actual repository location by replacing the current
reference "components/nodewright-customizations/manifests/tuning-gke.yaml" with
"recipes/components/nodewright-customizations/manifests/tuning-gke.yaml" so the
tuning-gke.yaml manifest is correctly resolved.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c3489fa9-c6bd-4aff-9595-92fbc46b9b87

📥 Commits

Reviewing files that changed from the base of the PR and between f6ea784 and 285283e.

📒 Files selected for processing (5)

recipes/overlays/b200-any.yaml
recipes/overlays/b200-gke-cos-inference-dynamo.yaml
recipes/overlays/b200-gke-cos-inference.yaml
recipes/overlays/b200-gke-cos-training-kubeflow.yaml
recipes/overlays/b200-gke-cos-training.yaml

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/b200-gke-cos-training-kubeflow.yaml`:
- Around line 32-35: Remove the redundant constraints block from the
b200-gke-cos-training-kubeflow overlay: delete the K8s.server.version constraint
(the entry with name "K8s.server.version" and value ">= 1.32") since that same
constraint is already defined in the parent overlay b200-gke-cos-training;
leaving the child overlay without its own constraints block will inherit the
parent's version floor and avoid duplicate maintenance.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8e5fae7a-b0f5-486b-a647-291083441571

📥 Commits

Reviewing files that changed from the base of the PR and between 285283e and 16df4fb.

📒 Files selected for processing (5)

recipes/overlays/b200-any.yaml
recipes/overlays/b200-gke-cos-inference-dynamo.yaml
recipes/overlays/b200-gke-cos-inference.yaml
recipes/overlays/b200-gke-cos-training-kubeflow.yaml
recipes/overlays/b200-gke-cos-training.yaml

github-actions · 2026-05-28T16:56:54Z

🌿 Preview your docs: https://nvidia-preview-feat-b200-gke-overlays-1004.docs.buildwithfern.com/aicr

Adds the first concrete service-bound overlays for the b200 accelerator on GKE COS: b200-any.yaml (deployment-phase floor wildcard, mirrors gb200-any.yaml from NVIDIA#1001) plus four leaves (training, inference, training-kubeflow, inference-dynamo). Layout mirrors h100-gke-cos-*.yaml (GKE is COS-only by AICR convention — no -ubuntu- variant). Retires the placeholder b200-any-training.yaml wildcard per the principle established in NVIDIA#1052; H100 / RTX Pro 6000 already follow this shape, so B200 reaches parity. Anchored on the production reference cluster nvcf-dgxc-k8s-gcp-azne1-prd7, which deploys no separate NCCL plugin installer (no gke-nccl-tcpxo, no GPUDirect-RDMA DaemonSet) — high- bandwidth multi-node NCCL is provided by GPU Operator with `gdrcopy` enabled, combined with GKE A4's native multi-NIC infrastructure. Both training and inference leaves set cdi.enabled + gdrcopy.enabled on the gpu-operator overrides to mirror that deployment model. B200-vs-GB200 deltas honored: - x86 host (vs GB200 Grace ARM) → real tuning-gke.yaml, not no-op - No NVreg_GrdmaPciTopoCheckOverride flag - Single-fabric NCCL (no MNNVL / NVL72) - gpu-operator floor >= v25.10.0 (Blackwell baseline) The gke-nccl-tcpxo component is intentionally NOT added: its DaemonSets pin to nvidia-h100-mega-80gb and target the TCPX transport on a3-megagpu-8g (H100), so they would not run on A4 nodes and would misrepresent the deployment model. nccl-all-reduce-bw threshold is a placeholder (>= 100); the H100 GKE TCPXO baseline (>= 250) is not the right anchor for A4's gdrcopy + native-multi-NIC model. Tighten once an empirical measurement from the reference cluster is captured. Fixes: NVIDIA#1004 Related: NVIDIA#1001, NVIDIA#1052, NVIDIA#969, NVIDIA#436

…dcard guide recipe-development.md's criteria-wildcard section credited only the gb200-any-training.yaml retirement (NVIDIA#1052), but b200-any-training.yaml was likewise retired in NVIDIA#1053 (both -any-training wildcards are gone on main; b200-any.yaml / gb200-any.yaml are the live deployment-floor overlays). Name both so recipe authors don't reintroduce the retired B200 cross-service NCCL-threshold pattern. Doc-only.

…guide (#1475)

yuanchen8911 added enhancement area/recipes labels May 27, 2026

github-actions Bot added the size/L label May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread recipes/overlays/b200-gke-cos-inference-dynamo.yaml

yuanchen8911 mentioned this pull request May 27, 2026

feat(components): add GKE A4 GPUDirect-RDMA networking component for B200 #1054

Closed

yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from ea71456 to bead546 Compare May 27, 2026 01:08

mchmarny assigned yuanchen8911 May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread recipes/overlays/b200-gke-cos-training.yaml

Comment thread recipes/overlays/b200-gke-cos-training.yaml

mchmarny unassigned yuanchen8911 May 27, 2026

yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from bead546 to f6ea784 Compare May 27, 2026 19:55

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread recipes/overlays/b200-gke-cos-training.yaml

yuanchen8911 mentioned this pull request May 27, 2026

refactor(recipes): retire gb200-any-training.yaml wildcard (#1052) #1068

Merged

11 tasks

yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from f6ea784 to 285283e Compare May 28, 2026 13:57

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from 285283e to 16df4fb Compare May 28, 2026 15:21

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread recipes/overlays/b200-gke-cos-training-kubeflow.yaml

yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch 2 times, most recently from 9ee137c to ee6f5c4 Compare May 28, 2026 16:55

github-actions Bot added the area/docs label May 28, 2026

yuanchen8911 changed the title ~~WIP: feat(recipes): add concrete GKE B200 service-bound overlays~~ feat(recipes): add concrete GKE B200 service-bound overlays May 28, 2026

yuanchen8911 marked this pull request as ready for review May 28, 2026 17:08

yuanchen8911 requested review from a team as code owners May 28, 2026 17:08

yuanchen8911 mentioned this pull request May 28, 2026

bug(kwok): verify_pods samples pod state across 5 kubectl calls — TOCTOU race produces impossible math + misleading diagnostic #1090

Closed

5 tasks

yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from ee6f5c4 to 97c6e2b Compare May 28, 2026 19:17

yuanchen8911 marked this pull request as draft May 28, 2026 19:51

mchmarny assigned yuanchen8911 May 28, 2026

yuanchen8911 marked this pull request as ready for review May 28, 2026 21:19

yuanchen8911 requested a review from mchmarny May 28, 2026 21:20

yuanchen8911 force-pushed the feat/b200-gke-overlays-1004 branch from 97c6e2b to 1edb775 Compare May 28, 2026 21:21

mchmarny approved these changes May 28, 2026

View reviewed changes

yuanchen8911 merged commit bbf8176 into NVIDIA:main May 28, 2026
121 checks passed

This was referenced May 29, 2026

feat(recipe): per-field union merge for validation phase checks #1103

Merged

feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS #1101

Merged

yuanchen8911 mentioned this pull request Jun 25, 2026

docs(recipes): note b200-any-training retirement (#1053) in wildcard guide #1475

Merged

6 tasks

yuanchen8911 added a commit that referenced this pull request Jun 25, 2026

docs(recipes): note b200-any-training retirement (#1053) in wildcard …

6fc3444

…guide (#1475)

Uh oh!

Conversation

yuanchen8911 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading