feat(recipes): add concrete GKE B200 service-bound overlays#1053
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR renames b200-any-training → b200-any and broadens its selector to all intents for accelerator b200, replacing the NCCL perf gate with a deployment-phase floor (including Deployment.gpu-operator.version >= v25.10.0). It also adds four GKE/COS B200 overlays: b200-gke-cos-inference, b200-gke-cos-inference-dynamo (DRA/K8s >= 1.34), b200-gke-cos-training, and b200-gke-cos-training-kubeflow, each wiring componentRefs, K8s/GPU-operator version constraints, and deployment/performance/conformance validations. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/b200-gke-cos-inference-dynamo.yaml`:
- Around line 46-53: The PR added explicit chart version pins for the Helm
release named "dynamo-platform" (version "1.0.2") in the overlay recipe, so
regenerate the bill-of-materials docs and commit the output: run make bom-docs
locally, verify the regenerated docs/user/container-images.md reflects the new
chart pins, and add/commit that updated docs/user/container-images.md to this PR
alongside the recipe change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 924479a7-6c13-4888-83ff-e6a316b2c7e0
📒 Files selected for processing (5)
recipes/overlays/b200-any.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/b200-gke-cos-inference.yamlrecipes/overlays/b200-gke-cos-training-kubeflow.yamlrecipes/overlays/b200-gke-cos-training.yaml
ea71456 to
bead546
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/b200-gke-cos-training.yaml`:
- Around line 40-49: The YAML comments claim gke-nccl-tcpxo is intentionally
omitted while the PR objectives require adding GPUDirect-TCPXO for B200;
reconcile by either (A) actually adding the gke-nccl-tcpxo component to this
overlay (insert a component entry for "gke-nccl-tcpxo" alongside "gpu-operator"
and enable any B200-specific selectors/taints/labels required for B200 nodes),
or (B) update the header comment/PR objectives to state that gke-nccl-tcpxo is
intentionally excluded for this overlay; adjust the text referring to
GPUDirect-TCPXO and the omitted component so code and objectives match.
- Around line 102-103: Update the NCCL performance floor to match the training
acceptance target by changing the value for the key "nccl-all-reduce-bw" from
">= 100" to ">= 250" in the overlay where that key is defined; ensure the new
threshold is applied wherever "nccl-all-reduce-bw" is set so the validation gate
enforces the intended >= 250 target.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 485f34fa-a5ab-4964-b2ef-36076aa0e6d8
📒 Files selected for processing (5)
recipes/overlays/b200-any.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/b200-gke-cos-inference.yamlrecipes/overlays/b200-gke-cos-training-kubeflow.yamlrecipes/overlays/b200-gke-cos-training.yaml
bead546 to
f6ea784
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/b200-gke-cos-training.yaml`:
- Around line 64-65: In recipes/overlays/b200-gke-cos-training.yaml the
manifestFiles entry incorrectly references
components/nodewright-customizations/manifests/tuning-gke.yaml (which does not
exist at repo root); change the manifestFiles value to
recipes/components/nodewright-customizations/manifests/tuning-gke.yaml so it
points to the actual tuning-gke.yaml file and then run yamllint against
recipes/overlays/b200-gke-cos-training.yaml to ensure YAML formatting/lint rules
pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 066d3d72-226d-4a97-aec5-d6de836b81d2
📒 Files selected for processing (5)
recipes/overlays/b200-any.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/b200-gke-cos-inference.yamlrecipes/overlays/b200-gke-cos-training-kubeflow.yamlrecipes/overlays/b200-gke-cos-training.yaml
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single `nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training query regardless of service. The companion `b200-any-training.yaml` had the same shape. The pattern is misleading: each service has a different network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single cross-service threshold is rarely correct for two fabrics. NCCL bandwidth thresholds belong on the concrete service-bound leaf, anchored to a measurement on that specific fabric. The fabric-independent deployment-phase floor (4 standard health checks + `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` — that pattern (established by NVIDIA#1001) is correct because the same gpu-operator version requirement applies across every service for the accelerator. Changes: - Delete `recipes/overlays/gb200-any-training.yaml`. - Update `recipes/overlays/gb200-any.yaml` comment block: drop the "Companion to gb200-any-training.yaml" intro, explain why the intent-scoped sibling was retired. - Doc updates with the same rationale: - `docs/integrator/recipe-development.md` — switch the criteria-wildcard example from `gb200-any-training.yaml` to `gb200-any.yaml` (deployment-phase floor); add the "per-fabric values don't belong here" caveat. - `docs/contributor/data.md` — refresh the wildcard explanation section, the resolver-tracing example, and the Mermaid flowchart to use `gb200-any` throughout; rename `-any-` naming convention to allow `-any` (deployment-floor pattern). - `docs/design/005-overlay-refactoring.md` — drop the `b200-any-training` / `gb200-any-training` lines from the overlay tree and leave a brief historical note explaining the retirement across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard). - Update the `knownGaps` header comment in `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200 training performance data gap (warn-only in non-strict mode today; closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays empty: the floor test treats missing performance as a `t.Log(WARN)` in default mode, so a knownGaps entry would be stale-flagged. The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053); the two issues are independent. Acceptance (per NVIDIA#1052): 1. `recipes/overlays/gb200-any-training.yaml` removed: yes. 2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007 for real OCI measurements; covered as a warn-only floor gap today. 3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes. 4. `aicr recipe --service oke --accelerator gb200 --intent training --format yaml`: resolves to 11 components, 6 overlays (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training); carries the gpu-operator v25.10.0 floor and the standard 4 deployment checks via `gb200-any.yaml`. 5. Doc references updated. 6. `make qualify` clean. Fixes NVIDIA#1052 Related NVIDIA#1004, NVIDIA#1007
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single `nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training query regardless of service. The companion `b200-any-training.yaml` had the same shape. The pattern is misleading: each service has a different network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single cross-service threshold is rarely correct for two fabrics. NCCL bandwidth thresholds belong on the concrete service-bound leaf, anchored to a measurement on that specific fabric. The fabric-independent deployment-phase floor (4 standard health checks + `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` — that pattern (established by NVIDIA#1001) is correct because the same gpu-operator version requirement applies across every service for the accelerator. Changes: - Delete `recipes/overlays/gb200-any-training.yaml`. - Update `recipes/overlays/gb200-any.yaml` comment block: drop the "Companion to gb200-any-training.yaml" intro, explain why the intent-scoped sibling was retired. - Doc updates with the same rationale: - `docs/integrator/recipe-development.md` — switch the criteria-wildcard example from `gb200-any-training.yaml` to `gb200-any.yaml` (deployment-phase floor); add the "per-fabric values don't belong here" caveat. - `docs/contributor/data.md` — refresh the wildcard explanation section, the resolver-tracing example, and the Mermaid flowchart to use `gb200-any` throughout; rename `-any-` naming convention to allow `-any` (deployment-floor pattern). - `docs/design/005-overlay-refactoring.md` — drop the `b200-any-training` / `gb200-any-training` lines from the overlay tree and leave a brief historical note explaining the retirement across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard). - Update the `knownGaps` header comment in `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200 training performance data gap (warn-only in non-strict mode today; closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays empty: the floor test treats missing performance as a `t.Log(WARN)` in default mode, so a knownGaps entry would be stale-flagged. The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053); the two issues are independent. Acceptance (per NVIDIA#1052): 1. `recipes/overlays/gb200-any-training.yaml` removed: yes. 2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007 for real OCI measurements; covered as a warn-only floor gap today. 3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes. 4. `aicr recipe --service oke --accelerator gb200 --intent training --format yaml`: resolves to 11 components, 6 overlays (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training); carries the gpu-operator v25.10.0 floor and the standard 4 deployment checks via `gb200-any.yaml`. 5. Doc references updated. 6. `make qualify` clean. Fixes NVIDIA#1052 Related NVIDIA#1004, NVIDIA#1007
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single `nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training query regardless of service. The companion `b200-any-training.yaml` had the same shape. The pattern is misleading: each service has a different network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single cross-service threshold is rarely correct for two fabrics. NCCL bandwidth thresholds belong on the concrete service-bound leaf, anchored to a measurement on that specific fabric. The fabric-independent deployment-phase floor (4 standard health checks + `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` — that pattern (established by NVIDIA#1001) is correct because the same gpu-operator version requirement applies across every service for the accelerator. Changes: - Delete `recipes/overlays/gb200-any-training.yaml`. - Update `recipes/overlays/gb200-any.yaml` comment block: drop the "Companion to gb200-any-training.yaml" intro, explain why the intent-scoped sibling was retired. - Doc updates with the same rationale: - `docs/integrator/recipe-development.md` — switch the criteria-wildcard example from `gb200-any-training.yaml` to `gb200-any.yaml` (deployment-phase floor); add the "per-fabric values don't belong here" caveat. - `docs/contributor/data.md` — refresh the wildcard explanation section, the resolver-tracing example, and the Mermaid flowchart to use `gb200-any` throughout; rename `-any-` naming convention to allow `-any` (deployment-floor pattern). - `docs/design/005-overlay-refactoring.md` — drop the `b200-any-training` / `gb200-any-training` lines from the overlay tree and leave a brief historical note explaining the retirement across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard). - Update the `knownGaps` header comment in `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200 training performance data gap (warn-only in non-strict mode today; closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays empty: the floor test treats missing performance as a `t.Log(WARN)` in default mode, so a knownGaps entry would be stale-flagged. The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053); the two issues are independent. Acceptance (per NVIDIA#1052): 1. `recipes/overlays/gb200-any-training.yaml` removed: yes. 2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007 for real OCI measurements; covered as a warn-only floor gap today. 3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes. 4. `aicr recipe --service oke --accelerator gb200 --intent training --format yaml`: resolves to 11 components, 6 overlays (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training); carries the gpu-operator v25.10.0 floor and the standard 4 deployment checks via `gb200-any.yaml`. 5. Doc references updated. 6. `make qualify` clean. Fixes NVIDIA#1052 Related NVIDIA#1004, NVIDIA#1007
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single `nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training query regardless of service. The companion `b200-any-training.yaml` had the same shape. The pattern is misleading: each service has a different network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single cross-service threshold is rarely correct for two fabrics. NCCL bandwidth thresholds belong on the concrete service-bound leaf, anchored to a measurement on that specific fabric. The fabric-independent deployment-phase floor (4 standard health checks + `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` — that pattern (established by NVIDIA#1001) is correct because the same gpu-operator version requirement applies across every service for the accelerator. Changes: - Delete `recipes/overlays/gb200-any-training.yaml`. - Update `recipes/overlays/gb200-any.yaml` comment block: drop the "Companion to gb200-any-training.yaml" intro, explain why the intent-scoped sibling was retired. - Doc updates with the same rationale: - `docs/integrator/recipe-development.md` — switch the criteria-wildcard example from `gb200-any-training.yaml` to `gb200-any.yaml` (deployment-phase floor); add the "per-fabric values don't belong here" caveat. - `docs/contributor/data.md` — refresh the wildcard explanation section, the resolver-tracing example, and the Mermaid flowchart to use `gb200-any` throughout; rename `-any-` naming convention to allow `-any` (deployment-floor pattern). - `docs/design/005-overlay-refactoring.md` — drop the `b200-any-training` / `gb200-any-training` lines from the overlay tree and leave a brief historical note explaining the retirement across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard). - Update the `knownGaps` header comment in `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200 training performance data gap (warn-only in non-strict mode today; closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays empty: the floor test treats missing performance as a `t.Log(WARN)` in default mode, so a knownGaps entry would be stale-flagged. The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053); the two issues are independent. Acceptance (per NVIDIA#1052): 1. `recipes/overlays/gb200-any-training.yaml` removed: yes. 2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007 for real OCI measurements; covered as a warn-only floor gap today. 3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes. 4. `aicr recipe --service oke --accelerator gb200 --intent training --format yaml`: resolves to 11 components, 6 overlays (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training); carries the gpu-operator v25.10.0 floor and the standard 4 deployment checks via `gb200-any.yaml`. 5. Doc references updated. 6. `make qualify` clean. Fixes NVIDIA#1052 Related NVIDIA#1004, NVIDIA#1007
f6ea784 to
285283e
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
recipes/overlays/b200-gke-cos-training.yaml (1)
64-65:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPath issue persists: manifestFiles references incorrect location.
The path
components/nodewright-customizations/manifests/tuning-gke.yamlshould berecipes/components/nodewright-customizations/manifests/tuning-gke.yamlto match the actual file location in the repository structure.📁 Proposed fix
- name: nodewright-customizations type: Helm manifestFiles: - - components/nodewright-customizations/manifests/tuning-gke.yaml + - recipes/components/nodewright-customizations/manifests/tuning-gke.yaml overrides:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/b200-gke-cos-training.yaml` around lines 64 - 65, Update the manifestFiles entry under manifestFiles in the overlay to point to the actual repository location by replacing the current reference "components/nodewright-customizations/manifests/tuning-gke.yaml" with "recipes/components/nodewright-customizations/manifests/tuning-gke.yaml" so the tuning-gke.yaml manifest is correctly resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@recipes/overlays/b200-gke-cos-training.yaml`:
- Around line 64-65: Update the manifestFiles entry under manifestFiles in the
overlay to point to the actual repository location by replacing the current
reference "components/nodewright-customizations/manifests/tuning-gke.yaml" with
"recipes/components/nodewright-customizations/manifests/tuning-gke.yaml" so the
tuning-gke.yaml manifest is correctly resolved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c3489fa9-c6bd-4aff-9595-92fbc46b9b87
📒 Files selected for processing (5)
recipes/overlays/b200-any.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/b200-gke-cos-inference.yamlrecipes/overlays/b200-gke-cos-training-kubeflow.yamlrecipes/overlays/b200-gke-cos-training.yaml
285283e to
16df4fb
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/b200-gke-cos-training-kubeflow.yaml`:
- Around line 32-35: Remove the redundant constraints block from the
b200-gke-cos-training-kubeflow overlay: delete the K8s.server.version constraint
(the entry with name "K8s.server.version" and value ">= 1.32") since that same
constraint is already defined in the parent overlay b200-gke-cos-training;
leaving the child overlay without its own constraints block will inherit the
parent's version floor and avoid duplicate maintenance.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 8e5fae7a-b0f5-486b-a647-291083441571
📒 Files selected for processing (5)
recipes/overlays/b200-any.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/b200-gke-cos-inference.yamlrecipes/overlays/b200-gke-cos-training-kubeflow.yamlrecipes/overlays/b200-gke-cos-training.yaml
9ee137c to
ee6f5c4
Compare
ee6f5c4 to
97c6e2b
Compare
Adds the first concrete service-bound overlays for the b200 accelerator on GKE COS: b200-any.yaml (deployment-phase floor wildcard, mirrors gb200-any.yaml from NVIDIA#1001) plus four leaves (training, inference, training-kubeflow, inference-dynamo). Layout mirrors h100-gke-cos-*.yaml (GKE is COS-only by AICR convention — no -ubuntu- variant). Retires the placeholder b200-any-training.yaml wildcard per the principle established in NVIDIA#1052; H100 / RTX Pro 6000 already follow this shape, so B200 reaches parity. Anchored on the production reference cluster nvcf-dgxc-k8s-gcp-azne1-prd7, which deploys no separate NCCL plugin installer (no gke-nccl-tcpxo, no GPUDirect-RDMA DaemonSet) — high- bandwidth multi-node NCCL is provided by GPU Operator with `gdrcopy` enabled, combined with GKE A4's native multi-NIC infrastructure. Both training and inference leaves set cdi.enabled + gdrcopy.enabled on the gpu-operator overrides to mirror that deployment model. B200-vs-GB200 deltas honored: - x86 host (vs GB200 Grace ARM) → real tuning-gke.yaml, not no-op - No NVreg_GrdmaPciTopoCheckOverride flag - Single-fabric NCCL (no MNNVL / NVL72) - gpu-operator floor >= v25.10.0 (Blackwell baseline) The gke-nccl-tcpxo component is intentionally NOT added: its DaemonSets pin to nvidia-h100-mega-80gb and target the TCPX transport on a3-megagpu-8g (H100), so they would not run on A4 nodes and would misrepresent the deployment model. nccl-all-reduce-bw threshold is a placeholder (>= 100); the H100 GKE TCPXO baseline (>= 250) is not the right anchor for A4's gdrcopy + native-multi-NIC model. Tighten once an empirical measurement from the reference cluster is captured. Fixes: NVIDIA#1004 Related: NVIDIA#1001, NVIDIA#1052, NVIDIA#969, NVIDIA#436
97c6e2b to
1edb775
Compare
…dcard guide recipe-development.md's criteria-wildcard section credited only the gb200-any-training.yaml retirement (NVIDIA#1052), but b200-any-training.yaml was likewise retired in NVIDIA#1053 (both -any-training wildcards are gone on main; b200-any.yaml / gb200-any.yaml are the live deployment-floor overlays). Name both so recipe authors don't reintroduce the retired B200 cross-service NCCL-threshold pattern. Doc-only.
…dcard guide recipe-development.md's criteria-wildcard section credited only the gb200-any-training.yaml retirement (NVIDIA#1052), but b200-any-training.yaml was likewise retired in NVIDIA#1053 (both -any-training wildcards are gone on main; b200-any.yaml / gb200-any.yaml are the live deployment-floor overlays). Name both so recipe authors don't reintroduce the retired B200 cross-service NCCL-threshold pattern. Doc-only.
Summary
Adds the first concrete service-bound overlays for the
b200accelerator on GKE COS —b200-any.yaml(deployment-phase floor wildcard) plus four leaves (training,inference,training-kubeflow,inference-dynamo). Retires the placeholderb200-any-training.yamlwildcard; H100 / RTX Pro 6000 already follow the no-*-any-training.yamlshape, so B200 reaches parity.Motivation / Context
Before this PR,
aicr recipe --service gke --accelerator b200 --intent <any>resolved only the wildcard NCCL threshold fromb200-any-training.yaml(added by #436), with no GKE COS GPU operator config and no platform variant.Anchored on the production cluster
nvcf-dgxc-k8s-gcp-azne1-prd7(GCP,asia-northeast1, NVCF prod) — confirmed against itscluster-spec.yaml, ArgoCD app set, and gpu-operator values file indgxcloud/mk8s/manifests:clusters/nvcf-prod/nvcf-dgxc-k8s-gcp-azne1-prd7/.Fixes: #1004
Related: #1001 (per-accelerator deployment-floor wildcard pattern), #1052 (
*-any-training.yamlretirement), #969 (validation-phase coverage audit), #436 (B200 enum + stub wildcard)Type of Change
b200-any-training.yamlretirement per refactor(recipes): retire*-any-training.yamlwildcards (NCCL thresholds per-leaf) #1052Component(s) Affected
pkg/recipe)(YAML-only — no Go source changes.)
Implementation Notes
Layout mirrors
h100-gke-cos-*.yaml(the closest reference pattern; GKE is COS-only by AICR convention — no-ubuntu-variant).b200-anybaseDeployment.gpu-operator.version >= v25.10.0). Mirrorsgb200-any.yaml/h100-any.yaml/rtx-pro-6000-any.yamlfrom #1001.b200-gke-cos-traininggke-cos-training>= v25.10.0; K8s>= 1.32;nccl-all-reduce-bw >= 100(placeholder, see Networking note).b200-gke-cos-inferencegke-cos-inferenceb200-any.yaml, conformance fromgke-cos.yaml. No performance phase (single-card inference, no NCCL fabric to gate).b200-gke-cos-training-kubeflowb200-gke-cos-trainingkubeflow-trainercomponent forTrainJobdistributed training.b200-gke-cos-inference-dynamob200-gke-cos-inference>= 1.34(DRA GA); self-declares deployment + performance + conformance.B200-vs-GB200 deltas honored:
tuning-gke.yamlfornodewright-customizations(same ash100-gke-cos-*), not the GB200 no-optuning.yaml.NVreg_GrdmaPciTopoCheckOverride=1override — that flag exists for GB200's Grace PCI topology with EFA. GKE A4 (B200) uses RDMA over Ethernet provided by GCP's native multi-NIC fabric.nccl-all-reduce-bwconstraint, not splitnet+nvls.>= v25.10.0(Blackwell support stabilized in 25.10, matches GB200 / RTX Pro 6000).cdi: enabled+gdrcopy: enabledon both training and inference leaves — mirrors the production reference cluster's gpu-operator values (580.95.05driver,gdrcopy.enabled: true,cdi.enabled: true). GB200/EKS sets the same pair.Networking model — no separate installer (Codex P1 / cluster-verified): GKE A4 (B200) on
nvcf-dgxc-k8s-gcp-azne1-prd7deploys no NCCL plugin installer DaemonSet — nogke-nccl-tcpxo, no GPUDirect-RDMA component. Multi-node NCCL is provided by GPU Operator (v25.10.1) withgdrcopy: enabled: truecombined with GKE A4's native multi-NIC infrastructure managed by GCP. Thegke-nccl-tcpxocomponentRef is intentionally not added to the B200 training leaf because its DaemonSets pincloud.google.com/gke-accelerator: nvidia-h100-mega-80gband target the TCPX transport ona3-megagpu-8g(H100) — would not run on A4 nodes and would misrepresent the deployment model.NCCL threshold — placeholder pending measurement:
nccl-all-reduce-bw >= 100is a conservative floor reflecting that A4 reaches high-bandwidth NCCL via gdrcopy + GKE native multi-NIC rather than via a TCPX installer — so the H100/GKE TCPXO baseline (>= 250) is not the right anchor. Tighten once an empirical number from the reference cluster is captured.Wildcard cleanup (
b200-any-training.yaml): retired in the same PR per #1052. The original>= 350 GB/scross-cloud placeholder threshold was fabric-blind (one number for EFA, TCPX, RoCE, native multi-NIC — none correct for all). H100 already follows this shape (noh100-any-training.yaml), so B200 reaches parity. Future per-cloud B200 leaves carry their own fabric-tuned thresholds.Inference-perf thresholds on the Dynamo leaf mirror H100's loose smoke-test floor (
inference-throughput >= 5000,inference-ttft-p99 <= 200); B200 is expected to exceed both with margin.Testing
Results:
TestOverlayValidationPhaseFloorpasses for all 4 new leaves. No newknownGapsentries.make qualify: all 22 chainsaw tests pass, coverage 77.0% (threshold 75%),golangci-lint0 issues, no vulnerabilities, license headers OK.aicr recipe --service gke --accelerator b200 --os cos --intent training --format yaml→ 17 components, gpu-operator withcdi.enabled: true+gdrcopy.enabled: true, deployment (>= v25.10.0), performance (nccl-all-reduce-bw >= 100), conformance (10 checks).aicr recipe --service gke --accelerator b200 --os cos --intent inference --format yaml→ gpu-operator overrides includegdrcopy: enabled; deployment inherited fromb200-any.yaml, conformance inherited fromgke-cos.yaml.aicr recipe --service gke --accelerator b200 --os cos --intent inference --platform dynamo --format yaml→ DRA + Dynamo + Grove resolved; full validation contract; K8s>= 1.34.aicr recipe --service gke --accelerator b200 --os cos --intent training --platform kubeflow --format yaml→ kubeflow-trainer injected.Coverage: YAML-only change — per CLAUDE.md the per-package coverage gate does not apply. Project-wide
make test-coveragefloor (75%) passes undermake qualifyat 77.0%.Risk Assessment
--service gke --accelerator b200.Rollout notes: Net deletion of
b200-any-training.yamlis safe — that wildcard had no concrete leaves depending on it (nob200-<svc>-*.yamlexisted before this PR), and removing it brings B200 to parity with H100 / RTX Pro 6000 (neither has an*-any-training.yaml). The cross-cloud threshold it contributed (>= 350) was fabric-blind; the per-leaf>= 100is grounded in the reference cluster's actual deployment model and will be tightened once empirical numbers land.Checklist
make testwith-race)make lint)TestOverlayValidationPhaseFloorauto-enumerates and validates the new overlaysaicr criteria listgit commit -S)