Skip to content

feat(recipes): add A100 GKE COS training Kubeflow overlay chain#1306

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-gke-overlays
Jun 11, 2026
Merged

feat(recipes): add A100 GKE COS training Kubeflow overlay chain#1306
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-gke-overlays

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Add A100 GKE overlays (issue #1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment-phase floor. Modeled on the H100 GKE COS training overlays.

Motivation / Context

a100 is a declared accelerator in pkg/recipe/criteria.go but has zero overlays in recipes/overlays/, so aicr recipe --accelerator a100 --service gke ... cannot resolve. Companion to the A100 OKE (#1294), AKS (#1295), and EKS (#1305) PRs; this slice covers GKE.

GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training → a100-gke-cos-training → a100-gke-cos-training-kubeflow.

Fixes: N/A (incremental — part of #1002)
Related: #1002, #1294, #1295, #1305, #969, #1256

A100 overlay series — tracked in #1002: #1294 (OKE) · #1295 (AKS) · #1305 (EKS) · #1306 (GKE) ← this PR

Coordination: a100-any.yaml (the cross-service A100 deployment floor) is byte-identical across #1294, #1295, #1305, and this PR. Only one needs to introduce it — whichever lands first, the others drop the duplicate on rebase.

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Docs/examples (docs/, examples/)

Implementation Notes

New overlays (reuse existing gke-cos/gke-cos-training parents and values-gke-cos-training.yaml — no new component values files):

  • a100-any — deployment-phase floor: 4 standard checks + Deployment.gpu-operator.version >= v24.6.0 (H100/H200-generation baseline; A100 operator-supported since v22.9).
  • a100-gke-cos-trainingbase: gke-cos-training; K8s >= 1.30. A100 has no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than H100's 1.32. gpu-operator cdi, nfd topologyUpdater. Conformance mirrors the H100 GKE COS training set.
  • a100-gke-cos-training-kubeflow — Kubeflow Trainer for distributed TrainJob, declared inline to match the GKE COS pattern (h100-gke-cos-training-kubeflow).

Key decisions documented in-file:

  • Nodewright tuning reuses the h100 profile (tuning-gke.yaml, accelerator: h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke package ships baked-in profiles only for gke-h100 / gke-b200, with no separate A100 target — but per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100.

  • gke-nccl-tcpxo omitted. GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family.

  • TCPXO doc scoped. docs/integrator/gke-tcpxo-networking.md previously applied to all *-gke-cos-training* recipes; it now scopes the prerequisites to the H100 (a3-megagpu-8g) recipes and calls out the A100 (a2) exception, so users selecting a100-gke-cos-training are not directed to configure TCPXO the bundle never installs.

  • Performance gating omitted. The H100 GKE nccl-all-reduce-bw >= 250 floor is calibrated on 8-GPU H100 NVLink nodes — neither fabric-class aware (nccl-all-reduce-bw training gate is a fixed absolute fabric-specific busbw value applied to SKU-agnostic recipes → false-fails EKS/H100 small SKUs #1256) nor valid for A100. An A100-on-GKE NCCL baseline is deferred to a follow-up.

Testing

go test ./pkg/recipe/...                 # PASS (incl. TestOverlayValidationPhaseFloor auto-discovery)
yamllint recipes/overlays/a100-*.yaml    # clean
# End-to-end resolution of the new leaf:
aicr recipe --service gke --accelerator a100 --os cos --intent training --platform kubeflow
#   -> components=12 overlays=7; K8s '>= 1.30'; Deployment.gpu-operator.version '>= v24.6.0';
#      kubeflow-trainer present; no nodewright-customizations, no gke-nccl-tcpxo;
#      gpu-operator inherits values-gke-cos-training.yaml

Full make qualify not required: this touches only YAML overlay files (zero .go changes), so the Go lint/test/e2e gates cannot regress from it. The embedded overlays are exercised by go test ./pkg/recipe/... (passes) and yamllint (clean). The only doc change is scoping docs/integrator/gke-tcpxo-networking.md (prose, no new anchors/links).

Risk Assessment

  • Low — Additive overlays only; no existing recipe or Go path changes. Easy to revert.

Rollout notes: Additive. Other A100 GKE leaves (inference, dynamo) and remaining work tracked under #1002.

Checklist

  • Tests pass locally (go test ./pkg/recipe/...)
  • Linter passes (yamllint clean; no .go files changed)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (covered by existing auto-discovery TestOverlayValidationPhaseFloor)
  • I updated docs if user-facing behavior changed (N/A — no leaf enumeration in docs)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 added area/recipes theme/recipes Recipe expansion, overlays, mixins, and component registry labels Jun 11, 2026
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Recipe evidence check

Affected leaf overlays: 2

Recipe Pointer Verify Digest match
a100-gke-cos-training-kubeflow ⚠️ missing
a100-gke-cos-training ⚠️ missing

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds three new RecipeMetadata overlays for A100 GPU configurations: a cross-cutting validation floor (a100-any) that sets deployment checks and a GPU Operator version floor (>= v24.6.0); a GKE+COS+training overlay (a100-gke-cos-training) with Kubernetes >= 1.30 and GPU Operator/NFD Helm overrides; and a Kubeflow extension overlay (a100-gke-cos-training-kubeflow) that wires a helm-based kubeflow-trainer component. Also updates docs to note an A100 exception for TCPXO prerequisites.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding A100 GKE COS training Kubeflow overlay chain, which directly aligns with the three new overlay files introduced.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description comprehensively describes the changeset, explaining the purpose of adding A100 GKE overlays, the structure of three new YAML files, key design decisions, testing approach, and risk assessment.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml`:
- Around line 34-35: Remove the redundant constraint named K8s.server.version
from this overlay (the value ">= 1.30" duplicates the parent overlay
a100-gke-cos-training); delete the K8s.server.version entry from the
a100-gke-cos-training-kubeflow overlay so the parent's constraint is inherited,
or if you intentionally need to pin it here, add an inline comment next to the
K8s.server.version entry explaining why this overlay must override the parent to
prevent accidental future drift.

In `@recipes/overlays/a100-gke-cos-training.yaml`:
- Around line 34-35: Remove the redundant K8s.server.version constraint from the
a100-gke-cos-training overlay: delete the duplicate key-value pair
`K8s.server.version: ">= 1.30"` in recipes/overlays/a100-gke-cos-training.yaml
(the same constraint is already provided by gke-cos-training.yaml), leaving only
one declaration to avoid unnecessary duplication while preserving behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4d75c63a-bf42-4529-a72e-d8acc1ddd699

📥 Commits

Reviewing files that changed from the base of the PR and between 6c14530 and 769d36f.

📒 Files selected for processing (3)
  • recipes/overlays/a100-any.yaml
  • recipes/overlays/a100-gke-cos-training-kubeflow.yaml
  • recipes/overlays/a100-gke-cos-training.yaml

Comment thread recipes/overlays/a100-gke-cos-training-kubeflow.yaml
Comment thread recipes/overlays/a100-gke-cos-training.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/gke-tcpxo-networking.md`:
- Line 3: Expand the acronym NCCL at first mention in the sentence containing
"NCCL falls back to TCP" by replacing the first occurrence with its full form
"NVIDIA Collective Communications Library (NCCL)" so subsequent uses can keep
the short form; update the phrase in the line that reads "Without it, NCCL falls
back to TCP (~4 GB/s vs ~340 GB/s with TCPXO)." to include the expanded form.

In `@recipes/overlays/a100-any.yaml`:
- Around line 34-54: The overlay is missing the required mixins field and thus
doesn't follow the overlay schema; update the spec to include a mixins entry
alongside base, criteria and constraints (e.g., add a top-level spec.mixins
array with the appropriate mixin names or an empty list if none are needed) so
the file defines spec.base, spec.mixins, spec.criteria and spec.constraints;
ensure criteria.service and criteria.accelerator remain unchanged and
constraints (like Deployment.gpu-operator.version) stay under
spec.validation.deployment.checks/constraints.

In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml`:
- Around line 41-44: The kubeflow-trainer componentRef is missing an explicit
version; update the componentRefs entry for kubeflow-trainer (the block
containing name: kubeflow-trainer, type: Helm, valuesFile:
components/kubeflow-trainer/values.yaml) to include a version field (e.g.,
version: "<pin-version>")—fetch the correct version from the kubeflow-trainer
chart/registry or components metadata and add that version string so the
componentRef includes name, type, version, and valuesFile as required by overlay
guidelines.

In `@recipes/overlays/a100-gke-cos-training.yaml`:
- Around line 20-36: The overlay's spec is missing the required mixins field; in
the spec block alongside base: gke-cos-training, criteria and constraints add an
explicit mixins entry (e.g., mixins: [] if none) so the recipe includes base,
mixins, criteria, and constraints per repository convention; update the spec
section (look for the spec: block and the existing base/criteria/constraints
keys) to insert mixins.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9dc3e55f-83f3-4df4-9f8f-5d9f8cf69a52

📥 Commits

Reviewing files that changed from the base of the PR and between 769d36f and 52e7c8b.

📒 Files selected for processing (4)
  • docs/integrator/gke-tcpxo-networking.md
  • recipes/overlays/a100-any.yaml
  • recipes/overlays/a100-gke-cos-training-kubeflow.yaml
  • recipes/overlays/a100-gke-cos-training.yaml

Comment thread docs/integrator/gke-tcpxo-networking.md Outdated
Comment thread recipes/overlays/a100-any.yaml
Comment thread recipes/overlays/a100-gke-cos-training-kubeflow.yaml
Comment thread recipes/overlays/a100-gke-cos-training.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-gke-overlays branch from 52e7c8b to b144bb2 Compare June 11, 2026 12:52
@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipes): add A100 GKE COS training Kubeflow overlay chain feat(recipes): add A100 GKE COS training Kubeflow overlay chain Jun 11, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 11, 2026 12:53
@yuanchen8911 yuanchen8911 requested review from a team as code owners June 11, 2026 12:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml`:
- Around line 20-35: The overlay spec is missing the required mixins field;
update the recipe overlay to include a mixins entry (either an explicit empty
list or the applicable mixin names) alongside the existing base:
a100-gke-cos-training, criteria and constraints so it conforms to the overlay
schema — add mixins: [] if none apply or list the relevant mixin identifiers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 983aa10d-eb4c-4c43-8a00-e75f7785ce12

📥 Commits

Reviewing files that changed from the base of the PR and between 52e7c8b and b144bb2.

📒 Files selected for processing (4)
  • docs/integrator/gke-tcpxo-networking.md
  • recipes/overlays/a100-any.yaml
  • recipes/overlays/a100-gke-cos-training-kubeflow.yaml
  • recipes/overlays/a100-gke-cos-training.yaml

Comment thread recipes/overlays/a100-gke-cos-training-kubeflow.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-gke-overlays branch 3 times, most recently from 93733e6 to 44f0792 Compare June 11, 2026 14:28

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
docs/integrator/gke-tcpxo-networking.md (1)

3-3: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

NCCL acronym still not expanded.

This is the same issue flagged in the previous review. Line 3 uses "NCCL" without expanding it on first mention. The acronym should be expanded to "NVIDIA Collective Communications Library (NCCL)" at first use.

As per coding guidelines: "Define acronyms on first use in documentation."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/integrator/gke-tcpxo-networking.md` at line 3, The documentation uses
the acronym "NCCL" without expansion; update the first mention in the text (the
line containing "NCCL falls back to TCP") to read "NVIDIA Collective
Communications Library (NCCL)" so the acronym is defined on first use and
subsequent references can keep "NCCL".

Source: Coding guidelines

recipes/overlays/a100-gke-cos-training.yaml (1)

20-36: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add explicit mixins field to satisfy overlay schema requirement.

The overlay spec includes base, criteria, and constraints, but the mixins field is missing. Per coding guidelines, "Recipe overlays must specify base, mixins, criteria, and constraints." Add mixins: [] if intentionally empty, or list applicable mixins.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-gke-cos-training.yaml` around lines 20 - 36, The
overlay spec for base: gke-cos-training is missing the required mixins field;
update the spec block to include a mixins entry (either an empty list mixins: []
if no mixins apply or enumerate applicable mixins) so the overlay satisfies the
schema that requires base, mixins, criteria, and constraints—you can locate this
inside the same spec that contains base: gke-cos-training and the constraints
entry with name: K8s.server.version.

Source: Coding guidelines

recipes/overlays/a100-gke-cos-training-kubeflow.yaml (2)

37-49: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pin kubeflow-trainer with an explicit version in componentRefs.

The kubeflow-trainer componentRef lacks a version field. Since this component is declared inline (line 38 comment: "Declared inline, not via the platform-kubeflow mixin"), it is a new component introduction and must include an explicit version pin alongside name, type, valuesFile, and manifestFiles to satisfy overlay schema requirements.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml` around lines 37 - 49,
The kubeflow-trainer entry in componentRefs is missing an explicit version pin;
add a top-level version key to the kubeflow-trainer componentRef (alongside
name, type, valuesFile, manifestFiles and dependencyRefs) and set it to the
proper explicit semantic version or component registry tag (e.g., a pinned
vX.Y.Z or exact release string) so the overlay schema is satisfied; update the
componentRefs block that contains kubeflow-trainer to include this version
field.

Source: Coding guidelines


20-35: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add explicit mixins field to satisfy overlay schema requirement.

The overlay spec includes base, criteria, and constraints, but the mixins field is missing. Per coding guidelines, "Recipe overlays must specify base, mixins, criteria, and constraints." Add mixins: [] if intentionally empty, or list applicable mixins.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml` around lines 20 - 35,
The overlay spec for the Kubeflow A100 GKE recipe is missing the required mixins
field; update the spec (near the existing base: a100-gke-cos-training, criteria,
and constraints entries) to include a mixins key—either add mixins: [] if there
are no mixins or list the applicable mixin names—to satisfy the overlay schema
requirement that recipes specify base, mixins, criteria, and constraints.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/gke-tcpxo-networking.md`:
- Line 5: The inline backticks in the phrase "A100 `a2-highgpu`/`a2-ultragpu`
machine family" create a broken formatting mid-phrase; update the sentence in
the "A100 (a2) exception" paragraph so the machine family is formatted
coherently—either remove the backticks (A100 a2-highgpu/a2-ultragpu machine
family) or group them as "A100 machine family (`a2-highgpu`/`a2-ultragpu`)"—and
ensure the rest of the sentence remains unchanged (the surrounding text "A100
(a2) exception:" and "The prerequisites below do not apply..." should be
preserved).

---

Duplicate comments:
In `@docs/integrator/gke-tcpxo-networking.md`:
- Line 3: The documentation uses the acronym "NCCL" without expansion; update
the first mention in the text (the line containing "NCCL falls back to TCP") to
read "NVIDIA Collective Communications Library (NCCL)" so the acronym is defined
on first use and subsequent references can keep "NCCL".

In `@recipes/overlays/a100-gke-cos-training-kubeflow.yaml`:
- Around line 37-49: The kubeflow-trainer entry in componentRefs is missing an
explicit version pin; add a top-level version key to the kubeflow-trainer
componentRef (alongside name, type, valuesFile, manifestFiles and
dependencyRefs) and set it to the proper explicit semantic version or component
registry tag (e.g., a pinned vX.Y.Z or exact release string) so the overlay
schema is satisfied; update the componentRefs block that contains
kubeflow-trainer to include this version field.
- Around line 20-35: The overlay spec for the Kubeflow A100 GKE recipe is
missing the required mixins field; update the spec (near the existing base:
a100-gke-cos-training, criteria, and constraints entries) to include a mixins
key—either add mixins: [] if there are no mixins or list the applicable mixin
names—to satisfy the overlay schema requirement that recipes specify base,
mixins, criteria, and constraints.

In `@recipes/overlays/a100-gke-cos-training.yaml`:
- Around line 20-36: The overlay spec for base: gke-cos-training is missing the
required mixins field; update the spec block to include a mixins entry (either
an empty list mixins: [] if no mixins apply or enumerate applicable mixins) so
the overlay satisfies the schema that requires base, mixins, criteria, and
constraints—you can locate this inside the same spec that contains base:
gke-cos-training and the constraints entry with name: K8s.server.version.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: fd711da6-5f3b-4042-9fd7-8d2a3d2249df

📥 Commits

Reviewing files that changed from the base of the PR and between b144bb2 and 93733e6.

📒 Files selected for processing (4)
  • docs/integrator/gke-tcpxo-networking.md
  • recipes/overlays/a100-any.yaml
  • recipes/overlays/a100-gke-cos-training-kubeflow.yaml
  • recipes/overlays/a100-gke-cos-training.yaml

For `*-gke-cos-training*` recipes, GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).
For the **H100 GKE COS training** recipes (`h100-gke-cos-training*`, on `a3-megagpu-8g` nodes), GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).

> **A100 (a2) exception:** the `a100-gke-cos-training*` recipes intentionally omit the `gke-nccl-tcpxo` component — GPUDirect TCPXO targets H100 `a3-megagpu-8g` nodes, not the A100 `a2-highgpu`/`a2-ultragpu` machine family. The prerequisites below do **not** apply to A100 GKE recipes, and the generated A100 bundle does not install the TCPXO DaemonSets.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider reformatting the machine family reference.

The phrase "A100 a2-highgpu/a2-ultragpu machine family" has backticks that start and end mid-phrase, creating a formatting break. Consider one of these alternatives:

  • Remove backticks: "A100 a2-highgpu/a2-ultragpu machine family"
  • Restructure: "A100 machine family (a2-highgpu/a2-ultragpu)"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/integrator/gke-tcpxo-networking.md` at line 5, The inline backticks in
the phrase "A100 `a2-highgpu`/`a2-ultragpu` machine family" create a broken
formatting mid-phrase; update the sentence in the "A100 (a2) exception"
paragraph so the machine family is formatted coherently—either remove the
backticks (A100 a2-highgpu/a2-ultragpu machine family) or group them as "A100
machine family (`a2-highgpu`/`a2-ultragpu`)"—and ensure the rest of the sentence
remains unchanged (the surrounding text "A100 (a2) exception:" and "The
prerequisites below do not apply..." should be preserved).

@yuanchen8911 yuanchen8911 force-pushed the feat/a100-gke-overlays branch 3 times, most recently from d219c1e to dbb70df Compare June 11, 2026 17:20
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-gke-overlays branch from dbb70df to ed15c45 Compare June 11, 2026 17:25
@github-actions github-actions Bot added size/L and removed size/M labels Jun 11, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-gke-overlays branch from ed15c45 to f88691a Compare June 11, 2026 19:41
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@github-actions github-actions Bot added size/M and removed size/L labels Jun 11, 2026
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-gke-overlays branch from f88691a to b7d1139 Compare June 11, 2026 20:01
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-gke-overlays branch from b7d1139 to 184f873 Compare June 11, 2026 20:21
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-gke-overlays branch from 184f873 to 0a9c9ab Compare June 11, 2026 20:41
@mchmarny mchmarny merged commit fd64dd7 into NVIDIA:main Jun 11, 2026
119 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs area/recipes size/M theme/recipes Recipe expansion, overlays, mixins, and component registry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants