Skip to content

feat(recipes): add A100 AKS training Kubeflow overlay chain#1295

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-aks-overlays
Jun 11, 2026
Merged

feat(recipes): add A100 AKS training Kubeflow overlay chain#1295
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-aks-overlays

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Add A100 AKS overlays (issue #1002): the AKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment-phase floor. Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

Motivation / Context

a100 is a declared accelerator in pkg/recipe/criteria.go but has zero overlays in recipes/overlays/, so aicr recipe --accelerator a100 --service aks ... cannot resolve. Companion to the A100 OKE PR (#1294); this slice covers AKS.

Fixes: N/A (incremental — part of #1002)
Related: #1002, #1294, #1305, #1306, #969, #1256

A100 overlay series — tracked in #1002: #1294 (OKE) · #1295 (AKS) ← this PR · #1305 (EKS) · #1306 (GKE)

Coordination: a100-any.yaml (the cross-service A100 deployment floor) is shared with #1294. Only one of the two PRs needs to introduce it — whichever lands first, the other will drop the duplicate on rebase.

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

Implementation Notes

New overlays (reuse existing aks/aks-training parents and values-aks-training.yaml — no new component values files):

  • a100-any — deployment-phase floor: 4 standard checks + Deployment.gpu-operator.version >= v24.6.0 (H100/H200-generation baseline; A100 operator-supported since v22.9).
  • a100-aks-trainingbase: aks-training; K8s >= 1.30. A100 has no NVLink ComputeDomain requirement, so it keeps the AKS training baseline rather than H100's 1.32.4. gpu-operator deps + gdrcopy, nfd topologyUpdater. Nodewright tuning omitted (packages don't target service: aks), matching h100-aks-training. Conformance mirrors the H100 AKS training set.
  • a100-aks-ubuntu-training+ os-ubuntu mixin.
  • a100-aks-ubuntu-training-kubeflow+ platform-kubeflow (Kubeflow Trainer for distributed TrainJob).

AKS pre-installs the NVIDIA container toolkit; the recipe inherits the AKS gpu-operator values and the InfiniBand network-operator RDMA stack unchanged from aks.yaml.

Testing

go test ./pkg/recipe/...                 # PASS (incl. TestOverlayValidationPhaseFloor auto-discovery)
yamllint recipes/overlays/a100-*.yaml    # clean
# End-to-end resolution of the new leaf:
aicr recipe --service aks --accelerator a100 --os ubuntu --intent training --platform kubeflow
#   -> components=13 overlays=8; K8s '>= 1.30'; Deployment.gpu-operator.version '>= v24.6.0';
#      kubeflow-trainer + network-operator present;
#      gpu-operator inherits values-aks-training.yaml

Full make qualify not required: this touches only YAML overlay files (zero .go changes), so the Go lint/test/e2e gates cannot regress from it. The embedded overlays are exercised by go test ./pkg/recipe/... (passes) and yamllint (clean). No docs/ page enumerates individual overlay leaves, so no doc update is needed.

Risk Assessment

  • Low — Additive overlays only; no existing recipe or Go path changes. Easy to revert.

Rollout notes: Additive. Other A100 AKS leaves (plain training, inference, dynamo) and remaining services tracked under #1002.

Checklist

  • Tests pass locally (go test ./pkg/recipe/...)
  • Linter passes (yamllint clean; no .go files changed)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (covered by existing auto-discovery TestOverlayValidationPhaseFloor)
  • I updated docs if user-facing behavior changed (N/A — no leaf enumeration in docs)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 added area/recipes theme/recipes Recipe expansion, overlays, mixins, and component registry labels Jun 10, 2026
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds four new RecipeMetadata overlays: a100-any (baseline matching any service with accelerator a100, deployment validation checks, gpu-operator.version >= v24.6.0), a100-aks-training (inherits aks-training; targets AKS+A100+training; K8s.server.version >= 1.30; enables gpu-operator gdrcopy and nfd topologyUpdater; adds 10 intent conformance checks), a100-aks-ubuntu-training (inherits a100-aks-training; applies os-ubuntu mixin and Ubuntu criteria; componentRefs empty), and a100-aks-ubuntu-training-kubeflow (inherits a100-aks-ubuntu-training; adds platform=kubeflow and platform-kubeflow mixin).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested labels

size/M

Suggested reviewers

  • xdu31
  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding A100 AKS training Kubeflow overlay recipes, which is the primary objective of this PR.
Description check ✅ Passed The description provides comprehensive context about the A100 AKS overlays being added, including their purpose, implementation details, testing, and relationship to related PRs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml`:
- Around line 32-35: The mixins list in the overlay
a100-aks-ubuntu-training-kubeflow currently repeats os-ubuntu; remove the
redundant entry so only platform-kubeflow remains in the mixins block. Edit the
mixins for the a100-aks-ubuntu-training-kubeflow recipe (spec.mixins / mixins)
and delete "os-ubuntu", leaving the inherited os-ubuntu from
a100-aks-ubuntu-training and keeping "platform-kubeflow" only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 02e03bf0-2daf-4ef7-811d-4e6e7215b0fd

📥 Commits

Reviewing files that changed from the base of the PR and between 8d0da78 and 2c2d1f2.

📒 Files selected for processing (4)
  • recipes/overlays/a100-aks-training.yaml
  • recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-aks-ubuntu-training.yaml
  • recipes/overlays/a100-any.yaml

Comment on lines +32 to +35
# Ubuntu OS constraints + Kubeflow trainer via mixins
mixins:
- os-ubuntu
- platform-kubeflow

@coderabbitai coderabbitai Bot Jun 10, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Analyze mixin declaration patterns across overlay inheritance chains

# Find all overlays that declare os-ubuntu mixin
echo "=== Overlays declaring os-ubuntu mixin ==="
rg -l "os-ubuntu" recipes/overlays/*.yaml | while read file; do
    base=$(yq '.spec.base' "$file")
    mixins=$(yq '.spec.mixins[]' "$file" 2>/dev/null | tr '\n' ', ' | sed 's/,$//')
    echo "$file -> base: $base, mixins: [$mixins]"
done

# Check if base recipes that declare os-ubuntu have children that redeclare it
echo -e "\n=== Checking for mixin redeclaration in child overlays ==="
for base_file in recipes/overlays/*ubuntu*.yaml; do
    base_name=$(basename "$base_file" .yaml)
    has_os_ubuntu=$(yq '.spec.mixins[] | select(. == "os-ubuntu")' "$base_file" 2>/dev/null)
    if [ -n "$has_os_ubuntu" ]; then
        echo "Base: $base_name declares os-ubuntu"
        # Find children that inherit from this base
        rg -l "base: $base_name" recipes/overlays/*.yaml | while read child; do
            child_mixins=$(yq '.spec.mixins[] | select(. == "os-ubuntu")' "$child" 2>/dev/null)
            if [ -n "$child_mixins" ]; then
                echo "  ↳ Child: $(basename "$child") REDECLARES os-ubuntu"
            else
                echo "  ↳ Child: $(basename "$child") does NOT redeclare os-ubuntu"
            fi
        done
    fi
done

Repository: NVIDIA/aicr

Length of output: 2466


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
import glob, os, sys, yaml, json

# Load all overlay YAMLs under recipes/overlays
paths = sorted(glob.glob("recipes/overlays/**/*.yaml", recursive=True))

def load_yaml(p):
    with open(p, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)

# 1) Find all overlays that explicitly mention os-ubuntu in spec.mixins
os_ubuntu_overlays = []
for p in paths:
    try:
        doc = load_yaml(p)
    except Exception:
        continue
    spec = doc.get("spec") if isinstance(doc, dict) else None
    mixins = spec.get("mixins") if isinstance(spec, dict) else None
    if isinstance(mixins, list) and "os-ubuntu" in mixins:
        base = spec.get("base") if isinstance(spec, dict) else None
        os_ubuntu_overlays.append((p, base, mixins))
print("=== Overlays where spec.mixins contains os-ubuntu ===")
for p, base, mixins in os_ubuntu_overlays:
    print(f"{p} -> base={base}, mixins={mixins}")

# 2) Trace the base chain for the specific overlay and report mixins along the chain
target = "recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml"

def build_base_map():
    m = {}
    for p in paths:
        try:
            doc = load_yaml(p)
        except Exception:
            continue
        spec = doc.get("spec") if isinstance(doc, dict) else None
        base = spec.get("base") if isinstance(spec, dict) else None
        name = doc.get("metadata", {}).get("name") if isinstance(doc.get("metadata"), dict) else None
        # map by metadata.name when available, else by filename
        if name:
            m[name] = p
        m[p] = p
        # also map base identifiers to paths later if they correspond to metadata.name
    return m

target_doc = load_yaml(target)
target_spec = target_doc.get("spec", {})
target_base = target_spec.get("base")
target_mixins = target_spec.get("mixins")

print("\n=== Target overlay info ===")
print(f"file={target}")
print(f"spec.base={target_base}")
print(f"spec.mixins={target_mixins}")

# Resolve base chain: base may be "some-overlay.yaml" or metadata.name or relative ref.
# We'll attempt common patterns:
# - If base endswith .yaml, resolve to that file path.
# - Else, treat base as a metadata.name and look up overlays with matching metadata.name.
name_to_path = {}
for p in paths:
    try:
        doc = load_yaml(p)
    except Exception:
        continue
    md = doc.get("metadata", {})
    if isinstance(md, dict) and "name" in md:
        name_to_path[md["name"]] = p

def resolve_base(base):
    if base is None:
        return None
    if isinstance(base, str) and base.endswith(".yaml"):
        # base might be relative to recipes/overlays
        candidate = os.path.normpath(os.path.join("recipes/overlays", base))
        if os.path.exists(candidate):
            return candidate
        if os.path.exists(base):
            return base
        # try absolute relative
        if os.path.exists(os.path.join("recipes/overlays", os.path.basename(base))):
            return os.path.join("recipes/overlays", os.path.basename(base))
        return None
    if isinstance(base, str):
        return name_to_path.get(base)
    return None

chain = []
seen = set()
cur = target
while cur and cur not in seen:
    seen.add(cur)
    doc = load_yaml(cur)
    spec = doc.get("spec", {}) if isinstance(doc, dict) else {}
    base = spec.get("base")
    mixins = spec.get("mixins")
    chain.append((cur, base, mixins))
    cur = resolve_base(base)

print("\n=== Inheritance chain (child -> ancestors via spec.base) ===")
for i, (f, b, mix) in enumerate(chain):
    print(f"{i}: {f}")
    print(f"   base: {b}")
    print(f"   mixins: {mix}")

# 3) For each overlay in the chain that contains os-ubuntu, show whether it is also declared in its immediate parent
# (i.e., whether redeclaration occurs at each step).
print("\n=== os-ubuntu redeclaration along chain ===")
for i in range(1, len(chain)):
    child = chain[i-0][0]  # current? careful
PY

# Print a small, readable excerpt of the target overlay around the relevant lines
echo
echo "=== Excerpt: target overlay ==="
sed -n '1,120p' recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml | nl -ba | sed -n '20,60p'

echo
echo "=== Excerpt: parent overlay candidate(s) based on spec.base ==="
# Extract spec.base from target quickly and then print that file if resolvable
base_line=$(python3 - <<'PY'
import yaml
p="recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml"
with open(p,'r',encoding='utf-8') as f:
    doc=yaml.safe_load(f)
print(doc.get('spec',{}).get('base'))
PY
)
echo "spec.base from target: $base_line"
if [[ "$base_line" == *.yaml ]]; then
  # try common resolution
  if [[ -f "recipes/overlays/$base_line" ]]; then
    sed -n '1,120p' "recipes/overlays/$base_line" | nl -ba | sed -n '20,60p'
  elif [[ -f "$base_line" ]]; then
    sed -n '1,120p' "$base_line" | nl -ba | sed -n '20,60p'
  fi
fi

Repository: NVIDIA/aicr

Length of output: 3007


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Show relevant YAML snippets (avoid `nl`)
echo "=== Target overlay (mixins + base) ==="
sed -n '1,120p' recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml | cat -n | sed -n '20,60p'
echo
echo "=== Parent overlay (mixins + base) ==="
sed -n '1,120p' recipes/overlays/a100-aks-ubuntu-training.yaml | cat -n | sed -n '20,60p'

# 2) Locate recipe overlay/mixin resolution logic in Go
echo
echo "=== Search for mixins resolution/merge logic ==="
rg -n --hidden --glob '!**/vendor/**' "mixins" pkg/recipe pkg/bundler pkg/recipe* || true

# 3) Narrow to likely overlay resolver files
echo
echo "=== Candidate files in pkg/recipe for overlay processing ==="
rg -n --hidden --glob '!**/vendor/**' "overlay" pkg/recipe || true

# 4) Search for YAML spec struct fields named Mixins
echo
echo "=== Search for struct fields named Mixins ==="
rg -n --hidden --glob '!**/vendor/**' "Mixins" pkg/recipe pkg || true

Repository: NVIDIA/aicr

Length of output: 50367


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== pkg/recipe/metadata.go (mixins merge/accumulation) ==="
sed -n '680,760p' pkg/recipe/metadata.go | nl -ba | sed -n '1,120p'

echo
echo "=== pkg/recipe/metadata_store.go (mergeMixins) ==="
sed -n '520,700p' pkg/recipe/metadata_store.go | nl -ba | sed -n '1,220p'

Repository: NVIDIA/aicr

Length of output: 175


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== pkg/recipe/metadata.go: mixins accumulation ==="
# show without nl
sed -n '680,760p' pkg/recipe/metadata.go

echo
echo "=== pkg/recipe/metadata_store.go: mergeMixins ==="
sed -n '520,700p' pkg/recipe/metadata_store.go

Repository: NVIDIA/aicr

Length of output: 10009


Remove redundant os-ubuntu from a100-aks-ubuntu-training-kubeflow mixins

a100-aks-ubuntu-training already declares spec.mixins: [os-ubuntu]; inheritance-chain mixins are accumulated into descendants with de-duplication, so a100-aks-ubuntu-training-kubeflow only needs to add platform-kubeflow (dropping the repeated os-ubuntu is safe/reduces noise). [pkg/recipe/metadata.go] [recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml, recipes/overlays/a100-aks-ubuntu-training.yaml]

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml` around lines 32 -
35, The mixins list in the overlay a100-aks-ubuntu-training-kubeflow currently
repeats os-ubuntu; remove the redundant entry so only platform-kubeflow remains
in the mixins block. Edit the mixins for the a100-aks-ubuntu-training-kubeflow
recipe (spec.mixins / mixins) and delete "os-ubuntu", leaving the inherited
os-ubuntu from a100-aks-ubuntu-training and keeping "platform-kubeflow" only.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this for consistency with the established convention. Every *-ubuntu-training-kubeflow leaf in the repo (gb200-eks, gb200-oke, h100-aks, h100-eks) declares both os-ubuntu and platform-kubeflow in the leaf's mixins block. Removing it from only this leaf would diverge from that pattern, so leaving it as-is intentionally.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanchen8911, understood — that's a sensible consistency rationale. Explicitly re-declaring both os-ubuntu and platform-kubeflow in every *-ubuntu-training-kubeflow leaf makes the mixin intent self-documenting and keeps the pattern uniform across all siblings. Leaving it as-is makes sense.


✏️ Learnings added
Learnt from: yuanchen8911
Repo: NVIDIA/aicr PR: 1295
File: recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml:32-35
Timestamp: 2026-06-10T20:30:35.631Z
Learning: In NVIDIA/aicr recipe overlays, every `*-ubuntu-training-kubeflow` leaf overlay (e.g., gb200-eks, gb200-oke, h100-aks, h100-eks, a100-aks) explicitly declares both `os-ubuntu` and `platform-kubeflow` in its `spec.mixins` block, even though `os-ubuntu` is already inherited from the parent `*-ubuntu-training` overlay. This is an intentional convention for self-documentation and consistency across sibling overlays. Do not flag this as a redundant mixin redeclaration in `recipes/overlays/*-ubuntu-training-kubeflow.yaml` files.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: yuanchen8911
Repo: NVIDIA/aicr PR: 1046
File: recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml:46-53
Timestamp: 2026-05-28T15:05:30.166Z
Learning: When reviewing an overlay YAML under recipes/overlays, don’t require regenerating BOM/user docs (e.g., running `make bom-docs` and updating `docs/user/container-images.md`) if the overlay reuses chart+version pins that are already present in an existing overlay/registry entry. Concretely, if the overlay’s chart/version pin set (such as the chart name and semver/version pin) exactly matches what’s already reflected in `docs/user/container-images.md`, `make bom-docs` should be a no-op—so BOM regeneration/docs updates are not required. Only flag BOM/doc regeneration when the overlay introduces new (previously unseen) chart/version pins that would change the documented container image/version set.

@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 2c2d1f2 to c3daa9f Compare June 10, 2026 20:27

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-aks-training.yaml`:
- Around line 40-57: The componentRefs entries for gpu-operator and nfd are
missing required keys; update each componentRef (the entries with name:
gpu-operator and name: nfd) to include a version and a valuesFile key per
overlay guidelines; ensure you add the appropriate semantic version string for
the component (e.g., matching other overlays) to the version field and the path
or filename that inherits/overrides values (e.g., the aks-training values file)
to valuesFile so both entries contain name, type, version, and valuesFile.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bd0945c0-3115-4f6a-9c43-a31a57227567

📥 Commits

Reviewing files that changed from the base of the PR and between 2c2d1f2 and c3daa9f.

📒 Files selected for processing (4)
  • recipes/overlays/a100-aks-training.yaml
  • recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-aks-ubuntu-training.yaml
  • recipes/overlays/a100-any.yaml

Comment thread recipes/overlays/a100-aks-training.yaml
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Recipe evidence check

Affected leaf overlays: 3

Recipe Pointer Verify Digest match
a100-aks-training ⚠️ missing
a100-aks-ubuntu-training-kubeflow ⚠️ missing
a100-aks-ubuntu-training ⚠️ missing

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 286bda5 to d915f05 Compare June 11, 2026 01:34
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
recipes/overlays/a100-aks-training.yaml (1)

40-57: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required version and valuesFile keys to each componentRef.

Both gpu-operator and nfd componentRefs are missing the required version and valuesFile fields specified in the overlay guidelines.

📋 Required fields per coding guidelines

As per coding guidelines: "Reference components in recipe overlays with componentRefs containing name, type, version, and valuesFile"

Add the appropriate version (matching other overlays) and valuesFile path (e.g., values-aks-training.yaml as mentioned in the PR description) to both entries.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-aks-training.yaml` around lines 40 - 57, The
componentRefs entries for gpu-operator and nfd are missing required keys; update
each componentRef (gpu-operator and nfd) to include a version field (use the
same version pattern used in other overlays) and a valuesFile field pointing to
the overlay values file (e.g., values-aks-training.yaml) so both entries include
name, type, version, and valuesFile as required by the recipe overlay
guidelines.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-aks-ubuntu-training.yaml`:
- Around line 36-38: Remove the redundant constraints entry for
K8s.server.version in the a100-aks-ubuntu-training overlay: since
K8s.server.version is already set to ">= 1.30" in the a100-aks-training parent
and the overlay doesn’t change the value, delete the K8s.server.version
constraint block from a100-aks-ubuntu-training.yaml (unless your intent is to
override or clear the parent value, in which case replace it with a different
value or an explicit clear directive).

---

Duplicate comments:
In `@recipes/overlays/a100-aks-training.yaml`:
- Around line 40-57: The componentRefs entries for gpu-operator and nfd are
missing required keys; update each componentRef (gpu-operator and nfd) to
include a version field (use the same version pattern used in other overlays)
and a valuesFile field pointing to the overlay values file (e.g.,
values-aks-training.yaml) so both entries include name, type, version, and
valuesFile as required by the recipe overlay guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 28cd1b3d-310d-4325-968d-c2dbfb95cfff

📥 Commits

Reviewing files that changed from the base of the PR and between 286bda5 and d915f05.

📒 Files selected for processing (4)
  • recipes/overlays/a100-aks-training.yaml
  • recipes/overlays/a100-aks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-aks-ubuntu-training.yaml
  • recipes/overlays/a100-any.yaml

Comment thread recipes/overlays/a100-aks-ubuntu-training.yaml
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from d915f05 to 3db32e0 Compare June 11, 2026 12:52
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipes): add A100 AKS training Kubeflow overlay chain feat(recipes): add A100 AKS training Kubeflow overlay chain Jun 11, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 11, 2026 12:52
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner June 11, 2026 12:52
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 3db32e0 to 879e33b Compare June 11, 2026 14:01
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 879e33b to 22376a4 Compare June 11, 2026 14:26
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 22376a4 to 9c63ca2 Compare June 11, 2026 15:06
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 9c63ca2 to 40266a4 Compare June 11, 2026 16:42
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 40266a4 to 08fe0cc Compare June 11, 2026 17:20
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@mchmarny mchmarny enabled auto-merge (squash) June 11, 2026 19:37
@github-actions github-actions Bot added size/M and removed size/L labels Jun 11, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-aks-overlays branch from 861076d to 24de015 Compare June 11, 2026 19:42
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@mchmarny mchmarny merged commit 463d6a1 into NVIDIA:main Jun 11, 2026
118 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/recipes size/M theme/recipes Recipe expansion, overlays, mixins, and component registry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants