Skip to content

feat(recipes): add A100 OKE training Kubeflow overlay chain#1294

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-oke-overlays
Jun 11, 2026
Merged

feat(recipes): add A100 OKE training Kubeflow overlay chain#1294
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-oke-overlays

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Add the first concrete A100 overlays (issue #1002): the OKE Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment-phase floor.

Motivation / Context

a100 is a declared accelerator in pkg/recipe/criteria.go but has zero overlays in recipes/overlays/, so aicr recipe --accelerator a100 --service oke ... cannot resolve. This is the dedicated tracker spun out of the #969 validation-coverage audit. This PR seeds the A100 OKE chain starting with the Kubeflow training leaf.

Training is modeled on the H100 training overlays — the GB200 OKE training recipe centers on an NVLS performance path that A100 does not have — with the OKE-specific plumbing taken from gb200-oke-training.

Fixes: N/A (incremental — first slice of #1002)
Related: #1002, #1295, #1305, #1306, #969, #1256

A100 overlay series — tracked in #1002: #1294 (OKE) ← this PR · #1295 (AKS) · #1305 (EKS) · #1306 (GKE)

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

Implementation Notes

New overlays (reuse existing oke/oke-training parents and values-oke*.yaml — no new component values files):

  • a100-any — deployment-phase floor: 4 standard checks + Deployment.gpu-operator.version >= v24.6.0. A100 floored at the H100/H200 generation baseline, not the Blackwell v25.10.0, since A100 has been operator-supported since v22.9.
  • a100-oke-trainingbase: oke-training; K8s >= 1.30. A100 has no IMEX/NVLink ComputeDomain requirement, so it keeps the OKE training baseline rather than GB200's 1.34. gpu-operator deps + gdrcopy, nfd topologyUpdater. Nodewright tuning omitted (packages don't target service: oke), matching gb200-oke-training. Conformance set mirrors gb200-oke-training (training-appropriate; no inference-only checks).
  • a100-oke-ubuntu-training+ os-ubuntu mixin.
  • a100-oke-ubuntu-training-kubeflow+ platform-kubeflow (Kubeflow Trainer for distributed TrainJob).

Key decisions documented in-file:

Testing

go test ./pkg/recipe/...                 # PASS (incl. TestOverlayValidationPhaseFloor auto-discovery)
yamllint recipes/overlays/a100-*.yaml    # clean
# End-to-end resolution of the new leaf:
aicr recipe --service oke --accelerator a100 --os ubuntu --intent training --platform kubeflow
#   -> components=12 overlays=8; K8s '>= 1.30'; Deployment.gpu-operator.version '>= v24.6.0';
#      kubeflow-trainer present; conformance includes gang-scheduling/pod-autoscaling/cluster-autoscaling;
#      gpu-operator inherits values-oke-training.yaml (driver.enabled:false via oke.yaml)

Full make qualify not required: this touches only YAML overlay files (zero .go changes), so the Go lint/test/e2e gates cannot regress from it. The embedded overlays are exercised by go test ./pkg/recipe/... (passes) and yamllint (clean). No docs/ page enumerates individual overlay leaves, so no doc update is needed.

Risk Assessment

  • Low — Additive overlays only; no existing recipe or Go path changes. Easy to revert.

Rollout notes: Additive. Subsequent A100 OKE leaves (plain training, inference, dynamo) and other services (EKS/GKE/AKS) tracked as follow-ups under #1002.

Checklist

  • Tests pass locally (go test ./pkg/recipe/...)
  • Linter passes (yamllint clean; no .go files changed)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (covered by existing auto-discovery TestOverlayValidationPhaseFloor)
  • I updated docs if user-facing behavior changed (N/A — no leaf enumeration in docs)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner June 10, 2026 19:33
@yuanchen8911 yuanchen8911 added the theme/recipes Recipe expansion, overlays, mixins, and component registry label Jun 10, 2026
@yuanchen8911 yuanchen8911 marked this pull request as draft June 10, 2026 19:34
@yuanchen8911 yuanchen8911 changed the title feat(recipes): add A100 OKE Dynamo inference overlay chain WIP: feat(recipes): add A100 OKE Dynamo inference overlay chain Jun 10, 2026
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds four RecipeMetadata overlays: a100-any (base overlay applying to any service with accelerator a100, adding deployment checks and a GPU Operator version constraint), a100-oke-training (inherits oke-training, narrows to OKE+A100+training, adds K8s >= 1.30, component Helm overrides, and intent-layer conformance checks), a100-oke-ubuntu-training (adds os=ubuntu mixin and K8s >= 1.30), and a100-oke-ubuntu-training-kubeflow (adds platform=kubeflow mixin and keeps K8s >= 1.30), with layered inheritance between them.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested labels

size/M

Suggested reviewers

  • xdu31
  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly explains the changes: adding A100 overlays for OKE training flows with specific details about each overlay, implementation decisions, and testing confirmation.
Title check ✅ Passed The pull request title accurately and specifically describes the main change: adding four A100 OKE training overlay YAML files with a Kubeflow variant, which is the primary purpose of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 force-pushed the feat/a100-oke-overlays branch from 4e14c6c to 7af2056 Compare June 10, 2026 19:42
@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipes): add A100 OKE Dynamo inference overlay chain WIP: feat(recipes): add A100 OKE training Kubeflow overlay chain Jun 10, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-oke-training.yaml`:
- Around line 38-57: The overlay's componentRefs for gpu-operator and nfd are
missing required fields; update the gpu-operator and nfd entries (under
componentRefs) to include explicit version and valuesFile keys (alongside
existing name, type, dependencyRefs/overrides) so they conform to the overlay
contract and ensure reproducible pinning; set valuesFile to the appropriate
inherited values filename and set version to the pinned chart version you want
to use.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f689edcb-b302-4669-8e59-e5cf6caa2778

📥 Commits

Reviewing files that changed from the base of the PR and between 4e14c6c and 7af2056.

📒 Files selected for processing (4)
  • recipes/overlays/a100-any.yaml
  • recipes/overlays/a100-oke-training.yaml
  • recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-oke-ubuntu-training.yaml

Comment on lines +38 to +57
componentRefs:
# A100-specific GPU Operator overrides (inherits valuesFile from oke-training).
# Nodewright tuning is omitted, mirroring the GB200 OKE training overlay:
# the nodewright-customizations packages do not target service: oke.
- name: gpu-operator
type: Helm
dependencyRefs:
- nfd
- cert-manager
- kube-prometheus-stack
overrides:
gdrcopy:
enabled: true

- name: nfd
type: Helm
overrides:
topologyUpdater:
enable: true

@coderabbitai coderabbitai Bot Jun 10, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit version and valuesFile for each componentRef.

gpu-operator and nfd refs omit required fields in this overlay. Please define explicit version and valuesFile (even if matching inherited values) to satisfy overlay contract and keep pins reproducible.

As per coding guidelines: recipes/overlays/*.yaml entries in componentRefs must contain name, type, version, and valuesFile.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-oke-training.yaml` around lines 38 - 57, The overlay's
componentRefs for gpu-operator and nfd are missing required fields; update the
gpu-operator and nfd entries (under componentRefs) to include explicit version
and valuesFile keys (alongside existing name, type, dependencyRefs/overrides) so
they conform to the overlay contract and ensure reproducible pinning; set
valuesFile to the appropriate inherited values filename and set version to the
pinned chart version you want to use.

Source: Coding guidelines

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors the established overlay-inheritance convention. The merged gb200-oke-training.yaml (and the other intent-layer training overlays) intentionally omit version and valuesFile on the gpu-operator and nfd componentRefs because both are inherited: gpu-operator's valuesFile comes from oke-training (components/gpu-operator/values-oke-training.yaml) and its chart version (v26.3.2) from base/oke. Re-pinning them in this leaf would duplicate inherited values and create two places to update on a chart bump. Resolving as consistent with the existing pattern.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanchen8911, understood — that's a clear and well-reasoned explanation. Leaf intent-layer overlays that inherit version and valuesFile from their parent (oke-training / base) intentionally omit those fields to avoid duplication and a two-place update burden on chart bumps. This is consistent with the established gb200-oke-training.yaml pattern. Withdrawing the suggestion.


✏️ Learnings added
Learnt from: yuanchen8911
Repo: NVIDIA/aicr PR: 1294
File: recipes/overlays/a100-oke-training.yaml:38-57
Timestamp: 2026-06-10T20:30:11.535Z
Learning: In `recipes/overlays/*.yaml`, intent-layer (leaf) training overlays such as `a100-oke-training.yaml` and `gb200-oke-training.yaml` intentionally omit `version` and `valuesFile` from `componentRefs` entries (e.g., `gpu-operator`, `nfd`) when those fields are already inherited from a parent overlay (e.g., `oke-training` provides `valuesFile: components/gpu-operator/values-oke-training.yaml` and `base`/`oke` pins the chart version). Re-specifying inherited fields in a leaf overlay would create duplication and require two-place updates on chart bumps. The coding guideline requiring `name`, `type`, `version`, and `valuesFile` in `componentRefs` applies only when an overlay introduces a genuinely new component pin not already present in the inheritance chain.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: yuanchen8911
Repo: NVIDIA/aicr PR: 1046
File: recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml:46-53
Timestamp: 2026-05-28T15:05:30.166Z
Learning: When reviewing an overlay YAML under recipes/overlays, don’t require regenerating BOM/user docs (e.g., running `make bom-docs` and updating `docs/user/container-images.md`) if the overlay reuses chart+version pins that are already present in an existing overlay/registry entry. Concretely, if the overlay’s chart/version pin set (such as the chart name and semver/version pin) exactly matches what’s already reflected in `docs/user/container-images.md`, `make bom-docs` should be a no-op—so BOM regeneration/docs updates are not required. Only flag BOM/doc regeneration when the overlay introduces new (previously unseen) chart/version pins that would change the documented container image/version set.

@yuanchen8911 yuanchen8911 force-pushed the feat/a100-oke-overlays branch from 7af2056 to 760fb3d Compare June 10, 2026 20:26
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 10, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
recipes/overlays/a100-oke-training.yaml (1)

38-57: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing required version and valuesFile fields in componentRefs (raised in previous review).

The gpu-operator and nfd component refs still lack explicit version and valuesFile fields. Per coding guidelines, componentRefs must contain name, type, version, and valuesFile to ensure reproducible version pins.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-oke-training.yaml` around lines 38 - 57, The
componentRefs for gpu-operator and nfd are missing required version and
valuesFile fields; update the gpu-operator and nfd entries (the objects with
name: gpu-operator and name: nfd) to include explicit version: "<pin the
intended chart/operator version>" and valuesFile: "<path/to/values.yaml or
overlay-specific values file>" so they conform to the componentRefs schema and
ensure reproducible pins (add the exact version string and the correct
valuesFile path used by other overlays).

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml`:
- Around line 32-35: The mixins list in this overlay redundantly re-declares
os-ubuntu which is already inherited from the parent overlay
a100-oke-ubuntu-training; remove the os-ubuntu entry and leave only
platform-kubeflow in the mixins array so this overlay only adds the Kubeflow
mixin and relies on the parent for the Ubuntu mixin.

---

Duplicate comments:
In `@recipes/overlays/a100-oke-training.yaml`:
- Around line 38-57: The componentRefs for gpu-operator and nfd are missing
required version and valuesFile fields; update the gpu-operator and nfd entries
(the objects with name: gpu-operator and name: nfd) to include explicit version:
"<pin the intended chart/operator version>" and valuesFile:
"<path/to/values.yaml or overlay-specific values file>" so they conform to the
componentRefs schema and ensure reproducible pins (add the exact version string
and the correct valuesFile path used by other overlays).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 84a6a931-a59d-49c2-ae2e-7dcc4cf52587

📥 Commits

Reviewing files that changed from the base of the PR and between 7af2056 and 760fb3d.

📒 Files selected for processing (4)
  • recipes/overlays/a100-any.yaml
  • recipes/overlays/a100-oke-training.yaml
  • recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-oke-ubuntu-training.yaml

Comment thread recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-oke-overlays branch from 760fb3d to a2c08ca Compare June 10, 2026 22:09
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 10, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Recipe evidence check

Affected leaf overlays: 3

Recipe Pointer Verify Digest match
a100-oke-training ⚠️ missing
a100-oke-ubuntu-training-kubeflow ⚠️ missing
a100-oke-ubuntu-training ⚠️ missing

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml`:
- Around line 37-40: The overlay contains a duplicated constraints block setting
"K8s.server.version" to ">= 1.30"; remove that entire constraints entry (the
name: K8s.server.version / value: ">= 1.30" block) from this overlay so the
version requirement is inherited from the parent, leaving only the platform:
kubeflow criterion and associated mixin in this overlay; after removal, run the
overlay validation/build to confirm no missing constraints.

In `@recipes/overlays/a100-oke-ubuntu-training.yaml`:
- Around line 35-38: Remove the duplicated K8s.server.version constraint from
the leaf overlay by deleting the constraint entry with name "K8s.server.version"
and value ">= 1.30" in a100-oke-ubuntu-training.yaml; the parent overlay
a100-oke-training.yaml already defines this constraint and it will be inherited
via per-field union merge, so retain the parent definition and leave any other
A100/OKE-specific constraints that are not covered by the mixin or parent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1d63b2d4-b569-4701-8d24-9ca7cf62b548

📥 Commits

Reviewing files that changed from the base of the PR and between 760fb3d and a2c08ca.

📒 Files selected for processing (4)
  • recipes/overlays/a100-any.yaml
  • recipes/overlays/a100-oke-training.yaml
  • recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-oke-ubuntu-training.yaml

Comment thread recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml
Comment thread recipes/overlays/a100-oke-ubuntu-training.yaml
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-oke-overlays branch from a2c08ca to 405c98f Compare June 11, 2026 01:34
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
recipes/overlays/a100-oke-ubuntu-training.yaml (1)

35-38: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider removing inherited constraint to reduce maintenance burden.

The K8s.server.version >= 1.30 constraint is already defined in the parent overlay a100-oke-training.yaml (line 35-36) and will be inherited via the recipe merge process. Repeating it here creates a multi-file update burden if the version floor changes. The comment "A100 + OKE specific constraints (not covered by mixin)" would be accurate in the parent but is misleading here since the parent already provides this constraint.

♻️ Simplify by removing inherited constraint
   mixins:
     - os-ubuntu
 
-  # A100 + OKE specific constraints (not covered by mixin)
-  constraints:
-    - name: K8s.server.version
-      value: ">= 1.30"
-
   componentRefs: []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-oke-ubuntu-training.yaml` around lines 35 - 38, Remove
the duplicated K8s.server.version constraint from the
a100-oke-ubuntu-training.yaml overlay: the constraint block containing name:
K8s.server.version and value: ">= 1.30" is already declared in the parent
overlay a100-oke-training.yaml and will be inherited via the recipe merge, so
delete that constraint entry here (and, if present, update the surrounding
comment so it no longer implies this file defines the version floor).
recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml (1)

37-40: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider removing inherited constraint to reduce maintenance burden.

The K8s.server.version >= 1.30 constraint is inherited from a100-oke-training.yaml (through the parent chain) and doesn't need to be redeclared here. Repeating it creates a three-file update burden (a100-oke-training, a100-oke-ubuntu-training, and this overlay) if the version floor changes. Since this overlay only adds the platform: kubeflow criterion and corresponding mixin, the constraints block can be removed entirely.

♻️ Simplify by removing inherited constraint
   mixins:
     - os-ubuntu
     - platform-kubeflow
 
-  # A100 + OKE specific constraints (not covered by mixin)
-  constraints:
-    - name: K8s.server.version
-      value: ">= 1.30"
-
   componentRefs: []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml` around lines 37 -
40, Remove the redundant constraints block that redeclares the inherited
K8s.server.version floor: delete the block starting at "constraints:" that
contains "name: K8s.server.version" and "value: \">= 1.30\"" from this overlay
(a100-oke-ubuntu-training-kubeflow.yaml); the version requirement is already
inherited from a100-oke-training.yaml via the parent chain and this overlay only
adds the platform: kubeflow criterion and its mixin, so dropping the constraints
block avoids triple-file updates when the version floor changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml`:
- Around line 37-40: Remove the redundant constraints block that redeclares the
inherited K8s.server.version floor: delete the block starting at "constraints:"
that contains "name: K8s.server.version" and "value: \">= 1.30\"" from this
overlay (a100-oke-ubuntu-training-kubeflow.yaml); the version requirement is
already inherited from a100-oke-training.yaml via the parent chain and this
overlay only adds the platform: kubeflow criterion and its mixin, so dropping
the constraints block avoids triple-file updates when the version floor changes.

In `@recipes/overlays/a100-oke-ubuntu-training.yaml`:
- Around line 35-38: Remove the duplicated K8s.server.version constraint from
the a100-oke-ubuntu-training.yaml overlay: the constraint block containing name:
K8s.server.version and value: ">= 1.30" is already declared in the parent
overlay a100-oke-training.yaml and will be inherited via the recipe merge, so
delete that constraint entry here (and, if present, update the surrounding
comment so it no longer implies this file defines the version floor).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 468f2dc5-9254-4ef1-8aa3-ae108f61a098

📥 Commits

Reviewing files that changed from the base of the PR and between a2c08ca and 405c98f.

📒 Files selected for processing (4)
  • recipes/overlays/a100-any.yaml
  • recipes/overlays/a100-oke-training.yaml
  • recipes/overlays/a100-oke-ubuntu-training-kubeflow.yaml
  • recipes/overlays/a100-oke-ubuntu-training.yaml

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-oke-overlays branch from 405c98f to 8419626 Compare June 11, 2026 12:52
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-oke-overlays branch from 4daa3dc to a916b0c Compare June 11, 2026 14:26
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-oke-overlays branch from a916b0c to 292623e Compare June 11, 2026 15:06
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-oke-overlays branch from 292623e to 01bb31f Compare June 11, 2026 16:42
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
Add the first concrete A100 overlays (issue NVIDIA#1002): the OKE Ubuntu
Kubeflow training leaf plus its ancestor chain and the cross-cutting
deployment floor.

Training is modeled on the H100 training overlays (the GB200 OKE training
recipe centers on an NVLS performance path that A100 does not have), with
the OKE-specific plumbing taken from the GB200 OKE training overlay.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0, the H100/H200-generation baseline rather than
  the Blackwell floor)
- a100-oke-training: A100 + OKE training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the OKE training baseline).
  gpu-operator deps + gdrcopy, nfd topologyUpdater; nodewright tuning
  omitted (packages do not target service: oke), matching GB200 OKE.
- a100-oke-ubuntu-training: + os-ubuntu mixin
- a100-oke-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Driver management is inherited unchanged from oke.yaml
(driver.enabled: false): OKE managed GPU node pools provision the Oracle
Linux GPU image with drivers pre-installed on bare-metal A100 shapes
(BM.GPU4.8, BM.GPU.A100-v2.8), same as the GB200/B200 OKE siblings.

Performance gating is intentionally omitted: the H100 nccl-all-reduce-bw
floor (>= 300) is calibrated on an 8-GPU H100 NVLink node and is neither
fabric-class aware nor valid for A100, so an A100-on-OCI NCCL baseline is
deferred to a follow-up.

Refs: NVIDIA#1002
@yuanchen8911 yuanchen8911 force-pushed the feat/a100-oke-overlays branch from 01bb31f to 6a36f9b Compare June 11, 2026 17:20
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294) and AKS (NVIDIA#1295) PRs — only one needs to land; the
  others drop it on rebase.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning uses the nvidia-tuned generic profile
(tuning-generic.yaml, accelerator=generic), mirroring rtx-pro-6000-eks.
nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200;
unlike H200 (same Hopper HGX platform, reuses the h100 profile), A100 is
Ampere with no matching profile. The generic profile is the supported
baseline for such targets: governor=performance, iommu=pt, BBR, without
the HGX-specific IB/hugepage/CPU-isolation settings the h100/gb200
profiles bake in.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles
only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is
not a fallback on immutable GKE COS (reboot/bootloader changes). The
nodewright-operator is still inherited from gke-cos.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
@mchmarny mchmarny merged commit 6eb85ac into NVIDIA:main Jun 11, 2026
118 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 AKS overlays (issue NVIDIA#1002): the AKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100 AKS training overlays (AKS has no GB200 reference).

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE PR (NVIDIA#1294) — only one needs to land; the other drops it on
  rebase.
- a100-aks-training: A100 + AKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the AKS training baseline
  rather than the H100 1.32.4 floor). gpu-operator deps + gdrcopy, nfd
  topologyUpdater; nodewright tuning omitted (packages do not target
  service: aks). Conformance mirrors the H100 AKS training set.
- a100-aks-ubuntu-training: + os-ubuntu mixin
- a100-aks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

AKS pre-installs the NVIDIA container toolkit and the recipe inherits the
AKS gpu-operator values (values-aks-training.yaml) and the InfiniBand
network-operator stack unchanged from aks.yaml.

Performance gating is intentionally omitted: the H100 AKS
nccl-all-reduce-bw floor (>= 100) is calibrated for H100 ND-series nodes
and is neither fabric-class aware nor valid for A100, so an A100-on-Azure
NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training
leaf plus its ancestor chain and the cross-cutting deployment floor.

Modeled on the H100/H200 EKS training overlays.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs.
- a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the EKS training baseline
  rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy,
  nfd topologyUpdater.
- a100-eks-ubuntu-training: + os-ubuntu mixin
- a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow
  Trainer for distributed TrainJob)

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100),
mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for
eks-h100 / eks-gb200, with no separate A100 target; per the nodewright
maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only
to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning
profile for A100 here. The recipe criteria stays a100; only the tuning
profile selector is h100.

Performance gating is intentionally omitted: the H100/H200 EKS
nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink
nodes with EFA and is neither fabric-class aware nor valid for A100, so
an A100-on-EKS NCCL baseline is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 11, 2026
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf
plus its parent and the cross-cutting deployment floor.

Modeled on the H100 GKE COS training overlays. GKE COS has no separate
Ubuntu intermediate (os: cos is set at the gke-cos service root), so the
chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow.

New overlays:
- a100-any: deployment-phase floor (4 standard checks + gpu-operator
  version pin >= v24.6.0). Cross-service A100 floor, shared with the
  A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs.
- a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink
  ComputeDomain requirement, so it keeps the GKE COS training baseline
  rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater.
- a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed
  TrainJob (declared inline, matching the GKE COS pattern).

Nodewright tuning reuses the h100 profile (tuning-gke.yaml,
accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke
package ships baked-in profiles only for gke-h100 / gke-b200, with no
separate A100 target; per the nodewright maintainer the A100-vs-H100
deltas in the DGX Base OS tunings pertain only to baremetal and do not
apply in EKS/GKE, so h100 is the correct tuning profile for A100 here.
The recipe criteria stays a100; only the tuning profile selector is h100.

gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes,
not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the
GKE TCPXO networking doc to the H100 recipes and call out the A100
exception so users selecting a100-gke-cos-training are not directed to
configure TCPXO prerequisites the bundle never installs.

Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor
(>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither
fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline
is deferred to a follow-up.

Refs: NVIDIA#1002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/recipes size/L theme/recipes Recipe expansion, overlays, mixins, and component registry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants