feat(recipes): add A100 EKS training Kubeflow overlay chain by yuanchen8911 · Pull Request #1305 · NVIDIA/aicr

yuanchen8911 · 2026-06-11T01:19:55Z

Summary

Add A100 EKS overlays (issue #1002): the EKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment-phase floor. Modeled on the H100/H200 EKS training overlays.

Motivation / Context

a100 is a declared accelerator in pkg/recipe/criteria.go but has zero overlays in recipes/overlays/, so aicr recipe --accelerator a100 --service eks ... cannot resolve. Companion to the A100 OKE (#1294) and AKS (#1295) PRs; this slice covers EKS.

Fixes: N/A (incremental — part of #1002)
Related: #1002, #1294, #1295, #1306, #969, #1256

A100 overlay series — tracked in #1002: #1294 (OKE) · #1295 (AKS) · #1305 (EKS) ← this PR · #1306 (GKE)

Coordination: a100-any.yaml (the cross-service A100 deployment floor) is byte-identical across #1294, #1295, and this PR. Only one needs to introduce it — whichever lands first, the others drop the duplicate on rebase.

Type of Change

New feature (non-breaking change that adds functionality)

Component(s) Affected

Recipe engine / data (pkg/recipe)

Implementation Notes

New overlays (reuse existing eks/eks-training parents and values-eks-training.yaml — no new component values files):

a100-any — deployment-phase floor: 4 standard checks + Deployment.gpu-operator.version >= v24.6.0 (H100/H200-generation baseline; A100 operator-supported since v22.9).
a100-eks-training — base: eks-training; K8s >= 1.30. A100 has no NVLink ComputeDomain requirement, so it keeps the EKS training baseline rather than H100/H200's 1.32.4. gpu-operator cdi + gdrcopy, nfd topologyUpdater. Conformance mirrors the H100/H200 EKS training set.
a100-eks-ubuntu-training — + os-ubuntu mixin.
a100-eks-ubuntu-training-kubeflow — + platform-kubeflow (Kubeflow Trainer for distributed TrainJob).

Key decisions documented in-file:

Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator: h100), mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200, with no separate A100 target — but per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100.
Performance gating intentionally omitted. The H100/H200 EKS sibling pins nccl-all-reduce-bw >= 300, calibrated on 8-GPU Hopper NVLink nodes with EFA — neither fabric-class aware (nccl-all-reduce-bw training gate is a fixed absolute fabric-specific busbw value applied to SKU-agnostic recipes → false-fails EKS/H100 small SKUs #1256) nor valid for A100. An A100-on-EKS NCCL baseline is deferred to a follow-up.

Testing

go test ./pkg/recipe/...                 # PASS (incl. TestOverlayValidationPhaseFloor auto-discovery)
yamllint recipes/overlays/a100-*.yaml    # clean
# End-to-end resolution of the new leaf:
aicr recipe --service eks --accelerator a100 --os ubuntu --intent training --platform kubeflow
#   -> components=14 overlays=8; K8s '>= 1.30'; Deployment.gpu-operator.version '>= v24.6.0';
#      kubeflow-trainer present; nodewright-customizations renders tuning-generic (accelerator=generic);
#      gpu-operator inherits values-eks-training.yaml

Full make qualify not required: this touches only YAML overlay files (zero .go changes), so the Go lint/test/e2e gates cannot regress from it. The embedded overlays are exercised by go test ./pkg/recipe/... (passes) and yamllint (clean). No docs/ page enumerates individual overlay leaves, so no doc update is needed.

Risk Assessment

Low — Additive overlays only; no existing recipe or Go path changes. Easy to revert.

Rollout notes: Additive. Other A100 EKS leaves (plain training, inference, dynamo) and remaining services tracked under #1002.

Checklist

Tests pass locally (go test ./pkg/recipe/...)
Linter passes (yamllint clean; no .go files changed)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (covered by existing auto-discovery TestOverlayValidationPhaseFloor)
I updated docs if user-facing behavior changed (N/A — no leaf enumeration in docs)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-06-11T01:23:47Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces four new recipe overlay manifests for A100 accelerator workloads. A wildcard a100-any overlay establishes baseline deployment validation checks and a gpu-operator version constraint (>= v24.6.0) for all A100 recipes. An a100-eks-training overlay inherits from the eks-training base, targets EKS workloads with A100 accelerator and training intent, enforces Kubernetes >= 1.30, configures gpu-operator Helm settings (enables cdi and gdrcopy), and specifies intent-layer conformance validation checks. Two platform specializations follow: a100-eks-ubuntu-training applies Ubuntu OS constraints via the os-ubuntu mixin, and a100-eks-ubuntu-training-kubeflow adds Kubeflow platform support via os-ubuntu and platform-kubeflow mixins. All variants enforce Kubernetes >= 1.30.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

feat(recipes): add concrete A100 service-bound overlays #1002: Adds multiple A100 recipe overlays (a100-any, a100-eks-training, a100-eks-ubuntu-training, a100-eks-ubuntu-training-kubeflow) which correspond to the missing A100 overlays described in the issue.

Suggested labels

size/XL

Suggested reviewers

mchmarny

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the primary change: adding A100 EKS training Kubeflow overlay chain. It directly aligns with the changeset which introduces four new recipe overlay YAML files for A100 EKS deployment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description clearly relates to the changeset: it explains the motivation for adding four new A100 EKS overlay YAML files, documents implementation decisions (nodewright tuning, performance gating omission), references related issues, and reports testing results.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-11T01:24:11Z

Recipe evidence check

Affected leaf overlays: 3

Recipe	Pointer	Verify	Digest match
`a100-eks-training`	⚠️ missing	—	—
`a100-eks-ubuntu-training-kubeflow`	⚠️ missing	—	—
`a100-eks-ubuntu-training`	⚠️ missing	—	—

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/a100-eks-ubuntu-training.yaml`:
- Around line 36-38: Remove the redundant K8s.server.version constraint from the
a100-eks-ubuntu-training overlay: delete the duplicated constraints entry that
sets "K8s.server.version >= 1.30" in the a100-eks-ubuntu-training specialization
and rely on the version floor inherited from the ancestor
a100-eks-training.yaml; only keep an explicit constraint here if this
specialization needs a stricter (higher) minimum than the ancestor.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8c71c1c5-ab8f-479b-a9cc-e7093dd44fff

📥 Commits

Reviewing files that changed from the base of the PR and between 51ef12d and 6b4a4bd.

📒 Files selected for processing (4)

recipes/overlays/a100-any.yaml
recipes/overlays/a100-eks-training.yaml
recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml
recipes/overlays/a100-eks-ubuntu-training.yaml

coderabbitai

♻️ Duplicate comments (2)

recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml (1)
38-40: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Redundant constraint inherited from ancestor overlay.

The K8s.server.version >= 1.30 constraint is already defined in the ancestor a100-eks-training overlay (line 34-35) and inherited through the overlay chain (a100-eks-training → a100-eks-ubuntu-training → this file). Repeating it creates a maintenance burden—if the version floor is raised, multiple files must be updated. Remove this redundant constraint block and rely on inheritance unless this specialization requires a stricter (higher) minimum version.
♻️ Proposed cleanup
  mixins:
    - os-ubuntu
    - platform-kubeflow

- # A100 + EKS specific constraints (not covered by mixin)
- constraints:
-   - name: K8s.server.version
-     value: ">= 1.30"
-
  componentRefs: []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml` around lines 38 -
40, Remove the redundant K8s.server.version constraint block from the
overlays/a100-eks-ubuntu-training-kubeflow overlay: locate the constraints:
section and delete the entry with name: K8s.server.version and value: ">= 1.30"
(it’s inherited from the a100-eks-training overlay); if this overlay truly needs
a stricter minimum, replace the value instead of duplicating the same
constraint.
recipes/overlays/a100-eks-ubuntu-training.yaml (1)
36-38: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Redundant constraint inherited from base overlay.

The K8s.server.version >= 1.30 constraint is already defined in the base a100-eks-training overlay (line 34-35). Due to overlay inheritance, this constraint is automatically inherited by all descendants. Repeating it here creates a maintenance burden—if the version floor is raised, multiple files must be updated. Remove this redundant constraint block and rely on inheritance unless this specialization requires a stricter (higher) minimum version.
♻️ Proposed cleanup
  mixins:
    - os-ubuntu

- # A100 + EKS specific constraints (not covered by mixin)
- constraints:
-   - name: K8s.server.version
-     value: ">= 1.30"
-
  componentRefs: []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/a100-eks-ubuntu-training.yaml` around lines 36 - 38, Remove
the redundant constraints block that defines "K8s.server.version >= 1.30" from
the a100-eks-ubuntu-training overlay (the "- name: K8s.server.version / value:
\">= 1.30\"" entry) so the overlay inherits the setting from the base
a100-eks-training overlay; only reintroduce a constraints entry here if you
intend to enforce a stricter minimum version, and after removal run the overlay
validation to confirm no other duplicates remain.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml`:
- Around line 38-40: Remove the redundant K8s.server.version constraint block
from the overlays/a100-eks-ubuntu-training-kubeflow overlay: locate the
constraints: section and delete the entry with name: K8s.server.version and
value: ">= 1.30" (it’s inherited from the a100-eks-training overlay); if this
overlay truly needs a stricter minimum, replace the value instead of duplicating
the same constraint.

In `@recipes/overlays/a100-eks-ubuntu-training.yaml`:
- Around line 36-38: Remove the redundant constraints block that defines
"K8s.server.version >= 1.30" from the a100-eks-ubuntu-training overlay (the "-
name: K8s.server.version / value: \">= 1.30\"" entry) so the overlay inherits
the setting from the base a100-eks-training overlay; only reintroduce a
constraints entry here if you intend to enforce a stricter minimum version, and
after removal run the overlay validation to confirm no other duplicates remain.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 289c0246-c2c5-4d06-a957-ba6cf54c2d6f

📥 Commits

Reviewing files that changed from the base of the PR and between 6b4a4bd and 331ae98.

📒 Files selected for processing (4)

recipes/overlays/a100-any.yaml
recipes/overlays/a100-eks-training.yaml
recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml
recipes/overlays/a100-eks-ubuntu-training.yaml

Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002

Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning reuses the h100 profile (tuning-gke.yaml, accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke package ships baked-in profiles only for gke-h100 / gke-b200, with no separate A100 target; per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002

Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment floor. Modeled on the H100/H200 EKS training overlays. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs. - a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the EKS training baseline rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy, nfd topologyUpdater. - a100-eks-ubuntu-training: + os-ubuntu mixin - a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow Trainer for distributed TrainJob) Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100), mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200, with no separate A100 target; per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100. Performance gating is intentionally omitted: the H100/H200 EKS nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink nodes with EFA and is neither fabric-class aware nor valid for A100, so an A100-on-EKS NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002

Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning reuses the h100 profile (tuning-gke.yaml, accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke package ships baked-in profiles only for gke-h100 / gke-b200, with no separate A100 target; per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002

github-actions Bot added area/recipes size/L labels Jun 11, 2026

yuanchen8911 added the theme/recipes Recipe expansion, overlays, mixins, and component registry label Jun 11, 2026

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from 51ef12d to 6b4a4bd Compare June 11, 2026 01:22

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread recipes/overlays/a100-eks-ubuntu-training.yaml

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from 6b4a4bd to 331ae98 Compare June 11, 2026 01:34

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

mchmarny assigned yuanchen8911 Jun 11, 2026

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from 331ae98 to 6bc9cf0 Compare June 11, 2026 12:52

yuanchen8911 changed the title ~~WIP: feat(recipes): add A100 EKS training Kubeflow overlay chain~~ feat(recipes): add A100 EKS training Kubeflow overlay chain Jun 11, 2026

yuanchen8911 marked this pull request as ready for review June 11, 2026 12:53

yuanchen8911 requested a review from a team as a code owner June 11, 2026 12:53

yuanchen8911 requested review from ayuskauskas and mchmarny June 11, 2026 12:53

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from 6bc9cf0 to 8e2f129 Compare June 11, 2026 14:01

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from 8e2f129 to 981a3b9 Compare June 11, 2026 14:26

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from 981a3b9 to c6724cd Compare June 11, 2026 15:06

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from c6724cd to 6b02232 Compare June 11, 2026 16:42

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from 6b02232 to d802c4c Compare June 11, 2026 17:20

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from d802c4c to eaf58ae Compare June 11, 2026 17:24

mchmarny approved these changes Jun 11, 2026

View reviewed changes

github-actions Bot added size/M and removed size/L labels Jun 11, 2026

yuanchen8911 force-pushed the feat/a100-eks-overlays branch 2 times, most recently from 5ecfad3 to 7587eb9 Compare June 11, 2026 20:01

mchmarny enabled auto-merge (squash) June 11, 2026 20:19

yuanchen8911 force-pushed the feat/a100-eks-overlays branch from 239b81a to f428cb3 Compare June 11, 2026 20:21

mchmarny merged commit d8d3070 into NVIDIA:main Jun 11, 2026
117 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(recipes): add A100 EKS training Kubeflow overlay chain#1305

feat(recipes): add A100 EKS training Kubeflow overlay chain#1305
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/a100-eks-overlays

yuanchen8911 commented Jun 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recipe evidence check

How to refresh evidence

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading