feat(recipes): add A100 EKS training Kubeflow overlay chain#1305
Conversation
51ef12d to
6b4a4bd
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces four new recipe overlay manifests for A100 accelerator workloads. A wildcard Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Recipe evidence checkAffected leaf overlays: 3
How to refresh evidenceRun on a cluster matching the recipe's aicr snapshot -o snapshot.yaml
aicr validate \
-r recipes/overlays/<slug>.yaml \
-s snapshot.yaml \
--emit-attestation ./out \
--push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yamlThis gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/a100-eks-ubuntu-training.yaml`:
- Around line 36-38: Remove the redundant K8s.server.version constraint from the
a100-eks-ubuntu-training overlay: delete the duplicated constraints entry that
sets "K8s.server.version >= 1.30" in the a100-eks-ubuntu-training specialization
and rely on the version floor inherited from the ancestor
a100-eks-training.yaml; only keep an explicit constraint here if this
specialization needs a stricter (higher) minimum than the ancestor.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 8c71c1c5-ab8f-479b-a9cc-e7093dd44fff
📒 Files selected for processing (4)
recipes/overlays/a100-any.yamlrecipes/overlays/a100-eks-training.yamlrecipes/overlays/a100-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/a100-eks-ubuntu-training.yaml
6b4a4bd to
331ae98
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml (1)
38-40: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winRedundant constraint inherited from ancestor overlay.
The
K8s.server.version >= 1.30constraint is already defined in the ancestora100-eks-trainingoverlay (line 34-35) and inherited through the overlay chain (a100-eks-training→a100-eks-ubuntu-training→ this file). Repeating it creates a maintenance burden—if the version floor is raised, multiple files must be updated. Remove this redundant constraint block and rely on inheritance unless this specialization requires a stricter (higher) minimum version.♻️ Proposed cleanup
mixins: - os-ubuntu - platform-kubeflow - # A100 + EKS specific constraints (not covered by mixin) - constraints: - - name: K8s.server.version - value: ">= 1.30" - componentRefs: []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml` around lines 38 - 40, Remove the redundant K8s.server.version constraint block from the overlays/a100-eks-ubuntu-training-kubeflow overlay: locate the constraints: section and delete the entry with name: K8s.server.version and value: ">= 1.30" (it’s inherited from the a100-eks-training overlay); if this overlay truly needs a stricter minimum, replace the value instead of duplicating the same constraint.recipes/overlays/a100-eks-ubuntu-training.yaml (1)
36-38: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winRedundant constraint inherited from base overlay.
The
K8s.server.version >= 1.30constraint is already defined in the basea100-eks-trainingoverlay (line 34-35). Due to overlay inheritance, this constraint is automatically inherited by all descendants. Repeating it here creates a maintenance burden—if the version floor is raised, multiple files must be updated. Remove this redundant constraint block and rely on inheritance unless this specialization requires a stricter (higher) minimum version.♻️ Proposed cleanup
mixins: - os-ubuntu - # A100 + EKS specific constraints (not covered by mixin) - constraints: - - name: K8s.server.version - value: ">= 1.30" - componentRefs: []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/a100-eks-ubuntu-training.yaml` around lines 36 - 38, Remove the redundant constraints block that defines "K8s.server.version >= 1.30" from the a100-eks-ubuntu-training overlay (the "- name: K8s.server.version / value: \">= 1.30\"" entry) so the overlay inherits the setting from the base a100-eks-training overlay; only reintroduce a constraints entry here if you intend to enforce a stricter minimum version, and after removal run the overlay validation to confirm no other duplicates remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@recipes/overlays/a100-eks-ubuntu-training-kubeflow.yaml`:
- Around line 38-40: Remove the redundant K8s.server.version constraint block
from the overlays/a100-eks-ubuntu-training-kubeflow overlay: locate the
constraints: section and delete the entry with name: K8s.server.version and
value: ">= 1.30" (it’s inherited from the a100-eks-training overlay); if this
overlay truly needs a stricter minimum, replace the value instead of duplicating
the same constraint.
In `@recipes/overlays/a100-eks-ubuntu-training.yaml`:
- Around line 36-38: Remove the redundant constraints block that defines
"K8s.server.version >= 1.30" from the a100-eks-ubuntu-training overlay (the "-
name: K8s.server.version / value: \">= 1.30\"" entry) so the overlay inherits
the setting from the base a100-eks-training overlay; only reintroduce a
constraints entry here if you intend to enforce a stricter minimum version, and
after removal run the overlay validation to confirm no other duplicates remain.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 289c0246-c2c5-4d06-a957-ba6cf54c2d6f
📒 Files selected for processing (4)
recipes/overlays/a100-any.yamlrecipes/overlays/a100-eks-training.yamlrecipes/overlays/a100-eks-ubuntu-training-kubeflow.yamlrecipes/overlays/a100-eks-ubuntu-training.yaml
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
331ae98 to
6bc9cf0
Compare
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
6bc9cf0 to
8e2f129
Compare
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
8e2f129 to
981a3b9
Compare
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
981a3b9 to
c6724cd
Compare
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
c6724cd to
6b02232
Compare
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
6b02232 to
d802c4c
Compare
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning is omitted: nvidia-tuning-gke ships baked-in profiles only for gke-h100 / gke-b200, and the EKS nvidia-tuned generic profile is not a fallback on immutable GKE COS (reboot/bootloader changes). The nodewright-operator is still inherited from gke-cos. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
d802c4c to
eaf58ae
Compare
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning reuses the h100 profile (tuning-gke.yaml, accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke package ships baked-in profiles only for gke-h100 / gke-b200, with no separate A100 target; per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning reuses the h100 profile (tuning-gke.yaml, accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke package ships baked-in profiles only for gke-h100 / gke-b200, with no separate A100 target; per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
5ecfad3 to
7587eb9
Compare
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning reuses the h100 profile (tuning-gke.yaml, accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke package ships baked-in profiles only for gke-h100 / gke-b200, with no separate A100 target; per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
Add A100 EKS overlays (issue NVIDIA#1002): the EKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment floor. Modeled on the H100/H200 EKS training overlays. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and GKE (NVIDIA#1306) PRs. - a100-eks-training: A100 + EKS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the EKS training baseline rather than the H100/H200 1.32.4 floor). gpu-operator cdi + gdrcopy, nfd topologyUpdater. - a100-eks-ubuntu-training: + os-ubuntu mixin - a100-eks-ubuntu-training-kubeflow: + platform-kubeflow (Kubeflow Trainer for distributed TrainJob) Nodewright tuning reuses the h100 profile (tuning.yaml, accelerator=h100), mirroring h200-eks-training. nvidia-setup ships baked-in profiles only for eks-h100 / eks-gb200, with no separate A100 target; per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100. Performance gating is intentionally omitted: the H100/H200 EKS nccl-all-reduce-bw floor (>= 300) is calibrated on 8-GPU Hopper NVLink nodes with EFA and is neither fabric-class aware nor valid for A100, so an A100-on-EKS NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
239b81a to
f428cb3
Compare
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning reuses the h100 profile (tuning-gke.yaml, accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke package ships baked-in profiles only for gke-h100 / gke-b200, with no separate A100 target; per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
Add A100 GKE overlays (issue NVIDIA#1002): the GKE COS Kubeflow training leaf plus its parent and the cross-cutting deployment floor. Modeled on the H100 GKE COS training overlays. GKE COS has no separate Ubuntu intermediate (os: cos is set at the gke-cos service root), so the chain is gke-cos-training -> a100-gke-cos-training -> ...-kubeflow. New overlays: - a100-any: deployment-phase floor (4 standard checks + gpu-operator version pin >= v24.6.0). Cross-service A100 floor, shared with the A100 OKE (NVIDIA#1294), AKS (NVIDIA#1295), and EKS (NVIDIA#1305) PRs. - a100-gke-cos-training: A100 + GKE COS training (K8s >= 1.30; no NVLink ComputeDomain requirement, so it keeps the GKE COS training baseline rather than the H100 1.32 floor). gpu-operator cdi, nfd topologyUpdater. - a100-gke-cos-training-kubeflow: Kubeflow Trainer for distributed TrainJob (declared inline, matching the GKE COS pattern). Nodewright tuning reuses the h100 profile (tuning-gke.yaml, accelerator=h100), mirroring h100-gke-cos-training. The nvidia-tuning-gke package ships baked-in profiles only for gke-h100 / gke-b200, with no separate A100 target; per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria stays a100; only the tuning profile selector is h100. gke-nccl-tcpxo is omitted: GPUDirect-TCPXO targets H100 a3-mega nodes, not the A100 a2 (a2-highgpu / a2-ultragpu) machine family. Scope the GKE TCPXO networking doc to the H100 recipes and call out the A100 exception so users selecting a100-gke-cos-training are not directed to configure TCPXO prerequisites the bundle never installs. Performance gating is omitted: the H100 GKE nccl-all-reduce-bw floor (>= 250) is calibrated on 8-GPU H100 NVLink nodes and is neither fabric-class aware nor valid for A100, so an A100-on-GKE NCCL baseline is deferred to a follow-up. Refs: NVIDIA#1002
Summary
Add A100 EKS overlays (issue #1002): the EKS Ubuntu Kubeflow training leaf plus its ancestor chain and the cross-cutting deployment-phase floor. Modeled on the H100/H200 EKS training overlays.
Motivation / Context
a100is a declared accelerator inpkg/recipe/criteria.gobut has zero overlays inrecipes/overlays/, soaicr recipe --accelerator a100 --service eks ...cannot resolve. Companion to the A100 OKE (#1294) and AKS (#1295) PRs; this slice covers EKS.Fixes: N/A (incremental — part of #1002)
Related: #1002, #1294, #1295, #1306, #969, #1256
A100 overlay series — tracked in #1002: #1294 (OKE) · #1295 (AKS) · #1305 (EKS) ← this PR · #1306 (GKE)
Type of Change
Component(s) Affected
pkg/recipe)Implementation Notes
New overlays (reuse existing
eks/eks-trainingparents andvalues-eks-training.yaml— no new component values files):a100-any— deployment-phase floor: 4 standard checks +Deployment.gpu-operator.version >= v24.6.0(H100/H200-generation baseline; A100 operator-supported since v22.9).a100-eks-training—base: eks-training;K8s >= 1.30. A100 has no NVLink ComputeDomain requirement, so it keeps the EKS training baseline rather than H100/H200's1.32.4. gpu-operatorcdi+gdrcopy, nfdtopologyUpdater. Conformance mirrors the H100/H200 EKS training set.a100-eks-ubuntu-training—+ os-ubuntumixin.a100-eks-ubuntu-training-kubeflow—+ platform-kubeflow(Kubeflow Trainer for distributedTrainJob).Key decisions documented in-file:
Nodewright tuning reuses the h100 profile (
tuning.yaml,accelerator: h100), mirroringh200-eks-training.nvidia-setupships baked-in profiles only foreks-h100/eks-gb200, with no separate A100 target — but per the nodewright maintainer the A100-vs-H100 deltas in the DGX Base OS tunings pertain only to baremetal and do not apply in EKS/GKE, so h100 is the correct tuning profile for A100 here. The recipe criteria staysa100; only the tuning profile selector ish100.Performance gating intentionally omitted. The H100/H200 EKS sibling pins
nccl-all-reduce-bw >= 300, calibrated on 8-GPU Hopper NVLink nodes with EFA — neither fabric-class aware (nccl-all-reduce-bw training gate is a fixed absolute fabric-specific busbw value applied to SKU-agnostic recipes → false-fails EKS/H100 small SKUs #1256) nor valid for A100. An A100-on-EKS NCCL baseline is deferred to a follow-up.Testing
Full
make qualifynot required: this touches only YAML overlay files (zero.gochanges), so the Go lint/test/e2e gates cannot regress from it. The embedded overlays are exercised bygo test ./pkg/recipe/...(passes) and yamllint (clean). Nodocs/page enumerates individual overlay leaves, so no doc update is needed.Risk Assessment
Rollout notes: Additive. Other A100 EKS leaves (plain training, inference, dynamo) and remaining services tracked under #1002.
Checklist
go test ./pkg/recipe/...).gofiles changed)TestOverlayValidationPhaseFloor)git commit -S)