Skip to content

Feat/gke cos training overlays skyhook#340

Closed
ayuskauskas wants to merge 2 commits into
feat/gke-cos-training-overlaysfrom
feat/gke-cos-training-overlays-skyhook
Closed

Feat/gke cos training overlays skyhook#340
ayuskauskas wants to merge 2 commits into
feat/gke-cos-training-overlaysfrom
feat/gke-cos-training-overlays-skyhook

Conversation

@ayuskauskas

Copy link
Copy Markdown
Contributor

Summary

Created a new skyhook-customization specific for GKE COS. Will work for h100 or gb200

Motivation / Context

Fixes:
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

yuanchen8911 and others added 2 commits March 10, 2026 14:48
Add GKE Container-Optimized OS (COS) overlay recipes for H100 training
workloads. Key differences from EKS:

- GPU driver disabled (GKE preinstalls on COS nodes)
- COS-specific host paths (/home/kubernetes/bin/nvidia)
- Toolkit with RUNTIME_CONFIG_SOURCE=file for COS
- gdrcopy disabled (not supported on COS host-managed driver)
- Prometheus storage uses standard-rwo (GKE PD CSI)
- DRA driver with COS nvidiaDriverRoot
- Skyhook uses no-op (COS has immutable rootfs)
- No AWS components (EFA, EBS CSI)

Overlay chain: base → gke-cos → gke-cos-training → h100-gke-cos-training
Optional: h100-gke-cos-training-kubeflow (adds Kubeflow Trainer)

New files:
- recipes/overlays/gke-cos-training.yaml
- recipes/overlays/h100-gke-cos-training.yaml
- recipes/overlays/h100-gke-cos-training-kubeflow.yaml
- recipes/components/gpu-operator/values-gke-cos-training.yaml

Updated files:
- recipes/overlays/gke-cos.yaml (storage, DRA, constraints, validation)
- recipes/components/gpu-operator/values-gke-cos.yaml (toolkit, gdrcopy)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants