Add GB200 overlays for OKE#497
Conversation
|
Welcome to AICR, @OguzPastirmaci! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
ReviewFindings
Open questions / notes
|
yuanchen8911
left a comment
There was a problem hiding this comment.
thanks for the PR. left some comments.
|
@OguzPastirmaci would you be interested in working with me to add Nodewright optimizations here like I have done with AWS? There are two levels we could do:
I would just need access to a node for an hour or two to make sure the package work correctly or I could work with you via a call or asynchronously. Looking forward to enabling Nodewright for OKE! |
Thanks @yuanchen8911, updated based on your feedback. |
@ayuskauskas happy to help. Could you elaborate what you mean by "bringing a vanilla OKE worker up to spec"? Here's the repo with the Packer templates that we use to build our HPC/GPU images: https://github.com/oracle-quickstart/oci-hpc-images |
I mean if I start an OKE cluster and select an off the shelf worker image and install/configure all the things to make it support AI workloads. That repository you linked seems like exactly that so would make a great starting point. There are a few options to take:
I have a few questions:
|
|
Thanks for the updates. The structure follows existing patterns, validation placement is now consistent with #493, and the Ubuntu version aligns with other services.Please rebase. |
7785913 to
5f34df5
Compare
yuanchen8911
left a comment
There was a problem hiding this comment.
KWOK Tests Failing: DRA Driver Scheduling Regression
All OKE KWOK tests (both Tier 1 and Tier 2) are failing because nvidia-dra-driver-gpu-controller can't be scheduled on any KWOK node.
Root cause: This PR adds a new OKE-specific values file (recipes/components/nvidia-dra-driver-gpu/values-oke.yaml) that hardcodes a hard node affinity requiring node-role.kubernetes.io/node:
controller:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-role.kubernetes.io/node"
operator: "Exists"The nvidia-dra-driver-gpu component already exists in the codebase and works fine — this new OKE-specific values file introduces the regression. KWOK simulated GPU worker nodes don't have the node-role.kubernetes.io/node label, resulting in 0/7 nodes are available: 7 node(s) didn't match Pod's node affinity/selector for every OKE recipe.
Suggested fixes (pick one):
- Add the
node-role.kubernetes.io/nodelabel to the KWOK simulated node templates (inkwok/config files) - Use the system
nodeSchedulingmechanism from the registry instead of hardcoding affinity in values, so the KWOK test harness can set it appropriately - Use
preferredDuringScheduling(soft) affinity instead ofrequiredDuringScheduling(hard)
|
Updated and confirmed with a local Kwok run. |
I think what we have wouldn't work easily at this stage, as our builds take more than an hour. We pre-install the GPU & OFED drivers and other things like the Lustre client in our images.
|
|
Please rebase and squash. |
7ab2e70 to
c5756a9
Compare
Add OKE service family support with GB200 accelerator overlays for training and inference workloads. Includes GPU Operator and DRA driver configurations for OKE's host-installed driver model.
c5756a9 to
aadd07c
Compare
Summary
Add GB200 recipes for OKE using
BM.GPU.GB200-v3.4shape.Motivation / Context
Fixes:
Related:
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Testing
# Commands run (prefer `make qualify` for non-trivial changes) make qualifyRisk Assessment
Rollout notes:
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info