Skip to content

Add GB200 overlays for OKE#497

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
OguzPastirmaci:oke-gb200
Apr 7, 2026
Merged

Add GB200 overlays for OKE#497
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
OguzPastirmaci:oke-gb200

Conversation

@OguzPastirmaci

Copy link
Copy Markdown
Contributor

Summary

Add GB200 recipes for OKE using BM.GPU.GB200-v3.4 shape.

Motivation / Context

Fixes:
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@OguzPastirmaci OguzPastirmaci requested a review from a team as a code owner April 6, 2026 22:20
@copy-pr-bot

copy-pr-bot Bot commented Apr 6, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

Welcome to AICR, @OguzPastirmaci! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@yuanchen8911

Copy link
Copy Markdown
Contributor

Review

Findings

  1. Medium: gb200-oke training validation is placed in the Ubuntu leaf instead of the intent parent, which is inconsistent with the validation lift-up that just landed in refactor(recipes): lift validation blocks from ubuntu leaves to intent overlays #493. The conformance block in gb200-oke-ubuntu-training.yaml should move to gb200-oke-training.yaml so validation remains OS-independent and follows the current overlay pattern.

Open questions / notes

  • Confirm that OKE should constrain Ubuntu to 22.04 rather than 24.04; this would be the only Ubuntu service family using 22.04.
  • cdi.enabled: true appears in both values-oke.yaml and values-oke-training.yaml. That is redundant for training overlays, though harmless.
  • No Skyhook customization is consistent with the AKS-style pattern.
  • nvidiaDriverRoot: / and the worker-node affinity look coherent for host-installed OKE drivers.

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR. left some comments.

@ayuskauskas

Copy link
Copy Markdown
Contributor

@OguzPastirmaci would you be interested in working with me to add Nodewright optimizations here like I have done with AWS? There are two levels we could do:

  1. Being able to bring a vanilla ubuntu OKE worker up to spec like with AWS we set the kernel, install EFA and setup a raid for container storage. nvidia-setup
  2. Add optimizations, we could do basic nvidia tunings or an OKE specific one. nvidia-tuned

I would just need access to a node for an hour or two to make sure the package work correctly or I could work with you via a call or asynchronously. Looking forward to enabling Nodewright for OKE!

@OguzPastirmaci

Copy link
Copy Markdown
Contributor Author

thanks for the PR. left some comments.

Thanks @yuanchen8911, updated based on your feedback.

@OguzPastirmaci

Copy link
Copy Markdown
Contributor Author

@OguzPastirmaci would you be interested in working with me to add Nodewright optimizations here like I have done with AWS? There are two levels we could do:

  1. Being able to bring a vanilla ubuntu OKE worker up to spec like with AWS we set the kernel, install EFA and setup a raid for container storage. nvidia-setup
  2. Add optimizations, we could do basic nvidia tunings or an OKE specific one. nvidia-tuned

I would just need access to a node for an hour or two to make sure the package work correctly or I could work with you via a call or asynchronously. Looking forward to enabling Nodewright for OKE!

@ayuskauskas happy to help. Could you elaborate what you mean by "bringing a vanilla OKE worker up to spec"?

Here's the repo with the Packer templates that we use to build our HPC/GPU images: https://github.com/oracle-quickstart/oci-hpc-images

@ayuskauskas

Copy link
Copy Markdown
Contributor

@OguzPastirmaci would you be interested in working with me to add Nodewright optimizations here like I have done with AWS? There are two levels we could do:
...

@ayuskauskas happy to help. Could you elaborate what you mean by "bringing a vanilla OKE worker up to spec"?

Here's the repo with the Packer templates that we use to build our HPC/GPU images: https://github.com/oracle-quickstart/oci-hpc-images

bringing a vanilla OKE worker up to spec

I mean if I start an OKE cluster and select an off the shelf worker image and install/configure all the things to make it support AI workloads. That repository you linked seems like exactly that so would make a great starting point. There are a few options to take:

  1. Transcribe the ansible playbooks into shellscripts to use in a package like nvidia-setup
  2. Create a new package (oke-setup for example) in nodewright-packages that clones your repo and runs the ansible playbooks directly whenever the package runs.
  3. Your repo could publish a nodewright-package that does the same as the above.

I have a few questions:

  1. I see you setup lustre in that playbook we chose not to do that in our initial recipes as it is dependent on how the infrastructure is setup. In OKE is lustre the only way and so making that choice for users isn't restrictive?
  2. Can we turn parts of the ansible playbook off? Things like mellanox, gpu driver, dcgm etc are better managed by Network and GPU Operator in an aicr style cluster.

yuanchen8911
yuanchen8911 previously approved these changes Apr 7, 2026
@yuanchen8911

Copy link
Copy Markdown
Contributor

Thanks for the updates. The structure follows existing patterns, validation placement is now consistent with #493, and the Ubuntu version aligns with other services.Please rebase.

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KWOK Tests Failing: DRA Driver Scheduling Regression

All OKE KWOK tests (both Tier 1 and Tier 2) are failing because nvidia-dra-driver-gpu-controller can't be scheduled on any KWOK node.

Root cause: This PR adds a new OKE-specific values file (recipes/components/nvidia-dra-driver-gpu/values-oke.yaml) that hardcodes a hard node affinity requiring node-role.kubernetes.io/node:

controller:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: "node-role.kubernetes.io/node"
            operator: "Exists"

The nvidia-dra-driver-gpu component already exists in the codebase and works fine — this new OKE-specific values file introduces the regression. KWOK simulated GPU worker nodes don't have the node-role.kubernetes.io/node label, resulting in 0/7 nodes are available: 7 node(s) didn't match Pod's node affinity/selector for every OKE recipe.

Suggested fixes (pick one):

  1. Add the node-role.kubernetes.io/node label to the KWOK simulated node templates (in kwok/ config files)
  2. Use the system nodeScheduling mechanism from the registry instead of hardcoding affinity in values, so the KWOK test harness can set it appropriately
  3. Use preferredDuringScheduling (soft) affinity instead of requiredDuringScheduling (hard)

@yuanchen8911

Copy link
Copy Markdown
Contributor

/lgtm
@OguzPastirmaci , please fix the affinity issue.

@OguzPastirmaci

Copy link
Copy Markdown
Contributor Author

/lgtm
@OguzPastirmaci , please fix the affinity issue.

Updated and confirmed with a local Kwok run.

@OguzPastirmaci

OguzPastirmaci commented Apr 7, 2026

Copy link
Copy Markdown
Contributor Author

@OguzPastirmaci would you be interested in working with me to add Nodewright optimizations here like I have done with AWS? There are two levels we could do:
...

@ayuskauskas happy to help. Could you elaborate what you mean by "bringing a vanilla OKE worker up to spec"?
Here's the repo with the Packer templates that we use to build our HPC/GPU images: https://github.com/oracle-quickstart/oci-hpc-images

bringing a vanilla OKE worker up to spec

I mean if I start an OKE cluster and select an off the shelf worker image and install/configure all the things to make it support AI workloads. That repository you linked seems like exactly that so would make a great starting point. There are a few options to take:

  1. Transcribe the ansible playbooks into shellscripts to use in a package like nvidia-setup
  2. Create a new package (oke-setup for example) in nodewright-packages that clones your repo and runs the ansible playbooks directly whenever the package runs.
  3. Your repo could publish a nodewright-package that does the same as the above.

I have a few questions:

  1. I see you setup lustre in that playbook we chose not to do that in our initial recipes as it is dependent on how the infrastructure is setup. In OKE is lustre the only way and so making that choice for users isn't restrictive?
  2. Can we turn parts of the ansible playbook off? Things like mellanox, gpu driver, dcgm etc are better managed by Network and GPU Operator in an aicr style cluster.

I think what we have wouldn't work easily at this stage, as our builds take more than an hour. We pre-install the GPU & OFED drivers and other things like the Lustre client in our images.

  1. We install the Lustre client so users won't need to if they prefer to use our managed Lustre service.
  2. We support GPU Operator and Network Operator, but we don't support running the GPU and network drivers as containers today.

@yuanchen8911

Copy link
Copy Markdown
Contributor

Please rebase and squash.

Comment thread recipes/components/gpu-operator/values-oke.yaml
Add OKE service family support with GB200 accelerator overlays for
training and inference workloads. Includes GPU Operator and DRA driver
configurations for OKE's host-installed driver model.

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuanchen8911 yuanchen8911 merged commit fa73b5b into NVIDIA:main Apr 7, 2026
38 checks passed
@OguzPastirmaci OguzPastirmaci deleted the oke-gb200 branch April 7, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants