Add OpenShift Container Platform (OCP) as a service type#1380
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds OpenShift Container Platform ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/010-ocp-helm.md`:
- Line 71: Add language identifiers to all unlabeled fenced code blocks to
comply with markdownlint MD040. In docs/design/010-ocp-helm.md at lines 71-71,
add appropriate language tags (such as yaml, go, or text) after the opening
triple backticks for each fenced code block. Similarly, in
docs/integrator/openshift.md at lines 25-25, add language identifiers to all
unlabeled fenced code blocks. Each opening fence should specify the code
language (e.g., ```yaml instead of just ```).
- Line 5: The ADR status in the header line currently states "Proposed —
2026-06-07 (design-only; not implemented)" but this PR is shipping the actual
implementation. Update this status line to change "Proposed" to "Implemented"
(or similar status indicating the feature is complete), remove the
"(design-only; not implemented)" qualifier, and update the date to reflect when
the implementation was completed. This ensures the architecture documentation
accurately reflects the current state of the OCP support feature.
In `@pkg/recipe/doc.go`:
- Line 71: The doc block in pkg/recipe/doc.go has been updated with
CriteriaServiceOCP documentation at one location, but other service enumeration
lists within the same doc block (such as the initial service list and other
enumeration descriptions) still lack the corresponding ocp entry and
description. Update all service enumeration lists and descriptions throughout
the doc block to consistently include CriteriaServiceOCP with its "Red Hat
OpenShift Container Platform" description, ensuring the package documentation
remains internally consistent across all service listings.
In `@recipes/components/gpu-operator-ocp-olm/readiness.yaml`:
- Around line 34-50: The assertions in the readiness checks lack explicit
resource names in their metadata, causing them to match any
ClusterServiceVersion or Deployment in the gpu-operator namespace with the
matching status conditions. This can cause false positives if unrelated
resources satisfy the conditions first. Add metadata.name constraints to both
assertions: add a name field to the ClusterServiceVersion resource metadata
(around line 37) and add a name field to the Deployment resource metadata in the
operator-deployment-available test (around line 47) to ensure the assertions
target only the specific GPU operator resources, not any resource that happens
to match the status conditions.
In `@recipes/components/gpu-operator-ocp/values.yaml`:
- Around line 57-58: The dcgmExporter.config.data section in values.yaml
contains duplicate DCGM metric IDs that will cause initialization failure:
DCGM_FI_PROF_PCIE_TX_BYTES appears at both line 57 and line 96, and
DCGM_FI_PROF_PCIE_RX_BYTES appears at both line 58 and line 97. Remove the
duplicate definitions of these two metric IDs and keep only one canonical
definition per metric. Review the descriptions at both locations to determine
which version is correct (based on whether they are NVML-sourced or DCP metrics)
and retain only that single definition while removing the redundant entries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b6b26084-e126-4918-8535-29d8ac6987cb
📒 Files selected for processing (40)
.claude/skills/creating-slide-decks/skeleton.html.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamlcmd/gate/chainsaw-config.yamldocs/contributor/recipe.mddocs/design/010-ocp-helm.mddocs/integrator/openshift.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/manifest/render.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gorecipes/checks/gpu-operator-ocp-olm/readiness.yamlrecipes/checks/network-operator-ocp-olm/readiness.yamlrecipes/checks/nfd-ocp-olm/readiness.yamlrecipes/components/gpu-operator-ocp-olm/manifests/operatorgroup.yamlrecipes/components/gpu-operator-ocp-olm/manifests/subscription.yamlrecipes/components/gpu-operator-ocp-olm/readiness.yamlrecipes/components/gpu-operator-ocp-olm/values.yamlrecipes/components/gpu-operator-ocp/manifests/clusterpolicy.yamlrecipes/components/gpu-operator-ocp/manifests/dcgm-exporter-configmap.yamlrecipes/components/gpu-operator-ocp/values.yamlrecipes/components/network-operator-ocp-olm/manifests/operatorgroup.yamlrecipes/components/network-operator-ocp-olm/manifests/subscription.yamlrecipes/components/network-operator-ocp-olm/readiness.yamlrecipes/components/network-operator-ocp-olm/values.yamlrecipes/components/network-operator-ocp/manifests/nicclusterpolicy.yamlrecipes/components/network-operator-ocp/values.yamlrecipes/components/nfd-ocp-olm/manifests/operatorgroup.yamlrecipes/components/nfd-ocp-olm/manifests/subscription.yamlrecipes/components/nfd-ocp-olm/readiness.yamlrecipes/components/nfd-ocp-olm/values.yamlrecipes/components/nfd-ocp/manifests/nodefeaturediscovery.yamlrecipes/components/nfd-ocp/values.yamlrecipes/overlays/ocp-inference.yamlrecipes/overlays/ocp-training.yamlrecipes/overlays/ocp.yamlrecipes/registry.yamlvalidators/performance/nccl_all_reduce_bw_constraint.go
|
@kaponco this PR now has merge conflicts with |
mchmarny
left a comment
There was a problem hiding this comment.
Thanks for the thorough ADR and the careful values-mirroring work — modeling OLM install + CR as two in-tree KindLocalHelm phases is a clean fit.
Architecturally this lands well inside scope: no custom deployer, OLM resources flow through the existing KindLocalHelm + readiness-hooks pipeline, and the ADR explicitly rejected the out-of-scope direct/kubectl apply deployer. That's the right call.
Two things block merge: (1) the entire new rendering path is untested — ADR-010 promises a tests/chainsaw/cli/bundle-ocp/ structure+content test that isn't in this PR, so 6 components, nested-value rendering, the new quote func, and hook ordering all ship unverified; and (2) ocp-inference inherits training-tuned base values (MIG Manager + GDRCopy enabled). The rest are mediums (component-catalog entries, duplicated readiness.yaml) and a couple of low nits, all inline.
I verified the bundle-time rendering model in pkg/manifest/render.go (it injects .Release.Service/.Release.Namespace and nests values under the component name), so the mixed $v + .Release + helm.sh/hook usage is correct, not a bug.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/013-ocp-helm.md`:
- Line 1: The document title has an ADR identifier mismatch that should be
corrected to match the file name. Update the heading in the top-level markdown
title so the ADR number aligns with docs/design/013-ocp-helm.md, and keep the
rest of the title text unchanged.
In `@docs/user/cli-reference.md`:
- Line 842: The markdown blockquote at this spot has an extra blank line inside
the quoted section, triggering MD028. Update the blockquote in the CLI reference
so the quote content stays contiguous without an empty quoted line, and verify
the surrounding markdown renders cleanly with the same quote structure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 11920244-96c4-47f4-9ab6-ecaa71a830dd
📒 Files selected for processing (20)
.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/design/013-ocp-helm.mddocs/integrator/openshift.mddocs/user/api-reference.mddocs/user/cli-reference.mddocs/user/component-catalog.mdpkg/manifest/render.gopkg/manifest/render_test.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gorecipes/checks/gpu-operator-ocp-olm/health-check.yamlrecipes/checks/gpu-operator-ocp/health-check.yamlrecipes/checks/network-operator-ocp-olm/health-check.yamlrecipes/checks/network-operator-ocp/health-check.yamlrecipes/checks/nfd-ocp-olm/health-check.yamlrecipes/checks/nfd-ocp/health-check.yamlrecipes/components/gpu-operator-ocp-olm/manifests/operatorgroup.yaml
💤 Files with no reviewable changes (12)
- recipes/components/gpu-operator-ocp-olm/manifests/operatorgroup.yaml
- recipes/checks/nfd-ocp-olm/health-check.yaml
- recipes/checks/gpu-operator-ocp-olm/health-check.yaml
- pkg/manifest/render.go
- recipes/checks/network-operator-ocp/health-check.yaml
- recipes/checks/gpu-operator-ocp/health-check.yaml
- recipes/checks/network-operator-ocp-olm/health-check.yaml
- pkg/recipe/doc.go
- recipes/checks/nfd-ocp/health-check.yaml
- pkg/recipe/criteria_test.go
- pkg/manifest/render_test.go
- pkg/recipe/criteria.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/bundler_test.go`:
- Around line 2738-2743: The `findNumberedDir` helper is hiding the real failure
and contains an unreachable suffix check. Update the `os.ReadDir(outDir)` call
to handle and return its error instead of discarding it, and simplify the
directory-name match in the loop by removing the impossible `"-"+name+"/"`
branch so only the valid `strings.HasSuffix(e.Name(), "-"+name)` check remains.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: e866f5c0-4432-4dee-941e-60372a19f392
📒 Files selected for processing (1)
pkg/bundler/bundler_test.go
Recipe evidence check
No leaf overlays affected by this PR. This gate is warning-only and never blocks merge. |
|
Functionally the PR is in good shape now. Fix the linting errors (make sure |
|
@kaponco the KWOK failures are related to the new OCP overlays ( Recommended direction: either mark/exclude OCP overlays as not KWOK-testable, or add a separate OCP/OLM-aware validation lane. Installing fake CRDs in KWOK would only make sense if the intended coverage is render/apply smoke, not real operator readiness. |
|
@kaponco I noticed all of your commits on this PR are not cryptographically signed again. Please rview the CONTRIBUTING.md#nvidia-org-members-and-automation and amend these commits. It may also be a good idea to squash these 15 commits into logical groups. |
|
@mchmarny yes 2 of the 15 I missed the -s flag. Anyway I ll squash. Since all of them are about OCP introduction, I can squash everything to one commit. if that ok with you |
@kaponco this is -S, as in upper case (cryptographically signed). The doc I referenced above outlines the exact git commands. |
|
@mchmarny , it says in CONTRIBUTING.md:159: ▎ External contributors sign off with -s; NVIDIA organization members use cryptographic signing (-S). See Developer Certificate of Origin for details. However - I amended the commit with the -S |
…charts (NVIDIA#566) Add OCP as a first-class service target using manifest-only Helm charts that install operators via OLM subscriptions and configure them through custom resources, with readiness gates between phases. - Add OCP base, training, and inference recipe overlays - Add OLM installer and CR components for GPU, Network, and NFD operators - Register OCP components with health checks in the component registry - Add `ocp` to service criteria enum across API spec, CLI, and docs - Add manifest render helper for the `quote` template function - Add OCP bundle integration test via chainsaw - Exclude OCP recipes from KWOK testing - Add design doc and integrator guide for OCP deployment Signed-off-by: Shai Kapon <[email protected]>
Summary
Add OpenShift Container Platform (OCP) as a first-class service type, deploying operators via OLM using the
existing Helm bundler pipeline.
Motivation / Context
AICR currently supports EKS, GKE and more cluster providers. OCP customers need GPU-accelerated recipes but
OCP deploys operators through OLM (Operator Lifecycle Manager) rather than direct Helm installs. This PR
models each OCP operator as two in-tree Helm chart phases — an OLM installer (Subscription + OperatorGroup)
and a CR configuration phase — with Chainsaw readiness gates ensuring operators are available before CRs are
applied.
Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Testing
# Commands run (prefer `make qualify` for non-trivial changes) make qualifyEverything is green — ready to commit.
Risk Assessment
Rollout notes:
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info