Skip to content

feat(recipe): add L40S accelerator support#1510

Merged
mchmarny merged 20 commits into
mainfrom
feat/oke-l40s-support
Jun 29, 2026
Merged

feat(recipe): add L40S accelerator support#1510
mchmarny merged 20 commits into
mainfrom
feat/oke-l40s-support

Conversation

@atif1996

@atif1996 atif1996 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Add L40S as a supported accelerator for OKE and introduce `ol` as the Oracle Linux OS enum. Restore cert-manager to its default AICR-managed installation for OKE. Add a fail-closed guard that returns `ErrCodeInvalidRequest` when a service+accelerator combination requires an explicit `--os` but none was supplied.

Motivation / Context

Three gaps addressed:

  1. L40S on OKE had no recipe — the fingerprint engine already mapped PCI `26b9 → "l40s"` but there were no OKE overlays for it.
  2. OKE service fingerprint was wrong — nodes with a raw OCID providerID (or old snapshots carrying `"oci"` or `"oci://..."`) did not map to `service: oke`. Fixed in `parseProvider` (collector) and `normalizeProviderID` (fingerprint layer) for resilience across all three forms.
  3. Degenerate recipes on missing `--os` — services whose leaf overlays all require an explicit OS (GKE requires `--os cos`, OKE L40S requires `--os ol`) silently returned a base-only recipe with exit 0. The fix adds a fail-closed guard.

Fixes: #1003
Fixes: #1517
Fixes: #1534

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (`pkg/recipe`)
  • Collectors / snapshotter (`pkg/collector`, `pkg/snapshotter`)
  • CLI (`cmd/aicr`, `pkg/cli`)
  • Docs/examples (`docs/`, `examples/`)
  • Other: fingerprint (`pkg/fingerprint`), overlays (`recipes/`), UAT tests

Implementation Notes

  • `ol` OS enum: Oracle Linux reports `ID=ol` in `/etc/os-release`; mirrors `cos` (short OS name). Added to `oskind`, `criteria.go`, `normalizeOSID()`, OpenAPI spec, CLI flags, and all doc enum lists.

  • OKE overlay structure: OS-agnostic parents (`oke`, `oke-training`, `oke-inference`) are preserved for existing overlays (GB200, A100) and external `base:` references. Oracle Linux-specific parents (`oke-ol`, `oke-ol-training`, `oke-ol-inference`) are added for L40S OKE overlays, mirroring the GKE `gke-cos` pattern. Only `l40s-oke-{training,inference}` carry `os: ol`; GB200/A100 OKE overlays remain OS-agnostic.

  • `l40s-any.yaml`: Cross-cutting deployment-phase floor (4 checks + gpu-operator `>= v24.6.0`), mirroring `a100-any.yaml`.

  • cert-manager: Restored to AICR-managed default for OKE. Oracle's cert-manager addon is optional and not deployed by default; AICR installs it automatically as it does on other services.

  • `driver.enabled: false` / `toolkit.enabled: false` in `values-oke.yaml` — Oracle Linux Gen2 GPU OKE image pre-installs both at the OS level.

  • Fail-closed OS guard (`requireOSIfNeeded`): called after `FindMatchingOverlays` in both `BuildRecipeResult` and `BuildRecipeResultWithEvaluator`. Returns `ErrCodeInvalidRequest` with available OS values when a service+accelerator combination has only OS-gated overlays and `--os` was omitted. Services with OS-agnostic overlays (EKS, OKE GB200/A100) are unaffected.

  • Fingerprint resilience: `normalizeProviderID` now handles all three OKE providerID forms: bare OCID (`ocid1.instance…`), old collector string (`"oci"`), and full scheme URL (`"oci://ocid1…"`).

Testing

unset GITLAB_TOKEN && make qualify  # all tests, lint, e2e, scan — clean pass

Verified locally:

  • `aicr recipe --service oke --accelerator l40s --intent training` → error: `service 'oke' requires --os; available values: ol`
  • `aicr recipe --service oke --accelerator l40s --intent training --os ol` → full recipe with `oke-ol`, `oke-ol-training`, `l40s-oke-training`
  • `aicr recipe --service oke --accelerator gb200 --intent training` → succeeds via OS-agnostic `oke → oke-training → gb200-oke-training` chain
  • `aicr recipe --service gke --accelerator h100 --intent training` → error: `service 'gke' requires --os; available values: cos`

Risk Assessment

  • Low — Additive for new L40S OKE path. The OS guard is a new fail-closed behavior for queries that previously returned silent degenerate output; callers already passing `--os` or using snapshot-driven criteria are unaffected.

Rollout notes: N/A

Checklist

  • Tests pass locally (`make test` with `-race`)
  • Linter passes (`make lint`)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (`git commit -S`)

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds L40S as a supported GPU accelerator and Oracle Linux (ol) as a supported OS family across the AICR stack. It updates GPU SKU parsing and OS normalization, adds new criteria constants and parsers, expands OpenAPI enums and documentation, introduces new L40S/OKE recipe overlays, updates existing OKE overlays to target ol, and adds OCI UAT cluster, runner, and Chainsaw validation fixtures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

area/validator, enhancement

Suggested reviewers

  • mchmarny
  • njhensley
  • xdu31
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning #1003 is only partially met: the PR adds OKE L40S overlays, but it also adds deferred training overlays and a separate l40s enum split. Remove the training overlay and avoid splitting the l40/L40S criteria unless you open a separate issue for that follow-up.
Out of Scope Changes check ⚠️ Warning The PR includes broad Oracle Linux, fingerprinting, collector, docs, and UAT changes that are not required by #1003's overlay scope. Split the non-overlay work into separate PRs or explicitly expand the linked issue scope before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding L40S accelerator support.
Description check ✅ Passed The description matches the code changes, covering L40S/OL support, OKE overlays, cert-manager, fingerprinting, and the OS guard.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/oke-l40s-support

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.claude/skills/aicr-analyzing-snapshots/SKILL.md (1)

152-158: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Resolve the merge conflict before merge.

The conflict markers make this Markdown invalid, and the branch version also drops ocp, which is still part of the supported service enum in api/aicr/v1/server.yaml. Keep this table in sync with the API contract.

Proposed cleanup
-<<<<<<< HEAD
-| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` |
-| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000` |
-=======
-| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `kind`, `lke`, `bcm` |
-| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` |
->>>>>>> 6c7ce4c2 (feat(recipe): add L40S accelerator support (OKE overlays + fingerprint fix))
+| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` |
+| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/aicr-analyzing-snapshots/SKILL.md around lines 152 - 158,
Remove the merge conflict markers from the Markdown table in the snapshot
analysis skill document and reconcile the branch/base content into valid table
rows. While fixing the table, make sure the supported service list stays aligned
with the service enum in api/aicr/v1/server.yaml by preserving the ocp entry
alongside the other supported services.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/recipe.md`:
- Around line 152-158: Resolve the merge conflict in the criteria table by
removing the conflict markers and merging both valid updates into a single
Markdown row set. Keep the `service` values aligned with the API contract by
preserving `ocp`, and keep the `accelerator` values aligned with the latest
support by including `l40s`; verify the `CriteriaServiceType` and
`CriteriaAcceleratorType` entries remain consistent with the documented schema.

In `@docs/user/api-reference.md`:
- Around line 121-127: Resolve the merge conflict in the API reference table by
removing the conflict markers and choosing the entries that match the API
schema. In the table for the documented accelerator/service fields, keep the
`service` list aligned with `api/aicr/v1/server.yaml` by retaining `ocp`, and
keep the `accelerator` list consistent with the supported values by including
`l40s` from the feature branch. Use the `service` and `accelerator` rows in the
affected table as the anchors when updating the docs.

In `@docs/user/cli-reference.md`:
- Around line 389-395: Resolve the merge conflict in the CLI reference table by
removing the conflict markers and merging both intended updates. Keep
`--service` aligned with `api/aicr/v1/server.yaml` by preserving `ocp`, and
merge the `--accelerator` values so the list includes the new `l40s` entry
alongside the existing GPU types. Verify the final table formatting in
`docs/user/cli-reference.md` remains valid after the conflict is removed.

In `@pkg/recipe/doc.go`:
- Line 29: The godoc in doc.go is inconsistent because the detailed accelerator
list under the Criteria types section still omits CriteriaAcceleratorL40S even
though the summary comment already mentions l40s. Update the comment block
around the CriteriaAcceleratorType / CriteriaAcceleratorL40 bullet list to
include CriteriaAcceleratorL40S alongside CriteriaAcceleratorL40, keeping the
documented accelerator set aligned with the API changes.

In `@tests/uat/oci/run`:
- Around line 125-130: The helm-diff installation step is unpinned, so UAT runs
can change when upstream releases change. Update the install logic in the run
script’s helm-diff plugin block to request a specific helm-diff version instead
of the default latest release, and apply the same version pinning pattern in the
matching helm-diff install blocks in tests/uat/aws/run and tests/uat/gcp/run so
all UAT environments behave consistently.

In `@tests/uat/oci/tests/cuj1-training/assert-recipe.yaml`:
- Around line 36-38: Update the CUJ1 training assertion fixture to verify the
OKE cert-manager addon is explicitly disabled, not just present in
componentRefs. In assert-recipe.yaml, extend the existing cert-manager check so
the rendered recipe/manifest is validated against the disabled state for
cert-manager, using the cert-manager and componentRefs assertions together to
catch any regression that re-enables the managed addon.

In `@tests/uat/oci/tests/l40s-training-config.yaml`:
- Around line 77-81: The validate agent block only sets tolerations, so the
validation Jobs can still schedule onto the system pool; update
spec.validate.agent to also mirror the GPU node selector used by
spec.snapshot.agent and keep the existing namespace/tolerations in place. Use
the agent stanza under spec.validate as the target and align it with the
nodeSelector settings from spec.snapshot.agent so validator jobs are constrained
to the L40S GPU pool.

---

Outside diff comments:
In @.claude/skills/aicr-analyzing-snapshots/SKILL.md:
- Around line 152-158: Remove the merge conflict markers from the Markdown table
in the snapshot analysis skill document and reconcile the branch/base content
into valid table rows. While fixing the table, make sure the supported service
list stays aligned with the service enum in api/aicr/v1/server.yaml by
preserving the ocp entry alongside the other supported services.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 78c5e234-83dd-4d27-947f-8a1c6c1d0054

📥 Commits

Reviewing files that changed from the base of the PR and between 9be382b and 0769c3a.

📒 Files selected for processing (31)
  • .claude/skills/aicr-analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/from_measurements_test.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/yaml_test.go
  • recipes/overlays/l40s-any.yaml
  • recipes/overlays/l40s-oke-inference.yaml
  • recipes/overlays/l40s-oke-training.yaml
  • recipes/overlays/l40s-oke-ubuntu-inference.yaml
  • recipes/overlays/l40s-oke-ubuntu-training-kubeflow.yaml
  • recipes/overlays/l40s-oke-ubuntu-training.yaml
  • recipes/overlays/oke.yaml
  • tests/uat/oci/cluster-config.yaml
  • tests/uat/oci/run
  • tests/uat/oci/tests/cuj1-training/assert-recipe.yaml
  • tests/uat/oci/tests/cuj1-training/assert-validate-deployment.yaml
  • tests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yaml
  • tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml
  • tests/uat/oci/tests/l40s-training-config.yaml

Comment thread docs/contributor/recipe.md Outdated
Comment thread docs/user/api-reference.md Outdated
Comment thread docs/user/cli-reference.md Outdated
Comment thread pkg/recipe/doc.go
Comment thread tests/uat/oci/run Outdated
Comment thread tests/uat/oci/tests/cuj1-training/assert-recipe.yaml Outdated
Comment thread tests/uat/oci/tests/l40s-training-config.yaml Outdated
@atif1996 atif1996 force-pushed the feat/oke-l40s-support branch 2 times, most recently from 8c4e842 to ee608da Compare June 28, 2026 01:26

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/recipe/metadata_test.go (1)

2193-2210: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add the new L40S/OKE leaves to this coverage table.

This PR adds l40s-oke-training and l40s-oke-inference, but TestNFDTopologyUpdater_OverlayCoverage still doesn't exercise them. A topology-updater regression on the new OKE/L40S inheritance path would currently pass unnoticed.

Suggested additions
+		{"l40s-oke-training", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentTraining, ""}, true},
+		{"l40s-oke-inference", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentInference, ""}, true},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/recipe/metadata_test.go` around lines 2193 - 2210,
TestNFDTopologyUpdater_OverlayCoverage is missing the new OKE/L40S cases, so add
coverage entries for the new l40s-oke-training and l40s-oke-inference leaves in
the existing criteria table. Update the table in metadata_test.go alongside the
other real-cluster GPU leaves so the overlay inheritance path for
CriteriaServiceOKE with CriteriaAcceleratorL40S and the training/inference
intents is exercised and regressions in the new leaf coverage are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml`:
- Around line 21-29: Update the CUJ1 Chainsaw UAT to exercise the new Oracle
Linux OKE path instead of the old Ubuntu selector. In the test spec’s
descriptive text and any recipe/bundle generation steps that still reference
`ubuntu`, change them to `ol` and ensure the `generate recipe`/`generate bundle`
flow explicitly passes `--os ol` so it matches the new OKE overlays.
Double-check the CUJ1 scenario steps and any related assertions in
`chainsaw-test.yaml` so the test validates the OKE/L40S/training path with
Oracle Linux throughout.

---

Outside diff comments:
In `@pkg/recipe/metadata_test.go`:
- Around line 2193-2210: TestNFDTopologyUpdater_OverlayCoverage is missing the
new OKE/L40S cases, so add coverage entries for the new l40s-oke-training and
l40s-oke-inference leaves in the existing criteria table. Update the table in
metadata_test.go alongside the other real-cluster GPU leaves so the overlay
inheritance path for CriteriaServiceOKE with CriteriaAcceleratorL40S and the
training/inference intents is exercised and regressions in the new leaf coverage
are caught.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 470dc7bb-f5ae-44d8-a673-3e9ab22d8098

📥 Commits

Reviewing files that changed from the base of the PR and between 8c4e842 and ee608da.

📒 Files selected for processing (38)
  • .claude/skills/aicr-analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/cli/snapshot.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/from_measurements.go
  • pkg/fingerprint/from_measurements_test.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/metadata_test.go
  • pkg/recipe/oskind/oskind.go
  • pkg/recipe/oskind/oskind_test.go
  • pkg/recipe/yaml_test.go
  • recipes/overlays/a100-oke-training.yaml
  • recipes/overlays/gb200-oke-inference.yaml
  • recipes/overlays/gb200-oke-training.yaml
  • recipes/overlays/l40s-any.yaml
  • recipes/overlays/l40s-oke-inference.yaml
  • recipes/overlays/l40s-oke-training.yaml
  • recipes/overlays/oke-inference.yaml
  • recipes/overlays/oke-training.yaml
  • recipes/overlays/oke.yaml
  • tests/uat/oci/cluster-config.yaml
  • tests/uat/oci/run
  • tests/uat/oci/tests/cuj1-training/assert-recipe.yaml
  • tests/uat/oci/tests/cuj1-training/assert-validate-deployment.yaml
  • tests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yaml
  • tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml
  • tests/uat/oci/tests/l40s-training-config.yaml

Comment thread tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/fingerprint/from_measurements.go`:
- Around line 323-333: Add a regression test for the raw-OCID compatibility path
in TestFromMeasurements_ServiceDetection so normalizeProviderID is covered for
older measurements that still carry a bare ocid1.* providerID. Extend the
existing table-driven cases with a raw OCI OCID input and assert it resolves to
the oke service type, alongside the already-normalized provider name cases. Use
the normalizeProviderID helper and the TestFromMeasurements_ServiceDetection
table as the main anchors when updating the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a34e66a6-5302-453d-865f-b0296bd697b9

📥 Commits

Reviewing files that changed from the base of the PR and between ee608da and ff8442b.

📒 Files selected for processing (3)
  • pkg/collector/k8s/node.go
  • pkg/collector/k8s/node_test.go
  • pkg/fingerprint/from_measurements.go

Comment thread pkg/fingerprint/from_measurements.go

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The L40S + ol OS enum core is well done — criteria/oskind/fingerprint (correct L40S-before-L40 SKU ordering), OpenAPI, CLI, and thorough table-driven tests all line up, and the OKE-requires-ol model mirrors the GKE-cos precedent cleanly.

The blocker is that the diff carries three changes unrelated to L40S that look like collateral from a stale/conflicted branch (mergeable_state is dirty):

  1. Deletes the Rekor supply-chain monitor (.github/workflows/rekor-monitor.yaml, added in #1480) plus its docs — silently removes a security control.
  2. Reverts #1495 by re-adding the single-shot verifyClusterPolicyReady check that was removed for flaking the deployment gate.
  3. New overlays use the pre-#1499 apiVersion aicr.nvidia.com/v1alpha1 instead of the aicr.run/v1alpha2 every other overlay uses.

Please rebase cleanly onto main so these three drop out, then keep only the L40S/ol changes. Two smaller items: the risk assessment labels the OKE os: ol requirement "non-breaking/additive," but it's a behavior change (OKE queries without --os now return no match — worth restating as such); and the unrelated UAT evidence-repo rename (aws/gcp) should ideally land separately. One doc nit inline. Happy to re-review once it's rebased.

Comment thread .github/workflows/rekor-monitor.yaml
Comment thread validators/deployment/expected_resources.go Outdated
Comment thread recipes/overlays/l40s-any.yaml Outdated
Comment thread pkg/client/v1/types.go Outdated
@github-actions

Copy link
Copy Markdown
Contributor

@atif1996 this PR now has merge conflicts with main. Please rebase to resolve them.

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-review: requesting changes

This PR rolls back three recently-merged changes and is incompatible with the current main API version. The rollbacks are reversed in the branch's own commits (its parent already contains #1495, #1480, and #1479), so a plain rebase will not restore them.

Blocking

  1. Re-adds the single-shot verifyClusterPolicyReady deployment check that #1495 removed. It samples ClusterPolicy status.state once with no retry, while the chainsaw check validate-cluster-policy-ready already polls the same state for ~5m. On a freshly-installed gpu-operator the operator-validator init containers take ~2.5m to reconcile, so the Go check records a false failure before chainsaw polls ready — re-introducing the deployment-gate flake #1495 fixed. Drop the Go re-add; keep the polling assertion as the sole source.

  2. All five newly added AICR documents use apiVersion: aicr.nvidia.com/v1alpha1, which current main rejects (expected "aicr.run/v1alpha2"). This fails TestAllMetadataFilesHaveRequiredFields and CLI validation.

  3. Deletes .github/workflows/rekor-monitor.yaml and its runbook in docs/contributor/maintaining.md (added in #1480) with no replacement, dropping hourly Rekor transparency-log consistency + release-identity monitoring. This appears out of scope for an L40S feature PR — please restore it unless the removal is intentional and tracked elsewhere.

  4. The UAT evidence push target reverts the #1478/#1479 stable-repo + :run-${RUN_ID} tagging scheme, creating a new untagged GHCR repository per run. Two effects: (a) the configured reference now lacks an explicit tag, so the summary no longer identifies the bundle — main's #1498 digest-based summary must be preserved through the rebase to fix that; and (b) per-run repositories proliferate regardless, which the digest summary does not address.

Non-blocking (both genuine medium-severity defects)

  1. parseProvider drops the oci://oke mapping in favor of raw-OCID detection only; an oci://… providerID now resolves to service: any. Keep the change additive.

  2. The OCI UAT runner references .github/workflows/uat-oci.yaml, which the PR never adds, and nothing else invokes tests/uat/oci, so the advertised OKE L40S UAT runs in no CI pipeline.

Doc/enum drift not attached inline (outside modified hunks): docs/contributor/index.md:182 — the pkg/recipe/oskind "single source of truth" enumeration omits ol.

Comment thread validators/deployment/expected_resources.go Outdated
Comment thread recipes/overlays/l40s-any.yaml Outdated
Comment thread tests/uat/aws/run Outdated
Comment thread pkg/collector/k8s/node.go
Comment thread tests/uat/oci/run Outdated
Comment thread pkg/client/v1/types.go
Comment thread pkg/fingerprint/types.go
Comment thread pkg/fingerprint/doc.go
Comment thread .claude/skills/aicr-analyzing-snapshots/SKILL.md
@atif1996

Copy link
Copy Markdown
Contributor Author

All four findings from this review are addressed in d0953bc:

  • rekor-monitor.yaml — restored from origin/main; lost when the squash predated ci(supply-chain): add Rekor identity + consistency monitor #1480
  • verifyClusterPolicyReady — call site, function body, and orphaned constants removed; test file restored to main's state
  • apiVersion — all three new overlays updated to aicr.run/v1alpha2
  • ol in doc comments — added to pkg/client/v1/types.go and pkg/fingerprint/doc.go

Tests pass (go test ./validators/...).

atif1996 added 5 commits June 29, 2026 12:20
- Add l40s-oke-training and l40s-oke-inference overlays for OKE
- Add `ol` OS enum (Oracle Linux, maps ID=ol from /etc/os-release)
- Require explicit `os: ol` on all non-ubuntu OKE overlays; mirrors
  GKE/COS model — no `os: any` fallback (query must specify OS)
- Disable cert-manager in the OKE base overlay via enabled: false
- Add fingerprint normalization for Oracle Linux (normalizeOSID)
- Update OpenAPI spec, CLI flags, and all doc enum lists with `ol`
- Add OCI OKE L40S training UAT test suite

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
OKE nodes set spec.providerID to a bare OCID ("ocid1.instance.oc1...")
with no scheme prefix, so the existing "oci://" split never matched and
the raw OCID leaked into the service fingerprint.

Fix in two layers:
- pkg/collector/k8s: parseProvider checks strings.HasPrefix("ocid1.")
  before the "://" split; removes the unreachable "oci" case.
- pkg/fingerprint: normalizeProviderID applied at read time so snapshots
  produced by older agent images are also corrected.

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
- chainsaw-test.yaml: switch --os ubuntu → --os ol (OKE now requires
  explicit OS; ubuntu matched nothing after this PR)
- assert-recipe.yaml: assert cert-manager overrides.enabled=false
  (key OKE behavior was present but unverified)
- l40s-training-config.yaml: add GPU nodeSelector to validate agent
  (mirrors snapshot agent; prevents landing on system-cpu pool)
- from_measurements_test.go: add raw OCID case to ServiceDetection
  test (normalizeProviderID backward-compat path had no coverage)

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
- Restore .github/workflows/rekor-monitor.yaml lost during squash (was
  added in #1480, absent from stale branch base)
- Remove verifyClusterPolicyReady call and function reverted by squash
  (#1495 had removed it; stale branch base brought it back)
- Update apiVersion aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2 in
  l40s-any, l40s-oke-training, l40s-oke-inference overlays (post-#1499
  domain migration)
- Add `ol` to OS enum description in pkg/client/v1/types.go and
  pkg/fingerprint/doc.go

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
- Restore `case "oci": return "oke"` in parseProvider alongside the new
  ocid1. prefix check — both formats must map to oke
- Fix inject_push_target in aws/run, gcp/run, oci/run to use the
  :run-${RUN_ID} tag scheme (#1478/#1479), not the repo-suffix approach
- Update apiVersion aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2 in
  l40s-training-config.yaml and assert-recipe.yaml
- Add `ol` to OSDimension comment in pkg/fingerprint/types.go
- Add `ol` and `talos` to Supported values comment in pkg/client/v1/types.go

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
Services like GKE and OKE whose overlays all carry an explicit OS
criterion (os: cos, os: ol) silently returned a degenerate base-only
recipe with exit 0 when --os was omitted. The resolver logged a
slog.Warn but callers had no way to detect the problem.

Add requireOSIfNeeded, called in both BuildRecipeResult and
BuildRecipeResultWithEvaluator immediately after FindMatchingOverlays.
The guard returns ErrCodeInvalidRequest (with the available OS values)
when a specific service was requested, no OS was supplied, no
service-specific overlay matched, and at least one OS-gated overlay
exists for that service. Services with OS-agnostic overlays (EKS) and
unknown services pass through unchanged.

Add TestBuildRecipeResult_OSRequired and
TestBuildRecipeResultWithEvaluator_OSRequired covering GKE without os
(now errors), GKE with os: cos (succeeds), EKS without os (succeeds),
and unknown service without os (no error).

Fix TestRecipeEndpointPOST/valid_YAML_body: the GKE test body was
missing os: cos and would now return 400 instead of 200.

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
mchmarny
mchmarny previously approved these changes Jun 29, 2026

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — all four outstanding review concerns are addressed at a8c2342:

  • OKE no-OS query / partial recipe: oke.yaml is restored as an OS-agnostic parent (service: oke, no os: field), with the new oke-ol* Oracle Linux variants living alongside it rather than replacing it. A no---os OKE query matches the full overlay chain again.
  • Base-name breakage: oke, oke-training, and oke-inference base names are back, so external --data overlays referencing them continue to resolve.
  • helm-diff supply chain: both tests/uat/aws/run and tests/uat/gcp/run now install from the signed release tarball; --verify=false is gone.
  • Stale comment: the verifyClusterPolicyReady reference in validators/deployment/expected_resources.go was removed.

CI is green and the refactor (keep OS-agnostic parents + add oke-ol* variants) is the right shape.

Two non-blocking follow-ups before merge:

  • The four threads are still formally open in the UI (#2/#4 auto-flipped to outdated; the two oke-ol.yaml threads target a still-present line) — please mark them resolved.
  • Branch is behind main; rebase/update before merging.

@mchmarny mchmarny requested a review from yuanchen8911 June 29, 2026 19:12
@mchmarny mchmarny enabled auto-merge (squash) June 29, 2026 19:13
@mchmarny mchmarny disabled auto-merge June 29, 2026 19:13
requireOSIfNeeded returns ErrCodeInvalidRequest when a query omits --os
but all matching overlays for that service+accelerator combination are
OS-gated. Services with OS-agnostic overlays (EKS, OKE GB200/A100)
remain unaffected; only L40S OKE and GKE (which are purely OS-specific)
now surface a clear error.

availableOSForCriteria uses c.Matches (full criteria check including
intent, accelerator) to build the list of valid OS values shown in the
error message, avoiding false positives for unrelated intent/accelerator
combinations.

Fixes: #1534

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
l40s-oke-training and l40s-oke-inference were missing from
TestNFDTopologyUpdater_OverlayCoverage; a topology-updater regression on
the new OKE/L40S inheritance path would have passed unnoticed.

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
@atif1996

Copy link
Copy Markdown
Contributor Author

Round-3 response

Thanks for the detailed review, @yuanchen8911. All blocking and open items are addressed:

Blocking

1. No---os OKE query returns a degenerate recipe (not "no match")
Added a requireOSIfNeeded fail-closed guard in BuildRecipeResult and BuildRecipeResultWithEvaluator. It checks whether all matching overlays for the requested service+accelerator are OS-gated and returns ErrCodeInvalidRequest with the available OS values when --os is omitted. Fixes #1534.

Behavior:

  • aicr recipe --service oke --accelerator l40s --intent trainingservice 'oke' requires --os; available values: ol
  • aicr recipe --service oke --accelerator gb200 --intent training → succeeds (OS-agnostic overlay exists)
  • aicr recipe --service gke --accelerator h100 --intent trainingservice 'gke' requires --os; available values: cos

2. cert-manager disabled for all OKE
Restored AICR-managed cert-manager on OKE. Oracle's cert-manager addon is optional and not deployed by default on new OKE clusters; AICR installs it automatically as it does for other services.

3. helm-diff install fails under Helm 4
Switched from the helm plugin install <git-url> --version form (broken on Helm 4) to installing from the signed release tarball, which is the upstream-documented Helm 4 method. Version is read from .settings.yaml.

Still open

UAT evidence repo renameinject_push_target in both aws/run and gcp/run already uses the canonical ghcr.io/nvidia/aicr-evidence/<recipe>:run-${RUN_ID} stable-repo + per-run tag scheme from #1478/#1479. No change from main on this path.

Rekor runbookdocs/contributor/maintaining.md already has the ## Release Supply-Chain Monitoring section with the full runbook (lines 52–). It was restored in the rebase.

Minor / drift

  • docs/contributor/index.md:182 pkg/recipe/oskind enumeration — ol is already present: ubuntu, rhel, cos, amazonlinux, ol, talos.
  • TestNFDTopologyUpdater_OverlayCoverage — added l40s-oke-training and l40s-oke-inference entries (CodeRabbit catch).
  • oci:// regression test — added in TestFromMeasurements_ServiceDetection.
  • Fingerprint test describing L40S as unsupported — fixed.

mchmarny
mchmarny previously approved these changes Jun 29, 2026
@mchmarny mchmarny enabled auto-merge (squash) June 29, 2026 19:46

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: requesting changes (round 4)

The earlier missing-OS L40S case, cert-manager, provider normalization, restored OKE parent identities, and the documentation issues are fixed. But tracing actual resolution output (not just structure) surfaces new regressions from the oke-ol* parallel chain + the OS guard.

Blocking

  1. --os ol weakens the GB200 K8s floor — the generic oke-ol-training (>= 1.30) overrides gb200-oke-training (>= 1.34), dropping the GB200 DRA/NVLS prerequisite (inline on recipes/overlays/oke-ol-training.yaml).
  2. The signed-tarball helm-diff install may fail on a clean Helm 4 runner without an imported keyring (inline on tests/uat/aws/run) — please confirm on a fresh runner.
  3. The OS guard now rejects valid OS-agnostic OKE queries because the generic oke-ol* overlays contribute ol for every accelerator (inline on pkg/recipe/metadata_store.go).

Should fix

  1. The guard's success shortcut ignores platform/intent, so a platform query can still pass with a partial recipe (inline on pkg/recipe/metadata_store.go).
  2. The guard violates the public OS-optional contract and leaks CLI --os wording into API/SDK errors (inline on pkg/recipe/metadata_store.go).

Root cause for 1/3/4: oke-ol* duplicate their parents' content as a parallel generic chain. Making them thin descendants that add only OL-specifics (no re-declared K8s.server.version, and not generic-accelerator) would resolve 1 and 3 together.

Comment thread recipes/overlays/oke-ol-training.yaml
Comment thread tests/uat/aws/run
# Install from the signed release tarball (helm-diff's recommended Helm 4
# method). The git-URL form requires --verify=false under Helm 4; the
# tarball form works without it and carries the upstream release signature.
helm plugin install "https://github.com/databus23/helm-diff/releases/download/${HELM_DIFF_VERSION}/helm-diff-linux-amd64.tgz"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to the signed release tarball is the right direction, but on a clean Helm 4 runner this may fail provenance verification without the publisher's keyring — reproduced as Error: plugin verification failed: .../.gnupg/pubring.gpg: no such file or directory. helm-diff's Helm 4 instructions import the trusted release key, export an OpenPGP keyring, and pass --keyring. Please either import+--keyring the key, or verify a pinned checksum, and confirm the install succeeds on a fresh AWS/GCP runner (the identical GCP block has the same path).

Comment thread pkg/recipe/metadata_store.go
if overlayAccel != accel {
continue
}
return nil // service+accelerator-specific overlay matched — no error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The success shortcut matches on service + accelerator only, ignoring intent and platform. A platform query (e.g. GB200 OKE Kubeflow) without --os can therefore pass this check and exit 0 with a recipe that never applies the platform overlay — a partial recipe rather than a fail-closed error. Include the requested platform/intent in the success criterion.

Comment thread pkg/recipe/metadata_store.go Outdated
}
if available := s.availableOSForCriteria(criteria); len(available) > 0 {
return aicrerrors.New(aicrerrors.ErrCodeInvalidRequest,
fmt.Sprintf("service '%s' requires --os; available values: %s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is returned by the shared MetadataStore used by both the CLI and the HTTP server / SDK, so: (a) it violates the public OS-optional contract — RecipeRequest.OS godoc (types.go:224) says empty just won't select OS-pinned leaves, and the OpenAPI os param documents omission as any, yet some OKE/GKE requests now return HTTP 400; and (b) it emits CLI-specific --os flag wording to API/SDK callers who have no such flag. Use neutral wording (e.g. "os is required for service '%s'; valid values: …") and reconcile the contract docs, or scope the requirement so it doesn't 400 documented-optional requests.

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. The blocking GB200 K8s-floor regression is fixed — verified by building this head and resolving --service oke --accelerator gb200 --intent training --os ol, which now yields K8s.server.version >= 1.34 (L40S correctly stays >= 1.30, and the no-OS GB200 query fail-closes).

Three non-blocking follow-ups remain (fine as separate issues, not gating this PR):

  • Confirm the helm-diff signed-tarball install succeeds on a clean Helm 4 runner (may need an imported keyring) — CI will surface it if not.
  • The OS-required guard's success check keys on service+accelerator only, so a platform/intent query can still pass with a partial recipe.
  • The guard returns CLI-flavored --os wording via the shared MetadataStore used by the HTTP server/SDK, and returns HTTP 400 for requests the OpenAPI/godoc still document as OS-optional — reconcile wording/contract.

@yuanchen8911

Copy link
Copy Markdown
Contributor

Approved. Three non-blocking follow-ups from the review, suitable as separate issues rather than gating this PR:

  1. Confirm the helm-diff signed-tarball install succeeds on a clean Helm 4 runner — the tarball install may require an imported keyring for provenance verification. CI will surface it if it fails; a pinned sha256 checksum is a simpler alternative to maintaining a keyring. (The version pin from .settings.yaml testing_tools.helm_diff is correct and stays.)

  2. The OS-required guard's success check keys on service + accelerator only, ignoring platform/intent — so a platform query can pass the guard and still resolve to a partial recipe that omits the platform overlay.

  3. The guard returns CLI-flavored --os wording via the shared MetadataStore used by the HTTP server and SDK, and returns HTTP 400 for requests the OpenAPI spec and RecipeRequest.OS godoc still document as OS-optional. Reconcile the error wording and the contract.

@atif1996

Copy link
Copy Markdown
Contributor Author

Round 4 response

Overlay structure (primary fixes — pushed in latest commit):

The root cause was that the new oke-ol / oke-ol-training / oke-ol-inference overlays were accelerator-generic duplicates of the OS-agnostic oke* parents, both sets running as siblings. With --os ol, oke-ol-training co-matched every accelerator as a second maximal leaf alongside the accelerator-specific leaf, causing RecipeMetadataSpec.Merge's last-writer-wins dedup to clobber the GB200 >= 1.34 floor with oke-ol-training's >= 1.30.

Fix: deleted the OS-agnostic oke / oke-training / oke-inference parents. The oke-ol* overlays are now the only OKE parent tier. All OKE leaves (gb200-oke-training, a100-oke-training, gb200-oke-inference, l40s-oke-*) carry os: ol and base on oke-ol*. Ubuntu leaves are unchanged (os: ubuntu, base on the accelerator-specific OL leaves). This means oke + any-accelerator without --os now returns a clear error listing the valid OS values — consistent with the intent that all OKE deployments are OL or Ubuntu.

The guard (requireOSIfNeeded) needed no logic change — the structural fix eliminates both the floor clobber and the over-rejection for OS-agnostic accelerators. The --os CLI-specific wording in the error message was neutralized to work for both CLI and HTTP callers, and the RecipeRequest.OS godoc + OpenAPI os param description now document that some service+accelerator combinations require an explicit OS.

Verification:

  • aicr recipe --service oke --accelerator gb200 --intent training --os olK8s.server.version: >= 1.34
  • aicr recipe --service oke --accelerator gb200 --intent training (no OS) → INVALID_REQUEST: specify an OS (valid: ol, ubuntu)
  • aicr recipe --service oke --accelerator l40s --intent training (no OS) → INVALID_REQUEST: specify an OS (valid: ol)
  • aicr recipe --service eks --accelerator h100 --intent training (no OS) → resolves (no regression) ✓
  • All tests pass: go test -race ./...; make qualify green

helm-diff provenance: installing from the signed tarball is the Helm 4 recommended path (the git-URL form requires --verify=false). Checksum pinning will be tracked in a follow-up.

OCI UAT workflow (tests/uat/oci/run): the runner documents the expected invocation pattern; the .github/workflows/uat-oci.yaml integration is out of scope for this PR and will follow once the cluster is provisioned.

Replied on all inline threads.

@mchmarny mchmarny merged commit c7f4373 into main Jun 29, 2026
11 of 12 checks passed
@mchmarny mchmarny deleted the feat/oke-l40s-support branch June 29, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

4 participants