feat(recipe): add L40S accelerator support#1510
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds L40S as a supported GPU accelerator and Oracle Linux ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.claude/skills/aicr-analyzing-snapshots/SKILL.md (1)
152-158: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winResolve the merge conflict before merge.
The conflict markers make this Markdown invalid, and the branch version also drops
ocp, which is still part of the supported service enum inapi/aicr/v1/server.yaml. Keep this table in sync with the API contract.Proposed cleanup
-<<<<<<< HEAD -| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` | -| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000` | -======= -| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `kind`, `lke`, `bcm` | -| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` | ->>>>>>> 6c7ce4c2 (feat(recipe): add L40S accelerator support (OKE overlays + fingerprint fix)) +| `service` | `CriteriaServiceType` | `any` or empty | `eks`, `gke`, `aks`, `oke`, `ocp`, `kind`, `lke`, `bcm` | +| `accelerator` | `CriteriaAcceleratorType` | `any` or empty | `h100`, `h200`, `gb200`, `b200`, `a100`, `l40`, `l40s`, `rtx-pro-6000` |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/aicr-analyzing-snapshots/SKILL.md around lines 152 - 158, Remove the merge conflict markers from the Markdown table in the snapshot analysis skill document and reconcile the branch/base content into valid table rows. While fixing the table, make sure the supported service list stays aligned with the service enum in api/aicr/v1/server.yaml by preserving the ocp entry alongside the other supported services.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/recipe.md`:
- Around line 152-158: Resolve the merge conflict in the criteria table by
removing the conflict markers and merging both valid updates into a single
Markdown row set. Keep the `service` values aligned with the API contract by
preserving `ocp`, and keep the `accelerator` values aligned with the latest
support by including `l40s`; verify the `CriteriaServiceType` and
`CriteriaAcceleratorType` entries remain consistent with the documented schema.
In `@docs/user/api-reference.md`:
- Around line 121-127: Resolve the merge conflict in the API reference table by
removing the conflict markers and choosing the entries that match the API
schema. In the table for the documented accelerator/service fields, keep the
`service` list aligned with `api/aicr/v1/server.yaml` by retaining `ocp`, and
keep the `accelerator` list consistent with the supported values by including
`l40s` from the feature branch. Use the `service` and `accelerator` rows in the
affected table as the anchors when updating the docs.
In `@docs/user/cli-reference.md`:
- Around line 389-395: Resolve the merge conflict in the CLI reference table by
removing the conflict markers and merging both intended updates. Keep
`--service` aligned with `api/aicr/v1/server.yaml` by preserving `ocp`, and
merge the `--accelerator` values so the list includes the new `l40s` entry
alongside the existing GPU types. Verify the final table formatting in
`docs/user/cli-reference.md` remains valid after the conflict is removed.
In `@pkg/recipe/doc.go`:
- Line 29: The godoc in doc.go is inconsistent because the detailed accelerator
list under the Criteria types section still omits CriteriaAcceleratorL40S even
though the summary comment already mentions l40s. Update the comment block
around the CriteriaAcceleratorType / CriteriaAcceleratorL40 bullet list to
include CriteriaAcceleratorL40S alongside CriteriaAcceleratorL40, keeping the
documented accelerator set aligned with the API changes.
In `@tests/uat/oci/run`:
- Around line 125-130: The helm-diff installation step is unpinned, so UAT runs
can change when upstream releases change. Update the install logic in the run
script’s helm-diff plugin block to request a specific helm-diff version instead
of the default latest release, and apply the same version pinning pattern in the
matching helm-diff install blocks in tests/uat/aws/run and tests/uat/gcp/run so
all UAT environments behave consistently.
In `@tests/uat/oci/tests/cuj1-training/assert-recipe.yaml`:
- Around line 36-38: Update the CUJ1 training assertion fixture to verify the
OKE cert-manager addon is explicitly disabled, not just present in
componentRefs. In assert-recipe.yaml, extend the existing cert-manager check so
the rendered recipe/manifest is validated against the disabled state for
cert-manager, using the cert-manager and componentRefs assertions together to
catch any regression that re-enables the managed addon.
In `@tests/uat/oci/tests/l40s-training-config.yaml`:
- Around line 77-81: The validate agent block only sets tolerations, so the
validation Jobs can still schedule onto the system pool; update
spec.validate.agent to also mirror the GPU node selector used by
spec.snapshot.agent and keep the existing namespace/tolerations in place. Use
the agent stanza under spec.validate as the target and align it with the
nodeSelector settings from spec.snapshot.agent so validator jobs are constrained
to the L40S GPU pool.
---
Outside diff comments:
In @.claude/skills/aicr-analyzing-snapshots/SKILL.md:
- Around line 152-158: Remove the merge conflict markers from the Markdown table
in the snapshot analysis skill document and reconcile the branch/base content
into valid table rows. While fixing the table, make sure the supported service
list stays aligned with the service enum in api/aicr/v1/server.yaml by
preserving the ocp entry alongside the other supported services.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 78c5e234-83dd-4d27-947f-8a1c6c1d0054
📒 Files selected for processing (31)
.claude/skills/aicr-analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/from_measurements_test.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/recipe/yaml_test.gorecipes/overlays/l40s-any.yamlrecipes/overlays/l40s-oke-inference.yamlrecipes/overlays/l40s-oke-training.yamlrecipes/overlays/l40s-oke-ubuntu-inference.yamlrecipes/overlays/l40s-oke-ubuntu-training-kubeflow.yamlrecipes/overlays/l40s-oke-ubuntu-training.yamlrecipes/overlays/oke.yamltests/uat/oci/cluster-config.yamltests/uat/oci/runtests/uat/oci/tests/cuj1-training/assert-recipe.yamltests/uat/oci/tests/cuj1-training/assert-validate-deployment.yamltests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yamltests/uat/oci/tests/cuj1-training/chainsaw-test.yamltests/uat/oci/tests/l40s-training-config.yaml
8c4e842 to
ee608da
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/recipe/metadata_test.go (1)
2193-2210: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd the new L40S/OKE leaves to this coverage table.
This PR adds
l40s-oke-trainingandl40s-oke-inference, butTestNFDTopologyUpdater_OverlayCoveragestill doesn't exercise them. A topology-updater regression on the new OKE/L40S inheritance path would currently pass unnoticed.Suggested additions
+ {"l40s-oke-training", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentTraining, ""}, true}, + {"l40s-oke-inference", criteria{CriteriaServiceOKE, CriteriaAcceleratorL40S, CriteriaOSOracleLinux, CriteriaIntentInference, ""}, true},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/recipe/metadata_test.go` around lines 2193 - 2210, TestNFDTopologyUpdater_OverlayCoverage is missing the new OKE/L40S cases, so add coverage entries for the new l40s-oke-training and l40s-oke-inference leaves in the existing criteria table. Update the table in metadata_test.go alongside the other real-cluster GPU leaves so the overlay inheritance path for CriteriaServiceOKE with CriteriaAcceleratorL40S and the training/inference intents is exercised and regressions in the new leaf coverage are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/uat/oci/tests/cuj1-training/chainsaw-test.yaml`:
- Around line 21-29: Update the CUJ1 Chainsaw UAT to exercise the new Oracle
Linux OKE path instead of the old Ubuntu selector. In the test spec’s
descriptive text and any recipe/bundle generation steps that still reference
`ubuntu`, change them to `ol` and ensure the `generate recipe`/`generate bundle`
flow explicitly passes `--os ol` so it matches the new OKE overlays.
Double-check the CUJ1 scenario steps and any related assertions in
`chainsaw-test.yaml` so the test validates the OKE/L40S/training path with
Oracle Linux throughout.
---
Outside diff comments:
In `@pkg/recipe/metadata_test.go`:
- Around line 2193-2210: TestNFDTopologyUpdater_OverlayCoverage is missing the
new OKE/L40S cases, so add coverage entries for the new l40s-oke-training and
l40s-oke-inference leaves in the existing criteria table. Update the table in
metadata_test.go alongside the other real-cluster GPU leaves so the overlay
inheritance path for CriteriaServiceOKE with CriteriaAcceleratorL40S and the
training/inference intents is exercised and regressions in the new leaf coverage
are caught.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 470dc7bb-f5ae-44d8-a673-3e9ab22d8098
📒 Files selected for processing (38)
.claude/skills/aicr-analyzing-snapshots/SKILL.md.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/contributor/recipe.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/cli/snapshot.gopkg/client/v1/types.gopkg/fingerprint/doc.gopkg/fingerprint/from_measurements.gopkg/fingerprint/from_measurements_test.gopkg/fingerprint/gpu_sku.gopkg/fingerprint/gpu_sku_test.gopkg/fingerprint/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/recipe/metadata_test.gopkg/recipe/oskind/oskind.gopkg/recipe/oskind/oskind_test.gopkg/recipe/yaml_test.gorecipes/overlays/a100-oke-training.yamlrecipes/overlays/gb200-oke-inference.yamlrecipes/overlays/gb200-oke-training.yamlrecipes/overlays/l40s-any.yamlrecipes/overlays/l40s-oke-inference.yamlrecipes/overlays/l40s-oke-training.yamlrecipes/overlays/oke-inference.yamlrecipes/overlays/oke-training.yamlrecipes/overlays/oke.yamltests/uat/oci/cluster-config.yamltests/uat/oci/runtests/uat/oci/tests/cuj1-training/assert-recipe.yamltests/uat/oci/tests/cuj1-training/assert-validate-deployment.yamltests/uat/oci/tests/cuj1-training/assert-validate-multiphase.yamltests/uat/oci/tests/cuj1-training/chainsaw-test.yamltests/uat/oci/tests/l40s-training-config.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/fingerprint/from_measurements.go`:
- Around line 323-333: Add a regression test for the raw-OCID compatibility path
in TestFromMeasurements_ServiceDetection so normalizeProviderID is covered for
older measurements that still carry a bare ocid1.* providerID. Extend the
existing table-driven cases with a raw OCI OCID input and assert it resolves to
the oke service type, alongside the already-normalized provider name cases. Use
the normalizeProviderID helper and the TestFromMeasurements_ServiceDetection
table as the main anchors when updating the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a34e66a6-5302-453d-865f-b0296bd697b9
📒 Files selected for processing (3)
pkg/collector/k8s/node.gopkg/collector/k8s/node_test.gopkg/fingerprint/from_measurements.go
mchmarny
left a comment
There was a problem hiding this comment.
The L40S + ol OS enum core is well done — criteria/oskind/fingerprint (correct L40S-before-L40 SKU ordering), OpenAPI, CLI, and thorough table-driven tests all line up, and the OKE-requires-ol model mirrors the GKE-cos precedent cleanly.
The blocker is that the diff carries three changes unrelated to L40S that look like collateral from a stale/conflicted branch (mergeable_state is dirty):
- Deletes the Rekor supply-chain monitor (
.github/workflows/rekor-monitor.yaml, added in #1480) plus its docs — silently removes a security control. - Reverts #1495 by re-adding the single-shot
verifyClusterPolicyReadycheck that was removed for flaking the deployment gate. - New overlays use the pre-#1499 apiVersion
aicr.nvidia.com/v1alpha1instead of theaicr.run/v1alpha2every other overlay uses.
Please rebase cleanly onto main so these three drop out, then keep only the L40S/ol changes. Two smaller items: the risk assessment labels the OKE os: ol requirement "non-breaking/additive," but it's a behavior change (OKE queries without --os now return no match — worth restating as such); and the unrelated UAT evidence-repo rename (aws/gcp) should ideally land separately. One doc nit inline. Happy to re-review once it's rebased.
|
@atif1996 this PR now has merge conflicts with |
yuanchen8911
left a comment
There was a problem hiding this comment.
Cross-review: requesting changes
This PR rolls back three recently-merged changes and is incompatible with the current main API version. The rollbacks are reversed in the branch's own commits (its parent already contains #1495, #1480, and #1479), so a plain rebase will not restore them.
Blocking
-
Re-adds the single-shot
verifyClusterPolicyReadydeployment check that #1495 removed. It samplesClusterPolicy status.stateonce with no retry, while the chainsaw checkvalidate-cluster-policy-readyalready polls the same state for ~5m. On a freshly-installed gpu-operator the operator-validator init containers take ~2.5m to reconcile, so the Go check records a false failure before chainsaw polls ready — re-introducing the deployment-gate flake #1495 fixed. Drop the Go re-add; keep the polling assertion as the sole source. -
All five newly added AICR documents use
apiVersion: aicr.nvidia.com/v1alpha1, which currentmainrejects (expected "aicr.run/v1alpha2"). This failsTestAllMetadataFilesHaveRequiredFieldsand CLI validation. -
Deletes
.github/workflows/rekor-monitor.yamland its runbook indocs/contributor/maintaining.md(added in #1480) with no replacement, dropping hourly Rekor transparency-log consistency + release-identity monitoring. This appears out of scope for an L40S feature PR — please restore it unless the removal is intentional and tracked elsewhere. -
The UAT evidence push target reverts the #1478/#1479 stable-repo +
:run-${RUN_ID}tagging scheme, creating a new untagged GHCR repository per run. Two effects: (a) the configured reference now lacks an explicit tag, so the summary no longer identifies the bundle — main's #1498 digest-based summary must be preserved through the rebase to fix that; and (b) per-run repositories proliferate regardless, which the digest summary does not address.
Non-blocking (both genuine medium-severity defects)
-
parseProviderdrops theoci://→okemapping in favor of raw-OCID detection only; anoci://…providerID now resolves toservice: any. Keep the change additive. -
The OCI UAT runner references
.github/workflows/uat-oci.yaml, which the PR never adds, and nothing else invokestests/uat/oci, so the advertised OKE L40S UAT runs in no CI pipeline.
Doc/enum drift not attached inline (outside modified hunks): docs/contributor/index.md:182 — the pkg/recipe/oskind "single source of truth" enumeration omits ol.
|
All four findings from this review are addressed in d0953bc:
Tests pass ( |
- Add l40s-oke-training and l40s-oke-inference overlays for OKE - Add `ol` OS enum (Oracle Linux, maps ID=ol from /etc/os-release) - Require explicit `os: ol` on all non-ubuntu OKE overlays; mirrors GKE/COS model — no `os: any` fallback (query must specify OS) - Disable cert-manager in the OKE base overlay via enabled: false - Add fingerprint normalization for Oracle Linux (normalizeOSID) - Update OpenAPI spec, CLI flags, and all doc enum lists with `ol` - Add OCI OKE L40S training UAT test suite Signed-off-by: Atif Mahmood <[email protected]> Signed-off-by: Atif Mahmood <[email protected]>
OKE nodes set spec.providerID to a bare OCID ("ocid1.instance.oc1...")
with no scheme prefix, so the existing "oci://" split never matched and
the raw OCID leaked into the service fingerprint.
Fix in two layers:
- pkg/collector/k8s: parseProvider checks strings.HasPrefix("ocid1.")
before the "://" split; removes the unreachable "oci" case.
- pkg/fingerprint: normalizeProviderID applied at read time so snapshots
produced by older agent images are also corrected.
Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
- chainsaw-test.yaml: switch --os ubuntu → --os ol (OKE now requires explicit OS; ubuntu matched nothing after this PR) - assert-recipe.yaml: assert cert-manager overrides.enabled=false (key OKE behavior was present but unverified) - l40s-training-config.yaml: add GPU nodeSelector to validate agent (mirrors snapshot agent; prevents landing on system-cpu pool) - from_measurements_test.go: add raw OCID case to ServiceDetection test (normalizeProviderID backward-compat path had no coverage) Signed-off-by: Atif Mahmood <[email protected]> Signed-off-by: Atif Mahmood <[email protected]>
- Restore .github/workflows/rekor-monitor.yaml lost during squash (was added in #1480, absent from stale branch base) - Remove verifyClusterPolicyReady call and function reverted by squash (#1495 had removed it; stale branch base brought it back) - Update apiVersion aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2 in l40s-any, l40s-oke-training, l40s-oke-inference overlays (post-#1499 domain migration) - Add `ol` to OS enum description in pkg/client/v1/types.go and pkg/fingerprint/doc.go Signed-off-by: Atif Mahmood <[email protected]> Signed-off-by: Atif Mahmood <[email protected]>
- Restore `case "oci": return "oke"` in parseProvider alongside the new
ocid1. prefix check — both formats must map to oke
- Fix inject_push_target in aws/run, gcp/run, oci/run to use the
:run-${RUN_ID} tag scheme (#1478/#1479), not the repo-suffix approach
- Update apiVersion aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2 in
l40s-training-config.yaml and assert-recipe.yaml
- Add `ol` to OSDimension comment in pkg/fingerprint/types.go
- Add `ol` and `talos` to Supported values comment in pkg/client/v1/types.go
Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
Services like GKE and OKE whose overlays all carry an explicit OS criterion (os: cos, os: ol) silently returned a degenerate base-only recipe with exit 0 when --os was omitted. The resolver logged a slog.Warn but callers had no way to detect the problem. Add requireOSIfNeeded, called in both BuildRecipeResult and BuildRecipeResultWithEvaluator immediately after FindMatchingOverlays. The guard returns ErrCodeInvalidRequest (with the available OS values) when a specific service was requested, no OS was supplied, no service-specific overlay matched, and at least one OS-gated overlay exists for that service. Services with OS-agnostic overlays (EKS) and unknown services pass through unchanged. Add TestBuildRecipeResult_OSRequired and TestBuildRecipeResultWithEvaluator_OSRequired covering GKE without os (now errors), GKE with os: cos (succeeds), EKS without os (succeeds), and unknown service without os (no error). Fix TestRecipeEndpointPOST/valid_YAML_body: the GKE test body was missing os: cos and would now return 400 instead of 200. Signed-off-by: Atif Mahmood <[email protected]> Signed-off-by: Atif Mahmood <[email protected]>
mchmarny
left a comment
There was a problem hiding this comment.
Approving — all four outstanding review concerns are addressed at a8c2342:
- OKE no-OS query / partial recipe:
oke.yamlis restored as an OS-agnostic parent (service: oke, noos:field), with the newoke-ol*Oracle Linux variants living alongside it rather than replacing it. A no---osOKE query matches the full overlay chain again. - Base-name breakage:
oke,oke-training, andoke-inferencebase names are back, so external--dataoverlays referencing them continue to resolve. helm-diffsupply chain: bothtests/uat/aws/runandtests/uat/gcp/runnow install from the signed release tarball;--verify=falseis gone.- Stale comment: the
verifyClusterPolicyReadyreference invalidators/deployment/expected_resources.gowas removed.
CI is green and the refactor (keep OS-agnostic parents + add oke-ol* variants) is the right shape.
Two non-blocking follow-ups before merge:
requireOSIfNeeded returns ErrCodeInvalidRequest when a query omits --os but all matching overlays for that service+accelerator combination are OS-gated. Services with OS-agnostic overlays (EKS, OKE GB200/A100) remain unaffected; only L40S OKE and GKE (which are purely OS-specific) now surface a clear error. availableOSForCriteria uses c.Matches (full criteria check including intent, accelerator) to build the list of valid OS values shown in the error message, avoiding false positives for unrelated intent/accelerator combinations. Fixes: #1534 Signed-off-by: Atif Mahmood <[email protected]> Signed-off-by: Atif Mahmood <[email protected]>
l40s-oke-training and l40s-oke-inference were missing from TestNFDTopologyUpdater_OverlayCoverage; a topology-updater regression on the new OKE/L40S inheritance path would have passed unnoticed. Signed-off-by: Atif Mahmood <[email protected]> Signed-off-by: Atif Mahmood <[email protected]>
Round-3 responseThanks for the detailed review, @yuanchen8911. All blocking and open items are addressed: Blocking1. No- Behavior:
2. cert-manager disabled for all OKE 3. helm-diff install fails under Helm 4 Still openUAT evidence repo rename — Rekor runbook — Minor / drift
|
yuanchen8911
left a comment
There was a problem hiding this comment.
Re-review: requesting changes (round 4)
The earlier missing-OS L40S case, cert-manager, provider normalization, restored OKE parent identities, and the documentation issues are fixed. But tracing actual resolution output (not just structure) surfaces new regressions from the oke-ol* parallel chain + the OS guard.
Blocking
--os olweakens the GB200 K8s floor — the genericoke-ol-training(>= 1.30) overridesgb200-oke-training(>= 1.34), dropping the GB200 DRA/NVLS prerequisite (inline onrecipes/overlays/oke-ol-training.yaml).- The signed-tarball helm-diff install may fail on a clean Helm 4 runner without an imported keyring (inline on
tests/uat/aws/run) — please confirm on a fresh runner. - The OS guard now rejects valid OS-agnostic OKE queries because the generic
oke-ol*overlays contributeolfor every accelerator (inline onpkg/recipe/metadata_store.go).
Should fix
- The guard's success shortcut ignores
platform/intent, so a platform query can still pass with a partial recipe (inline onpkg/recipe/metadata_store.go). - The guard violates the public OS-optional contract and leaks CLI
--oswording into API/SDK errors (inline onpkg/recipe/metadata_store.go).
Root cause for 1/3/4: oke-ol* duplicate their parents' content as a parallel generic chain. Making them thin descendants that add only OL-specifics (no re-declared K8s.server.version, and not generic-accelerator) would resolve 1 and 3 together.
| # Install from the signed release tarball (helm-diff's recommended Helm 4 | ||
| # method). The git-URL form requires --verify=false under Helm 4; the | ||
| # tarball form works without it and carries the upstream release signature. | ||
| helm plugin install "https://github.com/databus23/helm-diff/releases/download/${HELM_DIFF_VERSION}/helm-diff-linux-amd64.tgz" |
There was a problem hiding this comment.
Switching to the signed release tarball is the right direction, but on a clean Helm 4 runner this may fail provenance verification without the publisher's keyring — reproduced as Error: plugin verification failed: .../.gnupg/pubring.gpg: no such file or directory. helm-diff's Helm 4 instructions import the trusted release key, export an OpenPGP keyring, and pass --keyring. Please either import+--keyring the key, or verify a pinned checksum, and confirm the install succeeds on a fresh AWS/GCP runner (the identical GCP block has the same path).
| if overlayAccel != accel { | ||
| continue | ||
| } | ||
| return nil // service+accelerator-specific overlay matched — no error |
There was a problem hiding this comment.
The success shortcut matches on service + accelerator only, ignoring intent and platform. A platform query (e.g. GB200 OKE Kubeflow) without --os can therefore pass this check and exit 0 with a recipe that never applies the platform overlay — a partial recipe rather than a fail-closed error. Include the requested platform/intent in the success criterion.
| } | ||
| if available := s.availableOSForCriteria(criteria); len(available) > 0 { | ||
| return aicrerrors.New(aicrerrors.ErrCodeInvalidRequest, | ||
| fmt.Sprintf("service '%s' requires --os; available values: %s", |
There was a problem hiding this comment.
This error is returned by the shared MetadataStore used by both the CLI and the HTTP server / SDK, so: (a) it violates the public OS-optional contract — RecipeRequest.OS godoc (types.go:224) says empty just won't select OS-pinned leaves, and the OpenAPI os param documents omission as any, yet some OKE/GKE requests now return HTTP 400; and (b) it emits CLI-specific --os flag wording to API/SDK callers who have no such flag. Use neutral wording (e.g. "os is required for service '%s'; valid values: …") and reconcile the contract docs, or scope the requirement so it doesn't 400 documented-optional requests.
Signed-off-by: Atif Mahmood <[email protected]>
yuanchen8911
left a comment
There was a problem hiding this comment.
Approving. The blocking GB200 K8s-floor regression is fixed — verified by building this head and resolving --service oke --accelerator gb200 --intent training --os ol, which now yields K8s.server.version >= 1.34 (L40S correctly stays >= 1.30, and the no-OS GB200 query fail-closes).
Three non-blocking follow-ups remain (fine as separate issues, not gating this PR):
- Confirm the helm-diff signed-tarball install succeeds on a clean Helm 4 runner (may need an imported keyring) — CI will surface it if not.
- The OS-required guard's success check keys on service+accelerator only, so a platform/intent query can still pass with a partial recipe.
- The guard returns CLI-flavored
--oswording via the shared MetadataStore used by the HTTP server/SDK, and returns HTTP 400 for requests the OpenAPI/godoc still document as OS-optional — reconcile wording/contract.
|
Approved. Three non-blocking follow-ups from the review, suitable as separate issues rather than gating this PR:
|
Round 4 responseOverlay structure (primary fixes — pushed in latest commit): The root cause was that the new Fix: deleted the OS-agnostic The guard ( Verification:
helm-diff provenance: installing from the signed tarball is the Helm 4 recommended path (the git-URL form requires OCI UAT workflow ( Replied on all inline threads. |
Summary
Add L40S as a supported accelerator for OKE and introduce `ol` as the Oracle Linux OS enum. Restore cert-manager to its default AICR-managed installation for OKE. Add a fail-closed guard that returns `ErrCodeInvalidRequest` when a service+accelerator combination requires an explicit `--os` but none was supplied.
Motivation / Context
Three gaps addressed:
Fixes: #1003
Fixes: #1517
Fixes: #1534
Type of Change
Component(s) Affected
Implementation Notes
`ol` OS enum: Oracle Linux reports `ID=ol` in `/etc/os-release`; mirrors `cos` (short OS name). Added to `oskind`, `criteria.go`, `normalizeOSID()`, OpenAPI spec, CLI flags, and all doc enum lists.
OKE overlay structure: OS-agnostic parents (`oke`, `oke-training`, `oke-inference`) are preserved for existing overlays (GB200, A100) and external `base:` references. Oracle Linux-specific parents (`oke-ol`, `oke-ol-training`, `oke-ol-inference`) are added for L40S OKE overlays, mirroring the GKE `gke-cos` pattern. Only `l40s-oke-{training,inference}` carry `os: ol`; GB200/A100 OKE overlays remain OS-agnostic.
`l40s-any.yaml`: Cross-cutting deployment-phase floor (4 checks + gpu-operator `>= v24.6.0`), mirroring `a100-any.yaml`.
cert-manager: Restored to AICR-managed default for OKE. Oracle's cert-manager addon is optional and not deployed by default; AICR installs it automatically as it does on other services.
`driver.enabled: false` / `toolkit.enabled: false` in `values-oke.yaml` — Oracle Linux Gen2 GPU OKE image pre-installs both at the OS level.
Fail-closed OS guard (`requireOSIfNeeded`): called after `FindMatchingOverlays` in both `BuildRecipeResult` and `BuildRecipeResultWithEvaluator`. Returns `ErrCodeInvalidRequest` with available OS values when a service+accelerator combination has only OS-gated overlays and `--os` was omitted. Services with OS-agnostic overlays (EKS, OKE GB200/A100) are unaffected.
Fingerprint resilience: `normalizeProviderID` now handles all three OKE providerID forms: bare OCID (`ocid1.instance…`), old collector string (`"oci"`), and full scheme URL (`"oci://ocid1…"`).
Testing
Verified locally:
Risk Assessment
Rollout notes: N/A
Checklist