Add Slinky slurm-operator as platform-slurm#866
Conversation
|
Welcome to AICR, @faganihajizada! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
This comment was marked as resolved.
This comment was marked as resolved.
…slurm Adds the Slinky Slurm operator (chart v1.1.0 from oci://ghcr.io/slinkyproject/charts) as a first-class AICR platform under `--platform slurm`. Ships two components: `slinky-slurm-operator-crds` and `slinky-slurm-operator`, composed by a new `platform-slurm` mixin. Operator depends on cert-manager and the CRDs. Mirrors the operator-only pattern of `--platform dynamo` and `--platform kubeflow`: AICR ships only the platform; cluster-instance CRs (Controller, LoginSet, NodeSet, ...) remain user-authored. Leaf overlays (CSP x accelerator x OS) follow in a separate PR. Notable: chart v1.1.0 silently ignores operator.nodeSelector and webhook.nodeSelector. The registry comment links SlinkyProject/slurm-operator#187 with the verbatim follow-up action once that upstream PR merges and a new chart is published. The `slurm` cluster chart is intentionally deferred to a follow-up PR with its own short alias (e.g. `slurm-cluster`) so `slurm` continues to route to the operator. Verified: make qualify (test-coverage + lint + e2e + scan + license-check) PASS; make check-health COMPONENT=slinky-slurm-operator against a live Kind cluster + chart v1.1.0 PASS in 5.42s.
e5c11c9 to
af1d8b9
Compare
|
@faganihajizada welcome to AICR project, I'm super excited about adding Skinky. CodeRabbit can get little annoying on draft PRs but it's good to get early signals. Let us know if you want to walk through the approach or just go straight to review. Re |
Thank you @mchmarny I checked coderabbit reviews and made separate PR. This is ready for review.
yes 💯 |
The OpenAPI server spec at api/aicr/v1/server.yaml has the platform enum duplicated at 3 sites (operation parameters and Criteria schema). The TestOpenAPIEnumsMatchGoTypes test enforces these stay in sync with recipe.GetCriteriaPlatformTypes(). Adding 'slurm' to the Go enum without updating the spec broke that test. Add 'slurm' to all 3 enum sites alphabetically (between 'nim' and 'any').
3b04bba to
4e9c496
Compare
|
@faganihajizada this PR now has merge conflicts with |
mchmarny
left a comment
There was a problem hiding this comment.
Clean additive PR — registry/mixin shape mirrors the existing dynamo / kubeflow precedent, the upstream nodeSelector limitation is documented at the registry entry rather than buried in a separate note, and the empty-base scoping is intentional and well-explained.
One medium item on the docs side: the platform enum was added to pkg/recipe/criteria.go, server.yaml, and component-catalog.md, but docs/user/cli-reference.md and docs/user/api-reference.md (plus their site/docs/ mirrors) still hardcode the old dynamo, kubeflow, nim list. Four-line fix; should land in this PR per the doc-update rule for user-visible behavior. Two minor nits on values comments and the health-check vacuous-pass dependency — both optional.
Branch is also marked needs-rebase against main. None of this blocks once the doc gap is closed and the rebase is done.
# Conflicts: # docs/user/component-catalog.md
- docs/user/cli-reference.md, docs/user/api-reference.md: add `slurm` to the platform enum lists. The Go enum, OpenAPI spec, and component catalog were updated when slurm was added; these two reference pages drifted because they hardcode the enum. - recipes/registry.yaml (slinky-slurm-operator): document above defaultVersion that bumping the chart pin requires re-verifying the defaults baked into components/slinky-slurm-operator/values.yaml, since the values file mirrors chart defaults explicitly. Without this comment, an upstream default flip (e.g. certManager.enabled flipping from true to false) would pass through unnoticed on the next bump. - recipes/checks/slinky-slurm-operator/health-check.yaml: explain why validate-all-pods-healthy is not a vacuous pass on an empty namespace. Chainsaw `error` assertions pass when no matching resource exists, but the two preceding deployment-availability steps require both deployments to have a ready replica, guaranteeing pods exist before this step runs. Soundness was already there; the comment makes it visible when reading the step in isolation.
Two coupled changes:
1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
to reflect current registry + chart state. No code changes, no
registry pins moved; this picks up drift accumulated since main's
last BOM commit:
- slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
added (registry entries landed in NVIDIA#866 without a corresponding
BOM regen)
- kubeflow-trainer's bundled pytorch image bumped upstream
- busybox bumped upstream in some chart's rendered templates
2. Add the BOM-regen rule to contributor and agent docs, surfaced by
the same NVIDIA#872 review that flagged the drift:
- .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
"Common Tasks" section: after any change to recipes/registry.yaml,
a component values file, or a chart pin, run `make bom-docs` and
commit the regenerated container-images.md in the same PR
- docs/contributor/data.md — new "Regenerating the BOM" subsection
under Component Configuration covering when to run it, how it can
surface upstream chart drift, and how to handle unrelated drift
(split into a catch-up PR vs land it together)
Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.
Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.
Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
staleness check, agents-sync check, chart-pin verification, 20/20
chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
Two coupled changes:
1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
to reflect current registry + chart state. No code changes, no
registry pins moved; this picks up drift accumulated since main's
last BOM commit:
- slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
added (registry entries landed in NVIDIA#866 without a corresponding
BOM regen)
- kubeflow-trainer's bundled pytorch image bumped upstream
- busybox bumped upstream in some chart's rendered templates
2. Add the BOM-regen rule to contributor and agent docs, surfaced by
the same NVIDIA#872 review that flagged the drift:
- .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
"Common Tasks" section: after any change to recipes/registry.yaml,
a component values file, or a chart pin, run `make bom-docs` and
commit the regenerated container-images.md in the same PR
- docs/contributor/data.md — new "Regenerating the BOM" subsection
under Component Configuration covering when to run it, how it can
surface upstream chart drift, and how to handle unrelated drift
(split into a catch-up PR vs land it together)
Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.
Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.
Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
staleness check, agents-sync check, chart-pin verification, 20/20
chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
Two coupled changes:
1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
to reflect current registry + chart state. No code changes, no
registry pins moved; this picks up drift accumulated since main's
last BOM commit:
- slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
added (registry entries landed in NVIDIA#866 without a corresponding
BOM regen)
- kubeflow-trainer's bundled pytorch image bumped upstream
- busybox bumped upstream in some chart's rendered templates
2. Add the BOM-regen rule to contributor and agent docs, surfaced by
the same NVIDIA#872 review that flagged the drift:
- .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
"Common Tasks" section: after any change to recipes/registry.yaml,
a component values file, or a chart pin, run `make bom-docs` and
commit the regenerated container-images.md in the same PR
- docs/contributor/data.md — new "Regenerating the BOM" subsection
under Component Configuration covering when to run it, how it can
surface upstream chart drift, and how to handle unrelated drift
(split into a catch-up PR vs land it together)
Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.
Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.
Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
staleness check, agents-sync check, chart-pin verification, 20/20
chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
Two coupled changes:
1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
to reflect current registry + chart state. No code changes, no
registry pins moved; this picks up drift accumulated since main's
last BOM commit:
- slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
added (registry entries landed in NVIDIA#866 without a corresponding
BOM regen)
- kubeflow-trainer's bundled pytorch image bumped upstream
- busybox bumped upstream in some chart's rendered templates
2. Add the BOM-regen rule to contributor and agent docs, surfaced by
the same NVIDIA#872 review that flagged the drift:
- .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
"Common Tasks" section: after any change to recipes/registry.yaml,
a component values file, or a chart pin, run `make bom-docs` and
commit the regenerated container-images.md in the same PR
- docs/contributor/data.md — new "Regenerating the BOM" subsection
under Component Configuration covering when to run it, how it can
surface upstream chart drift, and how to handle unrelated drift
(split into a catch-up PR vs land it together)
- Makefile — corrected the bom-check target's help text from
"CI gate, opt-in locally" to "opt-in; not wired into qualify/
lint/merge gate" to match the actual enforcement story
Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.
Honest enforcement story: an earlier revision of these docs (and an
earlier revision of this commit message) claimed `make qualify` and
CI catch stale BOMs. That was incorrect — verified directly against
the Makefile:
- `qualify` depends on test-coverage / lint / e2e / scan / license-
check, none of which run `bom-check`
- `lint` runs `bom-pinning-check` (chart-pin verification per ADR-006),
not `bom-check` (BOM doc freshness)
- The merge gate has no PR-time BOM-staleness check
The corrected docs say so explicitly. Wiring `bom-check` into the gate
is a desirable follow-up but intentionally out of scope here — it
would change CI behavior for every PR and likely block unrelated PRs
that happen to expose accumulated upstream drift, which deserves its
own discussion.
This also explains how NVIDIA#866 merged with a stale BOM: nothing in the
existing gate would have caught it.
Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, agents-
sync check, chart-pin verification, 20/20 chainsaw, vulnerability
scan, license headers)
- No code changes, no chart pins moved, no behavior change
…g note Two follow-ups to PR NVIDIA#866 (Slinky slurm-operator), bundled into one post-merge cleanup PR. Platform-enum drift backfill — surfaces NVIDIA#866 did not catch ============================================================ PR NVIDIA#866 added `--platform slurm` to the OpenAPI spec, the Go enum (GetCriteriaPlatformTypes), and docs/user/{cli,api}-reference.md + component-catalog.md. Per the project's enum-audit rule (CLAUDE.md "Documentation updates"), every surface that enumerates platform values should be updated. The following surfaces still showed `kubeflow` only (pre-existing drift predating slurm — `dynamo` and `nim` were already missing): - docs/README.md:11 — glossary Criteria row - docs/contributor/data.md:130 — platform row in criteria table - docs/contributor/validations.md:99 — platform bullet - site/docs/getting-started/index.md:19 — glossary mirror - pkg/recipe/criteria.go:246 — Platform field godoc - pkg/recipe/doc.go:32 — Criteria struct shape comment (was non-alphabetical "kubeflow, dynamo, nim, slurm, any") - pkg/recipe/doc.go:93-98 — CriteriaPlatform* constant listing (was non-alphabetical order) - pkg/api/doc.go:72 — REST API doc was missing slurm entirely (only one of these with factual content drift; rest are ordering) - .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR snapshot-analysis skill criteria table All now list `dynamo, kubeflow, nim, slurm` in alphabetical order matching pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum). Per-file delimiter convention preserved: slash-separated in glossary-style parens, comma-separated in tables/godoc. New guard test ============== pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes() stay in sync. Catches the exact class of drift this commit fixes if a future platform value is added to one but not the other. Catalog node-selector limitation note ===================================== Appends a `**Known limitation:**` clause to the slinky-slurm-operator row in docs/user/component-catalog.md, documenting that chart v1.1.0 silently ignores `operator.nodeSelector` and `webhook.nodeSelector` and linking SlinkyProject/slurm-operator#187 for the upstream fix. The same limitation is already in-line in recipes/registry.yaml; the catalog note surfaces it to readers of the rendered user docs. Originally planned but no longer needed ======================================= While auditing, I had also planned to add Mark's two "optional nits" from NVIDIA#866 — a chart-default verification pointer at defaultVersion in recipes/registry.yaml, and a vacuous-pass guard comment in recipes/checks/slinky-slurm-operator/health-check.yaml. Both turned out to already be in the merged version (Mark applied them himself before merge — see recipes/registry.yaml:550-555 and recipes/checks/slinky-slurm-operator/health-check.yaml:64-69). The .lychee.toml header change is the project's license-header check auto-applying the standard Apache 2.0 copyright block. Addresses ========= Mark (@mchmarny) review feedback on NVIDIA#866: - Doc-audit gap across user docs (cli-reference + api-reference landed at merge time; this PR catches the remaining surfaces) CodeRabbit review feedback on NVIDIA#866: - NodeSelector limitation note on slinky-slurm-operator catalog row Internal panel review (PE + QA + DA) on draft NVIDIA#884: - Extended audit to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering), .claude/skills/analyzing-snapshots/SKILL.md - Added guard test against future enum/getter drift - Squashed to single commit; rebased onto upstream/main HEAD Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
…g note Follow-up to PR NVIDIA#866 (Slinky slurm-operator). Backfills platform-enum drift on surfaces the original PR did not catch, plus surfaces the chart v1.1.0 nodeSelector silent-ignore limitation on the user-facing component-catalog row. PR NVIDIA#896 (yuanchen8911) already fixed docs/README.md and docs/contributor/validations.md as part of a broader docs audit. PR NVIDIA#893 removed the site/docs/ vitepress mirror entirely. Remaining surfaces this PR addresses: - docs/contributor/data.md:130 — platform row in criteria table - docs/user/component-catalog.md:36 — slinky-slurm-operator row, appended **Known limitation:** clause documenting chart v1.1.0 silent-ignore of operator.nodeSelector/webhook.nodeSelector (the same limitation is already inline in recipes/registry.yaml; this surfaces it on the rendered user catalog) - pkg/api/doc.go:72 — REST API godoc was missing slurm entirely - pkg/recipe/doc.go:32 — struct shape comment reordered to alphabetical to match GetCriteriaPlatformTypes() - pkg/recipe/doc.go:93-98 — CriteriaPlatform* constants reordered alphabetical - pkg/recipe/criteria.go:246 — Platform-field godoc updated - .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR snapshot-analysis skill criteria table All now list 'dynamo, kubeflow, nim, slurm' alphabetically matching pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum). New guard test ============== pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes() stay in sync. Mechanically catches the exact class of drift this commit fixes if a future platform value is added to one but not the other. Addressed reviews ================= @mchmarny (NVIDIA#866) — Doc-audit gap. cli-reference and api-reference were fixed at merge time and NVIDIA#896 picked up README/validations; this PR catches the remaining surfaces (contributor/data, the three Go godoc files, and the internal skill table). @coderabbitai (NVIDIA#866) — NodeSelector limitation note on the slinky-slurm-operator catalog row. Internal PE + QA + DA panel review on draft NVIDIA#884 — extended audit to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering), .claude/skills/analyzing-snapshots/SKILL.md, plus the guard test in pkg/recipe/doc_test.go. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Add `runai` as a platform value in AICR recipes, peer to `dynamo`, `kubeflow`, `nim`, and `slurm`. Mirrors the precedent set by NVIDIA#866 (slurm). Sites updated: - pkg/recipe/criteria.go: CriteriaPlatformRunai const, parser case, sorted slice in GetCriteriaPlatformTypes - pkg/recipe/criteria_test.go: table-driven cases for runai (lower + uppercase) and updated TestGetCriteriaPlatformTypes expectations - pkg/recipe/doc.go: Platform field docstring + bullet under "Platform types for workload frameworks" - api/aicr/v1/server.yaml: three platform enum sites - docs/user/{api,cli}-reference.md: platform value enumerations - docs/contributor/{api-server,validations}.md: platform value enumerations - docs/README.md: glossary Criteria row Run:ai recipe matrix and any operator/CRD plumbing are deferred to follow-up PRs.
Per review feedback on NVIDIA#955: pkg/api/doc.go:72 still listed the platform enum as (dynamo, kubeflow, nim, any) — missing both `slurm` (pre-existing drift from NVIDIA#866) and the new `runai` added in this PR. Updated to match the canonical list now used across server.yaml, pkg/recipe/criteria.go, and the user/contributor docs: platform: Platform/framework (dynamo, kubeflow, nim, runai, slurm, any) No behavioral change; godoc-only. Reported-by: @yuanchen8911 Signed-off-by: Rob Esker <[email protected]>
Follow-up to PR #866 (Slinky slurm-operator). Backfills platform-enum drift on surfaces the original PR did not catch, plus surfaces the chart v1.1.0 nodeSelector silent-ignore limitation on the user-facing component-catalog row. PR #896 (yuanchen8911) already fixed docs/README.md and docs/contributor/validations.md as part of a broader docs audit. PR surfaces this PR addresses: - docs/contributor/data.md:130 — platform row in criteria table - docs/user/component-catalog.md:36 — slinky-slurm-operator row, appended **Known limitation:** clause documenting chart v1.1.0 silent-ignore of operator.nodeSelector/webhook.nodeSelector (the same limitation is already inline in recipes/registry.yaml; this surfaces it on the rendered user catalog) - pkg/api/doc.go:72 — REST API godoc was missing slurm entirely - pkg/recipe/doc.go:32 — struct shape comment reordered to alphabetical to match GetCriteriaPlatformTypes() - pkg/recipe/doc.go:93-98 — CriteriaPlatform* constants reordered alphabetical - pkg/recipe/criteria.go:246 — Platform-field godoc updated - .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR snapshot-analysis skill criteria table All now list 'dynamo, kubeflow, nim, slurm' alphabetically matching pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum). New guard test ============== pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes() stay in sync. Mechanically catches the exact class of drift this commit fixes if a future platform value is added to one but not the other. Addressed reviews ================= @mchmarny (#866) — Doc-audit gap. cli-reference and api-reference were fixed at merge time and #896 picked up README/validations; this PR catches the remaining surfaces (contributor/data, the three Go godoc files, and the internal skill table). @coderabbitai (#866) — NodeSelector limitation note on the slinky-slurm-operator catalog row. Internal PE + QA + DA panel review on draft #884 — extended audit to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering), .claude/skills/analyzing-snapshots/SKILL.md, plus the guard test in pkg/recipe/doc_test.go. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
…#884) Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]> Co-authored-by: Mark Chmarny <[email protected]>
Summary
Adds the Slinky slurm-operator (chart v1.1.0) as a first-class AICR platform under
--platform slurm. Empty-base PR — ships the operator + CRDs plumbing only; leaf overlays follow in a separate PR.Motivation / Context
Slurm is the dominant HPC workload manager. Users running mixed AI + HPC workloads on Kubernetes need a validated path to install the Slinky operator alongside the existing GPU stack (
gpu-operator,nvsentinel,nfd,cert-manager, etc.) without bringing it in by hand. This PR adds the platform plumbing so a follow-up can add per-CSP / per-accelerator leaf overlays without re-litigating the operator/CRD shape, namespace, version pin, or value-override aliases.Mirrors the operator-only pattern of
--platform dynamoand--platform kubeflow: AICR ships only the platform; cluster-instance CRs (Controller,LoginSet,NodeSet,Accounting,RestApi,Token) remain user-authored. The Slinkyslurmcluster Helm chart is intentionally deferred to a follow-up PR with its own short alias (e.g.slurm-cluster) so theslurmalias continues to route to the operator.Fixes: N/A
Related: SlinkyProject/slurm-operator#187 (upstream
nodeSelectorexposure — see Implementation Notes)Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)recipes/registry.yaml,recipes/components/,recipes/checks/,recipes/mixins/Implementation Notes
Components added
slinky-slurm-operator-crds—oci://ghcr.io/slinkyproject/charts/slurm-operator-crds:1.1.0. Installs the 6 CRDs underslinky.slurm.net(Accounting,Controller,LoginSet,NodeSet,RestApi,Token) into theslinkynamespace.slinky-slurm-operator—oci://ghcr.io/slinkyproject/charts/slurm-operator:1.1.0. Deploys the operator and admission webhook intoslinky. Depends oncert-manager(webhook serving certificate, issued via the chart's rendered Issuer + Certificate resources) andslinky-slurm-operator-crds.Glue
recipes/mixins/platform-slurm.yamlcomposes the two components into theslurmplatform, mirroringplatform-kubeflow.yaml.recipes/checks/slinky-slurm-operator/health-check.yamlasserts both deployments and pod-phase health (Chainsaw test, mirrorsdynamo-platform/health-check.yaml).pkg/recipe/criteria.goaddsCriteriaPlatformSlurmand updates the parser +GetCriteriaPlatformTypes(sorted:dynamo,kubeflow,nim,slurm).docs/user/component-catalog.mdadds rows for both components and a new bullet under "How Components Are Selected".Key decisions (forward-binding)
--platform slurmenum valueslurmcluster chart yet)dynamo/kubeflowprecedent — workload instances are user-authoreddynamo-crds/kgateway-crdsprecedent — independent CRD lifecycleslinky(notslurm-operatororslurm-system)platform-slurm) instead of inline overlay refsslurmshort alias points to the operatordynamoaliasingdynamo-platform. Future cluster chart will get a distinct alias (e.g.slurm-cluster)1.1.0helm pull(digestsha256:f923fe6c…84571e7)Known limitation:
nodeSelectordropped silentlyChart v1.1.0 templates only render
operator.{affinity,tolerations}andwebhook.{affinity,tolerations}. Settingoperator.nodeSelector/webhook.nodeSelectoris silently dropped:So AICR's
--system-node-selectorflag is a no-op for this component until a chart release that includes SlinkyProject/slurm-operator#187 lands. The registry comment includes the verbatim follow-up action (addnodeSelectorPaths: [operator.nodeSelector, webhook.nodeSelector]+ bumpdefaultVersion) so the next maintainer has zero archeology. Workaround in the meantime: pin nodes viavaluesFileoverrides onoperator.affinity.nodeAffinity/webhook.affinity.nodeAffinity.Empty-base scope (intentional)
The
platform-slurmmixin is currently orphan — no overlay references it. This is intentional: the leaf overlay matrix (CSP × accelerator × OS) and the corresponding deployment-order guard test entries are deferred to a separate PR. The schema test does not enforce mixin reachability, so this passes CI as-is.Testing
test-coverage(race detector)lint(lint-go, lint-yaml, license, check-agents-sync, check-docs-sidebar)e2e(chainsaw, 18 CLI tests)scan(grype, fail-on=high)license-check(go-licenses)Live smoke test against a Kind cluster + chart v1.1.0:
kind create cluster --name aicr-slurm-smoke helm install cert-manager oci://quay.io/jetstack/charts/cert-manager \ --namespace cert-manager --create-namespace --set crds.enabled=true helm install slurm-operator-crds oci://ghcr.io/slinkyproject/charts/slurm-operator-crds \ --version 1.1.0 helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \ --version 1.1.0 --namespace slinky --create-namespace --set crds.enabled=false make check-health COMPONENT=slinky-slurm-operator # PASS — 1/1 in 5.42sCLI sanity check:
$ aicr recipe --platform slurm --format yaml | head -15 kind: RecipeResult apiVersion: aicr.nvidia.com/v1alpha1 metadata: version: dev appliedOverlays: - base - monitoring-hpa criteria: service: any accelerator: any intent: any os: any platform: slurm constraints: - name: K8s.server.version(Resolves the new platform value and applies base + monitoring-hpa as expected for an empty-base addition — no overlay materializes the slurm components yet, so the recipe is identical to a no-platform query apart from the criteria echo.)
Risk Assessment
Rationale: Mirrors the same registry-addition pattern used for
kueue,kgateway,dynamo-platform,kubeflow-trainer, and the ~18 other components already inrecipes/registry.yaml. No changes to recipe engine semantics, bundler behavior, CLI flag shapes, or output formats. Pure additive surface — existing recipes, criteria queries, and CLI invocations are unaffected. The new platform enum value is opt-in via--platform slurmand has no effect on any existing matcher path.Rollout notes: N/A — additive change, no migration steps. Reversible by
git revertif the upstream chart pin needs to change before a leaf overlay PR lands.Follow-ups (separate PRs, not blocking this one)
h100-eks-ubuntu-training-slurm,h100-kind-training-slurm, plus the broader CSP × accelerator × OS matrix. Wiresplatform-slurminto actual recipes and letsmake check-health-allexercise the new health check in CI.slurmcluster chart as a separate component (slinky-slurmor similar) under its own short alias.metadata.nameto label-selector assertions onapp.kubernetes.io/nameso they toleratenameOverride. Cross-cutting cleanup; same fragility exists fordynamo-platform,kubeflow-trainer, etc. today.Checklist
make testwith-race)make lint)git commit -S)