Skip to content

Add Slinky slurm-operator as platform-slurm#866

Merged
mchmarny merged 6 commits into
NVIDIA:mainfrom
faganihajizada:feat/slinky-slurm-operator
May 13, 2026
Merged

Add Slinky slurm-operator as platform-slurm#866
mchmarny merged 6 commits into
NVIDIA:mainfrom
faganihajizada:feat/slinky-slurm-operator

Conversation

@faganihajizada

Copy link
Copy Markdown
Member

Summary

Adds the Slinky slurm-operator (chart v1.1.0) as a first-class AICR platform under --platform slurm. Empty-base PR — ships the operator + CRDs plumbing only; leaf overlays follow in a separate PR.

Motivation / Context

Slurm is the dominant HPC workload manager. Users running mixed AI + HPC workloads on Kubernetes need a validated path to install the Slinky operator alongside the existing GPU stack (gpu-operator, nvsentinel, nfd, cert-manager, etc.) without bringing it in by hand. This PR adds the platform plumbing so a follow-up can add per-CSP / per-accelerator leaf overlays without re-litigating the operator/CRD shape, namespace, version pin, or value-override aliases.

Mirrors the operator-only pattern of --platform dynamo and --platform kubeflow: AICR ships only the platform; cluster-instance CRs (Controller, LoginSet, NodeSet, Accounting, RestApi, Token) remain user-authored. The Slinky slurm cluster Helm chart is intentionally deferred to a follow-up PR with its own short alias (e.g. slurm-cluster) so the slurm alias continues to route to the operator.

Fixes: N/A
Related: SlinkyProject/slurm-operator#187 (upstream nodeSelector exposure — see Implementation Notes)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: recipes/registry.yaml, recipes/components/, recipes/checks/, recipes/mixins/

Implementation Notes

Components added

  • slinky-slurm-operator-crdsoci://ghcr.io/slinkyproject/charts/slurm-operator-crds:1.1.0. Installs the 6 CRDs under slinky.slurm.net (Accounting, Controller, LoginSet, NodeSet, RestApi, Token) into the slinky namespace.

  • slinky-slurm-operatoroci://ghcr.io/slinkyproject/charts/slurm-operator:1.1.0. Deploys the operator and admission webhook into slinky. Depends on cert-manager (webhook serving certificate, issued via the chart's rendered Issuer + Certificate resources) and slinky-slurm-operator-crds.

Glue

  • recipes/mixins/platform-slurm.yaml composes the two components into the slurm platform, mirroring platform-kubeflow.yaml.
  • recipes/checks/slinky-slurm-operator/health-check.yaml asserts both deployments and pod-phase health (Chainsaw test, mirrors dynamo-platform/health-check.yaml).
  • pkg/recipe/criteria.go adds CriteriaPlatformSlurm and updates the parser + GetCriteriaPlatformTypes (sorted: dynamo, kubeflow, nim, slurm).
  • docs/user/component-catalog.md adds rows for both components and a new bullet under "How Components Are Selected".

Key decisions (forward-binding)

Decision Rationale
Reserve --platform slurm enum value Public CLI surface — locks in the canonical name early
Operator-only first slice (no slurm cluster chart yet) Matches dynamo / kubeflow precedent — workload instances are user-authored
Two components: operator + CRDs (split) Matches dynamo-crds / kgateway-crds precedent — independent CRD lifecycle
Namespace slinky (not slurm-operator or slurm-system) Matches upstream Slinky quick-start — users can follow SchedMD docs verbatim
Mixin shape (platform-slurm) instead of inline overlay refs Chart has no per-cloud value divergence (no etcd/NATS/PVCs); kubeflow-style is simpler
slurm short alias points to the operator Mirrors dynamo aliasing dynamo-platform. Future cluster chart will get a distinct alias (e.g. slurm-cluster)
Pin chart 1.1.0 Latest stable as of PR; verified via helm pull (digest sha256:f923fe6c…84571e7)

Known limitation: nodeSelector dropped silently

Chart v1.1.0 templates only render operator.{affinity,tolerations} and webhook.{affinity,tolerations}. Setting operator.nodeSelector / webhook.nodeSelector is silently dropped:

helm template slinky-slurm-op slurm-operator-1.1.0.tgz \
  --set operator.nodeSelector.foo=bar \
  --set webhook.nodeSelector.baz=qux | grep nodeSelector
# (no output — value silently discarded)

So AICR's --system-node-selector flag is a no-op for this component until a chart release that includes SlinkyProject/slurm-operator#187 lands. The registry comment includes the verbatim follow-up action (add nodeSelectorPaths: [operator.nodeSelector, webhook.nodeSelector] + bump defaultVersion) so the next maintainer has zero archeology. Workaround in the meantime: pin nodes via valuesFile overrides on operator.affinity.nodeAffinity / webhook.affinity.nodeAffinity.

Empty-base scope (intentional)

The platform-slurm mixin is currently orphan — no overlay references it. This is intentional: the leaf overlay matrix (CSP × accelerator × OS) and the corresponding deployment-order guard test entries are deferred to a separate PR. The schema test does not enforce mixin reachability, so this passes CI as-is.

Testing

# Verified end-to-end:
make qualify   # test-coverage + lint + e2e + scan + license-check
Stage Result
test-coverage (race detector) PASS — 75.2% coverage (threshold 70%)
lint (lint-go, lint-yaml, license, check-agents-sync, check-docs-sidebar) PASS — 0 issues
e2e (chainsaw, 18 CLI tests) PASS — 18/18
scan (grype, fail-on=high) PASS — 0 high+ vulnerabilities
license-check (go-licenses) PASS — approved licenses only

Live smoke test against a Kind cluster + chart v1.1.0:

kind create cluster --name aicr-slurm-smoke
helm install cert-manager oci://quay.io/jetstack/charts/cert-manager \
  --namespace cert-manager --create-namespace --set crds.enabled=true
helm install slurm-operator-crds oci://ghcr.io/slinkyproject/charts/slurm-operator-crds \
  --version 1.1.0
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
  --version 1.1.0 --namespace slinky --create-namespace --set crds.enabled=false
make check-health COMPONENT=slinky-slurm-operator
# PASS — 1/1 in 5.42s

CLI sanity check:

$ aicr recipe --platform slurm --format yaml | head -15
kind: RecipeResult
apiVersion: aicr.nvidia.com/v1alpha1
metadata:
  version: dev
  appliedOverlays:
    - base
    - monitoring-hpa
criteria:
  service: any
  accelerator: any
  intent: any
  os: any
  platform: slurm
constraints:
  - name: K8s.server.version

(Resolves the new platform value and applies base + monitoring-hpa as expected for an empty-base addition — no overlay materializes the slurm components yet, so the recipe is identical to a no-platform query apart from the criteria echo.)

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rationale: Mirrors the same registry-addition pattern used for kueue, kgateway, dynamo-platform, kubeflow-trainer, and the ~18 other components already in recipes/registry.yaml. No changes to recipe engine semantics, bundler behavior, CLI flag shapes, or output formats. Pure additive surface — existing recipes, criteria queries, and CLI invocations are unaffected. The new platform enum value is opt-in via --platform slurm and has no effect on any existing matcher path.

Rollout notes: N/A — additive change, no migration steps. Reversible by git revert if the upstream chart pin needs to change before a leaf overlay PR lands.

Follow-ups (separate PRs, not blocking this one)

  • Leaf overlays for slurm: h100-eks-ubuntu-training-slurm, h100-kind-training-slurm, plus the broader CSP × accelerator × OS matrix. Wires platform-slurm into actual recipes and lets make check-health-all exercise the new health check in CI.
  • slurm cluster chart as a separate component (slinky-slurm or similar) under its own short alias.
  • Chainsaw health-check hardening — switch all 20 health checks from hardcoded metadata.name to label-selector assertions on app.kubernetes.io/name so they tolerate nameOverride. Cross-cutting cleanup; same fragility exists for dynamo-platform, kubeflow-trainer, etc. today.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions

Copy link
Copy Markdown
Contributor

Welcome to AICR, @faganihajizada! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@coderabbitai

This comment was marked as resolved.

…slurm

Adds the Slinky Slurm operator (chart v1.1.0 from
oci://ghcr.io/slinkyproject/charts) as a first-class AICR platform under
`--platform slurm`. Ships two components: `slinky-slurm-operator-crds`
and `slinky-slurm-operator`, composed by a new `platform-slurm` mixin.
Operator depends on cert-manager and the CRDs.

Mirrors the operator-only pattern of `--platform dynamo` and
`--platform kubeflow`: AICR ships only the platform; cluster-instance
CRs (Controller, LoginSet, NodeSet, ...) remain user-authored. Leaf
overlays (CSP x accelerator x OS) follow in a separate PR.

Notable: chart v1.1.0 silently ignores operator.nodeSelector and
webhook.nodeSelector. The registry comment links
SlinkyProject/slurm-operator#187 with the
verbatim follow-up action once that upstream PR merges and a new chart
is published. The `slurm` cluster chart is intentionally deferred to a
follow-up PR with its own short alias (e.g. `slurm-cluster`) so `slurm`
continues to route to the operator.

Verified: make qualify (test-coverage + lint + e2e + scan +
license-check) PASS; make check-health COMPONENT=slinky-slurm-operator
against a live Kind cluster + chart v1.1.0 PASS in 5.42s.
@faganihajizada faganihajizada force-pushed the feat/slinky-slurm-operator branch from e5c11c9 to af1d8b9 Compare May 13, 2026 09:14
coderabbitai[bot]

This comment was marked as resolved.

@mchmarny

Copy link
Copy Markdown
Member

@faganihajizada welcome to AICR project, I'm super excited about adding Skinky.

CodeRabbit can get little annoying on draft PRs but it's good to get early signals. Let us know if you want to walk through the approach or just go straight to review.

Re --system-node-selector flag being no-op, can you please an create issue in AICR to track 187 in Slinky. We will want to jump on this when it lands.

@faganihajizada faganihajizada marked this pull request as ready for review May 13, 2026 12:27
@faganihajizada faganihajizada requested review from a team as code owners May 13, 2026 12:27
@faganihajizada

faganihajizada commented May 13, 2026

Copy link
Copy Markdown
Member Author

@faganihajizada welcome to AICR project, I'm super excited about adding Skinky.

CodeRabbit can get little annoying on draft PRs but it's good to get early signals. Let us know if you want to walk through the approach or just go straight to review.

Thank you @mchmarny

I checked coderabbit reviews and made separate PR. This is ready for review.

Re --system-node-selector flag being no-op, can you please an create issue in AICR to track 187 in Slinky. We will want to jump on this when it lands.

yes 💯

The OpenAPI server spec at api/aicr/v1/server.yaml has the platform enum
duplicated at 3 sites (operation parameters and Criteria schema). The
TestOpenAPIEnumsMatchGoTypes test enforces these stay in sync with
recipe.GetCriteriaPlatformTypes(). Adding 'slurm' to the Go enum without
updating the spec broke that test.

Add 'slurm' to all 3 enum sites alphabetically (between 'nim' and 'any').
@faganihajizada faganihajizada force-pushed the feat/slinky-slurm-operator branch from 3b04bba to 4e9c496 Compare May 13, 2026 13:15
@github-actions

Copy link
Copy Markdown
Contributor

@faganihajizada this PR now has merge conflicts with main. Please rebase to resolve them.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean additive PR — registry/mixin shape mirrors the existing dynamo / kubeflow precedent, the upstream nodeSelector limitation is documented at the registry entry rather than buried in a separate note, and the empty-base scoping is intentional and well-explained.

One medium item on the docs side: the platform enum was added to pkg/recipe/criteria.go, server.yaml, and component-catalog.md, but docs/user/cli-reference.md and docs/user/api-reference.md (plus their site/docs/ mirrors) still hardcode the old dynamo, kubeflow, nim list. Four-line fix; should land in this PR per the doc-update rule for user-visible behavior. Two minor nits on values comments and the health-check vacuous-pass dependency — both optional.

Branch is also marked needs-rebase against main. None of this blocks once the doc gap is closed and the rebase is done.

Comment thread api/aicr/v1/server.yaml
Comment thread recipes/components/slinky-slurm-operator/values.yaml
Comment thread recipes/checks/slinky-slurm-operator/health-check.yaml
# Conflicts:
#	docs/user/component-catalog.md
- docs/user/cli-reference.md, docs/user/api-reference.md: add `slurm`
  to the platform enum lists. The Go enum, OpenAPI spec, and component
  catalog were updated when slurm was added; these two reference pages
  drifted because they hardcode the enum.

- recipes/registry.yaml (slinky-slurm-operator): document above
  defaultVersion that bumping the chart pin requires re-verifying the
  defaults baked into components/slinky-slurm-operator/values.yaml,
  since the values file mirrors chart defaults explicitly. Without
  this comment, an upstream default flip (e.g. certManager.enabled
  flipping from true to false) would pass through unnoticed on the
  next bump.

- recipes/checks/slinky-slurm-operator/health-check.yaml: explain why
  validate-all-pods-healthy is not a vacuous pass on an empty
  namespace. Chainsaw `error` assertions pass when no matching
  resource exists, but the two preceding deployment-availability steps
  require both deployments to have a ready replica, guaranteeing pods
  exist before this step runs. Soundness was already there; the
  comment makes it visible when reading the step in isolation.
@mchmarny mchmarny enabled auto-merge (squash) May 13, 2026 18:22
@mchmarny mchmarny merged commit 6dfbb5d into NVIDIA:main May 13, 2026
91 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Two coupled changes:

1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
   to reflect current registry + chart state. No code changes, no
   registry pins moved; this picks up drift accumulated since main's
   last BOM commit:
   - slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
     added (registry entries landed in NVIDIA#866 without a corresponding
     BOM regen)
   - kubeflow-trainer's bundled pytorch image bumped upstream
   - busybox bumped upstream in some chart's rendered templates

2. Add the BOM-regen rule to contributor and agent docs, surfaced by
   the same NVIDIA#872 review that flagged the drift:
   - .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
     "Common Tasks" section: after any change to recipes/registry.yaml,
     a component values file, or a chart pin, run `make bom-docs` and
     commit the regenerated container-images.md in the same PR
   - docs/contributor/data.md — new "Regenerating the BOM" subsection
     under Component Configuration covering when to run it, how it can
     surface upstream chart drift, and how to handle unrelated drift
     (split into a catch-up PR vs land it together)

Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.

Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.

Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
  staleness check, agents-sync check, chart-pin verification, 20/20
  chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Two coupled changes:

1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
   to reflect current registry + chart state. No code changes, no
   registry pins moved; this picks up drift accumulated since main's
   last BOM commit:
   - slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
     added (registry entries landed in NVIDIA#866 without a corresponding
     BOM regen)
   - kubeflow-trainer's bundled pytorch image bumped upstream
   - busybox bumped upstream in some chart's rendered templates

2. Add the BOM-regen rule to contributor and agent docs, surfaced by
   the same NVIDIA#872 review that flagged the drift:
   - .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
     "Common Tasks" section: after any change to recipes/registry.yaml,
     a component values file, or a chart pin, run `make bom-docs` and
     commit the regenerated container-images.md in the same PR
   - docs/contributor/data.md — new "Regenerating the BOM" subsection
     under Component Configuration covering when to run it, how it can
     surface upstream chart drift, and how to handle unrelated drift
     (split into a catch-up PR vs land it together)

Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.

Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.

Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
  staleness check, agents-sync check, chart-pin verification, 20/20
  chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Two coupled changes:

1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
   to reflect current registry + chart state. No code changes, no
   registry pins moved; this picks up drift accumulated since main's
   last BOM commit:
   - slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
     added (registry entries landed in NVIDIA#866 without a corresponding
     BOM regen)
   - kubeflow-trainer's bundled pytorch image bumped upstream
   - busybox bumped upstream in some chart's rendered templates

2. Add the BOM-regen rule to contributor and agent docs, surfaced by
   the same NVIDIA#872 review that flagged the drift:
   - .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
     "Common Tasks" section: after any change to recipes/registry.yaml,
     a component values file, or a chart pin, run `make bom-docs` and
     commit the regenerated container-images.md in the same PR
   - docs/contributor/data.md — new "Regenerating the BOM" subsection
     under Component Configuration covering when to run it, how it can
     surface upstream chart drift, and how to handle unrelated drift
     (split into a catch-up PR vs land it together)

Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.

Open question for follow-up (not blocking this PR): how did NVIDIA#866 merge
with a stale BOM? `make qualify` enforces BOM-up-to-dateness locally,
so either the CI check is path-conditional and skipped on that PR, or
there's another loophole. Worth a separate audit.

Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, BOM
  staleness check, agents-sync check, chart-pin verification, 20/20
  chainsaw, vulnerability scan, license headers)
- No code changes, no chart pins moved, no behavior change
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Two coupled changes:

1. Pure BOM regen of docs/user/container-images.md via `make bom-docs`
   to reflect current registry + chart state. No code changes, no
   registry pins moved; this picks up drift accumulated since main's
   last BOM commit:
   - slinky-slurm-operator and slinky-slurm-operator-crds BOM rows
     added (registry entries landed in NVIDIA#866 without a corresponding
     BOM regen)
   - kubeflow-trainer's bundled pytorch image bumped upstream
   - busybox bumped upstream in some chart's rendered templates

2. Add the BOM-regen rule to contributor and agent docs, surfaced by
   the same NVIDIA#872 review that flagged the drift:
   - .claude/CLAUDE.md (auto-synced to AGENTS.md) — short rule in the
     "Common Tasks" section: after any change to recipes/registry.yaml,
     a component values file, or a chart pin, run `make bom-docs` and
     commit the regenerated container-images.md in the same PR
   - docs/contributor/data.md — new "Regenerating the BOM" subsection
     under Component Configuration covering when to run it, how it can
     surface upstream chart drift, and how to handle unrelated drift
     (split into a catch-up PR vs land it together)
   - Makefile — corrected the bom-check target's help text from
     "CI gate, opt-in locally" to "opt-in; not wired into qualify/
     lint/merge gate" to match the actual enforcement story

Surfaced as an unrelated diff in NVIDIA#872 (aws-efa v0.5.26 bump) when
`make bom-docs` was rerun against the rebased branch. Splitting this
catch-up out so NVIDIA#872's diff is scoped to aws-efa content.

Honest enforcement story: an earlier revision of these docs (and an
earlier revision of this commit message) claimed `make qualify` and
CI catch stale BOMs. That was incorrect — verified directly against
the Makefile:
- `qualify` depends on test-coverage / lint / e2e / scan / license-
  check, none of which run `bom-check`
- `lint` runs `bom-pinning-check` (chart-pin verification per ADR-006),
  not `bom-check` (BOM doc freshness)
- The merge gate has no PR-time BOM-staleness check
The corrected docs say so explicitly. Wiring `bom-check` into the gate
is a desirable follow-up but intentionally out of scope here — it
would change CI behavior for every PR and likely block unrelated PRs
that happen to expose accumulated upstream drift, which deserves its
own discussion.

This also explains how NVIDIA#866 merged with a stale BOM: nothing in the
existing gate would have caught it.

Test plan:
- `make qualify` passes (Go tests, golangci-lint + yamllint, agents-
  sync check, chart-pin verification, 20/20 chainsaw, vulnerability
  scan, license headers)
- No code changes, no chart pins moved, no behavior change
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request May 15, 2026
…g note

Two follow-ups to PR NVIDIA#866 (Slinky slurm-operator), bundled into one
post-merge cleanup PR.

Platform-enum drift backfill — surfaces NVIDIA#866 did not catch
============================================================

PR NVIDIA#866 added `--platform slurm` to the OpenAPI spec, the Go enum
(GetCriteriaPlatformTypes), and docs/user/{cli,api}-reference.md +
component-catalog.md. Per the project's enum-audit rule (CLAUDE.md
"Documentation updates"), every surface that enumerates platform
values should be updated. The following surfaces still showed
`kubeflow` only (pre-existing drift predating slurm — `dynamo` and
`nim` were already missing):

- docs/README.md:11 — glossary Criteria row
- docs/contributor/data.md:130 — platform row in criteria table
- docs/contributor/validations.md:99 — platform bullet
- site/docs/getting-started/index.md:19 — glossary mirror
- pkg/recipe/criteria.go:246 — Platform field godoc
- pkg/recipe/doc.go:32 — Criteria struct shape comment (was
  non-alphabetical "kubeflow, dynamo, nim, slurm, any")
- pkg/recipe/doc.go:93-98 — CriteriaPlatform* constant listing
  (was non-alphabetical order)
- pkg/api/doc.go:72 — REST API doc was missing slurm entirely
  (only one of these with factual content drift; rest are ordering)
- .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR
  snapshot-analysis skill criteria table

All now list `dynamo, kubeflow, nim, slurm` in alphabetical order
matching pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go
enum). Per-file delimiter convention preserved: slash-separated in
glossary-style parens, comma-separated in tables/godoc.

New guard test
==============

pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter
asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes()
stay in sync. Catches the exact class of drift this commit fixes if
a future platform value is added to one but not the other.

Catalog node-selector limitation note
=====================================

Appends a `**Known limitation:**` clause to the slinky-slurm-operator
row in docs/user/component-catalog.md, documenting that chart v1.1.0
silently ignores `operator.nodeSelector` and `webhook.nodeSelector`
and linking SlinkyProject/slurm-operator#187 for the upstream fix.
The same limitation is already in-line in recipes/registry.yaml; the
catalog note surfaces it to readers of the rendered user docs.

Originally planned but no longer needed
=======================================

While auditing, I had also planned to add Mark's two "optional nits"
from NVIDIA#866 — a chart-default verification pointer at defaultVersion in
recipes/registry.yaml, and a vacuous-pass guard comment in
recipes/checks/slinky-slurm-operator/health-check.yaml. Both turned
out to already be in the merged version (Mark applied them himself
before merge — see recipes/registry.yaml:550-555 and
recipes/checks/slinky-slurm-operator/health-check.yaml:64-69).

The .lychee.toml header change is the project's license-header check
auto-applying the standard Apache 2.0 copyright block.

Addresses
=========

Mark (@mchmarny) review feedback on NVIDIA#866:
  - Doc-audit gap across user docs (cli-reference + api-reference
    landed at merge time; this PR catches the remaining surfaces)

CodeRabbit review feedback on NVIDIA#866:
  - NodeSelector limitation note on slinky-slurm-operator catalog row

Internal panel review (PE + QA + DA) on draft NVIDIA#884:
  - Extended audit to pkg/api/doc.go (factual miss), pkg/recipe/doc.go
    (ordering), .claude/skills/analyzing-snapshots/SKILL.md
  - Added guard test against future enum/getter drift
  - Squashed to single commit; rebased onto upstream/main HEAD

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request May 15, 2026
…g note

Follow-up to PR NVIDIA#866 (Slinky slurm-operator). Backfills platform-enum
drift on surfaces the original PR did not catch, plus surfaces the
chart v1.1.0 nodeSelector silent-ignore limitation on the user-facing
component-catalog row.

PR NVIDIA#896 (yuanchen8911) already fixed docs/README.md and
docs/contributor/validations.md as part of a broader docs audit. PR
NVIDIA#893 removed the site/docs/ vitepress mirror entirely. Remaining
surfaces this PR addresses:

- docs/contributor/data.md:130 — platform row in criteria table
- docs/user/component-catalog.md:36 — slinky-slurm-operator row,
  appended **Known limitation:** clause documenting chart v1.1.0
  silent-ignore of operator.nodeSelector/webhook.nodeSelector
  (the same limitation is already inline in recipes/registry.yaml;
  this surfaces it on the rendered user catalog)
- pkg/api/doc.go:72 — REST API godoc was missing slurm entirely
- pkg/recipe/doc.go:32 — struct shape comment reordered to alphabetical
  to match GetCriteriaPlatformTypes()
- pkg/recipe/doc.go:93-98 — CriteriaPlatform* constants reordered
  alphabetical
- pkg/recipe/criteria.go:246 — Platform-field godoc updated
- .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR
  snapshot-analysis skill criteria table

All now list 'dynamo, kubeflow, nim, slurm' alphabetically matching
pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum).

New guard test
==============

pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter
asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes()
stay in sync. Mechanically catches the exact class of drift this
commit fixes if a future platform value is added to one but not the
other.

Addressed reviews
=================

@mchmarny (NVIDIA#866) — Doc-audit gap. cli-reference and api-reference were
fixed at merge time and NVIDIA#896 picked up README/validations; this PR
catches the remaining surfaces (contributor/data, the three Go godoc
files, and the internal skill table).

@coderabbitai (NVIDIA#866) — NodeSelector limitation note on the
slinky-slurm-operator catalog row.

Internal PE + QA + DA panel review on draft NVIDIA#884 — extended audit
to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering),
.claude/skills/analyzing-snapshots/SKILL.md, plus the guard test
in pkg/recipe/doc_test.go.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
resker added a commit to resker/aicr that referenced this pull request May 19, 2026
Add `runai` as a platform value in AICR recipes, peer to
`dynamo`, `kubeflow`, `nim`, and `slurm`.

Mirrors the precedent set by NVIDIA#866 (slurm). Sites updated:

- pkg/recipe/criteria.go: CriteriaPlatformRunai const, parser case,
  sorted slice in GetCriteriaPlatformTypes
- pkg/recipe/criteria_test.go: table-driven cases for runai (lower +
  uppercase) and updated TestGetCriteriaPlatformTypes expectations
- pkg/recipe/doc.go: Platform field docstring + bullet under "Platform
  types for workload frameworks"
- api/aicr/v1/server.yaml: three platform enum sites
- docs/user/{api,cli}-reference.md: platform value enumerations
- docs/contributor/{api-server,validations}.md: platform value
  enumerations
- docs/README.md: glossary Criteria row

Run:ai recipe matrix and any operator/CRD plumbing are deferred to
follow-up PRs.
resker added a commit to resker/aicr that referenced this pull request May 19, 2026
Per review feedback on NVIDIA#955: pkg/api/doc.go:72 still listed the
platform enum as (dynamo, kubeflow, nim, any) — missing both `slurm`
(pre-existing drift from NVIDIA#866) and the new `runai` added in this PR.

Updated to match the canonical list now used across server.yaml,
pkg/recipe/criteria.go, and the user/contributor docs:

    platform: Platform/framework (dynamo, kubeflow, nim, runai, slurm, any)

No behavioral change; godoc-only.

Reported-by: @yuanchen8911
Signed-off-by: Rob Esker <[email protected]>
copy-pr-bot Bot pushed a commit that referenced this pull request May 27, 2026
Follow-up to PR #866 (Slinky slurm-operator). Backfills platform-enum
drift on surfaces the original PR did not catch, plus surfaces the
chart v1.1.0 nodeSelector silent-ignore limitation on the user-facing
component-catalog row.

PR #896 (yuanchen8911) already fixed docs/README.md and
docs/contributor/validations.md as part of a broader docs audit. PR
surfaces this PR addresses:

- docs/contributor/data.md:130 — platform row in criteria table
- docs/user/component-catalog.md:36 — slinky-slurm-operator row,
  appended **Known limitation:** clause documenting chart v1.1.0
  silent-ignore of operator.nodeSelector/webhook.nodeSelector
  (the same limitation is already inline in recipes/registry.yaml;
  this surfaces it on the rendered user catalog)
- pkg/api/doc.go:72 — REST API godoc was missing slurm entirely
- pkg/recipe/doc.go:32 — struct shape comment reordered to alphabetical
  to match GetCriteriaPlatformTypes()
- pkg/recipe/doc.go:93-98 — CriteriaPlatform* constants reordered
  alphabetical
- pkg/recipe/criteria.go:246 — Platform-field godoc updated
- .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR
  snapshot-analysis skill criteria table

All now list 'dynamo, kubeflow, nim, slurm' alphabetically matching
pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum).

New guard test
==============

pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter
asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes()
stay in sync. Mechanically catches the exact class of drift this
commit fixes if a future platform value is added to one but not the
other.

Addressed reviews
=================

@mchmarny (#866) — Doc-audit gap. cli-reference and api-reference were
fixed at merge time and #896 picked up README/validations; this PR
catches the remaining surfaces (contributor/data, the three Go godoc
files, and the internal skill table).

@coderabbitai (#866) — NodeSelector limitation note on the
slinky-slurm-operator catalog row.

Internal PE + QA + DA panel review on draft #884 — extended audit
to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering),
.claude/skills/analyzing-snapshots/SKILL.md, plus the guard test
in pkg/recipe/doc_test.go.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
mchmarny added a commit that referenced this pull request May 28, 2026
…#884)

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Co-authored-by: Mark Chmarny <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants