Skip to content

feat(recipe): add bcm service type with overlays#1060

Merged
mchmarny merged 3 commits into
mainfrom
feat/bcm-service-type
May 27, 2026
Merged

feat(recipe): add bcm service type with overlays#1060
mchmarny merged 3 commits into
mainfrom
feat/bcm-service-type

Conversation

@mchmarny

Copy link
Copy Markdown
Member

Summary

Adds NVIDIA Base Command Manager (BCM) as a first-class service value with a bcm overlay tree (training + inference, plus an H100 + Ubuntu leaf) so operators can generate validated Helmfile bundles for BCM-managed clusters.

Motivation / Context

BCM 11.x (cm-kubernetes-setup) provisions only the base Kubernetes platform — control plane, CNI, DNS, MetalLB. The NVIDIA runtime stack (GPU Operator, NFD, cert-manager, Prometheus, nvsentinel, DRA driver, etc.) is delivered by AICR via base.yaml and applied with the helmfile deployer. Argo CD support is a planned follow-up.

Fixes: #1059
Related: N/A

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server) — OpenAPI enum
  • Recipe engine / data (pkg/recipe)
  • Validator (pkg/validator) — exhaustive switch in validators/performance
  • Docs/examples (docs/, examples/)

Implementation Notes

  • CriteriaServiceBCM = "bcm" added with parser case, alphabetical insertion into GetCriteriaServiceTypes, and matching test coverage (TestParseCriteriaServiceType, TestGetCriteriaServiceTypes, TestAllCriteriaServiceTypes_UnionWithRegistry).
  • recipes/overlays/bcm.yaml inherits base, sets K8s.server.version >= 1.34, persists kube-prometheus-stack on the local-path StorageClass (BCM default), and adds node-role.kubernetes.io/master tolerations on nvidia-dra-driver-gpu so DRA workloads schedule on BCM masters in addition to upstream control-plane.
  • Intent (bcm-training, bcm-inference) and leaf (h100-bcm-training, h100-bcm-ubuntu-training) overlays follow the EKS pattern; os-ubuntu mixin is reused.
  • validators/performance/nccl_all_reduce_bw_constraint.go — added CriteriaServiceBCM to the existing case that returns nil, nil (BCM has no cloud-specific worker scheduling); required by the exhaustive lint rule.
  • Docs updated per the CLAUDE.md enum-audit checklist: docs/README.md glossary, docs/user/{cli,api}-reference.md, docs/contributor/{api-server,api-server-extending,cli,data,validations}.md, OpenAPI enum blocks (3 sites), bug-report issue template, package godoc, and the aicr recipe CLI help.
  • docs/user/container-images.md regenerated via make bom-docs; picked up upstream busybox:1.37 digest drift in the network-operator chart (unrelated to BCM but caught by the regen — committed in the same PR per CLAUDE.md).

Testing

unset GITLAB_TOKEN && make qualify

Result: passes — total coverage 77.1% (threshold 75%), all chainsaw E2E tests pass (22/22), vulnerability scan clean, license headers OK.

Per-package coverage on changed packages (vs origin/main baseline):

  • pkg/recipe: 91.5% (no change)
  • pkg/cli: 65.2% (no change)
  • validators/performance: 28.0% (no change — added single case label only)

Risk Assessment

  • Low — Additive: new enum value, new overlay files, single switch-case extension. No changes to existing recipes or service types.

Rollout notes: None. Backwards compatible.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Adds NVIDIA Base Command Manager (BCM) as a first-class service value
alongside eks/gke/aks/oke/kind/lke. BCM 11.x provisions only the base
Kubernetes platform, CNI, DNS, and MetalLB; the NVIDIA runtime stack is
delivered by AICR via base.yaml and applied with the helmfile deployer.

- Adds CriteriaServiceBCM enum + parse/getter wiring and tests
- Adds bcm service overlay with K8s 1.34+ constraint, master toleration,
  and local-path StorageClass for Prometheus persistence
- Adds bcm-training, bcm-inference, h100-bcm-training, and
  h100-bcm-ubuntu-training overlays following the EKS pattern
- Updates OpenAPI enum blocks, CLI/API/contributor docs, glossary,
  issue templates, and the nccl exhaustive-switch validator

Closes: #1059
@github-actions

Copy link
Copy Markdown
Contributor

@mchmarny mchmarny marked this pull request as draft May 27, 2026 15:19
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9aef1ced-513f-41ed-93ed-7b9b44195f7a

📥 Commits

Reviewing files that changed from the base of the PR and between 59ab2ff and e324ef9.

📒 Files selected for processing (5)
  • pkg/api/doc.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/types.go
  • pkg/recipe/doc.go
  • pkg/server/doc.go

📝 Walkthrough

Walkthrough

This PR adds NVIDIA Base Command Manager (bcm) as a first-class service type. It introduces the CriteriaServiceBCM constant, updates parsing and enumeration, extends OpenAPI enums, updates CLI/help and user/contributor docs and issue templates, adds a bcm recipe overlay tree (base, training, inference, h100 variants), and tweaks the validator and a container-image digest reference.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/aicr#999: Both PRs modify the criteria-value parsing and service type enumeration plumbing in pkg/recipe/criteria.go and AllCriteriaServiceTypes.

Suggested labels

enhancement, area/validator, area/tests, size/M

Suggested reviewers

  • lalitadithya
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning One out-of-scope change detected: docs/user/container-images.md busybox digest update is unrelated to BCM feature but was regenerated via make bom-docs. All other changes directly support the bcm service type addition. The busybox digest change is acknowledged in the PR description as caught during BOM regeneration but is technically out-of-scope for the BCM feature. Clarify whether this should be committed separately or document it as an approved concurrent change.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title directly and concisely summarizes the main change: adding bcm as a service type with associated recipe overlays.
Description check ✅ Passed The pull request description provides comprehensive context including summary, motivation, implementation details, testing results, and risk assessment, all directly related to the changeset.
Linked Issues check ✅ Passed All primary objectives from issue #1059 are met: CriteriaServiceBCM enum added with parsing/getter wiring, bcm.yaml overlay with topology constraints, intent and leaf overlays provided, OpenAPI enums updated, docs completed, and test coverage included.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/bcm-service-type

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/doc.go`:
- Around line 70-71: The package docs introduce a new service value 'bcm'
(CriteriaServiceBCM) but the query-parameter accepted-values list later in the
same doc still omits it; update the query-parameter list of accepted service
values (the docs referencing CriteriaServiceAny and the service enum/list) to
include 'bcm' alongside the existing entries so the two sections match and the
docs present consistent accepted values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f0701d3b-b14f-4579-a470-da38a9688fd8

📥 Commits

Reviewing files that changed from the base of the PR and between 9e1841f and 59ab2ff.

📒 Files selected for processing (22)
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/README.md
  • docs/contributor/api-server-extending.md
  • docs/contributor/api-server.md
  • docs/contributor/cli.md
  • docs/contributor/data.md
  • docs/contributor/validations.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • docs/user/container-images.md
  • pkg/cli/recipe.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_registry_parse_test.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • recipes/overlays/bcm-inference.yaml
  • recipes/overlays/bcm-training.yaml
  • recipes/overlays/bcm.yaml
  • recipes/overlays/h100-bcm-training.yaml
  • recipes/overlays/h100-bcm-ubuntu-training.yaml
  • validators/performance/nccl_all_reduce_bw_constraint.go

Comment thread pkg/recipe/doc.go
lalitadithya
lalitadithya previously approved these changes May 27, 2026
@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.9%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.9%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/api 0.00% (ø)
github.com/NVIDIA/aicr/pkg/cli 65.22% (ø)
github.com/NVIDIA/aicr/pkg/fingerprint 98.05% (ø)
github.com/NVIDIA/aicr/pkg/recipe 91.48% (+0.00%) 👍
github.com/NVIDIA/aicr/pkg/server 93.36% (ø)
github.com/NVIDIA/aicr/validators/performance 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/api/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/cli/recipe.go 84.13% (ø) 126 106 20
github.com/NVIDIA/aicr/pkg/fingerprint/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/fingerprint/types.go 100.00% (ø) 4 4 0
github.com/NVIDIA/aicr/pkg/recipe/criteria.go 91.64% (+0.03%) 335 (+1) 307 (+1) 28 👍
github.com/NVIDIA/aicr/pkg/recipe/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/server/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/performance/nccl_all_reduce_bw_constraint.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@mchmarny mchmarny marked this pull request as ready for review May 27, 2026 15:32
CodeRabbit flagged pkg/recipe/doc.go line 178 (query-parameter list) as
inconsistent with line 70 (criteria types list). Audited the rest of the
package godoc surface and updated four other locations that still
enumerated only the pre-BCM service list.
@mchmarny mchmarny marked this pull request as draft May 27, 2026 15:41
@mchmarny mchmarny marked this pull request as ready for review May 27, 2026 17:24
@mchmarny mchmarny enabled auto-merge (squash) May 27, 2026 17:25
@mchmarny mchmarny disabled auto-merge May 27, 2026 17:26
@mchmarny mchmarny merged commit 735c233 into main May 27, 2026
119 checks passed
@mchmarny mchmarny deleted the feat/bcm-service-type branch May 27, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add bcm as a service type with recipe overlays

2 participants