feat(recipe): add bcm service type with overlays#1060
Conversation
Adds NVIDIA Base Command Manager (BCM) as a first-class service value alongside eks/gke/aks/oke/kind/lke. BCM 11.x provisions only the base Kubernetes platform, CNI, DNS, and MetalLB; the NVIDIA runtime stack is delivered by AICR via base.yaml and applied with the helmfile deployer. - Adds CriteriaServiceBCM enum + parse/getter wiring and tests - Adds bcm service overlay with K8s 1.34+ constraint, master toleration, and local-path StorageClass for Prometheus persistence - Adds bcm-training, bcm-inference, h100-bcm-training, and h100-bcm-ubuntu-training overlays following the EKS pattern - Updates OpenAPI enum blocks, CLI/API/contributor docs, glossary, issue templates, and the nccl exhaustive-switch validator Closes: #1059
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR adds NVIDIA Base Command Manager (bcm) as a first-class service type. It introduces the CriteriaServiceBCM constant, updates parsing and enumeration, extends OpenAPI enums, updates CLI/help and user/contributor docs and issue templates, adds a bcm recipe overlay tree (base, training, inference, h100 variants), and tweaks the validator and a container-image digest reference. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/recipe/doc.go`:
- Around line 70-71: The package docs introduce a new service value 'bcm'
(CriteriaServiceBCM) but the query-parameter accepted-values list later in the
same doc still omits it; update the query-parameter list of accepted service
values (the docs referencing CriteriaServiceAny and the service enum/list) to
include 'bcm' alongside the existing entries so the two sections match and the
docs present consistent accepted values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: f0701d3b-b14f-4579-a470-da38a9688fd8
📒 Files selected for processing (22)
.github/ISSUE_TEMPLATE/bug_report.ymlapi/aicr/v1/server.yamldocs/README.mddocs/contributor/api-server-extending.mddocs/contributor/api-server.mddocs/contributor/cli.mddocs/contributor/data.mddocs/contributor/validations.mddocs/user/api-reference.mddocs/user/cli-reference.mddocs/user/container-images.mdpkg/cli/recipe.gopkg/recipe/criteria.gopkg/recipe/criteria_registry_parse_test.gopkg/recipe/criteria_test.gopkg/recipe/doc.gorecipes/overlays/bcm-inference.yamlrecipes/overlays/bcm-training.yamlrecipes/overlays/bcm.yamlrecipes/overlays/h100-bcm-training.yamlrecipes/overlays/h100-bcm-ubuntu-training.yamlvalidators/performance/nccl_all_reduce_bw_constraint.go
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
CodeRabbit flagged pkg/recipe/doc.go line 178 (query-parameter list) as inconsistent with line 70 (criteria types list). Audited the rest of the package godoc surface and updated four other locations that still enumerated only the pre-BCM service list.
Summary
Adds NVIDIA Base Command Manager (BCM) as a first-class
servicevalue with abcmoverlay tree (training + inference, plus an H100 + Ubuntu leaf) so operators can generate validated Helmfile bundles for BCM-managed clusters.Motivation / Context
BCM 11.x (
cm-kubernetes-setup) provisions only the base Kubernetes platform — control plane, CNI, DNS, MetalLB. The NVIDIA runtime stack (GPU Operator, NFD, cert-manager, Prometheus, nvsentinel, DRA driver, etc.) is delivered by AICR viabase.yamland applied with thehelmfiledeployer. Argo CD support is a planned follow-up.Fixes: #1059
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server) — OpenAPI enumpkg/recipe)pkg/validator) — exhaustive switch invalidators/performancedocs/,examples/)Implementation Notes
CriteriaServiceBCM = "bcm"added with parser case, alphabetical insertion intoGetCriteriaServiceTypes, and matching test coverage (TestParseCriteriaServiceType,TestGetCriteriaServiceTypes,TestAllCriteriaServiceTypes_UnionWithRegistry).recipes/overlays/bcm.yamlinheritsbase, setsK8s.server.version >= 1.34, persistskube-prometheus-stackon thelocal-pathStorageClass (BCM default), and addsnode-role.kubernetes.io/mastertolerations onnvidia-dra-driver-gpuso DRA workloads schedule on BCM masters in addition to upstreamcontrol-plane.bcm-training,bcm-inference) and leaf (h100-bcm-training,h100-bcm-ubuntu-training) overlays follow the EKS pattern;os-ubuntumixin is reused.validators/performance/nccl_all_reduce_bw_constraint.go— addedCriteriaServiceBCMto the existingcasethat returnsnil, nil(BCM has no cloud-specific worker scheduling); required by theexhaustivelint rule.CLAUDE.mdenum-audit checklist:docs/README.mdglossary,docs/user/{cli,api}-reference.md,docs/contributor/{api-server,api-server-extending,cli,data,validations}.md, OpenAPI enum blocks (3 sites), bug-report issue template, package godoc, and theaicr recipeCLI help.docs/user/container-images.mdregenerated viamake bom-docs; picked up upstreambusybox:1.37digest drift in the network-operator chart (unrelated to BCM but caught by the regen — committed in the same PR perCLAUDE.md).Testing
Result: passes — total coverage 77.1% (threshold 75%), all chainsaw E2E tests pass (22/22), vulnerability scan clean, license headers OK.
Per-package coverage on changed packages (vs
origin/mainbaseline):pkg/recipe: 91.5% (no change)pkg/cli: 65.2% (no change)validators/performance: 28.0% (no change — added single case label only)Risk Assessment
Rollout notes: None. Backwards compatible.
Checklist
make testwith-race)make lint)git commit -S)