Skip to content

feat(health): add tools/health generator and recipe-health matrix#1304

Merged
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:feat/recipe-health-matrix
Jun 11, 2026
Merged

feat(health): add tools/health generator and recipe-health matrix#1304
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:feat/recipe-health-matrix

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Add a hermetic, offline recipe-health generator (tools/health) plus the publication machinery (make targets, public doc, nav/ownership) that surfaces per-recipe structural health across the whole criteria matrix — mirroring the BOM generate-and-PR precedent, but with no live Helm render. The matrix body is intentionally deferred (see Implementation Notes); the tooling itself is complete and functional.

Motivation / Context

ADR-009 (Recipe Health Tracking, V1) calls for a hermetic generator, BOM-style make targets that splice into a marked region, and a public matrix doc. This PR delivers the tools/health generator + targets + doc; the validation logic it consumes (pkg/health.Compute) already landed via #1226/#1225.

Fixes: N/A — implements the #1229 tooling; public-matrix publication is deferred (see Implementation Notes), so I left #1229 open for the maintainer to close after the matrix publishes.
Related: #1229, #1224 (epic), #1226/#1227 (deps), #1230 (blocked: weekly refresh)

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update
  • Build/CI/tooling

Component(s) Affected

  • Docs/examples (docs/, examples/)
  • Other: tools/health generator + Makefile targets + tools/check-docs-mdx

Implementation Notes

  • Generator (tools/health) parallels tools/bom minus the network render: it calls pkg/health.Compute() against the embedded catalog (Provider: nil) and renders a Markdown matrix — per-recipe rolled-up status, a declared-coverage column (R:n D:n P:n C:n), and a literal pending Evidence column. Flags -deterministic/-no-title make the output byte-stable and embeddable.
  • Targets clone bom-docs/bom-check: make recipe-health-docs splices the body between <!-- BEGIN/END AICR-HEALTH -->; make recipe-health-check is advisory (not wired into make qualify/lint/merge gate). Named recipe-health-*, not the ADR's original health-*, to avoid the check-health/check-health-all/component-health chainsaw family (homophone footgun); ADR-009 §5 updated to match.
  • Matrix publication is intentionally deferred behind an AICR-HEALTH-DEFERRED sentinel. pkg/health's chart_pinned signal currently over-reports fail for manifest-only Helm components (declared type: Helm but shipping local manifests with no external chart to pin), which would make 17/32 recipes show fail and render a misleading public matrix. The generator/targets are fully functional; recipe-health-check skips while the sentinel is present. Publishing is a follow-up gated on fixing classifyChartPinned (a pkg/health change, CODEOWNERS-pinned to the pkg/health: core Compute() loop, resolves signal, rollup, determinism #1225/pkg/health: chart_pinned + declared_coverage signals #1226 owners).
  • pkg/health: constraints_wellformed signal (parse-only, hermetic) #1227 forward-compat: the renderer reads only the rolled-up Status (+ coverage), never per-dimension keys, and pkg/health.rollup() is generic over the dimension map — so constraints_wellformed folds in with zero code change, just a doc regen.
  • docs/user/recipe-health.md excluded from the MDX check (carries splice-marker HTML comments, exactly like container-images.md).

Testing

make lint                                   # golangci-lint 0 issues, MDX, license, agents-sync, bom-pinning
go test -race ./tools/health/...            # pass (CGO race)
go test -coverprofile ./tools/health/...    # 72.1%
make recipe-health-docs                     # splices the 32-row matrix cleanly (verified, then reverted to placeholder)
make recipe-health-check                    # skips honestly while deferral sentinel present (exit 0)

Note: I ran make lint + race/coverage on tools/health + the target/splice behavior, not the full make qualify (e2e/scan) — CI covers those. New package coverage 72.1% (above the 70% floor); no pre-existing package's coverage changed (no edits outside tools/health + docs/build files).

Risk Assessment

  • Low — Isolated, additive tooling; no CLI/server/runtime path touched. The advisory check is unwired from any gate, and the matrix body is a placeholder, so nothing here can turn CI red or affect deployments. Trivially revertible.

Rollout notes: No migration. The public matrix stays a placeholder until a follow-up fixes the chart_pinned false-positive and runs make recipe-health-docs.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Add a hermetic, offline recipe-health generator paralleling tools/bom
(minus the live Helm render). tools/health calls pkg/health.Compute()
and renders a Markdown matrix — per-recipe rolled-up status, a declared-
coverage column (R:n D:n P:n C:n), and a literal `pending` Evidence
column — with -deterministic/-no-title flags for committable output.

Publication mirrors the BOM precedent:
- make recipe-health-docs splices the matrix into the marked region of
  docs/user/recipe-health.md; make recipe-health-check is an advisory
  staleness check (not wired into make qualify). Named recipe-health-*
  to avoid the existing check-health/component-health chainsaw family.
- docs/user/recipe-health.md: hand-written header (layer-1 chart_pinned
  and uniform pending-Evidence notes) plus the spliced region.
- Registered in docs/index.yml, cross-linked from component-catalog.md,
  owned via CODEOWNERS, and excluded from the MDX check (splice
markers).

Tests cover render content, byte-stable determinism, the marker-presence
guard, and a compute-budget gate over the full catalog (ADR-009 §2).

The matrix body is intentionally deferred behind an AICR-HEALTH-DEFERRED
sentinel: pkg/health's chart_pinned signal currently over-reports fail
for manifest-only Helm components (no external chart to pin), so the
published matrix would mislead. The tooling is fully functional;
recipe-health-check skips while the sentinel is present.

Implements NVIDIA#1229. Design: ADR-009.
@njhensley njhensley requested review from a team as code owners June 11, 2026 00:30
@njhensley njhensley added theme/ci-dx CI pipelines, developer experience, and build tooling theme/supply-chain SLSA, SBOM, Sigstore, and provenance verification labels Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c4d2b08d-beb7-4c59-bdfc-cd0d9db4ad24

📥 Commits

Reviewing files that changed from the base of the PR and between 2a5b5a8 and dc9c081.

📒 Files selected for processing (3)
  • docs/design/009-recipe-health-tracking.md
  • tools/health/main.go
  • tools/health/main_test.go

📝 Walkthrough

Walkthrough

This pull request introduces a recipe-health documentation tool and integrates it into the project. It adds a new Go CLI (tools/health) that computes structural health for resolvable recipes using health.Compute and renders the results into a Markdown matrix. The implementation includes comprehensive unit and integration tests. The Makefile is extended with make recipe-health-docs to regenerate the documentation and make recipe-health-check to verify it remains current. A new user documentation page (docs/user/recipe-health.md) describes the health matrix columns and intentionally defers publication of the generated matrix body. Supporting changes include updates to the ADR, documentation navigation, linting exclusions, and project ownership/ignore rules.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#1291: Provides core pkg/health compute and rollup logic that tools/health calls to produce the matrix.
  • NVIDIA/aicr#1293: Changes around chart_pinned and optional coverage align with this PR's renderer handling of nil/optional coverage.
  • NVIDIA/aicr#1301: Modifies health.Compute dimensions (e.g., constraints_wellformed) which affect the generator's rendered output.

Suggested labels

area/recipes, size/L, theme/recipes

Suggested reviewers

  • mchmarny
  • yuanchen8911
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a tools/health generator and recipe-health matrix documentation.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, implementation details, testing, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/health/main_test.go`:
- Around line 169-181: TestCoverageCell currently contains two inline
assertions; refactor it into a table-driven test to match the style of
TestDimCell and make adding cases easier. Replace the body of TestCoverageCell
with a slice of test cases (each with name, input *health.DeclaredCoverage,
expected string) and iterate using t.Run for each case, calling coverageCell and
comparing results; keep references to the existing coverageCell function and
health.DeclaredCoverage/health.PhaseCoverage types so the same inputs ("nil" and
the sample cov with Readiness/Deployment/Conformance checks) are used as test
cases.

In `@tools/health/main.go`:
- Around line 88-90: The current code double-wraps errors from renderMatrix by
wrapping mdErr again with errors.Wrap(errors.ErrCodeInternal, "render
markdown"); instead, detect if mdErr != nil and return mdErr directly to
preserve the original structured pkg/errors context—update the error handling in
the block that checks mdErr (the variable returned from renderMatrix) to return
mdErr rather than wrapping it, leaving other error paths unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: db6fee11-770a-432e-a132-1de1fd5d3cbd

📥 Commits

Reviewing files that changed from the base of the PR and between 6c14530 and 2a5b5a8.

📒 Files selected for processing (11)
  • .github/CODEOWNERS
  • .gitignore
  • Makefile
  • docs/design/009-recipe-health-tracking.md
  • docs/index.yml
  • docs/user/component-catalog.md
  • docs/user/recipe-health.md
  • tools/check-docs-mdx
  • tools/health/main.go
  • tools/health/main_test.go
  • tools/health/markdown.go

Comment thread tools/health/main_test.go
Comment thread tools/health/main.go
mchmarny
mchmarny previously approved these changes Jun 11, 2026

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hermetic, offline tools/health generator + BOM-style splice targets + public matrix doc, faithfully mirroring the tools/bom precedent minus the network render. Determinism and error handling are solid and well-covered: run checks both render and Close errors with the right priority, the stickyWriter collects the first write failure, and byte-stability is tested directly. Deferring the matrix body behind a documented AICR-HEALTH-DEFERRED sentinel — pending the chart_pinned false-positive fix — is the right call to avoid publishing a misleading surface, and the recipe-health-check skip handles it honestly. The check-docs-mdx array-exclude conversion is correct. One non-blocking nit inline on duplicated coverage formatting. Approving.

Comment thread tools/health/markdown.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/009-recipe-health-tracking.md`:
- Around line 208-214: Split the long sentence explaining the compute budget
into multiple short declarative sentences: state that constraints_wellformed is
treated as parse-only, then explain that it uses the constraint-aware path with
a satisfied-stub evaluator (no snapshot), next note that snapshot-free parsers
run over the merged constraints, and finally call out that the dominant cost is
the per-combo resolve across the ~50 leaf combos rather than constraint
evaluation (mention the --no-cluster replay phrasing is unachievable offline
only as a brief parenthetical). Use the symbols constraints_wellformed,
satisfied-stub evaluator, snapshot-free parsers, and --no-cluster in these
concise sentences to keep references clear.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c4d2b08d-beb7-4c59-bdfc-cd0d9db4ad24

📥 Commits

Reviewing files that changed from the base of the PR and between 2a5b5a8 and dc9c081.

📒 Files selected for processing (3)
  • docs/design/009-recipe-health-tracking.md
  • tools/health/main.go
  • tools/health/main_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/009-recipe-health-tracking.md`:
- Around line 208-214: Split the long sentence explaining the compute budget
into multiple short declarative sentences: state that constraints_wellformed is
treated as parse-only, then explain that it uses the constraint-aware path with
a satisfied-stub evaluator (no snapshot), next note that snapshot-free parsers
run over the merged constraints, and finally call out that the dominant cost is
the per-combo resolve across the ~50 leaf combos rather than constraint
evaluation (mention the --no-cluster replay phrasing is unachievable offline
only as a brief parenthetical). Use the symbols constraints_wellformed,
satisfied-stub evaluator, snapshot-free parsers, and --no-cluster in these
concise sentences to keep references clear.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c4d2b08d-beb7-4c59-bdfc-cd0d9db4ad24

📥 Commits

Reviewing files that changed from the base of the PR and between 2a5b5a8 and dc9c081.

📒 Files selected for processing (3)
  • docs/design/009-recipe-health-tracking.md
  • tools/health/main.go
  • tools/health/main_test.go
🛑 Comments failed to post (1)
docs/design/009-recipe-health-tracking.md (1)

208-214: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Break the run-on sentence into shorter declarative statements.

Lines 208-214 form a single 70+ word sentence that packs the approach, rationale, evaluator behavior, parser location, and cost model into one complex unit. This violates the "Use short, declarative sentences in documentation" guideline and makes the explanation harder to follow, even in technical ADR documentation.

✨ Suggested refactoring for clarity
-**Compute budget.** `constraints_wellformed` is specified as parse-only rather
-than a `--no-cluster` constraint replay: each leaf is resolved through the
-constraint-aware path with a satisfied-stub evaluator (no snapshot) and the
-snapshot-free parsers run over the merged constraints (the original
-`--no-cluster` replay phrasing is unachievable offline, since the evaluator
-extracts a snapshot value before parsing). The dominant cost is therefore the
+**Compute budget.** `constraints_wellformed` is specified as parse-only rather
+than a `--no-cluster` constraint replay. Each leaf is resolved through the
+constraint-aware path with a satisfied-stub evaluator (no snapshot), and the
+snapshot-free parsers run over the merged constraints. The original
+`--no-cluster` replay phrasing is unachievable offline because the evaluator
+extracts a snapshot value before parsing. The dominant cost is therefore the
 per-combo resolve across the ~50 leaf combos, not constraint evaluation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/009-recipe-health-tracking.md` around lines 208 - 214, Split the
long sentence explaining the compute budget into multiple short declarative
sentences: state that constraints_wellformed is treated as parse-only, then
explain that it uses the constraint-aware path with a satisfied-stub evaluator
(no snapshot), next note that snapshot-free parsers run over the merged
constraints, and finally call out that the dominant cost is the per-combo
resolve across the ~50 leaf combos rather than constraint evaluation (mention
the --no-cluster replay phrasing is unachievable offline only as a brief
parenthetical). Use the symbols constraints_wellformed, satisfied-stub
evaluator, snapshot-free parsers, and --no-cluster in these concise sentences to
keep references clear.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci area/docs size/XL theme/ci-dx CI pipelines, developer experience, and build tooling theme/supply-chain SLSA, SBOM, Sigstore, and provenance verification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants