feat(health): add tools/health generator and recipe-health matrix by njhensley · Pull Request #1304 · NVIDIA/aicr

njhensley · 2026-06-11T00:30:42Z

Summary

Add a hermetic, offline recipe-health generator (tools/health) plus the publication machinery (make targets, public doc, nav/ownership) that surfaces per-recipe structural health across the whole criteria matrix — mirroring the BOM generate-and-PR precedent, but with no live Helm render. The matrix body is intentionally deferred (see Implementation Notes); the tooling itself is complete and functional.

Motivation / Context

ADR-009 (Recipe Health Tracking, V1) calls for a hermetic generator, BOM-style make targets that splice into a marked region, and a public matrix doc. This PR delivers the tools/health generator + targets + doc; the validation logic it consumes (pkg/health.Compute) already landed via #1226/#1225.

Fixes: N/A — implements the #1229 tooling; public-matrix publication is deferred (see Implementation Notes), so I left #1229 open for the maintainer to close after the matrix publishes.
Related: #1229, #1224 (epic), #1226/#1227 (deps), #1230 (blocked: weekly refresh)

Type of Change

New feature (non-breaking change that adds functionality)
Documentation update
Build/CI/tooling

Component(s) Affected

Docs/examples (docs/, examples/)
Other: tools/health generator + Makefile targets + tools/check-docs-mdx

Implementation Notes

Generator (tools/health) parallels tools/bom minus the network render: it calls pkg/health.Compute() against the embedded catalog (Provider: nil) and renders a Markdown matrix — per-recipe rolled-up status, a declared-coverage column (R:n D:n P:n C:n), and a literal pending Evidence column. Flags -deterministic/-no-title make the output byte-stable and embeddable.
Targets clone bom-docs/bom-check: make recipe-health-docs splices the body between ; make recipe-health-check is advisory (not wired into make qualify/lint/merge gate). Named recipe-health-*, not the ADR's original health-*, to avoid the check-health/check-health-all/component-health chainsaw family (homophone footgun); ADR-009 §5 updated to match.
Matrix publication is intentionally deferred behind an AICR-HEALTH-DEFERRED sentinel. pkg/health's chart_pinned signal currently over-reports fail for manifest-only Helm components (declared type: Helm but shipping local manifests with no external chart to pin), which would make 17/32 recipes show fail and render a misleading public matrix. The generator/targets are fully functional; recipe-health-check skips while the sentinel is present. Publishing is a follow-up gated on fixing classifyChartPinned (a pkg/health change, CODEOWNERS-pinned to the pkg/health: core Compute() loop, resolves signal, rollup, determinism #1225/pkg/health: chart_pinned + declared_coverage signals #1226 owners).
pkg/health: constraints_wellformed signal (parse-only, hermetic) #1227 forward-compat: the renderer reads only the rolled-up Status (+ coverage), never per-dimension keys, and pkg/health.rollup() is generic over the dimension map — so constraints_wellformed folds in with zero code change, just a doc regen.
docs/user/recipe-health.md excluded from the MDX check (carries splice-marker HTML comments, exactly like container-images.md).

Testing

make lint                                   # golangci-lint 0 issues, MDX, license, agents-sync, bom-pinning
go test -race ./tools/health/...            # pass (CGO race)
go test -coverprofile ./tools/health/...    # 72.1%
make recipe-health-docs                     # splices the 32-row matrix cleanly (verified, then reverted to placeholder)
make recipe-health-check                    # skips honestly while deferral sentinel present (exit 0)

Note: I ran make lint + race/coverage on tools/health + the target/splice behavior, not the full make qualify (e2e/scan) — CI covers those. New package coverage 72.1% (above the 70% floor); no pre-existing package's coverage changed (no edits outside tools/health + docs/build files).

Risk Assessment

Low — Isolated, additive tooling; no CLI/server/runtime path touched. The advisory check is unwired from any gate, and the matrix body is a placeholder, so nothing here can turn CI red or affect deployments. Trivially revertible.

Rollout notes: No migration. The public matrix stays a placeholder until a follow-up fixes the chart_pinned false-positive and runs make recipe-health-docs.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

Add a hermetic, offline recipe-health generator paralleling tools/bom (minus the live Helm render). tools/health calls pkg/health.Compute() and renders a Markdown matrix — per-recipe rolled-up status, a declared- coverage column (R:n D:n P:n C:n), and a literal `pending` Evidence column — with -deterministic/-no-title flags for committable output. Publication mirrors the BOM precedent: - make recipe-health-docs splices the matrix into the marked region of docs/user/recipe-health.md; make recipe-health-check is an advisory staleness check (not wired into make qualify). Named recipe-health-* to avoid the existing check-health/component-health chainsaw family. - docs/user/recipe-health.md: hand-written header (layer-1 chart_pinned and uniform pending-Evidence notes) plus the spliced region. - Registered in docs/index.yml, cross-linked from component-catalog.md, owned via CODEOWNERS, and excluded from the MDX check (splice markers). Tests cover render content, byte-stable determinism, the marker-presence guard, and a compute-budget gate over the full catalog (ADR-009 §2). The matrix body is intentionally deferred behind an AICR-HEALTH-DEFERRED sentinel: pkg/health's chart_pinned signal currently over-reports fail for manifest-only Helm components (no external chart to pin), so the published matrix would mislead. The tooling is fully functional; recipe-health-check skips while the sentinel is present. Implements NVIDIA#1229. Design: ADR-009.

coderabbitai · 2026-06-11T00:37:16Z

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c4d2b08d-beb7-4c59-bdfc-cd0d9db4ad24

📥 Commits

Reviewing files that changed from the base of the PR and between 2a5b5a8 and dc9c081.

📒 Files selected for processing (3)

docs/design/009-recipe-health-tracking.md
tools/health/main.go
tools/health/main_test.go

📝 Walkthrough

Walkthrough

This pull request introduces a recipe-health documentation tool and integrates it into the project. It adds a new Go CLI (tools/health) that computes structural health for resolvable recipes using health.Compute and renders the results into a Markdown matrix. The implementation includes comprehensive unit and integration tests. The Makefile is extended with make recipe-health-docs to regenerate the documentation and make recipe-health-check to verify it remains current. A new user documentation page (docs/user/recipe-health.md) describes the health matrix columns and intentionally defers publication of the generated matrix body. Supporting changes include updates to the ADR, documentation navigation, linting exclusions, and project ownership/ignore rules.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Epic: Recipe health tracking (ADR-009 V1) — enumeration, structural soundness, public matrix #1224: The PR implements the recipe health generator, Makefile targets, docs, CODEOWNERS and tests described by ADR-009.
tools/health generator + make targets + public recipe-health matrix #1229: The changes add the tools/health CLI, recipe-health doc targets, and ownership entries referenced in that issue.
aicr recipe list: add structural-health + coverage columns #1228: Both this PR and the issue interact with pkg/health.Compute; this PR adds a consumer (tools/health) of that compute output.

Possibly related PRs

NVIDIA/aicr#1291: Provides core pkg/health compute and rollup logic that tools/health calls to produce the matrix.
NVIDIA/aicr#1293: Changes around chart_pinned and optional coverage align with this PR's renderer handling of nil/optional coverage.
NVIDIA/aicr#1301: Modifies health.Compute dimensions (e.g., constraints_wellformed) which affect the generator's rendered output.

Suggested labels

area/recipes, size/L, theme/recipes

Suggested reviewers

mchmarny
yuanchen8911

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding a tools/health generator and recipe-health matrix documentation.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the motivation, implementation details, testing, and risk assessment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/health/main_test.go`:
- Around line 169-181: TestCoverageCell currently contains two inline
assertions; refactor it into a table-driven test to match the style of
TestDimCell and make adding cases easier. Replace the body of TestCoverageCell
with a slice of test cases (each with name, input *health.DeclaredCoverage,
expected string) and iterate using t.Run for each case, calling coverageCell and
comparing results; keep references to the existing coverageCell function and
health.DeclaredCoverage/health.PhaseCoverage types so the same inputs ("nil" and
the sample cov with Readiness/Deployment/Conformance checks) are used as test
cases.

In `@tools/health/main.go`:
- Around line 88-90: The current code double-wraps errors from renderMatrix by
wrapping mdErr again with errors.Wrap(errors.ErrCodeInternal, "render
markdown"); instead, detect if mdErr != nil and return mdErr directly to
preserve the original structured pkg/errors context—update the error handling in
the block that checks mdErr (the variable returned from renderMatrix) to return
mdErr rather than wrapping it, leaving other error paths unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: db6fee11-770a-432e-a132-1de1fd5d3cbd

📥 Commits

Reviewing files that changed from the base of the PR and between 6c14530 and 2a5b5a8.

📒 Files selected for processing (11)

.github/CODEOWNERS
.gitignore
Makefile
docs/design/009-recipe-health-tracking.md
docs/index.yml
docs/user/component-catalog.md
docs/user/recipe-health.md
tools/check-docs-mdx
tools/health/main.go
tools/health/main_test.go
tools/health/markdown.go

mchmarny

Hermetic, offline tools/health generator + BOM-style splice targets + public matrix doc, faithfully mirroring the tools/bom precedent minus the network render. Determinism and error handling are solid and well-covered: run checks both render and Close errors with the right priority, the stickyWriter collects the first write failure, and byte-stability is tested directly. Deferring the matrix body behind a documented AICR-HEALTH-DEFERRED sentinel — pending the chart_pinned false-positive fix — is the right call to avoid publishing a misleading surface, and the recipe-health-check skip handles it honestly. The check-docs-mdx array-exclude conversion is correct. One non-blocking nit inline on duplicated coverage formatting. Approving.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/009-recipe-health-tracking.md`:
- Around line 208-214: Split the long sentence explaining the compute budget
into multiple short declarative sentences: state that constraints_wellformed is
treated as parse-only, then explain that it uses the constraint-aware path with
a satisfied-stub evaluator (no snapshot), next note that snapshot-free parsers
run over the merged constraints, and finally call out that the dominant cost is
the per-combo resolve across the ~50 leaf combos rather than constraint
evaluation (mention the --no-cluster replay phrasing is unachievable offline
only as a brief parenthetical). Use the symbols constraints_wellformed,
satisfied-stub evaluator, snapshot-free parsers, and --no-cluster in these
concise sentences to keep references clear.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c4d2b08d-beb7-4c59-bdfc-cd0d9db4ad24

📥 Commits

Reviewing files that changed from the base of the PR and between 2a5b5a8 and dc9c081.

📒 Files selected for processing (3)

docs/design/009-recipe-health-tracking.md
tools/health/main.go
tools/health/main_test.go

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/009-recipe-health-tracking.md`:
- Around line 208-214: Split the long sentence explaining the compute budget
into multiple short declarative sentences: state that constraints_wellformed is
treated as parse-only, then explain that it uses the constraint-aware path with
a satisfied-stub evaluator (no snapshot), next note that snapshot-free parsers
run over the merged constraints, and finally call out that the dominant cost is
the per-combo resolve across the ~50 leaf combos rather than constraint
evaluation (mention the --no-cluster replay phrasing is unachievable offline
only as a brief parenthetical). Use the symbols constraints_wellformed,
satisfied-stub evaluator, snapshot-free parsers, and --no-cluster in these
concise sentences to keep references clear.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c4d2b08d-beb7-4c59-bdfc-cd0d9db4ad24

📥 Commits

Reviewing files that changed from the base of the PR and between 2a5b5a8 and dc9c081.

📒 Files selected for processing (3)

docs/design/009-recipe-health-tracking.md
tools/health/main.go
tools/health/main_test.go

🛑 Comments failed to post (1)

docs/design/009-recipe-health-tracking.md (1)

208-214: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Break the run-on sentence into shorter declarative statements.

Lines 208-214 form a single 70+ word sentence that packs the approach, rationale, evaluator behavior, parser location, and cost model into one complex unit. This violates the "Use short, declarative sentences in documentation" guideline and makes the explanation harder to follow, even in technical ADR documentation.

✨ Suggested refactoring for clarity

-**Compute budget.** `constraints_wellformed` is specified as parse-only rather
-than a `--no-cluster` constraint replay: each leaf is resolved through the
-constraint-aware path with a satisfied-stub evaluator (no snapshot) and the
-snapshot-free parsers run over the merged constraints (the original
-`--no-cluster` replay phrasing is unachievable offline, since the evaluator
-extracts a snapshot value before parsing). The dominant cost is therefore the
+**Compute budget.** `constraints_wellformed` is specified as parse-only rather
+than a `--no-cluster` constraint replay. Each leaf is resolved through the
+constraint-aware path with a satisfied-stub evaluator (no snapshot), and the
+snapshot-free parsers run over the merged constraints. The original
+`--no-cluster` replay phrasing is unachievable offline because the evaluator
+extracts a snapshot value before parsing. The dominant cost is therefore the
 per-combo resolve across the ~50 leaf combos, not constraint evaluation.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/009-recipe-health-tracking.md` around lines 208 - 214, Split the
long sentence explaining the compute budget into multiple short declarative
sentences: state that constraints_wellformed is treated as parse-only, then
explain that it uses the constraint-aware path with a satisfied-stub evaluator
(no snapshot), next note that snapshot-free parsers run over the merged
constraints, and finally call out that the dominant cost is the per-combo
resolve across the ~50 leaf combos rather than constraint evaluation (mention
the --no-cluster replay phrasing is unachievable offline only as a brief
parenthetical). Use the symbols constraints_wellformed, satisfied-stub
evaluator, snapshot-free parsers, and --no-cluster in these concise sentences to
keep references clear.

Source: Coding guidelines

njhensley requested review from a team as code owners June 11, 2026 00:30

njhensley added theme/ci-dx CI pipelines, developer experience, and build tooling theme/supply-chain SLSA, SBOM, Sigstore, and provenance verification labels Jun 11, 2026

github-actions Bot added area/ci area/docs size/XL labels Jun 11, 2026

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread tools/health/main_test.go

Comment thread tools/health/main.go

mchmarny assigned njhensley Jun 11, 2026

mchmarny previously approved these changes Jun 11, 2026

View reviewed changes

Comment thread tools/health/markdown.go

njhensley mentioned this pull request Jun 11, 2026

tools/health generator + make targets + public recipe-health matrix #1229

Closed

fix: address CR feedback for error wrapping and test tables

d609c55

njhensley dismissed mchmarny’s stale review via d609c55 June 11, 2026 14:38

Merge branch 'main' into feat/recipe-health-matrix

dc9c081

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Merge branch 'main' into feat/recipe-health-matrix

579fee5

njhensley mentioned this pull request Jun 11, 2026

Consolidate declared-coverage 'R:n D:n P:n C:n' formatter into a shared pkg/health helper #1312

Closed

mchmarny approved these changes Jun 11, 2026

View reviewed changes

mchmarny merged commit e1b0160 into NVIDIA:main Jun 11, 2026
32 checks passed

njhensley mentioned this pull request Jun 11, 2026

refactor(health): share declared-coverage Compact() across matrix and CLI #1313

Merged

12 tasks

njhensley deleted the feat/recipe-health-matrix branch June 23, 2026 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(health): add tools/health generator and recipe-health matrix#1304

feat(health): add tools/health generator and recipe-health matrix#1304
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:feat/recipe-health-matrix

njhensley commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhensley commented Jun 11, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading