feat(health): add tools/health generator and recipe-health matrix#1304
Conversation
Add a hermetic, offline recipe-health generator paralleling tools/bom (minus the live Helm render). tools/health calls pkg/health.Compute() and renders a Markdown matrix — per-recipe rolled-up status, a declared- coverage column (R:n D:n P:n C:n), and a literal `pending` Evidence column — with -deterministic/-no-title flags for committable output. Publication mirrors the BOM precedent: - make recipe-health-docs splices the matrix into the marked region of docs/user/recipe-health.md; make recipe-health-check is an advisory staleness check (not wired into make qualify). Named recipe-health-* to avoid the existing check-health/component-health chainsaw family. - docs/user/recipe-health.md: hand-written header (layer-1 chart_pinned and uniform pending-Evidence notes) plus the spliced region. - Registered in docs/index.yml, cross-linked from component-catalog.md, owned via CODEOWNERS, and excluded from the MDX check (splice markers). Tests cover render content, byte-stable determinism, the marker-presence guard, and a compute-budget gate over the full catalog (ADR-009 §2). The matrix body is intentionally deferred behind an AICR-HEALTH-DEFERRED sentinel: pkg/health's chart_pinned signal currently over-reports fail for manifest-only Helm components (no external chart to pin), so the published matrix would mislead. The tooling is fully functional; recipe-health-check skips while the sentinel is present. Implements NVIDIA#1229. Design: ADR-009.
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis pull request introduces a recipe-health documentation tool and integrates it into the project. It adds a new Go CLI (tools/health) that computes structural health for resolvable recipes using health.Compute and renders the results into a Markdown matrix. The implementation includes comprehensive unit and integration tests. The Makefile is extended with make recipe-health-docs to regenerate the documentation and make recipe-health-check to verify it remains current. A new user documentation page (docs/user/recipe-health.md) describes the health matrix columns and intentionally defers publication of the generated matrix body. Supporting changes include updates to the ADR, documentation navigation, linting exclusions, and project ownership/ignore rules. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/health/main_test.go`:
- Around line 169-181: TestCoverageCell currently contains two inline
assertions; refactor it into a table-driven test to match the style of
TestDimCell and make adding cases easier. Replace the body of TestCoverageCell
with a slice of test cases (each with name, input *health.DeclaredCoverage,
expected string) and iterate using t.Run for each case, calling coverageCell and
comparing results; keep references to the existing coverageCell function and
health.DeclaredCoverage/health.PhaseCoverage types so the same inputs ("nil" and
the sample cov with Readiness/Deployment/Conformance checks) are used as test
cases.
In `@tools/health/main.go`:
- Around line 88-90: The current code double-wraps errors from renderMatrix by
wrapping mdErr again with errors.Wrap(errors.ErrCodeInternal, "render
markdown"); instead, detect if mdErr != nil and return mdErr directly to
preserve the original structured pkg/errors context—update the error handling in
the block that checks mdErr (the variable returned from renderMatrix) to return
mdErr rather than wrapping it, leaving other error paths unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: db6fee11-770a-432e-a132-1de1fd5d3cbd
📒 Files selected for processing (11)
.github/CODEOWNERS.gitignoreMakefiledocs/design/009-recipe-health-tracking.mddocs/index.ymldocs/user/component-catalog.mddocs/user/recipe-health.mdtools/check-docs-mdxtools/health/main.gotools/health/main_test.gotools/health/markdown.go
mchmarny
left a comment
There was a problem hiding this comment.
Hermetic, offline tools/health generator + BOM-style splice targets + public matrix doc, faithfully mirroring the tools/bom precedent minus the network render. Determinism and error handling are solid and well-covered: run checks both render and Close errors with the right priority, the stickyWriter collects the first write failure, and byte-stability is tested directly. Deferring the matrix body behind a documented AICR-HEALTH-DEFERRED sentinel — pending the chart_pinned false-positive fix — is the right call to avoid publishing a misleading surface, and the recipe-health-check skip handles it honestly. The check-docs-mdx array-exclude conversion is correct. One non-blocking nit inline on duplicated coverage formatting. Approving.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/009-recipe-health-tracking.md`:
- Around line 208-214: Split the long sentence explaining the compute budget
into multiple short declarative sentences: state that constraints_wellformed is
treated as parse-only, then explain that it uses the constraint-aware path with
a satisfied-stub evaluator (no snapshot), next note that snapshot-free parsers
run over the merged constraints, and finally call out that the dominant cost is
the per-combo resolve across the ~50 leaf combos rather than constraint
evaluation (mention the --no-cluster replay phrasing is unachievable offline
only as a brief parenthetical). Use the symbols constraints_wellformed,
satisfied-stub evaluator, snapshot-free parsers, and --no-cluster in these
concise sentences to keep references clear.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c4d2b08d-beb7-4c59-bdfc-cd0d9db4ad24
📒 Files selected for processing (3)
docs/design/009-recipe-health-tracking.mdtools/health/main.gotools/health/main_test.go
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/009-recipe-health-tracking.md`:
- Around line 208-214: Split the long sentence explaining the compute budget
into multiple short declarative sentences: state that constraints_wellformed is
treated as parse-only, then explain that it uses the constraint-aware path with
a satisfied-stub evaluator (no snapshot), next note that snapshot-free parsers
run over the merged constraints, and finally call out that the dominant cost is
the per-combo resolve across the ~50 leaf combos rather than constraint
evaluation (mention the --no-cluster replay phrasing is unachievable offline
only as a brief parenthetical). Use the symbols constraints_wellformed,
satisfied-stub evaluator, snapshot-free parsers, and --no-cluster in these
concise sentences to keep references clear.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c4d2b08d-beb7-4c59-bdfc-cd0d9db4ad24
📒 Files selected for processing (3)
docs/design/009-recipe-health-tracking.mdtools/health/main.gotools/health/main_test.go
🛑 Comments failed to post (1)
docs/design/009-recipe-health-tracking.md (1)
208-214: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Break the run-on sentence into shorter declarative statements.
Lines 208-214 form a single 70+ word sentence that packs the approach, rationale, evaluator behavior, parser location, and cost model into one complex unit. This violates the "Use short, declarative sentences in documentation" guideline and makes the explanation harder to follow, even in technical ADR documentation.
✨ Suggested refactoring for clarity
-**Compute budget.** `constraints_wellformed` is specified as parse-only rather -than a `--no-cluster` constraint replay: each leaf is resolved through the -constraint-aware path with a satisfied-stub evaluator (no snapshot) and the -snapshot-free parsers run over the merged constraints (the original -`--no-cluster` replay phrasing is unachievable offline, since the evaluator -extracts a snapshot value before parsing). The dominant cost is therefore the +**Compute budget.** `constraints_wellformed` is specified as parse-only rather +than a `--no-cluster` constraint replay. Each leaf is resolved through the +constraint-aware path with a satisfied-stub evaluator (no snapshot), and the +snapshot-free parsers run over the merged constraints. The original +`--no-cluster` replay phrasing is unachievable offline because the evaluator +extracts a snapshot value before parsing. The dominant cost is therefore the per-combo resolve across the ~50 leaf combos, not constraint evaluation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/009-recipe-health-tracking.md` around lines 208 - 214, Split the long sentence explaining the compute budget into multiple short declarative sentences: state that constraints_wellformed is treated as parse-only, then explain that it uses the constraint-aware path with a satisfied-stub evaluator (no snapshot), next note that snapshot-free parsers run over the merged constraints, and finally call out that the dominant cost is the per-combo resolve across the ~50 leaf combos rather than constraint evaluation (mention the --no-cluster replay phrasing is unachievable offline only as a brief parenthetical). Use the symbols constraints_wellformed, satisfied-stub evaluator, snapshot-free parsers, and --no-cluster in these concise sentences to keep references clear.Source: Coding guidelines
Summary
Add a hermetic, offline recipe-health generator (
tools/health) plus the publication machinery (make targets, public doc, nav/ownership) that surfaces per-recipe structural health across the whole criteria matrix — mirroring the BOM generate-and-PR precedent, but with no live Helm render. The matrix body is intentionally deferred (see Implementation Notes); the tooling itself is complete and functional.Motivation / Context
ADR-009 (Recipe Health Tracking, V1) calls for a hermetic generator, BOM-style make targets that splice into a marked region, and a public matrix doc. This PR delivers the
tools/healthgenerator + targets + doc; the validation logic it consumes (pkg/health.Compute) already landed via #1226/#1225.Fixes: N/A — implements the #1229 tooling; public-matrix publication is deferred (see Implementation Notes), so I left #1229 open for the maintainer to close after the matrix publishes.
Related: #1229, #1224 (epic), #1226/#1227 (deps), #1230 (blocked: weekly refresh)
Type of Change
Component(s) Affected
docs/,examples/)tools/healthgenerator +Makefiletargets +tools/check-docs-mdxImplementation Notes
tools/health) parallelstools/bomminus the network render: it callspkg/health.Compute()against the embedded catalog (Provider: nil) and renders a Markdown matrix — per-recipe rolled-up status, a declared-coverage column (R:n D:n P:n C:n), and a literalpendingEvidence column. Flags-deterministic/-no-titlemake the output byte-stable and embeddable.bom-docs/bom-check:make recipe-health-docssplices the body between<!-- BEGIN/END AICR-HEALTH -->;make recipe-health-checkis advisory (not wired intomake qualify/lint/merge gate). Namedrecipe-health-*, not the ADR's originalhealth-*, to avoid thecheck-health/check-health-all/component-healthchainsaw family (homophone footgun); ADR-009 §5 updated to match.AICR-HEALTH-DEFERREDsentinel.pkg/health'schart_pinnedsignal currently over-reportsfailfor manifest-only Helm components (declaredtype: Helmbut shipping local manifests with no external chart to pin), which would make 17/32 recipes showfailand render a misleading public matrix. The generator/targets are fully functional;recipe-health-checkskips while the sentinel is present. Publishing is a follow-up gated on fixingclassifyChartPinned(apkg/healthchange, CODEOWNERS-pinned to the pkg/health: core Compute() loop, resolves signal, rollup, determinism #1225/pkg/health: chart_pinned + declared_coverage signals #1226 owners).Status(+ coverage), never per-dimension keys, andpkg/health.rollup()is generic over the dimension map — soconstraints_wellformedfolds in with zero code change, just a doc regen.docs/user/recipe-health.mdexcluded from the MDX check (carries splice-marker HTML comments, exactly likecontainer-images.md).Testing
Note: I ran
make lint+ race/coverage ontools/health+ the target/splice behavior, not the fullmake qualify(e2e/scan) — CI covers those. New package coverage 72.1% (above the 70% floor); no pre-existing package's coverage changed (no edits outsidetools/health+ docs/build files).Risk Assessment
Rollout notes: No migration. The public matrix stays a placeholder until a follow-up fixes the
chart_pinnedfalse-positive and runsmake recipe-health-docs.Checklist
make testwith-race)make lint)git commit -S)