Skip to content

feat(health): pkg/health core Compute loop, resolves signal, rollup#1291

Merged
mchmarny merged 1 commit into
mainfrom
feat/pkg-health-core
Jun 10, 2026
Merged

feat(health): pkg/health core Compute loop, resolves signal, rollup#1291
mchmarny merged 1 commit into
mainfrom
feat/pkg-health-core

Conversation

@mchmarny

Copy link
Copy Markdown
Member

Summary

Adds the standalone pkg/health package per ADR-009 V1: the core enumeration→resolution loop (Compute), the graded resolves signal, a generic per-recipe status rollup with unknown-held semantics, and byte-deterministic output.

Motivation / Context

First building block of the Recipe health tracking epic (ADR-009 V1). pkg/health is standalone (not in pkg/recipe) so the future v1.1 evidence axis stays acyclic — pkg/evidence/attestation already imports pkg/recipe, so hosting health there would later close a recipe → evidence → recipe cycle. This PR ships only the package core and the resolves dimension; the read-only signals and the constraint signal land on top of it.

Fixes: #1225
Related: #1224 (epic), #1226, #1227

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe) — new sibling package pkg/health
  • Build/CI/tooling — CODEOWNERS + labeler

Implementation Notes

  • Compute enumerates leaf overlays via MetadataStore.ListCatalog(filter) (filtered to IsLeaf) and resolves each through a single shared recipe.NewBuilder(...) so the owner-stamp invariant holds across combos. Combos are sorted by criteria string (tie-broken by leaf name) for determinism.
  • resolves grading: resolves → pass; non-transient error → fail (message in Detail); ErrCodeTimeout/ErrCodeInternalunknown, held out of the fail/warn rollup (re-run rather than penalize).
  • Rollup iterates the Dimensions map generically (precedence fail > warn > unknown > pass), so pkg/health: chart_pinned + declared_coverage signals #1226/pkg/health: constraints_wellformed signal (parse-only, hermetic) #1227 add dimensions without touching rollup code. A held unknown surfaces as the recipe status when nothing fails/warns — never silently as pass.
  • The resolves-only core leaves DeclaredCoverage at its zero value; the descriptor types are defined now so later signals slot in without a schema change.
  • No time.Now() on the computed path. The one non-deterministic input (unknown-from-real-timeout) is not exercised offline; the determinism test asserts byte-equality on the non-error embedded-catalog path via serializer.MarshalYAMLDeterministic.

Testing

golangci-lint run -c .golangci.yaml ./pkg/health/... ./pkg/defaults/...   # 0 issues
make test                                                                 # pass, total 76.5%
  • Unit: table-driven classifyResolve, isTransient, rollup (incl. unknown-held precedence).
  • Integration: embedded catalog → every leaf carries a resolves dimension with a status consistent with the rollup; byte-determinism across two Compute runs; non-matching filter → empty report; failing provider → surfaced load error.
  • pkg/health coverage: 95.5% (new package; floor is 75%).

Risk Assessment

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed (N/A — no user-facing surface in this PR)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@mchmarny mchmarny requested review from a team as code owners June 10, 2026 16:46
@mchmarny mchmarny added the theme/recipes Recipe expansion, overlays, mixins, and component registry label Jun 10, 2026
@mchmarny mchmarny self-assigned this Jun 10, 2026
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds the new pkg/health package implementing ADR-009 health reporting. It introduces the Compute() function that enumerates recipe catalog leaf overlays, builds each entry, classifies resolver errors into a resolves signal (pass/fail/unknown), applies dimension rollup with explicit precedence (fail > warn > unknown > pass), and returns a deterministically sorted report. The package includes data types for options, coverage tracking, per-combo health, and the overall report, along with comprehensive tests validating error classification, rollup semantics, embedded catalog correctness, deterministic serialization, and error handling. Configuration changes add ownership and labeler support.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • 1228 — Adds a pkg/health package and Compute API that health generators intend to call.
  • 1229 — Provides the exported Compute(ctx, Options) API used by tooling to populate health data.
  • 1224 — Parent epic implementing ADR-009; this PR delivers core health package infrastructure.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding the core Compute loop, resolves signal, and rollup logic to the new pkg/health package.
Description check ✅ Passed The description is detailed and directly related to the changeset, covering the Summary, Motivation, Implementation Notes, Testing, and Risk Assessment of the pkg/health package addition.
Linked Issues check ✅ Passed All coding requirements from issue #1225 are met: pkg/health package created with Compute loop, resolves signal, rollup with unknown-held semantics, deterministic output, CODEOWNERS/labeler updates, and comprehensive tests achieving 95.5% coverage.
Out of Scope Changes check ✅ Passed All changes are directly in scope: the new pkg/health package with core implementation, supporting configuration updates (CODEOWNERS, labeler, timeouts), and comprehensive tests addressing the issue #1225 objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/pkg-health-core

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.5%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.5%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/defaults 100.00% (ø)
github.com/NVIDIA/aicr/pkg/health 95.83% (+95.83%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/defaults/timeouts.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/health/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/health/health.go 95.83% (+95.83%) 48 (+48) 46 (+46) 2 (+2) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-documented foundation for ADR-009 — the generic rollup, the descriptor-vs-graded split, byte-deterministic output, and the standalone-package rationale (avoiding the future recipe → evidence → recipe import cycle) are all well-judged, and coverage is strong.

I have one correctness concern about the unknown classification being too wide (latent today — every current leaf resolves pass), one context-handling gap, and a test gap worth closing alongside the first. Details inline.

Comment thread pkg/health/health.go Outdated
Comment thread pkg/health/health.go
Comment thread pkg/health/health_test.go
Comment thread pkg/health/health.go
Comment thread pkg/health/doc.go
Comment thread .github/CODEOWNERS Outdated
Introduce the standalone pkg/health package per ADR-009 V1: the ADR-009
§4 types, a Compute() that enumerates leaf recipes via ListCatalog and
resolves each through a single shared recipe.Builder, the graded
resolves signal, a generic status rollup with unknown-held semantics,
and byte-deterministic output.

The package is standalone (not pkg/recipe) to keep the future v1.1
evidence import acyclic. Only the resolves dimension is graded here;
chart_pinned/declared_coverage (#1226) and constraints_wellformed
(#1227) layer onto the same generic rollup without changing it.

The held-unknown bucket holds only genuinely re-runnable errors (context
cancellation/deadline and ErrCodeTimeout). ErrCodeInternal is graded
fail: on the resolve path it is reserved for deterministic structural
defects (e.g. a missing registry healthCheck.assertFile), which re-runs
never clear — narrowing ADR-009 §2's literal wording to its stated
"re-run rather than penalize" intent. The catalog loop fails loud on
context cancellation so a truncated run is not mistaken for a healthy one.

Wire pkg/health/** into CODEOWNERS and the area/recipes labeler glob.

Closes #1225
@mchmarny mchmarny force-pushed the feat/pkg-health-core branch from 8c77b00 to fe7da04 Compare June 10, 2026 17:12

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — all review comments addressed in the force-push.

  • isTransient narrowed (the High-latent finding): ErrCodeInternal now grades fail; only ErrCodeTimeout + context.DeadlineExceeded/Canceled are held as unknown. Unit cases flipped accordingly.
  • Loop now fails loud on ctx.Err() instead of emitting a truncated report that looks healthy.
  • New TestComputeGradesDeterministicDefectAsFail drives a real broken leaf (missing healthCheck.assertFileErrCodeInternal) through Compute and asserts fail end-to-end — exactly the trigger path the bucket fix protects, guarding against regression.
  • CODEOWNERS wording fixed.

The two remaining nits (always-allocated Detail map, fail-path determinism fixture) are intentionally deferred and fine for V1. Nice, well-tested foundation for ADR-009.

@mchmarny mchmarny enabled auto-merge (squash) June 10, 2026 17:17

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/health/health.go`:
- Around line 44-45: Update the comment for the StatusUnknown constant to
reflect the actual transient logic in isTransient: remove mention of
ErrCodeInternal and only reference ErrCodeTimeout (and context
deadline/canceled) as transient; ensure the comment aligns with the rationale
already described near the isTransient function and the handling that
deliberately treats ErrCodeInternal as a non-transient failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 41e54918-682a-49ad-b8fe-c61a6e24ddd9

📥 Commits

Reviewing files that changed from the base of the PR and between 8c77b00 and fe7da04.

📒 Files selected for processing (6)
  • .github/CODEOWNERS
  • .github/labeler.yml
  • pkg/defaults/timeouts.go
  • pkg/health/doc.go
  • pkg/health/health.go
  • pkg/health/health_test.go

Comment thread pkg/health/health.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci size/XL theme/recipes Recipe expansion, overlays, mixins, and component registry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pkg/health: core Compute() loop, resolves signal, rollup, determinism

2 participants