Skip to content

docs: add ADR for shared component checks (deprecated)#631

Closed
yuanchen8911 wants to merge 17 commits into
NVIDIA:mainfrom
yuanchen8911:docs/adr-006-deployment-validation
Closed

docs: add ADR for shared component checks (deprecated)#631
yuanchen8911 wants to merge 17 commits into
NVIDIA:mainfrom
yuanchen8911:docs/adr-006-deployment-validation

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Add ADR-006 to document the proposed registry-driven component check-definition
model and the intended convergence on a shared validator entrypoint.

Motivation / Context

Issue #622 became large enough to warrant a durable architecture record, and
the discussion with #610 made the broader direction clearer. This ADR now
captures:

  • a shared check-definition model in recipes/registry.yaml
  • the initial structured contract (readiness, customChecks, crds)
  • the intended convergence on aicr validate as the shared validator
    entrypoint, including a wait-capable mode for deploy / cleanup and
    post-install validation flows
  • the rationale for keeping the structured contract small rather than turning
    it into a general-purpose DSL
  • the tradeoff versus reusing Chainsaw directly at validator runtime, while
    still leaving room for a Chainsaw- or asset-based override path when the
    structured schema is not sufficient

This PR is design-only. It does not include the #622 implementation; that
remains follow-up work.

Fixes: N/A
Related: #622, #610

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

  • Adds docs/design/006-deployment-validation.md in the existing ADR format.
  • Reframes the ADR from a deployment-validator-specific contract to a shared
    component check-definition model that different deployment-related phases can
    consume.
  • Defines the small structured contract in recipes/registry.yaml:
    readiness, customChecks, and crds.
  • Records the initial validator subset, including exact workload identities,
    CRD establishment, merge semantics, wait-capable validator convergence, and
    the fixed custom-check key set.
  • Makes explicit that this is not a general-purpose DSL or arbitrary assertion
    language; it is a typed schema over common Kubernetes check primitives.
  • Explains why crds stays first-class rather than being folded into generic
    customChecks.
  • Clarifies the relationship with #610: share the same underlying component
    check intent and validator entrypoint where possible, without forcing
    deploy-time hook/job details into the validator contract.
  • Records why the ADR still prefers the structured schema plus pure-Go
    validator path over making Chainsaw the primary validator runtime, while
    leaving room for a Chainsaw- or asset-based override path.
  • Includes a link from docs/contributor/validator.md so the ADR is
    discoverable from the validator architecture docs.

Testing

unset GITLAB_TOKEN && make qualify
unset GITLAB_TOKEN && make lint

Local result:

  • make qualify: passed earlier on the branch before the later ADR-only
    revisions
  • make lint: passed after the latest ADR-only revisions
  • skipped rerunning full make qualify for the latest commits because this is
    a doc-only change and the local overlay explicitly allows scoped verification
    for doc-only work

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 21, 2026 16:57
@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Apr 21, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds ADR-006 at docs/design/006-deployment-validation.md defining a registry-driven deployment-phase validation contract for aicr validate --phase deployment, introducing readiness, customChecks, and crds metadata, overlay/merge semantics, runtime evaluation rules for workload and CRD readiness, fixed customChecks keys, and a two-phase migration/enforcement plan. Also updates docs/contributor/validator.md to reference the new ADR. Documentation-only changes; no code, API, or runtime behavior modified.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title mentions 'shared component checks' but the PR primarily adds ADR-006 for deployment validation with registry-driven component readiness; the word 'deprecated' is unexplained and not reflected in the ADR content. Clarify the title to accurately reflect the main change (e.g., 'docs: add ADR-006 for deployment validation' or 'docs: add ADR for registry-driven deployment validation').
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the ADR's purpose, motivation, design decisions, and how it aligns with issues #622 and #610.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@lockwobr

Copy link
Copy Markdown
Contributor

I made this comment on the last PR, including it here.
#630 (comment)

@lockwobr

lockwobr commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

This design has a lot of overlap with #610 (#610 has 2 halves, only referring to deployment half). feels like we might want to share/coordinate, if does not add a lot of complexity to share the config. Would hate to have to maintain this in 2 places.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 261-274: Update the paragraph that lists implemented items to
clearly state this is target or planned functionality rather than current
behavior: change wording around ComponentConfig, ComponentRef, hydration of
readiness, overlay/external-data merge rules, deployment-validator features, and
registry mappings to future-tense or "design intent" phrasing (e.g., "will
include" or "planned follow-up PRs") and explicitly note that
pkg/recipe/metadata.go does not yet expose ComponentRef fields for
readiness/customChecks/crds; reference ComponentConfig and ComponentRef by name
so readers can map design to code until the implementation lands.
- Around line 244-245: In the negative-consequence bullet under "Fixed
special-case surface" in docs/design/006-deployment-validation.md, replace the
awkward phrase "requires coordinated Go and registry changes" with a clearer
alternative such as "requires coordinating Go and registry changes" (or
"requires Go and registry changes to be coordinated") so the sentence reads
naturally; edit the bullet text accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 003ec6e9-4e4f-4c55-8b84-e44b3e21cfa9

📥 Commits

Reviewing files that changed from the base of the PR and between d103f91 and 38ae50e.

📒 Files selected for processing (1)
  • docs/design/006-deployment-validation.md

Comment thread docs/design/006-deployment-validation.md Outdated
Comment thread docs/design/006-deployment-validation.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/design/006-deployment-validation.md (1)

247-248: ⚠️ Potential issue | 🟡 Minor

Fix nonstandard phrasing in the negative consequence bullet.

The phrase "requires coordinated Go and registry changes" is nonstandard. Use "requires coordinating Go and registry changes" or "requires Go and registry changes to be coordinated" instead.

📝 Proposed fix
-- **Fixed special-case surface**: new bespoke behavior requires coordinated Go
-  and registry changes rather than arbitrary registry-defined assertions.
+- **Fixed special-case surface**: new bespoke behavior requires coordinating Go
+  and registry changes rather than arbitrary registry-defined assertions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/006-deployment-validation.md` around lines 247 - 248, Replace the
nonstandard phrasing "requires coordinated Go and registry changes" in the
negative consequence bullet with a clearer alternative such as "requires
coordinating Go and registry changes" or "requires Go and registry changes to be
coordinated" so the sentence reads naturally and follows standard English usage;
locate the exact bullet line containing "**Fixed special-case surface**: new
bespoke behavior requires coordinated Go and registry changes rather than
arbitrary registry-defined assertions." and update the clause accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 247-248: Replace the nonstandard phrasing "requires coordinated Go
and registry changes" in the negative consequence bullet with a clearer
alternative such as "requires coordinating Go and registry changes" or "requires
Go and registry changes to be coordinated" so the sentence reads naturally and
follows standard English usage; locate the exact bullet line containing "**Fixed
special-case surface**: new bespoke behavior requires coordinated Go and
registry changes rather than arbitrary registry-defined assertions." and update
the clause accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 9762f7e1-4ff4-4f5d-a14a-6e73e9e65ea4

📥 Commits

Reviewing files that changed from the base of the PR and between 38ae50e and 9a21799.

📒 Files selected for processing (1)
  • docs/design/006-deployment-validation.md

@yuanchen8911

yuanchen8911 commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

This design has a lot of overlap with #610 (#610 has 2 halves, only referring to deployment half). feels like we might want to share/coordinate, if does not add a lot of complexity to share the config. Would hate to have to maintain this in 2 places.

Yes, there is real overlap. My take is

  • 610 deployment half is about deploy-time flow control:
    • post-install hook Jobs / sync waves
    • deployer-specific rendering
    • images, RBAC, timeouts, shell logic
  • 622 (ADR-006) is about post-install validation in aicr validate --phase deployment:
    • deployer-neutral
    • intended to stay usable as a standalone validator

So the overlap is mainly in the readiness intent for each component, not in the full runtime/config model. I agree we should coordinate so we do not end up maintaining two unrelated readiness stories per component.

@yuanchen8911

Copy link
Copy Markdown
Contributor Author

#630 (comment)

Good point, agree "Chainsaw can’t compile cleanly with our Go toolchain” objection is weaker now and library-backed reuse is more viable than before, but direct reuse of our current checks still implies either binary support or rewriting those assets into a narrower form, IMO.

@yuanchen8911 yuanchen8911 force-pushed the docs/adr-006-deployment-validation branch from 12aae95 to 2f53e6d Compare April 21, 2026 18:01

| Option | Summary | Benefits | Costs |
| --- | --- | --- | --- |
| **Registry fields + pure Go runtime** | Add three component-level fields (`readiness`, `customChecks`, `crds`) and have the deployment validator run the declared checks across all enabled components selected into the resolved recipe | Typed contract, deployer-neutral, easy to unit test, aligns with current `aicr validate` architecture, keeps the validator Kubernetes-native and extensible | Narrower than full Chainsaw expressiveness; some semantic overlap with repo-side health checks |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cost you are hand waving away is the maintenance of the pure Go solution as well as defining a new DSL to define the checks in. And lets be realistic about the benefits, chainsaw is also a kubernetes-native path. So the only real benefit to pure go is to not ship the chainsaw binary in the validator container.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aicr validate is still recipe-scoped. --recipe defines which components are expected and what checks apply to them. That is the motivation for adding readiness, customChecks, and crds: they make the per-component deployment checks explicit. If gpu-operator or dynamo are not in the recipe, the validator should not have an opinion about them. So "standalone" here means an independent runtime model, not opinions about arbitrary clusters.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying don't have those, and in fact I agree we should define such things if we need a readiness check beyond the standard Ready. What I am questioning is our own Go vs defining these checks in chainsaw.


The deployment validator uses a narrow, stable v1 contract:

- `Deployment`: `availableReplicas >= desiredReplicas`, with nil replicas treated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these already checked by k8s itself? I.e a deployment is marked as Ready when this is true? Same for the rest?


`customChecks` are intentionally narrow:
- fixed registration map in Go
- fixed v1 key set:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear what these would check

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
docs/design/006-deployment-validation.md (2)

284-285: ⚠️ Potential issue | 🟡 Minor

Tighten awkward phrasing in the negative consequence bullet.

“requires coordinated Go and registry changes” reads awkwardly; use “requires coordinating Go and registry changes” (or equivalent).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/006-deployment-validation.md` around lines 284 - 285, Update the
negative consequence bullet's phrasing: replace the fragment "requires
coordinated Go and registry changes" with a clearer formulation such as
"requires coordinating Go and registry changes" (or an equivalent that uses the
gerund/active form) so the sentence reads smoothly; locate the bullet containing
the exact phrase "**Fixed special-case surface**: new bespoke behavior requires
coordinated Go and registry changes rather than arbitrary registry-defined
assertions." and change only that clause to the improved wording.

301-314: ⚠️ Potential issue | 🟠 Major

Clarify this as design intent, not landed implementation.

This section is written as already-shipped behavior, but this PR is docs-only and current pkg/recipe/metadata.go still does not expose ComponentRef fields for readiness, customChecks, and crds. Reword to explicit target-state language.

✏️ Suggested wording adjustment
-The implementation associated with this ADR includes:
+The target implementation associated with this ADR will include (via follow-up code changes):

-- readiness metadata on `ComponentConfig` and `ComponentRef`
-- hydration of readiness metadata into the resolved recipe
-- overlay/external-data merge rules for `readiness`, `customChecks`, and `crds`
-- deployment-validator support for:
+- readiness metadata on `ComponentConfig` and `ComponentRef`
+- hydration of readiness metadata into the resolved recipe
+- overlay/external-data merge rules for `readiness`, `customChecks`, and `crds`
+- deployment-validator support for:
   - selector-based discovery
   - exact workload identities
   - union + dedup semantics
   - fixed custom checks
   - CRD `Established=True` readiness
   - fail-closed zero-match behavior
-- registry mappings for the current component inventory
+- registry mappings for the current component inventory
+
+At the time of writing this ADR, these fields are design intent and are not yet fully represented in `pkg/recipe/metadata.go`.

As per coding guidelines, documentation “must not invent features, guarantees, timelines, or roadmap commitments” and must “clearly distinguish current behavior from future intent.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/006-deployment-validation.md` around lines 301 - 314, The doc
text presents planned features as already implemented; change the phrasing in
the listed bullets to target-state language (e.g., "will include" or "planned to
include") and add a clarifying sentence that this ADR describes design intent
rather than current shipped behavior, and remove or qualify any items that claim
existing exports from pkg/recipe/metadata.go (notably readiness, customChecks,
and crds on ComponentRef/ComponentConfig); ensure you reference ComponentConfig
and ComponentRef in the clarification and call out that pkg/recipe/metadata.go
does not yet expose those fields so readers understand they are future work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 284-285: Update the negative consequence bullet's phrasing:
replace the fragment "requires coordinated Go and registry changes" with a
clearer formulation such as "requires coordinating Go and registry changes" (or
an equivalent that uses the gerund/active form) so the sentence reads smoothly;
locate the bullet containing the exact phrase "**Fixed special-case surface**:
new bespoke behavior requires coordinated Go and registry changes rather than
arbitrary registry-defined assertions." and change only that clause to the
improved wording.
- Around line 301-314: The doc text presents planned features as already
implemented; change the phrasing in the listed bullets to target-state language
(e.g., "will include" or "planned to include") and add a clarifying sentence
that this ADR describes design intent rather than current shipped behavior, and
remove or qualify any items that claim existing exports from
pkg/recipe/metadata.go (notably readiness, customChecks, and crds on
ComponentRef/ComponentConfig); ensure you reference ComponentConfig and
ComponentRef in the clarification and call out that pkg/recipe/metadata.go does
not yet expose those fields so readers understand they are future work.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1a2c4543-5680-4852-98ab-1b788a9ca9df

📥 Commits

Reviewing files that changed from the base of the PR and between 9a21799 and 12aae95.

📒 Files selected for processing (1)
  • docs/design/006-deployment-validation.md

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/design/006-deployment-validation.md (1)

284-285: ⚠️ Potential issue | 🟡 Minor

Use clearer phrasing for the fixed-surface consequence.

Line 284 reads awkwardly (“requires coordinated Go and registry changes”). Prefer “requires coordinating Go and registry changes.”

As per coding guidelines, documentation should ensure clarity and accuracy in wording.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/006-deployment-validation.md` around lines 284 - 285, The
sentence under the "**Fixed special-case surface**" section uses awkward
phrasing "requires coordinated Go and registry changes"; update that phrase to
"requires coordinating Go and registry changes" so the line reads: "**Fixed
special-case surface**: new bespoke behavior requires coordinating Go and
registry changes." Locate the paragraph beginning with the "**Fixed special-case
surface**" heading and replace only the phrase to preserve the surrounding
context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 8-10: Reword the wording that asserts implemented behavior to
clearly mark it as design intent/follow-up: change present-tense phrases like
“lands”, “includes” to future/intent phrasing (e.g., “will be implemented”,
“planned for”) in the document sections that reference the primary
implementation and the `ComponentRef` contract; specifically update the
paragraphs that mention `ComponentRef` fields and the lines currently claiming
they exist in `pkg/recipe/metadata.go` so they state these fields are part of
the intended design and not yet present in the current implementation.

---

Duplicate comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 284-285: The sentence under the "**Fixed special-case surface**"
section uses awkward phrasing "requires coordinated Go and registry changes";
update that phrase to "requires coordinating Go and registry changes" so the
line reads: "**Fixed special-case surface**: new bespoke behavior requires
coordinating Go and registry changes." Locate the paragraph beginning with the
"**Fixed special-case surface**" heading and replace only the phrase to preserve
the surrounding context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 2bae840e-035e-4865-b7e5-cdc08c9dae1a

📥 Commits

Reviewing files that changed from the base of the PR and between 12aae95 and 2f53e6d.

📒 Files selected for processing (2)
  • docs/contributor/validator.md
  • docs/design/006-deployment-validation.md

Comment thread docs/design/006-deployment-validation.md Outdated
@yuanchen8911

yuanchen8911 commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

@lockwobr Updated the ADR:

  • The ADR now explicitly links to Bridge Jobs for deploy/undeploy lifecycle flow control #610 and acknowledges the real overlap around per-component readiness intent.
  • It draws the boundary more clearly:
  • The intended coordination model is:
    • align semantics and component mappings where that is simple
    • avoid two unrelated readiness stories
    • but do not force a single shared config model if that would pull deploy-time hook/job details into the validator contract

@lockwobr

Copy link
Copy Markdown
Contributor

Circling back with a synthesis that ties this ADR to #610. I think the two designs aren't competing — they're solving adjacent problems that share one primitive: "is this component ready?"

The shared primitive: aicr validate with a blocking wait mode.

If aicr validate --component <name> --phase deployment --wait --timeout 30m blocks until readiness checks pass (or the timeout fires), and we ship an aicr container image with the CLI inside, then both designs consume the same implementation:

# #610-style bridge Job under this synthesis
containers:
  - name: wait
    image: ghcr.io/nvidia/aicr:v1.2.3
    command: [aicr, validate, --component, gpu-operator, --phase, deployment, --wait, --timeout, 30m]

Key property: the two designs stay in sync by default, with an escape hatch when needed. Since deploy-time readiness (bridge Job) and validate-time readiness (aicr validate) both dispatch through the same CLI codepath, there's no way for them to diverge accidentally — a change to how "ready" is computed automatically applies to both sites. No documentation discipline required, no drift between parallel implementations, no "did we update both?" risk. And when a component genuinely needs logic the shared path can't express, the readiness.yaml override (from #610's convention) provides a clean exit without abandoning the shared model for everyone else.

--wait --timeout is a general validator capability, not just a deployment-phase primitive. Plenty of checks benefit from "wait until this converges" semantics: post-snapshot agents reporting in, taint/label operations applying across nodes, health checks reaching a stable state, cluster-wide constraints that take time to propagate. Others are genuinely one-shot (static version checks, static capacity thresholds) and would just resolve immediately. The validator can expose --wait uniformly — checks that are wait-capable honor it, one-shot checks return immediately without penalty. Designing this as a general capability (not a bridge-Job-specific hack) means future validation work inherits it for free.

Precedence model for the two config shapes:

  • Structured readiness fields in ComponentRef (this ADR's contribution) cover the common cases declaratively — the default for most components. Both deploy-time and validate-time consume them identically.
  • readiness.yaml shipped in recipes/components/<name>/ (Bridge Jobs for deploy/undeploy lifecycle flow control #610's convention) acts as an override when a component needs logic that can't be expressed structurally.

File presence wins. Components stay on the structured path unless they explicitly need the escape hatch.

Addresses @ayuskauskas's Chainsaw point indirectly: how aicr validate internally evaluates readiness (pure Go, shells out to Chainsaw, something else entirely) becomes an implementation detail behind a stable CLI interface. Whatever the validator chooses under the hood, consumers see the same aicr validate entrypoint. That lets us pick the simplest thing now and revisit the executor later without breaking #610's bridge Jobs or any external tooling.

Tradeoffs worth naming early:

  1. aicr validate --wait --timeout is meaningful new surface (polling, backoff, SIGTERM handling for clean hook termination). Not hard, but not free.
  2. The aicr container image becomes another cluster-side artifact — signing, SBOM, CVE scanning. Precedent already exists (snapshot-capture), so no new threshold crossed.
  3. Air-gap: if the aicr image isn't pullable, bridge Jobs fail. The readiness.yaml override path lets users ship a raw kubectl wait Job as a fallback — supported escape hatch.
  4. ADR-006's "Fixed special-case surface" framing should explicitly leave room for readiness.yaml to override the structured fields, so we don't accidentally close off the override path.

Coordination point: the boundary between #610 (deploy-time flow control) and this ADR (post-install validation) stays as you've drawn it. The synthesis is just that both delegate to the same underlying readiness implementation rather than re-implementing the check logic in two places.

Happy to fold any of this into #610 or open a third issue for aicr validate --wait specifically if that scope feels like it deserves its own tracking.

@yuanchen8911

Copy link
Copy Markdown
Contributor Author

@lockwobr I updated the ADR based on your synthesis. It now makes the shared validator entrypoint explicit, adds the --wait --timeout convergence idea, and leaves room for a readiness.yaml escape hatch without overcommitting the exact precedence rules yet. PTAL.

@yuanchen8911

Copy link
Copy Markdown
Contributor Author

@ayuskauskas thanks, this is helpful. A few reactions point by point:

  • On the DSL concern: agreed. I do not want to introduce a general-purpose DSL plus arbitrary parser/executor here. The intent is a small typed schema over a few common Kubernetes check primitives, not a new assertion language.
  • On distroless: agreed that shipping Chainsaw does not inherently mean the image cannot stay distroless. My concern is less about distroless specifically and more about the broader runtime, packaging, signing, SBOM/CVE, upgrade, and testing cost of making Chainsaw part of the validator path.
  • On reducing deployment checks to selector + standard Ready condition: I think that is too narrow for cases we already know we need, such as exact named workloads, tightened DaemonSet semantics, declarative CRD establishment, and a small set of component-specific CR/status checks.
  • On customChecks as inline Chainsaw assert or file path: I think that is a reasonable escape hatch when the structured schema is not enough.
  • Where I still differ is on dropping the third type. I think CRD establishment should stay first-class rather than being folded into generic customChecks, because it is a common declarative signal, not just a bespoke edge case.

So my current preference is to keep the structured contract small (readiness, crds, customChecks), avoid turning it into a general DSL, and leave room for a Chainsaw- or asset-based override path when the typed schema is not sufficient.

@yuanchen8911 yuanchen8911 changed the title docs: add ADR for deployment validation docs: add ADR for shared component checks Apr 21, 2026
@yuanchen8911 yuanchen8911 requested a review from iamkhaledh April 21, 2026 20:10
@lockwobr

lockwobr commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

how much do use chainsaw now, by not using it we are also then having to go patch all the places it used today. I think we are using it a "good bit" we should consider it before just removing it from the table. @xdu31 would removing chainsaw be a lot of work? Additionally, moving to using it as a lib, how much work is that?

@github-actions github-actions Bot added size/XL and removed size/L labels Apr 21, 2026
@yuanchen8911 yuanchen8911 marked this pull request as draft April 22, 2026 01:13
@yuanchen8911 yuanchen8911 changed the title docs: add ADR for shared component checks docs: add ADR for shared component checks (deprecated) Apr 22, 2026
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

closed it. It looks like Chainsaw is the right direction.

The path to enhance aicr validate --phase deployment using Chainsaw is straightforward. We already have per-component Chainsaw health checks. The gap is wiring them into the deployment validation:  triggering the existing Chainsaw assets from the validator and enhancing/extending individual Chainsaw component checks as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants