docs: add ADR for shared component checks (deprecated) by yuanchen8911 · Pull Request #631 · NVIDIA/aicr

yuanchen8911 · 2026-04-21T16:57:07Z

Summary

Add ADR-006 to document the proposed registry-driven component check-definition
model and the intended convergence on a shared validator entrypoint.

Motivation / Context

Issue #622 became large enough to warrant a durable architecture record, and
the discussion with #610 made the broader direction clearer. This ADR now
captures:

a shared check-definition model in recipes/registry.yaml
the initial structured contract (readiness, customChecks, crds)
the intended convergence on aicr validate as the shared validator
entrypoint, including a wait-capable mode for deploy / cleanup and
post-install validation flows
the rationale for keeping the structured contract small rather than turning
it into a general-purpose DSL
the tradeoff versus reusing Chainsaw directly at validator runtime, while
still leaving room for a Chainsaw- or asset-based override path when the
structured schema is not sufficient

This PR is design-only. It does not include the #622 implementation; that
remains follow-up work.

Fixes: N/A
Related: #622, #610

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: ____________

Implementation Notes

Adds docs/design/006-deployment-validation.md in the existing ADR format.
Reframes the ADR from a deployment-validator-specific contract to a shared
component check-definition model that different deployment-related phases can
consume.
Defines the small structured contract in recipes/registry.yaml:
readiness, customChecks, and crds.
Records the initial validator subset, including exact workload identities,
CRD establishment, merge semantics, wait-capable validator convergence, and
the fixed custom-check key set.
Makes explicit that this is not a general-purpose DSL or arbitrary assertion
language; it is a typed schema over common Kubernetes check primitives.
Explains why crds stays first-class rather than being folded into generic
customChecks.
Clarifies the relationship with #610: share the same underlying component
check intent and validator entrypoint where possible, without forcing
deploy-time hook/job details into the validator contract.
Records why the ADR still prefers the structured schema plus pure-Go
validator path over making Chainsaw the primary validator runtime, while
leaving room for a Chainsaw- or asset-based override path.
Includes a link from docs/contributor/validator.md so the ADR is
discoverable from the validator architecture docs.

Testing

unset GITLAB_TOKEN && make qualify
unset GITLAB_TOKEN && make lint

Local result:

make qualify: passed earlier on the branch before the later ADR-only
revisions
make lint: passed after the latest ADR-only revisions
skipped rerunning full make qualify for the latest commits because this is
a doc-only change and the local overlay explicitly allows scoped verification
for doc-only work

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

github-actions · 2026-04-21T16:58:08Z

🌿 Preview your docs: https://nvidia-preview-docs-adr-006-deployment-validation.docs.buildwithfern.com/aicr

coderabbitai · 2026-04-21T16:59:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds ADR-006 at docs/design/006-deployment-validation.md defining a registry-driven deployment-phase validation contract for aicr validate --phase deployment, introducing readiness, customChecks, and crds metadata, overlay/merge semantics, runtime evaluation rules for workload and CRD readiness, fixed customChecks keys, and a two-phase migration/enforcement plan. Also updates docs/contributor/validator.md to reference the new ADR. Documentation-only changes; no code, API, or runtime behavior modified.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title mentions 'shared component checks' but the PR primarily adds ADR-006 for deployment validation with registry-driven component readiness; the word 'deprecated' is unexplained and not reflected in the ADR content.	Clarify the title to accurately reflect the main change (e.g., 'docs: add ADR-006 for deployment validation' or 'docs: add ADR for registry-driven deployment validation').

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the ADR's purpose, motivation, design decisions, and how it aligns with issues `#622` and `#610`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lockwobr · 2026-04-21T17:03:47Z

I made this comment on the last PR, including it here.
#630 (comment)

lockwobr · 2026-04-21T17:14:48Z

This design has a lot of overlap with #610 (#610 has 2 halves, only referring to deployment half). feels like we might want to share/coordinate, if does not add a lot of complexity to share the config. Would hate to have to maintain this in 2 places.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 261-274: Update the paragraph that lists implemented items to
clearly state this is target or planned functionality rather than current
behavior: change wording around ComponentConfig, ComponentRef, hydration of
readiness, overlay/external-data merge rules, deployment-validator features, and
registry mappings to future-tense or "design intent" phrasing (e.g., "will
include" or "planned follow-up PRs") and explicitly note that
pkg/recipe/metadata.go does not yet expose ComponentRef fields for
readiness/customChecks/crds; reference ComponentConfig and ComponentRef by name
so readers can map design to code until the implementation lands.
- Around line 244-245: In the negative-consequence bullet under "Fixed
special-case surface" in docs/design/006-deployment-validation.md, replace the
awkward phrase "requires coordinated Go and registry changes" with a clearer
alternative such as "requires coordinating Go and registry changes" (or
"requires Go and registry changes to be coordinated") so the sentence reads
naturally; edit the bullet text accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 003ec6e9-4e4f-4c55-8b84-e44b3e21cfa9

📥 Commits

Reviewing files that changed from the base of the PR and between d103f91 and 38ae50e.

📒 Files selected for processing (1)

docs/design/006-deployment-validation.md

coderabbitai

♻️ Duplicate comments (1)

docs/design/006-deployment-validation.md (1)

247-248: ⚠️ Potential issue | 🟡 Minor

Fix nonstandard phrasing in the negative consequence bullet.

The phrase "requires coordinated Go and registry changes" is nonstandard. Use "requires coordinating Go and registry changes" or "requires Go and registry changes to be coordinated" instead.

📝 Proposed fix

-- **Fixed special-case surface**: new bespoke behavior requires coordinated Go
-  and registry changes rather than arbitrary registry-defined assertions.
+- **Fixed special-case surface**: new bespoke behavior requires coordinating Go
+  and registry changes rather than arbitrary registry-defined assertions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/design/006-deployment-validation.md` around lines 247 - 248, Replace the
nonstandard phrasing "requires coordinated Go and registry changes" in the
negative consequence bullet with a clearer alternative such as "requires
coordinating Go and registry changes" or "requires Go and registry changes to be
coordinated" so the sentence reads naturally and follows standard English usage;
locate the exact bullet line containing "**Fixed special-case surface**: new
bespoke behavior requires coordinated Go and registry changes rather than
arbitrary registry-defined assertions." and update the clause accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 247-248: Replace the nonstandard phrasing "requires coordinated Go
and registry changes" in the negative consequence bullet with a clearer
alternative such as "requires coordinating Go and registry changes" or "requires
Go and registry changes to be coordinated" so the sentence reads naturally and
follows standard English usage; locate the exact bullet line containing "**Fixed
special-case surface**: new bespoke behavior requires coordinated Go and
registry changes rather than arbitrary registry-defined assertions." and update
the clause accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 9762f7e1-4ff4-4f5d-a14a-6e73e9e65ea4

📥 Commits

Reviewing files that changed from the base of the PR and between 38ae50e and 9a21799.

📒 Files selected for processing (1)

docs/design/006-deployment-validation.md

yuanchen8911 · 2026-04-21T17:48:06Z

This design has a lot of overlap with #610 (#610 has 2 halves, only referring to deployment half). feels like we might want to share/coordinate, if does not add a lot of complexity to share the config. Would hate to have to maintain this in 2 places.

Yes, there is real overlap. My take is

610 deployment half is about deploy-time flow control:
- post-install hook Jobs / sync waves
- deployer-specific rendering
- images, RBAC, timeouts, shell logic
622 (ADR-006) is about post-install validation in aicr validate --phase deployment:
- deployer-neutral
- intended to stay usable as a standalone validator

So the overlap is mainly in the readiness intent for each component, not in the full runtime/config model. I agree we should coordinate so we do not end up maintaining two unrelated readiness stories per component.

yuanchen8911 · 2026-04-21T18:01:01Z

#630 (comment)

Good point, agree "Chainsaw can’t compile cleanly with our Go toolchain” objection is weaker now and library-backed reuse is more viable than before, but direct reuse of our current checks still implies either binary support or rewriting those assets into a narrower form, IMO.

ayuskauskas · 2026-04-21T17:52:32Z

+
+| Option | Summary | Benefits | Costs |
+| --- | --- | --- | --- |
+| **Registry fields + pure Go runtime** | Add three component-level fields (`readiness`, `customChecks`, `crds`) and have the deployment validator run the declared checks across all enabled components selected into the resolved recipe | Typed contract, deployer-neutral, easy to unit test, aligns with current `aicr validate` architecture, keeps the validator Kubernetes-native and extensible | Narrower than full Chainsaw expressiveness; some semantic overlap with repo-side health checks |


A cost you are hand waving away is the maintenance of the pure Go solution as well as defining a new DSL to define the checks in. And lets be realistic about the benefits, chainsaw is also a kubernetes-native path. So the only real benefit to pure go is to not ship the chainsaw binary in the validator container.

aicr validate is still recipe-scoped. --recipe defines which components are expected and what checks apply to them. That is the motivation for adding readiness, customChecks, and crds: they make the per-component deployment checks explicit. If gpu-operator or dynamo are not in the recipe, the validator should not have an opinion about them. So "standalone" here means an independent runtime model, not opinions about arbitrary clusters.

I am not saying don't have those, and in fact I agree we should define such things if we need a readiness check beyond the standard Ready. What I am questioning is our own Go vs defining these checks in chainsaw.

ayuskauskas · 2026-04-21T18:01:36Z

+
+The deployment validator uses a narrow, stable v1 contract:
+
+- `Deployment`: `availableReplicas >= desiredReplicas`, with nil replicas treated


Aren't these already checked by k8s itself? I.e a deployment is marked as Ready when this is true? Same for the rest?

ayuskauskas · 2026-04-21T18:02:03Z

+
+`customChecks` are intentionally narrow:
+- fixed registration map in Go
+- fixed v1 key set:


It isn't clear what these would check

coderabbitai

♻️ Duplicate comments (2)

docs/design/006-deployment-validation.md (2)

284-285: ⚠️ Potential issue | 🟡 Minor

Tighten awkward phrasing in the negative consequence bullet.

“requires coordinated Go and registry changes” reads awkwardly; use “requires coordinating Go and registry changes” (or equivalent).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/design/006-deployment-validation.md` around lines 284 - 285, Update the
negative consequence bullet's phrasing: replace the fragment "requires
coordinated Go and registry changes" with a clearer formulation such as
"requires coordinating Go and registry changes" (or an equivalent that uses the
gerund/active form) so the sentence reads smoothly; locate the bullet containing
the exact phrase "**Fixed special-case surface**: new bespoke behavior requires
coordinated Go and registry changes rather than arbitrary registry-defined
assertions." and change only that clause to the improved wording.

301-314: ⚠️ Potential issue | 🟠 Major

Clarify this as design intent, not landed implementation.

This section is written as already-shipped behavior, but this PR is docs-only and current pkg/recipe/metadata.go still does not expose ComponentRef fields for readiness, customChecks, and crds. Reword to explicit target-state language.

✏️ Suggested wording adjustment

-The implementation associated with this ADR includes:
+The target implementation associated with this ADR will include (via follow-up code changes):

-- readiness metadata on `ComponentConfig` and `ComponentRef`
-- hydration of readiness metadata into the resolved recipe
-- overlay/external-data merge rules for `readiness`, `customChecks`, and `crds`
-- deployment-validator support for:
+- readiness metadata on `ComponentConfig` and `ComponentRef`
+- hydration of readiness metadata into the resolved recipe
+- overlay/external-data merge rules for `readiness`, `customChecks`, and `crds`
+- deployment-validator support for:
   - selector-based discovery
   - exact workload identities
   - union + dedup semantics
   - fixed custom checks
   - CRD `Established=True` readiness
   - fail-closed zero-match behavior
-- registry mappings for the current component inventory
+- registry mappings for the current component inventory
+
+At the time of writing this ADR, these fields are design intent and are not yet fully represented in `pkg/recipe/metadata.go`.

As per coding guidelines, documentation “must not invent features, guarantees, timelines, or roadmap commitments” and must “clearly distinguish current behavior from future intent.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/design/006-deployment-validation.md` around lines 301 - 314, The doc
text presents planned features as already implemented; change the phrasing in
the listed bullets to target-state language (e.g., "will include" or "planned to
include") and add a clarifying sentence that this ADR describes design intent
rather than current shipped behavior, and remove or qualify any items that claim
existing exports from pkg/recipe/metadata.go (notably readiness, customChecks,
and crds on ComponentRef/ComponentConfig); ensure you reference ComponentConfig
and ComponentRef in the clarification and call out that pkg/recipe/metadata.go
does not yet expose those fields so readers understand they are future work.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 284-285: Update the negative consequence bullet's phrasing:
replace the fragment "requires coordinated Go and registry changes" with a
clearer formulation such as "requires coordinating Go and registry changes" (or
an equivalent that uses the gerund/active form) so the sentence reads smoothly;
locate the bullet containing the exact phrase "**Fixed special-case surface**:
new bespoke behavior requires coordinated Go and registry changes rather than
arbitrary registry-defined assertions." and change only that clause to the
improved wording.
- Around line 301-314: The doc text presents planned features as already
implemented; change the phrasing in the listed bullets to target-state language
(e.g., "will include" or "planned to include") and add a clarifying sentence
that this ADR describes design intent rather than current shipped behavior, and
remove or qualify any items that claim existing exports from
pkg/recipe/metadata.go (notably readiness, customChecks, and crds on
ComponentRef/ComponentConfig); ensure you reference ComponentConfig and
ComponentRef in the clarification and call out that pkg/recipe/metadata.go does
not yet expose those fields so readers understand they are future work.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1a2c4543-5680-4852-98ab-1b788a9ca9df

📥 Commits

Reviewing files that changed from the base of the PR and between 9a21799 and 12aae95.

📒 Files selected for processing (1)

docs/design/006-deployment-validation.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

docs/design/006-deployment-validation.md (1)
284-285: ⚠️ Potential issue | 🟡 Minor

Use clearer phrasing for the fixed-surface consequence.

Line 284 reads awkwardly (“requires coordinated Go and registry changes”). Prefer “requires coordinating Go and registry changes.”

As per coding guidelines, documentation should ensure clarity and accuracy in wording.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/006-deployment-validation.md` around lines 284 - 285, The
sentence under the "**Fixed special-case surface**" section uses awkward
phrasing "requires coordinated Go and registry changes"; update that phrase to
"requires coordinating Go and registry changes" so the line reads: "**Fixed
special-case surface**: new bespoke behavior requires coordinating Go and
registry changes." Locate the paragraph beginning with the "**Fixed special-case
surface**" heading and replace only the phrase to preserve the surrounding
context.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 8-10: Reword the wording that asserts implemented behavior to
clearly mark it as design intent/follow-up: change present-tense phrases like
“lands”, “includes” to future/intent phrasing (e.g., “will be implemented”,
“planned for”) in the document sections that reference the primary
implementation and the `ComponentRef` contract; specifically update the
paragraphs that mention `ComponentRef` fields and the lines currently claiming
they exist in `pkg/recipe/metadata.go` so they state these fields are part of
the intended design and not yet present in the current implementation.

---

Duplicate comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 284-285: The sentence under the "**Fixed special-case surface**"
section uses awkward phrasing "requires coordinated Go and registry changes";
update that phrase to "requires coordinating Go and registry changes" so the
line reads: "**Fixed special-case surface**: new bespoke behavior requires
coordinating Go and registry changes." Locate the paragraph beginning with the
"**Fixed special-case surface**" heading and replace only the phrase to preserve
the surrounding context.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 2bae840e-035e-4865-b7e5-cdc08c9dae1a

📥 Commits

Reviewing files that changed from the base of the PR and between 12aae95 and 2f53e6d.

📒 Files selected for processing (2)

docs/contributor/validator.md
docs/design/006-deployment-validation.md

yuanchen8911 · 2026-04-21T18:36:03Z

@lockwobr Updated the ADR:

The ADR now explicitly links to Bridge Jobs for deploy/undeploy lifecycle flow control #610 and acknowledges the real overlap around per-component readiness intent.
It draws the boundary more clearly:
- Bridge Jobs for deploy/undeploy lifecycle flow control #610 is deploy-time flow control / undeploy cleanup
- ADR-006 is post-install validation in aicr validate --phase deployment
The intended coordination model is:
- align semantics and component mappings where that is simple
- avoid two unrelated readiness stories
- but do not force a single shared config model if that would pull deploy-time hook/job details into the validator contract

lockwobr · 2026-04-21T19:41:18Z

Circling back with a synthesis that ties this ADR to #610. I think the two designs aren't competing — they're solving adjacent problems that share one primitive: "is this component ready?"

The shared primitive: aicr validate with a blocking wait mode.

If aicr validate --component <name> --phase deployment --wait --timeout 30m blocks until readiness checks pass (or the timeout fires), and we ship an aicr container image with the CLI inside, then both designs consume the same implementation:

ADR-006 / Add registry-driven deployment readiness for all components #622 — operators run aicr validate --phase deployment directly (local CLI, day-N re-check, CI gates)
Bridge Jobs for deploy/undeploy lifecycle flow control #610 bridge Jobs — the deploy-time hook Job uses the aicr image and calls the same aicr validate ... --wait internally

# #610-style bridge Job under this synthesis
containers:
  - name: wait
    image: ghcr.io/nvidia/aicr:v1.2.3
    command: [aicr, validate, --component, gpu-operator, --phase, deployment, --wait, --timeout, 30m]

Key property: the two designs stay in sync by default, with an escape hatch when needed. Since deploy-time readiness (bridge Job) and validate-time readiness (aicr validate) both dispatch through the same CLI codepath, there's no way for them to diverge accidentally — a change to how "ready" is computed automatically applies to both sites. No documentation discipline required, no drift between parallel implementations, no "did we update both?" risk. And when a component genuinely needs logic the shared path can't express, the readiness.yaml override (from #610's convention) provides a clean exit without abandoning the shared model for everyone else.

--wait --timeout is a general validator capability, not just a deployment-phase primitive. Plenty of checks benefit from "wait until this converges" semantics: post-snapshot agents reporting in, taint/label operations applying across nodes, health checks reaching a stable state, cluster-wide constraints that take time to propagate. Others are genuinely one-shot (static version checks, static capacity thresholds) and would just resolve immediately. The validator can expose --wait uniformly — checks that are wait-capable honor it, one-shot checks return immediately without penalty. Designing this as a general capability (not a bridge-Job-specific hack) means future validation work inherits it for free.

Precedence model for the two config shapes:

Structured readiness fields in ComponentRef (this ADR's contribution) cover the common cases declaratively — the default for most components. Both deploy-time and validate-time consume them identically.
readiness.yaml shipped in recipes/components/<name>/ (Bridge Jobs for deploy/undeploy lifecycle flow control #610's convention) acts as an override when a component needs logic that can't be expressed structurally.

File presence wins. Components stay on the structured path unless they explicitly need the escape hatch.

Addresses @ayuskauskas's Chainsaw point indirectly: how aicr validate internally evaluates readiness (pure Go, shells out to Chainsaw, something else entirely) becomes an implementation detail behind a stable CLI interface. Whatever the validator chooses under the hood, consumers see the same aicr validate entrypoint. That lets us pick the simplest thing now and revisit the executor later without breaking #610's bridge Jobs or any external tooling.

Tradeoffs worth naming early:

aicr validate --wait --timeout is meaningful new surface (polling, backoff, SIGTERM handling for clean hook termination). Not hard, but not free.
The aicr container image becomes another cluster-side artifact — signing, SBOM, CVE scanning. Precedent already exists (snapshot-capture), so no new threshold crossed.
Air-gap: if the aicr image isn't pullable, bridge Jobs fail. The readiness.yaml override path lets users ship a raw kubectl wait Job as a fallback — supported escape hatch.
ADR-006's "Fixed special-case surface" framing should explicitly leave room for readiness.yaml to override the structured fields, so we don't accidentally close off the override path.

Coordination point: the boundary between #610 (deploy-time flow control) and this ADR (post-install validation) stays as you've drawn it. The synthesis is just that both delegate to the same underlying readiness implementation rather than re-implementing the check logic in two places.

Happy to fold any of this into #610 or open a third issue for aicr validate --wait specifically if that scope feels like it deserves its own tracking.

yuanchen8911 · 2026-04-21T19:51:47Z

@lockwobr I updated the ADR based on your synthesis. It now makes the shared validator entrypoint explicit, adds the --wait --timeout convergence idea, and leaves room for a readiness.yaml escape hatch without overcommitting the exact precedence rules yet. PTAL.

yuanchen8911 · 2026-04-21T20:02:25Z

@ayuskauskas thanks, this is helpful. A few reactions point by point:

On the DSL concern: agreed. I do not want to introduce a general-purpose DSL plus arbitrary parser/executor here. The intent is a small typed schema over a few common Kubernetes check primitives, not a new assertion language.
On distroless: agreed that shipping Chainsaw does not inherently mean the image cannot stay distroless. My concern is less about distroless specifically and more about the broader runtime, packaging, signing, SBOM/CVE, upgrade, and testing cost of making Chainsaw part of the validator path.
On reducing deployment checks to selector + standard Ready condition: I think that is too narrow for cases we already know we need, such as exact named workloads, tightened DaemonSet semantics, declarative CRD establishment, and a small set of component-specific CR/status checks.
On customChecks as inline Chainsaw assert or file path: I think that is a reasonable escape hatch when the structured schema is not enough.
Where I still differ is on dropping the third type. I think CRD establishment should stay first-class rather than being folded into generic customChecks, because it is a common declarative signal, not just a bespoke edge case.

So my current preference is to keep the structured contract small (readiness, crds, customChecks), avoid turning it into a general DSL, and leave room for a Chainsaw- or asset-based override path when the typed schema is not sufficient.

lockwobr · 2026-04-21T20:19:50Z

how much do use chainsaw now, by not using it we are also then having to go patch all the places it used today. I think we are using it a "good bit" we should consider it before just removing it from the table. @xdu31 would removing chainsaw be a lot of work? Additionally, moving to using it as a lib, how much work is that?

yuanchen8911 · 2026-04-23T18:51:36Z

closed it. It looks like Chainsaw is the right direction.

The path to enhance aicr validate --phase deployment using Chainsaw is straightforward. We already have per-component Chainsaw health checks. The gap is wiring them into the deployment validation: triggering the existing Chainsaw assets from the validator and enhancing/extending individual Chainsaw component checks as needed.

yuanchen8911 requested a review from a team as a code owner April 21, 2026 16:57

github-actions Bot added the area/docs label Apr 21, 2026

yuanchen8911 mentioned this pull request Apr 21, 2026

docs: add ADR for deployment validation #630

Closed

25 tasks

github-actions Bot added the size/L label Apr 21, 2026

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread docs/design/006-deployment-validation.md Outdated

Comment thread docs/design/006-deployment-validation.md Outdated

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

yuanchen8911 requested review from ayuskauskas, lockwobr, mchmarny and xdu31 April 21, 2026 17:42

yuanchen8911 added 4 commits April 21, 2026 11:01

docs: add deployment validation ADR

7dc5cb6

docs: refine deployment validation ADR

269e834

docs: polish deployment validation ADR

0388e92

docs: clarify deployment validation ADR

2f53e6d

yuanchen8911 force-pushed the docs/adr-006-deployment-validation branch from 12aae95 to 2f53e6d Compare April 21, 2026 18:01

ayuskauskas reviewed Apr 21, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread docs/design/006-deployment-validation.md Outdated

yuanchen8911 mentioned this pull request Apr 21, 2026

Add registry-driven deployment readiness for all components #622

Closed

yuanchen8911 added 3 commits April 21, 2026 11:15

docs: tighten deployment validation ADR

4e780e5

docs: clarify deployment validation ADR state

5141f06

docs: polish deployment validation ADR status

b540ae9

yuanchen8911 requested a review from ayuskauskas April 21, 2026 18:36

docs: broaden deployment validation ADR

1c9d8b0

docs: clarify shared check ADR

0f7c87c

yuanchen8911 changed the title ~~docs: add ADR for deployment validation~~ docs: add ADR for shared component checks Apr 21, 2026

yuanchen8911 requested a review from iamkhaledh April 21, 2026 20:10

yuanchen8911 added 4 commits April 21, 2026 14:23

docs: tighten shared check ADR

0eebf38

docs: refine shared check ADR

4d15dd3

docs: refine shared check ADR

62e2a29

docs: polish shared check ADR

bdbbbcf

github-actions Bot added size/XL and removed size/L labels Apr 21, 2026

docs: clarify shared check ADR

be43993

mchmarny assigned yuanchen8911 Apr 21, 2026

docs: sync ADR pages into site build

51010ec

lockwobr mentioned this pull request Apr 21, 2026

Bridge Jobs for deploy/undeploy lifecycle flow control #610

Closed

docs: fix site dead links for ADR pages

301b0d0

yuanchen8911 mentioned this pull request Apr 22, 2026

docs: fix Pages dead links for design docs #636

Merged

13 tasks

yuanchen8911 marked this pull request as draft April 22, 2026 01:13

yuanchen8911 changed the title ~~docs: add ADR for shared component checks~~ docs: add ADR for shared component checks (deprecated) Apr 22, 2026

lockwobr mentioned this pull request Apr 22, 2026

Expose standard .status.conditions[type=Ready] for GitOps/deploy-tool integration NVIDIA/nodewright#201

Closed

Merge branch 'main' into docs/adr-006-deployment-validation

e4ded79

yuanchen8911 closed this Apr 23, 2026

yuanchen8911 mentioned this pull request Apr 23, 2026

[EPIC] Reuse Chainsaw health checks in aicr validate --phase deployment #660

Closed

5 tasks

lockwobr mentioned this pull request May 15, 2026

feat: component readiness via chainsaw post-install hook (replaces #610) #904

Closed

This was referenced Jun 8, 2026

Hydrate healthCheck.assertFile + suppression sentinel (#660 PR 1) #1219

Closed

feat(recipe): hydrate healthCheck.assertFile + suppression sentinel #1231

Merged


		The deployment validator uses a narrow, stable v1 contract:

		- `Deployment`: `availableReplicas >= desiredReplicas`, with nil replicas treated

Uh oh!

Conversation

yuanchen8911 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

lockwobr commented Apr 21, 2026

Uh oh!

lockwobr commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanchen8911 commented Apr 21, 2026

Uh oh!

ayuskauskas Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ayuskauskas Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ayuskauskas Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ayuskauskas Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuanchen8911 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lockwobr commented Apr 21, 2026

Uh oh!

yuanchen8911 commented Apr 21, 2026

Uh oh!

yuanchen8911 commented Apr 21, 2026

Uh oh!

lockwobr commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanchen8911 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

yuanchen8911 commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

lockwobr commented Apr 21, 2026 •

edited

Loading

yuanchen8911 commented Apr 21, 2026 •

edited

Loading

yuanchen8911 commented Apr 21, 2026 •

edited

Loading

lockwobr commented Apr 21, 2026 •

edited

Loading