docs: add ADR for shared component checks (deprecated)#631
docs: add ADR for shared component checks (deprecated)#631yuanchen8911 wants to merge 17 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds ADR-006 at Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
I made this comment on the last PR, including it here. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 261-274: Update the paragraph that lists implemented items to
clearly state this is target or planned functionality rather than current
behavior: change wording around ComponentConfig, ComponentRef, hydration of
readiness, overlay/external-data merge rules, deployment-validator features, and
registry mappings to future-tense or "design intent" phrasing (e.g., "will
include" or "planned follow-up PRs") and explicitly note that
pkg/recipe/metadata.go does not yet expose ComponentRef fields for
readiness/customChecks/crds; reference ComponentConfig and ComponentRef by name
so readers can map design to code until the implementation lands.
- Around line 244-245: In the negative-consequence bullet under "Fixed
special-case surface" in docs/design/006-deployment-validation.md, replace the
awkward phrase "requires coordinated Go and registry changes" with a clearer
alternative such as "requires coordinating Go and registry changes" (or
"requires Go and registry changes to be coordinated") so the sentence reads
naturally; edit the bullet text accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 003ec6e9-4e4f-4c55-8b84-e44b3e21cfa9
📒 Files selected for processing (1)
docs/design/006-deployment-validation.md
There was a problem hiding this comment.
♻️ Duplicate comments (1)
docs/design/006-deployment-validation.md (1)
247-248:⚠️ Potential issue | 🟡 MinorFix nonstandard phrasing in the negative consequence bullet.
The phrase "requires coordinated Go and registry changes" is nonstandard. Use "requires coordinating Go and registry changes" or "requires Go and registry changes to be coordinated" instead.
📝 Proposed fix
-- **Fixed special-case surface**: new bespoke behavior requires coordinated Go - and registry changes rather than arbitrary registry-defined assertions. +- **Fixed special-case surface**: new bespoke behavior requires coordinating Go + and registry changes rather than arbitrary registry-defined assertions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/006-deployment-validation.md` around lines 247 - 248, Replace the nonstandard phrasing "requires coordinated Go and registry changes" in the negative consequence bullet with a clearer alternative such as "requires coordinating Go and registry changes" or "requires Go and registry changes to be coordinated" so the sentence reads naturally and follows standard English usage; locate the exact bullet line containing "**Fixed special-case surface**: new bespoke behavior requires coordinated Go and registry changes rather than arbitrary registry-defined assertions." and update the clause accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 247-248: Replace the nonstandard phrasing "requires coordinated Go
and registry changes" in the negative consequence bullet with a clearer
alternative such as "requires coordinating Go and registry changes" or "requires
Go and registry changes to be coordinated" so the sentence reads naturally and
follows standard English usage; locate the exact bullet line containing "**Fixed
special-case surface**: new bespoke behavior requires coordinated Go and
registry changes rather than arbitrary registry-defined assertions." and update
the clause accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 9762f7e1-4ff4-4f5d-a14a-6e73e9e65ea4
📒 Files selected for processing (1)
docs/design/006-deployment-validation.md
Yes, there is real overlap. My take is
So the overlap is mainly in the readiness intent for each component, not in the full runtime/config model. I agree we should coordinate so we do not end up maintaining two unrelated readiness stories per component. |
Good point, agree "Chainsaw can’t compile cleanly with our Go toolchain” objection is weaker now and library-backed reuse is more viable than before, but direct reuse of our current checks still implies either binary support or rewriting those assets into a narrower form, IMO. |
12aae95 to
2f53e6d
Compare
|
|
||
| | Option | Summary | Benefits | Costs | | ||
| | --- | --- | --- | --- | | ||
| | **Registry fields + pure Go runtime** | Add three component-level fields (`readiness`, `customChecks`, `crds`) and have the deployment validator run the declared checks across all enabled components selected into the resolved recipe | Typed contract, deployer-neutral, easy to unit test, aligns with current `aicr validate` architecture, keeps the validator Kubernetes-native and extensible | Narrower than full Chainsaw expressiveness; some semantic overlap with repo-side health checks | |
There was a problem hiding this comment.
A cost you are hand waving away is the maintenance of the pure Go solution as well as defining a new DSL to define the checks in. And lets be realistic about the benefits, chainsaw is also a kubernetes-native path. So the only real benefit to pure go is to not ship the chainsaw binary in the validator container.
There was a problem hiding this comment.
aicr validate is still recipe-scoped. --recipe defines which components are expected and what checks apply to them. That is the motivation for adding readiness, customChecks, and crds: they make the per-component deployment checks explicit. If gpu-operator or dynamo are not in the recipe, the validator should not have an opinion about them. So "standalone" here means an independent runtime model, not opinions about arbitrary clusters.
There was a problem hiding this comment.
I am not saying don't have those, and in fact I agree we should define such things if we need a readiness check beyond the standard Ready. What I am questioning is our own Go vs defining these checks in chainsaw.
|
|
||
| The deployment validator uses a narrow, stable v1 contract: | ||
|
|
||
| - `Deployment`: `availableReplicas >= desiredReplicas`, with nil replicas treated |
There was a problem hiding this comment.
Aren't these already checked by k8s itself? I.e a deployment is marked as Ready when this is true? Same for the rest?
|
|
||
| `customChecks` are intentionally narrow: | ||
| - fixed registration map in Go | ||
| - fixed v1 key set: |
There was a problem hiding this comment.
It isn't clear what these would check
There was a problem hiding this comment.
♻️ Duplicate comments (2)
docs/design/006-deployment-validation.md (2)
284-285:⚠️ Potential issue | 🟡 MinorTighten awkward phrasing in the negative consequence bullet.
“requires coordinated Go and registry changes” reads awkwardly; use “requires coordinating Go and registry changes” (or equivalent).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/006-deployment-validation.md` around lines 284 - 285, Update the negative consequence bullet's phrasing: replace the fragment "requires coordinated Go and registry changes" with a clearer formulation such as "requires coordinating Go and registry changes" (or an equivalent that uses the gerund/active form) so the sentence reads smoothly; locate the bullet containing the exact phrase "**Fixed special-case surface**: new bespoke behavior requires coordinated Go and registry changes rather than arbitrary registry-defined assertions." and change only that clause to the improved wording.
301-314:⚠️ Potential issue | 🟠 MajorClarify this as design intent, not landed implementation.
This section is written as already-shipped behavior, but this PR is docs-only and current
pkg/recipe/metadata.gostill does not exposeComponentReffields forreadiness,customChecks, andcrds. Reword to explicit target-state language.✏️ Suggested wording adjustment
-The implementation associated with this ADR includes: +The target implementation associated with this ADR will include (via follow-up code changes): -- readiness metadata on `ComponentConfig` and `ComponentRef` -- hydration of readiness metadata into the resolved recipe -- overlay/external-data merge rules for `readiness`, `customChecks`, and `crds` -- deployment-validator support for: +- readiness metadata on `ComponentConfig` and `ComponentRef` +- hydration of readiness metadata into the resolved recipe +- overlay/external-data merge rules for `readiness`, `customChecks`, and `crds` +- deployment-validator support for: - selector-based discovery - exact workload identities - union + dedup semantics - fixed custom checks - CRD `Established=True` readiness - fail-closed zero-match behavior -- registry mappings for the current component inventory +- registry mappings for the current component inventory + +At the time of writing this ADR, these fields are design intent and are not yet fully represented in `pkg/recipe/metadata.go`.As per coding guidelines, documentation “must not invent features, guarantees, timelines, or roadmap commitments” and must “clearly distinguish current behavior from future intent.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/006-deployment-validation.md` around lines 301 - 314, The doc text presents planned features as already implemented; change the phrasing in the listed bullets to target-state language (e.g., "will include" or "planned to include") and add a clarifying sentence that this ADR describes design intent rather than current shipped behavior, and remove or qualify any items that claim existing exports from pkg/recipe/metadata.go (notably readiness, customChecks, and crds on ComponentRef/ComponentConfig); ensure you reference ComponentConfig and ComponentRef in the clarification and call out that pkg/recipe/metadata.go does not yet expose those fields so readers understand they are future work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 284-285: Update the negative consequence bullet's phrasing:
replace the fragment "requires coordinated Go and registry changes" with a
clearer formulation such as "requires coordinating Go and registry changes" (or
an equivalent that uses the gerund/active form) so the sentence reads smoothly;
locate the bullet containing the exact phrase "**Fixed special-case surface**:
new bespoke behavior requires coordinated Go and registry changes rather than
arbitrary registry-defined assertions." and change only that clause to the
improved wording.
- Around line 301-314: The doc text presents planned features as already
implemented; change the phrasing in the listed bullets to target-state language
(e.g., "will include" or "planned to include") and add a clarifying sentence
that this ADR describes design intent rather than current shipped behavior, and
remove or qualify any items that claim existing exports from
pkg/recipe/metadata.go (notably readiness, customChecks, and crds on
ComponentRef/ComponentConfig); ensure you reference ComponentConfig and
ComponentRef in the clarification and call out that pkg/recipe/metadata.go does
not yet expose those fields so readers understand they are future work.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 1a2c4543-5680-4852-98ab-1b788a9ca9df
📒 Files selected for processing (1)
docs/design/006-deployment-validation.md
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
docs/design/006-deployment-validation.md (1)
284-285:⚠️ Potential issue | 🟡 MinorUse clearer phrasing for the fixed-surface consequence.
Line 284 reads awkwardly (“requires coordinated Go and registry changes”). Prefer “requires coordinating Go and registry changes.”
As per coding guidelines, documentation should ensure clarity and accuracy in wording.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/006-deployment-validation.md` around lines 284 - 285, The sentence under the "**Fixed special-case surface**" section uses awkward phrasing "requires coordinated Go and registry changes"; update that phrase to "requires coordinating Go and registry changes" so the line reads: "**Fixed special-case surface**: new bespoke behavior requires coordinating Go and registry changes." Locate the paragraph beginning with the "**Fixed special-case surface**" heading and replace only the phrase to preserve the surrounding context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 8-10: Reword the wording that asserts implemented behavior to
clearly mark it as design intent/follow-up: change present-tense phrases like
“lands”, “includes” to future/intent phrasing (e.g., “will be implemented”,
“planned for”) in the document sections that reference the primary
implementation and the `ComponentRef` contract; specifically update the
paragraphs that mention `ComponentRef` fields and the lines currently claiming
they exist in `pkg/recipe/metadata.go` so they state these fields are part of
the intended design and not yet present in the current implementation.
---
Duplicate comments:
In `@docs/design/006-deployment-validation.md`:
- Around line 284-285: The sentence under the "**Fixed special-case surface**"
section uses awkward phrasing "requires coordinated Go and registry changes";
update that phrase to "requires coordinating Go and registry changes" so the
line reads: "**Fixed special-case surface**: new bespoke behavior requires
coordinating Go and registry changes." Locate the paragraph beginning with the
"**Fixed special-case surface**" heading and replace only the phrase to preserve
the surrounding context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 2bae840e-035e-4865-b7e5-cdc08c9dae1a
📒 Files selected for processing (2)
docs/contributor/validator.mddocs/design/006-deployment-validation.md
|
@lockwobr Updated the ADR:
|
|
Circling back with a synthesis that ties this ADR to #610. I think the two designs aren't competing — they're solving adjacent problems that share one primitive: "is this component ready?" The shared primitive: If
# #610-style bridge Job under this synthesis
containers:
- name: wait
image: ghcr.io/nvidia/aicr:v1.2.3
command: [aicr, validate, --component, gpu-operator, --phase, deployment, --wait, --timeout, 30m]Key property: the two designs stay in sync by default, with an escape hatch when needed. Since deploy-time readiness (bridge Job) and validate-time readiness (
Precedence model for the two config shapes:
File presence wins. Components stay on the structured path unless they explicitly need the escape hatch. Addresses @ayuskauskas's Chainsaw point indirectly: how Tradeoffs worth naming early:
Coordination point: the boundary between #610 (deploy-time flow control) and this ADR (post-install validation) stays as you've drawn it. The synthesis is just that both delegate to the same underlying readiness implementation rather than re-implementing the check logic in two places. Happy to fold any of this into #610 or open a third issue for |
|
@lockwobr I updated the ADR based on your synthesis. It now makes the shared validator entrypoint explicit, adds the |
|
@ayuskauskas thanks, this is helpful. A few reactions point by point:
So my current preference is to keep the structured contract small ( |
|
how much do use chainsaw now, by not using it we are also then having to go patch all the places it used today. I think we are using it a "good bit" we should consider it before just removing it from the table. @xdu31 would removing chainsaw be a lot of work? Additionally, moving to using it as a lib, how much work is that? |
|
closed it. It looks like Chainsaw is the right direction. The path to enhance aicr validate --phase deployment using Chainsaw is straightforward. We already have per-component Chainsaw health checks. The gap is wiring them into the deployment validation: triggering the existing Chainsaw assets from the validator and enhancing/extending individual Chainsaw component checks as needed. |
Summary
Add ADR-006 to document the proposed registry-driven component check-definition
model and the intended convergence on a shared validator entrypoint.
Motivation / Context
Issue
#622became large enough to warrant a durable architecture record, andthe discussion with
#610made the broader direction clearer. This ADR nowcaptures:
recipes/registry.yamlreadiness,customChecks,crds)aicr validateas the shared validatorentrypoint, including a wait-capable mode for deploy / cleanup and
post-install validation flows
it into a general-purpose DSL
still leaving room for a Chainsaw- or asset-based override path when the
structured schema is not sufficient
This PR is design-only. It does not include the
#622implementation; thatremains follow-up work.
Fixes: N/A
Related: #622, #610
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
docs/design/006-deployment-validation.mdin the existing ADR format.component check-definition model that different deployment-related phases can
consume.
recipes/registry.yaml:readiness,customChecks, andcrds.CRD establishment, merge semantics, wait-capable validator convergence, and
the fixed custom-check key set.
language; it is a typed schema over common Kubernetes check primitives.
crdsstays first-class rather than being folded into genericcustomChecks.#610: share the same underlying componentcheck intent and validator entrypoint where possible, without forcing
deploy-time hook/job details into the validator contract.
validator path over making Chainsaw the primary validator runtime, while
leaving room for a Chainsaw- or asset-based override path.
docs/contributor/validator.mdso the ADR isdiscoverable from the validator architecture docs.
Testing
Local result:
make qualify: passed earlier on the branch before the later ADR-onlyrevisions
make lint: passed after the latest ADR-only revisionsmake qualifyfor the latest commits because this isa doc-only change and the local overlay explicitly allows scoped verification
for doc-only work
Risk Assessment
Rollout notes: N/A
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info