feat(aicr): top-level Go library facade for in-process consumers#1072
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a top-level Go facade package Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@aicr_test.go`:
- Around line 255-258: The test creates a real client via aicr.NewClient with
aicr.WithRecipeSource(tmp) and never closes it; add cleanup to close the client
to avoid leaking per-client cache state—after successful NewClient return,
register a cleanup (e.g., using t.Cleanup or defer) that calls client.Close() so
the client's resources are released; locate the client variable created by the
call to aicr.NewClient and ensure client.Close() is invoked during test
teardown.
- Around line 54-56: Replace the hardcoded absolute path in the test when
calling aicr.NewClient with a path built from the test's temporary directory so
it is guaranteed missing; specifically, construct the recipe path using
t.TempDir() (e.g., filepath.Join(t.TempDir(), "provider-nvidia-aicr-test")) and
pass that into aicr.WithRecipeSource(aicr.FilesystemSource(...)) so the test is
deterministic across environments.
In `@aicr.go`:
- Around line 146-148: The NewClient function currently iterates the opts slice
and calls each opt(c) unguarded, which can panic if an Option entry is nil;
update the loop in NewClient to check for nil before invocation (e.g., skip or
return an error when opt == nil) so Option entries are validated, ensuring you
reference the opts slice and the Option function type and call sites such as
opt(c) safely.
- Around line 297-298: Guard against a nil context before calling
context.WithTimeout in ResolveRecipe, BundleComponents, CollectSnapshot, and
ValidateState: add an explicit check if ctx == nil and return
ErrCodeInvalidRequest (or the function's error return) before invoking
context.WithTimeout(ctx, ...) so the subsequent ctx, cancel :=
context.WithTimeout(...) / defer cancel() cannot panic; update the code paths
where context.WithTimeout is used (the local ctx variable and cancel handling)
to assume ctx is non-nil after the guard.
In `@docs/integrator/go-library.md`:
- Around line 175-178: The docs state that ResolveRecipe applies
defaults.RecipeBuildTimeout as the cap, but the actual facade implementation
uses defaults.RecipeOperationTimeout; update the text to reflect the real
behavior by replacing the reference to defaults.RecipeBuildTimeout with
defaults.RecipeOperationTimeout (or otherwise state that ResolveRecipe uses the
RecipeOperationTimeout cap), and ensure the sentence references the
ResolveRecipe method and the defaults.RecipeOperationTimeout symbol so
integrators are not misled.
In `@docs/integrator/public-api.md`:
- Line 23: The package matrix entry for pkg/validator incorrectly says
"four-phase validation"; update the wording in the public API docs to match the
PR's actual public facade by changing the description for `pkg/validator` to
"three-phase validation (Deployment, Performance, Conformance)" or equivalent
phrasing that lists the three phases, ensuring the `pkg/validator` row text and
any related mention of phase count reflect three phases consistently.
In `@pkg/recipe/metadata_store_test.go`:
- Around line 1677-1680: The test currently only asserts err != nil for the
first canceled load; change it to assert the error carries the structured
timeout code (errors.ErrCodeTimeout) so regressions to plain context.Canceled
fail. Update the test assertions around the first load call (and similarly at
the other spot noted around 1692-1694) to use the pkg/errors helpers—e.g.,
verify the returned error has ErrCodeTimeout (using the package's
structured-error inspection function or errors.PropagateOrWrap semantics) rather
than merely non-nil, so the timeout code is preserved.
In `@pkg/recipe/metadata_store.go`:
- Around line 116-127: Update the LoadMetadataStoreFor godoc to reflect the new
auto-eviction of transient context cancellation errors: remove or correct any
sentence stating that transient cancellations remain cached and require
EvictCachedStore, and instead document that entries whose errs satisfy
isTransientLoadError (e.g., context.Canceled/context.DeadlineExceeded) are
automatically removed by the cache via storeCache.CompareAndDelete on detection,
so callers do not need to call EvictCachedStore to retry after transient
cancel/deadline errors; reference the LoadMetadataStoreFor behavior and the
transient-error handling implemented around storeCache.CompareAndDelete.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 13f06bc1-cc5b-496c-831b-4a9cb85ae60f
📒 Files selected for processing (13)
aicr.goaicr_internal_test.goaicr_test.godocs/design/architecture.mddocs/integrator/go-library.mddocs/integrator/public-api.mdoptions.gopkg/defaults/timeouts.gopkg/recipe/builder.gopkg/recipe/components.gopkg/recipe/metadata_store.gopkg/recipe/metadata_store_test.gotypes.go
e6df948 to
2efc0d9
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
docs/integrator/public-api.md (1)
23-23:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCorrect phase-count wording for
pkg/validatorin the package matrix.Line 23 says "four-phase validation," but this PR's public facade documents three phases (
Deployment,Performance,Conformance). Please reconcile the wording to keep docs consistent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/integrator/public-api.md` at line 23, Update the docs entry for `pkg/validator` in the package matrix to reflect the correct phase count and phrasing: replace "four-phase validation" with "three-phase validation (Deployment, Performance, Conformance)" so the README matches the public facade and lists the actual phases.docs/integrator/go-library.md (1)
175-178:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAlign timeout documentation with the current facade implementation.
Line 177 states
defaults.RecipeBuildTimeoutas the cap, but the facade currently appliesdefaults.RecipeOperationTimeoutinResolveRecipe. Please update this section to avoid misleading integrators.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/integrator/go-library.md` around lines 175 - 178, Update the docs to reflect the actual timeout used by the facade: change the reference from defaults.RecipeBuildTimeout to defaults.RecipeOperationTimeout in the paragraph describing ResolveRecipe (and other context-aware facade methods) so the documentation matches the current implementation where ResolveRecipe applies defaults.RecipeOperationTimeout as the cap on caller-provided contexts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@aicr_test.go`:
- Around line 370-392: The test currently only checks that BundleComponents
returns a non-nil error after client.Close; update the assertion to verify the
structured error code equals ErrCodeInvalidRequest to prevent contract
regressions. Locate the failing branch in the test where client.Close() has been
called and BundleComponents(context.Background(), &aicr.RecipeResult{Components:
nil}) is invoked, then replace the loose nil check with an assertion that
inspects the returned error (e.g., unwrap or type-assert to your SDK error type)
and confirms its code equals aicr.ErrCodeInvalidRequest while still failing the
test if it does not.
- Around line 126-133: The test creates two real clients via NewClient (clientA
and clientB) but never closes them, leaking per-client cache state; after each
successful NewClient call (for clientA and clientB) call and/or defer the
client's Close method (e.g., defer clientA.Close() immediately after clientA is
created and check err, and likewise defer clientB.Close()) so both clients are
properly released at the end of the test and state is not leaked between tests.
In `@aicr.go`:
- Around line 607-617: BundleComponents is honoring ctx but loadManifestFiles
does not, so long loops over ref.ManifestFiles can continue after cancellation;
change loadManifestFiles to accept a context.Context (e.g.,
loadManifestFiles(ctx context.Context, r.internal, dp, facade.Name) or similar),
update its callers in BundleComponents (both the generic and "kustomize" cases)
to pass the ctx, and inside loadManifestFiles check ctx.Done() (or return
ctx.Err()) between/ before file reads (and when returning) so the function
aborts promptly on cancellation and propagates the context error up.
In `@options.go`:
- Around line 55-60: Update the applyValidateOptions doc comment to remove the
incorrect claim that the translation occurs "inside the locked region" and
instead state that ValidateState calls applyValidateOptions after releasing
Client.mu (i.e., the translation happens once per call outside the Client.mu
lock), so future additions to validator.With* only require edits here and not on
the façade; reference applyValidateOptions, ValidateState, and Client.mu in the
revised sentence to make the synchronization scope explicit.
---
Duplicate comments:
In `@docs/integrator/go-library.md`:
- Around line 175-178: Update the docs to reflect the actual timeout used by the
facade: change the reference from defaults.RecipeBuildTimeout to
defaults.RecipeOperationTimeout in the paragraph describing ResolveRecipe (and
other context-aware facade methods) so the documentation matches the current
implementation where ResolveRecipe applies defaults.RecipeOperationTimeout as
the cap on caller-provided contexts.
In `@docs/integrator/public-api.md`:
- Line 23: Update the docs entry for `pkg/validator` in the package matrix to
reflect the correct phase count and phrasing: replace "four-phase validation"
with "three-phase validation (Deployment, Performance, Conformance)" so the
README matches the public facade and lists the actual phases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 9c3a8bf7-af88-4ad3-833f-97279ebb9e4c
📒 Files selected for processing (12)
aicr.goaicr_internal_test.goaicr_test.godocs/integrator/go-library.mddocs/integrator/public-api.mdoptions.gopkg/defaults/timeouts.gopkg/recipe/builder.gopkg/recipe/components.gopkg/recipe/metadata_store.gopkg/recipe/metadata_store_test.gotypes.go
|
Thanks for the review. All 8 fixes landed in
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
options.go (1)
55-59:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCorrect lock-scope wording in
applyValidateOptionsgodoc.The comment says translation happens “inside the locked region,” but
Client.ValidateStatecallsapplyValidateOptionsafterClient.muis unlocked. Please update this sentence to match the real synchronization scope.As per coding guidelines, “Ensure code comments and godoc for API changes are accurate and helpful.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@options.go` around lines 55 - 59, The godoc for applyValidateOptions incorrectly states the translation happens “inside the locked region”; update the sentence to reflect that applyValidateOptions is invoked after Client.mu is released by Client.ValidateState (i.e., the translation occurs outside the locked region / after unlocking). Edit the comment in applyValidateOptions to say the translation happens once per call and is performed outside the Client.mu lock (so future fields added to validator.With* require one edit here and none on the facade surface).aicr.go (1)
833-858:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPropagate
ctxintoloadManifestFilesand abort on cancellation.
BundleComponentsis cancellation-aware, but this nested I/O loop is not. A canceled request can still read multiple manifest files before returning.Suggested fix
- manifests, err := loadManifestFiles(r.internal, dp, facade.Name) + manifests, err := loadManifestFiles(ctx, r.internal, dp, facade.Name) @@ - manifests, err := loadManifestFiles(r.internal, dp, facade.Name) + manifests, err := loadManifestFiles(ctx, r.internal, dp, facade.Name) @@ -func loadManifestFiles(internal *recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte, error) { +func loadManifestFiles(ctx context.Context, internal *recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte, error) { @@ for _, path := range ref.ManifestFiles { + if err := ctx.Err(); err != nil { + return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled while loading manifests", err) + } content, err := recipe.GetManifestContentWithProvider(dp, path)As per coding guidelines, “Check context in loops — Always check
ctx.Done()in long-running operations.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@aicr.go` around lines 833 - 858, Update loadManifestFiles to accept a context (change signature to loadManifestFiles(ctx context.Context, internal *recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte, error)) and make the manifest-reading loop cancellation-aware: before each iteration (and optionally after each read) check ctx.Done() and if canceled return ctx.Err() (wrapped into the package error type like other failures); also, if a context-aware manifest reader exists (e.g., recipe.GetManifestContentWithProviderWithContext or similar), call that instead of recipe.GetManifestContentWithProvider so the underlying I/O is also cancellable. Ensure references to loadManifestFiles callers (e.g., BundleComponents) are updated to pass the ctx.aicr_test.go (1)
131-138:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClose both clients created in this test.
Line 131 and Line 135 allocate real clients but never close them. That can leak per-client cache state and make adjacent tests less isolated.
Suggested patch
clientA, err := aicr.NewClient(aicr.WithRecipeSource(aicr.FilesystemSource(dirA))) if err != nil { t.Fatalf("NewClient A: %v", err) } + t.Cleanup(func() { _ = clientA.Close() }) clientB, err := aicr.NewClient(aicr.WithRecipeSource(aicr.FilesystemSource(dirB))) if err != nil { t.Fatalf("NewClient B: %v", err) } + t.Cleanup(func() { _ = clientB.Close() })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@aicr_test.go` around lines 131 - 138, The test creates two clients via aicr.NewClient (clientA and clientB) but never closes them; call the clients' Close method (e.g., defer clientA.Close() immediately after clientA is created and err checked, and defer clientB.Close() after clientB is created) or explicitly call clientA.Close() and clientB.Close() at the end of the test to release per-client cache state and avoid leaking resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@aicr_test.go`:
- Around line 198-201: The goroutine currently discards the error from
client.Close(); instead capture and assert the Close() result: in the goroutine
call err := client.Close() and send err on a dedicated channel (e.g.,
closeErrCh) before calling wg.Done(); after wg.Wait() read from closeErrCh and
fail the test if err != nil (t.Fatalf or require.NoError). Apply the same
pattern for the other identical goroutine usage so both client.Close() calls are
checked.
---
Duplicate comments:
In `@aicr_test.go`:
- Around line 131-138: The test creates two clients via aicr.NewClient (clientA
and clientB) but never closes them; call the clients' Close method (e.g., defer
clientA.Close() immediately after clientA is created and err checked, and defer
clientB.Close() after clientB is created) or explicitly call clientA.Close() and
clientB.Close() at the end of the test to release per-client cache state and
avoid leaking resources.
In `@aicr.go`:
- Around line 833-858: Update loadManifestFiles to accept a context (change
signature to loadManifestFiles(ctx context.Context, internal
*recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte,
error)) and make the manifest-reading loop cancellation-aware: before each
iteration (and optionally after each read) check ctx.Done() and if canceled
return ctx.Err() (wrapped into the package error type like other failures);
also, if a context-aware manifest reader exists (e.g.,
recipe.GetManifestContentWithProviderWithContext or similar), call that instead
of recipe.GetManifestContentWithProvider so the underlying I/O is also
cancellable. Ensure references to loadManifestFiles callers (e.g.,
BundleComponents) are updated to pass the ctx.
In `@options.go`:
- Around line 55-59: The godoc for applyValidateOptions incorrectly states the
translation happens “inside the locked region”; update the sentence to reflect
that applyValidateOptions is invoked after Client.mu is released by
Client.ValidateState (i.e., the translation occurs outside the locked region /
after unlocking). Edit the comment in applyValidateOptions to say the
translation happens once per call and is performed outside the Client.mu lock
(so future fields added to validator.With* require one edit here and none on the
facade surface).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 7189d37d-d813-4114-a306-a23ac77982a3
📒 Files selected for processing (12)
aicr.goaicr_internal_test.goaicr_test.godocs/integrator/go-library.mddocs/integrator/public-api.mdoptions.gopkg/defaults/timeouts.gopkg/recipe/builder.gopkg/recipe/components.gopkg/recipe/metadata_store.gopkg/recipe/metadata_store_test.gotypes.go
2efc0d9 to
41042a1
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (5)
aicr.go (1)
620-633:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHonor
ctxcancellation while loading supplemental manifest files.
loadManifestFiles(lines 832-858) loops overref.ManifestFilesand performs manifest reads without checking context cancellation — the helper currently has nocontext.Contextparameter. A canceled request continues reading files until the loop finishes, even thoughBundleComponentschecksctxonly at the component level.Proposed fix
- manifests, err := loadManifestFiles(r.internal, dp, facade.Name) + manifests, err := loadManifestFiles(ctx, r.internal, dp, facade.Name) @@ -func loadManifestFiles(internal *recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte, error) { +func loadManifestFiles(ctx context.Context, internal *recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte, error) { @@ for _, path := range ref.ManifestFiles { + if err := ctx.Err(); err != nil { + return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled while loading manifests", err) + } content, err := recipe.GetManifestContentWithProvider(dp, path)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@aicr.go` around lines 620 - 633, The helper loadManifestFiles should honor cancellation: add a context.Context parameter to loadManifestFiles and in its loop over ref.ManifestFiles check ctx.Done (or call ctx.Err()) before each read to abort early; update all callers (e.g., BundleComponents where loadManifestFiles(r.internal, dp, facade.Name) is invoked in the "manifests" and "kustomize" branches) to pass the incoming ctx (or an appropriate derived context) so that canceled requests stop reading files immediately; ensure the function returns promptly with ctx.Err() when canceled.aicr_test.go (3)
131-138:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClose both clients in this test to avoid leaking per-client cache state.
Lines 131 and 135 create real clients but never release them. Add cleanup for both to maintain test isolation.
Suggested patch
clientA, err := aicr.NewClient(aicr.WithRecipeSource(aicr.FilesystemSource(dirA))) if err != nil { t.Fatalf("NewClient A: %v", err) } + t.Cleanup(func() { _ = clientA.Close() }) clientB, err := aicr.NewClient(aicr.WithRecipeSource(aicr.FilesystemSource(dirB))) if err != nil { t.Fatalf("NewClient B: %v", err) } + t.Cleanup(func() { _ = clientB.Close() })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@aicr_test.go` around lines 131 - 138, The test creates two real clients (clientA and clientB) via aicr.NewClient but never closes them, leaking per-client cache/state; ensure both clients are closed—either immediately defer clientA.Close() after successful creation of clientA and defer clientB.Close() after successful creation of clientB, or explicitly call clientA.Close() and clientB.Close() at test teardown so any per-client resources are released; update the test to call the Close method on both clientA and clientB (references: aicr.NewClient, clientA, clientB).
198-203: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueAssert
Close()succeeds in the concurrency lifecycle test.Line 200 discards the
Close()error, which could hide regressions in shutdown/drain behavior. WhileClose()currently always returns nil, capturing the error would catch future changes.Suggested patch
+ var closeErr error go func() { defer wg.Done() - _ = client.Close() + closeErr = client.Close() }() wg.Wait() + if closeErr != nil { + t.Errorf("trial %d: Close: %v", trial, closeErr) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@aicr_test.go` around lines 198 - 203, Capture the error returned by client.Close() in the goroutine and assert it succeeds instead of discarding it; replace `_ = client.Close()` with saving the result (e.g., `err := client.Close()`) and fail the test on non-nil (using t.Fatalf/require.NoError) so the concurrency lifecycle test (the goroutine that calls client.Close() and the wg synchronization) will surface any future shutdown/drain regressions.
392-398:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert structured error code for the closed-client branch.
This block currently only asserts "error is non-nil" but should also verify
ErrCodeInvalidRequestso contract regressions are caught.Suggested patch
_, err = client.BundleComponents(context.Background(), &aicr.RecipeResult{Components: nil}, ) if err == nil { t.Fatal("expected error after Close, got nil") } + var se *aicrerrors.StructuredError + if !errors.As(err, &se) || se.Code != aicrerrors.ErrCodeInvalidRequest { + t.Errorf("expected ErrCodeInvalidRequest after Close; got %v", err) + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@aicr_test.go` around lines 392 - 398, The test currently only checks for a non-nil error from client.BundleComponents; change it to assert the returned error encodes the specific contract error ErrCodeInvalidRequest. After calling client.BundleComponents with &aicr.RecipeResult{Components:nil}, replace the single non-nil check with logic that extracts the AICR error (e.g., using errors.As to cast to the package error type or the test project's helper used elsewhere) and assert its code equals aicr.ErrCodeInvalidRequest so the closed-client branch is explicitly validated.options.go (1)
55-60:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix inaccurate lock-scope statement in
applyValidateOptionsdocs.Line 58 states the translation happens "inside the locked region," but
ValidateStateinvokesapplyValidateOptionsafter releasingClient.mu(see aicr.go line 797). This misleads future synchronization work.Suggested doc fix
// applyValidateOptions builds the []validator.Option slice from a // validateConfig populated by the WithValidation* factories. Called -// from Client.ValidateState; the translation happens once per call, -// inside the locked region, so a future field added to validator.With* +// from Client.ValidateState; the translation happens once per call, +// before validator construction, so a future field added to validator.With* // is one edit here and zero edits on the facade surface.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@options.go` around lines 55 - 60, The doc comment on applyValidateOptions incorrectly claims the translation occurs "inside the locked region"; update the comment to state that applyValidateOptions is called after Client.mu is released (from ValidateState) so the translation happens outside the lock, and adjust wording to avoid implying any synchronization guarantees—reference applyValidateOptions and ValidateState (and Client.mu) in the text so future readers know where the call occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/integrator/go-library.md`:
- Line 145: Replace the speculative sentence "Additional sources (HTTP, Git) may
be added in future minor releases." with a factual, present-tense statement
reflecting current capabilities (e.g., "Currently supports local file and OCI
sources." or "No additional sources are supported at this time.") so the
documentation does not imply an uncommitted roadmap; update the sentence in
docs/integrator/go-library.md where that exact phrase appears.
- Around line 168-176: The documentation wording is inaccurate: the facade
applies a per-operation timeout unconditionally (it always calls
context.WithTimeout), so the effective deadline is the minimum of the caller's
deadline and the facade cap. Update the text to state that each context-aware
facade method (e.g., ResolveRecipe, BundleComponents, CollectSnapshot,
ValidateState) always wraps the caller context with the per-operation cap from
pkg/defaults (or AgentConfig.Timeout for CollectSnapshot), and that the
effective deadline used is the smaller of the caller's deadline and the facade's
cap.
---
Duplicate comments:
In `@aicr_test.go`:
- Around line 131-138: The test creates two real clients (clientA and clientB)
via aicr.NewClient but never closes them, leaking per-client cache/state; ensure
both clients are closed—either immediately defer clientA.Close() after
successful creation of clientA and defer clientB.Close() after successful
creation of clientB, or explicitly call clientA.Close() and clientB.Close() at
test teardown so any per-client resources are released; update the test to call
the Close method on both clientA and clientB (references: aicr.NewClient,
clientA, clientB).
- Around line 198-203: Capture the error returned by client.Close() in the
goroutine and assert it succeeds instead of discarding it; replace `_ =
client.Close()` with saving the result (e.g., `err := client.Close()`) and fail
the test on non-nil (using t.Fatalf/require.NoError) so the concurrency
lifecycle test (the goroutine that calls client.Close() and the wg
synchronization) will surface any future shutdown/drain regressions.
- Around line 392-398: The test currently only checks for a non-nil error from
client.BundleComponents; change it to assert the returned error encodes the
specific contract error ErrCodeInvalidRequest. After calling
client.BundleComponents with &aicr.RecipeResult{Components:nil}, replace the
single non-nil check with logic that extracts the AICR error (e.g., using
errors.As to cast to the package error type or the test project's helper used
elsewhere) and assert its code equals aicr.ErrCodeInvalidRequest so the
closed-client branch is explicitly validated.
In `@aicr.go`:
- Around line 620-633: The helper loadManifestFiles should honor cancellation:
add a context.Context parameter to loadManifestFiles and in its loop over
ref.ManifestFiles check ctx.Done (or call ctx.Err()) before each read to abort
early; update all callers (e.g., BundleComponents where
loadManifestFiles(r.internal, dp, facade.Name) is invoked in the "manifests" and
"kustomize" branches) to pass the incoming ctx (or an appropriate derived
context) so that canceled requests stop reading files immediately; ensure the
function returns promptly with ctx.Err() when canceled.
In `@options.go`:
- Around line 55-60: The doc comment on applyValidateOptions incorrectly claims
the translation occurs "inside the locked region"; update the comment to state
that applyValidateOptions is called after Client.mu is released (from
ValidateState) so the translation happens outside the lock, and adjust wording
to avoid implying any synchronization guarantees—reference applyValidateOptions
and ValidateState (and Client.mu) in the text so future readers know where the
call occurs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 40127a6d-18ad-433c-870b-b5ff11ba4e6f
📒 Files selected for processing (12)
aicr.goaicr_internal_test.goaicr_test.godocs/integrator/go-library.mddocs/integrator/public-api.mdoptions.gopkg/defaults/timeouts.gopkg/recipe/builder.gopkg/recipe/components.gopkg/recipe/metadata_store.gopkg/recipe/metadata_store_test.gotypes.go
41042a1 to
938d5c7
Compare
|
Second round of fixes in
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@options.go`:
- Around line 119-121: The WithValidation* option factories
(WithValidationImagePullSecrets, WithValidationTolerations,
WithValidationNodeSelector) must defensively clone caller-provided slices/maps
before storing them on validateConfig to avoid data races and preserve
nil-vs-empty semantics: for each slice argument, if it is nil leave nil,
otherwise allocate a new slice with make and copy(...) the contents and assign
that copy to c.imagePullSecrets / c.tolerations; for each map argument, if it is
nil leave nil, otherwise allocate a new map and copy all key/value pairs (for
k,v := range src { dst[k]=v }) and assign that copy to c.nodeSelector; this
ensures downstream code (e.g., serializeNodeSelector in pkg/api/validator/v1)
does not observe concurrent mutations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: ee9593d7-0b29-4c62-910a-ef81efda06f8
📒 Files selected for processing (12)
aicr.goaicr_internal_test.goaicr_test.godocs/integrator/go-library.mddocs/integrator/public-api.mdoptions.gopkg/defaults/timeouts.gopkg/recipe/builder.gopkg/recipe/components.gopkg/recipe/metadata_store.gopkg/recipe/metadata_store_test.gotypes.go
938d5c7 to
e9c2a70
Compare
|
Third round of fixes in
|
e9c2a70 to
e7aface
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Foundational correctness work (inflight drain, transient-error eviction, owner-token guard, bounded timeouts) is excellent and the race-detector tests pin the actual bugs.
Five inline comments along three axes against #1071's contract: (1) the facade creates a third DataProvider → Builder → Bundler → Validator wiring path without retiring CLI's recipe.SetDataProvider or migrating the REST server — both must follow before this PR's architectural payoff lands; (2) aicr.Snapshot/AgentConfig/PhaseResult/Phase are transparent aliases of packages the matrix marks Internal or "Public (evolving)" — semver-stable facade can't honestly identity-alias unstable types; (3) _ForTesting helpers leaked onto pkg/recipe's public surface, and the cross-Client guard protects only facade callers, not direct pkg/recipe.Builder users (which today includes pkg/api/server.go).
Not blocking the spirit of the change — (1) and (3) are fine as tracked follow-ups, (2) needs to be decided before merge so the documented contract is honest from day one.
|
Filed follow-up tracking for the deferred work this PR introduces:
#1078 is the only one I'd consider blocking on for this PR — the others are clean follow-ups. The contradiction between |
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
e7aface to
1deb73f
Compare
|
Thanks for filing #1076 / #1077 / #1078 / #1079 / #1080 — much appreciated. For the #1078 contradiction blocker, taking the lighter path in Concretely:
Does this resolve #1078's "blocking on this PR" concern, with the full alias-wrap staying tracked in #1078 for the next milestone? Happy to do the full wrap in this PR instead if you'd prefer that — wanted to check before paying the ~300 LoC cost on speculation. |
yes
no, lets land this and use the epic issues above to drive the full scope to completion |
Summary
Introduce
github.com/NVIDIA/aicras the project's stable, semver-trackedGo API for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without importing
internal packages directly.
The facade is a sibling of the existing CLI binary and REST server (not a
layer on top); each
Clientowns its own DataProvider, eliminating theprocess-global state that previously caused two consumers in one process
to silently corrupt each other's recipe caches.
Motivation / Context
Fixes #1071
A new class of consumer — long-running Kubernetes controllers and
similar in-process integrations — needs to call AICR's logic without
spawning a subprocess or running a REST sidecar. Today they have to
stitch together internal
pkg/recipeandpkg/bundlercalls, have nosemver-tracked contract to pin against, and hit a latent cross-source
corruption bug when two recipe sources are active in one process. This
PR closes that.
Type of Change
Components Affected
github.com/NVIDIA/aicrpkg/recipe— per-DataProvider caching +WithDataProvideroption +Evicthelperspkg/bundler—WithDataProvideroption; all internal reads (values, manifests, registry) routed through*WithProvidervariantspkg/defaults— newSnapshotOperationTimeout/ValidationOperationTimeoutvalidators/deployment— switched off deprecatedGetManifestContentImplementation Notes
See the commit message for the full breakdown. Highlights:
store and component-registry caches now key on
DataProviderpointeridentity;
Client.Close()explicitly evicts via the existing publicEvicthelpers.Closedrains in-flightResolveRecipe/BundleComponents/CollectSnapshot/ValidateStatecalls beforeevicting. Without this drain, a resolve that released the read lock
before
LoadMetadataStoreForcould repopulatestoreCache[dp]AFTERClose's eviction — silently leaking cache entries past Client lifetime.
Each entry point registers under the read lock via a new
inflight sync.WaitGroup; Close write-locks, releases, thenWait()s.TestClient_CloseDrainsInflightResolvepins this race deterministicallywith a
blockingDataProviderparkingWalkDiron a channel.LoadMetadataStoreForno longer cachescontext.Canceled/context.DeadlineExceedederrors permanently. Thesync.Oncesemantics would otherwise poison the entry for the Client'slifetime; now the entry is
CompareAndDeleted on transient error so afollow-up call with a healthy ctx loads from scratch.
builder.BuildFromCriteria's wrap switched toPropagateOrWrapso theinner
ErrCodeTimeoutsurvives — callers (and the eviction itself)rely on the structured timeout code.
ResolveRecipestamps the producingClientonits returned
RecipeResult;BundleComponents/ValidateStaterejectcross-client misuse with
ErrCodeInvalidRequest. Without this, acontroller managing N ProviderConfigs could silently mix
resolved-against-A refs with values-loaded-via-B reads.
passing
context.Background()can't hang reconciliation on stuckKubernetes I/O.
implementation to consume the facade is intentionally deferred to a
follow-up issue to keep this PR scoped and reviewable.
Testing
make qualifypasses locally end-to-end (unit +-race+ chainsawe2e + yamllint + license headers + sidebar check + vulnerability scan).
directly:
TestClient_ConcurrentResolveScopesToOwnSource— two clients, twooverlay sources, concurrent resolves under
-race, each returns itsown marker component and never the other's.
TestClient_NoCacheGrowthAcrossManyCloseCycles— 50 NewClient →ResolveRecipe → Close cycles, asserts metadata-store and registry
cache counts return to baseline.
TestClient_CloseDrainsInflightResolve—blockingDataProviderparks
WalkDiron a channel;Closemust wait for the in-flightresolve before evicting; assertion uses per-DataProvider
CachedStoreContainsForTesting/CachedRegistryContainsForTesting(robust under parallel test execution).
pkg/recipetestTestLoadMetadataStoreFor_TransientErrorIsNotCached— pre-canceledctx on first load surfaces an error; second call with healthy ctx
succeeds, proving the transient-error eviction.
Nodesrejection, OS-enables-OS-pinned-overlays, cross-client
BundleComponents/ValidateStaterejection.Risk Assessment
OpenAPI schema, and existing tests are untouched. The chainsaw e2e
suite is the regression gate and passes.
github.com/NVIDIA/aicr.ValidateOptionis already wrapped as afacade-owned functional-option type so
pkg/validator.Optionchangesdon't propagate. The remaining re-exports (
Snapshot,AgentConfig,PhaseResult,Phase, and the phase constants) are still transparentaliases of
pkg/snapshotter/pkg/validator; promoting them tofacade-owned wrappers is worth scoping separately.
Checklist
docs/integrator/go-library.md, public-APImatrix, architecture deep-dive)
make qualifyis green-S)Co-Authored-Bylines