Skip to content

feat(aicr): top-level Go library facade for in-process consumers#1072

Merged
mchmarny merged 3 commits into
NVIDIA:mainfrom
hkii:feat/aicr-client-facade
May 28, 2026
Merged

feat(aicr): top-level Go library facade for in-process consumers#1072
mchmarny merged 3 commits into
NVIDIA:mainfrom
hkii:feat/aicr-client-facade

Conversation

@hkii

@hkii hkii commented May 28, 2026

Copy link
Copy Markdown
Member

Summary

Introduce github.com/NVIDIA/aicr as the project's stable, semver-tracked
Go API for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without importing
internal packages directly.

The facade is a sibling of the existing CLI binary and REST server (not a
layer on top); each Client owns its own DataProvider, eliminating the
process-global state that previously caused two consumers in one process
to silently corrupt each other's recipe caches.

Motivation / Context

Fixes #1071

A new class of consumer — long-running Kubernetes controllers and
similar in-process integrations — needs to call AICR's logic without
spawning a subprocess or running a REST sidecar. Today they have to
stitch together internal pkg/recipe and pkg/bundler calls, have no
semver-tracked contract to pin against, and hit a latent cross-source
corruption bug when two recipe sources are active in one process. This
PR closes that.

Type of Change

  • New feature (non-breaking change which adds functionality)

Components Affected

  • New top-level package github.com/NVIDIA/aicr
  • pkg/recipe — per-DataProvider caching + WithDataProvider option + Evict helpers
  • pkg/bundlerWithDataProvider option; all internal reads (values, manifests, registry) routed through *WithProvider variants
  • pkg/defaults — new SnapshotOperationTimeout / ValidationOperationTimeout
  • validators/deployment — switched off deprecated GetManifestContent
  • Docs: integrator guide, public-API matrix

Implementation Notes

See the commit message for the full breakdown. Highlights:

  • Per-Client isolation is the load-bearing correctness change. Metadata-
    store and component-registry caches now key on DataProvider pointer
    identity; Client.Close() explicitly evicts via the existing public
    Evict helpers.
  • Shutdown safety: Close drains in-flight ResolveRecipe /
    BundleComponents / CollectSnapshot / ValidateState calls before
    evicting. Without this drain, a resolve that released the read lock
    before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER
    Close's eviction — silently leaking cache entries past Client lifetime.
    Each entry point registers under the read lock via a new
    inflight sync.WaitGroup; Close write-locks, releases, then Wait()s.
    TestClient_CloseDrainsInflightResolve pins this race deterministically
    with a blockingDataProvider parking WalkDir on a channel.
  • Transient-error resilience: LoadMetadataStoreFor no longer caches
    context.Canceled / context.DeadlineExceeded errors permanently. The
    sync.Once semantics would otherwise poison the entry for the Client's
    lifetime; now the entry is CompareAndDeleted on transient error so a
    follow-up call with a healthy ctx loads from scratch.
    builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the
    inner ErrCodeTimeout survives — callers (and the eviction itself)
    rely on the structured timeout code.
  • Owner-token guard: ResolveRecipe stamps the producing Client on
    its returned RecipeResult; BundleComponents / ValidateState reject
    cross-client misuse with ErrCodeInvalidRequest. Without this, a
    controller managing N ProviderConfigs could silently mix
    resolved-against-A refs with values-loaded-via-B reads.
  • Bounded timeouts at every facade entry point so a controller
    passing context.Background() can't hang reconciliation on stuck
    Kubernetes I/O.
  • CLI and REST surfaces are unchanged. Migrating their internal
    implementation to consume the facade is intentionally deferred to a
    follow-up issue to keep this PR scoped and reviewable.

Testing

  • make qualify passes locally end-to-end (unit + -race + chainsaw
    e2e + yamllint + license headers + sidebar check + vulnerability scan).
  • New facade-package tests cover the two acceptance criteria from [Feature]: Stable embeddable Go library surface for in-process recipe resolution #1071
    directly:
    • TestClient_ConcurrentResolveScopesToOwnSource — two clients, two
      overlay sources, concurrent resolves under -race, each returns its
      own marker component and never the other's.
    • TestClient_NoCacheGrowthAcrossManyCloseCycles — 50 NewClient →
      ResolveRecipe → Close cycles, asserts metadata-store and registry
      cache counts return to baseline.
    • TestClient_CloseDrainsInflightResolveblockingDataProvider
      parks WalkDir on a channel; Close must wait for the in-flight
      resolve before evicting; assertion uses per-DataProvider
      CachedStoreContainsForTesting / CachedRegistryContainsForTesting
      (robust under parallel test execution).
  • New pkg/recipe test
    TestLoadMetadataStoreFor_TransientErrorIsNotCached — pre-canceled
    ctx on first load surfaces an error; second call with healthy ctx
    succeeds, proving the transient-error eviction.
  • New facade tests for the field-level guards: negative-Nodes
    rejection, OS-enables-OS-pinned-overlays, cross-client
    BundleComponents / ValidateState rejection.

Risk Assessment

  • Blast radius on existing surfaces: zero. CLI flags, REST endpoints,
    OpenAPI schema, and existing tests are untouched. The chainsaw e2e
    suite is the regression gate and passes.
  • Public-surface lock-in: this PR establishes the v0.x contract for
    github.com/NVIDIA/aicr. ValidateOption is already wrapped as a
    facade-owned functional-option type so pkg/validator.Option changes
    don't propagate. The remaining re-exports (Snapshot, AgentConfig,
    PhaseResult, Phase, and the phase constants) are still transparent
    aliases of pkg/snapshotter / pkg/validator; promoting them to
    facade-owned wrappers is worth scoping separately.

Checklist

  • Tests added and passing
  • Documentation updated (docs/integrator/go-library.md, public-API
    matrix, architecture deep-dive)
  • make qualify is green
  • Commit signed (-S)
  • No Co-Authored-By lines
  • PR title under 70 characters

@hkii hkii requested a review from a team as a code owner May 28, 2026 04:29
@copy-pr-bot

copy-pr-bot Bot commented May 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown
Contributor

Welcome to AICR, @hkii! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a top-level Go facade package github.com/NVIDIA/aicr providing a concurrency-safe Client with NewClient, Close, ResolveRecipe, BundleComponents, CollectSnapshot, and ValidateState. Each Client builds its own recipe.DataProvider and recipe.Builder to ensure per-client isolation and evicts per-provider caches on Close. The change introduces public request/response types and options, operation timeout ceilings, preservation of structured cancellation/timeouts, eviction of transient metadata-store load failures, test-only cache inspection helpers, comprehensive unit/integration and internal concurrency tests, and integrator documentation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • NVIDIA/aicr#1015: Adds provider-bound metadata-store/registry caching and eviction primitives that the new aicr.Client relies on and explicitly evicts.

Suggested labels

enhancement, area/tests, area/bundler, area/api

Suggested reviewers

  • mchmarny
  • lalitadithya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: introducing a top-level Go library facade named 'aicr' for in-process consumers. It is specific and directly relates to the primary objective.
Description check ✅ Passed The description is comprehensive and directly related to the changeset. It explains the motivation, implementation approach, testing strategy, and risk assessment, all relevant to the PR's core objective of adding a stable Go API facade.
Linked Issues check ✅ Passed The PR fully addresses all coding-related requirements from issue #1071: delivers a stable, semver-tracked github.com/NVIDIA/aicr package with Client.ResolveRecipe/BundleComponents/CollectSnapshot/ValidateState/Close; ensures per-Client DataProvider isolation; implements cache eviction; maintains CLI/REST compatibility; includes comprehensive tests for concurrent isolation and cache evaporation; and passes make qualify.
Out of Scope Changes check ✅ Passed All changes are in scope: the new top-level aicr package and its facade (Client, types, options), per-DataProvider caching in pkg/recipe, WithDataProvider support in pkg/bundler, new timeout constants, transient-error eviction, test helpers, documentation, and supporting tests. No unrelated changes are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@aicr_test.go`:
- Around line 255-258: The test creates a real client via aicr.NewClient with
aicr.WithRecipeSource(tmp) and never closes it; add cleanup to close the client
to avoid leaking per-client cache state—after successful NewClient return,
register a cleanup (e.g., using t.Cleanup or defer) that calls client.Close() so
the client's resources are released; locate the client variable created by the
call to aicr.NewClient and ensure client.Close() is invoked during test
teardown.
- Around line 54-56: Replace the hardcoded absolute path in the test when
calling aicr.NewClient with a path built from the test's temporary directory so
it is guaranteed missing; specifically, construct the recipe path using
t.TempDir() (e.g., filepath.Join(t.TempDir(), "provider-nvidia-aicr-test")) and
pass that into aicr.WithRecipeSource(aicr.FilesystemSource(...)) so the test is
deterministic across environments.

In `@aicr.go`:
- Around line 146-148: The NewClient function currently iterates the opts slice
and calls each opt(c) unguarded, which can panic if an Option entry is nil;
update the loop in NewClient to check for nil before invocation (e.g., skip or
return an error when opt == nil) so Option entries are validated, ensuring you
reference the opts slice and the Option function type and call sites such as
opt(c) safely.
- Around line 297-298: Guard against a nil context before calling
context.WithTimeout in ResolveRecipe, BundleComponents, CollectSnapshot, and
ValidateState: add an explicit check if ctx == nil and return
ErrCodeInvalidRequest (or the function's error return) before invoking
context.WithTimeout(ctx, ...) so the subsequent ctx, cancel :=
context.WithTimeout(...) / defer cancel() cannot panic; update the code paths
where context.WithTimeout is used (the local ctx variable and cancel handling)
to assume ctx is non-nil after the guard.

In `@docs/integrator/go-library.md`:
- Around line 175-178: The docs state that ResolveRecipe applies
defaults.RecipeBuildTimeout as the cap, but the actual facade implementation
uses defaults.RecipeOperationTimeout; update the text to reflect the real
behavior by replacing the reference to defaults.RecipeBuildTimeout with
defaults.RecipeOperationTimeout (or otherwise state that ResolveRecipe uses the
RecipeOperationTimeout cap), and ensure the sentence references the
ResolveRecipe method and the defaults.RecipeOperationTimeout symbol so
integrators are not misled.

In `@docs/integrator/public-api.md`:
- Line 23: The package matrix entry for pkg/validator incorrectly says
"four-phase validation"; update the wording in the public API docs to match the
PR's actual public facade by changing the description for `pkg/validator` to
"three-phase validation (Deployment, Performance, Conformance)" or equivalent
phrasing that lists the three phases, ensuring the `pkg/validator` row text and
any related mention of phase count reflect three phases consistently.

In `@pkg/recipe/metadata_store_test.go`:
- Around line 1677-1680: The test currently only asserts err != nil for the
first canceled load; change it to assert the error carries the structured
timeout code (errors.ErrCodeTimeout) so regressions to plain context.Canceled
fail. Update the test assertions around the first load call (and similarly at
the other spot noted around 1692-1694) to use the pkg/errors helpers—e.g.,
verify the returned error has ErrCodeTimeout (using the package's
structured-error inspection function or errors.PropagateOrWrap semantics) rather
than merely non-nil, so the timeout code is preserved.

In `@pkg/recipe/metadata_store.go`:
- Around line 116-127: Update the LoadMetadataStoreFor godoc to reflect the new
auto-eviction of transient context cancellation errors: remove or correct any
sentence stating that transient cancellations remain cached and require
EvictCachedStore, and instead document that entries whose errs satisfy
isTransientLoadError (e.g., context.Canceled/context.DeadlineExceeded) are
automatically removed by the cache via storeCache.CompareAndDelete on detection,
so callers do not need to call EvictCachedStore to retry after transient
cancel/deadline errors; reference the LoadMetadataStoreFor behavior and the
transient-error handling implemented around storeCache.CompareAndDelete.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 13f06bc1-cc5b-496c-831b-4a9cb85ae60f

📥 Commits

Reviewing files that changed from the base of the PR and between e565ce0 and f41d4cb.

📒 Files selected for processing (13)
  • aicr.go
  • aicr_internal_test.go
  • aicr_test.go
  • docs/design/architecture.md
  • docs/integrator/go-library.md
  • docs/integrator/public-api.md
  • options.go
  • pkg/defaults/timeouts.go
  • pkg/recipe/builder.go
  • pkg/recipe/components.go
  • pkg/recipe/metadata_store.go
  • pkg/recipe/metadata_store_test.go
  • types.go

Comment thread aicr_test.go
Comment thread aicr_test.go
Comment thread aicr.go
Comment thread aicr.go
Comment thread docs/integrator/go-library.md
Comment thread docs/integrator/public-api.md Outdated
Comment thread pkg/recipe/metadata_store_test.go
Comment thread pkg/recipe/metadata_store.go
@hkii hkii force-pushed the feat/aicr-client-facade branch 3 times, most recently from e6df948 to 2efc0d9 Compare May 28, 2026 04:50

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
docs/integrator/public-api.md (1)

23-23: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Correct phase-count wording for pkg/validator in the package matrix.

Line 23 says "four-phase validation," but this PR's public facade documents three phases (Deployment, Performance, Conformance). Please reconcile the wording to keep docs consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/integrator/public-api.md` at line 23, Update the docs entry for
`pkg/validator` in the package matrix to reflect the correct phase count and
phrasing: replace "four-phase validation" with "three-phase validation
(Deployment, Performance, Conformance)" so the README matches the public facade
and lists the actual phases.
docs/integrator/go-library.md (1)

175-178: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align timeout documentation with the current facade implementation.

Line 177 states defaults.RecipeBuildTimeout as the cap, but the facade currently applies defaults.RecipeOperationTimeout in ResolveRecipe. Please update this section to avoid misleading integrators.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/integrator/go-library.md` around lines 175 - 178, Update the docs to
reflect the actual timeout used by the facade: change the reference from
defaults.RecipeBuildTimeout to defaults.RecipeOperationTimeout in the paragraph
describing ResolveRecipe (and other context-aware facade methods) so the
documentation matches the current implementation where ResolveRecipe applies
defaults.RecipeOperationTimeout as the cap on caller-provided contexts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@aicr_test.go`:
- Around line 370-392: The test currently only checks that BundleComponents
returns a non-nil error after client.Close; update the assertion to verify the
structured error code equals ErrCodeInvalidRequest to prevent contract
regressions. Locate the failing branch in the test where client.Close() has been
called and BundleComponents(context.Background(), &aicr.RecipeResult{Components:
nil}) is invoked, then replace the loose nil check with an assertion that
inspects the returned error (e.g., unwrap or type-assert to your SDK error type)
and confirms its code equals aicr.ErrCodeInvalidRequest while still failing the
test if it does not.
- Around line 126-133: The test creates two real clients via NewClient (clientA
and clientB) but never closes them, leaking per-client cache state; after each
successful NewClient call (for clientA and clientB) call and/or defer the
client's Close method (e.g., defer clientA.Close() immediately after clientA is
created and check err, and likewise defer clientB.Close()) so both clients are
properly released at the end of the test and state is not leaked between tests.

In `@aicr.go`:
- Around line 607-617: BundleComponents is honoring ctx but loadManifestFiles
does not, so long loops over ref.ManifestFiles can continue after cancellation;
change loadManifestFiles to accept a context.Context (e.g.,
loadManifestFiles(ctx context.Context, r.internal, dp, facade.Name) or similar),
update its callers in BundleComponents (both the generic and "kustomize" cases)
to pass the ctx, and inside loadManifestFiles check ctx.Done() (or return
ctx.Err()) between/ before file reads (and when returning) so the function
aborts promptly on cancellation and propagates the context error up.

In `@options.go`:
- Around line 55-60: Update the applyValidateOptions doc comment to remove the
incorrect claim that the translation occurs "inside the locked region" and
instead state that ValidateState calls applyValidateOptions after releasing
Client.mu (i.e., the translation happens once per call outside the Client.mu
lock), so future additions to validator.With* only require edits here and not on
the façade; reference applyValidateOptions, ValidateState, and Client.mu in the
revised sentence to make the synchronization scope explicit.

---

Duplicate comments:
In `@docs/integrator/go-library.md`:
- Around line 175-178: Update the docs to reflect the actual timeout used by the
facade: change the reference from defaults.RecipeBuildTimeout to
defaults.RecipeOperationTimeout in the paragraph describing ResolveRecipe (and
other context-aware facade methods) so the documentation matches the current
implementation where ResolveRecipe applies defaults.RecipeOperationTimeout as
the cap on caller-provided contexts.

In `@docs/integrator/public-api.md`:
- Line 23: Update the docs entry for `pkg/validator` in the package matrix to
reflect the correct phase count and phrasing: replace "four-phase validation"
with "three-phase validation (Deployment, Performance, Conformance)" so the
README matches the public facade and lists the actual phases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9c3a8bf7-af88-4ad3-833f-97279ebb9e4c

📥 Commits

Reviewing files that changed from the base of the PR and between f41d4cb and e6df948.

📒 Files selected for processing (12)
  • aicr.go
  • aicr_internal_test.go
  • aicr_test.go
  • docs/integrator/go-library.md
  • docs/integrator/public-api.md
  • options.go
  • pkg/defaults/timeouts.go
  • pkg/recipe/builder.go
  • pkg/recipe/components.go
  • pkg/recipe/metadata_store.go
  • pkg/recipe/metadata_store_test.go
  • types.go

Comment thread aicr_test.go
Comment thread aicr_test.go
Comment thread aicr.go Outdated
Comment thread options.go
@hkii

hkii commented May 28, 2026

Copy link
Copy Markdown
Member Author

Thanks for the review. All 8 fixes landed in 2efc0d9f (force-pushed). Per-finding:

  1. aicr_test.go:54-56: hardcoded /nonexistent/path/... replaced with filepath.Join(t.TempDir(), ...). Doesn't depend on what files happen to exist on the runner anymore.
  2. aicr_test.go:258: added t.Cleanup to close the client. Good catch on the isolation leak.
  3. aicr.go:148: NewClient's loop now skips nil Option entries instead of panicking. One branch per option, no crash if a caller assembles opts conditionally.
  4. aicr.go:298: nil-ctx guard added to all four entry points (ResolveRecipe, BundleComponents, CollectSnapshot, ValidateState). Returns ErrCodeInvalidRequest before context.WithTimeout gets a chance to panic.
  5. docs/integrator/go-library.md:178: doc was pointing at RecipeBuildTimeout, but the facade uses RecipeOperationTimeout. Rewrote that section to list the four per-operation caps and document the nil-ctx rejection.
  6. docs/integrator/public-api.md:23: "four-phase" -> "three-phase (Deployment, Performance, Conformance)".
  7. pkg/recipe/metadata_store_test.go:1680: the transient-error test now asserts the structured ErrCodeTimeout code on the first failure, not just err != nil. A regression to bare context.Canceled would have slipped past the old check.
  8. pkg/recipe/metadata_store.go:127: godoc still said transient cancellations stay cached forever, which stopped being true after the CompareAndDelete change. Rewritten to match actual behavior, and noted that callers no longer need the manual EvictCachedStore workaround.

make qualify is green locally; CI will retrigger on the force-push.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
options.go (1)

55-59: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Correct lock-scope wording in applyValidateOptions godoc.

The comment says translation happens “inside the locked region,” but Client.ValidateState calls applyValidateOptions after Client.mu is unlocked. Please update this sentence to match the real synchronization scope.

As per coding guidelines, “Ensure code comments and godoc for API changes are accurate and helpful.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@options.go` around lines 55 - 59, The godoc for applyValidateOptions
incorrectly states the translation happens “inside the locked region”; update
the sentence to reflect that applyValidateOptions is invoked after Client.mu is
released by Client.ValidateState (i.e., the translation occurs outside the
locked region / after unlocking). Edit the comment in applyValidateOptions to
say the translation happens once per call and is performed outside the Client.mu
lock (so future fields added to validator.With* require one edit here and none
on the facade surface).
aicr.go (1)

833-858: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate ctx into loadManifestFiles and abort on cancellation.

BundleComponents is cancellation-aware, but this nested I/O loop is not. A canceled request can still read multiple manifest files before returning.

Suggested fix
-           manifests, err := loadManifestFiles(r.internal, dp, facade.Name)
+           manifests, err := loadManifestFiles(ctx, r.internal, dp, facade.Name)
@@
-           manifests, err := loadManifestFiles(r.internal, dp, facade.Name)
+           manifests, err := loadManifestFiles(ctx, r.internal, dp, facade.Name)
@@
-func loadManifestFiles(internal *recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte, error) {
+func loadManifestFiles(ctx context.Context, internal *recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte, error) {
@@
 	for _, path := range ref.ManifestFiles {
+		if err := ctx.Err(); err != nil {
+			return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled while loading manifests", err)
+		}
 		content, err := recipe.GetManifestContentWithProvider(dp, path)

As per coding guidelines, “Check context in loops — Always check ctx.Done() in long-running operations.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aicr.go` around lines 833 - 858, Update loadManifestFiles to accept a context
(change signature to loadManifestFiles(ctx context.Context, internal
*recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte,
error)) and make the manifest-reading loop cancellation-aware: before each
iteration (and optionally after each read) check ctx.Done() and if canceled
return ctx.Err() (wrapped into the package error type like other failures);
also, if a context-aware manifest reader exists (e.g.,
recipe.GetManifestContentWithProviderWithContext or similar), call that instead
of recipe.GetManifestContentWithProvider so the underlying I/O is also
cancellable. Ensure references to loadManifestFiles callers (e.g.,
BundleComponents) are updated to pass the ctx.
aicr_test.go (1)

131-138: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Close both clients created in this test.

Line 131 and Line 135 allocate real clients but never close them. That can leak per-client cache state and make adjacent tests less isolated.

Suggested patch
 	clientA, err := aicr.NewClient(aicr.WithRecipeSource(aicr.FilesystemSource(dirA)))
 	if err != nil {
 		t.Fatalf("NewClient A: %v", err)
 	}
+	t.Cleanup(func() { _ = clientA.Close() })
 	clientB, err := aicr.NewClient(aicr.WithRecipeSource(aicr.FilesystemSource(dirB)))
 	if err != nil {
 		t.Fatalf("NewClient B: %v", err)
 	}
+	t.Cleanup(func() { _ = clientB.Close() })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aicr_test.go` around lines 131 - 138, The test creates two clients via
aicr.NewClient (clientA and clientB) but never closes them; call the clients'
Close method (e.g., defer clientA.Close() immediately after clientA is created
and err checked, and defer clientB.Close() after clientB is created) or
explicitly call clientA.Close() and clientB.Close() at the end of the test to
release per-client cache state and avoid leaking resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@aicr_test.go`:
- Around line 198-201: The goroutine currently discards the error from
client.Close(); instead capture and assert the Close() result: in the goroutine
call err := client.Close() and send err on a dedicated channel (e.g.,
closeErrCh) before calling wg.Done(); after wg.Wait() read from closeErrCh and
fail the test if err != nil (t.Fatalf or require.NoError). Apply the same
pattern for the other identical goroutine usage so both client.Close() calls are
checked.

---

Duplicate comments:
In `@aicr_test.go`:
- Around line 131-138: The test creates two clients via aicr.NewClient (clientA
and clientB) but never closes them; call the clients' Close method (e.g., defer
clientA.Close() immediately after clientA is created and err checked, and defer
clientB.Close() after clientB is created) or explicitly call clientA.Close() and
clientB.Close() at the end of the test to release per-client cache state and
avoid leaking resources.

In `@aicr.go`:
- Around line 833-858: Update loadManifestFiles to accept a context (change
signature to loadManifestFiles(ctx context.Context, internal
*recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte,
error)) and make the manifest-reading loop cancellation-aware: before each
iteration (and optionally after each read) check ctx.Done() and if canceled
return ctx.Err() (wrapped into the package error type like other failures);
also, if a context-aware manifest reader exists (e.g.,
recipe.GetManifestContentWithProviderWithContext or similar), call that instead
of recipe.GetManifestContentWithProvider so the underlying I/O is also
cancellable. Ensure references to loadManifestFiles callers (e.g.,
BundleComponents) are updated to pass the ctx.

In `@options.go`:
- Around line 55-59: The godoc for applyValidateOptions incorrectly states the
translation happens “inside the locked region”; update the sentence to reflect
that applyValidateOptions is invoked after Client.mu is released by
Client.ValidateState (i.e., the translation occurs outside the locked region /
after unlocking). Edit the comment in applyValidateOptions to say the
translation happens once per call and is performed outside the Client.mu lock
(so future fields added to validator.With* require one edit here and none on the
facade surface).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7189d37d-d813-4114-a306-a23ac77982a3

📥 Commits

Reviewing files that changed from the base of the PR and between e6df948 and 2efc0d9.

📒 Files selected for processing (12)
  • aicr.go
  • aicr_internal_test.go
  • aicr_test.go
  • docs/integrator/go-library.md
  • docs/integrator/public-api.md
  • options.go
  • pkg/defaults/timeouts.go
  • pkg/recipe/builder.go
  • pkg/recipe/components.go
  • pkg/recipe/metadata_store.go
  • pkg/recipe/metadata_store_test.go
  • types.go

Comment thread aicr_test.go
@hkii hkii force-pushed the feat/aicr-client-facade branch from 2efc0d9 to 41042a1 Compare May 28, 2026 05:11

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
aicr.go (1)

620-633: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor ctx cancellation while loading supplemental manifest files.

loadManifestFiles (lines 832-858) loops over ref.ManifestFiles and performs manifest reads without checking context cancellation — the helper currently has no context.Context parameter. A canceled request continues reading files until the loop finishes, even though BundleComponents checks ctx only at the component level.

Proposed fix
-			manifests, err := loadManifestFiles(r.internal, dp, facade.Name)
+			manifests, err := loadManifestFiles(ctx, r.internal, dp, facade.Name)
@@
-func loadManifestFiles(internal *recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte, error) {
+func loadManifestFiles(ctx context.Context, internal *recipe.RecipeResult, dp recipe.DataProvider, componentName string) ([]byte, error) {
@@
 	for _, path := range ref.ManifestFiles {
+		if err := ctx.Err(); err != nil {
+			return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled while loading manifests", err)
+		}
 		content, err := recipe.GetManifestContentWithProvider(dp, path)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aicr.go` around lines 620 - 633, The helper loadManifestFiles should honor
cancellation: add a context.Context parameter to loadManifestFiles and in its
loop over ref.ManifestFiles check ctx.Done (or call ctx.Err()) before each read
to abort early; update all callers (e.g., BundleComponents where
loadManifestFiles(r.internal, dp, facade.Name) is invoked in the "manifests" and
"kustomize" branches) to pass the incoming ctx (or an appropriate derived
context) so that canceled requests stop reading files immediately; ensure the
function returns promptly with ctx.Err() when canceled.
aicr_test.go (3)

131-138: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Close both clients in this test to avoid leaking per-client cache state.

Lines 131 and 135 create real clients but never release them. Add cleanup for both to maintain test isolation.

Suggested patch
 	clientA, err := aicr.NewClient(aicr.WithRecipeSource(aicr.FilesystemSource(dirA)))
 	if err != nil {
 		t.Fatalf("NewClient A: %v", err)
 	}
+	t.Cleanup(func() { _ = clientA.Close() })
 	clientB, err := aicr.NewClient(aicr.WithRecipeSource(aicr.FilesystemSource(dirB)))
 	if err != nil {
 		t.Fatalf("NewClient B: %v", err)
 	}
+	t.Cleanup(func() { _ = clientB.Close() })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aicr_test.go` around lines 131 - 138, The test creates two real clients
(clientA and clientB) via aicr.NewClient but never closes them, leaking
per-client cache/state; ensure both clients are closed—either immediately defer
clientA.Close() after successful creation of clientA and defer clientB.Close()
after successful creation of clientB, or explicitly call clientA.Close() and
clientB.Close() at test teardown so any per-client resources are released;
update the test to call the Close method on both clientA and clientB
(references: aicr.NewClient, clientA, clientB).

198-203: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Assert Close() succeeds in the concurrency lifecycle test.

Line 200 discards the Close() error, which could hide regressions in shutdown/drain behavior. While Close() currently always returns nil, capturing the error would catch future changes.

Suggested patch
+		var closeErr error
 		go func() {
 			defer wg.Done()
-			_ = client.Close()
+			closeErr = client.Close()
 		}()
 		wg.Wait()
+		if closeErr != nil {
+			t.Errorf("trial %d: Close: %v", trial, closeErr)
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aicr_test.go` around lines 198 - 203, Capture the error returned by
client.Close() in the goroutine and assert it succeeds instead of discarding it;
replace `_ = client.Close()` with saving the result (e.g., `err :=
client.Close()`) and fail the test on non-nil (using t.Fatalf/require.NoError)
so the concurrency lifecycle test (the goroutine that calls client.Close() and
the wg synchronization) will surface any future shutdown/drain regressions.

392-398: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert structured error code for the closed-client branch.

This block currently only asserts "error is non-nil" but should also verify ErrCodeInvalidRequest so contract regressions are caught.

Suggested patch
 		_, err = client.BundleComponents(context.Background(),
 			&aicr.RecipeResult{Components: nil},
 		)
 		if err == nil {
 			t.Fatal("expected error after Close, got nil")
 		}
+		var se *aicrerrors.StructuredError
+		if !errors.As(err, &se) || se.Code != aicrerrors.ErrCodeInvalidRequest {
+			t.Errorf("expected ErrCodeInvalidRequest after Close; got %v", err)
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aicr_test.go` around lines 392 - 398, The test currently only checks for a
non-nil error from client.BundleComponents; change it to assert the returned
error encodes the specific contract error ErrCodeInvalidRequest. After calling
client.BundleComponents with &aicr.RecipeResult{Components:nil}, replace the
single non-nil check with logic that extracts the AICR error (e.g., using
errors.As to cast to the package error type or the test project's helper used
elsewhere) and assert its code equals aicr.ErrCodeInvalidRequest so the
closed-client branch is explicitly validated.
options.go (1)

55-60: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix inaccurate lock-scope statement in applyValidateOptions docs.

Line 58 states the translation happens "inside the locked region," but ValidateState invokes applyValidateOptions after releasing Client.mu (see aicr.go line 797). This misleads future synchronization work.

Suggested doc fix
 // applyValidateOptions builds the []validator.Option slice from a
 // validateConfig populated by the WithValidation* factories. Called
-// from Client.ValidateState; the translation happens once per call,
-// inside the locked region, so a future field added to validator.With*
+// from Client.ValidateState; the translation happens once per call,
+// before validator construction, so a future field added to validator.With*
 // is one edit here and zero edits on the facade surface.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@options.go` around lines 55 - 60, The doc comment on applyValidateOptions
incorrectly claims the translation occurs "inside the locked region"; update the
comment to state that applyValidateOptions is called after Client.mu is released
(from ValidateState) so the translation happens outside the lock, and adjust
wording to avoid implying any synchronization guarantees—reference
applyValidateOptions and ValidateState (and Client.mu) in the text so future
readers know where the call occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/go-library.md`:
- Line 145: Replace the speculative sentence "Additional sources (HTTP, Git) may
be added in future minor releases." with a factual, present-tense statement
reflecting current capabilities (e.g., "Currently supports local file and OCI
sources." or "No additional sources are supported at this time.") so the
documentation does not imply an uncommitted roadmap; update the sentence in
docs/integrator/go-library.md where that exact phrase appears.
- Around line 168-176: The documentation wording is inaccurate: the facade
applies a per-operation timeout unconditionally (it always calls
context.WithTimeout), so the effective deadline is the minimum of the caller's
deadline and the facade cap. Update the text to state that each context-aware
facade method (e.g., ResolveRecipe, BundleComponents, CollectSnapshot,
ValidateState) always wraps the caller context with the per-operation cap from
pkg/defaults (or AgentConfig.Timeout for CollectSnapshot), and that the
effective deadline used is the smaller of the caller's deadline and the facade's
cap.

---

Duplicate comments:
In `@aicr_test.go`:
- Around line 131-138: The test creates two real clients (clientA and clientB)
via aicr.NewClient but never closes them, leaking per-client cache/state; ensure
both clients are closed—either immediately defer clientA.Close() after
successful creation of clientA and defer clientB.Close() after successful
creation of clientB, or explicitly call clientA.Close() and clientB.Close() at
test teardown so any per-client resources are released; update the test to call
the Close method on both clientA and clientB (references: aicr.NewClient,
clientA, clientB).
- Around line 198-203: Capture the error returned by client.Close() in the
goroutine and assert it succeeds instead of discarding it; replace `_ =
client.Close()` with saving the result (e.g., `err := client.Close()`) and fail
the test on non-nil (using t.Fatalf/require.NoError) so the concurrency
lifecycle test (the goroutine that calls client.Close() and the wg
synchronization) will surface any future shutdown/drain regressions.
- Around line 392-398: The test currently only checks for a non-nil error from
client.BundleComponents; change it to assert the returned error encodes the
specific contract error ErrCodeInvalidRequest. After calling
client.BundleComponents with &aicr.RecipeResult{Components:nil}, replace the
single non-nil check with logic that extracts the AICR error (e.g., using
errors.As to cast to the package error type or the test project's helper used
elsewhere) and assert its code equals aicr.ErrCodeInvalidRequest so the
closed-client branch is explicitly validated.

In `@aicr.go`:
- Around line 620-633: The helper loadManifestFiles should honor cancellation:
add a context.Context parameter to loadManifestFiles and in its loop over
ref.ManifestFiles check ctx.Done (or call ctx.Err()) before each read to abort
early; update all callers (e.g., BundleComponents where
loadManifestFiles(r.internal, dp, facade.Name) is invoked in the "manifests" and
"kustomize" branches) to pass the incoming ctx (or an appropriate derived
context) so that canceled requests stop reading files immediately; ensure the
function returns promptly with ctx.Err() when canceled.

In `@options.go`:
- Around line 55-60: The doc comment on applyValidateOptions incorrectly claims
the translation occurs "inside the locked region"; update the comment to state
that applyValidateOptions is called after Client.mu is released (from
ValidateState) so the translation happens outside the lock, and adjust wording
to avoid implying any synchronization guarantees—reference applyValidateOptions
and ValidateState (and Client.mu) in the text so future readers know where the
call occurs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 40127a6d-18ad-433c-870b-b5ff11ba4e6f

📥 Commits

Reviewing files that changed from the base of the PR and between 2efc0d9 and 41042a1.

📒 Files selected for processing (12)
  • aicr.go
  • aicr_internal_test.go
  • aicr_test.go
  • docs/integrator/go-library.md
  • docs/integrator/public-api.md
  • options.go
  • pkg/defaults/timeouts.go
  • pkg/recipe/builder.go
  • pkg/recipe/components.go
  • pkg/recipe/metadata_store.go
  • pkg/recipe/metadata_store_test.go
  • types.go

Comment thread docs/integrator/go-library.md Outdated
Comment thread docs/integrator/go-library.md Outdated
@hkii hkii force-pushed the feat/aicr-client-facade branch from 41042a1 to 938d5c7 Compare May 28, 2026 05:21
@hkii

hkii commented May 28, 2026

Copy link
Copy Markdown
Member Author

Second round of fixes in 938d5c79 (force-pushed). All 5 new findings addressed:

  1. aicr_test.go:131,135 (around :138): both clients in the isolation test now have t.Cleanup(func() { _ = client.Close() }). Same cache-leak pattern as the previous round, missed it on the first pass.
  2. aicr_test.go:394-399: closed-client BundleComponents assertion now also pins ErrCodeInvalidRequest, not just err != nil.
  3. aicr.go loadManifestFiles: takes ctx context.Context and checks ctx.Err() before each manifest read. Both call sites updated. Worst case after cancellation is one in-flight provider.ReadFile (the DataProvider interface itself isn't ctx-aware, so a single read in progress runs to completion). Returns wrapped ErrCodeTimeout on mid-load cancellation.
  4. options.go:55-60: applyValidateOptions godoc was wrong about the lock scope. ValidateState calls it after releasing Client.mu, not inside the locked region. Rewritten to match.
  5. aicr_test.go:200-203: the concurrency lifecycle test now captures Close()'s error via a buffered channel and asserts it after wg.Wait. The old _ = client.Close() swallowed regressions in drain/eviction.

make qualify green locally on the Go path.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@options.go`:
- Around line 119-121: The WithValidation* option factories
(WithValidationImagePullSecrets, WithValidationTolerations,
WithValidationNodeSelector) must defensively clone caller-provided slices/maps
before storing them on validateConfig to avoid data races and preserve
nil-vs-empty semantics: for each slice argument, if it is nil leave nil,
otherwise allocate a new slice with make and copy(...) the contents and assign
that copy to c.imagePullSecrets / c.tolerations; for each map argument, if it is
nil leave nil, otherwise allocate a new map and copy all key/value pairs (for
k,v := range src { dst[k]=v }) and assign that copy to c.nodeSelector; this
ensures downstream code (e.g., serializeNodeSelector in pkg/api/validator/v1)
does not observe concurrent mutations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ee9593d7-0b29-4c62-910a-ef81efda06f8

📥 Commits

Reviewing files that changed from the base of the PR and between 41042a1 and 938d5c7.

📒 Files selected for processing (12)
  • aicr.go
  • aicr_internal_test.go
  • aicr_test.go
  • docs/integrator/go-library.md
  • docs/integrator/public-api.md
  • options.go
  • pkg/defaults/timeouts.go
  • pkg/recipe/builder.go
  • pkg/recipe/components.go
  • pkg/recipe/metadata_store.go
  • pkg/recipe/metadata_store_test.go
  • types.go

Comment thread options.go
@hkii hkii force-pushed the feat/aicr-client-facade branch from 938d5c7 to e9c2a70 Compare May 28, 2026 05:35
@hkii

hkii commented May 28, 2026

Copy link
Copy Markdown
Member Author

Third round of fixes in e9c2a70c (force-pushed):

  1. docs/integrator/go-library.md:145: dropped the "Additional sources (HTTP, Git) may be added..." line. Replaced with a one-liner naming FilesystemSource as the only production source today. No roadmap promises in integrator docs.
  2. docs/integrator/go-library.md:168-176: Context handling section reworded. Was implying the per-operation caps fire only when the caller's context has no deadline; the facade actually calls context.WithTimeout unconditionally and the effective deadline is min(caller, cap). Now reads that way explicitly.
  3. options.go:119-145: WithValidationImagePullSecrets, WithValidationTolerations, WithValidationNodeSelector now defensive-copy their inputs via three small helpers (cloneStringSlice, cloneTolerations, cloneStringMap). Each preserves the nil-in/nil-out distinction so applyValidateOptions's "skip the validator.With* call when unset" path still works. Closes the data race where a caller mutating the original map/slice after handoff could trip serializeNodeSelector's map iteration.

make qualify green on the Go path.

@hkii hkii force-pushed the feat/aicr-client-facade branch from e9c2a70 to e7aface Compare May 28, 2026 06:05

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Foundational correctness work (inflight drain, transient-error eviction, owner-token guard, bounded timeouts) is excellent and the race-detector tests pin the actual bugs.

Five inline comments along three axes against #1071's contract: (1) the facade creates a third DataProvider → Builder → Bundler → Validator wiring path without retiring CLI's recipe.SetDataProvider or migrating the REST server — both must follow before this PR's architectural payoff lands; (2) aicr.Snapshot/AgentConfig/PhaseResult/Phase are transparent aliases of packages the matrix marks Internal or "Public (evolving)" — semver-stable facade can't honestly identity-alias unstable types; (3) _ForTesting helpers leaked onto pkg/recipe's public surface, and the cross-Client guard protects only facade callers, not direct pkg/recipe.Builder users (which today includes pkg/api/server.go).

Not blocking the spirit of the change — (1) and (3) are fine as tracked follow-ups, (2) needs to be decided before merge so the documented contract is honest from day one.

Comment thread aicr.go
Comment thread types.go
Comment thread docs/integrator/public-api.md Outdated
Comment thread pkg/recipe/components.go
Comment thread aicr.go
@mchmarny

Copy link
Copy Markdown
Member

Filed follow-up tracking for the deferred work this PR introduces:

#1078 is the only one I'd consider blocking on for this PR — the others are clean follow-ups. The contradiction between public-api.md's "Internal" tier and the facade's identity-aliased types is the one item that locks in a dishonest contract from day one.

Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
@hkii hkii force-pushed the feat/aicr-client-facade branch from e7aface to 1deb73f Compare May 28, 2026 15:07
@hkii

hkii commented May 28, 2026

Copy link
Copy Markdown
Member Author

Thanks for filing #1076 / #1077 / #1078 / #1079 / #1080 — much appreciated.

For the #1078 contradiction blocker, taking the lighter path in 1deb73f1: docs now match reality without doing the full wrap, leaving #1078 open as the tracked code-side fix.

Concretely:

  1. docs/integrator/public-api.md: moved pkg/snapshotter from InternalPublic (evolving). The previous matrix was claiming "may change without notice" for a package whose types are identity-aliased into a "Public (stable)" facade — that's the dishonesty the thread flagged.
  2. Added a new "Facade type aliases" section enumerating the five inherited symbols (Snapshot, AgentConfig, PhaseResult, Phase, the three phase constants) and naming their source tier explicitly. Tells consumers these specific symbols inherit pkg/snapshotter / pkg/validator's evolving tier rather than the facade's stable tier, with the link to feat(aicr): wrap facade alias types as facade-owned structs #1078 for the full wrap.
  3. types.go: each affected type's godoc now carries a "Stability:" note cross-referencing the new section and feat(aicr): wrap facade alias types as facade-owned structs #1078.

Does this resolve #1078's "blocking on this PR" concern, with the full alias-wrap staying tracked in #1078 for the next milestone? Happy to do the full wrap in this PR instead if you'd prefer that — wanted to check before paying the ~300 LoC cost on speculation.

@mchmarny mchmarny enabled auto-merge (squash) May 28, 2026 15:50
@mchmarny

mchmarny commented May 28, 2026

Copy link
Copy Markdown
Member

Does this resolve #1078 "blocking on this PR" concern?

yes

Happy to do the full wrap in this PR instead if you'd prefer that

no, lets land this and use the epic issues above to drive the full scope to completion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Stable embeddable Go library surface for in-process recipe resolution

2 participants