Skip to content

feat(bundler): route registry/manifest reads through recipe-bound provider#1016

Merged
mchmarny merged 1 commit into
mainfrom
feat/bundler-provider-migration
May 23, 2026
Merged

feat(bundler): route registry/manifest reads through recipe-bound provider#1016
mchmarny merged 1 commit into
mainfrom
feat/bundler-provider-migration

Conversation

@mchmarny

@mchmarny mchmarny commented May 23, 2026

Copy link
Copy Markdown
Member

Summary

Migrates the remaining 10 calls in pkg/bundler, pkg/bundler/deployer/{helmfile,argocdhelm}, and pkg/component to read registry and manifest content through recipeResult.DataProvider() instead of the process-global recipe.GetDataProvider() / GetComponentRegistry() / GetManifestContent().

Motivation / Context

PR #1015 made pkg/recipe fully provider-aware (per-Builder caches, RecipeResult.DataProvider() accessor, provider-bound *For helpers) but explicitly deferred the bundler-side migration as a follow-up. That left ~10 call sites in pkg/bundler, pkg/bundler/deployer/{helmfile,argocdhelm}, and pkg/component that silently fell back to the global — even when consuming a recipe built with WithDataProvider(dpA). This PR closes that path.

Fixes: N/A
Related: #1015

Type of Change

  • Refactoring (no functional changes)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)

Implementation Notes

  • pkg/bundler/bundler.go: 4 methods that already receive *recipe.RecipeResult (warnMissingStorageClassForPVCs, runComponentValidations, collectComponentManifestsByPhase, injectGKECriticalPriorityQuotas) now use recipeResult.DataProvider().
  • 5 internal helpers gained a trailing provider recipe.DataProvider parameter (getValueOverridesForComponent, getSetEnabledOverride, applyNodeSchedulingOverrides, copyDataFiles, buildDynamicValuesMap), derived once from recipeResult.DataProvider() at the nearest caller with the result in scope.
  • Deployer subpackages: helmfile.componentFlagsByName, argocdhelm.resolveOverrideKey, argocdhelm.buildDynamicSetFlags gain the same trailing parameter. Deployer.Generate interface kept unchanged — Generator structs already carry RecipeResult, so methods derive provider via g.RecipeResult.DataProvider().
  • pkg/component/generic.go: enrichConfigFromRegistry plumbed identically. Only caller (the deprecated/unused MakeBundle) updated.
  • pkg/recipe: new EffectiveDataProvider(dp) helper consolidates the bound-first / global-fallback pattern that the bundler's copyDataFiles and collectComponentManifestsByPhase needed for raw WalkDir / type-assertion calls.

CLI back-compat preserved

CLI today does not call WithDataProviderrecipeResult.DataProvider() returns nil — and every migrated call site falls back to the global via *For variants (and EffectiveDataProvider for the two raw cases). Bundle output is byte-identical for that path. Verified by the new end-to-end test which only triggers the bound-provider path when WithDataProvider is used.

Out of scope (further Stage 2 PRs)

  • pkg/cli/root.go SetDataProvider migration (touches CLI bootstrap)
  • pkg/validator/catalog/catalog.go GetDataProvider read
  • pkg/mirror/discover.go (2 sites)
  • validators/deployment/expected_resources.go (1 site)

Testing

make qualify

New tests:

  • TestApplyNodeSchedulingOverrides_BoundProvider — unit-level coverage of the bound-provider branch in a representative helper
  • TestBundler_Make_BoundProviderEndToEnd — integration test: build a recipe via WithDataProvider(layeredOverTempdir) with a marker driver version, call bundler.Make, walk emitted bundle, assert marker present in gpu-operator/values.yaml
  • TestEffectiveDataProvider — both branches of the new helper

Existing tests pass unchanged.

Coverage:

  • pkg/bundler: 77.3% -> 77.7% (+0.4%)
  • pkg/component: 71.0% -> 70.9% (within 0.5% noise floor)
  • pkg/recipe: 91.5% flat
  • Project-wide: 77.0% > 75% floor

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Public API unchanged. All migrated helpers are unexported except component.MakeBundle (which is itself // Deprecated: with no callers anywhere in the repo). CLI behavior is unchanged; bundle output is byte-identical for the CLI path. gopls IDE deprecation warnings on intentional legacy-global callers are suppressed with //nolint:staticcheck — same accepted trade-off as #1015. Project lint stays clean.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed (no user-facing changes)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

…vider

Stage 2 follow-up to #1015 (the per-Builder DataProvider isolation work).
Migrates the bundler / deployer / component code paths to honor
recipeResult.DataProvider() instead of the package-global
recipe.GetDataProvider() / GetComponentRegistry() / GetManifestContent().

## Motivation

PR #1015 made pkg/recipe fully provider-aware (per-Builder caches,
RecipeResult.DataProvider() accessor, provider-bound *For helpers) but
explicitly deferred the bundler-side migration as a follow-up. That
left ~10 call sites in pkg/bundler, pkg/bundler/deployer/{helmfile,
argocdhelm}, and pkg/component that silently fell back to the global —
even when consuming a recipe built with WithDataProvider(dpA). This PR
closes that path.

## Changes

- pkg/bundler/bundler.go: 4 methods that already receive *recipe.RecipeResult
  (warnMissingStorageClassForPVCs, runComponentValidations,
  collectComponentManifestsByPhase, injectGKECriticalPriorityQuotas)
  now use recipeResult.DataProvider().
- 5 internal helpers gained a trailing provider recipe.DataProvider
  parameter (getValueOverridesForComponent, getSetEnabledOverride,
  applyNodeSchedulingOverrides, copyDataFiles, buildDynamicValuesMap),
  derived once from recipeResult.DataProvider() at the nearest caller
  with the result in scope.
- Deployer subpackages: helmfile.componentFlagsByName,
  argocdhelm.resolveOverrideKey, argocdhelm.buildDynamicSetFlags gain
  the same trailing parameter. Deployer.Generate interface kept
  unchanged — Generator structs already carry RecipeResult, so methods
  derive provider via g.RecipeResult.DataProvider().
- pkg/component/generic.go: enrichConfigFromRegistry plumbed
  identically. Only caller (the deprecated/unused MakeBundle) updated.
- pkg/recipe: new EffectiveDataProvider(dp) helper consolidates the
  bound-first / global-fallback pattern that the bundler's
  copyDataFiles and collectComponentManifestsByPhase needed for raw
  WalkDir / type-assertion calls.

## CLI back-compat preserved

CLI today does not call WithDataProvider — recipeResult.DataProvider()
returns nil — and every migrated call site falls back to the global
via *For variants (and EffectiveDataProvider for the two raw cases).
Bundle output is byte-identical for that path. Verified by the new
end-to-end test which only triggers the bound-provider path when
WithDataProvider is used.

## Out of scope (further Stage 2 PRs)

- pkg/cli/root.go SetDataProvider migration (touches CLI bootstrap)
- pkg/validator/catalog/catalog.go GetDataProvider read
- pkg/mirror/discover.go (2 sites)
- validators/deployment/expected_resources.go (1 site)

## Testing

New tests:
- TestApplyNodeSchedulingOverrides_BoundProvider (unit-level coverage of
  the bound-provider branch in a representative helper)
- TestBundler_Make_BoundProviderEndToEnd (integration test: build a
  recipe via WithDataProvider(layeredOverTempdir) with a marker driver
  version, call bundler.Make, walk emitted bundle, assert marker
  present in gpu-operator/values.yaml)
- TestEffectiveDataProvider (both branches of the new helper)

Existing tests pass unchanged.

Coverage:
- pkg/bundler: 77.3% -> 77.7% (+0.4%)
- pkg/component: 71.0% -> 70.9% (within 0.5% noise floor)
- pkg/recipe: 91.5% flat
- Project-wide: 77.0% > 75% floor

## Risk

- Public API: none. All migrated helpers are unexported except
  component.MakeBundle (which is itself // Deprecated: with no
  callers anywhere in the repo).
- CLI behavior: unchanged. Bundle output byte-identical for CLI path.
- gopls IDE deprecation warnings: same accepted trade-off as #1015 —
  callers that intentionally use the legacy global path are tagged with
  //nolint:staticcheck. Project lint stays clean.
@mchmarny mchmarny requested a review from a team as a code owner May 23, 2026 19:25
@mchmarny mchmarny self-assigned this May 23, 2026
@mchmarny mchmarny enabled auto-merge (squash) May 23, 2026 19:27
@mchmarny mchmarny added this to the v1 milestone May 23, 2026
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.9%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.9%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/bundler 67.84% (+2.34%) 👍
github.com/NVIDIA/aicr/pkg/bundler/deployer/argocdhelm 80.95% (ø)
github.com/NVIDIA/aicr/pkg/bundler/deployer/helmfile 86.88% (ø)
github.com/NVIDIA/aicr/pkg/component 70.93% (ø)
github.com/NVIDIA/aicr/pkg/recipe 91.47% (+0.01%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/bundler/bundler.go 66.09% (+2.96%) 519 (+1) 343 (+16) 176 (-15) 👍
github.com/NVIDIA/aicr/pkg/bundler/deployer/argocdhelm/argocdhelm.go 80.95% (ø) 357 289 68
github.com/NVIDIA/aicr/pkg/bundler/deployer/helmfile/helmfile.go 81.88% (ø) 160 131 29
github.com/NVIDIA/aicr/pkg/component/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/component/generic.go 39.39% (ø) 99 39 60
github.com/NVIDIA/aicr/pkg/recipe/provider.go 91.67% (+0.11%) 228 (+3) 209 (+3) 19 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@dims dims left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mchmarny mchmarny merged commit 6098765 into main May 23, 2026
33 of 34 checks passed
@mchmarny mchmarny deleted the feat/bundler-provider-migration branch May 23, 2026 20:25
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers (notably the Crossplane provider-nvidia
controller) that need to call AICR's recipe-resolution, bundling,
snapshot, and validation logic in-process — without importing internal
packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md), public-API
matrix (docs/integrator/public-api.md), architecture deep-dive
(docs/design/architecture.md). The CLI and REST surfaces are unchanged;
existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers (notably the Crossplane provider-nvidia
controller) that need to call AICR's recipe-resolution, bundling,
snapshot, and validation logic in-process — without importing internal
packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers (notably the Crossplane provider-nvidia
controller) that need to call AICR's recipe-resolution, bundling,
snapshot, and validation logic in-process — without importing internal
packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants