Skip to content

feat(recipe): per-Builder DataProvider isolation; deprecate process globals#1015

Merged
mchmarny merged 2 commits into
mainfrom
feat/recipe-provider-isolation
May 23, 2026
Merged

feat(recipe): per-Builder DataProvider isolation; deprecate process globals#1015
mchmarny merged 2 commits into
mainfrom
feat/recipe-provider-isolation

Conversation

@mchmarny

@mchmarny mchmarny commented May 23, 2026

Copy link
Copy Markdown
Member

Summary

Adds per-Builder DataProvider isolation to pkg/recipe so it is safe for multi-tenant in-process consumers, plus provider-bound GetValuesForComponent / GetManifestContent variants. Marks process-global SetDataProvider / GetDataProvider as deprecated.

Motivation / Context

pkg/recipe was single-tenant by design (concurrency was not required for CLI but plays role in external GO pkg usage and API server). The metadata store and component registry caches were keyed by a global generation counter incremented on SetDataProvider. Two Builders against different providers in the same process could share state silently. This makes pkg/recipe usable as a library for operators, controllers, in-process testing harnesses, and BOM tooling.

Fixes: #985, #986, #987
Related: #983 (library-readiness epic), #966 (BOM tooling consumer)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli) — nolint comments on existing SetDataProvider call sites only
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*) — parity test + errors.AsType modernization
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator) — pkg/validator/catalog nolint + CutSuffix refactor
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/) — docs/contributor/data.md, pkg/recipe/doc.go

Implementation Notes

  • Replaced generation-counter caches with sync.Map[DataProvider]*entry keyed by provider identity. Per-entry sync.Once preserves populate-once semantics under concurrent load.
  • Builder.dp flows to MetadataStore.provider to RecipeResult.provider, so RecipeResult.GetValuesForComponent reads from the same source that built the result.
  • Added EvictCachedStore / EvictCachedRegistry for long-running consumers that retire providers.
  • SA1019 decision: nolint at the call sites (//nolint:staticcheck // tracked by #983 Stage 2). gopls IDE warnings on the deprecated calls are inherent to Go's Deprecated: lifecycle and cannot be silenced separately from golangci-lint — accepted as the canonical mark-first-migrate-later pattern.
  • Internal pkg/recipe calls were rewritten to consult the bound provider; only fallback branches (if dp == nil { dp = GetDataProvider() }) still reach the package-global.

Testing

make qualify

New tests (all -race clean):

  • TestBundlerValueParity_WithRecipeResult (pins bundler ↔ adapter equivalence)
  • TestNewBuilder_WithDataProvider / TestNewBuilder_DataProviderNilFallback
  • TestLoadMetadataStore_PerProviderIsolation
  • TestLoadMetadataStore_ConcurrentSameProviderIsCached
  • TestEvictCachedStore_Refetches / _NilIsNoOp
  • TestLoadMetadataStoreFor_NilProviderFallsBack
  • TestGetComponentRegistry_PerProviderIsolation
  • TestEvictCachedRegistry_Refetches / _NilIsNoOp
  • TestGetComponentRegistryFor_NilProviderFallsBack
  • TestBuilder_WithDataProvider_UsesBoundProvider
  • TestBuilder_TwoProviders_ConcurrentBuild (16 concurrent goroutines)
  • TestBuilder_DataProvider_NilSafe
  • TestRecipeResult_GetValuesForComponent_HonorsBoundProvider
  • TestGetManifestContentWithProvider / _NilFallback
  • TestRecipeResult_GetValuesForComponent_BoundProviderSurvivesGlobalSwap

Coverage: pkg/recipe 90.5% to 91.5% (+1.0%). All 8 new exported symbols have non-zero coverage. Project-wide test-coverage gate (75% floor): 76.9%. No touched-package regression > 0.5%.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

  • Cache memory: monotonically grows with unique providers if callers never evict. In-tree CLI / API server create exactly one provider per process, so the back-compat path is bounded to one entry — same as before.
  • Behavior change: none for CLI / API server (they don't use WithDataProvider). Bundler emits byte-identical bundles (pinned by TestBundlerValueParity_WithRecipeResult).
  • Deprecation: godoc + lint markers only — no symbol removal, no runtime change. Stage 2 migration of in-tree call sites is tracked under [Epic]: Library-readiness hardening for in-process Go consumers #983.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

…lobals

Closes #985, #986, #987 (part of the #983 library-readiness epic).

Adds per-Builder DataProvider isolation so pkg/recipe is safe for
multi-tenant in-process consumers, plus the provider-bound
GetValuesForComponent / GetManifestContent variants #985 deferred until
isolation landed.

## Motivation

pkg/recipe was single-tenant by construction — the metadata store and
component registry caches were keyed by a global generation counter
incremented on SetDataProvider. Two Builders against different providers
in the same process could share state silently. This makes pkg/recipe
usable as a library for operators, controllers, in-process testing
harnesses, and BOM tooling (unblocks #966).

## Type of Change

- Enhancement (new public API: WithDataProvider, *For variants, eviction helpers)
- Internal refactor (cache structure)
- Documentation

## Components Affected

- pkg/recipe (cache structure, Builder option, public helpers, deprecation markers)
- pkg/bundler (parity test + errors.AsType modernization, no behavior change)
- pkg/cli (nolint comments on existing SetDataProvider call sites)
- pkg/validator/catalog (nolint + CutSuffix refactor)
- docs/contributor/data.md and pkg/recipe/doc.go

## Implementation Notes

- Replaced generation-counter caches with sync.Map[DataProvider]*entry
  keyed by provider identity. Per-entry sync.Once preserves populate-once
  semantics under concurrent load.
- Builder.dp flows to MetadataStore.provider to RecipeResult.provider, so
  RecipeResult.GetValuesForComponent reads from the same source that
  built the result.
- Added EvictCachedStore / EvictCachedRegistry for long-running consumers
  that retire providers.
- SA1019 decision: nolint at the call sites (//nolint:staticcheck //
  tracked by #983 Stage 2). gopls IDE warnings on the deprecated calls
  are inherent to Go's Deprecated: lifecycle and cannot be silenced
  separately from golangci-lint — accepted as the canonical
  mark-first-migrate-later pattern.
- Internal pkg/recipe calls were rewritten to consult the bound provider;
  only fallback branches (if dp == nil { dp = GetDataProvider() }) still
  reach the package-global.

## Testing

New tests (all -race clean):
- TestBundlerValueParity_WithRecipeResult (pins bundler ↔ adapter equivalence)
- TestNewBuilder_WithDataProvider / TestNewBuilder_DataProviderNilFallback
- TestLoadMetadataStore_PerProviderIsolation
- TestLoadMetadataStore_ConcurrentSameProviderIsCached
- TestEvictCachedStore_Refetches / _NilIsNoOp
- TestLoadMetadataStoreFor_NilProviderFallsBack
- TestGetComponentRegistry_PerProviderIsolation
- TestEvictCachedRegistry_Refetches / _NilIsNoOp
- TestGetComponentRegistryFor_NilProviderFallsBack
- TestBuilder_WithDataProvider_UsesBoundProvider
- TestBuilder_TwoProviders_ConcurrentBuild (16 concurrent goroutines)
- TestBuilder_DataProvider_NilSafe
- TestRecipeResult_GetValuesForComponent_HonorsBoundProvider
- TestGetManifestContentWithProvider / _NilFallback
- TestRecipeResult_GetValuesForComponent_BoundProviderSurvivesGlobalSwap

Coverage: pkg/recipe 90.5% → 91.5% (+1.0%). All 8 new exported symbols
have non-zero coverage. Project-wide test-coverage gate (75% floor):
76.9%. No touched-package regression > 0.5%.

## Risk Assessment

- Cache memory: monotonically grows with unique providers if callers
  never evict. In-tree CLI / API server create exactly one provider per
  process, so back-compat path is bounded to one entry — same as before.
- Behavior change: none for CLI / API server (they don't use
  WithDataProvider). Bundler emits byte-identical bundles (pinned by
  TestBundlerValueParity_WithRecipeResult).
- Deprecation: godoc + lint markers only — no symbol removal, no runtime
  change. Stage 2 migration of in-tree call sites is tracked under #983.

## Checklist

- [x] Tests added for new behavior
- [x] make qualify clean
- [x] Commit cryptographically signed (-S)
- [x] Docs updated (docs/contributor/data.md, pkg/recipe/doc.go)
- [x] No Co-Authored-By / Claude attribution
- [x] PR title under 70 chars
@github-actions

Copy link
Copy Markdown
Contributor

@mchmarny mchmarny self-assigned this May 23, 2026
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.7%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.7%25-green)

No Go source files changed in this PR.

- GetManifestContentWithProvider now returns structured pkg/errors
  with ErrCodeNotFound / ErrCodeInternal (was: raw fs errors).
  stderrors.Is(err, fs.ErrNotExist) chain preserved via pkg/errors.Wrap
  Unwrap support, so the bundler caller in pkg/bundler/bundler.go that
  branches on fs.ErrNotExist continues to behave identically.
- docs/contributor/data.md: fix embedded provider prefix in 2 code
  examples to match runtime convention ("" not "data"); all production
  callers use empty prefix.
- Add TestGetManifestContentWithProvider_NotFound covering both the
  structured error code and the preserved fs.ErrNotExist chain.

PR feedback: #1015 (comment)
              #1015 (comment)
@mchmarny mchmarny merged commit 4e6cd9c into main May 23, 2026
30 of 31 checks passed
@mchmarny mchmarny deleted the feat/recipe-provider-isolation branch May 23, 2026 17:45
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers (notably the Crossplane provider-nvidia
controller) that need to call AICR's recipe-resolution, bundling,
snapshot, and validation logic in-process — without importing internal
packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md), public-API
matrix (docs/integrator/public-api.md), architecture deep-dive
(docs/design/architecture.md). The CLI and REST surfaces are unchanged;
existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers (notably the Crossplane provider-nvidia
controller) that need to call AICR's recipe-resolution, bundling,
snapshot, and validation logic in-process — without importing internal
packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers (notably the Crossplane provider-nvidia
controller) that need to call AICR's recipe-resolution, bundling,
snapshot, and validation logic in-process — without importing internal
packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
hkii added a commit to hkii/aicr that referenced this pull request May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API
for external consumers that need to call AICR's recipe-resolution,
bundling, snapshot, and validation logic in-process — without
importing internal packages directly.

Surface:
- Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult
- Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle
- Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot
- Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult
- Client.Close() -- releases per-Client caches via io.Closer

Each Client owns its own DataProvider via the existing
recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's
per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against
different recipe sources in the same process resolve concurrently
without contaminating each other's caches; TestClient_
ConcurrentResolveScopesToOwnSource exercises this directly under -race.

Shutdown correctness:
- Close drains in-flight operations before evicting per-Client cache
  entries. Without the drain, a ResolveRecipe that released the read
  lock before LoadMetadataStoreFor could repopulate storeCache[dp]
  AFTER Close's Evict call — silently leaking entries past the
  Client's lifetime. Client now carries an inflight sync.WaitGroup;
  each entry point Add(1)s under the read lock so Close's write-lock-
  then-Wait protocol observes the increment. New callers arriving
  after the closed-flag is set reject without registering.
  TestClient_CloseDrainsInflightResolve pins this race deterministically
  using a blockingDataProvider that parks WalkDir on a channel.

Transient-error resilience:
- A first-caller ctx cancellation no longer poisons sync.Once for the
  Client's lifetime. LoadMetadataStoreFor now detects
  context.Canceled / context.DeadlineExceeded inside entry.err and
  CompareAndDeletes the cache entry so a follow-up call with a
  healthy ctx loads from scratch. builder.BuildFromCriteria's wrap
  switched to PropagateOrWrap so the structured ErrCodeTimeout from
  buildMetadataStore survives — callers (and the new transient-error
  eviction itself) rely on the inner timeout code, which the old
  WrapWithContext(ErrCodeInternal, ...) call was overriding.
  TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in.

Field-level guards baked into the facade:
- ResolveRecipe stamps the owning Client on the returned RecipeResult;
  BundleComponents / ValidateState reject cross-client misuse with
  ErrCodeInvalidRequest. Without this guard, a result from Client A
  passed to Client B silently mixed A's component refs with B's
  DataProvider reads -- wrong Helm values or supplemental manifests
  with no error.
- RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0).
- RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays
  (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin)
  are reachable from the facade. Asymmetric matching otherwise excludes
  them when query OS defaults to "any".
- All four facade entry points apply context.WithTimeout against
  per-operation defaults (RecipeOperationTimeout / SnapshotOperation-
  Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing
  tighter deadlines keep theirs; callers passing context.Background()
  get a bounded operation rather than an unbounded controller hang.
- Each entry point also rejects a nil context.Context with
  ErrCodeInvalidRequest before context.WithTimeout — avoids the
  runtime panic that context.WithTimeout produces on a nil parent.
- NewClient skips nil Option entries instead of panicking on opt(c) —
  defensive against callers that build []Option dynamically.

Cache-lifecycle test coverage:
- TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient ->
  ResolveRecipe -> Close 50 times and asserts both the metadata-store
  and component-registry caches return to their baseline size. Proves
  Close() actually evicts, not just nils a pointer.
- CachedStoreCountForTesting / CachedRegistryCountForTesting and the
  per-provider CachedStoreContainsForTesting /
  CachedRegistryContainsForTesting helpers expose what the package
  caches without reflecting into unexported state; the per-provider
  helpers are robust under parallel test execution where the global
  counts would race against other tests' DataProviders.

Type insulation:
- ValidateOption is a facade-owned functional-option type that captures
  into an internal validateConfig and translates to pkg/validator
  options at call time. Changes to pkg/validator.Option's signature
  don't propagate to the facade contract.
- Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants
  remain transparent re-exports of pkg/snapshotter / pkg/validator;
  wrapping those (which would require mirroring their transitively-
  referenced types) is tracked as a follow-up.

Docs: integrator guide (docs/integrator/go-library.md) and public-API
matrix (docs/integrator/public-api.md). The CLI and REST surfaces are
unchanged; existing tests pass without modification.

Test coverage above the project-wide threshold; make qualify passes
locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI
e2e excluding helmfile which requires sudo for local install).

Fixes NVIDIA#1071
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(recipe): expose effective-values and manifest helpers on RecipeResult

1 participant