feat(recipe): per-Builder DataProvider isolation; deprecate process globals#1015
Merged
Conversation
…lobals Closes #985, #986, #987 (part of the #983 library-readiness epic). Adds per-Builder DataProvider isolation so pkg/recipe is safe for multi-tenant in-process consumers, plus the provider-bound GetValuesForComponent / GetManifestContent variants #985 deferred until isolation landed. ## Motivation pkg/recipe was single-tenant by construction — the metadata store and component registry caches were keyed by a global generation counter incremented on SetDataProvider. Two Builders against different providers in the same process could share state silently. This makes pkg/recipe usable as a library for operators, controllers, in-process testing harnesses, and BOM tooling (unblocks #966). ## Type of Change - Enhancement (new public API: WithDataProvider, *For variants, eviction helpers) - Internal refactor (cache structure) - Documentation ## Components Affected - pkg/recipe (cache structure, Builder option, public helpers, deprecation markers) - pkg/bundler (parity test + errors.AsType modernization, no behavior change) - pkg/cli (nolint comments on existing SetDataProvider call sites) - pkg/validator/catalog (nolint + CutSuffix refactor) - docs/contributor/data.md and pkg/recipe/doc.go ## Implementation Notes - Replaced generation-counter caches with sync.Map[DataProvider]*entry keyed by provider identity. Per-entry sync.Once preserves populate-once semantics under concurrent load. - Builder.dp flows to MetadataStore.provider to RecipeResult.provider, so RecipeResult.GetValuesForComponent reads from the same source that built the result. - Added EvictCachedStore / EvictCachedRegistry for long-running consumers that retire providers. - SA1019 decision: nolint at the call sites (//nolint:staticcheck // tracked by #983 Stage 2). gopls IDE warnings on the deprecated calls are inherent to Go's Deprecated: lifecycle and cannot be silenced separately from golangci-lint — accepted as the canonical mark-first-migrate-later pattern. - Internal pkg/recipe calls were rewritten to consult the bound provider; only fallback branches (if dp == nil { dp = GetDataProvider() }) still reach the package-global. ## Testing New tests (all -race clean): - TestBundlerValueParity_WithRecipeResult (pins bundler ↔ adapter equivalence) - TestNewBuilder_WithDataProvider / TestNewBuilder_DataProviderNilFallback - TestLoadMetadataStore_PerProviderIsolation - TestLoadMetadataStore_ConcurrentSameProviderIsCached - TestEvictCachedStore_Refetches / _NilIsNoOp - TestLoadMetadataStoreFor_NilProviderFallsBack - TestGetComponentRegistry_PerProviderIsolation - TestEvictCachedRegistry_Refetches / _NilIsNoOp - TestGetComponentRegistryFor_NilProviderFallsBack - TestBuilder_WithDataProvider_UsesBoundProvider - TestBuilder_TwoProviders_ConcurrentBuild (16 concurrent goroutines) - TestBuilder_DataProvider_NilSafe - TestRecipeResult_GetValuesForComponent_HonorsBoundProvider - TestGetManifestContentWithProvider / _NilFallback - TestRecipeResult_GetValuesForComponent_BoundProviderSurvivesGlobalSwap Coverage: pkg/recipe 90.5% → 91.5% (+1.0%). All 8 new exported symbols have non-zero coverage. Project-wide test-coverage gate (75% floor): 76.9%. No touched-package regression > 0.5%. ## Risk Assessment - Cache memory: monotonically grows with unique providers if callers never evict. In-tree CLI / API server create exactly one provider per process, so back-compat path is bounded to one entry — same as before. - Behavior change: none for CLI / API server (they don't use WithDataProvider). Bundler emits byte-identical bundles (pinned by TestBundlerValueParity_WithRecipeResult). - Deprecation: godoc + lint markers only — no symbol removal, no runtime change. Stage 2 migration of in-tree call sites is tracked under #983. ## Checklist - [x] Tests added for new behavior - [x] make qualify clean - [x] Commit cryptographically signed (-S) - [x] Docs updated (docs/contributor/data.md, pkg/recipe/doc.go) - [x] No Co-Authored-By / Claude attribution - [x] PR title under 70 chars
Contributor
This comment was marked as resolved.
This comment was marked as resolved.
Contributor
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
- GetManifestContentWithProvider now returns structured pkg/errors
with ErrCodeNotFound / ErrCodeInternal (was: raw fs errors).
stderrors.Is(err, fs.ErrNotExist) chain preserved via pkg/errors.Wrap
Unwrap support, so the bundler caller in pkg/bundler/bundler.go that
branches on fs.ErrNotExist continues to behave identically.
- docs/contributor/data.md: fix embedded provider prefix in 2 code
examples to match runtime convention ("" not "data"); all production
callers use empty prefix.
- Add TestGetManifestContentWithProvider_NotFound covering both the
structured error code and the preserved fs.ErrNotExist chain.
PR feedback: #1015 (comment)
#1015 (comment)
This was referenced May 23, 2026
7 tasks
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers (notably the Crossplane provider-nvidia controller) that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md), public-API matrix (docs/integrator/public-api.md), architecture deep-dive (docs/design/architecture.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers (notably the Crossplane provider-nvidia controller) that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers (notably the Crossplane provider-nvidia controller) that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
23 tasks
This was referenced May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds per-Builder DataProvider isolation to
pkg/recipeso it is safe for multi-tenant in-process consumers, plus provider-boundGetValuesForComponent/GetManifestContentvariants. Marks process-globalSetDataProvider/GetDataProvideras deprecated.Motivation / Context
pkg/recipewas single-tenant by design (concurrency was not required for CLI but plays role in external GO pkg usage and API server). The metadata store and component registry caches were keyed by a global generation counter incremented onSetDataProvider. TwoBuilders against different providers in the same process could share state silently. This makespkg/recipeusable as a library for operators, controllers, in-process testing harnesses, and BOM tooling.Fixes: #985, #986, #987
Related: #983 (library-readiness epic), #966 (BOM tooling consumer)
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli) — nolint comments on existingSetDataProvidercall sites onlycmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*) — parity test +errors.AsTypemodernizationpkg/collector,pkg/snapshotter)pkg/validator) —pkg/validator/catalognolint +CutSuffixrefactorpkg/errors,pkg/k8s)docs/,examples/) —docs/contributor/data.md,pkg/recipe/doc.goImplementation Notes
sync.Map[DataProvider]*entrykeyed by provider identity. Per-entrysync.Oncepreserves populate-once semantics under concurrent load.Builder.dpflows toMetadataStore.providertoRecipeResult.provider, soRecipeResult.GetValuesForComponentreads from the same source that built the result.EvictCachedStore/EvictCachedRegistryfor long-running consumers that retire providers.nolintat the call sites (//nolint:staticcheck // tracked by #983 Stage 2). gopls IDE warnings on the deprecated calls are inherent to Go'sDeprecated:lifecycle and cannot be silenced separately fromgolangci-lint— accepted as the canonical mark-first-migrate-later pattern.pkg/recipecalls were rewritten to consult the bound provider; only fallback branches (if dp == nil { dp = GetDataProvider() }) still reach the package-global.Testing
New tests (all
-raceclean):TestBundlerValueParity_WithRecipeResult(pins bundler ↔ adapter equivalence)TestNewBuilder_WithDataProvider/TestNewBuilder_DataProviderNilFallbackTestLoadMetadataStore_PerProviderIsolationTestLoadMetadataStore_ConcurrentSameProviderIsCachedTestEvictCachedStore_Refetches/_NilIsNoOpTestLoadMetadataStoreFor_NilProviderFallsBackTestGetComponentRegistry_PerProviderIsolationTestEvictCachedRegistry_Refetches/_NilIsNoOpTestGetComponentRegistryFor_NilProviderFallsBackTestBuilder_WithDataProvider_UsesBoundProviderTestBuilder_TwoProviders_ConcurrentBuild(16 concurrent goroutines)TestBuilder_DataProvider_NilSafeTestRecipeResult_GetValuesForComponent_HonorsBoundProviderTestGetManifestContentWithProvider/_NilFallbackTestRecipeResult_GetValuesForComponent_BoundProviderSurvivesGlobalSwapCoverage:
pkg/recipe90.5% to 91.5% (+1.0%). All 8 new exported symbols have non-zero coverage. Project-wide test-coverage gate (75% floor): 76.9%. No touched-package regression > 0.5%.Risk Assessment
Rollout notes:
WithDataProvider). Bundler emits byte-identical bundles (pinned byTestBundlerValueParity_WithRecipeResult).Checklist
make testwith-race)make lint)git commit -S)