feat(bundler): route registry/manifest reads through recipe-bound provider#1016
Merged
Conversation
…vider Stage 2 follow-up to #1015 (the per-Builder DataProvider isolation work). Migrates the bundler / deployer / component code paths to honor recipeResult.DataProvider() instead of the package-global recipe.GetDataProvider() / GetComponentRegistry() / GetManifestContent(). ## Motivation PR #1015 made pkg/recipe fully provider-aware (per-Builder caches, RecipeResult.DataProvider() accessor, provider-bound *For helpers) but explicitly deferred the bundler-side migration as a follow-up. That left ~10 call sites in pkg/bundler, pkg/bundler/deployer/{helmfile, argocdhelm}, and pkg/component that silently fell back to the global — even when consuming a recipe built with WithDataProvider(dpA). This PR closes that path. ## Changes - pkg/bundler/bundler.go: 4 methods that already receive *recipe.RecipeResult (warnMissingStorageClassForPVCs, runComponentValidations, collectComponentManifestsByPhase, injectGKECriticalPriorityQuotas) now use recipeResult.DataProvider(). - 5 internal helpers gained a trailing provider recipe.DataProvider parameter (getValueOverridesForComponent, getSetEnabledOverride, applyNodeSchedulingOverrides, copyDataFiles, buildDynamicValuesMap), derived once from recipeResult.DataProvider() at the nearest caller with the result in scope. - Deployer subpackages: helmfile.componentFlagsByName, argocdhelm.resolveOverrideKey, argocdhelm.buildDynamicSetFlags gain the same trailing parameter. Deployer.Generate interface kept unchanged — Generator structs already carry RecipeResult, so methods derive provider via g.RecipeResult.DataProvider(). - pkg/component/generic.go: enrichConfigFromRegistry plumbed identically. Only caller (the deprecated/unused MakeBundle) updated. - pkg/recipe: new EffectiveDataProvider(dp) helper consolidates the bound-first / global-fallback pattern that the bundler's copyDataFiles and collectComponentManifestsByPhase needed for raw WalkDir / type-assertion calls. ## CLI back-compat preserved CLI today does not call WithDataProvider — recipeResult.DataProvider() returns nil — and every migrated call site falls back to the global via *For variants (and EffectiveDataProvider for the two raw cases). Bundle output is byte-identical for that path. Verified by the new end-to-end test which only triggers the bound-provider path when WithDataProvider is used. ## Out of scope (further Stage 2 PRs) - pkg/cli/root.go SetDataProvider migration (touches CLI bootstrap) - pkg/validator/catalog/catalog.go GetDataProvider read - pkg/mirror/discover.go (2 sites) - validators/deployment/expected_resources.go (1 site) ## Testing New tests: - TestApplyNodeSchedulingOverrides_BoundProvider (unit-level coverage of the bound-provider branch in a representative helper) - TestBundler_Make_BoundProviderEndToEnd (integration test: build a recipe via WithDataProvider(layeredOverTempdir) with a marker driver version, call bundler.Make, walk emitted bundle, assert marker present in gpu-operator/values.yaml) - TestEffectiveDataProvider (both branches of the new helper) Existing tests pass unchanged. Coverage: - pkg/bundler: 77.3% -> 77.7% (+0.4%) - pkg/component: 71.0% -> 70.9% (within 0.5% noise floor) - pkg/recipe: 91.5% flat - Project-wide: 77.0% > 75% floor ## Risk - Public API: none. All migrated helpers are unexported except component.MakeBundle (which is itself // Deprecated: with no callers anywhere in the repo). - CLI behavior: unchanged. Bundle output byte-identical for CLI path. - gopls IDE deprecation warnings: same accepted trade-off as #1015 — callers that intentionally use the legacy global path are tagged with //nolint:staticcheck. Project lint stays clean.
This comment was marked as resolved.
This comment was marked as resolved.
Contributor
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers (notably the Crossplane provider-nvidia controller) that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md), public-API matrix (docs/integrator/public-api.md), architecture deep-dive (docs/design/architecture.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers (notably the Crossplane provider-nvidia controller) that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers (notably the Crossplane provider-nvidia controller) that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
7 tasks
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
12 tasks
hkii
added a commit
to hkii/aicr
that referenced
this pull request
May 28, 2026
Introduce github.com/NVIDIA/aicr as the project's semver-tracked Go API for external consumers that need to call AICR's recipe-resolution, bundling, snapshot, and validation logic in-process — without importing internal packages directly. Surface: - Client.ResolveRecipe(ctx, RecipeRequest) -> RecipeResult - Client.BundleComponents(ctx, RecipeResult) -> []ComponentBundle - Client.CollectSnapshot(ctx, *AgentConfig) -> *Snapshot - Client.ValidateState(ctx, *RecipeResult, *Snapshot, opts...) -> []*PhaseResult - Client.Close() -- releases per-Client caches via io.Closer Each Client owns its own DataProvider via the existing recipe.WithDataProvider plumbing (PR NVIDIA#1015) and the bundler's per-Client registry routing (PR NVIDIA#1016). Two Clients constructed against different recipe sources in the same process resolve concurrently without contaminating each other's caches; TestClient_ ConcurrentResolveScopesToOwnSource exercises this directly under -race. Shutdown correctness: - Close drains in-flight operations before evicting per-Client cache entries. Without the drain, a ResolveRecipe that released the read lock before LoadMetadataStoreFor could repopulate storeCache[dp] AFTER Close's Evict call — silently leaking entries past the Client's lifetime. Client now carries an inflight sync.WaitGroup; each entry point Add(1)s under the read lock so Close's write-lock- then-Wait protocol observes the increment. New callers arriving after the closed-flag is set reject without registering. TestClient_CloseDrainsInflightResolve pins this race deterministically using a blockingDataProvider that parks WalkDir on a channel. Transient-error resilience: - A first-caller ctx cancellation no longer poisons sync.Once for the Client's lifetime. LoadMetadataStoreFor now detects context.Canceled / context.DeadlineExceeded inside entry.err and CompareAndDeletes the cache entry so a follow-up call with a healthy ctx loads from scratch. builder.BuildFromCriteria's wrap switched to PropagateOrWrap so the structured ErrCodeTimeout from buildMetadataStore survives — callers (and the new transient-error eviction itself) rely on the inner timeout code, which the old WrapWithContext(ErrCodeInternal, ...) call was overriding. TestLoadMetadataStoreFor_TransientErrorIsNotCached locks this in. Field-level guards baked into the facade: - ResolveRecipe stamps the owning Client on the returned RecipeResult; BundleComponents / ValidateState reject cross-client misuse with ErrCodeInvalidRequest. Without this guard, a result from Client A passed to Client B silently mixed A's component refs with B's DataProvider reads -- wrong Helm values or supplemental manifests with no error. - RecipeRequest.Nodes < 0 rejected up-front (was silently treated as 0). - RecipeRequest.OS mapped to CriteriaOS so OS-pinned leaf overlays (e.g. h100-eks-ubuntu-training-kubeflow -> kubeflow-trainer mixin) are reachable from the facade. Asymmetric matching otherwise excludes them when query OS defaults to "any". - All four facade entry points apply context.WithTimeout against per-operation defaults (RecipeOperationTimeout / SnapshotOperation- Timeout / ValidationOperationTimeout in pkg/defaults). Callers passing tighter deadlines keep theirs; callers passing context.Background() get a bounded operation rather than an unbounded controller hang. - Each entry point also rejects a nil context.Context with ErrCodeInvalidRequest before context.WithTimeout — avoids the runtime panic that context.WithTimeout produces on a nil parent. - NewClient skips nil Option entries instead of panicking on opt(c) — defensive against callers that build []Option dynamically. Cache-lifecycle test coverage: - TestClient_NoCacheGrowthAcrossManyCloseCycles iterates NewClient -> ResolveRecipe -> Close 50 times and asserts both the metadata-store and component-registry caches return to their baseline size. Proves Close() actually evicts, not just nils a pointer. - CachedStoreCountForTesting / CachedRegistryCountForTesting and the per-provider CachedStoreContainsForTesting / CachedRegistryContainsForTesting helpers expose what the package caches without reflecting into unexported state; the per-provider helpers are robust under parallel test execution where the global counts would race against other tests' DataProviders. Type insulation: - ValidateOption is a facade-owned functional-option type that captures into an internal validateConfig and translates to pkg/validator options at call time. Changes to pkg/validator.Option's signature don't propagate to the facade contract. - Snapshot, AgentConfig, PhaseResult, Phase, and the phase constants remain transparent re-exports of pkg/snapshotter / pkg/validator; wrapping those (which would require mirroring their transitively- referenced types) is tracked as a follow-up. Docs: integrator guide (docs/integrator/go-library.md) and public-API matrix (docs/integrator/public-api.md). The CLI and REST surfaces are unchanged; existing tests pass without modification. Test coverage above the project-wide threshold; make qualify passes locally end-to-end on the Go path (lint, race, coverage, chainsaw CLI e2e excluding helmfile which requires sudo for local install). Fixes NVIDIA#1071
This was referenced May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrates the remaining 10 calls in
pkg/bundler,pkg/bundler/deployer/{helmfile,argocdhelm}, andpkg/componentto read registry and manifest content throughrecipeResult.DataProvider()instead of the process-globalrecipe.GetDataProvider()/GetComponentRegistry()/GetManifestContent().Motivation / Context
PR #1015 made
pkg/recipefully provider-aware (per-Builder caches,RecipeResult.DataProvider()accessor, provider-bound*Forhelpers) but explicitly deferred the bundler-side migration as a follow-up. That left ~10 call sites inpkg/bundler,pkg/bundler/deployer/{helmfile,argocdhelm}, andpkg/componentthat silently fell back to the global — even when consuming a recipe built withWithDataProvider(dpA). This PR closes that path.Fixes: N/A
Related: #1015
Type of Change
Component(s) Affected
pkg/recipe)pkg/bundler,pkg/component/*)Implementation Notes
pkg/bundler/bundler.go: 4 methods that already receive*recipe.RecipeResult(warnMissingStorageClassForPVCs,runComponentValidations,collectComponentManifestsByPhase,injectGKECriticalPriorityQuotas) now userecipeResult.DataProvider().provider recipe.DataProviderparameter (getValueOverridesForComponent,getSetEnabledOverride,applyNodeSchedulingOverrides,copyDataFiles,buildDynamicValuesMap), derived once fromrecipeResult.DataProvider()at the nearest caller with the result in scope.helmfile.componentFlagsByName,argocdhelm.resolveOverrideKey,argocdhelm.buildDynamicSetFlagsgain the same trailing parameter.Deployer.Generateinterface kept unchanged — Generator structs already carryRecipeResult, so methods derive provider viag.RecipeResult.DataProvider().pkg/component/generic.go:enrichConfigFromRegistryplumbed identically. Only caller (the deprecated/unusedMakeBundle) updated.pkg/recipe: newEffectiveDataProvider(dp)helper consolidates the bound-first / global-fallback pattern that the bundler'scopyDataFilesandcollectComponentManifestsByPhaseneeded for rawWalkDir/ type-assertion calls.CLI back-compat preserved
CLI today does not call
WithDataProvider—recipeResult.DataProvider()returns nil — and every migrated call site falls back to the global via*Forvariants (andEffectiveDataProviderfor the two raw cases). Bundle output is byte-identical for that path. Verified by the new end-to-end test which only triggers the bound-provider path whenWithDataProvideris used.Out of scope (further Stage 2 PRs)
pkg/cli/root.goSetDataProvidermigration (touches CLI bootstrap)pkg/validator/catalog/catalog.goGetDataProviderreadpkg/mirror/discover.go(2 sites)validators/deployment/expected_resources.go(1 site)Testing
New tests:
TestApplyNodeSchedulingOverrides_BoundProvider— unit-level coverage of the bound-provider branch in a representative helperTestBundler_Make_BoundProviderEndToEnd— integration test: build a recipe viaWithDataProvider(layeredOverTempdir)with a marker driver version, callbundler.Make, walk emitted bundle, assert marker present ingpu-operator/values.yamlTestEffectiveDataProvider— both branches of the new helperExisting tests pass unchanged.
Coverage:
pkg/bundler: 77.3% -> 77.7% (+0.4%)pkg/component: 71.0% -> 70.9% (within 0.5% noise floor)pkg/recipe: 91.5% flatRisk Assessment
Rollout notes: Public API unchanged. All migrated helpers are unexported except
component.MakeBundle(which is itself// Deprecated:with no callers anywhere in the repo). CLI behavior is unchanged; bundle output is byte-identical for the CLI path. gopls IDE deprecation warnings on intentional legacy-global callers are suppressed with//nolint:staticcheck— same accepted trade-off as #1015. Project lint stays clean.Checklist
make testwith-race)make lint)git commit -S)