Skip to content

feat(validator): parameterize image override env vars#1028

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
haarchri:feat/validator-parameterize-image-overrides
May 26, 2026
Merged

feat(validator): parameterize image override env vars#1028
mchmarny merged 1 commit into
NVIDIA:mainfrom
haarchri:feat/validator-parameterize-image-overrides

Conversation

@haarchri

Copy link
Copy Markdown
Contributor

Summary

Parameterize AICR_VALIDATOR_IMAGE_REGISTRY and AICR_VALIDATOR_IMAGE_TAG in the Deployer and Validator types, replacing hidden os.Getenv() reads with explicit constructor/option parameters threaded from the CLI boundary.

Motivation / Context

The Deployer.DeployJob() method read image override env vars from the calling process environment without documenting or exposing control to callers. This created hidden dependencies, made behavior hard to test, and coupled internal APIs to the process environment. The BuildJobPlan() public API was already parameterized, this change brings Deployer into alignment.

Fixes: #954
Related: #906

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

  • os.Getenv() calls moved to pkg/cli/validate.go (the CLI boundary), passed through validationConfigWithImageRegistryOverride/WithImageTagOverride options → Validator fields → NewDeployer constructor → BuildJobPlan.
  • NewDeployer signature gains two string parameters (imageRegistryOverride, imageTagOverride). This is a breaking change for direct callers of NewDeployer, but the only in-tree caller (pkg/validator/validator.go) is updated in this PR.
  • os.Getenv calls in pkg/validator/catalog/catalog.go and validators/performance/inference_perf_constraint.go are intentionally untouched, those execute inside validator container pods where reading from the pod's own environment is correct.

Testing

make qualify
  • Existing TestDeployJobImageTagOverrideForwarding updated to use constructor parameter instead of t.Setenv
  • New TestDeployJobImageRegistryOverrideForwarding added covering set/empty cases
  • All NewDeployer call sites in tests updated with the two new parameters

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:
no external API change, behavior is identical when invoked via the CLI. External consumers of NewDeployer (if any exist out-of-tree) will need to add two string arguments.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@haarchri haarchri requested a review from a team as a code owner May 25, 2026 09:19
@copy-pr-bot

copy-pr-bot Bot commented May 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean refactor — moves os.Getenv for AICR_VALIDATOR_IMAGE_REGISTRY / AICR_VALIDATOR_IMAGE_TAG from pkg/validator/job/deployer.go to the CLI boundary at pkg/cli/validate.go, threading the values through validationConfigWithImage*Override options → Validator fields → NewDeployer. Pattern is consistent with the existing functional-option / field-on-Validator shape (mirrors WithNodeSelector, WithImagePullSecrets, etc.), and matches the CLAUDE.md guidance that internal packages shouldn't read from the process environment directly.

Tests look good: TestDeployJobImageTagOverrideForwarding is now hermetic (no t.Setenv), the new TestDeployJobImageRegistryOverrideForwarding covers both set and empty cases, and every in-tree NewDeployer call site is updated.

One non-blocking observation for a future PR: NewDeployer is up to 12 positional parameters now — at some point an options struct would be friendlier than continuing to extend the signature. Out of scope here.

LGTM.

@mchmarny mchmarny enabled auto-merge (squash) May 26, 2026 12:23
@mchmarny mchmarny disabled auto-merge May 26, 2026 12:23
@mchmarny mchmarny merged commit e30f18a into NVIDIA:main May 26, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BuildJobPlan reads from process environment without documenting or parameterizing

2 participants