Skip to content

refactor(validator): extract job planning to public v1 API#906

Merged
mchmarny merged 8 commits into
NVIDIA:mainfrom
xdu31:refactor/validator-job-planning-api
May 18, 2026
Merged

refactor(validator): extract job planning to public v1 API#906
mchmarny merged 8 commits into
NVIDIA:mainfrom
xdu31:refactor/validator-job-planning-api

Conversation

@xdu31

@xdu31 xdu31 commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Extracts validator job planning logic into a public API (pkg/api/validator/v1/) to enable external controllers to build and render validator Jobs without depending on internal deployer implementation.

Motivation/Context

Resolves #734

External controllers (e.g., DGXC platform operator) need to create validator Jobs programmatically. Previously, job building logic was encapsulated in the internal job.Deployer type, making it inaccessible to external consumers.

This refactoring exposes the job planning API as stable, documented public functions that external controllers can import and use.

Type of Change

  • Refactoring (code restructuring without behavior change)
  • Bug fix (addressing code review findings)

Components Affected

  • pkg/api/validator/v1/ - New public job planning API
  • pkg/validator/job/deployer.go - Simplified to use v1 API
  • pkg/validator/catalog/ - Removed duplicate ImagePullPolicy
  • pkg/validator/validator.go - Fixed duplicate godoc

Implementation Notes

New Public API:

  • JobPlan struct - Decomposes Job into reusable components (image, args, env, volumes, resources, etc.)
  • BuildJobPlan() - Builds a JobPlan from ValidatorEntry and runtime parameters
  • RenderPlan() - Renders a complete Kubernetes Job from JobPlan
  • RenderPlanToApplyConfig() - Renders ApplyConfiguration for server-side apply
  • ImagePullPolicy() - Determines pull policy based on image reference
  • GenerateRunID() - Creates unique run identifiers

Key Design Decisions:

  • Added Namespace field to JobPlan for self-contained plan data
  • Split public API (job_plan.go) from internal helpers (job_plan_internal.go)
  • Deployer now uses v1 API instead of internal implementation
  • Consolidated all job planning tests into v1/job_plan_test.go

Code Verification:
Detailed field-by-field comparison confirms the new v1 API produces equivalent Job specs to the original deployer implementation (with one minor enhancement for future flexibility - see Resources Handling below).

Review Findings Addressed

Complete Field Preservation in ApplyConfigurations

Fixed buildEnvVarApply() and buildVolumesApply() to preserve all fields when converting to ApplyConfigurations:

  • EnvVar sources: Now preserves Optional, APIVersion, Divisor for all 4 ValueFrom types (FieldRef, SecretKeyRef, ConfigMapKeyRef, ResourceFieldRef)
  • Volume sources: Now preserves Items, DefaultMode, Optional, ReadOnly for all 7 volume types (ConfigMap, Secret, EmptyDir, HostPath, PVC, Projected, DownwardAPI)
  • Split buildVolumesApply() into helper functions to satisfy funlen linter while maintaining complete field coverage

Entropy Failure Validation

Changed GenerateRunID() and generateJobName() to panic on entropy failures instead of silent fallback:

  • Validates both error return and bytes read from rand.Read()
  • Prevents predictable ID generation that could cause collisions
  • Added DNS-1123 validation to generateJobName() with length checks

Resources Handling (Behavior Enhancement)

buildResources() introduces a minor enhancement over the original deployer:

Original deployer: Always hardcoded to 1 CPU / 1Gi (ignored catalog entry.Resources)
Refactored code: Honors entry.Resources if set, else defaults to 1 CPU / 1Gi

Current Impact:

  • No observable difference today - no catalog entry currently sets Resources
  • All validators get 1 CPU / 1Gi (identical behavior to old deployer)
  • Divergence is latent and only surfaces if/when catalog entries add custom requirements

Validation: If parsing fails (invalid CPU/memory values), logs warning and falls back to defaults, preventing invalid quantities from reaching the API server.

This enhancement enables future validators to declare custom resource requirements in the catalog without code changes.

Code Quality Improvements

  • Replaced O(n²) bubble sort with sort.Strings() in serializeNodeSelector()
  • Fixed duplicate godoc on EnsureDataConfigMaps()
  • Added comprehensive DNS-1123 label validation for generated job names

Deferred Issues

Testing

  • All existing tests pass (21/21)
  • New comprehensive tests in pkg/api/validator/v1/job_plan_test.go
  • Test coverage: v1 package 67.4%, overall 75.3%
  • Verified output equivalence with old deployer implementation
  • make qualify passes (tests, lint, e2e, vulnerability scan)

Risk Assessment

Low Risk:

  • Pure refactoring - no breaking behavior changes
  • All existing tests pass
  • New API produces identical Job specs (verified)
  • Deployer continues to work using v1 API
  • Review findings addressed with backward-compatible fixes

Checklist

  • Code follows project style and patterns
  • Tests added/updated and passing
  • Linting passes (0 issues)
  • Documentation updated (README.md added for v1 API)
  • No breaking changes to existing internal APIs
  • Commit message follows conventional commits format
  • Review findings addressed or tracked in issues

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@xdu31 xdu31 force-pushed the refactor/validator-job-planning-api branch from 633367c to f31e1cc Compare May 15, 2026 03:29
coderabbitai[bot]

This comment was marked as resolved.

@xdu31 xdu31 force-pushed the refactor/validator-job-planning-api branch from f31e1cc to 24e5bd7 Compare May 15, 2026 03:43
coderabbitai[bot]

This comment was marked as resolved.

@xdu31 xdu31 force-pushed the refactor/validator-job-planning-api branch from 24e5bd7 to 282d74c Compare May 15, 2026 16:51
coderabbitai[bot]

This comment was marked as resolved.

@xdu31 xdu31 closed this May 15, 2026
@xdu31 xdu31 deleted the refactor/validator-job-planning-api branch May 15, 2026 17:12
@xdu31 xdu31 restored the refactor/validator-job-planning-api branch May 15, 2026 17:28
@xdu31 xdu31 reopened this May 15, 2026
xdu31 added a commit to xdu31/aicr that referenced this pull request May 15, 2026
- Change ImagePullSecret (singular, string) to ImagePullSecrets (plural, []string) in JobPlan struct
  to support multiple image pull secrets instead of just the first one
- Add tolerations parameter logic in BuildJobPlan with fallback to tolerate all taints
- Update README.md examples to use correct function signatures (RenderPlan takes no namespace param,
  RenderPlanToApplyConfig takes plan and jobName only)
- Add ConfigMap ResourceVersion handling: fetch existing ConfigMap before Update to prevent
  version conflicts
- Preallocate slices for image pull secrets, env vars, volume mounts, volumes, and tolerations
  for better performance

Fixes findings from CodeRabbit AI review on PR NVIDIA#906
@xdu31 xdu31 force-pushed the refactor/validator-job-planning-api branch from 282d74c to 76d15b1 Compare May 15, 2026 17:29
coderabbitai[bot]

This comment was marked as resolved.

@github-actions

Copy link
Copy Markdown
Contributor

@xdu31 this PR now has merge conflicts with main. Please rebase to resolve them.

@xdu31 xdu31 force-pushed the refactor/validator-job-planning-api branch 3 times, most recently from b9e63b9 to 03b53fb Compare May 15, 2026 18:58
@xdu31

xdu31 commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

Response to CodeRabbit Findings

✅ Tolerations (job_plan.go:215) - No change needed

The current implementation is correct by design. The orchestrator Job must always use tolerate-all ({Operator: Exists}) regardless of the incoming tolerations parameter.

Rationale:

  • The orchestrator Job is the validator container that runs as a Kubernetes Job
  • It needs to schedule on any CPU node, so it must tolerate all taints
  • The tolerations parameter is only for inner workloads that validators spawn (e.g., GPU benchmarks)
  • These inner tolerations are serialized into the AICR_TOLERATIONS env var and applied by validators to their child Jobs

Evidence:
Test TestDeployJobOrchestratorToleratesTolerateAll explicitly validates this behavior with the comment:

"The orchestrator Job must always have tolerate-all so it can schedule on any CPU node, regardless of what tolerations are passed for inner workloads."

The test verifies that even when passing narrow GPU tolerations, the orchestrator Job still uses tolerate-all.


✅ ConfigMap ResourceVersion (validator.go:480-498) - Already implemented correctly

The ResourceVersion is being preserved. The code at line 482 fetches the existing ConfigMap:

existing, getErr := clientset.CoreV1().ConfigMaps(namespace).Get(ctx, cm.name, metav1.GetOptions{})

Then at line 498, we update using that existing object:

_, updateErr := clientset.CoreV1().ConfigMaps(namespace).Update(ctx, existing, metav1.UpdateOptions{})

Since existing comes from the Get call, it already contains existing.ObjectMeta.ResourceVersion, which is automatically preserved in the Update call. This is the standard Kubernetes pattern for optimistic concurrency control.


✅ README Issues - Fixed

All function signature and parameter name issues in the README have been corrected:

  • Removed namespace parameter from RenderPlan() and RenderPlanToApplyConfig() signatures
  • Changed imagePullSecret (singular string) to imagePullSecrets (plural slice) throughout
  • Updated all examples to match actual function signatures

Changes are staged and ready to commit.

@xdu31 xdu31 force-pushed the refactor/validator-job-planning-api branch from f7ab279 to 8fa0b5f Compare May 18, 2026 16:24
@xdu31 xdu31 force-pushed the refactor/validator-job-planning-api branch from 8fa0b5f to 723e33c Compare May 18, 2026 16:42
@xdu31

xdu31 commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

✅ Rebased on upstream main

Successfully rebased on latest upstream/main (932f59e) and resolved merge conflict.

Conflict Resolution:

  • File: pkg/validator/validator.go imports
  • Issue: Upstream added imports for GenerateRunID which we moved to v1 package
  • Fix: Removed unused imports (crypto/rand, encoding/hex), kept stderrors

Verification:

✅ make qualify PASSED
   - Tests: 21/21 passed
   - Coverage: 75.4% (above threshold)
   - Linting: passed
   - E2E: passed
   - Security: no vulnerabilities

Current commits:

723e33c5 docs(validator): fix README function signatures and imagePullSecrets references
d0e9e66f fix(validator): address CodeRabbit review findings for job planning API
72d49021 refactor(validator): extract job planning to public v1 API

Ready for review.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid refactor in shape — the deployer now delegates to a v1 package and the integration looks correct for AICR's own call site. CI is green. The concerns are about the public API contract, not the deployer integration, but they matter because this is being shipped as a stable surface for external controllers.

Blocking:

  • BuildJobPlan accepts tolerations and nodeSelector but neither lands on the validator Pod — they only flow through env vars to inner workloads. The README advertises them as Pod-level. Either honor them or rename + document. (inline on job_plan.go)
  • RenderPlan vs RenderPlanToApplyConfig are not equivalent for customized JobPlans — the apply path silently drops Items, DefaultMode, Optional, PVC.ReadOnly, DownwardAPI.Items, etc. The README claims both paths are supported equally. (inline on job_plan_internal.go)
  • GenerateRunID and generateJobName silently fall back on entropy failure, producing predictable / format-divergent IDs that are used as ConfigMap and Job names. (inline)

Non-blocking but worth fixing in this PR:

  • "Byte-identical" claim in the PR description is wrong — buildResources now respects entry.Resources (catalog has none today, but the divergence is real).
  • serializeNodeSelector regressed from slices.Sort to manual bubble sort.
  • os.Getenv reads inside the planning API leak the caller's process env into the public contract — at minimum, document; ideally, parameterize.
  • Duplicate godoc on EnsureDataConfigMaps.

Also note the PR is marked needs-rebase / mergeable_state: blocked — rebase onto main before merging.

Comment thread pkg/api/validator/v1/job_plan.go
Comment thread pkg/api/validator/v1/job_plan_internal.go
Comment thread pkg/api/validator/v1/job_plan.go
Comment thread pkg/api/validator/v1/job_plan_internal.go Outdated
Comment thread pkg/api/validator/v1/job_plan.go
Comment thread pkg/api/validator/v1/job_plan_internal.go
Comment thread pkg/api/validator/v1/job_plan_internal.go
Comment thread pkg/validator/validator.go
Comment thread pkg/api/validator/v1/README.md
@xdu31 xdu31 requested a review from mchmarny May 18, 2026 20:52
@mchmarny mchmarny enabled auto-merge (squash) May 18, 2026 23:35
@mchmarny mchmarny merged commit cdee98f into NVIDIA:main May 18, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add public RenderJob() function returning Job specs

2 participants