refactor(validator): extract job planning to public v1 API#906
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
633367c to
f31e1cc
Compare
f31e1cc to
24e5bd7
Compare
24e5bd7 to
282d74c
Compare
- Change ImagePullSecret (singular, string) to ImagePullSecrets (plural, []string) in JobPlan struct to support multiple image pull secrets instead of just the first one - Add tolerations parameter logic in BuildJobPlan with fallback to tolerate all taints - Update README.md examples to use correct function signatures (RenderPlan takes no namespace param, RenderPlanToApplyConfig takes plan and jobName only) - Add ConfigMap ResourceVersion handling: fetch existing ConfigMap before Update to prevent version conflicts - Preallocate slices for image pull secrets, env vars, volume mounts, volumes, and tolerations for better performance Fixes findings from CodeRabbit AI review on PR NVIDIA#906
282d74c to
76d15b1
Compare
|
@xdu31 this PR now has merge conflicts with |
b9e63b9 to
03b53fb
Compare
Response to CodeRabbit Findings✅ Tolerations (job_plan.go:215) - No change neededThe current implementation is correct by design. The orchestrator Job must always use tolerate-all ( Rationale:
Evidence:
The test verifies that even when passing narrow GPU tolerations, the orchestrator Job still uses tolerate-all. ✅ ConfigMap ResourceVersion (validator.go:480-498) - Already implemented correctlyThe ResourceVersion is being preserved. The code at line 482 fetches the existing ConfigMap: existing, getErr := clientset.CoreV1().ConfigMaps(namespace).Get(ctx, cm.name, metav1.GetOptions{})Then at line 498, we update using that _, updateErr := clientset.CoreV1().ConfigMaps(namespace).Update(ctx, existing, metav1.UpdateOptions{})Since ✅ README Issues - FixedAll function signature and parameter name issues in the README have been corrected:
Changes are staged and ready to commit. |
f7ab279 to
8fa0b5f
Compare
8fa0b5f to
723e33c
Compare
✅ Rebased on upstream mainSuccessfully rebased on latest upstream/main (932f59e) and resolved merge conflict. Conflict Resolution:
Verification: Current commits: Ready for review. |
mchmarny
left a comment
There was a problem hiding this comment.
Solid refactor in shape — the deployer now delegates to a v1 package and the integration looks correct for AICR's own call site. CI is green. The concerns are about the public API contract, not the deployer integration, but they matter because this is being shipped as a stable surface for external controllers.
Blocking:
BuildJobPlanacceptstolerationsandnodeSelectorbut neither lands on the validator Pod — they only flow through env vars to inner workloads. The README advertises them as Pod-level. Either honor them or rename + document. (inline onjob_plan.go)RenderPlanvsRenderPlanToApplyConfigare not equivalent for customized JobPlans — the apply path silently dropsItems,DefaultMode,Optional,PVC.ReadOnly,DownwardAPI.Items, etc. The README claims both paths are supported equally. (inline onjob_plan_internal.go)GenerateRunIDandgenerateJobNamesilently fall back on entropy failure, producing predictable / format-divergent IDs that are used as ConfigMap and Job names. (inline)
Non-blocking but worth fixing in this PR:
- "Byte-identical" claim in the PR description is wrong —
buildResourcesnow respectsentry.Resources(catalog has none today, but the divergence is real). serializeNodeSelectorregressed fromslices.Sortto manual bubble sort.os.Getenvreads inside the planning API leak the caller's process env into the public contract — at minimum, document; ideally, parameterize.- Duplicate godoc on
EnsureDataConfigMaps.
Also note the PR is marked needs-rebase / mergeable_state: blocked — rebase onto main before merging.
Summary
Extracts validator job planning logic into a public API (
pkg/api/validator/v1/) to enable external controllers to build and render validator Jobs without depending on internal deployer implementation.Motivation/Context
Resolves #734
External controllers (e.g., DGXC platform operator) need to create validator Jobs programmatically. Previously, job building logic was encapsulated in the internal
job.Deployertype, making it inaccessible to external consumers.This refactoring exposes the job planning API as stable, documented public functions that external controllers can import and use.
Type of Change
Components Affected
pkg/api/validator/v1/- New public job planning APIpkg/validator/job/deployer.go- Simplified to use v1 APIpkg/validator/catalog/- Removed duplicate ImagePullPolicypkg/validator/validator.go- Fixed duplicate godocImplementation Notes
New Public API:
JobPlanstruct - Decomposes Job into reusable components (image, args, env, volumes, resources, etc.)BuildJobPlan()- Builds a JobPlan from ValidatorEntry and runtime parametersRenderPlan()- Renders a complete Kubernetes Job from JobPlanRenderPlanToApplyConfig()- Renders ApplyConfiguration for server-side applyImagePullPolicy()- Determines pull policy based on image referenceGenerateRunID()- Creates unique run identifiersKey Design Decisions:
Namespacefield toJobPlanfor self-contained plan datajob_plan.go) from internal helpers (job_plan_internal.go)v1/job_plan_test.goCode Verification:
Detailed field-by-field comparison confirms the new v1 API produces equivalent Job specs to the original deployer implementation (with one minor enhancement for future flexibility - see Resources Handling below).
Review Findings Addressed
Complete Field Preservation in ApplyConfigurations
Fixed
buildEnvVarApply()andbuildVolumesApply()to preserve all fields when converting to ApplyConfigurations:buildVolumesApply()into helper functions to satisfy funlen linter while maintaining complete field coverageEntropy Failure Validation
Changed
GenerateRunID()andgenerateJobName()to panic on entropy failures instead of silent fallback:rand.Read()generateJobName()with length checksResources Handling (Behavior Enhancement)
buildResources()introduces a minor enhancement over the original deployer:Original deployer: Always hardcoded to
1 CPU / 1Gi(ignored catalogentry.Resources)Refactored code: Honors
entry.Resourcesif set, else defaults to1 CPU / 1GiCurrent Impact:
Resources1 CPU / 1Gi(identical behavior to old deployer)Validation: If parsing fails (invalid CPU/memory values), logs warning and falls back to defaults, preventing invalid quantities from reaching the API server.
This enhancement enables future validators to declare custom resource requirements in the catalog without code changes.
Code Quality Improvements
sort.Strings()inserializeNodeSelector()EnsureDataConfigMaps()Deferred Issues
Testing
pkg/api/validator/v1/job_plan_test.gomake qualifypasses (tests, lint, e2e, vulnerability scan)Risk Assessment
Low Risk:
Checklist