feat(api): BREAKING CHANGE: Remove numProcPerNode from Torch API#3239
Conversation
|
/hold for review |
There was a problem hiding this comment.
Pull request overview
This PR implements a BREAKING CHANGE that removes the numProcPerNode field from the TorchMLPolicySource API and changes TrainJob.Spec.Trainer.NumProcPerNode from IntOrString to *int32. This simplifies the API by relying on container resources and PyTorch's native auto-detection for determining the number of processes per node.
Changes:
- Removed
numProcPerNodefromTorchMLPolicySourcein TrainingRuntime/ClusterTrainingRuntime - Changed
TrainJob.Spec.Trainer.NumProcPerNodefromIntOrStringto*int32 - For Torch runtime: defaults to "auto" internally and can be overridden with an int value
- For MPI runtime: behavior unchanged, continues to use int values for number of slots per node
Reviewed changes
Copilot reviewed 38 out of 38 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/apis/trainer/v1alpha1/trainingruntime_types.go | Removed NumProcPerNode field from TorchMLPolicySource struct |
| pkg/apis/trainer/v1alpha1/trainjob_types.go | Changed NumProcPerNode type from IntOrString to *int32 in Trainer struct |
| pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go | Updated generated deepcopy code to remove IntOrString handling for TorchMLPolicySource.NumProcPerNode |
| pkg/apis/trainer/v1alpha1/zz_generated.openapi.go | Updated OpenAPI schema to reflect API changes |
| pkg/runtime/framework/plugins/torch/torch.go | Updated to default numProcPerNode to "auto" internally, convert int32 override to IntOrString for internal processing |
| pkg/runtime/framework/plugins/torch/torchtune.go | Updated to handle nil NumProcPerNode by defaulting to "auto" |
| pkg/runtime/framework/plugins/mpi/mpi.go | Simplified to directly use int32 values instead of IntOrString conversion |
| pkg/util/testing/wrapper.go | Updated test wrapper methods: TorchPolicy() simplified, NumProcPerNode() changed to accept int32 |
| test/integration/webhooks/trainjob_test.go | Removed validation tests for string values, updated tests to use int32 values |
| test/integration/webhooks/trainingruntime_webhook_test.go | Removed defaulting test for torch.numProcPerNode |
| pkg/runtime/framework/plugins/torch/torch_test.go | Removed validation tests for string values, updated all test cases to use simplified TorchPolicy() and int32 NumProcPerNode |
| pkg/runtime/framework/plugins/mpi/mpi_test.go | Removed validation test for string values, updated to use int32 NumProcPerNode |
| manifests/base/runtimes/torch_distributed.yaml | Changed torch.numProcPerNode: auto to torch: {} |
| manifests/base/runtimes/torchtune/* | Changed torch.numProcPerNode: auto to torch: {} in all torchtune runtime manifests |
| manifests/base/crds/*.yaml | Updated CRD schemas to reflect API changes |
| charts/kubeflow-trainer/templates/runtimes/*.yaml | Updated Helm chart runtime templates to use torch: {} |
| charts/kubeflow-trainer/crds/*.yaml | Updated Helm chart CRDs to match base manifests |
| api/python_api/kubeflow_trainer_api/models/* | Updated Python SDK models to reflect API changes |
| api/openapi-spec/swagger.json | Updated OpenAPI specification |
| docs/proposals/2170-kubeflow-trainer-v2/README.md | Updated proposal documentation with API changes and implementation history |
| pkg/client/applyconfiguration/trainer/v1alpha1/* | Updated apply configurations to reflect API changes |
| test/integration/controller/trainjob_controller_test.go | Updated controller tests to use simplified TorchPolicy() |
| pkg/runtime/core/trainingruntime_test.go | Updated core runtime tests to use simplified TorchPolicy() and int32 NumProcPerNode |
| pkg/runtime/framework/plugins/plainml/plainml_test.go | Updated plainml tests to use simplified TorchPolicy() |
| charts/kubeflow-trainer/tests/runtimes/torch_distributed_test.yaml | Updated Helm chart test to expect torch: {} instead of torch.numProcPerNode: auto |
1672171 to
350a6d5
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
350a6d5 to
8422ee5
Compare
|
@tenzen-y @astefanutti I've rebased this PR, let me know if this API change looks good to you. |
Signed-off-by: Andrey Velichkevich <[email protected]>
astefanutti
left a comment
There was a problem hiding this comment.
@andreyvelich thanks, it looks good.
Do you want to rebase a last time on top of the coveralls fix?
Signed-off-by: Andrey Velichkevich <[email protected]>
|
|
||
| numNodesRefPath := specPath.Child("trainer").Child("numNodes") | ||
| numNodes := *newObj.Spec.Trainer.NumNodes | ||
| numNodes := ptr.Deref(newObj.Spec.Trainer.NumNodes, 1) |
There was a problem hiding this comment.
Found this bug in TorchTune validation.
astefanutti
left a comment
There was a problem hiding this comment.
Thanks @andreyvelich!
/lgtm
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, demo migration to RuntimePatches, and ClusterTrainingRuntime alignment (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships two breaking API changes that touch AICR: 1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309). The CRD still admits the old field for compat but the v2.2 controller no longer applies it. The pytorch-mnist demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the `runtimePatches` shape with `manager: aicr.nvidia.com/demo` and explicit per-cluster scheduling (the EKS demo carries the AICR-standard `dedicated=worker-workload` tolerations + NoExecute effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule` and `nvidia.com/gpu=present:NoSchedule` to match the rest of the GKE flow). 2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239). Upstream removed the field from the Torch policy because it now infers parallelism from the container's `nvidia.com/gpu` limit. `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI test fixtures stay as-is. AICR's `torch-distributed` ClusterTrainingRuntime is updated from `mlPolicy.torch: { numProcPerNode: auto }` to `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime. Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: demo TrainJob admitted, pod scheduled with the migrated runtimePatches, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). 2-replica Deployment with `schedulerName: kai-scheduler` + DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also scheduled cleanly with `priorityClassName: train` (each replica got its own H100 via DRA). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...
This BREAKING CHANGE will remove numProcPerNode API from the Torch MLPolicy. As we discussed during the latest Trainer call, we would like to remove this API and rely on container resources to set this value: https://youtu.be/e9_g28XdpHg?t=351
We would like to keep this API in
trainJob.spec.trainer.numProcPerNodefor now, since for MPI use-cases users might want to set number of slots.cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @vsoch @akshaychitneni