Skip to content

feat(api): BREAKING CHANGE: Remove numProcPerNode from Torch API#3239

Merged
google-oss-prow[bot] merged 3 commits into
kubeflow:masterfrom
andreyvelich:remove-num-proc
Mar 1, 2026
Merged

feat(api): BREAKING CHANGE: Remove numProcPerNode from Torch API#3239
google-oss-prow[bot] merged 3 commits into
kubeflow:masterfrom
andreyvelich:remove-num-proc

Conversation

@andreyvelich

Copy link
Copy Markdown
Member

This BREAKING CHANGE will remove numProcPerNode API from the Torch MLPolicy. As we discussed during the latest Trainer call, we would like to remove this API and rely on container resources to set this value: https://youtu.be/e9_g28XdpHg?t=351

We would like to keep this API in trainJob.spec.trainer.numProcPerNode for now, since for MPI use-cases users might want to set number of slots.

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @vsoch @akshaychitneni

Copilot AI review requested due to automatic review settings February 23, 2026 14:59
@andreyvelich

Copy link
Copy Markdown
Member Author

/hold for review

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a BREAKING CHANGE that removes the numProcPerNode field from the TorchMLPolicySource API and changes TrainJob.Spec.Trainer.NumProcPerNode from IntOrString to *int32. This simplifies the API by relying on container resources and PyTorch's native auto-detection for determining the number of processes per node.

Changes:

  • Removed numProcPerNode from TorchMLPolicySource in TrainingRuntime/ClusterTrainingRuntime
  • Changed TrainJob.Spec.Trainer.NumProcPerNode from IntOrString to *int32
  • For Torch runtime: defaults to "auto" internally and can be overridden with an int value
  • For MPI runtime: behavior unchanged, continues to use int values for number of slots per node

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/apis/trainer/v1alpha1/trainingruntime_types.go Removed NumProcPerNode field from TorchMLPolicySource struct
pkg/apis/trainer/v1alpha1/trainjob_types.go Changed NumProcPerNode type from IntOrString to *int32 in Trainer struct
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go Updated generated deepcopy code to remove IntOrString handling for TorchMLPolicySource.NumProcPerNode
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go Updated OpenAPI schema to reflect API changes
pkg/runtime/framework/plugins/torch/torch.go Updated to default numProcPerNode to "auto" internally, convert int32 override to IntOrString for internal processing
pkg/runtime/framework/plugins/torch/torchtune.go Updated to handle nil NumProcPerNode by defaulting to "auto"
pkg/runtime/framework/plugins/mpi/mpi.go Simplified to directly use int32 values instead of IntOrString conversion
pkg/util/testing/wrapper.go Updated test wrapper methods: TorchPolicy() simplified, NumProcPerNode() changed to accept int32
test/integration/webhooks/trainjob_test.go Removed validation tests for string values, updated tests to use int32 values
test/integration/webhooks/trainingruntime_webhook_test.go Removed defaulting test for torch.numProcPerNode
pkg/runtime/framework/plugins/torch/torch_test.go Removed validation tests for string values, updated all test cases to use simplified TorchPolicy() and int32 NumProcPerNode
pkg/runtime/framework/plugins/mpi/mpi_test.go Removed validation test for string values, updated to use int32 NumProcPerNode
manifests/base/runtimes/torch_distributed.yaml Changed torch.numProcPerNode: auto to torch: {}
manifests/base/runtimes/torchtune/* Changed torch.numProcPerNode: auto to torch: {} in all torchtune runtime manifests
manifests/base/crds/*.yaml Updated CRD schemas to reflect API changes
charts/kubeflow-trainer/templates/runtimes/*.yaml Updated Helm chart runtime templates to use torch: {}
charts/kubeflow-trainer/crds/*.yaml Updated Helm chart CRDs to match base manifests
api/python_api/kubeflow_trainer_api/models/* Updated Python SDK models to reflect API changes
api/openapi-spec/swagger.json Updated OpenAPI specification
docs/proposals/2170-kubeflow-trainer-v2/README.md Updated proposal documentation with API changes and implementation history
pkg/client/applyconfiguration/trainer/v1alpha1/* Updated apply configurations to reflect API changes
test/integration/controller/trainjob_controller_test.go Updated controller tests to use simplified TorchPolicy()
pkg/runtime/core/trainingruntime_test.go Updated core runtime tests to use simplified TorchPolicy() and int32 NumProcPerNode
pkg/runtime/framework/plugins/plainml/plainml_test.go Updated plainml tests to use simplified TorchPolicy()
charts/kubeflow-trainer/tests/runtimes/torch_distributed_test.yaml Updated Helm chart test to expect torch: {} instead of torch.numProcPerNode: auto

@andreyvelich

Copy link
Copy Markdown
Member Author

@tenzen-y @astefanutti I've rebased this PR, let me know if this API change looks good to you.

@astefanutti astefanutti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich thanks, it looks good.

Do you want to rebase a last time on top of the coveralls fix?

Signed-off-by: Andrey Velichkevich <[email protected]>

numNodesRefPath := specPath.Child("trainer").Child("numNodes")
numNodes := *newObj.Spec.Trainer.NumNodes
numNodes := ptr.Deref(newObj.Spec.Trainer.NumNodes, 1)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this bug in TorchTune validation.

@astefanutti astefanutti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreyvelich!

/lgtm

@andreyvelich

Copy link
Copy Markdown
Member Author

/approve
/hold for @tenzen-y final review

@tenzen-y tenzen-y left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve

@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [andreyvelich,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y

tenzen-y commented Mar 1, 2026

Copy link
Copy Markdown
Member

/hold cancel

@google-oss-prow google-oss-prow Bot merged commit e6b8376 into kubeflow:master Mar 1, 2026
32 checks passed
@google-oss-prow google-oss-prow Bot added this to the v2.2 milestone Mar 1, 2026
@andreyvelich andreyvelich deleted the remove-num-proc branch March 1, 2026 12:27
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump, validator fallback URL update, demo
migration to RuntimePatches, and ClusterTrainingRuntime alignment
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

v2.2.0 ships two breaking API changes that touch AICR:

  1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309).
     The CRD still admits the old field for compat but the v2.2
     controller no longer applies it. The pytorch-mnist demo TrainJob
     in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to
     the `runtimePatches` shape with `manager: aicr.nvidia.com/demo`
     and explicit per-cluster scheduling (the EKS demo carries the
     AICR-standard `dedicated=worker-workload` tolerations + NoExecute
     effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule`
     and `nvidia.com/gpu=present:NoSchedule` to match the rest of the
     GKE flow).

  2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239).
     Upstream removed the field from the Torch policy because it now
     infers parallelism from the container's `nvidia.com/gpu` limit.
     `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI
     test fixtures stay as-is. AICR's `torch-distributed`
     ClusterTrainingRuntime is updated from
     `mlPolicy.torch: { numProcPerNode: auto }` to
     `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime.

Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade:
demo TrainJob admitted, pod scheduled with the migrated runtimePatches,
training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade
baseline). 2-replica Deployment with `schedulerName: kai-scheduler` +
DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also
scheduled cleanly with `priorityClassName: train` (each replica got
its own H100 via DRA).

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants