Skip to content

Promote Kueue as the only workload scheduling solution for A3U and adopt the same in NCCL tests#3534

Merged
annuay-google merged 2 commits into
GoogleCloudPlatform:developfrom
annuay-google:annuay/replace-tas-plugin-with-kueue
Jan 16, 2025
Merged

Promote Kueue as the only workload scheduling solution for A3U and adopt the same in NCCL tests#3534
annuay-google merged 2 commits into
GoogleCloudPlatform:developfrom
annuay-google:annuay/replace-tas-plugin-with-kueue

Conversation

@annuay-google
Copy link
Copy Markdown
Contributor

@annuay-google annuay-google commented Jan 14, 2025

What?

  • Promote Kueue as the only workload scheduling solution for A3U
  • Add default Kueue configuration
  • Deprecate TAS plugin (from A3U blueprints) - full module deprecation to follow soon
  • Update Jobset based NCCL test to use Kueue TAS in place of TAS plugin
  • Deprecate 2-node NCCL test (this is redundant, as we have an n-node NCCL test with n >= 1)

Why?

Kueue is the officially supported workload scheduler for A3 Ultra

Testing

Ran Jobset based NCCL test, verified bandwidth figures

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@annuay-google annuay-google changed the base branch from main to develop January 14, 2025 22:14
@annuay-google annuay-google added release-improvements Added to release notes under the "Improvements" heading. release-key-new-features Added to release notes under the "Key New Features" heading. and removed release-improvements Added to release notes under the "Improvements" heading. labels Jan 14, 2025
Copy link
Copy Markdown
Contributor

@ankitkinra ankitkinra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete the 2 node nccl test ?

Comment thread examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml Outdated
@annuay-google
Copy link
Copy Markdown
Contributor Author

Do we need to delete the 2 node nccl test ?

Deleting this as well, since we have an n-node test now

@annuay-google
Copy link
Copy Markdown
Contributor Author

Do we need to delete the 2 node nccl test ?

Deleting this as well, since we have an n-node test now

Deleted

@annuay-google annuay-google added the do-not-merge Block merging of this PR label Jan 14, 2025
Comment thread modules/compute/gke-node-pool/outputs.tf Outdated
Copy link
Copy Markdown
Contributor

@ankitkinra ankitkinra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved , but I see Sam has an open comment , please take a look at it

@annuay-google annuay-google removed the do-not-merge Block merging of this PR label Jan 15, 2025
@annuay-google
Copy link
Copy Markdown
Contributor Author

Reference

Addressed

Copy link
Copy Markdown
Contributor

@mwysokin mwysokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make an informed decision about how to handle TAS and non-TAS Workloads. I added a comment in the relevant place.

Comment thread examples/gke-a3-ultragpu/kueue-configuration.yaml.tftpl
@annuay-google annuay-google changed the title Replace TAS plugin with Kueue Promote Kueue as the only workload scheduling solution for A3U and adopt the same in NCCL tests Jan 16, 2025
@annuay-google annuay-google force-pushed the annuay/replace-tas-plugin-with-kueue branch from 4c19b60 to a3bfc70 Compare January 16, 2025 16:08
@annuay-google
Copy link
Copy Markdown
Contributor Author

I think we should make an informed decision about how to handle TAS and non-TAS Workloads. I added a comment in the relevant place.

Addressed

@annuay-google
Copy link
Copy Markdown
Contributor Author

Re-tested the changes after adding Michal's suggestions, all good

ighosh98 and others added 2 commits January 16, 2025 19:13
@annuay-google annuay-google force-pushed the annuay/replace-tas-plugin-with-kueue branch from a3bfc70 to 6325681 Compare January 16, 2025 19:14
@annuay-google annuay-google merged commit 959dde1 into GoogleCloudPlatform:develop Jan 16, 2025
@abbas1902 abbas1902 mentioned this pull request Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants