Skip to content

Add multi-network support in TPU v6e#4723

Merged
agrawalkhushi18 merged 1 commit into
GoogleCloudPlatform:developfrom
agrawalkhushi18:v6e-vpc
Oct 7, 2025
Merged

Add multi-network support in TPU v6e#4723
agrawalkhushi18 merged 1 commit into
GoogleCloudPlatform:developfrom
agrawalkhushi18:v6e-vpc

Conversation

@agrawalkhushi18
Copy link
Copy Markdown
Contributor

This PR introduces multi-network support by creating two dedicated VPCs to enhance performance and stability. The primary network (net-0) handles general GKE control plane and pod traffic, while a secondary, high-performance network (net-1) using GVNIC is attached directly to the TPU nodes. This isolates the intense inter-node communication required for ML training, preventing network congestion and ensuring maximum performance for workloads

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@agrawalkhushi18 agrawalkhushi18 requested review from a team and samskillman as code owners October 6, 2025 05:18
@agrawalkhushi18 agrawalkhushi18 added the release-chore To not include into release notes label Oct 6, 2025
shubpal07
shubpal07 previously approved these changes Oct 6, 2025
Copy link
Copy Markdown
Contributor

@shubpal07 shubpal07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

  1. Moving forward we may want to make the TPU cluster private. We can create a follow-up task for it.
  2. Are there any checks to test the validity of the muti-nic?

Copy link
Copy Markdown
Collaborator

@bytetwin bytetwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test run or test results to show if this change works @agrawalkhushi18

Comment thread community/examples/gke-tpu-v6/gke-tpu-v6.yaml Outdated
Comment thread community/examples/gke-tpu-v6/gke-tpu-v6.yaml Outdated
@agrawalkhushi18
Copy link
Copy Markdown
Contributor Author

Do we have a test run or test results to show if this change works @agrawalkhushi18

I deployed the v6e cluster and confirmed it worked without any network errors. But for further validation, I am now running a Network Intelligence connectivity test for final confirmation.

Comment thread community/examples/gke-tpu-v6/gke-tpu-v6.yaml Outdated
@agrawalkhushi18
Copy link
Copy Markdown
Contributor Author

Do we have a test run or test results to show if this change works @agrawalkhushi18

I deployed the v6e cluster and confirmed it worked without any network errors. But for further validation, I am now running a Network Intelligence connectivity test for final confirmation.

A connectivity test was run to ensure the multi-network setup for the TPU v6e cluster is fully functional. The test specifically validates end-to-end reachability on the secondary data plane (nic1). This confirms the high-performance path is correctly configured with proper routing and firewall rules. [test_link]

@bytetwin bytetwin added release-breaking-changes Prevents "smooth" re-deploy across versions and removed release-chore To not include into release notes labels Oct 7, 2025
@bytetwin
Copy link
Copy Markdown
Collaborator

bytetwin commented Oct 7, 2025

/gcbrun

@bytetwin
Copy link
Copy Markdown
Collaborator

bytetwin commented Oct 7, 2025

This is a breaking change and not a chore as the network name is being update that could potentially resulting in recreation of vpc and update cluster

@agrawalkhushi18 agrawalkhushi18 merged commit f6ceca8 into GoogleCloudPlatform:develop Oct 7, 2025
14 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-breaking-changes Prevents "smooth" re-deploy across versions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants