Add multi-network support in TPU v6e#4723
Conversation
bytetwin
left a comment
There was a problem hiding this comment.
Do we have a test run or test results to show if this change works @agrawalkhushi18
I deployed the v6e cluster and confirmed it worked without any network errors. But for further validation, I am now running a Network Intelligence connectivity test for final confirmation. |
b8d6624 to
0a0eae4
Compare
A connectivity test was run to ensure the multi-network setup for the TPU v6e cluster is fully functional. The test specifically validates end-to-end reachability on the secondary data plane (nic1). This confirms the high-performance path is correctly configured with proper routing and firewall rules. [test_link] |
|
/gcbrun |
|
This is a breaking change and not a chore as the network name is being update that could potentially resulting in recreation of vpc and update cluster |
f6ceca8
into
GoogleCloudPlatform:develop
This PR introduces multi-network support by creating two dedicated VPCs to enhance performance and stability. The primary network (net-0) handles general GKE control plane and pod traffic, while a secondary, high-performance network (net-1) using GVNIC is attached directly to the TPU nodes. This isolates the intense inter-node communication required for ML training, preventing network congestion and ensuring maximum performance for workloads
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.