Support GCP A4 clusters #3142

jvstme · 2025-10-02T00:26:58Z

This commit implements provisioning GCP A4 clusters with high-performance RoCE networking.

> dstack fleet
 FLEET  INSTANCE  BACKEND         RESOURCES                                          PRICE    STATUS  CREATED
 gpu    0         gcp (us-west2)  cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot)  $51.552  idle    21 mins ago
        1         gcp (us-west2)  cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot)  $51.552  idle    17 mins ago

To enable high-performance networking, users need to create the appropriate networks and configure them in the backend settings.

projects:
- name: main
  backends:
  - type: gcp
    project_id: my-project
    creds:
      type: default
    vpc_name: my-vpc-0  # regular, 1 subnet
    extra_vpcs:
    - my-vpc-1  # regular, 1 subnet
    roce_vpcs:
    - my-vpc-mrdma  # RoCE profile, 8 subnets

Then apply a fleet configuration.

type: fleet
nodes: 2
placement: cluster
availability_zones: [us-west2-c]
backends: [gcp]
resources:
  gpu: 8:b200

Each instance in the cluster will then have 10 network interfaces:

1 regular interface in the main VPC (default or the one configured in vpc_name).
1 regular interface in a VPC configured in extra_vpcs.
8 RDMA interfaces in the VPC configured in roce_vpcs.

Additionally, this commit optimizes the fetching and caching of subnets, so that they are fetched from the API only once, and not separately for each item in extra_vpcs. For some instance types, this reduces the number of API requests from 9 to 1, which cuts about 16 seconds from each offer provisioning attempt.

#3088

This commit implements provisioning GCP A4 clusters with high-performance RoCE networking. ```shell > dstack fleet FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED gpu 0 gcp (us-west2) cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot) $51.552 idle 21 mins ago 1 gcp (us-west2) cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot) $51.552 idle 17 mins ago ``` To enable high-performance networking, users need to create the [appropriate networks](https://cloud.google.com/ai-hypercomputer/docs/create/create-vm#setup-network) and configure them in the backend settings. ```yaml projects: - name: main backends: - type: gcp project_id: my-project creds: type: default vpc_name: my-vpc-0 # regular, 1 subnet extra_vpcs: - my-vpc-1 # regular, 1 subnet roce_vpcs: - my-vpc-mrdma # RoCE profile, 8 subnets ``` Then apply a fleet configuration. ```yaml type: fleet nodes: 2 placement: cluster availability_zones: [us-west2-c] backends: [gcp] resources: gpu: 8:b200 ``` Each instance in the cluster will then have 10 network interfaces: - 1 regular interface in the main VPC (`default` or the one configured in `vpc_name`). - 1 regular interface in a VPC configured in `extra_vpcs`. - 8 RDMA interfaces in the VPC configured in `roce_vpcs`. Additionally, this commit optimizes the fetching and caching of subnets, so that they are fetched from the API only once, and not separately for each item in `extra_vpcs`. For some instance types, this reduces the number of API requests from 9 to 1, which cuts about 16 seconds from each offer provisioning attempt.

jvstme requested a review from un-def October 2, 2025 00:43

un-def approved these changes Oct 2, 2025

View reviewed changes

jvstme merged commit f7ef485 into master Oct 2, 2025
38 of 42 checks passed

jvstme deleted the issue_3088_gcp_a4_clusters branch October 2, 2025 08:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support GCP A4 clusters #3142

Support GCP A4 clusters #3142

Uh oh!

jvstme commented Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support GCP A4 clusters #3142

Support GCP A4 clusters #3142

Uh oh!

Conversation

jvstme commented Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants