Skip to content

Conversation

@jvstme
Copy link
Collaborator

@jvstme jvstme commented Oct 2, 2025

This commit implements provisioning GCP A4 clusters with high-performance RoCE networking.

> dstack fleet
 FLEET  INSTANCE  BACKEND         RESOURCES                                          PRICE    STATUS  CREATED
 gpu    0         gcp (us-west2)  cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot)  $51.552  idle    21 mins ago
        1         gcp (us-west2)  cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot)  $51.552  idle    17 mins ago

To enable high-performance networking, users need to create the appropriate networks and configure them in the backend settings.

projects:
- name: main
  backends:
  - type: gcp
    project_id: my-project
    creds:
      type: default
    vpc_name: my-vpc-0  # regular, 1 subnet
    extra_vpcs:
    - my-vpc-1  # regular, 1 subnet
    roce_vpcs:
    - my-vpc-mrdma  # RoCE profile, 8 subnets

Then apply a fleet configuration.

type: fleet
nodes: 2
placement: cluster
availability_zones: [us-west2-c]
backends: [gcp]
resources:
  gpu: 8:b200

Each instance in the cluster will then have 10 network interfaces:

  • 1 regular interface in the main VPC (default or the one configured in vpc_name).
  • 1 regular interface in a VPC configured in extra_vpcs.
  • 8 RDMA interfaces in the VPC configured in roce_vpcs.

Additionally, this commit optimizes the fetching and caching of subnets, so that they are fetched from the API only once, and not separately for each item in extra_vpcs. For some instance types, this reduces the number of API requests from 9 to 1, which cuts about 16 seconds from each offer provisioning attempt.

#3088

This commit implements provisioning GCP A4
clusters with high-performance RoCE networking.

```shell
> dstack fleet
 FLEET  INSTANCE  BACKEND         RESOURCES                                          PRICE    STATUS  CREATED
 gpu    0         gcp (us-west2)  cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot)  $51.552  idle    21 mins ago
        1         gcp (us-west2)  cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot)  $51.552  idle    17 mins ago
```

To enable high-performance networking, users need
to create the
[appropriate networks](https://cloud.google.com/ai-hypercomputer/docs/create/create-vm#setup-network)
and configure them in the backend settings.

```yaml
projects:
- name: main
  backends:
  - type: gcp
    project_id: my-project
    creds:
      type: default
    vpc_name: my-vpc-0  # regular, 1 subnet
    extra_vpcs:
    - my-vpc-1  # regular, 1 subnet
    roce_vpcs:
    - my-vpc-mrdma  # RoCE profile, 8 subnets
```

Then apply a fleet configuration.

```yaml
type: fleet
nodes: 2
placement: cluster
availability_zones: [us-west2-c]
backends: [gcp]
resources:
  gpu: 8:b200
```

Each instance in the cluster will then have 10
network interfaces:
- 1 regular interface in the main VPC (`default`
  or the one configured in `vpc_name`).
- 1 regular interface in a VPC configured in
  `extra_vpcs`.
- 8 RDMA interfaces in the VPC configured in
  `roce_vpcs`.

Additionally, this commit optimizes the fetching
and caching of subnets, so that they are fetched
from the API only once, and not separately for
each item in `extra_vpcs`. For some instance
types, this reduces the number of API requests
from 9 to 1, which cuts about 16 seconds from each
offer provisioning attempt.
@jvstme jvstme requested a review from un-def October 2, 2025 00:43
@jvstme jvstme merged commit f7ef485 into master Oct 2, 2025
38 of 42 checks passed
@jvstme jvstme deleted the issue_3088_gcp_a4_clusters branch October 2, 2025 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants