Comparing changes

Final edits

Other fixes and improvements: * Handle errors in `_create_jump_pod_service_if_not_exists` * Check both Service and Pod to decide if the jump pod must be (re)created * Respect `Node.status.nodeinfo.architecture` * Add `namespace` option to the backend config Part-of: #3126

This implementation allows provisioning both individual A4 instances and clusters, but clusters do not yet support high-speed networking, since it requires a [different network setup](https://cloud.google.com/ai-hypercomputer/docs/create/create-vm#setup-network).

Only show the `USER` column in `dstack project list` if `--verbose` is passed. In my setup, where 9 projects are configured, this speeds up `dstack project list` from 20 seconds to 2 seconds.

Part-of: #3126

#3137) * networking -> proxy_jump * ssh_host -> hostname * ssh_port -> port In addition, `dstack-` prefix has been added to jump pod and service names for consistency with jobs pods and services. Closes: #3136 Co-authored-by: peterschmidt85 <[email protected]>

This commit implements provisioning GCP A4 clusters with high-performance RoCE networking. ```shell > dstack fleet FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED gpu 0 gcp (us-west2) cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot) $51.552 idle 21 mins ago 1 gcp (us-west2) cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot) $51.552 idle 17 mins ago ``` To enable high-performance networking, users need to create the [appropriate networks](https://cloud.google.com/ai-hypercomputer/docs/create/create-vm#setup-network) and configure them in the backend settings. ```yaml projects: - name: main backends: - type: gcp project_id: my-project creds: type: default vpc_name: my-vpc-0 # regular, 1 subnet extra_vpcs: - my-vpc-1 # regular, 1 subnet roce_vpcs: - my-vpc-mrdma # RoCE profile, 8 subnets ``` Then apply a fleet configuration. ```yaml type: fleet nodes: 2 placement: cluster availability_zones: [us-west2-c] backends: [gcp] resources: gpu: 8:b200 ``` Each instance in the cluster will then have 10 network interfaces: - 1 regular interface in the main VPC (`default` or the one configured in `vpc_name`). - 1 regular interface in a VPC configured in `extra_vpcs`. - 8 RDMA interfaces in the VPC configured in `roce_vpcs`. Additionally, this commit optimizes the fetching and caching of subnets, so that they are fetched from the API only once, and not separately for each item in `extra_vpcs`. For some instance types, this reduces the number of API requests from 9 to 1, which cuts about 16 seconds from each offer provisioning attempt.

* Discover and set instance's internal_ip (PodIP) * Fix region mismatch * Add `privileged: true` support * [runner] Set RLIMIT_MEMLOCK to unlimited. Fixes issues with InfiniBand/RDMA Part-of: #3126

* [Docs] Kubernetes guide * [Docs] Kubernetes guide Rework `Backends` and `Fleets` pages to reflect the changes related to Kubernetes * [Docs] Improve Kubernetes documentation Updated `README`, `Overview`, `Installation` * [Docs] Improve Kubernetes documentation Minor updates, incl. the description of `Default image`, and `privileged` for NCCL tests * [Docs] Improve Kubernetes documentation Updated `FAQ`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Sep 25, 2025

Commits on Sep 26, 2025

Commits on Sep 29, 2025

Commits on Sep 30, 2025

Commits on Oct 2, 2025

This comparison is taking too long to generate.

Uh oh!