-
Notifications
You must be signed in to change notification settings - Fork 207
Description
Steps to reproduce
- Create a fleet that allows any instances:
type: fleet
name: cloud-fleet
placement: cluster
nodes: 0..
- Run two clusters on different backends specifying that fleet:
dstack apply -f nccl-tests.yaml -b runpod --fleet cloud-fleet
dstack apply -f nccl-tests.yaml -b gcp --fleet cloud-fleet
You'll get two clusters provisioned in the same fleet:
(dstack) ➜ my_dstack_public git:(master) ✗ dstack fleet
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
cloud-fleet 0 runpod cpu=128 $16.704 provisioni… 31 sec ago
(US-MO-1) mem=2008GB
disk=100GB
A100:80GB…
1 runpod cpu=128 $16.704 busy 31 sec ago
(US-MO-1) mem=2008GB
disk=100GB
A100:80GB…
2 gcp cpu=224 $51.552 idle 10 mins
(us-west2) mem=3968GB (warning) ago
disk=100GB
B200:180G…
(spot)
3 gcp cpu=224 $51.552 idle 9 mins ago
(us-west2) mem=3968GB (warning)
disk=100GB
B200:180G…
(spot)
This breaks fleet with placement: cluster contract that guarantees that all instances in the fleet are part of the interconnected cluster. A related problem is that if cluster does not have placement: cluster, then the fleet contract is not broken but there is still a problem that fleet selection logic for multi-node tasks will incorrectly assume all nodes are interconnected and can acommodate the task when prioritizing fleets.
Actual behaviour
No response
Expected behaviour
It seems we need to forbid provisioning different clusters in one fleet. This can be challenging to implement in the current fleet selection logic without sacrificing concurrency: process_submitted_jobs does not take fleet lock, and without the lock, concurrent runs can add different clusters to one fleet.
dstack version
master
Server logs
Additional information
No response