Skip to content

[Bug]: Different clusters can be provisioned in one fleet #3250

@r4victor

Description

@r4victor

Steps to reproduce

  1. Create a fleet that allows any instances:
type: fleet
name: cloud-fleet
placement: cluster
nodes: 0..
  1. Run two clusters on different backends specifying that fleet:
dstack apply -f nccl-tests.yaml -b runpod --fleet cloud-fleet
dstack apply -f nccl-tests.yaml -b gcp --fleet cloud-fleet

You'll get two clusters provisioned in the same fleet:

(dstack) ➜  my_dstack_public git:(master) ✗ dstack fleet
 FLEET        INSTANCE  BACKEND     RESOURCES   PRICE    STATUS       CREATED    
 cloud-fleet  0         runpod      cpu=128     $16.704  provisioni…  31 sec ago 
                        (US-MO-1)   mem=2008GB                                   
                                    disk=100GB                                   
                                    A100:80GB…                                   
              1         runpod      cpu=128     $16.704  busy         31 sec ago 
                        (US-MO-1)   mem=2008GB                                   
                                    disk=100GB                                   
                                    A100:80GB…                                   
              2         gcp         cpu=224     $51.552  idle         10 mins    
                        (us-west2)  mem=3968GB           (warning)    ago        
                                    disk=100GB                                   
                                    B200:180G…                                   
                                    (spot)                                       
              3         gcp         cpu=224     $51.552  idle         9 mins ago 
                        (us-west2)  mem=3968GB           (warning)               
                                    disk=100GB                                   
                                    B200:180G…                                   
                                    (spot)                                       

This breaks fleet with placement: cluster contract that guarantees that all instances in the fleet are part of the interconnected cluster. A related problem is that if cluster does not have placement: cluster, then the fleet contract is not broken but there is still a problem that fleet selection logic for multi-node tasks will incorrectly assume all nodes are interconnected and can acommodate the task when prioritizing fleets.

Actual behaviour

No response

Expected behaviour

It seems we need to forbid provisioning different clusters in one fleet. This can be challenging to implement in the current fleet selection logic without sacrificing concurrency: process_submitted_jobs does not take fleet lock, and without the lock, concurrent runs can add different clusters to one fleet.

dstack version

master

Server logs

Additional information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions