-
Notifications
You must be signed in to change notification settings - Fork 207
Description
Steps to reproduce
-
Create a project without cloud backends.
The same can be reproduced with cloud fleets, see below.
-
Get an on-prem fleet with one instance and another fleet with two instances.
> dstack fleet FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED on-prem-2 0 ssh (remote) 2xCPU, 1GB, 35.2GB (disk) $0.0 idle 1 hour ago 1 ssh (remote) 2xCPU, 1GB, 35.2GB (disk) $0.0 idle 1 hour ago on-prem-1 0 ssh (remote) 2xCPU, 1GB, 35.1GB (disk) $0.0 idle 7 mins ago
-
Try running a task with two nodes or a service with two replicas.
type: service replicas: 2 port: 12345 commands: - sleep infinity resources: memory: 0.5GB.. disk: 10GB..
Actual behaviour
dstack may assign the run to the fleet with one instance. The second job will then fail because the fleet does not have enough instances.
> dstack apply
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 ssh remote instance 2xCPU, 1GB, 35.2GB (disk) no $0 idle
2 ssh remote instance 2xCPU, 1GB, 35.2GB (disk) no $0 idle
3 ssh remote instance 2xCPU, 1GB, 35.1GB (disk) no $0 idle
Submit a new run? [y/n]: y
NAME BACKEND INSTANCE RESOURCES RESERVATION PRICE STATUS SUBMITTED ERROR
happy-pug-1 failed 22:24 JOB_FAILED
replica=0 job=0 ssh (remote) instance 2xCPU, 1GB, 35.1GB (disk) $0.0 terminated 22:24 TERMINATED_BY_SERVER
replica=1 job=0 failed 22:24 FAILED_TO_START_DUE_TO_NO_CAPACITYSometimes dstack will choose the correct fleet, you may need to re-create one of the fleets a few times until you can reproduce.
Expected behaviour
dstack chooses the fleet with two instances and both jobs are provisioned successfully.
If there are no fleets with enough capacity, dstack shows no offers and the run fails before submitting the jobs.
dstack version
0.18.36
Server logs
Additional information
The same can be reproduced with cloud fleets using --reuse.
> dstack fleet
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
cloud-1 0 aws (eu-north-1) 2xCPU, 8GB, 100.0GB (disk), SPOT $0.029 idle 2 mins ago
cloud-2 0 aws (eu-north-1) 4xCPU, 16GB, 100.0GB (disk), SPOT $0.0603 idle 1 min ago
1 aws (eu-north-1) 4xCPU, 16GB, 100.0GB (disk), SPOT $0.0603 idle 1 min ago
> dstack apply --reuse
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 aws eu-north-1 m5.large 2xCPU, 8GB, 100.0GB (disk) yes $0.029 idle
2 aws eu-north-1 m5.xlarge 4xCPU, 16GB, 100.0GB (disk) yes $0.0603 idle
3 aws eu-north-1 m5.xlarge 4xCPU, 16GB, 100.0GB (disk) yes $0.0603 idle
Submit a new run? [y/n]: y
NAME BACKEND INSTANCE RESOURCES RESERVATION PRICE STATUS SUBMITTED ERROR
fuzzy-fish-1 failed 22:58 JOB_FAILED
replica=0 job=0 aws (eu-north-1) m5.large 2xCPU, 8GB, 100.0GB (disk), $0.029 terminated 22:58 TERMINATED_BY_SERVER
SPOT
replica=1 job=0 failed 22:58 FAILED_TO_START_DUE_TO_NO_CAPACITY