-
Notifications
You must be signed in to change notification settings - Fork 207
Description
Steps to reproduce
Using a simple SSH Host fleet with one host, run a simple task with dummmy command and one "files" directive to add a small file, when attempting to execture this task, the container will be created on the host, but the dstack-runner process initializing the task inside the container will hand and consume all ram until the process gets halted. Simply removing the "files" directive will make the problem disappear.
test.txt
commands to run test.txt
ava-fleet.dstack.yml
ava-task.dstack.yml
Actual behaviour
The task will fail as the container is killed for using all the ram, so dstack server cannot connect to it to gather metrics, and the task will be considered failed.
If we connect to the container on the host, in the brief moments between starting the Task and it being terminated, we can see this process taking all the ram :
root 961 262 75.3 57403404 49390160 ? Sl 16:51 4:51 /usr/local/bin/dstack-runner --log-level 5 start --http-port 10999 --ssh-port 10022 --temp-dir /tmp/runner --home-dir /root
Expected behaviour
The expected behavior is to have the file test.txt available in the container, and the execution of the task occuring correctly.
dstack version
0.19.35
Server logs
[11:30:13] INFO dstack._internal.server.background.tasks.process_submitted_jobs:768 The job my-test-job-2-0-0 switched instance ava-eam-on-prem-fleet-0 status to BUSY
INFO dstack._internal.server.background.tasks.process_submitted_jobs:777 job(d22cde)my-test-job-2-0-0: now is provisioning on 'ava-eam-on-prem-fleet-0'
[11:30:23] INFO dstack._internal.server.background.tasks.process_runs:367 run(da65e7)my-test-job-2: run status has changed SUBMITTED -> PROVISIONING
INFO dstack._internal.server.background.tasks.process_runs:376 run(da65e7)my-test-job-2: run took 11.61 seconds from submission to provisioning.
[11:30:25] INFO dstack._internal.server.background.tasks.process_running_jobs:616 job(d22cde)my-test-job-2-0-0: now is PULLING
[11:30:52] INFO dstack._internal.server.background.tasks.process_runs:367 run(da65e7)my-test-job-2: run status has changed PROVISIONING -> RUNNING
[11:32:42] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:32:47] WARNING dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:02:35.988324
[11:32:52] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:33:02] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
WARNING dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:02:50.936520
[11:33:12] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:33:16] WARNING dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:03:04.724999
[11:33:22] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:33:32] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
WARNING dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:03:20.520831
[11:33:42] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:33:47] WARNING dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:03:35.629000
[11:33:52] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:02] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
WARNING dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:03:50.948077
[11:34:12] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:17] WARNING dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:04:05.589356
[11:34:22] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:30] WARNING dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:04:18.687885
[11:34:32] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:42] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:45] WARNING dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:04:33.981250
[11:34:52] WARNING dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:59] WARNING dstack._internal.server.background.tasks.process_running_jobs:380 job(d22cde)my-test-job-2-0-0: failed because instance is unreachable, age=0:04:47.073074
[11:35:01] INFO dstack._internal.server.services.jobs:346 job(d22cde)my-test-job-2-0-0: instance 'ava-eam-on-prem-fleet-0' has been released, new status is IDLE
INFO dstack._internal.server.services.services:270 job(d22cde)my-test-job-2-0-0: service replica unregistered from receiving requests, gateway=False
INFO dstack._internal.server.services.jobs:400 job(d22cde)my-test-job-2-0-0: job status is FAILED, reason: INTERRUPTED_BY_NO_CAPACITY
INFO dstack._internal.server.background.tasks.process_runs:367 run(da65e7)my-test-job-2: run status has changed RUNNING -> TERMINATING
[11:35:08] INFO dstack._internal.server.services.runs:1195 run(da65e7)my-test-job-2: run status has changed TERMINATING -> FAILED, reason: JOB_FAILEDAdditional information
No response