Skip to content

[Bug]: Using "files" directive with an SSH Fleet will have dstack-runner consume all ram and hang #3260

@edanvoye

Description

@edanvoye

Steps to reproduce

Using a simple SSH Host fleet with one host, run a simple task with dummmy command and one "files" directive to add a small file, when attempting to execture this task, the container will be created on the host, but the dstack-runner process initializing the task inside the container will hand and consume all ram until the process gets halted. Simply removing the "files" directive will make the problem disappear.

test.txt
commands to run test.txt
ava-fleet.dstack.yml
ava-task.dstack.yml

Actual behaviour

The task will fail as the container is killed for using all the ram, so dstack server cannot connect to it to gather metrics, and the task will be considered failed.
If we connect to the container on the host, in the brief moments between starting the Task and it being terminated, we can see this process taking all the ram :

root 961 262 75.3 57403404 49390160 ? Sl 16:51 4:51 /usr/local/bin/dstack-runner --log-level 5 start --http-port 10999 --ssh-port 10022 --temp-dir /tmp/runner --home-dir /root

Image Image

Expected behaviour

The expected behavior is to have the file test.txt available in the container, and the execution of the task occuring correctly.

dstack version

0.19.35

Server logs

[11:30:13] INFO     dstack._internal.server.background.tasks.process_submitted_jobs:768 The job my-test-job-2-0-0 switched instance ava-eam-on-prem-fleet-0 status to BUSY
           INFO     dstack._internal.server.background.tasks.process_submitted_jobs:777 job(d22cde)my-test-job-2-0-0: now is provisioning on 'ava-eam-on-prem-fleet-0'
[11:30:23] INFO     dstack._internal.server.background.tasks.process_runs:367 run(da65e7)my-test-job-2: run status has changed SUBMITTED -> PROVISIONING
           INFO     dstack._internal.server.background.tasks.process_runs:376 run(da65e7)my-test-job-2: run took 11.61 seconds from submission to provisioning.
[11:30:25] INFO     dstack._internal.server.background.tasks.process_running_jobs:616 job(d22cde)my-test-job-2-0-0: now is PULLING
[11:30:52] INFO     dstack._internal.server.background.tasks.process_runs:367 run(da65e7)my-test-job-2: run status has changed PROVISIONING -> RUNNING
[11:32:42] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:32:47] WARNING  dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:02:35.988324
[11:32:52] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:33:02] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
           WARNING  dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:02:50.936520
[11:33:12] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:33:16] WARNING  dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:03:04.724999
[11:33:22] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:33:32] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
           WARNING  dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:03:20.520831
[11:33:42] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:33:47] WARNING  dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:03:35.629000
[11:33:52] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:02] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
           WARNING  dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:03:50.948077
[11:34:12] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:17] WARNING  dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:04:05.589356
[11:34:22] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:30] WARNING  dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:04:18.687885
[11:34:32] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:42] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:45] WARNING  dstack._internal.server.background.tasks.process_running_jobs:390 job(d22cde)my-test-job-2-0-0: is unreachable, waiting for the instance to become reachable again, age=0:04:33.981250
[11:34:52] WARNING  dstack._internal.server.background.tasks.process_metrics:139 Failed to connect to job my-test-job-2-0-0 to collect metrics
[11:34:59] WARNING  dstack._internal.server.background.tasks.process_running_jobs:380 job(d22cde)my-test-job-2-0-0: failed because instance is unreachable, age=0:04:47.073074
[11:35:01] INFO     dstack._internal.server.services.jobs:346 job(d22cde)my-test-job-2-0-0: instance 'ava-eam-on-prem-fleet-0' has been released, new status is IDLE
           INFO     dstack._internal.server.services.services:270 job(d22cde)my-test-job-2-0-0: service replica unregistered from receiving requests, gateway=False
           INFO     dstack._internal.server.services.jobs:400 job(d22cde)my-test-job-2-0-0: job status is FAILED, reason: INTERRUPTED_BY_NO_CAPACITY
           INFO     dstack._internal.server.background.tasks.process_runs:367 run(da65e7)my-test-job-2: run status has changed RUNNING -> TERMINATING
[11:35:08] INFO     dstack._internal.server.services.runs:1195 run(da65e7)my-test-job-2: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmajor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions