Skip to content

Slowness with large input #991

@gaow

Description

@gaow

This topic has been discussed before but perhaps not the same context. I've got a couple of workflow steps like this:

[step]
input: '/path/to/a/single/file.gz', for_each = 'chroms', concurrent = True
output: dynamic(glob.glob('{cwd}/{y_data:bnn}/chr*/*.rds'))
[another_step]
input: glob.glob(f'{cwd}/{y_data:bnn}/chr*/*.rds'), group_by = 1, concurrent = True
output: dynamic(glob.glob(f'{cwd}/{y_data:bnn}/SuSiE_CS_*/*.rds'))
R: expand = "${ }"

I run it in 2 separate sequential SoS commands:

sos run step
sos run another_step

You see the first step takes a single file file.gz, pair it with different chroms then create many small rds dynamic output. The actual output length at the end of the pipeline is

>>> len(glob.glob('chr*/*.rds'))
43601

Now when I run the 2nd step it got stuck at the single SoS process to prepare for the run, for 10 minutes (i started writing this post 5 min ago), and it is still working on it ... not yet analyzing the data.

~43K files does not sound a big deal right? But this is indeed the first time I use dynamic output of a previous step as the input of the next, in separate commands. I am wondering what is going on maybe in this context? and if we can do something about it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions