Slowness with large input

This topic has been discussed before but perhaps not the same context. I've got a couple of workflow steps like this:

```
[step]
input: '/path/to/a/single/file.gz', for_each = 'chroms', concurrent = True
output: dynamic(glob.glob('{cwd}/{y_data:bnn}/chr*/*.rds'))
```

```
[another_step]
input: glob.glob(f'{cwd}/{y_data:bnn}/chr*/*.rds'), group_by = 1, concurrent = True
output: dynamic(glob.glob(f'{cwd}/{y_data:bnn}/SuSiE_CS_*/*.rds'))
R: expand = "${ }"
```

I run it in 2 separate sequential SoS commands:

```
sos run step
sos run another_step
```

You see the first step takes a single file `file.gz`, pair it with different `chroms` then create many small `rds` dynamic output. The actual output length at the end of the pipeline is 

```python
>>> len(glob.glob('chr*/*.rds'))
43601
```

Now when I run the 2nd step it got stuck at the single SoS process to prepare for the run, for 10 minutes (i started writing this post 5 min ago), and it is still working on it ... not yet analyzing the data.

~43K files does not sound a big deal right? But this is indeed the first time I use dynamic output of a previous step as the input of the next, in separate commands. I am wondering what is going on maybe in this context? and if we can do something about it. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slowness with large input #991

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slowness with large input #991

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions