Skip to content

SoS performance dealing with large number of files #874

@gaow

Description

@gaow

This bothers me when I do some very simple simulations:

[1]
output: [f"performance_test/{x+1}.out" for x in range(500)]
run:
  touch performance_test/{1..500}.out

[2]
r = [x for x in range(500)]
input: group_by = 1, paired_with = 'r'
output: [f"performance_test/{x+1}.rds" for x in range(500)], group_by = 1
task: concurrent = True
R: expand = '${ }'
  x = rnorm(${_r[0]})
  saveRDS(x, ${_output:r})

If you run this script, you'll see it halts for a second or 2 at the end of every batch of completed jobs. I can understand that things like signature checks etc are on going. Therefore a simple simulation that takes < 10 sec as a for loop can take as much as > 700 secs with SoS -- the overhead takes way longer time than the actual computation. I remember it used to be 10 sec vs > 100 secs before last summer. Now I guess as signature check becomes more strict and careful about racing conditions the whole process is a lot slower. Is there still room for optimization?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions