-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Improve startup time of distributed RDataFrame application #8232
Description
Explain what you would like to see improved
Currently there is a setup step done in the client before actually starting the distributed computations. During the setup, a list of ranges of entries from the original dataset is computed. The logic is as follows:
- For each file of the dataset, open it and compute a list of all the clusters in the file.
- From the list of all clusters of all files, divide it into groups of clusters (
Ranges) depending on thenpartitionsparameter of the dataframe
Each Range will be assigned its own task in the distributed resources. The point 1. above can be particularly expensive to run since it relies on TFile::Open . If the files of the dataset are stored remotely, the overhead adds up pretty quickly. The call happens specifically in:
root/bindings/experimental/distrdf/python/DistRDF/Node.py
Lines 363 to 368 in db3d424
| for filename in filelist: | |
| f = ROOT.TFile.Open(filename) | |
| t = f.Get(treename) | |
| entries = t.GetEntriesFast() | |
| it = t.GetClusterIterator(0) |
Optional: share how it could be improved
Ideally we could avoid calling TFile::Open in the client. @Axel-Naumann proposed on mattermost to estimate the number of clusters of each file depending on its size and consequently compute the number of tasks to run on the distributed resources:
If you have these files:
50MB
100MB
300MB
3GB
then I'd translate that to cluster estimates:
2
3
10
100
and split this into n tasks accordingly.
The single task in the distributed worker would then be responsible to open only the file(s) where the estimated clusters should be stored. This needs to be explored.
Additional context
Thanks to @stwunsch for bringing this up. This issue will keep track of further discussion and updates on the matter.