Skip to content

Improve startup time of distributed RDataFrame application #8232

@vepadulano

Description

@vepadulano

Explain what you would like to see improved

Currently there is a setup step done in the client before actually starting the distributed computations. During the setup, a list of ranges of entries from the original dataset is computed. The logic is as follows:

  1. For each file of the dataset, open it and compute a list of all the clusters in the file.
  2. From the list of all clusters of all files, divide it into groups of clusters (Ranges) depending on the npartitions parameter of the dataframe

Each Range will be assigned its own task in the distributed resources. The point 1. above can be particularly expensive to run since it relies on TFile::Open . If the files of the dataset are stored remotely, the overhead adds up pretty quickly. The call happens specifically in:

for filename in filelist:
f = ROOT.TFile.Open(filename)
t = f.Get(treename)
entries = t.GetEntriesFast()
it = t.GetClusterIterator(0)

Optional: share how it could be improved

Ideally we could avoid calling TFile::Open in the client. @Axel-Naumann proposed on mattermost to estimate the number of clusters of each file depending on its size and consequently compute the number of tasks to run on the distributed resources:

If you have these files:

   50MB
   100MB
   300MB
   3GB

then I'd translate that to cluster estimates:

    2
    3
    10
    100

and split this into n tasks accordingly.

The single task in the distributed worker would then be responsible to open only the file(s) where the estimated clusters should be stored. This needs to be explored.

Additional context

Thanks to @stwunsch for bringing this up. This issue will keep track of further discussion and updates on the matter.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions