Improve startup time of distributed RDataFrame application

### Explain what you would like to see improved


Currently there is a setup step done in the client before actually starting the distributed computations. During the setup, a list of ranges of entries from the original dataset is computed. The logic is as follows:
1. For each file of the dataset, open it and compute a list of all the clusters in the file.
2. From the list of all clusters of all files, divide it into groups of clusters (`Range`s) depending on the `npartitions` parameter of the dataframe

Each `Range` will be assigned its own task in the distributed resources. The point 1. above can be particularly expensive to run since it relies on `TFile::Open` . If the files of the dataset are stored remotely, the overhead adds up pretty quickly. The call happens specifically in:
https://github.com/root-project/root/blob/db3d4240abbda1c946fd2a7af08544cf1b357911/bindings/experimental/distrdf/python/DistRDF/Node.py#L363-L368

### Optional: share how it could be improved


Ideally we could avoid calling TFile::Open in the client. @Axel-Naumann  proposed on mattermost to estimate the number of clusters of each file depending on its size and consequently compute the number of tasks to run on the distributed resources:
```
If you have these files:

   50MB
   100MB
   300MB
   3GB

then I'd translate that to cluster estimates:

    2
    3
    10
    100

and split this into n tasks accordingly.
```

The single task in the distributed worker would then be responsible to open only the file(s) where the estimated clusters should be stored. This needs to be explored.

### Additional context

Thanks to @stwunsch for bringing this up. This issue will keep track of further discussion and updates on the matter.

	for filename in filelist:
	f = ROOT.TFile.Open(filename)
	t = f.Get(treename)

	entries = t.GetEntriesFast()
	it = t.GetClusterIterator(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve startup time of distributed RDataFrame application #8232

Explain what you would like to see improved

Optional: share how it could be improved

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve startup time of distributed RDataFrame application #8232

Description

Explain what you would like to see improved

Optional: share how it could be improved

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions