-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Avoid Python globbing when we actually need TChain globbing #8490
Description
Describe the bug
The logic to retrieve the input files of a distributed RDataFrame is faulty at
root/bindings/experimental/distrdf/python/DistRDF/Node.py
Lines 744 to 748 in 76fa200
| if isinstance(secondarg, str): | |
| # Expand globbing excluding remote files | |
| remote_prefixes = ("root:", "http:", "https:") | |
| if not secondarg.startswith(remote_prefixes): | |
| return glob.glob(secondarg) |
Because the globbing done by glob.glob is not the same as that done by TChain, in general.
Expected behavior
The list of input files is actually used to retrieve the clusters of all the trees used to create the distributed RDataFrame in
root/bindings/experimental/distrdf/python/DistRDF/Node.py
Lines 363 to 368 in 76fa200
| for filename in filelist: | |
| f = ROOT.TFile.Open(filename) | |
| t = f.Get(treename) | |
| entries = t.GetEntriesFast() | |
| it = t.GetClusterIterator(0) |
So in general this extra globbing step is not needed. A way to solve this is to directly create the TChain with whatever string the user passes (i.e. with globbing characters) and retrieve the clusters from the TChain rather than traversing the list of files coming out of the globbing