Skip to content

Avoid Python globbing when we actually need TChain globbing #8490

@vepadulano

Description

@vepadulano

Describe the bug

The logic to retrieve the input files of a distributed RDataFrame is faulty at

if isinstance(secondarg, str):
# Expand globbing excluding remote files
remote_prefixes = ("root:", "http:", "https:")
if not secondarg.startswith(remote_prefixes):
return glob.glob(secondarg)

Because the globbing done by glob.glob is not the same as that done by TChain, in general.

Expected behavior

The list of input files is actually used to retrieve the clusters of all the trees used to create the distributed RDataFrame in

for filename in filelist:
f = ROOT.TFile.Open(filename)
t = f.Get(treename)
entries = t.GetEntriesFast()
it = t.GetClusterIterator(0)

So in general this extra globbing step is not needed. A way to solve this is to directly create the TChain with whatever string the user passes (i.e. with globbing characters) and retrieve the clusters from the TChain rather than traversing the list of files coming out of the globbing

To Reproduce

Setup

Additional context

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions