-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support chains with subtrees with different names in distributed RDF #8750
Copy link
Copy link
Closed
Labels
Description
Describe the bug
The current logic to construct a TChain in a distributed task to pass to the RDF constructor is at
root/bindings/experimental/distrdf/python/DistRDF/Backends/Base.py
Lines 166 to 168 in b494a9b
| chain = ROOT.TChain(treename) | |
| for f in current_range.filelist: | |
| chain.Add(str(f)) |
But this is too simple, since it doesn't account for the common use case of a TChain with no name and sub trees with different names:
>>> import ROOT
>>> RDF = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
>>> c = ROOT.TChain()
>>> c.Add("10entries.root/entries")
1
>>> c.Add("other10entries.root/otherentries")
1
>>> df = RDF(c)
>>> df.Count().GetValue()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vpadulan/Programs/rootproject/rootbuild/devdebugtest/lib/DistRDF/Proxy.py", line 127, in GetValue
headnode.backend.execute(generator)
File "/home/vpadulan/Programs/rootproject/rootbuild/devdebugtest/lib/DistRDF/Backends/Base.py", line 135, in execute
ranges = headnode.build_ranges()
File "/home/vpadulan/Programs/rootproject/rootbuild/devdebugtest/lib/DistRDF/HeadNode.py", line 307, in build_ranges
clustersinfiles = Ranges.get_clusters(self.treename, self.inputfiles)
File "/home/vpadulan/Programs/rootproject/rootbuild/devdebugtest/lib/DistRDF/Ranges.py", line 220, in get_clusters
entries = t.GetEntriesFast()
AttributeError: 'TObject' object has no attribute 'GetEntriesFast'The error is due to the current sub tree not being found (since the input chain has no name).
Expected behavior
Distributed RDataFrame should support this use case
Setup
ROOT master built on Fedora32
Reactions are currently unavailable