Skip to content

Support chains with subtrees with different names in distributed RDF #8750

@vepadulano

Description

@vepadulano

Describe the bug

The current logic to construct a TChain in a distributed task to pass to the RDF constructor is at

chain = ROOT.TChain(treename)
for f in current_range.filelist:
chain.Add(str(f))

But this is too simple, since it doesn't account for the common use case of a TChain with no name and sub trees with different names:

>>> import ROOT
>>> RDF = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
>>> c = ROOT.TChain()
>>> c.Add("10entries.root/entries")
1
>>> c.Add("other10entries.root/otherentries")
1
>>> df = RDF(c)
>>> df.Count().GetValue()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vpadulan/Programs/rootproject/rootbuild/devdebugtest/lib/DistRDF/Proxy.py", line 127, in GetValue
    headnode.backend.execute(generator)
  File "/home/vpadulan/Programs/rootproject/rootbuild/devdebugtest/lib/DistRDF/Backends/Base.py", line 135, in execute
    ranges = headnode.build_ranges()
  File "/home/vpadulan/Programs/rootproject/rootbuild/devdebugtest/lib/DistRDF/HeadNode.py", line 307, in build_ranges
    clustersinfiles = Ranges.get_clusters(self.treename, self.inputfiles)
  File "/home/vpadulan/Programs/rootproject/rootbuild/devdebugtest/lib/DistRDF/Ranges.py", line 220, in get_clusters
    entries = t.GetEntriesFast()
AttributeError: 'TObject' object has no attribute 'GetEntriesFast'

The error is due to the current sub tree not being found (since the input chain has no name).

Expected behavior

Distributed RDataFrame should support this use case

Setup

ROOT master built on Fedora32

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions