Skip to content

Support TChain with different subtrees in distributed RDF#8858

Merged
vepadulano merged 2 commits intoroot-project:masterfrom
vepadulano:distrdf-tchain-subtrees
Aug 20, 2021
Merged

Support TChain with different subtrees in distributed RDF#8858
vepadulano merged 2 commits intoroot-project:masterfrom
vepadulano:distrdf-tchain-subtrees

Conversation

@vepadulano
Copy link
Copy Markdown
Member

Fixes #8750

To support this usecase we need to send the distributed workers also the names of the subtrees in the main chain. At this point we might want to think of a bit of a reworking of the data structures like ChainCluster and FileAndIndex, plus I would like to make the function get_clusters return less redundant info (currently each cluster also reports the name of the file, the name of the tree and the number of entries which are all the same for clusters belonging to the same file). These improvements are left for the next PR.

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-debian10-i386/cxx14.
Running on pcepsft10.dyndns.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-08-19T08:52:33.319Z] stderr: error: could not read '.git/rebase-apply/head-name': No such file or directory

For a distributed RDF taking a TTree-based dataset, the `TreeHeadNode` class stores the dataset as either a `TTree` or `TChain`
internally. Previously it was assumed that the dataset had a unique name, which meant that if the dataset was a TChain it was assumed
that the whole chain had one name for all the trees. This is not true in general, one can build a chain (with or without a main name) and
then add multiple trees each with its own name. This commit supports this usecase by distinguishing a `maintreename` attribute which
corresponds either to the name of the TTree or to the main name passed to the constructor of TChain. Then it adds a `subtreenames`
attribute that holds a vector of names: if the dataset is a TTree, it just holds the name of the tree in the file (including the
subdirectory structure); if it is a TChain, it holds all the names of the subtrees.

The new information needs to be passed to the distributed workers so it has been made available also in the `TreeRange` class and other
places in the logic that builds ranges.
@vepadulano vepadulano force-pushed the distrdf-tchain-subtrees branch from 27f1400 to 499d2d0 Compare August 20, 2021 10:50
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-debian10-i386/cxx14.
Running on pcepsft10.dyndns.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-08-20T11:21:59.066Z] stderr: error: could not read '.git/rebase-apply/head-name': No such file or directory

@vepadulano vepadulano merged commit 294bc76 into root-project:master Aug 20, 2021
@vepadulano vepadulano deleted the distrdf-tchain-subtrees branch October 30, 2021 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support chains with subtrees with different names in distributed RDF

3 participants