Support TChain with different subtrees in distributed RDF#8858
Merged
vepadulano merged 2 commits intoroot-project:masterfrom Aug 20, 2021
Merged
Support TChain with different subtrees in distributed RDF#8858vepadulano merged 2 commits intoroot-project:masterfrom
vepadulano merged 2 commits intoroot-project:masterfrom
Conversation
|
Starting build on |
|
Build failed on ROOT-debian10-i386/cxx14. Errors:
|
etejedor
approved these changes
Aug 19, 2021
For a distributed RDF taking a TTree-based dataset, the `TreeHeadNode` class stores the dataset as either a `TTree` or `TChain` internally. Previously it was assumed that the dataset had a unique name, which meant that if the dataset was a TChain it was assumed that the whole chain had one name for all the trees. This is not true in general, one can build a chain (with or without a main name) and then add multiple trees each with its own name. This commit supports this usecase by distinguishing a `maintreename` attribute which corresponds either to the name of the TTree or to the main name passed to the constructor of TChain. Then it adds a `subtreenames` attribute that holds a vector of names: if the dataset is a TTree, it just holds the name of the tree in the file (including the subdirectory structure); if it is a TChain, it holds all the names of the subtrees. The new information needs to be passed to the distributed workers so it has been made available also in the `TreeRange` class and other places in the logic that builds ranges.
27f1400 to
499d2d0
Compare
|
Starting build on |
|
Build failed on ROOT-debian10-i386/cxx14. Errors:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #8750
To support this usecase we need to send the distributed workers also the names of the subtrees in the main chain. At this point we might want to think of a bit of a reworking of the data structures like
ChainClusterandFileAndIndex, plus I would like to make the functionget_clustersreturn less redundant info (currently each cluster also reports the name of the file, the name of the tree and the number of entries which are all the same for clusters belonging to the same file). These improvements are left for the next PR.