[ntuple] Add support for "friend ntuples"#6979
Conversation
|
Starting build on |
|
Build failed on ROOT-debian10-i386/cxx14. Errors:
|
|
Build failed on ROOT-fedora30/cxx14. Errors:
|
|
Build failed on ROOT-fedora31/noimt. Errors:
|
|
Build failed on mac11.0/cxx17. Errors:
|
|
Build failed on ROOT-ubuntu16/nortcxxmod. Errors:
|
|
Build failed on windows10/cxx14. Errors:
|
|
Build failed on mac1014/python3. Errors:
|
6f1e260 to
2212937
Compare
|
Starting build on |
|
Build failed on mac11.0/cxx17. Failing tests: |
|
[are we set on the (super-ambiguous, imho) "friend" terminology? some possible alternatives: horizontal stack, column-wise concat, field-wise concat, horizontal concat...] |
I'm not sure if it lines up with the SQL term exactly but we could use "join" (column-join, field-join, etc.) |
|
Is there a provision/design for the case where the entries in the "friend" are not aligned (either because the friend is missing entries, or has more entries and/or they are in a different order)? |
|
Starting build on |
|
@pcanal This should be ready for review. I think the way to later allow for unaligned friends is through combining friends with another virtual page source that gives access to the underlying page source with an entry list (or another mechanism for shuffling & skimming the original entries). I like @mxxo suggestion of renaming friends to "joins" or joined ntuples. @eguiraud @Axel-Naumann what do you think? |
|
About the name of the feature, my two cents: I think "join" has the huge advantage that it is what users (both database-savvy and not) would instinctively search for in a lot of cases. "Joined ntuples" is a better wording if what you can do is basically a horizontal "paste". If you can actually build relations between two ntuples based on entry values, then it's basically an analog of the SQL "join" so even just "join" is not misleading. |
| FindNextClusterId and FindPrevClusterId to travers clusters by entry number. | ||
| */ | ||
| // clang-format on | ||
| class RClusterDescriptorRange { |
There was a problem hiding this comment.
Do you mean the suffix 'Range'? If so would you have some sort of range start and end end that is different in each instance of RClusterDescriptorRange ? (ie. see std::span or even RNTupleClusterRange :) )
Maybe RClusterDescriptorIteratable and then instead of GetClusterRange something like make_iteratable or GetClusterIteratable? (Another alternative I can think of is to use 'Collectioninstead ofIteratable` but that might be over-promising)
There was a problem hiding this comment.
As discussed, I added a corresponding TODO because there are other *Range classes similar to RClusterDescriptorRange that should be renamed consistently.
|
Starting build on |
|
Starting build on |
Adds a new, virtual page source,
RPageSourceFriends, that takes a list of other page sources in order to combine them horizontally. The friends page source constructs a virtual descriptor and maps field, column, and cluster IDs from virtual to physical ones and back.Note that this PR introduces a change to the cluster semantics: clusters do not need to cover all the columns for an event range but they can cover only a part of it (a shard). Columns that are linked (e.g. offset column and value column, columns belonging to the same field subtree) should still be part of only a single cluster.
Remaining todos:
RPageSourceFriends::Clone()