[WIP][DF] Refactor DistRDF range creation#8391
[WIP][DF] Refactor DistRDF range creation#8391vepadulano wants to merge 6 commits intoroot-project:masterfrom
Conversation
|
Starting build on |
|
This pull request fixes 1 alert when merging e06e89b into 3d850fe - view on LGTM.com fixed alerts:
|
e06e89b to
2c8f04a
Compare
|
Starting build on |
|
This pull request fixes 1 alert when merging 2c8f04a into 3f94894 - view on LGTM.com fixed alerts:
|
In distributed RDataFrame, the head node of the RDF computation graph also stores information about the user provided arguments to the RDF constructor, i.e. information about the source of data. The current implementation has only one HeadNode class responsible for distinguishing between the two supported data sources, i.e. empty RDF with sequential entries and TTree/TChain based RDF. This commit introduces instead multiple head node types, one per data source. In particular, there is now a EntriesHeadNode and a TreeHeadNode but in the future also other data sources might be supported for example a RNtupleHeadNode. To decide which is the correct type of head node for a particular distributed dataframe, a Factory class is created with the get_headnode static method that parses the user provided arguments to the RDataFrame constructor and returns the correct head node instance. It would follow that each data source might have different information to send to the distributed resources. This is the case, and it will be addressed in the next commit with the introduction of a different Range object per each head node type
Each head node type has its own amount of information to send to the distributed workers. For example, if the RDataFrame is just building on some empty entries, it is enough to send Range objects made of a pair of integers (start, end). When a TTree is involved, it has at least a name and a file but might also have information about friend trees. These differences are now reflected in different types of Range objects, i.e. EntriesRange and TreeRange for now. Furthermore, there is no need to host the logic to create Ranges in a class. So it is now hosted on its own Python module with free functions. This makes it more modular and would possibly also allow for caching the Range objects for subsequent usages.
The recently introduced utility functions in `ROOT::Internal::TreeUtils` namespace allow us to avoid having our custom logic to retrieve information about the input TTree of the distributed RDataFrame. In particular, 1. TreeHeadNode.get_treename function now relies on TreeUtils.GetTreeFullPaths (for both TTree and TChain). It returns a string or raises an error if the tree name could not be found. 2. TreeHeadNode.get_inputfiles function now relies on TreeUtils.GetFileNamesFromTree (for both TTree and TChain). It returns a list of strings or raises an error if the input files could not be found. 3. TreeHeadNode.get_friendinfo function now relies on TreeUtils.GetFriendInfo (for both TTree and TChain). If the user provided a TTree object as input, it returns three Python tuples (of Python strings or tuples thereof), one per each data member of the ROOT.Internal.TreeUtils.RFriendInfo object returned by GetFriendInfo. If the user did not provide a TTree instance, it returns a tuple of three None objects. This means that the FriendInfo class of DistRDF can be removed. The gathered information about friend trees is then used to rebuild the TChain on the distributed workers. This should take into account more complex friend trees structures like in root-project#7584
This commit shows a possible improvement to the startup time of a distributed RDF execution in python sessions where the same dataset is queried multiple times in different RDF instances. This is done through caching the return values of the functions in the Ranges module. This still does not improve the initial startup time, which is limited by `TFile::Open` and can be quite high if the files of the dataset are far away from the client machine.
2c8f04a to
275cd68
Compare
|
Starting build on |
|
This pull request fixes 1 alert when merging 275cd68 into 3f94894 - view on LGTM.com fixed alerts:
|
| ROOT.RDataFrame(*args) | ||
|
|
||
| firstarg = args[0] | ||
| if isinstance(firstarg, int): |
There was a problem hiding this comment.
The lack of function overloading makes us do this kind of things 😞 I guess when RNtuple comes the RDataFrame constructor will accept an RNtuple (so there won't be any ambiguity here)
There was a problem hiding this comment.
there is no RDF constructor that takes an RNTuple, we have a factory function for each data-source, e.g. MakeNTupleDataFrame(tupleName, fileName). How will construction of a distRDF+RNTuple look like?
There was a problem hiding this comment.
We can probably provide a similar factory function ROOT.RDF.Distributed.Spark.MakeNTupleDataFrame and from that we can dispatch to a headnode type like RNTupleHeadNode. This would be in the spirit of needing minimal changes in user code that already uses MakeNTupleDataFrame
| for chainsubnames in treefriendinfo.fFriendChainSubNames) | ||
| return friendnamesalias, friendfilenames, friendchainsubnames | ||
| else: | ||
| return None, None, None |
There was a problem hiding this comment.
Do we need to return three None in a tuple or just returning None should be enough?
There was a problem hiding this comment.
This is to make it consistent with the other return type and to have a unique way in which to fill the attributes of the TreeRange object (done further down in TreeHeadNode.build_ranges that in turn calls Ranges.get_clustered_ranges)
There was a problem hiding this comment.
Mmm but you either get the three or none, right? Perhaps it's because somewhere else this needs to be treated as a single object which is either a tuple or None, instead of three different objects?
There was a problem hiding this comment.
In the end these three tuples need to be emplaced in the namedtuple TreeRange which is defined as
TreeRange = collections.namedtuple(
"TreeRange", ["start", "end", "treename", "treefilenames", "friendnamesalias", "friendfilenames", "friendchainsubnames", "defaultbranches"])So in particular the attributes "friendnamesalias", "friendfilenames", "friendchainsubnames" are already three distinct objects, so we never use the return value of this function as a single tuple
There was a problem hiding this comment.
Food for thought: wouldn't it be better to actually group those as "friend info" in the TreeRange, instead of making it completely flat? 😄 Same for start and end, probably.
There was a problem hiding this comment.
I will incorporate this suggestion in another commit
| return clusters | ||
|
|
||
|
|
||
| @lru_cache(maxsize=None) |
There was a problem hiding this comment.
So this caches the result of the function given the arguments, right?
Should we set a boundary in the number of distinct calls that are cached, just in case?
The possible problem with this is if the content of the input files changes during the execution (which I guess is unlikely).
There was a problem hiding this comment.
So this caches the result of the function given the arguments, right?
Correct, and the arguments all need to be hashable (that's why where we previously had lists now there are tuples)
Should we set a boundary in the number of distinct calls that are cached, just in case?
It can be done sure. I just don't know what's the magic number ahah
The possible problem with this is if the content of the input files changes during the execution (which I guess is unlikely).
This is actually something I wanted to explore. In theory we could work around this if the TFile fUUID member does what I think it does, i.e. associates a unique id to the same filename each time it gets changed.
There was a problem hiding this comment.
I would just leave the default for the maxsize.
There was a problem hiding this comment.
I agree with Enric, it's weird that we are saying "we are ok with using an unbounded amount of memory"
etejedor
left a comment
There was a problem hiding this comment.
Looks good in general, I added a few comments!
eguiraud
left a comment
There was a problem hiding this comment.
I left some comments, there are a couple of things that look possibly wrong, the rest is just minor stuff
| ROOT.RDataFrame(*args) | ||
|
|
||
| firstarg = args[0] | ||
| if isinstance(firstarg, int): |
There was a problem hiding this comment.
there is no RDF constructor that takes an RNTuple, we have a factory function for each data-source, e.g. MakeNTupleDataFrame(tupleName, fileName). How will construction of a distRDF+RNTuple look like?
|
|
||
| EntriesRange = collections.namedtuple("EntriesRange", ["start", "end"]) | ||
| TreeRange = collections.namedtuple( | ||
| "TreeRange", ["start", "end", "treename", "treefilenames", "friendnames", "friendfilenames"]) |
There was a problem hiding this comment.
this should probably be its own type, for documentation/lookup purposes: get_clustered_ranges says it returns a list[namedtuple], but it would be better if it said that it returns a list[TreeRange]
There was a problem hiding this comment.
Ok, point taken. Probably deserves a PR of its own just for this
| defaultbranches, # type: list[str] | ||
| ) # type: collections.namedtuple | ||
| for clusters in _n_even_chunks(clustersinfiles, npartitions) | ||
| ] |
There was a problem hiding this comment.
not for this PR, but this is way too terse, I have no idea what's going on here 😅
There was a problem hiding this comment.
Ok, we'll see what we can do about that in another PR
| return clusters | ||
|
|
||
|
|
||
| @lru_cache(maxsize=None) |
There was a problem hiding this comment.
I agree with Enric, it's weird that we are saying "we are ok with using an unbounded amount of memory"
|
After reading the comments and some discussion I decided it's better to split this PR. I will address the comments in the respective PR. |
fixes #7584
This PR shows a possible refactor of the logic that finally creates the ranges to send to the distributed resources. It works in the following steps:
ROOT::Internal::TreeUtilsfunctionsRanges for reuse in the same python session. This still doesn't improve the initial startup time discussed in Improve startup time of distributed RDataFrame application #8232