Skip to content

[WIP][DF] Refactor DistRDF range creation#8391

Closed
vepadulano wants to merge 6 commits intoroot-project:masterfrom
vepadulano:distrdf-refactor
Closed

[WIP][DF] Refactor DistRDF range creation#8391
vepadulano wants to merge 6 commits intoroot-project:masterfrom
vepadulano:distrdf-refactor

Conversation

@vepadulano
Copy link
Copy Markdown
Member

fixes #7584

This PR shows a possible refactor of the logic that finally creates the ranges to send to the distributed resources. It works in the following steps:

  1. Split the big HeadNode class in differente head node types according to the original data source (e.g. EntriesHeadNode, TreeHeadNode, in the future also RNTupleHeadNode). Use a factory to get the correct head node type according to user provided arguments to the RDataFrame constructor
  2. Create a different Range type per each head node type. This makes the passing of information more modular, allowing sending only a couple of integers in the case of empty RDF , or adding info about friend trees in the case of a tree based RDF
  3. Better support friends with the recently introduced ROOT::Internal::TreeUtils functions
  4. NEW: Cache the created Ranges for reuse in the same python session. This still doesn't improve the initial startup time discussed in Improve startup time of distributed RDataFrame application #8232

@vepadulano vepadulano self-assigned this Jun 10, 2021
@vepadulano vepadulano requested a review from etejedor as a code owner June 10, 2021 07:21
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 10, 2021

This pull request fixes 1 alert when merging e06e89b into 3d850fe - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 10, 2021

This pull request fixes 1 alert when merging 2c8f04a into 3f94894 - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

In distributed RDataFrame, the head node of the RDF computation graph also stores information about the user provided arguments to the RDF constructor, i.e. information about the source of data.
The current implementation has only one HeadNode class responsible for distinguishing between the two supported data sources, i.e. empty RDF with sequential entries and TTree/TChain based RDF.
This commit introduces instead multiple head node types, one per data source. In particular, there is now a EntriesHeadNode and a TreeHeadNode but in the future also other data sources might be supported for example a RNtupleHeadNode.
To decide which is the correct type of head node for a particular distributed dataframe, a Factory class is created with the get_headnode static method that parses the user provided arguments to the RDataFrame constructor and returns the correct head node instance.

It would follow that each data source might have different information to send to the distributed resources. This is the case, and it will be addressed in the next commit with the introduction of a different Range object per each head node type
Each head node type has its own amount of information to send to the distributed workers. For example, if the RDataFrame is just building on some empty entries, it is enough to send Range objects made of a pair of integers (start, end). When a TTree is involved, it has at least a name and a file but might also have information about friend trees.
These differences are now reflected in different types of Range objects, i.e. EntriesRange and TreeRange for now. Furthermore, there is no need to host the logic to create Ranges in a class. So it is now hosted on its own Python module with free functions. This makes it more modular and would possibly also allow for caching the Range objects for subsequent usages.
The recently introduced utility functions in `ROOT::Internal::TreeUtils` namespace allow us to avoid having our custom logic to retrieve information about the input TTree of the distributed RDataFrame.
In particular,
1. TreeHeadNode.get_treename function now relies on TreeUtils.GetTreeFullPaths (for both TTree and TChain). It returns a string or raises an error if the tree name could not be found.
2. TreeHeadNode.get_inputfiles function now relies on TreeUtils.GetFileNamesFromTree (for both TTree and TChain). It returns a list of strings or raises an error if the input files could not be found.
3. TreeHeadNode.get_friendinfo function now relies on TreeUtils.GetFriendInfo (for both TTree and TChain). If the user provided a TTree object as input, it returns three Python tuples (of Python strings or tuples thereof), one per each data member of the ROOT.Internal.TreeUtils.RFriendInfo object returned by GetFriendInfo. If the user did not provide a TTree instance, it returns a tuple of three None objects. This means that the FriendInfo class of DistRDF can be removed.

The gathered information about friend trees is then used to rebuild the TChain on the distributed workers. This should take into account more complex friend trees structures like in root-project#7584
This commit shows a possible improvement to the startup time of a distributed RDF execution in python sessions where the same dataset is queried multiple times in different RDF instances. This is done through caching the return values of the functions in the Ranges module.
This still does not improve the initial startup time, which is limited by `TFile::Open` and can be quite high if the files of the dataset are far away from the client machine.
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 10, 2021

This pull request fixes 1 alert when merging 275cd68 into 3f94894 - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

ROOT.RDataFrame(*args)

firstarg = args[0]
if isinstance(firstarg, int):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of function overloading makes us do this kind of things 😞 I guess when RNtuple comes the RDataFrame constructor will accept an RNtuple (so there won't be any ambiguity here)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no RDF constructor that takes an RNTuple, we have a factory function for each data-source, e.g. MakeNTupleDataFrame(tupleName, fileName). How will construction of a distRDF+RNTuple look like?

Copy link
Copy Markdown
Member Author

@vepadulano vepadulano Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably provide a similar factory function ROOT.RDF.Distributed.Spark.MakeNTupleDataFrame and from that we can dispatch to a headnode type like RNTupleHeadNode. This would be in the spirit of needing minimal changes in user code that already uses MakeNTupleDataFrame

for chainsubnames in treefriendinfo.fFriendChainSubNames)
return friendnamesalias, friendfilenames, friendchainsubnames
else:
return None, None, None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return three None in a tuple or just returning None should be enough?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make it consistent with the other return type and to have a unique way in which to fill the attributes of the TreeRange object (done further down in TreeHeadNode.build_ranges that in turn calls Ranges.get_clustered_ranges)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm but you either get the three or none, right? Perhaps it's because somewhere else this needs to be treated as a single object which is either a tuple or None, instead of three different objects?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end these three tuples need to be emplaced in the namedtuple TreeRange which is defined as

TreeRange = collections.namedtuple(
    "TreeRange", ["start", "end", "treename", "treefilenames", "friendnamesalias", "friendfilenames", "friendchainsubnames", "defaultbranches"])

So in particular the attributes "friendnamesalias", "friendfilenames", "friendchainsubnames" are already three distinct objects, so we never use the return value of this function as a single tuple

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought: wouldn't it be better to actually group those as "friend info" in the TreeRange, instead of making it completely flat? 😄 Same for start and end, probably.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will incorporate this suggestion in another commit

return clusters


@lru_cache(maxsize=None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this caches the result of the function given the arguments, right?
Should we set a boundary in the number of distinct calls that are cached, just in case?

The possible problem with this is if the content of the input files changes during the execution (which I guess is unlikely).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this caches the result of the function given the arguments, right?

Correct, and the arguments all need to be hashable (that's why where we previously had lists now there are tuples)

Should we set a boundary in the number of distinct calls that are cached, just in case?

It can be done sure. I just don't know what's the magic number ahah

The possible problem with this is if the content of the input files changes during the execution (which I guess is unlikely).

This is actually something I wanted to explore. In theory we could work around this if the TFile fUUID member does what I think it does, i.e. associates a unique id to the same filename each time it gets changed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just leave the default for the maxsize.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Enric, it's weird that we are saying "we are ok with using an unbounded amount of memory"

Copy link
Copy Markdown
Contributor

@etejedor etejedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, I added a few comments!

Copy link
Copy Markdown
Contributor

@eguiraud eguiraud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, there are a couple of things that look possibly wrong, the rest is just minor stuff

ROOT.RDataFrame(*args)

firstarg = args[0]
if isinstance(firstarg, int):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no RDF constructor that takes an RNTuple, we have a factory function for each data-source, e.g. MakeNTupleDataFrame(tupleName, fileName). How will construction of a distRDF+RNTuple look like?


EntriesRange = collections.namedtuple("EntriesRange", ["start", "end"])
TreeRange = collections.namedtuple(
"TreeRange", ["start", "end", "treename", "treefilenames", "friendnames", "friendfilenames"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be its own type, for documentation/lookup purposes: get_clustered_ranges says it returns a list[namedtuple], but it would be better if it said that it returns a list[TreeRange]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, point taken. Probably deserves a PR of its own just for this

defaultbranches, # type: list[str]
) # type: collections.namedtuple
for clusters in _n_even_chunks(clustersinfiles, npartitions)
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this PR, but this is way too terse, I have no idea what's going on here 😅

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we'll see what we can do about that in another PR

return clusters


@lru_cache(maxsize=None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Enric, it's weird that we are saying "we are ok with using an unbounded amount of memory"

@vepadulano
Copy link
Copy Markdown
Member Author

After reading the comments and some discussion I decided it's better to split this PR. I will address the comments in the respective PR.

@vepadulano
Copy link
Copy Markdown
Member Author

Items [1-3] of this PR have been taken care of, respectively in #8485, #8534, #8605 . Item 4 will not be addressed at this moment, we can reevaluate it at a later point.

@vepadulano vepadulano closed this Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DF] Distributed RDataFrame doesn't handle friend trees correctly

4 participants