[WIP][DF] Refactor DistRDF range creation by vepadulano · Pull Request #8391 · root-project/root

vepadulano · 2021-06-10T07:21:14Z

This PR shows a possible refactor of the logic that finally creates the ranges to send to the distributed resources. It works in the following steps:

Split the big HeadNode class in differente head node types according to the original data source (e.g. EntriesHeadNode, TreeHeadNode, in the future also RNTupleHeadNode). Use a factory to get the correct head node type according to user provided arguments to the RDataFrame constructor
Create a different Range type per each head node type. This makes the passing of information more modular, allowing sending only a couple of integers in the case of empty RDF , or adding info about friend trees in the case of a tree based RDF
Better support friends with the recently introduced ROOT::Internal::TreeUtils functions
NEW: Cache the created Ranges for reuse in the same python session. This still doesn't improve the initial startup time discussed in Improve startup time of distributed RDataFrame application #8232

phsft-bot · 2021-06-10T07:21:26Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

lgtm-com · 2021-06-10T07:56:26Z

This pull request fixes 1 alert when merging e06e89b into 3d850fe - view on LGTM.com

fixed alerts:

1 for Unused local variable

phsft-bot · 2021-06-10T10:41:46Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

lgtm-com · 2021-06-10T11:14:51Z

This pull request fixes 1 alert when merging 2c8f04a into 3f94894 - view on LGTM.com

fixed alerts:

1 for Unused local variable

In distributed RDataFrame, the head node of the RDF computation graph also stores information about the user provided arguments to the RDF constructor, i.e. information about the source of data. The current implementation has only one HeadNode class responsible for distinguishing between the two supported data sources, i.e. empty RDF with sequential entries and TTree/TChain based RDF. This commit introduces instead multiple head node types, one per data source. In particular, there is now a EntriesHeadNode and a TreeHeadNode but in the future also other data sources might be supported for example a RNtupleHeadNode. To decide which is the correct type of head node for a particular distributed dataframe, a Factory class is created with the get_headnode static method that parses the user provided arguments to the RDataFrame constructor and returns the correct head node instance. It would follow that each data source might have different information to send to the distributed resources. This is the case, and it will be addressed in the next commit with the introduction of a different Range object per each head node type

Each head node type has its own amount of information to send to the distributed workers. For example, if the RDataFrame is just building on some empty entries, it is enough to send Range objects made of a pair of integers (start, end). When a TTree is involved, it has at least a name and a file but might also have information about friend trees. These differences are now reflected in different types of Range objects, i.e. EntriesRange and TreeRange for now. Furthermore, there is no need to host the logic to create Ranges in a class. So it is now hosted on its own Python module with free functions. This makes it more modular and would possibly also allow for caching the Range objects for subsequent usages.

The recently introduced utility functions in `ROOT::Internal::TreeUtils` namespace allow us to avoid having our custom logic to retrieve information about the input TTree of the distributed RDataFrame. In particular, 1. TreeHeadNode.get_treename function now relies on TreeUtils.GetTreeFullPaths (for both TTree and TChain). It returns a string or raises an error if the tree name could not be found. 2. TreeHeadNode.get_inputfiles function now relies on TreeUtils.GetFileNamesFromTree (for both TTree and TChain). It returns a list of strings or raises an error if the input files could not be found. 3. TreeHeadNode.get_friendinfo function now relies on TreeUtils.GetFriendInfo (for both TTree and TChain). If the user provided a TTree object as input, it returns three Python tuples (of Python strings or tuples thereof), one per each data member of the ROOT.Internal.TreeUtils.RFriendInfo object returned by GetFriendInfo. If the user did not provide a TTree instance, it returns a tuple of three None objects. This means that the FriendInfo class of DistRDF can be removed. The gathered information about friend trees is then used to rebuild the TChain on the distributed workers. This should take into account more complex friend trees structures like in root-project#7584

This commit shows a possible improvement to the startup time of a distributed RDF execution in python sessions where the same dataset is queried multiple times in different RDF instances. This is done through caching the return values of the functions in the Ranges module. This still does not improve the initial startup time, which is limited by `TFile::Open` and can be quite high if the files of the dataset are far away from the client machine.

phsft-bot · 2021-06-10T13:02:31Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

lgtm-com · 2021-06-10T13:32:49Z

This pull request fixes 1 alert when merging 275cd68 into 3f94894 - view on LGTM.com

fixed alerts:

1 for Unused local variable

etejedor · 2021-06-10T13:44:40Z

bindings/experimental/distrdf/python/DistRDF/HeadNode.py

+        ROOT.RDataFrame(*args)
+
+        firstarg = args[0]
+        if isinstance(firstarg, int):


The lack of function overloading makes us do this kind of things 😞 I guess when RNtuple comes the RDataFrame constructor will accept an RNtuple (so there won't be any ambiguity here)

there is no RDF constructor that takes an RNTuple, we have a factory function for each data-source, e.g. MakeNTupleDataFrame(tupleName, fileName). How will construction of a distRDF+RNTuple look like?

We can probably provide a similar factory function ROOT.RDF.Distributed.Spark.MakeNTupleDataFrame and from that we can dispatch to a headnode type like RNTupleHeadNode. This would be in the spirit of needing minimal changes in user code that already uses MakeNTupleDataFrame

bindings/experimental/distrdf/python/DistRDF/Node.py

bindings/experimental/distrdf/python/DistRDF/Backends/Base.py

bindings/experimental/distrdf/python/DistRDF/HeadNode.py

etejedor · 2021-06-10T15:17:25Z

bindings/experimental/distrdf/python/DistRDF/HeadNode.py

+                                    for chainsubnames in treefriendinfo.fFriendChainSubNames)
+            return friendnamesalias, friendfilenames, friendchainsubnames
+        else:
+            return None, None, None


Do we need to return three None in a tuple or just returning None should be enough?

This is to make it consistent with the other return type and to have a unique way in which to fill the attributes of the TreeRange object (done further down in TreeHeadNode.build_ranges that in turn calls Ranges.get_clustered_ranges)

Mmm but you either get the three or none, right? Perhaps it's because somewhere else this needs to be treated as a single object which is either a tuple or None, instead of three different objects?

In the end these three tuples need to be emplaced in the namedtuple TreeRange which is defined as

TreeRange = collections.namedtuple( "TreeRange", ["start", "end", "treename", "treefilenames", "friendnamesalias", "friendfilenames", "friendchainsubnames", "defaultbranches"])

So in particular the attributes "friendnamesalias", "friendfilenames", "friendchainsubnames" are already three distinct objects, so we never use the return value of this function as a single tuple

Food for thought: wouldn't it be better to actually group those as "friend info" in the TreeRange, instead of making it completely flat? 😄 Same for start and end, probably.

I will incorporate this suggestion in another commit

etejedor · 2021-06-10T15:26:36Z

bindings/experimental/distrdf/python/DistRDF/Ranges.py

    return clusters


+@lru_cache(maxsize=None)


So this caches the result of the function given the arguments, right?
Should we set a boundary in the number of distinct calls that are cached, just in case?

The possible problem with this is if the content of the input files changes during the execution (which I guess is unlikely).

So this caches the result of the function given the arguments, right?

Correct, and the arguments all need to be hashable (that's why where we previously had lists now there are tuples)

Should we set a boundary in the number of distinct calls that are cached, just in case?

It can be done sure. I just don't know what's the magic number ahah

The possible problem with this is if the content of the input files changes during the execution (which I guess is unlikely).

This is actually something I wanted to explore. In theory we could work around this if the TFile fUUID member does what I think it does, i.e. associates a unique id to the same filename each time it gets changed.

I would just leave the default for the maxsize.

I agree with Enric, it's weird that we are saying "we are ok with using an unbounded amount of memory"

etejedor

Looks good in general, I added a few comments!

eguiraud

I left some comments, there are a couple of things that look possibly wrong, the rest is just minor stuff

bindings/experimental/distrdf/python/DistRDF/HeadNode.py

eguiraud · 2021-06-11T14:27:13Z

bindings/experimental/distrdf/python/DistRDF/HeadNode.py

+        ROOT.RDataFrame(*args)
+
+        firstarg = args[0]
+        if isinstance(firstarg, int):


there is no RDF constructor that takes an RNTuple, we have a factory function for each data-source, e.g. MakeNTupleDataFrame(tupleName, fileName). How will construction of a distRDF+RNTuple look like?

bindings/experimental/distrdf/python/DistRDF/HeadNode.py

eguiraud · 2021-06-11T15:48:05Z

bindings/experimental/distrdf/python/DistRDF/Ranges.py

+
+EntriesRange = collections.namedtuple("EntriesRange", ["start", "end"])
+TreeRange = collections.namedtuple(
+    "TreeRange", ["start", "end", "treename", "treefilenames", "friendnames", "friendfilenames"])


this should probably be its own type, for documentation/lookup purposes: get_clustered_ranges says it returns a list[namedtuple], but it would be better if it said that it returns a list[TreeRange]

Ok, point taken. Probably deserves a PR of its own just for this

bindings/experimental/distrdf/python/DistRDF/Backends/Base.py

eguiraud · 2021-06-11T16:12:22Z

bindings/experimental/distrdf/python/DistRDF/Ranges.py

+            defaultbranches,  # type: list[str]
        )  # type: collections.namedtuple
        for clusters in _n_even_chunks(clustersinfiles, npartitions)
    ]


not for this PR, but this is way too terse, I have no idea what's going on here 😅

Ok, we'll see what we can do about that in another PR

eguiraud · 2021-06-11T16:13:03Z

bindings/experimental/distrdf/python/DistRDF/Ranges.py

    return clusters


+@lru_cache(maxsize=None)


I agree with Enric, it's weird that we are saying "we are ok with using an unbounded amount of memory"

vepadulano · 2021-06-18T15:54:27Z

After reading the comments and some discussion I decided it's better to split this PR. I will address the comments in the respective PR.

vepadulano · 2021-08-13T08:59:50Z

Items [1-3] of this PR have been taken care of, respectively in #8485, #8534, #8605 . Item 4 will not be addressed at this moment, we can reevaluate it at a later point.

vepadulano added the in:RDataFrame label Jun 10, 2021

vepadulano self-assigned this Jun 10, 2021

vepadulano requested a review from etejedor as a code owner June 10, 2021 07:21

vepadulano mentioned this pull request Jun 10, 2021

[WIP][DF][skip-ci] Better support friend trees in distributed RDF #8382

Closed

vepadulano force-pushed the distrdf-refactor branch from e06e89b to 2c8f04a Compare June 10, 2021 10:41

vepadulano added 6 commits June 10, 2021 14:27

Adapt test to refactor changes

e50caed

Add extra test for friend trees

275cd68

vepadulano force-pushed the distrdf-refactor branch from 2c8f04a to 275cd68 Compare June 10, 2021 13:02

etejedor reviewed Jun 10, 2021

View reviewed changes

bindings/experimental/distrdf/python/DistRDF/Node.py Outdated Show resolved Hide resolved

etejedor reviewed Jun 10, 2021

View reviewed changes

bindings/experimental/distrdf/python/DistRDF/Backends/Base.py Show resolved Hide resolved

etejedor reviewed Jun 10, 2021

View reviewed changes

bindings/experimental/distrdf/python/DistRDF/HeadNode.py Show resolved Hide resolved

etejedor reviewed Jun 10, 2021

View reviewed changes

etejedor suggested changes Jun 10, 2021

View reviewed changes

eguiraud reviewed Jun 11, 2021

View reviewed changes

This was referenced Jun 18, 2021

[DF] Split HeadNode class in different head nodes according to data source #8485

Merged

Range types according to headnode types in distributed RDF #8534

Merged

vepadulano mentioned this pull request Jul 5, 2021

[DF] Use ROOT::Internal::TreeUtils functions in TreeHeadNode class #8605

Merged

vepadulano closed this Aug 13, 2021

Conversation

vepadulano commented Jun 10, 2021

Uh oh!

phsft-bot commented Jun 10, 2021

Uh oh!

lgtm-com bot commented Jun 10, 2021

Uh oh!

phsft-bot commented Jun 10, 2021

Uh oh!

lgtm-com bot commented Jun 10, 2021

Uh oh!

phsft-bot commented Jun 10, 2021

Uh oh!

lgtm-com bot commented Jun 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vepadulano Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etejedor left a comment

Choose a reason for hiding this comment

Uh oh!

eguiraud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vepadulano commented Jun 18, 2021

Uh oh!

vepadulano commented Aug 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

vepadulano Jun 18, 2021 •

edited

Loading