[DF] Split `HeadNode` class in different head nodes according to data source by vepadulano · Pull Request #8485 · root-project/root

vepadulano · 2021-06-18T18:37:44Z

This PR addresses step 1 of #8391

Split the big HeadNode class in differente head node types according to the original data source (e.g. EntriesHeadNode, TreeHeadNode, in the future also RNTupleHeadNode). Use a factory to get the correct head node type according to user provided arguments to the RDataFrame constructor

It also addresses the comments from the original PR regarding the files touched in this PR, except for the following:

The comments about changing the names of the getters in TreeHeadNode and doing the parsing of constructor arguments only once in __init__. will be addressed both in another PR. Probably the properties can be removed or changed in favor of having the attributes already set during __init__ thanks to the mentioned single-pass parsing.
Checking that glob.glob behaves similarly to TChain's globbing is left for another time

phsft-bot · 2021-06-18T18:37:54Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

lgtm-com · 2021-06-18T19:08:52Z

This pull request fixes 1 alert when merging d309b4e into 645d3d0 - view on LGTM.com

fixed alerts:

1 for Unused local variable

eguiraud

Thank you for the rework, this looks much nicer! Some last comments below.

bindings/experimental/distrdf/python/DistRDF/HeadNode.py

eguiraud · 2021-06-22T11:55:53Z

bindings/experimental/distrdf/python/DistRDF/HeadNode.py

+            # Restrict 'npartitions' if it's greater than 'nentries'
+            msg = ("Number of partitions {0} is greater than number of entries {1} "
+                   "in the dataframe. Using {1} partition(s)".format(self.npartitions, self.nentries))
+            warnings.warn(msg, UserWarning, stacklevel=2)


here, because of the self.npartitions = 2 above, we warn in case of nentries == 1 but I think that is a valid (toy/test) usecase

(you mentioned here that distRDF is also not ok with nentries == 0: that case should probably also just work (just run locally and produce a bunch of empty histograms, maybe with a warning) but i'm also ok with leaving that behavior as it is. on the other hand, warning on 1 entry but being ok with 3 is weird)

This depends on whether we can always guarantee that backends will be able to process the case of npartitions == 1. I think it can be done in general, not sure that is for this PR

bindings/experimental/distrdf/python/DistRDF/HeadNode.py

bindings/experimental/distrdf/python/DistRDF/Node.py

phsft-bot · 2021-06-22T14:29:05Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

lgtm-com · 2021-06-22T14:59:21Z

This pull request fixes 1 alert when merging 4d212ec into e1bdaf6 - view on LGTM.com

fixed alerts:

1 for Unused local variable

bindings/experimental/distrdf/test/test_buildranges.py

phsft-bot · 2021-06-23T11:14:34Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

phsft-bot · 2021-06-23T11:37:27Z

Build failed on ROOT-performance-centos8-multicore/default.
Running on olbdw-01.cern.ch:/data/sftnight/workspace/root-pullrequests-build
See console output.

Failing tests:

lgtm-com · 2021-06-23T11:44:55Z

This pull request fixes 1 alert when merging 6eab070 into d026b49 - view on LGTM.com

fixed alerts:

1 for Unused local variable

phsft-bot · 2021-06-23T12:19:27Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

phsft-bot · 2021-06-23T12:28:04Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

lgtm-com · 2021-06-23T12:55:49Z

This pull request fixes 1 alert when merging d84218f into d026b49 - view on LGTM.com

fixed alerts:

1 for Unused local variable

phsft-bot · 2021-06-23T13:21:52Z

Build failed on ROOT-performance-centos8-multicore/default.
Running on olbdw-01.cern.ch:/data/sftnight/workspace/root-pullrequests-build
See console output.

Failing tests:

eguiraud · 2021-06-23T17:19:27Z

bindings/experimental/distrdf/python/DistRDF/Backends/Spark/Backend.py

+        # 2. An educated guess according to the backend, using the backend's
+        #    `optimize_npartitions` function
+        # 3. Set `npartitions` to 2
+        npartitions = kwargs.pop("npartitions", self.optimize_npartitions(Base.BaseBackend.MIN_NPARTITIONS))


no need to pass MIN_NPARTITIONS as an argument, it's a data member

Right, initially it was meant to be a generic function "optimize the partitions if possible, else give me back the argument I pass you". It's probably changed logic now

So what should I do here? Change the signature of optimize_npartitions to (self) rather than (self, npartitions) ? Or maybe leave the input parameter but make it optional as (self, npartitions=None)?

(this function might still change / be removed in the future whenever we decide on some coherent structure to "optimize" the number of partitions of the distributed dataframe)

i don't see why this shouldn't be def optimize_npartitions(self) (might even change it to a property called npartitions). passing a data member as the only argument of a method looks "wrong".

phsft-bot · 2021-06-24T07:08:46Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

phsft-bot · 2021-06-24T07:31:22Z

Build failed on ROOT-performance-centos8-multicore/default.
Running on olbdw-01.cern.ch:/data/sftnight/workspace/root-pullrequests-build
See console output.

Failing tests:

lgtm-com · 2021-06-24T07:48:39Z

This pull request introduces 1 alert and fixes 1 when merging ad75fbf into b7d0cb1 - view on LGTM.com

new alerts:

1 for Signature mismatch in overriding method

fixed alerts:

1 for Unused local variable

eguiraud

🎉 🎉

vepadulano · 2021-06-24T08:31:10Z

LGTM gracefully telling me I shouldn't override methods with different signatures, will fix it right away 😄

phsft-bot · 2021-06-24T08:37:54Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

phsft-bot · 2021-06-24T09:01:07Z

Build failed on ROOT-performance-centos8-multicore/default.
Running on olbdw-01.cern.ch:/data/sftnight/workspace/root-pullrequests-build
See console output.

Failing tests:

lgtm-com · 2021-06-24T09:09:23Z

This pull request fixes 1 alert when merging de65bd0 into b3b9aa2 - view on LGTM.com

fixed alerts:

1 for Unused local variable

In distributed RDataFrame, the head node of the RDF computation graph also stores information about the user provided arguments to the RDF constructor, i.e. information about the source of data. The current implementation has only one HeadNode class responsible for distinguishing between the two supported data sources, i.e. empty RDF with sequential entries and TTree/TChain based RDF. This commit introduces instead multiple head node types, one per data source. In particular, there is now a `EmptySourceHeadNode` and a `TreeHeadNode`. To decide which is the correct type of head node for a particular distributed dataframe, this commit introduces a factory function `get_headnode` that parses the user provided arguments to the RDataFrame constructor and returns the correct head node instance. It would follow that each data source might have different information to send to the distributed resources. This is the case, and it will be addressed in the future with the introduction of a different Range object per each head node type These change brought a few more changes along: * In the `get_headnode` factory function there is an early check for the passed arguments to the RDF constructor. Improve the check by catching the `TypeError` that comes from the C++ overload failing to instantiate an RDF with wrong constructor arguments, by raising a new `TypeError` with a more user friendly message. * The logic of the distributed execution present in the Base backend and in the Spark backend need to be changed in order to accomodate the different head node types. * The presence of different head node types collides with the current implementation of the `RangesBuilder` class that expected a single type of headnode to retrieve attributes from. Workaround this by using `try except` clauses in the properties of `RangesBuilder`. It is an ugly workaround that will be superseded very soon when also the creation of the ranges will be separated according to the data source.

The number of partitions in which the distributed RDataFrame should be split is needed in the logic for creating the ranges from the head node, and might also be modified after some checks. Since it should stay a data member of the head node classes, it should also be defined at __init__ time. This now happens in the call of `make_dataframe` of any distributed backend, that will also pass the `npartitions` argument to the `get_headnode` factory function. The `npartitions` parameter is now a required, positional argument in all the signatures that expect it.

…te head nodes

This method does not really need an extra positional parameter, just to return it if it could not "optimize" it. Switch to returning the default `MIN_NPARTITIONS` data member if nothing better could be done.

phsft-bot · 2021-06-24T10:33:16Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

lgtm-com · 2021-06-24T11:02:40Z

This pull request fixes 1 alert when merging 6110909 into b3b9aa2 - view on LGTM.com

fixed alerts:

1 for Unused local variable

phsft-bot · 2021-06-24T11:18:02Z

Build failed on ROOT-performance-centos8-multicore/default.
Running on olbdw-01.cern.ch:/data/sftnight/workspace/root-pullrequests-build
See console output.

Failing tests:

vepadulano · 2021-06-24T14:10:30Z

Alright, tests are green, I'm merging! Thanks for all help and input @eguiraud @etejedor !

[DF][NFC] Remove unused import in Node.py

a723468

vepadulano requested a review from etejedor as a code owner June 18, 2021 18:37

vepadulano changed the title ~~[DF]~~ [DF] Split HeadNode class in different head nodes according to data source Jun 18, 2021

vepadulano requested a review from eguiraud June 18, 2021 18:38

vepadulano self-assigned this Jun 18, 2021

vepadulano added the in:RDataFrame label Jun 18, 2021

eguiraud reviewed Jun 22, 2021

View reviewed changes

eguiraud mentioned this pull request Jun 22, 2021

[WIP][DF] Refactor DistRDF range creation #8391

Closed

etejedor reviewed Jun 23, 2021

View reviewed changes

bindings/experimental/distrdf/test/test_buildranges.py Outdated Show resolved Hide resolved

eguiraud reviewed Jun 23, 2021

View reviewed changes

eguiraud approved these changes Jun 24, 2021

View reviewed changes

vepadulano added 5 commits June 24, 2021 12:20

[DF][NFC] Add TODOs in TreeHeadNode and RangesBuilder

f3003ed

[DF][NFC] Change tests according to new head node types

6f5d12a

[DF][NFC] Adapt tests to new signature, use factory functions to crea…

a9f8bac

…te head nodes

[DF] Change signature of optimize_npartitions in backends

6110909

This method does not really need an extra positional parameter, just to return it if it could not "optimize" it. Switch to returning the default `MIN_NPARTITIONS` data member if nothing better could be done.

vepadulano force-pushed the distrdf-headnodes branch from de65bd0 to 6110909 Compare June 24, 2021 10:33

vepadulano merged commit 5ed68e4 into root-project:master Jun 24, 2021

vepadulano deleted the distrdf-headnodes branch October 30, 2021 20:35

Conversation

vepadulano commented Jun 18, 2021

Uh oh!

phsft-bot commented Jun 18, 2021

Uh oh!

lgtm-com bot commented Jun 18, 2021

Uh oh!

eguiraud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eguiraud Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eguiraud Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

vepadulano Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phsft-bot commented Jun 22, 2021

Uh oh!

lgtm-com bot commented Jun 22, 2021

Uh oh!

Uh oh!

phsft-bot commented Jun 23, 2021

Uh oh!

phsft-bot commented Jun 23, 2021

Failing tests:

Uh oh!

lgtm-com bot commented Jun 23, 2021

Uh oh!

phsft-bot commented Jun 23, 2021

Uh oh!

phsft-bot commented Jun 23, 2021

Uh oh!

lgtm-com bot commented Jun 23, 2021

Uh oh!

phsft-bot commented Jun 23, 2021

Failing tests:

Uh oh!

eguiraud Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vepadulano Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

vepadulano Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

vepadulano Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

eguiraud Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phsft-bot commented Jun 24, 2021

Uh oh!

phsft-bot commented Jun 24, 2021

Failing tests:

Uh oh!

lgtm-com bot commented Jun 24, 2021

Uh oh!

eguiraud left a comment

Choose a reason for hiding this comment

Uh oh!

vepadulano commented Jun 24, 2021

Uh oh!

phsft-bot commented Jun 24, 2021

Uh oh!

phsft-bot commented Jun 24, 2021

Failing tests:

Uh oh!

lgtm-com bot commented Jun 24, 2021

eguiraud Jun 22, 2021 •

edited

Loading

eguiraud Jun 23, 2021 •

edited

Loading

eguiraud Jun 24, 2021 •

edited

Loading