Skip to content

[DF] Split HeadNode class in different head nodes according to data source#8485

Merged
vepadulano merged 7 commits intoroot-project:masterfrom
vepadulano:distrdf-headnodes
Jun 24, 2021
Merged

[DF] Split HeadNode class in different head nodes according to data source#8485
vepadulano merged 7 commits intoroot-project:masterfrom
vepadulano:distrdf-headnodes

Conversation

@vepadulano
Copy link
Copy Markdown
Member

This PR addresses step 1 of #8391

Split the big HeadNode class in differente head node types according to the original data source (e.g. EntriesHeadNode, TreeHeadNode, in the future also RNTupleHeadNode). Use a factory to get the correct head node type according to user provided arguments to the RDataFrame constructor

It also addresses the comments from the original PR regarding the files touched in this PR, except for the following:

  • The comments about changing the names of the getters in TreeHeadNode and doing the parsing of constructor arguments only once in __init__. will be addressed both in another PR. Probably the properties can be removed or changed in favor of having the attributes already set during __init__ thanks to the mentioned single-pass parsing.
  • Checking that glob.glob behaves similarly to TChain's globbing is left for another time

@vepadulano vepadulano requested a review from etejedor as a code owner June 18, 2021 18:37
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@vepadulano vepadulano changed the title [DF] [DF] Split HeadNode class in different head nodes according to data source Jun 18, 2021
@vepadulano vepadulano requested a review from eguiraud June 18, 2021 18:38
@vepadulano vepadulano self-assigned this Jun 18, 2021
@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 18, 2021

This pull request fixes 1 alert when merging d309b4e into 645d3d0 - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

Copy link
Copy Markdown
Contributor

@eguiraud eguiraud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the rework, this looks much nicer! Some last comments below.

# Restrict 'npartitions' if it's greater than 'nentries'
msg = ("Number of partitions {0} is greater than number of entries {1} "
"in the dataframe. Using {1} partition(s)".format(self.npartitions, self.nentries))
warnings.warn(msg, UserWarning, stacklevel=2)
Copy link
Copy Markdown
Contributor

@eguiraud eguiraud Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, because of the self.npartitions = 2 above, we warn in case of nentries == 1 but I think that is a valid (toy/test) usecase

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you mentioned here that distRDF is also not ok with nentries == 0: that case should probably also just work (just run locally and produce a bunch of empty histograms, maybe with a warning) but i'm also ok with leaving that behavior as it is. on the other hand, warning on 1 entry but being ok with 3 is weird)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on whether we can always guarantee that backends will be able to process the case of npartitions == 1. I think it can be done in general, not sure that is for this PR

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 22, 2021

This pull request fixes 1 alert when merging 4d212ec into e1bdaf6 - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 23, 2021

This pull request fixes 1 alert when merging 6eab070 into d026b49 - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

1 similar comment
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 23, 2021

This pull request fixes 1 alert when merging d84218f into d026b49 - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

# 2. An educated guess according to the backend, using the backend's
# `optimize_npartitions` function
# 3. Set `npartitions` to 2
npartitions = kwargs.pop("npartitions", self.optimize_npartitions(Base.BaseBackend.MIN_NPARTITIONS))
Copy link
Copy Markdown
Contributor

@eguiraud eguiraud Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to pass MIN_NPARTITIONS as an argument, it's a data member

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, initially it was meant to be a generic function "optimize the partitions if possible, else give me back the argument I pass you". It's probably changed logic now

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what should I do here? Change the signature of optimize_npartitions to (self) rather than (self, npartitions) ? Or maybe leave the input parameter but make it optional as (self, npartitions=None)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this function might still change / be removed in the future whenever we decide on some coherent structure to "optimize" the number of partitions of the distributed dataframe)

Copy link
Copy Markdown
Contributor

@eguiraud eguiraud Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see why this shouldn't be def optimize_npartitions(self) (might even change it to a property called npartitions). passing a data member as the only argument of a method looks "wrong".

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 24, 2021

This pull request introduces 1 alert and fixes 1 when merging ad75fbf into b7d0cb1 - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

fixed alerts:

  • 1 for Unused local variable

Copy link
Copy Markdown
Contributor

@eguiraud eguiraud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🎉

@vepadulano
Copy link
Copy Markdown
Member Author

LGTM gracefully telling me I shouldn't override methods with different signatures, will fix it right away 😄

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 24, 2021

This pull request fixes 1 alert when merging de65bd0 into b3b9aa2 - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

In distributed RDataFrame, the head node of the RDF computation graph also stores information about the user provided arguments to the RDF constructor, i.e. information about the source of data.
The current implementation has only one HeadNode class responsible for distinguishing between the two supported data sources, i.e. empty RDF with sequential entries and TTree/TChain based RDF.
This commit introduces instead multiple head node types, one per data source. In particular, there is now a `EmptySourceHeadNode` and a `TreeHeadNode`.
To decide which is the correct type of head node for a particular distributed dataframe, this commit introduces a factory function `get_headnode` that parses the user provided arguments to the RDataFrame constructor and returns the correct head node instance.

It would follow that each data source might have different information to send to the distributed resources. This is the case, and it will be addressed in the future with the introduction of a different Range object per each head node type

These change brought a few more changes along:
* In the `get_headnode` factory function there is an early check for the passed arguments to the RDF constructor. Improve the check by catching the `TypeError` that comes from the C++ overload failing to instantiate an RDF with wrong constructor arguments, by raising a new `TypeError` with a more user friendly message.
* The logic of the distributed execution present in the Base backend and in the Spark backend need to be changed in order to accomodate the different head node types.
* The presence of different head node types collides with the current implementation of the `RangesBuilder` class that expected a single type of headnode to retrieve attributes from. Workaround this by using `try except` clauses in the properties of `RangesBuilder`. It is an ugly workaround that will be superseded very soon when also the creation of the ranges will be separated according to the data source.
The number of partitions in which the distributed RDataFrame should be split is needed in the logic for creating the ranges from the head node, and might also be modified after some checks.
Since it should stay a data member of the head node classes, it should also be defined at __init__ time.
This now happens in the call of `make_dataframe` of any distributed backend, that will also pass the `npartitions` argument to the `get_headnode` factory function.
The `npartitions` parameter is now a required, positional argument in all the signatures that expect it.
This method does not really need an extra positional parameter, just to return it if it could not "optimize" it.
Switch to returning the default `MIN_NPARTITIONS` data member if nothing better could be done.
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 24, 2021

This pull request fixes 1 alert when merging 6110909 into b3b9aa2 - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable

@phsft-bot
Copy link
Copy Markdown

@vepadulano
Copy link
Copy Markdown
Member Author

Alright, tests are green, I'm merging! Thanks for all help and input @eguiraud @etejedor !

@vepadulano vepadulano merged commit 5ed68e4 into root-project:master Jun 24, 2021
@vepadulano vepadulano deleted the distrdf-headnodes branch October 30, 2021 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants