Skip to content

Directly store needed attributes in TreeHeadNode#8816

Merged
vepadulano merged 2 commits intoroot-project:masterfrom
vepadulano:distrdf-tree-attribute
Aug 10, 2021
Merged

Directly store needed attributes in TreeHeadNode#8816
vepadulano merged 2 commits intoroot-project:masterfrom
vepadulano:distrdf-tree-attribute

Conversation

@vepadulano
Copy link
Copy Markdown
Member

This PR addresses the TODO comments to remove the superfluous public getters / properties logic in TreeHeadNode. Currently, just parses the user arguments in the __init__ method and stores the needed attributes. My idea is that in the end we'll only need to store npartitions and tree (which are always needed) plus defaultbranches and friendinfo as "optional" class attributes (meaning they are initialized to None and changed if some conditions apply).

For simplicity, in this PR I still leave treename and inputfiles attributes which are needed in the Ranges.get_clusters function.

The next PRs will address:

  • The assumption we only have one unique treename in the RDataFrame
  • Change in Ranges.get_clusters to accept a TTree instance as argument, pairing it with new C++ functions in ROOT::Internal::TreeUtils to retrieve a vector of clusters (plus some other metadata) depending if it's a TTree or TChain

If the logic for this PR is approved I'll add more docs and commit messages

@vepadulano vepadulano requested review from eguiraud and etejedor August 9, 2021 09:07
@vepadulano vepadulano self-assigned this Aug 9, 2021
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

Copy link
Copy Markdown
Contributor

@eguiraud eguiraud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

EDIT: but the lgtm-com bot alert looks correct

# RDataFrame(tree, defaultBranches = {})
self.tree = args[0]
# Retrieve information about friend trees when user passes a TTree
# or TChain object.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment belongs to the previous commit

hn = create_dummy_headnode(tree)

self.assertEqual(hn.get_num_entries(), 4)
self.assertEqual(hn.tree.GetEntries(), 4)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you testing here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the above tests in class NumEntriesTest were previously testing that the method get_num_entries of TreeHeadNode was returning the correct amount. Now, they test that the attribute tree of the class is the correct TTree/TChain that we expect by retrieving the number of entries.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may agree that it's not a super useful test, but I didn't want to remove it as a first step

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Aug 9, 2021

This pull request introduces 1 alert when merging 27f893d into 68c7ac9 - view on LGTM.com

new alerts:

  • 1 for Unused import

@vepadulano vepadulano linked an issue Aug 9, 2021 that may be closed by this pull request
* Remove the superfluous getters plus properties logic for attributes in TreeHeadNode.
* Instead of storing the user arguments and parse them later, parse them directly in the __init__ method and store a `tree` class attribute with a TTree or TChain that will be processed.

In general, the `npartitions` and `tree` attributes are always needed to setup the distributed execution. The `defaultbranches` and `friendinfo` attributes are initialized to `None` and changed if some conditions apply:
* `defaultbranches` needs to be explicitly passed by the user to the RDataFrame constructor
* Information about friend trees is retrieved only if the user passed a `TTree &` to the RDataFrame constructor

This commit retrieves and stores the `treename` and `inputfiles` attributes to keep compatibility with the current signature of `DistRDF.Ranges.get_clusters`.
In a future commit, this signature may be changed to accept directly a TTree or TChain and thus these two attributes could be removed.
@vepadulano vepadulano force-pushed the distrdf-tree-attribute branch from 27f893d to 8c64c8a Compare August 9, 2021 13:10
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@vepadulano
Copy link
Copy Markdown
Member Author

  • Removed unused import
  • Updated class docstrings
  • Added commit message
  • Rebased

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-ubuntu16/nortcxxmod.
Running on sft-ubuntu-1604-4.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-08-09T13:11:19.003Z] CMake Error at CMakeLists.txt:7 (cmake_minimum_required):
  • [2021-08-09T13:11:19.003Z] CMake Error at /mnt/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1117 (message):

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-debian10-i386/cxx14.
Running on pcepsft10.dyndns.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-08-09T15:00:00.639Z] stderr: error: could not read '.git/rebase-apply/head-name': No such file or directory
  • [2021-08-09T15:00:03.623Z] stderr: error: could not read '.git/rebase-apply/head-name': No such file or directory
  • [2021-08-09T15:00:03.881Z] CMake Error at /home/sftnight/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1084 (message):

The head node of a computation graph where the RDataFrame data source is
a TTree. This head node is responsible for the following RDataFrame constructor::
a TTree or a TChain. This head node is responsible for the following
RDataFrame constructors::
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double colon

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double colon is how Sphinx parses a code block (doxygen would expect a ~~~{.py}) . It's still here and there in DistRDF, when we'll move to fully utilise doxygen will be changed

@vepadulano vepadulano merged commit 21c4835 into root-project:master Aug 10, 2021
@vepadulano vepadulano deleted the distrdf-tree-attribute branch October 30, 2021 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid Python globbing when we actually need TChain globbing

4 participants