Skip to content

RooFit MultiProcess and TestStatistics#8294

Closed
egpbos wants to merge 518 commits intoroot-project:masterfrom
roofit-dev:RooFit_MultiProcess_PR
Closed

RooFit MultiProcess and TestStatistics#8294
egpbos wants to merge 518 commits intoroot-project:masterfrom
roofit-dev:RooFit_MultiProcess_PR

Conversation

@egpbos
Copy link
Copy Markdown
Contributor

@egpbos egpbos commented May 31, 2021

This PR adds to RooFit:

  1. Parallelism to gradient calculation in Minuit2 minimization in the form of a extensible interface in the RooFit::MultiProcess package.
  2. A refactored test statistics framework with cleaner separation of computation and physics/statistics concepts than in existing RooAbsTestStatistic derived classes. Currently, RooFit::TestStatistics is part of roofitcore. Note: TestStatistics/likelihood_builders still has to be finished, this will be done in the coming few weeks.
  3. RooFitZMQ, a wrapper of ZeroMQ functionality used in RooFit::MultiProcess for communication between processes. Modified after code, contributed by @roelaaij.

RooFitZMQ maybe still needs some attention, because in its current form it includes a big part of the libzmq source tree (needed for ppoll, see below), which I'm sure causes licensing issues (it's LGPLv3). I'm open to suggestions on how to handle this.

To make the above additions possible, some modifications to both RooFit and non-RooFit code were made as well:

  1. In Minuit2:
    1. We added a subclass of the AnalyticalGradientCalculator called the ExternalInternalGradientCalculator. Whereas the AGC assumes that the gradient that is passed to it (from outside of Minuit2) is in normal parameter space, the EIGC allows its (External) user to use Minuit2 "Internal" parameter space, i.e. the parameter space that may be bounded into some range using transformation functions. This allowed us to exactly (floating point bit-wise) replicate the Minuit2 gradient calculation outside of Minuit2 itself, allowing us to parallelize this gradient calculation process exactly without having to worry about breaking Minuit2. The replication, NumericalDerivatorMinuit2, was based on earlier work by @lmoneta who already had separated out the bulk of the gradient calculation code from Minuit2.
    2. To make this all work, we also had to upgrade precision of the transformation functions to long double instead of double, otherwise round off errors would still persist and ruin any chances of exact bit-wise equality.
  2. In mathcore: Some additions to IFunction were made to allow Minuit2 to probe functions for their ability to generate gradients and second derivatives. Similar additions were made to function adapter classes in Minuit2.
  3. In RooFit:
    1. Most RooMinimizerFcn functionality was moved into an abstract base class RooAbsMinimizerFcn, which in turn forms the base class of the new RooMinimizerFcn, but also of the added RooGradMinimizerFcn (serial, but gradient external to Minuit2) and MinuitFcnGrad (with parallel MultiProcess back-end) classes.
    2. The RooRealMPFE based classes can make use of an added parameter CPUAffinity. In Unix systems (not macOS), this makes the MPFE based parallelization a lot faster by pinning processes to physical CPU cores.
    3. To accomodate the new minimization frameworks, RooMinimizer was changed quite a bit as well. It is still backwards compatible, but the new functionality can be accessed through a new create template factory function. This template function allows users to pass in their own calculation back-ends, e.g. for calculating on GPUs or in autograd enabled frameworks.

The commit history also contains the proof of concept version, the benchmark results of which were presented at ACAT19 and CHEP19 (and preliminary results at the 2018 ROOT Users workshop in Sarajevo). That version was redesigned starting from 2019 to better integrate with the rest of the code and at the same time untangle the test statistics classes to conceptually bring them closer to the math, instead of the more implementation-detail oriented existing design (RooAbsTestStatistic et al.).

The new packages include the following tests, which should probably still be added to the testing infrastructure somehow:

  1. MultiProcess:
    1. test_RooFitMultiProcess_Messenger
    2. test_RooFitMultiProcess_ProcessManager
    3. test_RooFitMultiProcess_Job
  2. TestStatistics:
    1. testLikelihoodGradientJob
    2. testLikelihoodSerial
    3. testRooRealL
  3. RooFitZMQ:
    1. test_RooFitZMQ
    2. test_RooFitZMQ_polling
    3. test_RooFitZMQ_HWM
    4. test_RooFitZMQ_load_balancing
  4. RooFitCore:
    1. testRooGradMinimizer
    2. testBidirMMapPipe
    3. testMPFEnll

From my side (and that of the NL eScience Center), the project has ended and time has run out to make any further major contributions to it, except, of course finishing this PR and providing help to get it working and to possibly hand over further development :)

Here are some notes for possible future work:

  • RooFitZMQ includes an extension of ZeroMQ itself: a ppoll function. This function should ideally be contributed to ZeroMQ, but I have had no time for this. The motivation behind ppoll is given in this blog post.
  • At the last moment, I decided to reimplement part of the Queue functionality. The task distribution and parameter updating functionalities are now done directly using appropriate ZeroMQ sockets instead of indirectly through the Queue. The old-style Queue functionality, however, has not been cleaned up yet. Doing so will clean up the "plumbing" of the MultiProcess functions quite a bit.
  • Benchmarking and optimization still has to be done for this version as well. The scaling results of the proof of concept (see references above) should be reproducible with this reimplementation, but this possibly still needs some tuning.
  • After the most recent merging in of master, the RooGradMinimizer tests no longer pass, because the numbers are no longer floating point exactly the same. We have not looked into why, but one possible source is the reworked Kahan summation class. This was applied in RooMinimizerFcn, but not yet in our external-gradient classes.
  • The proof-of-concept version classes are also still present in the source tree (roofitcore/MultiProcess), but have only been partially maintained since we started with the final version. Probably the best thing to do there is to remove that, but maybe people disagree and want to keep it for comparison while benchmarking and reproducing the results of the proof-of-concept benchmarks. Note: BidirMMapPipe is in there as well, since it was moved there. This class is used in the RooRealMPFE event-based parallelization method that was present already before I started. RooGaussMinimizerFcn and RooTaskSpec were also part of our proof-of-concept exploration work.
  • Similarly, there is some left-over code from benchmarks that is probably now deprecated. In particular, RooTimer and RooJSONListFile, but also strewn around the code there are still some chrono includes or other timing remnants.

This work was done over the past 5 years at the initiative of Wouter Verkerke @wverkerke under a Netherlands eScience Center grant, with direct code contributions from @vincecr0ft and @ipelupessy on the RooFit side and @roelaaij on ZeroMQ, lots of support from @cburgard, Lydia Brenner and @jiskattema, invaluable design input from @hageboeck and @lmoneta in the final stage of moving from proof of concept version to the version before you.

egpbos and others added 30 commits March 7, 2018 14:47
preexisting BidirMMapPipe instances

this allows more complex communication topologies
checks whether process pid is equal to the earlier determined parent id,
this ensures the waitpid can actually be done (needs child)
The test currently gets stuck for some reason.
@phsft-bot
Copy link
Copy Markdown

Build failed on mac11.0/cxx17.
Running on macphsft20.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-06-04T18:21:08.430Z] CMake Error at /Users/sftnight/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1044 (message):

@phsft-bot
Copy link
Copy Markdown

Build failed on mac1014/python3.
Running on macphsft17.dyndns.cern.ch:/build/jenkins/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-06-04T18:41:50.308Z] CMake Error at /build/jenkins/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1044 (message):

@phsft-bot
Copy link
Copy Markdown

Build failed on windows10/cxx14.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Errors:

  • [2021-06-04T19:16:06.725Z] CMake Error at C:/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1044 (message):

egpbos added 2 commits June 23, 2021 11:54
As highlighted by @lmoneta in PR review comments (root-project#8369 (review)), the second derivative and step size information is not necessary at this point. The floating point exact minuit-external (roofit-internal) derivative duplication is currently done using ExternalInternalGradientCalculator and further made possible by the long double parameter transformation functions. This is all we need for now.
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-performance-centos8-multicore/default.
Running on olbdw-01.cern.ch:/data/sftnight/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-06-23T09:55:55.578Z] CMake Error at /data/sftnight/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1043 (message):

@phsft-bot
Copy link
Copy Markdown

Build failed on mac11.0/cxx17.
Running on macphsft20.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-06-23T09:56:46.838Z] CMake Error at /Users/sftnight/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1043 (message):

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-ubuntu16/nortcxxmod.
Running on sft-ubuntu-1604-1.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-06-23T09:57:58.694Z] CMake Error at /mnt/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1043 (message):

@phsft-bot
Copy link
Copy Markdown

Build failed on mac1014/python3.
Running on macitois21.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-06-23T09:58:04.056Z] CMake Error at /Volumes/HD2/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1043 (message):

@phsft-bot
Copy link
Copy Markdown

Build failed on windows10/cxx14.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Errors:

  • [2021-06-23T10:01:44.267Z] CMake Error at C:/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1043 (message):

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-debian10-i386/cxx14.
Running on pcepsft10.dyndns.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2021-06-23T10:03:59.230Z] CMake Error at /home/sftnight/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1043 (message):

void LikelihoodJob::update_bool(std::size_t ix, bool value) {
if (get_manager()->process_manager().is_master()) {
auto msg = RooFit::MultiProcess::M2Q::update_bool;
get_manager()->messenger().send_from_queue_to_master(msg, ix, value);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong! Should be master_to_queue.

//}

void LikelihoodJob::send_back_task_result_from_worker(std::size_t /*task*/) {
get_manager()->messenger().send_from_worker_to_master(result, carry);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all doesn't work, e.g. this doesn't send task ID at the moment, see LikelihoodGradientJob for a better implementation.

}

void LikelihoodGradientJob::receive_task_result_on_queue(std::size_t task, std::size_t worker_id)
{
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, old plumbing.

@guitargeek
Copy link
Copy Markdown
Contributor

Closing because all these developments are merged, with #9349 as the last PR.

@guitargeek guitargeek closed this Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants