Ensure registration of atexit function is done only once by vepadulano · Pull Request #10753 · root-project/root

vepadulano · 2022-06-14T13:48:10Z

Before this commit, the following triggered two calls to
TROOT::EndOfProcessCleanups:

python -c "import ROOT; ROOT.TH1F"

This happened because the TApplication constructor registered a
function that calls EndOfProcessCleanups with the C atexit and
PyROOT registered another similar function with the Python atexit
module in __init__.py. The Python atexit happened before the C
atexit.

This commit introduces a static function in TApplication that
registers the call to EndOfProcessCleanups with the C atexit and
this is guaranteed to be run only once through the use of
std::call_once. This new function can be called both from PyROOT
and the TApplication constructor, to make sure it will be
registered by either of them.

Furthermore, calling this from PyROOT means that now
EndOfProcessCleanups is always called at the very end of the
program, seemingly a more sensible choice.

This PR fixes #10743

Before this commit, the following triggered two calls to TROOT::EndOfProcessCleanups: python -c "import ROOT; ROOT.TH1F" This happened because the TApplication constructor registered a function that calls EndOfProcessCleanups with the C `atexit` and PyROOT registered another similar function with the Python atexit module in `__init__.py`. The Python atexit happened before the C atexit. This commit introduces a static function in TApplication that registers the call to EndOfProcessCleanups with the C `atexit` and this is guaranteed to be run only once through the use of std::call_once. This new function can be called both from PyROOT and the TApplication constructor, to make sure it will be registered by either of them. Furthermore, calling this from PyROOT means that now EndOfProcessCleanups is always called at the very end of the program, seemingly a more sensible choice.

phsft-bot · 2022-06-14T13:48:19Z

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu16/nortcxxmod, ROOT-ubuntu2004/python3, mac1015/cxx17, mac11/cxx14, windows10/cxx14
How to customize builds

phsft-bot · 2022-06-14T13:48:54Z

Build failed on ROOT-ubuntu16/nortcxxmod.
Running on sft-ubuntu-1604-1.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

[2022-06-14T13:48:53.450Z] CMake Error at /mnt/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1083 (message):

vepadulano · 2022-06-14T13:49:50Z

With the following diff

diff --git a/core/base/src/TROOT.cxx b/core/base/src/TROOT.cxx
index 3f385422ce..58a6fe052c 100644
--- a/core/base/src/TROOT.cxx
+++ b/core/base/src/TROOT.cxx
@@ -1201,6 +1201,7 @@ void TROOT::CloseFiles()
 
 void TROOT::EndOfProcessCleanups()
 {
+   std::cout << "Calling TROOT::EndOfProcessCleanups\n";
    // This will not delete the objects 'held' by the TFiles so that
    // they can still be 'reacheable' when ResetGlobals is run.
    CloseFiles();

And a build of current master, I get

$: python -c "import ROOT; ROOT.TH1F"
Calling TROOT::EndOfProcessCleanups
Calling TROOT::EndOfProcessCleanups

Whereas with a build of this PR I get

$: python -c "import ROOT; ROOT.TH1F"
Calling TROOT::EndOfProcessCleanups

lgtm-com · 2022-06-14T14:12:01Z

This pull request introduces 1 alert when merging 697a0b0 into 867fe54 - view on LGTM.com

new alerts:

1 for Module imports itself

eguiraud · 2022-06-14T14:37:01Z

Thanks Vincenzo! I think the most important change in this PR is that PyROOT does not register the EndOfProcessCleanups call with Python's atexit mechanism anymore, but it uses the classic handlers mechanism that TApplication already uses. And that means that now PyROOT objects will de deleted before EndOfProcessCleanups is called, which is an important change in behavior but I think it is the intended/correct behavior.

I am not sure we need the combo std::call_once + std::once_flag data member, a static counter seems simpler and it does the job (unless we expect that TApplication and PyROOT could try to register the handler concurrently, but I can't imagine how that would happen).

With this said I'm the least qualified person to decide on whether we want to go with this or not, I'll leave it to the other reviewers.

phsft-bot · 2022-06-14T14:38:58Z

Build failed on ROOT-performance-centos8-multicore/cxx17.
Running on olbdw-01.cern.ch:/data/sftnight/workspace/root-pullrequests-build
See console output.

Failing tests:

projectroot.runtutorials.tutorial_legacy_rootenv

pcanal · 2022-06-14T15:21:01Z

It seems the order of execution is now 'wrong' in the PyROOT cases, i.e. do we still implement the behavior described in the comment:

            # Hard teardown: run part of the gROOT shutdown sequence.
            # Running it here ensures that it is done before any ROOT libraries
            # are off-loaded, with unspecified order of static object destruction.

See also 7a592f5 for some description of a challenging use case.

vepadulano · 2022-06-14T17:52:17Z

See also 7a592f5 for some description of a challenging use case.

It is unclear to me what is the weight of the behaviour described in the comment and in the commit message w.r.t. the logic of what we need to do at atexit time. From my understanding, the current behaviour (before this PR) is:

Run TROOT::EndOfProcessCleanups
Python gc destroys variables, thus triggering any destructors of the corresponding C++ objects
Run again TROOT::EndOfProcessCleanups

This logic seems faulty to me. I ran the reproducer of the original issue linked in the commit you linked and it runs smoothly after the changes of this PR. Copying here the reproducer for reference:

import ROOT

def create():
  f = ROOT.TFile("file.root", "RECREATE")
  t = ROOT.TTree("tree", "")
  return f, t
f, t = create()

Admittedly, I have only tried with Python 3.10, I will see if I can also try with other Python 3 minor versions. But irrespective of this, I don't think TROOT::EndOfProcessCleanups should be called twice. I believe that if there is indeed a behaviour that requires some Python atexit handler, that should be written in another function, separate from EndOfProcessCleanups, and called accordingly.

pcanal · 2022-06-14T19:06:03Z

A priori a 2nd running of EndOfProcessCleanups should be 'harmless' (i.e. what ever list of things to delete would be empty on the 2nd run).

The issue here (well there #10743) does not seems to be the 2nd running of EndOfProcessCleanups but rather that there is a bad interaction between the 'early' running of the cleanups and that something in RDataFrame is not properly informed that something is delete early.

The early cleanups is necessary because of the following issue.

The python code (directly or indirectly) can load libraries (Ls)
The python code (directly or indirectly) can open TFile (Fs) and can elect to never delete them
The python code (directly or indirectly) can attach objects based on the Libraries on to the TFile.

Then the order of execution at tear down will be

(1) tear down python
(2) unload the loaded libraries (Ls)
(3) execute the atexit registered by TROOT,
(4) this includes EndOfProcessCleanups
(5) this includes flushing and closing the TFiles.

During (5) the object attached the TFile (usually just histogram and TTree but can also be object from the user libraries.

However if the libraries for the user objects are already unloaded, their code (including a Streamer function or even their dictionary) are gone and this result in crash at best or worse in data loss.

In order to prevent that we introduced (in 7a592f5 and friends), running the EndOfProcessCleanups of cleanups as soon as possible (i.e. as soon as we can detect the process is starting to tear down). (i.e. make it part of (1)).

More likely that not this means than rather than removing early this call to EndOfProcessCleanups, the solution is to inform RDataFrame of those early deletion (depending whether the things that are deleted that are annoying RDataFrame, they may or may not be having the trouble described above, so another solution might be to insure that the things that bother RDataFrame are not included in the list of things deleted by EndOfProcessCleanups)

vepadulano · 2022-06-14T19:39:10Z

Ok I understand a bit better the original motivation for the two calls to EndOfProcessCleanups (although I still think is not very clean). Can we at least make EndOfProcessCleanups truly a no-op if it's not the first time it's been called? This should let Python call it before point (5) above, then when it's going to get called by the TApplication it won't bother RDF. Does this sound reasonable?

pcanal · 2022-06-14T19:52:58Z

Can we at least make EndOfProcessCleanups truly a no-op if it's not the first time it's been called?

Well, maybe, maybe not. If it is currently not a true no-op then it might be either

(a bug) some list is not properly cleared
(a feature) some items are added to the to-be-cleaned list between the 2 calls.

This should let Python call it before point (5) above, then when it's going to get called by the TApplication it won't bother RDF.

It must be called (by python) during point (1) (which maybe what you meant) and RDF should not be doing anything after (5) so it should already not bother RDF.

But maybe I am missing something. I.e. is the second call to EndOfProcessCleanups a red-herring that 'just' looks suspicious and the problem is "first call affecting RDF" or is it really the 2nd call that is creating a problem.

eguiraud · 2022-06-14T22:06:06Z

Just to clarify, the second call is totally fine and the first call is creating a problem for RDF because of #10742 .

Even then, it seemed weird that calling EndOfProcessCleanups twice, and once before the destructor of object at local scope is even called, was "everything working as intended". That's why I opened the issue.

If Python's or PyROOT's teardown sequence unloads libraries that might be needed by EndOfProcessCleanups then I agree with Vincenzo that maybe the best we can do is run it once before Python's teardown (maybe with a comment that explains the issue, if the scenario in the commit message of 7a592f5 does not really need the early EndOfProcessCleanups call as Vincenzo mentioned).

pcanal · 2022-06-14T22:23:43Z

if the scenario in the commit message of 7a592f5 does not really need the early EndOfProcessCleanups

Unless something in the handling of the python handles changed since the commit, then it is needed.

Even then, it seemed weird that calling EndOfProcessCleanups twice, and once before the destructor of object at local scope is even called, was "everything working as intended".

The "once before ...." is more or so because there is no 'better' place to call it.

the first call is creating a problem for RDF because of #10742 .

Indeed (this is scenario similar to the (broken) ones I tried to describe). Getting the order of destruction right is very challenging (way more corner cases than you would expect or want) and unless there is a strong reason (and #10742 is not a good reason, that bug leads to a broken state which leads to broken behavior :) ), I would leave this code as-is.

vepadulano · 2022-06-15T07:32:28Z

Unless something in the handling of the python handles changed since the commit, then it is needed.

Before and after this PR, the reproducer of 7a592f5 works, so it means that currently we have no coverage for the edge cases that it was covering.

pcanal · 2022-06-15T14:39:11Z

the reproducer of 7a592f5 works,

Did you run it under valgrind? (it could 'just' appear to work)

pcanal · 2022-06-15T15:37:03Z

... but even if it did come clean it would not necessarily prove that it is not needed. Order of tear down is notoriously fickled and may depends on many variable (version of python, compilation options, general order of operation). I.e. unless we unearth the actual commits and/or changes that make this step obsolete, I would not touch it.

vepadulano · 2022-08-30T10:22:34Z

Closing this PR, the only thing needed is probably more test coverage to better understand the need for the current approach. I opened a bug issue for that

vepadulano added the in:Core Libraries label Jun 14, 2022

vepadulano requested a review from eguiraud June 14, 2022 13:48

vepadulano requested review from Axel-Naumann, etejedor and pcanal as code owners June 14, 2022 13:48

lmoneta assigned vepadulano Jun 20, 2022

eguiraud removed their request for review August 18, 2022 17:40

vepadulano mentioned this pull request Aug 30, 2022

Investigate need for TROOT::EndOfProcessCleanups in PyROOT #11279

Closed

vepadulano closed this Aug 30, 2022

vepadulano mentioned this pull request Oct 14, 2024

[Do not merge][PyROOT] Don't call ClearProxiedObjects() in hard shutdown mode #16671

Closed

Conversation

vepadulano commented Jun 14, 2022

Uh oh!

phsft-bot commented Jun 14, 2022

Uh oh!

phsft-bot commented Jun 14, 2022

Errors:

Uh oh!

vepadulano commented Jun 14, 2022

Uh oh!

lgtm-com bot commented Jun 14, 2022

Uh oh!

eguiraud commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phsft-bot commented Jun 14, 2022

Failing tests:

Uh oh!

pcanal commented Jun 14, 2022

Uh oh!

vepadulano commented Jun 14, 2022

Uh oh!

pcanal commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vepadulano commented Jun 14, 2022

Uh oh!

pcanal commented Jun 14, 2022

Uh oh!

eguiraud commented Jun 14, 2022

Uh oh!

pcanal commented Jun 14, 2022

Uh oh!

vepadulano commented Jun 15, 2022

Uh oh!

pcanal commented Jun 15, 2022

Uh oh!

pcanal commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vepadulano commented Aug 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eguiraud commented Jun 14, 2022 •

edited

Loading

pcanal commented Jun 14, 2022 •

edited

Loading

pcanal commented Jun 15, 2022 •

edited

Loading