Skip to content

Ensure registration of atexit function is done only once#10753

Closed
vepadulano wants to merge 1 commit intoroot-project:masterfrom
vepadulano:unique-atexit
Closed

Ensure registration of atexit function is done only once#10753
vepadulano wants to merge 1 commit intoroot-project:masterfrom
vepadulano:unique-atexit

Conversation

@vepadulano
Copy link
Copy Markdown
Member

Before this commit, the following triggered two calls to
TROOT::EndOfProcessCleanups:

python -c "import ROOT; ROOT.TH1F"

This happened because the TApplication constructor registered a
function that calls EndOfProcessCleanups with the C atexit and
PyROOT registered another similar function with the Python atexit
module in __init__.py. The Python atexit happened before the C
atexit.

This commit introduces a static function in TApplication that
registers the call to EndOfProcessCleanups with the C atexit and
this is guaranteed to be run only once through the use of
std::call_once. This new function can be called both from PyROOT
and the TApplication constructor, to make sure it will be
registered by either of them.

Furthermore, calling this from PyROOT means that now
EndOfProcessCleanups is always called at the very end of the
program, seemingly a more sensible choice.

This PR fixes #10743

Before this commit, the following triggered two calls to
TROOT::EndOfProcessCleanups:

python -c "import ROOT; ROOT.TH1F"

This happened because the TApplication constructor registered a
function that calls EndOfProcessCleanups with the C `atexit` and
PyROOT registered another similar function with the Python atexit
module in `__init__.py`. The Python atexit happened before the C
atexit.

This commit introduces a static function in TApplication that
registers the call to EndOfProcessCleanups with the C `atexit` and
this is guaranteed to be run only once through the use of
std::call_once. This new function can be called both from PyROOT
and the TApplication constructor, to make sure it will be
registered by either of them.

Furthermore, calling this from PyROOT means that now
EndOfProcessCleanups is always called at the very end of the
program, seemingly a more sensible choice.
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu16/nortcxxmod, ROOT-ubuntu2004/python3, mac1015/cxx17, mac11/cxx14, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-ubuntu16/nortcxxmod.
Running on sft-ubuntu-1604-1.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2022-06-14T13:48:53.450Z] CMake Error at /mnt/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1083 (message):

@vepadulano
Copy link
Copy Markdown
Member Author

With the following diff

diff --git a/core/base/src/TROOT.cxx b/core/base/src/TROOT.cxx
index 3f385422ce..58a6fe052c 100644
--- a/core/base/src/TROOT.cxx
+++ b/core/base/src/TROOT.cxx
@@ -1201,6 +1201,7 @@ void TROOT::CloseFiles()
 
 void TROOT::EndOfProcessCleanups()
 {
+   std::cout << "Calling TROOT::EndOfProcessCleanups\n";
    // This will not delete the objects 'held' by the TFiles so that
    // they can still be 'reacheable' when ResetGlobals is run.
    CloseFiles();

And a build of current master, I get

$: python -c "import ROOT; ROOT.TH1F"
Calling TROOT::EndOfProcessCleanups
Calling TROOT::EndOfProcessCleanups

Whereas with a build of this PR I get

$: python -c "import ROOT; ROOT.TH1F"
Calling TROOT::EndOfProcessCleanups

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Jun 14, 2022

This pull request introduces 1 alert when merging 697a0b0 into 867fe54 - view on LGTM.com

new alerts:

  • 1 for Module imports itself

@eguiraud
Copy link
Copy Markdown
Contributor

eguiraud commented Jun 14, 2022

Thanks Vincenzo! I think the most important change in this PR is that PyROOT does not register the EndOfProcessCleanups call with Python's atexit mechanism anymore, but it uses the classic handlers mechanism that TApplication already uses. And that means that now PyROOT objects will de deleted before EndOfProcessCleanups is called, which is an important change in behavior but I think it is the intended/correct behavior.

I am not sure we need the combo std::call_once + std::once_flag data member, a static counter seems simpler and it does the job (unless we expect that TApplication and PyROOT could try to register the handler concurrently, but I can't imagine how that would happen).

With this said I'm the least qualified person to decide on whether we want to go with this or not, I'll leave it to the other reviewers.

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-performance-centos8-multicore/cxx17.
Running on olbdw-01.cern.ch:/data/sftnight/workspace/root-pullrequests-build
See console output.

Failing tests:

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jun 14, 2022

It seems the order of execution is now 'wrong' in the PyROOT cases, i.e. do we still implement the behavior described in the comment:

            # Hard teardown: run part of the gROOT shutdown sequence.
            # Running it here ensures that it is done before any ROOT libraries
            # are off-loaded, with unspecified order of static object destruction.

See also 7a592f5 for some description of a challenging use case.

@vepadulano
Copy link
Copy Markdown
Member Author

See also 7a592f5 for some description of a challenging use case.

It is unclear to me what is the weight of the behaviour described in the comment and in the commit message w.r.t. the logic of what we need to do at atexit time. From my understanding, the current behaviour (before this PR) is:

  1. Run TROOT::EndOfProcessCleanups
  2. Python gc destroys variables, thus triggering any destructors of the corresponding C++ objects
  3. Run again TROOT::EndOfProcessCleanups

This logic seems faulty to me. I ran the reproducer of the original issue linked in the commit you linked and it runs smoothly after the changes of this PR. Copying here the reproducer for reference:

import ROOT

def create():
  f = ROOT.TFile("file.root", "RECREATE")
  t = ROOT.TTree("tree", "")
  return f, t
f, t = create()

Admittedly, I have only tried with Python 3.10, I will see if I can also try with other Python 3 minor versions. But irrespective of this, I don't think TROOT::EndOfProcessCleanups should be called twice. I believe that if there is indeed a behaviour that requires some Python atexit handler, that should be written in another function, separate from EndOfProcessCleanups, and called accordingly.

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jun 14, 2022

A priori a 2nd running of EndOfProcessCleanups should be 'harmless' (i.e. what ever list of things to delete would be empty on the 2nd run).

The issue here (well there #10743) does not seems to be the 2nd running of EndOfProcessCleanups but rather that there is a bad interaction between the 'early' running of the cleanups and that something in RDataFrame is not properly informed that something is delete early.

The early cleanups is necessary because of the following issue.

  • The python code (directly or indirectly) can load libraries (Ls)
  • The python code (directly or indirectly) can open TFile (Fs) and can elect to never delete them
  • The python code (directly or indirectly) can attach objects based on the Libraries on to the TFile.

Then the order of execution at tear down will be

(1) tear down python
(2) unload the loaded libraries (Ls)
(3) execute the atexit registered by TROOT,
(4) this includes EndOfProcessCleanups
(5) this includes flushing and closing the TFiles.

During (5) the object attached the TFile (usually just histogram and TTree but can also be object from the user libraries.

However if the libraries for the user objects are already unloaded, their code (including a Streamer function or even their dictionary) are gone and this result in crash at best or worse in data loss.

In order to prevent that we introduced (in 7a592f5 and friends), running the EndOfProcessCleanups of cleanups as soon as possible (i.e. as soon as we can detect the process is starting to tear down). (i.e. make it part of (1)).

More likely that not this means than rather than removing early this call to EndOfProcessCleanups, the solution is to inform RDataFrame of those early deletion (depending whether the things that are deleted that are annoying RDataFrame, they may or may not be having the trouble described above, so another solution might be to insure that the things that bother RDataFrame are not included in the list of things deleted by EndOfProcessCleanups)

@vepadulano
Copy link
Copy Markdown
Member Author

Ok I understand a bit better the original motivation for the two calls to EndOfProcessCleanups (although I still think is not very clean). Can we at least make EndOfProcessCleanups truly a no-op if it's not the first time it's been called? This should let Python call it before point (5) above, then when it's going to get called by the TApplication it won't bother RDF. Does this sound reasonable?

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jun 14, 2022

Can we at least make EndOfProcessCleanups truly a no-op if it's not the first time it's been called?

Well, maybe, maybe not. If it is currently not a true no-op then it might be either

  • (a bug) some list is not properly cleared
  • (a feature) some items are added to the to-be-cleaned list between the 2 calls.

This should let Python call it before point (5) above, then when it's going to get called by the TApplication it won't bother RDF.

It must be called (by python) during point (1) (which maybe what you meant) and RDF should not be doing anything after (5) so it should already not bother RDF.

But maybe I am missing something. I.e. is the second call to EndOfProcessCleanups a red-herring that 'just' looks suspicious and the problem is "first call affecting RDF" or is it really the 2nd call that is creating a problem.

@eguiraud
Copy link
Copy Markdown
Contributor

Just to clarify, the second call is totally fine and the first call is creating a problem for RDF because of #10742 .

Even then, it seemed weird that calling EndOfProcessCleanups twice, and once before the destructor of object at local scope is even called, was "everything working as intended". That's why I opened the issue.

If Python's or PyROOT's teardown sequence unloads libraries that might be needed by EndOfProcessCleanups then I agree with Vincenzo that maybe the best we can do is run it once before Python's teardown (maybe with a comment that explains the issue, if the scenario in the commit message of 7a592f5 does not really need the early EndOfProcessCleanups call as Vincenzo mentioned).

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jun 14, 2022

if the scenario in the commit message of 7a592f5 does not really need the early EndOfProcessCleanups

Unless something in the handling of the python handles changed since the commit, then it is needed.

Even then, it seemed weird that calling EndOfProcessCleanups twice, and once before the destructor of object at local scope is even called, was "everything working as intended".

The "once before ...." is more or so because there is no 'better' place to call it.

the first call is creating a problem for RDF because of #10742 .

Indeed (this is scenario similar to the (broken) ones I tried to describe). Getting the order of destruction right is very challenging (way more corner cases than you would expect or want) and unless there is a strong reason (and #10742 is not a good reason, that bug leads to a broken state which leads to broken behavior :) ), I would leave this code as-is.

@vepadulano
Copy link
Copy Markdown
Member Author

Unless something in the handling of the python handles changed since the commit, then it is needed.

Before and after this PR, the reproducer of 7a592f5 works, so it means that currently we have no coverage for the edge cases that it was covering.

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jun 15, 2022

the reproducer of 7a592f5 works,

Did you run it under valgrind? (it could 'just' appear to work)

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jun 15, 2022

... but even if it did come clean it would not necessarily prove that it is not needed. Order of tear down is notoriously fickled and may depends on many variable (version of python, compilation options, general order of operation). I.e. unless we unearth the actual commits and/or changes that make this step obsolete, I would not touch it.

@vepadulano
Copy link
Copy Markdown
Member Author

Closing this PR, the only thing needed is probably more test coverage to better understand the need for the current approach. I opened a bug issue for that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyROOT calls TROOT::EndOfProcessCleanups() twice, the first time too early

4 participants