-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Error in TInterpreter::Calc with no output stack in seemingly random distributed rdf test execution #11515
Description
The problem
Sometimes the distributed RDataFrame test of the RunGraphs functionality fails, for example in this CI run. The visible error from the user side comes from Dask, which serializes a Python RuntimeError and produces some output log on stdout as follows:
2022-10-07 17:00:25,901 - distributed.worker - WARNING - Compute Failed
Key: dask_mapper-d6baeae1-74a8-4a41-84e9-6ee2e8db064f
Function: execute_task
args: ((<function DaskBackend.ProcessAndMerge.<locals>.dask_mapper at 0x7f0cdae1a320>, (<function apply at 0x7f0cfa133910>, <class 'DistRDF.Ranges.TreeRangePerc'>, (), (<class 'dict'>, [['id', 1], ['treenames', ['myTree']], ['filenames', ['2clusters.root']], ['first_file_idx', 0], ['last_file_idx', 1], ['first_tree_start_perc', 0.5], ['last_tree_end_perc', 1], ['friendinfo', None]]))))
kwargs: {}
Exception: "RuntimeError('C++ exception thrown:\\n\\truntime_error: \\nAn error occurred during just-in-time compilation in RLoopManager::Run. The lines above might indicate the cause of the crash\\nAll RDF objects that have not run their event loop yet should be considered in an invalid state.\\n')"
The Python RuntimeError in turn is just a wrapper of the C++ std::runtime_error which comes from this function in the RDF machinery.
That error suggests a failure in TInterpreter::Calc, which should be visible with some kind of output to stderr showing the compilation error. Unfortunately, stderr is completely empty, no error is shown by TInterpreter.
By adding a bunch of print statements here and there, we can get to see some more details of this issue (the patch will be attached).
Failure in cling::IncrementalExecutor::jitInitOrWrapper: PID=31643,funcname=_Z16__cling_Un1Qu340Pv
Failure at /home/vpadulan/programs/rootproject/rootsrc/interpreter/cling/lib/Interpreter/Interpreter.cpp:1438 with ExecutionResult=3
Failure in Interpreter::EvaluateInternal: PID=31643,evalres=1
The process was evaluating: ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Histo1D, Int_t>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x7f0cc988e7d0), new const char*[1]{"b1"}, 1, 1, reinterpret_cast<shared_ptr<TH1D>*>(0x7f0cd043c1f0), reinterpret_cast<std::weak_ptr<ROOT::Internal::RDF::RJittedAction>*>(0x7f0cc94b3830), reinterpret_cast<ROOT::Internal::RDF::RColumnRegister*>(0x7f0cc29a41b0));
Failure in gInterpreter->Calc: PID=31643,errorCode=1
Adding these two lines in cling::IncrementalExecutor::jitInitOrWrapper provides a way to stop the execution when the offending code is triggered, so that we can step in with gdb -p PID_OF_FAILING_DISTRDF_TASK:
volatile int a = 0;
while (a == 0) {}
It shows that there are many threads in flight (21!), even if the distributed task from the RDF point of view runs sequentially:
(gdb) info threads
Id Target Id Frame
1 Thread 0x7f1370756740 (LWP 34502) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f131020075c) at futex-internal.c:57
2 Thread 0x7f135fbff640 (LWP 34504) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f13622926e0 <thread_status+96>) at futex-internal.c:57
3 Thread 0x7f135f3fe640 (LWP 34505) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292760 <thread_status+224>) at futex-internal.c:57
4 Thread 0x7f135ebfd640 (LWP 34506) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f13622927e0 <thread_status+352>) at futex-internal.c:57
5 Thread 0x7f135e3fc640 (LWP 34507) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292860 <thread_status+480>) at futex-internal.c:57
6 Thread 0x7f1357bfb640 (LWP 34508) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f13622928e0 <thread_status+608>) at futex-internal.c:57
7 Thread 0x7f13573fa640 (LWP 34509) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292960 <thread_status+736>) at futex-internal.c:57
8 Thread 0x7f1356bf9640 (LWP 34510) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f13622929e0 <thread_status+864>) at futex-internal.c:57
9 Thread 0x7f13503f8640 (LWP 34511) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292a60 <thread_status+992>) at futex-internal.c:57
10 Thread 0x7f134dbf7640 (LWP 34512) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292ae0 <thread_status+1120>) at futex-internal.c:57
11 Thread 0x7f134b3f6640 (LWP 34513) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292b60 <thread_status+1248>) at futex-internal.c:57
12 Thread 0x7f134abf5640 (LWP 34514) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292be0 <thread_status+1376>) at futex-internal.c:57
13 Thread 0x7f13483f4640 (LWP 34515) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292c60 <thread_status+1504>) at futex-internal.c:57
14 Thread 0x7f1343bf3640 (LWP 34516) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292ce0 <thread_status+1632>) at futex-internal.c:57
15 Thread 0x7f13433f2640 (LWP 34517) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
--Type <RET> for more, q to quit, c to continue without paging--
futex_word=0x7f1362292d60 <thread_status+1760>) at futex-internal.c:57
16 Thread 0x7f133ebf1640 (LWP 34518) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7f1362292de0 <thread_status+1888>) at futex-internal.c:57
17 Thread 0x7f132aa21640 (LWP 34584) "python" __GI___libc_read (nbytes=4, buf=0x7f133827b950, fd=33) at ../sysdeps/unix/sysv/linux/read.c:26
18 Thread 0x7f132a120640 (LWP 34585) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7f132a11f330, op=137,
expected=0, futex_word=0x7f1370d85acc <_PyRuntime+428>) at futex-internal.c:57
19 Thread 0x7f132991f640 (LWP 34586) "python" __GI___libc_read (nbytes=4, buf=0x7f13382bf140, fd=31) at ../sysdeps/unix/sysv/linux/read.c:26
20 Thread 0x7f132911e640 (LWP 34587) "python" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393,
expected=0, futex_word=0x562911e622b0) at futex-internal.c:57
* 21 Thread 0x7f131ad44640 (LWP 34739) "python" 0x00007f1332cbafdc in cling::IncrementalExecutor::executeWrapper(llvm::StringRef, cling::Value*) const () from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so
The two most interesting ones are threads 1 and 21:
Thread1
#6 std::__condvar::wait (__m=..., this=<optimized out>)
at /usr/src/debug/gcc-12.2.1-2.fc36.x86_64/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/std_mutex.h:155
#7 std::condition_variable::wait (this=<optimized out>, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:41
#8 0x00007f133726f0aa in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex> > (this=0x7f1310200730, __lock=...)
at /usr/include/c++/12/condition_variable:312
#9 0x00007f133726df05 in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex>, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}>(std::unique_lock<std::mutex>&, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}) (this=0x7f1310200730, __lock=..., __p=...) at /usr/include/c++/12/condition_variable:321
#10 0x00007f1337269c3e in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x7f13102006f8)
at /home/vpadulan/programs/rootproject/rootsrc/core/thread/src/TReentrantRWLock.cxx:190
#11 0x00007f1337277d74 in ROOT::TRWMutexImp<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x7f13102006f0)
at /home/vpadulan/programs/rootproject/rootsrc/core/thread/src/TRWMutexImp.cxx:42
#12 0x00007f1337276797 in ROOT::TVirtualRWMutex::Lock (this=0x7f13102006f0)
at /home/vpadulan/programs/rootproject/rootsrc/core/base/inc/TVirtualRWMutex.h:79
#13 0x00007f1332b7a3c8 in TLockGuard::TLockGuard (this=0x7fffa6ac5868, mutex=0x7f13102006f0)
at /home/vpadulan/programs/rootproject/rootsrc/core/base/inc/TVirtualMutex.h:80
#14 0x00007f1332a80aa6 in TCling::AutoParse (this=0x56290befde40, cls=0x7fffa6ac5d40 "_facade")
at /home/vpadulan/programs/rootproject/rootsrc/core/metacling/src/TCling.cxx:6452
#15 0x00007f1332a6667d in TCling__AutoParseCallback (className=0x7fffa6ac5d40 "_facade")
Thread 21:
#0 0x00007f1332cbafdc in cling::IncrementalExecutor::executeWrapper(llvm::StringRef, cling::Value*) const ()
from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so
#1 0x00007f1332c44ab1 in cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) ()
from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so
#2 0x00007f1332c451a3 in cling::Interpreter::EvaluateInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) ()
from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so
#3 0x00007f1332c45b1a in cling::Interpreter::evaluate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::Value&) () from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so
#4 0x00007f1332a74e9d in TCling::Calc (this=0x56290befde40,
line=0x7f12fa0f8290 "ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Histo1D, Int_t>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x7f130be1a880), new const char*[1]{\"b1\"}, 1, 1, "..., error=0x7f131ad40bdc)
at /home/vpadulan/programs/rootproject/rootsrc/core/metacling/src/TCling.cxx:3556
#5 0x00007f132bf6da60 in operator() (__closure=0x7f131ad40bc0,
codeSlice="ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Histo1D, Int_t>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x7f130be1a880), new const char*[1]{\"b1\"}, 1, 1, "...)
at /home/vpadulan/programs/rootproject/rootsrc/tree/dataframe/src/RDFUtils.cxx:340
#6 0x00007f132bf6def9 in ROOT::Internal::RDF::InterpreterCalc (
code="ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Histo1D, Int_t>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x7f130be1a880), new const char*[1]{\"b1\"}, 1, 1, "..., context="RLoopManager::Run")
Which as a first instinct hints at some possible contention between the different things cling is doing in the two threads. For the moment I cannot come up with an easier reproducer, see next section for more details
How to reproduce
Here is the patch to print the statements that show the PID (as .txt so that I can attach it to this issue):
0001-Print-statements-for-DistRDF-Cling-failure.txt
The following is a Python script with the test
import os
import pyspark
from dask.distributed import Client, LocalCluster
import ROOT
RunGraphs = ROOT.RDF.Experimental.Distributed.RunGraphs
SparkDF = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
DaskDF = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame
# verbosity = ROOT.Experimental.RLogScopedVerbosity(ROOT.Detail.RDF.RDFLogChannel(), ROOT.Experimental.ELogLevel.kDebug+10)
def connection():
daskconn = Client(LocalCluster(n_workers=2, threads_per_worker=1, processes=True, memory_limit="2GiB"))
conf = {"spark.master": "local[2]", "spark.driver.memory": "4g", "spark.app.name": "roottest-distrdf-common"}
sparkconf = pyspark.SparkConf().setAll(conf.items())
sparkconn = pyspark.SparkContext(conf=sparkconf)
return daskconn, sparkconn
def test_rungraphs_sparkanddask_3histos(daskconn, sparkconn):
"""
Submit three different RDF graphs concurrently to Spark and Dask
"""
# Create a test file for processing
treename = "myTree"
filename = "2clusters.root"
nentries = 10000
opts = ROOT.RDF.RSnapshotOptions()
opts.fAutoFlush = 5000
ROOT.RDataFrame(nentries).Define("b1", "42")\
.Define("b2", "42")\
.Define("b3", "42")\
.Snapshot(treename, filename, ["b1", "b2", "b3"], opts)
histoproxies_spark = [
SparkDF(treename, filename, sparkcontext=sparkconn, npartitions=2)
.Histo1D((col, col, 1, 40, 45), col)
for col in ["b1", "b2", "b3"]
]
histoproxies_dask = [
DaskDF(treename, filename, daskclient=daskconn, npartitions=2)
.Histo1D((col, col, 1, 40, 45), col)
for col in ["b1", "b2", "b3"]
]
histoproxies = histoproxies_spark + histoproxies_dask
# Before triggering the computation graphs values are None
for proxy in histoproxies:
assert proxy.proxied_node.value is None
RunGraphs(histoproxies)
# After RunGraphs all histograms are correctly assigned to the
# node objects
for proxy in histoproxies:
histo = proxy.proxied_node.value
assert isinstance(histo, ROOT.TH1D)
assert histo.GetEntries() == nentries
assert histo.GetMean() == 42
os.remove(filename)
def main():
test_rungraphs_sparkanddask_3histos(*connection())
if __name__ == "__main__":
raise SystemExit(main())And here is a bash script that runs the previous Python script in a loop until it fails
counter=0
while test $? -eq 0
do
counter=$(($counter+1))
echo -e "\nRunning test $counter\n"
CLING_DEBUG=1 python test.py
done
# Remove leftover file from the test when it fails
rm 2clusters.root
Setup
ROOT master@b13756d
Python 3.10.7
Java 19
pyspark 3.3.0
dask/distributed 2022.7.1
ROOT built with:
cmake -GNinja -Dccache=ON -Dminimal=ON -Dtesting=ON -Droottest=ON -Dpyroot=ON -Ddataframe=ON -Dxrootd=ON -Dtest_distrdf_pyspark=ON -Dtest_distrdf_dask=ON -DCMAKE_BUILD_TYPE=Debug -Dimt=ON
Additional context
I am tentatively assigning this also to @Axel-Naumann @vgvassilev @jalopezg-r00t who may have an easier time at understanding the output from gdb and help in creating a simpler reproducer