Further improvements to reduce lock contention by bendavid · Pull Request #6919 · root-project/root

bendavid · 2020-12-06T20:28:21Z

Further reduction in lock contention when using RDataFrame with a large number of threads and/or files, by improving the global read write lock and migrating one hot spot in TBufferFile to use it.

phsft-bot · 2020-12-06T20:28:23Z

Can one of the admins verify this patch?

eguiraud · 2020-12-07T16:08:24Z

@phsft-bot build

phsft-bot · 2020-12-07T16:08:31Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-fedora30/cxx14, ROOT-fedora31/noimt, ROOT-ubuntu16/nortcxxmod, mac1015/cxx17, windows10/cxx14
How to customize builds

phsft-bot · 2020-12-07T16:11:34Z

Build failed on mac1015/cxx17.
Running on macitois19.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Errors:

[2020-12-07T16:11:33.483Z] CMake Error at cmake/modules/SearchInstalledSoftware.cmake:1616 (message):
[2020-12-07T16:11:33.483Z] CMake Error at /Volumes/HDD2/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1077 (message):

Axel-Naumann · 2020-12-07T16:11:36Z

core/thread/inc/ROOT/TReentrantRWLock.hxx

+       static thread_local LocalCounts unique_local_counts;
+       return &unique_local_counts;


Suggested change

static thread_local LocalCounts unique_local_counts;

return &unique_local_counts;

static thread_local LocalCounts unique_local_counts;

return &unique_local_counts;

(We indent with 3 spaces.)

Axel-Naumann · 2020-12-07T16:12:11Z

core/thread/inc/ROOT/TReentrantRWLock.hxx

   local_t GetLocal(){
-      TTHREAD_TLS_DECL(LocalCounts, gLocal);
-      return &gLocal;
+       static thread_local LocalCounts unique_local_counts;


@pcanal can we use thread_local throughout all supported platforms these days?

Yes, I think so. The main hold-over was MacOS (and Windows) for objects (i.e. this case). A priori if the code compile and links, it should be fine.

Axel-Naumann · 2020-12-07T16:13:13Z

io/io/src/TBufferFile.cxx


   else {
-      R__LOCKGUARD(gInterpreterMutex);
+      R__READ_LOCKGUARD(ROOT::gCoreMutex);


How likely will we deadlock here, @pcanal ?

There is no (new) risk there. By design the Read and Write lock are working well togethere (a thread that hold a read lock and then request a write lock, first realease the read lock and waiting until all other thread have release the Read lock either by finishing they critical section or asking for the write lock).

I am more worried about race condition. Currently the StreamerInfo is published via RegisterStreamerInfo which is called before the end of the critical section. However since the check is done through the ReadCritical section and there is a test on IsCompiled which is updated at the last line of Build, I am guessing it is likely okay.

Axel-Naumann · 2020-12-07T16:14:36Z

io/io/src/TBufferFile.cxx

+         sinfo = (TStreamerInfo*)cl->GetStreamerInfos()->At(version);
+         if (sinfo == nullptr) {
+            if ( version == cl->GetClassVersion() || version == 1 ) {
+                const_cast<TClass*>(cl)->BuildRealData(pointer);


Indent by 3 spaces, please.

pcanal · 2020-12-07T17:50:18Z

core/thread/src/TThread.cxx

   {
     R__LOCKGUARD(gGlobalMutex);
     if (!ROOT::gCoreMutex) {
-        // To avoid dead locks, caused by shared library opening and/or static initialization


Why is this comment no longer applicable?

I had thought this was related to some of the platform-specific stuff in TTHREAD_TLS_DECL and that it's not an issue if static thread_local can be used across the board.

See commit bd1894b.
tls_get_addr_tail is not a ROOT lock/routine but a system one. (and, not 100% sure, but it might be involved in the (compiler's) implementation of static initialization)

Do you know which version of define TTHREAD_TLS was being used for that stack trace? (if it was already using thread_local then I agree there's no reason the situation should be different now)

I am pretty sure it was the default TTHREAD_TLS (i.e. ROOT's workaround).

…th C++14 and later

bendavid · 2020-12-08T00:16:20Z

Given the discussion about deadlocks using ``static thread_local``` I pushed an alternate solution to avoid exclusive locks in ReadLock() based on shared mutexes, which is only used for C++14 or higher in this case.

The indentation should also be fixed now.

pcanal · 2020-12-08T00:58:07Z

core/thread/inc/ROOT/TReentrantRWLock.hxx

   size_t &GetLocalReadersCount(local_t &local) { return fReadersCount[local]; }
+};
+
+struct RecurseCountsShared {


When adding a new count options, we should extent the test core/thread/test/testRWLock.cxx
Starting with adding a new global variable to the section:

auto gMutex = new TMutex(kTRUE); auto gRWMutex = new TRWMutexImp<TMutex>(); auto gRWMutexSpin = new TRWMutexImp<ROOT::TSpinMutex>(); auto gRWMutexStd = new TRWMutexImp<std::mutex>(); auto gReentrantRWMutex = new ROOT::TReentrantRWLock<TMutex>(); auto gReentrantRWMutexSM = new ROOT::TReentrantRWLock<ROOT::TSpinMutex>(); auto gReentrantRWMutexStd = new ROOT::TReentrantRWLock<std::mutex>(); auto gSpinMutex = new ROOT::TSpinMutex(); // Intentionally ignore the Fatal error due to the shread thread-local storage. // In this test we need to be 'careful' to not use all those mutex at the same time. int trigger1 = gErrorIgnoreLevel = kFatal + 1; auto gReentrantRWMutexTL = new ROOT::TReentrantRWLock<TMutex, ROOT::Internal::UniqueLockRecurseCount>(); auto gReentrantRWMutexSMTL = new ROOT::TReentrantRWLock<ROOT::TSpinMutex, ROOT::Internal::UniqueLockRecurseCount>(); auto gRWMutexTL = new TRWMutexImp<TMutex, ROOT::Internal::UniqueLockRecurseCount>(); auto gRWMutexTLSpin = new TRWMutexImp<ROOT::TSpinMutex, ROOT::Internal::UniqueLockRecurseCount>(); int trigger2 = gErrorIgnoreLevel = 0;

(in the right section, like the one between trigger1 and trigger2 in this case) and then duplicating (copy/paste) the usage so that the new global uses in the same set of tests as the others.

Beside testing the functionality it also give a first level benchmark of the raw speed of the options.

bendavid · 2020-12-08T01:16:39Z

Here's an updated stack trace with full debug info in case it's useful in the meantime, taken again during one of the remaining periods of high lock contention when many files are being closed and opened.
pstackdebug.txt

pcanal · 2020-12-08T12:10:59Z

core/thread/src/TRWMutexImp.cxx

-
-} // End of namespace ROOT
+#if __cplusplus >= 201402L
+template class TRWMutexImp<std::shared_timed_mutex, ROOT::Internal::RecurseCountsShared>;


What's the advantage of using the shared_timed_mutex in this context?

The regular shared_mutex is only available from C++17, so the shared_timed_mutex is a fallback for the C++14 case. (I'm not sure whether the additional timed functionality actually makes it slower if not used.)

Can you publish the numerical result of the test benchmark? (for a C++14 and C++17 builds)

bendavid · 2020-12-09T02:10:02Z

There were actually some deadlocks when running the tests due to an uninitialized variable (fixed now). The tests now look as below. The new implementations are somewhat slower for write locks, faster for read locks, but significantly faster for the VeryLargeconcurrentReadAndWrite tests which I added. The slower write locking could probably be mitigated by using a dedicated shared_mutex inside RecurseCountsShared only to protect the map of read counts, but maybe this is sufficient for the moment.

Singularity> ctest -R gtest-core-thread-test-corethreadtestUnit -V                                           
UpdateCTestConfiguration  from :/scratchnvme/bendavid/rootdev2/build/DartConfiguration.tcl
Parse Config file:/scratchnvme/bendavid/rootdev2/build/DartConfiguration.tcl
 Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/usr/bin/cmake
UpdateCTestConfiguration  from :/scratchnvme/bendavid/rootdev2/build/DartConfiguration.tcl
Parse Config file:/scratchnvme/bendavid/rootdev2/build/DartConfiguration.tcl
Test project /scratchnvme/bendavid/rootdev2/build
Constructing a list of tests
Ignore test: roottest-cling-parsing-semicolon
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 76
    Start 76: gtest-core-thread-test-corethreadtestUnit

76: Test command: /usr/bin/cmake "-DCMD=/scratchnvme/bendavid/rootdev2/build/core/thread/test/corethreadtestUnit" "-DCWD=/scratchnvme/bendavid/rootdev2/build/core/thread/test" "-DSYS=/scratchnvme/bendavid/rootdev2/build" "-P" "/scratchnvme/bendavid/rootdev2/root/cmake/modules/RootTestDriver.cmake"
76: Environment variables: 
76:  ROOT_HIST=0
76: Test timeout computed to be: 1500
76: Running main() from /scratchnvme/bendavid/rootdev2/build/googletest-prefix/src/googletest/googletest/src/gtest_main.cc
76: [==========] Running 104 tests from 3 test suites.
76: [----------] Global test environment set-up.
76: [----------] 6 tests from InterpreterLock
76: [ RUN      ] InterpreterLock.ConcurrentCalc
76: [       OK ] InterpreterLock.ConcurrentCalc (112 ms)
76: [ RUN      ] InterpreterLock.ReadLocks
76: [       OK ] InterpreterLock.ReadLocks (95 ms)
76: [ RUN      ] InterpreterLock.BalancedUserReadLock
76: [       OK ] InterpreterLock.BalancedUserReadLock (210 ms)
76: [ RUN      ] InterpreterLock.BalancedUserWriteLock
76: [       OK ] InterpreterLock.BalancedUserWriteLock (198 ms)
76: [ RUN      ] InterpreterLock.UnBalancedUserReadLock
76: [       OK ] InterpreterLock.UnBalancedUserReadLock (146 ms)
76: [ RUN      ] InterpreterLock.UnBalancedUserWriteLock
76: [       OK ] InterpreterLock.UnBalancedUserWriteLock (96 ms)
76: [----------] 6 tests from InterpreterLock (857 ms total)
76: 
76: [----------] 88 tests from RWLock
76: [ RUN      ] RWLock.MutexLockVirtual
76: [       OK ] RWLock.MutexLockVirtual (59 ms)
76: [ RUN      ] RWLock.MutexUnLockVirtual
76: [       OK ] RWLock.MutexUnLockVirtual (62 ms)
76: [ RUN      ] RWLock.WriteLockVirtual
76: [       OK ] RWLock.WriteLockVirtual (3029 ms)
76: [ RUN      ] RWLock.WriteUnLockVirtual
76: [       OK ] RWLock.WriteUnLockVirtual (577 ms)
76: [ RUN      ] RWLock.WriteSpinLockVirtual
76: [       OK ] RWLock.WriteSpinLockVirtual (2758 ms)
76: [ RUN      ] RWLock.WriteSpinUnLockVirtual
76: [       OK ] RWLock.WriteSpinUnLockVirtual (275 ms)
76: [ RUN      ] RWLock.WriteLock
76: [       OK ] RWLock.WriteLock (2993 ms)
76: [ RUN      ] RWLock.WriteUnLock
76: [       OK ] RWLock.WriteUnLock (553 ms)
76: [ RUN      ] RWLock.WriteSpinLock
76: [       OK ] RWLock.WriteSpinLock (2826 ms)
76: [ RUN      ] RWLock.WriteSpinUnLock
76: [       OK ] RWLock.WriteSpinUnLock (276 ms)
76: [ RUN      ] RWLock.WriteStdDirectLock
76: [       OK ] RWLock.WriteStdDirectLock (2953 ms)
76: [ RUN      ] RWLock.WriteStdDirectUnLock
76: [       OK ] RWLock.WriteStdDirectUnLock (399 ms)
76: [ RUN      ] RWLock.WriteStdD14irectLock
76: [       OK ] RWLock.WriteStdD14irectLock (3082 ms)
76: [ RUN      ] RWLock.WriteStd14DirectUnLock
76: [       OK ] RWLock.WriteStd14DirectUnLock (523 ms)
76: [ RUN      ] RWLock.WriteStd17DirectLock
76: [       OK ] RWLock.WriteStd17DirectLock (2974 ms)
76: [ RUN      ] RWLock.WriteStd17DirectUnLock
76: [       OK ] RWLock.WriteStd17DirectUnLock (551 ms)
76: [ RUN      ] RWLock.WriteSpinDirectLock
76: [       OK ] RWLock.WriteSpinDirectLock (2727 ms)
76: [ RUN      ] RWLock.WriteSpinDirectUnLock
76: [       OK ] RWLock.WriteSpinDirectUnLock (222 ms)
76: [ RUN      ] RWLock.WriteDirectLock
76: [       OK ] RWLock.WriteDirectLock (3017 ms)
76: [ RUN      ] RWLock.WriteDirectUnLock
76: [       OK ] RWLock.WriteDirectUnLock (435 ms)
76: [ RUN      ] RWLock.ReadLockStdDirect
76: [       OK ] RWLock.ReadLockStdDirect (2225 ms)
76: [ RUN      ] RWLock.ReadUnLockStdDirect
76: [       OK ] RWLock.ReadUnLockStdDirect (123 ms)
76: [ RUN      ] RWLock.ReadLockStd14Direct
76: [       OK ] RWLock.ReadLockStd14Direct (2645 ms)
76: [ RUN      ] RWLock.ReadUnLockStd14Direct
76: [       OK ] RWLock.ReadUnLockStd14Direct (100 ms)
76: [ RUN      ] RWLock.ReadLockStd17Direct
76: [       OK ] RWLock.ReadLockStd17Direct (2762 ms)
76: [ RUN      ] RWLock.ReadUnLockStd17Direct
76: [       OK ] RWLock.ReadUnLockStd17Direct (92 ms)
76: [ RUN      ] RWLock.ReadLockSpinDirect
76: [       OK ] RWLock.ReadLockSpinDirect (1926 ms)
76: [ RUN      ] RWLock.ReadUnLockSpinDirect
76: [       OK ] RWLock.ReadUnLockSpinDirect (97 ms)
76: [ RUN      ] RWLock.ReadLockDirect
76: [       OK ] RWLock.ReadLockDirect (2205 ms)
76: [ RUN      ] RWLock.ReadUnLockDirect
76: [       OK ] RWLock.ReadUnLockDirect (94 ms)
76: [ RUN      ] RWLock.WriteSpinTLDirectLock
76: [       OK ] RWLock.WriteSpinTLDirectLock (957 ms)
76: [ RUN      ] RWLock.WriteSpinTLsDirectUnLock
76: [       OK ] RWLock.WriteSpinTLsDirectUnLock (240 ms)
76: [ RUN      ] RWLock.WriteTLDirectLock
76: [       OK ] RWLock.WriteTLDirectLock (985 ms)
76: [ RUN      ] RWLock.WriteTLDirectUnLock
76: [       OK ] RWLock.WriteTLDirectUnLock (467 ms)
76: [ RUN      ] RWLock.ReadLockSpinTLDirect
76: [       OK ] RWLock.ReadLockSpinTLDirect (286 ms)
76: [ RUN      ] RWLock.ReadUnLockSpinTLDirect
76: [       OK ] RWLock.ReadUnLockSpinTLDirect (95 ms)
76: [ RUN      ] RWLock.ReadLockTLDirect
76: [       OK ] RWLock.ReadLockTLDirect (324 ms)
76: [ RUN      ] RWLock.ReadUnLockTLDirect
76: [       OK ] RWLock.ReadUnLockTLDirect (91 ms)
76: [ RUN      ] RWLock.SpinMutexLockUnlock
76: [       OK ] RWLock.SpinMutexLockUnlock (148 ms)
76: [ RUN      ] RWLock.MutexGuard
76: [       OK ] RWLock.MutexGuard (177 ms)
76: [ RUN      ] RWLock.WriteGuard
76: [       OK ] RWLock.WriteGuard (4820 ms)
76: [ RUN      ] RWLock.WriteSpinGuard
76: [       OK ] RWLock.WriteSpinGuard (4297 ms)
76: [ RUN      ] RWLock.ReentrantStd
76: [       OK ] RWLock.ReentrantStd (0 ms)
76: [ RUN      ] RWLock.ReentrantStd14
76: [       OK ] RWLock.ReentrantStd14 (0 ms)
76: [ RUN      ] RWLock.ReentrantStd17
76: [       OK ] RWLock.ReentrantStd17 (0 ms)
76: [ RUN      ] RWLock.ReentrantSpin
76: [       OK ] RWLock.ReentrantSpin (0 ms)
76: [ RUN      ] RWLock.Reentrant
76: [       OK ] RWLock.Reentrant (0 ms)
76: [ RUN      ] RWLock.ReentrantTLSpin
76: [       OK ] RWLock.ReentrantTLSpin (0 ms)
76: [ RUN      ] RWLock.ReentrantTL
76: [       OK ] RWLock.ReentrantTL (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd
76: [       OK ] RWLock.ResetRestoreStd (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd14
76: [       OK ] RWLock.ResetRestoreStd14 (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd17
76: [       OK ] RWLock.ResetRestoreStd17 (0 ms)
76: [ RUN      ] RWLock.ResetRestoreSpin
76: [       OK ] RWLock.ResetRestoreSpin (0 ms)
76: [ RUN      ] RWLock.ResetRestore
76: [       OK ] RWLock.ResetRestore (0 ms)
76: [ RUN      ] RWLock.ResetRestoreTLSpin
76: [       OK ] RWLock.ResetRestoreTLSpin (0 ms)
76: [ RUN      ] RWLock.ResetRestoreTL
76: [       OK ] RWLock.ResetRestoreTL (0 ms)
76: [ RUN      ] RWLock.concurrentResetRestore
76: [       OK ] RWLock.concurrentResetRestore (88 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreSpin
76: [       OK ] RWLock.concurrentResetRestoreSpin (65 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd
76: [       OK ] RWLock.concurrentResetRestoreStd (79 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd14
76: [       OK ] RWLock.concurrentResetRestoreStd14 (67 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd17
76: [       OK ] RWLock.concurrentResetRestoreStd17 (64 ms)
76: [ RUN      ] RWLock.LargeconcurrentResetRestore
76: [       OK ] RWLock.LargeconcurrentResetRestore (5264 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreTL
76: [       OK ] RWLock.concurrentResetRestoreTL (61 ms)
76: [ RUN      ] RWLock.LargeconcurrentResetRestoreTL
76: [       OK ] RWLock.LargeconcurrentResetRestoreTL (4643 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWrites
76: [       OK ] RWLock.concurrentReadsAndWrites (3095 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesSpin
76: [       OK ] RWLock.concurrentReadsAndWritesSpin (3098 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd
76: [       OK ] RWLock.concurrentReadsAndWritesStd (3103 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd14
76: [       OK ] RWLock.concurrentReadsAndWritesStd14 (3105 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd17
76: [       OK ] RWLock.concurrentReadsAndWritesStd17 (3098 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWrites
76: [       OK ] RWLock.LargeconcurrentReadsAndWrites (3104 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd (3101 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd14
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd14 (3106 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd17
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd17 (3107 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesSpin
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesSpin (310 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWrites
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWrites (16427 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd (17458 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd14
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd14 (9737 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd17
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd17 (9956 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesSpin
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesSpin (4470 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReads
76: [       OK ] RWLock.VeryLargeconcurrentReads (1074 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd (1076 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd14
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd14 (1068 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd17
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd17 (1073 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsSpin
76: [       OK ] RWLock.VeryLargeconcurrentReadsSpin (407 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesTL
76: [       OK ] RWLock.concurrentReadsAndWritesTL (3087 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesTL
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesTL (3096 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesTL
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesTL (5982 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsTL
76: [       OK ] RWLock.VeryLargeconcurrentReadsTL (1068 ms)
76: [----------] 88 tests from RWLock (169985 ms total)
76: 
76: [----------] 10 tests from TThreadedObject
76: [ RUN      ] TThreadedObject.CreateAndDestroy
76: [       OK ] TThreadedObject.CreateAndDestroy (1 ms)
76: [ RUN      ] TThreadedObject.Get
76: [       OK ] TThreadedObject.Get (8 ms)
76: [ RUN      ] TThreadedObject.GetAtSlot
76: [       OK ] TThreadedObject.GetAtSlot (0 ms)
76: [ RUN      ] TThreadedObject.GetAtSlotUnchecked
76: [       OK ] TThreadedObject.GetAtSlotUnchecked (0 ms)
76: [ RUN      ] TThreadedObject.GetAtSlotRaw
76: [       OK ] TThreadedObject.GetAtSlotRaw (0 ms)
76: [ RUN      ] TThreadedObject.SetAtSlot
76: [       OK ] TThreadedObject.SetAtSlot (0 ms)
76: [ RUN      ] TThreadedObject.Merge
76: [       OK ] TThreadedObject.Merge (128 ms)
76: [ RUN      ] TThreadedObject.SnapshotMerge
76: [       OK ] TThreadedObject.SnapshotMerge (0 ms)
76: [ RUN      ] TThreadedObject.GrowSlots
76: [       OK ] TThreadedObject.GrowSlots (1 ms)
76: [ RUN      ] TThreadedObject.GetNSlots
76: [       OK ] TThreadedObject.GetNSlots (1 ms)
76: [----------] 10 tests from TThreadedObject (139 ms total)
76: 
76: [----------] Global test environment tear-down
76: [==========] 104 tests from 3 test suites ran. (170981 ms total)
76: [  PASSED  ] 104 tests.
1/1 Test #76: gtest-core-thread-test-corethreadtestUnit ...   Passed  171.21 sec

The following tests passed:
        gtest-core-thread-test-corethreadtestUnit

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 172.05 sec

pcanal · 2020-12-09T03:41:29Z

It looks like (surprisingly :() the currently used implementation (RecurseCounts) is not yet in the test. Could you add it?

bendavid · 2020-12-09T03:48:36Z

RecurseCounts is the default second template argument for TReentrantRWLock and TRWMutexImp so it's definitely already there. The most relevant comparisons here are Std vs Std14 vs Std17. (Even if this was all run in a C++17 build I set it up to just test all the variants which are available.)

pcanal · 2020-12-09T18:40:33Z

RecurseCounts is the default second template argument for TReentrantRWLock

Thanks. I had forgotten about that :)

pcanal · 2020-12-09T18:51:27Z

faster for read locks,

I don't seem to be seeing that:

76: [       OK ] RWLock.ReadLockStdDirect (2225 ms)
76: [       OK ] RWLock.ReadLockStd14Direct (2645 ms)
76: [       OK ] RWLock.ReadLockStd17Direct (2762 ms)

what am I missing.

Also do we understand why the Std17 results seems to be higher/slower than Std14?

bendavid · 2020-12-09T21:23:38Z

Ah, sorry, I was looking at the numbers for ReadUnLock. The differences between Std14 and Std17 are at least partly noise I think.

I also have an implementation working now based on tbb::enumerable_thread_specific (https://software.intel.com/content/www/us/en/develop/documentation/tbb-documentation/top/intel-threading-building-blocks-developer-reference/thread-local-storage/enumerablethreadspecific-template-class.html)

Now in principle what tbb is doing underneath should not so different from TTHREAD_TLS_DECL, at least for the explicit pthread-based version, so we can think about whether this is really safe from deadlocks, but the numbers looks pretty good. In the test output below "TBB" is using tbb::enumerable_thread_specific<LocalCounts> and "TBBUnique" is using tbb::enumerable_thread_specific<LocalCounts, tbb::cache_aligned_allocator<LocalCounts>, tbb::ets_key_per_instance> which is limited to a small number of instances per process (but not strictly to 1).

If you would be willing to have tbb as a dependency for core/thread then I can add this (and maybe even remove the shared_mutex versions)

76: Test command: /usr/bin/cmake "-DCMD=/scratchnvme/bendavid/rootdev2/build/core/thread/test/corethreadtestUnit" "-DCWD=/scratchnvme/bendavid/rootdev2/build/core/thread/test" "-DSYS=/scratchnvme/bendavid/rootdev2/build" "-P" "/scratchnvme/bendavid/rootdev2/root/cmake/modules/RootTestDriver.cmake"
76: Environment variables: 
76:  ROOT_HIST=0
76: Test timeout computed to be: 1500
76: Running main() from /scratchnvme/bendavid/rootdev2/build/googletest-prefix/src/googletest/googletest/src/gtest_main.cc
76: [==========] Running 126 tests from 3 test suites.
76: [----------] Global test environment set-up.
76: [----------] 6 tests from InterpreterLock
76: [ RUN      ] InterpreterLock.ConcurrentCalc
76: [       OK ] InterpreterLock.ConcurrentCalc (87 ms)
76: [ RUN      ] InterpreterLock.ReadLocks
76: [       OK ] InterpreterLock.ReadLocks (78 ms)
76: [ RUN      ] InterpreterLock.BalancedUserReadLock
76: [       OK ] InterpreterLock.BalancedUserReadLock (169 ms)
76: [ RUN      ] InterpreterLock.BalancedUserWriteLock
76: [       OK ] InterpreterLock.BalancedUserWriteLock (178 ms)
76: [ RUN      ] InterpreterLock.UnBalancedUserReadLock
76: [       OK ] InterpreterLock.UnBalancedUserReadLock (124 ms)
76: [ RUN      ] InterpreterLock.UnBalancedUserWriteLock
76: [       OK ] InterpreterLock.UnBalancedUserWriteLock (89 ms)
76: [----------] 6 tests from InterpreterLock (725 ms total)
76: 
76: [----------] 110 tests from RWLock
76: [ RUN      ] RWLock.MutexLockVirtual
76: [       OK ] RWLock.MutexLockVirtual (62 ms)
76: [ RUN      ] RWLock.MutexUnLockVirtual
76: [       OK ] RWLock.MutexUnLockVirtual (65 ms)
76: [ RUN      ] RWLock.WriteLockVirtual
76: [       OK ] RWLock.WriteLockVirtual (3110 ms)
76: [ RUN      ] RWLock.WriteUnLockVirtual
76: [       OK ] RWLock.WriteUnLockVirtual (470 ms)
76: [ RUN      ] RWLock.WriteSpinLockVirtual
76: [       OK ] RWLock.WriteSpinLockVirtual (2736 ms)
76: [ RUN      ] RWLock.WriteSpinUnLockVirtual
76: [       OK ] RWLock.WriteSpinUnLockVirtual (269 ms)
76: [ RUN      ] RWLock.WriteLock
76: [       OK ] RWLock.WriteLock (3116 ms)
76: [ RUN      ] RWLock.WriteUnLock
76: [       OK ] RWLock.WriteUnLock (476 ms)
76: [ RUN      ] RWLock.WriteSpinLock
76: [       OK ] RWLock.WriteSpinLock (2780 ms)
76: [ RUN      ] RWLock.WriteSpinUnLock
76: [       OK ] RWLock.WriteSpinUnLock (268 ms)
76: [ RUN      ] RWLock.WriteStdDirectLock
76: [       OK ] RWLock.WriteStdDirectLock (3164 ms)
76: [ RUN      ] RWLock.WriteStdDirectUnLock
76: [       OK ] RWLock.WriteStdDirectUnLock (447 ms)
76: [ RUN      ] RWLock.WriteStdD14irectLock
76: [       OK ] RWLock.WriteStdD14irectLock (3150 ms)
76: [ RUN      ] RWLock.WriteStd14DirectUnLock
76: [       OK ] RWLock.WriteStd14DirectUnLock (469 ms)
76: [ RUN      ] RWLock.WriteStd17DirectLock
76: [       OK ] RWLock.WriteStd17DirectLock (3122 ms)
76: [ RUN      ] RWLock.WriteStd17DirectUnLock
76: [       OK ] RWLock.WriteStd17DirectUnLock (529 ms)
76: [ RUN      ] RWLock.WriteStdTBBDirectLock
76: [       OK ] RWLock.WriteStdTBBDirectLock (2121 ms)
76: [ RUN      ] RWLock.WriteStdTBBDirectUnLock
76: [       OK ] RWLock.WriteStdTBBDirectUnLock (386 ms)
76: [ RUN      ] RWLock.WriteStdTBBUniqueDirectLock
76: [       OK ] RWLock.WriteStdTBBUniqueDirectLock (1233 ms)
76: [ RUN      ] RWLock.WriteStdTBBUniqueDirectUnLock
76: [       OK ] RWLock.WriteStdTBBUniqueDirectUnLock (470 ms)
76: [ RUN      ] RWLock.WriteSpinDirectLock
76: [       OK ] RWLock.WriteSpinDirectLock (2788 ms)
76: [ RUN      ] RWLock.WriteSpinDirectUnLock
76: [       OK ] RWLock.WriteSpinDirectUnLock (226 ms)
76: [ RUN      ] RWLock.WriteDirectLock
76: [       OK ] RWLock.WriteDirectLock (3012 ms)
76: [ RUN      ] RWLock.WriteDirectUnLock
76: [       OK ] RWLock.WriteDirectUnLock (425 ms)
76: [ RUN      ] RWLock.ReadLockStdDirect
76: [       OK ] RWLock.ReadLockStdDirect (2460 ms)
76: [ RUN      ] RWLock.ReadUnLockStdDirect
76: [       OK ] RWLock.ReadUnLockStdDirect (92 ms)
76: [ RUN      ] RWLock.ReadLockStd14Direct
76: [       OK ] RWLock.ReadLockStd14Direct (2667 ms)
76: [ RUN      ] RWLock.ReadUnLockStd14Direct
76: [       OK ] RWLock.ReadUnLockStd14Direct (94 ms)
76: [ RUN      ] RWLock.ReadLockStd17Direct
76: [       OK ] RWLock.ReadLockStd17Direct (2775 ms)
76: [ RUN      ] RWLock.ReadUnLockStd17Direct
76: [       OK ] RWLock.ReadUnLockStd17Direct (100 ms)
76: [ RUN      ] RWLock.ReadLockStdTBBDirect
76: [       OK ] RWLock.ReadLockStdTBBDirect (1157 ms)
76: [ RUN      ] RWLock.ReadUnLockStdTBBDirect
76: [       OK ] RWLock.ReadUnLockStdTBBDirect (92 ms)
76: [ RUN      ] RWLock.ReadLockStdTBBUniqueDirect
76: [       OK ] RWLock.ReadLockStdTBBUniqueDirect (398 ms)
76: [ RUN      ] RWLock.ReadUnLockStdTBBUniqueDirect
76: [       OK ] RWLock.ReadUnLockStdTBBUniqueDirect (92 ms)
76: [ RUN      ] RWLock.ReadLockSpinDirect
76: [       OK ] RWLock.ReadLockSpinDirect (2224 ms)
76: [ RUN      ] RWLock.ReadUnLockSpinDirect
76: [       OK ] RWLock.ReadUnLockSpinDirect (134 ms)
76: [ RUN      ] RWLock.ReadLockDirect
76: [       OK ] RWLock.ReadLockDirect (2350 ms)
76: [ RUN      ] RWLock.ReadUnLockDirect
76: [       OK ] RWLock.ReadUnLockDirect (91 ms)
76: [ RUN      ] RWLock.WriteSpinTLDirectLock
76: [       OK ] RWLock.WriteSpinTLDirectLock (931 ms)
76: [ RUN      ] RWLock.WriteSpinTLsDirectUnLock
76: [       OK ] RWLock.WriteSpinTLsDirectUnLock (243 ms)
76: [ RUN      ] RWLock.WriteTLDirectLock
76: [       OK ] RWLock.WriteTLDirectLock (1082 ms)
76: [ RUN      ] RWLock.WriteTLDirectUnLock
76: [       OK ] RWLock.WriteTLDirectUnLock (461 ms)
76: [ RUN      ] RWLock.ReadLockSpinTLDirect
76: [       OK ] RWLock.ReadLockSpinTLDirect (320 ms)
76: [ RUN      ] RWLock.ReadUnLockSpinTLDirect
76: [       OK ] RWLock.ReadUnLockSpinTLDirect (92 ms)
76: [ RUN      ] RWLock.ReadLockTLDirect
76: [       OK ] RWLock.ReadLockTLDirect (301 ms)
76: [ RUN      ] RWLock.ReadUnLockTLDirect
76: [       OK ] RWLock.ReadUnLockTLDirect (85 ms)
76: [ RUN      ] RWLock.SpinMutexLockUnlock
76: [       OK ] RWLock.SpinMutexLockUnlock (148 ms)
76: [ RUN      ] RWLock.MutexGuard
76: [       OK ] RWLock.MutexGuard (178 ms)
76: [ RUN      ] RWLock.WriteGuard
76: [       OK ] RWLock.WriteGuard (4995 ms)
76: [ RUN      ] RWLock.WriteSpinGuard
76: [       OK ] RWLock.WriteSpinGuard (4519 ms)
76: [ RUN      ] RWLock.ReentrantStd
76: [       OK ] RWLock.ReentrantStd (0 ms)
76: [ RUN      ] RWLock.ReentrantStd14
76: [       OK ] RWLock.ReentrantStd14 (1 ms)
76: [ RUN      ] RWLock.ReentrantStd17
76: [       OK ] RWLock.ReentrantStd17 (0 ms)
76: [ RUN      ] RWLock.ReentrantStdTBB
76: [       OK ] RWLock.ReentrantStdTBB (0 ms)
76: [ RUN      ] RWLock.ReentrantStdTBBUnique
76: [       OK ] RWLock.ReentrantStdTBBUnique (0 ms)
76: [ RUN      ] RWLock.ReentrantSpin
76: [       OK ] RWLock.ReentrantSpin (0 ms)
76: [ RUN      ] RWLock.Reentrant
76: [       OK ] RWLock.Reentrant (0 ms)
76: [ RUN      ] RWLock.ReentrantTLSpin
76: [       OK ] RWLock.ReentrantTLSpin (0 ms)
76: [ RUN      ] RWLock.ReentrantTL
76: [       OK ] RWLock.ReentrantTL (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd
76: [       OK ] RWLock.ResetRestoreStd (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd14
76: [       OK ] RWLock.ResetRestoreStd14 (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd17
76: [       OK ] RWLock.ResetRestoreStd17 (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStdTBB
76: [       OK ] RWLock.ResetRestoreStdTBB (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStdTBBUnique
76: [       OK ] RWLock.ResetRestoreStdTBBUnique (0 ms)
76: [ RUN      ] RWLock.ResetRestoreSpin
76: [       OK ] RWLock.ResetRestoreSpin (0 ms)
76: [ RUN      ] RWLock.ResetRestore
76: [       OK ] RWLock.ResetRestore (0 ms)
76: [ RUN      ] RWLock.ResetRestoreTLSpin
76: [       OK ] RWLock.ResetRestoreTLSpin (0 ms)
76: [ RUN      ] RWLock.ResetRestoreTL
76: [       OK ] RWLock.ResetRestoreTL (0 ms)
76: [ RUN      ] RWLock.concurrentResetRestore
76: [       OK ] RWLock.concurrentResetRestore (61 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreSpin
76: [       OK ] RWLock.concurrentResetRestoreSpin (40 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd
76: [       OK ] RWLock.concurrentResetRestoreStd (53 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd14
76: [       OK ] RWLock.concurrentResetRestoreStd14 (62 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd17
76: [       OK ] RWLock.concurrentResetRestoreStd17 (64 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStdTBB
76: [       OK ] RWLock.concurrentResetRestoreStdTBB (45 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStdTBBUnique
76: [       OK ] RWLock.concurrentResetRestoreStdTBBUnique (33 ms)
76: [ RUN      ] RWLock.LargeconcurrentResetRestore
76: [       OK ] RWLock.LargeconcurrentResetRestore (5534 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreTL
76: [       OK ] RWLock.concurrentResetRestoreTL (59 ms)
76: [ RUN      ] RWLock.LargeconcurrentResetRestoreTL
76: [       OK ] RWLock.LargeconcurrentResetRestoreTL (4693 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWrites
76: [       OK ] RWLock.concurrentReadsAndWrites (3099 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesSpin
76: [       OK ] RWLock.concurrentReadsAndWritesSpin (3097 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd
76: [       OK ] RWLock.concurrentReadsAndWritesStd (3129 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd14
76: [       OK ] RWLock.concurrentReadsAndWritesStd14 (3210 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd17
76: [       OK ] RWLock.concurrentReadsAndWritesStd17 (3105 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStdTBB
76: [       OK ] RWLock.concurrentReadsAndWritesStdTBB (3168 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStdTBBUnique
76: [       OK ] RWLock.concurrentReadsAndWritesStdTBBUnique (3112 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWrites
76: [       OK ] RWLock.LargeconcurrentReadsAndWrites (3128 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd (3111 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd14
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd14 (3121 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd17
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd17 (3128 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStdTBB
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStdTBB (3100 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStdTBBUnique
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStdTBBUnique (3117 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesSpin
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesSpin (315 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWrites
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWrites (22562 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd (13138 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd14
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd14 (12093 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd17
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd17 (9137 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStdTBB
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStdTBB (9902 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStdTBBUnique
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStdTBBUnique (6940 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesSpin
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesSpin (4261 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReads
76: [       OK ] RWLock.VeryLargeconcurrentReads (1071 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd (1077 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd14
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd14 (1072 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd17
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd17 (1072 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStdTBB
76: [       OK ] RWLock.VeryLargeconcurrentReadsStdTBB (1071 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStdTBBUnique
76: [       OK ] RWLock.VeryLargeconcurrentReadsStdTBBUnique (1070 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsSpin
76: [       OK ] RWLock.VeryLargeconcurrentReadsSpin (392 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesTL
76: [       OK ] RWLock.concurrentReadsAndWritesTL (3193 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesTL
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesTL (3102 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesTL
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesTL (6246 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsTL
76: [       OK ] RWLock.VeryLargeconcurrentReadsTL (1072 ms)
76: [----------] 110 tests from RWLock (213032 ms total)
76: 
76: [----------] 10 tests from TThreadedObject
76: [ RUN      ] TThreadedObject.CreateAndDestroy
76: [       OK ] TThreadedObject.CreateAndDestroy (1 ms)
76: [ RUN      ] TThreadedObject.Get
76: [       OK ] TThreadedObject.Get (10 ms)
76: [ RUN      ] TThreadedObject.GetAtSlot
76: [       OK ] TThreadedObject.GetAtSlot (0 ms)
76: [ RUN      ] TThreadedObject.GetAtSlotUnchecked
76: [       OK ] TThreadedObject.GetAtSlotUnchecked (0 ms)
76: [ RUN      ] TThreadedObject.GetAtSlotRaw
76: [       OK ] TThreadedObject.GetAtSlotRaw (0 ms)
76: [ RUN      ] TThreadedObject.SetAtSlot
76: [       OK ] TThreadedObject.SetAtSlot (0 ms)
76: [ RUN      ] TThreadedObject.Merge
76: [       OK ] TThreadedObject.Merge (159 ms)
76: [ RUN      ] TThreadedObject.SnapshotMerge
76: [       OK ] TThreadedObject.SnapshotMerge (1 ms)
76: [ RUN      ] TThreadedObject.GrowSlots
76: [       OK ] TThreadedObject.GrowSlots (0 ms)
76: [ RUN      ] TThreadedObject.GetNSlots
76: [       OK ] TThreadedObject.GetNSlots (1 ms)
76: [----------] 10 tests from TThreadedObject (172 ms total)
76: 
76: [----------] Global test environment tear-down
76: [==========] 126 tests from 3 test suites ran. (213929 ms total)
76: [  PASSED  ] 126 tests.
1/1 Test #76: gtest-core-thread-test-corethreadtestUnit ...   Passed  214.25 sec

The following tests passed:
        gtest-core-thread-test-corethreadtestUnit

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 215.09 sec

pcanal · 2020-12-17T18:39:58Z

In terms of sorting out the dependencies, one thing I could imagine is making core/imt an optional extra dependency of core/thread, and hiding the tbb stuff there (but open to other suggestions)

I agree that it is a possible solution. @Axel-Naumann opinion?

pcanal · 2020-12-17T18:42:34Z

Benchmarks for my RDF use-case with 256 threads look like:

I might be confused (i.e. I am probably looking at the wrong numbers). This last set of numbers says that the total improvements on the 256 threads use case is around 7% of the 3 minutes run-time. Is that correct?

Axel-Naumann · 2020-12-18T12:10:45Z

Dependency: I don't have a strong opinion or technical argument. I'm a bit afraid of circular dependencies between core and imt.

bendavid · 2020-12-23T05:48:20Z

@pcanal, yes the improvement from using TBB with TReentrantRWLock for gCoreMutex is only 7% in this case, but actually the use of RW locks instead of always write locks in TBufferFile::ReadClassBuffer is a much bigger improvement (and the comparisons I was showing for the RWLock variants was always on top of this).

For a smaller test with only 9.7M events in 19 input files (instead of 513M events in 1000 input files), using
256 threads and ROOT.TTreeProcessorMT.SetMaxTasksPerFilePerWorker(1) the numbers are:

Baseline:
        Percent of CPU this job got: 397%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 24:56.08

+ Use read-write locks in TBufferFile::ReadClassBuffer
        Percent of CPU this job got: 3971%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.81

+ TBBUnique for gCoreMutex
        Percent of CPU this job got: 4765%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.00

So the use of RW locks in TBufferFile::ReadClassBuffer is a factor 80 improvement and the improvement of the RWLock with TBB is another 10%

bendavid · 2020-12-23T08:34:56Z

Though this test is probably too short to effectively scale to higher thread counts, I add the scaling comparison here which is anyways interesting (obviously the baseline is much better behaved at smaller thread counts)

9.7M events in 19 input files
ROOT.TTreeProcessorMT.SetMaxTasksPerFilePerWorker(1)

Baseline

256 threads
        Percent of CPU this job got: 397%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 24:56.08
128 threads
        Percent of CPU this job got: 816%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 7:46.83
64 threads
        Percent of CPU this job got: 770%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:51.09
32 threads
        Percent of CPU this job got: 1209%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:35.40
16 threads
        Percent of CPU this job got: 1142%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.39
8 threads
        Percent of CPU this job got: 743%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.61
4 threads
        Percent of CPU this job got: 386%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:27.77
2 threads
        Percent of CPU this job got: 198%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:41.98
1 thread (no implicit MT)
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:10.06

+ Use read-write locks in TBufferFile::ReadClassBuffer

256 threads
        Percent of CPU this job got: 3971%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.81
192 threads
        Percent of CPU this job got: 4578%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.87
128 threads
        Percent of CPU this job got: 4908%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.90
64 threads
        Percent of CPU this job got: 3897%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.40
32 threads
        Percent of CPU this job got: 2403%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.90
16 threads
        Percent of CPU this job got: 1384%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.35
8 threads
        Percent of CPU this job got: 739%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.25
4 threads
        Percent of CPU this job got: 385%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:27.97
2 threads
        Percent of CPU this job got: 197%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:35.59
1 thread (no implicit MT)
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:11.11

+ TBBUnique for gCoreMutex

256 threads
        Percent of CPU this job got: 4765%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.00
192 threads
        Percent of CPU this job got: 4644%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.62
128 threads
        Percent of CPU this job got: 4551%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.46
64 threads
        Percent of CPU this job got: 3782%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.05
32 threads
        Percent of CPU this job got: 2383%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.43
16 threads
        Percent of CPU this job got: 1374%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.43
8 threads
        Percent of CPU this job got: 742%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.26
4 threads
        Percent of CPU this job got: 385%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:26.85
2 threads
        Percent of CPU this job got: 197%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32.81
1 thread (no implicit MT)
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:10.98

bendavid · 2020-12-23T08:39:29Z

For comparison, running on the full 513M events and 1000 files for the final configuration, ie Baseline + Use read-write locks in TBufferFile::ReadClassBuffer + TBBUnique for gCoreMutex + the additional changes to the task splitting to target 24 tasks per thread independent of the number of files the thread scaling is much more reasonable. (Keeping in mind also that even if the thread scaling were perfect from the application standpoint, there is significant deviations from perfect scaling expected beyond 128 threads due to hyperthreading, plus additional effects from per-core dynamic boosting etc)

256 threads
        Percent of CPU this job got: 19725%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:49.24
192 threads
        Percent of CPU this job got: 16846%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:56.93
128 threads
        Percent of CPU this job got: 12064%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:24.70
64 threads
        Percent of CPU this job got: 6130%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:52.66
32 threads
        Percent of CPU this job got: 3082%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 11:11.29
16 threads
        Percent of CPU this job got: 1580%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 20:08.16

pcanal · 2021-01-07T23:19:37Z

Those are indeed fantastic numbers.

All that is left to do is to clean-up a tiny bit the history (squash the 3 "fix and add" commit into the commit they fix or add) and more importantly to copy/paste the number above in the commit log(s).

Great Work!! Thanks!!!

pcanal · 2021-01-15T18:42:48Z

@bendavid Is the TBBUnique implementation uploaded to this PR? (i.e. I am not seeing but I could be confused).

bendavid · 2021-01-15T18:50:42Z

Not yet, I'm in the process of shuffling around the dependencies such that root can still build without tbb.

pcanal · 2021-01-15T18:54:31Z

Thanks.

bendavid · 2021-01-27T16:53:49Z

Since I haven't had a chance to cleanly address the TBB dependency issue, and since the other change is much more important performance-wise, I've split that off into a separate PR here which should be possible to merge quickly: #7105

oshadura · 2021-01-29T11:07:38Z

@phsft-bot build!

phsft-bot · 2021-01-29T11:07:44Z

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-fedora30/cxx14, ROOT-fedora31/noimt, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

phsft-bot · 2021-01-29T11:07:52Z

Build failed on ROOT-fedora31/noimt.
See console output.

phsft-bot · 2021-01-29T15:02:14Z

Build failed on mac11.0/cxx17.
Running on macphsft20.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

bendavid · 2021-02-16T19:25:17Z

@pcanal I finally had a chance to look at this in more detail and did not find an elegant solution for making TBB an optional dependency for this purpose. (Putting the relevant template specification in core/imt as I had in mind would create a circular dependency with core/thread). Open to suggestions how to deal with this (or if a hard TBB dependency in core/thread would be acceptable and/or an optional one straightforward to implement.)

pcanal · 2021-02-17T23:41:13Z

I think we can get away with something simpler where when IMT is on (or more precisely when TBB has been detected by cmake), we turn on a flag (#define in RConfigure.h) that change the way core/thread is build (and yes, in this optional mode, libThread would link against libTBB.so (or however it is spelt)

pcanal · 2021-03-16T18:11:45Z

Shouldn't we merge this (in addition to #7260), for example for use on Windows and/or when TBB is not available?

Related, we should put (a summary of) the performance numbers somewhere in the git reporsitory. Maybe either in the release note or one of the commit.

Thanks!

bendavid · 2021-03-16T18:35:05Z

This PR can actually be closed because it's fully covered by the combination of

#7105 (merged)
#7106 (merged)
#7260 (the other one we're discussing)

The performance improvements were added to the PR descriptions but indeed not in the commit messages themselves.

pcanal · 2021-03-16T18:55:09Z

@bendavid I am confused. In none of the PR listed, I see struct RecurseCountsShared. Indeed here I see:

So the general penalty from the shared_mutex appears to compensate for avoiding the exclusive lock in ReadLock(). This could be further optimized by using the shared_mutex only in RecurseCountsShared, but given the results, the tbb-based solution looks rather promising.

So the question really is back to, for the case where we can't use TBB is there a point in introducing RecurseCountsShared?

bendavid · 2021-03-16T19:07:32Z

Ah, well, I was kind of assuming that this is mostly relevant/important together with implicit multithreading, in which case TBB needs to be available anyway. Presumably the TBB issue on Windows is eventually solvable (and besides, does anyone really use Root on Windows with a large number of cores?)

If you still think it's useful I can try to make the further optimisation described above and make a separate PR for this.

pcanal · 2021-03-16T19:13:27Z

I thinking more of the potential future where we have to replace TBB (for whatever reasons, eg not being available on a new platforms or being not longer developed).

In addition I would be interested to see the optimized performance number vs TBB :)

So please yes :)

bendavid · 2021-03-16T19:58:03Z

Ok, looking in more detail, I THINK that the original deadlock issue with UniqueLockRecurseCount can actually be avoided by declaring the LocalCounts as a thread_local data member of UniqueLockRecurseCount rather than a local variable inside of GetLocal() so this might be an even better solution.

pcanal · 2021-03-16T20:28:01Z

as a thread_local data member of UniqueLockRecurseCount

What do you mean (syntax wise) by a thread local data member?

bendavid · 2021-03-16T20:39:28Z

e.g.

struct UniqueLockRecurseCount {
  struct LocalCounts {
    size_t fReadersCount = 0;
    bool fIsWriter = false;
  };
  
  static thread_local LocalCounts fCounts;
};

thread_local UniqueLockRecurseCount::LocalCounts UniqueLockRecurseCount::fCounts = UniqueLockRecurseCount::LocalCounts();

pcanal · 2021-03-16T21:40:51Z

Yes, it might work as long that we guarantee that the thread local static fCounts is not used during the loading of libThread itself (If this happens we are no guarantee of order of initialization and thus it might not be intialized when used). I think it is safe to rely on this since its usage (somewhat) require TThread::Init to be called first.

bendavid requested a review from pcanal as a code owner December 6, 2020 20:28

lmoneta assigned pcanal Dec 7, 2020

eguiraud assigned eguiraud and unassigned pcanal Dec 7, 2020

eguiraud requested review from Axel-Naumann and eguiraud December 7, 2020 15:18

Axel-Naumann previously requested changes Dec 7, 2020

View reviewed changes

pcanal reviewed Dec 7, 2020

View reviewed changes

bendavid added 2 commits December 8, 2020 00:45

Use read-write locks in TBufferFile::ReadClassBuffer

453a89d

Add shared_mutex-based implementation for TReentrantRWLock for use wi…

6d8ae1e

…th C++14 and later

bendavid force-pushed the lockimprovements branch from 2af6594 to 6d8ae1e Compare December 8, 2020 00:12

pcanal reviewed Dec 8, 2020

View reviewed changes

add unit tests

598ae4b

pcanal reviewed Dec 8, 2020

View reviewed changes

bendavid added 4 commits December 8, 2020 13:34

fix duplicated+missing test

79fe9d2

fix uninitialized variable

f152cc0

add additional tests

767574e

use wrappers for shared mutex

94cdf60

eguiraud assigned pcanal Dec 15, 2020

bendavid mentioned this pull request Feb 19, 2021

Add TBB-based counter for global read-write lock #7260

Merged

bendavid closed this Mar 16, 2021

bendavid mentioned this pull request Sep 2, 2025

[core] Implement RecurseCountsThreadLocal to be used in gCoreMutex #19799

Closed

		static thread_local LocalCounts unique_local_counts;
		return &unique_local_counts;

Conversation

bendavid commented Dec 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phsft-bot commented Dec 6, 2020

Uh oh!

eguiraud commented Dec 7, 2020

Uh oh!

phsft-bot commented Dec 7, 2020

Uh oh!

phsft-bot commented Dec 7, 2020

Errors:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bendavid commented Dec 8, 2020

Uh oh!

pcanal Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bendavid commented Dec 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bendavid commented Dec 9, 2020

Uh oh!

pcanal commented Dec 9, 2020

Uh oh!

bendavid commented Dec 9, 2020

Uh oh!

pcanal commented Dec 9, 2020

Uh oh!

pcanal commented Dec 9, 2020

Uh oh!

bendavid commented Dec 9, 2020

Uh oh!

pcanal commented Dec 17, 2020

Uh oh!

pcanal commented Dec 17, 2020

Uh oh!

Axel-Naumann commented Dec 18, 2020

Uh oh!

bendavid commented Dec 23, 2020

Uh oh!

bendavid commented Dec 23, 2020

Uh oh!

bendavid commented Dec 23, 2020

Uh oh!

pcanal commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcanal commented Jan 15, 2021

Uh oh!

bendavid commented Dec 6, 2020 •

edited

Loading

pcanal Dec 8, 2020 •

edited

Loading

pcanal commented Jan 7, 2021 •

edited

Loading

pcanal commented Mar 16, 2021 •

edited

Loading