Skip to content

Further improvements to reduce lock contention#6919

Closed
bendavid wants to merge 7 commits intoroot-project:masterfrom
bendavid:lockimprovements
Closed

Further improvements to reduce lock contention#6919
bendavid wants to merge 7 commits intoroot-project:masterfrom
bendavid:lockimprovements

Conversation

@bendavid
Copy link
Copy Markdown
Contributor

@bendavid bendavid commented Dec 6, 2020

Further reduction in lock contention when using RDataFrame with a large number of threads and/or files, by improving the global read write lock and migrating one hot spot in TBufferFile to use it.

@bendavid bendavid requested a review from pcanal as a code owner December 6, 2020 20:28
@phsft-bot
Copy link
Copy Markdown

Can one of the admins verify this patch?

@eguiraud
Copy link
Copy Markdown
Contributor

eguiraud commented Dec 7, 2020

@phsft-bot build

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-fedora30/cxx14, ROOT-fedora31/noimt, ROOT-ubuntu16/nortcxxmod, mac1015/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on mac1015/cxx17.
Running on macitois19.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2020-12-07T16:11:33.483Z] CMake Error at cmake/modules/SearchInstalledSoftware.cmake:1616 (message):
  • [2020-12-07T16:11:33.483Z] CMake Error at /Volumes/HDD2/build/workspace/root-pullrequests-build/rootspi/jenkins/root-build.cmake:1077 (message):

Comment on lines +41 to +42
static thread_local LocalCounts unique_local_counts;
return &unique_local_counts;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static thread_local LocalCounts unique_local_counts;
return &unique_local_counts;
static thread_local LocalCounts unique_local_counts;
return &unique_local_counts;

(We indent with 3 spaces.)

local_t GetLocal(){
TTHREAD_TLS_DECL(LocalCounts, gLocal);
return &gLocal;
static thread_local LocalCounts unique_local_counts;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcanal can we use thread_local throughout all supported platforms these days?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. The main hold-over was MacOS (and Windows) for objects (i.e. this case). A priori if the code compile and links, it should be fine.


else {
R__LOCKGUARD(gInterpreterMutex);
R__READ_LOCKGUARD(ROOT::gCoreMutex);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How likely will we deadlock here, @pcanal ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no (new) risk there. By design the Read and Write lock are working well togethere (a thread that hold a read lock and then request a write lock, first realease the read lock and waiting until all other thread have release the Read lock either by finishing they critical section or asking for the write lock).

I am more worried about race condition. Currently the StreamerInfo is published via RegisterStreamerInfo which is called before the end of the critical section. However since the check is done through the ReadCritical section and there is a test on IsCompiled which is updated at the last line of Build, I am guessing it is likely okay.

sinfo = (TStreamerInfo*)cl->GetStreamerInfos()->At(version);
if (sinfo == nullptr) {
if ( version == cl->GetClassVersion() || version == 1 ) {
const_cast<TClass*>(cl)->BuildRealData(pointer);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent by 3 spaces, please.

{
R__LOCKGUARD(gGlobalMutex);
if (!ROOT::gCoreMutex) {
// To avoid dead locks, caused by shared library opening and/or static initialization
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this comment no longer applicable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought this was related to some of the platform-specific stuff in TTHREAD_TLS_DECL and that it's not an issue if static thread_local can be used across the board.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit bd1894b.
tls_get_addr_tail is not a ROOT lock/routine but a system one. (and, not 100% sure, but it might be involved in the (compiler's) implementation of static initialization)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know which version of define TTHREAD_TLS was being used for that stack trace? (if it was already using thread_local then I agree there's no reason the situation should be different now)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure it was the default TTHREAD_TLS (i.e. ROOT's workaround).

@bendavid
Copy link
Copy Markdown
Contributor Author

bendavid commented Dec 8, 2020

Given the discussion about deadlocks using ``static thread_local``` I pushed an alternate solution to avoid exclusive locks in ReadLock() based on shared mutexes, which is only used for C++14 or higher in this case.

The indentation should also be fixed now.

size_t &GetLocalReadersCount(local_t &local) { return fReadersCount[local]; }
};

struct RecurseCountsShared {
Copy link
Copy Markdown
Member

@pcanal pcanal Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding a new count options, we should extent the test core/thread/test/testRWLock.cxx
Starting with adding a new global variable to the section:

auto gMutex = new TMutex(kTRUE);
auto gRWMutex = new TRWMutexImp<TMutex>();
auto gRWMutexSpin = new TRWMutexImp<ROOT::TSpinMutex>();
auto gRWMutexStd = new TRWMutexImp<std::mutex>();
auto gReentrantRWMutex = new ROOT::TReentrantRWLock<TMutex>();
auto gReentrantRWMutexSM = new ROOT::TReentrantRWLock<ROOT::TSpinMutex>();
auto gReentrantRWMutexStd = new ROOT::TReentrantRWLock<std::mutex>();
auto gSpinMutex = new ROOT::TSpinMutex();

// Intentionally ignore the Fatal error due to the shread thread-local storage.
// In this test we need to be 'careful' to not use all those mutex at the same time.
int trigger1 = gErrorIgnoreLevel = kFatal + 1;
auto gReentrantRWMutexTL = new ROOT::TReentrantRWLock<TMutex, ROOT::Internal::UniqueLockRecurseCount>();
auto gReentrantRWMutexSMTL = new ROOT::TReentrantRWLock<ROOT::TSpinMutex, ROOT::Internal::UniqueLockRecurseCount>();
auto gRWMutexTL = new TRWMutexImp<TMutex, ROOT::Internal::UniqueLockRecurseCount>();
auto gRWMutexTLSpin = new TRWMutexImp<ROOT::TSpinMutex, ROOT::Internal::UniqueLockRecurseCount>();
int trigger2 = gErrorIgnoreLevel = 0;

(in the right section, like the one between trigger1 and trigger2 in this case) and then duplicating (copy/paste) the usage so that the new global uses in the same set of tests as the others.

Beside testing the functionality it also give a first level benchmark of the raw speed of the options.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@bendavid
Copy link
Copy Markdown
Contributor Author

bendavid commented Dec 8, 2020

Here's an updated stack trace with full debug info in case it's useful in the meantime, taken again during one of the remaining periods of high lock contention when many files are being closed and opened.
pstackdebug.txt


} // End of namespace ROOT No newline at end of file
#if __cplusplus >= 201402L
template class TRWMutexImp<std::shared_timed_mutex, ROOT::Internal::RecurseCountsShared>;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of using the shared_timed_mutex in this context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular shared_mutex is only available from C++17, so the shared_timed_mutex is a fallback for the C++14 case. (I'm not sure whether the additional timed functionality actually makes it slower if not used.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you publish the numerical result of the test benchmark? (for a C++14 and C++17 builds)

@bendavid
Copy link
Copy Markdown
Contributor Author

bendavid commented Dec 9, 2020

There were actually some deadlocks when running the tests due to an uninitialized variable (fixed now). The tests now look as below. The new implementations are somewhat slower for write locks, faster for read locks, but significantly faster for the VeryLargeconcurrentReadAndWrite tests which I added. The slower write locking could probably be mitigated by using a dedicated shared_mutex inside RecurseCountsShared only to protect the map of read counts, but maybe this is sufficient for the moment.

Singularity> ctest -R gtest-core-thread-test-corethreadtestUnit -V                                           
UpdateCTestConfiguration  from :/scratchnvme/bendavid/rootdev2/build/DartConfiguration.tcl
Parse Config file:/scratchnvme/bendavid/rootdev2/build/DartConfiguration.tcl
 Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/usr/bin/cmake
UpdateCTestConfiguration  from :/scratchnvme/bendavid/rootdev2/build/DartConfiguration.tcl
Parse Config file:/scratchnvme/bendavid/rootdev2/build/DartConfiguration.tcl
Test project /scratchnvme/bendavid/rootdev2/build
Constructing a list of tests
Ignore test: roottest-cling-parsing-semicolon
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 76
    Start 76: gtest-core-thread-test-corethreadtestUnit

76: Test command: /usr/bin/cmake "-DCMD=/scratchnvme/bendavid/rootdev2/build/core/thread/test/corethreadtestUnit" "-DCWD=/scratchnvme/bendavid/rootdev2/build/core/thread/test" "-DSYS=/scratchnvme/bendavid/rootdev2/build" "-P" "/scratchnvme/bendavid/rootdev2/root/cmake/modules/RootTestDriver.cmake"
76: Environment variables: 
76:  ROOT_HIST=0
76: Test timeout computed to be: 1500
76: Running main() from /scratchnvme/bendavid/rootdev2/build/googletest-prefix/src/googletest/googletest/src/gtest_main.cc
76: [==========] Running 104 tests from 3 test suites.
76: [----------] Global test environment set-up.
76: [----------] 6 tests from InterpreterLock
76: [ RUN      ] InterpreterLock.ConcurrentCalc
76: [       OK ] InterpreterLock.ConcurrentCalc (112 ms)
76: [ RUN      ] InterpreterLock.ReadLocks
76: [       OK ] InterpreterLock.ReadLocks (95 ms)
76: [ RUN      ] InterpreterLock.BalancedUserReadLock
76: [       OK ] InterpreterLock.BalancedUserReadLock (210 ms)
76: [ RUN      ] InterpreterLock.BalancedUserWriteLock
76: [       OK ] InterpreterLock.BalancedUserWriteLock (198 ms)
76: [ RUN      ] InterpreterLock.UnBalancedUserReadLock
76: [       OK ] InterpreterLock.UnBalancedUserReadLock (146 ms)
76: [ RUN      ] InterpreterLock.UnBalancedUserWriteLock
76: [       OK ] InterpreterLock.UnBalancedUserWriteLock (96 ms)
76: [----------] 6 tests from InterpreterLock (857 ms total)
76: 
76: [----------] 88 tests from RWLock
76: [ RUN      ] RWLock.MutexLockVirtual
76: [       OK ] RWLock.MutexLockVirtual (59 ms)
76: [ RUN      ] RWLock.MutexUnLockVirtual
76: [       OK ] RWLock.MutexUnLockVirtual (62 ms)
76: [ RUN      ] RWLock.WriteLockVirtual
76: [       OK ] RWLock.WriteLockVirtual (3029 ms)
76: [ RUN      ] RWLock.WriteUnLockVirtual
76: [       OK ] RWLock.WriteUnLockVirtual (577 ms)
76: [ RUN      ] RWLock.WriteSpinLockVirtual
76: [       OK ] RWLock.WriteSpinLockVirtual (2758 ms)
76: [ RUN      ] RWLock.WriteSpinUnLockVirtual
76: [       OK ] RWLock.WriteSpinUnLockVirtual (275 ms)
76: [ RUN      ] RWLock.WriteLock
76: [       OK ] RWLock.WriteLock (2993 ms)
76: [ RUN      ] RWLock.WriteUnLock
76: [       OK ] RWLock.WriteUnLock (553 ms)
76: [ RUN      ] RWLock.WriteSpinLock
76: [       OK ] RWLock.WriteSpinLock (2826 ms)
76: [ RUN      ] RWLock.WriteSpinUnLock
76: [       OK ] RWLock.WriteSpinUnLock (276 ms)
76: [ RUN      ] RWLock.WriteStdDirectLock
76: [       OK ] RWLock.WriteStdDirectLock (2953 ms)
76: [ RUN      ] RWLock.WriteStdDirectUnLock
76: [       OK ] RWLock.WriteStdDirectUnLock (399 ms)
76: [ RUN      ] RWLock.WriteStdD14irectLock
76: [       OK ] RWLock.WriteStdD14irectLock (3082 ms)
76: [ RUN      ] RWLock.WriteStd14DirectUnLock
76: [       OK ] RWLock.WriteStd14DirectUnLock (523 ms)
76: [ RUN      ] RWLock.WriteStd17DirectLock
76: [       OK ] RWLock.WriteStd17DirectLock (2974 ms)
76: [ RUN      ] RWLock.WriteStd17DirectUnLock
76: [       OK ] RWLock.WriteStd17DirectUnLock (551 ms)
76: [ RUN      ] RWLock.WriteSpinDirectLock
76: [       OK ] RWLock.WriteSpinDirectLock (2727 ms)
76: [ RUN      ] RWLock.WriteSpinDirectUnLock
76: [       OK ] RWLock.WriteSpinDirectUnLock (222 ms)
76: [ RUN      ] RWLock.WriteDirectLock
76: [       OK ] RWLock.WriteDirectLock (3017 ms)
76: [ RUN      ] RWLock.WriteDirectUnLock
76: [       OK ] RWLock.WriteDirectUnLock (435 ms)
76: [ RUN      ] RWLock.ReadLockStdDirect
76: [       OK ] RWLock.ReadLockStdDirect (2225 ms)
76: [ RUN      ] RWLock.ReadUnLockStdDirect
76: [       OK ] RWLock.ReadUnLockStdDirect (123 ms)
76: [ RUN      ] RWLock.ReadLockStd14Direct
76: [       OK ] RWLock.ReadLockStd14Direct (2645 ms)
76: [ RUN      ] RWLock.ReadUnLockStd14Direct
76: [       OK ] RWLock.ReadUnLockStd14Direct (100 ms)
76: [ RUN      ] RWLock.ReadLockStd17Direct
76: [       OK ] RWLock.ReadLockStd17Direct (2762 ms)
76: [ RUN      ] RWLock.ReadUnLockStd17Direct
76: [       OK ] RWLock.ReadUnLockStd17Direct (92 ms)
76: [ RUN      ] RWLock.ReadLockSpinDirect
76: [       OK ] RWLock.ReadLockSpinDirect (1926 ms)
76: [ RUN      ] RWLock.ReadUnLockSpinDirect
76: [       OK ] RWLock.ReadUnLockSpinDirect (97 ms)
76: [ RUN      ] RWLock.ReadLockDirect
76: [       OK ] RWLock.ReadLockDirect (2205 ms)
76: [ RUN      ] RWLock.ReadUnLockDirect
76: [       OK ] RWLock.ReadUnLockDirect (94 ms)
76: [ RUN      ] RWLock.WriteSpinTLDirectLock
76: [       OK ] RWLock.WriteSpinTLDirectLock (957 ms)
76: [ RUN      ] RWLock.WriteSpinTLsDirectUnLock
76: [       OK ] RWLock.WriteSpinTLsDirectUnLock (240 ms)
76: [ RUN      ] RWLock.WriteTLDirectLock
76: [       OK ] RWLock.WriteTLDirectLock (985 ms)
76: [ RUN      ] RWLock.WriteTLDirectUnLock
76: [       OK ] RWLock.WriteTLDirectUnLock (467 ms)
76: [ RUN      ] RWLock.ReadLockSpinTLDirect
76: [       OK ] RWLock.ReadLockSpinTLDirect (286 ms)
76: [ RUN      ] RWLock.ReadUnLockSpinTLDirect
76: [       OK ] RWLock.ReadUnLockSpinTLDirect (95 ms)
76: [ RUN      ] RWLock.ReadLockTLDirect
76: [       OK ] RWLock.ReadLockTLDirect (324 ms)
76: [ RUN      ] RWLock.ReadUnLockTLDirect
76: [       OK ] RWLock.ReadUnLockTLDirect (91 ms)
76: [ RUN      ] RWLock.SpinMutexLockUnlock
76: [       OK ] RWLock.SpinMutexLockUnlock (148 ms)
76: [ RUN      ] RWLock.MutexGuard
76: [       OK ] RWLock.MutexGuard (177 ms)
76: [ RUN      ] RWLock.WriteGuard
76: [       OK ] RWLock.WriteGuard (4820 ms)
76: [ RUN      ] RWLock.WriteSpinGuard
76: [       OK ] RWLock.WriteSpinGuard (4297 ms)
76: [ RUN      ] RWLock.ReentrantStd
76: [       OK ] RWLock.ReentrantStd (0 ms)
76: [ RUN      ] RWLock.ReentrantStd14
76: [       OK ] RWLock.ReentrantStd14 (0 ms)
76: [ RUN      ] RWLock.ReentrantStd17
76: [       OK ] RWLock.ReentrantStd17 (0 ms)
76: [ RUN      ] RWLock.ReentrantSpin
76: [       OK ] RWLock.ReentrantSpin (0 ms)
76: [ RUN      ] RWLock.Reentrant
76: [       OK ] RWLock.Reentrant (0 ms)
76: [ RUN      ] RWLock.ReentrantTLSpin
76: [       OK ] RWLock.ReentrantTLSpin (0 ms)
76: [ RUN      ] RWLock.ReentrantTL
76: [       OK ] RWLock.ReentrantTL (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd
76: [       OK ] RWLock.ResetRestoreStd (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd14
76: [       OK ] RWLock.ResetRestoreStd14 (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd17
76: [       OK ] RWLock.ResetRestoreStd17 (0 ms)
76: [ RUN      ] RWLock.ResetRestoreSpin
76: [       OK ] RWLock.ResetRestoreSpin (0 ms)
76: [ RUN      ] RWLock.ResetRestore
76: [       OK ] RWLock.ResetRestore (0 ms)
76: [ RUN      ] RWLock.ResetRestoreTLSpin
76: [       OK ] RWLock.ResetRestoreTLSpin (0 ms)
76: [ RUN      ] RWLock.ResetRestoreTL
76: [       OK ] RWLock.ResetRestoreTL (0 ms)
76: [ RUN      ] RWLock.concurrentResetRestore
76: [       OK ] RWLock.concurrentResetRestore (88 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreSpin
76: [       OK ] RWLock.concurrentResetRestoreSpin (65 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd
76: [       OK ] RWLock.concurrentResetRestoreStd (79 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd14
76: [       OK ] RWLock.concurrentResetRestoreStd14 (67 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd17
76: [       OK ] RWLock.concurrentResetRestoreStd17 (64 ms)
76: [ RUN      ] RWLock.LargeconcurrentResetRestore
76: [       OK ] RWLock.LargeconcurrentResetRestore (5264 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreTL
76: [       OK ] RWLock.concurrentResetRestoreTL (61 ms)
76: [ RUN      ] RWLock.LargeconcurrentResetRestoreTL
76: [       OK ] RWLock.LargeconcurrentResetRestoreTL (4643 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWrites
76: [       OK ] RWLock.concurrentReadsAndWrites (3095 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesSpin
76: [       OK ] RWLock.concurrentReadsAndWritesSpin (3098 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd
76: [       OK ] RWLock.concurrentReadsAndWritesStd (3103 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd14
76: [       OK ] RWLock.concurrentReadsAndWritesStd14 (3105 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd17
76: [       OK ] RWLock.concurrentReadsAndWritesStd17 (3098 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWrites
76: [       OK ] RWLock.LargeconcurrentReadsAndWrites (3104 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd (3101 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd14
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd14 (3106 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd17
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd17 (3107 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesSpin
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesSpin (310 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWrites
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWrites (16427 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd (17458 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd14
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd14 (9737 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd17
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd17 (9956 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesSpin
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesSpin (4470 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReads
76: [       OK ] RWLock.VeryLargeconcurrentReads (1074 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd (1076 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd14
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd14 (1068 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd17
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd17 (1073 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsSpin
76: [       OK ] RWLock.VeryLargeconcurrentReadsSpin (407 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesTL
76: [       OK ] RWLock.concurrentReadsAndWritesTL (3087 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesTL
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesTL (3096 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesTL
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesTL (5982 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsTL
76: [       OK ] RWLock.VeryLargeconcurrentReadsTL (1068 ms)
76: [----------] 88 tests from RWLock (169985 ms total)
76: 
76: [----------] 10 tests from TThreadedObject
76: [ RUN      ] TThreadedObject.CreateAndDestroy
76: [       OK ] TThreadedObject.CreateAndDestroy (1 ms)
76: [ RUN      ] TThreadedObject.Get
76: [       OK ] TThreadedObject.Get (8 ms)
76: [ RUN      ] TThreadedObject.GetAtSlot
76: [       OK ] TThreadedObject.GetAtSlot (0 ms)
76: [ RUN      ] TThreadedObject.GetAtSlotUnchecked
76: [       OK ] TThreadedObject.GetAtSlotUnchecked (0 ms)
76: [ RUN      ] TThreadedObject.GetAtSlotRaw
76: [       OK ] TThreadedObject.GetAtSlotRaw (0 ms)
76: [ RUN      ] TThreadedObject.SetAtSlot
76: [       OK ] TThreadedObject.SetAtSlot (0 ms)
76: [ RUN      ] TThreadedObject.Merge
76: [       OK ] TThreadedObject.Merge (128 ms)
76: [ RUN      ] TThreadedObject.SnapshotMerge
76: [       OK ] TThreadedObject.SnapshotMerge (0 ms)
76: [ RUN      ] TThreadedObject.GrowSlots
76: [       OK ] TThreadedObject.GrowSlots (1 ms)
76: [ RUN      ] TThreadedObject.GetNSlots
76: [       OK ] TThreadedObject.GetNSlots (1 ms)
76: [----------] 10 tests from TThreadedObject (139 ms total)
76: 
76: [----------] Global test environment tear-down
76: [==========] 104 tests from 3 test suites ran. (170981 ms total)
76: [  PASSED  ] 104 tests.
1/1 Test #76: gtest-core-thread-test-corethreadtestUnit ...   Passed  171.21 sec

The following tests passed:
        gtest-core-thread-test-corethreadtestUnit

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 172.05 sec

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Dec 9, 2020

It looks like (surprisingly :() the currently used implementation (RecurseCounts) is not yet in the test. Could you add it?

@bendavid
Copy link
Copy Markdown
Contributor Author

bendavid commented Dec 9, 2020

RecurseCounts is the default second template argument for TReentrantRWLock and TRWMutexImp so it's definitely already there. The most relevant comparisons here are Std vs Std14 vs Std17. (Even if this was all run in a C++17 build I set it up to just test all the variants which are available.)

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Dec 9, 2020

RecurseCounts is the default second template argument for TReentrantRWLock

Thanks. I had forgotten about that :)

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Dec 9, 2020

faster for read locks,

I don't seem to be seeing that:

76: [       OK ] RWLock.ReadLockStdDirect (2225 ms)
76: [       OK ] RWLock.ReadLockStd14Direct (2645 ms)
76: [       OK ] RWLock.ReadLockStd17Direct (2762 ms)

what am I missing.

Also do we understand why the Std17 results seems to be higher/slower than Std14?

@bendavid
Copy link
Copy Markdown
Contributor Author

bendavid commented Dec 9, 2020

Ah, sorry, I was looking at the numbers for ReadUnLock. The differences between Std14 and Std17 are at least partly noise I think.

I also have an implementation working now based on tbb::enumerable_thread_specific (https://software.intel.com/content/www/us/en/develop/documentation/tbb-documentation/top/intel-threading-building-blocks-developer-reference/thread-local-storage/enumerablethreadspecific-template-class.html)

Now in principle what tbb is doing underneath should not so different from TTHREAD_TLS_DECL, at least for the explicit pthread-based version, so we can think about whether this is really safe from deadlocks, but the numbers looks pretty good. In the test output below "TBB" is using tbb::enumerable_thread_specific<LocalCounts> and "TBBUnique" is using tbb::enumerable_thread_specific<LocalCounts, tbb::cache_aligned_allocator<LocalCounts>, tbb::ets_key_per_instance> which is limited to a small number of instances per process (but not strictly to 1).

If you would be willing to have tbb as a dependency for core/thread then I can add this (and maybe even remove the shared_mutex versions)

76: Test command: /usr/bin/cmake "-DCMD=/scratchnvme/bendavid/rootdev2/build/core/thread/test/corethreadtestUnit" "-DCWD=/scratchnvme/bendavid/rootdev2/build/core/thread/test" "-DSYS=/scratchnvme/bendavid/rootdev2/build" "-P" "/scratchnvme/bendavid/rootdev2/root/cmake/modules/RootTestDriver.cmake"
76: Environment variables: 
76:  ROOT_HIST=0
76: Test timeout computed to be: 1500
76: Running main() from /scratchnvme/bendavid/rootdev2/build/googletest-prefix/src/googletest/googletest/src/gtest_main.cc
76: [==========] Running 126 tests from 3 test suites.
76: [----------] Global test environment set-up.
76: [----------] 6 tests from InterpreterLock
76: [ RUN      ] InterpreterLock.ConcurrentCalc
76: [       OK ] InterpreterLock.ConcurrentCalc (87 ms)
76: [ RUN      ] InterpreterLock.ReadLocks
76: [       OK ] InterpreterLock.ReadLocks (78 ms)
76: [ RUN      ] InterpreterLock.BalancedUserReadLock
76: [       OK ] InterpreterLock.BalancedUserReadLock (169 ms)
76: [ RUN      ] InterpreterLock.BalancedUserWriteLock
76: [       OK ] InterpreterLock.BalancedUserWriteLock (178 ms)
76: [ RUN      ] InterpreterLock.UnBalancedUserReadLock
76: [       OK ] InterpreterLock.UnBalancedUserReadLock (124 ms)
76: [ RUN      ] InterpreterLock.UnBalancedUserWriteLock
76: [       OK ] InterpreterLock.UnBalancedUserWriteLock (89 ms)
76: [----------] 6 tests from InterpreterLock (725 ms total)
76: 
76: [----------] 110 tests from RWLock
76: [ RUN      ] RWLock.MutexLockVirtual
76: [       OK ] RWLock.MutexLockVirtual (62 ms)
76: [ RUN      ] RWLock.MutexUnLockVirtual
76: [       OK ] RWLock.MutexUnLockVirtual (65 ms)
76: [ RUN      ] RWLock.WriteLockVirtual
76: [       OK ] RWLock.WriteLockVirtual (3110 ms)
76: [ RUN      ] RWLock.WriteUnLockVirtual
76: [       OK ] RWLock.WriteUnLockVirtual (470 ms)
76: [ RUN      ] RWLock.WriteSpinLockVirtual
76: [       OK ] RWLock.WriteSpinLockVirtual (2736 ms)
76: [ RUN      ] RWLock.WriteSpinUnLockVirtual
76: [       OK ] RWLock.WriteSpinUnLockVirtual (269 ms)
76: [ RUN      ] RWLock.WriteLock
76: [       OK ] RWLock.WriteLock (3116 ms)
76: [ RUN      ] RWLock.WriteUnLock
76: [       OK ] RWLock.WriteUnLock (476 ms)
76: [ RUN      ] RWLock.WriteSpinLock
76: [       OK ] RWLock.WriteSpinLock (2780 ms)
76: [ RUN      ] RWLock.WriteSpinUnLock
76: [       OK ] RWLock.WriteSpinUnLock (268 ms)
76: [ RUN      ] RWLock.WriteStdDirectLock
76: [       OK ] RWLock.WriteStdDirectLock (3164 ms)
76: [ RUN      ] RWLock.WriteStdDirectUnLock
76: [       OK ] RWLock.WriteStdDirectUnLock (447 ms)
76: [ RUN      ] RWLock.WriteStdD14irectLock
76: [       OK ] RWLock.WriteStdD14irectLock (3150 ms)
76: [ RUN      ] RWLock.WriteStd14DirectUnLock
76: [       OK ] RWLock.WriteStd14DirectUnLock (469 ms)
76: [ RUN      ] RWLock.WriteStd17DirectLock
76: [       OK ] RWLock.WriteStd17DirectLock (3122 ms)
76: [ RUN      ] RWLock.WriteStd17DirectUnLock
76: [       OK ] RWLock.WriteStd17DirectUnLock (529 ms)
76: [ RUN      ] RWLock.WriteStdTBBDirectLock
76: [       OK ] RWLock.WriteStdTBBDirectLock (2121 ms)
76: [ RUN      ] RWLock.WriteStdTBBDirectUnLock
76: [       OK ] RWLock.WriteStdTBBDirectUnLock (386 ms)
76: [ RUN      ] RWLock.WriteStdTBBUniqueDirectLock
76: [       OK ] RWLock.WriteStdTBBUniqueDirectLock (1233 ms)
76: [ RUN      ] RWLock.WriteStdTBBUniqueDirectUnLock
76: [       OK ] RWLock.WriteStdTBBUniqueDirectUnLock (470 ms)
76: [ RUN      ] RWLock.WriteSpinDirectLock
76: [       OK ] RWLock.WriteSpinDirectLock (2788 ms)
76: [ RUN      ] RWLock.WriteSpinDirectUnLock
76: [       OK ] RWLock.WriteSpinDirectUnLock (226 ms)
76: [ RUN      ] RWLock.WriteDirectLock
76: [       OK ] RWLock.WriteDirectLock (3012 ms)
76: [ RUN      ] RWLock.WriteDirectUnLock
76: [       OK ] RWLock.WriteDirectUnLock (425 ms)
76: [ RUN      ] RWLock.ReadLockStdDirect
76: [       OK ] RWLock.ReadLockStdDirect (2460 ms)
76: [ RUN      ] RWLock.ReadUnLockStdDirect
76: [       OK ] RWLock.ReadUnLockStdDirect (92 ms)
76: [ RUN      ] RWLock.ReadLockStd14Direct
76: [       OK ] RWLock.ReadLockStd14Direct (2667 ms)
76: [ RUN      ] RWLock.ReadUnLockStd14Direct
76: [       OK ] RWLock.ReadUnLockStd14Direct (94 ms)
76: [ RUN      ] RWLock.ReadLockStd17Direct
76: [       OK ] RWLock.ReadLockStd17Direct (2775 ms)
76: [ RUN      ] RWLock.ReadUnLockStd17Direct
76: [       OK ] RWLock.ReadUnLockStd17Direct (100 ms)
76: [ RUN      ] RWLock.ReadLockStdTBBDirect
76: [       OK ] RWLock.ReadLockStdTBBDirect (1157 ms)
76: [ RUN      ] RWLock.ReadUnLockStdTBBDirect
76: [       OK ] RWLock.ReadUnLockStdTBBDirect (92 ms)
76: [ RUN      ] RWLock.ReadLockStdTBBUniqueDirect
76: [       OK ] RWLock.ReadLockStdTBBUniqueDirect (398 ms)
76: [ RUN      ] RWLock.ReadUnLockStdTBBUniqueDirect
76: [       OK ] RWLock.ReadUnLockStdTBBUniqueDirect (92 ms)
76: [ RUN      ] RWLock.ReadLockSpinDirect
76: [       OK ] RWLock.ReadLockSpinDirect (2224 ms)
76: [ RUN      ] RWLock.ReadUnLockSpinDirect
76: [       OK ] RWLock.ReadUnLockSpinDirect (134 ms)
76: [ RUN      ] RWLock.ReadLockDirect
76: [       OK ] RWLock.ReadLockDirect (2350 ms)
76: [ RUN      ] RWLock.ReadUnLockDirect
76: [       OK ] RWLock.ReadUnLockDirect (91 ms)
76: [ RUN      ] RWLock.WriteSpinTLDirectLock
76: [       OK ] RWLock.WriteSpinTLDirectLock (931 ms)
76: [ RUN      ] RWLock.WriteSpinTLsDirectUnLock
76: [       OK ] RWLock.WriteSpinTLsDirectUnLock (243 ms)
76: [ RUN      ] RWLock.WriteTLDirectLock
76: [       OK ] RWLock.WriteTLDirectLock (1082 ms)
76: [ RUN      ] RWLock.WriteTLDirectUnLock
76: [       OK ] RWLock.WriteTLDirectUnLock (461 ms)
76: [ RUN      ] RWLock.ReadLockSpinTLDirect
76: [       OK ] RWLock.ReadLockSpinTLDirect (320 ms)
76: [ RUN      ] RWLock.ReadUnLockSpinTLDirect
76: [       OK ] RWLock.ReadUnLockSpinTLDirect (92 ms)
76: [ RUN      ] RWLock.ReadLockTLDirect
76: [       OK ] RWLock.ReadLockTLDirect (301 ms)
76: [ RUN      ] RWLock.ReadUnLockTLDirect
76: [       OK ] RWLock.ReadUnLockTLDirect (85 ms)
76: [ RUN      ] RWLock.SpinMutexLockUnlock
76: [       OK ] RWLock.SpinMutexLockUnlock (148 ms)
76: [ RUN      ] RWLock.MutexGuard
76: [       OK ] RWLock.MutexGuard (178 ms)
76: [ RUN      ] RWLock.WriteGuard
76: [       OK ] RWLock.WriteGuard (4995 ms)
76: [ RUN      ] RWLock.WriteSpinGuard
76: [       OK ] RWLock.WriteSpinGuard (4519 ms)
76: [ RUN      ] RWLock.ReentrantStd
76: [       OK ] RWLock.ReentrantStd (0 ms)
76: [ RUN      ] RWLock.ReentrantStd14
76: [       OK ] RWLock.ReentrantStd14 (1 ms)
76: [ RUN      ] RWLock.ReentrantStd17
76: [       OK ] RWLock.ReentrantStd17 (0 ms)
76: [ RUN      ] RWLock.ReentrantStdTBB
76: [       OK ] RWLock.ReentrantStdTBB (0 ms)
76: [ RUN      ] RWLock.ReentrantStdTBBUnique
76: [       OK ] RWLock.ReentrantStdTBBUnique (0 ms)
76: [ RUN      ] RWLock.ReentrantSpin
76: [       OK ] RWLock.ReentrantSpin (0 ms)
76: [ RUN      ] RWLock.Reentrant
76: [       OK ] RWLock.Reentrant (0 ms)
76: [ RUN      ] RWLock.ReentrantTLSpin
76: [       OK ] RWLock.ReentrantTLSpin (0 ms)
76: [ RUN      ] RWLock.ReentrantTL
76: [       OK ] RWLock.ReentrantTL (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd
76: [       OK ] RWLock.ResetRestoreStd (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd14
76: [       OK ] RWLock.ResetRestoreStd14 (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStd17
76: [       OK ] RWLock.ResetRestoreStd17 (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStdTBB
76: [       OK ] RWLock.ResetRestoreStdTBB (0 ms)
76: [ RUN      ] RWLock.ResetRestoreStdTBBUnique
76: [       OK ] RWLock.ResetRestoreStdTBBUnique (0 ms)
76: [ RUN      ] RWLock.ResetRestoreSpin
76: [       OK ] RWLock.ResetRestoreSpin (0 ms)
76: [ RUN      ] RWLock.ResetRestore
76: [       OK ] RWLock.ResetRestore (0 ms)
76: [ RUN      ] RWLock.ResetRestoreTLSpin
76: [       OK ] RWLock.ResetRestoreTLSpin (0 ms)
76: [ RUN      ] RWLock.ResetRestoreTL
76: [       OK ] RWLock.ResetRestoreTL (0 ms)
76: [ RUN      ] RWLock.concurrentResetRestore
76: [       OK ] RWLock.concurrentResetRestore (61 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreSpin
76: [       OK ] RWLock.concurrentResetRestoreSpin (40 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd
76: [       OK ] RWLock.concurrentResetRestoreStd (53 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd14
76: [       OK ] RWLock.concurrentResetRestoreStd14 (62 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStd17
76: [       OK ] RWLock.concurrentResetRestoreStd17 (64 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStdTBB
76: [       OK ] RWLock.concurrentResetRestoreStdTBB (45 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreStdTBBUnique
76: [       OK ] RWLock.concurrentResetRestoreStdTBBUnique (33 ms)
76: [ RUN      ] RWLock.LargeconcurrentResetRestore
76: [       OK ] RWLock.LargeconcurrentResetRestore (5534 ms)
76: [ RUN      ] RWLock.concurrentResetRestoreTL
76: [       OK ] RWLock.concurrentResetRestoreTL (59 ms)
76: [ RUN      ] RWLock.LargeconcurrentResetRestoreTL
76: [       OK ] RWLock.LargeconcurrentResetRestoreTL (4693 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWrites
76: [       OK ] RWLock.concurrentReadsAndWrites (3099 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesSpin
76: [       OK ] RWLock.concurrentReadsAndWritesSpin (3097 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd
76: [       OK ] RWLock.concurrentReadsAndWritesStd (3129 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd14
76: [       OK ] RWLock.concurrentReadsAndWritesStd14 (3210 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStd17
76: [       OK ] RWLock.concurrentReadsAndWritesStd17 (3105 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStdTBB
76: [       OK ] RWLock.concurrentReadsAndWritesStdTBB (3168 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesStdTBBUnique
76: [       OK ] RWLock.concurrentReadsAndWritesStdTBBUnique (3112 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWrites
76: [       OK ] RWLock.LargeconcurrentReadsAndWrites (3128 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd (3111 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd14
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd14 (3121 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStd17
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStd17 (3128 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStdTBB
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStdTBB (3100 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesStdTBBUnique
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesStdTBBUnique (3117 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesSpin
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesSpin (315 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWrites
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWrites (22562 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd (13138 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd14
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd14 (12093 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStd17
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStd17 (9137 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStdTBB
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStdTBB (9902 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesStdTBBUnique
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesStdTBBUnique (6940 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesSpin
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesSpin (4261 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReads
76: [       OK ] RWLock.VeryLargeconcurrentReads (1071 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd (1077 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd14
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd14 (1072 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStd17
76: [       OK ] RWLock.VeryLargeconcurrentReadsStd17 (1072 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStdTBB
76: [       OK ] RWLock.VeryLargeconcurrentReadsStdTBB (1071 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsStdTBBUnique
76: [       OK ] RWLock.VeryLargeconcurrentReadsStdTBBUnique (1070 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsSpin
76: [       OK ] RWLock.VeryLargeconcurrentReadsSpin (392 ms)
76: [ RUN      ] RWLock.concurrentReadsAndWritesTL
76: [       OK ] RWLock.concurrentReadsAndWritesTL (3193 ms)
76: [ RUN      ] RWLock.LargeconcurrentReadsAndWritesTL
76: [       OK ] RWLock.LargeconcurrentReadsAndWritesTL (3102 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsAndWritesTL
76: [       OK ] RWLock.VeryLargeconcurrentReadsAndWritesTL (6246 ms)
76: [ RUN      ] RWLock.VeryLargeconcurrentReadsTL
76: [       OK ] RWLock.VeryLargeconcurrentReadsTL (1072 ms)
76: [----------] 110 tests from RWLock (213032 ms total)
76: 
76: [----------] 10 tests from TThreadedObject
76: [ RUN      ] TThreadedObject.CreateAndDestroy
76: [       OK ] TThreadedObject.CreateAndDestroy (1 ms)
76: [ RUN      ] TThreadedObject.Get
76: [       OK ] TThreadedObject.Get (10 ms)
76: [ RUN      ] TThreadedObject.GetAtSlot
76: [       OK ] TThreadedObject.GetAtSlot (0 ms)
76: [ RUN      ] TThreadedObject.GetAtSlotUnchecked
76: [       OK ] TThreadedObject.GetAtSlotUnchecked (0 ms)
76: [ RUN      ] TThreadedObject.GetAtSlotRaw
76: [       OK ] TThreadedObject.GetAtSlotRaw (0 ms)
76: [ RUN      ] TThreadedObject.SetAtSlot
76: [       OK ] TThreadedObject.SetAtSlot (0 ms)
76: [ RUN      ] TThreadedObject.Merge
76: [       OK ] TThreadedObject.Merge (159 ms)
76: [ RUN      ] TThreadedObject.SnapshotMerge
76: [       OK ] TThreadedObject.SnapshotMerge (1 ms)
76: [ RUN      ] TThreadedObject.GrowSlots
76: [       OK ] TThreadedObject.GrowSlots (0 ms)
76: [ RUN      ] TThreadedObject.GetNSlots
76: [       OK ] TThreadedObject.GetNSlots (1 ms)
76: [----------] 10 tests from TThreadedObject (172 ms total)
76: 
76: [----------] Global test environment tear-down
76: [==========] 126 tests from 3 test suites ran. (213929 ms total)
76: [  PASSED  ] 126 tests.
1/1 Test #76: gtest-core-thread-test-corethreadtestUnit ...   Passed  214.25 sec

The following tests passed:
        gtest-core-thread-test-corethreadtestUnit

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 215.09 sec

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Dec 17, 2020

In terms of sorting out the dependencies, one thing I could imagine is making core/imt an optional extra dependency of core/thread, and hiding the tbb stuff there (but open to other suggestions)

I agree that it is a possible solution. @Axel-Naumann opinion?

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Dec 17, 2020

Benchmarks for my RDF use-case with 256 threads look like:

I might be confused (i.e. I am probably looking at the wrong numbers). This last set of numbers says that the total improvements on the 256 threads use case is around 7% of the 3 minutes run-time. Is that correct?

@Axel-Naumann
Copy link
Copy Markdown
Member

Dependency: I don't have a strong opinion or technical argument. I'm a bit afraid of circular dependencies between core and imt.

@bendavid
Copy link
Copy Markdown
Contributor Author

@pcanal, yes the improvement from using TBB with TReentrantRWLock for gCoreMutex is only 7% in this case, but actually the use of RW locks instead of always write locks in TBufferFile::ReadClassBuffer is a much bigger improvement (and the comparisons I was showing for the RWLock variants was always on top of this).

For a smaller test with only 9.7M events in 19 input files (instead of 513M events in 1000 input files), using
256 threads and ROOT.TTreeProcessorMT.SetMaxTasksPerFilePerWorker(1) the numbers are:

Baseline:
        Percent of CPU this job got: 397%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 24:56.08

+ Use read-write locks in TBufferFile::ReadClassBuffer
        Percent of CPU this job got: 3971%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.81

+ TBBUnique for gCoreMutex
        Percent of CPU this job got: 4765%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.00

So the use of RW locks in TBufferFile::ReadClassBuffer is a factor 80 improvement and the improvement of the RWLock with TBB is another 10%

@bendavid
Copy link
Copy Markdown
Contributor Author

Though this test is probably too short to effectively scale to higher thread counts, I add the scaling comparison here which is anyways interesting (obviously the baseline is much better behaved at smaller thread counts)

9.7M events in 19 input files
ROOT.TTreeProcessorMT.SetMaxTasksPerFilePerWorker(1)

Baseline

256 threads
        Percent of CPU this job got: 397%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 24:56.08
128 threads
        Percent of CPU this job got: 816%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 7:46.83
64 threads
        Percent of CPU this job got: 770%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:51.09
32 threads
        Percent of CPU this job got: 1209%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:35.40
16 threads
        Percent of CPU this job got: 1142%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.39
8 threads
        Percent of CPU this job got: 743%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.61
4 threads
        Percent of CPU this job got: 386%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:27.77
2 threads
        Percent of CPU this job got: 198%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:41.98
1 thread (no implicit MT)
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:10.06

+ Use read-write locks in TBufferFile::ReadClassBuffer

256 threads
        Percent of CPU this job got: 3971%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.81
192 threads
        Percent of CPU this job got: 4578%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.87
128 threads
        Percent of CPU this job got: 4908%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.90
64 threads
        Percent of CPU this job got: 3897%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.40
32 threads
        Percent of CPU this job got: 2403%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.90
16 threads
        Percent of CPU this job got: 1384%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.35
8 threads
        Percent of CPU this job got: 739%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.25
4 threads
        Percent of CPU this job got: 385%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:27.97
2 threads
        Percent of CPU this job got: 197%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:35.59
1 thread (no implicit MT)
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:11.11

+ TBBUnique for gCoreMutex

256 threads
        Percent of CPU this job got: 4765%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.00
192 threads
        Percent of CPU this job got: 4644%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.62
128 threads
        Percent of CPU this job got: 4551%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.46
64 threads
        Percent of CPU this job got: 3782%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.05
32 threads
        Percent of CPU this job got: 2383%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.43
16 threads
        Percent of CPU this job got: 1374%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.43
8 threads
        Percent of CPU this job got: 742%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.26
4 threads
        Percent of CPU this job got: 385%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:26.85
2 threads
        Percent of CPU this job got: 197%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32.81
1 thread (no implicit MT)
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:10.98

@bendavid
Copy link
Copy Markdown
Contributor Author

For comparison, running on the full 513M events and 1000 files for the final configuration, ie Baseline + Use read-write locks in TBufferFile::ReadClassBuffer + TBBUnique for gCoreMutex + the additional changes to the task splitting to target 24 tasks per thread independent of the number of files the thread scaling is much more reasonable. (Keeping in mind also that even if the thread scaling were perfect from the application standpoint, there is significant deviations from perfect scaling expected beyond 128 threads due to hyperthreading, plus additional effects from per-core dynamic boosting etc)

256 threads
        Percent of CPU this job got: 19725%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:49.24
192 threads
        Percent of CPU this job got: 16846%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:56.93
128 threads
        Percent of CPU this job got: 12064%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:24.70
64 threads
        Percent of CPU this job got: 6130%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:52.66
32 threads
        Percent of CPU this job got: 3082%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 11:11.29
16 threads
        Percent of CPU this job got: 1580%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 20:08.16

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jan 7, 2021

Those are indeed fantastic numbers.

All that is left to do is to clean-up a tiny bit the history (squash the 3 "fix and add" commit into the commit they fix or add) and more importantly to copy/paste the number above in the commit log(s).

Great Work!! Thanks!!!

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jan 15, 2021

@bendavid Is the TBBUnique implementation uploaded to this PR? (i.e. I am not seeing but I could be confused).

@bendavid
Copy link
Copy Markdown
Contributor Author

Not yet, I'm in the process of shuffling around the dependencies such that root can still build without tbb.

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jan 15, 2021

Thanks.

@bendavid
Copy link
Copy Markdown
Contributor Author

Since I haven't had a chance to cleanly address the TBB dependency issue, and since the other change is much more important performance-wise, I've split that off into a separate PR here which should be possible to merge quickly: #7105

@oshadura
Copy link
Copy Markdown
Collaborator

@phsft-bot build!

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-fedora30/cxx14, ROOT-fedora31/noimt, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-fedora31/noimt.
See console output.

@phsft-bot
Copy link
Copy Markdown

@bendavid
Copy link
Copy Markdown
Contributor Author

@pcanal I finally had a chance to look at this in more detail and did not find an elegant solution for making TBB an optional dependency for this purpose. (Putting the relevant template specification in core/imt as I had in mind would create a circular dependency with core/thread). Open to suggestions how to deal with this (or if a hard TBB dependency in core/thread would be acceptable and/or an optional one straightforward to implement.)

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Feb 17, 2021

I think we can get away with something simpler where when IMT is on (or more precisely when TBB has been detected by cmake), we turn on a flag (#define in RConfigure.h) that change the way core/thread is build (and yes, in this optional mode, libThread would link against libTBB.so (or however it is spelt)

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Mar 16, 2021

Shouldn't we merge this (in addition to #7260), for example for use on Windows and/or when TBB is not available?

Related, we should put (a summary of) the performance numbers somewhere in the git reporsitory. Maybe either in the release note or one of the commit.

Thanks!

@bendavid
Copy link
Copy Markdown
Contributor Author

This PR can actually be closed because it's fully covered by the combination of

#7105 (merged)
#7106 (merged)
#7260 (the other one we're discussing)

The performance improvements were added to the PR descriptions but indeed not in the commit messages themselves.

@bendavid bendavid closed this Mar 16, 2021
@pcanal
Copy link
Copy Markdown
Member

pcanal commented Mar 16, 2021

@bendavid I am confused. In none of the PR listed, I see struct RecurseCountsShared. Indeed here I see:

So the general penalty from the shared_mutex appears to compensate for avoiding the exclusive lock in ReadLock(). This could be further optimized by using the shared_mutex only in RecurseCountsShared, but given the results, the tbb-based solution looks rather promising.

So the question really is back to, for the case where we can't use TBB is there a point in introducing RecurseCountsShared?

@bendavid
Copy link
Copy Markdown
Contributor Author

Ah, well, I was kind of assuming that this is mostly relevant/important together with implicit multithreading, in which case TBB needs to be available anyway. Presumably the TBB issue on Windows is eventually solvable (and besides, does anyone really use Root on Windows with a large number of cores?)

If you still think it's useful I can try to make the further optimisation described above and make a separate PR for this.

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Mar 16, 2021

I thinking more of the potential future where we have to replace TBB (for whatever reasons, eg not being available on a new platforms or being not longer developed).

In addition I would be interested to see the optimized performance number vs TBB :)

So please yes :)

@bendavid
Copy link
Copy Markdown
Contributor Author

Ok, looking in more detail, I THINK that the original deadlock issue with UniqueLockRecurseCount can actually be avoided by declaring the LocalCounts as a thread_local data member of UniqueLockRecurseCount rather than a local variable inside of GetLocal() so this might be an even better solution.

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Mar 16, 2021

as a thread_local data member of UniqueLockRecurseCount

What do you mean (syntax wise) by a thread local data member?

@bendavid
Copy link
Copy Markdown
Contributor Author

e.g.

struct UniqueLockRecurseCount {
  struct LocalCounts {
    size_t fReadersCount = 0;
    bool fIsWriter = false;
  };
  
  static thread_local LocalCounts fCounts;
};

thread_local UniqueLockRecurseCount::LocalCounts UniqueLockRecurseCount::fCounts = UniqueLockRecurseCount::LocalCounts();

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Mar 16, 2021

Yes, it might work as long that we guarantee that the thread local static fCounts is not used during the loading of libThread itself (If this happens we are no guarantee of order of initialization and thus it might not be intialized when used). I think it is safe to rely on this since its usage (somewhat) require TThread::Init to be called first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants