Further improvements to reduce lock contention#6919
Further improvements to reduce lock contention#6919bendavid wants to merge 7 commits intoroot-project:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
@phsft-bot build |
|
Starting build on |
|
Build failed on mac1015/cxx17. Errors:
|
| static thread_local LocalCounts unique_local_counts; | ||
| return &unique_local_counts; |
There was a problem hiding this comment.
| static thread_local LocalCounts unique_local_counts; | |
| return &unique_local_counts; | |
| static thread_local LocalCounts unique_local_counts; | |
| return &unique_local_counts; |
(We indent with 3 spaces.)
| local_t GetLocal(){ | ||
| TTHREAD_TLS_DECL(LocalCounts, gLocal); | ||
| return &gLocal; | ||
| static thread_local LocalCounts unique_local_counts; |
There was a problem hiding this comment.
@pcanal can we use thread_local throughout all supported platforms these days?
There was a problem hiding this comment.
Yes, I think so. The main hold-over was MacOS (and Windows) for objects (i.e. this case). A priori if the code compile and links, it should be fine.
|
|
||
| else { | ||
| R__LOCKGUARD(gInterpreterMutex); | ||
| R__READ_LOCKGUARD(ROOT::gCoreMutex); |
There was a problem hiding this comment.
There is no (new) risk there. By design the Read and Write lock are working well togethere (a thread that hold a read lock and then request a write lock, first realease the read lock and waiting until all other thread have release the Read lock either by finishing they critical section or asking for the write lock).
I am more worried about race condition. Currently the StreamerInfo is published via RegisterStreamerInfo which is called before the end of the critical section. However since the check is done through the ReadCritical section and there is a test on IsCompiled which is updated at the last line of Build, I am guessing it is likely okay.
io/io/src/TBufferFile.cxx
Outdated
| sinfo = (TStreamerInfo*)cl->GetStreamerInfos()->At(version); | ||
| if (sinfo == nullptr) { | ||
| if ( version == cl->GetClassVersion() || version == 1 ) { | ||
| const_cast<TClass*>(cl)->BuildRealData(pointer); |
There was a problem hiding this comment.
Indent by 3 spaces, please.
| { | ||
| R__LOCKGUARD(gGlobalMutex); | ||
| if (!ROOT::gCoreMutex) { | ||
| // To avoid dead locks, caused by shared library opening and/or static initialization |
There was a problem hiding this comment.
Why is this comment no longer applicable?
There was a problem hiding this comment.
I had thought this was related to some of the platform-specific stuff in TTHREAD_TLS_DECL and that it's not an issue if static thread_local can be used across the board.
There was a problem hiding this comment.
See commit bd1894b.
tls_get_addr_tail is not a ROOT lock/routine but a system one. (and, not 100% sure, but it might be involved in the (compiler's) implementation of static initialization)
There was a problem hiding this comment.
Do you know which version of define TTHREAD_TLS was being used for that stack trace? (if it was already using thread_local then I agree there's no reason the situation should be different now)
There was a problem hiding this comment.
I am pretty sure it was the default TTHREAD_TLS (i.e. ROOT's workaround).
2af6594 to
6d8ae1e
Compare
|
Given the discussion about deadlocks using ``static thread_local``` I pushed an alternate solution to avoid exclusive locks in ReadLock() based on shared mutexes, which is only used for C++14 or higher in this case. The indentation should also be fixed now. |
| size_t &GetLocalReadersCount(local_t &local) { return fReadersCount[local]; } | ||
| }; | ||
|
|
||
| struct RecurseCountsShared { |
There was a problem hiding this comment.
When adding a new count options, we should extent the test core/thread/test/testRWLock.cxx
Starting with adding a new global variable to the section:
auto gMutex = new TMutex(kTRUE);
auto gRWMutex = new TRWMutexImp<TMutex>();
auto gRWMutexSpin = new TRWMutexImp<ROOT::TSpinMutex>();
auto gRWMutexStd = new TRWMutexImp<std::mutex>();
auto gReentrantRWMutex = new ROOT::TReentrantRWLock<TMutex>();
auto gReentrantRWMutexSM = new ROOT::TReentrantRWLock<ROOT::TSpinMutex>();
auto gReentrantRWMutexStd = new ROOT::TReentrantRWLock<std::mutex>();
auto gSpinMutex = new ROOT::TSpinMutex();
// Intentionally ignore the Fatal error due to the shread thread-local storage.
// In this test we need to be 'careful' to not use all those mutex at the same time.
int trigger1 = gErrorIgnoreLevel = kFatal + 1;
auto gReentrantRWMutexTL = new ROOT::TReentrantRWLock<TMutex, ROOT::Internal::UniqueLockRecurseCount>();
auto gReentrantRWMutexSMTL = new ROOT::TReentrantRWLock<ROOT::TSpinMutex, ROOT::Internal::UniqueLockRecurseCount>();
auto gRWMutexTL = new TRWMutexImp<TMutex, ROOT::Internal::UniqueLockRecurseCount>();
auto gRWMutexTLSpin = new TRWMutexImp<ROOT::TSpinMutex, ROOT::Internal::UniqueLockRecurseCount>();
int trigger2 = gErrorIgnoreLevel = 0;
(in the right section, like the one between trigger1 and trigger2 in this case) and then duplicating (copy/paste) the usage so that the new global uses in the same set of tests as the others.
Beside testing the functionality it also give a first level benchmark of the raw speed of the options.
|
Here's an updated stack trace with full debug info in case it's useful in the meantime, taken again during one of the remaining periods of high lock contention when many files are being closed and opened. |
|
|
||
| } // End of namespace ROOT No newline at end of file | ||
| #if __cplusplus >= 201402L | ||
| template class TRWMutexImp<std::shared_timed_mutex, ROOT::Internal::RecurseCountsShared>; |
There was a problem hiding this comment.
What's the advantage of using the shared_timed_mutex in this context?
There was a problem hiding this comment.
The regular shared_mutex is only available from C++17, so the shared_timed_mutex is a fallback for the C++14 case. (I'm not sure whether the additional timed functionality actually makes it slower if not used.)
There was a problem hiding this comment.
Can you publish the numerical result of the test benchmark? (for a C++14 and C++17 builds)
|
There were actually some deadlocks when running the tests due to an uninitialized variable (fixed now). The tests now look as below. The new implementations are somewhat slower for write locks, faster for read locks, but significantly faster for the VeryLargeconcurrentReadAndWrite tests which I added. The slower write locking could probably be mitigated by using a dedicated |
|
It looks like (surprisingly :() the currently used implementation (RecurseCounts) is not yet in the test. Could you add it? |
|
RecurseCounts is the default second template argument for TReentrantRWLock and TRWMutexImp so it's definitely already there. The most relevant comparisons here are Std vs Std14 vs Std17. (Even if this was all run in a C++17 build I set it up to just test all the variants which are available.) |
Thanks. I had forgotten about that :) |
I don't seem to be seeing that: what am I missing. Also do we understand why the Std17 results seems to be higher/slower than Std14? |
|
Ah, sorry, I was looking at the numbers for ReadUnLock. The differences between Std14 and Std17 are at least partly noise I think. I also have an implementation working now based on Now in principle what tbb is doing underneath should not so different from If you would be willing to have tbb as a dependency for core/thread then I can add this (and maybe even remove the shared_mutex versions) |
I agree that it is a possible solution. @Axel-Naumann opinion? |
I might be confused (i.e. I am probably looking at the wrong numbers). This last set of numbers says that the total improvements on the 256 threads use case is around 7% of the 3 minutes run-time. Is that correct? |
|
Dependency: I don't have a strong opinion or technical argument. I'm a bit afraid of circular dependencies between core and imt. |
|
@pcanal, yes the improvement from using TBB with TReentrantRWLock for gCoreMutex is only 7% in this case, but actually the use of RW locks instead of always write locks in For a smaller test with only 9.7M events in 19 input files (instead of 513M events in 1000 input files), using So the use of RW locks in |
|
Though this test is probably too short to effectively scale to higher thread counts, I add the scaling comparison here which is anyways interesting (obviously the baseline is much better behaved at smaller thread counts) |
|
For comparison, running on the full 513M events and 1000 files for the final configuration, ie Baseline + Use read-write locks in TBufferFile::ReadClassBuffer + TBBUnique for gCoreMutex + the additional changes to the task splitting to target 24 tasks per thread independent of the number of files the thread scaling is much more reasonable. (Keeping in mind also that even if the thread scaling were perfect from the application standpoint, there is significant deviations from perfect scaling expected beyond 128 threads due to hyperthreading, plus additional effects from per-core dynamic boosting etc) |
|
Those are indeed fantastic numbers. All that is left to do is to clean-up a tiny bit the history (squash the 3 "fix and add" commit into the commit they fix or add) and more importantly to copy/paste the number above in the commit log(s). Great Work!! Thanks!!! |
|
@bendavid Is the TBBUnique implementation uploaded to this PR? (i.e. I am not seeing but I could be confused). |
|
Not yet, I'm in the process of shuffling around the dependencies such that root can still build without tbb. |
|
Thanks. |
|
Since I haven't had a chance to cleanly address the TBB dependency issue, and since the other change is much more important performance-wise, I've split that off into a separate PR here which should be possible to merge quickly: #7105 |
|
@phsft-bot build! |
|
Starting build on |
|
Build failed on ROOT-fedora31/noimt. |
|
Build failed on mac11.0/cxx17. Failing tests: |
|
@pcanal I finally had a chance to look at this in more detail and did not find an elegant solution for making TBB an optional dependency for this purpose. (Putting the relevant template specification in core/imt as I had in mind would create a circular dependency with core/thread). Open to suggestions how to deal with this (or if a hard TBB dependency in core/thread would be acceptable and/or an optional one straightforward to implement.) |
|
I think we can get away with something simpler where when IMT is on (or more precisely when TBB has been detected by cmake), we turn on a flag ( |
|
Shouldn't we merge this (in addition to #7260), for example for use on Windows and/or when TBB is not available? Related, we should put (a summary of) the performance numbers somewhere in the git reporsitory. Maybe either in the release note or one of the commit. Thanks! |
|
@bendavid I am confused. In none of the PR listed, I see
So the question really is back to, for the case where we can't use TBB is there a point in introducing RecurseCountsShared? |
|
Ah, well, I was kind of assuming that this is mostly relevant/important together with implicit multithreading, in which case TBB needs to be available anyway. Presumably the TBB issue on Windows is eventually solvable (and besides, does anyone really use Root on Windows with a large number of cores?) If you still think it's useful I can try to make the further optimisation described above and make a separate PR for this. |
|
I thinking more of the potential future where we have to replace TBB (for whatever reasons, eg not being available on a new platforms or being not longer developed). In addition I would be interested to see the optimized performance number vs TBB :) So please yes :) |
|
Ok, looking in more detail, I THINK that the original deadlock issue with UniqueLockRecurseCount can actually be avoided by declaring the LocalCounts as a thread_local data member of UniqueLockRecurseCount rather than a local variable inside of GetLocal() so this might be an even better solution. |
What do you mean (syntax wise) by a thread local data member? |
|
e.g. struct UniqueLockRecurseCount {
struct LocalCounts {
size_t fReadersCount = 0;
bool fIsWriter = false;
};
static thread_local LocalCounts fCounts;
};
thread_local UniqueLockRecurseCount::LocalCounts UniqueLockRecurseCount::fCounts = UniqueLockRecurseCount::LocalCounts(); |
|
Yes, it might work as long that we guarantee that the |
Further reduction in lock contention when using RDataFrame with a large number of threads and/or files, by improving the global read write lock and migrating one hot spot in TBufferFile to use it.