common/mempool.cc: Improve performance of sharding#55696
common/mempool.cc: Improve performance of sharding#55696rzarzynski merged 2 commits intoceph:mainfrom
Conversation
|
|
||
| } | ||
|
|
||
| size_t mempool::get_num_shards(void) { |
There was a problem hiding this comment.
Can this function allowed to be called from more than one thread concurrently? I guess no
There was a problem hiding this comment.
Yes it can.
The very first time its called when num_shards is 0 (hasn't been initialized yet) the code is not multi-thread safe. However this first call will be for one of the static initialization's of a Pool structure that has a dynamic array of size get_num_shards() which is when the code is still single threaded.
I originally tried static initialization to set up num_shards, but this just got into a mess with static initialization ordering problems, so I resorted to initialize on first use.
I''ll add a mutex like this to make the first call thread safe too. As this is performance critical code we only want to take the mutex when initialization is required so it will look like this:
if (num_shards==0) {
std::lock_guardstd::mutex lck (thread_shard_mtx);
if (num_shards==0) {
// One time initialization
...
}
}
| thread_shard_index = sched_getcpu() & ((1 << num_shard_bits) - 1); | ||
| } | ||
| #endif | ||
| return thread_shard_index; |
There was a problem hiding this comment.
I see more complexity here even for crimson: an extra indirection layer. It's the TLS lookup. IIRC its cost can vary a lot depending on a TLS model, usage of dlopen() and all the hairyness.
My imagination of sched_getcpu() is that it's a user-space syscall loading cpuid from memory where kernel copies it when core change happens. I guess it's blazing fast, perhaps even faster than TLS especially when its internal complexity would need to see the daylight.
There was a problem hiding this comment.
Ah, I went too fast to the code changes :-).
Calling sched_getcpu() every call to mempool gives more stable performance when the number of threads exceeds the number of cores, but the overheads of calling sched_getcpu() show that performance is worse than the existing pthread_self() hash for smaller numbers of threads
The overhead is pretty interesting. My guess was it's actually cheaper than TLS (__tls_get_addr()?). Do you have a profile by any chance?
There was a problem hiding this comment.
I've updated the benchmarks in the PR comments. A local-exec TLS lookup is the cheapest at ~ 5 cycles, a global-dynamic TLS lookup (which will happen for code in a dynamic library so is more representative) is ~15 cycles, calling sched_getcpu() is ~32 cycles.
The call to pthread_self() and hash is ~8 cycles, this is effectively a local-exec TLS lookup and a bit of overhead for a function call and a logical and.
Inspecting the assembly code for the sched_getcpu() code path is over twice as long as a global-dynamic TLS lookup involving a call to __tls_get_addr() which confirms what the benchmarks are showing. Like you, I'm surprised sched_getcpu() isn't more efficient - but there's quite a bit of complexity working out whether it can use the VDSO variable or has to resort to making a syscall and this happens on every call.
I think we need to look at the mempool statistics when running on a system with a heavy I/O workload. The stats I've managed to obtain from running on a fairly pathetic setup show a 3x improvement for the adaptive caching versus calling sched_getcpu(), which might reduce to 2x if we made a slick inline version of pick_a_shard_int that just called sched_getcpu(). I suspect the benefits will be much greater on a system running heavy I/O workload, if they are not then maybe the simplicity of always calling sched_getcpu() will out weight a marginal performance improvement of trying to cache the result.
There was a problem hiding this comment.
Latest commit 5f72112 implements the following algorithms:
THREAD_OWNER - this has one TLS access per mempool stat update, only calls sched_getcpu() when a thread has context switched to another thread, and almost eliminates cache lines storing the stats from being accessed by different CPU cores (there is one cache line ping pong each time a thread switches cores).
SCHED_GETCPU - this calls sched_getcpu() every mempool stat update, it has no TLS access. Even with the optimized version of sched_getcpu() reading a variable from vDSO this slower than the THREAD_OWNER implementation.
SCHED_GETCPU USING RDPID - for systems that support rdpid, use this instead of sched_getcpu(). rdpid is a 1 cycle instruction on Icelake servers. Benchmarks with ceph_test_c2c showed this was fractionally slower than THREAD_OWNER when # threads < # cores and fractionally faster than THREAD_OWNER when # threads > # cores.
I suspect that outside of benchmarks (where instructions/data are L1 cached) that SCHED_GETCPU USING RDPID will give the best performance, followed by THREAD_OTHER, followed by SCHED_GETCPU.
src/common/mempool.cc
Outdated
| #else | ||
| if (thread_shard_index == max_shards) { | ||
| // Thread has not been assigned to a shard yet | ||
| thread_shard_index = sched_getcpu() & ((1 << num_shard_bits) - 1); |
There was a problem hiding this comment.
This assumes seastar's reactors should never change cores. I think they are pinned but I'm worried about the great quantifier. I don't know when exactly this pinning happens and whether it's guaranteed this codes doesn't execute before.
In short: the complexity makes the thinking harder (and more painful for me ;-).
There was a problem hiding this comment.
Let me respond to my own comment: the periodical recheck should address this worry.
Still, the impression is we're fighting the complexity with more complexity.
Are these numbers coming from a microbenchmark? Was |
These performance benchmarks were from a micro benchmark, timing millions of a calls to a function implementing the code being tested. I didn't even bother turning on optimization for the compiler, however the code and indeed the optimized sequence of uops is going to be L1 cached so this will make very little difference to the benchmarks. The TLS model was local-exec, as you point out if the code is in a dynamic library the TLS access will be marginally slower. However these are all second order effects. In practice this code isn't going to remain in the L1 cache so the costs of both a TLS access and a call to sched_getcpu will be higher than shown by this benchmark, but a TLS access is still access is still going to be something like a 1/3rd of the cost of calling sched_getcpu which will be something like 1/3rd to 1/8th of the cost of performing an atomic access to a dirty cache line in another core. |
|
@bill-scales An excellent measurement of cycle cost of different operations! Only seeing the numbers I have an audacity to propose a significant rework of out mempool logic.
That will give us 5 or 15 cycles for TLS retrieval + 2 x 19 cycles for atomic increment, total of ~48 cycles per update. Now for the things I am not certain: That will give us 5 or 15 cycles for TLS retrieval + 2 x 5 cycles for regular increment, total of ~20 cycles per update. In this technique I expect to pay single penalty for core change. @bill-scales Do you think such rework can fly? Do you think it will be worthwhile? |
I did consider something like this. If the thread count is relatively small then doing accounting per thread will be more efficient than per core and a technique like above will work well. However if there are 1000's of threads then accounting per thread will start to consume a lot of memory and trying to calculate summary counts across this number of shards will become expensive. I don't have enough experience of Ceph to know how many threads is typical, I see between 30 and 70 threads in OSD processes (so this technique would work well there), I don't know how many threads (specifically those accessing mempool) is typical in clients. I'm sure that the mempool stats/dump interfaces aren't called very often, so having these iterate across 1000 shards probably wouldn't be a problem, but there are other interfaces such as allocated_items and allocated_bytes that will be called more often, so number of shards is a concern. Maybe others can comment on what typical thread counts might be and whether a shard per thread model is viable? I did think about having a per thread sharding scheme and switching to a per core scheme if the number of threads got too high, however I thought that switching models without pausing threads is too complex.
If we did 1-6 then I think we would want to get rid of the atomic's as well. The challenge here is the memory consistency model of having one thread updating a value and another thread reading it. Not a problem for x86, and solvable for other architectures with a bit of care.
This would be far too slow, my experience with tkill and signal handlers suggest its not uncommon for it to take 1ms for a thread to receive a signal from another thread (this was in an environment where the source and target thread had affinity to different cores and were the only processes running on those cores). In an environment like Crimson you can dispatch a task on each I/O reactor to summarize per thread data, but we need something more generic |
I was focused on OSD and I missed the fact that clients may have significantly more threads. |
Thread local storage
I think marginally slower brings an implicit assumption: on the optimistic path. I'm afraid there is a lot of complexity coming from the flexibility of thread management,
void *
__tls_get_addr (GET_ADDR_ARGS)
{
dtv_t *dtv = THREAD_DTV ();
if (__glibc_unlikely (dtv[0].counter != GL(dl_tls_generation)))
return update_get_addr (GET_ADDR_PARAM);
void *p = dtv[GET_ADDR_MODULE].pointer.val;
if (__glibc_unlikely (p == TLS_DTV_UNALLOCATED))
return tls_get_addr_tail (GET_ADDR_PARAM, dtv, NULL);
return (char *) p + GET_ADDR_OFFSET;
}Costly
|
src/common/mempool.cc
Outdated
|
|
||
| static thread_local int thread_shard_recheck_count; | ||
| static thread_local int thread_shard_recheck_interval; | ||
|
|
There was a problem hiding this comment.
Packing related thread_local variables together reduces amount of TLS ref.
struct thread_shard_t {
size_t index = max_shards;
int recheck_count = 0;
int recheck_interval = 10000;
};
thread_local thread_shard_t thread_shard;
|
@bill-scales Before operating on the shard, thread checks owner:
This is the mutation of your recheck algorithm that has following cost:
I claim that is should be more cycle efficient. |
But in the first place: what's the real benefit of the |
Yes, and I would propose to revamp counters to keep all counters together in a packed manner.
Only if |
|
I like Adam's idea, I agree that it is likely to perform better and it will be a simpler implementation. I'll rework the commit to implement this and to also improve how shard data is grouped to use less cache lines. I'll do some more performance measurements to compare implementations.
Currently a TLS access is faster than calling sched_getcpu() on x86_64 architectures, a TLS access is likely to be much faster than sched_getcpu() on arm architectures (can't implement the VDSO optimization). |
…he line thashing Improve the performance of mempool sharding on systems with sched_getcpu() The idea is to assign threads to shards based on which CPU core they are executing on using sched_getcpu() to select the shard. All the threads executing on the same CPU core share the same shard. If each CPU core has its own shard there should be no cache line contention. Fixes: https://tracker.ceph.com/issues/64100 Signed-off-by: Bill Scales <[email protected]>
This test case calls exit() to terminiate a test mid flight to test recovery from crashes at different points in the code. However it does not stop threads before calling exit and consequently these threads can continue to access mempool structures that have gone out of scope and are freed by the exiting thread. The introduction of a unique_ptr into mempool causes these threads to assert when they access the freed memory. The simple fix is to call _exit instead of exit in the test case so that global destructors are not run. Signed-off-by: Bill Scales <[email protected]>
|
Extra commit to fix the failing test unittest_bluefs_fs. This test had a use after free bug that was previously unnoticed but with the change to mempool to use a unique_ptr causes an assert. The test case is testing recovery after crashing bluefs at different points in making updates, so fixing the test case to stop all the threads before exiting is liable to reduce the coverage of the test. Instead I've change its use of exit to _exit to avoid the calling global destructors. |
|
jenkins test api |
|
jenkins test make check arm64 |
|
jenkins test rook e2e |
|
jenkins test dashboard cephadm |
|
jenkins test api |
|
jenkins test api |
|
@rzarzynski @ljflores this PR was approved |
| static size_t num_shard_bits; | ||
| static size_t num_shards; | ||
|
|
||
| std::unique_ptr<shard_t[]> shards = std::make_unique<shard_t[]>(get_num_shards()); |
There was a problem hiding this comment.
@bill-scales seeing some crashes in 'make check arm64' when indexing into this array: https://tracker.ceph.com/issues/71027 assigned to you
A change to mempool ceph#55696 has exposed a use after free bug in crushtool during process exit where dtors are being called to free up mempool data structures at the same time that the ceph context service thread is trying to update them. This commit modifies crushtool's initialization to prevent this (unneeded) thread from being created. See issue for more details about why the thread was not terminiating. Fixes: https://tracker.ceph.com/issues/71027 Signed-off-by: Bill Scales <[email protected]>
A change to mempool ceph#55696 has exposed a use after free bug in crushtool during process exit where dtors are being called to free up mempool data structures at the same time that the ceph context service thread is trying to update them. This commit modifies crushtool's initialization to prevent this (unneeded) thread from being created. See issue for more details about why the thread was not terminiating. Fixes: https://tracker.ceph.com/issues/71027 Signed-off-by: Bill Scales <[email protected]>
A change to mempool ceph#55696 has exposed a use after free bug in crushtool during process exit where dtors are being called to free up mempool data structures at the same time that the ceph context service thread is trying to update them. This commit modifies crushtool's initialization to prevent this (unneeded) thread from being created. See issue for more details about why the thread was not terminiating. Fixes: https://tracker.ceph.com/issues/71027 Signed-off-by: Bill Scales <[email protected]> (cherry picked from commit 2f3ffff)
A change to mempool ceph#55696 has exposed a use after free bug in crushtool during process exit where dtors are being called to free up mempool data structures at the same time that the ceph context service thread is trying to update them. This commit modifies crushtool's initialization to prevent this (unneeded) thread from being created. See issue for more details about why the thread was not terminiating. Fixes: https://tracker.ceph.com/issues/71027 Signed-off-by: Connor Fawcett <[email protected]> (cherry picked from commit 2f3ffff)
Checklist
Explanation of changes
Old code used a hash of pthread_self to select a shard for each thread, to reduce how many threads accessed the same statistics. Cache line thrashing still occurs when multiple threads are assigned the same shard. On some platforms/architectures the hash could be very ineffective and caused vast amounts of thrashing. A unit test to detect ineffective distributions was disabled because it regularly caused failures.
The change uses sched_getcpu() to select a shard to almost eliminate cache lline thrashing because per shard stats are now only accessed by one core except in the very rare scenario of a thread context switch occurring between the call to sched_getcpu and the update of the stats. There are some slight better performing implementations (see performance section below) but we choose this solution for simplicity and robustness.
On platforms (e.g. aarch64) that don’t support sched_getcpu() a thread local variable assigns each thread a shard using a round robin distribution. This is a much better distribution than the pthread_self hash. The unit test is re-enabled.
num_shards and num_shard_bits were constants but have been change to variables, the 1st call to get_num_shards selects appropriate values based on how many CPU cores the system has. This will reduce cache line thrashing on systems with higher core counts at the cost of slightly more memory usage (these systems are very likely to have more memory and its only a few KB of extra memory per daemon/client).
For class pool_t : Statistics counters for each shard are stored in separate cache lines (no change), but because the number of shards has increased and become a variable a new global shards structure is added with two dimensional array[num_shards][num_pools] of stats to reduce the amount of padding to cache line boundaries - this is now per shard rather than per pool per shard. The shard array is removed from the class, pool index is added to make it easy to index into the global array.
For struct type_t (which stores pool statistics for each type in “debug” builds, but actually is used for several types in production builds) Crimson !WITH_ALIEN had replaced a single set of statistics with a set per shard to reduce cache line contention. This change has been adopted for all builds. Because num_shards is now a variable the array of shards becomes a dynamic allocation.
Crimson changes
For !WITH_ALIEN the call to sched_getcpu() used to be cached in a thread local because the ioreactor threads were bound with an affinity of 1 core. The ioreactors are no longer guaranteed to have this affinity so we revert to calling sched_getcpu() every time.
Mempool used the ceph_atomic template which for !WITH_ALIEN builds used normal types instead of atomic types. Because ioreactors are no longer guaranteed to have affinity with 1 core this is not safe, so the code has been changed back to use std:atomic
With these changes there is no longer any Crimson specific code in mempool.
Performance benchmarks
Lets start with some benchmarks to show the various costs of fragments of code. The objective is to show that the overheads of performing atomic accesses to a cache line that is dirty in another CPU core is far worse than the overheads of calling sched_getcpu():
These tests were run on a dual socket Intel Broadwell x86_64 system with no attempt to optimize the code being generated by the compiler, timings will vary on different systems and different architectures but it gives an idea of relative costs. The results are averages of many iterations and represent the cost when code+data is in the CPU L1 cache, in practice costs are likely to be higher because of CPU cache misses.
Calling sched_getcpu() to find the current CPU 32 cycles
Calling pthread_self() and hashing the result 8 cycles
Calling a function to return a thread_local int (local-exec) 5 cycles
Calling a function to return a thread_local int (global-dynamic) 15 cycles
Calling a function to return rdpid (assembly instruction for newer x86 systems) 5 cycles
Update: The thread_local int test was performed using a micro benchmark in a single executable which uses TLS local-exec which is the fastest way to access a thread local variable. It was pointed out that this is perhaps not representative and that we should really be comparing to TLS global-dynamic which is the worst case which happens when the thread local variable is in a dynamic library. I've added this extra benchmark which indeed shows some extra overheads.
Update: Investigated using rdpid as a faster alternative to sched_getcpu(). It appears that the rdpid instruction executes in 1 cycle on Intel Icelake CPUs, in this benchmark with the additional overheads of a function call it is being measured as 5 cycles. This is significantly faster than calling sched_getcpu() even though this is only reading a variable from the vDSO area on x86 systems.
Atomic increment of a memory variable 19 cycles
Non-atomic increment of a memory variable 5 cycles
Atomic increment with 8 threads (different cores, two sockets) accessing the same cache line in series 142 cycles intra socket, 400 cycles inter socket
This was a test with multiple threads performing atomic increments with the same cache line, but with serialization to ensure each thread did an increment in turn hence forcing the cache line to move between CPU cores each time, but with no contention. The cost of moving the cache line is higher when it needs to be moved between sockets.
Atomic increment with 2 threads (different cores of same socket) accessing the same cache line in parallel 75 cycles
Atomic increment with 4 threads (different cores of same socket) accessing the same cache line in parallel 166 cycles
Atomic increment with 6 threads (different cores of same socket) accessing the same cache line in parallel 253 cycles
Atomic increment with 8 threads (different cores of same socket) accessing the same cache line in parallel 330 cycles
Atomic increment with 8 threads (different cores, two sockets) accessing the same cache line in parallel 350 - 435 cycles
This was a test with multiple threads performing atomic increments within the same cache line to cause the cache line to ping pong between cores. The test ensured the threads all ran at the same time and generated contention where multiple threads are trying to lock the same cache line. The results suggest that threads were sometimes able to perform more than one increment before a cache line moved, but that when there is a lot of contention not only is there a cost of moving the cache line but also a cost of waiting for the cache line to become available
I also did some testing on a Mac M1 (only aarch64 system I’ve got access to), which shows that the costs are pretty similar. Mac does not have sched_getcpu() or methods for setting or testing thread affinity, so we need to assume that the kernel scheduler made sensible choices about where threads were run:
Calling pthread_self() and hashing the result 4.6ns
Calling a function to return a thread_local int 5.7ns
Atomic increment of memory variable 14.8ns
Non-atomic increment of memory variable 4.3ns
Atomic increment with 8 threads (same socket) accessing the same cache line in series 276ns
Atomic increment with 2 threads (same socket) accessing the same cache line in parallel 35ns
Atomic increment with 4 threads (same socket) accessing the same cache line in parallel 140ns
Atomic increment with 6 threads (same socket) accessing the same cache line in parallel 258ns
Atomic increment with 8 threads (same socket) accessing the same cache line in parallel 276ns
Conclusions
Comparison of different algorithms
There is an existing test case test_c2c which does a good job of measuring the performance of the mempool sharding code. It can be run with different numbers of threads and with/without sharding. I tried a variety of different algorithms and ran tests to compare the performance with different numbers of threads. The test system had 32 cores (dual socket, 8 cores + hyperthreading per socket). Tests were run with up to 128 threads to show behavior both when there are less threads than cores and when there are more threads than cores.
The X axis is number of threads, the Y access is number of operations completed in 30 seconds (higher is better):
This test case is perhaps a bit unrepresentative of real world performance because the threads are running without yielding and once all cores are busy there is little benefit in the kernel scheduler migrating threads.
Discounted implementations
An implementation that caches the call to get_schedcpu() in a thread local variable gives marginally better performance. An idea by Adam Kupczyk (see comments below) can cheaply detect when a thread is migrated to a different core and updates the cached value. This implementation was discounted because it was considered too complex for the minor benefit.
An implementation that uses rdpid (a relatively new x86_64 instruction) instead of sched_getcpu() also gives marginally better performance. This implementation was discounted because its hardware specific and needs code to detect whether the system supports it. Hopefully the library implementation of sched_getcpu will be updated to use this in the future giving sched_getcpu() the same performance.
While micro benchmarks show these other implementations to have better performance its very unlikely that this would give a measurable improvement to Ceph I/O performance because mempool is only a tiny fraction of the I/O path.
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e