mempool: avoid true sharing for the counters, with crimson#53130
mempool: avoid true sharing for the counters, with crimson#53130
Conversation
There was a problem hiding this comment.
I'm curious why mempool misbehaves so much in the multi-core environment.
The shard_t structs are already sharded, cache-line-aligned and they should follow any core change:
T* allocate(size_t n, void *p = nullptr) {
// ...
shard_t *shard = pool->pick_a_shard();
shard->bytes += total;
shard->items += n;
// ...
}
void deallocate(T* p, size_t n) {
// ...
shard_t *shard = pool->pick_a_shard();// Align shard to a cacheline.
//
// It would be possible to retrieve the value at runtime (for instance
// with getconf LEVEL1_DCACHE_LINESIZE or grep -m1 cache_alignment
// /proc/cpuinfo). It is easier to hard code the largest cache
// linesize for all known processors (128 bytes). If the actual cache
// linesize is smaller on a given processor, it will just waste a few
// bytes.
//
struct shard_t {
ceph::atomic<size_t> bytes = {0};
ceph::atomic<size_t> items = {0};
char __padding[128 - sizeof(ceph::atomic<size_t>)*2];
} __attribute__ ((aligned (128)));
static_assert(sizeof(shard_t) == 128, "shard_t should be cacheline-sized");Maybe the shard selection runs into tons of collision?
// we shard pool stats across many shard_t's to reduce the amount
// of cacheline ping pong.
enum {
num_shard_bits = 5
};
enum {
num_shards = 1 << num_shard_bits
}; static size_t pick_a_shard_int() {
// Dirt cheap, see:
// https://fossies.org/dox/glibc-2.32/pthread__self_8c_source.html
size_t me = (size_t)pthread_self();
size_t i = (me >> CEPH_PAGE_SHIFT) & ((1 << num_shard_bits) - 1);
return i;
}
shard_t* pick_a_shard() {
size_t i = pick_a_shard_int();
return &shard[i];
}
src/common/mempool.cc
Outdated
| #include "include/mempool.h" | ||
| #include "include/demangle.h" | ||
|
|
||
| #ifndef WITH_SEASTAR |
There was a problem hiding this comment.
What about the alienized BlueStore? It has plenty of internal structs stored in memory and the stats often turns out be very, very useful / crucial when it comes the field support.
There was a problem hiding this comment.
so we comment out mempool accounting, then how to accounting the consumption of memory of a set of containers in crimson?
There was a problem hiding this comment.
I think the discussion was that disabling mempool within crimson is acceptable, but we are still finding out in this PR whether there is a quick and better solution.
|
The fact that commenting out mempool::pool_t& mempool::get_pool(mempool::pool_index_t ix)
{
// We rely on this array being initialized before any invocation of
// this function, even if it is called by ctors in other compilation
// units that are being initialized before this compilation unit.
static mempool::pool_t table[num_pools];
return table[ix];
}Another hypothesis: this |
Seems not really. Because the hotspot is at |
|
Yeah, the extern "C"
int __cxa_guard_acquire (__guard *g)
{
// ...
if (__gnu_cxx::__is_single_threaded())
{
// No need to use atomics, and no need to wait for other threads.
int *gi = (int *) (void *) g;
if (*gi == 0)
{
*gi = _GLIBCXX_GUARD_PENDING_BIT;
return 1;
}
else
throw_recursive_init_exception();
}
else
{
int *gi = (int *) (void *) g;
const int guard_bit = _GLIBCXX_GUARD_BIT;
const int pending_bit = _GLIBCXX_GUARD_PENDING_BIT;
const int waiting_bit = _GLIBCXX_GUARD_WAITING_BIT;
while (1)
{
int expected(0);
if (__atomic_compare_exchange_n(gi, &expected, pending_bit, false,
__ATOMIC_ACQ_REL,
__ATOMIC_ACQUIRE))
{
// This thread should do the initialization.
return 1;
}
if (expected == guard_bit)
{
// Already initialized.
return 0;
} |
Yeah, the overhead of an atomic may vary a lot – from unpleasant (AFAIK it always flushes CPU's memory queues as the cache coherency works at the level of, well, caches) to terribly bad (ping-pongs). The Unfortunately, there are multiple caveats. First, the number of shards is constant, limited at the compile-time and the limit isn't large (32 slots IIRC which nicely correlates with the chart). You might want to increase it. Second, the shard selection function uses Finally, the usage of
Let's not exclude that. Regards, |
Looks 19 out of 20 reactors are picking shard-31, collisions! I'll go forward to specialize Or do you think the case should be Let me know whether it is reasonable to you. |
ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>
|
Huh, this is a smoking gun! I tested on the classical OSD & crimson with a dead simple patch: diff --git a/src/common/mempool.cc b/src/common/mempool.cc
index 79354f70821..f17245cf795 100644
--- a/src/common/mempool.cc
+++ b/src/common/mempool.cc
@@ -15,6 +15,14 @@
#include "include/mempool.h"
#include "include/demangle.h"
+#include <iostream>
+
// Thread local variables should save index, not &shard[index],
// because shard[] is defined in the class
static thread_local size_t thread_shard_index = mempool::num_shards;
@@ -95,7 +103,11 @@ size_t mempool::pool_t::allocated_items() const
void mempool::pool_t::adjust_count(ssize_t items, ssize_t bytes)
{
- thread_shard_index = (thread_shard_index == num_shards) ? pick_a_shard_int() : thread_shard_index;
+ if (thread_shard_index == num_shards) {
+ thread_shard_index = pick_a_shard_int();
+ std::cerr << "assigning thread id " << (void*)pthread_self() << " shard id "
+ << thread_shard_index << std::endl;
+ }On my system the distribution looks far, far better. ceph-osdDefaults. crimson-osdOnly 8 reactors + BlueStore. It looks the assumptions behind
I'm not sure / doubt alien threads should be calling |
|
@cyx1231st: I think the problem isn't crimson-specific. I started poking with Would you mind giving a try to main...rzarzynski:ceph:wip-mempool-cpulocalstorage? |
Thanks! I'll let you know the results. |
ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>
|
Looks sched_getcpu() is working as expected! However, I see there are still other caveats. I'll incoperate the proposed commit and push another version for further discussions. |
53831ff to
30a24fc
Compare
|
@rzarzynski After excluding the updates to The core number with Crimson can be determined, so I added asserts to make sure collision won't happen in case of crimson, see 30a24fc What do you think? |
|
Hi @cyx1231st! 1 Huh! So there is one unsharded counter left: There are also other things we can do to further improve mempool basing the cpu local storage.
struct shard_t {
ceph::atomic<size_t> bytes = {0};
ceph::atomic<size_t> items = {0};
char __padding[128 - sizeof(ceph::atomic<size_t>)*2];
} __attribute__ ((aligned (128)));to plain |
|
I'm not sure why
I'm actually not sure whether dropping 128 bytes padding is reasonable, because by doing so will make multiple seastar reactors to update the same cacheline. This seems to me bringing back the "false sharing" issue again, quote from wiki:
My first attempt before the RFC #52404 was to change them to the plain size_t, but the performance was not changed, so I finally decided to drop them entirely. That seems to match the discription above that the issue is actually due to multiple cores trying to update the same cacheline, not because of the atomic type. Does it make sense? |
There is more twist to that. We don't need to protect the shards-of-different-pools assigned to the same CPU but need to protect from mixing shards-of-same pool between different CPUs. However, the change – as it rearranges data structures – would be far bigger than the small-but-wrong commit :-(. At the moment we have: We could have: I think the naming (cpu local storage) is misleading. What the runtime offers is far from thread local storage. We aren't getting an easy to use area of memory but just a selector... |
Yes, it was reasonable when the cost was dominated by ping-pongs. When the collisions are clear and the BTW: just pushed the sharding for |
ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>
|
I think the direction for the fixes are clear by making
Agree. It would be interesting to see how And I just realized that the term "false sharing" is incorrect in the title. |
b1803d7 to
e4c54ec
Compare
|
@ljflores: the changes to classical are minimal, so I think the PR can be glued to any testing branch without imposing extra risk. |
ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>
It looks this counter was initially thought to be a debug one but -- because of the `force_register` -- flag it is widely used now. Found by Yingxin Cheng <[email protected]> Signed-off-by: Radosław Zarzyński <[email protected]>
Signed-off-by: Yingxin Cheng <[email protected]>
Signed-off-by: Yingxin Cheng <[email protected]>
Signed-off-by: Yingxin Cheng <[email protected]>
e4c54ec to
931894f
Compare
|
rebased |
|
jenkins test make check |
|
@ljflores do the above test results look good to you?
Seems to me they are known and unrelated, @Matan-B do the crimson test results look good to you, too? |
|
make check failure: |
|
jenkins test make check |
|
jenkins test windows |
Test results LGTM. |
Thanks for the suggestions, |
|
jenkins test windows |
ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>
Hey @cyx1231st, the rados failures are all unrelated, tracked by:
LGTM! |
|
I'll merge if no other concerns. |

Follow up #52404 (as REF)
By globally disabling mempool counters, the peak performance (OPT, this PR) is ~1.8x better than REF, ~6.7x better than Async for the same set of tests:

Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows