mempool: avoid true sharing for the counters, with crimson by cyx1231st · Pull Request #53130 · ceph/ceph

cyx1231st · 2023-08-24T06:42:58Z

Follow up #52404 (as REF)

By globally disabling mempool counters, the peak performance (OPT, this PR) is ~1.8x better than REF, ~6.7x better than Async for the same set of tests:

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

rzarzynski

I'm curious why mempool misbehaves so much in the multi-core environment.

The shard_t structs are already sharded, cache-line-aligned and they should follow any core change:

  T* allocate(size_t n, void *p = nullptr) {
    // ...
    shard_t *shard = pool->pick_a_shard();
    shard->bytes += total;
    shard->items += n;
    // ...
  }

  void deallocate(T* p, size_t n) {
    // ...
    shard_t *shard = pool->pick_a_shard();

// Align shard to a cacheline.
//
// It would be possible to retrieve the value at runtime (for instance
// with getconf LEVEL1_DCACHE_LINESIZE or grep -m1 cache_alignment
// /proc/cpuinfo). It is easier to hard code the largest cache
// linesize for all known processors (128 bytes). If the actual cache
// linesize is smaller on a given processor, it will just waste a few
// bytes.
//
struct shard_t {
  ceph::atomic<size_t> bytes = {0};
  ceph::atomic<size_t> items = {0};
  char __padding[128 - sizeof(ceph::atomic<size_t>)*2];
} __attribute__ ((aligned (128)));

static_assert(sizeof(shard_t) == 128, "shard_t should be cacheline-sized");

Maybe the shard selection runs into tons of collision?

// we shard pool stats across many shard_t's to reduce the amount
// of cacheline ping pong.
enum {
  num_shard_bits = 5
};
enum {
  num_shards = 1 << num_shard_bits
};

  static size_t pick_a_shard_int() {
    // Dirt cheap, see:
    //   https://fossies.org/dox/glibc-2.32/pthread__self_8c_source.html
    size_t me = (size_t)pthread_self();
    size_t i = (me >> CEPH_PAGE_SHIFT) & ((1 << num_shard_bits) - 1);
    return i;
  }

  shard_t* pick_a_shard() {
    size_t i = pick_a_shard_int();
    return &shard[i];
  }

rzarzynski · 2023-08-24T10:50:00Z

src/common/mempool.cc

 #include "include/mempool.h"
 #include "include/demangle.h"

+#ifndef WITH_SEASTAR


What about the alienized BlueStore? It has plenty of internal structs stored in memory and the stats often turns out be very, very useful / crucial when it comes the field support.

so we comment out mempool accounting, then how to accounting the consumption of memory of a set of containers in crimson?

I think the discussion was that disabling mempool within crimson is acceptable, but we are still finding out in this PR whether there is a quick and better solution.

src/common/mempool.cc

rzarzynski · 2023-08-24T13:58:47Z

The fact that commenting out mempool::get_pool(mempool::pool_index_t(mempool)).adjust_count() helped still bothers me a lot.

mempool::pool_t& mempool::get_pool(mempool::pool_index_t ix)
{
  // We rely on this array being initialized before any invocation of
  // this function, even if it is called by ctors in other compilation
  // units that are being initialized before this compilation unit.
  static mempool::pool_t table[num_pools];
  return table[ix];
}

Another hypothesis: this static is guarded by a lock (even a "fast", atomic-based one) to handle race-during-initialization. Will take a look on objdump to confirm.

cyx1231st · 2023-08-24T17:04:22Z

The fact that commenting out mempool::get_pool(mempool::pool_index_t(mempool)).adjust_count() helped still bothers me a lot.

The direct reason is from analysis that this is the top hotspot at high CPU count, adding more CPUs will make things worse, and commenting it out helps a lot:

 static size_t pick_a_shard_int() {
    // Dirt cheap, see:
    //   https://fossies.org/dox/glibc-2.32/pthread__self_8c_source.html
    size_t me = (size_t)pthread_self();
    size_t i = (me >> CEPH_PAGE_SHIFT) & ((1 << num_shard_bits) - 1);
    return i;
  }

Now I see it will pick the shard based on a thread-local variable, looks reasonable... Probably the buffer allocations are so frequent that cannot tolerate any collisions?

I'll take look at how shards are really picked at runtime.

cyx1231st · 2023-08-24T17:14:52Z

Another hypothesis: this static is guarded by a lock (even a "fast", atomic-based one) to handle race-during-initialization. Will take a look on objdump to confirm.

Seems not really. Because the hotspot is at std::__atomic_base<unsigned long>::operator+= and mempool.cc:100

rzarzynski · 2023-08-24T17:47:03Z

Yeah, the static lock in get_pool() is present but pretty well optimized:

000000003d9e8f7a <mempool::get_pool(mempool::pool_index_t)>:
    3d9e8f7a:   f3 0f 1e fa             endbr64 
    3d9e8f7e:   41 57                   push   %r15
    3d9e8f80:   41 56                   push   %r14
    3d9e8f82:   41 55                   push   %r13
    3d9e8f84:   41 54                   push   %r12
    3d9e8f86:   55                      push   %rbp
    3d9e8f87:   53                      push   %rbx
    3d9e8f88:   48 83 ec 08             sub    $0x8,%rsp
    3d9e8f8c:   41 89 fd                mov    %edi,%r13d
    3d9e8f8f:   0f b6 05 6a 0e cf 4b    movzbl 0x4bcf0e6a(%rip),%eax        # 896d9e00 <guard variable for mempool::get_pool(mempool::pool_index_t)::table>
    3d9e8f96:   84 c0                   test   %al,%al
    3d9e8f98:   74 53                   je     3d9e8fed <mempool::get_pool(mempool::pool_index_t)+0x73>
# the hot path here
    3d9e8f9a:   49 63 c5                movslq %r13d,%rax
# ...
    3d9e8fed:   48 8d 3d 0c 0e cf 4b    lea    0x4bcf0e0c(%rip),%rdi        # 896d9e00 <guard variable for mempool::get_pool(mempool::pool_index_t)::table>
    3d9e8ff4:   e8 17 91 5a fb          callq  38f92110 <__cxa_guard_acquire@plt>
    3d9e8ff9:   85 c0                   test   %eax,%eax
    3d9e8ffb:   74 9d                   je     3d9e8f9a <mempool::get_pool(mempool::pool_index_t)+0x20>
# slow, lockish init when __cxa_guard_acquire() returns 1

  extern "C"
  int __cxa_guard_acquire (__guard *g)
  {
    // ...
    if (__gnu_cxx::__is_single_threaded())
      {
        // No need to use atomics, and no need to wait for other threads.
        int *gi = (int *) (void *) g;
        if (*gi == 0)
          {
            *gi = _GLIBCXX_GUARD_PENDING_BIT;
            return 1;
          }
        else
          throw_recursive_init_exception();
      }
    else
      {
        int *gi = (int *) (void *) g;
        const int guard_bit = _GLIBCXX_GUARD_BIT;
        const int pending_bit = _GLIBCXX_GUARD_PENDING_BIT;
        const int waiting_bit = _GLIBCXX_GUARD_WAITING_BIT;

        while (1)
          {
            int expected(0);
            if (__atomic_compare_exchange_n(gi, &expected, pending_bit, false,
                                            __ATOMIC_ACQ_REL,
                                            __ATOMIC_ACQUIRE))
              {
                // This thread should do the initialization.
                return 1;
              }

            if (expected == guard_bit)
              {
                // Already initialized.
                return 0;
              }

rzarzynski · 2023-08-24T20:20:37Z

The direct reason is from analysis that this is the top hotspot at high CPU count, adding more CPUs will make things worse, and commenting it out helps a lot

Yeah, the overhead of an atomic may vary a lot – from unpleasant (AFAIK it always flushes CPU's memory queues as the cache coherency works at the level of, well, caches) to terribly bad (ping-pongs). The mempool seems to be made in a way that should locate it closer to the positive end of this spectrum but only under the assumption the sharding works fine.

Unfortunately, there are multiple caveats. First, the number of shards is constant, limited at the compile-time and the limit isn't large (32 slots IIRC which nicely correlates with the chart). You might want to increase it.

Second, the shard selection function uses pthread_self() and some bit shifts to derive the selector. In other words, it draws assumptions on which of the returned bits are meaningful. This seems fragile.

Finally, the usage of pthread_self (and thread local storage in general) is just an approximation. The intention really is cpu local storage (BTW: it would be cool if a kernel, on a context switch, puts in cheap-to-access place the ID of core a thread is executed on; how costly sched_getcpu() really is?). I guess usually it's good enough as it seems reasonable to expect from a scheduler to minimize migrations of tasks between cores. However, there are scenarios where this trick stops holding – try to imagine many threads with asymmetric workloads executed by few CPU cores. I think a CPU contention can easily exaggerate the penalties.

Probably the buffer allocations are so frequent that cannot tolerate any collisions?

Let's not exclude that.

Regards,
Radek

cyx1231st · 2023-08-25T03:11:37Z

@rzarzynski

!!! pool=0x55b4ca5b5000 reactor=1 pthread=139964979344960 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=0 pthread=139964981729984 shard=5/32
!!! pool=0x55b4ca5b5000 reactor=2 pthread=139964973053504 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=3 pthread=139964966762048 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=4 pthread=139964960470592 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=6 pthread=139964947887680 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=9 pthread=139964929013312 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=5 pthread=139964954179136 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=7 pthread=139964941596224 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=8 pthread=139964935304768 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=11 pthread=139964916430400 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=10 pthread=139964922721856 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=13 pthread=139964903847488 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=16 pthread=139964884973120 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=12 pthread=139964910138944 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=19 pthread=139964866098752 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=17 pthread=139964878681664 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=18 pthread=139964872390208 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=14 pthread=139964897556032 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=15 pthread=139964891264576 shard=31/32

Looks 19 out of 20 reactors are picking shard-31, collisions!

I'll go forward to specialize pick_a_shard_int() to use seastar::this_shard_id() as the shard index in case of WITH_SEASTAR, and increase num_shard_bits = 6, so it should be impossible to collide.

Or do you think the case should be #if defined(WITH_SEASTAR) && !defined(WITH_ALIEN) ?

Let me know whether it is reasonable to you.

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

rzarzynski · 2023-08-25T19:09:21Z

Huh, this is a smoking gun!

I tested on the classical OSD & crimson with a dead simple patch:

diff --git a/src/common/mempool.cc b/src/common/mempool.cc
index 79354f70821..f17245cf795 100644
--- a/src/common/mempool.cc
+++ b/src/common/mempool.cc
@@ -15,6 +15,14 @@
 #include "include/mempool.h"
 #include "include/demangle.h"
 
+#include <iostream>
+
 // Thread local variables should save index, not &shard[index],
 // because shard[] is defined in the class
 static thread_local size_t thread_shard_index = mempool::num_shards;
@@ -95,7 +103,11 @@ size_t mempool::pool_t::allocated_items() const
 
 void mempool::pool_t::adjust_count(ssize_t items, ssize_t bytes)
 {
-  thread_shard_index = (thread_shard_index == num_shards) ? pick_a_shard_int() : thread_shard_index;
+  if (thread_shard_index == num_shards) {
+    thread_shard_index = pick_a_shard_int();
+    std::cerr << "assigning thread id " << (void*)pthread_self() << " shard id "
+              << thread_shard_index << std::endl;
+  }

On my system the distribution looks far, far better.

ceph-osd

Defaults.

assigning thread id 0x7f1861d666c0 shard id 6
assigning thread id 0x7f1860524700 shard id 4
assigning thread id 0x7f185f522700 shard id 2
assigning thread id 0x7f1851173700 shard id 19
assigning thread id 0x7f1850972700 shard id 18
assigning thread id 0x7f184a165700 shard id 5
assigning thread id 0x7f184c169700 shard id 9
assigning thread id 0x7f1851161700 shard id 1
assigning thread id 0x7f184a153700 shard id 19
assigning thread id 0x7f184d159700 shard id 25
assigning thread id 0x7f185019f700 shard id 31
assigning thread id 0x7f185e520700 shard id 0
assigning thread id 0x7f183f13d700 shard id 29
assigning thread id 0x7f1843145700 shard id 5
assigning thread id 0x7f1842944700 shard id 4
assigning thread id 0x7f1841942700 shard id 2
assigning thread id 0x7f185fd23700 shard id 3
assigning thread id 0x7f183e93c700 shard id 28
assigning thread id 0x7f183c137700 shard id 23
assigning thread id 0x7f183d93a700 shard id 26
assigning thread id 0x7f183e13b700 shard id 27
assigning thread id 0x7f183f93e700 shard id 30
assigning thread id 0x7f183b936700 shard id 22
assigning thread id 0x7f183c938700 shard id 24
assigning thread id 0x7f1841141700 shard id assigning thread id 0x7f1842143700 shard id 3assigning thread id 0x7f184013f700 shard id 31
assigning thread id 0x7f183d139700 shard id 25
1assigning thread id 0x7f1840940700 shard id 0


assigning thread id 0x7f185599e700 shard id 30
assigning thread id 0x7f185519d700 shard id 29
assigning thread id 0x7f184d95a700 shard id 26
assigning thread id 0x7f184c958700 shard id 24
assigning thread id 0x7f1860d25700 shard id 5
assigning thread id 0x7f184e19b700 shard id 27
assigning thread id 0x7f1851161700 shard id 1
assigning thread id 0x7f183b135700 shard id 21
assigning thread id 0x7f184e99c700 shard id 28
assigning thread id 0x7f185d4cd700 shard id 13

crimson-osd

Only 8 reactors + BlueStore.

assigning thread id 0x7f91dfeb5a40 shard id 21
assigning thread id 0x7f91dfeb5a40 shard id 21
assigning thread id 0x7f91ce294700 shard id 20
assigning thread id 0x7f91c6a8b700 shard id 11
assigning thread id 0x7f91c6274700 shard id 20
assigning thread id 0x7f91bf26d700 shard id 13
assigning thread id 0x7f91c1271700 shard id 17
assigning thread id 0x7f91c5f72700 shard id 18
assigning thread id 0x7f91bef64700 shard id 4
assigning thread id 0x7f91ceaa3700 shard id 3
assigning thread id 0x7f91cb1f9700 shard id 25
assigning thread id 0x7f91c1f6a700 shard id 10
assigning thread id 0x7f91c276b700 shard id 11
assigning thread id 0x7f91cc238700 shard id 24
assigning thread id 0x7f91c568e700 shard id 14
assigning thread id 0x7f91cda7d700 shard id 29
assigning thread id 0x7f91cca4f700 shard id 15
assigning thread id 0x7f91cd266700 shard id 6

It looks the assumptions behind pthread_self() & the meaningful bits are truly fragile :-)).

Or do you think the case should be #if defined(WITH_SEASTAR) && !defined(WITH_ALIEN) ?

I'm not sure / doubt alien threads should be calling seastar::this_shard_id().

rzarzynski · 2023-08-25T19:12:13Z

@cyx1231st: I think the problem isn't crimson-specific. I started poking with sched_getcpu() and it should be really, really cheap while allowing to minimize cache misses even when the shard selection function works as it should.

Would you mind giving a try to main...rzarzynski:ceph:wip-mempool-cpulocalstorage?

cyx1231st · 2023-08-28T05:59:57Z

Would you mind giving a try to main...rzarzynski:ceph:wip-mempool-cpulocalstorage?

Thanks! I'll let you know the results.

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

cyx1231st · 2023-08-28T08:58:51Z

Looks sched_getcpu() is working as expected!

!!! pool=0x562274a94000 reactor=2 getcpu=2 shard=2/128
!!! pool=0x562274a94000 reactor=3 getcpu=3 shard=3/128
!!! pool=0x562274a94000 reactor=4 getcpu=4 shard=4/128
!!! pool=0x562274a94000 reactor=5 getcpu=5 shard=5/128
!!! pool=0x562274a94000 reactor=8 getcpu=8 shard=8/128
!!! pool=0x562274a94000 reactor=7 getcpu=7 shard=7/128
!!! pool=0x562274a94000 reactor=6 getcpu=6 shard=6/128
!!! pool=0x562274a94000 reactor=9 getcpu=9 shard=9/128
!!! pool=0x562274a94000 reactor=10 getcpu=10 shard=10/128
!!! pool=0x562274a94000 reactor=12 getcpu=12 shard=12/128
!!! pool=0x562274a94000 reactor=14 getcpu=14 shard=14/128
!!! pool=0x562274a94000 reactor=11 getcpu=11 shard=11/128
!!! pool=0x562274a94000 reactor=13 getcpu=13 shard=13/128
!!! pool=0x562274a94000 reactor=15 getcpu=15 shard=15/128
!!! pool=0x562274a94000 reactor=16 getcpu=16 shard=16/128
!!! pool=0x562274a94000 reactor=17 getcpu=17 shard=17/128
!!! pool=0x562274a94000 reactor=18 getcpu=18 shard=18/128
!!! pool=0x562274a94000 reactor=19 getcpu=19 shard=19/128

However, I see there are still other caveats. I'll incoperate the proposed commit and push another version for further discussions.

cyx1231st · 2023-08-29T07:56:47Z

@rzarzynski
It turns out that the commit main...rzarzynski:ceph:wip-mempool-cpulocalstorage reaches the performance curve of Crimson REF (see #53130 (comment)).

After excluding the updates to type_t::items (132476a), the performance curve reaches Crimson OPT.

The core number with Crimson can be determined, so I added asserts to make sure collision won't happen in case of crimson, see 30a24fc

What do you think?

rzarzynski · 2023-08-29T09:19:21Z

Hi @cyx1231st!

1 Huh! So there is one unsharded counter left: type. Yeah, that's the beauty of atomics – all or nothing. We should shard it or revisit the force_register & the pool declaration macro:

62fa72c0e168 (Sage Weil           2016-10-18 12:03:28 -0400 348)   void init(bool force_register) {
049797641b94 (Sage Weil           2016-10-14 17:50:33 -0400 349)     pool = &get_pool(pool_ix);
b15a377a7b07 (Sage Weil           2016-10-18 12:06:27 -0400 350)     if (debug_mode || force_register) {
049797641b94 (Sage Weil           2016-10-14 17:50:33 -0400 351)       type = pool->get_type(typeid(T), sizeof(T));
049797641b94 (Sage Weil           2016-10-14 17:50:33 -0400 352)     }
049797641b94 (Sage Weil           2016-10-14 17:50:33 -0400 353)   }

There are also other things we can do to further improve mempool basing the cpu local storage.

The padding to max-known-cache-line-size (128 bytes) is a pure waste. We actually want to have the counter structs dense to minimize cache misses. I added a commit for that to main...rzarzynski:ceph:wip-mempool-cpulocalstorage.
How about, just for a sake of a test, switching the counters in:

struct shard_t {
  ceph::atomic<size_t> bytes = {0};
  ceph::atomic<size_t> items = {0};
  char __padding[128 - sizeof(ceph::atomic<size_t>)*2];
} __attribute__ ((aligned (128)));

to plain size_t? This would allow to roughly estimate how much the non-contented atomicness (queues flushes etc.) contribute to the penalty. If it brings a significant overhead, we could tune the strategy after having more-than-1-CPUs-on-shard (or prohibit that entirely with an assert like you did which I very like :-)).

cyx1231st · 2023-08-29T09:52:58Z

I'm not sure why force_register is needed by looking at the history. Can we drop it so that it is only enabled in the debug mode, generally?

The padding to max-known-cache-line-size (128 bytes) is a pure waste. We actually want to have the counter structs dense to minimize cache misses.

I'm actually not sure whether dropping 128 bytes padding is reasonable, because by doing so will make multiple seastar reactors to update the same cacheline. This seems to me bringing back the "false sharing" issue again, quote from wiki:

If two processors operate on independent data in the same memory address region storable in a single line, the cache 
coherency mechanisms in the system may force the whole line across the bus or interconnect with every data write, 
forcing memory stalls in addition to wasting system bandwidth.

How about, just for a sake of a test, switching the counters from ceph::atomic<size_t> to size_t

My first attempt before the RFC #52404 was to change them to the plain size_t, but the performance was not changed, so I finally decided to drop them entirely.

That seems to match the discription above that the issue is actually due to multiple cores trying to update the same cacheline, not because of the atomic type.

Does it make sense?

rzarzynski · 2023-08-29T10:09:06Z

I'm actually not sure whether dropping 128 bytes padding is reasonable, because by doing so will make multiple seastar reactors to update the same cacheline. This seems to me bringing back the "false sharing" issue again,

There is more twist to that. We don't need to protect the shards-of-different-pools assigned to the same CPU but need to protect from mixing shards-of-same pool between different CPUs.

However, the change – as it rearranges data structures – would be far bigger than the small-but-wrong commit :-(.

At the moment we have:

pool x: padded shard cpu 1 | padded shard cpu 2 | ...
pool y: padded shard cpu 1 | padded shard cpu 2 | ...

We could have:

padded shards of cpu 1: pool x | pool y | ...
padded shards of cpu 2: pool x | pool y | ...

I think the naming (cpu local storage) is misleading. What the runtime offers is far from thread local storage. We aren't getting an easy to use area of memory but just a selector...

rzarzynski · 2023-08-29T10:40:14Z

My first attempt before the RFC #52404 was to change them to the plain size_t, but the performance was not changed, so I finally decided to drop them entirely.

Yes, it was reasonable when the cost was dominated by ping-pongs. When the collisions are clear and the type is disabled or sharded, the smaller contributors should become visible.

BTW: just pushed the sharding for type + the revert to main...rzarzynski:ceph:wip-mempool-cpulocalstorage.

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

cyx1231st · 2023-09-08T03:39:22Z

I think the direction for the fixes are clear by making type_t sharded and shard selection appropriate. I'll update this PR later and verify when the hardware is available again.

Yes, it was reasonable when the cost was dominated by ping-pongs. When the collisions are clear and the type is disabled or sharded, the smaller contributors should become visible.

Agree. It would be interesting to see how atomic impacts the performance after the fixes. However, I suppose we still cannot generally drop the atomic wrappers because only crimson uses coorpertive-scheduling, the classic will have issues without atomic under preemptive scheduling. Is my understanding correct?

And I just realized that the term "false sharing" is incorrect in the title.

rzarzynski

LGMT!

rzarzynski · 2023-11-22T12:31:15Z

@ljflores: the changes to classical are minimal, so I think the PR can be glued to any testing branch without imposing extra risk.

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

It looks this counter was initially thought to be a debug one but -- because of the `force_register` -- flag it is widely used now. Found by Yingxin Cheng <[email protected]> Signed-off-by: Radosław Zarzyński <[email protected]>

Signed-off-by: Yingxin Cheng <[email protected]>

cyx1231st · 2023-11-23T01:54:03Z

rebased

cyx1231st · 2023-11-23T03:17:24Z

jenkins test make check

cyx1231st · 2023-11-27T08:21:35Z

Rados tests: https://pulpito.ceph.com/yingxin-2023-11-27_02:16:02-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/

@ljflores do the above test results look good to you?

Crimson-rados tests: https://pulpito.ceph.com/yingxin-2023-11-27_02:15:02-crimson-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/

https://tracker.ceph.com/issues/61651: 7467434, 7467440
https://tracker.ceph.com/issues/63647: 7467449, 7467459

Seems to me they are known and unrelated, @Matan-B do the crimson test results look good to you, too?

cyx1231st · 2023-11-27T08:22:02Z

make check failure:

The following tests FAILED:
	 19 - run-tox-python-common (Failed)

cyx1231st · 2023-11-27T08:22:19Z

jenkins test make check

cyx1231st · 2023-11-27T08:22:40Z

jenkins test windows

Matan-B · 2023-11-27T08:41:48Z

Crimson-rados tests: https://pulpito.ceph.com/yingxin-2023-11-27_02:15:02-crimson-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/

entries.begin()->version > info.last_update : 7467434, 7467440

seastar::shared_mutex::unlock() after ~SnapTrimEvent(): 7467449, 7467459

Seems to me they are known and unrelated, @Matan-B do the crimson test results look good to you, too?

Test results LGTM.
Whenever the failures are known, the best practice would be to attach the corresponding tracker and to attach the failed job to the tracker itself.

cyx1231st · 2023-11-27T08:45:57Z

Whenever the failures are known, the best practice would be to attach the corresponding tracker and to attach the failed job to the tracker itself.

Thanks for the suggestions, ~~will do.~~ done.

cyx1231st · 2023-11-28T01:41:19Z

jenkins test windows

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

ljflores · 2023-12-07T17:36:39Z

Rados tests: https://pulpito.ceph.com/yingxin-2023-11-27_02:16:02-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/

@ljflores do the above test results look good to you?

Crimson-rados tests: https://pulpito.ceph.com/yingxin-2023-11-27_02:15:02-crimson-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/

https://tracker.ceph.com/issues/61651: 7467434, 7467440

https://tracker.ceph.com/issues/63647: 7467449, 7467459

Seems to me they are known and unrelated, @Matan-B do the crimson test results look good to you, too?

Hey @cyx1231st, the rados failures are all unrelated, tracked by:

LGTM!

cyx1231st · 2023-12-08T01:16:59Z

I'll merge if no other concerns.

cyx1231st requested review from athanatos and rzarzynski August 24, 2023 06:42

github-actions bot added the common label Aug 24, 2023

cyx1231st mentioned this pull request Aug 24, 2023

[RFC] crimson: understand performance impacts by skipping mempool in raw #52404

Closed

14 tasks

rzarzynski reviewed Aug 24, 2023

View reviewed changes

src/common/mempool.cc Outdated Show resolved Hide resolved

rzarzynski added the performance label Aug 24, 2023

rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Aug 25, 2023

mempool: switch from thread local storage to cpu local storage

4abb341

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

cyx1231st pushed a commit to cyx1231st/ceph that referenced this pull request Aug 28, 2023

mempool: switch from thread local storage to cpu local storage

3c930f0

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

Matan-B added the crimson label Aug 28, 2023

cyx1231st force-pushed the wip-crimson-drop-mempool branch from 53831ff to 30a24fc Compare August 29, 2023 07:28

rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Sep 1, 2023

mempool: switch from thread local storage to cpu local storage

9636f7a

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

cyx1231st changed the title ~~crimson: avoid false sharing with mempool~~ crimson: avoid true sharing within mempool Sep 8, 2023

cyx1231st changed the title ~~crimson: avoid true sharing within mempool~~ mempool: avoid true sharing for the counters Sep 8, 2023

cyx1231st force-pushed the wip-crimson-drop-mempool branch from b1803d7 to e4c54ec Compare November 21, 2023 07:58

rzarzynski approved these changes Nov 22, 2023

View reviewed changes

rzarzynski added the needs-qa label Nov 22, 2023

cyx1231st removed the TESTED label Nov 23, 2023

rzarzynski and others added 5 commits November 23, 2023 09:53

mempool: switch from thread local storage to cpu local storage

111cee4

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

mempool: implement sharding for type_t::items

fd05b5a

It looks this counter was initially thought to be a debug one but -- because of the `force_register` -- flag it is widely used now. Found by Yingxin Cheng <[email protected]> Signed-off-by: Radosław Zarzyński <[email protected]>

test/test_mempool: fix to use mempool::pick_a_shard_int()

8146f83

Signed-off-by: Yingxin Cheng <[email protected]>

mempool: only enable cpu local storage in crimson

1ac378d

Signed-off-by: Yingxin Cheng <[email protected]>

mempool: only enable sharding for type_t::items in crimson

931894f

Signed-off-by: Yingxin Cheng <[email protected]>

cyx1231st force-pushed the wip-crimson-drop-mempool branch from e4c54ec to 931894f Compare November 23, 2023 01:53

cyx1231st changed the title ~~mempool: avoid true sharing for the counters~~ mempool: avoid true sharing for the counters, with crimson Nov 23, 2023

cyx1231st pushed a commit to cyx1231st/ceph that referenced this pull request Dec 4, 2023

mempool: switch from thread local storage to cpu local storage

ef19472

ceph#53130 (comment) Signed-off-by: Radosław Zarzyński <[email protected]>

cyx1231st added the TESTED label Dec 8, 2023

cyx1231st merged commit 3301618 into ceph:main Dec 11, 2023

cyx1231st deleted the wip-crimson-drop-mempool branch December 11, 2023 01:39

Matan-B added crimson-perf and removed performance labels Jan 3, 2024

rzarzynski mentioned this pull request Feb 1, 2024

mempool: Modify pick_a_shard_int() fucntion #55249

Closed

14 tasks

Comments

Conversation

cyx1231st commented Aug 24, 2023

Contribution Guidelines

Checklist

Uh oh!

rzarzynski left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzarzynski Aug 24, 2023

Choose a reason for hiding this comment

Uh oh!

liu-chunmei Aug 29, 2023

Choose a reason for hiding this comment

Uh oh!

cyx1231st Aug 29, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rzarzynski commented Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyx1231st commented Aug 24, 2023

Uh oh!

cyx1231st commented Aug 24, 2023

Uh oh!

rzarzynski commented Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyx1231st commented Aug 25, 2023

Uh oh!

rzarzynski commented Aug 25, 2023

ceph-osd

crimson-osd

Uh oh!

rzarzynski commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyx1231st commented Aug 28, 2023

Uh oh!

cyx1231st commented Aug 28, 2023

Uh oh!

cyx1231st commented Aug 29, 2023

Uh oh!

rzarzynski commented Aug 29, 2023

Uh oh!

cyx1231st commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Aug 29, 2023

Uh oh!

cyx1231st commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

rzarzynski commented Nov 22, 2023

Uh oh!

cyx1231st commented Nov 23, 2023

Uh oh!

cyx1231st commented Nov 23, 2023

Uh oh!

cyx1231st commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyx1231st commented Nov 27, 2023

Uh oh!

cyx1231st commented Nov 27, 2023

Uh oh!

cyx1231st commented Nov 27, 2023

Uh oh!

Matan-B commented Nov 27, 2023

rzarzynski left a comment •

edited

Loading

rzarzynski commented Aug 24, 2023 •

edited

Loading

rzarzynski commented Aug 24, 2023 •

edited

Loading

rzarzynski commented Aug 24, 2023 •

edited

Loading

rzarzynski commented Aug 25, 2023 •

edited

Loading

cyx1231st commented Aug 29, 2023 •

edited

Loading

rzarzynski commented Aug 29, 2023 •

edited

Loading

cyx1231st commented Sep 8, 2023 •

edited

Loading

cyx1231st commented Nov 27, 2023 •

edited

Loading

cyx1231st commented Nov 27, 2023 •

edited

Loading