Skip to content

Comments

mempool: avoid true sharing for the counters, with crimson#53130

Merged
cyx1231st merged 5 commits intoceph:mainfrom
cyx1231st:wip-crimson-drop-mempool
Dec 11, 2023
Merged

mempool: avoid true sharing for the counters, with crimson#53130
cyx1231st merged 5 commits intoceph:mainfrom
cyx1231st:wip-crimson-drop-mempool

Conversation

@cyx1231st
Copy link
Member

Follow up #52404 (as REF)

By globally disabling mempool counters, the peak performance (OPT, this PR) is ~1.8x better than REF, ~6.7x better than Async for the same set of tests:
msgr

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why mempool misbehaves so much in the multi-core environment.

The shard_t structs are already sharded, cache-line-aligned and they should follow any core change:

  T* allocate(size_t n, void *p = nullptr) {
    // ...
    shard_t *shard = pool->pick_a_shard();
    shard->bytes += total;
    shard->items += n;
    // ...
  }

  void deallocate(T* p, size_t n) {
    // ...
    shard_t *shard = pool->pick_a_shard();
// Align shard to a cacheline.
//
// It would be possible to retrieve the value at runtime (for instance
// with getconf LEVEL1_DCACHE_LINESIZE or grep -m1 cache_alignment
// /proc/cpuinfo). It is easier to hard code the largest cache
// linesize for all known processors (128 bytes). If the actual cache
// linesize is smaller on a given processor, it will just waste a few
// bytes.
//
struct shard_t {
  ceph::atomic<size_t> bytes = {0};
  ceph::atomic<size_t> items = {0};
  char __padding[128 - sizeof(ceph::atomic<size_t>)*2];
} __attribute__ ((aligned (128)));

static_assert(sizeof(shard_t) == 128, "shard_t should be cacheline-sized");

Maybe the shard selection runs into tons of collision?

// we shard pool stats across many shard_t's to reduce the amount
// of cacheline ping pong.
enum {
  num_shard_bits = 5
};
enum {
  num_shards = 1 << num_shard_bits
};
  static size_t pick_a_shard_int() {
    // Dirt cheap, see:
    //   https://fossies.org/dox/glibc-2.32/pthread__self_8c_source.html
    size_t me = (size_t)pthread_self();
    size_t i = (me >> CEPH_PAGE_SHIFT) & ((1 << num_shard_bits) - 1);
    return i;
  }

  shard_t* pick_a_shard() {
    size_t i = pick_a_shard_int();
    return &shard[i];
  }

#include "include/mempool.h"
#include "include/demangle.h"

#ifndef WITH_SEASTAR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the alienized BlueStore? It has plenty of internal structs stored in memory and the stats often turns out be very, very useful / crucial when it comes the field support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we comment out mempool accounting, then how to accounting the consumption of memory of a set of containers in crimson?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the discussion was that disabling mempool within crimson is acceptable, but we are still finding out in this PR whether there is a quick and better solution.

@rzarzynski
Copy link
Contributor

rzarzynski commented Aug 24, 2023

The fact that commenting out mempool::get_pool(mempool::pool_index_t(mempool)).adjust_count() helped still bothers me a lot.

mempool::pool_t& mempool::get_pool(mempool::pool_index_t ix)
{
  // We rely on this array being initialized before any invocation of
  // this function, even if it is called by ctors in other compilation
  // units that are being initialized before this compilation unit.
  static mempool::pool_t table[num_pools];
  return table[ix];
}

Another hypothesis: this static is guarded by a lock (even a "fast", atomic-based one) to handle race-during-initialization. Will take a look on objdump to confirm.

@cyx1231st
Copy link
Member Author

The fact that commenting out mempool::get_pool(mempool::pool_index_t(mempool)).adjust_count() helped still bothers me a lot.

The direct reason is from analysis that this is the top hotspot at high CPU count, adding more CPUs will make things worse, and commenting it out helps a lot:
image

 static size_t pick_a_shard_int() {
    // Dirt cheap, see:
    //   https://fossies.org/dox/glibc-2.32/pthread__self_8c_source.html
    size_t me = (size_t)pthread_self();
    size_t i = (me >> CEPH_PAGE_SHIFT) & ((1 << num_shard_bits) - 1);
    return i;
  }

Now I see it will pick the shard based on a thread-local variable, looks reasonable... Probably the buffer allocations are so frequent that cannot tolerate any collisions?

I'll take look at how shards are really picked at runtime.

@cyx1231st
Copy link
Member Author

Another hypothesis: this static is guarded by a lock (even a "fast", atomic-based one) to handle race-during-initialization. Will take a look on objdump to confirm.

Seems not really. Because the hotspot is at std::__atomic_base<unsigned long>::operator+= and mempool.cc:100

@rzarzynski
Copy link
Contributor

rzarzynski commented Aug 24, 2023

Yeah, the static lock in get_pool() is present but pretty well optimized:

000000003d9e8f7a <mempool::get_pool(mempool::pool_index_t)>:
    3d9e8f7a:   f3 0f 1e fa             endbr64 
    3d9e8f7e:   41 57                   push   %r15
    3d9e8f80:   41 56                   push   %r14
    3d9e8f82:   41 55                   push   %r13
    3d9e8f84:   41 54                   push   %r12
    3d9e8f86:   55                      push   %rbp
    3d9e8f87:   53                      push   %rbx
    3d9e8f88:   48 83 ec 08             sub    $0x8,%rsp
    3d9e8f8c:   41 89 fd                mov    %edi,%r13d
    3d9e8f8f:   0f b6 05 6a 0e cf 4b    movzbl 0x4bcf0e6a(%rip),%eax        # 896d9e00 <guard variable for mempool::get_pool(mempool::pool_index_t)::table>
    3d9e8f96:   84 c0                   test   %al,%al
    3d9e8f98:   74 53                   je     3d9e8fed <mempool::get_pool(mempool::pool_index_t)+0x73>
# the hot path here
    3d9e8f9a:   49 63 c5                movslq %r13d,%rax
# ...
    3d9e8fed:   48 8d 3d 0c 0e cf 4b    lea    0x4bcf0e0c(%rip),%rdi        # 896d9e00 <guard variable for mempool::get_pool(mempool::pool_index_t)::table>
    3d9e8ff4:   e8 17 91 5a fb          callq  38f92110 <__cxa_guard_acquire@plt>
    3d9e8ff9:   85 c0                   test   %eax,%eax
    3d9e8ffb:   74 9d                   je     3d9e8f9a <mempool::get_pool(mempool::pool_index_t)+0x20>
# slow, lockish init when __cxa_guard_acquire() returns 1
  extern "C"
  int __cxa_guard_acquire (__guard *g)
  {
    // ...
    if (__gnu_cxx::__is_single_threaded())
      {
        // No need to use atomics, and no need to wait for other threads.
        int *gi = (int *) (void *) g;
        if (*gi == 0)
          {
            *gi = _GLIBCXX_GUARD_PENDING_BIT;
            return 1;
          }
        else
          throw_recursive_init_exception();
      }
    else
      {
        int *gi = (int *) (void *) g;
        const int guard_bit = _GLIBCXX_GUARD_BIT;
        const int pending_bit = _GLIBCXX_GUARD_PENDING_BIT;
        const int waiting_bit = _GLIBCXX_GUARD_WAITING_BIT;

        while (1)
          {
            int expected(0);
            if (__atomic_compare_exchange_n(gi, &expected, pending_bit, false,
                                            __ATOMIC_ACQ_REL,
                                            __ATOMIC_ACQUIRE))
              {
                // This thread should do the initialization.
                return 1;
              }

            if (expected == guard_bit)
              {
                // Already initialized.
                return 0;
              }

@rzarzynski
Copy link
Contributor

rzarzynski commented Aug 24, 2023

The direct reason is from analysis that this is the top hotspot at high CPU count, adding more CPUs will make things worse, and commenting it out helps a lot

Yeah, the overhead of an atomic may vary a lot – from unpleasant (AFAIK it always flushes CPU's memory queues as the cache coherency works at the level of, well, caches) to terribly bad (ping-pongs). The mempool seems to be made in a way that should locate it closer to the positive end of this spectrum but only under the assumption the sharding works fine.

Unfortunately, there are multiple caveats. First, the number of shards is constant, limited at the compile-time and the limit isn't large (32 slots IIRC which nicely correlates with the chart). You might want to increase it.

Second, the shard selection function uses pthread_self() and some bit shifts to derive the selector. In other words, it draws assumptions on which of the returned bits are meaningful. This seems fragile.

Finally, the usage of pthread_self (and thread local storage in general) is just an approximation. The intention really is cpu local storage (BTW: it would be cool if a kernel, on a context switch, puts in cheap-to-access place the ID of core a thread is executed on; how costly sched_getcpu() really is?). I guess usually it's good enough as it seems reasonable to expect from a scheduler to minimize migrations of tasks between cores. However, there are scenarios where this trick stops holding – try to imagine many threads with asymmetric workloads executed by few CPU cores. I think a CPU contention can easily exaggerate the penalties.

Probably the buffer allocations are so frequent that cannot tolerate any collisions?

Let's not exclude that.

Regards,
Radek

@cyx1231st
Copy link
Member Author

@rzarzynski

!!! pool=0x55b4ca5b5000 reactor=1 pthread=139964979344960 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=0 pthread=139964981729984 shard=5/32
!!! pool=0x55b4ca5b5000 reactor=2 pthread=139964973053504 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=3 pthread=139964966762048 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=4 pthread=139964960470592 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=6 pthread=139964947887680 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=9 pthread=139964929013312 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=5 pthread=139964954179136 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=7 pthread=139964941596224 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=8 pthread=139964935304768 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=11 pthread=139964916430400 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=10 pthread=139964922721856 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=13 pthread=139964903847488 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=16 pthread=139964884973120 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=12 pthread=139964910138944 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=19 pthread=139964866098752 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=17 pthread=139964878681664 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=18 pthread=139964872390208 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=14 pthread=139964897556032 shard=31/32
!!! pool=0x55b4ca5b5000 reactor=15 pthread=139964891264576 shard=31/32

Looks 19 out of 20 reactors are picking shard-31, collisions!

I'll go forward to specialize pick_a_shard_int() to use seastar::this_shard_id() as the shard index in case of WITH_SEASTAR, and increase num_shard_bits = 6, so it should be impossible to collide.

Or do you think the case should be #if defined(WITH_SEASTAR) && !defined(WITH_ALIEN) ?

Let me know whether it is reasonable to you.

rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Aug 25, 2023
@rzarzynski
Copy link
Contributor

Huh, this is a smoking gun!

I tested on the classical OSD & crimson with a dead simple patch:

diff --git a/src/common/mempool.cc b/src/common/mempool.cc
index 79354f70821..f17245cf795 100644
--- a/src/common/mempool.cc
+++ b/src/common/mempool.cc
@@ -15,6 +15,14 @@
 #include "include/mempool.h"
 #include "include/demangle.h"
 
+#include <iostream>
+
 // Thread local variables should save index, not &shard[index],
 // because shard[] is defined in the class
 static thread_local size_t thread_shard_index = mempool::num_shards;
@@ -95,7 +103,11 @@ size_t mempool::pool_t::allocated_items() const
 
 void mempool::pool_t::adjust_count(ssize_t items, ssize_t bytes)
 {
-  thread_shard_index = (thread_shard_index == num_shards) ? pick_a_shard_int() : thread_shard_index;
+  if (thread_shard_index == num_shards) {
+    thread_shard_index = pick_a_shard_int();
+    std::cerr << "assigning thread id " << (void*)pthread_self() << " shard id "
+              << thread_shard_index << std::endl;
+  }

On my system the distribution looks far, far better.

ceph-osd

Defaults.

assigning thread id 0x7f1861d666c0 shard id 6
assigning thread id 0x7f1860524700 shard id 4
assigning thread id 0x7f185f522700 shard id 2
assigning thread id 0x7f1851173700 shard id 19
assigning thread id 0x7f1850972700 shard id 18
assigning thread id 0x7f184a165700 shard id 5
assigning thread id 0x7f184c169700 shard id 9
assigning thread id 0x7f1851161700 shard id 1
assigning thread id 0x7f184a153700 shard id 19
assigning thread id 0x7f184d159700 shard id 25
assigning thread id 0x7f185019f700 shard id 31
assigning thread id 0x7f185e520700 shard id 0
assigning thread id 0x7f183f13d700 shard id 29
assigning thread id 0x7f1843145700 shard id 5
assigning thread id 0x7f1842944700 shard id 4
assigning thread id 0x7f1841942700 shard id 2
assigning thread id 0x7f185fd23700 shard id 3
assigning thread id 0x7f183e93c700 shard id 28
assigning thread id 0x7f183c137700 shard id 23
assigning thread id 0x7f183d93a700 shard id 26
assigning thread id 0x7f183e13b700 shard id 27
assigning thread id 0x7f183f93e700 shard id 30
assigning thread id 0x7f183b936700 shard id 22
assigning thread id 0x7f183c938700 shard id 24
assigning thread id 0x7f1841141700 shard id assigning thread id 0x7f1842143700 shard id 3assigning thread id 0x7f184013f700 shard id 31
assigning thread id 0x7f183d139700 shard id 25
1assigning thread id 0x7f1840940700 shard id 0


assigning thread id 0x7f185599e700 shard id 30
assigning thread id 0x7f185519d700 shard id 29
assigning thread id 0x7f184d95a700 shard id 26
assigning thread id 0x7f184c958700 shard id 24
assigning thread id 0x7f1860d25700 shard id 5
assigning thread id 0x7f184e19b700 shard id 27
assigning thread id 0x7f1851161700 shard id 1
assigning thread id 0x7f183b135700 shard id 21
assigning thread id 0x7f184e99c700 shard id 28
assigning thread id 0x7f185d4cd700 shard id 13

crimson-osd

Only 8 reactors + BlueStore.

assigning thread id 0x7f91dfeb5a40 shard id 21
assigning thread id 0x7f91dfeb5a40 shard id 21
assigning thread id 0x7f91ce294700 shard id 20
assigning thread id 0x7f91c6a8b700 shard id 11
assigning thread id 0x7f91c6274700 shard id 20
assigning thread id 0x7f91bf26d700 shard id 13
assigning thread id 0x7f91c1271700 shard id 17
assigning thread id 0x7f91c5f72700 shard id 18
assigning thread id 0x7f91bef64700 shard id 4
assigning thread id 0x7f91ceaa3700 shard id 3
assigning thread id 0x7f91cb1f9700 shard id 25
assigning thread id 0x7f91c1f6a700 shard id 10
assigning thread id 0x7f91c276b700 shard id 11
assigning thread id 0x7f91cc238700 shard id 24
assigning thread id 0x7f91c568e700 shard id 14
assigning thread id 0x7f91cda7d700 shard id 29
assigning thread id 0x7f91cca4f700 shard id 15
assigning thread id 0x7f91cd266700 shard id 6

It looks the assumptions behind pthread_self() & the meaningful bits are truly fragile :-)).

Or do you think the case should be #if defined(WITH_SEASTAR) && !defined(WITH_ALIEN) ?

I'm not sure / doubt alien threads should be calling seastar::this_shard_id().

@rzarzynski
Copy link
Contributor

rzarzynski commented Aug 25, 2023

@cyx1231st: I think the problem isn't crimson-specific. I started poking with sched_getcpu() and it should be really, really cheap while allowing to minimize cache misses even when the shard selection function works as it should.

Would you mind giving a try to main...rzarzynski:ceph:wip-mempool-cpulocalstorage?

@cyx1231st
Copy link
Member Author

Would you mind giving a try to main...rzarzynski:ceph:wip-mempool-cpulocalstorage?

Thanks! I'll let you know the results.

cyx1231st pushed a commit to cyx1231st/ceph that referenced this pull request Aug 28, 2023
@cyx1231st
Copy link
Member Author

Looks sched_getcpu() is working as expected!

!!! pool=0x562274a94000 reactor=2 getcpu=2 shard=2/128
!!! pool=0x562274a94000 reactor=3 getcpu=3 shard=3/128
!!! pool=0x562274a94000 reactor=4 getcpu=4 shard=4/128
!!! pool=0x562274a94000 reactor=5 getcpu=5 shard=5/128
!!! pool=0x562274a94000 reactor=8 getcpu=8 shard=8/128
!!! pool=0x562274a94000 reactor=7 getcpu=7 shard=7/128
!!! pool=0x562274a94000 reactor=6 getcpu=6 shard=6/128
!!! pool=0x562274a94000 reactor=9 getcpu=9 shard=9/128
!!! pool=0x562274a94000 reactor=10 getcpu=10 shard=10/128
!!! pool=0x562274a94000 reactor=12 getcpu=12 shard=12/128
!!! pool=0x562274a94000 reactor=14 getcpu=14 shard=14/128
!!! pool=0x562274a94000 reactor=11 getcpu=11 shard=11/128
!!! pool=0x562274a94000 reactor=13 getcpu=13 shard=13/128
!!! pool=0x562274a94000 reactor=15 getcpu=15 shard=15/128
!!! pool=0x562274a94000 reactor=16 getcpu=16 shard=16/128
!!! pool=0x562274a94000 reactor=17 getcpu=17 shard=17/128
!!! pool=0x562274a94000 reactor=18 getcpu=18 shard=18/128
!!! pool=0x562274a94000 reactor=19 getcpu=19 shard=19/128

However, I see there are still other caveats. I'll incoperate the proposed commit and push another version for further discussions.

@cyx1231st cyx1231st force-pushed the wip-crimson-drop-mempool branch from 53831ff to 30a24fc Compare August 29, 2023 07:28
@cyx1231st
Copy link
Member Author

@rzarzynski
It turns out that the commit main...rzarzynski:ceph:wip-mempool-cpulocalstorage reaches the performance curve of Crimson REF (see #53130 (comment)).

After excluding the updates to type_t::items (132476a), the performance curve reaches Crimson OPT.

The core number with Crimson can be determined, so I added asserts to make sure collision won't happen in case of crimson, see 30a24fc

What do you think?

@rzarzynski
Copy link
Contributor

Hi @cyx1231st!

1 Huh! So there is one unsharded counter left: type. Yeah, that's the beauty of atomics – all or nothing. We should shard it or revisit the force_register & the pool declaration macro:

62fa72c0e168 (Sage Weil           2016-10-18 12:03:28 -0400 348)   void init(bool force_register) {
049797641b94 (Sage Weil           2016-10-14 17:50:33 -0400 349)     pool = &get_pool(pool_ix);
b15a377a7b07 (Sage Weil           2016-10-18 12:06:27 -0400 350)     if (debug_mode || force_register) {
049797641b94 (Sage Weil           2016-10-14 17:50:33 -0400 351)       type = pool->get_type(typeid(T), sizeof(T));
049797641b94 (Sage Weil           2016-10-14 17:50:33 -0400 352)     }
049797641b94 (Sage Weil           2016-10-14 17:50:33 -0400 353)   }

There are also other things we can do to further improve mempool basing the cpu local storage.

  1. The padding to max-known-cache-line-size (128 bytes) is a pure waste. We actually want to have the counter structs dense to minimize cache misses. I added a commit for that to main...rzarzynski:ceph:wip-mempool-cpulocalstorage.

  2. How about, just for a sake of a test, switching the counters in:

struct shard_t {
  ceph::atomic<size_t> bytes = {0};
  ceph::atomic<size_t> items = {0};
  char __padding[128 - sizeof(ceph::atomic<size_t>)*2];
} __attribute__ ((aligned (128)));

to plain size_t? This would allow to roughly estimate how much the non-contented atomicness (queues flushes etc.) contribute to the penalty. If it brings a significant overhead, we could tune the strategy after having more-than-1-CPUs-on-shard (or prohibit that entirely with an assert like you did which I very like :-)).

@cyx1231st
Copy link
Member Author

cyx1231st commented Aug 29, 2023

I'm not sure why force_register is needed by looking at the history. Can we drop it so that it is only enabled in the debug mode, generally?

The padding to max-known-cache-line-size (128 bytes) is a pure waste. We actually want to have the counter structs dense to minimize cache misses.

I'm actually not sure whether dropping 128 bytes padding is reasonable, because by doing so will make multiple seastar reactors to update the same cacheline. This seems to me bringing back the "false sharing" issue again, quote from wiki:

If two processors operate on independent data in the same memory address region storable in a single line, the cache 
coherency mechanisms in the system may force the whole line across the bus or interconnect with every data write, 
forcing memory stalls in addition to wasting system bandwidth.

How about, just for a sake of a test, switching the counters from ceph::atomic<size_t> to size_t

My first attempt before the RFC #52404 was to change them to the plain size_t, but the performance was not changed, so I finally decided to drop them entirely.

That seems to match the discription above that the issue is actually due to multiple cores trying to update the same cacheline, not because of the atomic type.

Does it make sense?

@rzarzynski
Copy link
Contributor

rzarzynski commented Aug 29, 2023

I'm actually not sure whether dropping 128 bytes padding is reasonable, because by doing so will make multiple seastar reactors to update the same cacheline. This seems to me bringing back the "false sharing" issue again,

There is more twist to that. We don't need to protect the shards-of-different-pools assigned to the same CPU but need to protect from mixing shards-of-same pool between different CPUs.

However, the change – as it rearranges data structures – would be far bigger than the small-but-wrong commit :-(.

At the moment we have:

pool x: padded shard cpu 1 | padded shard cpu 2 | ...
pool y: padded shard cpu 1 | padded shard cpu 2 | ...

We could have:

padded shards of cpu 1: pool x | pool y | ...
padded shards of cpu 2: pool x | pool y | ...

I think the naming (cpu local storage) is misleading. What the runtime offers is far from thread local storage. We aren't getting an easy to use area of memory but just a selector...

@rzarzynski
Copy link
Contributor

My first attempt before the RFC #52404 was to change them to the plain size_t, but the performance was not changed, so I finally decided to drop them entirely.

Yes, it was reasonable when the cost was dominated by ping-pongs. When the collisions are clear and the type is disabled or sharded, the smaller contributors should become visible.

BTW: just pushed the sharding for type + the revert to main...rzarzynski:ceph:wip-mempool-cpulocalstorage.

rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Sep 1, 2023
@cyx1231st
Copy link
Member Author

cyx1231st commented Sep 8, 2023

I think the direction for the fixes are clear by making type_t sharded and shard selection appropriate. I'll update this PR later and verify when the hardware is available again.

Yes, it was reasonable when the cost was dominated by ping-pongs. When the collisions are clear and the type is disabled or sharded, the smaller contributors should become visible.

Agree. It would be interesting to see how atomic impacts the performance after the fixes. However, I suppose we still cannot generally drop the atomic wrappers because only crimson uses coorpertive-scheduling, the classic will have issues without atomic under preemptive scheduling. Is my understanding correct?

And I just realized that the term "false sharing" is incorrect in the title.

@cyx1231st cyx1231st changed the title crimson: avoid false sharing with mempool crimson: avoid true sharing within mempool Sep 8, 2023
@cyx1231st cyx1231st changed the title crimson: avoid true sharing within mempool mempool: avoid true sharing for the counters Sep 8, 2023
@cyx1231st cyx1231st force-pushed the wip-crimson-drop-mempool branch from b1803d7 to e4c54ec Compare November 21, 2023 07:58
Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT!

@rzarzynski
Copy link
Contributor

@ljflores: the changes to classical are minimal, so I think the PR can be glued to any testing branch without imposing extra risk.

@cyx1231st cyx1231st removed the TESTED label Nov 23, 2023
rzarzynski and others added 5 commits November 23, 2023 09:53
It looks this counter was initially thought to be a debug one
but -- because of the `force_register` --  flag it is widely
used now.

Found by Yingxin Cheng <[email protected]>

Signed-off-by: Radosław Zarzyński <[email protected]>
@cyx1231st cyx1231st force-pushed the wip-crimson-drop-mempool branch from e4c54ec to 931894f Compare November 23, 2023 01:53
@cyx1231st
Copy link
Member Author

rebased

@cyx1231st cyx1231st changed the title mempool: avoid true sharing for the counters mempool: avoid true sharing for the counters, with crimson Nov 23, 2023
@cyx1231st
Copy link
Member Author

jenkins test make check

@cyx1231st
Copy link
Member Author

make check failure:

The following tests FAILED:
	 19 - run-tox-python-common (Failed)

@cyx1231st
Copy link
Member Author

jenkins test make check

@cyx1231st
Copy link
Member Author

jenkins test windows

@Matan-B
Copy link
Contributor

Matan-B commented Nov 27, 2023

Crimson-rados tests: https://pulpito.ceph.com/yingxin-2023-11-27_02:15:02-crimson-rados-wip-yingxin-crimson-improve-mempool5-distro-default-smithi/

  • entries.begin()->version > info.last_update : 7467434, 7467440
  • seastar::shared_mutex::unlock() after ~SnapTrimEvent(): 7467449, 7467459

Seems to me they are known and unrelated, @Matan-B do the crimson test results look good to you, too?

Test results LGTM.
Whenever the failures are known, the best practice would be to attach the corresponding tracker and to attach the failed job to the tracker itself.

@cyx1231st
Copy link
Member Author

cyx1231st commented Nov 27, 2023

Whenever the failures are known, the best practice would be to attach the corresponding tracker and to attach the failed job to the tracker itself.

Thanks for the suggestions, will do. done.

@cyx1231st
Copy link
Member Author

jenkins test windows

cyx1231st pushed a commit to cyx1231st/ceph that referenced this pull request Dec 4, 2023
@cyx1231st
Copy link
Member Author

I'll merge if no other concerns.

@cyx1231st cyx1231st merged commit 3301618 into ceph:main Dec 11, 2023
@cyx1231st cyx1231st deleted the wip-crimson-drop-mempool branch December 11, 2023 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants