common/mempool.cc: Improve performance of sharding by bill-scales · Pull Request #55696 · ceph/ceph

bill-scales · 2024-02-21T17:12:32Z

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Explanation of changes

Old code used a hash of pthread_self to select a shard for each thread, to reduce how many threads accessed the same statistics. Cache line thrashing still occurs when multiple threads are assigned the same shard. On some platforms/architectures the hash could be very ineffective and caused vast amounts of thrashing. A unit test to detect ineffective distributions was disabled because it regularly caused failures.

The change uses sched_getcpu() to select a shard to almost eliminate cache lline thrashing because per shard stats are now only accessed by one core except in the very rare scenario of a thread context switch occurring between the call to sched_getcpu and the update of the stats. There are some slight better performing implementations (see performance section below) but we choose this solution for simplicity and robustness.

On platforms (e.g. aarch64) that don’t support sched_getcpu() a thread local variable assigns each thread a shard using a round robin distribution. This is a much better distribution than the pthread_self hash. The unit test is re-enabled.

num_shards and num_shard_bits were constants but have been change to variables, the 1st call to get_num_shards selects appropriate values based on how many CPU cores the system has. This will reduce cache line thrashing on systems with higher core counts at the cost of slightly more memory usage (these systems are very likely to have more memory and its only a few KB of extra memory per daemon/client).

For class pool_t : Statistics counters for each shard are stored in separate cache lines (no change), but because the number of shards has increased and become a variable a new global shards structure is added with two dimensional array[num_shards][num_pools] of stats to reduce the amount of padding to cache line boundaries - this is now per shard rather than per pool per shard. The shard array is removed from the class, pool index is added to make it easy to index into the global array.

 For struct type_t (which stores pool statistics for each type in “debug” builds, but actually is used for several types in production builds) Crimson !WITH_ALIEN had replaced a single set of statistics with a set per shard to reduce cache line contention. This change has been adopted for all builds. Because num_shards is now a variable the array of shards becomes a dynamic allocation. 

Crimson changes

For !WITH_ALIEN the call to sched_getcpu() used to be cached in a thread local because the ioreactor threads were bound with an affinity of 1 core. The ioreactors are no longer guaranteed to have this affinity so we revert to calling sched_getcpu() every time.

Mempool used the ceph_atomic template which for !WITH_ALIEN builds used normal types instead of atomic types. Because ioreactors are no longer guaranteed to have affinity with 1 core this is not safe, so the code has been changed back to use std:atomic

 With these changes there is no longer any Crimson specific code in mempool.
 

Performance benchmarks

Lets start with some benchmarks to show the various costs of fragments of code. The objective is to show that the overheads of performing atomic accesses to a cache line that is dirty in another CPU core is far worse than the overheads of calling sched_getcpu():

These tests were run on a dual socket Intel Broadwell x86_64 system with no attempt to optimize the code being generated by the compiler, timings will vary on different systems and different architectures but it gives an idea of relative costs. The results are averages of many iterations and represent the cost when code+data is in the CPU L1 cache, in practice costs are likely to be higher because of CPU cache misses.

Calling sched_getcpu() to find the current CPU 32 cycles
Calling pthread_self() and hashing the result 8 cycles
Calling a function to return a thread_local int (local-exec) 5 cycles
Calling a function to return a thread_local int (global-dynamic) 15 cycles
Calling a function to return rdpid (assembly instruction for newer x86 systems) 5 cycles

Update: The thread_local int test was performed using a micro benchmark in a single executable which uses TLS local-exec which is the fastest way to access a thread local variable. It was pointed out that this is perhaps not representative and that we should really be comparing to TLS global-dynamic which is the worst case which happens when the thread local variable is in a dynamic library. I've added this extra benchmark which indeed shows some extra overheads.

Update: Investigated using rdpid as a faster alternative to sched_getcpu(). It appears that the rdpid instruction executes in 1 cycle on Intel Icelake CPUs, in this benchmark with the additional overheads of a function call it is being measured as 5 cycles. This is significantly faster than calling sched_getcpu() even though this is only reading a variable from the vDSO area on x86 systems.

Atomic increment of a memory variable 19 cycles
Non-atomic increment of a memory variable 5 cycles

Atomic increment with 8 threads (different cores, two sockets) accessing the same cache line in series 142 cycles intra socket, 400 cycles inter socket

This was a test with multiple threads performing atomic increments with the same cache line, but with serialization to ensure each thread did an increment in turn hence forcing the cache line to move between CPU cores each time, but with no contention. The cost of moving the cache line is higher when it needs to be moved between sockets.

Atomic increment with 2 threads (different cores of same socket) accessing the same cache line in parallel 75 cycles
Atomic increment with 4 threads (different cores of same socket) accessing the same cache line in parallel 166 cycles
Atomic increment with 6 threads (different cores of same socket) accessing the same cache line in parallel 253 cycles
Atomic increment with 8 threads (different cores of same socket) accessing the same cache line in parallel 330 cycles
Atomic increment with 8 threads (different cores, two sockets) accessing the same cache line in parallel 350 - 435 cycles

This was a test with multiple threads performing atomic increments within the same cache line to cause the cache line to ping pong between cores. The test ensured the threads all ran at the same time and generated contention where multiple threads are trying to lock the same cache line. The results suggest that threads were sometimes able to perform more than one increment before a cache line moved, but that when there is a lot of contention not only is there a cost of moving the cache line but also a cost of waiting for the cache line to become available

I also did some testing on a Mac M1 (only aarch64 system I’ve got access to), which shows that the costs are pretty similar. Mac does not have sched_getcpu() or methods for setting or testing thread affinity, so we need to assume that the kernel scheduler made sensible choices about where threads were run:

Calling pthread_self() and hashing the result 4.6ns
Calling a function to return a thread_local int 5.7ns

Atomic increment of memory variable 14.8ns
Non-atomic increment of memory variable 4.3ns
Atomic increment with 8 threads (same socket) accessing the same cache line in series 276ns
Atomic increment with 2 threads (same socket) accessing the same cache line in parallel 35ns
Atomic increment with 4 threads (same socket) accessing the same cache line in parallel 140ns
Atomic increment with 6 threads (same socket) accessing the same cache line in parallel 258ns
Atomic increment with 8 threads (same socket) accessing the same cache line in parallel 276ns

Conclusions

Moving a CPU cache line between cores has a very high cost, much higher than calling sched_getcpu() to try and prevent this happening
Caching the binding of a thread to a shard in a thread_local has a small performance benefit over repeatedly calling sched_getcpu()

Comparison of different algorithms

There is an existing test case test_c2c which does a good job of measuring the performance of the mempool sharding code. It can be run with different numbers of threads and with/without sharding. I tried a variety of different algorithms and ran tests to compare the performance with different numbers of threads. The test system had 32 cores (dual socket, 8 cores + hyperthreading per socket). Tests were run with up to 128 threads to show behavior both when there are less threads than cores and when there are more threads than cores.

The X axis is number of threads, the Y access is number of operations completed in 30 seconds (higher is better):

Without any sharding the cache line contention is terrible
The pthread_self() hash works well on x86_64 until the number of threads exceeds the number of cores, then performance drops steeply
A round robin assignment of threads to shards works as well as the pthread_self() hash but mitigates the problem with this hash on aarch64 platforms
Calling sched_getcpu() every call to mempool gives more stable performance when the number of threads exceeds the number of cores, but the overheads of calling sched_getcpu() show that performance is worse than the existing pthread_self() hash for smaller numbers of threads
The adaptive caching of sched_getcpu() gives the best performance, and in particular performance does not degrade as number of threads increases above the number of cores

This test case is perhaps a bit unrepresentative of real world performance because the threads are running without yielding and once all cores are busy there is little benefit in the kernel scheduler migrating threads.

Discounted implementations

An implementation that caches the call to get_schedcpu() in a thread local variable gives marginally better performance. An idea by Adam Kupczyk (see comments below) can cheaply detect when a thread is migrated to a different core and updates the cached value. This implementation was discounted because it was considered too complex for the minor benefit.

An implementation that uses rdpid (a relatively new x86_64 instruction) instead of sched_getcpu() also gives marginally better performance. This implementation was discounted because its hardware specific and needs code to detect whether the system supports it. Hopefully the library implementation of sched_getcpu will be updated to use this in the future giving sched_getcpu() the same performance.

While micro benchmarks show these other implementations to have better performance its very unlikely that this would give a measurable improvement to Ceph I/O performance because mempool is only a tiny fraction of the I/O path.

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

src/common/mempool.cc

perezjosibm · 2024-02-22T12:43:42Z

src/common/mempool.cc

+
+}
+
+size_t mempool::get_num_shards(void) {


Can this function allowed to be called from more than one thread concurrently? I guess no

Yes it can.

The very first time its called when num_shards is 0 (hasn't been initialized yet) the code is not multi-thread safe. However this first call will be for one of the static initialization's of a Pool structure that has a dynamic array of size get_num_shards() which is when the code is still single threaded.

I originally tried static initialization to set up num_shards, but this just got into a mess with static initialization ordering problems, so I resorted to initialize on first use.

I''ll add a mutex like this to make the first call thread safe too. As this is performance critical code we only want to take the mutex when initialization is required so it will look like this:

if (num_shards==0) {
std::lock_guardstd::mutex lck (thread_shard_mtx);
if (num_shards==0) {
// One time initialization
...
}
}

src/test/test_mempool.cc

src/common/mempool.cc

rzarzynski · 2024-02-22T15:44:32Z

src/common/mempool.cc

+    thread_shard_index = sched_getcpu() & ((1 << num_shard_bits) - 1);
+  }
+#endif
+  return thread_shard_index;


I see more complexity here even for crimson: an extra indirection layer. It's the TLS lookup. IIRC its cost can vary a lot depending on a TLS model, usage of dlopen() and all the hairyness.
My imagination of sched_getcpu() is that it's a user-space syscall loading cpuid from memory where kernel copies it when core change happens. I guess it's blazing fast, perhaps even faster than TLS especially when its internal complexity would need to see the daylight.

Ah, I went too fast to the code changes :-).

Calling sched_getcpu() every call to mempool gives more stable performance when the number of threads exceeds the number of cores, but the overheads of calling sched_getcpu() show that performance is worse than the existing pthread_self() hash for smaller numbers of threads

The overhead is pretty interesting. My guess was it's actually cheaper than TLS (__tls_get_addr()?). Do you have a profile by any chance?

I've updated the benchmarks in the PR comments. A local-exec TLS lookup is the cheapest at ~ 5 cycles, a global-dynamic TLS lookup (which will happen for code in a dynamic library so is more representative) is ~15 cycles, calling sched_getcpu() is ~32 cycles.

The call to pthread_self() and hash is ~8 cycles, this is effectively a local-exec TLS lookup and a bit of overhead for a function call and a logical and.

Inspecting the assembly code for the sched_getcpu() code path is over twice as long as a global-dynamic TLS lookup involving a call to __tls_get_addr() which confirms what the benchmarks are showing. Like you, I'm surprised sched_getcpu() isn't more efficient - but there's quite a bit of complexity working out whether it can use the VDSO variable or has to resort to making a syscall and this happens on every call.

I think we need to look at the mempool statistics when running on a system with a heavy I/O workload. The stats I've managed to obtain from running on a fairly pathetic setup show a 3x improvement for the adaptive caching versus calling sched_getcpu(), which might reduce to 2x if we made a slick inline version of pick_a_shard_int that just called sched_getcpu(). I suspect the benefits will be much greater on a system running heavy I/O workload, if they are not then maybe the simplicity of always calling sched_getcpu() will out weight a marginal performance improvement of trying to cache the result.

Latest commit 5f72112 implements the following algorithms:

THREAD_OWNER - this has one TLS access per mempool stat update, only calls sched_getcpu() when a thread has context switched to another thread, and almost eliminates cache lines storing the stats from being accessed by different CPU cores (there is one cache line ping pong each time a thread switches cores).

SCHED_GETCPU - this calls sched_getcpu() every mempool stat update, it has no TLS access. Even with the optimized version of sched_getcpu() reading a variable from vDSO this slower than the THREAD_OWNER implementation.

SCHED_GETCPU USING RDPID - for systems that support rdpid, use this instead of sched_getcpu(). rdpid is a 1 cycle instruction on Icelake servers. Benchmarks with ceph_test_c2c showed this was fractionally slower than THREAD_OWNER when # threads < # cores and fractionally faster than THREAD_OWNER when # threads > # cores.

I suspect that outside of benchmarks (where instructions/data are L1 cached) that SCHED_GETCPU USING RDPID will give the best performance, followed by THREAD_OTHER, followed by SCHED_GETCPU.

src/include/mempool.h

rzarzynski · 2024-02-22T18:22:23Z

src/common/mempool.cc

+#else
+  if (thread_shard_index == max_shards) {
+    // Thread has not been assigned to a shard yet
+    thread_shard_index = sched_getcpu() & ((1 << num_shard_bits) - 1);


This assumes seastar's reactors should never change cores. I think they are pinned but I'm worried about the great quantifier. I don't know when exactly this pinning happens and whether it's guaranteed this codes doesn't execute before.

In short: the complexity makes the thinking harder (and more painful for me ;-).

Let me respond to my own comment: the periodical recheck should address this worry.

Still, the impression is we're fighting the complexity with more complexity.

src/include/mempool.h

rzarzynski · 2024-02-22T19:02:17Z

Calling sched_getcpu() to find the current CPU 32 cycles
Calling pthread_self() and hashing the result 8 cycles
Calling a function to return a thread_local int 5 cycles

Are these numbers coming from a microbenchmark? Was local-exec be used as TLS model? I'm asking because things like -fPIC adds a degree of freedom.

bill-scales · 2024-02-23T09:44:12Z

Are these numbers coming from a microbenchmark? Was local-exec be used as TLS model? I'm asking because things like -fPIC adds a degree of freedom.

These performance benchmarks were from a micro benchmark, timing millions of a calls to a function implementing the code being tested. I didn't even bother turning on optimization for the compiler, however the code and indeed the optimized sequence of uops is going to be L1 cached so this will make very little difference to the benchmarks.

The TLS model was local-exec, as you point out if the code is in a dynamic library the TLS access will be marginally slower.

However these are all second order effects. In practice this code isn't going to remain in the L1 cache so the costs of both a TLS access and a call to sched_getcpu will be higher than shown by this benchmark, but a TLS access is still access is still going to be something like a 1/3rd of the cost of calling sched_getcpu which will be something like 1/3rd to 1/8th of the cost of performing an atomic access to a dirty cache line in another core.

aclamk · 2024-02-27T19:14:48Z

@bill-scales An excellent measurement of cycle cost of different operations!

Only seeing the numbers I have an audacity to propose a significant rework of out mempool logic.

Drop pick_a_shard logic
Create mempool_all type that contains bloom_filter .. unittest_2 counters (no align to 128 bytes)
Make mempool_all a thread local ...
... with constructor signaling to mempool core its presence
... and destructor retrieving remaining values. (some values can be negative!)
Make mempool core that iterates active mempool_all tables for data retrieval:
allocated_bytes(), allocated_items()

That will give us 5 or 15 cycles for TLS retrieval + 2 x 19 cycles for atomic increment, total of ~48 cycles per update.

Now for the things I am not certain:
7) Convert pool counter values from atomic<size_t> to regular ssize_t
8) Create signal-based (man 2 tkill) retrieval procedure that will circumvent the fact that counters are not atomic.

That will give us 5 or 15 cycles for TLS retrieval + 2 x 5 cycles for regular increment, total of ~20 cycles per update.

In this technique I expect to pay single penalty for core change.
Rescheduling thread to different core requires a cache flush.
Thread resuming on different core will just experience stall at first access.

@bill-scales Do you think such rework can fly? Do you think it will be worthwhile?

bill-scales · 2024-02-28T09:11:06Z

1. Drop `pick_a_shard` logic
2. Create `mempool_all` type that contains `bloom_filter` .. `unittest_2` counters (no align to 128 bytes)
3. Make `mempool_all` a thread local ...
4. ... with constructor signaling to `mempool core` its presence
5. ... and destructor retrieving remaining values. (some values can be negative!)
6. Make `mempool core` that iterates active `mempool_all` tables for data retrieval:
   `allocated_bytes()`, `allocated_items()`

I did consider something like this. If the thread count is relatively small then doing accounting per thread will be more efficient than per core and a technique like above will work well. However if there are 1000's of threads then accounting per thread will start to consume a lot of memory and trying to calculate summary counts across this number of shards will become expensive.

I don't have enough experience of Ceph to know how many threads is typical, I see between 30 and 70 threads in OSD processes (so this technique would work well there), I don't know how many threads (specifically those accessing mempool) is typical in clients.

I'm sure that the mempool stats/dump interfaces aren't called very often, so having these iterate across 1000 shards probably wouldn't be a problem, but there are other interfaces such as allocated_items and allocated_bytes that will be called more often, so number of shards is a concern.

Maybe others can comment on what typical thread counts might be and whether a shard per thread model is viable?

I did think about having a per thread sharding scheme and switching to a per core scheme if the number of threads got too high, however I thought that switching models without pausing threads is too complex.

Convert pool counter values from atomic<size_t> to regular ssize_t

If we did 1-6 then I think we would want to get rid of the atomic's as well. The challenge here is the memory consistency model of having one thread updating a value and another thread reading it. Not a problem for x86, and solvable for other architectures with a bit of care.

Create signal-based (man 2 tkill) retrieval procedure that will circumvent the fact that counters are not atomic.

This would be far too slow, my experience with tkill and signal handlers suggest its not uncommon for it to take 1ms for a thread to receive a signal from another thread (this was in an environment where the source and target thread had affinity to different cores and were the only processes running on those cores).

In an environment like Crimson you can dispatch a task on each I/O reactor to summarize per thread data, but we need something more generic

aclamk · 2024-02-28T12:13:46Z

I don't have enough experience of Ceph to know how many threads is typical, I see between 30 and 70 threads in OSD processes (so this technique would work well there), I don't know how many threads (specifically those accessing mempool) is typical in clients.

I was focused on OSD and I missed the fact that clients may have significantly more threads.

rzarzynski · 2024-02-28T22:39:29Z

Thread local storage

The TLS model was local-exec, as you point out if the code is in a dynamic library the TLS access will be marginally slower.

I think marginally slower brings an implicit assumption: on the optimistic path. I'm afraid there is a lot of complexity coming from the flexibility of thread management, dlopen() and so on global-dynamic needs to deal with. I don't recall details but I made the following note some time ago:

(...) Imposing the global-dynamic thread-local-storage model in which an access to thread_local variable is resolved by a call to __tls_get_addr() . The overhead varies and can become unpleasant when the unlikely-hinted path, with update_get_addr() and _dl_update_slotinfo(), is being taken over and over. This happens when _dl_update_slotinfo() can't bump up the generation counter of DTV (a per-thread vector of pointers to TLSes of loaded ELF modules) high enough to match the global dl_tls_generation (increased on e.g. dlopen(), dlclose()) due to synchronization-related restrictions. Although this hasn't been observed on current master, I was hit by the case while working on per-thread PerfCounters (extensive usage of TLS).

void *
__tls_get_addr (GET_ADDR_ARGS)
{
  dtv_t *dtv = THREAD_DTV ();

  if (__glibc_unlikely (dtv[0].counter != GL(dl_tls_generation)))
    return update_get_addr (GET_ADDR_PARAM);

  void *p = dtv[GET_ADDR_MODULE].pointer.val;

  if (__glibc_unlikely (p == TLS_DTV_UNALLOCATED))
    return tls_get_addr_tail (GET_ADDR_PARAM, dtv, NULL);

  return (char *) p + GET_ADDR_OFFSET;
}

Costly `sched_getcpu()`

I was curious what might be the reasons behind the penalty of sched_getcpu(). Internally it turns out to be quite complex:

int
sched_getcpu (void)
{
  unsigned int cpu;
  int r = -1;
#ifdef HAVE_GETCPU_VSYSCALL
  r = INLINE_VSYSCALL (getcpu, 3, &cpu, NULL, NULL);
#else
  // ...
#endif
  return r == -1 ? r : cpu;
}

Apart of the PLT indirection, glibc imposes yet another level (the _dl_vdso_getcpu pointer) as well as even more checks:

#ifndef INTERNAL_VSYSCALL_CALL
# define INTERNAL_VSYSCALL_CALL(funcptr, err, nr, args...)                    \
     funcptr (args)
#endif

#define INLINE_VSYSCALL(name, nr, args...)                                    \
  ({                                                                          \
    __label__ out;                                                            \
    __label__ iserr;                                                          \
    INTERNAL_SYSCALL_DECL (sc_err);                                           \
    long int sc_ret;                                                          \
                                                                              \
    __typeof (GLRO(dl_vdso_##name)) vdsop = GLRO(dl_vdso_##name);             \
    if (vdsop != NULL)                                                        \
      {                                                                       \
        sc_ret = INTERNAL_VSYSCALL_CALL (vdsop, sc_err, nr, ##args);          \
        if (!INTERNAL_SYSCALL_ERROR_P (sc_ret, sc_err))                       \
          goto out;                                                           \
        if (INTERNAL_SYSCALL_ERRNO (sc_ret, sc_err) != ENOSYS)                \
          goto iserr;                                                         \
      }                                                                       \
                                                                              \
    sc_ret = INTERNAL_SYSCALL (name, sc_err, nr, ##args);                     \
    if (INTERNAL_SYSCALL_ERROR_P (sc_ret, sc_err))                            \
      {                                                                       \
      iserr:                                                                  \
        __set_errno (INTERNAL_SYSCALL_ERRNO (sc_ret, sc_err));                \
        sc_ret = -1L;                                                         \
      }                                                                       \
  out:                                                                        \
    sc_ret;                                                                   \
  })

#define GLRO(x) _##x

/* Initialize the VDSO functions pointers.  */
static inline void __attribute__ ((always_inline))
setup_vdso_pointers (void)
{
  // ...
#ifdef HAVE_GETCPU_VSYSCALL
  GLRO(dl_vdso_getcpu) = dl_vdso_vsym (HAVE_GETCPU_VSYSCALL);
#endif

# define HAVE_GETCPU_VSYSCALL           "__vdso_getcpu"

Finally, it goes to the kernel's VDSO code...

notrace long
__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
{
	vdso_read_cpunode(cpu, node);

	return 0;
}

static inline void vdso_read_cpunode(unsigned *cpu, unsigned *node)
{
	unsigned int p;

	/*
	 * Load CPU and node number from the GDT.  LSL is faster than RDTSCP
	 * and works on all CPUs.  This is volatile so that it orders
	 * correctly with respect to barrier() and to keep GCC from cleverly
	 * hoisting it out of the calling function.
	 *
	 * If RDPID is available, use it.
	 */
	alternative_io ("lsl %[seg],%[p]",
			".byte 0xf3,0x0f,0xc7,0xf8", /* RDPID %eax/rax */
			X86_FEATURE_RDPID,
			[p] "=a" (p), [seg] "r" (__CPUNODE_SEG));

	if (cpu)
		*cpu = (p & VDSO_CPUNODE_MASK);
	if (node)
		*node = (p >> VDSO_CPUNODE_BITS);
}

... where the core id is either 1) loaded from memory (LSL) or 2) the relatively recent RDPID instruction is executed. This adds another degree-of-freedom to testing (RDPID available-or-not – I wonder about VMs here) and – more importantly – to the tuning described in the PR's description. I'm concerned it's fragile and dependent on specific environment.

Own abstraction?

I was looking for the latency / bandwidth data for RDPID on https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html but I found nothing :-(. However, if the VDSO's abstractions are the culprit, perhaps we could reduce the cost of obtaining the CPU-based shard selector while keeping any extra complexity pretty encapsulated.

The selector doesn't necessarily need to come from sched_getcpu(). I can imagine a global, const-like pointer-to-pick_a_shard() that, at process' start-up, would be initialized by selecting from multiple implementations:

RDPID–based one if the instruction is available. This would cover all modern x86 on all OSes, including FreeBSD ;-).
Potentially multiple fallback algorithms with clearly defined interface. For ARMs we could even go with rseq-accelerated sched_getcpu(). Perhaps the x86 optimization might be jettisoned at compile-time.

This is basically an exchange of varying-data caching for unvarying-algorithm caching . The advantage is squeezing the complex & fragile cache invalidation / refresh logic.

However, I'm not sure we need any caching at all.

The cost of the last 10%

Calling sched_getcpu() to find the current CPU 32 cycles

Calling pthread_self() and hashing the result 8 cycles

Calling a function to return a thread_local int (local-exec) 5 cycles

Calling a function to return a thread_local int (global-dynamic) 15 cycles

The cost of the ping-pong between 8 threads (cores) can be around 300-400 cycles.
The difference between the sched_getcpu() and the global-dynamic TLS approach is 17 cycles.
The difference between the sched_getcpu() and the local-exec TLS approach is 27 cycles.

The lion's share of the benefit is in tackling the ping pongs. Assuming the 8 threads case, the adaptive caching would give further ~7 percentage points at the price of inflating the complexity.

The initial report was about 19 of 20 cores getting the same shard. I bet the ping pong cost is far, far greater diminishing the returns further.

The question I don't have an answer to is how often the terrible collisions happens on average system. The 19/20 happened on Ubuntu 22.04.1 LTS, 5.4.0-156-generic, Cascade lake(x86-64) which makes me speculate it's glibc dependent but this is just a gut feeling.

aclamk · 2024-02-29T09:35:10Z

src/common/mempool.cc

+
+static thread_local int thread_shard_recheck_count;
+static thread_local int thread_shard_recheck_interval;
+


Packing related thread_local variables together reduces amount of TLS ref.

struct thread_shard_t { size_t index = max_shards; int recheck_count = 0; int recheck_interval = 10000; }; thread_local thread_shard_t thread_shard;

aclamk · 2024-02-29T09:53:10Z

@bill-scales
I have an idea on improvement on shard clash detection mechanism.
Each thread has thread_local unique_int me;. (unique_int is just a normal int, but will be unique).
Each shard has field atomic<unique_int> owner;.

Before operating on the shard, thread checks owner:

if owner == me. We are in proper shard. That can obviously change before we get to counter++, but this cannot be avoided.
if owner != me. We go getcpuid(), find our proper shard, and take ownership owner = me.

This is the mutation of your recheck algorithm that has following cost:

on hit it costs atomic fetch
on miss it costs: cross core fetch(to us) + cross core fetch(the other thread gets it back evantually) + getcpuid() + atomic set

I claim that is should be more cycle efficient.

rzarzynski · 2024-02-29T10:38:45Z

Each thread has thread_local unique_int me;. (unique_int is just a normal int, but will be unique).

pthread_self() and pthread_t stored altogether with the counters, on the same cache line?

But in the first place: what's the real benefit of the sched_getcpu() caching?

aclamk · 2024-02-29T11:09:43Z

pthread_self() and pthread_t stored altogether with the counters, on the same cache line?

Yes, and I would propose to revamp counters to keep all counters together in a packed manner.

But in the first place: what's the real benefit of the sched_getcpu() caching?

Only if sched_getcpu() is significantly slower then TLS. (as suggested by Bill's numbers)

bill-scales · 2024-02-29T12:08:58Z

I like Adam's idea, I agree that it is likely to perform better and it will be a simpler implementation. I'll rework the commit to implement this and to also improve how shard data is grouped to use less cache lines. I'll do some more performance measurements to compare implementations.

But in the first place: what's the real benefit of the sched_getcpu() caching?

Currently a TLS access is faster than calling sched_getcpu() on x86_64 architectures, a TLS access is likely to be much faster than sched_getcpu() on arm architectures (can't implement the VDSO optimization).

rzarzynski · 2025-03-20T00:20:26Z

Nothing more than a rebase after the trivial merge conflict we got after merging the crimson's WITH_ALIEN refactoring. I think it's ready for retest.

CC: @ljflores, @yuriw.

…he line thashing Improve the performance of mempool sharding on systems with sched_getcpu() The idea is to assign threads to shards based on which CPU core they are executing on using sched_getcpu() to select the shard. All the threads executing on the same CPU core share the same shard. If each CPU core has its own shard there should be no cache line contention. Fixes: https://tracker.ceph.com/issues/64100 Signed-off-by: Bill Scales <[email protected]>

This test case calls exit() to terminiate a test mid flight to test recovery from crashes at different points in the code. However it does not stop threads before calling exit and consequently these threads can continue to access mempool structures that have gone out of scope and are freed by the exiting thread. The introduction of a unique_ptr into mempool causes these threads to assert when they access the freed memory. The simple fix is to call _exit instead of exit in the test case so that global destructors are not run. Signed-off-by: Bill Scales <[email protected]>

bill-scales · 2025-03-20T12:00:15Z

Extra commit to fix the failing test unittest_bluefs_fs. This test had a use after free bug that was previously unnoticed but with the change to mempool to use a unique_ptr causes an assert. The test case is testing recovery after crashing bluefs at different points in making updates, so fixing the test case to stop all the threads before exiting is liable to reduce the coverage of the test. Instead I've change its use of exit to _exit to avoid the calling global destructors.

bill-scales · 2025-03-20T16:40:03Z

jenkins test api

bill-scales · 2025-03-21T07:42:53Z

jenkins test make check arm64

bill-scales · 2025-03-21T10:20:59Z

jenkins test rook e2e

bill-scales · 2025-03-21T10:21:23Z

jenkins test dashboard cephadm

rzarzynski · 2025-03-21T15:05:28Z

jenkins test api

bill-scales · 2025-03-22T13:36:29Z

jenkins test api

Jayaprakash-ibm · 2025-04-02T16:53:52Z

RADOS Approved : https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues70610

yuriw · 2025-04-02T17:27:23Z

@rzarzynski @ljflores this PR was approved
I will let you make sure you got all the reviews and pls merge at will
ref: https://tracker.ceph.com/issues/70610#change-292112

cbodley · 2025-04-25T14:23:46Z

src/common/mempool.cc

+static size_t num_shard_bits;
+static size_t num_shards;
+
+std::unique_ptr<shard_t[]> shards = std::make_unique<shard_t[]>(get_num_shards());


@bill-scales seeing some crashes in 'make check arm64' when indexing into this array: https://tracker.ceph.com/issues/71027 assigned to you

A change to mempool ceph#55696 has exposed a use after free bug in crushtool during process exit where dtors are being called to free up mempool data structures at the same time that the ceph context service thread is trying to update them. This commit modifies crushtool's initialization to prevent this (unneeded) thread from being created. See issue for more details about why the thread was not terminiating. Fixes: https://tracker.ceph.com/issues/71027 Signed-off-by: Bill Scales <[email protected]>

A change to mempool ceph#55696 has exposed a use after free bug in crushtool during process exit where dtors are being called to free up mempool data structures at the same time that the ceph context service thread is trying to update them. This commit modifies crushtool's initialization to prevent this (unneeded) thread from being created. See issue for more details about why the thread was not terminiating. Fixes: https://tracker.ceph.com/issues/71027 Signed-off-by: Bill Scales <[email protected]> (cherry picked from commit 2f3ffff)

A change to mempool ceph#55696 has exposed a use after free bug in crushtool during process exit where dtors are being called to free up mempool data structures at the same time that the ceph context service thread is trying to update them. This commit modifies crushtool's initialization to prevent this (unneeded) thread from being created. See issue for more details about why the thread was not terminiating. Fixes: https://tracker.ceph.com/issues/71027 Signed-off-by: Connor Fawcett <[email protected]> (cherry picked from commit 2f3ffff)

github-actions bot added common tests labels Feb 21, 2024

perezjosibm reviewed Feb 22, 2024

View reviewed changes

src/common/mempool.cc Outdated Show resolved Hide resolved

perezjosibm reviewed Feb 22, 2024

View reviewed changes

src/common/mempool.cc Outdated Show resolved Hide resolved

perezjosibm reviewed Feb 22, 2024

View reviewed changes

src/test/test_mempool.cc Outdated Show resolved Hide resolved

neha-ojha requested a review from rzarzynski February 22, 2024 14:55

rzarzynski reviewed Feb 22, 2024

View reviewed changes

rosinL added the performance label Feb 23, 2024

aclamk reviewed Feb 29, 2024

View reviewed changes

bill-scales requested review from a team as code owners March 13, 2024 08:37

bill-scales requested review from afreen23 and avanthakkar and removed request for a team March 13, 2024 08:37

bill-scales added 2 commits March 20, 2025 07:46

bill-scales force-pushed the main branch from 3d33af7 to 9dfe5c1 Compare March 20, 2025 11:53

github-actions bot removed the needs-rebase label Mar 20, 2025

yuriw added the wip-yuri2-testing label Mar 21, 2025

cbodley removed the rgw label Mar 28, 2025

yuriw assigned rzarzynski Apr 2, 2025

yuriw added TESTED ready-to-merge and removed wip-yuri2-testing labels Apr 2, 2025

rzarzynski merged commit aafd5c5 into ceph:main Apr 2, 2025
8 of 10 checks passed

github-project-automation bot moved this from Reviewer approved to Done in Ceph-Dashboard Apr 2, 2025

Matan-B mentioned this pull request Apr 3, 2025

src: sed -i 's/WITH_SEASTAR/WITH_CRIMSON/' #62325

Merged

14 tasks

cbodley reviewed Apr 25, 2025

View reviewed changes

bill-scales mentioned this pull request May 2, 2025

crushtool: init with CINIT_FLAG_NO_DAEMON_ACTIONS #63092

Merged

14 tasks

connorfawcett mentioned this pull request Jul 18, 2025

tentacle: crushtool: init with CINIT_FLAG_NO_DAEMON_ACTIONS #64578

Merged

14 tasks


		static thread_local int thread_shard_recheck_count;
		static thread_local int thread_shard_recheck_interval;

Comments

Conversation

bill-scales commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Explanation of changes

Crimson changes

Performance benchmarks

Comparison of different algorithms

Discounted implementations

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rzarzynski commented Feb 22, 2024

Uh oh!

bill-scales commented Feb 23, 2024

Uh oh!

aclamk commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bill-scales commented Feb 28, 2024

Uh oh!

aclamk commented Feb 28, 2024

Uh oh!

rzarzynski commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Thread local storage

Costly sched_getcpu()

Own abstraction?

The cost of the last 10%

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aclamk commented Feb 29, 2024

Uh oh!

rzarzynski commented Feb 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aclamk commented Feb 29, 2024

Uh oh!

bill-scales commented Feb 29, 2024

Uh oh!

rzarzynski commented Mar 20, 2025

Uh oh!

bill-scales commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bill-scales commented Mar 20, 2025

Uh oh!

bill-scales commented Mar 21, 2025

Uh oh!

bill-scales commented Mar 21, 2025

Uh oh!

bill-scales commented Mar 21, 2025

Uh oh!

rzarzynski commented Mar 21, 2025

Uh oh!

bill-scales commented Mar 22, 2025

Uh oh!

bill-scales commented Feb 21, 2024 •

edited

Loading

aclamk commented Feb 27, 2024 •

edited

Loading

rzarzynski commented Feb 28, 2024 •

edited

Loading

Costly `sched_getcpu()`

rzarzynski commented Feb 29, 2024 •

edited

Loading

bill-scales commented Mar 20, 2025 •

edited

Loading