mempool: Modify `pick_a_shard_int()` fucntion by rosinL · Pull Request #55249 · ceph/ceph

rosinL · 2024-01-19T09:44:46Z

According to make check log, with ubuntu jammy aarch64 pagesize=4k, the mempool.check_shard_select create 3200 threads and map to mempools, only map to 4 shards, it is very high collision probability. I find the return value of pthread_self(), like below: 0xfff99887f100 0xfff99806f100 0xfff99704f100 0xfff99683f100
It have the same lower 16 bits, we should right shift 16 bit.

Fixes: https://tracker.ceph.com/issues/64100

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

According to make check log, with ubuntu jammy aarch64 pagesize=4k, the mempool.check_shard_select create 3200 threads and map to mempools, only map to 4 shards, it is very high collision probability. I find the return value of `pthread_self()`, like below: ` 0xfff99887f100 0xfff99806f100 0xfff99704f100 0xfff99683f100 ` It have the same lower 16 bits, we should right shift 16 bit. Fixes: https://tracker.ceph.com/issues/64100 Signed-off-by: luo rixin <[email protected]>

rosinL · 2024-01-19T09:50:28Z

@markhpc here is performance related issue, please take a review. With ubuntu jammy aarch64 pagesize=4k, the right shift should be 16 instead of CEPH_PAGE_SHIFT (12). But I don't know the reason.

rosinL · 2024-01-22T00:59:56Z

jenkins test make check arm64

rosinL · 2024-01-23T08:54:17Z

@rzarzynski @ljflores @athanatos please help to review.

ljflores · 2024-01-23T16:31:13Z

@rosinL I added a performance label so Mark can take a look!

rosinL · 2024-01-26T09:37:54Z

Ping @rzarzynski @athanatos @markhpc

rosinL · 2024-01-30T09:29:29Z

Ping @rzarzynski @athanatos @markhpc

markhpc · 2024-02-01T15:47:25Z

Heya, I was out on travel last week and am just catching up. I confess that on a casual glance I don't understand the reasoning to set this to a static value of 16. IE hardcoding it doesn't seem like the right way forward?

bill-scales · 2024-02-01T16:11:11Z

The original code is a bit suspect - what gets returned by pthread_self() is architecture specific, normally it will be a pointer to a data structure, but Posix theoretically also allows a structure to be returned. Casting the result to an integer and then shifting by CEPH_PAGE_SIZE is making an assumption about the pointers being returned - the shift by page size looks like a reasonable assumption for x86_64 (a quick test shows that different threads return pointers to different 4K pages) but for aarch64 it appears these structures have a different alignment.

The change of shifting by 16 might make things better for aarch64 platforms but worse for other platforms

rzarzynski · 2024-02-01T16:17:13Z

src/include/mempool.h

  //   https://fossies.org/dox/glibc-2.32/pthread__self_8c_source.html
  size_t me = (size_t)pthread_self();
-  size_t i = (me >> CEPH_PAGE_SHIFT) & ((1 << num_shard_bits) - 1);
+  size_t i = (me >> 16) & ((1 << num_shard_bits) - 1);


pthread_self()-as-shard-selector is a really bad idea. I would prefer to switch to the way we already use in crimson – please take a look on these PRs and the discussion there:

mempool: avoid true sharing for the counters, with crimson #53130

mempool: switch from thread local storage to cpu local storage #53245

It is different from crimson. Crimson is shard per core, just one thread running on the core until the end, so crimson can use the sched_getcpu(). For classic osd, there are so many threads (tp_osd_tp, msgr_worker,kvthread), running on the cpus in a shared way, use the sched_getcpu() may be cause more cache ping-pongs between cpu cores.

If _GNU_SOURCE is defined:

The crimson patch calls sched_getcpu() everytime mempool::adjust_count is called which means a given CPU core will choose the same shard (regardless of which thread happens to be running on that core at the time). In the case of crimson the thread affinity is bound to the core so there is no chance it will switch cores between the call to sched_getcpu() and the increment.

For classic there is a (tiny) risk of a context switch and core hop between calling sched_getcpu() and the increment. The use of atomics means this exceptional case is still correct, just with non-optimal performance. So the patch would work fine for classic.

If _GNU_SOURCE is not defined:

The crimson patch calls sched_getcpu() once to assign an ID to the thread and then uses this every time mempool::adjust_count is called. As the thread affinity is bound to a core this works well for crimson and avoids shards from being accessed by different cores.

For classic threads aren't bound to cores so this won't work very well, although it might do a better job that the current arithmetic on pthread_self. A better fix for classic in this scenario would be to use a global variable that is incremented modulo number of shards to select a shard the first time a thread calls mempool::adjust_count and then stick to this shard on subsequent calls. If threads switch cores this won't prevent CPU cache lines bouncing between cores (but then the current pthread_self() implementation doesn't do this either) but it will at least guarantee a perfect distribution of threads across shards which is better than the current implementation or the proposed change

For classic there is a (tiny) risk of a context switch and core hop between calling sched_getcpu() and the increment. The use of atomics means this exceptional case is still correct, just with non-optimal performance. So the patch would work fine for classic.

Yup, that's the exact justification for the atomics there. I think we could squeeze them if we could guarantee there is no way to hop between cores (everything pinned?). I don't know whether this is a real problem.

rosinL · 2024-02-02T09:06:54Z

Heya, I was out on travel last week and am just catching up. I confess that on a casual glance I don't understand the reasoning to set this to a static value of 16. IE hardcoding it doesn't seem like the right way forward?

Actually, 16 is from my observation of pthread_self() return value, not based on some theory. The current value CEPH_PAGE_SHIFT(12 for 4k pagesize,16 for 64k pagesize) is based on the assumption the pthread_self() value is from a page aligned structure(glibc implement specific). But on this case, the assumption is not true, lead to bad shard result, 3200 threads only map to 4 shards. Maybe we should do some transformation of pthread_self() like hash or crc, to get a less collision result.

rosinL · 2024-02-02T09:15:47Z

The original code is a bit suspect - what gets returned by pthread_self() is architecture specific, normally it will be a pointer to a data structure, but Posix theoretically also allows a structure to be returned. Casting the result to an integer and then shifting by CEPH_PAGE_SIZE is making an assumption about the pointers being returned - the shift by page size looks like a reasonable assumption for x86_64 (a quick test shows that different threads return pointers to different 4K pages) but for aarch64 it appears these structures have a different alignment.

The change of shifting by 16 might make things better for aarch64 platforms but worse for other platforms

Yes, the alignment is architecture specific. Maybe we need a more general way, for example use a hash result of pthread_self

rzarzynski · 2024-02-02T14:27:24Z

Actually, 16 is from my observation of pthread_self() return value, not based on some theory.

Yeah, that's the fragility – it's quite environment-specific. I recall the problem was observed on one system but on mine the distribution was pretty reasonable :-(.

Yes, the alignment is architecture specific. Maybe we need a more general way, for example use a hash result of pthread_self

Before drilling further I would enlist the intended cases for the optimization. I believe _GNU_SOURCE is the absolutely dominating one; the alternative path – in my understanding – targets FreeBSD (there are Folks who run Ceph on it) and maybe (?) the librados-on-Windows (???).

What worries me is that a home-brewed shard selector would be costlier, more complex and bring more ping-pongs-in-unpinned-scenarios than the sched_getcpu().

rosinL · 2024-02-05T02:29:48Z

In my opinion, 2 considerations for mempool shard:

map a given thread to the same shard every time
avoid high frequency access to some shards

And there are 2 alternative solutions:

In pick_a_shard_int, still use pthread_self() as a seed, but do transform(eg crc) to get a less collision shard result (easy, maybe less balanced distribution between shards)
As @bill-scales mentions, replace pick_a_shard_int with allocate_a_shard_int, use a round robin way to allocate a shard id to the thread for the first time(a bit more complex, perfect distribution between shards)

What's your opinion, may be other options? @rzarzynski @bill-scales

rzarzynski · 2024-02-05T13:35:04Z

map a given thread to the same shard every time

This is about all the ways where a thread is glued to a mempool's shard. It includes approaches based on pthread_self(), gettid() or a custom, constant token (e.g. the mentioned round robin). There is a potential penalty when threads are unpinned and frequently migrate across cores. In addition to that, there the tiny time window between selecting the shard and the actual update of counters. IMHO all these techniques are used to approximate CPU local storage.

In the sched_getcpu() approach a cpu is assigned to a shard. There is no need to worry about migrations (which makes it even more useful for the classical than for crimson), apart the tiny time window between selecting the shard and the actual update of counters. If the number of shards matches-or-exceeds number of CPUs, you could have the perfect shared-nothing (still apart the tiny window, of course).

rosinL · 2024-02-06T01:47:48Z

I will do some tests to compare cpu-affinity shard sched_getcpu() and thread-affinity shard

github-actions · 2024-04-06T02:06:39Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-05-06T02:12:05Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

markhpc · 2024-05-09T14:28:12Z

I think we still want this

this is basically just a performance issue that is triggering on our build server. Should be reverted once the upstream decides how to fix it. Quoting myself: > looks like just a perf regression the upstream hasn't decided how > to fix yet: ceph/ceph#56448, ceph/ceph#55249, ceph/ceph#55696 References: #23

github-actions · 2024-07-08T16:01:48Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-08-07T17:01:26Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

ljflores requested review from athanatos, markhpc and rzarzynski January 23, 2024 16:25

ljflores added core performance arm64 labels Jan 23, 2024

rzarzynski reviewed Feb 1, 2024

View reviewed changes

Svelar mentioned this pull request Mar 25, 2024

test/test_mempool: skip check_shard_select test on ARM64 #56448

Merged

14 tasks

github-actions bot added the stale label Apr 6, 2024

github-actions bot closed this May 6, 2024

markhpc reopened this May 9, 2024

github-actions bot removed the stale label May 9, 2024

bazaah mentioned this pull request May 20, 2024

ceph 18.2.2-3 bazaah/aur-ceph#23

Closed

14 tasks

github-actions bot added the stale label Jul 8, 2024

github-actions bot closed this Aug 7, 2024

Comments

Conversation

rosinL commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

rosinL commented Jan 19, 2024

Uh oh!

rosinL commented Jan 22, 2024

Uh oh!

rosinL commented Jan 23, 2024

Uh oh!

ljflores commented Jan 23, 2024

Uh oh!

rosinL commented Jan 26, 2024

Uh oh!

rosinL commented Jan 30, 2024

Uh oh!

markhpc commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bill-scales commented Feb 1, 2024

Uh oh!

rzarzynski Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

rosinL Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

bill-scales Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

rzarzynski Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

rosinL commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rosinL commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Feb 2, 2024

Uh oh!

rosinL commented Feb 5, 2024

Uh oh!

rzarzynski commented Feb 5, 2024

Uh oh!

rosinL commented Feb 6, 2024

Uh oh!

github-actions bot commented Apr 6, 2024

Uh oh!

github-actions bot commented May 6, 2024

Uh oh!

markhpc commented May 9, 2024

Uh oh!

github-actions bot commented Jul 8, 2024

Uh oh!

github-actions bot commented Aug 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rosinL commented Jan 19, 2024 •

edited

Loading

markhpc commented Feb 1, 2024 •

edited

Loading

rosinL commented Feb 2, 2024 •

edited

Loading

rosinL commented Feb 2, 2024 •

edited

Loading