mempool: Modify pick_a_shard_int() fucntion#55249
Conversation
According to make check log, with ubuntu jammy aarch64 pagesize=4k, the mempool.check_shard_select create 3200 threads and map to mempools, only map to 4 shards, it is very high collision probability. I find the return value of `pthread_self()`, like below: ` 0xfff99887f100 0xfff99806f100 0xfff99704f100 0xfff99683f100 ` It have the same lower 16 bits, we should right shift 16 bit. Fixes: https://tracker.ceph.com/issues/64100 Signed-off-by: luo rixin <[email protected]>
|
@markhpc here is performance related issue, please take a review. With ubuntu jammy aarch64 pagesize=4k, the right shift should be 16 instead of CEPH_PAGE_SHIFT (12). But I don't know the reason. |
|
jenkins test make check arm64 |
|
@rzarzynski @ljflores @athanatos please help to review. |
|
@rosinL I added a performance label so Mark can take a look! |
1 similar comment
|
Heya, I was out on travel last week and am just catching up. I confess that on a casual glance I don't understand the reasoning to set this to a static value of 16. IE hardcoding it doesn't seem like the right way forward? |
|
The original code is a bit suspect - what gets returned by pthread_self() is architecture specific, normally it will be a pointer to a data structure, but Posix theoretically also allows a structure to be returned. Casting the result to an integer and then shifting by CEPH_PAGE_SIZE is making an assumption about the pointers being returned - the shift by page size looks like a reasonable assumption for x86_64 (a quick test shows that different threads return pointers to different 4K pages) but for aarch64 it appears these structures have a different alignment. The change of shifting by 16 might make things better for aarch64 platforms but worse for other platforms |
| // https://fossies.org/dox/glibc-2.32/pthread__self_8c_source.html | ||
| size_t me = (size_t)pthread_self(); | ||
| size_t i = (me >> CEPH_PAGE_SHIFT) & ((1 << num_shard_bits) - 1); | ||
| size_t i = (me >> 16) & ((1 << num_shard_bits) - 1); |
There was a problem hiding this comment.
pthread_self()-as-shard-selector is a really bad idea. I would prefer to switch to the way we already use in crimson – please take a look on these PRs and the discussion there:
There was a problem hiding this comment.
It is different from crimson. Crimson is shard per core, just one thread running on the core until the end, so crimson can use the sched_getcpu(). For classic osd, there are so many threads (tp_osd_tp, msgr_worker,kvthread), running on the cpus in a shared way, use the sched_getcpu() may be cause more cache ping-pongs between cpu cores.
There was a problem hiding this comment.
If _GNU_SOURCE is defined:
The crimson patch calls sched_getcpu() everytime mempool::adjust_count is called which means a given CPU core will choose the same shard (regardless of which thread happens to be running on that core at the time). In the case of crimson the thread affinity is bound to the core so there is no chance it will switch cores between the call to sched_getcpu() and the increment.
For classic there is a (tiny) risk of a context switch and core hop between calling sched_getcpu() and the increment. The use of atomics means this exceptional case is still correct, just with non-optimal performance. So the patch would work fine for classic.
If _GNU_SOURCE is not defined:
The crimson patch calls sched_getcpu() once to assign an ID to the thread and then uses this every time mempool::adjust_count is called. As the thread affinity is bound to a core this works well for crimson and avoids shards from being accessed by different cores.
For classic threads aren't bound to cores so this won't work very well, although it might do a better job that the current arithmetic on pthread_self. A better fix for classic in this scenario would be to use a global variable that is incremented modulo number of shards to select a shard the first time a thread calls mempool::adjust_count and then stick to this shard on subsequent calls. If threads switch cores this won't prevent CPU cache lines bouncing between cores (but then the current pthread_self() implementation doesn't do this either) but it will at least guarantee a perfect distribution of threads across shards which is better than the current implementation or the proposed change
There was a problem hiding this comment.
For classic there is a (tiny) risk of a context switch and core hop between calling sched_getcpu() and the increment. The use of atomics means this exceptional case is still correct, just with non-optimal performance. So the patch would work fine for classic.
Yup, that's the exact justification for the atomics there. I think we could squeeze them if we could guarantee there is no way to hop between cores (everything pinned?). I don't know whether this is a real problem.
Actually, 16 is from my observation of |
Yes, the alignment is architecture specific. Maybe we need a more general way, for example use a hash result of |
Yeah, that's the fragility – it's quite environment-specific. I recall the problem was observed on one system but on mine the distribution was pretty reasonable :-(.
Before drilling further I would enlist the intended cases for the optimization. I believe What worries me is that a home-brewed shard selector would be costlier, more complex and bring more ping-pongs-in-unpinned-scenarios than the |
|
In my opinion, 2 considerations for mempool shard:
And there are 2 alternative solutions:
What's your opinion, may be other options? @rzarzynski @bill-scales |
This is about all the ways where a thread is glued to a In the |
|
I will do some tests to compare cpu-affinity shard |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
|
I think we still want this |
this is basically just a performance issue that is triggering on our build server. Should be reverted once the upstream decides how to fix it. Quoting myself: > looks like just a perf regression the upstream hasn't decided how > to fix yet: ceph/ceph#56448, ceph/ceph#55249, ceph/ceph#55696 References: #23
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
According to make check log, with ubuntu jammy aarch64 pagesize=4k, the mempool.check_shard_select create 3200 threads and map to mempools, only map to 4 shards, it is very high collision probability. I find the return value of
pthread_self(), like below:0xfff99887f100 0xfff99806f100 0xfff99704f100 0xfff99683f100It have the same lower 16 bits, we should right shift 16 bit.
Fixes: https://tracker.ceph.com/issues/64100
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e