Skip to content

Comments

mempool: Modify pick_a_shard_int() fucntion#55249

Closed
rosinL wants to merge 1 commit intoceph:mainfrom
rosinL:wip-fix-64100
Closed

mempool: Modify pick_a_shard_int() fucntion#55249
rosinL wants to merge 1 commit intoceph:mainfrom
rosinL:wip-fix-64100

Conversation

@rosinL
Copy link
Member

@rosinL rosinL commented Jan 19, 2024

According to make check log, with ubuntu jammy aarch64 pagesize=4k, the mempool.check_shard_select create 3200 threads and map to mempools, only map to 4 shards, it is very high collision probability. I find the return value of pthread_self(), like below: 0xfff99887f100 0xfff99806f100 0xfff99704f100 0xfff99683f100
It have the same lower 16 bits, we should right shift 16 bit.

Fixes: https://tracker.ceph.com/issues/64100

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

According to make check log, with ubuntu jammy aarch64 pagesize=4k,
the mempool.check_shard_select create 3200 threads and map to mempools,
only map to 4 shards, it is very high collision probability.
I find the return value of `pthread_self()`, like below:
`
0xfff99887f100
0xfff99806f100
0xfff99704f100
0xfff99683f100
`
It have the same lower 16 bits, we should right shift 16 bit.

Fixes: https://tracker.ceph.com/issues/64100

Signed-off-by: luo rixin <[email protected]>
@rosinL
Copy link
Member Author

rosinL commented Jan 19, 2024

@markhpc here is performance related issue, please take a review. With ubuntu jammy aarch64 pagesize=4k, the right shift should be 16 instead of CEPH_PAGE_SHIFT (12). But I don't know the reason.

@rosinL
Copy link
Member Author

rosinL commented Jan 22, 2024

jenkins test make check arm64

@rosinL
Copy link
Member Author

rosinL commented Jan 23, 2024

@rzarzynski @ljflores @athanatos please help to review.

@ljflores
Copy link
Member

@rosinL I added a performance label so Mark can take a look!

@rosinL
Copy link
Member Author

rosinL commented Jan 26, 2024

Ping @rzarzynski @athanatos @markhpc

1 similar comment
@rosinL
Copy link
Member Author

rosinL commented Jan 30, 2024

Ping @rzarzynski @athanatos @markhpc

@markhpc
Copy link
Member

markhpc commented Feb 1, 2024

Heya, I was out on travel last week and am just catching up. I confess that on a casual glance I don't understand the reasoning to set this to a static value of 16. IE hardcoding it doesn't seem like the right way forward?

@bill-scales
Copy link
Contributor

The original code is a bit suspect - what gets returned by pthread_self() is architecture specific, normally it will be a pointer to a data structure, but Posix theoretically also allows a structure to be returned. Casting the result to an integer and then shifting by CEPH_PAGE_SIZE is making an assumption about the pointers being returned - the shift by page size looks like a reasonable assumption for x86_64 (a quick test shows that different threads return pointers to different 4K pages) but for aarch64 it appears these structures have a different alignment.

The change of shifting by 16 might make things better for aarch64 platforms but worse for other platforms

// https://fossies.org/dox/glibc-2.32/pthread__self_8c_source.html
size_t me = (size_t)pthread_self();
size_t i = (me >> CEPH_PAGE_SHIFT) & ((1 << num_shard_bits) - 1);
size_t i = (me >> 16) & ((1 << num_shard_bits) - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pthread_self()-as-shard-selector is a really bad idea. I would prefer to switch to the way we already use in crimson – please take a look on these PRs and the discussion there:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is different from crimson. Crimson is shard per core, just one thread running on the core until the end, so crimson can use the sched_getcpu(). For classic osd, there are so many threads (tp_osd_tp, msgr_worker,kvthread), running on the cpus in a shared way, use the sched_getcpu() may be cause more cache ping-pongs between cpu cores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _GNU_SOURCE is defined:

The crimson patch calls sched_getcpu() everytime mempool::adjust_count is called which means a given CPU core will choose the same shard (regardless of which thread happens to be running on that core at the time). In the case of crimson the thread affinity is bound to the core so there is no chance it will switch cores between the call to sched_getcpu() and the increment.

For classic there is a (tiny) risk of a context switch and core hop between calling sched_getcpu() and the increment. The use of atomics means this exceptional case is still correct, just with non-optimal performance. So the patch would work fine for classic.

If _GNU_SOURCE is not defined:

The crimson patch calls sched_getcpu() once to assign an ID to the thread and then uses this every time mempool::adjust_count is called. As the thread affinity is bound to a core this works well for crimson and avoids shards from being accessed by different cores.

For classic threads aren't bound to cores so this won't work very well, although it might do a better job that the current arithmetic on pthread_self. A better fix for classic in this scenario would be to use a global variable that is incremented modulo number of shards to select a shard the first time a thread calls mempool::adjust_count and then stick to this shard on subsequent calls. If threads switch cores this won't prevent CPU cache lines bouncing between cores (but then the current pthread_self() implementation doesn't do this either) but it will at least guarantee a perfect distribution of threads across shards which is better than the current implementation or the proposed change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For classic there is a (tiny) risk of a context switch and core hop between calling sched_getcpu() and the increment. The use of atomics means this exceptional case is still correct, just with non-optimal performance. So the patch would work fine for classic.

Yup, that's the exact justification for the atomics there. I think we could squeeze them if we could guarantee there is no way to hop between cores (everything pinned?). I don't know whether this is a real problem.

@rosinL
Copy link
Member Author

rosinL commented Feb 2, 2024

Heya, I was out on travel last week and am just catching up. I confess that on a casual glance I don't understand the reasoning to set this to a static value of 16. IE hardcoding it doesn't seem like the right way forward?

Actually, 16 is from my observation of pthread_self() return value, not based on some theory. The current value CEPH_PAGE_SHIFT(12 for 4k pagesize,16 for 64k pagesize) is based on the assumption the pthread_self() value is from a page aligned structure(glibc implement specific). But on this case, the assumption is not true, lead to bad shard result, 3200 threads only map to 4 shards. Maybe we should do some transformation of pthread_self() like hash or crc, to get a less collision result.

@rosinL
Copy link
Member Author

rosinL commented Feb 2, 2024

The original code is a bit suspect - what gets returned by pthread_self() is architecture specific, normally it will be a pointer to a data structure, but Posix theoretically also allows a structure to be returned. Casting the result to an integer and then shifting by CEPH_PAGE_SIZE is making an assumption about the pointers being returned - the shift by page size looks like a reasonable assumption for x86_64 (a quick test shows that different threads return pointers to different 4K pages) but for aarch64 it appears these structures have a different alignment.

The change of shifting by 16 might make things better for aarch64 platforms but worse for other platforms

Yes, the alignment is architecture specific. Maybe we need a more general way, for example use a hash result of pthread_self

@rzarzynski
Copy link
Contributor

Actually, 16 is from my observation of pthread_self() return value, not based on some theory.

Yeah, that's the fragility – it's quite environment-specific. I recall the problem was observed on one system but on mine the distribution was pretty reasonable :-(.

Yes, the alignment is architecture specific. Maybe we need a more general way, for example use a hash result of pthread_self

Before drilling further I would enlist the intended cases for the optimization. I believe _GNU_SOURCE is the absolutely dominating one; the alternative path – in my understanding – targets FreeBSD (there are Folks who run Ceph on it) and maybe (?) the librados-on-Windows (???).

What worries me is that a home-brewed shard selector would be costlier, more complex and bring more ping-pongs-in-unpinned-scenarios than the sched_getcpu().

@rosinL
Copy link
Member Author

rosinL commented Feb 5, 2024

In my opinion, 2 considerations for mempool shard:

  1. map a given thread to the same shard every time
  2. avoid high frequency access to some shards

And there are 2 alternative solutions:

  1. In pick_a_shard_int, still use pthread_self() as a seed, but do transform(eg crc) to get a less collision shard result (easy, maybe less balanced distribution between shards)
  2. As @bill-scales mentions, replace pick_a_shard_int with allocate_a_shard_int, use a round robin way to allocate a shard id to the thread for the first time(a bit more complex, perfect distribution between shards)

What's your opinion, may be other options? @rzarzynski @bill-scales

@rzarzynski
Copy link
Contributor

  1. map a given thread to the same shard every time

This is about all the ways where a thread is glued to a mempool's shard. It includes approaches based on pthread_self(), gettid() or a custom, constant token (e.g. the mentioned round robin). There is a potential penalty when threads are unpinned and frequently migrate across cores. In addition to that, there the tiny time window between selecting the shard and the actual update of counters. IMHO all these techniques are used to approximate CPU local storage.

In the sched_getcpu() approach a cpu is assigned to a shard. There is no need to worry about migrations (which makes it even more useful for the classical than for crimson), apart the tiny time window between selecting the shard and the actual update of counters. If the number of shards matches-or-exceeds number of CPUs, you could have the perfect shared-nothing (still apart the tiny window, of course).

@rosinL
Copy link
Member Author

rosinL commented Feb 6, 2024

I will do some tests to compare cpu-affinity shard sched_getcpu() and thread-affinity shard

@github-actions
Copy link

github-actions bot commented Apr 6, 2024

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Apr 6, 2024
@github-actions
Copy link

github-actions bot commented May 6, 2024

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this May 6, 2024
@markhpc
Copy link
Member

markhpc commented May 9, 2024

I think we still want this

@markhpc markhpc reopened this May 9, 2024
@github-actions github-actions bot removed the stale label May 9, 2024
@bazaah bazaah mentioned this pull request May 20, 2024
14 tasks
bazaah added a commit to bazaah/aur-ceph that referenced this pull request May 25, 2024
this is basically just a performance issue that is triggering on our
build server.

Should be reverted once the upstream decides how to fix it.

Quoting myself:

> looks like just a perf regression the upstream hasn't decided how
> to fix yet: ceph/ceph#56448, ceph/ceph#55249, ceph/ceph#55696

References: #23
@github-actions
Copy link

github-actions bot commented Jul 8, 2024

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jul 8, 2024
@github-actions
Copy link

github-actions bot commented Aug 7, 2024

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants