[RFC] crimson: understand performance impacts by skipping mempool in raw by cyx1231st · Pull Request #52404 · ceph/ceph

cyx1231st · 2023-07-12T03:25:05Z

The above results are using perf_crimson_msgr (client mode) to pressurize perf_crimson_msgr (server mode) or perf_async_msgr to understand apple-to-apple performance by CPU cores used.

Initially, both crimson and async msgr have similar performance curve, but further analysis shows they have different bottlenecks:

Async msgr is mostly blocked by its locks at 16 CPUs;
Crimson msgr is mostly blocked by some mempool counters in raw buffer at 16 CPUs;

Async msgr should share the same infrastructure about the buffer and mempool, but its immediate bottleneck is not there, which implies that its scaling issue might be synthetic.

For crimson msgr, if comment out the mempool counters in this PR, the CPU scaling problem is relaxed greatly, see the curve of crimson-poc.

In short:

Both crimson and async msgr implements similar features, and their single-core performance looks similar when there is no racing;
When goes to multiple cores, a single point of shared resource in the hot path can make crimson scale bad, with the end-to-end result approaches classic;

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

athanatos · 2023-07-12T03:49:58Z

Interesting. I'm in favor of globally disabling mempool counters for crimson for now. IIRC, their usage is informational -- it's a way of dumping memory usage of different structures. I expect the issue is the usage of atomics? If so, we'll probably want per-reactor counters that can be dynamically combined when queried.

cyx1231st · 2023-07-12T04:55:16Z

I expect the issue is the usage of atomics? If so, we'll probably want per-reactor counters that can be dynamically combined when queried.

Yes, these counters are shared across-cores, IIUC it causes CPUs to race the same cache-line, and simply changing the atomics to non-atomics cannot help in this case.

perf-crimson-msgr server side workload is straightforward and pure run-to-completion -- receive and send msg in the same core:

ceph/src/crimson/tools/perf_crimson_msgr.cc

Lines 233 to 238 in c50c042

    
           auto rep = crimson::make_message<MOSDOp>(0, 0, hobj, spgid, 0, 0, 0); 
        
           bufferlist data(server.msg_data); 
        
           rep->write(0, server.msg_len, data); 
        
           rep->set_tid(m->get_tid()); 
        
           ++server.msg_count; 
        
           std::ignore = c->send(std::move(rep));

This is why the mempool counters are the only issue in the tests.

When it comes to OSD, things are more complicated because we will submit messages as well as their buffers across cores, and I think we currently construct and destruct buffers from different shards, which requires their ref-counters to be atomic and shared across-cores in the hot path.

So the problem becomes synthetic with OSD and I'm not sure whether enforcing construction/destruction in the same shard will be faster (i.e. atomic ref-counter vs submitting buffer destructions to the original shard). And it seems to me is not a simple change.

cyx1231st · 2023-08-24T06:44:48Z

Close in favor of #53130

[RFC] drop mempool for raw

cf14866

cyx1231st requested review from athanatos and rzarzynski July 12, 2023 03:25

cyx1231st added the performance label Jul 12, 2023

cyx1231st mentioned this pull request Aug 24, 2023

mempool: avoid true sharing for the counters, with crimson #53130

Merged

14 tasks

cyx1231st closed this Aug 24, 2023

cyx1231st deleted the rfc-seastar-msgr-optimizations-mempool branch August 24, 2023 06:44

cyx1231st added the crimson-perf label Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] crimson: understand performance impacts by skipping mempool in raw #52404

[RFC] crimson: understand performance impacts by skipping mempool in raw #52404
cyx1231st wants to merge 1 commit intoceph:mainfrom
cyx1231st:rfc-seastar-msgr-optimizations-mempool

cyx1231st commented Jul 12, 2023

Uh oh!

athanatos commented Jul 12, 2023

Uh oh!

cyx1231st commented Jul 12, 2023

Uh oh!

cyx1231st commented Aug 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cyx1231st commented Jul 12, 2023

Contribution Guidelines

Checklist

Uh oh!

athanatos commented Jul 12, 2023

Uh oh!

cyx1231st commented Jul 12, 2023

Uh oh!

cyx1231st commented Aug 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants