Skip to content

[RFC] crimson: understand performance impacts by skipping mempool in raw #52404

Closed
cyx1231st wants to merge 1 commit intoceph:mainfrom
cyx1231st:rfc-seastar-msgr-optimizations-mempool
Closed

[RFC] crimson: understand performance impacts by skipping mempool in raw #52404
cyx1231st wants to merge 1 commit intoceph:mainfrom
cyx1231st:rfc-seastar-msgr-optimizations-mempool

Conversation

@cyx1231st
Copy link
Member

msgr-iops-cpus

The above results are using perf_crimson_msgr (client mode) to pressurize perf_crimson_msgr (server mode) or perf_async_msgr to understand apple-to-apple performance by CPU cores used.

Initially, both crimson and async msgr have similar performance curve, but further analysis shows they have different bottlenecks:

  • Async msgr is mostly blocked by its locks at 16 CPUs;
  • Crimson msgr is mostly blocked by some mempool counters in raw buffer at 16 CPUs;

Async msgr should share the same infrastructure about the buffer and mempool, but its immediate bottleneck is not there, which implies that its scaling issue might be synthetic.

For crimson msgr, if comment out the mempool counters in this PR, the CPU scaling problem is relaxed greatly, see the curve of crimson-poc.

In short:

  • Both crimson and async msgr implements similar features, and their single-core performance looks similar when there is no racing;
  • When goes to multiple cores, a single point of shared resource in the hot path can make crimson scale bad, with the end-to-end result approaches classic;

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@athanatos
Copy link
Contributor

Interesting. I'm in favor of globally disabling mempool counters for crimson for now. IIRC, their usage is informational -- it's a way of dumping memory usage of different structures. I expect the issue is the usage of atomics? If so, we'll probably want per-reactor counters that can be dynamically combined when queried.

@cyx1231st
Copy link
Member Author

I expect the issue is the usage of atomics? If so, we'll probably want per-reactor counters that can be dynamically combined when queried.

Yes, these counters are shared across-cores, IIUC it causes CPUs to race the same cache-line, and simply changing the atomics to non-atomics cannot help in this case.

perf-crimson-msgr server side workload is straightforward and pure run-to-completion -- receive and send msg in the same core:

auto rep = crimson::make_message<MOSDOp>(0, 0, hobj, spgid, 0, 0, 0);
bufferlist data(server.msg_data);
rep->write(0, server.msg_len, data);
rep->set_tid(m->get_tid());
++server.msg_count;
std::ignore = c->send(std::move(rep));

This is why the mempool counters are the only issue in the tests.

When it comes to OSD, things are more complicated because we will submit messages as well as their buffers across cores, and I think we currently construct and destruct buffers from different shards, which requires their ref-counters to be atomic and shared across-cores in the hot path.

So the problem becomes synthetic with OSD and I'm not sure whether enforcing construction/destruction in the same shard will be faster (i.e. atomic ref-counter vs submitting buffer destructions to the original shard). And it seems to me is not a simple change.

@cyx1231st
Copy link
Member Author

Close in favor of #53130

@cyx1231st cyx1231st closed this Aug 24, 2023
@cyx1231st cyx1231st deleted the rfc-seastar-msgr-optimizations-mempool branch August 24, 2023 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants