blk/KernelDevice: Introduce a cap on the number of pending discards by jbaergen-do · Pull Request #61455 · ceph/ceph

jbaergen-do · 2025-01-20T18:40:26Z

Some disks have a discard performance that is too low to keep up with write workloads. Using async discard in this case will cause the OSD to run out of capacity due to the number of outstanding discards preventing allocations from being freed. While sync discard could be used in this case to cause backpressure, this might have unacceptable performance implications.

For the most part, as long as enough discards are getting through to a device, then it will stay trimmed enough to maintain acceptable performance. Thus, we can introduce a cap on the pending discard count, ensuring that the queue of allocations to be freed doesn't get too long while also issuing sufficient discards to disk. The default value of 1000000 has ample room for discard spikes (e.g. from snaptrim); it could result in multiple minutes of discards being queued up, but at least it's not unbounded (though if a user really wants unbounded behaviour, they can choose it by setting the new configuration option to 0).

Fixes: https://tracker.ceph.com/issues/69604

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

ifed01 · 2025-01-21T18:10:20Z

jenkins test make check

aclamk · 2025-01-23T14:00:54Z

jenkins test make check arm64

aclamk · 2025-01-23T15:06:14Z

src/blk/kernel/KernelDevice.cc


  std::lock_guard l(discard_lock);
+
+  if (max_pending > 0 && discard_queued.num_intervals() >= max_pending)


I think the max_pending value should not be in items, but it bytes.
In bytes it gives operator some hint how to set the value - one can imagine that 1% of disc capacity can be in discard queue and plan controlling OSD free space based on it.
Otherwise its just taking a value out of the hat.

Yeah, I considered that, but the drive in question is limited far more by discard IOPS than it is by bytes

aclamk · 2025-01-23T15:09:36Z

@jbaergen-do Good idea for a workaround solution.

@ifed01 Perfect solution would be if Allocators could immediately get ownership over released extents, so they can allocate them back, but a freed, long unused chunks will be send for discard.
Do you think we can implement such feature in next-gen allocators?

jbaergen-do · 2025-01-23T16:18:34Z

FWIW, another solution here is do perform a periodic sweep across free+dirty extents in the allocator, since if the block gets overwritten soon then the discard was a waste. However, this looked to be a lot of effort to implement and this simple solution has held so far

aclamk · 2025-01-23T17:55:08Z

jenkins test make check

aclamk · 2025-01-23T17:57:16Z

FWIW, another solution here is do perform a periodic sweep across free+dirty extents in the allocator, since if the block gets overwritten soon then the discard was a waste. However, this looked to be a lot of effort to implement and this simple solution has held so far

I get it. I admire the simplicity of just skipping discard if drive cannot keep up.

aclamk · 2025-01-23T20:12:35Z

jenkins test make check arm64

aclamk · 2025-01-24T08:43:21Z

@jbaergen-do I have no idea how this PR managed this, but it consistently fails in make check arm64 on
projectroot.src.test.objectstore.unittest_bluefs
projectroot.src.test.objectstore.unittest_deferred
projectroot.src.test.objectstore.unittest_bdev
I originally suspected some previously merged toxin, but other PRs do not express it.

jbaergen-do · 2025-01-24T17:13:30Z

@aclamk Well, that's bizarre! The crashes are all in device open, which is well before anything my changes touch 🤔

I'm going to pull latest into my PR and we'll start there; I've done some digging around and can't think of any reason why my changes (or any other recent changes, for that matter) could have caused this

Some disks have a discard performance that is too low to keep up with write workloads. Using async discard in this case will cause the OSD to run out of capacity due to the number of outstanding discards preventing allocations from being freed. While sync discard could be used in this case to cause backpressure, this might have unacceptable performance implications. For the most part, as long as enough discards are getting through to a device, then it will stay trimmed enough to maintain acceptable performance. Thus, we can introduce a cap on the pending discard count, ensuring that the queue of allocations to be freed doesn't get too long while also issuing sufficient discards to disk. The default value of 1000000 has ample room for discard spikes (e.g. from snaptrim); it could result in multiple minutes of discards being queued up, but at least it's not unbounded (though if a user really wants unbounded behaviour, they can choose it by setting the new configuration option to 0). Fixes: https://tracker.ceph.com/issues/69604 Signed-off-by: Joshua Baergen <[email protected]>

jbaergen-do · 2025-01-24T22:57:45Z

That seemed to do it 🤷‍♂️

prazumovsky · 2025-02-07T10:51:33Z

Hello! Does this PR active and is in progress? We faced high capacity usage on devices with slow discard performance recently and interested in this solution.

jbaergen-do · 2025-02-07T17:34:15Z

Hi Peter! It's just waiting for approval at this point - @aclamk would you mind taking another look now that the tests are passing?

aclamk · 2025-02-11T10:15:52Z

@jbaergen-do @prazumovsky
PR has not been forgotten.
Review + teuthology testing await.

aclamk · 2025-02-24T17:00:47Z

Passed https://tracker.ceph.com/issues/69920.

…tream blk/KernelDevice: Introduce a cap on the number of pending discards

jbaergen-do requested a review from a team as a code owner January 20, 2025 18:40

github-actions bot added common core labels Jan 20, 2025

ifed01 mentioned this pull request Jan 23, 2025

blk:Warning added for discard queue overflow #61286

Merged

14 tasks

aclamk reviewed Jan 23, 2025

View reviewed changes

jbaergen-do force-pushed the limit-discard-qlen-upstream branch from fbbbe04 to 1dee883 Compare January 24, 2025 17:14

aclamk added aclamk-testing-phoebe bluestore testing bluestore labels Feb 11, 2025

aclamk approved these changes Feb 11, 2025

View reviewed changes

aclamk merged commit eef1641 into ceph:main Feb 24, 2025
12 checks passed

zmc pushed a commit to zmc/ceph that referenced this pull request Feb 26, 2025

Merge pull request ceph#61455 from jbaergen-do/limit-discard-qlen-ups…

588e754

…tream blk/KernelDevice: Introduce a cap on the number of pending discards

jbaergen-do deleted the limit-discard-qlen-upstream branch February 28, 2025 20:54

This was referenced Mar 11, 2025

reef: blk/KernelDevice: Introduce a cap on the number of pending discards #62220

Merged

squid: blk/KernelDevice: Introduce a cap on the number of pending discards #62221

Merged


		std::lock_guard l(discard_lock);

		if (max_pending > 0 && discard_queued.num_intervals() >= max_pending)

Conversation

jbaergen-do commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

ifed01 commented Jan 21, 2025

Uh oh!

aclamk commented Jan 23, 2025

Uh oh!

aclamk Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

jbaergen-do Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

aclamk commented Jan 23, 2025

Uh oh!

jbaergen-do commented Jan 23, 2025

Uh oh!

aclamk commented Jan 23, 2025

Uh oh!

aclamk commented Jan 23, 2025

Uh oh!

aclamk commented Jan 23, 2025

Uh oh!

aclamk commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbaergen-do commented Jan 24, 2025

Uh oh!

jbaergen-do commented Jan 24, 2025

Uh oh!

prazumovsky commented Feb 7, 2025

Uh oh!

jbaergen-do commented Feb 7, 2025

Uh oh!

aclamk commented Feb 11, 2025

Uh oh!

aclamk commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jbaergen-do commented Jan 20, 2025 •

edited

Loading

aclamk commented Jan 24, 2025 •

edited

Loading