Skip to content

blk/KernelDevice: Introduce a cap on the number of pending discards#61455

Merged
aclamk merged 1 commit intoceph:mainfrom
jbaergen-do:limit-discard-qlen-upstream
Feb 24, 2025
Merged

blk/KernelDevice: Introduce a cap on the number of pending discards#61455
aclamk merged 1 commit intoceph:mainfrom
jbaergen-do:limit-discard-qlen-upstream

Conversation

@jbaergen-do
Copy link
Contributor

@jbaergen-do jbaergen-do commented Jan 20, 2025

Some disks have a discard performance that is too low to keep up with write workloads. Using async discard in this case will cause the OSD to run out of capacity due to the number of outstanding discards preventing allocations from being freed. While sync discard could be used in this case to cause backpressure, this might have unacceptable performance implications.

For the most part, as long as enough discards are getting through to a device, then it will stay trimmed enough to maintain acceptable performance. Thus, we can introduce a cap on the pending discard count, ensuring that the queue of allocations to be freed doesn't get too long while also issuing sufficient discards to disk. The default value of 1000000 has ample room for discard spikes (e.g. from snaptrim); it could result in multiple minutes of discards being queued up, but at least it's not unbounded (though if a user really wants unbounded behaviour, they can choose it by setting the new configuration option to 0).

Fixes: https://tracker.ceph.com/issues/69604

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@jbaergen-do jbaergen-do requested a review from a team as a code owner January 20, 2025 18:40
@ifed01
Copy link
Contributor

ifed01 commented Jan 21, 2025

jenkins test make check

@aclamk
Copy link
Contributor

aclamk commented Jan 23, 2025

jenkins test make check arm64


std::lock_guard l(discard_lock);

if (max_pending > 0 && discard_queued.num_intervals() >= max_pending)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the max_pending value should not be in items, but it bytes.
In bytes it gives operator some hint how to set the value - one can imagine that 1% of disc capacity can be in discard queue and plan controlling OSD free space based on it.
Otherwise its just taking a value out of the hat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I considered that, but the drive in question is limited far more by discard IOPS than it is by bytes

@aclamk
Copy link
Contributor

aclamk commented Jan 23, 2025

@jbaergen-do Good idea for a workaround solution.

@ifed01 Perfect solution would be if Allocators could immediately get ownership over released extents, so they can allocate them back, but a freed, long unused chunks will be send for discard.
Do you think we can implement such feature in next-gen allocators?

@jbaergen-do
Copy link
Contributor Author

FWIW, another solution here is do perform a periodic sweep across free+dirty extents in the allocator, since if the block gets overwritten soon then the discard was a waste. However, this looked to be a lot of effort to implement and this simple solution has held so far

@aclamk
Copy link
Contributor

aclamk commented Jan 23, 2025

jenkins test make check

@aclamk
Copy link
Contributor

aclamk commented Jan 23, 2025

FWIW, another solution here is do perform a periodic sweep across free+dirty extents in the allocator, since if the block gets overwritten soon then the discard was a waste. However, this looked to be a lot of effort to implement and this simple solution has held so far

I get it. I admire the simplicity of just skipping discard if drive cannot keep up.

@aclamk
Copy link
Contributor

aclamk commented Jan 23, 2025

jenkins test make check arm64

@aclamk
Copy link
Contributor

aclamk commented Jan 24, 2025

@jbaergen-do I have no idea how this PR managed this, but it consistently fails in make check arm64 on
projectroot.src.test.objectstore.unittest_bluefs
projectroot.src.test.objectstore.unittest_deferred
projectroot.src.test.objectstore.unittest_bdev
I originally suspected some previously merged toxin, but other PRs do not express it.

@jbaergen-do
Copy link
Contributor Author

@aclamk Well, that's bizarre! The crashes are all in device open, which is well before anything my changes touch 🤔

I'm going to pull latest into my PR and we'll start there; I've done some digging around and can't think of any reason why my changes (or any other recent changes, for that matter) could have caused this

Some disks have a discard performance that is too low to keep up with
write workloads. Using async discard in this case will cause the OSD to
run out of capacity due to the number of outstanding discards preventing
allocations from being freed. While sync discard could be used in this
case to cause backpressure, this might have unacceptable performance
implications.

For the most part, as long as enough discards are getting through to a
device, then it will stay trimmed enough to maintain acceptable
performance. Thus, we can introduce a cap on the pending discard count,
ensuring that the queue of allocations to be freed doesn't get too long
while also issuing sufficient discards to disk. The default value of
1000000 has ample room for discard spikes (e.g. from snaptrim); it could
result in multiple minutes of discards being queued up, but at least
it's not unbounded (though if a user really wants unbounded behaviour,
they can choose it by setting the new configuration option to 0).

Fixes: https://tracker.ceph.com/issues/69604
Signed-off-by: Joshua Baergen <[email protected]>
@jbaergen-do jbaergen-do force-pushed the limit-discard-qlen-upstream branch from fbbbe04 to 1dee883 Compare January 24, 2025 17:14
@jbaergen-do
Copy link
Contributor Author

That seemed to do it 🤷‍♂️

@prazumovsky
Copy link

Hello! Does this PR active and is in progress? We faced high capacity usage on devices with slow discard performance recently and interested in this solution.

@jbaergen-do
Copy link
Contributor Author

Hi Peter! It's just waiting for approval at this point - @aclamk would you mind taking another look now that the tests are passing?

@aclamk
Copy link
Contributor

aclamk commented Feb 11, 2025

@jbaergen-do @prazumovsky
PR has not been forgotten.
Review + teuthology testing await.

@aclamk
Copy link
Contributor

aclamk commented Feb 24, 2025

Passed https://tracker.ceph.com/issues/69920.

@aclamk aclamk merged commit eef1641 into ceph:main Feb 24, 2025
12 checks passed
zmc pushed a commit to zmc/ceph that referenced this pull request Feb 26, 2025
…tream

blk/KernelDevice: Introduce a cap on the number of pending discards
@jbaergen-do jbaergen-do deleted the limit-discard-qlen-upstream branch February 28, 2025 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants