blk/KernelDevice: Introduce a cap on the number of pending discards#61455
blk/KernelDevice: Introduce a cap on the number of pending discards#61455
Conversation
|
jenkins test make check |
|
jenkins test make check arm64 |
|
|
||
| std::lock_guard l(discard_lock); | ||
|
|
||
| if (max_pending > 0 && discard_queued.num_intervals() >= max_pending) |
There was a problem hiding this comment.
I think the max_pending value should not be in items, but it bytes.
In bytes it gives operator some hint how to set the value - one can imagine that 1% of disc capacity can be in discard queue and plan controlling OSD free space based on it.
Otherwise its just taking a value out of the hat.
There was a problem hiding this comment.
Yeah, I considered that, but the drive in question is limited far more by discard IOPS than it is by bytes
|
@jbaergen-do Good idea for a workaround solution. @ifed01 Perfect solution would be if Allocators could immediately get ownership over released extents, so they can allocate them back, but a freed, long unused chunks will be send for discard. |
|
FWIW, another solution here is do perform a periodic sweep across free+dirty extents in the allocator, since if the block gets overwritten soon then the discard was a waste. However, this looked to be a lot of effort to implement and this simple solution has held so far |
|
jenkins test make check |
I get it. I admire the simplicity of just skipping discard if drive cannot keep up. |
|
jenkins test make check arm64 |
|
@jbaergen-do I have no idea how this PR managed this, but it consistently fails in make check arm64 on |
|
@aclamk Well, that's bizarre! The crashes are all in device open, which is well before anything my changes touch 🤔 I'm going to pull latest into my PR and we'll start there; I've done some digging around and can't think of any reason why my changes (or any other recent changes, for that matter) could have caused this |
Some disks have a discard performance that is too low to keep up with write workloads. Using async discard in this case will cause the OSD to run out of capacity due to the number of outstanding discards preventing allocations from being freed. While sync discard could be used in this case to cause backpressure, this might have unacceptable performance implications. For the most part, as long as enough discards are getting through to a device, then it will stay trimmed enough to maintain acceptable performance. Thus, we can introduce a cap on the pending discard count, ensuring that the queue of allocations to be freed doesn't get too long while also issuing sufficient discards to disk. The default value of 1000000 has ample room for discard spikes (e.g. from snaptrim); it could result in multiple minutes of discards being queued up, but at least it's not unbounded (though if a user really wants unbounded behaviour, they can choose it by setting the new configuration option to 0). Fixes: https://tracker.ceph.com/issues/69604 Signed-off-by: Joshua Baergen <[email protected]>
fbbbe04 to
1dee883
Compare
|
That seemed to do it 🤷♂️ |
|
Hello! Does this PR active and is in progress? We faced high capacity usage on devices with slow discard performance recently and interested in this solution. |
|
Hi Peter! It's just waiting for approval at this point - @aclamk would you mind taking another look now that the tests are passing? |
|
@jbaergen-do @prazumovsky |
…tream blk/KernelDevice: Introduce a cap on the number of pending discards
Some disks have a discard performance that is too low to keep up with write workloads. Using async discard in this case will cause the OSD to run out of capacity due to the number of outstanding discards preventing allocations from being freed. While sync discard could be used in this case to cause backpressure, this might have unacceptable performance implications.
For the most part, as long as enough discards are getting through to a device, then it will stay trimmed enough to maintain acceptable performance. Thus, we can introduce a cap on the pending discard count, ensuring that the queue of allocations to be freed doesn't get too long while also issuing sufficient discards to disk. The default value of 1000000 has ample room for discard spikes (e.g. from snaptrim); it could result in multiple minutes of discards being queued up, but at least it's not unbounded (though if a user really wants unbounded behaviour, they can choose it by setting the new configuration option to 0).
Fixes: https://tracker.ceph.com/issues/69604
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e