Consider the simple case of submitting a readv operation via io_uring_enter(), with no polling or other special flags. If the readv op is unable to immediately obtain a blk-mq tag in blk_mq_submit_bio(), it doesn’t wait. Control returns to io_queue_sqe(), which calls io_queue_async() to retry the operation. io_uring_enter() then completes normally. An iou-wrk-XXXX thread resubmits the I/O and waits for a tag as necessary. Eventually the readv completes normally and the CQE res field indicates success.
Now consider the more complex case where blk_mq_submit_bio() splits the bio because the transfer size exceeds the limits for the target device. If all bio’s making up the original request fail to immediately get a tag, then the same retry procedures are used. An iou-wrk-XXXX thread resubmits the I/O and waits for tags as necessary, and eventually the readv completes normally.
But if some of the split bio’s get a tag immediately while others do not, the retry doesn’t happen. For the bio’s that get a tag, those chunks of the original I/O are completed. But the overall readv returns -EAGAIN in the CQE res field because of the bio’s that couldn’t immediately get a tag.
Is this expected behavior? It seems desirable for io_uring code to retry the readv in this case like in the other cases, rather than requiring user space to do the retry. But I’m just learning about io_uring and how it interacts with the blk-mq layer, and I don’t see a straightforward way to fix this.
Thoughts?
Consider the simple case of submitting a readv operation via io_uring_enter(), with no polling or other special flags. If the readv op is unable to immediately obtain a blk-mq tag in blk_mq_submit_bio(), it doesn’t wait. Control returns to io_queue_sqe(), which calls io_queue_async() to retry the operation. io_uring_enter() then completes normally. An iou-wrk-XXXX thread resubmits the I/O and waits for a tag as necessary. Eventually the readv completes normally and the CQE res field indicates success.
Now consider the more complex case where blk_mq_submit_bio() splits the bio because the transfer size exceeds the limits for the target device. If all bio’s making up the original request fail to immediately get a tag, then the same retry procedures are used. An iou-wrk-XXXX thread resubmits the I/O and waits for tags as necessary, and eventually the readv completes normally.
But if some of the split bio’s get a tag immediately while others do not, the retry doesn’t happen. For the bio’s that get a tag, those chunks of the original I/O are completed. But the overall readv returns -EAGAIN in the CQE res field because of the bio’s that couldn’t immediately get a tag.
Is this expected behavior? It seems desirable for io_uring code to retry the readv in this case like in the other cases, rather than requiring user space to do the retry. But I’m just learning about io_uring and how it interacts with the blk-mq layer, and I don’t see a straightforward way to fix this.
Thoughts?