client: Fix a deadlock when osd is full#60411
Conversation
|
I faced similar stalls while writing an async io test case after filling up the cluster (fbff10a) but with fuse client. From what I found after discussions with greg and reading |
Are you saying this was seen in fs suite test run? If yes, could you point me to the relevant run link if you have it? That brings up the question -- are our tests enough to exercise this case or did we miss it in our runs thinking things could go bad when cluster is full? |
I've lost track of the runs, I need to find it in the PR, and they are quite old so i'm not sure if the links are live but yes - when cluster is full and the ratios are near 1 (or let it just be default), everything does down the hill and the stall is inevitable. I guess this happens with sync io too. I can't recall exactly. I can give this a run. |
We do exercise this case but with comfort ratios which always yield expected results. |
|
jenkins test api |
I don't think that is exactly this bug. Since the deadlock here is happening after fixing the corruption as part of 3ebe974 |
Let's see if we can have a test case for this. @kotreshhr did mention that it could be possible to test this via the client blocklist codepath. |
I didn't find a easy way to add the test in teuthology. But I could manually reproduce this and tested. This fix breaks the lock dependency. So more work is required here. Traceback with debug build: Lock dependency broken: |
|
jenkins test make check arm64 |
|
@batrick I think you missed the update #60411 (comment) This fix is breaking the lock dependency. I need to see what's the best way to fix this. Any ideas ? |
Yes, in the back of my mind I was wondering about that. I think
The
I think doing the cancel of the |
How about queuing the context callback in a finisher thread? Client has |
I think the assumption of op_cancel always succeeds in the following code is required not to cause consistency issues. I tried a solution to unlock and lock the rwlock and slock (session lock) before calling the call back but that resulted two failures
So releasing and acquiring the rwlock and slock for the sake of calling callback registered is leading to issues in multiple places. |
I think that would require a non-trivial refactor in many places in objecter. Were you able to test out with queuing via a finisher thread? |
It worked for the first time. I will have to do few more testing to get confidence. I will update once I do that. |
|
@vshankar I tested the approach of queuing the context callback using the finisher thread. It works! I didn't see any issues. Here is the manual test done. |
|
While testing the fio with nfs ganesha I hit a osd crash. I have reported that here https://tracker.ceph.com/issues/68688?next_issue_id=68683 |
* refs/pull/60411/head: qa: Add async io test to nfs ganesha using fio client: Fix a deadlock when osd is full Reviewed-by: Venky Shankar <[email protected]> Reviewed-by: Patrick Donnelly <[email protected]> Reviewed-by: Dhairya Parmar <[email protected]>
|
(running this through tests again) |
|
duh! all 10 jobs failed with
@kotreshhr o_O |
@kotreshhr could be an offending PR in https://tracker.ceph.com/issues/68968 |
|
It does. The tracker update (https://tracker.ceph.com/issues/68968#note-1) was just for my tracking so that I don't accidentally merge the change. The test branch still has the change. See: https://github.com/vshankar/ceph/commits/wip-vshankar-testing-20241118.055430-debug/ |
@vshankar The failures are in read as seen below. Could this be because of #60507 ? This fix should not affect read as this is write path. |
It's possible, yes. I'll have to double check though. |
* refs/pull/60411/head: qa: Add async io test to nfs ganesha using fio client: Fix a deadlock when osd is full Reviewed-by: Dhairya Parmar <[email protected]> Reviewed-by: Venky Shankar <[email protected]> Reviewed-by: Patrick Donnelly <[email protected]>
Problem: When osd is full, the client receives the notification and cancels the ongoing writes. If the ongoing writes are async, it could cause a dead lock as the async callback registered also takes the 'client_lock' which the handle_osd_map takes at the beginning. The op_cancel_writes calls the callback registered for the async write synchronously holding the 'client_lock' causing the deadlock. Earlier approach: It was tried to solve this issue by calling 'op_cancel_writes' without holding 'client_lock'. But this failed lock dependency between objecter's 'rwlock' and async write's callback taking 'client_lock'. The 'client_lock' should always be taken before taking 'rwlock'. So this approach is dropped against the current approach. Solution: Use C_OnFinisher for objecter async write callback i.e., wrap the async write's callback using the Finisher. This queues the callback to the Finisher's context queue which the finisher thread picks up and executes thus avoiding the deadlock. Testing: The fix is tested in the vstart cluster with the following reproducer. 1. Mount the cephfs volume using nfs-ganesha at /mnt 2. Run fio on /mnt on one terminal 3. On the other terminal, blocklist the nfs client session 4. The fio would hang It is reproducing in the vstart cluster most of the times. I think that's because it's slow. The same test written for teuthology is not reproducing the issue. The test expects one or more writes to be on going in rados when the client is blocklisted for the deadlock to be hit. Stripped down version of Traceback: ---------- 0 0x00007f4d77274960 in __lll_lock_wait () 1 0x00007f4d7727aff2 in pthread_mutex_lock@@GLIBC_2.2.5 () 2 0x00007f4d7491b0a1 in __gthread_mutex_lock (__mutex=0x7f4d200f99b0) 3 std::mutex::lock (this=<optimized out>) 4 std::scoped_lock<std::mutex>::scoped_lock (__m=..., this=<optimized out>, this=<optimized out>, __m=...) 5 Client::C_Lock_Client_Finisher::finish (this=0x7f4ca0103550, r=-28) 6 0x00007f4d74888dfd in Context::complete (this=0x7f4ca0103550, r=<optimized out>) 7 0x00007f4d7498850c in std::__do_visit<...>(...) (__visitor=...) 8 std::visit<Objecter::Op::complete(...) (__visitor=...) 9 Objecter::Op::complete(...) (e=..., e=..., r=-28, ec=..., f=...) 10 Objecter::Op::complete (e=..., r=-28, ec=..., this=0x7f4ca022c7f0) 11 Objecter::op_cancel (this=0x7f4d200fab20, s=<optimized out>, tid=<optimized out>, r=-28) 12 0x00007f4d7498ea12 in Objecter::op_cancel_writes (this=0x7f4d200fab20, r=-28, pool=103) 13 0x00007f4d748e1c8e in Client::_handle_full_flag (this=0x7f4d200f9830, pool=103) 14 0x00007f4d748ed20c in Client::handle_osd_map (m=..., this=0x7f4d200f9830) 15 Client::ms_dispatch2 (this=0x7f4d200f9830, m=...) 16 0x00007f4d75b8add2 in Messenger::ms_deliver_dispatch (m=..., this=0x7f4d200ed3e0) 17 DispatchQueue::entry (this=0x7f4d200ed6f0) 18 0x00007f4d75c27fa1 in DispatchQueue::DispatchThread::entry (this=<optimized out>) 19 0x00007f4d77277c02 in start_thread () 20 0x00007f4d772fcc40 in clone3 () -------- Fixes: https://tracker.ceph.com/issues/68641 Signed-off-by: Kotresh HR <[email protected]>
9b1e196 to
60c5801
Compare
|
The fio test is carved out as separate PR #61143 to run against the main and validate if the issue mentioned in #60411 (comment) exists in main as well. |
|
jenkins test make check |
|
@kotreshhr I spoke too soon -- similar failure without this change. See: https://pulpito.ceph.com/vshankar-2024-12-19_10:00:45-fs-wip-vshankar-testing-20241219.063429-debug-testing-default-smithi/8044375/ |
ok, then there is a bug in main branch ? |
yes. I'll create a redmine tracker for the failure. This fix can be merged then. |
@vshankar The qa test PR #61143 is merged. And you mentioned the issue exists (#60411 (comment)) without this PR. How do we proceed here ? |
I'm running this through a sub-suite and will merge. Should be done tomorrow. |
|
This PR is under test in https://tracker.ceph.com/issues/69375. |
|
@kotreshhr I could have merged it with running it through QA again, but I included it just so that the change is tested again. I don't expect anything to fail though (apart from the failure we already have). |
When osd is full, the client receives the notification and cancels the ongoing writes. If the ongoing writes are async, it could cause a dead lock as the async callback registered also takes the 'client_lock' which the handle_osd_map takes at the beginning.
Stripped down version of Traceback:
This patch fixes the same.
Fixes: https://tracker.ceph.com/issues/68641
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e