Skip to content

client: Fix a deadlock when osd is full#60411

Merged
vshankar merged 1 commit intoceph:mainfrom
kotreshhr:aio-nfsganesha-osd-full-hang
Jan 2, 2025
Merged

client: Fix a deadlock when osd is full#60411
vshankar merged 1 commit intoceph:mainfrom
kotreshhr:aio-nfsganesha-osd-full-hang

Conversation

@kotreshhr
Copy link
Contributor

@kotreshhr kotreshhr commented Oct 21, 2024

When osd is full, the client receives the notification and cancels the ongoing writes. If the ongoing writes are async, it could cause a dead lock as the async callback registered also takes the 'client_lock' which the handle_osd_map takes at the beginning.

Stripped down version of Traceback:

0  0x00007f4d77274960 in __lll_lock_wait ()
1  0x00007f4d7727aff2 in pthread_mutex_lock@@GLIBC_2.2.5 () 2  0x00007f4d7491b0a1 in __gthread_mutex_lock (__mutex=0x7f4d200f99b0) 3  std::mutex::lock (this=<optimized out>)
4  std::scoped_lock<std::mutex>::scoped_lock (__m=..., this=<optimized out>, this=<optimized out>, __m=...)
5  Client::C_Lock_Client_Finisher::finish (this=0x7f4ca0103550, r=-28)
6  0x00007f4d74888dfd in Context::complete (this=0x7f4ca0103550, r=<optimized out>)
7  0x00007f4d7498850c in std::__do_visit<...>(...) (__visitor=...) 
8  std::visit<Objecter::Op::complete(...) (__visitor=...) 
9  Objecter::Op::complete(...) (e=..., e=..., r=-28, ec=..., f=...)
10 Objecter::Op::complete (e=..., r=-28, ec=..., this=0x7f4ca022c7f0)
11 Objecter::op_cancel (this=0x7f4d200fab20, s=<optimized out>, tid=<optimized out>, r=-28)
12 0x00007f4d7498ea12 in Objecter::op_cancel_writes (this=0x7f4d200fab20, r=-28, pool=103)
13 0x00007f4d748e1c8e in Client::_handle_full_flag (this=0x7f4d200f9830, pool=103)
14 0x00007f4d748ed20c in Client::handle_osd_map (m=..., this=0x7f4d200f9830)
15 Client::ms_dispatch2 (this=0x7f4d200f9830, m=...) 
16 0x00007f4d75b8add2 in Messenger::ms_deliver_dispatch (m=..., this=0x7f4d200ed3e0) 
17 DispatchQueue::entry (this=0x7f4d200ed6f0)
18 0x00007f4d75c27fa1 in DispatchQueue::DispatchThread::entry (this=<optimized out>) 
19 0x00007f4d77277c02 in start_thread ()
20 0x00007f4d772fcc40 in clone3 ()

This patch fixes the same.

Fixes: https://tracker.ceph.com/issues/68641

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@kotreshhr kotreshhr requested review from a team and vshankar October 21, 2024 11:02
@github-actions github-actions bot added the cephfs Ceph File System label Oct 21, 2024
@dparmar18
Copy link
Contributor

I faced similar stalls while writing an async io test case after filling up the cluster (fbff10a) but with fuse client. From what I found after discussions with greg and reading qa/suites/fs/functional/tasks/mds-full.yaml is that the things go haywire if the near full (and all the relevant) ratio(s) are set default (in other words closer to 1) or more and so when I used the configs vals used in the YAML (close to 0.6-0.7), everything worked fine and the stall was not seen. Do we know what these ratios were set in the cluster used by QE?

@vshankar
Copy link
Contributor

I faced similar stalls while writing an async io test case after filling up the cluster (fbff10a) but with fuse client. From what I found after discussions with greg and reading qa/suites/fs/functional/tasks/mds-full.yaml is that the things go haywire if the near full (and all the relevant) ratio(s) are set default (in other words closer to 1) or more and so when I used the configs vals used in the YAML (close to 0.6-0.7), everything worked fine and the stall was not seen. Do we know what these ratios were set in the cluster used by QE?

Are you saying this was seen in fs suite test run? If yes, could you point me to the relevant run link if you have it?

That brings up the question -- are our tests enough to exercise this case or did we miss it in our runs thinking things could go bad when cluster is full?

@dparmar18
Copy link
Contributor

I faced similar stalls while writing an async io test case after filling up the cluster (fbff10a) but with fuse client. From what I found after discussions with greg and reading qa/suites/fs/functional/tasks/mds-full.yaml is that the things go haywire if the near full (and all the relevant) ratio(s) are set default (in other words closer to 1) or more and so when I used the configs vals used in the YAML (close to 0.6-0.7), everything worked fine and the stall was not seen. Do we know what these ratios were set in the cluster used by QE?

Are you saying this was seen in fs suite test run? If yes, could you point me to the relevant run link if you have it?

That brings up the question -- are our tests enough to exercise this case or did we miss it in our runs thinking things could go bad when cluster is full?

I've lost track of the runs, I need to find it in the PR, and they are quite old so i'm not sure if the links are live but yes - when cluster is full and the ratios are near 1 (or let it just be default), everything does down the hill and the stall is inevitable. I guess this happens with sync io too. I can't recall exactly. I can give this a run.

@dparmar18
Copy link
Contributor

That brings up the question -- are our tests enough to exercise this case or did we miss it in our runs thinking things could go bad when cluster is full?

We do exercise this case but with comfort ratios which always yield expected results.

@kotreshhr
Copy link
Contributor Author

jenkins test api

@kotreshhr
Copy link
Contributor Author

That brings up the question -- are our tests enough to exercise this case or did we miss it in our runs thinking things could go bad when cluster is full?

We do exercise this case but with comfort ratios which always yield expected results.

I don't think that is exactly this bug. Since the deadlock here is happening after fixing the corruption as part of 3ebe974

@vshankar
Copy link
Contributor

That brings up the question -- are our tests enough to exercise this case or did we miss it in our runs thinking things could go bad when cluster is full?

We do exercise this case but with comfort ratios which always yield expected results.

I don't think that is exactly this bug. Since the deadlock here is happening after fixing the corruption as part of 3ebe974

Let's see if we can have a test case for this. @kotreshhr did mention that it could be possible to test this via the client blocklist codepath.

@kotreshhr
Copy link
Contributor Author

That brings up the question -- are our tests enough to exercise this case or did we miss it in our runs thinking things could go bad when cluster is full?

We do exercise this case but with comfort ratios which always yield expected results.

I don't think that is exactly this bug. Since the deadlock here is happening after fixing the corruption as part of 3ebe974

Let's see if we can have a test case for this. @kotreshhr did mention that it could be possible to test this via the client blocklist codepath.

I didn't find a easy way to add the test in teuthology. But I could manually reproduce this and tested. This fix breaks the lock dependency. So more work is required here.

Traceback with debug build:

#0  0x00007f78a0908664 in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f78a08afc4e in raise () from /lib64/libc.so.6
#2  0x00007f78a0897902 in abort () from /lib64/libc.so.6
#3  0x00007f789f40e50e in ceph::__ceph_abort (file=file@entry=0x7f789f9645e0 "/home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/common/lockdep.cc", line=line@entry=337, func=func@entry=0x7f789f964678 "int lockdep_will_lock(const char*, int, bool, bool)", msg="abort() called") at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/common/assert.cc:198
#4  0x00007f789f49577c in lockdep_will_lock (name=0x32dd1b30 "Client::client_lock", id=<optimized out>, force_backtrace=<optimized out>, recursive=recursive@entry=false) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/common/lockdep.cc:337
#5  0x00007f789f498957 in ceph::mutex_debug_detail::mutex_debugging_base::_will_lock (this=this@entry=0x32dce6b0, recursive=recursive@entry=false) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/common/mutex_debug.cc:49
#6  0x00007f78668523cc in ceph::mutex_debug_detail::mutex_debug_impl<false>::lock (this=this@entry=0x32dce6b0, no_lockdep=no_lockdep@entry=false) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/common/mutex_debug.h:179
#7  0x00007f786685a8bf in std::scoped_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >::scoped_lock (this=0x7f78495fee78, __m=...) at /usr/include/c++/14/mutex:787
#8  Client::C_Lock_Client_Finisher::finish (this=0x7f774401f220, r=-108) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/client/Client.cc:11426
#9  0x00007f786684b691 in Context::complete (this=0x7f774401f220, r=<optimized out>) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/include/Context.h:99
#10 0x00007f786698744d in Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}::operator()<Context*>(Context*&&) const (__closure=0x7f78495feed0, arg=@0x7f774406d9c0: 0x7f774401f220) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/osdc/Objecter.h:2050
#11 std::__invoke_impl<void, Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}, Context*>(std::__invoke_other, Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, Context*&&) (__f=...) at /usr/include/c++/14/bits/invoke.h:61
#12 std::__invoke<Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}, Context*>(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, Context*&&) (__fn=...) at /usr/include/c++/14/bits/invoke.h:96
#13 std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<void> (*)(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&)>, std::integer_sequence<unsigned long, 2ul> >::__visit_invoke(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&) (__visitor=..., __vars#0=...) at /usr/include/c++/14/variant:1060
#14 std::__do_visit<std::__detail::__variant::__deduce_visit_result<void>, Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*> >(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&) (__visitor=...) at /usr/include/c++/14/variant:1821
#15 0x00007f786698746f in std::visit<Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*> >(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&) (__visitor=...) at /usr/include/c++/14/variant:1882
#16 0x00007f78669874bf in Objecter::Op::complete (f=..., ec=..., r=r@entry=-108, e=...) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/osdc/Objecter.h:2047
#17 0x00007f78669590ba in Objecter::Op::complete (this=0x7f774406d5d0, ec=..., r=-108, e=...) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/osdc/Objecter.h:2062
#18 Objecter::op_cancel (this=this@entry=0x32dc9f50, s=s@entry=0x7f7770013270, tid=<optimized out>, r=r@entry=-108) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/osdc/Objecter.cc:2520
#19 0x00007f7866959aed in Objecter::op_cancel_writes (this=0x32dc9f50, r=-108, pool=-1) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/osdc/Objecter.cc:2619
#20 0x00007f78668c6ac9 in Client::handle_osd_map (this=this@entry=0x32dce4f0, m=...) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/client/Client.cc:2905
#21 0x00007f78668c782e in Client::ms_dispatch2 (this=0x32dce4f0, m=...) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/client/Client.cc:2980
#22 0x00007f789f53d7fd in Messenger::ms_deliver_dispatch (this=0x32dc5910, m=...) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/msg/Messenger.h:740
#23 0x00007f789f53a344 in DispatchQueue::entry (this=0x32dc5cc8) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/msg/DispatchQueue.cc:201
#24 0x00007f789f5ebc13 in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/msg/DispatchQueue.h:101
#25 0x00007f789f3ed669 in Thread::entry_wrapper (this=0x32dc5e58) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/common/Thread.cc:87
#26 0x00007f789f3ed681 in Thread::_entry_func (arg=<optimized out>) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/common/Thread.cc:74
#27 0x00007f78a09066d7 in start_thread () from /lib64/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--
#28 0x00007f78a098a60c in clone3 () from /lib64/libc.so.6

Lock dependency broken:

2024-10-22T21:36:00.239+0530 7f78a04c9240 -1 WARNING: all dangerous and experimental features are enabled.
2024-10-22T21:36:00.245+0530 7f78a04c9240 -1 WARNING: all dangerous and experimental features are enabled.
2024-10-22T21:36:00.275+0530 7f78a04c9240 -1 WARNING: all dangerous and experimental features are enabled.
2024-10-22T21:36:00.280+0530 7f78a04c9240 -1 WARNING: all dangerous and experimental features are enabled.
2024-10-22T21:36:00.280+0530 7f78a04c9240 -1 asok(0x32a4bfe0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/build/asok/client.vstart.112582.asok': (17) File exists
2024-10-22T21:53:39.076+0530 7f78496006c0  0 
------------------------------------
existing dependency Client::client_lock (41) -> Objecter::rwlock (24) at:

2024-10-22T21:53:39.079+0530 7f78496006c0  0 new dependency Objecter::rwlock (24) -> Client::client_lock (41) creates a cycle at
 ceph version Development (no_version) squid (dev)
 1: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x17) [0x7f789f498957]
 2: (ceph::mutex_debug_detail::mutex_debug_impl<false>::lock(bool)+0x38) [0x7f78668523cc]
 3: /lib64/libcephfs.so.2(+0x5a8bf) [0x7f786685a8bf]
 4: (Context::complete(int)+0x9) [0x7f786684b691]
 5: /lib64/libcephfs.so.2(+0x18744d) [0x7f786698744d]
 6: /lib64/libcephfs.so.2(+0x18746f) [0x7f786698746f]
 7: /lib64/libcephfs.so.2(+0x1874bf) [0x7f78669874bf]
 8: /lib64/libcephfs.so.2(+0x1590ba) [0x7f78669590ba]
 9: /lib64/libcephfs.so.2(+0x159aed) [0x7f7866959aed]
 10: /lib64/libcephfs.so.2(+0xc6ac9) [0x7f78668c6ac9]
 11: /lib64/libcephfs.so.2(+0xc782e) [0x7f78668c782e]
 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x5d) [0x7f789f53d7fd]
 13: (DispatchQueue::entry()+0x418) [0x7f789f53a344]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f789f5ebc13]
 15: (Thread::entry_wrapper()+0x53) [0x7f789f3ed669]
 16: (Thread::_entry_func(void*)+0x9) [0x7f789f3ed681]
 17: /lib64/libc.so.6(+0x976d7) [0x7f78a09066d7]
 18: /lib64/libc.so.6(+0x11b60c) [0x7f78a098a60c]

2024-10-22T21:53:39.080+0530 7f78496006c0  0 btw, i am holding these locks:
2024-10-22T21:53:39.080+0530 7f78496006c0  0   Objecter::rwlock (24)
2024-10-22T21:53:39.080+0530 7f78496006c0  0 


@batrick
Copy link
Member

batrick commented Oct 22, 2024

jenkins test make check arm64

@batrick batrick removed their assignment Oct 22, 2024
@kotreshhr
Copy link
Contributor Author

kotreshhr commented Oct 22, 2024

@batrick I think you missed the update #60411 (comment) This fix is breaking the lock dependency. I need to see what's the best way to fix this. Any ideas ?

@batrick
Copy link
Member

batrick commented Oct 22, 2024

@batrick I think you missed the update #60411 (comment) This fix is breaking the lock dependency. I need to see what's the best way to fix this. Any ideas ?

Yes, in the back of my mind I was wondering about that. I think Client::client_lock (41) -> Objecter::rwlock (24) is reasonable but the reverse should not be. Looking at:

Objecter::op_cancel_writes

The rwlock could be structured differently so the rwlock can be dropped before actually cancelling ops. There are a few problems I see:

  • int Objecter::op_cancel(OSDSession *s, ceph_tid_t tid, int r) is inherently unsafe because the OSDSession* may be invalidated by a session close. We're not doing any kind of reference counting on it.
  • There is another int Objecter::op_cancel(ceph_tid_t tid, int r) which looks up the session with appropriate session locks (but not the rwlock when traversing sessions??) but it's probably expensive.

I think doing the cancel of the std::vector<ceph_tid_t> after dropping the rwlock is the way to go but the code needs a bit of refactoring work. The inconsistent use of locks is really concerning.

@vshankar
Copy link
Contributor

vshankar commented Oct 23, 2024

@batrick I think you missed the update #60411 (comment) This fix is breaking the lock dependency. I need to see what's the best way to fix this. Any ideas ?

Yes, in the back of my mind I was wondering about that. I think Client::client_lock (41) -> Objecter::rwlock (24) is reasonable but the reverse should not be. Looking at:

Objecter::op_cancel_writes

The rwlock could be structured differently so the rwlock can be dropped before actually cancelling ops. There are a few problems I see:

  • int Objecter::op_cancel(OSDSession *s, ceph_tid_t tid, int r) is inherently unsafe because the OSDSession* may be invalidated by a session close. We're not doing any kind of reference counting on it.
  • There is another int Objecter::op_cancel(ceph_tid_t tid, int r) which looks up the session with appropriate session locks (but not the rwlock when traversing sessions??) but it's probably expensive.

I think doing the cancel of the std::vector<ceph_tid_t> after dropping the rwlock is the way to go but the code needs a bit of refactoring work. The inconsistent use of locks is really concerning.

How about queuing the context callback in a finisher thread? Client has objecter_finisher which is passed to Filer class. objecter_finisher also is used by WritebackHandler.

@kotreshhr
Copy link
Contributor Author

@batrick I think you missed the update #60411 (comment) This fix is breaking the lock dependency. I need to see what's the best way to fix this. Any ideas ?

Yes, in the back of my mind I was wondering about that. I think Client::client_lock (41) -> Objecter::rwlock (24) is reasonable but the reverse should not be. Looking at:
Objecter::op_cancel_writes
The rwlock could be structured differently so the rwlock can be dropped before actually cancelling ops. There are a few problems I see:

  • int Objecter::op_cancel(OSDSession *s, ceph_tid_t tid, int r) is inherently unsafe because the OSDSession* may be invalidated by a session close. We're not doing any kind of reference counting on it.
  • There is another int Objecter::op_cancel(ceph_tid_t tid, int r) which looks up the session with appropriate session locks (but not the rwlock when traversing sessions??) but it's probably expensive.

I think doing the cancel of the std::vector<ceph_tid_t> after dropping the rwlock is the way to go but the code needs a bit of refactoring work. The inconsistent use of locks is really concerning.

How about queuing the context callback in a finisher thread? Client has objecter_finisher which is passed to Filer class. objecter_finisher also is used by WritebackHandler.

I think the assumption of op_cancel always succeeds in the following code is required not to cause consistency issues.

epoch_t Objecter::op_cancel_writes(int r, int64_t pool)
{ 
  unique_lock wl(rwlock);
  
  std::vector<ceph_tid_t> to_cancel;
  bool found = false;
  
  for (auto siter = osd_sessions.begin();
       siter != osd_sessions.end(); ++siter) {
    OSDSession *s = siter->second; 
    shared_lock sl(s->lock);
    for (auto op_i = s->ops.begin();
         op_i != s->ops.end(); ++op_i) {
      if (op_i->second->target.flags & CEPH_OSD_FLAG_WRITE
          && (pool == -1 || op_i->second->target.target_oloc.pool == pool)) {
        to_cancel.push_back(op_i->first);
      }
    }
    sl.unlock();
    for (auto titer = to_cancel.begin(); titer != to_cancel.end(); ++titer) {
      int cancel_result = op_cancel(s, *titer, r);
      // We hold rwlock across search and cancellation, so cancels
      // should always succeed
      ceph_assert(cancel_result == 0);
    }
....
....

I tried a solution to unlock and lock the rwlock and slock (session lock) before calling the call back but that resulted two failures

  1. libaio fio - Only one thread received the EIO and the others were still went to completion but mount wasn't accessible. No crashes though.
  2. I saw a crash from handle_osd_op_reply failing to call the callback registered. See the crash below.

So releasing and acquiring the rwlock and slock for the sake of calling callback registered is leading to issues in multiple places.

#0  0x00007fd9f9e5a8c9 in Client::C_Lock_Client_Finisher::finish (this=0x7fd8140d8bc0, r=-108)
   at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/client/Client.cc:11427
#1  0x00007fd9f9e4b691 in Context::complete (this=0x7fd8140d8bc0, r=<optimized out>)
   at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/include/Context.h:99
#2  0x00007fd9f9f8768d in Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}::operator()<Context*>(Context*&&) const (__closure=0x7fd9ef3fe800, 
   arg=@0x7fd9ef3fe9f0: 0x7fd8140d8bc0)
   at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/osdc/Objecter.h:2050
#3  std::__invoke_impl<void, Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}, Context*>(std::__invoke_other, Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, Context*&&) (__f=...) at /usr/include/c++/14/bits/invoke.h:61
#4  std::__invoke<Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&--Type <RET> for more, q to quit, c to continue without paging--c
&)#1}, Context*>(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, Context*&&) (__fn=...) at /usr/include/c++/14/bits/invoke.h:96
#5  std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<void> (*)(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&)>, std::integer_sequence<unsigned long, 2ul> >::__visit_invoke(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&) (__visitor=..., __vars#0=...)
   at /usr/include/c++/14/variant:1060
#6  std::__do_visit<std::__detail::__variant::__deduce_visit_result<void>, Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*> >(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&) (__visitor=...) at /usr/include/c++/14/variant:1821
#7  0x00007fd9f9f876af in std::visit<Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*> >(Objecter::Op::complete(std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&, boost::system::error_code, int, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>)::{lambda(auto:1&&)#1}&&, std::variant<boost::asio::any_completion_handler<void (boost::system::error_code)>, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code)> >, Context*>&&) (__visitor=...) at /usr/include/c++/14/variant:1882
#8  0x00007fd9f9f876ff in Objecter::Op::complete (f=..., ec=..., r=r@entry=-108, e=...)
   at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/osdc/Objecter.h:2047
#9  0x00007fd9f9f5e3ff in Objecter::handle_osd_op_reply (this=this@entry=0x347b28f0, 
   m=m@entry=0x7fd990009af0)

@vshankar
Copy link
Contributor

@batrick I think you missed the update #60411 (comment) This fix is breaking the lock dependency. I need to see what's the best way to fix this. Any ideas ?

Yes, in the back of my mind I was wondering about that. I think Client::client_lock (41) -> Objecter::rwlock (24) is reasonable but the reverse should not be. Looking at:
Objecter::op_cancel_writes
The rwlock could be structured differently so the rwlock can be dropped before actually cancelling ops. There are a few problems I see:

  • int Objecter::op_cancel(OSDSession *s, ceph_tid_t tid, int r) is inherently unsafe because the OSDSession* may be invalidated by a session close. We're not doing any kind of reference counting on it.
  • There is another int Objecter::op_cancel(ceph_tid_t tid, int r) which looks up the session with appropriate session locks (but not the rwlock when traversing sessions??) but it's probably expensive.

I think doing the cancel of the std::vector<ceph_tid_t> after dropping the rwlock is the way to go but the code needs a bit of refactoring work. The inconsistent use of locks is really concerning.

How about queuing the context callback in a finisher thread? Client has objecter_finisher which is passed to Filer class. objecter_finisher also is used by WritebackHandler.

I think the assumption of op_cancel always succeeds in the following code is required not to cause consistency issues.

epoch_t Objecter::op_cancel_writes(int r, int64_t pool)
{ 
  unique_lock wl(rwlock);
  
  std::vector<ceph_tid_t> to_cancel;
  bool found = false;
  
  for (auto siter = osd_sessions.begin();
       siter != osd_sessions.end(); ++siter) {
    OSDSession *s = siter->second; 
    shared_lock sl(s->lock);
    for (auto op_i = s->ops.begin();
         op_i != s->ops.end(); ++op_i) {
      if (op_i->second->target.flags & CEPH_OSD_FLAG_WRITE
          && (pool == -1 || op_i->second->target.target_oloc.pool == pool)) {
        to_cancel.push_back(op_i->first);
      }
    }
    sl.unlock();
    for (auto titer = to_cancel.begin(); titer != to_cancel.end(); ++titer) {
      int cancel_result = op_cancel(s, *titer, r);
      // We hold rwlock across search and cancellation, so cancels
      // should always succeed
      ceph_assert(cancel_result == 0);
    }
....
....

I tried a solution to unlock and lock the rwlock and slock (session lock) before calling the call back but that resulted two failures

I think that would require a non-trivial refactor in many places in objecter. Were you able to test out with queuing via a finisher thread?

@kotreshhr
Copy link
Contributor Author

kotreshhr commented Oct 23, 2024

@batrick I think you missed the update #60411 (comment) This fix is breaking the lock dependency. I need to see what's the best way to fix this. Any ideas ?

Yes, in the back of my mind I was wondering about that. I think Client::client_lock (41) -> Objecter::rwlock (24) is reasonable but the reverse should not be. Looking at:
Objecter::op_cancel_writes
The rwlock could be structured differently so the rwlock can be dropped before actually cancelling ops. There are a few problems I see:

  • int Objecter::op_cancel(OSDSession *s, ceph_tid_t tid, int r) is inherently unsafe because the OSDSession* may be invalidated by a session close. We're not doing any kind of reference counting on it.
  • There is another int Objecter::op_cancel(ceph_tid_t tid, int r) which looks up the session with appropriate session locks (but not the rwlock when traversing sessions??) but it's probably expensive.

I think doing the cancel of the std::vector<ceph_tid_t> after dropping the rwlock is the way to go but the code needs a bit of refactoring work. The inconsistent use of locks is really concerning.

How about queuing the context callback in a finisher thread? Client has objecter_finisher which is passed to Filer class. objecter_finisher also is used by WritebackHandler.

I think the assumption of op_cancel always succeeds in the following code is required not to cause consistency issues.

epoch_t Objecter::op_cancel_writes(int r, int64_t pool)
{ 
  unique_lock wl(rwlock);
  
  std::vector<ceph_tid_t> to_cancel;
  bool found = false;
  
  for (auto siter = osd_sessions.begin();
       siter != osd_sessions.end(); ++siter) {
    OSDSession *s = siter->second; 
    shared_lock sl(s->lock);
    for (auto op_i = s->ops.begin();
         op_i != s->ops.end(); ++op_i) {
      if (op_i->second->target.flags & CEPH_OSD_FLAG_WRITE
          && (pool == -1 || op_i->second->target.target_oloc.pool == pool)) {
        to_cancel.push_back(op_i->first);
      }
    }
    sl.unlock();
    for (auto titer = to_cancel.begin(); titer != to_cancel.end(); ++titer) {
      int cancel_result = op_cancel(s, *titer, r);
      // We hold rwlock across search and cancellation, so cancels
      // should always succeed
      ceph_assert(cancel_result == 0);
    }
....
....

I tried a solution to unlock and lock the rwlock and slock (session lock) before calling the call back but that resulted two failures

I think that would require a non-trivial refactor in many places in objecter. Were you able to test out with queuing via a finisher thread?

It worked for the first time. I will have to do few more testing to get confidence. I will update once I do that.

@kotreshhr
Copy link
Contributor Author

@vshankar I tested the approach of queuing the context callback using the finisher thread. It works! I didn't see any issues.

Here is the manual test done.
The fio is run on TERMINAL-1 as below. The client is blocklisted from TERMINAL-2 as below after fio started rw.
we can observe all the pending async io has received EIO and fio failed gracefully with no hang.
On TERMINAL-2, we can also see that the nfs ganesha is not killed and still running fine. Also observed, no thread is hanging in libcephfs in ganesha process.

TERMINAL -1 

kotresh:build$ sudo mount -t nfs -o port=44349 192.168.1.114:/cephfs /home/kotresh/ganesh_mount1kotresh:build$ sudo fio --ioengine=libaio -directory=/home/kotresh/ganesh_mount1 --filename=fio.randrw.test2 --name=job --bs=16k --direct=1  --group_reporting --iodepth=128  --randrepeat=0 --norandommap=1 --thread=2 --ramp_time=20s --offset_increment=5% --size=2G --time_based --runtime=60 --ramp_time=1s --percentage_random=0 --rw=randrw --rwmixread=50
job: (g=0): rw=randrw, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=128
fio-3.36
Starting 1 thread
job: Laying out IO file (1 file / 2048MiB)
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=11862016, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=11878400, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=11862016, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=11878400, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=11894784, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=11911168, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=11927552, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=11943936, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=11960320, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=11976704, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=11993088, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12009472, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12025856, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12042240, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12058624, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=11894784, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=11943936, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12025856, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=11911168, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12075008, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12058624, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=11960320, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12091392, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12042240, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12500992, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=11927552, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12124160, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12140544, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12156928, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=11993088, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12173312, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12189696, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12189696, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12173312, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12206080, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12222464, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12009472, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12288000, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12238848, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12255232, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12271616, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12353536, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12288000, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12369920, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12304384, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12320768, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12337152, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12353536, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12369920, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12386304, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12402688, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12419072, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12419072, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12500992, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12435456, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12107776, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12124160, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12451840, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12468224, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12484608, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12550144, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12107776, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=11976704, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12091392, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12599296, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12075008, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: read offset=12517376, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12140544, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12156928, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12206080, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12222464, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12238848, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12255232, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12271616, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12304384, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12337152, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12402688, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12386304, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12320768, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12451840, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12435456, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12468224, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12484608, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12533760, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12517376, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12566528, buflen=16384
fio: io_u error on file /home/kotresh/ganesh_mount1/fio.randrw.test2: Remote I/O error: write offset=12582912, buflen=16384
fio: pid=118763, err=121/file:io_u.c:1896, func=io_u error, error=Remote I/O error

job: (groupid=0, jobs=1): err=121 (file:io_u.c:1896, func=io_u error, error=Remote I/O error): pid=118763: Thu Oct 24 00:01:40 2024
  read: IOPS=55, BW=868KiB/s (889kB/s)(10.7MiB/12571msec)
    slat (nsec): min=1877, max=49754, avg=6970.58, stdev=5422.17
    clat (msec): min=178, max=1731, avg=899.77, stdev=234.73
     lat (msec): min=178, max=1731, avg=899.78, stdev=234.73
    clat percentiles (msec):
     |  1.00th=[  241],  5.00th=[  493], 10.00th=[  651], 20.00th=[  726],
     | 30.00th=[  768], 40.00th=[  835], 50.00th=[  894], 60.00th=[  969],
     | 70.00th=[ 1062], 80.00th=[ 1116], 90.00th=[ 1200], 95.00th=[ 1234],
     | 99.00th=[ 1301], 99.50th=[ 1334], 99.90th=[ 1737], 99.95th=[ 1737],
     | 99.99th=[ 1737]
   bw (  KiB/s): min=  608, max= 1696, per=100.00%, avg=870.52, stdev=225.83, samples=25
   iops        : min=   38, max=  106, avg=54.40, stdev=14.12, samples=25
  write: IOPS=53, BW=921KiB/s (944kB/s)(11.3MiB/12571msec); 0 zone resets
    slat (nsec): min=2127, max=38750, avg=7842.07, stdev=5627.30
    clat (msec): min=637, max=2702, avg=1445.63, stdev=428.49
     lat (msec): min=637, max=2702, avg=1445.64, stdev=428.49
    clat percentiles (msec):
     |  1.00th=[  818],  5.00th=[  927], 10.00th=[  969], 20.00th=[ 1083],
     | 30.00th=[ 1116], 40.00th=[ 1234], 50.00th=[ 1334], 60.00th=[ 1469],
     | 70.00th=[ 1636], 80.00th=[ 1905], 90.00th=[ 2072], 95.00th=[ 2232],
     | 99.00th=[ 2500], 99.50th=[ 2702], 99.90th=[ 2702], 99.95th=[ 2702],
     | 99.99th=[ 2702]
   bw (  KiB/s): min= 1920, max= 2052, per=100.00%, avg=1987.55, stdev=46.80, samples=11
   iops        : min=  120, max=  128, avg=124.18, stdev= 2.89, samples=11
  lat (msec)   : 250=0.51%, 500=2.27%, 750=11.57%, 1000=23.28%, 2000=56.44%
  lat (msec)   : >=2000=8.86%
  cpu          : usr=0.06%, sys=0.10%, ctx=1316, majf=0, minf=0
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.1%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=697,669,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=868KiB/s (889kB/s), 868KiB/s-868KiB/s (889kB/s-889kB/s), io=10.7MiB (11.2MB), run=12571-12571msec
  WRITE: bw=921KiB/s (944kB/s), 921KiB/s-921KiB/s (944kB/s-944kB/s), io=11.3MiB (11.9MB), run=12571-12571msec
kotresh:build$ 

TERMINAL 2
kotresh:build$ bin/ceph tell mds.a session ls
2024-10-24T00:01:00.960+0530 7fed8d0006c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-10-24T00:01:00.968+0530 7fed8d0006c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-10-24T00:01:00.979+0530 7feda2012b80 10 client.0 _ll_register_callbacks cb 0x5618c918cdd0 invalidate_ino_cb 0 invalidate_dentry_cb 0 switch_interrupt_cb 0 remount_cb 0
2024-10-24T00:01:00.980+0530 7fed634006c0 10 client.0 ms_handle_connect on v2:192.168.1.114:40349/0
2024-10-24T00:01:00.980+0530 7feda2012b80 10 client.4211 fetch_fsmap learned FSMap version 4
2024-10-24T00:01:00.980+0530 7feda2012b80 10 client.4211 fetch_fsmap finished waiting for FSMap version 4
2024-10-24T00:01:00.980+0530 7feda2012b80 10 client.4211 resolve_mds: resolved name 'a' to daemon mds.a
2024-10-24T00:01:00.980+0530 7feda2012b80 20 client.4211 populate_metadata read hostname 'li-4f30544c-234f-11b2-a85c-fca7e10b62e2.ibm.com'
2024-10-24T00:01:00.980+0530 7feda2012b80  4 client.4211 mds_command: new command op to 4131 tid=0 multi_id=0 [{"prefix": "get_command_descriptions"}]
2024-10-24T00:01:00.981+0530 7fed634006c0 10 client.4211 ms_handle_connect on v2:192.168.1.114:6810/4294866277
2024-10-24T00:01:00.981+0530 7fed634006c0 10 client.4211 handle_command_reply: tid=0
2024-10-24T00:01:00.982+0530 7feda2012b80  1 client.4211 shutdown
2024-10-24T00:01:00.982+0530 7fed634006c0  0 client.4211 ms_handle_reset on v2:192.168.1.114:6810/4294866277
2024-10-24T00:01:00.982+0530 7feda2012b80 20 client.4211 trim_cache size 0 max 16384
2024-10-24T00:01:00.998+0530 7feda2012b80 10 client.0 _ll_register_callbacks cb 0x7fed70055ea0 invalidate_ino_cb 0 invalidate_dentry_cb 0 switch_interrupt_cb 0 remount_cb 0
2024-10-24T00:01:00.999+0530 7fed634006c0 10 client.0 ms_handle_connect on v2:192.168.1.114:40349/0
2024-10-24T00:01:00.999+0530 7feda2012b80 10 client.4213 fetch_fsmap learned FSMap version 4
2024-10-24T00:01:00.999+0530 7feda2012b80 10 client.4213 fetch_fsmap finished waiting for FSMap version 4
2024-10-24T00:01:00.999+0530 7feda2012b80 10 client.4213 resolve_mds: resolved name 'a' to daemon mds.a
2024-10-24T00:01:00.999+0530 7feda2012b80 20 client.4213 populate_metadata read hostname 'li-4f30544c-234f-11b2-a85c-fca7e10b62e2.ibm.com'
2024-10-24T00:01:00.999+0530 7feda2012b80  4 client.4213 mds_command: new command op to 4131 tid=0 multi_id=0 [{"prefix": "session ls"}]
2024-10-24T00:01:01.000+0530 7fed634006c0 10 client.4213 ms_handle_connect on v2:192.168.1.114:6810/4294866277
2024-10-24T00:01:01.001+0530 7fed634006c0 10 client.4213 handle_command_reply: tid=0
2024-10-24T00:01:01.001+0530 7feda2012b80  1 client.4213 shutdown
2024-10-24T00:01:01.001+0530 7fed634006c0  0 client.4213 ms_handle_reset on v2:192.168.1.114:6810/4294866277
2024-10-24T00:01:01.001+0530 7feda2012b80 20 client.4213 trim_cache size 0 max 16384
[
    {
        "id": 4181,
        "entity": {
            "name": {
                "type": "client",
                "num": 4181
            },
            "addr": {
                "type": "any",
                "addr": "192.168.1.114:0",
                "nonce": 498969874
            }
        },
        "state": "open",
        "num_leases": 0,
        "num_caps": 1,
        "request_load_avg": 0,
        "uptime": 106.312054258,
        "requests_in_flight": 0,
        "num_completed_requests": 0,
        "num_completed_flushes": 0,
        "reconnecting": false,
        "recall_caps": {
            "value": 0,
            "halflife": 60
        },
        "release_caps": {
            "value": 0,
            "halflife": 60
        },
        "recall_caps_throttle": {
            "value": 0,
            "halflife": 1.5
        },
        "recall_caps_throttle2o": {
            "value": 0,
            "halflife": 0.5
        },
        "session_cache_liveness": {
            "value": 0.78223092939475258,
            "halflife": 300
        },
        "cap_acquisition": {
            "value": 0,
            "halflife": 30
        },
        "last_trim_completed_requests_tid": 2,
        "last_trim_completed_flushes_tid": 0,
        "delegated_inos": [],
        "inst": "client.4181 192.168.1.114:0/498969874",
        "completed_requests": [],
        "prealloc_inos": [],
        "client_metadata": {
            "client_features": {
                "feature_bits": "0x00000000003fffff"
            },
            "metric_spec": {
                "metric_flags": {
                    "feature_bits": "0x000000000000ffff"
                }
            },
            "ceph_sha1": "no_version",
            "ceph_version": "ceph version Development (no_version) squid (dev)",
            "entity_id": "nfs.vstart.a.0558fddd",
            "hostname": "li-4f30544c-234f-11b2-a85c-fca7e10b62e2.ibm.com",
            "pid": "117937",
            "root": "/",
            "timeout": "300",
            "uuid": "ganesha-a-0001"
        }
    }
]
kotresh:build$ ceph osd blocklist ls
2024-10-24T00:01:07.997+0530 7f3d168006c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-10-24T00:01:08.015+0530 7f3d168006c0 -1 WARNING: all dangerous and experimental features are enabled.
listed 0 entries
kotresh:build$ ceph osd blocklist add 192.168.1.114:0/498969874
2024-10-24T00:01:39.493+0530 7f5ee6c006c0 -1 WARNING: all dangerous and experimental features are enabled.
2024-10-24T00:01:39.500+0530 7f5ee6c006c0 -1 WARNING: all dangerous and experimental features are enabled.
blocklisting 192.168.1.114:0/498969874 until 2024-10-24T01:01:39.619887+0530 (3600 sec)

kotresh:build$ ps -ef | grep ganesha.nfsd
root      117937    3824  3 Oct23 ?        00:00:05 ganesha.nfsd -L /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/build/out/ganesha-a.log -f /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/build/dev/ganesha.a/ganesha-a.conf -p /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/build/out/ganesha-a.pid -N NIV_EVENT
kotresh   118852   37197  0 00:01 pts/4    00:00:00 grep --color=auto ganesha.nfsd

$ sudo gdb -ex 'set follow-fork-mode child' -ex 'set logging file gdb.now.txt' -ex 'set logging enabled yes' -p 117937

kotresh:build$ grep Client gdb.now.txt 
#4  0x00007fefe048fe67 in operator() (__closure=0x17c1d328) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/client/Client.cc:7149
#5  0x00007fefe048ff15 in std::__invoke_impl<void, Client::start_tick_thread()::<lambda()> > (__f=...) at /usr/include/c++/14/bits/invoke.h:60
#6  std::__invoke<Client::start_tick_thread()::<lambda()> > (__fn=...) at /usr/include/c++/14/bits/invoke.h:96
#7  std::thread::_Invoker<std::tuple<Client::start_tick_thread()::<lambda()> > >::_M_invoke<0> (this=<optimized out>) at /usr/include/c++/14/bits/std_thread.h:301
#8  std::thread::_Invoker<std::tuple<Client::start_tick_thread()::<lambda()> > >::operator() (this=<optimized out>) at /usr/include/c++/14/bits/std_thread.h:308
#9  std::thread::_State_impl<std::thread::_Invoker<std::tuple<Client::start_tick_thread()::<lambda()> > > >::_M_run(void) (this=<optimized out>) at /usr/include/c++/14/bits/std_thread.h:253
kotresh:build$ grep Object gdb.now.txt 
#4  0x00007fefe052b1f3 in ObjectCacher::flusher_entry (this=0x17b8e4c0) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/osdc/ObjectCacher.cc:2005
#5  0x00007fefe053cf43 in ObjectCacher::FlusherThread::entry (this=<optimized out>) at /home/kotresh/sandbox/upstream/kotresh-ceph2/ceph/src/osdc/ObjectCacher.h:439

@kotreshhr
Copy link
Contributor Author

While testing the fio with nfs ganesha I hit a osd crash. I have reported that here https://tracker.ceph.com/issues/68688?next_issue_id=68683
I think this is not related to this particular fix.

@vshankar vshankar requested a review from batrick October 28, 2024 02:32
vshankar added a commit to vshankar/ceph that referenced this pull request Nov 20, 2024
* refs/pull/60411/head:
	qa: Add async io test to nfs ganesha using fio
	client: Fix a deadlock when osd is full

Reviewed-by: Venky Shankar <[email protected]>
Reviewed-by: Patrick Donnelly <[email protected]>
Reviewed-by: Dhairya Parmar <[email protected]>
@vshankar vshankar self-requested a review November 26, 2024 13:45
@vshankar
Copy link
Contributor

(running this through tests again)

@vshankar
Copy link
Contributor

@vshankar
Copy link
Contributor

@kotreshhr
Copy link
Contributor Author

@vshankar
Copy link
Contributor

@kotreshhr
Copy link
Contributor Author

https://pulpito.ceph.com/vshankar-2024-11-28_10:25:29-fs:nfs-wip-vshankar-testing-20241118.055430-debug-testing-default-smithi/

duh! all 10 jobs failed with

Test failure: test_async_io_fio (tasks.cephfs.test_nfs.TestNFS)

@kotreshhr o_O

@kotreshhr could be an offending PR in https://tracker.ceph.com/issues/68968

@vshankar This run doesn't contain this fix PR60411 ?

It does. The tracker update (https://tracker.ceph.com/issues/68968#note-1) was just for my tracking so that I don't accidentally merge the change. The test branch still has the change. See: https://github.com/vshankar/ceph/commits/wip-vshankar-testing-20241118.055430-debug/

@vshankar The failures are in read as seen below. Could this be because of #60507 ? This fix should not affect read as this is write path.

2024-11-28T10:51:11.894 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=49152, buflen=16384
2024-11-28T10:51:12.000 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=114688, buflen=16384
2024-11-28T10:51:12.000 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=16384, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=196608, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=98304, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=32768, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=163840, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=0, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=131072, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=147456, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=360448, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=442368, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=376832, buflen=16384
2024-11-28T10:51:12.001 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=311296, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=65536, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=262144, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=294912, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=475136, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=327680, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=458752, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=524288, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=425984, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=81920, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=409600, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=507904, buflen=16384
2024-11-28T10:51:12.002 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=491520, buflen=16384
2024-11-28T10:51:12.003 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=180224, buflen=16384
2024-11-28T10:51:12.003 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=245760, buflen=16384
2024-11-28T10:51:12.003 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=229376, buflen=16384
2024-11-28T10:51:12.003 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=212992, buflen=16384
2024-11-28T10:51:12.003 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=278528, buflen=16384
2024-11-28T10:51:12.003 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=393216, buflen=16384
2024-11-28T10:51:12.003 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=344064, buflen=16384
2024-11-28T10:51:12.003 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=573440, buflen=16384
2024-11-28T10:51:12.003 INFO:teuthology.orchestra.run.smithi176.stderr:fio: io_u error on file /mnt/fio/fio.randrw.test: Input/output error: read offset=655360, buflen=16384
:

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/vshankar-2024-11-28_10:25:29-fs:nfs-wip-vshankar-testing-20241118.055430-debug-testing-default-smithi/

duh! all 10 jobs failed with

Test failure: test_async_io_fio (tasks.cephfs.test_nfs.TestNFS)

@kotreshhr o_O

@kotreshhr could be an offending PR in https://tracker.ceph.com/issues/68968

@vshankar This run doesn't contain this fix PR60411 ?

It does. The tracker update (https://tracker.ceph.com/issues/68968#note-1) was just for my tracking so that I don't accidentally merge the change. The test branch still has the change. See: https://github.com/vshankar/ceph/commits/wip-vshankar-testing-20241118.055430-debug/

@vshankar The failures are in read as seen below. Could this be because of #60507 ? This fix should not affect read as this is write path.

It's possible, yes. I'll have to double check though.

vshankar added a commit to vshankar/ceph that referenced this pull request Nov 29, 2024
* refs/pull/60411/head:
	qa: Add async io test to nfs ganesha using fio
	client: Fix a deadlock when osd is full

Reviewed-by: Dhairya Parmar <[email protected]>
Reviewed-by: Venky Shankar <[email protected]>
Reviewed-by: Patrick Donnelly <[email protected]>
@vshankar
Copy link
Contributor

https://pulpito.ceph.com/vshankar-2024-11-28_10:25:29-fs:nfs-wip-vshankar-testing-20241118.055430-debug-testing-default-smithi/

duh! all 10 jobs failed with

Test failure: test_async_io_fio (tasks.cephfs.test_nfs.TestNFS)

@kotreshhr o_O

@kotreshhr could be an offending PR in https://tracker.ceph.com/issues/68968

@vshankar This run doesn't contain this fix PR60411 ?

It does. The tracker update (https://tracker.ceph.com/issues/68968#note-1) was just for my tracking so that I don't accidentally merge the change. The test branch still has the change. See: https://github.com/vshankar/ceph/commits/wip-vshankar-testing-20241118.055430-debug/

@vshankar The failures are in read as seen below. Could this be because of #60507 ? This fix should not affect read as this is write path.

It's possible, yes. I'll have to double check though.

Still seeing failures without #60507. See: https://pulpito.ceph.com/vshankar-2024-11-29_15:07:04-fs-wip-vshankar-testing-20241129.105255-debug-testing-default-smithi/8014609/

QA tracker: https://tracker.ceph.com/issues/68968

Unfortunately, I didn't get time to debug it.

Problem:
 When osd is full, the client receives the notification
and cancels the ongoing writes. If the ongoing writes
are async, it could cause a dead lock as the async
callback registered also takes the 'client_lock' which
the handle_osd_map takes at the beginning.

 The op_cancel_writes calls the callback registered for
the async write synchronously holding the 'client_lock'
causing the deadlock.

Earlier approach:
  It was tried to solve this issue by calling 'op_cancel_writes'
without holding 'client_lock'. But this failed lock dependency
between objecter's 'rwlock' and async write's callback taking
'client_lock'. The 'client_lock' should always be taken before
taking 'rwlock'. So this approach is dropped against the current
approach.

Solution:
 Use C_OnFinisher for objecter async write callback i.e., wrap
the async write's callback using the Finisher. This queues the
callback to the Finisher's context queue which the finisher
thread picks up and executes thus avoiding the deadlock.

Testing:
 The fix is tested in the vstart cluster with the following reproducer.
 1. Mount the cephfs volume using nfs-ganesha at /mnt
 2. Run fio on /mnt on one terminal
 3. On the other terminal, blocklist the nfs client session
 4. The fio would hang

 It is reproducing in the vstart cluster most of the times. I think
that's because it's slow. The same test written for teuthology is
not reproducing the issue. The test expects one or more writes
to be on going in rados when the client is blocklisted for the deadlock
to be hit.

Stripped down version of Traceback:
----------
0  0x00007f4d77274960 in __lll_lock_wait ()
1  0x00007f4d7727aff2 in pthread_mutex_lock@@GLIBC_2.2.5 ()
2  0x00007f4d7491b0a1 in __gthread_mutex_lock (__mutex=0x7f4d200f99b0)
3  std::mutex::lock (this=<optimized out>)
4  std::scoped_lock<std::mutex>::scoped_lock (__m=..., this=<optimized out>, this=<optimized out>, __m=...)
5  Client::C_Lock_Client_Finisher::finish (this=0x7f4ca0103550, r=-28)
6  0x00007f4d74888dfd in Context::complete (this=0x7f4ca0103550, r=<optimized out>)
7  0x00007f4d7498850c in std::__do_visit<...>(...) (__visitor=...)
8  std::visit<Objecter::Op::complete(...) (__visitor=...)
9  Objecter::Op::complete(...) (e=..., e=..., r=-28, ec=..., f=...)
10 Objecter::Op::complete (e=..., r=-28, ec=..., this=0x7f4ca022c7f0)
11 Objecter::op_cancel (this=0x7f4d200fab20, s=<optimized out>, tid=<optimized out>, r=-28)
12 0x00007f4d7498ea12 in Objecter::op_cancel_writes (this=0x7f4d200fab20, r=-28, pool=103)
13 0x00007f4d748e1c8e in Client::_handle_full_flag (this=0x7f4d200f9830, pool=103)
14 0x00007f4d748ed20c in Client::handle_osd_map (m=..., this=0x7f4d200f9830)
15 Client::ms_dispatch2 (this=0x7f4d200f9830, m=...)
16 0x00007f4d75b8add2 in Messenger::ms_deliver_dispatch (m=..., this=0x7f4d200ed3e0)
17 DispatchQueue::entry (this=0x7f4d200ed6f0)
18 0x00007f4d75c27fa1 in DispatchQueue::DispatchThread::entry (this=<optimized out>)
19 0x00007f4d77277c02 in start_thread ()
20 0x00007f4d772fcc40 in clone3 ()
--------

Fixes: https://tracker.ceph.com/issues/68641
Signed-off-by: Kotresh HR <[email protected]>
@kotreshhr kotreshhr force-pushed the aio-nfsganesha-osd-full-hang branch from 9b1e196 to 60c5801 Compare December 19, 2024 06:16
@kotreshhr
Copy link
Contributor Author

The fio test is carved out as separate PR #61143 to run against the main and validate if the issue mentioned in #60411 (comment) exists in main as well.

@kotreshhr
Copy link
Contributor Author

jenkins test make check

@vshankar
Copy link
Contributor

vshankar commented Dec 20, 2024

@kotreshhr
Copy link
Contributor Author

@vshankar
Copy link
Contributor

@kotreshhr I spoke too soon -- similar failure without this change. See: https://pulpito.ceph.com/vshankar-2024-12-19_10:00:45-fs-wip-vshankar-testing-20241219.063429-debug-testing-default-smithi/8044375/

ok, then there is a bug in main branch ?

yes. I'll create a redmine tracker for the failure.

This fix can be merged then.

@kotreshhr
Copy link
Contributor Author

@kotreshhr I spoke too soon -- similar failure without this change. See: https://pulpito.ceph.com/vshankar-2024-12-19_10:00:45-fs-wip-vshankar-testing-20241219.063429-debug-testing-default-smithi/8044375/

ok, then there is a bug in main branch ?

yes. I'll create a redmine tracker for the failure.

This fix can be merged then.

@vshankar The qa test PR #61143 is merged. And you mentioned the issue exists (#60411 (comment)) without this PR. How do we proceed here ?

@vshankar
Copy link
Contributor

@kotreshhr I spoke too soon -- similar failure without this change. See: https://pulpito.ceph.com/vshankar-2024-12-19_10:00:45-fs-wip-vshankar-testing-20241219.063429-debug-testing-default-smithi/8044375/

ok, then there is a bug in main branch ?

yes. I'll create a redmine tracker for the failure.
This fix can be merged then.

@vshankar The qa test PR #61143 is merged. And you mentioned the issue exists (#60411 (comment)) without this PR. How do we proceed here ?

I'm running this through a sub-suite and will merge. Should be done tomorrow.

@vshankar
Copy link
Contributor

@vshankar
Copy link
Contributor

@kotreshhr I could have merged it with running it through QA again, but I included it just so that the change is tested again.

I don't expect anything to fail though (apart from the failure we already have).

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants