Skip to content

Conversation

@hebasto
Copy link
Member

@hebasto hebasto commented Jun 20, 2020

On master (e3fa3c7) if the push_lock() throws the "potential deadlock detected" exception (via the potential_deadlock_detected() call), the LockData instance internal state differs from one when the push_lock() was called. This non-well behaviour makes (at least) testing brittle.

This PR preserves the LockData instance initial state if push_lock() throws an exception, and improves the sync_tests unit test.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 20, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if the offending entry is still in the lock stack by the time potential_deadlock_detected() is called because it prints the stack and we want to see the offender in the debug log.

Also, I think that it would be best to undo all mods before abort()/throw. We also inserted into lockdata.lockorders and in lockdata.invlockorders. Maybe something like:

diff --git i/src/sync.cpp w/src/sync.cpp
index 3098e402c..b21154d6e 100644
--- i/src/sync.cpp
+++ w/src/sync.cpp
@@ -124,17 +124,12 @@ static void potential_deadlock_detected(const LockPair& mismatch, const LockStac
         }
         if (i.first == mismatch.second) {
             LogPrintf(" (2)"); /* Continued */
         }
         LogPrintf(" %s\n", i.second.ToString());
     }
-    if (g_debug_lockorder_abort) {
-        tfm::format(std::cerr, "Assertion failed: detected inconsistent lock order at %s:%i, details in debug log.\n", __FILE__, __LINE__);
-        abort();
-    }
-    throw std::logic_error("potential deadlock detected");
 }
 
 static void double_lock_detected(const void* mutex, LockStack& lock_stack)
 {
     LogPrintf("DOUBLE LOCK DETECTED\n");
     LogPrintf("Lock order:\n");
@@ -186,14 +181,28 @@ static void push_lock(MutexType* c, const CLockLocation& locklocation)
         if (lockdata.lockorders.count(p1))
             continue;
         lockdata.lockorders.emplace(p1, lock_stack);
 
         const LockPair p2 = std::make_pair(c, i.first);
         lockdata.invlockorders.insert(p2);
-        if (lockdata.lockorders.count(p2))
+        if (lockdata.lockorders.count(p2)) {
             potential_deadlock_detected(p1, lockdata.lockorders[p2], lockdata.lockorders[p1]);
+
+            lockdata.invlockorders.erase(p2);
+            lockdata.lockorders.erase(p1);
+            lock_stack.pop_back();
+
+            if (g_debug_lockorder_abort) {
+                tfm::format(std::cerr,
+                    "Assertion failed: detected inconsistent lock order at %s:%i, "
+                    "details in debug log.\n",
+                    __FILE__, __LINE__);
+                abort();
+            }
+            throw std::logic_error("potential deadlock detected");
+        }
     }
 }
 
 static void pop_lock()
 {
     LockData& lockdata = GetLockData();

src/sync.cpp Outdated
tfm::format(std::cerr, "Assertion failed: detected inconsistent lock order at %s:%i, details in debug log.\n", __FILE__, __LINE__);
abort();
}
throw std::logic_error("potential deadlock detected");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to unconditionally remove the entry from the stack here, after printing the stack and before abort()/throw. This cleanup is warranted from all code paths that call EnterCritical().

Suggested change
throw std::logic_error("potential deadlock detected");
// Undo the insertion from push_lock(). We will not lock the mutex.
s2.pop_back();
throw std::logic_error("potential deadlock detected");

(this should be before the if, 4 lines above, but github does not allow me to put the suggestion there)

@hebasto
Copy link
Member Author

hebasto commented Jun 22, 2020

@vasild

It would be good if the offending entry is still in the lock stack by the time potential_deadlock_detected() is called because it prints the stack and we want to see the offender in the debug log.

It prints from LockOrders instances, not from the lock stack of the current thread, no?

@vasild
Copy link
Contributor

vasild commented Jun 22, 2020

You are right! I did not realize that the LockStack objects are copied.

Anyway - I think, for EnterCritical() to be a good citizen, if it throws, it should leave its internal structures in the state they were before it was called.

@hebasto
Copy link
Member Author

hebasto commented Jun 22, 2020

Updated c810cd4 -> 236e760 (pr19340.01 -> pr19340.02, diff):

operator[] has a side effect that it would insert a new element in m_lock_stacks if the thread that called it does not own any locks.

@vasild

Anyway - I think, for EnterCritical() to be a good citizen, if it throws, it should leave its internal structures in the state they were before it was called.

I'd leave it as is since it seems unrelated to this PR goal.

@hebasto
Copy link
Member Author

hebasto commented Jun 22, 2020

Reworked after the discussion with @vasild on IRC.

Going to update the OP soon. The OP has been updated.

@hebasto hebasto changed the title Fix lock stack after "potential deadlock detected" exception Preserve the LockData initial state if "potential deadlock detected" exception thrown Jun 22, 2020
Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK f88222a

@JeremyRubin
Copy link
Contributor

I'm not sure I fully get what's going on here, but the change looks fine to me. Can you explain how making the copy of the lockstack actually helps? It looks like a race condition to even copy it?

@vasild
Copy link
Contributor

vasild commented Jul 10, 2020

@JeremyRubin thanks for asking! The key here is the added lock_stack.pop_back(), not the copying. The copying is needed because without it the prints from potential_deadlock_detected() would not be complete (because we already removed one element). There is no race because that code is operating under lockdata.dd_mutex.

Let me elaborate with an example:

  • Thread t1 acquires lockA
  • Thread t2 acquires lockB
  • Thread t1 tries to acquire lockB
    • EnterCritical() adds lockB to the lock stack of t1 and succeeds
    • The lock stacks are: t1: lockA, lockB, t2: lockB
    • Now t1 hangs inside the OS mutex lock() function, waiting for lockB to be released.
  • Thread t2 tries to acquire lockA
    • EnterCritical() adds lockA to the lock stack of t2 and it is really upset because some threads first (try to) acquire lockA and then lockB and some other threads try in reverse order. It is going to either abort() or throw, lets assume it is going to throw.
    • The lock stacks are: t1: lockA, lockB, t2: lockB, lockA
    • EnterCritical() prints both lock stacks and throws.
    • The caller (in t2) catches the exception and continues executing (this is only happening from tests).

Notice that at the end t2's lock stack is lockB, lockA but it actually only owns lockB -- this is the problem being fixed by this PR. This only matters in tests which catch the exception and continue executing. The other code ends up with abort() and the whole program terminates immediately.

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK f88222a

just some questions. Feel free to ignore.


const LockPair p1 = std::make_pair(i.first, c);
const LockPair p2 = std::make_pair(c, i.first);
if (lockdata.lockorders.count(p2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is now being done before if (lockdata.lockorders.count(p1))continue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hebasto
Copy link
Member Author

hebasto commented Aug 2, 2020

Updated f88222a -> cab80f4 (pr19340.03 -> pr19340.04, diff):

Any reason this is now being done before if (lockdata.lockorders.count(p1))continue?

Could remove the confusing static's here?


Rebased cab80f4 -> 63e9e40 (pr19340.04 -> pr19340.05) to fix Travis "lint" job errors.

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 63e9e40

@maflcko
Copy link
Member

maflcko commented Aug 4, 2020

re-ACK 63e9e40

@maflcko maflcko merged commit 0f16212 into bitcoin:master Aug 4, 2020
@hebasto hebasto deleted the 200620-stack branch August 4, 2020 15:11
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Aug 4, 2020
…ial deadlock detected" exception thrown

63e9e40 test: Add LockStackEmpty() (Hennadii Stepanov)
42b2a95 test: Repeat deadlock tests (Hennadii Stepanov)
1f96be2 Preserve initial state if push_lock() throws exception (Hennadii Stepanov)

Pull request description:

  On master (e3fa3c7) if the `push_lock()` throws the "potential deadlock detected" exception (via the `potential_deadlock_detected()` call), the `LockData` instance internal state differs from one when the `push_lock()` was called. This non-well behaviour makes (at least) testing brittle.

  This PR preserves the `LockData` instance initial state if `push_lock()` throws an exception, and improves the `sync_tests` unit test.

ACKs for top commit:
  MarcoFalke:
    re-ACK 63e9e40
  vasild:
    ACK 63e9e40

Tree-SHA512: 7679182154ce5f079b44b790faf76eb5f553328dea70a326ff6b600db70e2f9ae015a33a104ca070cb660318280cb79b6b42e37ea5166f26f9e627ba721fcdec
deadalnix pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Sep 7, 2021
…exception thrown

Summary:
> Preserve initial state if push_lock() throws exception

> test: Repeat deadlock tests

> test: Add LockStackEmpty()

This is a backport of [[bitcoin/bitcoin#19340 | core#19340]]

Test Plan: With TSAN: `ninja check check-functional`

Reviewers: #bitcoin_abc, majcosta

Reviewed By: #bitcoin_abc, majcosta

Differential Revision: https://reviews.bitcoinabc.org/D10059
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Oct 12, 2021
pravblockc pushed a commit to pravblockc/dash that referenced this pull request Nov 18, 2021
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants