Fix the spinlock from #4377 and probably also fix #4448#4449
Fix the spinlock from #4377 and probably also fix #4448#4449djspiewak merged 2 commits intotypelevel:series/3.6.xfrom
Conversation
|
Btw, in #4377 I've asked why the spinlock is necessary (I dislike spinlocks ;-). @djspiewak said (more or less), that due to |
What did spinlocks ever do to you?! :P More seriously, the async bounded and unbounded queue implementations are a pretty fun one to look at from this standpoint. They're lock free MPMC structures, and what's interesting is they require a spinlock in a very very specific circumstance. It's somewhat similar to the |
djspiewak
left a comment
There was a problem hiding this comment.
Oh this is a really nice catch. I'm not sure why I didn't think concurrent interruption was possible but obviously that's an issue.
If they require a spinlock, they're not lock free :P |
Mostly lock-free! :D But in fairness, I think that's pretty much everything. There are no truly lock-free concurrent systems, only systems which marginalize locks to extremely rare cases involving composite atomics. |
|
I mean... a MPMC queue can definitely be lock free. The one using the spinlock is possibly faster, but there is no theoretical or even practical obstacle to making it lock free (i.e., using a spinlock in rare cases will probably make it faster than using an algorithm that never uses a lock... but that algorithm still exists, and could be used...). |
Co-authored-by: Mareks Rampāns <[email protected]>
So, #4377 added a spinlock for interrupting a sleeping/polling WorkerThread. That spinlock is incorrect, i.e., a worker thread doesn't necessarily waits (spins) until the spinlock is released by WSTP#notifyParked. The incorrect scenario is when the worker thread observes
parkedbeing "parked", tries to CAS it to "unparked", and the CAS fails. The current implementation in this case simply goes on, as if the lock is free. While in fact it may not be free; it's likely that the CAS failed exactly because the spinlock was acquired.This PR fixes the spinlock, so in this case the worker thread re-reads
parked, and correctly waits for the lock to be free.Now, I think this is the reason for the bug #4448. I think what happens there is that the worker thread goes on while the spinlock is locked by WSTP#notifyParked; the worker thread goes to sleep (
parkedbecomes "parked"); WSTP#notifyParked "releases" the lock it thinks it still holds, thusparkedchanges state directly from "parked" -> "unparked", without a corresponding doneSleeping (in normal (correct) operation, a doneSleeping call is always made whenparkedmoves out from "parked"). Thus, adoneSleepingcall is, in effect "missing". Thestate0xfffeffff1 observed in #4448 is exactly the missing doneSleeping call (i.e.,-DeltaSearching).That's why I believe this should fix #4448. I don't know how to write a test for that (the only way I could observe the issue locally is by putting
Thread.yields in various places in the WSTP). However, I believe the occasionalMutexSpecfailures were exactly because of this problem.Footnotes
0xfffeffff is the (65534, 65535) = (ActiveThreadCount, SearchingThreadCount) reported in Poisoned pod, extremely slow, weird thread counts #4448. ↩