Skip to content

Fix the spinlock from #4377 and probably also fix #4448#4449

Merged
djspiewak merged 2 commits intotypelevel:series/3.6.xfrom
durban:fixSpinlock
Jul 22, 2025
Merged

Fix the spinlock from #4377 and probably also fix #4448#4449
djspiewak merged 2 commits intotypelevel:series/3.6.xfrom
durban:fixSpinlock

Conversation

@durban
Copy link
Copy Markdown
Contributor

@durban durban commented Jul 21, 2025

So, #4377 added a spinlock for interrupting a sleeping/polling WorkerThread. That spinlock is incorrect, i.e., a worker thread doesn't necessarily waits (spins) until the spinlock is released by WSTP#notifyParked. The incorrect scenario is when the worker thread observes parked being "parked", tries to CAS it to "unparked", and the CAS fails. The current implementation in this case simply goes on, as if the lock is free. While in fact it may not be free; it's likely that the CAS failed exactly because the spinlock was acquired.

This PR fixes the spinlock, so in this case the worker thread re-reads parked, and correctly waits for the lock to be free.

Now, I think this is the reason for the bug #4448. I think what happens there is that the worker thread goes on while the spinlock is locked by WSTP#notifyParked; the worker thread goes to sleep (parked becomes "parked"); WSTP#notifyParked "releases" the lock it thinks it still holds, thus parked changes state directly from "parked" -> "unparked", without a corresponding doneSleeping (in normal (correct) operation, a doneSleeping call is always made when parked moves out from "parked"). Thus, a doneSleeping call is, in effect "missing". The state 0xfffeffff1 observed in #4448 is exactly the missing doneSleeping call (i.e., -DeltaSearching).

That's why I believe this should fix #4448. I don't know how to write a test for that (the only way I could observe the issue locally is by putting Thread.yields in various places in the WSTP). However, I believe the occasional MutexSpec failures were exactly because of this problem.

Footnotes

  1. 0xfffeffff is the (65534, 65535) = (ActiveThreadCount, SearchingThreadCount) reported in Poisoned pod, extremely slow, weird thread counts #4448.

@durban
Copy link
Copy Markdown
Contributor Author

durban commented Jul 21, 2025

Btw, in #4377 I've asked why the spinlock is necessary (I dislike spinlocks ;-). @djspiewak said (more or less), that due to doneSleeping. I think this issue proves, that @djspiewak was right, since the spinlock being incorrect caused a bug with doneSleeping. (Although, I still think that it would be possible to avoid the spinlock by always carefully CASing parked, let's fix this first, then possibly later we might figure out how exactly to remove the lock :-)

@djspiewak
Copy link
Copy Markdown
Member

I've asked why the spinlock is necessary (I dislike spinlocks ;-)

What did spinlocks ever do to you?! :P

More seriously, the async bounded and unbounded queue implementations are a pretty fun one to look at from this standpoint. They're lock free MPMC structures, and what's interesting is they require a spinlock in a very very specific circumstance. It's somewhat similar to the parked case, where you have two bits of atomic state and sometimes you can detect them in a mutually inconsistent state, but you know that when that happens, you're just seeing the state a few cycles before the other CPU publishes its results, so you just have to chill out for a second and wait for the memory bus to do its thing. The only alternative would be to have a hard thread lock, which effectively does something very similar but indirectly in the kernel, so nothing is really gained.

djspiewak
djspiewak previously approved these changes Jul 21, 2025
Copy link
Copy Markdown
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is a really nice catch. I'm not sure why I didn't think concurrent interruption was possible but obviously that's an issue.

@durban
Copy link
Copy Markdown
Contributor Author

durban commented Jul 21, 2025

They're lock free MPMC structures, and what's interesting is they require a spinlock

If they require a spinlock, they're not lock free :P

@durban durban marked this pull request as ready for review July 21, 2025 20:52
@djspiewak
Copy link
Copy Markdown
Member

If they require a spinlock, they're not lock free :P

Mostly lock-free! :D But in fairness, I think that's pretty much everything. There are no truly lock-free concurrent systems, only systems which marginalize locks to extremely rare cases involving composite atomics.

@durban
Copy link
Copy Markdown
Contributor Author

durban commented Jul 21, 2025

I mean... a MPMC queue can definitely be lock free. The one using the spinlock is possibly faster, but there is no theoretical or even practical obstacle to making it lock free (i.e., using a spinlock in rare cases will probably make it faster than using an algorithm that never uses a lock... but that algorithm still exists, and could be used...).

@djspiewak djspiewak merged commit 624af32 into typelevel:series/3.6.x Jul 22, 2025
34 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants