Dynamically bypass selector polling if no I/O events are present by djspiewak · Pull Request #4377 · typelevel/cats-effect

djspiewak · 2025-04-13T22:51:10Z

With this change, every time we park a thread we check to see if the underlying polling system needsPoll, which is a hint that is intended to be implemented by checking if any events have been registered with the underlying polling system. If it returns false, we don't even bother hitting the poller and instead park the worker thread the old fashioned way. This has some slightly complex concurrent coordination implications which makes this whole thing just a bit more complex and introduces a tiny spin wait, but otherwise is generally okay.

This should bring thread suspension performance for applications which are not using PollingSystem back very close to what it was in 3.5.x. The needsPoll call does have some overhead, and the more complex suspension state does have a cost, but it shouldn't be that severe. I'm traveling so I haven't had a chance to run benchmarks yet but that's next on my todo list.

Fixes #4328

…ling

djspiewak · 2025-04-13T22:56:09Z

Released as 3.6.1-25-d807be0 if anyone would like to test something

djspiewak · 2025-04-14T01:53:44Z

DispatcherBenchmark is the biggest change, regressing 25%. parTraverse regressed slightly but so did runnableSchedulingScalaGlobal, which is more of a concurrency baseline, and parTraverse regressed less so let's call that one a tentative win. traverse is equivalent (a non-concurrency baseline). All the main WSTP benchmarks are essentially within the margin of error.

I'd like to see the real world numbers on this one, but the microbenchmarks seem to give their blessing in so far as they go.

Before

[info] Benchmark                                             (cpuTokens)   (size)   Mode  Cnt     Score     Error    Units
[info] DispatcherBenchmark.scheduling                                N/A     1000  thrpt   20   169.751 ± 100.207  ops/min
[info] ParallelBenchmark.parTraverse                               10000     1000  thrpt   20   543.307 ±   1.551    ops/s
[info] ParallelBenchmark.traverse                                  10000     1000  thrpt   20    50.845 ±   0.015    ops/s
[info] WorkStealingBenchmark.alloc                                   N/A  1000000  thrpt   20     9.510 ±   0.049  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark          N/A  1000000  thrpt   20    12.188 ±   0.822  ops/min
[info] WorkStealingBenchmark.runnableScheduling                      N/A  1000000  thrpt   20  1529.989 ±  38.601  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal           N/A  1000000  thrpt   20  1746.552 ±   9.345  ops/min
[info] WorkStealingBenchmark.scheduling                              N/A  1000000  thrpt   20    13.742 ±   0.398  ops/min

After

Note that the global EC is 4% slower in this run, but traverse is essentially within the margin of error. That indicates that we would generally expect anything concurrency-related to have a lower baseline in this run than in the previous one. Good old EC2!

[info] Benchmark                                             (cpuTokens)   (size)   Mode  Cnt     Score    Error    Units
[info] DispatcherBenchmark.scheduling                                N/A     1000  thrpt   20   125.216 ± 15.753  ops/min
[info] ParallelBenchmark.parTraverse                               10000     1000  thrpt   20   529.290 ± 26.988    ops/s
[info] ParallelBenchmark.traverse                                  10000     1000  thrpt   20    50.864 ±  0.008    ops/s
[info] WorkStealingBenchmark.alloc                                   N/A  1000000  thrpt   20     9.499 ±  0.036  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark          N/A  1000000  thrpt   20    12.800 ±  0.604  ops/min
[info] WorkStealingBenchmark.runnableScheduling                      N/A  1000000  thrpt   20  1539.262 ± 13.460  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal           N/A  1000000  thrpt   20  1683.373 ± 25.941  ops/min
[info] WorkStealingBenchmark.scheduling                              N/A  1000000  thrpt   20    13.744 ±  0.501  ops/min

core/jvm/src/main/scala/cats/effect/unsafe/ParkedSignal.scala

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

armanbilge · 2025-04-14T08:06:21Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkStealingThreadPool.scala

+      val polling = st eq ParkedSignal.ParkedPolling
+      val simple = st eq ParkedSignal.ParkedSimple
+
+      if ((polling || simple) && signal.compareAndSet(st, ParkedSignal.Interrupting)) {


The semantic here is different from the old implementation: getAndSet involves a loop, but there is no longer a loop here.

In the (admittedly unlikely) scenario that a worker thread transitions from ParkedPolling to ParkedSimple or vice versa while evaluating this condition, we would prematurely give up on attempting to wake up this thread. We might consider looping if the compareAndSet fails.

I think that might not be worth it. So going from ParkedPolling to ParkedSimple would require that between the get() above and the CAS, the other thread would need to wake itself up (or be awakened by another notify), run through its whole state machine, pick up a fiber, execute that fiber through the whole machinery, run out of work, go through the whole "check all the things for work" (which, includes doing the thing that we're trying to notify it to do here!) and then finally go back and park again, all in the space of just these few lines.

Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one. The work item we're trying to notify on would actually get picked up (before the target thread resuspends), and we'll ultimately wake the same number of threads.

This compared to having an actual CAS loop which would require more complexity here and I think an extra conditional jump, despite the fact that it's effectively never going to be hit and wouldn't matter even if it did. So I think I'd rather just eat the CAS failure.

Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one.

Sure ... I am just worried about relying on the existence of another thread. For example, if there is exactly once worker thread in the runtime, could we end up in a deadlock?

I have thought through it a few times and I have not come up with a scenario yet that could deadlock, but I don't feel confident.

If there's only one worker then that worker is either awake already (in which case we can't have entered this conditional) or it's parked and we're notifying it (in which case all the normal rules apply).

or it's parked and we're notifying it (in which case all the normal rules apply).

Right, so if all the normal rules are applying ...

Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one.

What happens if we fail the CAS due to the race condition, we don't loop, and then we don't go to the next thread because there is no next thread? We've given up on our one-and-only thread.

You words precisely capture my feelings about this:

Now, I think you're probably right that nothing is lost here since the thread will both check the external queue and attempt to steal before going back to sleep, but it's hard for me to 100% convince myself that this is safe in all circumstances. This is very similar to the multi-consumer queue cases, which are genuinely insanely subtle and we've been bitten quite a bit with lost notifications. Speaking more generally, we tend to always bias notifications in favor of over-notifying rather than under-notifying, since the worst case in the former is a bit less performance, while the worst case in the latter is a deadlock.

Co-authored-by: Arman Bilge <[email protected]>

djspiewak · 2025-04-14T16:28:40Z

So my interpretation of the results from #4328 is that this is indeed an improvement, and we likely have some more work to do on Selector constant factor overhead.

tests/jvm/src/test/scala/cats/effect/IOPlatformSpecification.scala

durban · 2025-04-14T21:31:46Z

I'm probably missing something, but do we really need the ParkedSignal.Interrupting state? What does the spinlock actually protects? (Spurious wakeups already must be handled.)

(Btw, I've restarted the "windows-latest, 2.13.15, temurin@17, ciJVM" job, because it timed out. But I'm wondering if it was a real failure.)

armanbilge

I'm probably missing something, but do we really need the ParkedSignal.Interrupting state? What does the spinlock actually protects? (Spurious wakeups already must be handled.)

@durban My understanding is this protects us against the case where the interruptor reads that the worker thread is in the "polling" or "simple" state, but by the time it invokes the appropriate interrupt action, the worker thread is parked in an different state (and thus the interrupt is ineffective).

armanbilge · 2025-04-14T19:38:25Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkStealingThreadPool.scala

+      val polling = st eq ParkedSignal.ParkedPolling
+      val simple = st eq ParkedSignal.ParkedSimple
+
+      if ((polling || simple) && signal.compareAndSet(st, ParkedSignal.Interrupting)) {


Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one.

Sure ... I am just worried about relying on the existence of another thread. For example, if there is exactly once worker thread in the runtime, could we end up in a deadlock?

I have thought through it a few times and I have not come up with a scenario yet that could deadlock, but I don't feel confident.

armanbilge · 2025-04-14T21:42:13Z

tests/jvm/src/test/scala/cats/effect/IOPlatformSpecification.scala

+        // just create a bit of chaos with timers and async completion
+        val sleeps = 0.until(10).map(i => IO.sleep((i * 10).millis)).toList
+
+        val latch = IO.deferred[Unit].flatMap(d => d.complete(()).start *> d.get)
+        val latches = 0.until(10).map(_ => latch).toList
+
+        val test = (sleeps ::: latches).parSequence.parReplicateA_(100)


I think we need a bit more chaos: specifically, exercising the external queue somehow.

Also, I wonder if we should have a version of this test where the runtime has exactly one worker, that's how we've caught various bugs before.

I'm not averse to running a version of the test with only one worker. I'll try tickling the external queue first.

durban · 2025-04-14T21:58:14Z

Okay, so the way to wake someone up is (without this PR) essentially:

if (signal.getAndSet(false)) system.interrupt(...)

With the 2 different parking states it could be:

val old = signal.getAndSet(Unparked)
// HERE
if (old eq ParkedPolling) system.interrupt(...)
else if (old eq ParkedSimple) LockSupport.unpark(...)

And yes, it could happen that at the point marked with HERE above the target thread wakes up, then goes back to sleep with a different way. But is that a problem? We wanted to wake it up, and... it did! (And sure, then we might make a wrong unpark call, but that should be harmless(?).)

(Again, I might be missing something, I just started thinking about this...)

DeviLab · 2025-04-15T11:10:07Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

+      } else {
        // Park the thread until further notice.
        val start = System.nanoTime()
        metrics.incrementPolledCount()


Shouldn't this metric be incremented if only needPoll==true?

Good catch! Yes.

djspiewak · 2025-04-15T14:41:29Z

@durban I think we need Interrupting for two reasons.

First, when the WorkerThread is awakened by either a timer or I/O event, it must ensure that WorkStealingThreadPool#doneSleeping() is invoked. This is distinct from the situation where the thread is awakened by the pool itself, when doneSleeping() is invoked directly from the notifying side. Checking for Interrupting allows us to disambiguate this situation and ensure that exactly one call is performed, regardless of whether we were awakened by the kernel or the pool. In a sense, this is actually a three-way race condition, and the kernel is one of the legs.

The second reason here is the situation @armanbilge outlined: the pool decides to a awaken a thread which is parked polling, that thread wakes on its own, does its thing, goes all the way through its state machine, and then parks again without polling before the pool has the ability to wake it. In this situation, a notification ends up being "lost". Now, I think you're probably right that nothing is lost here since the thread will both check the external queue and attempt to steal before going back to sleep, but it's hard for me to 100% convince myself that this is safe in all circumstances. This is very similar to the multi-consumer queue cases, which are genuinely insanely subtle and we've been bitten quite a bit with lost notifications. Speaking more generally, we tend to always bias notifications in favor of over-notifying rather than under-notifying, since the worst case in the former is a bit less performance, while the worst case in the latter is a deadlock.

I'm not saying we shouldn't try to reason through this, and if we can find a way to remove the Interrupting state we absolutely should do it, but I couldn't figure out a satisfying way around the first problem, and the second problem still seems scary even if not immediately obviously fatal.

iRevive · 2025-04-17T08:41:48Z

I tried 3.6.1-25-d807be0, here are the results:

CPU Usage

Although the graph shows the mean value as 20%, the startup inflates the number. The more realistic number is ~13% when the system is idling.

Events per worker

The sum of all events (blocked, parked, polled) across all workers.
For instance, CE 3.6.1 SelectorSystem polls the thread 7.690.942 times per 5 minutes, CE 3.6.1-25-d807be0 - 11.766.

Ah, the poll number should be even lower, if I get it right: #4377 (comment).

durban · 2025-04-17T16:02:34Z

@djspiewak Yeah, I haven't looked at doneSleeping yet, makes sense. Also, I completely understand being cautious.

Something else: https://github.com/typelevel/cats-effect/actions/runs/14459016996/job/40548044386?pr=4377#step:22:4843 This is the second time I've seen this branch hang in MutexSpec. I fear this could be something real...

djspiewak · 2025-05-24T15:01:00Z

This does feel like something real, but it's definitely kinda nuts. It also surprises me that MutexSpec is reproducing it rather than one of the Deferred variants, which are normally where we find this sort of thing.

Fix the spinlock from #4377 and probably also fix #4448

djspiewak added 5 commits April 13, 2025 13:30

Initial attempt at enriching the interrupt state machine to avoid pol…

db913fd

…ling

Simplified loops a bit

2eee23a

Fixed conditional and interrupt test

b09a2b7

Restored Scala 3 support

158dac4

Added test for new mixed-mode polling handling

d807be0

djspiewak added the 🪲 bug label Apr 13, 2025

djspiewak linked an issue Apr 13, 2025 that may be closed by this pull request

The Polling System CPU overhead #4328

Closed

djspiewak added this to the v3.6.next milestone Apr 13, 2025

armanbilge added the Cirrus JVM label Apr 13, 2025

armanbilge reviewed Apr 14, 2025

View reviewed changes

djspiewak and others added 4 commits April 14, 2025 07:47

Update core/jvm/src/main/scala/cats/effect/unsafe/ParkedSignal.scala

86ed31b

Co-authored-by: Arman Bilge <[email protected]>

Update core/jvm/src/main/scala/cats/effect/unsafe/ParkedSignal.scala

0c2cfa1

Co-authored-by: Arman Bilge <[email protected]>

Update core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

07400e0

Co-authored-by: Arman Bilge <[email protected]>

Update core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

3c2b5fe

Co-authored-by: Arman Bilge <[email protected]>

armanbilge mentioned this pull request Apr 14, 2025

The Polling System CPU overhead #4328

Closed

armanbilge reviewed Apr 14, 2025

View reviewed changes

tests/jvm/src/test/scala/cats/effect/IOPlatformSpecification.scala Outdated Show resolved Hide resolved

Use createDefaultPollingSystem

9f15058

armanbilge reviewed Apr 14, 2025

View reviewed changes

djspiewak added 2 commits April 14, 2025 18:08

Create more chaos with the external queue

1c13c66

Corrected blocked thread detection to handle interrupting state

f5ea181

DeviLab reviewed Apr 15, 2025

View reviewed changes

djspiewak closed this May 24, 2025

djspiewak reopened this May 24, 2025

Run MutexSpec sequentially on windows

5149437

djspiewak force-pushed the bug/conditionally-poll branch from 850d0b9 to 5149437 Compare June 7, 2025 16:28

djspiewak added 2 commits July 7, 2025 14:53

SJS doesn't support getProperty

9231cf0

Merge branch 'series/3.6.x' into bug/conditionally-poll

c33f415

djspiewak merged commit a429cfd into typelevel:series/3.6.x Jul 7, 2025
36 checks passed

durban mentioned this pull request Jul 18, 2025

MutexSuite timeout on Windows #4446

Open

lenguyenthanh mentioned this pull request Jul 21, 2025

Poisoned pod, extremely slow, weird thread counts #4448

Closed

durban added a commit to durban/cats-effect that referenced this pull request Jul 21, 2025

Fix the spinlock from typelevel#4377

a6816f0

durban mentioned this pull request Jul 21, 2025

Fix the spinlock from #4377 and probably also fix #4448 #4449

Merged

djspiewak added a commit that referenced this pull request Jul 22, 2025

Merge pull request #4449 from durban/fixSpinlock

624af32

Fix the spinlock from #4377 and probably also fix #4448

Uh oh!

Conversation

djspiewak commented Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djspiewak commented Apr 13, 2025

Uh oh!

djspiewak commented Apr 14, 2025

Before

After

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

armanbilge Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djspiewak commented Apr 14, 2025

Uh oh!

Uh oh!

durban commented Apr 14, 2025

Uh oh!

armanbilge left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

durban commented Apr 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djspiewak commented Apr 15, 2025

Uh oh!

iRevive commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CPU Usage

Events per worker

Uh oh!

durban commented Apr 17, 2025

Uh oh!

djspiewak commented May 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

djspiewak commented Apr 13, 2025 •

edited

Loading

armanbilge Apr 15, 2025 •

edited

Loading

iRevive commented Apr 17, 2025 •

edited

Loading