Dynamically bypass selector polling if no I/O events are present#4377
Dynamically bypass selector polling if no I/O events are present#4377djspiewak merged 15 commits intotypelevel:series/3.6.xfrom
Conversation
|
Released as |
|
I'd like to see the real world numbers on this one, but the microbenchmarks seem to give their blessing in so far as they go. BeforeAfterNote that the global EC is 4% slower in this run, but |
| val polling = st eq ParkedSignal.ParkedPolling | ||
| val simple = st eq ParkedSignal.ParkedSimple | ||
|
|
||
| if ((polling || simple) && signal.compareAndSet(st, ParkedSignal.Interrupting)) { |
There was a problem hiding this comment.
The semantic here is different from the old implementation: getAndSet involves a loop, but there is no longer a loop here.
In the (admittedly unlikely) scenario that a worker thread transitions from ParkedPolling to ParkedSimple or vice versa while evaluating this condition, we would prematurely give up on attempting to wake up this thread. We might consider looping if the compareAndSet fails.
There was a problem hiding this comment.
I think that might not be worth it. So going from ParkedPolling to ParkedSimple would require that between the get() above and the CAS, the other thread would need to wake itself up (or be awakened by another notify), run through its whole state machine, pick up a fiber, execute that fiber through the whole machinery, run out of work, go through the whole "check all the things for work" (which, includes doing the thing that we're trying to notify it to do here!) and then finally go back and park again, all in the space of just these few lines.
Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one. The work item we're trying to notify on would actually get picked up (before the target thread resuspends), and we'll ultimately wake the same number of threads.
This compared to having an actual CAS loop which would require more complexity here and I think an extra conditional jump, despite the fact that it's effectively never going to be hit and wouldn't matter even if it did. So I think I'd rather just eat the CAS failure.
There was a problem hiding this comment.
Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one.
Sure ... I am just worried about relying on the existence of another thread. For example, if there is exactly once worker thread in the runtime, could we end up in a deadlock?
I have thought through it a few times and I have not come up with a scenario yet that could deadlock, but I don't feel confident.
There was a problem hiding this comment.
If there's only one worker then that worker is either awake already (in which case we can't have entered this conditional) or it's parked and we're notifying it (in which case all the normal rules apply).
There was a problem hiding this comment.
or it's parked and we're notifying it (in which case all the normal rules apply).
Right, so if all the normal rules are applying ...
Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one.
What happens if we fail the CAS due to the race condition, we don't loop, and then we don't go to the next thread because there is no next thread? We've given up on our one-and-only thread.
There was a problem hiding this comment.
You words precisely capture my feelings about this:
Now, I think you're probably right that nothing is lost here since the thread will both check the external queue and attempt to steal before going back to sleep, but it's hard for me to 100% convince myself that this is safe in all circumstances. This is very similar to the multi-consumer queue cases, which are genuinely insanely subtle and we've been bitten quite a bit with lost notifications. Speaking more generally, we tend to always bias notifications in favor of over-notifying rather than under-notifying, since the worst case in the former is a bit less performance, while the worst case in the latter is a deadlock.
Co-authored-by: Arman Bilge <[email protected]>
Co-authored-by: Arman Bilge <[email protected]>
Co-authored-by: Arman Bilge <[email protected]>
Co-authored-by: Arman Bilge <[email protected]>
|
So my interpretation of the results from #4328 is that this is indeed an improvement, and we likely have some more work to do on |
tests/jvm/src/test/scala/cats/effect/IOPlatformSpecification.scala
Outdated
Show resolved
Hide resolved
|
I'm probably missing something, but do we really need the (Btw, I've restarted the "windows-latest, 2.13.15, temurin@17, ciJVM" job, because it timed out. But I'm wondering if it was a real failure.) |
armanbilge
left a comment
There was a problem hiding this comment.
I'm probably missing something, but do we really need the ParkedSignal.Interrupting state? What does the spinlock actually protects? (Spurious wakeups already must be handled.)
@durban My understanding is this protects us against the case where the interruptor reads that the worker thread is in the "polling" or "simple" state, but by the time it invokes the appropriate interrupt action, the worker thread is parked in an different state (and thus the interrupt is ineffective).
| val polling = st eq ParkedSignal.ParkedPolling | ||
| val simple = st eq ParkedSignal.ParkedSimple | ||
|
|
||
| if ((polling || simple) && signal.compareAndSet(st, ParkedSignal.Interrupting)) { |
There was a problem hiding this comment.
Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one.
Sure ... I am just worried about relying on the existence of another thread. For example, if there is exactly once worker thread in the runtime, could we end up in a deadlock?
I have thought through it a few times and I have not come up with a scenario yet that could deadlock, but I don't feel confident.
| // just create a bit of chaos with timers and async completion | ||
| val sleeps = 0.until(10).map(i => IO.sleep((i * 10).millis)).toList | ||
|
|
||
| val latch = IO.deferred[Unit].flatMap(d => d.complete(()).start *> d.get) | ||
| val latches = 0.until(10).map(_ => latch).toList | ||
|
|
||
| val test = (sleeps ::: latches).parSequence.parReplicateA_(100) |
There was a problem hiding this comment.
I think we need a bit more chaos: specifically, exercising the external queue somehow.
Also, I wonder if we should have a version of this test where the runtime has exactly one worker, that's how we've caught various bugs before.
There was a problem hiding this comment.
I'm not averse to running a version of the test with only one worker. I'll try tickling the external queue first.
|
Okay, so the way to wake someone up is (without this PR) essentially: if (signal.getAndSet(false)) system.interrupt(...)With the 2 different parking states it could be: val old = signal.getAndSet(Unparked)
// HERE
if (old eq ParkedPolling) system.interrupt(...)
else if (old eq ParkedSimple) LockSupport.unpark(...)And yes, it could happen that at the point marked with (Again, I might be missing something, I just started thinking about this...) |
| } else { | ||
| // Park the thread until further notice. | ||
| val start = System.nanoTime() | ||
| metrics.incrementPolledCount() |
There was a problem hiding this comment.
Shouldn't this metric be incremented if only needPoll==true?
|
@durban I think we need First, when the The second reason here is the situation @armanbilge outlined: the pool decides to a awaken a thread which is parked polling, that thread wakes on its own, does its thing, goes all the way through its state machine, and then parks again without polling before the pool has the ability to wake it. In this situation, a notification ends up being "lost". Now, I think you're probably right that nothing is lost here since the thread will both check the external queue and attempt to steal before going back to sleep, but it's hard for me to 100% convince myself that this is safe in all circumstances. This is very similar to the multi-consumer queue cases, which are genuinely insanely subtle and we've been bitten quite a bit with lost notifications. Speaking more generally, we tend to always bias notifications in favor of over-notifying rather than under-notifying, since the worst case in the former is a bit less performance, while the worst case in the latter is a deadlock. I'm not saying we shouldn't try to reason through this, and if we can find a way to remove the |
|
I tried CPU Usage
Although the graph shows the mean value as 20%, the startup inflates the number. The more realistic number is ~13% when the system is idling. Events per worker
The sum of all events (blocked, parked, polled) across all workers. Ah, the poll number should be even lower, if I get it right: #4377 (comment). |
|
@djspiewak Yeah, I haven't looked at Something else: https://github.com/typelevel/cats-effect/actions/runs/14459016996/job/40548044386?pr=4377#step:22:4843 This is the second time I've seen this branch hang in |
|
This does feel like something real, but it's definitely kinda nuts. It also surprises me that |
850d0b9 to
5149437
Compare


With this change, every time we park a thread we check to see if the underlying polling system
needsPoll, which is a hint that is intended to be implemented by checking if any events have been registered with the underlying polling system. If it returnsfalse, we don't even bother hitting the poller and instead park the worker thread the old fashioned way. This has some slightly complex concurrent coordination implications which makes this whole thing just a bit more complex and introduces a tiny spin wait, but otherwise is generally okay.This should bring thread suspension performance for applications which are not using
PollingSystemback very close to what it was in 3.5.x. TheneedsPollcall does have some overhead, and the more complex suspension state does have a cost, but it shouldn't be that severe. I'm traveling so I haven't had a chance to run benchmarks yet but that's next on my todo list.Fixes #4328