High(er) performance async `Queue` by djspiewak · Pull Request #2885 · typelevel/cats-effect

djspiewak · 2022-03-19T00:46:01Z

First off, one thing I discovered as part of this is that the existing Queue is really a lot faster than you probably think. But we can do better.

This relies on the runtime typecase trick on the typeclass to see if the GenConcurrent is secretly actually an Async. When it is, and when the bound is > 1, we transparently swap to this more efficient implementation. Benchmarks to follow, but it's somewhere between 2x and 10x faster, depending on the scenario and the number of physical threads. Implementation is based on jctools for the bounded part, and all credit is due to @viktorklang for the unbounded part. Bugs are probably my fault.

Lots of room to improve here! A non-exhaustive list:

We can support unbounded without too much trouble, just by building on a pair of UnboundedUnsafes
circular is also pretty easy with what we have
synchronous is going to require a brand new structure and a lot of thought. That's a very tricky case
Specialized versions for ScalaJS would technically be optimal
The jctools bounded queue does have support for a takeAll-like operation, meaning we can override tryTakeN and it will be much better than the naive version, which should in turn make Fs2 Channel vastly better
~~Right now, it starts failing if you enqueue more than Long.MaxValue. This is easily fixable, I was just lazy~~

An even more substantial optimization will be striping consumers. While this is an mpmc queue, the far-and-away most common scenario is mpsc, which would allow us to optimize it a lot. We can take advantage of that by striping sc queues so long as we don't care as much about fairness in that case, and since we know our consuming threads are bounded by the physical threads, we also know our stripes will be strictly bounded.

Anyway, lots of room for future excitement.

Closes #2771

…of TBB

djspiewak · 2022-03-19T01:37:32Z

Auspicious…

durban

A general comment: none of these queues seem lock-free (which is not necessarily a problem, it's just an observation.)

durban · 2022-03-21T17:57:38Z

std/shared/src/main/scala/cats/effect/std/Queue.scala

+                // manually complete our own callback
+                // note that we could have a race condition here where we're already completed
+                // async will deduplicate these calls for us
+                // additionally, the continuation (below) is held until the registration completes


This ("the continuation is held") seems important here. Is this guaranteed by any Async?

It is! And I agree it would be worth spelling it out more. We use this trick a lot though.

durban · 2022-03-21T18:01:49Z

std/shared/src/main/scala/cats/effect/std/Queue.scala

-  def bounded[F[_], A](capacity: Int)(implicit F: GenConcurrent[F, _]): F[Queue[F, A]] = {
+  def bounded[F[_], A](capacity: Int)(implicit F: GenConcurrent[F, _]): F[Queue[F, A]] =
+    F match {
+      case f0: Async[F] =>


This is not strictly related to this PR, but what's stopping someone implementing Cont from doing the same, and recovering an Async[G] from the MonadCancel[G, Throwable]?

Doing so and then setting up weird forking scenarios would result in runtime errors. So, caveat emptor.

durban · 2022-03-21T18:06:44Z

std/shared/src/main/scala/cats/effect/std/Queue.scala

+    def offer(a: A): F[Unit] = F defer {
+      try {
+        // attempt to put into the buffer; if the buffer is full, it will raise an exception
+        buffer.put(a)


Since offer and take both first try to use buffer I think it can happen that a waiter is bypassed. For example, initially the queue is empty, and there is one taker waiting. Then, offer and take are racing, and take immediately removes the item inserted by offer, thus bypassing the waiter, which was earlier than take. (Although this whole thing might be unobservable. It seems a "fairness" issue, and not necessarily a "semantics" issue.)

You're correct! There are actually a couple cases like this. Strict fairness is not guaranteed in highly contended races.

djspiewak · 2022-03-21T20:43:34Z

They're definitely all lock free. They aren't contention free though.

durban · 2022-03-21T23:18:11Z

Regarding lock freedom:

Let's look at UnsafeUnbounded: in take an interesting case is when taken is null, but last.get() ne null. In this case what it does is: take() // Waiting for prevLast.set(cell), so recurse. It waits on another thread (put), by spinning. This is... well, it's a spinlock. Acquired by put with last.getAndSet(cell), and released either with first.set(cell) or with prevLast.set(cell). And take is waiting (by spinning) on this lock with the recursive call.

There is another similar case (spinwaiting on another thread) in take after the long comment.

In UnsafeBounded there is something similar going on (if I understand correctly): put acquires a spinlock with tail.compareAndSet(currentTail, currentTail + 1), writes the data into buffer while holding the lock, then releases it with sequenceBuffer.incrementAndGet(project(currentTail)). take waits on this lock by spinning until seq == currentHead + 1.

viktorklang · 2022-03-22T12:32:38Z

The spinlock in UnsafeUnbounded is not needed for correctness, only performance.-- Cheers, √

…celation

djspiewak · 2022-06-16T03:29:20Z

Depends on #3000

…tion

djspiewak · 2022-06-18T19:47:07Z

[info] Benchmark                                                  (size)   Mode  Cnt      Score     Error    Units
[info] QueueBenchmark.boundedAsyncEnqueueDequeueContended         100000  thrpt   10  29510.955 ± 206.937  ops/min
[info] QueueBenchmark.boundedAsyncEnqueueDequeueMany              100000  thrpt   10   6105.196 ±  27.218  ops/min
[info] QueueBenchmark.boundedAsyncEnqueueDequeueOne               100000  thrpt   10   6261.579 ±   5.845  ops/min
[info] QueueBenchmark.boundedConcurrentEnqueueDequeueContended    100000  thrpt   10  10060.914 ±  44.569  ops/min
[info] QueueBenchmark.boundedConcurrentEnqueueDequeueMany         100000  thrpt   10   4157.941 ±   7.719  ops/min
[info] QueueBenchmark.boundedConcurrentEnqueueDequeueOne          100000  thrpt   10   4278.601 ±  16.290  ops/min
[info] QueueBenchmark.unboundedAsyncEnqueueDequeueContended       100000  thrpt   10  30820.903 ±  93.492  ops/min
[info] QueueBenchmark.unboundedAsyncEnqueueDequeueMany            100000  thrpt   10  16245.842 ±  47.942  ops/min
[info] QueueBenchmark.unboundedAsyncEnqueueDequeueOne             100000  thrpt   10  16594.351 ± 126.705  ops/min
[info] QueueBenchmark.unboundedConcurrentEnqueueDequeueContended  100000  thrpt   10  10073.977 ±  76.811  ops/min
[info] QueueBenchmark.unboundedConcurrentEnqueueDequeueMany       100000  thrpt   10   4155.995 ±   5.937  ops/min
[info] QueueBenchmark.unboundedConcurrentEnqueueDequeueOne        100000  thrpt   10   4274.642 ±   6.468  ops/min

Plenty more room to improve, but I'll take a 2-4x as a starting point.

djspiewak and others added 28 commits March 18, 2022 17:28

Added queue benchmarks

e3b97b8

Added some more benchmarks

f3159a5

Sketching ideas and an unbounded mpmc queue

8a5baab

Noodling on mpmc circular buffers

04fb891

Wrapped up unsafe things into a safe thing

d65774e

Actually unblock waiters

6ead840

Passing tests ftw?

23ecf4e

Failing tests

193a8ad

Fixed race conditions using announcement bitmask

d892be9

Added tests for (and fixed) unbounded queue

8e9b173

Added clearing test

a38e067

Fixed broken first invariant in unbounded queue

fff8854

Threw away bespoke implementation and replaced with a port of a port …

4a3630b

…of TBB

Replaced with jctools port

6277b80

Adjusted bounds on tests to avoid pathological case on jctools queue

82c50b3

Scalafmt

a063e2f

Refactored queue benchmarks a bit

9b16517

Added contention benchmark

e336d03

Change granularity to reproduce livelock

092c032

Reimplemented UnsafeUnbounded to be less weird and broken

4e5aa1f

Use ConcurrentLinkedQueue for now

2284f6e

Added dequeue-then-enqueue benchmark

4325a86

Added contention spec

113d959

Restored UnsafeUnbounded

4db5278

Added null support to UnsafeUnbounded

ce19735

Avoid situation where we can null out first forever

6c809f1

Added a ton of comments and fixed (hopefully) the race condition

e2a6983

Scalafmt

d2fb615

Fixed spec for clear()

25d537a

This was referenced Mar 19, 2022

Implement an optimized Queue.synchronous for Async[F] #2888

Open

Implement platform-specialized Queues for ScalaJS #2889

Open

Implement optimized tryTakeN for async Queue.bounded #2890

Closed

wjoel mentioned this pull request Mar 19, 2022

Use a ConcurrentHashMap for Supervisor state if F is Async #2876

Merged

durban reviewed Mar 21, 2022

View reviewed changes

This was referenced Mar 25, 2022

High(er) performance unbounded queue #2914

Merged

Reimplemented Channel in terms of Queue typelevel/fs2#2856

Closed

High(er) performance tryTakeN for Queue.bounded #2917

Merged

Harden Queue#take's cancelation properties #2920

Closed

djspiewak added 10 commits May 21, 2022 10:35

Reorganized queue tests to be more organized

1d37930

Added cancelableTakeTests

b03d1d4

Missed a bit of reorganizing

7580537

Revised Queue#take cancelation semantics to ensure delivery

5ba748c

Attempting a different formulation of the test

6521c92

Reformatted and removed unused parameters

2788be2

Hardened take cancelation semantics in Dequeue

bb29e64

Merge branch 'series/3.x' into feature/hardened-queue

2fbd5cd

Fixed Queue#take to ensure that notifications are never lost on can…

b6bcbe2

…celation

Fixed offer cancelation deadlock race condition on all queues

8649999

djspiewak added 5 commits June 17, 2022 11:21

Merge branch 'series/3.x' into feature/hardened-queue

a34604a

Added mima excludes for onOfferNoCapacity

02c53a0

Merge branch 'feature/hardened-queue' into feature/high-perf-queue

6c4a63f

Fix rare race condition in which notifications can be lost on cancela…

2f90463

…tion

Fixed race condition in which async take was not atomic

805c350

djspiewak merged commit d04b2e8 into typelevel:series/3.x Jun 18, 2022

Uh oh!

Conversation

djspiewak commented Mar 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djspiewak commented Mar 19, 2022

Uh oh!

durban left a comment

Choose a reason for hiding this comment

Uh oh!

durban Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

djspiewak Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

durban Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

djspiewak Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

durban Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

djspiewak Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

djspiewak commented Mar 21, 2022

Uh oh!

durban commented Mar 21, 2022

Uh oh!

viktorklang commented Mar 22, 2022 via email

Uh oh!

djspiewak commented Jun 16, 2022

Uh oh!

djspiewak commented Jun 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

djspiewak commented Mar 19, 2022 •

edited

Loading