Change to lockless unbounded mpsc by mratsim · Pull Request #21 · mratsim/weave

mratsim · 2019-11-23T18:16:55Z

Implemented lockless unbounded MPSC channels.

Original issue

The initial implementation was based on snmalloc paper (itself inspired by Pony).
The idea was to use it for both:

the future memory pool that needs a way to transfer onership of memory block back to the thread that allocated them in the first place
steal request

However they have the following issues:

they hold on the last item of a queue (breaking for steal requests)
they require memory management of the dummy node (snmalloc deletes it and its memory doesn't seem to be reclaimed)
they never touch the "back" pointer of the queue when dequeuing, meaning if an item was last, dequeuing will still points to it. Pony has an emptiness check via tagged pointer and snmalloc does something unknown.

Overhead / Performance

On fib(40) with i9-9980XE (36 cores OC @ 4.1 GHz)

We gain 10-20 ms (5-10%) and also avoids a thread blocking a steal request channel

Investigation details

Full details from #19 (comment)

It seems like there is an issue with snmalloc channels (adapted from Pony's) or I adapted them improperly.

snmalloc pseudo code from paper

Code from: https://github.com/microsoft/snmalloc/blob/7faefbbb0ed69554d0e19bfe901ec5b28e046a82/src/ds/mpscq.h#L29-L83

    void init(T* stub)
    {
      stub->next.store(nullptr, std::memory_order_relaxed);
      front = stub;
      back.store(stub, std::memory_order_relaxed);
      invariant();
    }

    T* destroy()
    {
      T* fnt = front;
      back.store(nullptr, std::memory_order_relaxed);
      front = nullptr;
      return fnt;
    }

    inline bool is_empty()
    {
      T* bk = back.load(std::memory_order_relaxed);

      return bk == front;
    }

    void enqueue(T* first, T* last)
    {
      // Pushes a list of messages to the queue. Each message from first to
      // last should be linked together through their next pointers.
      invariant();
      last->next.store(nullptr, std::memory_order_relaxed);
      std::atomic_thread_fence(std::memory_order_release);
      T* prev = back.exchange(last, std::memory_order_relaxed);
      prev->next.store(first, std::memory_order_relaxed);
    }

    std::pair<T*, bool> dequeue()
    {
      // Returns the front message, or null if not possible to return a message.
      invariant();
      T* first = front;
      T* next = first->next.load(std::memory_order_relaxed);

      if (next != nullptr)
      {
        front = next;
        AAL::prefetch(&(next->next));
        assert(front);
        std::atomic_thread_fence(std::memory_order_acquire);
        invariant();
        return std::pair(first, true);
      }

      return std::pair(nullptr, false);
    }
  };
} // namespace snmalloc

Pony: https://qconlondon.com/london-2016/system/files/presentation-slides/sylvanclebsch.pdf

Code from https://github.com/ponylang/ponyc/blob/7145c2a84b68ae5b307f0756eee67c222aa02fda/src/libponyrt/actor/messageq.c

Note that Pony enqueue at the head and dequeue at the tail

static bool messageq_push(messageq_t* q, pony_msg_t* first, pony_msg_t* last)
{
  atomic_store_explicit(&last->next, NULL, memory_order_relaxed);

  // Without that fence, the store to last->next above could be reordered after
  // the exchange on the head and after the store to prev->next done by the
  // next push, which would result in the pop incorrectly seeing the queue as
  // empty.
  // Also synchronise with the pop on prev->next.
  atomic_thread_fence(memory_order_release);

  pony_msg_t* prev = atomic_exchange_explicit(&q->head, last,
    memory_order_relaxed);

  bool was_empty = ((uintptr_t)prev & 1) != 0;
  prev = (pony_msg_t*)((uintptr_t)prev & ~(uintptr_t)1);

#ifdef USE_VALGRIND
  // Double fence with Valgrind since we need to have prev in scope for the
  // synchronisation annotation.
  ANNOTATE_HAPPENS_BEFORE(&prev->next);
  atomic_thread_fence(memory_order_release);
#endif
  atomic_store_explicit(&prev->next, first, memory_order_relaxed);

  return was_empty;
}

void ponyint_messageq_init(messageq_t* q)
{
  pony_msg_t* stub = POOL_ALLOC(pony_msg_t);
  stub->index = POOL_INDEX(sizeof(pony_msg_t));
  atomic_store_explicit(&stub->next, NULL, memory_order_relaxed);

  atomic_store_explicit(&q->head, (pony_msg_t*)((uintptr_t)stub | 1),
    memory_order_relaxed);
  q->tail = stub;

#ifndef PONY_NDEBUG
  messageq_size_debug(q);
#endif
}

pony_msg_t* ponyint_thread_messageq_pop(messageq_t* q
#ifdef USE_DYNAMIC_TRACE
  , uintptr_t thr
#endif
  )
{
  pony_msg_t* tail = q->tail;
  pony_msg_t* next = atomic_load_explicit(&tail->next, memory_order_relaxed);

  if(next != NULL)
  {
    DTRACE2(THREAD_MSG_POP, (uint32_t) next->id, (uintptr_t) thr);
    q->tail = next;
    atomic_thread_fence(memory_order_acquire);
#ifdef USE_VALGRIND
    ANNOTATE_HAPPENS_AFTER(&tail->next);
    ANNOTATE_HAPPENS_BEFORE_FORGET_ALL(tail);
#endif
    ponyint_pool_free(tail->index, tail);
  }

  return next;
}

bool ponyint_messageq_markempty(messageq_t* q)
{
  pony_msg_t* tail = q->tail;
  pony_msg_t* head = atomic_load_explicit(&q->head, memory_order_relaxed);

  if(((uintptr_t)head & 1) != 0)
    return true;

  if(head != tail)
    return false;

  head = (pony_msg_t*)((uintptr_t)head | 1);

#ifdef USE_VALGRIND
  ANNOTATE_HAPPENS_BEFORE(&q->head);
#endif
  return atomic_compare_exchange_strong_explicit(&q->head, &tail, head,
    memory_order_release, memory_order_relaxed);
}

However, I probably have missed something here but it seems to me like:

The back of the queue (head in Pony) is never modified in the enqueue/pop routines.
This is problematic as it would still points to the item that was just dequeued
In snmalloc the queue front is overwritten by next, meaning the initial dummy node would be overwritten on first dequeue. But its memory is never freed/recycled.
As they both erase the front in the dequeue/pop, but the dequeuing only advances when there is more than one item in the queue, the consumer will get blocked on the last message.
Snmalloc keeps an atomic count of enqueued objects next to the queue.
In our case, both for the memory pool and for steal requests (for the adaptative steal strategy) we also need to keep an approximate count of enqueued items, we might as well integrate it in the queue.

…o the drawing board - nim-lang/Nim#12695

…stem alloc

- they hold on the last item of a queue (breaking for steal requests) - they require memory management of the dummy node (snmalloc deletes it and its memory doesn't seem to be reclaimed) - they never touch the "back" pointer of the queue when dequeuing, meaning if an item was last, dequeuing will still points to it. Pony has an emptiness check via tagged pointer and snmalloc does ???

mratsim added 4 commits November 22, 2019 00:05

Stashing: Sadly, we might never know the perf of this channel, back t…

fd9aca6

…o the drawing board - nim-lang/Nim#12695

Fix queue, workaround nim-lang/Nim#12695

e305edc

Fix fence issue: no strange memory reuse anymore with both Nim and sy…

76db49e

…stem alloc

dumblob mentioned this pull request Nov 24, 2019

Proper support for distributed computing, parallelism and concurrency vlang/v#1868

Closed

mratsim merged commit b796d1a into master Nov 24, 2019

mratsim added a commit that referenced this pull request Nov 24, 2019

Cleanup repo / update docs after #21

e41c927

This was referenced Nov 24, 2019

SparsetSet: signed/unsigned/isEmpty/contains issue #20

Closed

The MPSC channel has a deadlock #6

Closed

mratsim deleted the test-lockless-unbounded-mpsc branch November 30, 2019 13:56

dumblob mentioned this pull request Nov 23, 2021

Performance benchmarks Taymindis/lfqueue#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change to lockless unbounded mpsc#21

Change to lockless unbounded mpsc#21
mratsim merged 4 commits into
masterfrom
test-lockless-unbounded-mpsc

mratsim commented Nov 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mratsim commented Nov 23, 2019

Original issue

Overhead / Performance

Investigation details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant