Skip to content

Please consider using non-boost lockfree queue for threadpool and bls_worker batch aggregation #3797

@sidhujag

Description

@sidhujag

I have been dealing with std::atomic issues on s390x on Syscoin for example(https://travis-ci.org/github/syscoin/syscoin/jobs/741950976) as you can see boost lockfree queue's alignment issues is causing to not find atomic 16 byte operators in and look out for libatomic implementation, therefor I have to explictely link libatomic. Also static analysis tools (on x86_64 which hits closer to home) break as you can see here https://cirrus-ci.com/task/6216796170616832?command=ci#L5798:

/usr/include/boost/lockfree/detail/freelist.hpp:93:23: runtime error: constructor call on misaligned address 0x606000365ea0 for type 'boost::lockfree::queue<std::function<void (int)> *>::node', which requires 64 byte alignment
0x606000365ea0: note: pointer points here
 b4 00 00 77  40 5e 36 00 60 60 be be  be be be be be be be be  be be be be be be be be  be be be be
              ^ 
    #0 0x55a99d1f793b in boost::lockfree::queue<std::function<void (int)>*>::node* boost::lockfree::detail::freelist_stack<boost::lockfree::queue<std::function<void (int)>*>::node, std::allocator<boost::lockfree::queue<std::function<void (int)>*>::node> >::construct<true, false, boost::lockfree::queue<std::function<void (int)>*>::node*>(boost::lockfree::queue<std::function<void (int)>*>::node* const&) /usr/include/boost/lockfree/detail/freelist.hpp:93:13

This later error triggered me from simply living with it to, having to do something about it.

You shouldn't take these errors lightly as they will "likely" cause production failures when running at scale. boost::lockfree looks unstable for this reason to me, and even using boost::atomic doesn't help. Maybe its the way the threadpool and the aggregation is implemented (probably). From the example link below, I think an easier threadpool implementation also exists which minimizes the CAS instead of locking like ctpl.h does and only using 2 atomic variables. Check out the examples for using concurrentqueue with a threadpool for MPMC ( I assume the BLS aggregation is multi producer multi consumer?)

So anyways I have replaced thread_pool's lockfree queue to moodycamel's excellent lockfree queue (https://github.com/cameron314/concurrentqueue/blob/master/samples.md), the tests are passing. This queue is much faster and is industry vetted (https://moodycamel.com/blog/2014/a-fast-general-purpose-lock-free-queue-for-c++.htm#benchmarks). You can see its 2 orders of magnitude faster with 8 or more threads than std::mutex + std::queue or boost::lockfree::queue.

It also has a bulk queuing operation for optimization, so I can see even if batch sizes are 16 as you set, you can queue all batches up in one call instead of looping and pushing one by one. I did manage to update to the new queue easily, only a few lines of code change but changing the bulk queuing I thought I would leave that here if you wish to update it (the implementor of the code I assume will be able to do it in a handful of minutes before testing).

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions