Multithreaded CGAL geometry evaluation#1980
Multithreaded CGAL geometry evaluation#1980devilman3d wants to merge 6 commits intoopenscad:masterfrom
Conversation
|
Thanks a lot for taking a shot at this! It will take a while to properly review this, but I have a few initial questions/comments:
|
…k multithreaded. Also fixed all the tests broken by my last commit.
|
I fixed all the tests I broke in that first commit and also solved the Gmpq threading problem. I'm apparently using newer GCC and CGAL than the build machines. Working on a backward-compatible solution now... |
|
Is a spin lock the best approach? |
I chose them because they have significant performance compared to the posix mutex (which I originally used until I found out about std::atomic_flag). They are only held for short durations to perform "atomic" operations on shared data. It turns out to be pretty efficient. In a single-core environment the spin lock is only a couple of extra CPU instructions. In a multi-core environment, those couple of extra instructions may be repeated a few times as the lock spins, but you have the benefit of multiple cores. I tried to utilize other std::atomic operations, but every synchronization point needed a lock to guard access to data. Also, note the use of a semaphore to suspend the parent thread at the OS level while child threads do work. |
I originally didn't. Figured out that was part of the problem with the tests failing, so I did some research. I found out about CGAL's Handle_for class from which pretty much everything in CGAL derives. Luckily that class is completely header-based and it can be replaced via C/C++ header trickery. I re-implemented the class using shared_ptr's because their reference counting is already thread-safe. There was additional detail work to get Handle_for fully thread-safe, but it is now.
Hmm. Not much to it:
Perhaps an explanation is in order. It's magic!!! Well, automagic. As the tree is walked from leaf-to-trunk, a thread is spawned to render each node's geometry. When any thread completes, another is spawned for the next unprocessed node. The maximum number of threads is equal to the number of cores detected in the machine. A node with children will not be started until all of its children have completed. This leads to an interesting phenomenon in which the quickly processed children add up to increasingly complex parents ending at the single-threaded root node and its dreaded implicit union. (As an aside, Issue #350 takes the speedup gained here to a whole new level.)
Oops, my bad. I am now quite familiar with the tests... |
|
If someone [ @t-paul ;) ] could do a Windows build I'll be happy to test. |
src/openscad.cc
Outdated
| } | ||
| #endif // OPENSCAD_QTGUI | ||
|
|
||
| #include <execinfo.h> |
There was a problem hiding this comment.
What's that include used for? This breaks Windows (MXE based) compilation. Can we make it conditional maybe?
There was a problem hiding this comment.
Yeah, that wasn't supposed to get committed as I can actually use a debugger now. I'll have an update shortly.
|
Windows builds in the usual place as |
|
The non-multi console was 2015.03-2 BTW. |
I was mostly hoping for a good, handcrafted, example which we could use for sanity-testing this. That would also help comparing results. |
|
@t-paul was it built with debug or something? |
|
Same build procedure as always on our build server, I think it runs the |
|
When I built a debug version myself with MSYS2 it was too slow to use for
anything but testing. The Windows snapshots I download aren't slower than
releases.
…On 6 April 2017 at 09:27, Torsten Paul ***@***.***> wrote:
Same build procedure as always on our build server, I think it runs the
--snapshot build which would enable debug, but I have to look into the
script to be sure.
I did update the MXE environment not long ago, so it should have things
like Qt updated.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1980 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAijhTyvHj-JKU1SgkBxrKbkU9Zh2zeqks5rtKH8gaJpZM4Mu_BB>
.
|
|
The progress update is now more frequent and responsive to cancel. :)
It appears not. Future feature... |
|
Well it screams on my 4 core 8 thread system... |
|
A tip for testing: add to tests/CMakeLists.txt ..just before the call to This will turn on the threaded traversal for all tests |
kintel
left a comment
There was a problem hiding this comment.
I'm chipping away at this. looking at the small things first :)
| QMAKE_CXXFLAGS *= -fno-strict-aliasing | ||
| QMAKE_CXXFLAGS_WARN_ON += -Wno-unused-local-typedefs # ignored before 4.8 | ||
| # debug: turn off optimization in debug builds | ||
| debug: QMAKE_LFLAGS += -rdynamic -O0 |
There was a problem hiding this comment.
Did you actually need to add -O0 ? This should already be handled by qmake.
| #define SPIN_LOCK(flag) do { } while(flag.test_and_set()) | ||
|
|
||
| // increments lockedErrorsCount and calls CGAL::set_error_behavior if this is the first [unlocked] call | ||
| void lockErrors(CGAL::Failure_behaviour behavior) |
There was a problem hiding this comment.
While we're already doing the work to clean up the cgal error handling, could we take it one step further and encapsulate this into a RAII-style lock, and use the appropriate C++11 constructs for that + boost's spinlock? e.g.
class ErrorLocker {
public:
ErrorLocker() {
std::lock_guard<boost::detail::spinlock> guard(ErrorLocker::lock);
oldBehavior = CGAL::set_error_behaviour(CGAL::THROW_EXCEPTION);
}
~ErrorLocker() {
std::lock_guard<boost::detail::spinlock> guard(ErrorLocker::lock);
CGAL::set_error_behaviour(oldBehavior);
}
private:
CGAL::Failure_behaviour oldBehavior;
static boost::detail::spinlock lock;
};
There was a problem hiding this comment.
I was about to suggest SPINLOCK should call yield(), however boost's spinlock does it already,
void lock()
{
for( unsigned k = 0; !try_lock(); ++k )
{
boost::detail::yield( k );
}
}
this is brilliant.... i have a very small question ... over the years we have done so much of this modifying of CGAL , at what point has OpenSCAD basically forked CGAL, and is there much long term risk involved where OpenSCAD-CGAL becomes incompatible with CGAL-CGAL? Thanks |
|
There is always the risk that, if CGAL breaks backwards compatibility, e.g. due to a major version change, we'll get work to do. We also have the possibility of moving away from CGAL, so whatever we do, I strive to keep our CGAL usage behind some internal APIs. There is still a bunch of work to do in that domain though.. |
|
@MichaelAtOz Do you have a good torture test for this? |
|
old/example024 with n=4 gets ~9000 threads, but they process quickly, then get into the union, you see only 1 thread after that, cancel sometimes works, but I close the window & it exits cleanly, then rerun. |
|
BTW, on the 8 thread box I set CPU affinity to drop a few cores from the OpenSCAD process to create thread contention, to simulate other conflicting workloads, and allow any potential concurrency issues to show their faces. |
|
I think observation b. may have been me misinterpreting the monitoring. A terminated thread hangs around in the display for a short period. (still showing the CPU consumption at the time of termination) |
|
Note the CGAL error ( and another - to be posted tomorrow) are repeatable. |
|
I have been looking into the CGAL errors... There are a number of variables involved. There are a couple of things not quite correct with that implementation of CGAL::Handle_for, but those are not the source of the real problem. The real problem seems to stem from the fact CGAL's usage of Gmpq arithmetic operators is not "atomic". There are race conditions in CGAL where one thread can do math which modifies the underlying numerator/denominator of a Gmpq object while another thread is using those unsynchronized values. There are also problems with lost/early-released Handle_for's and other references in CGAL. Gah! Those headers are complex! And, the differences between versions only complicate the complexity. I'm working on solutions to identify the memory issues and really make CGAL's usage of Gmpq threadsafe. All in all, I'm stoked to see it is [mostly] working for ya'll! Please keep testing! |
|
@devilman3d The Gmpq issue you mention is probably the same I mentioned close to the top. Gmpq in CGAL are non-thread-safe reference-counted objects. I'll try to think about a torture test which causes multiple Gmpq objects to be referenced from different threads. |
|
The progress indicator does not update with 2D render. Could you add a phase 3 console message '(Salmon) Spawning complete'. @devilman3d I have yet to fully Grock your thread tree. Can I presume it is n-ary where each n is the children (or 1 grouped children subtree) appropriate to the .scad operator, eg a for(<10 loops>) is a node=union with n=10 branches? I'm trying to follow observed thread behaviour. |
|
Strike that. Let me put my brain back on. |
Isn't union commutative, such that any subsets can be unioned and then unioned to produce the same result?
|
|
@MichaelAtOz Yes, but my hunch is that the time performing the final union scales with the total number of triangles, so we might not win that much with such an approach. |
|
The above mentioned discussion: #1234 |
|
This is what I was wanting to post on the forum
I'm still working thru it, but I'm now leaning toward these lower CPU% threads are micro-workloads, for want of a term, they finish before consuming much CPU. I'm still looking at the performance info, but close to 100% of threads record zero CPU (I suspect below some measurable threshold), only a handful record more (eg ~300/~50,000 2D, ~1,700/~50,000 3D). [small sample size, need more varied CSG workload, still working on it] ATM 2D multi-threading is worse than single-threaded, some times up to 200% [elapsed]. Probably due to the above. Whether a more chunky thread/workload allocation would be better needs consideration, where thread overheads would be less. This probably applies to 3D too. |
|
I just tested this on my system with 2 cores and 4 thread. and using the example modulerecursion.scad with level=15 and I got following: for threading enabled:
and for threading disabled: |
|
Rather than spawning hundreds of thousands of threads perhaps it would be more efficient to spawn as many worker threads as the your CPU supports and have them chew through the work to be done. |
That's what it does. 2 CGAL threads on my Duo, 8 on my quad hyperthreaded box.
I hadn't seen that example before, very pretty. A discussion on approaches to 2D is needed. ↓ 2/3D-afied version, use fewer levels for 3D (try 7 first) |
|
Has anyone applied brain power to whether, with multi threads, a non-normalised tree has any advantages? |
|
@MichaelAtOz Normalization is only performed for previews. Normalization is a prerequisite for OpenCSG to be able to correctly preview a scene. Intuitively it doesn't sound like normalization is a good path for optimization as it tends to produce more nodes, but it could be worth to think about down the road. |
|
The progress reporting may not be threadsafe (See #1985), and since it does mess with the console output, it's likely that somethings happened. |
Rather than spawning and exiting threads immidiate when visiting nodes, spawn threads at the start, let the nodevisitor (or however you iterate the tree) put workitems in some collection, and let the threads take work from that collection, then - when the visitor is done and the workqueue is empty - first then exit the threads. Why pay the cost of forking and exiting hundreds of thousands of times? |
|
Closing this one as there is open review for a year and lots of conflicts by now. See #2399 for current work-in-progress. |
* make cacheLock an instance variable of GeometryCache * acquire lock in smartCacheInsert()

I have implemented a solution for #391. My solution consists of a new class, ThreadedNodeVisitor, deriving from NodeVisitor. In turn, GeometryEvaluator now derives from ThreadedNodeVisitor. ThreadedNodeVisitor is fully documented.
ThreadedNodeVisitor takes a constructor parameter named "threaded" which determines whether the visitor is actually multi-threaded. The flag is false except from CGALWorker and main() so multi-threading is only used for CGAL evaluation.
The one other "big" change I made outside ThreadedNodeVisitor and GeometryEvaluator is the implementation of CGALUtils::lockErrors/::unlockErrors. This wraps CGAL::set_error_behavior so multiple threads don't stomp on each other. I updated all the respective calls for consistency.
This feature can be enabled via the "threaded traversal" experimental feature.
Please review the code and let me know if there any issues. Otherwise, how do I go about claiming that bounty??? 8-D