Skip to content

Multithreaded CGAL geometry evaluation#1980

Closed
devilman3d wants to merge 6 commits intoopenscad:masterfrom
devilman3d:master
Closed

Multithreaded CGAL geometry evaluation#1980
devilman3d wants to merge 6 commits intoopenscad:masterfrom
devilman3d:master

Conversation

@devilman3d
Copy link

I have implemented a solution for #391. My solution consists of a new class, ThreadedNodeVisitor, deriving from NodeVisitor. In turn, GeometryEvaluator now derives from ThreadedNodeVisitor. ThreadedNodeVisitor is fully documented.

ThreadedNodeVisitor takes a constructor parameter named "threaded" which determines whether the visitor is actually multi-threaded. The flag is false except from CGALWorker and main() so multi-threading is only used for CGAL evaluation.

The one other "big" change I made outside ThreadedNodeVisitor and GeometryEvaluator is the implementation of CGALUtils::lockErrors/::unlockErrors. This wraps CGAL::set_error_behavior so multiple threads don't stomp on each other. I updated all the respective calls for consistency.

This feature can be enabled via the "threaded traversal" experimental feature.

Please review the code and let me know if there any issues. Otherwise, how do I go about claiming that bounty??? 8-D

@kintel
Copy link
Member

kintel commented Mar 31, 2017

Thanks a lot for taking a shot at this! It will take a while to properly review this, but I have a few initial questions/comments:

  • Question: How did you get around the issue with CGAL::Gmpq not being thread-safe?
  • It would be cool with a few test-cases (they don't have to be automatic) which demonstrates how this works.
  • ..and please make sure the tests succeed - your current build fails on all platform (I didn't look at the details)

…k multithreaded. Also fixed all the tests broken by my last commit.
@devilman3d
Copy link
Author

I fixed all the tests I broke in that first commit and also solved the Gmpq threading problem. I'm apparently using newer GCC and CGAL than the build machines. Working on a backward-compatible solution now...

@MichaelAtOz
Copy link
Member

Is a spin lock the best approach?

@devilman3d
Copy link
Author

devilman3d commented Apr 5, 2017

Is a spin lock the best approach?

I chose them because they have significant performance compared to the posix mutex (which I originally used until I found out about std::atomic_flag). They are only held for short durations to perform "atomic" operations on shared data. It turns out to be pretty efficient. In a single-core environment the spin lock is only a couple of extra CPU instructions. In a multi-core environment, those couple of extra instructions may be repeated a few times as the lock spins, but you have the benefit of multiple cores.

I tried to utilize other std::atomic operations, but every synchronization point needed a lock to guard access to data.

Also, note the use of a semaphore to suspend the parent thread at the OS level while child threads do work.

@devilman3d
Copy link
Author

Question: How did you get around the issue with CGAL::Gmpq not being thread-safe?

I originally didn't. Figured out that was part of the problem with the tests failing, so I did some research. I found out about CGAL's Handle_for class from which pretty much everything in CGAL derives. Luckily that class is completely header-based and it can be replaced via C/C++ header trickery. I re-implemented the class using shared_ptr's because their reference counting is already thread-safe. There was additional detail work to get Handle_for fully thread-safe, but it is now.

It would be cool with a few test-cases (they don't have to be automatic) which demonstrates how this works.

Hmm. Not much to it:

  1. Turn on the experimental feature or pass --enable=thread-traversal on the command line
  2. Load your favorite model with lots of transforms and CSG ops
  3. Press F6 or watch the command prompt
  4. Sit back and reap the rewards of multiple cores B-)

Perhaps an explanation is in order. It's magic!!! Well, automagic. As the tree is walked from leaf-to-trunk, a thread is spawned to render each node's geometry. When any thread completes, another is spawned for the next unprocessed node. The maximum number of threads is equal to the number of cores detected in the machine. A node with children will not be started until all of its children have completed. This leads to an interesting phenomenon in which the quickly processed children add up to increasingly complex parents ending at the single-threaded root node and its dreaded implicit union. (As an aside, Issue #350 takes the speedup gained here to a whole new level.)

..and please make sure the tests succeed - your current build fails on all platform (I didn't look at the details)

Oops, my bad. I am now quite familiar with the tests...

@MichaelAtOz
Copy link
Member

If someone [ @t-paul ;) ] could do a Windows build I'll be happy to test.

src/openscad.cc Outdated
}
#endif // OPENSCAD_QTGUI

#include <execinfo.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's that include used for? This breaks Windows (MXE based) compilation. Can we make it conditional maybe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that wasn't supposed to get committed as I can actually use a debugger now. I'll have an update shortly.

@t-paul
Copy link
Member

t-paul commented Apr 5, 2017

Windows builds in the usual place as OpenSCAD-2017.04.05-x86-{32,64}_multithread*

@MichaelAtOz
Copy link
Member

It runs.
First attempt was underwhelming, then I turned the feature on ;)
A 2 Core system, I need to get my 4-core (8 thread system up - soon)
multithread 2 cores

Process CPU/Memory
multi thread cpu mem
The flat tail is the final union. Earlier I saw it up to 99% briefly.
I observed the Threads live, but didn't take a picture.
Looks good :)
I did note GUI unresponsiveness prior to render, it could be the monitoring tools I had. I'll reboot & do more tests.

Does render() go multithreaded?

@MichaelAtOz
Copy link
Member

The non-multi console was 2015.03-2 BTW.

@kintel
Copy link
Member

kintel commented Apr 6, 2017

Load your favorite model with lots of transforms and CSG ops

I was mostly hoping for a good, handcrafted, example which we could use for sanity-testing this. That would also help comparing results.

@MichaelAtOz
Copy link
Member

@t-paul was it built with debug or something?
I feels pretty laggy in general, preview & repainting with complex geometry, overall renders are faster tho.

@t-paul
Copy link
Member

t-paul commented Apr 6, 2017

Same build procedure as always on our build server, I think it runs the --snapshot build which would enable debug, but I have to look into the script to be sure.
I did update the MXE environment not long ago, so it should have things like Qt updated.

@nophead
Copy link
Member

nophead commented Apr 6, 2017 via email

@MichaelAtOz
Copy link
Member

The progress update is now more frequent and responsive to cancel. :)

Does render() go multithreaded?

It appears not. Future feature...

@MichaelAtOz
Copy link
Member

Well it screams on my 4 core 8 thread system...

@MichaelAtOz
Copy link
Member

Same scad model as above (5m 23s on 2 core 2.53GHz) takes 2m 20s on 4 core hyperthread i7-3770 (3.5/3.9 turbo GHz)

cpu mem annotated

@kintel
Copy link
Member

kintel commented Apr 8, 2017

A tip for testing: add to tests/CMakeLists.txt
set(EXPERIMENTAL_OPTION ${EXPERIMENTAL_OPTION} "--enable=thread-traversal")

..just before the call to add_test() at https://github.com/devilman3d/openscad/blob/ae3d7bae8b43d19468c1727d54abfb2a3213289c/tests/CMakeLists.txt#L1057

This will turn on the threaded traversal for all tests

Copy link
Member

@kintel kintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm chipping away at this. looking at the small things first :)

QMAKE_CXXFLAGS *= -fno-strict-aliasing
QMAKE_CXXFLAGS_WARN_ON += -Wno-unused-local-typedefs # ignored before 4.8
# debug: turn off optimization in debug builds
debug: QMAKE_LFLAGS += -rdynamic -O0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you actually need to add -O0 ? This should already be handled by qmake.

#define SPIN_LOCK(flag) do { } while(flag.test_and_set())

// increments lockedErrorsCount and calls CGAL::set_error_behavior if this is the first [unlocked] call
void lockErrors(CGAL::Failure_behaviour behavior)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're already doing the work to clean up the cgal error handling, could we take it one step further and encapsulate this into a RAII-style lock, and use the appropriate C++11 constructs for that + boost's spinlock? e.g.

	class ErrorLocker {
	public:
	  ErrorLocker() {
	    std::lock_guard<boost::detail::spinlock> guard(ErrorLocker::lock);
	    oldBehavior = CGAL::set_error_behaviour(CGAL::THROW_EXCEPTION);
	  }
	  ~ErrorLocker() {
	    std::lock_guard<boost::detail::spinlock> guard(ErrorLocker::lock);
	    CGAL::set_error_behaviour(oldBehavior);
	  }
	private:
	  CGAL::Failure_behaviour oldBehavior;
	  static boost::detail::spinlock lock;
	};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to suggest SPINLOCK should call yield(), however boost's spinlock does it already,

void lock()
    {
        for( unsigned k = 0; !try_lock(); ++k )
        {
            boost::detail::yield( k );
        }
    }

@donbright
Copy link
Member

Question: How did you get around the issue with CGAL::Gmpq not being thread-safe?

I originally didn't. Figured out that was part of the problem with the tests failing, so I did some research. I found out about CGAL's Handle_for class from which pretty much everything in CGAL derives. Luckily that class is completely header-based and it can be replaced via C/C++ header trickery. I re-implemented the class using shared_ptr's because their reference counting is already thread-safe. There was additional detail work to get Handle_for fully thread-safe, but it is now.

this is brilliant.... i have a very small question ... over the years we have done so much of this modifying of CGAL , at what point has OpenSCAD basically forked CGAL, and is there much long term risk involved where OpenSCAD-CGAL becomes incompatible with CGAL-CGAL? Thanks

@kintel
Copy link
Member

kintel commented Apr 12, 2017

There is always the risk that, if CGAL breaks backwards compatibility, e.g. due to a major version change, we'll get work to do. We also have the possibility of moving away from CGAL, so whatever we do, I strive to keep our CGAL usage behind some internal APIs. There is still a bunch of work to do in that domain though..

@kintel
Copy link
Member

kintel commented Apr 14, 2017

@MichaelAtOz Do you have a good torture test for this?

@MichaelAtOz
Copy link
Member

old/example024 with n=4 gets ~9000 threads, but they process quickly, then get into the union, you see only 1 thread after that, cancel sometimes works, but I close the window & it exits cleanly, then rerun.
I find real world complex objects make for some heavy work per thread. If you promise not to criticize the ugly code, I can send you the thing I used in the above graph posts, it had 800+ threads and eventually finished.
I set CPU affinity to 1 CPU sometimes (on my 2 core box), that lets you see the threads longer.
I haven't done much on my 8 thread box since the last graph above, setting affinity to 2 CPUs (4 threads) will probably be needed to observe.
I just found a real torture test, ~50,000 threads, finishes in <3m, (it's 2D), it was never intended to Render, but I just tried it. I'll email it.

@MichaelAtOz
Copy link
Member

BTW, on the 8 thread box I set CPU affinity to drop a few cores from the OpenSCAD process to create thread contention, to simulate other conflicting workloads, and allow any potential concurrency issues to show their faces.

@MichaelAtOz
Copy link
Member

MichaelAtOz commented Apr 16, 2017

I think observation b. may have been me misinterpreting the monitoring. A terminated thread hangs around in the display for a short period. (still showing the CPU consumption at the time of termination)

@MichaelAtOz
Copy link
Member

Note the CGAL error ( and another - to be posted tomorrow) are repeatable.
I then disabled the multi-thread feature, and rendered the same, without error. (4+hrs v's 45m)
Once. I will repeat tomorrow, but at this stage, it looks bodgy...

@MichaelAtOz
Copy link
Member

@devilman3d
Copy link
Author

devilman3d commented Apr 17, 2017

I have been looking into the CGAL errors... There are a number of variables involved. There are a couple of things not quite correct with that implementation of CGAL::Handle_for, but those are not the source of the real problem.

The real problem seems to stem from the fact CGAL's usage of Gmpq arithmetic operators is not "atomic". There are race conditions in CGAL where one thread can do math which modifies the underlying numerator/denominator of a Gmpq object while another thread is using those unsynchronized values.

There are also problems with lost/early-released Handle_for's and other references in CGAL. Gah! Those headers are complex! And, the differences between versions only complicate the complexity.

I'm working on solutions to identify the memory issues and really make CGAL's usage of Gmpq threadsafe.

All in all, I'm stoked to see it is [mostly] working for ya'll! Please keep testing!

@kintel
Copy link
Member

kintel commented Apr 18, 2017

@devilman3d The Gmpq issue you mention is probably the same I mentioned close to the top. Gmpq in CGAL are non-thread-safe reference-counted objects. I'll try to think about a torture test which causes multiple Gmpq objects to be referenced from different threads.
I think I read somewhere that CGAL 4.10 fixes some of this.

@MichaelAtOz
Copy link
Member

The progress indicator does not update with 2D render.
2d no progress indicater

Could you add a phase 3 console message '(Salmon) Spawning complete'.

@devilman3d I have yet to fully Grock your thread tree. Can I presume it is n-ary where each n is the children (or 1 grouped children subtree) appropriate to the .scad operator, eg a for(<10 loops>) is a node=union with n=10 branches? I'm trying to follow observed thread behaviour.

@MichaelAtOz
Copy link
Member

Strike that. Let me put my brain back on.
I presume it basically reflects the CSG tree.

@MichaelAtOz
Copy link
Member

the quickly processed children add up to increasingly complex parents ending at the single-threaded root node and its dreaded implicit union.

Isn't union commutative, such that any subsets can be unioned and then unioned to produce the same result?
Thus the 'dreaded' implicit union, need not be, as it could be turned into a binary tree* and threaded?

  • or more n-ary, I will discuss some findings later.

@kintel
Copy link
Member

kintel commented Apr 20, 2017

@MichaelAtOz Yes, but my hunch is that the time performing the final union scales with the total number of triangles, so we might not win that much with such an approach.
The exception would be unions which reduce the triangle count - if we get those run early, there may be a nice win. There was a discussion on this a few years back - I'll see if I can find it.

@kintel
Copy link
Member

kintel commented Apr 20, 2017

The above mentioned discussion: #1234

@MichaelAtOz
Copy link
Member

This is what I was wanting to post on the forum

Other times ~6-7% max, not sure what's happening then, possibly both halves getting cache misses. (that was 3D)
...
However threads (8) were getting ~5% each for most of the peak shown here, ~50% total CPU.
...
I'm about to do performance comparisons.

I'm still working thru it, but I'm now leaning toward these lower CPU% threads are micro-workloads, for want of a term, they finish before consuming much CPU. I'm still looking at the performance info, but close to 100% of threads record zero CPU (I suspect below some measurable threshold), only a handful record more (eg ~300/~50,000 2D, ~1,700/~50,000 3D). [small sample size, need more varied CSG workload, still working on it]

ATM 2D multi-threading is worse than single-threaded, some times up to 200% [elapsed]. Probably due to the above. Whether a more chunky thread/workload allocation would be better needs consideration, where thread overheads would be less. This probably applies to 3D too.

@MichaelAtOz
Copy link
Member

so we might not win that much with such an approach.

On a fewer thread machine yes. But the graphs show 7/8th of the CPU idle for the dreaded union.
cpu tail

My gut says suck it and see...

@MichaelAtOz
Copy link
Member

@amarjeetkapoor1
Copy link
Member

I just tested this on my system with 2 cores and 4 thread. and using the example modulerecursion.scad with level=15

and I got following:

for threading enabled:

Total rendering time: 0 hours, 3 minutes, 42 seconds

and for threading disabled:
Total rendering time: 0 hours, 1 minutes, 22 seconds

@nophead
Copy link
Member

nophead commented Apr 23, 2017

Rather than spawning hundreds of thousands of threads perhaps it would be more efficient to spawn as many worker threads as the your CPU supports and have them chew through the work to be done.

@MichaelAtOz
Copy link
Member

more efficient to spawn as many worker threads as the your CPU supports

That's what it does. 2 CGAL threads on my Duo, 8 on my quad hyperthreaded box.
I suggested that ThreadOffset option, where you can use fewer threads, or over commit - as from the graphs above (and many other observation I have made) you can see that there is underutilised CPU, presumably when threads are starting/exiting.

using the example modulerecursion.scad

I hadn't seen that example before, very pretty.
However it is 2D, so those results are expected.
Change it to a cube()↓ and multi threads are better.

A discussion on approaches to 2D is needed.
Perhaps don't use multi threads for 2D in the first instance.
Multi threads on 3D can also be optimised, as a phase 2, then consider how 2D may be optimised as well.
[I have a gut feel, having watched many random renders, that an approach were each thread does more instances before exiting, will save much thread overhead. I don't know what info may be available to cluster workload, some measure of complexity would be good]

↓ 2/3D-afied version, use fewer levels for 3D (try 7 first)

// Recursive calls of modules can generate complex geometry, especially
// fractal style objects.
// The example uses a recursive module to generate a random tree as
// described in http://natureofcode.com/book/chapter-8-fractals/

levels = 10; // number of levels for the recursion
D=3;         // 2D or 3D
len = 100; // length of the first segment
thickness = 5; // thickness of the first segment

// the identity matrix
identity = [ [ 1, 0, 0, 0 ], [ 0, 1, 0, 0 ], [ 0, 0, 1, 0 ], [ 0, 0, 0, 1 ] ];

// random generator, to generate always the same output for the example,
// this uses a seed for rands() and stores the array of random values in
// the random variable. To generate different output, remove the seed or
// replace the function rnd() to just call rands(s, e, 1)[0].
rcnt = 1000;

// generate a randomised start key.
key=ceil(rands(999,9999,1)[0]); 

// Use a fixed key if it looks good & for repeatability.
//  comment out the next line for randomised key. 
key=8991; 

random = rands(0, 1, rcnt,key);
function rnd(s, e, r) = random[r % rcnt] * (e - s) + s;

// generate 4x4 translation matrix
function mt(x, y) = [ [ 1, 0, 0, x ], [ 0, 1, 0, y ], [ 0, 0, 1, 0 ], [ 0, 0, 0, 1 ] ];

// generate 4x4 rotation matrix around Z axis
function mr(a) = [ [ cos(a), -sin(a), 0, 0 ], [ sin(a), cos(a), 0, 0 ], [ 0, 0, 1, 0 ], [ 0, 0, 0, 1 ] ];

module tree(length, thickness, count, m = identity, r = 1) {
    color([0, 1 - (0.8 / levels * count), 0])
        multmatrix(m)
          if (D==2)
            square([thickness, length]);
          else
            cube([thickness,length,1]);

    if (count > 0) {
        tree(rnd(0.6, 0.8, r) * length, 0.8 * thickness, count - 1, m * mt(0, length) * mr(rnd(20, 35, r + 1)), 8 * r);
        tree(rnd(0.6, 0.8, r + 1) * length, 0.8 * thickness, count - 1, m * mt(0, length) * mr(-rnd(20, 35, r + 3)), 8 * r + 4);
    }
}

tree(len, thickness, levels);

translate([10,1,0])
if (D==3)
    linear_extrude(height=1,convexity=20)
        text(str("3D key=",key));
else
        text(str("2D key=",key));

echo(version=version());
// Written in 2015 by Torsten Paul <[email protected]>
//
// To the extent possible under law, the author(s) have dedicated all
// copyright and related and neighboring rights to this software to the
// public domain worldwide. This software is distributed without any
// warranty.
//
// You should have received a copy of the CC0 Public Domain
// Dedication along with this software.
// If not, see <http://creativecommons.org/publicdomain/zero/1.0/>.

@MichaelAtOz
Copy link
Member

Has anyone applied brain power to whether, with multi threads, a non-normalised tree has any advantages?

@kintel
Copy link
Member

kintel commented Apr 24, 2017

@MichaelAtOz Normalization is only performed for previews. Normalization is a prerequisite for OpenCSG to be able to correctly preview a scene. Intuitively it doesn't sound like normalization is a good path for optimization as it tends to produce more nodes, but it could be worth to think about down the road.

@MichaelAtOz
Copy link
Member

There is, I think, some problem with the console logging.
I had the code above rendering in the background, when I clicked back a very long time later, this is what I saw:
threads console
It first, glancing at the console, I thought it must still be processing, then glanced at CPU usage, nothing.
Then saw the bottom right.
It looks like it completed, so I can only assume the console output got stomped on. Possibly a thread conflict?

@kintel
Copy link
Member

kintel commented Apr 24, 2017

The progress reporting may not be threadsafe (See #1985), and since it does mess with the console output, it's likely that somethings happened.
FYI: What would likely happen in this case is that all output will be dumped to stdout instead. Windows tends to just swallow that, but on other platforms or when starting OpenSCAD from the cmd-line, you should see the console output there.

@MichaelAtOz MichaelAtOz mentioned this pull request Apr 24, 2017
3 tasks
@solenskiner
Copy link

more efficient to spawn as many worker threads as the your CPU supports

That's what it does. 2 CGAL threads on my Duo, 8 on my quad hyperthreaded box.
I suggested that ThreadOffset option, where you can use fewer threads, or over commit - as from the graphs above (and many other observation I have made) you can see that there is underutilised CPU, presumably when threads are starting/exiting.

Rather than spawning and exiting threads immidiate when visiting nodes, spawn threads at the start, let the nodevisitor (or however you iterate the tree) put workitems in some collection, and let the threads take work from that collection, then - when the visitor is done and the workqueue is empty - first then exit the threads. Why pay the cost of forking and exiting hundreds of thousands of times?

@t-paul
Copy link
Member

t-paul commented May 23, 2018

Closing this one as there is open review for a year and lots of conflicts by now.

See #2399 for current work-in-progress.

@t-paul t-paul closed this May 23, 2018
MichaelAtOz referenced this pull request May 29, 2018
* make cacheLock an instance variable of GeometryCache
* acquire lock in smartCacheInsert()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants