Skip to content

Another attempt at multi-threaded geometry evaluation#2399

Open
justbuchanan wants to merge 46 commits intoopenscad:masterfrom
justbuchanan:multithread
Open

Another attempt at multi-threaded geometry evaluation#2399
justbuchanan wants to merge 46 commits intoopenscad:masterfrom
justbuchanan:multithread

Conversation

@justbuchanan
Copy link
Member

@justbuchanan justbuchanan commented May 22, 2018

This builds on @devilman3d's work in #1980 and re-uses most of the general changes to openscad. I rewrote the ThreadedNodeVisitor class to be a bit simpler and more efficient (and possibly resolved a couple bugs). I also merged the latest from master as the previous version was about a year out of sync.

Note that I haven't addressed the gmpq thread-safety issues pointed out in the other PR. Does anyone have a good example scad model that triggers this? I haven't run into it in the many examples I've tried, but I also haven't done anything to specifically fix the issue.

how it works

The new implementation of ThreadedNodeVisitor starts a fixed number of worker threads at the start of traversal and schedules work amongst them as it becomes available. A condition variable is used to efficiently sleep threads while work is unavailable.

As before, the geometry tree is traversed top-to-bottom recursively, running prefix traversal for each of the nodes on the way down. This happens serially, ensuring that each node's prefix traversal happens before any of it's children's.

The postfix traversals are usually (much) more expensive to run than the prefix traversals and these are what we run in parallel. The important constraint on running these in parallel is that a given node's postfix traversal can not be run until the postfix traversals for each of the child nodes is run.

As the tree is recursively traversed top-to-bottom, a tree of pending postfix traversals is built. In order to keep track of which nodes are ready to run, each node keeps a counter of how many of its child nodes have yet to complete their postfix traversals. Each time a postfix traversal is completed, the pending child count of it's parent node is decremented. When the counter gets to zero, that node is pushed onto the work queue and is executed as soon as a thread is available.

testing

I've run some tests of the new implementation on several scad models and can see a fairly significant speedup, especially for complex models. Simple models often run a little faster with single-threaded evaluation because there's not much to parallelize and the multi-threading code adds some overhead. Below are some timing comparisons run on my computer of several example models. Note the last one is a large scad model that I downloaded from https://github.com/CarlosGS/Cyclone-PCB-Factory.

These numbers are from my computer which has a fairly fast cpu (Intel Core i7-6700K - 8 cores). Please try it out on your machines and with different models and report how well it works for you. I added a script at scripts/thread-comparison-report.py you can use to run several examples.

Example Serial Parallel
examples/Old/example001.scad 1.4s 1.6s
examples/Old/example002.scad 0.2s 0.2s
examples/Old/example003.scad 0.2s 0.1s
examples/Old/example004.scad 0.4s 0.4s
examples/Old/example005.scad 2.6s 3.0s
examples/Old/example006.scad 104.6s 74.5s
examples/Old/example007.scad 0.3s 0.4s
examples/Old/example020.scad 5.7s 6.4s
examples/Old/example021.scad 5.4s 4.6s
examples/Old/example022.scad 5.3s 4.1s
examples/Old/example023.scad 4.7s 3.8s
examples/Old/example024.scad 49.9s 59.2s
examples/Advanced/GEB.scad 6.1s 6.5s
examples/Advanced/children.scad 13.6s 10.7s
github/cyclone-pcb-factory/Source_files/Cyclone.scad 436.2s 189.0s

TODO

  • Fix cancellation
  • Make progress-reporting threadsafe.
  • Re-evaluate usage of spin lock
    • Use std::mutex?
    • RAII locking?
    • if spinlock, add yield()
    • Use boost::detail::spinlock for CGAL error lock
  • Fix difference() regression (see comment in Hang when compiling with large number of objects & Some Multi-thread stuff #2400)
    • Add example to regression tests
  • Remove new warning about "no 2d or 3d child objects"
  • check if sync with master in 23dc40a dropped the mixed 2d/3d warnings
  • Ensure merge with 2a8d499 is correct
  • Investigate making F5 (CSG compilation) more multi-threaded. Separate pull request?

devilman3d and others added 17 commits March 30, 2017 15:41
…k multithreaded. Also fixed all the tests broken by my last commit.
* a bit simpler and less code
* start a pool of threads once and use them to process operations rather
than starting a new thread for each operation.
* during recursive exploration of the tree, build a tree of "pending
work items" corresponding to postfix operations for each of the nodes in
the tree.
* use condition variable to e efficiently sleep threads while waiting
for new nodes to process
@MichaelAtOz
Copy link
Member

I'll dig into my testing from last time.
I got thread issues on some, they are large random things I was working on as an 'art' thing.
Based on a generated (displayed) initial key, so I could reuse any key that had issues.
However thread clashes always depend on implication specific issues so it will be interesting if they repeat the problem. This showed as CGAL errors, similar to those seen with bad STLs.

@justbuchanan re performance, try 2D, last time 2D was worse in parallel. Also up the numbers very slightly (start small) on example024.

It would be handy (for the interim) to instrument it, as last time I had no idea what the internals were trying to do. Dumping the tree maybe?

In testing, I also found that setting CPU affinity of the process to exclude some logical CPUs allowed more thread contention, without contention testing is less likely to find issues.

Also a feature request for later, if it's not too difficult, assuming you are using the same mechanism as before to count logical CPUs to determine thread numbers, allowing a thread offset (+/-), to over/under provision threads. It would also be handy for testing, and in real use to ensure you can leave other CPU for those mundane emails, music etc...

Well that was a disjointed brain dump...

@t-paul Could you do a windows 64 bit exe?

@MichaelAtOz
Copy link
Member

@justbuchanan Looks good so far. Now if we can just parallelise that final union...

First problem, a. for reference, Crash when clicking the cancel render button.

@MichaelAtOz
Copy link
Member

Another tip for testing, particularly on lower CPU count machines, set OpenSCAD process priority below normal. Lets other stuff run.

@MichaelAtOz
Copy link
Member

For one of my more complex models,
8m06s 1 core,
5m25s 2 core.
66% of original time.

@justbuchanan
Copy link
Member Author

Thanks for trying it out and thanks for the feedback so far! Also good to see that you got a nice speedup on a complex model :)

I'll do some more testing with some 2d models and post the results. I'll also try using more or less threads and see if I can find any bugs. I'll look into the crash when cancelling an in-progress render.

This is using the same method as before to determine the number of threads to run - it should correspond to the number of virtual cores you have. I'll take a look at adding an option to over/under provision.

Also +1 for speeding up the final union - openscad definitely spends a lot of time there.

@MichaelAtOz
Copy link
Member

Re a. if multiple threads are still running they all crash, ie multiple error dialogues.

Re the determination of # threads, if it is easy can you move that to closer to OpenSCAD initialisation, I just tried deselecting 4 cores (of 8), after starting it, before F6, it only used 4 threads. I had to deselect them after F6 starts. This won't be necessary if you do the over provisioning thing tho.

On 8 cores, I ran the previous mutli-tasking version and it's timing was equal, to the second, with this version, over a 6+minute render.

@MichaelAtOz
Copy link
Member

I can't recall where this came from, maybe Nophead, but it is good at generating workload, don't go over n=5 for initial runs.

module manyballs(n)
{
    $fn=25;
    delta = 45;
    for(i=[0:1:n-1]) {
       x = i*delta;
       for(j=[0:1:n-1]) {
          y = j*delta;
          for(k=[0:1:n-1]) {
             z = k*delta;
             translate([x,y,z])sphere(25);
          }
       }
    }
}

manyballs(n=2); 

@justbuchanan
Copy link
Member Author

I just pushed some changes that should fix the cancellation issues - let me know if that doesn't work for you. Note that because cancellation is only detected during progress updates, which happen after each node's computation is finished, there's often a huge lag between clicking the button and having it actually stop :/

I don't think there's a way for core selection to 'stick' between renders because the threads are started separately each time. It would be difficult to move thread initialization closer to openscad init because they need to accept some parameters at startup to know what to process. It might be possible, but it would definitely make the code more complex.

I'll work on adding a flag for setting the number of jobs/threads (similar to make's -j flag).

@MichaelAtOz
Copy link
Member

difficult to move thread initialization closer to openscad init

Sorry I explained that badly.
I mean the call to std::thread::hardware_concurrency();

@MichaelAtOz
Copy link
Member

@MichaelAtOz I think an openscad/multithreading bug is most likely, but either could be possible. Is the crash consistent? Is there a way to get a stack trace, assuming it's actually crashing?

Consistent? Well it crashes a lot, I don't recall one of these specific loads finishing. I've had simpler stuff finish. The progress bar sometimes goes up to 800-900 quickly (seems that may be after a F5), sometimes just incrementing by 1-ish slowly (but using 8x100% CPU). Sometimes crashed quickly, one go to 972/1000, one to 999 with 2x100% - strike the above, just had one finish:

Parsing design (AST generation)...
Compiling design (CSG Tree generation)...
ECHO: version = [2019, 6, 10]
Rendering Polygon Mesh using CGAL...
Started 8 worker threads
Geometries in cache: 24505
Geometry cache size in bytes: 17838912
CGAL Polyhedrons in cache: 1
CGAL cache size in bytes: 72573464
Total rendering time: 0 hours, 19 minutes, 41 seconds
Top level object is a 3D object:
Simple: yes
Vertices: 53305
Halfedges: 162704
Edges: 81352
Halffacets: 52550
Facets: 26275
Volumes: 2
Rendering finished.

But that is rare. That one was a slow incremental progress.
I flushed the cache, F6 it went straight to 592/1000.

I'm, historically familiar with Unix, but only new to details in Linux, so have to dig for info.
I worked out to run the Appimage from a terminal to get an error message.

That second F6 above (592), incremented slowly to 600+, then:

CGAL error: assertion violation!
Expression : cet->get_index() == ce->twin()->get_index()
File       : /usr/include/CGAL/Nef_3/SNC_external_structure.h
Line       : 1173

Other runs

CGAL error: assertion violation!
Expression : cet->get_index() == ce->twin()->get_index()
File       : /usr/include/CGAL/Nef_3/SNC_external_structure.h
Line       : 1173
Explanation: 
Refer to the bug-reporting instructions at http://www.cgal.org/bug_report.html
Aborted

Then

CGAL error: assertion violation!
Expression : itl != it->second.end()
File       : /usr/include/CGAL/Nef_3/SNC_external_structure.h
Line       : 1106

To me that points to cache corruption??
The code is a 3D-ified example:

// Recursive calls of modules can generate complex geometry, especially
// fractal style objects.
// The example uses a recursive module to generate a random tree as
// described in http://natureofcode.com/book/chapter-8-fractals/

// number of levels for the recursion
levels = 12; // [1:1:14]
// length of the first segment
len = 150; // [10:10:200]
// thickness of the first segment
thickness = 10; //[1:1:20]

// the identity matrix
identity = [ [ 1, 0, 0, 0 ], 
             [ 0, 1, 0, 0 ], 
             [ 0, 0, 1, 0 ], 
             [ 0, 0, 0, 1 ] ];

// random generator, to generate always the same output for the example,
// this uses a seed for rands() and stores the array of random values in
// the random variable. To generate different output, remove the seed or
// replace the function rnd() to just call rands(s, e, 1)[0].
rcnt = 1000;
random = rands(0, 1, rcnt, 18);
function rnd(s, e, r) = random[r % rcnt] * (e - s) + s;

// generate 4x4 translation matrix
function mt(x, y) = [ [ 1, 0, 0, x ], 
                      [ 0, 1, 0, y ], 
                      [ 0, 0, 1, 0 ], 
                      [ 0, 0, 0, 1 ] ];

// generate 4x4 rotation matrix around Z axis
function mr(a) = [ [ cos(a), -sin(a), 0, 0 ], 
                   [ sin(a), cos(a), 0, 0 ], 
                   [ 0, 0, 1, 0 ], [ 0, 0, 0, 1 ] ];

module tree(length, thickness, count, m = identity, r = 1) {
    color([0, 1 - (0.8 / levels * count), 0])
        multmatrix(m)
            cube([thickness, length, thickness]);

    if (count > 0) {
        tree(rnd(0.6, 0.8, r) * length, 0.8 * thickness, 
             count - 1, 
             m * mt(0, length) * mr(rnd(20, 35, r + 1)), 
             8 * r);
        tree(rnd(0.6, 0.8, r + 1) * length, 
             0.8 * thickness, count - 1, 
             m * mt(0, length) * mr(-rnd(20, 35, r + 3)), 
             8 * r + 4);
    }
}

tree(len, thickness, levels);

echo(version=version());
// Written in 2015 by Torsten Paul <[email protected]>
//
// To the extent possible under law, the author(s) have dedicated all
// copyright and related and neighboring rights to this software to the
// public domain worldwide. This software is distributed without any
// warranty.
//
// You should have received a copy of the CC0 Public Domain
// Dedication along with this software.
// If not, see <http://creativecommons.org/publicdomain/zero/1.0/>.

It's just lots of unions of cubes.
Note the 18 seed, sometimes I deleted that to get different randomness, all the above was 18.

Nightly (single thread) finished:

Parsing design (AST generation)...
Compiling design (CSG Tree generation)...
ECHO: version = [2019, 7, 15]
Rendering Polygon Mesh using CGAL...
Geometries in cache: 24505
Geometry cache size in bytes: 17838912
CGAL Polyhedrons in cache: 1
CGAL cache size in bytes: 72589816
Total rendering time: 0 hours, 42 minutes, 41 seconds
Top level object is a 3D object:
Simple: yes
Vertices: 53319
Halfedges: 162740
Edges: 81370
Halffacets: 52558
Facets: 26279
Volumes: 2
Rendering finished.

Note the different vertex counts etc. I would expect both thread methods to be the same. (?)
Unfortunately I didn't export the multi-thread success, I'll hopefully have another...(got one at 999 on 1 cpu ATM)

@MichaelAtOz
Copy link
Member

That one finished, but with counts the same as nightly.

Parsing design (AST generation)...
Compiling design (CSG Tree generation)...
ECHO: version = [2019, 6, 10]
Rendering Polygon Mesh using CGAL...
Started 8 worker threads
Geometries in cache: 24505
Geometry cache size in bytes: 17838912
CGAL Polyhedrons in cache: 1
CGAL cache size in bytes: 72589832
Total rendering time: 0 hours, 18 minutes, 30 seconds
Top level object is a 3D object:
Simple: yes
Vertices: 53319
Halfedges: 162740
Edges: 81370
Halffacets: 52558
Facets: 26279
Volumes: 2
Rendering finished.

I exported anyway, I'll have a look later.

@MichaelAtOz
Copy link
Member

MichaelAtOz commented Jul 18, 2019

Another finished with different counts. Exported. I'll compare geometry in a couple of days...

ECHO: version = [2019, 6, 10]
Rendering Polygon Mesh using CGAL...
Started 8 worker threads
Geometries in cache: 24505
Geometry cache size in bytes: 17838912
CGAL Polyhedrons in cache: 1
CGAL cache size in bytes: 72573464
Total rendering time: 0 hours, 18 minutes, 12 seconds
Top level object is a 3D object:
Simple: yes
Vertices: 53305
Halfedges: 162704
Edges: 81352
Halffacets: 52550
Facets: 26275
Volumes: 2
Rendering finished.

STL export finished: /home/mebd/openscad/module_recursion-tree-multi2.stl
OFF export finished: /home/mebd/openscad/module_recursion-tree-multi2.off

@t-paul
Copy link
Member

t-paul commented Jul 18, 2019

Note that AppImages by design are built on older platform (Ubuntu 16.04 at this point for the one on Circle-CI) to increase chances they run on different systems. Right now only OpenCSG is specifically updated due to known bugs, but other libs, especially CGAL are older versions from the build platform.

@Efroymson
Copy link

Are there OS X builds available as well? I’d like to help test, that is the best platform for me.

@MichaelAtOz
Copy link
Member

Almost instant crash (F5 b4 F6 - just keeping track ATM, not necessarily saying they are related)

CGAL error: assertion violation!
Expression : e_below != SHalfedge_handle()
File       : /usr/include/CGAL/Nef_3/SNC_FM_decorator.h
Line       : 423

Interesting, multiple errors, presumably threads. Console garbled, not thread safe? F5 b4 F6.

mebd@mebd:~$ /home/mebd/openscad/software/OpenSCAD-2019.06.10.ai2804-_PR2399x86_64.AppImage
CGAL error: CGAL error: assertion violation!assertion
Expression : cet->get_index() == ce->twin()->get_index()
 violation!
Expression : cet->get_index() == ce->twin()->get_index()
File       : /usr/include/CGAL/Nef_3/SNC_external_structure.h
Line       : 1173
Explanation: 
Refer to the bug-reporting instructions at http://www.cgal.org/bug_report.html
File       : /usr/include/CGAL/Nef_3/SNC_external_structure.h
Line       : Aborted
mebd@mebd:~$ 

Note also that the same workload finished on Windows

Parsing design (AST generation)...
Saved backup file: C:/Users/MeB/Documents/OpenSCAD/backups/module_recursion-tree-backup-TWdtUmCM.scad
Compiling design (CSG Tree generation)...
ECHO: version = [2019, 6, 10]
Rendering Polygon Mesh using CGAL...
Saved design 'C:/Users/MeB/Documents/3D-REPRAP/Things/My Things/tests/module_recursion-tree.scad'.
Started 6 worker threads
STL export finished: C:/Users/MeB/Documents/3D-REPRAP/Things/My Things/tests/module_recursion-tree-multi-win.stl
OFF export finished: C:/Users/MeB/Documents/3D-REPRAP/Things/My Things/tests/module_recursion-tree-multi-win.off

Again console garbled/missing. So no counts, I'll compare the exports.

I'll move on to other workload generators with more than just unions. Fewer objects (v's tree branches) but more complexity.

I'm wondering how to instrument this to see what the treads are doing? Do any --debug parameters come to mind?

Also for future performance comparisons, it would be handy to:
a. Log elapsed time, when entering the final global union,
b. When a flush cache is done, so you can tell from the console history which was cached or not.

@MichaelAtOz
Copy link
Member

Re the Windows one above, I had a light bulb moment, just do F6 again from cache, got the counts.

STL export finished: C:/Users/MeB/Documents/3D-REPRAP/Things/My Things/tests/module_recursion-tree-multi-win.stl
OFF export finished: C:/Users/MeB/Documents/3D-REPRAP/Things/My Things/tests/module_recursion-tree-multi-win.off
Parsing design (AST generation)...
Compiling design (CSG Tree generation)...
ECHO: version = [2019, 6, 10]
Rendering Polygon Mesh using CGAL...
Geometries in cache: 24505
Geometry cache size in bytes: 17838912
CGAL Polyhedrons in cache: 8190
CGAL cache size in bytes: 1687685352
Total rendering time: 0 hours, 0 minutes, 2 seconds
Top level object is a 3D object:
Simple: yes
Vertices: 53305
Halfedges: 162704
Edges: 81352
Halffacets: 52550
Facets: 26275
Volumes: 2
Rendering finished.

Which matches one of the multi-debians above, but not nightly-debian...

@MichaelAtOz
Copy link
Member

This issue is getting quite long, I should have posted the above in #2405. I'll continue there.

@t-paul
Copy link
Member

t-paul commented Oct 23, 2019

Updated from master. The AppImage is now using CGAL 4.11, that might fix some of the crash issues seen on Linux.

@Lenbok
Copy link
Contributor

Lenbok commented Oct 24, 2019

@t-paul Do you have a link to the linux AppImage that just got built? I can see the windows builds have a nice link in the summary section to the artifact, but can't see that for the linux build.

(edit: it was an earlier windows build that I was looking at that had the links in the summary, not sure how to get the current build artifacts)

@t-paul
Copy link
Member

t-paul commented Oct 24, 2019

@Efroymson
Copy link

Efroymson commented Oct 24, 2019 via email

@t-paul
Copy link
Member

t-paul commented Oct 24, 2019

No, the MacOS plan is restricted to 500 minutes per month, so we can't build everything as with Linux and Windows. Right now the MacOS builds are limited to build once on 5 days a week and for master only.

@Lenbok
Copy link
Contributor

Lenbok commented Oct 26, 2019

@t-paul How long do the build artifacts hang around? I just tried downloading again from another computer and get 404 on
https://app.circleci.com/jobs/github/openscad/openscad/3835/artifacts

@t-paul
Copy link
Member

t-paul commented Oct 26, 2019

I don't know, the only info I can find is that they don't recommend it as long term storage. I can still download this file and also others that are > 10 month old.

@MichaelAtOz
Copy link
Member

try this

@MichaelAtOz
Copy link
Member

(edit: it was an earlier windows build that I was looking at that had the links in the summary, not sure how to get the current build artifacts)

See the ci/circleci: openscad-* below, click the Details link, then manually add #artifacts to the end of the URL.

@thehans
Copy link
Member

thehans commented Nov 10, 2019

I think the solution to making boolean operations more parallelizable (particularly the top-level union in many cases) is to break down the WorkItems into individual binary operations, only handling a pair of geometries at a time(per thread).

Inside cgalutils-applyops.cc, they are already broken down into binary operations on pairs of geometries:



etc. But the threaded implementation is lumping them as single thread to handle all children.

union and intersection are commutative so operation on each pair of geometries can be performed in any order and in their own thread.

For union I recently made a change with PR #3121 to perform union on the simplest geometries first, keeping them sorted in a std::priority_queue. This replaces what was previously calling Nef_nary_union_3 (which also internally only really joins two geometries at a time). Also, intermediate results go back into the sorted queue, so the threaded code would need some way to manage this queue in a thread-safe way.

intersection could also be handled in a similar manner as union now does.

difference of multiple objects could be rewritten as a single difference and a union: A-B-C-D-E == A-(B+C+D+E), where the unions could be done in parallel, then just one difference to finish it off.

I only just started looking into the threaded code, so I'm not sure yet the best way to re-write WorkItem etc to function with this sort of scheme, but I think the overall idea has potential to give the really nice speedups that one would expect from scaling # of threads.

@justbuchanan What do you think, does that all make sense to you? Are you interested in taking a stab at this idea, or would you rather I give it a try?

@Efroymson
Copy link

Has there been any behind the scenes update on this? I am working on some monster files that could surely benefit from faster renders. If there is any way I can help, let me know.

@t-paul
Copy link
Member

t-paul commented May 22, 2020

Not on this PR, there's a different approach in #3193 but that seems to have stalled also.

@Efroymson
Copy link

Thanks, I asked over there as well. I am looking at jumping to Blender just to get faster renders, but it would be a lot of work to port my generating code .. if I could even figure out how.

@brezina1
Copy link

Where can I download the installer for win64? (Can't download it from artifacts on circleci anymore)

@github-actions
Copy link

github-actions bot commented Oct 9, 2025

OpenSCAD bot: this PR is stale because it has been open for 60 days with no activity. It will be closed if no further activity occurs within 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.