Skip to content

Parallelize render using multiple processes#3193

Closed
slaviber wants to merge 23 commits intoopenscad:masterfrom
slaviber:multithread
Closed

Parallelize render using multiple processes#3193
slaviber wants to merge 23 commits intoopenscad:masterfrom
slaviber:multithread

Conversation

@slaviber
Copy link

Hello,
I tried to implement multiprocess support for OpenSCAD according to #391.
(With all the previous failed attempts, maybe we should start working in parallel to succeed...)

This is my approach:

First of all, it is obvious for me that if there can be any huge optimization/parallelism, that is in the CGAL library itself - by computing the operations in small independent chunks. But unfortunately, GCAL is not even thread-safe.
There is not much room for optimization of OpenSCAD. Of course, that is because one real CAD model has a closely linked structure - every next geometric element depends on one or rarely a few previous. So there isn't much to be computed in parallel on the 'operation' level.
Of course, there are exceptions. And I'm making use of them to achieve a nice (solid) decrease in the rendering time in some corner cases. At almost negligible performance cost for the rest of the time.

I first did some performance tests and I found that many constructions parsed by the OpenSCAD language are reduced to Unions before other operations are being applied to them. For example all for loops, all the geometry in the sourcefile's root and so on. (And maybe even more in the future).
And I also found that some optimizations can be done before even making calls to CGAL (ex. combining the PolySets in the Unions into multivolume objects, thus reducing the Boolean OR overhead.).

The actual implementation:

So my implementation resides entirely in the cgalutils-applyobjects sources mainly because that's the only place where the 3D objects can be parallelized to some extent.

First I'm taking all order-independent operations of a given branch (Unions and Intersections).
Then I'm merging the objects wherever possible, which compensates for the added overhead of the consequent operations.
Then, if there are enough 3D objects left, i'm spawning child processess, which compute the operations in parallel wherever possible, then combining child processess and so on until the number of objects left is insufficient for spawning more childs.
I'm combining the left 3D objects serially and returning the root.

That's one of those corner cases:

box.scad
That box has one patterning operation - union of 30 holes and one Intersection of 6 spheres.

  • First the Intersection gets evaluated: 3 worker threads in parallel, then serial combination.
  • Then some non-optimizable Differences are evaluated
  • Then comes the pattern - depending on the complexity of the patterning object, there may be one big merged union, or as in this case, again: 15 worker threads, then 7 worker threads, then 4 and so on until only < 4 objects are left which are then combined serially.
  • ...

When I did the test on my PC the results were - 15'47'' for the master branch and 8'15'' for my PR. That's almost double increase in rendering speed (at the expense of all that RAM for all those workers in parallel).

The actual results vary from roughly equal rendering time for non-branched low-polygon count objects, to slightly slower for moderately-branched low polygon count objects to massive reduce in time for high polygon count branched models.

@kintel
Copy link
Member

kintel commented Jan 19, 2020

Thanks for the PR!
We don't yet use clang-format on this codebase, so this would be easier to review if you didn't pull in the reformatting commit.

@slaviber
Copy link
Author

Thanks for the response!
I'm sorry, didn't know about that. It's never too late to make another PR with just my copy-pasted changes without any (re)formatting.
By the way, almost all automated builds are failing because Boost.Process (for spawning the child processess) requires boost version 1.64 or later. Which is almost 3 years old now but seems that OpenSCAD is compiled against an older version.

@t-paul
Copy link
Member

t-paul commented Jan 20, 2020

We can't easily use boost 1.64. We can update the one listed in the readme which has currently 1.35, but the base line is probably 1.58 (as of Ubuntu 16.04LTS).

@slaviber
Copy link
Author

I understand. If Ubuntu 16.04LTS has to be supported, then i'd need to use either QProcess (which seems hard) or another 3rd-party multiplatform library that supports multiprocessing.

@t-paul
Copy link
Member

t-paul commented Jan 20, 2020

Ubuntu 16.04LTS is mostly required because it's currently still the base for AppImage and Snap. It's possible to update dependencies for those (and we already need to do that), but this really needs to be justified as the effort spend on upgrading the build environments is not exactly small.

A separate process seems to have some advantages over the threading solution in #2399. Do you think the benefits are bigger than the drawbacks (like the need to serialize the data structures to the other processes).

@slaviber
Copy link
Author

slaviber commented Jan 21, 2020

Too much code is not thread safe. I think that separate processes are the only possible solution for now.
The serialization overhead is negligible compared to the CGAL operations. Even if a multi-threaded solution was possible, the serialization is almost pure RAM > RAM copy anyway, which is fast.

Below are some perf log images of my PR - stdlib, gmplib and CGAL are the biggest factors:

@kintel
Copy link
Member

kintel commented Jan 21, 2020

@slaviber Multiple processes would indeed make me sleep better at night :)

A few hints to make PRs like this easier to manage:

  • If you perform any refactoring as part of your PR, consider opening separate PRs for green refactoring which is generally useful or harmless, as those can be reviewed and merged quickly, making your final PR less daunting
  • If possible, mark your feature Experimental (see e.g.
    static const Feature ExperimentalInputDriverDBus;
    ), to make it possible to turn on/off. This could allow us to merge before it's 100% ready and better allow others to test it.

@slaviber
Copy link
Author

I understand.
The experimental Feature looks pretty neat.

@mhier
Copy link

mhier commented Jan 25, 2020

I can't compile this on Ubuntu 19.10 (gcc 9.2.1), I get the following linker error:

[100%] Linking CXX executable openscad
/usr/bin/ld: CMakeFiles/OpenSCAD.dir/src/cgalutils-applyops.cc.o: undefined reference to symbol 'shm_unlink@@GLIBC_2.2.5'
/usr/bin/ld: /lib/x86_64-linux-gnu/librt.so.1: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/OpenSCAD.dir/build.make:3193: openscad] Error 1
make[1]: *** [CMakeFiles/Makefile2:73: CMakeFiles/OpenSCAD.dir/all] Error 2
make: *** [Makefile:95: all] Error 2

If I add rt to the COMMON_LIBRARIES, I get a lot of undefined references for boost::archive symbols. Adding "serialization" to the list of required modules for boost helped.

Note: I think you need to search for librt properly and only add it if it exists, since there are platforms with do not have it.

@slaviber
Copy link
Author

@mhier Yes, the build scripts aren't fixed properly.

So I made some huge changes to my code.
Fixed bugs like:

  • shared memory not properly destructed, remaining in RAM after program exit
  • making the qmake (mostly) and cmake builds actually compile and link.
  • fixed failing tests

Added / Reverted functionality:

  • removed meaningless (comment) blocks
  • optimized the memory usage and allocation count
  • switched to Qprocess (finally) so Boost.Process is not required any more.
  • split the implementation in a different .cc file that contains only the multithreaded part
  • reverted the cgalutils-applyops file to its original form, including only the multiprocessing calls and checks
  • the multiprocessing can now be enabled/disabled as a Feature without changing any functionality in the master

And I found out that the lazy-union feature makes rendering slower. Without multiprocessing it's just slower, with multiprocessing there are never enough objects in the Union to spawn a child.

@kintel Now I can't decide which will be better - to push my changes to this branch or to create another cleaner PR.

@kintel
Copy link
Member

kintel commented Jan 26, 2020

@slaviber Keep in mind that if you use QProcess, this would be a GUI-only feature, as the cmd-line version of OpenSCAD is built without Qt.

In terms of managing this PR, you can always pull out refactoring as a separate PR, and force-push a rebased version to this PR. If that's too much git magic, opening new PRs from scratch works too.

@slaviber slaviber force-pushed the multithread branch 4 times, most recently from 4f11009 to d26208f Compare January 30, 2020 19:21
@slaviber
Copy link
Author

OK, I'm staying with QProcess mainly because of the lack of an alternative. And it's 5 lines of code to swap anyway.
I'm done with the changes, the PR is now ready for a review/testing.

@slaviber slaviber requested a review from kintel January 31, 2020 12:40
@MichaelAtOz
Copy link
Member

Is there a problem with ci? There is no executable in the details #artifacts

@t-paul
Copy link
Member

t-paul commented Feb 5, 2020

Works for me now - https://circleci.com/gh/openscad/openscad/4399#artifacts - maybe there was some issue or delay yesterday? It may also depend on old/new UI in CircleCI as the new one seems to require a login :-(.

@MichaelAtOz
Copy link
Member

The exe is there now....shrug.

On Windows7 F6 I get:
#3193 multi-assert

@MichaelAtOz
Copy link
Member

MichaelAtOz commented Feb 6, 2020

I tried examples/basics/logo, no error, but examples/basics/text_on_cubes, got the above assertion.
On most examples that I tried it is almost immediate, example024/023/022, but example021 the progress bar got to ~60%, example020 worked*.

* I exported to STL, STL had some inverted faces, but 2019.05 had the same, so not multithreads problem.

@slaviber
Copy link
Author

slaviber commented Feb 6, 2020

Thanks for the feedback @MichaelAtOz.
This assertion indicates that my implementation is not behaving as I expect in some cases.
I'll check what's happening and if it is platform dependent.

@slaviber
Copy link
Author

slaviber commented Feb 6, 2020

It was a silly mistake that has nothing to do with the multiprocessing, but with the merging of non-intersecting PolySets in a Union.
Checked text_on_cube, example021, example022, example023, example024 - they now all render correctly.

@Efroymson
Copy link

Has there been any behind the scenes update on this? I am working on some monster files that could surely benefit from faster renders. If there is any way I can help, let me know

@slaviber
Copy link
Author

Hello @Efroymson,
You can compile my branch and test if it speeds up your work.
Aside from this, I have no idea why my PR lost the devs' attention.

@t-paul
Copy link
Member

t-paul commented May 25, 2020

If it helps, I could try retriggering the CI-builds (binaries are deleted after 30 days). That would produce ready to try apps for Windows and Linux (no MacOS at this point).

@Efroymson
Copy link

@t-paul That is probably a good idea, but it doesn't help me much, as my new machine is OS X only, at the moment.

@slaviber I would like to try building your branch, but the last time I tried to build OpenScad for OS X I ran into a slew of problems, and was unable to solve all of them.

@Efroymson
Copy link

Just FYI, I can't seem to build this version on my system because the boost build fails. It can't find unistd.h, which is strange.

@slaviber
Copy link
Author

You are trying the OS X building procedure from the README.md file, right?
Which step fails? Can you show the error log?

@MichaelAtOz
Copy link
Member

@amb-iota

After reading this all in here it looks like there is a version to compile where you get some parallelism and in my case I think it would do a good job as I've got lots of separate objects that could be build quick and separate. Else I had to wait more than 30 Min. Which version do I have to take to get the multitasking feature on my system working.

This is still in development and has had limited testing, and as I noted can flood the system with jobs ATM.
Looks like the compiled executables have been deleted, development ones are not retained for long.
I'm not a GitHub developer, so someone else will need to point you to the correct source if you wish to compile it. @t-paul?

@t-paul
Copy link
Member

t-paul commented Apr 6, 2021

Updated against master, so we have fresh artifacts for 30 days...

@MichaelAtOz
Copy link
Member

@amb-iota if you want to run the current state of this, you can download the relevant executable from the links below, under 'all checks have passed' scroll down, click 'details' of the appimage, or for Windows the openscad-mxe-xxbit.
You will need to register with Circleci - you can authenticate with GitHub, and because it is dumb and forgets what you wanted to do this first time, come back here to click 'details' again.
Then select the Artifacts tab, download the relevant file.

it looks like there is a version to compile

If you want to compile, what environment are you using?

@amb-iota
Copy link

amb-iota commented Apr 8, 2021

@amb-iota if you want to run the current state of this, you can download the relevant executable from the links below, under 'all checks have passed' scroll down, click 'details' of the appimage, or for Windows the openscad-mxe-xxbit.
You will need to register with Circleci - you can authenticate with GitHub, and because it is dumb and forgets what you wanted to do this first time, come back here to click 'details' again.
Then select the Artifacts tab, download the relevant file.

it looks like there is a version to compile

If you want to compile, what environment are you using?

I am using Ubuntu 20.04 with 64bit.

@mholiv
Copy link

mholiv commented Apr 8, 2021

Thinking it might be worth hosting these binaries as an unofficial build somewhere. 🤔

Enough people seem to care.

@MichaelAtOz
Copy link
Member

@amb-iota ...
If you want to compile, what environment are you using?

I am using Ubuntu 20.04 with 64bit.

I'm not an expert but as I understand it, I believe you want the code from here.
Instruction are here. I'm not sure if they are up to date re the changes from qmake to cmake @t-paul?

Also @t-paul re mholiv's comment above, is it worth dropping the artifacts to the ftp site?

@t-paul
Copy link
Member

t-paul commented Apr 11, 2021

I stashed the files at http://files.openscad.org/pr/PR3193/ but we don't have the storage space to do that all the time nor does it scale if I have to do this manually. Maybe at some point we can get that automated based on a PR tag.

@lf94
Copy link

lf94 commented May 16, 2021

@t-paul when will this be merged? 😃

@MichaelAtOz
Copy link
Member

@if94 This has had limited testing and can have bad effects on system stability with complex models.
It likely wont get merged without more fine tuning, & it appears nobody is actively working on it.
We can always use more developers and/or testers.

@lf94
Copy link

lf94 commented May 17, 2021

can have bad effects on system stability with complex models

All I've seen as bad so far is it will eat all CPUs, but it was suggested to use n - 1 CPUs by default.

What else is there, so we can start a list of things to do / look out for?

@MichaelAtOz
Copy link
Member

All I've seen as bad so far

It crippled my system, a Debian Skull-canyon with high-end NVMe SSDs, I'm surprised it didn't go fatal, but it was a worrying for a few minutes. Unix is meant to be resilient but something really clobbered it.

@slaviber I did a test with n=5000 on my 8 core 32GB Debian system.
A few times it brought it to its knees, even the mouse stopped moving, and more importantly my VLC media stopped playing! The first time, in one F6, it was frozen for maybe a minute, and came back, later in same F6, maybe 15s a few times.

n=5000 is not a big test in my stress test, the multi-threaded version was civil even with n=100000, I had tests running for hours. This one spat out in very quick succession huge numbers of processes, and compared to the overhead of a thread, the overhead of a full OpenSCAD process is large, particularly memory. [this may be related to the use of Appimage perhaps?? I'm no expert]

Some people seem to see it as usable for less challenging designs.

Not being a Linux dev, I imagine setting up a worker controller to doll-out & manage jobs is not trivial (unless there is some library?). Unfortunately with a crippled system it is difficult for me to see what was actually happening, it needs a Linux guy who get into weeds.

What else is there, so we can start a list of things to do

  • Start a list
  • Find a developer
  • Test/debug what caused 'freeze' with complex design
  • Design approach to avoid it (presuming CPU count limit ATM, but it could be memory or something else)
  • Preference and command-line option for CPU# v's process spawning limits
  • Develop Worker management mechanism or process spawning limiting code
  • Develop test cases
  • Test
  • Stress test
  • Etc

Start with the second point. Sorry to be snarky but I thought I spelled it out with my first reply.

@lf94
Copy link

lf94 commented May 17, 2021

I am a developer, so that's why I've asked. 😆

Thank you very much @MichaelAtOz for taking the time to describe what needs to be done. It doesn't seem like much more works is needed - just some investigation, adding some limit options, and job batching.

Edit: Do you know where the complex design is which is causing the issue?

@MichaelAtOz
Copy link
Member

Edit: Do you know where the complex design is which is causing the issue?

Lucky I came here to see the edit.

The latest here is this V6, I have V8 on the Debian system, (car engine term is purely coincidental), I'll need to compare them to se if the new one is relevant, that was months ago in a pandemic...

You should also read #3465 on what reality is in relation to 3D output, particularly the merits of identical output.
I must admit I lost interest with unexplained different STLs for the same .scad.
If a different tessellator is the cause, is it inherently part of multi-processing or just incidental.

These theological discussion should probably occur in #3465.

@lf94
Copy link

lf94 commented May 19, 2021

@MichaelAtOz I've compiled and run the example which is freezing. I want to be explicit to be sure we are seeing the same problem: When rendering with F6, it becomes stuck at 999/1000, correct?

I also read the thread about the topology changing. I have no real opinion on this right now. My immediate reaction is: I can't imagine this is a big deal.

@MichaelAtOz
Copy link
Member

I want to be explicit to be sure we are seeing the same problem: When rendering with F6, it becomes stuck at 999/1000, correct?

Not my 'freeze' no.
Excuse the mansplain, did you enable multi-processing in Preferences/Features? [but see Hmm below]

What you see at 999/1000 is the final global union, unavoidable* and not readily parallelable, tho here-somewhere is/was(?) discussion of strategies to optimise it, sorting by facet count etc, I can't recall if that was in this, or another activity. Perhaps @Someone-Else can recall.

*you may want to see what preferences/features/lazy-union does to that.

My 'freeze',

it brought it to its knees, even the mouse stopped moving, and more importantly my VLC media stopped playing!
The first time, in one F6, it was frozen for maybe a minute, and came back, later in same F6, maybe 15s a few times.

Frozen, it just started to get bad, progressively worse responsiveness then quickly locking up, looking dead, then 'shit I don't really want to hold down the power button'. In the one F6, freeze, coming back to a slow jittery GUI, locking up again for 15s etc repeat, then coming slowly out of it, where I was able to run system monitor and lower the priority (see the image above with 26GB of Appimage OpenSCAD processes) . I expect if I was able to see it earlier there may have been a lot more spawning processes choking the kernal. In the middle I never had enough control to cancel the render.

I was about to re-run that a couple of days ago, with System monitor already running, the initial GUI OpenSCAD at lower priority, so the spawned process will inherit it. But some IRL things hit. I'll try do that today.

Hmm, I just tried a smaller test, but I'm not seeing and extra processes, something funny going on.
Even n=8000, I only see one --spawnchild process running at 12% (100% of one core). Looking into it.

I also read the thread about the topology changing. I have no real opinion on this right now. My immediate reaction is: I can't imagine this is a big deal.

I was also about to add my 2c to that discussion, will do so.

@MichaelAtOz
Copy link
Member

Well, of course as soon as you post, it changes behaviour.

I got the freeze again, but I had left it pre-freeze to get caffeine.
Locked solid, then partial recover etc, and yes 999/1000.
Other funny things happening, including 32GB RAM and 32GB swap consumed, oooodels of 2.x GB processes.
It is not CPU as the spawnchild's were LOW, so it has to be kernal and/or swap* or memory copy in a critical section.
*while I was typing that, i t w e n t i nto an oth er cris is, but memory was not 32+32, it was 14+21.

Easy to tell when it's about to happen if you have media playing, when it hits, it goes like a bad CD, then freezes.

It's still 999/1000, process count is slowly reducing.
I wonder if this bit is the final union optimisation??
If it is spawning a process for every pair of n=8000 (yes that was unwise), that could explain the final hit.

I'll need to do it again an be present when it first starts.

Can anyone recommend a better process monitor than the default one?

@MichaelAtOz
Copy link
Member

I'm continuing testing related stuff in #3465.

@amb-iota
Copy link

Hi Michael, your freezing remembers me on a tiny thing I did program in the past. I was trying to calculate a game.
But what I found out is, parallel working is only interesting as long as you do not share a problem in more pieces as processors are available reduced about one thread as else you get stuck because the machine itself isn't freeing there own processes anymore and that means you cause a problem that multiplies itself and does not only depend on the job itself.

Don't know if your parallel processing is limiting itself to that kind of value, because if not you allways will run out of normal range not only with processors because your Memory will than get filled compleetly over time.

maybe that's causing your freezing because it sounds a bit like it.

@Efroymson
Copy link

I have been using the Mac build, but it doesn't do multiprocessing. I get one core of my CPU maxed out and nothing more. There is no multiprocessing option in the preferences panel, unless one of the options in this screen shot is secretly the one to use.

Screen Shot 2021-10-05 at 13 52 53

e

@MichaelAtOz
Copy link
Member

@Efroymson I don't believe this branch had a Mac build.
The automatic builds for branches only have Windows 64 & 32 bit & Linux Appimage builds.
This branch is not in master hence is not in the snapshot developer nightly builds.
It has deficiencies and needs some love from a Developer.

@Efroymson
Copy link

It was a while ago, but I recall downloading a Mac build from the build machine. Unfortunately I can't recall where I got the pointer for where to look, nor who sent it, but it was definitely a different build of OpenScad. I must confess that while I have a 30,000 foot understanding of build automation, the exact way GitHub handles it, either generally or specifically to this project, is a mystery to me.

@Efroymson I don't believe this branch had a Mac build.
The automatic builds for branches only have Windows 64 & 32 bit & Linux Appimage builds.
This branch is not in master hence is not in the snapshot developer nightly builds.
It has deficiencies and needs some love from a Developer.

bool foundFirst = false;

try {
// DIFFERENCE is a strictly ordered operation and cannot be multithreaded ?! :(
Copy link

@dgatwood dgatwood Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, most of the computation for difference() need not be strictly ordered. It is almost embarrassingly parallel, with the exception of the first item.

difference() {
  a;
  b;
  c;
  d;
  e;
  f;
}

is equivalent to

difference() {
  a;
  union() {
    b;
    c;
    d;
    e;
    f;
  };
};

In fact, in an ideal world, the work operations would treat it as though it looked like this:

difference() {
  a;
  union() {
    union() {
      union() {a; b;}
      union() {c; d;}
    }
    f;
  }
}

So you do the unions in a massively parallel way, and when both child nodes are completed, you do the parent union operation, until you get up to the final node, and then you complete the difference operation. And as long as you design the work unit system correctly, you can throw the leaf nodes (a, union{bc}, union{de}, and f) onto separate worker threads, and everything should "just work". When each work unit completes, it checks to see if the parent node's children are all complete, and if so, you dispatch that node to a worker thread. And so on.

Am I missing something? (I haven't dug into this code much yet; I was just looking for a way to shave several hours off a render that takes a fraction of a second for the GPU-quality rendering, and found this pull request, and started looking at it, and saw this comment, and it struck me as confusing.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thank you for your interest in my PR.

It has been quite a long ago, 5 years actually, and I lost track of what has been done to this project since then.

If I remember correctly, I made numerous tests and found out that the Diff operation, as it was coded back then, resulted in wildly different output based on the order of combining the solids. The GCAL library was buggy in this regard.

It may or may not have been the case, but since the dispute on what output was considered valid, I focused on more productive activities and got my bachelor in Engineering instead.

If your observation is true and the Diff operation can be subdivided to a Union and a Difference, and this does not cause a further operation bottleneck, and, most of all, does not alter the final render, then great.

Maybe you can bring this code back from the dead and finally fix the paralellism problem in OpenSCAD?
I gave up and patched another open source sw for my final thesis.

I have to mention that the biggest bottleneck was the CGAL library itself not being thread-safe, which required very computationally-heavy serialization. And lots of RAM usage. Things may have changed for better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see, CGAL is currently thread-safe as long as you aren't sharing objects across threads, hence why I suggested work units that combine results from previous stages; that way, you aren't ever sharing data across threads, presumably, and you'll also get deterministically consistent results every run without the need for strictly ordering anything beyond ensuring that [work item a] doesn't start until after [work item a's nested dependencies] are finished.

It's interesting that you saw different results depending on order. I could see there being subtle differences, e.g. different triangulation of a face or something, but I'd expect the results to be congruent to the originals, just assembled differently. If that's not the case, that's... interesting. 😁

@jordanbrown0
Copy link
Contributor

OpenSCAD has almost completely moved from CGAL to Manifold, which is parallel and far faster.

@dgatwood
Copy link

OpenSCAD has almost completely moved from CGAL to Manifold, which is parallel and far faster.

Holy crap. That reduced my render time from... well, giving up after half a day to being done in just three seconds. Why isn't this on by default? 😂

@t-paul
Copy link
Member

t-paul commented Jul 25, 2025

Closing, it's long stalled and superseded by adding of the Manifold back-end.

@t-paul t-paul closed this Jul 25, 2025
@jordanbrown0
Copy link
Contributor

As for the "why not by default", see #5833, which currently says to see #5485 and #5866.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.