Parallelize render using multiple processes#3193
Parallelize render using multiple processes#3193slaviber wants to merge 23 commits intoopenscad:masterfrom
Conversation
|
Thanks for the PR! |
|
Thanks for the response! |
|
We can't easily use boost 1.64. We can update the one listed in the readme which has currently 1.35, but the base line is probably 1.58 (as of Ubuntu 16.04LTS). |
|
I understand. If Ubuntu 16.04LTS has to be supported, then i'd need to use either QProcess (which seems hard) or another 3rd-party multiplatform library that supports multiprocessing. |
|
Ubuntu 16.04LTS is mostly required because it's currently still the base for AppImage and Snap. It's possible to update dependencies for those (and we already need to do that), but this really needs to be justified as the effort spend on upgrading the build environments is not exactly small. A separate process seems to have some advantages over the threading solution in #2399. Do you think the benefits are bigger than the drawbacks (like the need to serialize the data structures to the other processes). |
|
@slaviber Multiple processes would indeed make me sleep better at night :) A few hints to make PRs like this easier to manage:
|
|
I understand. |
|
I can't compile this on Ubuntu 19.10 (gcc 9.2.1), I get the following linker error: If I add rt to the COMMON_LIBRARIES, I get a lot of undefined references for boost::archive symbols. Adding "serialization" to the list of required modules for boost helped. Note: I think you need to search for librt properly and only add it if it exists, since there are platforms with do not have it. |
|
@mhier Yes, the build scripts aren't fixed properly. So I made some huge changes to my code.
Added / Reverted functionality:
And I found out that the lazy-union feature makes rendering slower. Without multiprocessing it's just slower, with multiprocessing there are never enough objects in the Union to spawn a child. @kintel Now I can't decide which will be better - to push my changes to this branch or to create another cleaner PR. |
|
@slaviber Keep in mind that if you use QProcess, this would be a GUI-only feature, as the cmd-line version of OpenSCAD is built without Qt. In terms of managing this PR, you can always pull out refactoring as a separate PR, and force-push a rebased version to this PR. If that's too much git magic, opening new PRs from scratch works too. |
4f11009 to
d26208f
Compare
|
OK, I'm staying with QProcess mainly because of the lack of an alternative. And it's 5 lines of code to swap anyway. |
|
Is there a problem with ci? There is no executable in the |
|
Works for me now - https://circleci.com/gh/openscad/openscad/4399#artifacts - maybe there was some issue or delay yesterday? It may also depend on old/new UI in CircleCI as the new one seems to require a login :-(. |
|
I tried examples/basics/logo, no error, but examples/basics/text_on_cubes, got the above assertion. * I exported to STL, STL had some inverted faces, but 2019.05 had the same, so not multithreads problem. |
|
Thanks for the feedback @MichaelAtOz. |
|
It was a silly mistake that has nothing to do with the multiprocessing, but with the merging of non-intersecting PolySets in a Union. |
|
Has there been any behind the scenes update on this? I am working on some monster files that could surely benefit from faster renders. If there is any way I can help, let me know |
|
Hello @Efroymson, |
|
If it helps, I could try retriggering the CI-builds (binaries are deleted after 30 days). That would produce ready to try apps for Windows and Linux (no MacOS at this point). |
|
Just FYI, I can't seem to build this version on my system because the boost build fails. It can't find unistd.h, which is strange. |
|
You are trying the OS X building procedure from the README.md file, right? |
This is still in development and has had limited testing, and as I noted can flood the system with jobs ATM. |
|
Updated against |
|
@amb-iota if you want to run the current state of this, you can download the relevant executable from the links below, under 'all checks have passed' scroll down, click 'details' of the appimage, or for Windows the openscad-mxe-xxbit.
If you want to compile, what environment are you using? |
I am using Ubuntu 20.04 with 64bit. |
|
Thinking it might be worth hosting these binaries as an unofficial build somewhere. 🤔 Enough people seem to care. |
I'm not an expert but as I understand it, I believe you want the code from here. Also @t-paul re mholiv's comment above, is it worth dropping the artifacts to the ftp site? |
|
I stashed the files at http://files.openscad.org/pr/PR3193/ but we don't have the storage space to do that all the time nor does it scale if I have to do this manually. Maybe at some point we can get that automated based on a PR tag. |
|
@t-paul when will this be merged? 😃 |
|
@if94 This has had limited testing and can have bad effects on system stability with complex models. |
All I've seen as bad so far is it will eat all CPUs, but it was suggested to use n - 1 CPUs by default. What else is there, so we can start a list of things to do / look out for? |
It crippled my system, a Debian Skull-canyon with high-end NVMe SSDs, I'm surprised it didn't go fatal, but it was a worrying for a few minutes. Unix is meant to be resilient but something really clobbered it.
Some people seem to see it as usable for less challenging designs. Not being a Linux dev, I imagine setting up a worker controller to doll-out & manage jobs is not trivial (unless there is some library?). Unfortunately with a crippled system it is difficult for me to see what was actually happening, it needs a Linux guy who get into weeds.
Start with the second point. Sorry to be snarky but I thought I spelled it out with my first reply. |
|
I am a developer, so that's why I've asked. 😆 Thank you very much @MichaelAtOz for taking the time to describe what needs to be done. It doesn't seem like much more works is needed - just some investigation, adding some limit options, and job batching. Edit: Do you know where the complex design is which is causing the issue? |
Lucky I came here to see the edit. The latest here is this V6, I have V8 on the Debian system, (car engine term is purely coincidental), I'll need to compare them to se if the new one is relevant, that was months ago in a pandemic... You should also read #3465 on what reality is in relation to 3D output, particularly the merits of identical output. These theological discussion should probably occur in #3465. |
|
@MichaelAtOz I've compiled and run the example which is freezing. I want to be explicit to be sure we are seeing the same problem: When rendering with F6, it becomes stuck at 999/1000, correct? I also read the thread about the topology changing. I have no real opinion on this right now. My immediate reaction is: I can't imagine this is a big deal. |
Not my 'freeze' no. What you see at 999/1000 is the final global union, unavoidable* and not readily parallelable, tho here-somewhere is/was(?) discussion of strategies to optimise it, sorting by facet count etc, I can't recall if that was in this, or another activity. Perhaps @Someone-Else can recall. *you may want to see what preferences/features/lazy-union does to that. My 'freeze',
Frozen, it just started to get bad, progressively worse responsiveness then quickly locking up, looking dead, then 'shit I don't really want to hold down the power button'. In the one F6, freeze, coming back to a slow jittery GUI, locking up again for 15s etc repeat, then coming slowly out of it, where I was able to run system monitor and lower the priority (see the image above with 26GB of Appimage OpenSCAD processes) . I expect if I was able to see it earlier there may have been a lot more spawning processes choking the kernal. In the middle I never had enough control to cancel the render. I was about to re-run that a couple of days ago, with System monitor already running, the initial GUI OpenSCAD at lower priority, so the spawned process will inherit it. But some IRL things hit. I'll try do that today. Hmm, I just tried a smaller test, but I'm not seeing and extra processes, something funny going on.
I was also about to add my 2c to that discussion, will do so. |
|
Well, of course as soon as you post, it changes behaviour. I got the freeze again, but I had left it pre-freeze to get caffeine. Easy to tell when it's about to happen if you have media playing, when it hits, it goes like a bad CD, then freezes. It's still 999/1000, process count is slowly reducing. I'll need to do it again an be present when it first starts. Can anyone recommend a better process monitor than the default one? |
|
I'm continuing testing related stuff in #3465. |
|
Hi Michael, your freezing remembers me on a tiny thing I did program in the past. I was trying to calculate a game. Don't know if your parallel processing is limiting itself to that kind of value, because if not you allways will run out of normal range not only with processors because your Memory will than get filled compleetly over time. maybe that's causing your freezing because it sounds a bit like it. |
|
@Efroymson I don't believe this branch had a Mac build. |
|
It was a while ago, but I recall downloading a Mac build from the build machine. Unfortunately I can't recall where I got the pointer for where to look, nor who sent it, but it was definitely a different build of OpenScad. I must confess that while I have a 30,000 foot understanding of build automation, the exact way GitHub handles it, either generally or specifically to this project, is a mystery to me.
|
| bool foundFirst = false; | ||
|
|
||
| try { | ||
| // DIFFERENCE is a strictly ordered operation and cannot be multithreaded ?! :( |
There was a problem hiding this comment.
Unless I'm missing something, most of the computation for difference() need not be strictly ordered. It is almost embarrassingly parallel, with the exception of the first item.
difference() {
a;
b;
c;
d;
e;
f;
}
is equivalent to
difference() {
a;
union() {
b;
c;
d;
e;
f;
};
};
In fact, in an ideal world, the work operations would treat it as though it looked like this:
difference() {
a;
union() {
union() {
union() {a; b;}
union() {c; d;}
}
f;
}
}
So you do the unions in a massively parallel way, and when both child nodes are completed, you do the parent union operation, until you get up to the final node, and then you complete the difference operation. And as long as you design the work unit system correctly, you can throw the leaf nodes (a, union{bc}, union{de}, and f) onto separate worker threads, and everything should "just work". When each work unit completes, it checks to see if the parent node's children are all complete, and if so, you dispatch that node to a worker thread. And so on.
Am I missing something? (I haven't dug into this code much yet; I was just looking for a way to shave several hours off a render that takes a fraction of a second for the GPU-quality rendering, and found this pull request, and started looking at it, and saw this comment, and it struck me as confusing.)
There was a problem hiding this comment.
Hey, thank you for your interest in my PR.
It has been quite a long ago, 5 years actually, and I lost track of what has been done to this project since then.
If I remember correctly, I made numerous tests and found out that the Diff operation, as it was coded back then, resulted in wildly different output based on the order of combining the solids. The GCAL library was buggy in this regard.
It may or may not have been the case, but since the dispute on what output was considered valid, I focused on more productive activities and got my bachelor in Engineering instead.
If your observation is true and the Diff operation can be subdivided to a Union and a Difference, and this does not cause a further operation bottleneck, and, most of all, does not alter the final render, then great.
Maybe you can bring this code back from the dead and finally fix the paralellism problem in OpenSCAD?
I gave up and patched another open source sw for my final thesis.
I have to mention that the biggest bottleneck was the CGAL library itself not being thread-safe, which required very computationally-heavy serialization. And lots of RAM usage. Things may have changed for better.
There was a problem hiding this comment.
From what I see, CGAL is currently thread-safe as long as you aren't sharing objects across threads, hence why I suggested work units that combine results from previous stages; that way, you aren't ever sharing data across threads, presumably, and you'll also get deterministically consistent results every run without the need for strictly ordering anything beyond ensuring that [work item a] doesn't start until after [work item a's nested dependencies] are finished.
It's interesting that you saw different results depending on order. I could see there being subtle differences, e.g. different triangulation of a face or something, but I'd expect the results to be congruent to the originals, just assembled differently. If that's not the case, that's... interesting. 😁
|
OpenSCAD has almost completely moved from CGAL to Manifold, which is parallel and far faster. |
Holy crap. That reduced my render time from... well, giving up after half a day to being done in just three seconds. Why isn't this on by default? 😂 |
|
Closing, it's long stalled and superseded by adding of the Manifold back-end. |



Hello,
I tried to implement multiprocess support for OpenSCAD according to #391.
(With all the previous failed attempts, maybe we should start working in parallel to succeed...)
This is my approach:
First of all, it is obvious for me that if there can be any huge optimization/parallelism, that is in the CGAL library itself - by computing the operations in small independent chunks. But unfortunately, GCAL is not even thread-safe.
There is not much room for optimization of OpenSCAD. Of course, that is because one real CAD model has a closely linked structure - every next geometric element depends on one or rarely a few previous. So there isn't much to be computed in parallel on the 'operation' level.
Of course, there are exceptions. And I'm making use of them to achieve a nice (solid) decrease in the rendering time in some corner cases. At almost negligible performance cost for the rest of the time.
I first did some performance tests and I found that many constructions parsed by the OpenSCAD language are reduced to Unions before other operations are being applied to them. For example all for loops, all the geometry in the sourcefile's root and so on. (And maybe even more in the future).
And I also found that some optimizations can be done before even making calls to CGAL (ex. combining the PolySets in the Unions into multivolume objects, thus reducing the Boolean OR overhead.).
The actual implementation:
So my implementation resides entirely in the cgalutils-applyobjects sources mainly because that's the only place where the 3D objects can be parallelized to some extent.
First I'm taking all order-independent operations of a given branch (Unions and Intersections).
Then I'm merging the objects wherever possible, which compensates for the added overhead of the consequent operations.
Then, if there are enough 3D objects left, i'm spawning child processess, which compute the operations in parallel wherever possible, then combining child processess and so on until the number of objects left is insufficient for spawning more childs.
I'm combining the left 3D objects serially and returning the root.
That's one of those corner cases:
box.scad
That box has one patterning operation - union of 30 holes and one Intersection of 6 spheres.
When I did the test on my PC the results were - 15'47'' for the master branch and 8'15'' for my PR. That's almost double increase in rendering speed (at the expense of all that RAM for all those workers in parallel).
The actual results vary from roughly equal rendering time for non-branched low-polygon count objects, to slightly slower for moderately-branched low polygon count objects to massive reduce in time for high polygon count branched models.