Another attempt at multi-threaded geometry evaluation#2399
Another attempt at multi-threaded geometry evaluation#2399justbuchanan wants to merge 46 commits intoopenscad:masterfrom
Conversation
…k multithreaded. Also fixed all the tests broken by my last commit.
* a bit simpler and less code * start a pool of threads once and use them to process operations rather than starting a new thread for each operation. * during recursive exploration of the tree, build a tree of "pending work items" corresponding to postfix operations for each of the nodes in the tree. * use condition variable to e efficiently sleep threads while waiting for new nodes to process
|
I'll dig into my testing from last time. @justbuchanan re performance, try 2D, last time 2D was worse in parallel. Also up the numbers very slightly (start small) on example024. It would be handy (for the interim) to instrument it, as last time I had no idea what the internals were trying to do. Dumping the tree maybe? In testing, I also found that setting CPU affinity of the process to exclude some logical CPUs allowed more thread contention, without contention testing is less likely to find issues. Also a feature request for later, if it's not too difficult, assuming you are using the same mechanism as before to count logical CPUs to determine thread numbers, allowing a thread offset (+/-), to over/under provision threads. It would also be handy for testing, and in real use to ensure you can leave other CPU for those mundane emails, music etc... Well that was a disjointed brain dump... @t-paul Could you do a windows 64 bit exe? |
|
@justbuchanan Looks good so far. Now if we can just parallelise that final union... First problem, a. for reference, Crash when clicking the cancel render button. |
|
Another tip for testing, particularly on lower CPU count machines, set OpenSCAD process priority below normal. Lets other stuff run. |
|
For one of my more complex models, |
|
Thanks for trying it out and thanks for the feedback so far! Also good to see that you got a nice speedup on a complex model :) I'll do some more testing with some 2d models and post the results. I'll also try using more or less threads and see if I can find any bugs. I'll look into the crash when cancelling an in-progress render. This is using the same method as before to determine the number of threads to run - it should correspond to the number of virtual cores you have. I'll take a look at adding an option to over/under provision. Also +1 for speeding up the final union - openscad definitely spends a lot of time there. |
|
Re a. if multiple threads are still running they all crash, ie multiple error dialogues. Re the determination of # threads, if it is easy can you move that to closer to OpenSCAD initialisation, I just tried deselecting 4 cores (of 8), after starting it, before F6, it only used 4 threads. I had to deselect them after F6 starts. This won't be necessary if you do the over provisioning thing tho. On 8 cores, I ran the previous mutli-tasking version and it's timing was equal, to the second, with this version, over a 6+minute render. |
|
I can't recall where this came from, maybe Nophead, but it is good at generating workload, don't go over n=5 for initial runs. |
|
I just pushed some changes that should fix the cancellation issues - let me know if that doesn't work for you. Note that because cancellation is only detected during progress updates, which happen after each node's computation is finished, there's often a huge lag between clicking the button and having it actually stop :/ I don't think there's a way for core selection to 'stick' between renders because the threads are started separately each time. It would be difficult to move thread initialization closer to openscad init because they need to accept some parameters at startup to know what to process. It might be possible, but it would definitely make the code more complex. I'll work on adding a flag for setting the number of jobs/threads (similar to make's -j flag). |
Sorry I explained that badly. |
Consistent? Well it crashes a lot, I don't recall one of these specific loads finishing. I've had simpler stuff finish. The progress bar sometimes goes up to 800-900 quickly (seems that may be after a F5), sometimes just incrementing by 1-ish slowly (but using 8x100% CPU). Sometimes crashed quickly, one go to 972/1000, one to 999 with 2x100% - strike the above, just had one finish: But that is rare. That one was a slow incremental progress. I'm, historically familiar with Unix, but only new to details in Linux, so have to dig for info. That second F6 above (592), incremented slowly to 600+, then: Other runs Then To me that points to cache corruption?? It's just lots of unions of cubes. Nightly (single thread) finished: Note the different vertex counts etc. I would expect both thread methods to be the same. (?) |
|
That one finished, but with counts the same as nightly. I exported anyway, I'll have a look later. |
|
Another finished with different counts. Exported. I'll compare geometry in a couple of days... |
|
Note that AppImages by design are built on older platform (Ubuntu 16.04 at this point for the one on Circle-CI) to increase chances they run on different systems. Right now only OpenCSG is specifically updated due to known bugs, but other libs, especially CGAL are older versions from the build platform. |
|
Are there OS X builds available as well? I’d like to help test, that is the best platform for me. |
|
Almost instant crash (F5 b4 F6 - just keeping track ATM, not necessarily saying they are related) Interesting, multiple errors, presumably threads. Console garbled, not thread safe? F5 b4 F6. Note also that the same workload finished on Windows Again console garbled/missing. So no counts, I'll compare the exports. I'll move on to other workload generators with more than just unions. Fewer objects (v's tree branches) but more complexity. I'm wondering how to instrument this to see what the treads are doing? Do any --debug parameters come to mind? Also for future performance comparisons, it would be handy to: |
|
Re the Windows one above, I had a light bulb moment, just do F6 again from cache, got the counts. Which matches one of the multi-debians above, but not nightly-debian... |
|
This issue is getting quite long, I should have posted the above in #2405. I'll continue there. |
# Conflicts: # src/feature.h
|
Updated from master. The AppImage is now using CGAL 4.11, that might fix some of the crash issues seen on Linux. |
|
@t-paul Do you have a link to the linux AppImage that just got built? I can see the windows builds have a nice link in the summary section to the artifact, but can't see that for the linux build. (edit: it was an earlier windows build that I was looking at that had the links in the summary, not sure how to get the current build artifacts) |
|
Is there an OSX build as well? I tried looking for one but https://app.circleci.com/jobs/github/openscad/openscad and https://app.circleci.com/jobs/github/openscad/ don't seem to be populated with links to the other builds, as I had hoped they would be. The other nearby numbers weren't helpful either, e.g. CircleCI
|
|
| |
CircleCI
|
|
|
Thanks,
Robert
On Wednesday, October 23, 2019, 06:07:31 PM MDT, Torsten Paul <[email protected]> wrote:
https://app.circleci.com/jobs/github/openscad/openscad/3835/artifacts
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
|
No, the MacOS plan is restricted to 500 minutes per month, so we can't build everything as with Linux and Windows. Right now the MacOS builds are limited to build once on 5 days a week and for master only. |
|
@t-paul How long do the build artifacts hang around? I just tried downloading again from another computer and get 404 on |
|
I don't know, the only info I can find is that they don't recommend it as long term storage. I can still download this file and also others that are > 10 month old. |
|
try this |
See the |
|
I think the solution to making boolean operations more parallelizable (particularly the top-level union in many cases) is to break down the Inside openscad/src/cgalutils-applyops.cc Line 117 in 0d4fa3e openscad/src/cgalutils-applyops.cc Line 120 in 0d4fa3e etc. But the threaded implementation is lumping them as single thread to handle all children.
For
I only just started looking into the threaded code, so I'm not sure yet the best way to re-write @justbuchanan What do you think, does that all make sense to you? Are you interested in taking a stab at this idea, or would you rather I give it a try? |
|
Has there been any behind the scenes update on this? I am working on some monster files that could surely benefit from faster renders. If there is any way I can help, let me know. |
|
Not on this PR, there's a different approach in #3193 but that seems to have stalled also. |
|
Thanks, I asked over there as well. I am looking at jumping to Blender just to get faster renders, but it would be a lot of work to port my generating code .. if I could even figure out how. |
|
Where can I download the installer for win64? (Can't download it from artifacts on circleci anymore) |
|
OpenSCAD bot: this PR is stale because it has been open for 60 days with no activity. It will be closed if no further activity occurs within 60 days. |
This builds on @devilman3d's work in #1980 and re-uses most of the general changes to openscad. I rewrote the
ThreadedNodeVisitorclass to be a bit simpler and more efficient (and possibly resolved a couple bugs). I also merged the latest from master as the previous version was about a year out of sync.Note that I haven't addressed the gmpq thread-safety issues pointed out in the other PR. Does anyone have a good example scad model that triggers this? I haven't run into it in the many examples I've tried, but I also haven't done anything to specifically fix the issue.
how it works
The new implementation of
ThreadedNodeVisitorstarts a fixed number of worker threads at the start of traversal and schedules work amongst them as it becomes available. A condition variable is used to efficiently sleep threads while work is unavailable.As before, the geometry tree is traversed top-to-bottom recursively, running prefix traversal for each of the nodes on the way down. This happens serially, ensuring that each node's prefix traversal happens before any of it's children's.
The postfix traversals are usually (much) more expensive to run than the prefix traversals and these are what we run in parallel. The important constraint on running these in parallel is that a given node's postfix traversal can not be run until the postfix traversals for each of the child nodes is run.
As the tree is recursively traversed top-to-bottom, a tree of pending postfix traversals is built. In order to keep track of which nodes are ready to run, each node keeps a counter of how many of its child nodes have yet to complete their postfix traversals. Each time a postfix traversal is completed, the pending child count of it's parent node is decremented. When the counter gets to zero, that node is pushed onto the work queue and is executed as soon as a thread is available.
testing
I've run some tests of the new implementation on several scad models and can see a fairly significant speedup, especially for complex models. Simple models often run a little faster with single-threaded evaluation because there's not much to parallelize and the multi-threading code adds some overhead. Below are some timing comparisons run on my computer of several example models. Note the last one is a large scad model that I downloaded from https://github.com/CarlosGS/Cyclone-PCB-Factory.
These numbers are from my computer which has a fairly fast cpu (Intel Core i7-6700K - 8 cores). Please try it out on your machines and with different models and report how well it works for you. I added a script at
scripts/thread-comparison-report.pyyou can use to run several examples.TODO
Use std::mutex?boost::detail::spinlockfor CGAL error lockdifference()regression (see comment in Hang when compiling with large number of objects & Some Multi-thread stuff #2400)