Performance problems / best usage of the API #1640
Replies: 7 comments
-
|
This is great feedback, thank you for writing it down! Before we go into considering a change in On Mozilla's side, we'll try to identify what pieces of Firefox implementation are missing to run this test, and see how the numbers dance there. For the instance count, I don't know what should be done yet. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you @Popov72 for the detailed feedback! +CC @austinEng explicitly if you have more ideas. The binding between Javascript and Blink are known to be somewhat slow and there's effort underway to makes them faster. Note that it would benefit both WebGPU and WebGL so it's unclear how it would change the perf comparison you did. At the same time we haven't fine-tuned Chromium's WebGPU Blink bindings yet, and there's still micro-optimizations to be found (Austin found 3-4% recently and I'm looking at whether more is available). Note that when measuring the GPU process we have found WebGPU to be much faster than WebGL. But I understand that's not what you are bottlenecked on at the moment. That said what I understood from the lengthy discussions we had was that the cost of WebGPU calls is just one aspects of the problem, the other being that it looks very difficult to take a WebGL-style engine and turn it into an efficient WebGPU-style engine. Retrofitting ahead of time pipeline and bindgroup creation is difficult when the engine is designed such that users can change anything (like glStencilRef) at any time. As you know I don't have ideas for a great solution for this, except maybe 1) bindgroup replacement and 2) some kind of stateful pipeline builder. But I don't know if there's appetite in the group to discuss these at this time.
Bundling a single draw + its parameters is a bit small compared to what was initially envisioned for
We definitely want WebGPU to be faster than WebGL even without this. However since WebGL does the equivalent of 4 WebGPU calls in a single VAO change it's hard to beat on that specific aspect. Are you able to pack all the data in a single vertex buffer instead of 4 in Babylon?
I'd like us to try and micro-optimize the code for these in Blink. It should become much faster with the fast API calls and some other changes. I'm not sure what would be the right primitive to add to WebGPU otherwise.
No objections on our side. For other implementations I know
This seems like a great usecase for |
Beta Was this translation helpful? Give feedback.
-
|
I'd love not to get too hung up on bundles unless there's a way of patching them (which is even more complicated). The way I am expecting to port my existing engine, I'm going to be uploading a giant UBO per-frame that has all the frame data I care about (e.g. if an object is frustum culled that frame, then it won't bother appearing in the UBO), and use dynamic uniform buffer offsets. So if I wanted to record and cache a bundle (which is iffy in my use case even in the best of circumstances), then I'd need some way of updating the dynamic UBO offsets per-frame. There are other use cases that make bundles more challenging. e.g. object can appear in multiple views, and each view can have multiple passes, for example split-screen multiplayer where each view has shadowmaps, Zpre, etc. can't just have a single bundle per-object, need a bundle per-pass, per-view. We would need to treat the bundles as a series of cached state changes, and caching is always always challenging. Bundles should not be required to beat WebGL in terms of performance. |
Beta Was this translation helpful? Give feedback.
-
I thought about it originally but figured it's not a solution. It doesn't allow them to parameterize a bundle. If a bundle does an indirect draw from buffer A at offset X, then there is no easy way to populate this data before running a bundle. Any population would have to be done outside the renderpass, so the same bundle can't be run with different parameters.
I would like to support this strongly. Let's even up on bare calls first. |
Beta Was this translation helpful? Give feedback.
-
Note that what we call the "fast path" is what we think is compliant with the new WebGPU-style architecture: the pipeline/bind groups are cached and reused for all subsequent frames. However, as outlined in the samples above, this strategy is not enough to have the same/better perf in WebGPU than in WebGL, that's why we went a step further and created a bundle to encapsulate the pipeline/bind groups/vertex buffers/draw calls. Now, it is this bundle that is cached and reused for subsequent frames. What other things do you have in mind when you say "turn it into an efficient WebGPU-style engine"? At least for pipelines/bind groups, I don't see how we can do better than to have them in a bundle and reuse the bundle afterwards.
That's what we do in the "very fast path", where we record all commands for the whole scene into a single bundle. In the future, we plan to allow the user to create lists of "static" objects, each list being backed by a bundle under the hood.
Not at the time being, the user can add/remove attributes at any time and automatically merging buffers is not implemented (but the user can provide interleaved buffers, I think). However, using a bundle makes this a non problem (even if we were able to merge the buffers ourselves, we would still have two calls (because of the index buffer) compared to a single call with VAO).
Exactly, we basically need a different bundle each time we have a new drawing context for a mesh. That's where we are heading to in Babylon for the "fast path". We know it won't handle everything automatically, we will need some help from the user in some cases (meaning the user will need to call something like |
Beta Was this translation helpful? Give feedback.
-
|
I've done some profiling on Firefox Nightly:
This was running Popov72/BabylonDev#1 - the code was just updated to latest, no optimizations added. A more idiomatic approach to WebGPU (i.e. sharing the buffers to store data for multiple cubes, re-ordering the state changes, using bundles, etc) would probably have a different picture. Firefox Nightly appears to already run better on Linux and Windows, at least. And the content process time is improved comparing to WebGL. We've also identified the first set of optimizations (around IPC transfers in particular) that would help this case. A lot of our time is currently spent on copying data around and growing vectors. |
Beta Was this translation helpful? Give feedback.
-
|
The numbers seem quite close between Firefox and Chromium for WebGPU: in @Popov72's Chromium WebGPU trace the content RAF takes |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I know the performance problem I'm about to speak about is more related to the implementation than to the spec itself, but it still could be interesting to discuss about it here because I may not use the API as expected / the API could be improved toward speeding up the implementation.
Notes:
The sample is simply drawing 3000 cubes. It binds 3 ubos to be more in line with a real scenario. The WebGL version is doing the same. It is using VAOs as they are available in WebGL2.
WebGPU sample source code
WebGL sample source code
On my computer (i7 6700K):
We are interested here in the CPU (Scripting) timing, not the GPU timing (not shown): our users are generally CPU and not GPU bound in the Web context, so it's important for us to keep the CPU time as low as possible.
As you can see above, the sample is globally 17% slower in WebGPU than in WebGL regarding the scripting time.
On one hand we could try to lower the number of
setBindGroupcalls to save some perf on WebGPU compared to WebGL, but on the other hand in a real scenario we would have more than a single call tosetVertexBuffer(at least 3, for the position / normal / uv buffers) that would raise the gap between WebGPU and WebGL because in WebGL all those buffers are held in a single VAO.At first, in Babylon.js we tried to add a layer to our existing code to avoid too much structural changes and implemented some global caching mechanisms to retrieve the right pipeline / bind groups corresponding to the current state of the engine before each draw call. Even if we tried hard to make those caches as fast as possible, in the end we are slower than WebGL because thoses caches simply don't exist there (so we call this path the "slow path").
However, as we can see in the numbers above, even by using the API at full speed (meaning, no cache lookups to retrieve the pipeline/bind groups, straight calls) we are still slower than WebGL (on the CPU side) whereas we would have expected to be at least as fast (and honestly we thought/hoped? it would be faster).
That was a bit of a disappointment to us and even if the difference could be narrowed down in future browser updates, at best we will have comparable perf between WebGL and WebGPU where we would have hoped for better perf, at least for "marketing" sake: why switching to WebGPU if perf are the same or even worse? There are new features in WebGPU that can warrant a switch, but for a lot of people perf will be the selling point. So we tried to find a way to improve things. That is where bundles come in.
We are aware they are not meant to be used this way, but we use them simply as a container of API calls, to lower the number of calls done per frame. So, for each draw of a mesh we have a cached bundle that we reuse (the bundle contains the calls: 1 x
setRenderPipeline, 1 xsetIndexBuffer, n xsetVertexBuffer, m xsetBindGroups, 1 xdraw/drawIndexed), and at the end of the frame we issue a singleexecuteBundlescall with the list of collected bundles: this way we can actually be faster than WebGL (we call this path the "fast path"). To be honest, it's a work in progress, as we now have a problem of bundle cache management: we must detect when to invalidate this bundle (and recreate it) but also try to limit the number of times we invalidate a bundle so that the cache is worthwhile (recreating the bundle each frame would be moot).So, back to the subject, from the spec point of view:
bindVertexArrayis 2.5x time faster than the combinedsetIndexBuffer+setVertexBuffercalls): should we have something on par in WebGPU, or is the bundle actually what we need?Also, when it comes to bundles, it would really be a must if they could support everything a render pass supports, that is also supporting
setViewport,setScissorRect,setStencilReferenceandsetBlendColor. We currently have a "very fast path" (for mostly static scenes) where we record all GPU commands in a single bundle and execute this bundle for all subsequent frames. But because those 4 calls can't be recorded in a bundle, we need to create multiple bundles and issue these calls in between directly on the current render pass.In the "one bundle per draw" scenario explained above, one thing that could help avoiding to invalidate the bundle too often would be able to overload the
instanceCountparameter of thedraw/drawIndexedcall recorded in the bundle when the bundle is executed. Indeed, in Babylon.js we have a use case where we are using instancing but the number of instances displayed varies depending on the camera position. So that's something that can change often and each time it changes we need to recreate the bundle. I don't know if that is a sensible request, I guess at minima we would need something like aDrawBundlethat would encapsulate one and a single onedraw/drawIndexedcall and a new API likeexecuteDrawBundles(DrawBundle[], integer[])withinteger[]being the overloadedinstanceCountvalues... Random thoughts here, probably off the mark.Again, I know this post is based on tests in a single browser so should be taken with a grain of salt but I'm not sure the outcome would be really different in other browsers or in some months from here, our main problem on the CPU side being the cost of API calls.
Beta Was this translation helpful? Give feedback.
All reactions