Performance problems / best usage of the API #1640

Popov72 · 2021-04-02T16:00:18Z

Popov72
Apr 2, 2021

I know the performance problem I'm about to speak about is more related to the implementation than to the spec itself, but it still could be interesting to discuss about it here because I may not use the API as expected / the API could be improved toward speeding up the implementation.

Notes:

I have only tested with Chrome Canary as the sample does not work in Firefox (as of 2021/04/02).
I'm working in the Babylon.js team to add support for WebGPU to the engine

The sample is simply drawing 3000 cubes. It binds 3 ubos to be more in line with a real scenario. The WebGL version is doing the same. It is using VAOs as they are available in WebGL2.

WebGPU sample source code

WebGL sample source code

On my computer (i7 6700K):

WebGPU	WebGL

We are interested here in the CPU (Scripting) timing, not the GPU timing (not shown): our users are generally CPU and not GPU bound in the Web context, so it's important for us to keep the CPU time as low as possible.

As you can see above, the sample is globally 17% slower in WebGPU than in WebGL regarding the scripting time.

On one hand we could try to lower the number of setBindGroup calls to save some perf on WebGPU compared to WebGL, but on the other hand in a real scenario we would have more than a single call to setVertexBuffer (at least 3, for the position / normal / uv buffers) that would raise the gap between WebGPU and WebGL because in WebGL all those buffers are held in a single VAO.

At first, in Babylon.js we tried to add a layer to our existing code to avoid too much structural changes and implemented some global caching mechanisms to retrieve the right pipeline / bind groups corresponding to the current state of the engine before each draw call. Even if we tried hard to make those caches as fast as possible, in the end we are slower than WebGL because thoses caches simply don't exist there (so we call this path the "slow path").

However, as we can see in the numbers above, even by using the API at full speed (meaning, no cache lookups to retrieve the pipeline/bind groups, straight calls) we are still slower than WebGL (on the CPU side) whereas we would have expected to be at least as fast (and honestly we thought/hoped? it would be faster).

That was a bit of a disappointment to us and even if the difference could be narrowed down in future browser updates, at best we will have comparable perf between WebGL and WebGPU where we would have hoped for better perf, at least for "marketing" sake: why switching to WebGPU if perf are the same or even worse? There are new features in WebGPU that can warrant a switch, but for a lot of people perf will be the selling point. So we tried to find a way to improve things. That is where bundles come in.

We are aware they are not meant to be used this way, but we use them simply as a container of API calls, to lower the number of calls done per frame. So, for each draw of a mesh we have a cached bundle that we reuse (the bundle contains the calls: 1 x setRenderPipeline, 1 x setIndexBuffer, n x setVertexBuffer, m x setBindGroups, 1 x draw/drawIndexed), and at the end of the frame we issue a single executeBundles call with the list of collected bundles: this way we can actually be faster than WebGL (we call this path the "fast path"). To be honest, it's a work in progress, as we now have a problem of bundle cache management: we must detect when to invalidate this bundle (and recreate it) but also try to limit the number of times we invalidate a bundle so that the cache is worthwhile (recreating the bundle each frame would be moot).

So, back to the subject, from the spec point of view:

are there any problems to use bundles like this?
is it expected to use them like this to have high perf (and notably higher than WebGL)?
VAO is one key point for better performances in WebGL than in WebGPU (bindVertexArray is 2.5x time faster than the combined setIndexBuffer + setVertexBuffer calls): should we have something on par in WebGPU, or is the bundle actually what we need?

Also, when it comes to bundles, it would really be a must if they could support everything a render pass supports, that is also supporting setViewport, setScissorRect, setStencilReference and setBlendColor. We currently have a "very fast path" (for mostly static scenes) where we record all GPU commands in a single bundle and execute this bundle for all subsequent frames. But because those 4 calls can't be recorded in a bundle, we need to create multiple bundles and issue these calls in between directly on the current render pass.

In the "one bundle per draw" scenario explained above, one thing that could help avoiding to invalidate the bundle too often would be able to overload the instanceCount parameter of the draw/drawIndexed call recorded in the bundle when the bundle is executed. Indeed, in Babylon.js we have a use case where we are using instancing but the number of instances displayed varies depending on the camera position. So that's something that can change often and each time it changes we need to recreate the bundle. I don't know if that is a sensible request, I guess at minima we would need something like a DrawBundle that would encapsulate one and a single one draw/drawIndexed call and a new API like executeDrawBundles(DrawBundle[], integer[]) with integer[] being the overloaded instanceCount values... Random thoughts here, probably off the mark.

Again, I know this post is based on tests in a single browser so should be taken with a grain of salt but I'm not sure the outcome would be really different in other browsers or in some months from here, our main problem on the CPU side being the cost of API calls.

kvark · 2021-04-05T18:22:13Z

kvark
Apr 5, 2021
Maintainer

This is great feedback, thank you for writing it down!

Before we go into considering a change in GPURenderBundle, let's see what low-hanging fruits are there in Chromium implementation (read: let's wait for Google to investigate this).

On Mozilla's side, we'll try to identify what pieces of Firefox implementation are missing to run this test, and see how the numbers dance there.

For the instance count, I don't know what should be done yet.

0 replies

Kangz · 2021-04-07T14:59:08Z

Kangz
Apr 7, 2021
Maintainer

Thank you @Popov72 for the detailed feedback!

+CC @austinEng explicitly if you have more ideas.

The binding between Javascript and Blink are known to be somewhat slow and there's effort underway to makes them faster. Note that it would benefit both WebGPU and WebGL so it's unclear how it would change the perf comparison you did. At the same time we haven't fine-tuned Chromium's WebGPU Blink bindings yet, and there's still micro-optimizations to be found (Austin found 3-4% recently and I'm looking at whether more is available). Note that when measuring the GPU process we have found WebGPU to be much faster than WebGL. But I understand that's not what you are bottlenecked on at the moment.

That said what I understood from the lengthy discussions we had was that the cost of WebGPU calls is just one aspects of the problem, the other being that it looks very difficult to take a WebGL-style engine and turn it into an efficient WebGPU-style engine. Retrofitting ahead of time pipeline and bindgroup creation is difficult when the engine is designed such that users can change anything (like glStencilRef) at any time. As you know I don't have ideas for a great solution for this, except maybe 1) bindgroup replacement and 2) some kind of stateful pipeline builder. But I don't know if there's appetite in the group to discuss these at this time.

are there any problems to use bundles like this?

Bundling a single draw + its parameters is a bit small compared to what was initially envisioned for GPURenderBundle but it works. It would be better if you could bundle at least a whole sub-scenegraph that you know doesn't change (as in the objects to draw stay the same, but they can move, animate etc).

is it expected to use them like this to have high perf (and notably higher than WebGL)?

We definitely want WebGPU to be faster than WebGL even without this. However since WebGL does the equivalent of 4 WebGPU calls in a single VAO change it's hard to beat on that specific aspect. Are you able to pack all the data in a single vertex buffer instead of 4 in Babylon?

VAO is one key point for better performances in WebGL than in WebGPU (bindVertexArray is 2.5x time faster than the combined setIndexBuffer + setVertexBuffer calls): should we have something on par in WebGPU, or is the bundle actually what we need?

I'd like us to try and micro-optimize the code for these in Blink. It should become much faster with the fast API calls and some other changes. I'm not sure what would be the right primitive to add to WebGPU otherwise.

Also, when it comes to bundles, it would really be a must if they could support everything a render pass supports, that is also supporting setViewport, setScissorRect, setStencilReference and setBlendColor.

No objections on our side. For other implementations I know setScissorRect and setViewport aren't available on D3D12 bundles but the guidance is to not use them in general so I think it's ok. I don't know about Metal, @kvark do you?

In the "one bundle per draw" scenario explained above, one thing that could help avoiding to invalidate the bundle too often would be able to overload the instanceCount parameter of the draw/drawIndexed call recorded in the bundle when the bundle is executed. Indeed, in Babylon.js we have a use case where we are using instancing but the number of instances displayed varies depending on the camera position.

This seems like a great usecase for drawIndirect and drawIndexedIndirect that are available in bundles!

0 replies

magcius · 2021-04-07T17:24:35Z

magcius
Apr 7, 2021

I'd love not to get too hung up on bundles unless there's a way of patching them (which is even more complicated). The way I am expecting to port my existing engine, I'm going to be uploading a giant UBO per-frame that has all the frame data I care about (e.g. if an object is frustum culled that frame, then it won't bother appearing in the UBO), and use dynamic uniform buffer offsets. So if I wanted to record and cache a bundle (which is iffy in my use case even in the best of circumstances), then I'd need some way of updating the dynamic UBO offsets per-frame.

There are other use cases that make bundles more challenging. e.g. object can appear in multiple views, and each view can have multiple passes, for example split-screen multiplayer where each view has shadowmaps, Zpre, etc. can't just have a single bundle per-object, need a bundle per-pass, per-view. We would need to treat the bundles as a series of cached state changes, and caching is always always challenging.

Bundles should not be required to beat WebGL in terms of performance.

0 replies

kvark · 2021-04-07T18:10:11Z

kvark
Apr 7, 2021
Maintainer

@Kangz

don't know about Metal, @kvark do you?

#286 (comment)

This seems like a great usecase for drawIndirect and drawIndexedIndirect that are available in bundles!

I thought about it originally but figured it's not a solution. It doesn't allow them to parameterize a bundle. If a bundle does an indirect draw from buffer A at offset X, then there is no easy way to populate this data before running a bundle. Any population would have to be done outside the renderpass, so the same bundle can't be run with different parameters.

Bundles should not be required to beat WebGL in terms of performance.

I would like to support this strongly. Let's even up on bare calls first.

0 replies

Popov72 · 2021-04-08T21:17:20Z

Popov72
Apr 8, 2021
Author

That said what I understood from the lengthy discussions we had was that the cost of WebGPU calls is just one aspects of the problem, the other being that it looks very difficult to take a WebGL-style engine and turn it into an efficient WebGPU-style engine. Retrofitting ahead of time pipeline and bindgroup creation is difficult when the engine is designed such that users can change anything (like glStencilRef) at any time

Note that what we call the "fast path" is what we think is compliant with the new WebGPU-style architecture: the pipeline/bind groups are cached and reused for all subsequent frames.

However, as outlined in the samples above, this strategy is not enough to have the same/better perf in WebGPU than in WebGL, that's why we went a step further and created a bundle to encapsulate the pipeline/bind groups/vertex buffers/draw calls. Now, it is this bundle that is cached and reused for subsequent frames.

What other things do you have in mind when you say "turn it into an efficient WebGPU-style engine"? At least for pipelines/bind groups, I don't see how we can do better than to have them in a bundle and reuse the bundle afterwards.

It would be better if you could bundle at least a whole sub-scenegraph that you know doesn't change (as in the objects to draw stay the same, but they can move, animate etc).

That's what we do in the "very fast path", where we record all commands for the whole scene into a single bundle. In the future, we plan to allow the user to create lists of "static" objects, each list being backed by a bundle under the hood.

Are you able to pack all the data in a single vertex buffer instead of 4 in Babylon?

Not at the time being, the user can add/remove attributes at any time and automatically merging buffers is not implemented (but the user can provide interleaved buffers, I think). However, using a bundle makes this a non problem (even if we were able to merge the buffers ourselves, we would still have two calls (because of the index buffer) compared to a single call with VAO).

can't just have a single bundle per-object, need a bundle per-pass, per-view. We would need to treat the bundles as a series of cached state changes, and caching is always always challenging.

Exactly, we basically need a different bundle each time we have a new drawing context for a mesh. That's where we are heading to in Babylon for the "fast path". We know it won't handle everything automatically, we will need some help from the user in some cases (meaning the user will need to call something like clearCachedBundles by hand to have them recreated with the proper new states) but we will try to limit these cases as much as possible and will make some documentation about that.

0 replies

kvark · 2021-04-10T04:11:27Z

kvark
Apr 10, 2021
Maintainer

I've done some profiling on Firefox Nightly:

System	API	Content RAF	GPU RAF	Profile
Linux/NVidia	WebGPU	7ms	15-20ms	https://share.firefox.dev/2RmndRr
Linux/NVidia	WebGL	50-60ms	0ms (non-gpu-isolated)	https://share.firefox.dev/2PGOUnF
macOS/Intel	WebGPU	3-4ms	20ms	https://share.firefox.dev/3fXoMzC
macOS/Intel	WebGL	4-5ms	15ms (unsure)	https://share.firefox.dev/3wB23z0
Windows/AMD	WebGPU	5-6ms	20ms	https://share.firefox.dev/3wGWn6K
Windows/AMD	WebGL	6ms	35ms	https://share.firefox.dev/3g3eq1b

This was running Popov72/BabylonDev#1 - the code was just updated to latest, no optimizations added. A more idiomatic approach to WebGPU (i.e. sharing the buffers to store data for multiple cubes, re-ordering the state changes, using bundles, etc) would probably have a different picture.

Firefox Nightly appears to already run better on Linux and Windows, at least. And the content process time is improved comparing to WebGL. We've also identified the first set of optimizations (around IPC transfers in particular) that would help this case. A lot of our time is currently spent on copying data around and growing vectors.

0 replies

Kangz · 2021-04-12T14:11:34Z

Kangz
Apr 12, 2021
Maintainer

The numbers seem quite close between Firefox and Chromium for WebGPU: in @Popov72's Chromium WebGPU trace the content RAF takes 1500 / 8000 * 33 = 6.1ms. Chromium's content-side WebGL could have been more optimized than Firefox's because it had the GPU process split for longer which is way it is slightly faster than WebGPU (as opposed to slightly slower for Firefox) (and also Chromium's JS bindings could be a bit slower than Firefox).

0 replies

Performance problems / best usage of the API #1640

Uh oh!

Uh oh!

Popov72 Apr 2, 2021

Replies: 7 comments

Uh oh!

kvark Apr 5, 2021 Maintainer

Uh oh!

Kangz Apr 7, 2021 Maintainer

Uh oh!

Uh oh!

magcius Apr 7, 2021

Uh oh!

kvark Apr 7, 2021 Maintainer

Uh oh!

Uh oh!

Popov72 Apr 8, 2021 Author

Uh oh!

kvark Apr 10, 2021 Maintainer

Uh oh!

Kangz Apr 12, 2021 Maintainer

Popov72
Apr 2, 2021

kvark
Apr 5, 2021
Maintainer

Kangz
Apr 7, 2021
Maintainer

magcius
Apr 7, 2021

kvark
Apr 7, 2021
Maintainer

Popov72
Apr 8, 2021
Author

kvark
Apr 10, 2021
Maintainer

Kangz
Apr 12, 2021
Maintainer