Performance problems / best usage of the API

I know the performance problem I'm about to speak about is more related to the implementation than to the spec itself, but it still could be interesting to discuss about it here because I may not use the API as expected / the API could be improved toward speeding up the implementation.

Notes:
* I have only tested with Chrome Canary as the sample does not work in Firefox (as of 2021/04/02).
* I'm working in the [Babylon.js](https://www.babylonjs.com/) team to add support for WebGPU to the engine

The sample is simply drawing 3000 cubes. It binds 3 ubos to be more in line with a real scenario. The WebGL version is doing the same. It is using VAOs as they are available in WebGL2.

[WebGPU sample source code](https://github.com/Popov72/BabylonDev/blob/master/resources/external/testPerfWebGPU.html)

[WebGL sample source code](https://github.com/Popov72/BabylonDev/blob/master/resources/external/testPerfWebGlVaoUbo.html)

On my computer (i7 6700K):
| WebGPU | WebGL |
|----------|-----------|
| ![image](https://user-images.githubusercontent.com/4152247/113421211-fb13fc80-93ca-11eb-9cf9-7c2b5d99bb3f.png)| ![image](https://user-images.githubusercontent.com/4152247/113421300-24348d00-93cb-11eb-9e0c-be91daa3ebb5.png)|
| ![image](https://user-images.githubusercontent.com/4152247/113422050-64e0d600-93cc-11eb-8ef9-30ded0abb306.png)| ![image](https://user-images.githubusercontent.com/4152247/113422019-57c3e700-93cc-11eb-8343-9074f64cd78a.png)|

We are interested here in the CPU (Scripting) timing, not the GPU timing (not shown): our users are generally CPU and not GPU bound in the Web context, so it's important for us to keep the CPU time as low as possible.

As you can see above, the sample is globally 17% slower in WebGPU than in WebGL regarding the scripting time.

On one hand we could try to lower the number of `setBindGroup` calls to save some perf on WebGPU compared to WebGL, but on the other hand in a real scenario we would have more than a single call to `setVertexBuffer` (at least 3, for the position / normal / uv buffers) that would raise the gap between WebGPU and WebGL because in WebGL all those buffers are held in a single VAO.

At first, in Babylon.js we tried to add a layer to our existing code to avoid too much structural changes and implemented some global caching mechanisms to retrieve the right pipeline / bind groups corresponding to the current state of the engine before each draw call. Even if we tried hard to make those caches as fast as possible, in the end we are slower than WebGL because thoses caches simply don't exist there (so we call this path the "slow path").

However, as we can see in the numbers above, even by using the API at full speed (meaning, no cache lookups to retrieve the pipeline/bind groups, straight calls) we are still slower than WebGL (on the CPU side) whereas we would have expected to be at least as fast (and honestly we thought/hoped? it would be faster).

That was a bit of a disappointment to us and even if the difference could be narrowed down in future browser updates, at best we will have comparable perf between WebGL and WebGPU where we would have hoped for better perf, at least for "marketing" sake: why switching to WebGPU if perf are the same or even worse? There are new features in WebGPU that can warrant a switch, but for a lot of people perf will be the selling point. So we tried to find a way to improve things. That is where bundles come in.

We are aware they are not meant to be used this way, but we use them simply as a container of API calls, to lower the number of calls done per frame. So, for each draw of a mesh we have a cached bundle that we reuse (the bundle contains the calls: 1 x `setRenderPipeline`, 1 x `setIndexBuffer`, n x `setVertexBuffer`, m x `setBindGroups`, 1 x `draw/drawIndexed`), and at the end of the frame we issue a single `executeBundles` call with the list of collected bundles: this way we can actually be faster than WebGL (we call this path the "fast path"). To be honest, it's a work in progress, as we now have a problem of bundle cache management: we must detect when to invalidate this bundle (and recreate it) but also try to limit the number of times we invalidate a bundle so that the cache is worthwhile (recreating the bundle each frame would be moot).

So, back to the subject, from the spec point of view:
* are there any problems to use bundles like this?
* is it expected to use them like this to have high perf (and notably higher than WebGL)?
* VAO is one key point for better performances in WebGL than in WebGPU (`bindVertexArray` is 2.5x time faster than the combined `setIndexBuffer` + `setVertexBuffer` calls): should we have something on par in WebGPU, or is the bundle actually what we need?

Also, when it comes to bundles, it would really be a must if they could support everything a render pass supports, that is also supporting `setViewport`, `setScissorRect`, `setStencilReference` and `setBlendColor`. We currently have a "very fast path" (for mostly static scenes) where we record all GPU commands in a single bundle and execute this bundle for all subsequent frames. But because those 4 calls can't be recorded in a bundle, we need to create multiple bundles and issue these calls in between directly on the current render pass.

In the "one bundle per draw" scenario explained above, one thing that could help avoiding to invalidate the bundle too often would be able to overload the `instanceCount` parameter of the `draw/drawIndexed` call recorded in the bundle when the bundle is executed. Indeed, in Babylon.js we have a use case where we are using instancing but the number of instances displayed varies depending on the camera position. So that's something that can change often and each time it changes we need to recreate the bundle. I don't know if that is a sensible request, I guess at minima we would need something like a `DrawBundle` that would encapsulate one and a single one `draw/drawIndexed` call and a new API like `executeDrawBundles(DrawBundle[], integer[])` with `integer[]` being the overloaded `instanceCount` values... Random thoughts here, probably off the mark.

Again, I know this post is based on tests in a single browser so should be taken with a grain of salt but I'm not sure the outcome would be really different in other browsers or in some months from here, our main problem on the CPU side being the cost of API calls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance problems / best usage of the API #1596

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance problems / best usage of the API #1596

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions