Skip to content

[Impeller] Storing MSAA stencil texture has disproportionate impact on performance on Vulkan backend. #137302

@jonahwilliams

Description

@jonahwilliams

Animated Advanced Blend Performance is 50x worse than iPhone 11 on a Pixel 7 Pro, With GPU times in excess of 50ms (with an 8ms budget). Below is a Mali streamline trace recorded on a Pixel 7 Pro, one with MSAA enabled and one with MSAA disabled. The top left corner shows the total counter per time period, which for these graphs was 100ms.

Some notable differences:

Texture External Read Beats:

  • "The number of read beats received by the texture unit that required an external memory access due to an L2 cache miss.".
    ~11.5x higher with MSAA enabled
    Texture L2 Read Beats:
  • "The number of read beats received by the texture unit." Possibly number of not missed reads?
    4x higher with MSAA enabled.

Other external bus/read metrics: all seem to indicate that we are reading/writing way too much texture data. So why is this happening?

MSAA Enabled

image

MSAA Disabled

image

Is this due to stencil storage?

Right now we multisample the stencil buffer and store it across backdrop filters (to avoid repopulating it). Since this test app has no clips, we can experiment with this by disabling the stencil storage. This was done by changing the stencil store/load actions to be DontCare and Clear, respectively.

Diff

diff --git a/impeller/entity/entity_pass.cc b/impeller/entity/entity_pass.cc
index 4d6897b5b3..980c3fe583 100644
--- a/impeller/entity/entity_pass.cc
+++ b/impeller/entity/entity_pass.cc
@@ -258,8 +258,7 @@ void EntityPass::AddSubpassInline(std::unique_ptr<EntityPass> pass) {
 
 static RenderTarget::AttachmentConfig GetDefaultStencilConfig(bool readable) {
   return RenderTarget::AttachmentConfig{
-      .storage_mode = readable ? StorageMode::kDevicePrivate
-                               : StorageMode::kDeviceTransient,
+      .storage_mode = StorageMode::kDeviceTransient,
       .load_action = LoadAction::kDontCare,
       .store_action = StoreAction::kDontCare,
   };
diff --git a/impeller/entity/inline_pass_context.cc b/impeller/entity/inline_pass_context.cc
index 4253e75987..f83247a1ad 100644
--- a/impeller/entity/inline_pass_context.cc
+++ b/impeller/entity/inline_pass_context.cc
@@ -142,13 +142,12 @@ InlinePassContext::RenderPassResult InlinePassContext::GetRenderPass(
 
   // Only clear the stencil if this is the very first pass of the
   // layer.
-  stencil->load_action =
-      pass_count_ > 0 ? LoadAction::kLoad : LoadAction::kClear;
+  stencil->load_action = LoadAction::kClear;
   // If we're on the last pass of the layer, there's no need to store the
   // stencil because nothing needs to read it.
   stencil->store_action = pass_count_ == total_pass_reads_
                               ? StoreAction::kDontCare
-                              : StoreAction::kStore;
+                              : StoreAction::kDontCare;
   pass_target_.target_.SetStencilAttachment(stencil.value());
 
   pass_target_.target_.SetColorAttachment(color0, 0);

Results

The performance is immediately much better than with MSAA enabled. ~80FPS up from ~4FPS.

The results are much closer to the MSAA disabled case, but why? The stencil buffer is only 8 bits, so a multisampled stencil is 32 bits, roughly the same size as our offscreen texture we use for each backdrop. Naively I'd expect something like a 2x difference by removing the stencil storage, but given the substantial overhead it might be possible that we're causing some other form of deoptimization by storing the stencil - or something like the implementation deciding to back the S8 stencil with a full depth/stencil without telling us? That seems contrary to the spirit of Vulkan, but just a guess

image

How to avoid storing the stencil:

I detailed an approximate solution to this in the discussion for StC here under "Depth Buffer for Clipping": Stencil Then Cover for Impeller . The gist is rather than storing the stencil, we replay all stencil affecting commands when recreating the root pass (with an easy nop if there were none). The benefit of this approach is that it also opens a route to depth/stencil usage for StC!

Metadata

Metadata

Assignees

Labels

P1High-priority issues at the top of the work listc: performanceRelates to speed or footprint issues (see "perf:" labels)e: impellerImpeller rendering backend issues and features requeststeam-engineOwned by Engine teamtriaged-engineTriaged by Engine team

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions