[Impeller] Track per-resource synchronization timelines

This is one approach to resolve https://github.com/flutter/flutter/issues/120399, https://github.com/flutter/flutter/issues/112648, and https://github.com/flutter/flutter/issues/106519 in an efficient way that avoids costly host<->device syncs.

We may or may not want to write an full design doc around what we actually end up doing here (if we end up writing a bigger doc around this approach, feel free to copy any or all of this). But here's the content from the doc I started writing around this topic many months ago:

## Definitions

* Host/CPU - The computer that stages GPU command buffers.
* Device/GPU - The computer that executes command buffers.
* Command buffer - A list of commands (primarily, but not exclusively, for raster pipeline executions) constructed on the host and executed on the device.
* Recording time - The point at which backend agnostic Impeller command buffers are being recorded.
* Encoding time - The point at which the Impeller Renderer converts backend agnostic command buffers into native backend command buffers.
* Execution time - The point at which the native backend command buffers are submitted to a driver and execute on the device.
* Timeline - A sequence of ordered synchronization events for a given resource.

## Device VS Host parallelism

There are two categories of parallelism that this design is concerned with maximizing:

1. Execution time (GPU) parallelism: The Impeller Entities framework (primarily `EntityPass` and `FilterContents`) drip feeds the GPU single command buffers at a time, and all of these command buffers execute synchronously, even in cases where the GPU has free ALUs that it could be using to execute a RenderPass from a different command buffer which happens to not have any common render targets.
2. Recording/encoding time (CPU) parallelism: Non-collapsed sibling EntityPasses don't have any overlap in terms of the writable resources in the command buffers they construct, and so sibling EntityPasses can be safely encode their command buffers in separate threads and then sent to the GPU in one batch submit. Command recording/encoding isn't trivial! EntityPass performs all kinds of draw call culling tricks and pass simplification to minimize the memory footprint and repeated work on the GPU.

## Backend resource timelines

The gist of the problem is that device-backed resource (textures/buffers) access needs to be ordered (except for parallel reads). One possible way to allow Renderer users (like the Entities framework) to produce these "timeline" events for each resource would be to introduce an explicit `Semaphore` primitive in the Renderer API that Impeller commands can wait on and signal. This way, it's up to Renderer users to hook up these signals to achieve the intended ordering at **recording time**.

However, another possible approach is to just infer the correct per-resource synchronization timelines at encoding time without having to burden Renderer API users with the need to manage synchronization primitives.

### Retain parallelism of device reads

Write operations are hard barriers for ordering, but multiple reads can be grouped together and happen in parallel in-between writes. The resource timeline needs additional state to toggle between a "mutable" mode and an "aliasing" mode. More concretely, _reads_ only need to wait for the previous _write_ to have finished (which is the same as waiting for all of the previous writes to have finished). But _writes_ have the additional constraint of also needing to wait until all of the previously encountered _reads_ have finished.

The below sections describe a minimal example solution for Vulkan 1.1 that retains maximum GPU parallelizability of reads.

### Tracked synchronization primitives

First, every resource needs an ordered event timeline, so the backend explicitly tracks this state for every device allocated resource:
* Metal: A MTLEvent + a `read_start` index, defaulting to -1.
* Vulkan 1.1: A vector of VkSemaphores + a `read_start` index, defaulting to -1.
* Vulkan 1.2: A timeline VkSamephore + a `read_start` index, defaulting to -1.
* GLES2: No extra state necessary. GLES2 doesn't have command buffers or device synchronization primitives. Encoding time _is_ execution time, and command execution is implicitly synchronous.

Note that all accesses of the resource timeline state should be thread-safe, and the order in which the user adds commands that read/write to textures at recording time should determine how the timeline unfolds (see also the **"Thread safety"** section below).

### Example rules for Vulkan 1.1

Using Vulkan 1.1 as an example, the resource timeline can be tracked with the following rules:
* When writing to a resource (i.e. uploading from the host, binding as writable, using as an attachment, using as blit destination, etc.), perform the following actions:
  1. If the resource timeline has at least one semaphore:
    * If `read_start == -1`:
      * Append one semaphore wait command that waits on the last semaphore in the resource timeline.
    * if `read_start > -1`:
      * For each semaphore with index >= `read_start` in the resource timeline semaphore list, append a semaphore wait command.
      * Set `read_start` index to -1.
  4. Create a new semaphore and append it to the end of the resource timeline.
  5. Append the command which writes to the resource in question.
  6. Append a semaphore signal command to signal the semaphore that was appended to the resource timeline in step 2.
* When reading from a resource (i.e. binding as readonly, transferring to the host, etc.), perform the following actions in order:
  1. If `read_start - 1 > -1`:
    * Append one semaphore wait command that waits on semaphore `read_start - 1` (which is the index of the last semaphore appended for a write operation) in the resource timeline.
  2. Create a new semaphore and append it to the end of the resource timeline.
  3. If `read_start == -1`:
    * Set `read_start` to the index of the new semaphore created in step 2.
  4. Append the command which reads to the resource in question.
  7. Append a semaphore signal command to signal the semaphore that was appended to the resource timeline in step 2.

## Thread safety/nondeterministic timeline ordering

We can get away with making all interactions that happen with the timelines threadsafe as a catch-all. If we did so, dependency logic errors at command recording time would just cause nondeterministic usage order -- which wouldn't be a validation/crash problem, but might not have the intended results. Take this scenario, for example:

![Scenario](https://user-images.githubusercontent.com/919017/161874978-47dfb110-f277-466f-a170-97637898a828.png)

It's clear that `RenderPassA` should be evaluated before `RenderPassC` and `RenderPassB` should be evaluated before `RenderPassD`, but it's not clear if `RenderPassC` should be evaluated before or after `RenderPassB`. If the user happens to care about this order, the user needs to make sure that the commands which bind or attach `TextureA` are amended in the correct order. But maybe the user doesn't care, or maybe the user happens to know that all these RenderPasses are commutative, and so it chooses to run the two command encoding tasks in parallel jobs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Impeller] Track per-resource synchronization timelines #120406

Definitions

Device VS Host parallelism

Backend resource timelines

Retain parallelism of device reads

Tracked synchronization primitives

Example rules for Vulkan 1.1

Thread safety/nondeterministic timeline ordering

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Impeller] Track per-resource synchronization timelines #120406

Description

Definitions

Device VS Host parallelism

Backend resource timelines

Retain parallelism of device reads

Tracked synchronization primitives

Example rules for Vulkan 1.1

Thread safety/nondeterministic timeline ordering

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions