Multiple Queues skeleton proposal. #95

kdashg · 2018-10-08T18:58:19Z

There's still a bunch of work to do here, but this should give us something to talk about, and point us in the right direction.

RafaelCintron · 2018-10-09T22:24:14Z

design/MultipleQueues.md

+# Multiple Queues
+
+Queues are created at Device creation time, and are exposed via `sequence<Queue> Device.queues`.
+There are multiple types of queues, denoted by the `.type` bitfield on the Queue object.


I am not seeing the .type bit field on the WebGPUQueue definition. Is that something planned for later?

RafaelCintron · 2018-10-09T22:34:00Z

design/MultipleQueues.md

+Queues are created at Device creation time, and are exposed via `sequence<Queue> Device.queues`.
+There are multiple types of queues, denoted by the `.type` bitfield on the Queue object.
+Queues can support `graphics` operations, `compute` operations, both, or neither.
+"Neither" Queues can only execute copy commands.


I would prefer that copy operations have their own Boolean or that we use an enum who's description clearly defines which operations are supported.

D3D12 has recently added video decode and processing queues, neither of which support copy operations. If we ever add them to WebGPU, the API will need to be clear that they do not support copy.

wow, a graphics queue with no copy operations, fancy!

In D3D12, graphics queues can have copy operations. Video related queues cannot.

Cool! Please let us know when there's documentation about this. I can't really integrate this knowledge until the interfaces have public documentation though.

Unfortunately, the documentation for video queues is a ways away.

To be clear, I am not proposing that we add video queues to WebGPU for MVP, or even V1.0. But I would like for them to be easily added one day in the future. Until then, we should avoid having verbiage that says "absence of all flags implies foo behavior" because we may add new flags in the future that do not apply to foo behavior.

If we feel that adding a flag for copy may make developers nervous about needing to write code that handles graphics queues with no copy operations, then we should consider enums that clearly state what operations are supported for the enum value.

RafaelCintron · 2018-10-09T22:51:45Z

design/MultipleQueues.md

+## Further work
+
+- Transfering resources between queues, particularly for VK_SHARING_MODE_EXCLUSIVE, as opposed to _CONCURRENT.
+- App-requested number of same-type queues, instead of implementation-dictated-only.


I think the browser should tell developer which kinds of queues they can create and allow them to create multiple queues of different types.

I do not think the browser should try to guess the number of optimal queues since that is dependent on the content.

Suppose you have a fictional system with 6 compute cores. If the developer submits six dispatch calls with no dependencies between them, the driver is free to run each operation on a different core, utilizing all six of them. Similarly, the driver can also utilize the same 6 cores if the developer submits a single dispatch call with maxThreads=6. In these two instances, queuing things on compute queues 2-6 will do the developer no good.

I don't think queues map to GPU cores, any GPU core can execute any queue's work... at least in Vulkan.

@devshgraphicsprogramming right. Can you come up with a different example on when exposing the available queue types/counts would make the user code run better (which is the point @RafaelCintron is making)?

I think the nicely illustrated example in this post https://mynameismjp.wordpress.com/2018/06/17/breaking-down-barriers-part-3-multiple-command-processors/ as well as the preceeding post, illustrate the need for multiple queues from the same family

The question here is not whether or not to have multiple queues (PR is all about multiple queues). The question is whether we should expose what is available, or just let the user tell us what they need and want to work with.

I address that better in my other reply #95 (comment)

I don't think you want to put yourself in the position of the scheduler and dependency tracker, you're pretty much making and doing what an enhanced OpenGL implementation would do.

Also the example provided by @RafaelCintron is a bit moot

Suppose you have a fictional system with 6 compute cores. If the developer submits six dispatch calls with no dependencies between them, the driver is free to run each operation on a different core, utilizing all six of them. Similarly, the driver can also utilize the same 6 cores if the developer submits a single dispatch call with maxThreads=6. In these two instances, queuing things on compute queues 2-6 will do the developer no good.

A single queue will fill up all of your Compute Units / Shader Processors with work anyway, so that maxThreads parameter is moot anyway.
In the absence of dependencies as in @RafaelCintron's example a single queue and command processor will keep your entire GPU with all of its CU/SP/SMs busy with no idle time (bar a context switch between compute and graphics on old NVIDIAs).

You only need as many queues as the maximal slice in your unweighted Work-DAG of execution dependencies, I forgot my high-school discrete math... but basically looks like maximum-cut which is NP-complete.

RafaelCintron · 2018-10-09T23:02:56Z

design/MultipleQueues.md

+
+## Further work
+
+- Transfering resources between queues, particularly for VK_SHARING_MODE_EXCLUSIVE, as opposed to _CONCURRENT.


D3D12's D3D12_RESOURCE_FLAG_ALLOW_SIMULTANEOUS_ACCESS flag is relevant here.

Simultaneous access guarentees that a resource is in a layout that can be read from any queue type. Hence, it allows you to mix SRV and UAV states across different queues running at the same time. If you know you're not going be mixing these for the same resource, you can leave the flag off.

Regardless of the state fo the flag, undefined behavior will result if you write to a region of a resource that is being read from an operation on a different queue or the same queue.

+1, we need to figure out how we are going to avoid write-write and read-write hazards.

In vulkan only one queue family can be owner of a resource, AFAIK

Not with VK_SHARING_MODE_CONCURRENT

"Ranges of buffers and image subresources of image objects created using VK_SHARING_MODE_CONCURRENT must only be accessed by queues from the queue families specified through the queueFamilyIndexCount and pQueueFamilyIndices members of the corresponding create info structures."

You also need to specify up-front which families you will share with when creating the objects.

Also the performance impact
"
(info) "VK_SHARING_MODE_CONCURRENT may result in lower performance access to the buffer or image than VK_SHARING_MODE_EXCLUSIVE"
"
https://www.reddit.com/r/vulkan/comments/7m6h0x/just_using_vk_sharing_mode_concurrent/
SachaWillems - "You'll notice differences on lower spec devices, esp. on mobile. So if you want to support mobile too, just using concurrent all the time is not a good idea, as it also forces you to always specify (up front) what queues will access the image."

`copy` not guaranteed. QueueFamily should be user-constructable. Note that hints might help reduce overhead.

Kangz · 2018-10-26T12:08:03Z

design/MultipleQueues.md

+## QueueFamily
+
+There are multiple families of queues.
+Availible families of queues are surfaced via `sequence<QueueFamily> Adapter.queueFamilies`.


type: Availible

typo: type 🤣

kvark · 2018-10-26T14:13:36Z

design/MultipleQueues.md

-a fence is complete, followed by submitting command buffers that depend on that fence.
+
+Synchronizing access to resources across queues should be done by telling a queue to wait until a fence is complete, followed by submitting command buffers that depend on that fence.
+Resource data may have multiple concurrent readers, or exclusively one writer, at a time.


how does this work w.r.t different types of read access? Say, one queue uses resource A as a vertex buffer, another queue uses it as a uniform buffer. Both only read, but neither knows about what exact state this resource is expected to be in.

Big issue here, fences and events in vulkan cannot be signalled in one queue and waited upon in another.

You need a VkSemaphore for that.

kvark · 2018-10-26T14:14:44Z

design/MultipleQueues.md

+If different commands would violate this exclusion, the implmentation injects synchronization.
+If a command is submitted that will never be able to synchronize for exclusion without subsequent user commands, that Submit is refused.
+
+[1]: For MVP, we require reads/write exclusion at whole-resource granularity, instead of allowing subranges.


Or we can just say for MVP that we only expose a single queue?

Kangz · 2018-10-29T13:56:28Z

Thanks for the write-up. The queue family request mechanism looks great!

I do have major concerns with the synchronization aspect though. Basically if the implementation is able to do something useful with implicit synchronization like you described, then it can do the same thing with a Metal-style model with a single queue and potential automatic parallelization under the "as if" rule. Metal-style is easier for applications and thus a better alternative to multiple queues with implicit synchronization.

To be clear I'm not suggesting we expose a single queue and assume implementations will parallelize workloads. Both this and your proposal seem too hard to implement usefully given we don't have access to the hardware but have to go through somewhat inflexible APIs in D3D12 and Vulkan.

litherum · 2018-10-29T13:39:02Z

design/MultipleQueues.md

+The most capable family is always first in the `Adapter.queueFamilies` list.
+(It will either be a graphics+compute, or at least compute)
+Exposed adapters must have at least a compute family.
+Users can also create `QueueFamily`s.


What does this mean?

Users shouldn't create QueueFamilies, they are an intrinsic part of the device and an only be queried (Vulkan).

In Vulkan there will always be one queue family that can do everything basic like graphics, transfer and compute.

In Vulkan there will always be one queue family that can do everything basic like graphics, transfer and compute.

@jdashg keeps saying that this is only true if there is a queue family that supports graphics at all.

litherum · 2018-10-29T14:39:32Z

design/MultipleQueues.md

+
+## Synchronization
+
+Queues can have fences inserted into them, and any queue can wait on that fence to be complete.


Some links:

Metal

MTLEvent

MTLSharedEvent

MTLCommandBuffer.encodeWaitForEvent(_:value:)

MTLCommandBuffer.encodeSignalEvent(_:value:)

D3D12

ID3D12Fence

ID3D12CommandQueue::Signal()

ID3D12CommandQueue::Wait()

ID3D12GraphicsCommandList::ResourceBarrier()

Vulkan

Fences

VkFence

vkQueueSubmit()

vkGetFenceStatus()

Events

VkEvent

vkCmdSetEvent()

vkCmdWaitEvents()

vkGetEventStatus()

vkSetEvent()

Semaphores

VkSemaphore

VkSubmitInfo

Any queue cannot wait on a fence signalled on another queue... that will crash badly in Vulkan.

For cross-queue synch you need a semaphore.

litherum · 2018-10-29T14:48:26Z

design/MultipleQueues.md

+Synchronizing access to resources across queues should be done by telling a queue to wait until a fence is complete, followed by submitting command buffers that depend on that fence.
+Resource data may have multiple concurrent readers, or exclusively one writer, at a time.
+Resources may have multiple concurrent readers and writers, so long as all subranges satisfy many-read/single-write exclusion. [1]
+If different commands would violate this exclusion, the implmentation injects synchronization.


implmentation

Again, make the distinction between:

Pipeline Barrier

Fence

Event

Semaphore

litherum · 2018-10-29T14:51:45Z

design/MultipleQueues.md

+~~~
+
+For a `submit(sequence<WebGPUCommandBuffer> buffers)`, implementations inject any required synchronization.
+On submit, each CommandBuffer in turn traverses its resources and identifies any outstanding synchronization requirements.


wait.. this means that you will require and expect submitted command buffers to execute in-order of submission!?

What this is saying is that we (WebGPU implementation) will insert command buffers doing the memory barriers / resource transitions between some of those in a sequence. That doesn't serialize the execution (by GPU) of the command buffers more than the user would do in Vulkan.

E.g. if one command buffer is writing into a UAV and another one is using the same resource as a vertex buffer, WebGPU implementation will insert the transition. The user would do the same in Vulkan/D3D12.

Ok, so you will go on a quest to find dependency chains between the uses of resources?

I hate to tell you that, but it is exactly what OpenGL has to do... and that makes creating a bugless and performant OpenGL implementation (driver) very very hard.
You'd have to do amazing work on par with the engineers at Nvidia and AMD to both:

insert the transitions correctly to not get undefined behaviour, artefacts, other problems

not insert too many to get subpar performance and serrialization

do it fast

I think a dependency graph of the resources used in the command buffers you will be processing for submission will be an enormous DAG and quite a challenge to analyze.

The Vulkan approach of having the user explicitly provide the transitions ahead of time and having them baked into the command buffer, while not trying to insert them at runtime is most probably the biggest reason (jointly with offline shader compilation) why Vulkan applications are seeing the often quoted 60% overall-CPU-utilization reductions on mobile devices even when multithreading and worker-thread command buffer generation is accounted for.

Yes, that's pretty much where the group is heading at the moment.

I think a dependency graph of the resources used in the command buffers you will be processing for submission will be an enormous DAG and quite a challenge to analyze.

I'm not as pessimistic. There aren't that many resources that change their usage through the frame: say, a hundred render targets plus a bunch of buffers we write as UAV. This isn't an enormous DAG to analyze.

The Vulkan approach of having the user explicitly provide the transitions ahead of time and having them baked into the command buffer, while not trying to insert them at runtime

The good thing for WebGPU (contrary to OpenGL) is that we have command buffers. So the barriers inside command buffers will also need to be computed only once (either at recording, or the first submission, depending on implementation). It's just the barriers between command buffers that we'll need to compute and insert on every submission at runtime.

The whole point of the "render pass" system added in the latest D3D12 is to help track and validate these transitions, Metal already does this tracking implicitly, and it's now a common tactic for engines to track them it themselves as well ( see https://www.gdcvault.com/play/1024656/Advanced-Graphics-Tech-Moving-to and http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/GDC2017-D3D12-And-Vulkan-Done-Right.pdf ).

well yes, so since engines do it themselves.... maybe you should let them and make your lives easier?

There is great deal of historical context to this discussion. Basically, since the browser implementations have to track the lifetime and usage of resources for validation anyway, it's not a big step to derive the barriers from this info.

lifetime resource tracking is easiest, instead of deleting (where the usual JS GC would do) put it on a list with an associated API fence inserted after last API use (or instead of where you'd want the delete)... I do that already.

barriers are easy (in OpenGL you basically slam a glMemoryBarrier before the first read from a modified resource)

begin/end dependencies are a bit harder, and I would really like to see an implementation that is able to insert them and validate them before command submit to virtual-queue (webGPU queue, not VK queue) time so we can benefit from pre-validated and pre-compiled command buffers like we go in D3D12 and VK.

litherum · 2018-10-29T14:52:30Z

design/MultipleQueues.md

+In Vulkan, synchronization injection takes the form of synthesizing VkSubmitInfos, synchronizing via VkSemaphores and VkPipelineStageFlags, as well as submitting synthesized CommandBuffers containing memory barriers and queue family transfers.
+(Queue family transfers may require submitting synthesized CommandBuffers to other queues, as well)
+
+Implementation may warn users about synchronization overhead.


Implementations

litherum · 2018-10-29T14:54:28Z

design/MultipleQueues.md

+
+## QueueFamily
+
+There are multiple families of queues.


What is the benefit of exposing families? Why not make each queue independent, having its own capabilities?

Because multiple queues from the same family do not need to transfer ownership between each other

They also allow for less idling when complex execution dependencies are present
https://mynameismjp.wordpress.com/2018/06/17/breaking-down-barriers-part-3-multiple-command-processors/

I don't think we've discussed the technique for ownership transfer. Depending on how that works, the implementation could be the one determining which barriers are issued, so it would know internally if the two queues happen to be in the same family and omit barriers accordingly.

Yes, but this transfer will be costly so the user has to have exposed queue family information, so that they can avoid using resources across queues from different families.

litherum · 2018-10-29T14:57:12Z

design/sketch.webidl

 };

+interface WebGPUQueueFamily {
+    readonly attribute boolean graphics;


8 different possibilities? The DX12 style is easier to work with.

litherum · 2018-10-29T14:58:21Z

design/sketch.webidl

+
    void submit(sequence<WebGPUCommandBuffer> buffers);
    WebGPUFence insertFence();
+    void wait(WebGPUFence);


Queues can wait but they can't signal?

I think this commit is missing a rebase.

RafaelCintron · 2018-10-29T18:36:35Z

design/MultipleQueues.md

+Resources may have multiple concurrent readers and writers, so long as all subranges satisfy many-read/single-write exclusion. [1]
+If different commands would violate this exclusion, the implmentation injects synchronization.
+If a command is submitted that will never be able to synchronize for exclusion without subsequent user commands, that Submit is refused.
+


I think I know what you're trying to get at here. But, in case I am misunderstanding, please provide an example that would get rejected by the API.

RafaelCintron · 2018-10-29T18:42:07Z

design/MultipleQueues.md

+   AccessBits access;
+};
+list<LastAccess> last_accesses;
+~~~


If this information is kept inside of each resource, how are we going to handle race conditions where one resource is used by multiple queues from multiple web workers? Should we make resources become read-only once they're transferred between workers? Even if we do, @kvark 's point about there being different types of reading (shader resource vs UAV, etc) still holds. D3D's D3D12_RESOURCE_FLAG_ALLOW_SIMULTANEOUS_ACCESS is relevant here.

RafaelCintron · 2018-10-29T18:46:52Z

design/MultipleQueues.md

+Queues are created at Device creation time, and are exposed via `sequence<Queue> Device.queues`.
+Each element in `sequence<QueueFamily> WebGPUDeviceDescriptor.queueRequests` results in a corresponding element in `Device.queues`.
+If an user-provided QueueFamily does not match a family in `Adapter.queueFamilies`, a more capable family is used.
+If no available family can satisfy an user-provided family, device creation fails.


Will it be an error if the developer passes an empty sequence for queueRequest?

Kangz · 2018-10-29T19:16:15Z

design/sketch.webidl

    //WebGPULimits limits; Don't expose higher limits for now.

-    // TODO are other things configurable like queues?
+    sequence<WebGLQueueFamily> queueRequests;


nit: WebGPUQueueFamily

Kangz · 2018-10-29T19:16:25Z

design/sketch.webidl

    readonly attribute DOMString name;
    readonly attribute WebGPUExtensions extensions;
    //readonly attribute WebGPULimits limits; Don't expose higher limits for now.
+    readonly attribute sequence<WebGLQueueFamily> queueFamilies;


ditto WebGPUQueueFamily

Kangz · 2018-10-30T13:50:30Z

Following yesterday's discussion, I think it makes sense for the Web to not have explicit transfers between queues but automatic parallelization and hints instead. For Dawn I don't think we can do any useful parallelization without at least these hints:

initialQueue in resource descriptors defaulting to some queue, so that we know which queues writables resources started owned on.
queue.transferWriteOwnership([resources...], nextQueue), queue.releaseWriteOwnership([resources...]) and queue.acquireWriteOwnership([resources...]) hints to move resources between "readable by all" and "writable by a queue" states

Even then the implementation looks like it will be insanely complex and will need to juggle a ton of timelines and event/semaphores. It will super easily fall off the parallel path if the application forgets something but at most we'll print a warning and keep going.

We decided to talk about multi-queue because it influenced the design of command buffers and buffer mapping. #100 resolves the command buffer discussion and makes command buffers created from queues. For buffer mapping we still need to discuss some more but I'm hopeful this pull-request will give enough information to know what the design constraints are.

At this point I think we should resolve here on 1) how you request and get queues, 2) what single queue is exposed in the MVP. We can also record that for "webbiness" reasons multi-queue barriers / ownership transfers should be implicit + hints. Discussion of exactly what the hints are and how multiple queues interact together could be left for later.

Just to be clear, this thread had a lot of value (at least for me) in defining how you request queues and agreeing on implicit synchronization.

devshgraphicsprogramming · 2018-11-15T23:23:08Z

not have explicit transfers between queues

Again, in VK explicit owndership transfer is necessary only between different queue FAMILIES, not different queues in the same family.

kainino0x · 2022-08-25T01:32:25Z

Closing stale PR; we will revisit this as a proposal when we get to looking back at the multi-queue label.

kdashg added 3 commits October 8, 2018 11:57

Multiple Queues skeleton proposal.

c37c6bb

Add TBD section.

46b9e0a

Add next steps from CG meeting.

1593ed7

RafaelCintron requested changes Oct 9, 2018

View reviewed changes

queueRequests and implicit synchronization outline.

5fe7a70

`copy` not guaranteed. QueueFamily should be user-constructable. Note that hints might help reduce overhead.

Kangz reviewed Oct 26, 2018

View reviewed changes

kvark reviewed Oct 26, 2018

View reviewed changes

litherum reviewed Oct 29, 2018

View reviewed changes

RafaelCintron reviewed Oct 29, 2018

View reviewed changes

Kangz reviewed Oct 29, 2018

View reviewed changes

kvark mentioned this pull request Nov 15, 2018

The case for passes #64

Closed

kainino0x mentioned this pull request Oct 21, 2019

Add Multi-Queue #478

Open

kvark changed the base branch from master to main June 23, 2020 13:18

kainino0x added the multi-queue Part of the multi-queue feature label Dec 16, 2020

kainino0x marked this pull request as draft August 25, 2022 01:29

kainino0x added the proposal label Aug 25, 2022

kainino0x closed this Aug 25, 2022


		## Further work

		- Transfering resources between queues, particularly for VK_SHARING_MODE_EXCLUSIVE, as opposed to _CONCURRENT.


		## Synchronization

		Queues can have fences inserted into them, and any queue can wait on that fence to be complete.

Multiple Queues skeleton proposal. #95

Multiple Queues skeleton proposal. #95

Uh oh!

Conversation

kdashg commented Oct 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RafaelCintron Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kangz commented Oct 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Metal

D3D12

Vulkan

Fences

Events

Semaphores

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RafaelCintron Oct 16, 2018 •

edited

Loading

devshgraphicsprogramming Nov 16, 2018 •

edited

Loading

devshgraphicsprogramming Nov 16, 2018 •

edited

Loading