-
Notifications
You must be signed in to change notification settings - Fork 353
Multiple Queues skeleton proposal. #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
design/MultipleQueues.md
Outdated
| # Multiple Queues | ||
|
|
||
| Queues are created at Device creation time, and are exposed via `sequence<Queue> Device.queues`. | ||
| There are multiple types of queues, denoted by the `.type` bitfield on the Queue object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not seeing the .type bit field on the WebGPUQueue definition. Is that something planned for later?
design/MultipleQueues.md
Outdated
| Queues are created at Device creation time, and are exposed via `sequence<Queue> Device.queues`. | ||
| There are multiple types of queues, denoted by the `.type` bitfield on the Queue object. | ||
| Queues can support `graphics` operations, `compute` operations, both, or neither. | ||
| "Neither" Queues can only execute copy commands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that copy operations have their own Boolean or that we use an enum who's description clearly defines which operations are supported.
D3D12 has recently added video decode and processing queues, neither of which support copy operations. If we ever add them to WebGPU, the API will need to be clear that they do not support copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, a graphics queue with no copy operations, fancy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In D3D12, graphics queues can have copy operations. Video related queues cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Please let us know when there's documentation about this. I can't really integrate this knowledge until the interfaces have public documentation though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the documentation for video queues is a ways away.
To be clear, I am not proposing that we add video queues to WebGPU for MVP, or even V1.0. But I would like for them to be easily added one day in the future. Until then, we should avoid having verbiage that says "absence of all flags implies foo behavior" because we may add new flags in the future that do not apply to foo behavior.
If we feel that adding a flag for copy may make developers nervous about needing to write code that handles graphics queues with no copy operations, then we should consider enums that clearly state what operations are supported for the enum value.
design/MultipleQueues.md
Outdated
| ## Further work | ||
|
|
||
| - Transfering resources between queues, particularly for VK_SHARING_MODE_EXCLUSIVE, as opposed to _CONCURRENT. | ||
| - App-requested number of same-type queues, instead of implementation-dictated-only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the browser should tell developer which kinds of queues they can create and allow them to create multiple queues of different types.
I do not think the browser should try to guess the number of optimal queues since that is dependent on the content.
Suppose you have a fictional system with 6 compute cores. If the developer submits six dispatch calls with no dependencies between them, the driver is free to run each operation on a different core, utilizing all six of them. Similarly, the driver can also utilize the same 6 cores if the developer submits a single dispatch call with maxThreads=6. In these two instances, queuing things on compute queues 2-6 will do the developer no good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think queues map to GPU cores, any GPU core can execute any queue's work... at least in Vulkan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devshgraphicsprogramming right. Can you come up with a different example on when exposing the available queue types/counts would make the user code run better (which is the point @RafaelCintron is making)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the nicely illustrated example in this post https://mynameismjp.wordpress.com/2018/06/17/breaking-down-barriers-part-3-multiple-command-processors/ as well as the preceeding post, illustrate the need for multiple queues from the same family
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question here is not whether or not to have multiple queues (PR is all about multiple queues). The question is whether we should expose what is available, or just let the user tell us what they need and want to work with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I address that better in my other reply #95 (comment)
I don't think you want to put yourself in the position of the scheduler and dependency tracker, you're pretty much making and doing what an enhanced OpenGL implementation would do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the example provided by @RafaelCintron is a bit moot
Suppose you have a fictional system with 6 compute cores. If the developer submits six dispatch calls with no dependencies between them, the driver is free to run each operation on a different core, utilizing all six of them. Similarly, the driver can also utilize the same 6 cores if the developer submits a single dispatch call with maxThreads=6. In these two instances, queuing things on compute queues 2-6 will do the developer no good.
A single queue will fill up all of your Compute Units / Shader Processors with work anyway, so that maxThreads parameter is moot anyway.
In the absence of dependencies as in @RafaelCintron's example a single queue and command processor will keep your entire GPU with all of its CU/SP/SMs busy with no idle time (bar a context switch between compute and graphics on old NVIDIAs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only need as many queues as the maximal slice in your unweighted Work-DAG of execution dependencies, I forgot my high-school discrete math... but basically looks like maximum-cut which is NP-complete.
design/MultipleQueues.md
Outdated
|
|
||
| ## Further work | ||
|
|
||
| - Transfering resources between queues, particularly for VK_SHARING_MODE_EXCLUSIVE, as opposed to _CONCURRENT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D3D12's D3D12_RESOURCE_FLAG_ALLOW_SIMULTANEOUS_ACCESS flag is relevant here.
Simultaneous access guarentees that a resource is in a layout that can be read from any queue type. Hence, it allows you to mix SRV and UAV states across different queues running at the same time. If you know you're not going be mixing these for the same resource, you can leave the flag off.
Regardless of the state fo the flag, undefined behavior will result if you write to a region of a resource that is being read from an operation on a different queue or the same queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we need to figure out how we are going to avoid write-write and read-write hazards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In vulkan only one queue family can be owner of a resource, AFAIK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not with VK_SHARING_MODE_CONCURRENT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Ranges of buffers and image subresources of image objects created using VK_SHARING_MODE_CONCURRENT must only be accessed by queues from the queue families specified through the queueFamilyIndexCount and pQueueFamilyIndices members of the corresponding create info structures."
You also need to specify up-front which families you will share with when creating the objects.
Also the performance impact
"
(info) "VK_SHARING_MODE_CONCURRENT may result in lower performance access to the buffer or image than VK_SHARING_MODE_EXCLUSIVE"
"
https://www.reddit.com/r/vulkan/comments/7m6h0x/just_using_vk_sharing_mode_concurrent/
SachaWillems - "You'll notice differences on lower spec devices, esp. on mobile. So if you want to support mobile too, just using concurrent all the time is not a good idea, as it also forces you to always specify (up front) what queues will access the image."
`copy` not guaranteed. QueueFamily should be user-constructable. Note that hints might help reduce overhead.
| ## QueueFamily | ||
|
|
||
| There are multiple families of queues. | ||
| Availible families of queues are surfaced via `sequence<QueueFamily> Adapter.queueFamilies`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type: Availible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: type 🤣
| a fence is complete, followed by submitting command buffers that depend on that fence. | ||
|
|
||
| Synchronizing access to resources across queues should be done by telling a queue to wait until a fence is complete, followed by submitting command buffers that depend on that fence. | ||
| Resource data may have multiple concurrent readers, or exclusively one writer, at a time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this work w.r.t different types of read access? Say, one queue uses resource A as a vertex buffer, another queue uses it as a uniform buffer. Both only read, but neither knows about what exact state this resource is expected to be in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big issue here, fences and events in vulkan cannot be signalled in one queue and waited upon in another.
You need a VkSemaphore for that.
| If different commands would violate this exclusion, the implmentation injects synchronization. | ||
| If a command is submitted that will never be able to synchronize for exclusion without subsequent user commands, that Submit is refused. | ||
|
|
||
| [1]: For MVP, we require reads/write exclusion at whole-resource granularity, instead of allowing subranges. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can just say for MVP that we only expose a single queue?
|
Thanks for the write-up. The queue family request mechanism looks great! I do have major concerns with the synchronization aspect though. Basically if the implementation is able to do something useful with implicit synchronization like you described, then it can do the same thing with a Metal-style model with a single queue and potential automatic parallelization under the "as if" rule. Metal-style is easier for applications and thus a better alternative to multiple queues with implicit synchronization. To be clear I'm not suggesting we expose a single queue and assume implementations will parallelize workloads. Both this and your proposal seem too hard to implement usefully given we don't have access to the hardware but have to go through somewhat inflexible APIs in D3D12 and Vulkan. |
| The most capable family is always first in the `Adapter.queueFamilies` list. | ||
| (It will either be a graphics+compute, or at least compute) | ||
| Exposed adapters must have at least a compute family. | ||
| Users can also create `QueueFamily`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users shouldn't create QueueFamilies, they are an intrinsic part of the device and an only be queried (Vulkan).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Vulkan there will always be one queue family that can do everything basic like graphics, transfer and compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Vulkan there will always be one queue family that can do everything basic like graphics, transfer and compute.
@jdashg keeps saying that this is only true if there is a queue family that supports graphics at all.
|
|
||
| ## Synchronization | ||
|
|
||
| Queues can have fences inserted into them, and any queue can wait on that fence to be complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some links:
Metal
- MTLEvent
- MTLSharedEvent
- MTLCommandBuffer.encodeWaitForEvent(_:value:)
- MTLCommandBuffer.encodeSignalEvent(_:value:)
D3D12
- ID3D12Fence
- ID3D12CommandQueue::Signal()
- ID3D12CommandQueue::Wait()
- ID3D12GraphicsCommandList::ResourceBarrier()
Vulkan
Fences
Events
Semaphores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any queue cannot wait on a fence signalled on another queue... that will crash badly in Vulkan.
For cross-queue synch you need a semaphore.
| Synchronizing access to resources across queues should be done by telling a queue to wait until a fence is complete, followed by submitting command buffers that depend on that fence. | ||
| Resource data may have multiple concurrent readers, or exclusively one writer, at a time. | ||
| Resources may have multiple concurrent readers and writers, so long as all subranges satisfy many-read/single-write exclusion. [1] | ||
| If different commands would violate this exclusion, the implmentation injects synchronization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implmentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, make the distinction between:
- Pipeline Barrier
- Fence
- Event
- Semaphore
| ~~~ | ||
|
|
||
| For a `submit(sequence<WebGPUCommandBuffer> buffers)`, implementations inject any required synchronization. | ||
| On submit, each CommandBuffer in turn traverses its resources and identifies any outstanding synchronization requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait.. this means that you will require and expect submitted command buffers to execute in-order of submission!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this is saying is that we (WebGPU implementation) will insert command buffers doing the memory barriers / resource transitions between some of those in a sequence. That doesn't serialize the execution (by GPU) of the command buffers more than the user would do in Vulkan.
E.g. if one command buffer is writing into a UAV and another one is using the same resource as a vertex buffer, WebGPU implementation will insert the transition. The user would do the same in Vulkan/D3D12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so you will go on a quest to find dependency chains between the uses of resources?
I hate to tell you that, but it is exactly what OpenGL has to do... and that makes creating a bugless and performant OpenGL implementation (driver) very very hard.
You'd have to do amazing work on par with the engineers at Nvidia and AMD to both:
- insert the transitions correctly to not get undefined behaviour, artefacts, other problems
- not insert too many to get subpar performance and serrialization
- do it fast
I think a dependency graph of the resources used in the command buffers you will be processing for submission will be an enormous DAG and quite a challenge to analyze.
The Vulkan approach of having the user explicitly provide the transitions ahead of time and having them baked into the command buffer, while not trying to insert them at runtime is most probably the biggest reason (jointly with offline shader compilation) why Vulkan applications are seeing the often quoted 60% overall-CPU-utilization reductions on mobile devices even when multithreading and worker-thread command buffer generation is accounted for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's pretty much where the group is heading at the moment.
I think a dependency graph of the resources used in the command buffers you will be processing for submission will be an enormous DAG and quite a challenge to analyze.
I'm not as pessimistic. There aren't that many resources that change their usage through the frame: say, a hundred render targets plus a bunch of buffers we write as UAV. This isn't an enormous DAG to analyze.
The Vulkan approach of having the user explicitly provide the transitions ahead of time and having them baked into the command buffer, while not trying to insert them at runtime
The good thing for WebGPU (contrary to OpenGL) is that we have command buffers. So the barriers inside command buffers will also need to be computed only once (either at recording, or the first submission, depending on implementation). It's just the barriers between command buffers that we'll need to compute and insert on every submission at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of the "render pass" system added in the latest D3D12 is to help track and validate these transitions, Metal already does this tracking implicitly, and it's now a common tactic for engines to track them it themselves as well ( see https://www.gdcvault.com/play/1024656/Advanced-Graphics-Tech-Moving-to and http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/GDC2017-D3D12-And-Vulkan-Done-Right.pdf ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well yes, so since engines do it themselves.... maybe you should let them and make your lives easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is great deal of historical context to this discussion. Basically, since the browser implementations have to track the lifetime and usage of resources for validation anyway, it's not a big step to derive the barriers from this info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lifetime resource tracking is easiest, instead of deleting (where the usual JS GC would do) put it on a list with an associated API fence inserted after last API use (or instead of where you'd want the delete)... I do that already.
barriers are easy (in OpenGL you basically slam a glMemoryBarrier before the first read from a modified resource)
begin/end dependencies are a bit harder, and I would really like to see an implementation that is able to insert them and validate them before command submit to virtual-queue (webGPU queue, not VK queue) time so we can benefit from pre-validated and pre-compiled command buffers like we go in D3D12 and VK.
| In Vulkan, synchronization injection takes the form of synthesizing VkSubmitInfos, synchronizing via VkSemaphores and VkPipelineStageFlags, as well as submitting synthesized CommandBuffers containing memory barriers and queue family transfers. | ||
| (Queue family transfers may require submitting synthesized CommandBuffers to other queues, as well) | ||
|
|
||
| Implementation may warn users about synchronization overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementations
|
|
||
| ## QueueFamily | ||
|
|
||
| There are multiple families of queues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of exposing families? Why not make each queue independent, having its own capabilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because multiple queues from the same family do not need to transfer ownership between each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They also allow for less idling when complex execution dependencies are present
https://mynameismjp.wordpress.com/2018/06/17/breaking-down-barriers-part-3-multiple-command-processors/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we've discussed the technique for ownership transfer. Depending on how that works, the implementation could be the one determining which barriers are issued, so it would know internally if the two queues happen to be in the same family and omit barriers accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this transfer will be costly so the user has to have exposed queue family information, so that they can avoid using resources across queues from different families.
| }; | ||
|
|
||
| interface WebGPUQueueFamily { | ||
| readonly attribute boolean graphics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 different possibilities? The DX12 style is easier to work with.
|
|
||
| void submit(sequence<WebGPUCommandBuffer> buffers); | ||
| WebGPUFence insertFence(); | ||
| void wait(WebGPUFence); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Queues can wait but they can't signal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this commit is missing a rebase.
| Resources may have multiple concurrent readers and writers, so long as all subranges satisfy many-read/single-write exclusion. [1] | ||
| If different commands would violate this exclusion, the implmentation injects synchronization. | ||
| If a command is submitted that will never be able to synchronize for exclusion without subsequent user commands, that Submit is refused. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I know what you're trying to get at here. But, in case I am misunderstanding, please provide an example that would get rejected by the API.
| AccessBits access; | ||
| }; | ||
| list<LastAccess> last_accesses; | ||
| ~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this information is kept inside of each resource, how are we going to handle race conditions where one resource is used by multiple queues from multiple web workers? Should we make resources become read-only once they're transferred between workers? Even if we do, @kvark 's point about there being different types of reading (shader resource vs UAV, etc) still holds. D3D's D3D12_RESOURCE_FLAG_ALLOW_SIMULTANEOUS_ACCESS is relevant here.
| Queues are created at Device creation time, and are exposed via `sequence<Queue> Device.queues`. | ||
| Each element in `sequence<QueueFamily> WebGPUDeviceDescriptor.queueRequests` results in a corresponding element in `Device.queues`. | ||
| If an user-provided QueueFamily does not match a family in `Adapter.queueFamilies`, a more capable family is used. | ||
| If no available family can satisfy an user-provided family, device creation fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be an error if the developer passes an empty sequence for queueRequest?
| //WebGPULimits limits; Don't expose higher limits for now. | ||
|
|
||
| // TODO are other things configurable like queues? | ||
| sequence<WebGLQueueFamily> queueRequests; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: WebGPUQueueFamily
| readonly attribute DOMString name; | ||
| readonly attribute WebGPUExtensions extensions; | ||
| //readonly attribute WebGPULimits limits; Don't expose higher limits for now. | ||
| readonly attribute sequence<WebGLQueueFamily> queueFamilies; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto WebGPUQueueFamily
|
Following yesterday's discussion, I think it makes sense for the Web to not have explicit transfers between queues but automatic parallelization and hints instead. For Dawn I don't think we can do any useful parallelization without at least these hints:
Even then the implementation looks like it will be insanely complex and will need to juggle a ton of timelines and event/semaphores. It will super easily fall off the parallel path if the application forgets something but at most we'll print a warning and keep going. We decided to talk about multi-queue because it influenced the design of command buffers and buffer mapping. #100 resolves the command buffer discussion and makes command buffers created from queues. For buffer mapping we still need to discuss some more but I'm hopeful this pull-request will give enough information to know what the design constraints are. At this point I think we should resolve here on 1) how you request and get queues, 2) what single queue is exposed in the MVP. We can also record that for "webbiness" reasons multi-queue barriers / ownership transfers should be implicit + hints. Discussion of exactly what the hints are and how multiple queues interact together could be left for later. Just to be clear, this thread had a lot of value (at least for me) in defining how you request queues and agreeing on implicit synchronization. |
Again, in VK explicit owndership transfer is necessary only between different queue FAMILIES, not different queues in the same family. |
|
Closing stale PR; we will revisit this as a proposal when we get to looking back at the |
There's still a bunch of work to do here, but this should give us something to talk about, and point us in the right direction.