Add subgroup barrier to subgroups proposal

The current subgroup proposal is lacking a subgroup barrier, though it was present in the naga prototype implementation. In this issue I will try to motivate why we need it, suggest semantics including some test cases, and finally some notes about how it could be implemented.

I believe there is a strong analogy to acquire/release semantics in the C++ memory model. In that analogy, subgroup barriers map to acquire/release, lockstep behavior is equivalent to TSO, and promoting the subgroup barrier to a workgroup barrier is equivalent to sequential consistency. It is not necessary to fully appreciate this analogy, but I think it's helpful in understanding where arguments [questioning the need](https://github.com/jackaudio/jack2/issues/388#issuecomment-433747133) for the finer grained distinctions may come from.

## A motivating use case

I didn't have a golden example of a use case for subgroup barriers until recently, but now I do. I'm not alone, there is a (currently still unanswered) [question](https://developer.apple.com/forums/thread/743710) on Apple developer forums asking this in the context of Metal, with a reference to Apple-written code in which the question was unclear.

The algorithm is warp-local multi-split, which is at the core of state of the art least-significant-digit radix sorting algorithms such as [Onesweep](https://arxiv.org/abs/2206.01784). This has a [readily available implementation in CUDA](https://github.com/NVIDIA/cub/blob/0fc3c3701632a4be906765b73be20a9ad0da603d/cub/agent/agent_radix_sort_onesweep.cuh). I did a [prototype port](https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Sorting.20revisited) to vanilla WebGPU, using workgroup shared arrays instead of subgroups, but found that the excessive number of workgroup barriers needed tanked performance (I don't have hard numbers yet but it is likely that our investigation into sorting will produce some). On infrastructure with subgroups (after which the algorithm is named), the WLMS algorithm performs key ranking entirely using subgroups, with a single reasonably cheap prefix sum to stitch together the results after n chunks, where n ("elements per thread") is a tunable parameter that gains work efficiency possibly at the expense of exploitable parallelism.

A key operation in WMLS updates an array in workgroup shared memory so that the locations addressed are disjoint at subgroup granularity, but different threads within a subgroup may read and write the same value. In CUDA, the Vulkan memory model, and my interpretation of Metal's adaptation of the C++ memory model, such a thing is a data race unless there is an intervening barrier. And indeed, in the CUB source code, there is a [barrier](https://github.com/NVIDIA/cub/blob/main/cub/block/block_radix_rank.cuh#L786C1-L787) that macro-expands to `__syncwarp()`.

A reduced test case for this pattern is as follows:

```wgsl
const WG = 256;
var<workgroup> a: array<u32, WG>;

@compute @workgroup_size(WG)
fn main(
    @builtin(local_invocation_index) index: u32,
    @builtin(subgroup_size) sg_size: u32,
) {
    a[index ^ (sg_size - 1)] = 1;
    subgroupBarrier();
    let x = a[index];
    if x != 1 {
        fail();
    }
}
```

A few quick observations. If the mask were `(WG - 1)` instead of `(sg_size - 1)`, then this would be expected to fail, though correctness could be restored by upgrading the barrier to workgroup scope.

## Reconvergence and uniformity

As stated in [this informative thread](https://forums.developer.nvidia.com/t/why-syncwarp-is-necessary-in-undivergent-warp-reduction/209893) questioning the need for `__syncwarp` in CUDA, the Volta execution model is allowed to move threads forward in any order, even in the absence of conditional control flow. While this is not the execution model we expect in graphics implementations, I don't think we should preclude that; it's possible we may want to run WGSL programs in the future on infrastructure that shares this property with CUDA.

The spec for `simdgroup_barrier()` in MSL requires subgroup uniformity, though analogous barriers in other APIs do not. For WLMS, the barrier would be run in workgroup uniform control flow, so I personally would be happy with that. The rules for reconvergence are messy (as amply documented in the subgroup proposal), and I expect will probably continue to be refined.

## Proposed semantics

I propose that `subgroupBarrier()` is *required* to be in uniform control flow, just as the other barriers. This could be weakened later, though I think would require uniformity analysis to run at subgroup granularity as well.

Its semantics are the same as `workgroupBarrier()` except that it says "invocations in the same subgroup" rather than "invocations in the same workgroup" (referencing [14.5.1 Barriers](https://www.w3.org/TR/WGSL/#barrier) in the current spec). The memory semantics are the same as workgroup barriers - `AcquireRelease` and `WorkgroupMemory` memory semantics.

## Implementation guide

In general, implementations should do the least expensive thing required to pass the above litmus test. In some cases, that may be nothing. Even so, the goal is to capture programmer intent in case the distinction is important. I doubt anyone would be comfortable adding the above litmus test *sans* barrier to the CTS.

### Metal

Translation to MSL is straightforwardly `simdgroup_barrier(mem_flags::mem_threadgroup)`. As mentioned above, subgroup uniformity is required.

### Vulkan

Translation to Vulkan is `controlBarrier(gl_ScopeSubgroup, gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsAcquireRelease)` in GLSL syntax, or OpControlBarrier(3, 2, 0x108) in SPIR-V. This follows straightforwardly from the memory semantics as written in the spec, with the addition that memory scope (not a concept defined by WGSL) is set to workgroup, matching the scope implied by the memory semantics.

### HLSL

Translation to HLSL is `// TODO: add WaveBarrier() if a future shader model incorporates it`. Just as on x86 we can rely on total store ordering and can emit vanilla load and store operations, here we rely on the folk knowledge that GPUs execute in lockstep at the wave level, and thus no synchronization is necessary. If an implementation fails the above litmus test, is it a bug in the implementation, or an insufficiently strong spec? Is an electron a wave or a particle? In either case, the solution is to emit `GroupMemoryBarrierWithGroupSync()` when we are not confident the implementation admits the fast implementation, possibly using an allowlist or denylist as with other bug workarounds.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add subgroup barrier to subgroups proposal #4437

A motivating use case

Reconvergence and uniformity

Proposed semantics

Implementation guide

Metal

Vulkan

HLSL

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add subgroup barrier to subgroups proposal #4437

Description

A motivating use case

Reconvergence and uniformity

Proposed semantics

Implementation guide

Metal

Vulkan

HLSL

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions