Skip to content

Add subgroup barrier to subgroups proposal #4437

@raphlinus

Description

@raphlinus

The current subgroup proposal is lacking a subgroup barrier, though it was present in the naga prototype implementation. In this issue I will try to motivate why we need it, suggest semantics including some test cases, and finally some notes about how it could be implemented.

I believe there is a strong analogy to acquire/release semantics in the C++ memory model. In that analogy, subgroup barriers map to acquire/release, lockstep behavior is equivalent to TSO, and promoting the subgroup barrier to a workgroup barrier is equivalent to sequential consistency. It is not necessary to fully appreciate this analogy, but I think it's helpful in understanding where arguments questioning the need for the finer grained distinctions may come from.

A motivating use case

I didn't have a golden example of a use case for subgroup barriers until recently, but now I do. I'm not alone, there is a (currently still unanswered) question on Apple developer forums asking this in the context of Metal, with a reference to Apple-written code in which the question was unclear.

The algorithm is warp-local multi-split, which is at the core of state of the art least-significant-digit radix sorting algorithms such as Onesweep. This has a readily available implementation in CUDA. I did a prototype port to vanilla WebGPU, using workgroup shared arrays instead of subgroups, but found that the excessive number of workgroup barriers needed tanked performance (I don't have hard numbers yet but it is likely that our investigation into sorting will produce some). On infrastructure with subgroups (after which the algorithm is named), the WLMS algorithm performs key ranking entirely using subgroups, with a single reasonably cheap prefix sum to stitch together the results after n chunks, where n ("elements per thread") is a tunable parameter that gains work efficiency possibly at the expense of exploitable parallelism.

A key operation in WMLS updates an array in workgroup shared memory so that the locations addressed are disjoint at subgroup granularity, but different threads within a subgroup may read and write the same value. In CUDA, the Vulkan memory model, and my interpretation of Metal's adaptation of the C++ memory model, such a thing is a data race unless there is an intervening barrier. And indeed, in the CUB source code, there is a barrier that macro-expands to __syncwarp().

A reduced test case for this pattern is as follows:

const WG = 256;
var<workgroup> a: array<u32, WG>;

@compute @workgroup_size(WG)
fn main(
    @builtin(local_invocation_index) index: u32,
    @builtin(subgroup_size) sg_size: u32,
) {
    a[index ^ (sg_size - 1)] = 1;
    subgroupBarrier();
    let x = a[index];
    if x != 1 {
        fail();
    }
}

A few quick observations. If the mask were (WG - 1) instead of (sg_size - 1), then this would be expected to fail, though correctness could be restored by upgrading the barrier to workgroup scope.

Reconvergence and uniformity

As stated in this informative thread questioning the need for __syncwarp in CUDA, the Volta execution model is allowed to move threads forward in any order, even in the absence of conditional control flow. While this is not the execution model we expect in graphics implementations, I don't think we should preclude that; it's possible we may want to run WGSL programs in the future on infrastructure that shares this property with CUDA.

The spec for simdgroup_barrier() in MSL requires subgroup uniformity, though analogous barriers in other APIs do not. For WLMS, the barrier would be run in workgroup uniform control flow, so I personally would be happy with that. The rules for reconvergence are messy (as amply documented in the subgroup proposal), and I expect will probably continue to be refined.

Proposed semantics

I propose that subgroupBarrier() is required to be in uniform control flow, just as the other barriers. This could be weakened later, though I think would require uniformity analysis to run at subgroup granularity as well.

Its semantics are the same as workgroupBarrier() except that it says "invocations in the same subgroup" rather than "invocations in the same workgroup" (referencing 14.5.1 Barriers in the current spec). The memory semantics are the same as workgroup barriers - AcquireRelease and WorkgroupMemory memory semantics.

Implementation guide

In general, implementations should do the least expensive thing required to pass the above litmus test. In some cases, that may be nothing. Even so, the goal is to capture programmer intent in case the distinction is important. I doubt anyone would be comfortable adding the above litmus test sans barrier to the CTS.

Metal

Translation to MSL is straightforwardly simdgroup_barrier(mem_flags::mem_threadgroup). As mentioned above, subgroup uniformity is required.

Vulkan

Translation to Vulkan is controlBarrier(gl_ScopeSubgroup, gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsAcquireRelease) in GLSL syntax, or OpControlBarrier(3, 2, 0x108) in SPIR-V. This follows straightforwardly from the memory semantics as written in the spec, with the addition that memory scope (not a concept defined by WGSL) is set to workgroup, matching the scope implied by the memory semantics.

HLSL

Translation to HLSL is // TODO: add WaveBarrier() if a future shader model incorporates it. Just as on x86 we can rely on total store ordering and can emit vanilla load and store operations, here we rely on the folk knowledge that GPUs execute in lockstep at the wave level, and thus no synchronization is necessary. If an implementation fails the above litmus test, is it a bug in the implementation, or an insufficiently strong spec? Is an electron a wave or a particle? In either case, the solution is to emit GroupMemoryBarrierWithGroupSync() when we are not confident the implementation admits the fast implementation, possibly using an allowlist or denylist as with other bug workarounds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    wgslWebGPU Shading Language Issues

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions