Skip to content

Uniformity analysis: find a way to load from a storage buffer uniformly. #2321

@Kangz

Description

@Kangz

This is a more detailed issue for one of the points raised in #2229 and for this comment on the uniformity analysis.

In more advanced uses of compute shaders it is often useful to load a value from a storage buffer and then have the whole workgroup do something based on that value, including things that require uniform control flow. In the proposal in #1571 this is not possible, and I think we should find a way to make this work (without blocking #1571 from landing). This pattern is useful for various reasons, for example when doing job queues / VM instructions in compute shaders (like Nanite or piet-gpu do). Here's a small example:

struct Job {
    type: u32;
    payload: u32;
};

struct JobQueue {
    jobCount: u32;
    currentJob: atomic<u32>;
    jobs : array<Job>;
};
[[group(0), binding(0)]] var<storage, read_write> jobs : JobQueue;

[[stage(compute)]] fn doJobs() {
    loop {
        // Note: this atomic add is a bit wrong since it would need to happen only once
        // per workgroup
        let jobIndex = atomicAdd(&jobs.currentJob, 1);
        if (jobIndex >= jobs.jobCount) { break; }

        switch (jobs.jobs[jobIndex].type) {
          case 66u:
            doOrder66(jobs.jobs[jobIndex].payload);
            break;

          // more stuff
       }
    }
}

fn doOrder66(payload : u32) {
    // Stuff that requires uniform control flow. For example with a workgroup barrier.
    workgroupBarrier();
}

With the uniformity analysis, the loads from jobs are always non-uniform. This is because another workgroup in the same dispatch could be modifying the variable while it is being read. More precisely if the GPU has multiple subgroups for this workgroup, the variable could be modified between subgroup 0 and subgroup 1 reading it!

However this means that it is not possible to do a job queue like the above because workgroupBarrier needs to be called in uniform control flow, so doOrder66 too, so jobs.jobs[jobIndex].type too (the value we switch on), which the uniformity analysis says it's not uniform.

My initial proposal that was half thought through was to add a uniformLoad(p: ptr<T, workgroup/storage, access>) -> T that must be called in uniform control flow with a uniform pointer value and returns a uniform value that's the dereferencing of the pointer. However it wouldn't be enough for the example above because jobIndex is not in storage. So it would require the code to be do a hop through a workgroup variable:

var<workgroup> sharedJobIndex : u32;

loop {
    if (firstInvocation) {
        sharedJobIndex = atomicAdd(&jobs.currentJob, 1);
    }
    let jobIndex = uniformLoad(sharedJobIndex);
    if (jobIndex >= uniformLoad(jobs.jobCount)) { break; }

    switch (uniformLoad(jobs.jobs[jobIndex].type)) {
    // ...
}

Maybe a better primitive would be a workgroupBroadcastFirst(value: T) -> T that's called in uniform control flow and broadcasts the value of the first instance (local index (0, 0, 0)) to all the others. This is similar to an operation that's possible with subgroup/waveops/threadgroup that's subgroupBroadcastFirst.

Metadata

Metadata

Assignees

No one assigned

    Labels

    wgslWebGPU Shading Language Issues

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions