Skip to content

Use radix sort for sort phase and sprite sorting #4291

@superdump

Description

@superdump

Background

We currently use sort_by_key to sort PhaseItems in the sort phase, and sort_unstable_by to sort extracted sprites in queue_sprites.

  • sort_by_key - This sort is stable (i.e., does not reorder equal elements) and O(m * n * log(n)) worst-case, where the key function is O(m).
  • sort_unstable_by - This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O(n * log(n)) worst-case.

For large numbers of items to sort, such as in bevymark or many_sprites where n > 100_000, and given sort keys that are a small and fixed number of bits such as the f32 that we are commonly using, a radix sort could/should be much faster.

crates

I investigated sorting 10M random f32s in the range 0.1f32..1000.0f32 (the default perspective camera near and far planes) and observed the following with the available radix sort crates:

1053.388ms      stable
391.499ms       unstable
60.359ms        rdst_single
16.067ms        rdst
79.550ms        voracious_sort
79.945ms        voracious_stable_sort
13.895ms        voracious_mt_sort
65.653ms        radsort
  • multi-threading
    • rdst and voracious_mt_sort results above are multi-threaded using rayon, all the rest of the results are single-threaded
    • rdst has a hard dependency on rayon, though it can be configured to do single-threaded sorting. I created an issue about making multi-threading optional as we already have a threadpool in bevy and maybe we don't want to add another: Make rayon optional nessex/rdst#2
    • voracious_radix_sort has a non-default multi-threading feature
    • radsort is single-threaded only
  • According to crates.io, the crates have the following footprints:
    • radsort - 17.1kB
    • rdst - 34.3kB
    • voracious_radix_sort - 178kB
  • Flexibility
    • radsort was the easiest to integrate using its sort_by_key function
    • rdst and voracious_radix_sort both require implementation of some traits on the type being sorted, and require that the type implement Copy. This cannot be done for batched sprites currently because BatchedPhaseItem contains a Range<f32> for which it is not possible to implement Copy

Proof of Concept

I made a branch here that uses radsort: https://github.com/superdump/bevy/tree/render-radix-sort .

The sorting of ExtractedSprites in queue_sprites is significantly improved by using radsort::sort_by_key like this:

            radsort::sort_by_key(extracted_sprites, |extracted_sprite| {
                (
                    extracted_sprite.transform.translation.z,
                    match extracted_sprite.image_handle_id {
                        HandleId::Id(uuid, id) => {
                            ((uuid.as_u128() & ((1 << 64) - 1)) << 64) | (id as u128)
                        }
                        HandleId::AssetPathId(id) => {
                            ((id.source_path_id().value() as u128) << 64)
                                | (id.label_id().value() as u128)
                        }
                    },
                )
            });

With that, on an M1 Max, the median execution time of queue_sprites in bevymark -- 20000 8 over 1500 frames increased from 9.82ms to 17.09ms, which makes no sense to me. In many_sprites it decreased from 11.34ms to 9.47ms.

The median execution time of sort_phase_system for the relevant phase (Transparent2d for sprites, Opaque3d for 3D meshes) in many_cubes -- sphere it decreased from 0.728ms to 0.094ms. In bevymark -- 20000 8 it increased from 0.184ms to 1.95ms, which makes no sense. And in many_sprites it increased from 0.106ms to 1.19ms.

This is quite confusing. I haven't been able to figure out why the sort performance gets worse. radsort claims both best and worst execution time of O(n), space complexity of O(n), and that it is a stable sort so it does not reorder equal items. On the branch, all the sort_key implementations are inlined.

Next steps

  • Figure out why radsort is much slower in many cases
  • Try voracious_radix_sort or rdst to see if they perform consistently better, though rdst would likely not be approved unless its multi-threading were made optional.
    • Make multi-threading optional in rdst if it turns out to be a preferable solution due to the smaller crate footprint

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-RenderingDrawing game state to the screenC-PerformanceA change motivated by improving speed, memory usage or compile times

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions