Skip to content

Parallel Transform Propagation#17840

Merged
mockersf merged 39 commits intobevyengine:mainfrom
aevyrie:parallel-transform
Feb 23, 2025
Merged

Parallel Transform Propagation#17840
mockersf merged 39 commits intobevyengine:mainfrom
aevyrie:parallel-transform

Conversation

@aevyrie
Copy link
Member

@aevyrie aevyrie commented Feb 13, 2025

Objective

  • Make transform propagation faster.

Solution

  • Work sharing worker threads
  • Parallel tree traversal excluding leaves
  • Second cache friendly wide pass over all leaves
  • 3-10x faster than main

Testing

  • Tracy
  • Caldera hotel is showing 3-7x faster on my M4 Max. Timing for bevy's existing transform system shifts wildly run to run, so I don't know that I would advertise a particular number. But this implementation is faster in a... statistically significant way.
    image

@aevyrie aevyrie requested a review from Victoronz February 13, 2025 06:36
@alice-i-cecile alice-i-cecile added this to the 0.16 milestone Feb 13, 2025
@alice-i-cecile alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times A-Transform Translations, rotations and scales M-Release-Note Work that should be called out in the blog due to impact S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Feb 13, 2025
@aevyrie aevyrie marked this pull request as draft February 13, 2025 06:49
@aevyrie
Copy link
Member Author

aevyrie commented Feb 13, 2025

Getting this up now that it seems to function, marking as draft because it still needs cleanup.

@alice-i-cecile
Copy link
Member

FYI, I'm merging #17815. The methods there are probably worth benching here.

@alice-i-cecile
Copy link
Member

I'd love a before / after tracy histogram BTW. I've seen enough tests on Discord to be convinced that this is markedly (3x or so) faster, but it would be lovely to show in the release notes.

@NthTensor
Copy link
Contributor

Alright, I am finally going to be able to give this a look. Right out of the gate I find the increase in run variance concerning, but I'm in favor of merging and leaving that as follow up if we can't easily identify the cause.

@aevyrie
Copy link
Member Author

aevyrie commented Feb 22, 2025

Alright, I am finally going to be able to give this a look. Right out of the gate I find the increase in run variance concerning, but I'm in favor of merging and leaving that as follow up if we can't easily identify the cause.

If you are referring to my comment, that was about bevy's existing system. It swings wildly +- 1ms. Variance (timing distribution) within a run is consistent for both. Variance between runs in this PR seems better than main. My guess is this is caused by E-cores vs P-cores. The old implementation usually ended up spending most time in a single core, whereas the new one is spread across all available cores in the thread pool. I think this explains why inter-run variance has improved, the work is spread across a mix of P and E cores.

@alice-i-cecile alice-i-cecile added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Feb 22, 2025
@alice-i-cecile alice-i-cecile added this pull request to the merge queue Feb 22, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 22, 2025
@alice-i-cecile alice-i-cecile added this pull request to the merge queue Feb 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 23, 2025
@aevyrie
Copy link
Member Author

aevyrie commented Feb 23, 2025

The CI failure was not caused by this PR, seems to be some sort of MESA failure according to discord.

@mockersf mockersf enabled auto-merge February 23, 2025 20:28
@mockersf mockersf added this pull request to the merge queue Feb 23, 2025
Merged via the queue into bevyengine:main with commit dba1f7a Feb 23, 2025
33 checks passed
@alice-i-cecile
Copy link
Member

Thank you to everyone involved with the authoring or reviewing of this PR! This work is relatively important and needs release notes! Head over to bevyengine/bevy-website#1995 if you'd like to help out.

github-merge-queue bot pushed a commit that referenced this pull request Dec 30, 2025
# Objective

- Follow up from previous transform optimization (#18589), make the
`mark_dirty_trees` system more intelligent - don't run this expensive
static scene optimization for dynamic scenes.
- Using a threshold was mentioned as a follow up in that PR, and we also
want this threshold to be user-configurable.
- This was not implemented previously because the optimizations were
still large improvements even in dynamic scenes thanks to the improved
parallelism #17840

## Solution

- Don't run static scene optimization (dirty tree tracking) for very
dynamic scenes - defined here as scenes where more than 30% of objects
have their `Transform` updated.
- This is configurable with a percentage threshold, or it can be
unconditionally enabled or disabled when setting to `0.0` or `1.0` to
avoid the cost of computing the threshold.
- For dynamic scenes, this makes transform prop much faster, twice as
fast in the stress tests shown here.

## Testing

transform_hierarchy stress tests, all of these cases spawn about a
quarter million entities:

- humanoids_active - dynamic scene that should be faster than `main`:
<img width="609" height="395" alt="image"
src="https://github.com/user-attachments/assets/bf3d6b93-aa09-4440-b8ac-18af7e46a00f"
/>

- humanoids_inactive - static scene that should be unchanged from
`main`:
<img width="631" height="377" alt="image"
src="https://github.com/user-attachments/assets/a0306109-600b-4cdd-a217-5cc15e269bca"
/>

- humanoids_mixed - half dynamic scene that should be faster than `main`
<img width="604" height="372" alt="image"
src="https://github.com/user-attachments/assets/2751ece2-d4b9-4daa-af24-fe379eaf75b2"
/>

- large_tree - dynamic scene (50% of entities are moved) we expect to
see improvements
<img width="665" height="371" alt="image"
src="https://github.com/user-attachments/assets/c6b08abe-eb1d-44fb-be36-457f9d5ba78e"
/>
mockersf pushed a commit that referenced this pull request Dec 30, 2025
# Objective

- Follow up from previous transform optimization (#18589), make the
`mark_dirty_trees` system more intelligent - don't run this expensive
static scene optimization for dynamic scenes.
- Using a threshold was mentioned as a follow up in that PR, and we also
want this threshold to be user-configurable.
- This was not implemented previously because the optimizations were
still large improvements even in dynamic scenes thanks to the improved
parallelism #17840

## Solution

- Don't run static scene optimization (dirty tree tracking) for very
dynamic scenes - defined here as scenes where more than 30% of objects
have their `Transform` updated.
- This is configurable with a percentage threshold, or it can be
unconditionally enabled or disabled when setting to `0.0` or `1.0` to
avoid the cost of computing the threshold.
- For dynamic scenes, this makes transform prop much faster, twice as
fast in the stress tests shown here.

## Testing

transform_hierarchy stress tests, all of these cases spawn about a
quarter million entities:

- humanoids_active - dynamic scene that should be faster than `main`:
<img width="609" height="395" alt="image"
src="https://github.com/user-attachments/assets/bf3d6b93-aa09-4440-b8ac-18af7e46a00f"
/>

- humanoids_inactive - static scene that should be unchanged from
`main`:
<img width="631" height="377" alt="image"
src="https://github.com/user-attachments/assets/a0306109-600b-4cdd-a217-5cc15e269bca"
/>

- humanoids_mixed - half dynamic scene that should be faster than `main`
<img width="604" height="372" alt="image"
src="https://github.com/user-attachments/assets/2751ece2-d4b9-4daa-af24-fe379eaf75b2"
/>

- large_tree - dynamic scene (50% of entities are moved) we expect to
see improvements
<img width="665" height="371" alt="image"
src="https://github.com/user-attachments/assets/c6b08abe-eb1d-44fb-be36-457f9d5ba78e"
/>
mockersf pushed a commit that referenced this pull request Dec 30, 2025
# Objective

- Follow up from previous transform optimization (#18589), make the
`mark_dirty_trees` system more intelligent - don't run this expensive
static scene optimization for dynamic scenes.
- Using a threshold was mentioned as a follow up in that PR, and we also
want this threshold to be user-configurable.
- This was not implemented previously because the optimizations were
still large improvements even in dynamic scenes thanks to the improved
parallelism #17840

## Solution

- Don't run static scene optimization (dirty tree tracking) for very
dynamic scenes - defined here as scenes where more than 30% of objects
have their `Transform` updated.
- This is configurable with a percentage threshold, or it can be
unconditionally enabled or disabled when setting to `0.0` or `1.0` to
avoid the cost of computing the threshold.
- For dynamic scenes, this makes transform prop much faster, twice as
fast in the stress tests shown here.

## Testing

transform_hierarchy stress tests, all of these cases spawn about a
quarter million entities:

- humanoids_active - dynamic scene that should be faster than `main`:
<img width="609" height="395" alt="image"
src="https://github.com/user-attachments/assets/bf3d6b93-aa09-4440-b8ac-18af7e46a00f"
/>

- humanoids_inactive - static scene that should be unchanged from
`main`:
<img width="631" height="377" alt="image"
src="https://github.com/user-attachments/assets/a0306109-600b-4cdd-a217-5cc15e269bca"
/>

- humanoids_mixed - half dynamic scene that should be faster than `main`
<img width="604" height="372" alt="image"
src="https://github.com/user-attachments/assets/2751ece2-d4b9-4daa-af24-fe379eaf75b2"
/>

- large_tree - dynamic scene (50% of entities are moved) we expect to
see improvements
<img width="665" height="371" alt="image"
src="https://github.com/user-attachments/assets/c6b08abe-eb1d-44fb-be36-457f9d5ba78e"
/>
mockersf pushed a commit that referenced this pull request Dec 30, 2025
# Objective

- Follow up from previous transform optimization (#18589), make the
`mark_dirty_trees` system more intelligent - don't run this expensive
static scene optimization for dynamic scenes.
- Using a threshold was mentioned as a follow up in that PR, and we also
want this threshold to be user-configurable.
- This was not implemented previously because the optimizations were
still large improvements even in dynamic scenes thanks to the improved
parallelism #17840

## Solution

- Don't run static scene optimization (dirty tree tracking) for very
dynamic scenes - defined here as scenes where more than 30% of objects
have their `Transform` updated.
- This is configurable with a percentage threshold, or it can be
unconditionally enabled or disabled when setting to `0.0` or `1.0` to
avoid the cost of computing the threshold.
- For dynamic scenes, this makes transform prop much faster, twice as
fast in the stress tests shown here.

## Testing

transform_hierarchy stress tests, all of these cases spawn about a
quarter million entities:

- humanoids_active - dynamic scene that should be faster than `main`:
<img width="609" height="395" alt="image"
src="https://github.com/user-attachments/assets/bf3d6b93-aa09-4440-b8ac-18af7e46a00f"
/>

- humanoids_inactive - static scene that should be unchanged from
`main`:
<img width="631" height="377" alt="image"
src="https://github.com/user-attachments/assets/a0306109-600b-4cdd-a217-5cc15e269bca"
/>

- humanoids_mixed - half dynamic scene that should be faster than `main`
<img width="604" height="372" alt="image"
src="https://github.com/user-attachments/assets/2751ece2-d4b9-4daa-af24-fe379eaf75b2"
/>

- large_tree - dynamic scene (50% of entities are moved) we expect to
see improvements
<img width="665" height="371" alt="image"
src="https://github.com/user-attachments/assets/c6b08abe-eb1d-44fb-be36-457f9d5ba78e"
/>
mockersf pushed a commit that referenced this pull request Dec 30, 2025
# Objective

- Follow up from previous transform optimization (#18589), make the
`mark_dirty_trees` system more intelligent - don't run this expensive
static scene optimization for dynamic scenes.
- Using a threshold was mentioned as a follow up in that PR, and we also
want this threshold to be user-configurable.
- This was not implemented previously because the optimizations were
still large improvements even in dynamic scenes thanks to the improved
parallelism #17840

## Solution

- Don't run static scene optimization (dirty tree tracking) for very
dynamic scenes - defined here as scenes where more than 30% of objects
have their `Transform` updated.
- This is configurable with a percentage threshold, or it can be
unconditionally enabled or disabled when setting to `0.0` or `1.0` to
avoid the cost of computing the threshold.
- For dynamic scenes, this makes transform prop much faster, twice as
fast in the stress tests shown here.

## Testing

transform_hierarchy stress tests, all of these cases spawn about a
quarter million entities:

- humanoids_active - dynamic scene that should be faster than `main`:
<img width="609" height="395" alt="image"
src="https://github.com/user-attachments/assets/bf3d6b93-aa09-4440-b8ac-18af7e46a00f"
/>

- humanoids_inactive - static scene that should be unchanged from
`main`:
<img width="631" height="377" alt="image"
src="https://github.com/user-attachments/assets/a0306109-600b-4cdd-a217-5cc15e269bca"
/>

- humanoids_mixed - half dynamic scene that should be faster than `main`
<img width="604" height="372" alt="image"
src="https://github.com/user-attachments/assets/2751ece2-d4b9-4daa-af24-fe379eaf75b2"
/>

- large_tree - dynamic scene (50% of entities are moved) we expect to
see improvements
<img width="665" height="371" alt="image"
src="https://github.com/user-attachments/assets/c6b08abe-eb1d-44fb-be36-457f9d5ba78e"
/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Transform Translations, rotations and scales C-Performance A change motivated by improving speed, memory usage or compile times M-Release-Note Work that should be called out in the blog due to impact S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it

Projects

None yet

Development

Successfully merging this pull request may close these issues.