Refactor moving entity to new table#23151
Conversation
chescock
left a comment
There was a problem hiding this comment.
Yeah, this approach seems much nicer! I think it's right, but I want to check a little more carefully before approving, and I'm out of time for today :).
| new_table_id: TableId, | ||
| row: TableRow, | ||
| ) -> TableMoveResult<'_> { | ||
| assert!(old_table_id != new_table_id); |
There was a problem hiding this comment.
(I suspect we'll eventually want to make this a safety condition and only debug_assert! it, but since this PR already improves performance I don't think we need to micro-optimize too much :).)
There was a problem hiding this comment.
It wasn't super rigorous, but I did run benchmarks before and after adding the assert!s and didn't see any difference.
| let mut src_next = src_iter.next(); | ||
| let mut dst_next = dst_iter.next(); | ||
|
|
||
| while let Some((src_component_id, src_column)) = src_next.take() { |
There was a problem hiding this comment.
I'm having a little trouble following the control flow here. I think this is supposed to affect each source column exactly once, and either copy it or drop/forget it. Is that right? Could we rework it to use a for loop over the source columns so that it's more clear that it touches each one once?
I think that would look something like this:
let mut dst_iter = dst_table.columns.iter_mut();
let mut dst_next = dst_iter.next();
for (src_component_id, src_column) in src_table.columns.iter_mut() {
// Skip past any destination columns that don't exist in the source table.
// The caller is responsible for initializing those columns.
while dst_next
.as_ref()
.is_some_and(|(dst_component_id, _dst_column)| *dst_component_id < src_component_id)
{
dst_next = dst_iter.next();
}
// Then move the value in the source column if it exists in the destination table,
// or remove it if it does not.
if let Some((dst_component_id, dst_column)) = dst_next.as_mut()
&& *dst_component_id == src_component_id
{
unsafe {
dst_column.initialize_from_unchecked(src_column, last_index, row, dst_row);
}
dst_next = dst_iter.next();
} else {
unsafe {
src_column.swap_remove_unchecked::<DROP>(last_index, row);
}
}
}If that's equivalent, then I think it will be easier to follow.
Co-Authored-By: Chris Russell <[email protected]> Co-Authored-By: eugineerd <[email protected]>
ElliottjPierce
left a comment
There was a problem hiding this comment.
I don't see anything wrong with this; the tests seem to agree, and the perf improvement is very nice! Some thoughts on things to look at next:
I don't like panics in unsafe functions. I'd rather have more safety requirements than more asserts, but that's not a big deal and can be saved for later.
I think this moving algorithm is vastly superior to what we had before. It may be doing unnecessary work though if it moves a component value that is going to be overwritten anyway. So there might be room to make this even faster.
I'm pretty indifferent to using table ids instead of pointers, but I don't see what the change buys us. And using ids may loose us some optimization opportunities in the future, but IDK.
There are a few small functions here and there that you might want to inline, but it's probably happening anyway, so up to you.
Anyway, I think this is a great change. Looking forward to seeing where this goes next.
| /// If `DROP` is `false`, removed components will be forgotten, | ||
| /// allowing ownership to be relinquished to the caller. | ||
| /// | ||
| /// # Panics |
There was a problem hiding this comment.
Actually, it might be unsound for this to panic...
It looks like this is called after moving the entity between archetypes and stuff. If it unwinds, it leaves the entity in the wrong archetype, which could potentially cause UB. So this might need to be a safety requirement.
There was a problem hiding this comment.
Shouldn't panicking always be safer than continuing with violated safety requirements?
There was a problem hiding this comment.
They are both the same here IIUC. If the safety requirement is not fulfilled, it's UB. If it panics, it unwinds with the world in an invalid state. The entity thinks it's in the new archetype, even though it isn't initialize fully yet. If something catches the unwind, like bevy's error handler, then the world will continue as is, so a query could return references to uninitialized memory which is UB. It would also probably try to drop uninitialized memory which is UB. So either way, if the caller is wrong, it's UB, so it should be unsafe to reflect that.
(All of that is from memory from where bevy catches unwinds. But I think it's right.) You could also make the argument that documenting the panic makes it implicitly unsafe when called in these contexts, but I don't think that's a patter that bevy addopts.
There was a problem hiding this comment.
If it panics, it unwinds with the world in an invalid state.
Yeah, see #20368 for another example of it being unsound to unwind during component insertion.
Hmm, we also run arbitrary user code in the drop impls within this. Does that mean it's already unsound to remove a component with a panicking drop? ... Well, it's definitely buggy, since it will swap components between entities while doing swap_remove, but I haven't actually been able to trigger UB since it doesn't update the EntityLocation until afterwards. I might try to put together a clean example and make an issue.
There was a problem hiding this comment.
This is one of those things that's really hard to get right. Especially with drop logic, allocator panics, hook panics (which could leave some hooks not run), etc. All of these leave the world in an invalid state, which could be totally fine or cause UB depending on how the world is used next.
From my research in building my own ecs, I find the simplest thing to do is to declare a variable that panics when it's dropped and forget it afterward.
let panic_bomb = PanicOnDrop("Unacceptable panic in structural world change");
// Structural change here (ex: spawn, insert, etc.)
core::mem::forget(panic_bomb);If something does panic inside, it will cause a non-unwinding panic when it hits the panic_bomb, aborting the process. IMO, this is the best balance of minimal safety requirements that doesn't require catching unwinds in component drops, hooks, and such.
There are other things you can do, too. For example, you can make sure the hook and drop function pointers include a catch_unwind or something themselves, which rust can make 0 cost if it can verify that a panic is impossible. But this gets messy IMO.
There was a problem hiding this comment.
This is one of those things that's really hard to get right.
Agreed!
I find the simplest thing to do is to declare a variable that panics when it's dropped and forget it afterward.
Yeah, this is a good pattern to ensure safety!
But we also want to reduce the number of ways that users can cause Bevy to abort the process, of course :).
I think it should be possible to at least make sure we aren't running arbitrary user code while the panic_bomb is armed. If we move the drop calls to happen last, after all the EntityLocation bookkeeping is finished, then any panics there will only cause other values to leak. Conversely, if we initialize required components after reserving capacity in their tables, but before doing anything else, then any panics there will only cause other required component values to leak.
(I also wonder whether we could move the drop calls out far enough that they can happen where we have static types. That could eliminate the indirect calls and let the optimizer inline them.)
There was a problem hiding this comment.
Very interesting. I definitely think this is worth splitting out into its own issue to discuss. There are lots of paths forward here. I think the simplest thing for this PR is to just make it unsafe and call it good for now.
But we should try to solidify a "Bevy way" for these "unwinding in unsafe code" questions and patch all the existing soundness holes, ideally before any users start running into aborts.
chescock
left a comment
There was a problem hiding this comment.
I checked over all the safety requirements, and I'm pretty convinced this is sound, except for the panic that @ElliottjPierce noted. And it's simpler and faster, so it seems like a great change!
We should update the safety comments on move_row and get rid of the asserts there before merging, since unwinding is UB anyway.
It would also be good to fix the safety requirements on BlobArray, but those were wrong before this PR so I wouldn't block on that.
| /// If `DROP` is `false`, removed components will be forgotten, | ||
| /// allowing ownership to be relinquished to the caller. | ||
| /// | ||
| /// # Panics |
There was a problem hiding this comment.
If it panics, it unwinds with the world in an invalid state.
Yeah, see #20368 for another example of it being unsound to unwind during component insertion.
Hmm, we also run arbitrary user code in the drop impls within this. Does that mean it's already unsound to remove a component with a panicking drop? ... Well, it's definitely buggy, since it will swap components between entities while doing swap_remove, but I haven't actually been able to trigger UB since it doesn't update the EntityLocation until afterwards. I might try to put together a clean example and make an issue.
Co-Authored-By: Chris Russell <[email protected]> Co-Authored-By: Eagster <[email protected]>
## Objective When moving an entity from one table to another, we have 3 methods available: - `move_to_superset_unchecked`, for moving during insertion. - `move_to_and_drop_missing_unchecked`, for moving during normal removal. - `move_to_and_forget_missing_unchecked`, for moving during `take` removal. However, when looking at flecs (to steal ideas, naturally), I noticed it just has one function for moving during insertion and removal: [`flecs_table_move`](https://github.com/SanderMertens/flecs/blob/689adf8f6c1d069aa51fa059b8a2cbb4fb761f1d/src/storage/table.c#L1961). It also handles things a bit differently internally (details on that below). ## Solution There are essentially 3 somewhat independent changes here: - Combining the 3 methods into 1 called `move_row`. - Moving the `move` method to `Tables` and taking `TableId`s (rather than being `Table::move_to(&mut self, other: &mut Table)`). - Changing the internals of the `move` method to be more like flecs. ### Combining the methods The 3 methods were already very similar; `move_to_superset` just didn't check if components were missing. They could be combined without any of the other changes, but it's possible that insertion performance would suffer. ### `TableId` instead of `&mut Table` I believe the idea behind using references instead of IDs was to be able to cache the pointers and avoid fetching the tables more than once. However, the only callers of the `move_to` methods, `BundleInserter` and `BundleRemover`, only used the pointers for the `move_to` methods and didn't need them for anything else, so there isn't much point in caching them. And being able to just use IDs makes the code nicer, since we don't have to dance around the borrow checker. ### Internals `flecs_table_move` requires that tables' columns are sorted (by component ID, low-to-high). It iterates the source and destination tables at the same time (i.e. their component IDs and corresponding columns), but pauses one iterator when the other "falls behind" (has a lower component ID). If the source table falls behind, the source table's current component ID was removed, and if the destination table falls behind, the destination table's current component ID was added. Bevy's `move_to` methods don't rely on tables' columns being in any certain order; they just iterate the source table and fetch each matching column from the destination table individually. As it turns out, Bevy's tables *are* sorted in practice, we just never guaranteed or relied on it. A table's list of component IDs are sorted to use as the key in a `HashMap`, and the same list is used to build the table. So we can just make it official by `assert`ing that a table's components are sorted when the table is finalized (`TableBuilder::build`). Evidently, iterating the destination table is faster than fetching from it repeatedly: <img width="644" height="848" alt="Screenshot_20260224_135012" src="https://github.com/user-attachments/assets/118cb78a-3267-411e-aa8e-3d1d7ea4a274" /> ### Misc There's a `PERF` comment that has survived since 2021 (bevyengine#2673) and I'm not really sure what it means: ``` // PERF: store "non bundle" components in edge, then just move those to avoid // redundant copies ``` I didn't want to remove it without saying anything, but I think it's lived long enough. --------- Co-authored-by: Chris Russell <[email protected]> Co-authored-by: eugineerd <[email protected]> Co-authored-by: Eagster <[email protected]>

Objective
When moving an entity from one table to another, we have 3 methods available:
move_to_superset_unchecked, for moving during insertion.move_to_and_drop_missing_unchecked, for moving during normal removal.move_to_and_forget_missing_unchecked, for moving duringtakeremoval.However, when looking at flecs (to steal ideas, naturally), I noticed it just has one function for moving during insertion and removal:
flecs_table_move. It also handles things a bit differently internally (details on that below).Solution
There are essentially 3 somewhat independent changes here:
move_row.movemethod toTablesand takingTableIds (rather than beingTable::move_to(&mut self, other: &mut Table)).movemethod to be more like flecs.Combining the methods
The 3 methods were already very similar;
move_to_supersetjust didn't check if components were missing. They could be combined without any of the other changes, but it's possible that insertion performance would suffer.TableIdinstead of&mut TableI believe the idea behind using references instead of IDs was to be able to cache the pointers and avoid fetching the tables more than once. However, the only callers of the
move_tomethods,BundleInserterandBundleRemover, only used the pointers for themove_tomethods and didn't need them for anything else, so there isn't much point in caching them. And being able to just use IDs makes the code nicer, since we don't have to dance around the borrow checker.Internals
flecs_table_moverequires that tables' columns are sorted (by component ID, low-to-high). It iterates the source and destination tables at the same time (i.e. their component IDs and corresponding columns), but pauses one iterator when the other "falls behind" (has a lower component ID). If the source table falls behind, the source table's current component ID was removed, and if the destination table falls behind, the destination table's current component ID was added.Bevy's
move_tomethods don't rely on tables' columns being in any certain order; they just iterate the source table and fetch each matching column from the destination table individually.As it turns out, Bevy's tables are sorted in practice, we just never guaranteed or relied on it. A table's list of component IDs are sorted to use as the key in a
HashMap, and the same list is used to build the table. So we can just make it official byasserting that a table's components are sorted when the table is finalized (TableBuilder::build).Evidently, iterating the destination table is faster than fetching from it repeatedly:
Misc
There's a
PERFcomment that has survived since 2021 (#2673) and I'm not really sure what it means:I didn't want to remove it without saying anything, but I think it's lived long enough.