Implement opt-in change indexes for dense components. by pcwalton · Pull Request #23519 · bevyengine/bevy

pcwalton · 2026-03-25T21:57:19Z

This summary (like the rest of the PR) is a work in progress.

Overview

Currently, for queries that use Added and/or Changed query filters, the Bevy ECS must examine every component of every entity that matches the archetypes in question. Because core systems like rendering, transforms, and visibility calculation rely heavily on Added/Changed query filters, this adds up to a significant bottleneck when scaling to millions of entities. With the significant effort in 0.19 to scale to mega-worlds (1 million entities or more), the performance of Changed has become the largest blocker to achieving high scalability. The goal is to be competitive with Unity DOTS and its megacity demo, which has approximately 4.5 million mesh instances and modifies about 5,000 transforms per frame; without some method of accelerating Added and Changed, as for example in this PR, I don't believe this is feasible for Bevy to achieve.

To solve this issue, this commit adds change indexes, which are an opt-in acceleration method for dense components. Change indexes introduce a table of summaries of each page of rows within a table. The number of consecutive rows that constitute a page is known as the page size, and, through measurement, I found 256 to be a reasonable conservative value. Each summary consists of the most recent change tick for all the indexed components within that archetype. When iterating through a query (either sequentially or in parallel), if an indexed component C cannot match unless Added<C> or Changed<C> is true, then the query engine uses the summary to skip entire pages' worth of entities.

Adding the #[component(change = "indexed")] attribute to a component enables indexing for that component. Because indexing adds overhead to Mut<T> among other operations, indexing is opt-in instead of opt-out. It's possible to determine statically, at compile time, whether a component is indexed, and the plan to ensure that Mut<T> doesn't regress relies on this.

Alternate approaches

There are several alternate approaches that I experimented with. My experience with each one was as follows:

Per-column change indexes

My initial attempt stored change indexes on each column rather than on each archetype. This provided more specificity: the query acceleration could take into account only the change ticks for the components in the query filter rather than all indexed components on the archetype. The downside was that it severely impacted the performance of extract_meshes_for_gpu_building, which has the following query:

fn extract_meshes_for_gpu_building(
    ...,
    changed_meshes_query: Extract<
        Query<
            GpuMeshExtractionQuery,
            Or<(
                Changed<ViewVisibility>,
                Changed<GlobalTransform>,
                Changed<PreviousGlobalTransform>,
                Changed<Lightmap>,
                Changed<Aabb>,
                Changed<Mesh3d>,
                Changed<MeshTag>,
                (
                    Changed<NoFrustumCulling>,
                    Changed<NotShadowReceiver>,
                    Changed<TransmittedShadowReceiver>,
                    Changed<NotShadowCaster>,
                    Changed<NoAutomaticBatching>,
                    Changed<NoCpuCulling>,
                ),
                Changed<VisibilityRange>,
                Changed<SkinnedMesh>,
            )>,
        >,
    ...
)

This is 14 different components that had to be checked and is responsible for one of the bottlenecks. In fact, being able to consolidate all of these components into a single check is one of the major motivations for change indexes to begin with.

Per-archetype change indexes

I also experimented with change indexes stored on the archetype instead of on the table. The advantage of storing the index on the archetype would be that sparse sets and tables are handled identically. Unfortunately, this ballooned complexity quite a bit and led to a lot of incorrect behavior. The biggest sticking point that I could see was that, in order to produce a Mut<T> with a pointer to the change index, a pointer to the change index needs to be stored in the Fetch. But that's incompatible with how query iteration for dense components works: for dense components, queries iterate over tables, not over components.

Benchmarks

`many_cubes`

My primary interest is in scaling to worlds with millions of entities. A pure benchmark of scalability in this area is many_cubes --instance-count 4000000 --no-cpu-culling. (Four million cubes is the maximum before the transform-and-cull shader runs into wgpu workgroup limits, and CPU culling must be disabled in order to meaningfully scale to that level.) The results are as follows:

many_cubes --instance-count 4000000 --no-cpu-culling, main:
19.34 median ms/frame, 52 FPS

many_cubes --instance-count 4000000 --no-cpu-culling, this PR:
14.49 median ms/frame, 69 FPS

The extract_mesh_materials system, the bottleneck during the extraction phase, goes from median 4.58 ms/frame to 0.0238 ms/frame, a 192x speedup:

(Please note that batch_and_prepare_binned_render_phase, write_work_item_buffers, and write_indirect_parameters_buffers are all addressed by #23481 and followups to it, so the overall speedups from change indexes won't be limited by Amdahl's Law the way they are now.)

`bevy_city`

In bevy_city, 12,442 entities out of 46,717 change every frame. This is not a workload that change indexes significantly improve, because the time spent actually doing the work that must happen on change dwarfs the time spent checking the filter for static meshes. Nevertheless, it's useful to show that change indexes don't regress bevy_city. Note that bevy_city is GPU bound, so the total frame times don't really indicate anything related to this PR.

bevy_city with no CPU culling on meshes, main:
Median frame time 26.9 ms (37 FPS)

bevy_city with no CPU culling on meshes, this PR:
Median frame time 27.8 ms (36 FPS)

extract_meshes_for_gpu_building comparison between this PR (yellow) and main (red). Median time is 2.03 ms in both cases.

Addition and removal

Benchmark	`main`	This PR
`add_remove/table`	1.0494 ms	1.0840 ms
`add_remove/sparse_set`	820.38 µs	762.14 µs
`add_remove_big/table`	1.9357 ms	2.0195 ms
`add_remove_big/sparse_set`	826.59 µs	828.18 µs
`add_remove_very_big/table`	61.148 ms	60.296 ms

Change detection

Test	`main`	This PR
`all_added_detection/5000_entities_ecs::change_detection::Table`	5µs 587ns	6µs 519ns
`all_added_detection/5000_entities_ecs::change_detection::Sparse`	6µs 727ns	6µs 523ns
`all_added_detection/50000_entities_ecs::change_detection::Table`	57µs 28ns	64µs 783ns
`all_added_detection/50000_entities_ecs::change_detection::Sparse`	67µs 399ns	67µs 32ns
`all_changed_detection/5000_entities_ecs::change_detection::Table`	6µs 684ns	7µs 796ns
`all_changed_detection/5000_entities_ecs::change_detection::Sparse`	6µs 923ns	11µs 786ns
`all_changed_detection/50000_entities_ecs::change_detection::Table`	66µs 57ns	121µs 793ns
`all_changed_detection/50000_entities_ecs::change_detection::Sparse`	68µs 780ns	115µs 138ns
`few_changed_detection/5000_entities_ecs::change_detection::Table`	2µs 19ns	5µs 766ns
`few_changed_detection/5000_entities_ecs::change_detection::Sparse`	4µs 307ns	8µs 27ns
`few_changed_detection/50000_entities_ecs::change_detection::Table`	40µs 489ns	52µs 935ns
`few_changed_detection/50000_entities_ecs::change_detection::Sparse`	83µs 41ns	82µs 157ns
`none_changed_detection/5000_entities_ecs::change_detection::Table`	1µs 346ns	3µs 886ns
`none_changed_detection/5000_entities_ecs::change_detection::Sparse`	3µs 922ns	3µs 984ns
`none_changed_detection/50000_entities_ecs::change_detection::Table`	14µs 238ns	38µs 329ns
`none_changed_detection/50000_entities_ecs::change_detection::Sparse`	39µs 562ns	39µs 621ns
`multiple_archetypes_none_changed_detection/5_archetypes_10_entities_ecs::change_detection::Table`	66ns	62ns
`multiple_archetypes_none_changed_detection/5_archetypes_10_entities_ecs::change_detection::Sparse`	80ns	81ns
`multiple_archetypes_none_changed_detection/5_archetypes_100_entities_ecs::change_detection::Table`	242ns	383ns
`multiple_archetypes_none_changed_detection/5_archetypes_100_entities_ecs::change_detection::Sparse`	492ns	488ns
`multiple_archetypes_none_changed_detection/5_archetypes_1000_entities_ecs::change_detection::Table`	1µs 537ns	3µs 964ns
`multiple_archetypes_none_changed_detection/5_archetypes_1000_entities_ecs::change_detection::Sparse`	4µs 432ns	4µs 541ns
`multiple_archetypes_none_changed_detection/5_archetypes_10000_entities_ecs::change_detection::Table`	15µs 416ns	38µs 575ns
`multiple_archetypes_none_changed_detection/5_archetypes_10000_entities_ecs::change_detection::Sparse`	45µs 476ns	47µs 493ns
`multiple_archetypes_none_changed_detection/20_archetypes_10_entities_ecs::change_detection::Table`	220ns	216ns
`multiple_archetypes_none_changed_detection/20_archetypes_10_entities_ecs::change_detection::Sparse`	265ns	267ns
`multiple_archetypes_none_changed_detection/20_archetypes_100_entities_ecs::change_detection::Table`	962ns	1µs 684ns
`multiple_archetypes_none_changed_detection/20_archetypes_100_entities_ecs::change_detection::Sparse`	1µs 945ns	1µs 997ns
`multiple_archetypes_none_changed_detection/20_archetypes_1000_entities_ecs::change_detection::Table`	6µs 537ns	16µs 38ns
`multiple_archetypes_none_changed_detection/20_archetypes_1000_entities_ecs::change_detection::Sparse`	18µs 632ns	19µs 271ns
`multiple_archetypes_none_changed_detection/20_archetypes_10000_entities_ecs::change_detection::Table`	68µs 266ns	159µs 500ns
`multiple_archetypes_none_changed_detection/20_archetypes_10000_entities_ecs::change_detection::Spars...`	264µs 850ns	271µs 500ns
`multiple_archetypes_none_changed_detection/100_archetypes_10_entities_ecs::change_detection::Table`	1µs 209ns	1µs 132ns
`multiple_archetypes_none_changed_detection/100_archetypes_10_entities_ecs::change_detection::Sparse`	1µs 396ns	1µs 430ns
`multiple_archetypes_none_changed_detection/100_archetypes_100_entities_ecs::change_detection::Table`	5µs 927ns	9µs 263ns
`multiple_archetypes_none_changed_detection/100_archetypes_100_entities_ecs::change_detection::Sparse`	12µs 204ns	12µs 420ns
`multiple_archetypes_none_changed_detection/100_archetypes_1000_entities_ecs::change_detection::Table`	52µs 475ns	89µs 500ns
`multiple_archetypes_none_changed_detection/100_archetypes_1000_entities_ecs::change_detection::Spars...`	152µs 187ns	152µs 637ns
`multiple_archetypes_none_changed_detection/100_archetypes_10000_entities_ecs::change_detection::Tabl...`	380µs 300ns	823µs 50ns
`multiple_archetypes_none_changed_detection/100_archetypes_10000_entities_ecs::change_detection::Spar...`	1ms 326µs 950ns	1ms 367µs 850ns

Future work

These benchmark numbers shouldn't be considered the upper limit of what is possible with change indexes. The remaining systems in many_cubes, for instance, could probably see large improvements with additional work. For instance:

Systems such as visibility::calculate_bounds and mark_meshes_as_changed_if_their_materials_changed aren't currently eligible to use change indexes because they use AssetChanged, which must perform a full table scan. However, by introducing a resource that stores a bidirectional index between Mesh and Material assets and the entities that use them, the AssetChanged query filter could be dropped, and these systems could be migrated to only use Added/Changed, making them eligible for change indexes.
Some systems such as reset_view_visibility could be migrated to use change indexes and be eliminated from the profile.

Ultimately, the goal is for the CPU time to approach zero for meshes that don't change from frame to frame, and to have efficient handling for meshes that do.

ElliottjPierce

I want to come back to this and do a full review later, but here's some quick thoughts:

This needs a lot more docs to explain what the structure of this even is. I'll do more review when there's more here. Trying to put this together, I think what's going on here is: In addition to tracking changes for each component value, track changes for blocks/"pages" of entities in each table. There are PagesSize entities in each block and they all share the same world tick. For things that are changed often, this makes mutations slower. But for very rarely changed things, this means we can skip large sections of entities if their shared change tick is old. Am I getting that right?
We are going to need more docs and examples to motivate this for users. I'd love to see some benchmark results.
How does this perform for entities that rarely have component values changed but are frequently moved between tables? How much does this hurt spawning performance, inserts, and such? Probably well worth the cost, but still...
This makes Mut 8 bytes larger IIUC. This is probably the most concerning thing for me. This is still probably worth it, but this is going to hurt in some places if I had to guess.
This will probably improve performance for the average user. But, it will also probably make it worse for others, depending on how often they are changing things. I think it would be cool (but probably not worth trying yet) if users could customize the page size more. Maybe per component and the table just takes the larges, IDK. The more rarely a component is changed, the bigger its page size should be. Maybe even have a tool that can watch the app run and suggest ideal page sizes. Could be interesting.
I'd like to point out that this improves the theoretical "normal" case but it also makes the theoretical worst case worse. If exactly one entity in each page is changed, even from a different component, it will make performance worse. For example, in a game with 10 rarely changed components using this new indexing scheme, while each one of those 10 is rarely mutated, it's probably pretty common for one of them to be mutated on an entity with all 10. On the whole, this technically makes querying less efficient the more components an entity has, which is not ideal. But it's probably not a huge issue in practice. This could be fixed by moving this indexing scheme to the columns, but that may have other drawbacks. Thoughts on this?
We need better names here than Default and Indexed. Maybe Individual, and PessimisticallyPaged, and later we could add None? IDK, but "indexed" isn't very informative IMO, but this is a small thing.

alice-i-cecile · 2026-03-26T21:57:38Z

    /// If this is `true`, then [`QueryFilter::filter_fetch`] must always return true.
    const IS_ARCHETYPAL: bool;

+    const USES_INDEX: bool;


Probably clearer as USES_CHANGE_INDEX.

alice-i-cecile · 2026-03-26T22:29:13Z

 }

+#[derive(Debug, Copy, Clone, Default, Eq, PartialEq)]
+pub enum ChangeMode {


ChangeDetectionStrategy perhaps?

alice-i-cecile · 2026-03-26T22:29:39Z

+#[derive(Debug, Copy, Clone, Default, Eq, PartialEq)]
+pub enum ChangeMode {
+    #[default]
+    Default,


I would prefer a more descriptive name here, but I'm drawing a blank.

alice-i-cecile · 2026-03-26T22:31:07Z

 ## This will often provide more detailed error messages.
 track_location = []

+big_pages = []


Remember to add docs for this, and duplicate to Bevy's Cargo.toml

alice-i-cecile · 2026-03-26T22:34:05Z

    /// A constant indicating the storage type used for this component.
    const STORAGE_TYPE: StorageType;

+    const CHANGE_MODE: ChangeMode;


I've gone back-and-forth on whether these need to be runtime configurable. It would be nice to be able to tune these to your specific game's workload. Ultimately, I just don't think it's viable for performance, and feel that associated constants here and for StorageType are the way.

alice-i-cecile · 2026-03-26T22:34:59Z

+use crate::{change_detection::Tick, storage::TableRow};
+
+#[cfg(feature = "big_pages")]
+pub(crate) const PAGE_SIZE: u32 = 4096;


Remember to comment briefly about how these values were determined.

alice-i-cecile

Preliminary review (strongly positive):

extremely impressive performance numbers overall
strategy seems fundamentally good: very reasonable from both an ECS and data-structures-and-algorithms perspective
we should try and limit the overhead for existing change detection strategies: this will dominate in some workloads. I've opened #23529 to track / plan this. Poke me to implement it once 0.19 is cut.
unfortunately I'm not convinced that we can solve #4882 in a runtime-configurable way without serious overhead. Which would defeat the point: the value of runtime-config is improved performance.
we're going to need to be careful to ensure that this works on runtime / dynamic component types: I haven't dug in to evaluate how this was done yet
I am absolutely going to be a stickler about docs here; please bother me if you want me to help / write them

ElliottjPierce · 2026-03-27T03:54:13Z

How does this compare to #11120? Is there a world we could do some version of both? A "column last changed" and change indexes like this.

How will this interact with #22500? That adds query look-aheads to similarly skip chunks of entities IIRC. Probably possible to do both, but ideally we could implement one in terms of the other.

chescock

This is exciting! I will try to do a more thorough review later.

Adding the #[component(change = "indexed")] attribute to a component enables indexing for that component. Because indexing adds overhead to Mut<T> among other operations, indexing is opt-in instead of opt-out. It's possible to determine statically, at compile time, whether a component is indexed, and the plan to ensure that Mut<T> doesn't regress relies on this.

How will a library author decide which components support indexing? Won't the right answer vary between applications?

My worry is that we'll have a slippery slope where users ask for components to be indexed one-by-one because they have a use case for that component. Each one will seem reasonable because the cost of indexing one component is low but the value for their use cases is high. But we'll eventually wind up with everything indexed, in which case we might as well just index everything now and evaluate the full cost.

(Or maybe there's a middle ground where pub types use opt-out and non-pub types use opt-in? If it's non-pub then nobody can add a Changed filter anyway, so there's no need for indexes unless the library uses them. But it's probably a bad idea to let changing the visibility of a component affect performance like that!)

chescock · 2026-03-30T15:03:21Z

+    pub(crate) fn note_changed(&self, row: TableRow, tick: Tick) {
+        let page = row.index_u32() / PAGE_SIZE;
+        debug_assert!((page as usize) < self.page_table.len());
+        self.page_table[page as usize].store(tick.get(), Ordering::Relaxed);


If these are stored on the table as a whole then I think doing ordinary stores might have a race condition.

If two systems run concurrently, then one will have an earlier tick than the other. If they both write to different components on entities in the same table, then the one with an earlier tick could overwrite the tick from the later one.

For that to matter, you'd need a third system that considered one tick to be changed and the other to be not changed. I think we can do that with a schedule like:

fn A(q: Query<&X, Changed<X>>) {} fn B(q: Query<&mut X>) {} fn C(q: Query<&mut Y>) {} app.add_systems( ((A, B).chain(), C) ) world.spawn((X, Y));

Frame 1:
C starts, tick 1
A starts, tick 2
A reads component X
A ends
B starts, tick 3
B writes component X, stores 3 in page summary
B ends
C writes component Y, stores 1 in page summary
C ends

Then, on Frame 2, A sees tick 1 in the page summary and that it last ran on tick 2, so it assumes nothing could have been changed, and misses the change to X.

I don't see an easy way to avoid that without atomic RMW ops, which I assume are too expensive to be viable here.

I'm confused. The only time we call World::increment_change_tick outside of tests is in World::clear_trackers, and the only time we call that outside of tests is in App::update() -> SubApps::update() -> SubApp::update(). App::update() runs a whole frame. So, in your example, I don't see how C can store a different change tick than B would?

(Also I think if Mut worked the way you're suggesting change detection would already be broken to begin with, as it doesn't perform any sort of max checking IIRC. Max checking doesn't work because of wraparound.)

I'm confused. The only time we call World::increment_change_tick outside of tests is in World::clear_trackers,

It gets called in <FunctionSystem as System>::run_unsafe, which is called on every system run:

bevy/crates/bevy_ecs/src/system/function_system.rs

Line 670 in b4d4adc

let change_tick = world.increment_change_tick();

Having a separate tick for each system lets us detect both changes that happened earlier in the current frame, and changes that happened after the current system ran in the last frame.

... ah, I guess that's actually UnsafeWorldCell::increment_change_tick, which might be why your search missed it.

(Also I think if Mut worked the way you're suggesting change detection would already be broken to begin with, as it doesn't perform any sort of max checking IIRC. Max checking doesn't work because of wraparound.)

I don't think I understand what you mean here. Wraparound is handled by the checks in Tick::is_newer_than and regular calls to World::check_change_ticks, which calls Tick::check_tick to clamp old ticks before they wrap.

bevy/crates/bevy_ecs/src/change_detection/tick.rs

Lines 52 to 55 in b4d4adc

pub fn is_newer_than(self, last_run: Tick, this_run: Tick) -> bool {

// This works even with wraparound because the world tick (`this_run`) is always "newer" than

// `last_run` and `self.tick`, and we scan periodically to clamp `ComponentTicks` values

// so they never get older than `u32::MAX` (the difference would overflow).

But I don't see how that's related, so maybe you meant something else by wraparound and I'm misunderstanding?

Ah, yes. You're right. Closing as this pretty much kills the whole feature with no way to rescue it that I can think of.

The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch, the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: Add the running total leading into each chunk to every element of the chunk itself. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (1) the full ECS table scans required for change detection and (2) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (1), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (2). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel

The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel

pcwalton · 2026-04-09T01:16:30Z

This is blocked on adding where T: Component to Mut<T> to avoid performance regressions.

pcwalton · 2026-04-09T01:21:17Z

I haven't looked in detail at #11120, but if it's table-level change detection, this PR essentially subsumes that. As the page size approaches infinity, this PR collapses to table-level change detection.

I think that this can basically run on top of #22500 because it just adds an additional "skip N rows" function. Note that this PR is row-based, not column-based, so I don't think we can really implement #22500 on top of this, nor vice versa. I tried per-column change indexes originally, but it failed to achieve the speedups that I wanted in rendering, because rendering checks a lot of columns (15 or so).

…icks

…o hierarchical-change-ticks

pcwalton · 2026-04-10T00:48:40Z

I believe this feature is dead as written for the reason pointed out by @chescock.

We can't fix the feature by doing component-level summaries, as those are way too expensive for rendering, which has to check 15 components or so.

pcwalton · 2026-04-10T01:46:23Z

I'll keep this open for discussion but I would like to try a completely different approach. My current thought is:

Get rid of the page table. Instead, just have a table-level "last changed" tick.
Don't modify Mut at all. Instead, pessimistically assume that touching a table during an iteration that might mutate an indexed component mutates it.
Use an atomic operation to update the per-table last changed tick during "table rollover" (moving to a new table during an iteration, or Query::get), if the iteration might mutate an indexed component.

The nice thing about (3) is that table rollover is a relatively-expensive operation anyways, so adding one new atomic shouldn't be too bad. Note that there is zero extra overhead unless the query in question is mutating an indexed component.

This should simplify things quite a bit, as well as be faster, and will probably be just as effective in practice.

(Thanks for Sander and Diddykonga for productive discussions here!)

Diddykonga · 2026-04-10T03:28:38Z

Use an atomic operation to update the per-table last changed tick during "table rollover" (moving to a new table during an iteration, or Query::get).

Doesn't this have the same issue?
It will just happen at a different point and a different rate. Mut deref -> Query table 'rollover'

pcwalton · 2026-04-10T06:25:02Z

@Diddykonga No, because this operation will use an atomic compare-and-swap to avoid ever going backwards in time.

pcwalton · 2026-04-11T21:01:19Z

I prototyped this and found the overhead on checked .get_mut() to be <10%.

Affinator · 2026-04-28T11:35:03Z

I know that this would balloon the complexity, but I did not find it discussed here: How about being able able to select the granularity of the indexing? I.e. archetype, table, pages, column, entity?

(Ignoring the question for now how to implement when more than one approach is needed for a table.)

This would allow really fine-tuning the performance for each use case. Additionally it would not be necessary to implement everything at once. The different approaches could be added one after the other.

pcwalton · 2026-04-29T21:12:26Z

I'm a little concerned about the code bloat that that would add. I suspect the Rust compiler is inlining Mut::deref_mut and adding anything to that is going to be multiplied many times over.

…icks

amtep · 2026-05-16T17:30:47Z

Don't modify Mut at all. Instead, pessimistically assume that touching a table during an iteration that might mutate an indexed component mutates it.

It does make me a little sad that cleverly avoiding mutable derefs if nothing has changed will no longer help :)

akriegman · 2026-05-31T09:15:22Z

3. Use an atomic operation to update the per-table last changed tick during "table rollover" (moving to a new table during an iteration, or `Query::get`), if the iteration might mutate an indexed component.

Couldn't we track the last changed tick per column per table? Then we'd get more granularity for not much more work, and iiuc we wouldn't need atomics?

Implement opt-in change indexes for dense components.

19b5ecc

alice-i-cecile added C-Feature A new feature, making something new possible A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times X-Needs-SME This type of work requires an SME to approve it. labels Mar 25, 2026

alice-i-cecile self-assigned this Mar 25, 2026

github-project-automation Bot added this to ECS Mar 25, 2026

github-project-automation Bot moved this to Needs SME Triage in ECS Mar 25, 2026

alice-i-cecile added M-Release-Note Work that should be called out in the blog due to impact S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Mar 25, 2026

alice-i-cecile requested review from ElliottjPierce, chescock and james-j-obrien March 25, 2026 21:57

ecoskey self-requested a review March 25, 2026 22:06

ElliottjPierce reviewed Mar 26, 2026

View reviewed changes

pcwalton added 2 commits March 25, 2026 19:02

Implement opt-in change indexes for dense components.

aa9bb66

rel-with-deb-info

2022be0

This was referenced Mar 26, 2026

Fast and Flexible Change Detection #23152

Open

Add a T: Component bound to Mut and Ref smart pointers #23529

Open

alice-i-cecile reviewed Mar 26, 2026

View reviewed changes

chescock reviewed Mar 30, 2026

View reviewed changes

pcwalton added 2 commits April 8, 2026 18:55

Merge remote-tracking branch 'origin/main' into hierarchical-change-t…

9b9e1ca

…icks

Merge remote-tracking branch 'pcwalton/hierarchical-change-ticks' int…

764417d

…o hierarchical-change-ticks

pcwalton closed this Apr 10, 2026

github-project-automation Bot moved this from Needs SME Triage to Done in ECS Apr 10, 2026

pcwalton reopened this Apr 10, 2026

github-project-automation Bot moved this from Done to Needs SME Triage in ECS Apr 10, 2026

pcwalton added 3 commits April 10, 2026 18:44

Remove the page table

38a5fb5

WIP, broken

aafdbbb

WIP

1945f2b

pcwalton added 5 commits April 29, 2026 14:35

Merge remote-tracking branch 'origin/main' into hierarchical-change-t…

f40ca02

…icks

Merge remote-tracking branch 'origin/main' into hierarchical-change-t…

3163d70

…icks

Get change ticks working

366e30f

Merge remote-tracking branch 'origin/main' into hierarchical-change-t…

761ada1

…icks

Fix detection for change indices

a5b0b05

cart force-pushed the main branch from af894e5 to 017ffc5 Compare May 4, 2026 23:35

cart closed this May 5, 2026

github-project-automation Bot moved this from Needs SME Triage to Done in ECS May 5, 2026

cart reopened this May 5, 2026

github-project-automation Bot moved this from Done to Needs SME Triage in ECS May 5, 2026

	pub fn is_newer_than(self, last_run: Tick, this_run: Tick) -> bool {
	// This works even with wraparound because the world tick (`this_run`) is always "newer" than
	// `last_run` and `self.tick`, and we scan periodically to clamp `ComponentTicks` values
	// so they never get older than `u32::MAX` (the difference would overflow).

Uh oh!

Uh oh!

Conversation

pcwalton commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Alternate approaches

Per-column change indexes

Per-archetype change indexes

Benchmarks

many_cubes

bevy_city

Addition and removal

Change detection

Future work

Uh oh!

ElliottjPierce left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alice-i-cecile left a comment

Choose a reason for hiding this comment

Uh oh!

ElliottjPierce commented Mar 27, 2026

Uh oh!

chescock left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcwalton commented Apr 9, 2026

Uh oh!

pcwalton commented Apr 9, 2026

Uh oh!

pcwalton commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcwalton commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Diddykonga commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcwalton commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcwalton commented Apr 11, 2026

Uh oh!

Affinator commented Apr 28, 2026

Uh oh!

pcwalton commented Apr 29, 2026

Uh oh!

amtep commented May 16, 2026

Uh oh!

akriegman commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

pcwalton commented Mar 25, 2026 •

edited

Loading

`many_cubes`

`bevy_city`

pcwalton commented Apr 10, 2026 •

edited

Loading

pcwalton commented Apr 10, 2026 •

edited

Loading

Diddykonga commented Apr 10, 2026 •

edited

Loading

pcwalton commented Apr 10, 2026 •

edited

Loading