Skip to content

Unpack bins belonging to multidrawable batch sets on the GPU instead of on the CPU.#23481

Merged
alice-i-cecile merged 8 commits into
bevyengine:mainfrom
pcwalton:bin-slabs-2
Mar 30, 2026
Merged

Unpack bins belonging to multidrawable batch sets on the GPU instead of on the CPU.#23481
alice-i-cecile merged 8 commits into
bevyengine:mainfrom
pcwalton:bin-slabs-2

Conversation

@pcwalton

Copy link
Copy Markdown
Contributor

The Bevy renderer maintains a single flat array containing information needed to render each mesh instance, known as the mesh input uniform array. Entities can be added or removed from this array in O(1) time. The downside of that, however, is that the mesh input uniform array can't be directly used for rendering. In order to render using multi-draw indirect (MDI), which is the most efficient way to render meshes that wgpu supports, the mesh instances must be grouped into batch sets, within which all mesh instances share the same rendering state. We currently solve this problem by producing a list of preprocess work items per batch set every frame on the CPU. A preprocess work item consists of two indices: one into the list of mesh input uniforms, and one into the list of indirect draw commands needed for multi-draw indirect.

The problem with this approach is that building the list of preprocess work items anew every frame becomes a performance bottleneck when scaling to millions of mesh instances. Even though the list consists of only two 32-bit indices per mesh instance, this is enough to shoot the system (batch_and_prepare_binned_render_phase) to the top of the profile when scaling above 3 million mesh instances or so. Other data-driven engines, such as Unity DOTS, have no problem with higher mesh instance counts than that, so this currently represents a performance problem for Bevy.

This PR partially solves the problem by introducing a bin unpacking step that runs on the GPU. On the CPU, the indices into the mesh input uniform list are cached from frame to frame and only updated when entities are added or removed from bins. The GPU bin unpacking step takes this incrementally-updated list of entities and produces the list of preprocess work items. We execute one dispatch of the bin unpacking shader per batch set; this ensures that the number of GPU commands issued per batch set remains a constant.

A major design goal of this work was to make the data structure as simple as possible while ensuring that adding or removing an entity is O(1). To maintain compatibility with Metal, and to reduce the number of invocations of the sparse buffer update in the future, we avoid bindless buffers. While the most obvious approach might seem to be to use the offset allocator crate to allocate hash sets to hold entities inside a buffer, that approach would actually end up being more complex than the solution implemented in this PR, as well as causing performance problems when blocks in the allocator must move. Instead, this patch introduces a data structure, the RenderMultidrawableBatchSet, which carefully maintains indirection in order to keep the number of operations that must be performed when an entity is added or removed O(1).

Because the RenderMultidrawableBatchSet is a somewhat complex data structure with several invariants, I used the proptest crate to perform randomized testing. The resulting test simulates adding and removing entities and checks to ensure the invariants after each randomized workload hold. I observed no errors after many runs of the randomized test, so I have good confidence that the data structure is correct.

On many_cubes --instance-count 1000000 --no-cpu-culling, this PR takes batch_and_prepare_binned_render_phase from 1.52 ms median time to 0.0543 ms, a 28.0x speedup.

There are two improvements I want to make after this PR. I held off on making them in order to reduce the size of this patch, as well as to make bisecting any regressions easier. The improvements are:

  1. Currently, the bin buffers are stored using a RawBufferVec. However, they don't change from frame to frame for static meshes, and updating only has constant overhead for each changed mesh instance. A SparseBufferVec would be more efficient.

  2. Although extracting bins has been moved from CPU to GPU, the bins themselves are still traversed on CPU in order to allocate space for the mesh uniforms, which are keyed off the built-in instance ID in the vertex shader. This could be a bottleneck when there are large numbers of meshes. This process, which is essentially just a prefix sum, could be moved from the CPU to the GPU with one or two more compute shader dispatches per batch set.

Screenshot 2026-03-23 084433

on the CPU.

The Bevy renderer maintains a single flat array containing information
needed to render each mesh instance, known as the *mesh input uniform*
array. Entities can be added or removed from this array in O(1) time.
The downside of that, however, is that the mesh input uniform array
can't be directly used for rendering. In order to render using
multi-draw indirect (MDI), which is the most efficient way to render
meshes that `wgpu` supports, the mesh instances must be grouped into
*batch sets*, within which all mesh instances share the same rendering
state. We currently solve this problem by producing a list of
*preprocess work items* per batch set every frame on the CPU. A
preprocess work item consists of two indices: one into the list of mesh
input uniforms, and one into the list of indirect draw commands needed
for multi-draw indirect.

The problem with this approach is that building the list of preprocess
work items anew every frame becomes a performance bottleneck when
scaling to millions of mesh instances. Even though the list consists of
only two 32-bit indices per mesh instance, this is enough to shoot the
system (`batch_and_prepare_binned_render_phase`) to the top of the
profile when scaling above 3 million mesh instances or so. Other
data-driven engines, such as Unity DOTS, have no problem with higher
mesh instance counts than that, so this currently represents a
performance problem for Bevy.

This PR partially solves the problem by introducing a *bin unpacking*
step that runs on the GPU. On the CPU, the indices into the mesh input
uniform list are cached from frame to frame and only updated when
entities are added or removed from bins. The GPU bin unpacking step
takes this incrementally-updated list of entities and produces the list
of preprocess work items. We execute one dispatch of the bin unpacking
shader per batch set; this ensures that the number of GPU commands
issued per batch set remains a constant.

A major design goal of this work was to make the data structure as
simple as possible while ensuring that adding or removing an entity is
O(1). To maintain compatibility with Metal, and to reduce the number of
invocations of the sparse buffer update in the future, we avoid bindless
buffers. While the most obvious approach might seem to be to use the
offset allocator crate to allocate hash sets to hold entities inside a
buffer, that approach would actually end up being more complex than the
solution implemented in this PR, as well as causing performance problems
when blocks in the allocator must move. Instead, this patch introduces a
data structure, the `RenderMultidrawableBatchSet`, which carefully
maintains indirection in order to keep the number of operations that
must be performed when an entity is added or removed O(1).

Because the `RenderMultidrawableBatchSet` is a somewhat complex data
structure with several invariants, I used the `proptest` crate to
perform randomized testing. The resulting test simulates adding and
removing entities and checks to ensure the invariants after each
randomized workload hold. I observed no errors after many runs of the
randomized test, so I have good confidence that the data structure is
correct.

On `many_cubes --instance-count 1000000 --no-cpu-culling`, this PR takes
`batch_and_prepare_binned_render_phase` from 1.52 ms median time to
0.0543 ms, a 28.0x speedup.

There are two improvements I want to make after this PR. I held off on
making them in order to reduce the size of this patch, as well as to
make bisecting any regressions easier. The improvements are:

1. Currently, the bin buffers are stored using a `RawBufferVec`.
   However, they don't change from frame to frame for static meshes, and
   updating only has constant overhead for each changed mesh instance. A
   `SparseBufferVec` would be more efficient.

2. Although extracting bins has been moved from CPU to GPU, the bins
   themselves are still traversed on CPU in order to allocate space for
   the mesh uniforms, which are keyed off the built-in instance ID in
   the vertex shader. This could be a bottleneck when there are large
   numbers of meshes. This process, which is essentially just a prefix
   sum, could be moved from the CPU to the GPU with one or two more
   compute shader dispatches per batch set.
@pcwalton pcwalton requested review from atlv24 and tychedelia March 23, 2026 15:56
@pcwalton pcwalton added the A-Rendering Drawing game state to the screen label Mar 23, 2026
@github-project-automation github-project-automation Bot moved this to Needs SME Triage in Rendering Mar 23, 2026
@pcwalton pcwalton added S-Needs-Review Needs reviewer attention (from anyone!) to move forward C-Performance A change motivated by improving speed, memory usage or compile times labels Mar 23, 2026

@atlv24 atlv24 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good work. The comments are a bit noisy at times, but overall solid engineering. I like the proptests.

It feels like this is implementing a special ECS for the GPU. I wonder if we can one day have this be more ECS native, if bevy_ecs can catch up.

I think a lot of the indexed/non-indexed code path duplication (including pre-existing duplication) can probably be cleaned up, but that should be a follow-up. I just feel like there's possibility for a better abstraction there.

Let's land it!

Comment on lines +24 to +27
// Padding.
pad_a: u32,
// Padding.
pad_b: array<vec4<u32>, 15>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just

Suggested change
// Padding.
pad_a: u32,
// Padding.
pad_b: array<vec4<u32>, 15>,
pad_b: array<u32, 61>,

?

@pcwalton pcwalton Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally tried that, but it caused a UBO alignment error. It seems that wgpu wants UBO types to have 16 byte alignment.

@IceSentry IceSentry self-requested a review March 27, 2026 05:49

@IceSentry IceSentry left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

There's a few potential panics that could make me uncomfortable but I think the tests are enough to prove that they won't be hit.

Comment thread crates/bevy_render/src/batching/gpu_preprocessing.rs Outdated
@IceSentry IceSentry added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Mar 30, 2026
Co-authored-by: IceSentry <IceSentry@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

Your PR caused a change in the graphical output of an example or rendering test. This might be intentional, but it could also mean that something broke!
You can review it at https://pixel-eagle.com/project/B04F67C0-C054-4A6F-92EC-F599FEC2FD1D?filter=PR-23481

If it's expected, please add the M-Deliberate-Rendering-Change label.

If this change seems unrelated to your PR, you can consider updating your PR to target the latest main branch, either by rebasing or merging main into it.

@alice-i-cecile

Copy link
Copy Markdown
Member

Latest commit seems to have broken shadows with render layers worse: https://pixel-eagle.com/project/b04f67c0-c054-4a6f-92ec-f599fec2fd1d/run/38573/compare/38550?screenshot=testbed_3d/screenshot-RenderLayers.png

It might be better to simply merge this as is and do a proper fix for this in follow-up. Let me know how you'd like to proceed.

@github-actions

Copy link
Copy Markdown
Contributor

Your PR caused a change in the graphical output of an example or rendering test. This might be intentional, but it could also mean that something broke!
You can review it at https://pixel-eagle.com/project/B04F67C0-C054-4A6F-92EC-F599FEC2FD1D?filter=PR-23481

If it's expected, please add the M-Deliberate-Rendering-Change label.

If this change seems unrelated to your PR, you can consider updating your PR to target the latest main branch, either by rebasing or merging main into it.

@alice-i-cecile

Copy link
Copy Markdown
Member

The render layers + light problem is pre-existing, and tracked in #23264. I'm going to merge this, and we can attempt a fix seperately.

@alice-i-cecile alice-i-cecile added this pull request to the merge queue Mar 30, 2026
Merged via the queue into bevyengine:main with commit 894d8d7 Mar 30, 2026
42 checks passed
@github-project-automation github-project-automation Bot moved this from Needs SME Triage to Done in Rendering Mar 30, 2026
splo pushed a commit to splo/bevy that referenced this pull request Mar 31, 2026
…of on the CPU. (bevyengine#23481)

The Bevy renderer maintains a single flat array containing information
needed to render each mesh instance, known as the *mesh input uniform*
array. Entities can be added or removed from this array in O(1) time.
The downside of that, however, is that the mesh input uniform array
can't be directly used for rendering. In order to render using
multi-draw indirect (MDI), which is the most efficient way to render
meshes that `wgpu` supports, the mesh instances must be grouped into
*batch sets*, within which all mesh instances share the same rendering
state. We currently solve this problem by producing a list of
*preprocess work items* per batch set every frame on the CPU. A
preprocess work item consists of two indices: one into the list of mesh
input uniforms, and one into the list of indirect draw commands needed
for multi-draw indirect.

The problem with this approach is that building the list of preprocess
work items anew every frame becomes a performance bottleneck when
scaling to millions of mesh instances. Even though the list consists of
only two 32-bit indices per mesh instance, this is enough to shoot the
system (`batch_and_prepare_binned_render_phase`) to the top of the
profile when scaling above 3 million mesh instances or so. Other
data-driven engines, such as Unity DOTS, have no problem with higher
mesh instance counts than that, so this currently represents a
performance problem for Bevy.

This PR partially solves the problem by introducing a *bin unpacking*
step that runs on the GPU. On the CPU, the indices into the mesh input
uniform list are cached from frame to frame and only updated when
entities are added or removed from bins. The GPU bin unpacking step
takes this incrementally-updated list of entities and produces the list
of preprocess work items. We execute one dispatch of the bin unpacking
shader per batch set; this ensures that the number of GPU commands
issued per batch set remains a constant.

A major design goal of this work was to make the data structure as
simple as possible while ensuring that adding or removing an entity is
O(1). To maintain compatibility with Metal, and to reduce the number of
invocations of the sparse buffer update in the future, we avoid bindless
buffers. While the most obvious approach might seem to be to use the
offset allocator crate to allocate hash sets to hold entities inside a
buffer, that approach would actually end up being more complex than the
solution implemented in this PR, as well as causing performance problems
when blocks in the allocator must move. Instead, this patch introduces a
data structure, the `RenderMultidrawableBatchSet`, which carefully
maintains indirection in order to keep the number of operations that
must be performed when an entity is added or removed O(1).

Because the `RenderMultidrawableBatchSet` is a somewhat complex data
structure with several invariants, I used the `proptest` crate to
perform randomized testing. The resulting test simulates adding and
removing entities and checks to ensure the invariants after each
randomized workload hold. I observed no errors after many runs of the
randomized test, so I have good confidence that the data structure is
correct.

On `many_cubes --instance-count 1000000 --no-cpu-culling`, this PR takes
`batch_and_prepare_binned_render_phase` from 1.52 ms median time to
0.0543 ms, a 28.0x speedup.

There are two improvements I want to make after this PR. I held off on
making them in order to reduce the size of this patch, as well as to
make bisecting any regressions easier. The improvements are:

1. Currently, the bin buffers are stored using a `RawBufferVec`.
However, they don't change from frame to frame for static meshes, and
updating only has constant overhead for each changed mesh instance. A
`SparseBufferVec` would be more efficient.

2. Although extracting bins has been moved from CPU to GPU, the bins
themselves are still traversed on CPU in order to allocate space for the
mesh uniforms, which are keyed off the built-in instance ID in the
vertex shader. This could be a bottleneck when there are large numbers
of meshes. This process, which is essentially just a prefix sum, could
be moved from the CPU to the GPU with one or two more compute shader
dispatches per batch set.

<img width="2756" height="1800" alt="Screenshot 2026-03-23 084433"
src="https://github.com/user-attachments/assets/20366341-7bf8-4dd8-881f-4740c53e9a6b"
/>

---------

Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
Co-authored-by: IceSentry <IceSentry@users.noreply.github.com>
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 2, 2026
The goal of GPU-driven rendering is to cache the entire scene graph on
the GPU in a form that's efficient for rendering and, for objects that
didn't change since the previous frame, to have zero CPU-side overhead.
If the scene didn't change, the only CPU overhead should be proportional
to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU
loop over every mesh *instance* in rendering, which brought us closer to
this ideal, but it didn't fully get us there, because there's still a
CPU loop over every *mesh*. Although there are usually many fewer meshes
than mesh instances in large scenes, this still represents a potential
bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate `MeshUniform`s, which are the data
structures that the GPU transform-and-cull stage stores the
post-transform data in. Unlike `MeshInputUniform`s, which are scattered
throughout memory and allocated using a CPU-side free list,
`MeshUniform`s are indexed by *instance ID*. Because of the way
multi-draw indirect assigns instance IDs, all instances of a specific
mesh must be adjacent to one another. This necessitates a global
allocation pass that lays out `MeshUniform`s in memory such that all the
instances of a specific mesh end up adjacent to one another.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 2, 2026
The goal of GPU-driven rendering is to cache the entire scene graph on
the GPU in a form that's efficient for rendering and, for objects that
didn't change since the previous frame, to have zero CPU-side overhead.
If the scene didn't change, the only CPU overhead should be proportional
to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU
loop over every mesh *instance* in rendering, which brought us closer to
this ideal, but it didn't fully get us there, because there's still a
CPU loop over every *mesh*. Although there are usually many fewer meshes
than mesh instances in large scenes, this still represents a potential
bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate `MeshUniform`s, which are the data
structures that the GPU transform-and-cull stage stores the
post-transform data in. Unlike `MeshInputUniform`s, which are scattered
throughout memory and allocated using a CPU-side free list,
`MeshUniform`s are indexed by *instance ID*. Because of the way
multi-draw indirect assigns instance IDs, all instances of a specific
mesh must be adjacent to one another. This necessitates a global
allocation pass that lays out `MeshUniform`s in memory such that all the
instances of a specific mesh end up adjacent to one another. This
operation is currently performed on the CPU in the
`MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set`
method and has overhead proportional to the number of separate meshes
(not mesh *instances*) in each batch set.

This PR addresses the problem by moving the sequential loop in that
method to the GPU. A new GPU phase known as the *uniform allocation*
step has been added. This shader essentially performs a [prefix sum] in
order to allocate the `MeshUniform`s corresponding to the batches within
a batch set. This isn't the first prefix sum operation that we have in
Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in
order to scale better to tens of thousands of meshes in a single batch,
the uniform allocation pass added in this PR uses the three-step *scan
and fan* process rather than the two-step process that PR bevyengine#23036 uses.
The scan and fan algorithm works as follows:

1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size
   equal to the workgroup size (256, in this case), producing a prefix
   sum for each 256-element block. Write the final sum of each chunk to
   a *fan buffer*.

2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer
   and write the results. Now each chunk can determine the running total
   leading into that chunk.

3. *Fan*: Add the running total leading into each chunk to every element
   of the chunk itself.

Note that, if the number of meshes is lower than the workgroup size, we
only need step (1) above and can skip steps (2) and (3). Because batch
sets rarely contain over 256 meshes, this means that in real-world
scenes we typically only need to run step (1).

This patch had to rework the `RenderMultidrawableBatchSet` structure
added in PR bevyengine#23481 in order to perform additional bookkeeping necessary
to keep the time complexity of adding a mesh instance O(1). The
`proptest`-based test suite has been updated and extended significantly
to deal with this additional complexity.

For static meshes without skins and morph target, this PR eliminates the
last remaining per-mesh overhead in the render schedules, with the
exceptions of (1) the full ECS table scans required for change detection
and (2) the overhead of reuploading the various GPU buffers. Change
indexes (PR bevyengine#23519) address issue (1), and more use of `SparseBufferVec`
(PR bevyengine#23242) will address issue (2).

[prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum

[Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 2, 2026
The goal of GPU-driven rendering is to cache the entire scene graph on
the GPU in a form that's efficient for rendering and, for objects that
didn't change since the previous frame, to have zero CPU-side overhead.
If the scene didn't change, the only CPU overhead should be proportional
to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU
loop over every mesh *instance* in rendering, which brought us closer to
this ideal, but it didn't fully get us there, because there's still a
CPU loop over every *mesh*. Although there are usually many fewer meshes
than mesh instances in large scenes, this still represents a potential
bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate `MeshUniform`s, which are the data
structures that the GPU transform-and-cull stage stores the
post-transform data in. Unlike `MeshInputUniform`s, which are scattered
throughout memory and allocated using a CPU-side free list,
`MeshUniform`s are indexed by *instance ID*. Because of the way
multi-draw indirect assigns instance IDs, all instances of a specific
mesh must be adjacent to one another. This necessitates a global
allocation pass that lays out `MeshUniform`s in memory such that all the
instances of a specific mesh end up adjacent to one another. This
operation is currently performed on the CPU in the
`MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set`
method and has overhead proportional to the number of separate meshes
(not mesh *instances*) in each batch set.

This PR addresses the problem by moving the sequential loop in that
method to the GPU. A new GPU phase known as the *uniform allocation*
step has been added. This shader essentially performs a [prefix sum] in
order to allocate the `MeshUniform`s corresponding to the batches within
a batch set. This isn't the first prefix sum operation that we have in
Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in
order to scale better to tens of thousands of meshes in a single batch
set (i.e. multi-draw command), the uniform allocation pass added in this
PR uses the three-step *scan and fan* process rather than the two-step
process that PR bevyengine#23036 uses.  The scan and fan algorithm works as
follows:

1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size
   equal to the workgroup size (256, in this case), producing a prefix
   sum for each 256-element block. Write the final sum of each chunk to
   a *fan buffer*.

2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer
   and write the results. Now each chunk can determine the running total
   leading into that chunk.

3. *Fan*: For each chunk, add the running total leading into that chunk
   to every one of that chunk's elements.

Note that, if the number of meshes is lower than the workgroup size, we
only need step (1) above and can skip steps (2) and (3). Because batch
sets rarely contain over 256 meshes, this means that in real-world
scenes we typically only need to run step (1).

This patch had to rework the `RenderMultidrawableBatchSet` structure
added in PR bevyengine#23481 in order to perform additional bookkeeping necessary
to keep the time complexity of adding a mesh instance O(1). The
`proptest`-based test suite has been updated and extended significantly
to deal with this additional complexity.

For static meshes without skins and morph target, this PR eliminates the
last remaining per-mesh overhead in the render schedules, with the
exceptions of (a) the full ECS table scans required for change detection
and (b) the overhead of reuploading the various GPU buffers. Change
indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec`
(PR bevyengine#23242) will address issue (b).

[prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum

[Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 3, 2026
The goal of GPU-driven rendering is to cache the entire scene graph on
the GPU in a form that's efficient for rendering and, for objects that
didn't change since the previous frame, to have zero CPU-side overhead.
If the scene didn't change, the only CPU overhead should be proportional
to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU
loop over every mesh *instance* in rendering, which brought us closer to
this ideal, but it didn't fully get us there, because there's still a
CPU loop over every *mesh*. Although there are usually many fewer meshes
than mesh instances in large scenes, this still represents a potential
bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate `MeshUniform`s, which are the data
structures that the GPU transform-and-cull stage stores the
post-transform data in. Unlike `MeshInputUniform`s, which are scattered
throughout memory and allocated using a CPU-side free list,
`MeshUniform`s are indexed by *instance ID*. Because of the way
multi-draw indirect assigns instance IDs, all instances of a specific
mesh must be adjacent to one another. This necessitates a global
allocation pass that lays out `MeshUniform`s in memory such that all the
instances of a specific mesh end up adjacent to one another. This
operation is currently performed on the CPU in the
`MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set`
method and has overhead proportional to the number of separate meshes
(not mesh *instances*) in each batch set.

This PR addresses the problem by moving the sequential loop in that
method to the GPU. A new GPU phase known as the *uniform allocation*
step has been added. This shader essentially performs a [prefix sum] in
order to allocate the `MeshUniform`s corresponding to the batches within
a batch set. This isn't the first prefix sum operation that we have in
Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in
order to scale better to tens of thousands of meshes in a single batch
set (i.e. multi-draw command), the uniform allocation pass added in this
PR uses the three-step *scan and fan* process rather than the two-step
process that PR bevyengine#23036 uses.  The scan and fan algorithm works as
follows:

1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size
   equal to the workgroup size (256, in this case), producing a prefix
   sum for each 256-element block. Write the final sum of each chunk to
   a *fan buffer*.

2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer
   and write the results. Now each chunk can determine the running total
   leading into that chunk.

3. *Fan*: For each chunk, add the running total leading into that chunk
   to every one of that chunk's elements.

Note that, if the number of meshes is lower than the workgroup size, we
only need step (1) above and can skip steps (2) and (3). Because batch
sets rarely contain over 256 meshes, this means that in real-world
scenes we typically only need to run step (1).

This patch had to rework the `RenderMultidrawableBatchSet` structure
added in PR bevyengine#23481 in order to perform additional bookkeeping necessary
to keep the time complexity of adding a mesh instance O(1). The
`proptest`-based test suite has been updated and extended significantly
to deal with this additional complexity.

For static meshes without skins and morph target, this PR eliminates the
last remaining per-mesh overhead in the render schedules, with the
exceptions of (a) the full ECS table scans required for change detection
and (b) the overhead of reuploading the various GPU buffers. Change
indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec`
(PR bevyengine#23242) will address issue (b).

[prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum

[Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 4, 2026
The goal of GPU-driven rendering is to cache the entire scene graph on
the GPU in a form that's efficient for rendering and, for objects that
didn't change since the previous frame, to have zero CPU-side overhead.
If the scene didn't change, the only CPU overhead should be proportional
to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU
loop over every mesh *instance* in rendering, which brought us closer to
this ideal, but it didn't fully get us there, because there's still a
CPU loop over every *mesh*. Although there are usually many fewer meshes
than mesh instances in large scenes, this still represents a potential
bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate `MeshUniform`s, which are the data
structures that the GPU transform-and-cull stage stores the
post-transform data in. Unlike `MeshInputUniform`s, which are scattered
throughout memory and allocated using a CPU-side free list,
`MeshUniform`s are indexed by *instance ID*. Because of the way
multi-draw indirect assigns instance IDs, all instances of a specific
mesh must be adjacent to one another. This necessitates a global
allocation pass that lays out `MeshUniform`s in memory such that all the
instances of a specific mesh end up adjacent to one another. This
operation is currently performed on the CPU in the
`MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set`
method and has overhead proportional to the number of separate meshes
(not mesh *instances*) in each batch set.

This PR addresses the problem by moving the sequential loop in that
method to the GPU. A new GPU phase known as the *uniform allocation*
step has been added. This shader essentially performs a [prefix sum] in
order to allocate the `MeshUniform`s corresponding to the batches within
a batch set. This isn't the first prefix sum operation that we have in
Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in
order to scale better to tens of thousands of meshes in a single batch
set (i.e. multi-draw command), the uniform allocation pass added in this
PR uses the three-step *scan and fan* process rather than the two-step
process that PR bevyengine#23036 uses.  The scan and fan algorithm works as
follows:

1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size
   equal to the workgroup size (256, in this case), producing a prefix
   sum for each 256-element block. Write the final sum of each chunk to
   a *fan buffer*.

2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer
   and write the results. Now each chunk can determine the running total
   leading into that chunk.

3. *Fan*: For each chunk, add the running total leading into that chunk
   to every one of that chunk's elements.

Note that, if the number of meshes is lower than the workgroup size, we
only need step (1) above and can skip steps (2) and (3). Because batch
sets rarely contain over 256 meshes, this means that in real-world
scenes we typically only need to run step (1).

This patch had to rework the `RenderMultidrawableBatchSet` structure
added in PR bevyengine#23481 in order to perform additional bookkeeping necessary
to keep the time complexity of adding a mesh instance O(1). The
`proptest`-based test suite has been updated and extended significantly
to deal with this additional complexity.

For static meshes without skins and morph target, this PR eliminates the
last remaining per-mesh overhead in the render schedules, with the
exceptions of (a) the full ECS table scans required for change detection
and (b) the overhead of reuploading the various GPU buffers. Change
indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec`
(PR bevyengine#23242) will address issue (b).

[prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum

[Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 8, 2026
number of cameras.

The intention has long been to render shadow maps for point and spot
lights only once, regardless of the number of views. This is reflected
in the fact that `RetainedViewEntity::auxiliary_entity` is
`Entity::PLACEHOLDER` for them. Unfortunately, this is currently
inconsistently implemented, and a separate `ExtractedView` is presently
spawned and rendered to for every point and spot light shadow map. The
behavior of these views is inconsistent because they violate the
invariant that there must only be one render-world view per
`RetainedViewEntity`.

This patch changes Bevy's behavior to spawn only one `ExtractedView` for
point and spot lights. This required some significant rearchitecting of
the render schedule because the render schedule is currently driven off
cameras. Driving the rendering off cameras is incorrect for point and
spot light shadow maps, which aren't associated with any camera.

This PR fixes the regression on the `render_layers` test in `testbed_3d`
in PR bevyengine#23481, in that it renders the way it rendered before that PR.
Note, however, that the rendering isn't what may have been intended: the
shadows don't match the visible objects. That's because the shadows come
from point lights, which aren't associated with cameras, and therefore
shadows are rendered using the default set of `RenderLayers`. A future
patch may want to add flags to cameras that specify that they should
have their own point light and spot light shadow maps that inherit the
render layer (and HLOD) behavior of their associated cameras. As this
patch is fairly large, though, and because my immediate goal is to fix
the regression in bevyengine#23481, I think those flags are best implemented in a
follow-up.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 8, 2026
number of cameras.

The intention has long been to render shadow maps for point and spot
lights only once, regardless of the number of views. This is reflected
in the fact that `RetainedViewEntity::auxiliary_entity` is
`Entity::PLACEHOLDER` for them. Unfortunately, this is currently
inconsistently implemented, and a separate `ExtractedView` is presently
spawned and rendered to for every point and spot light shadow map. The
behavior of these views is inconsistent because they violate the
invariant that there must only be one render-world view per
`RetainedViewEntity`.

This patch changes Bevy's behavior to spawn only one `ExtractedView` for
point and spot lights. This required some significant rearchitecting of
the render schedule because the render schedule is currently driven off
cameras. Driving the rendering off cameras is incorrect for point and
spot light shadow maps, which aren't associated with any camera.

This PR fixes the regression on the `render_layers` test in `testbed_3d`
in PR bevyengine#23481, in that it renders the way it rendered before that PR.
Note, however, that the rendering isn't what may have been intended: the
shadows don't match the visible objects. That's because the shadows come
from point lights, which aren't associated with cameras, and therefore
shadows are rendered using the default set of `RenderLayers`. A future
patch may want to add flags to cameras that specify that they should
have their own point light and spot light shadow maps that inherit the
render layer (and HLOD) behavior of their associated cameras. As this
patch is fairly large, though, and because my immediate goal is to fix
the regression in bevyengine#23481, I think those flags are best implemented in a
follow-up.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Apr 8, 2026
number of cameras.

The intention has long been to render shadow maps for point and spot
lights only once, regardless of the number of views. This is reflected
in the fact that `RetainedViewEntity::auxiliary_entity` is
`Entity::PLACEHOLDER` for them. Unfortunately, this is currently
inconsistently implemented, and a separate `ExtractedView` is presently
spawned and rendered to for every point and spot light shadow map. The
behavior of these views is inconsistent because they violate the
invariant that there must only be one render-world view per
`RetainedViewEntity`.

This patch changes Bevy's behavior to spawn only one `ExtractedView` for
point and spot lights. This required some significant rearchitecting of
the render schedule because the render schedule is currently driven off
cameras. Driving the rendering off cameras is incorrect for point and
spot light shadow maps, which aren't associated with any camera.

This PR fixes the regression on the `render_layers` test in `testbed_3d`
in PR bevyengine#23481, in that it renders the way it rendered before that PR.
Note, however, that the rendering isn't what may have been intended: the
shadows don't match the visible objects. That's because the shadows come
from point lights, which aren't associated with cameras, and therefore
shadows are rendered using the default set of `RenderLayers`. A future
patch may want to add flags to cameras that specify that they should
have their own point light and spot light shadow maps that inherit the
render layer (and HLOD) behavior of their associated cameras. As this
patch is fairly large, though, and because my immediate goal is to fix
the regression in bevyengine#23481, I think those flags are best implemented in a
follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants