Unpack bins belonging to multidrawable batch sets on the GPU instead of on the CPU.#23481
Conversation
on the CPU. The Bevy renderer maintains a single flat array containing information needed to render each mesh instance, known as the *mesh input uniform* array. Entities can be added or removed from this array in O(1) time. The downside of that, however, is that the mesh input uniform array can't be directly used for rendering. In order to render using multi-draw indirect (MDI), which is the most efficient way to render meshes that `wgpu` supports, the mesh instances must be grouped into *batch sets*, within which all mesh instances share the same rendering state. We currently solve this problem by producing a list of *preprocess work items* per batch set every frame on the CPU. A preprocess work item consists of two indices: one into the list of mesh input uniforms, and one into the list of indirect draw commands needed for multi-draw indirect. The problem with this approach is that building the list of preprocess work items anew every frame becomes a performance bottleneck when scaling to millions of mesh instances. Even though the list consists of only two 32-bit indices per mesh instance, this is enough to shoot the system (`batch_and_prepare_binned_render_phase`) to the top of the profile when scaling above 3 million mesh instances or so. Other data-driven engines, such as Unity DOTS, have no problem with higher mesh instance counts than that, so this currently represents a performance problem for Bevy. This PR partially solves the problem by introducing a *bin unpacking* step that runs on the GPU. On the CPU, the indices into the mesh input uniform list are cached from frame to frame and only updated when entities are added or removed from bins. The GPU bin unpacking step takes this incrementally-updated list of entities and produces the list of preprocess work items. We execute one dispatch of the bin unpacking shader per batch set; this ensures that the number of GPU commands issued per batch set remains a constant. A major design goal of this work was to make the data structure as simple as possible while ensuring that adding or removing an entity is O(1). To maintain compatibility with Metal, and to reduce the number of invocations of the sparse buffer update in the future, we avoid bindless buffers. While the most obvious approach might seem to be to use the offset allocator crate to allocate hash sets to hold entities inside a buffer, that approach would actually end up being more complex than the solution implemented in this PR, as well as causing performance problems when blocks in the allocator must move. Instead, this patch introduces a data structure, the `RenderMultidrawableBatchSet`, which carefully maintains indirection in order to keep the number of operations that must be performed when an entity is added or removed O(1). Because the `RenderMultidrawableBatchSet` is a somewhat complex data structure with several invariants, I used the `proptest` crate to perform randomized testing. The resulting test simulates adding and removing entities and checks to ensure the invariants after each randomized workload hold. I observed no errors after many runs of the randomized test, so I have good confidence that the data structure is correct. On `many_cubes --instance-count 1000000 --no-cpu-culling`, this PR takes `batch_and_prepare_binned_render_phase` from 1.52 ms median time to 0.0543 ms, a 28.0x speedup. There are two improvements I want to make after this PR. I held off on making them in order to reduce the size of this patch, as well as to make bisecting any regressions easier. The improvements are: 1. Currently, the bin buffers are stored using a `RawBufferVec`. However, they don't change from frame to frame for static meshes, and updating only has constant overhead for each changed mesh instance. A `SparseBufferVec` would be more efficient. 2. Although extracting bins has been moved from CPU to GPU, the bins themselves are still traversed on CPU in order to allocate space for the mesh uniforms, which are keyed off the built-in instance ID in the vertex shader. This could be a bottleneck when there are large numbers of meshes. This process, which is essentially just a prefix sum, could be moved from the CPU to the GPU with one or two more compute shader dispatches per batch set.
atlv24
left a comment
There was a problem hiding this comment.
This is really good work. The comments are a bit noisy at times, but overall solid engineering. I like the proptests.
It feels like this is implementing a special ECS for the GPU. I wonder if we can one day have this be more ECS native, if bevy_ecs can catch up.
I think a lot of the indexed/non-indexed code path duplication (including pre-existing duplication) can probably be cleaned up, but that should be a follow-up. I just feel like there's possibility for a better abstraction there.
Let's land it!
| // Padding. | ||
| pad_a: u32, | ||
| // Padding. | ||
| pad_b: array<vec4<u32>, 15>, |
There was a problem hiding this comment.
why not just
| // Padding. | |
| pad_a: u32, | |
| // Padding. | |
| pad_b: array<vec4<u32>, 15>, | |
| pad_b: array<u32, 61>, |
?
There was a problem hiding this comment.
I originally tried that, but it caused a UBO alignment error. It seems that wgpu wants UBO types to have 16 byte alignment.
IceSentry
left a comment
There was a problem hiding this comment.
LGTM!
There's a few potential panics that could make me uncomfortable but I think the tests are enough to prove that they won't be hit.
Co-authored-by: IceSentry <IceSentry@users.noreply.github.com>
|
Your PR caused a change in the graphical output of an example or rendering test. This might be intentional, but it could also mean that something broke! If it's expected, please add the M-Deliberate-Rendering-Change label. If this change seems unrelated to your PR, you can consider updating your PR to target the latest main branch, either by rebasing or merging main into it. |
|
Latest commit seems to have broken shadows with render layers worse: https://pixel-eagle.com/project/b04f67c0-c054-4a6f-92ec-f599fec2fd1d/run/38573/compare/38550?screenshot=testbed_3d/screenshot-RenderLayers.png It might be better to simply merge this as is and do a proper fix for this in follow-up. Let me know how you'd like to proceed. |
|
Your PR caused a change in the graphical output of an example or rendering test. This might be intentional, but it could also mean that something broke! If it's expected, please add the M-Deliberate-Rendering-Change label. If this change seems unrelated to your PR, you can consider updating your PR to target the latest main branch, either by rebasing or merging main into it. |
|
The render layers + light problem is pre-existing, and tracked in #23264. I'm going to merge this, and we can attempt a fix seperately. |
…of on the CPU. (bevyengine#23481) The Bevy renderer maintains a single flat array containing information needed to render each mesh instance, known as the *mesh input uniform* array. Entities can be added or removed from this array in O(1) time. The downside of that, however, is that the mesh input uniform array can't be directly used for rendering. In order to render using multi-draw indirect (MDI), which is the most efficient way to render meshes that `wgpu` supports, the mesh instances must be grouped into *batch sets*, within which all mesh instances share the same rendering state. We currently solve this problem by producing a list of *preprocess work items* per batch set every frame on the CPU. A preprocess work item consists of two indices: one into the list of mesh input uniforms, and one into the list of indirect draw commands needed for multi-draw indirect. The problem with this approach is that building the list of preprocess work items anew every frame becomes a performance bottleneck when scaling to millions of mesh instances. Even though the list consists of only two 32-bit indices per mesh instance, this is enough to shoot the system (`batch_and_prepare_binned_render_phase`) to the top of the profile when scaling above 3 million mesh instances or so. Other data-driven engines, such as Unity DOTS, have no problem with higher mesh instance counts than that, so this currently represents a performance problem for Bevy. This PR partially solves the problem by introducing a *bin unpacking* step that runs on the GPU. On the CPU, the indices into the mesh input uniform list are cached from frame to frame and only updated when entities are added or removed from bins. The GPU bin unpacking step takes this incrementally-updated list of entities and produces the list of preprocess work items. We execute one dispatch of the bin unpacking shader per batch set; this ensures that the number of GPU commands issued per batch set remains a constant. A major design goal of this work was to make the data structure as simple as possible while ensuring that adding or removing an entity is O(1). To maintain compatibility with Metal, and to reduce the number of invocations of the sparse buffer update in the future, we avoid bindless buffers. While the most obvious approach might seem to be to use the offset allocator crate to allocate hash sets to hold entities inside a buffer, that approach would actually end up being more complex than the solution implemented in this PR, as well as causing performance problems when blocks in the allocator must move. Instead, this patch introduces a data structure, the `RenderMultidrawableBatchSet`, which carefully maintains indirection in order to keep the number of operations that must be performed when an entity is added or removed O(1). Because the `RenderMultidrawableBatchSet` is a somewhat complex data structure with several invariants, I used the `proptest` crate to perform randomized testing. The resulting test simulates adding and removing entities and checks to ensure the invariants after each randomized workload hold. I observed no errors after many runs of the randomized test, so I have good confidence that the data structure is correct. On `many_cubes --instance-count 1000000 --no-cpu-culling`, this PR takes `batch_and_prepare_binned_render_phase` from 1.52 ms median time to 0.0543 ms, a 28.0x speedup. There are two improvements I want to make after this PR. I held off on making them in order to reduce the size of this patch, as well as to make bisecting any regressions easier. The improvements are: 1. Currently, the bin buffers are stored using a `RawBufferVec`. However, they don't change from frame to frame for static meshes, and updating only has constant overhead for each changed mesh instance. A `SparseBufferVec` would be more efficient. 2. Although extracting bins has been moved from CPU to GPU, the bins themselves are still traversed on CPU in order to allocate space for the mesh uniforms, which are keyed off the built-in instance ID in the vertex shader. This could be a bottleneck when there are large numbers of meshes. This process, which is essentially just a prefix sum, could be moved from the CPU to the GPU with one or two more compute shader dispatches per batch set. <img width="2756" height="1800" alt="Screenshot 2026-03-23 084433" src="https://github.com/user-attachments/assets/20366341-7bf8-4dd8-881f-4740c53e9a6b" /> --------- Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com> Co-authored-by: IceSentry <IceSentry@users.noreply.github.com>
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another.
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch, the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: Add the running total leading into each chunk to every element of the chunk itself. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (1) the full ECS table scans required for change detection and (2) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (1), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (2). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
number of cameras. The intention has long been to render shadow maps for point and spot lights only once, regardless of the number of views. This is reflected in the fact that `RetainedViewEntity::auxiliary_entity` is `Entity::PLACEHOLDER` for them. Unfortunately, this is currently inconsistently implemented, and a separate `ExtractedView` is presently spawned and rendered to for every point and spot light shadow map. The behavior of these views is inconsistent because they violate the invariant that there must only be one render-world view per `RetainedViewEntity`. This patch changes Bevy's behavior to spawn only one `ExtractedView` for point and spot lights. This required some significant rearchitecting of the render schedule because the render schedule is currently driven off cameras. Driving the rendering off cameras is incorrect for point and spot light shadow maps, which aren't associated with any camera. This PR fixes the regression on the `render_layers` test in `testbed_3d` in PR bevyengine#23481, in that it renders the way it rendered before that PR. Note, however, that the rendering isn't what may have been intended: the shadows don't match the visible objects. That's because the shadows come from point lights, which aren't associated with cameras, and therefore shadows are rendered using the default set of `RenderLayers`. A future patch may want to add flags to cameras that specify that they should have their own point light and spot light shadow maps that inherit the render layer (and HLOD) behavior of their associated cameras. As this patch is fairly large, though, and because my immediate goal is to fix the regression in bevyengine#23481, I think those flags are best implemented in a follow-up.
number of cameras. The intention has long been to render shadow maps for point and spot lights only once, regardless of the number of views. This is reflected in the fact that `RetainedViewEntity::auxiliary_entity` is `Entity::PLACEHOLDER` for them. Unfortunately, this is currently inconsistently implemented, and a separate `ExtractedView` is presently spawned and rendered to for every point and spot light shadow map. The behavior of these views is inconsistent because they violate the invariant that there must only be one render-world view per `RetainedViewEntity`. This patch changes Bevy's behavior to spawn only one `ExtractedView` for point and spot lights. This required some significant rearchitecting of the render schedule because the render schedule is currently driven off cameras. Driving the rendering off cameras is incorrect for point and spot light shadow maps, which aren't associated with any camera. This PR fixes the regression on the `render_layers` test in `testbed_3d` in PR bevyengine#23481, in that it renders the way it rendered before that PR. Note, however, that the rendering isn't what may have been intended: the shadows don't match the visible objects. That's because the shadows come from point lights, which aren't associated with cameras, and therefore shadows are rendered using the default set of `RenderLayers`. A future patch may want to add flags to cameras that specify that they should have their own point light and spot light shadow maps that inherit the render layer (and HLOD) behavior of their associated cameras. As this patch is fairly large, though, and because my immediate goal is to fix the regression in bevyengine#23481, I think those flags are best implemented in a follow-up.
number of cameras. The intention has long been to render shadow maps for point and spot lights only once, regardless of the number of views. This is reflected in the fact that `RetainedViewEntity::auxiliary_entity` is `Entity::PLACEHOLDER` for them. Unfortunately, this is currently inconsistently implemented, and a separate `ExtractedView` is presently spawned and rendered to for every point and spot light shadow map. The behavior of these views is inconsistent because they violate the invariant that there must only be one render-world view per `RetainedViewEntity`. This patch changes Bevy's behavior to spawn only one `ExtractedView` for point and spot lights. This required some significant rearchitecting of the render schedule because the render schedule is currently driven off cameras. Driving the rendering off cameras is incorrect for point and spot light shadow maps, which aren't associated with any camera. This PR fixes the regression on the `render_layers` test in `testbed_3d` in PR bevyengine#23481, in that it renders the way it rendered before that PR. Note, however, that the rendering isn't what may have been intended: the shadows don't match the visible objects. That's because the shadows come from point lights, which aren't associated with cameras, and therefore shadows are rendered using the default set of `RenderLayers`. A future patch may want to add flags to cameras that specify that they should have their own point light and spot light shadow maps that inherit the render layer (and HLOD) behavior of their associated cameras. As this patch is fairly large, though, and because my immediate goal is to fix the regression in bevyengine#23481, I think those flags are best implemented in a follow-up.
The Bevy renderer maintains a single flat array containing information needed to render each mesh instance, known as the mesh input uniform array. Entities can be added or removed from this array in O(1) time. The downside of that, however, is that the mesh input uniform array can't be directly used for rendering. In order to render using multi-draw indirect (MDI), which is the most efficient way to render meshes that
wgpusupports, the mesh instances must be grouped into batch sets, within which all mesh instances share the same rendering state. We currently solve this problem by producing a list of preprocess work items per batch set every frame on the CPU. A preprocess work item consists of two indices: one into the list of mesh input uniforms, and one into the list of indirect draw commands needed for multi-draw indirect.The problem with this approach is that building the list of preprocess work items anew every frame becomes a performance bottleneck when scaling to millions of mesh instances. Even though the list consists of only two 32-bit indices per mesh instance, this is enough to shoot the system (
batch_and_prepare_binned_render_phase) to the top of the profile when scaling above 3 million mesh instances or so. Other data-driven engines, such as Unity DOTS, have no problem with higher mesh instance counts than that, so this currently represents a performance problem for Bevy.This PR partially solves the problem by introducing a bin unpacking step that runs on the GPU. On the CPU, the indices into the mesh input uniform list are cached from frame to frame and only updated when entities are added or removed from bins. The GPU bin unpacking step takes this incrementally-updated list of entities and produces the list of preprocess work items. We execute one dispatch of the bin unpacking shader per batch set; this ensures that the number of GPU commands issued per batch set remains a constant.
A major design goal of this work was to make the data structure as simple as possible while ensuring that adding or removing an entity is O(1). To maintain compatibility with Metal, and to reduce the number of invocations of the sparse buffer update in the future, we avoid bindless buffers. While the most obvious approach might seem to be to use the offset allocator crate to allocate hash sets to hold entities inside a buffer, that approach would actually end up being more complex than the solution implemented in this PR, as well as causing performance problems when blocks in the allocator must move. Instead, this patch introduces a data structure, the
RenderMultidrawableBatchSet, which carefully maintains indirection in order to keep the number of operations that must be performed when an entity is added or removed O(1).Because the
RenderMultidrawableBatchSetis a somewhat complex data structure with several invariants, I used theproptestcrate to perform randomized testing. The resulting test simulates adding and removing entities and checks to ensure the invariants after each randomized workload hold. I observed no errors after many runs of the randomized test, so I have good confidence that the data structure is correct.On
many_cubes --instance-count 1000000 --no-cpu-culling, this PR takesbatch_and_prepare_binned_render_phasefrom 1.52 ms median time to 0.0543 ms, a 28.0x speedup.There are two improvements I want to make after this PR. I held off on making them in order to reduce the size of this patch, as well as to make bisecting any regressions easier. The improvements are:
Currently, the bin buffers are stored using a
RawBufferVec. However, they don't change from frame to frame for static meshes, and updating only has constant overhead for each changed mesh instance. ASparseBufferVecwould be more efficient.Although extracting bins has been moved from CPU to GPU, the bins themselves are still traversed on CPU in order to allocate space for the mesh uniforms, which are keyed off the built-in instance ID in the vertex shader. This could be a bottleneck when there are large numbers of meshes. This process, which is essentially just a prefix sum, could be moved from the CPU to the GPU with one or two more compute shader dispatches per batch set.