Implement GPU clustering for lights, light probes, and decals.#23036
Conversation
Currently, Bevy clusters lights on the CPU. This is generally not considered a best practice any longer, and it can be a bottleneck in workloads like `many_lights`. Moreover, it prevents GPU systems like [Hanabi] from creating clusterable objects such as lights and decals without a round trip to the CPU. This PR introduces GPU light clustering when supported by the hardware. The algorithm is the same as the existing GPU light clustering, but parallelized over all clusters, and the resulting on-GPU format for clusters is unchanged. GPU light clustering uses the hardware rasterizer for compute purposes as a way to automatically distribute workloads within 2D axis-aligned bounding boxes without actually rendering any pixels, a first for Bevy. The algorithm is as follows, with each step corresponding to a raster or compute command: 1. *Z slicing*: We have a 3D cluster froxel grid of size WxHxD and seek to rasterize D axis-aligned quads, each of size WxH, representing the range of each clusterable object. In this compute phase, we generate D indirect instances for each clusterable object for the subsequent indirect draws. 2. *Count rasterization*: We use instanced indirect drawing to rasterize each quad generated in step 1 to a viewport of size WxH, with color writes disabled. Each rasterized fragment represents a cluster-object pair. In the fragment shader, we check to see if the object intersects the cluster, and, if it does, we atomically bump a counter corresponding to the number of objects of the given type intersecting the cluster in question. We don't record the ID of the object in this phase; we simply count the number of objects. 3. *Local allocation*: Now that we know the number of objects of each type in each cluster, we can proceed to allocate space in the clustered object buffer for each clustered object list. To do this, we need to perform a [*prefix sum*] operation so that each list is tightly packed with the others. For example, if adjacent clusters have 2, 5, and 3 objects, they'll be allocated at offsets 0, 2, and 7 respectively. This *local* step uses a [Hillis-Steele scan] in shared memory to compute the prefix sum of each chunk of 256 clusters. We can't go beyond 256 clusters in this local step because 256 is the maximum workgroup size in `wgpu`. 4. *Global allocation*: To deal with the fact that we can't calculate prefix sums beyond 256 clusters in step 3, we employ this second step that does a sequential loop over every 256-cluster chunk, propagating the prefix sum. At the end of this step, every list of clustered objects is allocated. 5. *Populate rasterization*: Finally, we issue an instanced indirect draw command using the same parameters as step (2). We test each cluster-object pair for intersection, and, if the test passes, we record the ID of each clustered object into the correct space in the list, using an scratch pad buffer of atomics to store the position of the next object in each list. The buffer of clustered objects has a fixed size and can overflow. We detect this condition via asynchronous CPU readback and automatically grow the buffer for subsequent frames. In this case, we also log a message so that the developer can choose a larger initial buffer size and avoid any incorrect frames. Additionally, like bevyengine#22874, the automatic clustering heuristics are dynamically adjusted from frame to frame, by recording statistics on the GPU and using CPU readback to download them back to the CPU for processing. As part of this PR, I refactored clustered visibility so that clustered objects go through the same `ViewVisibility` system as other objects, instead of using `VisibleClusterableObjects`. This was a nice simplification. On the `many_lights` benchmark, with about 8,000 lights visible out of 100,000, this process takes approximately 0.12 ms on my NVIDIA GeForce RTX 4070 Laptop GPU. [Hanabi]: https://github.com/djeedai/bevy_hanabi [*prefix sum*]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
|
It looks like your PR has been selected for a highlight in the next release blog post, but you didn't provide a release note. Please review the instructions for writing release notes, then expand or revise the content in the release notes directory to showcase your changes. |
|
Your PR caused a change in the graphical output of an example or rendering test. This might be intentional, but it could also mean that something broke! If it's expected, please add the M-Deliberate-Rendering-Change label. If this change seems unrelated to your PR, you can consider updating your PR to target the latest main branch, either by rebasing or merging main into it. |
|
I think we should pick this as one of our "look we made performance better" PRs for this cycle. This is a nice crunchy bit of optimization work, with impressive numbers and graphs and a nice write-up already. |
|
Your PR caused a change in the graphical output of an example or rendering test. This might be intentional, but it could also mean that something broke! If it's expected, please add the M-Deliberate-Rendering-Change label. If this change seems unrelated to your PR, you can consider updating your PR to target the latest main branch, either by rebasing or merging main into it. |
|
For reference, this is how Doom Eternal did light clustering on the GPU: https://advances.realtimerendering.com/s2020/RenderingDoomEternal.pdf |
|
What Doom Eternal does looks essentially like the way you implement rasterization in compute, with the coarse/fine raster stages. It looks like I use roughly the same approach, but with hardware rasterization. I'm sure what Doom Eternal does is more efficient because it avoids edge overshading and vertex shader overhead. But I like the simplicity of my approach, the performance is good, and this is a pretty monster patch as it is. We can always switch to software rasterization down the road if we need to. |
|
The regressions picked up by PixelEagle should be fixed by #23083 |
| /// Settings relating to GPU clustering. | ||
| #[derive(Clone, Copy, Debug)] | ||
| pub struct GlobalClusterGpuSettings { | ||
| /// The initial size of the list of Z slices. |
There was a problem hiding this comment.
Is this the amount of Z slices, or?
There was a problem hiding this comment.
It's the capacity of the Z slice list, so it has to be less than or equal to the total number of Z slices across all clusterable objects. I'll go ahead and rename it to initial_z_slice_list_capacity to be consistent with the meaning of "capacity" for Vec, etc.
|
Yeah, the issue with bitsets is that they have a fixed size. Looking at Wicked Engine, both of their approaches have a maximum number of lights per tile, which I don't love. We could fix that by detecting overflow and signaling back to the CPU, of course. In any case, I think any changes here are for a follow-up, as this patch intentionally didn't change the representation of the resulting list just to keep the scope limited. |
atlv24
left a comment
There was a problem hiding this comment.
I have a bunch of nits i tracked on a local text file, which i will address after this merges, as they are not blocking and do not affect correctness.
I tested orthographic on a couple examples to ensure the logic was portable, and it all looks fine.
This took a really long time to review and I have some maintainability concerns given how much code it adds, but I am okay with taking on the burden of figuring out better abstractions/improvements later.
Lets get this in!
| // perspective the point at max z but min xy may be less xy in screenspace, | ||
| // and similar. As such, projecting the min and max xy at both the closer | ||
| // and further z and taking the min and max of those projected points | ||
| // addresses this. |
There was a problem hiding this comment.
was worried a bit about ortho but tested and read through the logic carefully and this seems all fine. im not sure if theres a cheaper way
There was a problem hiding this comment.
That code was copied line by line, including the comment, from bevy_light/src/cluster/assign.rs.
| // Do nothing to the color buffer. We only care about using | ||
| // the rasterizer for fragment scheduling; we're not going | ||
| // to actually paint any pixels. | ||
| load: LoadOp::Clear(Color::BLACK.to_linear().into()), |
There was a problem hiding this comment.
If we don't do anything to it, why clear it?
There was a problem hiding this comment.
Clear is actually faster than Load on tiled mobile GPUs because on those GPUs you generally have to transfer the data in, render to it, then transfer it back out, and Clear lets you skip the first part. Actually, DontCare would be the fastest, but in wgpu constructing a DontCare is gated behind unsafe, and I didn't want any unsafe code here.
|
Marking it as Ready For Final Review, pending on the example run finishing and looking good, and @mockersf signing off that we don't have too many active regressions on main that we should wait merging this for. |
# Objective Solari was flickering when DLSS was enabled after #23036. ## Solution My intuition was that this was a pre-existing system ordering issue, exposed by happenstance due to tweaks in the exact topo-sort being used. That's consistent with both flickering and spooky action at a distance. After some digging, I found that `solari_lighting`'s system ordering was not consistent with `deferred_lighting`, and was free-floating inside of `Core3dSystems::MainPass)`. Ordering this before the `main_opaque_3d_pass` fixed the bug! I think the root ambiguity has something to do with the order in which GPU commands are queued, but that's a bit beyond my expertise. This bug likely exists without DLSS, and before the linked PR, but is hard to surface due to the topo sort of systems we were typically falling into. ## Testing `cargo run --example solari --features="bevy_solari https free_camera dlss"`
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch, the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: Add the running total leading into each chunk to every element of the chunk itself. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (1) the full ECS table scans required for change detection and (2) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (1), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (2). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
The GPU clustering path (added in bevyengine#23036) never handled `ClusterConfig::None` the way the CPU path does (assign.rs clears + skips the view). A None view has zero cluster dimensions, so it flowed through unguarded and produced a 0x0 "clustering dummy texture" -> Device::create_texture validation error -> crash. Nothing in-tree constructs ClusterConfig::None, so no test or example exercised it and CI stayed green; a ray-traced camera that sets None to skip clustered-forward work is the first real consumer. Honor None, mirroring the CPU path: - prepare_cluster_dummy_textures skips zero-dimension views (no 0x0 texture); cluster_on_gpu already skips views lacking a dummy texture. - prepare_clusters_for_gpu_clustering emits an empty ViewClusterBindings for None views -- the mesh view bind group's cluster entries are non-optional, so the view still needs them (0 clustered objects, like the CPU path) -- but skips the GPU clustering buffers and passes. The empty binding is built once and reused across frames (the render world retains the component), so it adds no per-frame allocation. - add ViewClusterBindings::is_empty() for the reuse check.
Currently, Bevy clusters lights on the CPU. This is generally not considered a best practice any longer, and it can be a bottleneck in workloads like
many_lights. Moreover, it prevents GPU systems like Hanabi from creating clusterable objects such as lights and decals without a round trip to the CPU.This PR introduces GPU light clustering when supported by the hardware. The algorithm is the same as the existing GPU light clustering, but parallelized over all clusters, and the resulting on-GPU format for clusters is unchanged. GPU light clustering uses the hardware rasterizer for compute purposes as a way to automatically distribute workloads within 2D axis-aligned bounding boxes without actually rendering any pixels, a first for Bevy. The algorithm is as follows, with each step corresponding to a raster or compute command:
Z slicing: We have a 3D cluster froxel grid of size WxHxD and seek to rasterize D axis-aligned quads, each of size WxH, representing the range of each clusterable object. In this compute phase, we generate D indirect instances for each clusterable object for the subsequent indirect draws.
Count rasterization: We use instanced indirect drawing to rasterize each quad generated in step 1 to a viewport of size WxH, with color writes disabled. Each rasterized fragment represents a cluster-object pair. In the fragment shader, we check to see if the object intersects the cluster, and, if it does, we atomically bump a counter corresponding to the number of objects of the given type intersecting the cluster in question. We don't record the ID of the object in this phase; we simply count the number of objects.
Local allocation: Now that we know the number of objects of each type in each cluster, we can proceed to allocate space in the clustered object buffer for each clustered object list. To do this, we need to perform a prefix sum operation so that each list is tightly packed with the others. For example, if adjacent clusters have 2, 5, and 3 objects, they'll be allocated at offsets 0, 2, and 7 respectively. This local step uses a Hillis-Steele scan in shared memory to compute the prefix sum of each chunk of 256 clusters. We can't go beyond 256 clusters in this local step because 256 is the maximum workgroup size in
wgpu.Global allocation: To deal with the fact that we can't calculate prefix sums beyond 256 clusters in step 3, we employ this second step that does a sequential loop over every 256-cluster chunk, propagating the prefix sum. At the end of this step, every list of clustered objects is allocated.
Populate rasterization: Finally, we issue an instanced indirect draw command using the same parameters as step (2). We test each cluster-object pair for intersection, and, if the test passes, we record the ID of each clustered object into the correct space in the list, using a scratch pad buffer of atomics to store the position of the next object in each list.
The buffer of clustered objects has a fixed size and can overflow. We detect this condition via asynchronous CPU readback and automatically grow the buffer for subsequent frames. In this case, we also log a message so that the developer can choose a larger initial buffer size and avoid any incorrect frames. Additionally, like #22874, the automatic clustering heuristics are dynamically adjusted from frame to frame, by recording statistics on the GPU and using CPU readback to download them back to the CPU for processing.
As part of this PR, I refactored clustered visibility so that clustered objects go through the same
ViewVisibilitysystem as other objects, instead of usingVisibleClusterableObjects. This was a nice simplification.On the
many_lightsbenchmark, with about 8,000 lights visible out of 100,000, this process takes approximately 0.099 ms on my NVIDIA GeForce RTX 4070 Laptop GPU. The AMD Ryzen 9 8945HS CPU, however, takes 2.12 ms to do the same task. The GPU version is therefore a 21x speedup.mainassign_objects_to_clusterstime, 2.12 ms:GPU clustering GPU time, 0.099 ms:

main, 5.71 ms median frame time, 175 FPS:GPU clustering, 4.88 ms median frame time, 205 FPS:

Alice's PM Note from @kfc35
Fixes #22957 and also fixes #22904.