Cortex-M backend: Add AoT scratch-buffer planning.#19636
Conversation
Add scratch tensors to the operator signatures, which are then assigned exir.memory.alloc. These allocs are automatically memory planned by ExecuTorch. Introduce `required_cmsis_buffer_size`which computes the buffer size from node properties + the Cortex-M configuration. The function uses functions registered by target in backends/cortex_m/passes/scratch_buffer_sizes.py This is used to set the size of the allocs in ConvertToCortexMPass Finally, modify the kernels to use the new scratch tensor instead of allocating temporary memory. Add a new macro CORTEX_M_ENABLE_ASSERT to do a safety check that the aot computed buffer size is equal to the buffer size computed at runtime. Use this when testing. Signed-off-by: Erik Lundell <erik.lundell@arm.com> Change-Id: Ia7ec8eda87833888a0639b480e531fd17818298a
Follow the plan from previous buffer planning work. Signed-off-by: Erik Lundell <erik.lundell@arm.com> Change-Id: I4bf3ca1cc421421b61903cba24856d0fd635d64a
We can now reduce the memory size to 0 when building the cortex_m test runner. Signed-off-by: Erik Lundell <erik.lundell@arm.com> Change-Id: Ieb1292c2db4651cd1f0756aa9d43ecedd5e262e5
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19636
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 2 New FailuresAs of commit 264fc9e with merge base 824cbff ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This is a polished version of #16580 |
| #ifdef CORTEX_M_ENABLE_ASSERTS | ||
| const int32_t runtime_buffer_bytes = | ||
| arm_fully_connected_s8_get_buffer_size(&out_dims); | ||
| if (ctx.size != static_cast<size_t>(runtime_buffer_bytes)) { |
There was a problem hiding this comment.
WDYT about doing ctx.size < runtime_buffer_bytes here? Essentially, if we have more memory than what we actually need, do we still want to error out?
There was a problem hiding this comment.
To me, this is a correctness assert. We don't just want to avoid failure, we want to make sure to ensure correctness.
| ctx.buf = scratch.mutable_data_ptr<int8_t>(); | ||
| } | ||
|
|
||
| #ifdef CORTEX_M_ENABLE_ASSERTS |
There was a problem hiding this comment.
As much as I want to limit the runtime checks, I think it'd be good to have this check always on and non-optional. Without this, we could wind up writing past end of buffers.
Also, naming nit: technically this is not an assert, as it should not crash the program if it fails. Maybe ENABLE_RUNTIME_CHECKS?
There was a problem hiding this comment.
The idea is that we should be confident that we are doing the correct allocation after testing. Users can turn this on to verify for example that they have not mixed up cmsis_nn versions, but then skip it in production. That's also why I want it to be a crash. If there is a mismatch here, I want to enforce a fix. Also, when we have this flag available, we can use it in more places.
| int( | ||
| cmsis_nn.convolve_wrapper_buffer_size( | ||
| backend, | ||
| cmsis_nn.DataType.A8W8, |
There was a problem hiding this comment.
Probably be good to make data type a parameter for all of these.
There was a problem hiding this comment.
Can we keep that for a later patch, when the backend starts supporting more than int8?
| ) | ||
|
|
||
| with node.graph.inserting_before(node): | ||
| scratch = node.graph.call_function( |
There was a problem hiding this comment.
exir.passes.make_alloc_node is the canonical helper for emitting memory.alloc nodes. It's used by to_out_variant and to_scratch_op_pass use internally, and it sets meta["val"] and meta["tensor_meta"] on the alloc. The memory planner keys off meta["val"] / meta["tensor_meta"] to build the TensorSpec that drives lifetime analysis and arena placement.
There was a problem hiding this comment.
Thanks, I'll use that
| ) | ||
| return exir_ops.edge.cortex_m.quantized_conv2d.default, new_args | ||
|
|
||
| def _set_scratch_buffer_size(self, node: torch.fx.Node) -> None: |
There was a problem hiding this comment.
IIUC, the current flow creates each scratch alloc with a placeholder shape, wires it into the cortex_m op's args, and then walks back through node.args[-(i+1)] in _set_scratch_buffer_size to mutate the alloc's shape once the size is known. This works, but the two-phase flow has a few drawbacks:
- The alloc -> size pairing is positional and implicit, split across two files. Adding a non-scratch trailing arg to any op signature silently mis-pairs.
- The UNINITIALIZED_ALLOC_ARGS sentinel + later mutation requires a reader to follow two hops to understand the alloc's actual size.
- _set_scratch_buffer_size exists only to undo the placeholder.
All the values required_cmsis_nn_buffer_sizes needs are already available locally in get*_replacement. If the size functions took inputs directly (e.g. an explicit ConvBufferSizeInputs dataclass), you could compute sizes before constructing the cortex_m op and emit allocs at their final size on one pass.
There was a problem hiding this comment.
I see your point, but I would like to keep the current design with the following arguments:
- node already perfectly captures the context needed to compute scratch buffer sizes. Introducing a new dataclass interface creates more boilerplate code with the risk of mismatching args.
- Importantly, node.meta["val"].shape is needed for conv, which requires either executing the node the calculate it (as is done now), or some duplication of logic to compute the shape from its args.
- cortex_m node creation is only done in one place in the pass (L513) , and the initialization happens directly after, so the logic is not too far separated.
- There is a check that the trailing args are allocs, so there is no silent mis-pairing. If alloc sizes were given in the wrong order, the runtime check would catch it.
I have pushed a commit to try to clarify the pattern though.
|
|
||
| return [ | ||
| int( | ||
| cmsis_nn.convolve_wrapper_buffer_size( |
Mainly clarify the uninitialized/intialize alloc pattern. Signed-off-by: Erik Lundell <erik.lundell@arm.com> Change-Id: I062a5048094129be6ed8e9f7eafc096f34132b2f
Signed-off-by: Erik Lundell <erik.lundell@arm.com> Change-Id: I8da2906a5f4cc69d15d033d8e5d1113d8b4afc4e
This is done for conv, depthwise conv, transpose conv, and bmm.
Add scratch tensors to the operator signatures, which are then
assigned exir.memory.alloc. These allocs are automatically memory
planned by ExecuTorch.
Introduce
required_cmsis_buffer_sizewhich computes the buffersize from node properties + the Cortex-M configuration.
The function uses functions registered by target in
backends/cortex_m/passes/scratch_buffer_sizes.py
This is used to set the size of the allocs in ConvertToCortexMPass
Finally, modify the kernels to use the new scratch tensor instead
of allocating temporary memory. Add a new macro CORTEX_M_ENABLE_ASSERT
to do a safety check that the aot computed buffer size is equal to the
buffer size computed at runtime. Use this when testing.
cc @psiddh @AdrianLundell @digantdesai @rascani @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell