Skip to content

Cortex-M backend: Add AoT scratch-buffer planning.#19636

Open
Erik-Lundell wants to merge 5 commits into
pytorch:mainfrom
Erik-Lundell:cmsis_nn2
Open

Cortex-M backend: Add AoT scratch-buffer planning.#19636
Erik-Lundell wants to merge 5 commits into
pytorch:mainfrom
Erik-Lundell:cmsis_nn2

Conversation

@Erik-Lundell
Copy link
Copy Markdown
Collaborator

@Erik-Lundell Erik-Lundell commented May 18, 2026

This is done for conv, depthwise conv, transpose conv, and bmm.

Add scratch tensors to the operator signatures, which are then
assigned exir.memory.alloc. These allocs are automatically memory
planned by ExecuTorch.

Introduce required_cmsis_buffer_sizewhich computes the buffer
size from node properties + the Cortex-M configuration.
The function uses functions registered by target in
backends/cortex_m/passes/scratch_buffer_sizes.py
This is used to set the size of the allocs in ConvertToCortexMPass

Finally, modify the kernels to use the new scratch tensor instead
of allocating temporary memory. Add a new macro CORTEX_M_ENABLE_ASSERT
to do a safety check that the aot computed buffer size is equal to the
buffer size computed at runtime. Use this when testing.

cc @psiddh @AdrianLundell @digantdesai @rascani @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

mansnils and others added 3 commits May 18, 2026 08:09
Add scratch tensors to the operator signatures, which are then
assigned exir.memory.alloc. These allocs are automatically memory
planned by ExecuTorch.

Introduce `required_cmsis_buffer_size`which computes the buffer
size from node properties + the Cortex-M configuration.
The function uses functions registered by target in
backends/cortex_m/passes/scratch_buffer_sizes.py
This is used to set the size of the allocs in ConvertToCortexMPass

Finally, modify the kernels to use the new scratch tensor instead
of allocating temporary memory. Add a new macro CORTEX_M_ENABLE_ASSERT
to do a safety check that the aot computed buffer size is equal to the
buffer size computed at runtime. Use this when testing.

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Change-Id: Ia7ec8eda87833888a0639b480e531fd17818298a
Follow the plan from previous buffer planning
work.

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Change-Id: I4bf3ca1cc421421b61903cba24856d0fd635d64a
We can now reduce the memory size to 0 when building the
cortex_m test runner.

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Change-Id: Ieb1292c2db4651cd1f0756aa9d43ecedd5e262e5
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 18, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19636

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures

As of commit 264fc9e with merge base 824cbff (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 18, 2026
@github-actions github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels May 18, 2026
@Erik-Lundell Erik-Lundell added enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: microcontrollers For embedded MCUs like Cortex-M, or RTOS like Zephyr, does not track NPU backend like Arm Ethos. ciflow/trunk partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm and removed ciflow/trunk module: arm Issues related to arm backend labels May 18, 2026
@Erik-Lundell
Copy link
Copy Markdown
Collaborator Author

This is a polished version of #16580

@Erik-Lundell Erik-Lundell added the release notes: ops & kernels Changes to the opset and any new / changed kernel implementations label May 18, 2026
#ifdef CORTEX_M_ENABLE_ASSERTS
const int32_t runtime_buffer_bytes =
arm_fully_connected_s8_get_buffer_size(&out_dims);
if (ctx.size != static_cast<size_t>(runtime_buffer_bytes)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about doing ctx.size < runtime_buffer_bytes here? Essentially, if we have more memory than what we actually need, do we still want to error out?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this is a correctness assert. We don't just want to avoid failure, we want to make sure to ensure correctness.

ctx.buf = scratch.mutable_data_ptr<int8_t>();
}

#ifdef CORTEX_M_ENABLE_ASSERTS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I want to limit the runtime checks, I think it'd be good to have this check always on and non-optional. Without this, we could wind up writing past end of buffers.

Also, naming nit: technically this is not an assert, as it should not crash the program if it fails. Maybe ENABLE_RUNTIME_CHECKS?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we should be confident that we are doing the correct allocation after testing. Users can turn this on to verify for example that they have not mixed up cmsis_nn versions, but then skip it in production. That's also why I want it to be a crash. If there is a mismatch here, I want to enforce a fix. Also, when we have this flag available, we can use it in more places.

int(
cmsis_nn.convolve_wrapper_buffer_size(
backend,
cmsis_nn.DataType.A8W8,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably be good to make data type a parameter for all of these.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep that for a later patch, when the backend starts supporting more than int8?

)

with node.graph.inserting_before(node):
scratch = node.graph.call_function(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exir.passes.make_alloc_node is the canonical helper for emitting memory.alloc nodes. It's used by to_out_variant and to_scratch_op_pass use internally, and it sets meta["val"] and meta["tensor_meta"] on the alloc. The memory planner keys off meta["val"] / meta["tensor_meta"] to build the TensorSpec that drives lifetime analysis and arena placement.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll use that

)
return exir_ops.edge.cortex_m.quantized_conv2d.default, new_args

def _set_scratch_buffer_size(self, node: torch.fx.Node) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the current flow creates each scratch alloc with a placeholder shape, wires it into the cortex_m op's args, and then walks back through node.args[-(i+1)] in _set_scratch_buffer_size to mutate the alloc's shape once the size is known. This works, but the two-phase flow has a few drawbacks:

  • The alloc -> size pairing is positional and implicit, split across two files. Adding a non-scratch trailing arg to any op signature silently mis-pairs.
  • The UNINITIALIZED_ALLOC_ARGS sentinel + later mutation requires a reader to follow two hops to understand the alloc's actual size.
  • _set_scratch_buffer_size exists only to undo the placeholder.

All the values required_cmsis_nn_buffer_sizes needs are already available locally in get*_replacement. If the size functions took inputs directly (e.g. an explicit ConvBufferSizeInputs dataclass), you could compute sizes before constructing the cortex_m op and emit allocs at their final size on one pass.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but I would like to keep the current design with the following arguments:

  • node already perfectly captures the context needed to compute scratch buffer sizes. Introducing a new dataclass interface creates more boilerplate code with the risk of mismatching args.
  • Importantly, node.meta["val"].shape is needed for conv, which requires either executing the node the calculate it (as is done now), or some duplication of logic to compute the shape from its args.
  • cortex_m node creation is only done in one place in the pass (L513) , and the initialization happens directly after, so the logic is not too far separated.
  • There is a check that the trailing args are allocs, so there is no silent mis-pairing. If alloc sizes were given in the wrong order, the runtime check would catch it.

I have pushed a commit to try to clarify the pattern though.


return [
int(
cmsis_nn.convolve_wrapper_buffer_size(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Mainly clarify the uninitialized/intialize
alloc pattern.

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Change-Id: I062a5048094129be6ed8e9f7eafc096f34132b2f
@github-actions github-actions Bot added the module: arm Issues related to arm backend label May 19, 2026
Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Change-Id: I8da2906a5f4cc69d15d033d8e5d1113d8b4afc4e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: arm Issues related to arm backend module: microcontrollers For embedded MCUs like Cortex-M, or RTOS like Zephyr, does not track NPU backend like Arm Ethos. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: ops & kernels Changes to the opset and any new / changed kernel implementations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants