PoC Cortex_m backend: Add support for CMSIS-NN scratch buffers by mansnils · Pull Request #16580 · pytorch/executorch

mansnils · 2026-01-14T12:43:06Z

Use exir.memory.alloc for CMSIS-NN scratch buffers, which is ideal since it has a TensorSpec and gets memory planned but creates no additional operator overhead.
Use CMSIS-NN pybind wrapper to get correct buffer size.

Improves: #16041

Comparing w/wo patch memory consumption for conv2d_x3 unit test case. Without patch:

I [executorch:arm_executor_runner.cpp:842 log_mem_status()] model_pte_program_size: 10208 bytes.
I [executorch:arm_executor_runner.cpp:843 log_mem_status()] model_pte_loaded_size: 10208 bytes.
I [executorch:arm_executor_runner.cpp:848 log_mem_status()] input_file_allocator_used: 10976 / 62914560 free: 62903584 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:860 log_mem_status()] method_allocator_used: 5433 / 62914560 free: 62909127 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:867 log_mem_status()] method_allocator_planned: 2560 bytes
I [executorch:arm_executor_runner.cpp:871 log_mem_status()] method_allocator_loaded: 2857 bytes
I [executorch:arm_executor_runner.cpp:875 log_mem_status()] method_allocator_input: 16 bytes
I [executorch:arm_executor_runner.cpp:876 log_mem_status()] method_allocator_executor: 0 bytes
I [executorch:arm_executor_runner.cpp:879 log_mem_status()] temp_allocator: 2097152

With patch:
I [executorch:arm_executor_runner.cpp:846 log_mem_status()] model_pte_program_size: 10336 bytes.
I [executorch:arm_executor_runner.cpp:847 log_mem_status()] model_pte_loaded_size: 10336 bytes.
I [executorch:arm_executor_runner.cpp:852 log_mem_status()] input_file_allocator_used: 11104 / 62914560 free: 62903456 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:864 log_mem_status()] method_allocator_used: 6414 / 62914560 free: 62908146 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:871 log_mem_status()] method_allocator_planned: 2560 bytes
I [executorch:arm_executor_runner.cpp:875 log_mem_status()] method_allocator_loaded: 3838 bytes
I [executorch:arm_executor_runner.cpp:879 log_mem_status()] method_allocator_input: 16 bytes
I [executorch:arm_executor_runner.cpp:880 log_mem_status()] method_allocator_executor: 0 bytes

Summary:

The big temp_allocator used for scratch is removed in patch and no longer used except for Linear/FC but this is a PoC/Draft-PR anway
method_allocator_planned: 2560 bytes is the same in both cases => This means there is 100% reuse of the scratch buffers
increased model size and method_allocator_loaded delta => This can be explained by increased number of planned objects that describe scratch reuse (more meta data). This meta data should stay consistent if scratch buffer sizes increased so I think this acceptable

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @Sebastian-Larsson @robell

Use exir.memory.alloc for CMSIS-NN scratch buffers, which is ideal since it has a TensorSpec and gets memory planned but creates no additional operator overhead. Use CMSIS-NN pybind wrapper to get correct buffer size. Change-Id: Ia7ec8eda87833888a0639b480e531fd17818298a

pytorch-bot · 2026-01-14T12:43:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16580

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Awaiting Approval, 2 New Failures, 1 Unrelated Failure

As of commit c197701 with merge base 8e8d97e ():

AWAITING APPROVAL - The following workflow needs approval before CI can run:

periodic (gh)

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for backends/cortex_m/passes/convert_to_cortex_m_pass.py:
pull / unittest-arm-backend-with-no-deps (test_run_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t c982d77b54332529a967f59bdf7011371dff5ef95487c26a93f37f58c5a468ea /exec failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / run-emulator (gh) (trunk failure)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-01-14T12:44:01Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

rascani · 2026-01-14T18:57:28Z

Thanks for putting this together @mansnils. I think the general approach makes sense to me, but I have a couple additional questions:

The buffer sizes required will likely need to vary based on the specific cortex-m configuration (ie, dsp and/or mve or neither). I don't think we've currently plumbed that sort of knowledge into the AOT stack, but it is appealing (albeit making the model more specialized). In the interim, were you thinking of always allocating based on the worst case size or only focus on MVE optimizations for now?
Another potential risk is drift between the CMSIS-NN version the model was compiled against vs the CMSIS-NN version the runtime uses. Model artifacts tend to have long lifetimes. Since we are also doing a runtime check that the compiled scratch buffer size can still fit, I think we're generally okay here as well. The only downside is that working models could start failing if the runtime is upgraded to a new version that wants more memory than the model was compiled with.

rascani · 2026-01-14T18:25:27Z

 import torch
 import torch.fx
+
+from cmsisnn_sizes import convolve_wrapper_s8_buffer_size_mve


Where does cmsisnn_sizes come from? I don't see it in the upstream repo, but from the summary it sounds like this is a pybinding around the underlying get_buffer_size functions from here.

If so, how portable are those get_buffer_size functions? We'll obviously be running those functions on x86 and 64-bit archs now. I would imagine the sizeof() usage would be the riskiest, but could be okay if we always stick to fixed width integers.

Maybe it would be good to have some compile time checks that verify certain shapes result in certain buffer sizes?

The "cmsis_nn_sizes" are just local changes for now, and not yet upstreamed. So if/when committing to this approach we need official pybinding support in CMSIS-NN.

The actual underlying get_buffer_size function in this case is this one: https://github.com/ARM-software/CMSIS-NN/blob/fc08374121353a41076389de7007f49a487bbdb6/Include/arm_nnfunctions.h#L216
So it is actually intended for compiling on host and we have an equivalent for DSP. We only have this kind of host friendly functions for the get_buffer_size functions, which is fine since that is what we need here.

mansnils · 2026-01-15T12:52:11Z

Thanks for putting this together @mansnils. I think the general approach makes sense to me, but I have a couple additional questions:

1. The buffer sizes required will likely need to vary based on the specific cortex-m configuration (ie, dsp and/or mve or neither). I don't think we've currently plumbed that sort of knowledge into the AOT stack, but it is appealing (albeit making the model more specialized). In the interim, were you thinking of always allocating based on the worst case size or only focus on MVE optimizations for now?

2. Another potential risk is drift between the CMSIS-NN version the model was compiled against vs the CMSIS-NN version the runtime uses. Model artifacts tend to have long lifetimes. Since we are also doing a runtime check that the compiled scratch buffer size can still fit, I think we're generally okay here as well. The only downside is that working models could start failing if the runtime is upgraded to a new version that wants more memory than the model was compiled with.

A very good question. The interim plan is to focus on MVE for now. However as you say for this approach we need a way to signal AOT stack, what kind of processor we are targeting - scalar, DSP or MVE based.
Also a good question. To solve this I imagine we document carefully relation between actual CMSIS-NN release version and CMSIS-NN pybinding version. So that a user would be responsible for installing the right pip package version given the runtime version (the same version probably). We could also have runtime checks but would be nice to not rely on it as it affects memory footprint (maybe just in debug builds).

rascani · 2026-01-15T19:33:07Z

A very good question. The interim plan is to focus on MVE for now. However as you say for this approach we need a way to signal AOT stack, what kind of processor we are targeting - scalar, DSP or MVE based.

SGTM, I'll prioritize looking at how to plumb those details through the AOT stack. Let me know if you or @AdrianLundell already had something in mind.

Also a good question. To solve this I imagine we document carefully relation between actual CMSIS-NN release version and CMSIS-NN pybinding version. So that a user would be responsible for installing the right pip package version given the runtime version (the same version probably). We could also have runtime checks but would be nice to not rely on it as it affects memory footprint (maybe just in debug builds).

I think it makes sense to have some version warning checks where we can. My biggest concern is the runtime compatibility check because model artifacts have a tendency to stick around a long time. Since the compatibility is scoped to the buffer size, I do think its pretty limited risk. The two scenarios seem to be:

If the runtime wants a bigger buffer than what was memory planned, we reject the model and tell the user to regenerate the model (or downgrade the runtime).
If the runtime wants a smaller buffer, we can still run the model (and maybe emit a debug log that a regenerated model could potentially save memory space).

Regardless, I think this is a pretty good plan.

mansnils · 2026-01-16T12:16:02Z

A very good question. The interim plan is to focus on MVE for now. However as you say for this approach we need a way to signal AOT stack, what kind of processor we are targeting - scalar, DSP or MVE based.

SGTM, I'll prioritize looking at how to plumb those details through the AOT stack. Let me know if you or @AdrianLundell already had something in mind.

Also a good question. To solve this I imagine we document carefully relation between actual CMSIS-NN release version and CMSIS-NN pybinding version. So that a user would be responsible for installing the right pip package version given the runtime version (the same version probably). We could also have runtime checks but would be nice to not rely on it as it affects memory footprint (maybe just in debug builds).

I think it makes sense to have some version warning checks where we can. My biggest concern is the runtime compatibility check because model artifacts have a tendency to stick around a long time. Since the compatibility is scoped to the buffer size, I do think its pretty limited risk. The two scenarios seem to be:
* If the runtime wants a bigger buffer than what was memory planned, we reject the model and tell the user to regenerate the model (or downgrade the runtime).

* If the runtime wants a smaller buffer, we can still run the model (and maybe emit a debug log that a regenerated model could potentially save memory space).
Regardless, I think this is a pretty good plan.

SGTM, we will proceed with CMSIS-NN pybindings then.

github-actions · 2026-03-18T00:57:21Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

rascani · 2026-03-18T01:21:18Z

Any update here @mansnils?

mansnils · 2026-04-09T11:45:33Z

Any update here @mansnils?

Sorry for delay. Bindings are now merged: ARM-software/CMSIS-NN@ad95bdf

mansnils · 2026-04-22T09:28:22Z

More updates: #18940

@digantdesai

### Summary Introduce a CortexMCompileConfig dataclass (cpu + isa) that carries Cortex-M target information from the --target=cortex-m<variant>+int8 CLI string into CortexMPassManager. The full standard Cortex-M lineup is registered (M0, M0+, M3, M4, M7, M23, M33, M35P, M52, M55, M85), each with a sensible default ISA; the optional-DSP M33/M35P and optional-MVE M52/M55/M85 cases can be expressed via the isa= kwarg. No pass reads the config yet, so this change is purely plumbing, but it positions both the upcoming AOT scratch-buffer sizing work (#16580) and the CMSIS-NN scalar (#17646) / DSP (#17644) backend support to plug in without re-plumbing the call site. Actually building for the new variants still requires CortexMTester gains an optional config kwarg, and the Pico 2 MLP example now constructs CortexMPassManager with cpu='cortex-m33' to match the RP2350 hardware it targets. Authored with Claude. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

mansnils added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Jan 14, 2026

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 14, 2026

rascani self-requested a review January 14, 2026 16:41

mergennachin requested a review from psiddh January 14, 2026 16:41

rascani requested review from psiddh and removed request for psiddh January 14, 2026 16:42

rascani reviewed Jan 14, 2026

View reviewed changes

github-actions Bot added the Stale PRs inactive for over 60 days label Mar 18, 2026

rascani removed the Stale PRs inactive for over 60 days label Mar 18, 2026

rascani mentioned this pull request May 11, 2026

Cortex-M: Thread target CPU/ISA through the AOT pass manager #19470

Merged

Erik-Lundell mentioned this pull request May 18, 2026

Cortex-M backend: Add AoT scratch-buffer planning. #19636

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC Cortex_m backend: Add support for CMSIS-NN scratch buffers#16580

PoC Cortex_m backend: Add support for CMSIS-NN scratch buffers#16580
mansnils wants to merge 1 commit into
pytorch:mainfrom
mansnils:cmsis_nn

mansnils commented Jan 14, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jan 14, 2026

Uh oh!

rascani commented Jan 14, 2026

Uh oh!

rascani Jan 14, 2026

Uh oh!

mansnils Jan 15, 2026

Uh oh!

mansnils commented Jan 15, 2026

Uh oh!

rascani commented Jan 15, 2026

Uh oh!

mansnils commented Jan 16, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

rascani commented Mar 18, 2026

Uh oh!

mansnils commented Apr 9, 2026

Uh oh!

mansnils commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mansnils commented Jan 14, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16580

❌ 1 Awaiting Approval, 2 New Failures, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Jan 14, 2026

This PR needs a release notes: label

Uh oh!

rascani commented Jan 14, 2026

Uh oh!

rascani Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

mansnils Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

mansnils commented Jan 15, 2026

Uh oh!

rascani commented Jan 15, 2026

Uh oh!

mansnils commented Jan 16, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

rascani commented Mar 18, 2026

Uh oh!

mansnils commented Apr 9, 2026

Uh oh!

mansnils commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mansnils commented Jan 14, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Jan 14, 2026 •

edited

Loading

This PR needs a `release notes:` label