Skip to content

PoC Cortex_m backend: Add support for CMSIS-NN scratch buffers#16580

Draft
mansnils wants to merge 1 commit into
pytorch:mainfrom
mansnils:cmsis_nn
Draft

PoC Cortex_m backend: Add support for CMSIS-NN scratch buffers#16580
mansnils wants to merge 1 commit into
pytorch:mainfrom
mansnils:cmsis_nn

Conversation

@mansnils
Copy link
Copy Markdown
Collaborator

@mansnils mansnils commented Jan 14, 2026

Use exir.memory.alloc for CMSIS-NN scratch buffers, which is ideal since it has a TensorSpec and gets memory planned but creates no additional operator overhead.
Use CMSIS-NN pybind wrapper to get correct buffer size.

Improves: #16041

Comparing w/wo patch memory consumption for conv2d_x3 unit test case. Without patch:

I [executorch:arm_executor_runner.cpp:842 log_mem_status()] model_pte_program_size: 10208 bytes.
I [executorch:arm_executor_runner.cpp:843 log_mem_status()] model_pte_loaded_size: 10208 bytes.
I [executorch:arm_executor_runner.cpp:848 log_mem_status()] input_file_allocator_used: 10976 / 62914560 free: 62903584 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:860 log_mem_status()] method_allocator_used: 5433 / 62914560 free: 62909127 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:867 log_mem_status()] method_allocator_planned: 2560 bytes
I [executorch:arm_executor_runner.cpp:871 log_mem_status()] method_allocator_loaded: 2857 bytes
I [executorch:arm_executor_runner.cpp:875 log_mem_status()] method_allocator_input: 16 bytes
I [executorch:arm_executor_runner.cpp:876 log_mem_status()] method_allocator_executor: 0 bytes
I [executorch:arm_executor_runner.cpp:879 log_mem_status()] temp_allocator: 2097152

With patch:
I [executorch:arm_executor_runner.cpp:846 log_mem_status()] model_pte_program_size: 10336 bytes.
I [executorch:arm_executor_runner.cpp:847 log_mem_status()] model_pte_loaded_size: 10336 bytes.
I [executorch:arm_executor_runner.cpp:852 log_mem_status()] input_file_allocator_used: 11104 / 62914560 free: 62903456 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:864 log_mem_status()] method_allocator_used: 6414 / 62914560 free: 62908146 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:871 log_mem_status()] method_allocator_planned: 2560 bytes
I [executorch:arm_executor_runner.cpp:875 log_mem_status()] method_allocator_loaded: 3838 bytes
I [executorch:arm_executor_runner.cpp:879 log_mem_status()] method_allocator_input: 16 bytes
I [executorch:arm_executor_runner.cpp:880 log_mem_status()] method_allocator_executor: 0 bytes

Summary:

  • The big temp_allocator used for scratch is removed in patch and no longer used except for Linear/FC but this is a PoC/Draft-PR anway
  • method_allocator_planned: 2560 bytes is the same in both cases => This means there is 100% reuse of the scratch buffers
  • increased model size and method_allocator_loaded delta => This can be explained by increased number of planned objects that describe scratch reuse (more meta data). This meta data should stay consistent if scratch buffer sizes increased so I think this acceptable

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @Sebastian-Larsson @robell

Use exir.memory.alloc for CMSIS-NN scratch buffers, which is ideal
since it has a TensorSpec and gets memory planned but creates no
additional operator overhead.
Use CMSIS-NN pybind wrapper to get correct buffer size.

Change-Id: Ia7ec8eda87833888a0639b480e531fd17818298a
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jan 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16580

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Awaiting Approval, 2 New Failures, 1 Unrelated Failure

As of commit c197701 with merge base 8e8d97e (image):

AWAITING APPROVAL - The following workflow needs approval before CI can run:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@mansnils mansnils added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Jan 14, 2026
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 14, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@rascani rascani self-requested a review January 14, 2026 16:41
@mergennachin mergennachin requested a review from psiddh January 14, 2026 16:41
@rascani rascani requested review from psiddh and removed request for psiddh January 14, 2026 16:42
@rascani
Copy link
Copy Markdown
Contributor

rascani commented Jan 14, 2026

Thanks for putting this together @mansnils. I think the general approach makes sense to me, but I have a couple additional questions:

  1. The buffer sizes required will likely need to vary based on the specific cortex-m configuration (ie, dsp and/or mve or neither). I don't think we've currently plumbed that sort of knowledge into the AOT stack, but it is appealing (albeit making the model more specialized). In the interim, were you thinking of always allocating based on the worst case size or only focus on MVE optimizations for now?

  2. Another potential risk is drift between the CMSIS-NN version the model was compiled against vs the CMSIS-NN version the runtime uses. Model artifacts tend to have long lifetimes. Since we are also doing a runtime check that the compiled scratch buffer size can still fit, I think we're generally okay here as well. The only downside is that working models could start failing if the runtime is upgraded to a new version that wants more memory than the model was compiled with.

import torch
import torch.fx

from cmsisnn_sizes import convolve_wrapper_s8_buffer_size_mve
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does cmsisnn_sizes come from? I don't see it in the upstream repo, but from the summary it sounds like this is a pybinding around the underlying get_buffer_size functions from here.

If so, how portable are those get_buffer_size functions? We'll obviously be running those functions on x86 and 64-bit archs now. I would imagine the sizeof() usage would be the riskiest, but could be okay if we always stick to fixed width integers.

Maybe it would be good to have some compile time checks that verify certain shapes result in certain buffer sizes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "cmsis_nn_sizes" are just local changes for now, and not yet upstreamed. So if/when committing to this approach we need official pybinding support in CMSIS-NN.

The actual underlying get_buffer_size function in this case is this one: https://github.com/ARM-software/CMSIS-NN/blob/fc08374121353a41076389de7007f49a487bbdb6/Include/arm_nnfunctions.h#L216
So it is actually intended for compiling on host and we have an equivalent for DSP. We only have this kind of host friendly functions for the get_buffer_size functions, which is fine since that is what we need here.

@mansnils
Copy link
Copy Markdown
Collaborator Author

Thanks for putting this together @mansnils. I think the general approach makes sense to me, but I have a couple additional questions:

1. The buffer sizes required will likely need to vary based on the specific cortex-m configuration (ie, dsp and/or mve or neither). I don't think we've currently plumbed that sort of knowledge into the AOT stack, but it is appealing (albeit making the model more specialized). In the interim, were you thinking of always allocating based on the worst case size or only focus on MVE optimizations for now?

2. Another potential risk is drift between the CMSIS-NN version the model was compiled against vs the CMSIS-NN version the runtime uses. Model artifacts tend to have long lifetimes. Since we are also doing a runtime check that the compiled scratch buffer size can still fit, I think we're generally okay here as well. The only downside is that working models could start failing if the runtime is upgraded to a new version that wants more memory than the model was compiled with.
  1. A very good question. The interim plan is to focus on MVE for now. However as you say for this approach we need a way to signal AOT stack, what kind of processor we are targeting - scalar, DSP or MVE based.

  2. Also a good question. To solve this I imagine we document carefully relation between actual CMSIS-NN release version and CMSIS-NN pybinding version. So that a user would be responsible for installing the right pip package version given the runtime version (the same version probably). We could also have runtime checks but would be nice to not rely on it as it affects memory footprint (maybe just in debug builds).

@rascani
Copy link
Copy Markdown
Contributor

rascani commented Jan 15, 2026

  1. A very good question. The interim plan is to focus on MVE for now. However as you say for this approach we need a way to signal AOT stack, what kind of processor we are targeting - scalar, DSP or MVE based.

SGTM, I'll prioritize looking at how to plumb those details through the AOT stack. Let me know if you or @AdrianLundell already had something in mind.

  1. Also a good question. To solve this I imagine we document carefully relation between actual CMSIS-NN release version and CMSIS-NN pybinding version. So that a user would be responsible for installing the right pip package version given the runtime version (the same version probably). We could also have runtime checks but would be nice to not rely on it as it affects memory footprint (maybe just in debug builds).

I think it makes sense to have some version warning checks where we can. My biggest concern is the runtime compatibility check because model artifacts have a tendency to stick around a long time. Since the compatibility is scoped to the buffer size, I do think its pretty limited risk. The two scenarios seem to be:

  • If the runtime wants a bigger buffer than what was memory planned, we reject the model and tell the user to regenerate the model (or downgrade the runtime).
  • If the runtime wants a smaller buffer, we can still run the model (and maybe emit a debug log that a regenerated model could potentially save memory space).

Regardless, I think this is a pretty good plan.

@mansnils
Copy link
Copy Markdown
Collaborator Author

  1. A very good question. The interim plan is to focus on MVE for now. However as you say for this approach we need a way to signal AOT stack, what kind of processor we are targeting - scalar, DSP or MVE based.

SGTM, I'll prioritize looking at how to plumb those details through the AOT stack. Let me know if you or @AdrianLundell already had something in mind.

  1. Also a good question. To solve this I imagine we document carefully relation between actual CMSIS-NN release version and CMSIS-NN pybinding version. So that a user would be responsible for installing the right pip package version given the runtime version (the same version probably). We could also have runtime checks but would be nice to not rely on it as it affects memory footprint (maybe just in debug builds).

I think it makes sense to have some version warning checks where we can. My biggest concern is the runtime compatibility check because model artifacts have a tendency to stick around a long time. Since the compatibility is scoped to the buffer size, I do think its pretty limited risk. The two scenarios seem to be:

* If the runtime wants a bigger buffer than what was memory planned, we reject the model and tell the user to regenerate the model (or downgrade the runtime).

* If the runtime wants a smaller buffer, we can still run the model (and maybe emit a debug log that a regenerated model could potentially save memory space).

Regardless, I think this is a pretty good plan.

SGTM, we will proceed with CMSIS-NN pybindings then.

@github-actions
Copy link
Copy Markdown

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions Bot added the Stale PRs inactive for over 60 days label Mar 18, 2026
@rascani rascani removed the Stale PRs inactive for over 60 days label Mar 18, 2026
@rascani
Copy link
Copy Markdown
Contributor

rascani commented Mar 18, 2026

Any update here @mansnils?

@mansnils
Copy link
Copy Markdown
Collaborator Author

mansnils commented Apr 9, 2026

Any update here @mansnils?

Sorry for delay. Bindings are now merged: ARM-software/CMSIS-NN@ad95bdf

@mansnils
Copy link
Copy Markdown
Collaborator Author

More updates: #18940

rascani added a commit that referenced this pull request May 13, 2026
### Summary
Introduce a CortexMCompileConfig dataclass (cpu + isa) that carries
Cortex-M target information from the --target=cortex-m<variant>+int8 CLI
string into CortexMPassManager. The full standard Cortex-M lineup is
registered (M0, M0+, M3, M4, M7, M23, M33, M35P, M52, M55, M85), each
with a sensible default ISA; the optional-DSP M33/M35P and optional-MVE
M52/M55/M85 cases can be expressed via the isa= kwarg.

No pass reads the config yet, so this change is purely plumbing, but it
positions both the upcoming AOT scratch-buffer sizing work (#16580) and
the CMSIS-NN scalar (#17646) / DSP (#17644) backend support to plug in
without re-plumbing the call site. Actually building for the new
variants still requires

CortexMTester gains an optional config kwarg, and the Pico 2 MLP example
now constructs CortexMPassManager with cpu='cortex-m33' to match the
RP2350 hardware it targets.

Authored with Claude.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants