PoC Cortex_m backend: Add support for CMSIS-NN scratch buffers#16580
PoC Cortex_m backend: Add support for CMSIS-NN scratch buffers#16580mansnils wants to merge 1 commit into
Conversation
Use exir.memory.alloc for CMSIS-NN scratch buffers, which is ideal since it has a TensorSpec and gets memory planned but creates no additional operator overhead. Use CMSIS-NN pybind wrapper to get correct buffer size. Change-Id: Ia7ec8eda87833888a0639b480e531fd17818298a
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16580
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Awaiting Approval, 2 New Failures, 1 Unrelated FailureAs of commit c197701 with merge base 8e8d97e ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
Thanks for putting this together @mansnils. I think the general approach makes sense to me, but I have a couple additional questions:
|
| import torch | ||
| import torch.fx | ||
|
|
||
| from cmsisnn_sizes import convolve_wrapper_s8_buffer_size_mve |
There was a problem hiding this comment.
Where does cmsisnn_sizes come from? I don't see it in the upstream repo, but from the summary it sounds like this is a pybinding around the underlying get_buffer_size functions from here.
If so, how portable are those get_buffer_size functions? We'll obviously be running those functions on x86 and 64-bit archs now. I would imagine the sizeof() usage would be the riskiest, but could be okay if we always stick to fixed width integers.
Maybe it would be good to have some compile time checks that verify certain shapes result in certain buffer sizes?
There was a problem hiding this comment.
The "cmsis_nn_sizes" are just local changes for now, and not yet upstreamed. So if/when committing to this approach we need official pybinding support in CMSIS-NN.
The actual underlying get_buffer_size function in this case is this one: https://github.com/ARM-software/CMSIS-NN/blob/fc08374121353a41076389de7007f49a487bbdb6/Include/arm_nnfunctions.h#L216
So it is actually intended for compiling on host and we have an equivalent for DSP. We only have this kind of host friendly functions for the get_buffer_size functions, which is fine since that is what we need here.
|
SGTM, I'll prioritize looking at how to plumb those details through the AOT stack. Let me know if you or @AdrianLundell already had something in mind.
I think it makes sense to have some version warning checks where we can. My biggest concern is the runtime compatibility check because model artifacts have a tendency to stick around a long time. Since the compatibility is scoped to the buffer size, I do think its pretty limited risk. The two scenarios seem to be:
Regardless, I think this is a pretty good plan. |
SGTM, we will proceed with CMSIS-NN pybindings then. |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
Any update here @mansnils? |
Sorry for delay. Bindings are now merged: ARM-software/CMSIS-NN@ad95bdf |
|
More updates: #18940 |
### Summary Introduce a CortexMCompileConfig dataclass (cpu + isa) that carries Cortex-M target information from the --target=cortex-m<variant>+int8 CLI string into CortexMPassManager. The full standard Cortex-M lineup is registered (M0, M0+, M3, M4, M7, M23, M33, M35P, M52, M55, M85), each with a sensible default ISA; the optional-DSP M33/M35P and optional-MVE M52/M55/M85 cases can be expressed via the isa= kwarg. No pass reads the config yet, so this change is purely plumbing, but it positions both the upcoming AOT scratch-buffer sizing work (#16580) and the CMSIS-NN scalar (#17646) / DSP (#17644) backend support to plug in without re-plumbing the call site. Actually building for the new variants still requires CortexMTester gains an optional config kwarg, and the Pico 2 MLP example now constructs CortexMPassManager with cpu='cortex-m33' to match the RP2350 hardware it targets. Authored with Claude. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell
Use exir.memory.alloc for CMSIS-NN scratch buffers, which is ideal since it has a TensorSpec and gets memory planned but creates no additional operator overhead.
Use CMSIS-NN pybind wrapper to get correct buffer size.
Improves: #16041
Comparing w/wo patch memory consumption for conv2d_x3 unit test case. Without patch:
I [executorch:arm_executor_runner.cpp:842 log_mem_status()] model_pte_program_size: 10208 bytes.
I [executorch:arm_executor_runner.cpp:843 log_mem_status()] model_pte_loaded_size: 10208 bytes.
I [executorch:arm_executor_runner.cpp:848 log_mem_status()] input_file_allocator_used: 10976 / 62914560 free: 62903584 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:860 log_mem_status()] method_allocator_used: 5433 / 62914560 free: 62909127 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:867 log_mem_status()] method_allocator_planned: 2560 bytes
I [executorch:arm_executor_runner.cpp:871 log_mem_status()] method_allocator_loaded: 2857 bytes
I [executorch:arm_executor_runner.cpp:875 log_mem_status()] method_allocator_input: 16 bytes
I [executorch:arm_executor_runner.cpp:876 log_mem_status()] method_allocator_executor: 0 bytes
I [executorch:arm_executor_runner.cpp:879 log_mem_status()] temp_allocator: 2097152
With patch:
I [executorch:arm_executor_runner.cpp:846 log_mem_status()] model_pte_program_size: 10336 bytes.
I [executorch:arm_executor_runner.cpp:847 log_mem_status()] model_pte_loaded_size: 10336 bytes.
I [executorch:arm_executor_runner.cpp:852 log_mem_status()] input_file_allocator_used: 11104 / 62914560 free: 62903456 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:864 log_mem_status()] method_allocator_used: 6414 / 62914560 free: 62908146 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:871 log_mem_status()] method_allocator_planned: 2560 bytes
I [executorch:arm_executor_runner.cpp:875 log_mem_status()] method_allocator_loaded: 3838 bytes
I [executorch:arm_executor_runner.cpp:879 log_mem_status()] method_allocator_input: 16 bytes
I [executorch:arm_executor_runner.cpp:880 log_mem_status()] method_allocator_executor: 0 bytes
Summary:
cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @Sebastian-Larsson @robell