Arm backend: Add Cortex-M as a first-class target in aot_arm_compiler#17075
Arm backend: Add Cortex-M as a first-class target in aot_arm_compiler#17075psiddh wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17075
Note: Links to docs will display an error until the docs builds have been completed. ❌ 10 New Failures, 3 Cancelled JobsAs of commit 105498e with merge base f30d5ed ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
39666cd to
7f14a9d
Compare
1b64ef3 to
41462be
Compare
There was a problem hiding this comment.
Pull request overview
This PR enables full MobileNetV2 lowering to the CMSIS-NN backend for Cortex-M microcontrollers by implementing comprehensive support for quantized operations through a dedicated compilation path. The changes replace the previous delegation-based approach with a portable kernel-based architecture that converts all quantized operations to cortex_m::* operators.
Changes:
- Added dedicated Cortex-M compilation path (
to_edge_cortex_m) in the AOT compiler with CortexMQuantizer-based quantization - Implemented addmm operator support for decomposed linear layers through new
_get_addmm_replacementmethod - Enhanced quantization parameter propagation with new
PropagateQParamsPassand passthrough op handling inFoldAndAnnotateQParamsPass - Extended quantizer to mark parameter nodes as annotated and added passthrough ops (hardtanh, max_pool2d, dropout)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/arm/aot_arm_compiler.py | Adds to_edge_cortex_m function for Cortex-M compilation path using CortexMQuantizer and removes old transform_for_cortex_m_backend function |
| backends/cortex_m/quantizer/quantizer.py | Adds _mark_param_node_as_annotated method and extends passthrough ops list for MobileNetV2 support |
| backends/cortex_m/passes/propagate_qparams_pass.py | New pass to propagate qparams through passthrough ops (transpose/permute) to consumer nodes like addmm |
| backends/cortex_m/passes/cortex_m_pass_manager.py | Adds PropagateQParamsPass and DecomposeAdaptiveAvgPool2dPass to pass list, adds skip_passes parameter to __init__ |
| backends/cortex_m/passes/convert_to_cortex_m_pass.py | Implements _get_addmm_replacement method to convert decomposed linear (addmm) operations to cortex_m.quantized_linear |
| backends/arm/_passes/fold_qdq_with_annotated_qparams_pass.py | Adds passthrough ops (hardtanh, relu, clamp) support and second-pass qparams propagation logic |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d7d85fb to
b222911
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
examples/arm/aot_arm_compiler.py
Outdated
| @@ -396,6 +388,7 @@ def forward(self, x): | |||
| "TOSA-1.0+INT", | |||
| "TOSA-1.0+FP", | |||
| "TOSA-1.0+INT+int16", | |||
| "cortex-m-int8", | |||
There was a problem hiding this comment.
Thanks for adding this, I think the flag might need to be different for different cortex-m as the memory planning might differentiate depending on impementation, so maybe it should be cortex-m55* for now? Also the other target use + so it might be more consistent with "cortex-m55+int8" ?
Also we need to update the two M55 examples (Corstone-300 and NXP N6) in
https://github.com/pytorch/executorch/tree/main/zephyr
to use this new nice flag :) We can maybe also do that in a separate PR also.
There was a problem hiding this comment.
Updated the flag, and will update the M55 examples in follow up PR
|
Hi @psiddh this seem to give some errors is test-arm-backed-* test runners it might be (probably is) unrelated but take care before merging as it will break CI if not 🙏 . |
Are you referring to the CI failure on this PR: https://github.com/pytorch/executorch/actions/runs/22315191858/job/64557957144?pr=17075 . If so, it seems unrelated and it is failing due to size check.
Looked at the failing test-arm-backend* fail jobs, Looks unrelated to me. I will keep a close eye on the CI post landing. The failed job for Zephyr :
|
I agree I also think they are unrelated, usually I would just rebase the PR to retrigger a rerun on the latest to make sure but I think it might mess up your internal-external Meta PR sync. Thats why I just added a comment :) I can try to trigger a re-test without a rebase, that might work. |
Rebased to the HEAD |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
PTE size seems to still fail and might actually be bigger, maybe strings are longer, orse some other metadata bigger. it seems the number in the test need to be bumpt to not fail CI. |
|
| exported_program, args, model, example_inputs | ||
| ) | ||
|
|
||
| # Cortex-m ops are never included in vgf or direct-drive |
There was a problem hiding this comment.
I wonder if this removal increased the size
Previously, Cortex-M op conversion was applied as an afterthought to all non-vgf targets via transform_for_cortex_m_backend(). This made the flow hard to follow, used a bare EdgeCompileConfig that decomposed ops like linear into addmm (requiring unnecessary workarounds), and didn't use the CortexMQuantizer or CortexMPassManager. Add a dedicated to_edge_cortex_m() path selected via --target=cortex-m that owns the full pipeline: CortexMQuantizer for INT8 quantization, correct EdgeCompileConfig with preserve_ops to prevent premature decomposition, and CortexMPassManager.pass_list for op conversion. Remove the old scattered transform_for_cortex_m_backend() function. Verified all ops fully lowered to cortex_m::quantized_* operators for both MobileNetV2 (70 nodes) and MobileNetV3 (122 nodes). E2E inference tested on Alif E8 board. Test Plan: python3 -m examples.arm.aot_arm_compiler -m mv2 --target=cortex-m55+int8 --quantize --intermediates=./mv2_intermediates --output=./mv2_cortex_m.pte python3 -m examples.arm.aot_arm_compiler -m mv3 --target=cortex-m55+int8 --quantize --intermediates=./mv3_intermediates --output=./mv3_cortex_m.pte Also ran E2E inference on Alif E8 board
|
Hi there also seem to be a real problem with ZephyrOS tests: I see this in the logs in 2 of them As the Corstone FVP does not quite on this errors the jobs time times out instead. So maybe something needs to be added to the cmake/link rules now for cortex-m? In the last Zephyr failed job https://github.com/pytorch/executorch/actions/runs/22425903988/job/64933843797?pr=17075 I see e.g. time out like the other 2 BUT in a strange place, I suspect it could be the same but logs are not flushed. Lets see itfthis also goes away with the same fix. |
Sorry that I was wrong and maybe caused hope and maybe more work and unneeded delays for this PR . |
|
I think with Zephyr test failures with (thinking aloud here), |
…pass flag Summary: Remove the transform_for_cortex_m_backend() function and the --enable_qdq_fusion_pass CLI flag from aot_arm_compiler.py. The function applied Cortex-M passes as a post-hoc step to all non-VGF targets, which made the compilation flow hard to follow and coupled the delegation path to Cortex-M-specific logic. Instead, ReplaceQuantNodesPass is now applied directly inside to_edge_TOSA_delegate() to handle any boundary quantized_decomposed::* nodes that remain outside the delegated subgraph. This makes the delegation path self-contained and explicit about its runtime requirements. This change is in preparation for an upcoming PR (#17075) that introduces Cortex-M as a first-class compilation target with its own dedicated pipeline, including CortexMQuantizer and CortexMPassManager.
Created this clean PR to remove all legacy flow: #17740 , Lets see if the CI passes clean on this PR. If it does, then I will rebase this ongoing PR on top of it |
…pass flag Summary: Remove the transform_for_cortex_m_backend() function and the --enable_qdq_fusion_pass CLI flag from aot_arm_compiler.py. The function applied Cortex-M passes as a post-hoc step to all non-VGF targets, which made the compilation flow hard to follow and coupled the delegation path to Cortex-M-specific logic. Instead, ReplaceQuantNodesPass is now applied directly inside to_edge_TOSA_delegate() to handle any boundary quantized_decomposed::* nodes that remain outside the delegated subgraph. This makes the delegation path self-contained and explicit about its runtime requirements. This change is in preparation for an upcoming PR (#17075) that introduces Cortex-M as a first-class compilation target with its own dedicated pipeline, including CortexMQuantizer and CortexMPassManager.
…pass flag Summary: Remove the transform_for_cortex_m_backend() function and the --enable_qdq_fusion_pass CLI flag from aot_arm_compiler.py. The function applied Cortex-M passes as a post-hoc step to all non-VGF targets, which made the compilation flow hard to follow and coupled the delegation path to Cortex-M-specific logic. Instead, ReplaceQuantNodesPass is now applied directly inside to_edge_TOSA_delegate() to handle any boundary quantized_decomposed::* nodes that remain outside the delegated subgraph. This makes the delegation path self-contained and explicit about its runtime requirements. This change is in preparation for an upcoming PR (#17075) that introduces Cortex-M as a first-class compilation target with its own dedicated pipeline, including CortexMQuantizer and CortexMPassManager.
Add a dedicated to_edge_cortex_m() path selected via --target=cortex-m that
owns the full pipeline: CortexMQuantizer for INT8 quantization, correct
EdgeCompileConfig with preserve_ops to prevent premature decomposition, and
CortexMPassManager.pass_list for op conversion. Remove the old scattered
transform_for_cortex_m_backend() function.
Verified all ops fully lowered to cortex_m::quantized_* operators for both
MobileNetV2 (70 nodes) and MobileNetV3 (122 nodes). E2E inference tested
on Alif E8 board.
Test Plan:
- python3 -m examples.arm.aot_arm_compiler -m mv2 --target=cortex-m55+int8 --quantize --intermediates=./mv2_intermediates --output=./mv2_cortex_m.pte
- python3 -m examples.arm.aot_arm_compiler -m mv3 --target=cortex-m55+int8 --quantize --intermediates=./mv3_intermediates --output=./mv3_cortex_m.pte