Skip to content

Arm backend: Add Cortex-M as a first-class target in aot_arm_compiler#17075

Open
psiddh wants to merge 1 commit intopytorch:mainfrom
psiddh:main
Open

Arm backend: Add Cortex-M as a first-class target in aot_arm_compiler#17075
psiddh wants to merge 1 commit intopytorch:mainfrom
psiddh:main

Conversation

@psiddh
Copy link
Contributor

@psiddh psiddh commented Jan 30, 2026

Add a dedicated to_edge_cortex_m() path selected via --target=cortex-m that
owns the full pipeline: CortexMQuantizer for INT8 quantization, correct
EdgeCompileConfig with preserve_ops to prevent premature decomposition, and
CortexMPassManager.pass_list for op conversion. Remove the old scattered
transform_for_cortex_m_backend() function.

Verified all ops fully lowered to cortex_m::quantized_* operators for both
MobileNetV2 (70 nodes) and MobileNetV3 (122 nodes). E2E inference tested
on Alif E8 board.

Test Plan:
- python3 -m examples.arm.aot_arm_compiler -m mv2 --target=cortex-m55+int8 --quantize --intermediates=./mv2_intermediates --output=./mv2_cortex_m.pte
- python3 -m examples.arm.aot_arm_compiler -m mv3 --target=cortex-m55+int8 --quantize --intermediates=./mv3_intermediates --output=./mv3_cortex_m.pte

Also ran E2E inference on Alif E8 board

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17075

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures, 3 Cancelled Jobs

As of commit 105498e with merge base f30d5ed (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 30, 2026
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@psiddh psiddh force-pushed the main branch 6 times, most recently from 39666cd to 7f14a9d Compare February 4, 2026 09:06
@zingo zingo added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk module: microcontrollers For embedded MCUs like Cortex-M, or RTOS like Zephyr, does not track NPU backend like Arm Ethos. labels Feb 5, 2026
@psiddh psiddh force-pushed the main branch 5 times, most recently from 1b64ef3 to 41462be Compare February 6, 2026 07:48
@psiddh psiddh changed the title Summary:MV2 CortexM PassManager changes for Alif E8 Cortex-M: Enable full MobileNetV2 lowering to CMSIS-NN backend via Aot Compiler script Feb 6, 2026
@psiddh psiddh marked this pull request as ready for review February 6, 2026 07:56
@psiddh psiddh requested a review from digantdesai as a code owner February 6, 2026 07:56
Copilot AI review requested due to automatic review settings February 6, 2026 07:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables full MobileNetV2 lowering to the CMSIS-NN backend for Cortex-M microcontrollers by implementing comprehensive support for quantized operations through a dedicated compilation path. The changes replace the previous delegation-based approach with a portable kernel-based architecture that converts all quantized operations to cortex_m::* operators.

Changes:

  • Added dedicated Cortex-M compilation path (to_edge_cortex_m) in the AOT compiler with CortexMQuantizer-based quantization
  • Implemented addmm operator support for decomposed linear layers through new _get_addmm_replacement method
  • Enhanced quantization parameter propagation with new PropagateQParamsPass and passthrough op handling in FoldAndAnnotateQParamsPass
  • Extended quantizer to mark parameter nodes as annotated and added passthrough ops (hardtanh, max_pool2d, dropout)

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
examples/arm/aot_arm_compiler.py Adds to_edge_cortex_m function for Cortex-M compilation path using CortexMQuantizer and removes old transform_for_cortex_m_backend function
backends/cortex_m/quantizer/quantizer.py Adds _mark_param_node_as_annotated method and extends passthrough ops list for MobileNetV2 support
backends/cortex_m/passes/propagate_qparams_pass.py New pass to propagate qparams through passthrough ops (transpose/permute) to consumer nodes like addmm
backends/cortex_m/passes/cortex_m_pass_manager.py Adds PropagateQParamsPass and DecomposeAdaptiveAvgPool2dPass to pass list, adds skip_passes parameter to __init__
backends/cortex_m/passes/convert_to_cortex_m_pass.py Implements _get_addmm_replacement method to convert decomposed linear (addmm) operations to cortex_m.quantized_linear
backends/arm/_passes/fold_qdq_with_annotated_qparams_pass.py Adds passthrough ops (hardtanh, relu, clamp) support and second-pass qparams propagation logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@psiddh psiddh force-pushed the main branch 2 times, most recently from d7d85fb to b222911 Compare February 6, 2026 09:09
Copilot AI review requested due to automatic review settings February 6, 2026 09:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -396,6 +388,7 @@ def forward(self, x):
"TOSA-1.0+INT",
"TOSA-1.0+FP",
"TOSA-1.0+INT+int16",
"cortex-m-int8",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, I think the flag might need to be different for different cortex-m as the memory planning might differentiate depending on impementation, so maybe it should be cortex-m55* for now? Also the other target use + so it might be more consistent with "cortex-m55+int8" ?

Also we need to update the two M55 examples (Corstone-300 and NXP N6) in
https://github.com/pytorch/executorch/tree/main/zephyr
to use this new nice flag :) We can maybe also do that in a separate PR also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the flag, and will update the M55 examples in follow up PR

@zingo
Copy link
Collaborator

zingo commented Feb 25, 2026

Hi @psiddh this seem to give some errors is test-arm-backed-* test runners it might be (probably is) unrelated but take care before merging as it will break CI if not 🙏 .

@psiddh
Copy link
Contributor Author

psiddh commented Feb 25, 2026

Hi @psiddh this seem to give some errors is test-arm-backed-* test runners it might be (probably is) unrelated but take care before merging as it will break CI if not 🙏 .

Are you referring to the CI failure on this PR: https://github.com/pytorch/executorch/actions/runs/22315191858/job/64557957144?pr=17075 . If so, it seems unrelated and it is failing due to size check.

Hi @psiddh this seem to give some errors is test-arm-backed-* test runners it might be (probably is) unrelated but take care before merging as it will break CI if not 🙏 .

Looked at the failing test-arm-backend* fail jobs, Looks unrelated to me. I will keep a close eye on the CI post landing.
I think if CI tests are using 'qdq' flag which was introduced at the very beginning of CortexM development 6 months ago, they will fail. But I don't think there are any CI jobs atm, that use this flag afaia. Will keep a close eye on CI

The failed job for Zephyr :

  • 2026-02-23T16:50:39.0579191Z E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'quantized_decomposed::dequantize_per_tensor.out' not found.

@zingo
Copy link
Collaborator

zingo commented Feb 25, 2026

Hi @psiddh this seem to give some errors is test-arm-backed-* test runners it might be (probably is) unrelated but take care before merging as it will break CI if not 🙏 .

Are you referring to the CI failure on this PR: https://github.com/pytorch/executorch/actions/runs/22315191858/job/64557957144?pr=17075 . If so, it seems unrelated and it is failing due to size check.

Hi @psiddh this seem to give some errors is test-arm-backed-* test runners it might be (probably is) unrelated but take care before merging as it will break CI if not 🙏 .

Looked at the failing test-arm-backend* fail jobs, Looks unrelated to me. I will keep a close eye on the CI post landing. I think if CI tests are using 'qdq' flag which was introduced at the very beginning of CortexM development 6 months ago, they will fail. But I don't think there are any CI jobs atm, that use this flag afaia. Will keep a close eye on CI

The failed job for Zephyr :

  • 2026-02-23T16:50:39.0579191Z E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'quantized_decomposed::dequantize_per_tensor.out' not found.

I agree I also think they are unrelated, usually I would just rebase the PR to retrigger a rerun on the latest to make sure but I think it might mess up your internal-external Meta PR sync. Thats why I just added a comment :)
We just need to keep an eye the hud after it is merge to make sure the errors does not start to appare there :) e.g. https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-arm%7Ccortex-m&useRegexFilter=true

I can try to trigger a re-test without a rebase, that might work.

Copilot AI review requested due to automatic review settings February 25, 2026 19:19
@psiddh
Copy link
Contributor Author

psiddh commented Feb 25, 2026

Hi @psiddh this seem to give some errors is test-arm-backed-* test runners it might be (probably is) unrelated but take care before merging as it will break CI if not 🙏 .

Are you referring to the CI failure on this PR: https://github.com/pytorch/executorch/actions/runs/22315191858/job/64557957144?pr=17075 . If so, it seems unrelated and it is failing due to size check.

Hi @psiddh this seem to give some errors is test-arm-backed-* test runners it might be (probably is) unrelated but take care before merging as it will break CI if not 🙏 .

Looked at the failing test-arm-backend* fail jobs, Looks unrelated to me. I will keep a close eye on the CI post landing. I think if CI tests are using 'qdq' flag which was introduced at the very beginning of CortexM development 6 months ago, they will fail. But I don't think there are any CI jobs atm, that use this flag afaia. Will keep a close eye on CI
The failed job for Zephyr :

  • 2026-02-23T16:50:39.0579191Z E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'quantized_decomposed::dequantize_per_tensor.out' not found.

I agrre I also think they are unrelated, usually I would just rebase the PR to retrigger a rerun on the latest to make sure but I think it might mess up your internal-external Meta PR sync. Thats why I just added a comment :) We just need to keep an eye the hud after it is merge to make sure the errors does not start to appare there :) e.g. https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-arm%7Ccortex-m&useRegexFilter=true

I can try to trigger a re-test without a rebase, that might work.

Rebased to the HEAD

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zingo
Copy link
Collaborator

zingo commented Feb 25, 2026

PTE size seems to still fail and might actually be bigger, maybe strings are longer, orse some other metadata bigger. it seems the number in the test need to be bumpt to not fail CI.

@psiddh
Copy link
Contributor Author

psiddh commented Feb 26, 2026

PTE size seems to still fail and might actually be bigger, maybe strings are longer, orse some other metadata bigger. it seems the number in the test need to be bumpt to not fail CI.

#17725

exported_program, args, model, example_inputs
)

# Cortex-m ops are never included in vgf or direct-drive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this removal increased the size

Previously, Cortex-M op conversion was applied as an afterthought to all
non-vgf targets via transform_for_cortex_m_backend(). This made the flow
hard to follow, used a bare EdgeCompileConfig that decomposed ops like
linear into addmm (requiring unnecessary workarounds), and didn't use the
CortexMQuantizer or CortexMPassManager.

Add a dedicated to_edge_cortex_m() path selected via --target=cortex-m that
owns the full pipeline: CortexMQuantizer for INT8 quantization, correct
EdgeCompileConfig with preserve_ops to prevent premature decomposition, and
CortexMPassManager.pass_list for op conversion. Remove the old scattered
transform_for_cortex_m_backend() function.

Verified all ops fully lowered to cortex_m::quantized_* operators for both
MobileNetV2 (70 nodes) and MobileNetV3 (122 nodes). E2E inference tested
on Alif E8 board.

Test Plan:
python3 -m examples.arm.aot_arm_compiler -m mv2 --target=cortex-m55+int8 --quantize --intermediates=./mv2_intermediates --output=./mv2_cortex_m.pte
python3 -m examples.arm.aot_arm_compiler -m mv3 --target=cortex-m55+int8 --quantize --intermediates=./mv3_intermediates --output=./mv3_cortex_m.pte

Also ran E2E inference on Alif E8 board
@zingo
Copy link
Collaborator

zingo commented Feb 26, 2026

Hi there also seem to be a real problem with ZephyrOS tests:

I see this in the logs in 2 of them
https://github.com/pytorch/executorch/actions/runs/22425903988/job/64933843809?pr=17075 and https://github.com/pytorch/executorch/actions/runs/22425903988/job/64933843807?pr=17075

...
Thu, 26 Feb 2026 03:22:53 GMT E [executorch:method.cpp:784 resolve_operator()] Missing operator: [0] quantized_decomposed::quantize_per_tensor.out
Thu, 26 Feb 2026 03:22:53 GMT E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'quantized_decomposed::dequantize_per_tensor.out' not found.
...
Thu, 26 Feb 2026 03:22:53 GMT I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
Thu, 26 Feb 2026 05:03:31 GMT Error: The operation was canceled.

As the Corstone FVP does not quite on this errors the jobs time times out instead.

So maybe something needs to be added to the cmake/link rules now for cortex-m?

In the last Zephyr failed job https://github.com/pytorch/executorch/actions/runs/22425903988/job/64933843797?pr=17075

I see

Thu, 26 Feb 2026 03:22:08 GMT                   from /pytorch/executorch/zephyr_scratch/modules/lib/executorch/../executorch/runtime/core/exec_aten/exec_aten.h:36,
Thu, 26 Feb 2026 03:22:08 GMT                   from /pytorch/executorch/zephyr_scratch/modules/lib/executorch/../executorch/runtime/core/evalue.h:10,
Thu, 26 Feb 2026 03:22:08 GMT                   from /pytorch/executorch/zephyr_scratch/modules/lib/executorch/../executorch/runtime/executor/method.h:18,
Thu, 26 Feb 2026 05:03:31 GMT  Error: The operation was canceled.

e.g. time out like the other 2 BUT in a strange place, I suspect it could be the same but logs are not flushed. Lets see itfthis also goes away with the same fix.

@zingo
Copy link
Collaborator

zingo commented Feb 26, 2026

I agree I also think they are unrelated...

Sorry that I was wrong and maybe caused hope and maybe more work and unneeded delays for this PR .

@psiddh
Copy link
Contributor Author

psiddh commented Feb 26, 2026

I think with Zephyr test failures with (thinking aloud here),
elif args.delegate → to_edge_TOSA_delegate(), though100% of ops are delegated to the NPU, the input/output boundary QDQ nodes are still in the graph as quantized_decomposed::*. No ReplaceQuantNodesPass to convert them. and hence Missing kernel at runtime ?

psiddh pushed a commit that referenced this pull request Feb 26, 2026
…pass flag

Summary:
Remove the transform_for_cortex_m_backend() function and the --enable_qdq_fusion_pass CLI flag from aot_arm_compiler.py. The function applied Cortex-M passes as a post-hoc step to all non-VGF targets, which made the compilation flow hard to follow and coupled the delegation path to Cortex-M-specific logic.

Instead, ReplaceQuantNodesPass is now applied directly inside to_edge_TOSA_delegate() to handle any boundary quantized_decomposed::* nodes that remain outside the delegated subgraph. This makes the delegation path self-contained and explicit about its runtime requirements.

This change is in preparation for an upcoming PR (#17075) that introduces Cortex-M as a first-class compilation target with its own dedicated pipeline, including CortexMQuantizer and CortexMPassManager.
@psiddh
Copy link
Contributor Author

psiddh commented Feb 26, 2026

I think with Zephyr test failures with (thinking aloud here), elif args.delegate → to_edge_TOSA_delegate(), though100% of ops are delegated to the NPU, the input/output boundary QDQ nodes are still in the graph as quantized_decomposed::*. No ReplaceQuantNodesPass to convert them. and hence Missing kernel at runtime ?

Created this clean PR to remove all legacy flow: #17740 , Lets see if the CI passes clean on this PR. If it does, then I will rebase this ongoing PR on top of it

psiddh pushed a commit that referenced this pull request Feb 26, 2026
…pass flag

Summary:
Remove the transform_for_cortex_m_backend() function and the --enable_qdq_fusion_pass CLI flag from aot_arm_compiler.py. The function applied Cortex-M passes as a post-hoc step to all non-VGF targets, which made the compilation flow hard to follow and coupled the delegation path to Cortex-M-specific logic.

Instead, ReplaceQuantNodesPass is now applied directly inside to_edge_TOSA_delegate() to handle any boundary quantized_decomposed::* nodes that remain outside the delegated subgraph. This makes the delegation path self-contained and explicit about its runtime requirements.

This change is in preparation for an upcoming PR (#17075) that introduces Cortex-M as a first-class compilation target with its own dedicated pipeline, including CortexMQuantizer and CortexMPassManager.
psiddh pushed a commit that referenced this pull request Feb 27, 2026
…pass flag

Summary:
Remove the transform_for_cortex_m_backend() function and the --enable_qdq_fusion_pass CLI flag from aot_arm_compiler.py. The function applied Cortex-M passes as a post-hoc step to all non-VGF targets, which made the compilation flow hard to follow and coupled the delegation path to Cortex-M-specific logic.

Instead, ReplaceQuantNodesPass is now applied directly inside to_edge_TOSA_delegate() to handle any boundary quantized_decomposed::* nodes that remain outside the delegated subgraph. This makes the delegation path self-contained and explicit about its runtime requirements.

This change is in preparation for an upcoming PR (#17075) that introduces Cortex-M as a first-class compilation target with its own dedicated pipeline, including CortexMQuantizer and CortexMPassManager.
@zingo zingo changed the title Add Cortex-M as a first-class target in aot_arm_compiler Arm backend: Add Cortex-M as a first-class target in aot_arm_compiler Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: microcontrollers For embedded MCUs like Cortex-M, or RTOS like Zephyr, does not track NPU backend like Arm Ethos. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants