Fix merge conflicts by cavusmustafa · Pull Request #37 · ynimmaga/executorch

cavusmustafa · 2025-03-24T22:24:18Z

Merged executorch main branch into openvino_backend and resolved conflicts.

Differential Revision: D71001041 Pull Request resolved: pytorch#9168

@digantdesai

Adds TOSA support for logical not in Arm backend. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 Signed-off-by: Måns Nilsson <mans.nilsson@arm.com> Co-authored-by: Yufeng Shi <yufeng.shi@arm.com>

Summary: Fix a typo in executorch documentation https://pytorch.org/executorch/main/backends-xnnpack.html Reviewed By: cccclai Differential Revision: D70645356 Co-authored-by: Frank Yu <frankyu@meta.com>

@larryliu0820

### Summary Just to make it consistent with its `linux` counterpart, let's update the reference from M1 to just macOS. ### Test plan CI cc @larryliu0820 @lucylq

@digantdesai

The old value of 4 min was to restrictive when running bigger models on some machines and caused github runners to fail sometimes. cc @digantdesai @freddan80 @per @oscarandersson8218 Signed-off-by: Zingo Andersen <zingo.andersen@arm.com>

…torch#9083) This makes it more easy to run the scripts from various tools like your editor. Signed-off-by: Zingo Andersen <zingo.andersen@arm.com>

- Remove xfails - Refactor test_mm to use testing pipelines Signed-off-by: Erik Lundell <erik.lundell@arm.com>

…t-buck) (pytorch#9159)" (pytorch#9187) This reverts commit 70d4427.

…g builds (pytorch#9044)" (pytorch#9188) This reverts commit 8f7bc8d.

This reverts commit 2889483.

### Summary - add SXR2330P ### Test plan ```bash python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedOperator -s $SERIAL_NO -m SM8650 -b build-android ```

Needed after extension llm third-party deps moved into tokenizers subdir

@mergennachin

### Summary Update XNNPACK backend doc page to cover quantization schemes and generally clean up the format. Update backend template doc with additional detail. ### Test plan Built docs locally and verified contents. cc @mergennachin @byjlw

### Summary We don't need to duplicate the deps source. ### Test plan CI

…ytorch#9190) This reverts commit 05a160e. Revert "Revert "Make serial parallel_for "polyfill" iterate backwards in debug builds (pytorch#9044)"" This reverts commit 815eaff. Revert "Revert "Unbreak optimized kernels buck build (and check it in unittest-buck) (pytorch#9159)"" This reverts commit 10bb615.

Needed to efficiently use parallel_for with BroadcastIndexesRange.

…ch#9058) Now all the apply functions share a common implementation, which means further changes (e.g., parallel_for, generating specialized dtypes for the case where all inputs have the same type) don't need to be repeated 3 times. (Interestingly, this seems to increase the effectiveness of the following parallelization change. Not entirely sure why, but I checked the generated code for optimized op_where and it seems to have improved, which is surprising.)

Internal model got a 5.7% latency improvement (313.8 ms before, 296.0 ms after).

Not sure why the threadpool extension isn't mentioned or built in here; we should follow up on that.

@SS-JIA

… args (pytorch#9203) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#9173 by @SS-JIA ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/196/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/196/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/196/orig @diff-train-skip-merge Co-authored-by: Stephen Jia <ssjia@meta.com>

CI is assigning Python 3.9 machines, but ET requires 3.10.

Use old one. No need to use new API

Differential Revision: D71023514 Pull Request resolved: pytorch#9196

This is step (5) of pytorch#8932. At this exact moment, this rebuild is inefficient because it rebuilds the whole portable op library, but ops don't support optional parallelization just yet. This will become less true when we roll out parallel_for support across portable ops immediately following this PR.

I attempted to port `at::parallel_reduce` to ExecuTorch and use that in reduce_util.h, but it turned out to be much trickier than expected. (In brief: parallel reduction requires two steps: 1) split the input range into chunks and reduce over them (easily done like parallel_for), and then 2) combine the sub-results from chunks. The reduction function accepted by reduce_over_dim is not well-suited to step (2).) Instead, I ported the parallelization strategy used by binary_kernel_reduce_lastdim: just parallelize over the *non*-reduced dimensions of the tensor. I don't understand why this strategy isn't generally applicable and we aren't otherwise capable of parallelizing reductions, so I haven't gated it to the case where we are reducing over a contiguous last dimension. I will send a follow-up that packages up this strategy nicely and uses it in our reduction portable ops.

No need to re-calculate numel() here.

…rallelization PoC (pytorch#9139)

Everything but the Python test (which depends on //caffe2:torch) is fine.

Differential Revision: D71073675 Pull Request resolved: pytorch#9206

Differential Revision: D70184325 Pull Request resolved: pytorch#8488

Differential Revision: D71404805 Pull Request resolved: pytorch#9456

Differential Revision: D71634148 Pull Request resolved: pytorch#9522

### Summary - QC backend changes for adopting LPBQ - test case: conv2d 16a4w - refactor a bit ### Test plan ```bash python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedOperator.test_qnn_backend_conv2d_block -s $SERIAL_NO -m SM8650 -b build-android ```

Differential Revision: D71699118 Pull Request resolved: pytorch#9529

Differential Revision: D71698626 Pull Request resolved: pytorch#9528

Differential Revision: D71591385 Pull Request resolved: pytorch#9493

…ytorch#9489) Note that this includes ops that are not currently implemented. These ops are added for completeness. Signed-off-by: Erik Lundell <erik.lundell@arm.com>

- Change constant test data to random generators. - Tests on Ethos-U55 are meant to xfail as int 16 tables are currently not supported. - For other tests, add flaky marker. Remove increased qtol, since the inaccuracies only show up sporadically. Signed-off-by: Erik Lundell <erik.lundell@arm.com>

We've been seeing flakes due to missing torchgen that have become more common. After some investigation, it appears that pytorch#8688 was probably overzealous: installing pytorch was probably also installing torchgen, so let's ~~install pytorch~~just run the macos setup script to avoid proliferating configurations. Test Plan: unittest-buck / macos on this PR, monitor to see if failures go away.

Differential Revision: D71699919 Pull Request resolved: pytorch#9530

@larryliu0820

…9434) ### Summary The final diff as part of pytorch#9117. This is the big one that affects users — we finally move the core build scripts into `scripts/` ### Test plan CI cc @larryliu0820 @lucylq

### Summary TSIA after pytorch#9117. ### Test plan N/A

Differential Revision: D71667998 Pull Request resolved: pytorch#9523

Arm backend: support for CEIL op - Update unary operator factory with CEIL op - Rename and refactor test_floor to handle similar ops Signed-off-by: Madeleine Dunn <madeleine.dunn@arm.com>

Differential Revision: D71713725 Pull Request resolved: pytorch#9546

…9548) Updates torchao pin to enable shared embedding quantization.

Fix the maven deps. Should use fbjni. Add AndroidManifest.xml for test.

Add pytorch#9354 by [Inklingdq](https://github.com/Inklingdq) which was accidentally merged to `viable/strict` to `main`. Co-authored-by: Inkling <64665980+Inklingdq@users.noreply.github.com>

One of the current drawback of using pinned PyTorch commit on CI is that we need to build PyTorch wheel on all MacOS jobs because it doesn't have Docker image. Building PyTorch wheel is usually not too bad because we have sccache in place to make the compilation faster. However, it's still slower than using a prebuilt wheel, and sccache is also not available on GitHub MacOS runner `macos-latest-xlarge` (no access to S3). As all MacOS jobs are building exactly the same PyTorch wheel, the proposal here is to cache the wheel on S3 `gha-artifacts` bucket which is publicly readable, i.e. https://gha-artifacts.s3.us-east-1.amazonaws.com/cached_artifacts/pytorch/executorch/pytorch_wheels/Darwin/311/torch-2.7.0a0%2Bgit295f2ed-cp311-cp311-macosx_14_0_arm64.whl. The job can check for matching wheel from S3 and use it instead. If there is no such wheel, it will continue building PyTorch normally. Once a new wheel is built and if the runner has write access to S3, it will upload the wheel so that other jobs can pick it up going forward. ### Testing All CI jobs pass (failures are pre-existing from trunk). Here are some quick number on how this helps reduce the durations of different MacOS jobs. * Apple workflow: * build-benchmark-app: [BEFORE](https://github.com/pytorch/executorch/actions/runs/14002229786/job/39210715922) ~80m → [AFTER](https://github.com/pytorch/executorch/actions/runs/14001343158/job/39214390843) ~44m * build-frameworks-ios: [BEFORE](https://github.com/pytorch/executorch/actions/runs/14002229786/job/39210732212) ~80m → [AFTER](https://github.com/pytorch/executorch/actions/runs/14001343158/job/39214394644) ~ 44m * build-demo-ios: [BEFORE](https://github.com/pytorch/executorch/actions/runs/14003433493/job/39213882743) ~ 55m → [AFTER](https://github.com/pytorch/executorch/actions/runs/14001343158/job/39214390955) ~23m * Apple perf workflow: * build-benchmark-app: [BEFORE](https://github.com/pytorch/executorch/actions/runs/13982706236/job/39208203350) ~80m → [AFTER](https://github.com/pytorch/executorch/actions/runs/14001347585/job/39214401072) ~48m * export model (llama): [BEFORE](https://github.com/pytorch/executorch/actions/runs/13982706236/job/39150917351) ~30m → [AFTER](https://github.com/pytorch/executorch/actions/runs/14001347585/job/39214401617) ~13m * All MacOS jobs in pull and trunk: * BEFORE ~417 on commit b195ed9 → AFTER ~268m Overall, I'm seeing the duration for all MacOS jobs reducing by close to 2x. This is very useful to reduce the cost running MacOS jobs (remember the budget request to OSS team because of the $$$ GitHub MacOS runners)

No targets and currently isn't being run

ynimmaga · 2025-03-24T23:11:54Z

LGTM

dpalmasan and others added 30 commits March 11, 2025 20:28

Fix LLM fine-tuning examples import error on OSS

e86c9c9

Differential Revision: D71001041 Pull Request resolved: pytorch#9168

Arm backend: Add TOSA support for logical not (pytorch#9128)

03f064b

Adds TOSA support for logical not in Arm backend. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 Signed-off-by: Måns Nilsson <mans.nilsson@arm.com> Co-authored-by: Yufeng Shi <yufeng.shi@arm.com>

Fix typo in executorch documentation (pytorch#9000)

534defd

Summary: Fix a typo in executorch documentation https://pytorch.org/executorch/main/backends-xnnpack.html Reviewed By: cccclai Differential Revision: D70645356 Co-authored-by: Frank Yu <frankyu@meta.com>

Rename M1 references in wheel building to macOS (pytorch#9166)

2d47bca

### Summary Just to make it consistent with its `linux` counterpart, let's update the reference from M1 to just macOS. ### Test plan CI cc @larryliu0820 @lucylq

Arm backend: Make build script source the generated setup_path.sh (py…

9216768

…torch#9083) This makes it more easy to run the scripts from various tools like your editor. Signed-off-by: Zingo Andersen <zingo.andersen@arm.com>

Arm backend: Bump Vela pin to support matmul on Ethos-U55 (pytorch#9184)

705f100

- Remove xfails - Refactor test_mm to use testing pipelines Signed-off-by: Erik Lundell <erik.lundell@arm.com>

Revert "Unbreak optimized kernels buck build (and check it in unittes…

8af9b9a

…t-buck) (pytorch#9159)" (pytorch#9187) This reverts commit 70d4427.

Revert "Make serial parallel_for "polyfill" iterate backwards in debu…

498795b

…g builds (pytorch#9044)" (pytorch#9188) This reverts commit 8f7bc8d.

Revert "Split & remove extension_parallel (pytorch#8983)" (pytorch#9189)

f8e357a

This reverts commit 2889483.

Qualcomm AI Engine Direct - support XR2Gen3 SoC (pytorch#9181)

d2f9ff6

### Summary - add SXR2330P ### Test plan ```bash python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedOperator -s $SERIAL_NO -m SM8650 -b build-android ```

Update paths in Benchmark app project (pytorch#9191)

e3c7954

Needed after extension llm third-party deps moved into tokenizers subdir

Use requirements-dev.txt to install pip deps (pytorch#9167)

d707e05

### Summary We don't need to duplicate the deps source. ### Test plan CI

Add BroadcastIndexesIterator::operator+ (pytorch#9057)

b2badda

Needed to efficiently use parallel_for with BroadcastIndexesRange.

parallelize optimized op_where using parallel_for (pytorch#9059)

c183ef0

Internal model got a 5.7% latency improvement (313.8 ms before, 296.0 ms after).

Attempt to unbreak build_apple_frameworks (pytorch#9193)

4de19fd

Not sure why the threadpool extension isn't mentioned or built in here; we should follow up on that.

fix ci (pytorch#9201)

4f02b33

CI is assigning Python 3.9 machines, but ET requires 3.10.

Link custom kernels with libextension_threadpool (pytorch#9209)

5712d73

Modify minibench SDK (pytorch#9208)

7ac7270

Use old one. No need to use new API

Add _unsafe_reset_threadpool to pybindings

acae017

Differential Revision: D71023514 Pull Request resolved: pytorch#9196

use numel() when appropriate in get_reduced_dim_product (pytorch#9142)

f5bc22a

No need to re-calculate numel() here.

Extract parallel_for_each_reduce_over_dim_output_index from argmin pa…

8334bb6

…rallelization PoC (pytorch#9139)

cover //kernels/prim_ops/... in unittest-buck (pytorch#9176)

c7db9b9

Everything but the Python test (which depends on //caffe2:torch) is fine.

Update android docs

b8c3ee9

Differential Revision: D71073675 Pull Request resolved: pytorch#9206

jackzhxng and others added 26 commits March 22, 2025 01:31

Fix xnnpack quantization discrepancy for non-fp32

38851a1

Differential Revision: D70184325 Pull Request resolved: pytorch#8488

Make export llama checkpoint and param optional

de0f6f1

Differential Revision: D71404805 Pull Request resolved: pytorch#9456

support mimi model export

76ae537

Differential Revision: D71634148 Pull Request resolved: pytorch#9522

test_mimi: Better proxy codes to make it work for both OSS and internal

da7b003

Differential Revision: D71699118 Pull Request resolved: pytorch#9529

Test mimi: remove redundant codes

6a4168f

Differential Revision: D71698626 Pull Request resolved: pytorch#9528

updte the return type of log_delegation_intermediate_output

8cd1b93

Differential Revision: D71591385 Pull Request resolved: pytorch#9493

Arm backend: Add all ops not supported on Ethos-U55 to support-check (p…

7d37bbc

…ytorch#9489) Note that this includes ops that are not currently implemented. These ops are added for completeness. Signed-off-by: Erik Lundell <erik.lundell@arm.com>

Add test_mimi to ci

7063c26

Differential Revision: D71699919 Pull Request resolved: pytorch#9530

[build Folder Migration] Move core build files into scripts (pytorch#…

9dce492

…9434) ### Summary The final diff as part of pytorch#9117. This is the big one that affects users — we finally move the core build scripts into `scripts/` ### Test plan CI cc @larryliu0820 @lucylq

Update README to reflect build folder changes (pytorch#9544)

90f0843

### Summary TSIA after pytorch#9117. ### Test plan N/A

Fix scalar (single element tensor) binary ops on HiFi

88d9616

Differential Revision: D71667998 Pull Request resolved: pytorch#9523

Arm backend: Add CEIL Operator (pytorch#9267)

31d3545

Arm backend: support for CEIL op - Update unary operator factory with CEIL op - Rename and refactor test_floor to handle similar ops Signed-off-by: Madeleine Dunn <madeleine.dunn@arm.com>

Add rules to build internal code.

94ec549

Differential Revision: D71713725 Pull Request resolved: pytorch#9546

Updates torchao pin to enable shared embedding quantization (pytorch#…

341f318

…9548) Updates torchao pin to enable shared embedding quantization.

Android aar update and enable emulator test

4399b23

Fix the maven deps. Should use fbjni. Add AndroidManifest.xml for test.

Add SmolLM (smollm2) (pytorch#9541)

60280d9

Add pytorch#9354 by [Inklingdq](https://github.com/Inklingdq) which was accidentally merged to `viable/strict` to `main`. Co-authored-by: Inkling <64665980+Inklingdq@users.noreply.github.com>

Remove dead test (pytorch#9363)

54d1249

No targets and currently isn't being run

Fix qwen import typo (pytorch#9535)

07e9672

Add Phi-4-mini README.md (pytorch#9302)

37036b3

Merge remote-tracking branch 'executorch/main' into fix_merge_conflicts

b160608

Resolved conflicts with main branch

30c6821

code formatting

682ae80

ynimmaga merged commit 0030fb9 into ynimmaga:openvino_backend Mar 24, 2025
13 of 261 checks passed

cavusmustafa had a problem deploying to upload-benchmark-results March 26, 2025 22:33 — with GitHub Actions Error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix merge conflicts#37

Fix merge conflicts#37
ynimmaga merged 585 commits intoynimmaga:openvino_backendfrom
cavusmustafa:fix_merge_conflicts

cavusmustafa commented Mar 24, 2025

Uh oh!

ynimmaga commented Mar 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

cavusmustafa commented Mar 24, 2025

Uh oh!

ynimmaga commented Mar 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants