Skip to content

[asm] Async memory iter_arg tying safety in register allocator#978

Draft
Hardcode84 wants to merge 1 commit intoiree-org:mainfrom
Hardcode84:vmem-iter-arg-tying
Draft

[asm] Async memory iter_arg tying safety in register allocator#978
Hardcode84 wants to merge 1 commit intoiree-org:mainfrom
Hardcode84:vmem-iter-arg-tying

Conversation

@Hardcode84
Copy link
Contributor

Problem

In software-pipelined loops, buffer_load / ds_read results are passed as
condition iter_args. The liveness analysis (Pass 3b) unconditionally tied ALL
iter_args to their corresponding block args, making them share a physical
register. This eliminates the back-edge v_mov copy but is unsafe when the
block arg is still live after the async memory op writes — the VMEM/LDS write
clobbers a value MFMAs are still reading.

  %data_next = buffer_load ...       // ← writes to shared register
  %acc_new   = mfma %data, ..., %acc // ← still reading block arg %data (BOOM)

A previous prototype by @panditsa put the filter in LinearScanPass.cpp, but liveness also
added unconditional ties — two competing sources of truth.

Fix

Move the safety check into Liveness.cpp, the single place where loop ties are
decided:

  • Detection: use the MemoryOp trait (hasTrait<OpTrait::MemoryOp>() && getNumResults() > 0) instead of enumerating op types. Covers buffer_load_*,
    ds_read_*, global_load_*, flat_load_*.
  • Safety: for each iter_arg defined by an async load, compare its definition
    point against the last use of the corresponding block arg in opToIdx order.
    If lastBlockArgUse > defIdxunsafe, skip the tie.
  • Two guard sites: both the equivalence-class members collection and the
    tiedPairs insertion are guarded.

Additionally, the assembly emitter's back-edge copy logic is fixed to decompose
multi-register values into individual v_mov_b32 copies (previously only copied
one register out of N for wide types like pvreg<N, 4>).

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
@Hardcode84 Hardcode84 requested a review from harsh-nod February 25, 2026 22:24
@harsh-nod
Copy link
Collaborator

I think this is fine as a starting point, but we would eventually want to make the tying dependence aware (in a subsequent PR).

// destination register asynchronously. If the corresponding block arg
// is still read after the load issues, sharing a register would let the
// load clobber a value MFMAs are still consuming.
auto isUnsafeAsyncTie = [&](Value iterArg, BlockArgument blockArg) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than have this as a heuristic in the pass, could we add this as a verification hook on ConditionOp or LoopOp that flags when an async memory result is passed as an iter_arg with an overlapping block arg use. This would catch the hazard at IR validation time rather than relying on the liveness pass to silently handle it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what IR validation will give us besides random compilation failures for user. Also, checking non-local properties (use-def chains) in verifier is a bad practice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with the verifier is that we always validate this as an invariant in the IR, rather than only evaluate it in this pass. Okay then maybe we should model it as a normal form but we can do that in a separate PR.

Copy link
Collaborator

@harsh-nod harsh-nod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing the bug, minor comments, but overall looks good

@Hardcode84
Copy link
Contributor Author

Also, this is overconservative, if we have waitcount before the loop end this should be safe, but waitcounts inserted in subsequent passes.

@harsh-nod
Copy link
Collaborator

Also, this is overconservative, if we have waitcount before the loop end this should be safe, but waitcounts inserted in subsequent passes.

Couldnt we just move the ticketing pass before this one?

@Hardcode84
Copy link
Contributor Author

Couldnt we just move the ticketing pass before this one?

Maybe, we can also just check if value is consumed before the loop end (which will generate waitcount later)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants