Fix int32 overflow deadlock and non-power-of-2 crash in Triton AlltoAllv (#2133) by snarayankh · Pull Request #2133 · meta-pytorch/torchcomms

snarayankh · 2026-04-18T00:59:38Z

Summary:

This diff fixes two bugs in the Triton AlltoAllv kernel (comms/pipes/collectives/triton/):

int32 iteration/completion counter overflow causes deadlock

The iteration counter tensor and completion counters tensor were allocated as torch.int32. The kernel computes signal values as sender_bpp * (iteration + 1) in int32 arithmetic before passing to wait_signal_from, which sign-extends the result to int64 and then reinterprets as uint64 for the C++ comparison. With blocks_per_peer=16, int32 overflows after ~134M iterations. The overflowed negative int32 becomes a huge uint64 (~18.4 quintillion) after sign-extension, making the unsigned comparison actual >= expected permanently false — causing all recv blocks to spin forever in a silent deadlock.

The full overflow chain:

torch.zeros(1, dtype=torch.int32) — iteration tensor created as int32
tl.load(iteration_ptr) — loaded as Triton i32 in the kernel
sender_bpp * (iteration + 1) — i32 × i32 = i32 OVERFLOW at ~134M iterations
tl.full([], expected_value, tl.int64) — sign-extends negative i32 to i64 in wait_signal_from wrapper
C++ torchcomms_wait_signal_from(unsigned long long expected_value) — reinterprets as uint64 = ~18.4 quintillion
cmp_op(CmpOp::GE, actual_uint64, expected_uint64) — unsigned comparison, permanently false → DEADLOCK

Time to failure: ~77 days at 20 iter/s with bpp=16, ~1.5 days at 1000 iter/s (CUDA graph replay).

The sender side is fine — signal_block promotes the small value to int64 via tl.zeros([], dtype=tl.int64) + value before arithmetic, so the cumulative uint64 signal is always correct. The mismatch is only on the receiver side where the expected value computation overflows in int32 before widening.

Fix: Change both _ITERATION_TENSOR_CACHE and _COMPLETION_COUNTERS_CACHE from dtype=torch.int32 to dtype=torch.int64. This makes all kernel arithmetic (iteration loads, bpp multiplications, counter atomics) operate in int64, matching the uint64 signal infrastructure. Zero performance cost — these are a single scalar tensor and a small per-rank array.

_sum_int64_kernel crashes for non-power-of-2 world sizes

_sum_int64_kernel passes N=self.world_size directly to tl.arange(0, N). Triton requires tl.arange range to be a power of 2 — verified in third-party/tp2/triton/2.0.x/triton/python/triton/language/semantic.py:603-614: if (range & (range - 1)) != 0: raise ValueError("arange's range must be a power of 2"). Non-power-of-2 world sizes (3, 5, 6, 7...) cause a compile-time ValueError crash.

The sibling kernel _prepare_alltoallv_kernel in the same file correctly uses BLOCK_SIZE=triton.next_power_of_2(self.world_size) with masking (mask = offsets < W), but _sum_int64_kernel was not updated to match.

Fix: Apply the same power-of-2 rounding + masking pattern that _prepare_alltoallv_kernel already uses. The _prep_block_size is already computed and available.

Reviewed By: srinathb-meta

Differential Revision: D101236430

meta-codesync · 2026-04-18T00:59:58Z

@snarayankh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101236430.

…llv (meta-pytorch#2133) Summary: This diff fixes two bugs in the Triton AlltoAllv kernel (`comms/pipes/collectives/triton/`): **int32 iteration/completion counter overflow causes deadlock** The iteration counter tensor and completion counters tensor were allocated as `torch.int32`. The kernel computes signal values as `sender_bpp * (iteration + 1)` in int32 arithmetic before passing to `wait_signal_from`, which sign-extends the result to int64 and then reinterprets as uint64 for the C++ comparison. With `blocks_per_peer=16`, int32 overflows after ~134M iterations. The overflowed negative int32 becomes a huge uint64 (~18.4 quintillion) after sign-extension, making the unsigned comparison `actual >= expected` permanently false — causing all recv blocks to spin forever in a **silent deadlock**. The full overflow chain: 1. `torch.zeros(1, dtype=torch.int32)` — iteration tensor created as int32 2. `tl.load(iteration_ptr)` — loaded as Triton i32 in the kernel 3. `sender_bpp * (iteration + 1)` — i32 × i32 = i32 **OVERFLOW** at ~134M iterations 4. `tl.full([], expected_value, tl.int64)` — sign-extends negative i32 to i64 in `wait_signal_from` wrapper 5. C++ `torchcomms_wait_signal_from(unsigned long long expected_value)` — reinterprets as uint64 = ~18.4 quintillion 6. `cmp_op(CmpOp::GE, actual_uint64, expected_uint64)` — unsigned comparison, permanently false → **DEADLOCK** Time to failure: ~77 days at 20 iter/s with bpp=16, ~1.5 days at 1000 iter/s (CUDA graph replay). The sender side is fine — `signal_block` promotes the small value to int64 via `tl.zeros([], dtype=tl.int64) + value` *before* arithmetic, so the cumulative uint64 signal is always correct. The mismatch is only on the receiver side where the expected value computation overflows in int32 before widening. Fix: Change both `_ITERATION_TENSOR_CACHE` and `_COMPLETION_COUNTERS_CACHE` from `dtype=torch.int32` to `dtype=torch.int64`. This makes all kernel arithmetic (iteration loads, bpp multiplications, counter atomics) operate in int64, matching the uint64 signal infrastructure. Zero performance cost — these are a single scalar tensor and a small per-rank array. **`_sum_int64_kernel` crashes for non-power-of-2 world sizes** `_sum_int64_kernel` passes `N=self.world_size` directly to `tl.arange(0, N)`. Triton requires `tl.arange` range to be a power of 2 — verified in `third-party/tp2/triton/2.0.x/triton/python/triton/language/semantic.py:603-614`: `if (range & (range - 1)) != 0: raise ValueError("arange's range must be a power of 2")`. Non-power-of-2 world sizes (3, 5, 6, 7...) cause a compile-time `ValueError` crash. The sibling kernel `_prepare_alltoallv_kernel` in the same file correctly uses `BLOCK_SIZE=triton.next_power_of_2(self.world_size)` with masking (`mask = offsets < W`), but `_sum_int64_kernel` was not updated to match. Fix: Apply the same power-of-2 rounding + masking pattern that `_prepare_alltoallv_kernel` already uses. The `_prep_block_size` is already computed and available. Reviewed By: srinathb-meta Differential Revision: D101236430

meta-codesync · 2026-04-20T21:16:41Z

This pull request has been merged in 59a4ca4.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 18, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 18, 2026

meta-codesync bot changed the title ~~Fix int32 overflow deadlock and non-power-of-2 crash in Triton AlltoAllv~~ Fix int32 overflow deadlock and non-power-of-2 crash in Triton AlltoAllv (#2133) Apr 20, 2026

snarayankh force-pushed the export-D101236430 branch from 5b7cda8 to 7385504 Compare April 20, 2026 17:49

meta-codesync bot closed this in 59a4ca4 Apr 20, 2026

facebook-github-tools bot added the Merged label Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix int32 overflow deadlock and non-power-of-2 crash in Triton AlltoAllv (#2133)#2133

Fix int32 overflow deadlock and non-power-of-2 crash in Triton AlltoAllv (#2133)#2133
snarayankh wants to merge 1 commit intometa-pytorch:mainfrom
snarayankh:export-D101236430

snarayankh commented Apr 18, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Apr 18, 2026

Uh oh!

meta-codesync bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

snarayankh commented Apr 18, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Apr 18, 2026

Uh oh!

meta-codesync bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

snarayankh commented Apr 18, 2026 •

edited by meta-codesync bot

Loading