AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv (#2258) by zhiyongww · Pull Request #2258 · meta-pytorch/torchcomms

zhiyongww · 2026-04-24T04:45:02Z

Summary:

P2pNvlTransportDevice::send / recv signatures are (group, ptr, nbytes, int active_blocks=0, std::size_t max_signal_bytes=0, const Timeout& = Timeout()). The two callers in AllToAllvAmd.cuh were passing timeout directly after nbytes, which positionally bound to int active_blocks and failed compilation:

no viable conversion from 'Timeout' to 'int'

Pass /*active_blocks=*/0, /*max_signal_bytes=*/0 explicitly so timeout lands on the right parameter.

Differential Revision: D102217579

meta-codesync · 2026-04-24T04:45:11Z

@zhiyongww has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102217579.

…d/recv (meta-pytorch#2258) Summary: `P2pNvlTransportDevice::send` / `recv` signatures are `(group, ptr, nbytes, int active_blocks=0, std::size_t max_signal_bytes=0, const Timeout& = Timeout())`. The two callers in `AllToAllvAmd.cuh` were passing `timeout` directly after `nbytes`, which positionally bound to `int active_blocks` and failed compilation: no viable conversion from 'Timeout' to 'int' Pass `/*active_blocks=*/0, /*max_signal_bytes=*/0` explicitly so `timeout` lands on the right parameter. Differential Revision: D102217579

Summary: Foundation constant for the multi-NIC IBGDA transport. Defines `kMaxNicsPerGpu = 2`, the compile-time cap on the number of RDMA NICs per GPU. One "NIC" corresponds to one ibverbs path: a mlx5_X device + ibv_pd + QPs. Hardware mapping (per comms/tcp_devmem/HW.md): - H100 (Grand Teton): 1 NIC per GPU → numNics=1 - GB200 (Catalina): 2 NICs per GPU (2 ConnectX-7 chips) → numNics=2 - GB300 (Clemente): 2 NICs per GPU (1 dual-port ConnectX-8 exposed as 2 mlx5_X) → numNics=2 `kMaxNicsPerGpu` caps the static array sizes used by the IBGDA backend in subsequent commits: - NetworkLKeys / NetworkRKeys (per-NIC key array wrappers) - IbgdaLocalBuffer.lkeys / IbgdaRemoteBuffer.rkeys - LocalBufferRegistration.lkeys / RemoteBufferRegistration.rkeys - P2pIbgdaTransportDevice.sinkLkeys_ - IbgdaTransportExchInfoAll.nicInfo - MultipeerIbgdaTransport per-NIC IB resources (ibvCtxs_, ibvPds_, ...) - CachedMr.mrs Lives in `comms/pipes/rdma/NicDiscovery.h` next to the existing topology-aware NIC selection API rather than in a new IBGDA-specific header — per code review feedback (D101700973), the cap is a hardware property of "how many NICs a GPU can have," not specific to any single transport implementation. Subsequent commits in this stack use the constant via `#include "comms/pipes/rdma/NicDiscovery.h"`. The C-side mirror for the NCCLx ABI is `NCCLX_MAX_IBGDA_NICS` in `nccl.h` (introduced later in the stack), kept in lockstep with this constant via a static_assert at the bridge layer in `PipesDeviceBackend.cpp`. Increase the cap only if a future platform supports > 2 NICs per GPU; update both the C++ constant and the C define in lockstep, and audit the IBGDA wire format (size scales with `kMaxNicsPerGpu × kMaxQpsPerPeer × kMaxRanksForAllGather`). No behavior change: header-only constant addition. Wired into IbgdaBuffer / P2pIbgdaTransportDevice / DeviceWindow / RegisteredBuffer in subsequent commits. Reviewed By: goelayu Differential Revision: D101700973

meta-pytorch#2247) Summary: Builds the per-NIC key infrastructure on top of the `kMaxNicsPerGpu` constant from the previous commit: 1. **`NetworkLKeys` / `NetworkRKeys`** — named, fixed-size aggregate wrappers around `NetworkLKey[kMaxNicsPerGpu]` / `NetworkRKey[kMaxNicsPerGpu]`. Standard-layout aggregates so they remain usable in unions for the legacy `lkey` / `rkey` aliases. Brace-initializable: `NetworkLKeys{}` (all zero), `NetworkLKeys{lkey0, lkey1}` (per-NIC), `NetworkLKeys{lkey}` (single-NIC, brace-elision). Indexed via `operator[]`. 2. **`IbgdaLocalBuffer.lkeys` / `IbgdaRemoteBuffer.rkeys`** — switched from a single `NetworkLKey lkey` / `NetworkRKey rkey` field to a `NetworkLKeys` / `NetworkRKeys` per-NIC array. Anonymous union preserves the legacy `.lkey` / `.rkey` member access for read-side compatibility (still resolves to `lkeys[0]` / `rkeys[0]`). 3. **Single multi-key constructor only** — no deprecated single-key ctor. At `numNics > 1`, a single-key value would silently leave NIC[1..N-1] keys zero and trap on first multi-NIC use; callers always construct with the per-NIC `NetworkLKeys` / `NetworkRKeys` aggregate. Net behavior: at `numNics = 1`, all sites read `lkeys[0]` / `rkeys[0]` (or via the legacy `.lkey` / `.rkey` alias) — bit-identical to pre-multi-NIC. The infrastructure is wired into the IBGDA backend in subsequent commits (P2p ctor, DeviceWindow, HostWindow). **AMD/HIP compile fixes:** the legacy `.lkey` / `.rkey` union shim does not survive HIP compilation (HIP includes the header in contexts where the union members are not visible), so all AMD device sites are migrated to the explicit `lkey_per_device[0]` / `rkey_per_device[0]` accessor and AMD test/benchmark constructors are wrapped as `NetworkLKeys{NetworkLKey(...)}` / `NetworkRKeys{NetworkRKey(...)}`. AMD does not support multi-NIC yet, so reads are pinned to slot `[0]`. The `IBGDA_KEYS_OOB_TRAP` macro is also split: CUDA keeps the full printf + `__trap()` diagnostic; HIP becomes a no-op (the bounds check is unreachable at `size = 1` and `__trap()` / `blockIdx` / `threadIdx` are not portably available in the HIP host-include path). Reviewed By: goelayu Differential Revision: D102053818

…d/recv (meta-pytorch#2258) Summary: `P2pNvlTransportDevice::send` / `recv` signatures are `(group, ptr, nbytes, int active_blocks=0, std::size_t max_signal_bytes=0, const Timeout& = Timeout())`. The two callers in `AllToAllvAmd.cuh` were passing `timeout` directly after `nbytes`, which positionally bound to `int active_blocks` and failed compilation: no viable conversion from 'Timeout' to 'int' Pass `/*active_blocks=*/0, /*max_signal_bytes=*/0` explicitly so `timeout` lands on the right parameter. Differential Revision: D102217579

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 24, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 24, 2026

meta-codesync Bot changed the title ~~AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv~~ AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv (#2258) Apr 24, 2026

zhiyongww force-pushed the export-D102217579 branch from 86d1286 to fd4b6a1 Compare April 24, 2026 20:18

zhiyongww added 3 commits April 24, 2026 13:27

zhiyongww force-pushed the export-D102217579 branch from fd4b6a1 to 775010e Compare April 24, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv (#2258)#2258

AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv (#2258)#2258
zhiyongww wants to merge 3 commits intometa-pytorch:mainfrom
zhiyongww:export-D102217579

zhiyongww commented Apr 24, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhiyongww commented Apr 24, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhiyongww commented Apr 24, 2026 •

edited by meta-codesync Bot

Loading