AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv (#2258)#2258
Open
zhiyongww wants to merge 3 commits intometa-pytorch:mainfrom
Open
AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv (#2258)#2258zhiyongww wants to merge 3 commits intometa-pytorch:mainfrom
zhiyongww wants to merge 3 commits intometa-pytorch:mainfrom
Conversation
Contributor
|
@zhiyongww has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102217579. |
86d1286 to
fd4b6a1
Compare
zhiyongww
added a commit
to zhiyongww/torchcomms
that referenced
this pull request
Apr 24, 2026
…d/recv (meta-pytorch#2258) Summary: `P2pNvlTransportDevice::send` / `recv` signatures are `(group, ptr, nbytes, int active_blocks=0, std::size_t max_signal_bytes=0, const Timeout& = Timeout())`. The two callers in `AllToAllvAmd.cuh` were passing `timeout` directly after `nbytes`, which positionally bound to `int active_blocks` and failed compilation: no viable conversion from 'Timeout' to 'int' Pass `/*active_blocks=*/0, /*max_signal_bytes=*/0` explicitly so `timeout` lands on the right parameter. Differential Revision: D102217579
Summary: Foundation constant for the multi-NIC IBGDA transport. Defines `kMaxNicsPerGpu = 2`, the compile-time cap on the number of RDMA NICs per GPU. One "NIC" corresponds to one ibverbs path: a mlx5_X device + ibv_pd + QPs. Hardware mapping (per comms/tcp_devmem/HW.md): - H100 (Grand Teton): 1 NIC per GPU → numNics=1 - GB200 (Catalina): 2 NICs per GPU (2 ConnectX-7 chips) → numNics=2 - GB300 (Clemente): 2 NICs per GPU (1 dual-port ConnectX-8 exposed as 2 mlx5_X) → numNics=2 `kMaxNicsPerGpu` caps the static array sizes used by the IBGDA backend in subsequent commits: - NetworkLKeys / NetworkRKeys (per-NIC key array wrappers) - IbgdaLocalBuffer.lkeys / IbgdaRemoteBuffer.rkeys - LocalBufferRegistration.lkeys / RemoteBufferRegistration.rkeys - P2pIbgdaTransportDevice.sinkLkeys_ - IbgdaTransportExchInfoAll.nicInfo - MultipeerIbgdaTransport per-NIC IB resources (ibvCtxs_, ibvPds_, ...) - CachedMr.mrs Lives in `comms/pipes/rdma/NicDiscovery.h` next to the existing topology-aware NIC selection API rather than in a new IBGDA-specific header — per code review feedback (D101700973), the cap is a hardware property of "how many NICs a GPU can have," not specific to any single transport implementation. Subsequent commits in this stack use the constant via `#include "comms/pipes/rdma/NicDiscovery.h"`. The C-side mirror for the NCCLx ABI is `NCCLX_MAX_IBGDA_NICS` in `nccl.h` (introduced later in the stack), kept in lockstep with this constant via a static_assert at the bridge layer in `PipesDeviceBackend.cpp`. Increase the cap only if a future platform supports > 2 NICs per GPU; update both the C++ constant and the C define in lockstep, and audit the IBGDA wire format (size scales with `kMaxNicsPerGpu × kMaxQpsPerPeer × kMaxRanksForAllGather`). No behavior change: header-only constant addition. Wired into IbgdaBuffer / P2pIbgdaTransportDevice / DeviceWindow / RegisteredBuffer in subsequent commits. Reviewed By: goelayu Differential Revision: D101700973
meta-pytorch#2247) Summary: Builds the per-NIC key infrastructure on top of the `kMaxNicsPerGpu` constant from the previous commit: 1. **`NetworkLKeys` / `NetworkRKeys`** — named, fixed-size aggregate wrappers around `NetworkLKey[kMaxNicsPerGpu]` / `NetworkRKey[kMaxNicsPerGpu]`. Standard-layout aggregates so they remain usable in unions for the legacy `lkey` / `rkey` aliases. Brace-initializable: `NetworkLKeys{}` (all zero), `NetworkLKeys{lkey0, lkey1}` (per-NIC), `NetworkLKeys{lkey}` (single-NIC, brace-elision). Indexed via `operator[]`. 2. **`IbgdaLocalBuffer.lkeys` / `IbgdaRemoteBuffer.rkeys`** — switched from a single `NetworkLKey lkey` / `NetworkRKey rkey` field to a `NetworkLKeys` / `NetworkRKeys` per-NIC array. Anonymous union preserves the legacy `.lkey` / `.rkey` member access for read-side compatibility (still resolves to `lkeys[0]` / `rkeys[0]`). 3. **Single multi-key constructor only** — no deprecated single-key ctor. At `numNics > 1`, a single-key value would silently leave NIC[1..N-1] keys zero and trap on first multi-NIC use; callers always construct with the per-NIC `NetworkLKeys` / `NetworkRKeys` aggregate. Net behavior: at `numNics = 1`, all sites read `lkeys[0]` / `rkeys[0]` (or via the legacy `.lkey` / `.rkey` alias) — bit-identical to pre-multi-NIC. The infrastructure is wired into the IBGDA backend in subsequent commits (P2p ctor, DeviceWindow, HostWindow). **AMD/HIP compile fixes:** the legacy `.lkey` / `.rkey` union shim does not survive HIP compilation (HIP includes the header in contexts where the union members are not visible), so all AMD device sites are migrated to the explicit `lkey_per_device[0]` / `rkey_per_device[0]` accessor and AMD test/benchmark constructors are wrapped as `NetworkLKeys{NetworkLKey(...)}` / `NetworkRKeys{NetworkRKey(...)}`. AMD does not support multi-NIC yet, so reads are pinned to slot `[0]`. The `IBGDA_KEYS_OOB_TRAP` macro is also split: CUDA keeps the full printf + `__trap()` diagnostic; HIP becomes a no-op (the bounds check is unreachable at `size = 1` and `__trap()` / `blockIdx` / `threadIdx` are not portably available in the HIP host-include path). Reviewed By: goelayu Differential Revision: D102053818
…d/recv (meta-pytorch#2258) Summary: `P2pNvlTransportDevice::send` / `recv` signatures are `(group, ptr, nbytes, int active_blocks=0, std::size_t max_signal_bytes=0, const Timeout& = Timeout())`. The two callers in `AllToAllvAmd.cuh` were passing `timeout` directly after `nbytes`, which positionally bound to `int active_blocks` and failed compilation: no viable conversion from 'Timeout' to 'int' Pass `/*active_blocks=*/0, /*max_signal_bytes=*/0` explicitly so `timeout` lands on the right parameter. Differential Revision: D102217579
fd4b6a1 to
775010e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
P2pNvlTransportDevice::send/recvsignatures are(group, ptr, nbytes, int active_blocks=0, std::size_t max_signal_bytes=0, const Timeout& = Timeout()). The two callers inAllToAllvAmd.cuhwere passingtimeoutdirectly afternbytes, which positionally bound toint active_blocksand failed compilation:no viable conversion from 'Timeout' to 'int'
Pass
/*active_blocks=*/0, /*max_signal_bytes=*/0explicitly sotimeoutlands on the right parameter.Differential Revision: D102217579