Skip to content

AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv (#2258)#2258

Open
zhiyongww wants to merge 3 commits intometa-pytorch:mainfrom
zhiyongww:export-D102217579
Open

AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv (#2258)#2258
zhiyongww wants to merge 3 commits intometa-pytorch:mainfrom
zhiyongww:export-D102217579

Conversation

@zhiyongww
Copy link
Copy Markdown
Contributor

@zhiyongww zhiyongww commented Apr 24, 2026

Summary:

P2pNvlTransportDevice::send / recv signatures are (group, ptr, nbytes, int active_blocks=0, std::size_t max_signal_bytes=0, const Timeout& = Timeout()). The two callers in AllToAllvAmd.cuh were passing timeout directly after nbytes, which positionally bound to int active_blocks and failed compilation:

no viable conversion from 'Timeout' to 'int'

Pass /*active_blocks=*/0, /*max_signal_bytes=*/0 explicitly so timeout lands on the right parameter.

Differential Revision: D102217579

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 24, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 24, 2026

@zhiyongww has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102217579.

@meta-codesync meta-codesync Bot changed the title AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv AllToAllvAmd: pass missing active_blocks/max_signal_bytes args to send/recv (#2258) Apr 24, 2026
zhiyongww added a commit to zhiyongww/torchcomms that referenced this pull request Apr 24, 2026
…d/recv (meta-pytorch#2258)

Summary:

`P2pNvlTransportDevice::send` / `recv` signatures are `(group, ptr, nbytes, int active_blocks=0, std::size_t max_signal_bytes=0, const Timeout& = Timeout())`. The two callers in `AllToAllvAmd.cuh` were passing `timeout` directly after `nbytes`, which positionally bound to `int active_blocks` and failed compilation:

  no viable conversion from 'Timeout' to 'int'

Pass `/*active_blocks=*/0, /*max_signal_bytes=*/0` explicitly so `timeout` lands on the right parameter.

Differential Revision: D102217579
Summary:

Foundation constant for the multi-NIC IBGDA transport. Defines `kMaxNicsPerGpu = 2`, the compile-time cap on the number of RDMA NICs per GPU. One "NIC" corresponds to one ibverbs path: a mlx5_X device + ibv_pd + QPs.

Hardware mapping (per comms/tcp_devmem/HW.md):
- H100 (Grand Teton): 1 NIC per GPU → numNics=1
- GB200 (Catalina):   2 NICs per GPU (2 ConnectX-7 chips) → numNics=2
- GB300 (Clemente):   2 NICs per GPU (1 dual-port ConnectX-8 exposed as 2 mlx5_X) → numNics=2

`kMaxNicsPerGpu` caps the static array sizes used by the IBGDA backend in subsequent commits:
- NetworkLKeys / NetworkRKeys (per-NIC key array wrappers)
- IbgdaLocalBuffer.lkeys / IbgdaRemoteBuffer.rkeys
- LocalBufferRegistration.lkeys / RemoteBufferRegistration.rkeys
- P2pIbgdaTransportDevice.sinkLkeys_
- IbgdaTransportExchInfoAll.nicInfo
- MultipeerIbgdaTransport per-NIC IB resources (ibvCtxs_, ibvPds_, ...)
- CachedMr.mrs

Lives in `comms/pipes/rdma/NicDiscovery.h` next to the existing topology-aware NIC selection API rather than in a new IBGDA-specific header — per code review feedback (D101700973), the cap is a hardware property of "how many NICs a GPU can have," not specific to any single transport implementation. Subsequent commits in this stack use the constant via `#include "comms/pipes/rdma/NicDiscovery.h"`.

The C-side mirror for the NCCLx ABI is `NCCLX_MAX_IBGDA_NICS` in `nccl.h` (introduced later in the stack), kept in lockstep with this constant via a static_assert at the bridge layer in `PipesDeviceBackend.cpp`. Increase the cap only if a future platform supports > 2 NICs per GPU; update both the C++ constant and the C define in lockstep, and audit the IBGDA wire format (size scales with `kMaxNicsPerGpu × kMaxQpsPerPeer × kMaxRanksForAllGather`).

No behavior change: header-only constant addition. Wired into IbgdaBuffer / P2pIbgdaTransportDevice / DeviceWindow / RegisteredBuffer in subsequent commits.

Reviewed By: goelayu

Differential Revision: D101700973
meta-pytorch#2247)

Summary:

Builds the per-NIC key infrastructure on top of the `kMaxNicsPerGpu` constant from the previous commit:

1. **`NetworkLKeys` / `NetworkRKeys`** — named, fixed-size aggregate wrappers around `NetworkLKey[kMaxNicsPerGpu]` / `NetworkRKey[kMaxNicsPerGpu]`. Standard-layout aggregates so they remain usable in unions for the legacy `lkey` / `rkey` aliases. Brace-initializable: `NetworkLKeys{}` (all zero), `NetworkLKeys{lkey0, lkey1}` (per-NIC), `NetworkLKeys{lkey}` (single-NIC, brace-elision). Indexed via `operator[]`.

2. **`IbgdaLocalBuffer.lkeys` / `IbgdaRemoteBuffer.rkeys`** — switched from a single `NetworkLKey lkey` / `NetworkRKey rkey` field to a `NetworkLKeys` / `NetworkRKeys` per-NIC array. Anonymous union preserves the legacy `.lkey` / `.rkey` member access for read-side compatibility (still resolves to `lkeys[0]` / `rkeys[0]`).

3. **Single multi-key constructor only** — no deprecated single-key ctor. At `numNics > 1`, a single-key value would silently leave NIC[1..N-1] keys zero and trap on first multi-NIC use; callers always construct with the per-NIC `NetworkLKeys` / `NetworkRKeys` aggregate.

Net behavior: at `numNics = 1`, all sites read `lkeys[0]` / `rkeys[0]` (or via the legacy `.lkey` / `.rkey` alias) — bit-identical to pre-multi-NIC. The infrastructure is wired into the IBGDA backend in subsequent commits (P2p ctor, DeviceWindow, HostWindow).

**AMD/HIP compile fixes:** the legacy `.lkey` / `.rkey` union shim does not survive HIP compilation (HIP includes the header in contexts where the union members are not visible), so all AMD device sites are migrated to the explicit `lkey_per_device[0]` / `rkey_per_device[0]` accessor and AMD test/benchmark constructors are wrapped as `NetworkLKeys{NetworkLKey(...)}` / `NetworkRKeys{NetworkRKey(...)}`. AMD does not support multi-NIC yet, so reads are pinned to slot `[0]`. The `IBGDA_KEYS_OOB_TRAP` macro is also split: CUDA keeps the full printf + `__trap()` diagnostic; HIP becomes a no-op (the bounds check is unreachable at `size = 1` and `__trap()` / `blockIdx` / `threadIdx` are not portably available in the HIP host-include path).

Reviewed By: goelayu

Differential Revision: D102053818
…d/recv (meta-pytorch#2258)

Summary:

`P2pNvlTransportDevice::send` / `recv` signatures are `(group, ptr, nbytes, int active_blocks=0, std::size_t max_signal_bytes=0, const Timeout& = Timeout())`. The two callers in `AllToAllvAmd.cuh` were passing `timeout` directly after `nbytes`, which positionally bound to `int active_blocks` and failed compilation:

  no viable conversion from 'Timeout' to 'int'

Pass `/*active_blocks=*/0, /*max_signal_bytes=*/0` explicitly so `timeout` lands on the right parameter.

Differential Revision: D102217579
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant