RTX 5090 / 595.58.03-p2p: vLLM TP=2 crashes only when custom-all-reduce + CUDA graphs are both enabled

### NVIDIA Open GPU Kernel Modules Version

595.58.03

### Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

- [ ] I confirm that this does not happen with the proprietary driver package.

### Operating System and Version

Talos OS v1.13.0-beta.1

### Kernel Release

6.18.19-talos

### Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

- [x] I am running on a stable kernel release.

### Hardware: GPU

RTX 5090

### Describe the bug

I’m seeing a reproducible crash with the P2P hack on **RTX 5090 PCIe (PHB)** that does **not** happen on stock NVIDIA **595.58.03**.

## Environment

- GPUs: **2× RTX 5090** (from a 4×5090 node)
- Topology: **PHB**
- Driver base: **595.58.03**
- CUDA: **13.0**
- PyTorch: **2.11.0+cu130**
- NCCL: **2.28.9+cuda13.0**
- vLLM: **0.19.1rc1.dev118**

Hack config:

```text
OpenRmEnableUnsupportedGpus=1
DmaRemapPeerMmio=1
RegistryDwords="ForceP2P=17;RMForceP2PType=1;RMPcieP2PType=1;PeerMappingOverride=1;RMForceStaticBar1=1"
```

## Situation

With the patched P2P driver, this vLLM command crashes during startup:

```bash
vllm serve /models/Qwen3.5-35B-A3B-NVFP4 \
  --host 0.0.0.0 \
  --port 8001 \
  --served-model-name qwen3.5-35b-a3b \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.88 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":6}' \
  --max-num-seqs 20 \
  --enable-log-requests \
  --language-model-only \
  --aggregate-engine-logging \
  --enable-sleep-mode \
  --api-server-count 1 \
  --no-enable-prefix-caching
```

The crash happens right after:

```text
[kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=272
[gpu_model_runner.py:5893] Profiling CUDA graph memory: PIECEWISE=33 (largest=259), FULL=18 (largest=140)
```

Then:

```text
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered'
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered'
```

And kernel log shows `NVRM: Xid (PCI:0000:03:00): 31, pid=130010, name=python3, channel 0x00000002, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC0 GPCCLIENT_T1_4 faulted @ 0x7b4b_49707000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ` on both GPUs.

## Very important matrix

This is the key discriminator:

- default run: **FAIL**
- add `--enforce-eager`: **PASS**
- remove `--enforce-eager`, add `--disable-custom-all-reduce`: **PASS**
- remove `--disable-custom-all-reduce`: **FAIL again**
- same workload on **stock 595.58.03**: **PASS**

So the failure requires **both**:

1. **CUDA graphs enabled**
2. **vLLM custom all-reduce enabled**

It is **not** a generic TP=2 failure.

## Why I think this is in the hacked driver

On the patched driver, these all pass:

- `cudaDeviceCanAccessPeer`
- `cudaDeviceEnablePeerAccess`
- `p2pBandwidthLatencyTest`
- `simpleIPC`
- vLLM `can_actually_p2p`
- NCCL all_reduce completes

But vLLM still crashes only in the **graph + custom-all-reduce** path.

So this looks like:

> basic P2P / basic IPC works, but the patched driver breaks the **graph-captured peer/IPC registration path** used by vLLM custom-all-reduce.

From vLLM source, the crash site `custom_all_reduce.cuh:455` is in the graph-only metadata path around:

- `cuPointerGetAttribute(CU_POINTER_ATTRIBUTE_RANGE_START_ADDR, ...)`
- `cudaIpcGetMemHandle(...)`

This path is only exercised when **graphs + custom all-reduce** are both active.

## Additional evidence

On the patched driver:

- `nvidia-smi topo -p2p r` / `w`: **OK**
- BAR1 per GPU: **32 GiB**
- `simpleIPC`: **PASS**
- NCCL log shows:

  ```text
  isAllCudaP2p=1
  isAllDirectP2p=0
  via SHM/direct/direct
  ```

  so NCCL is conservative and does not trust direct P2P fully.

Also, I built the patched source standalone on a stock host and compared built modules against stock 595.58.03:

- **only `nvidia.ko` differs**
- `nvidia-uvm.ko`, `nvidia-drm.ko`, `nvidia-modeset.ko`, `nvidia-peermem.ko` match stock

So the suspect is narrowed to **core `nvidia.ko` changes**, not UVM/DRM/etc.

## Current suspect

The most likely issue is in the hacked repo’s **core BAR1 / peer-mapping path in `nvidia.ko`**, especially anything that:

- forces BAR1 P2P capability on RTX 5090 PCIe
- overrides peer mapping checks
- changes graph-captured allocation export/import behavior
- makes `cudaIpcGetMemHandle` / `cudaIpcOpenMemHandle` appear valid for buffers that later fault under real access

The highest-suspicion areas are the `nvidia.ko` patches that:

- force BAR1/P2P capability reporting
- override peer mapping support
- route non-datacenter GPUs into BAR1 P2P code paths

In other words, this does **not** look like “P2P completely broken”; it looks more like:

> the hacked driver enables a peer/BAR1 path that survives small tests, but becomes invalid when vLLM registers graph-captured buffers for custom all-reduce.

### To Reproduce

Run above vllm command with p2p driver

### Bug Incidence

Always

### nvidia-bug-report.log.gz

[nvidia-bug-report.log.gz](https://github.com/user-attachments/files/26646455/nvidia-bug-report.log.gz)
Fake empty log gz... Can't get that easily on talos...Will try to get it if really needed

### More Info

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTX 5090 / 595.58.03-p2p: vLLM TP=2 crashes only when custom-all-reduce + CUDA graphs are both enabled #21

NVIDIA Open GPU Kernel Modules Version

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Kernel Release

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

Describe the bug

Environment

Situation

Very important matrix

Why I think this is in the hacked driver

Additional evidence

Current suspect

To Reproduce

Bug Incidence

nvidia-bug-report.log.gz

More Info

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RTX 5090 / 595.58.03-p2p: vLLM TP=2 crashes only when custom-all-reduce + CUDA graphs are both enabled #21

Description

NVIDIA Open GPU Kernel Modules Version

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Kernel Release

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

Describe the bug

Environment

Situation

Very important matrix

Why I think this is in the hacked driver

Additional evidence

Current suspect

To Reproduce

Bug Incidence

nvidia-bug-report.log.gz

More Info

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions