Skip to content

RTX 5090 / 595.58.03-p2p: vLLM TP=2 crashes only when custom-all-reduce + CUDA graphs are both enabled #21

@himekifee

Description

@himekifee

NVIDIA Open GPU Kernel Modules Version

595.58.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Talos OS v1.13.0-beta.1

Kernel Release

6.18.19-talos

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

RTX 5090

Describe the bug

I’m seeing a reproducible crash with the P2P hack on RTX 5090 PCIe (PHB) that does not happen on stock NVIDIA 595.58.03.

Environment

  • GPUs: 2× RTX 5090 (from a 4×5090 node)
  • Topology: PHB
  • Driver base: 595.58.03
  • CUDA: 13.0
  • PyTorch: 2.11.0+cu130
  • NCCL: 2.28.9+cuda13.0
  • vLLM: 0.19.1rc1.dev118

Hack config:

OpenRmEnableUnsupportedGpus=1
DmaRemapPeerMmio=1
RegistryDwords="ForceP2P=17;RMForceP2PType=1;RMPcieP2PType=1;PeerMappingOverride=1;RMForceStaticBar1=1"

Situation

With the patched P2P driver, this vLLM command crashes during startup:

vllm serve /models/Qwen3.5-35B-A3B-NVFP4 \
  --host 0.0.0.0 \
  --port 8001 \
  --served-model-name qwen3.5-35b-a3b \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.88 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":6}' \
  --max-num-seqs 20 \
  --enable-log-requests \
  --language-model-only \
  --aggregate-engine-logging \
  --enable-sleep-mode \
  --api-server-count 1 \
  --no-enable-prefix-caching

The crash happens right after:

[kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=272
[gpu_model_runner.py:5893] Profiling CUDA graph memory: PIECEWISE=33 (largest=259), FULL=18 (largest=140)

Then:

Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered'
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered'

And kernel log shows NVRM: Xid (PCI:0000:03:00): 31, pid=130010, name=python3, channel 0x00000002, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC0 GPCCLIENT_T1_4 faulted @ 0x7b4b_49707000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ on both GPUs.

Very important matrix

This is the key discriminator:

  • default run: FAIL
  • add --enforce-eager: PASS
  • remove --enforce-eager, add --disable-custom-all-reduce: PASS
  • remove --disable-custom-all-reduce: FAIL again
  • same workload on stock 595.58.03: PASS

So the failure requires both:

  1. CUDA graphs enabled
  2. vLLM custom all-reduce enabled

It is not a generic TP=2 failure.

Why I think this is in the hacked driver

On the patched driver, these all pass:

  • cudaDeviceCanAccessPeer
  • cudaDeviceEnablePeerAccess
  • p2pBandwidthLatencyTest
  • simpleIPC
  • vLLM can_actually_p2p
  • NCCL all_reduce completes

But vLLM still crashes only in the graph + custom-all-reduce path.

So this looks like:

basic P2P / basic IPC works, but the patched driver breaks the graph-captured peer/IPC registration path used by vLLM custom-all-reduce.

From vLLM source, the crash site custom_all_reduce.cuh:455 is in the graph-only metadata path around:

  • cuPointerGetAttribute(CU_POINTER_ATTRIBUTE_RANGE_START_ADDR, ...)
  • cudaIpcGetMemHandle(...)

This path is only exercised when graphs + custom all-reduce are both active.

Additional evidence

On the patched driver:

  • nvidia-smi topo -p2p r / w: OK

  • BAR1 per GPU: 32 GiB

  • simpleIPC: PASS

  • NCCL log shows:

    isAllCudaP2p=1
    isAllDirectP2p=0
    via SHM/direct/direct
    

    so NCCL is conservative and does not trust direct P2P fully.

Also, I built the patched source standalone on a stock host and compared built modules against stock 595.58.03:

  • only nvidia.ko differs
  • nvidia-uvm.ko, nvidia-drm.ko, nvidia-modeset.ko, nvidia-peermem.ko match stock

So the suspect is narrowed to core nvidia.ko changes, not UVM/DRM/etc.

Current suspect

The most likely issue is in the hacked repo’s core BAR1 / peer-mapping path in nvidia.ko, especially anything that:

  • forces BAR1 P2P capability on RTX 5090 PCIe
  • overrides peer mapping checks
  • changes graph-captured allocation export/import behavior
  • makes cudaIpcGetMemHandle / cudaIpcOpenMemHandle appear valid for buffers that later fault under real access

The highest-suspicion areas are the nvidia.ko patches that:

  • force BAR1/P2P capability reporting
  • override peer mapping support
  • route non-datacenter GPUs into BAR1 P2P code paths

In other words, this does not look like “P2P completely broken”; it looks more like:

the hacked driver enables a peer/BAR1 path that survives small tests, but becomes invalid when vLLM registers graph-captured buffers for custom all-reduce.

To Reproduce

Run above vllm command with p2p driver

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz
Fake empty log gz... Can't get that easily on talos...Will try to get it if really needed

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions