NVIDIA Open GPU Kernel Modules Version
595.58.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Talos OS v1.13.0-beta.1
Kernel Release
6.18.19-talos
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
RTX 5090
Describe the bug
I’m seeing a reproducible crash with the P2P hack on RTX 5090 PCIe (PHB) that does not happen on stock NVIDIA 595.58.03.
Environment
- GPUs: 2× RTX 5090 (from a 4×5090 node)
- Topology: PHB
- Driver base: 595.58.03
- CUDA: 13.0
- PyTorch: 2.11.0+cu130
- NCCL: 2.28.9+cuda13.0
- vLLM: 0.19.1rc1.dev118
Hack config:
OpenRmEnableUnsupportedGpus=1
DmaRemapPeerMmio=1
RegistryDwords="ForceP2P=17;RMForceP2PType=1;RMPcieP2PType=1;PeerMappingOverride=1;RMForceStaticBar1=1"
Situation
With the patched P2P driver, this vLLM command crashes during startup:
vllm serve /models/Qwen3.5-35B-A3B-NVFP4 \
--host 0.0.0.0 \
--port 8001 \
--served-model-name qwen3.5-35b-a3b \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.88 \
--speculative-config '{"method":"mtp","num_speculative_tokens":6}' \
--max-num-seqs 20 \
--enable-log-requests \
--language-model-only \
--aggregate-engine-logging \
--enable-sleep-mode \
--api-server-count 1 \
--no-enable-prefix-caching
The crash happens right after:
[kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=272
[gpu_model_runner.py:5893] Profiling CUDA graph memory: PIECEWISE=33 (largest=259), FULL=18 (largest=140)
Then:
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered'
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered'
And kernel log shows NVRM: Xid (PCI:0000:03:00): 31, pid=130010, name=python3, channel 0x00000002, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC0 GPCCLIENT_T1_4 faulted @ 0x7b4b_49707000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ on both GPUs.
Very important matrix
This is the key discriminator:
- default run: FAIL
- add
--enforce-eager: PASS
- remove
--enforce-eager, add --disable-custom-all-reduce: PASS
- remove
--disable-custom-all-reduce: FAIL again
- same workload on stock 595.58.03: PASS
So the failure requires both:
- CUDA graphs enabled
- vLLM custom all-reduce enabled
It is not a generic TP=2 failure.
Why I think this is in the hacked driver
On the patched driver, these all pass:
cudaDeviceCanAccessPeer
cudaDeviceEnablePeerAccess
p2pBandwidthLatencyTest
simpleIPC
- vLLM
can_actually_p2p
- NCCL all_reduce completes
But vLLM still crashes only in the graph + custom-all-reduce path.
So this looks like:
basic P2P / basic IPC works, but the patched driver breaks the graph-captured peer/IPC registration path used by vLLM custom-all-reduce.
From vLLM source, the crash site custom_all_reduce.cuh:455 is in the graph-only metadata path around:
cuPointerGetAttribute(CU_POINTER_ATTRIBUTE_RANGE_START_ADDR, ...)
cudaIpcGetMemHandle(...)
This path is only exercised when graphs + custom all-reduce are both active.
Additional evidence
On the patched driver:
Also, I built the patched source standalone on a stock host and compared built modules against stock 595.58.03:
- only
nvidia.ko differs
nvidia-uvm.ko, nvidia-drm.ko, nvidia-modeset.ko, nvidia-peermem.ko match stock
So the suspect is narrowed to core nvidia.ko changes, not UVM/DRM/etc.
Current suspect
The most likely issue is in the hacked repo’s core BAR1 / peer-mapping path in nvidia.ko, especially anything that:
- forces BAR1 P2P capability on RTX 5090 PCIe
- overrides peer mapping checks
- changes graph-captured allocation export/import behavior
- makes
cudaIpcGetMemHandle / cudaIpcOpenMemHandle appear valid for buffers that later fault under real access
The highest-suspicion areas are the nvidia.ko patches that:
- force BAR1/P2P capability reporting
- override peer mapping support
- route non-datacenter GPUs into BAR1 P2P code paths
In other words, this does not look like “P2P completely broken”; it looks more like:
the hacked driver enables a peer/BAR1 path that survives small tests, but becomes invalid when vLLM registers graph-captured buffers for custom all-reduce.
To Reproduce
Run above vllm command with p2p driver
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
Fake empty log gz... Can't get that easily on talos...Will try to get it if really needed
More Info
No response
NVIDIA Open GPU Kernel Modules Version
595.58.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Talos OS v1.13.0-beta.1
Kernel Release
6.18.19-talos
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
RTX 5090
Describe the bug
I’m seeing a reproducible crash with the P2P hack on RTX 5090 PCIe (PHB) that does not happen on stock NVIDIA 595.58.03.
Environment
Hack config:
Situation
With the patched P2P driver, this vLLM command crashes during startup:
vllm serve /models/Qwen3.5-35B-A3B-NVFP4 \ --host 0.0.0.0 \ --port 8001 \ --served-model-name qwen3.5-35b-a3b \ --tensor-parallel-size 2 \ --max-model-len 16384 \ --gpu-memory-utilization 0.88 \ --speculative-config '{"method":"mtp","num_speculative_tokens":6}' \ --max-num-seqs 20 \ --enable-log-requests \ --language-model-only \ --aggregate-engine-logging \ --enable-sleep-mode \ --api-server-count 1 \ --no-enable-prefix-cachingThe crash happens right after:
Then:
And kernel log shows
NVRM: Xid (PCI:0000:03:00): 31, pid=130010, name=python3, channel 0x00000002, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC0 GPCCLIENT_T1_4 faulted @ 0x7b4b_49707000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READon both GPUs.Very important matrix
This is the key discriminator:
--enforce-eager: PASS--enforce-eager, add--disable-custom-all-reduce: PASS--disable-custom-all-reduce: FAIL againSo the failure requires both:
It is not a generic TP=2 failure.
Why I think this is in the hacked driver
On the patched driver, these all pass:
cudaDeviceCanAccessPeercudaDeviceEnablePeerAccessp2pBandwidthLatencyTestsimpleIPCcan_actually_p2pBut vLLM still crashes only in the graph + custom-all-reduce path.
So this looks like:
From vLLM source, the crash site
custom_all_reduce.cuh:455is in the graph-only metadata path around:cuPointerGetAttribute(CU_POINTER_ATTRIBUTE_RANGE_START_ADDR, ...)cudaIpcGetMemHandle(...)This path is only exercised when graphs + custom all-reduce are both active.
Additional evidence
On the patched driver:
nvidia-smi topo -p2p r/w: OKBAR1 per GPU: 32 GiB
simpleIPC: PASSNCCL log shows:
so NCCL is conservative and does not trust direct P2P fully.
Also, I built the patched source standalone on a stock host and compared built modules against stock 595.58.03:
nvidia.kodiffersnvidia-uvm.ko,nvidia-drm.ko,nvidia-modeset.ko,nvidia-peermem.komatch stockSo the suspect is narrowed to core
nvidia.kochanges, not UVM/DRM/etc.Current suspect
The most likely issue is in the hacked repo’s core BAR1 / peer-mapping path in
nvidia.ko, especially anything that:cudaIpcGetMemHandle/cudaIpcOpenMemHandleappear valid for buffers that later fault under real accessThe highest-suspicion areas are the
nvidia.kopatches that:In other words, this does not look like “P2P completely broken”; it looks more like:
To Reproduce
Run above vllm command with p2p driver
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
Fake empty log gz... Can't get that easily on talos...Will try to get it if really needed
More Info
No response