NVIDIA Open GPU Kernel Modules Version
590.44.01
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 24.04.3 LTS
Kernel Release
6.17.0-20
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
RTX5090 32G
Describe the bug
Hardware Environment:
Setup A (Working): 2 Nodes with NVIDIA Pro 5000 (16 GPUs total), interconnected via dual-port 100G RoCE.
Setup B (Issue): 2 Nodes with RTX 5090 (16 GPUs total), same network setup.
Description:
I am observing a significant performance difference between the two setups when using all_reduce.
On the NVIDIA Pro 5000 setup, GDR (GPU Direct RDMA) is enabled successfully, and the all_reduce performance reaches approximately 28 GB/s.
On the RTX 5090 setup, GDR cannot be enabled. Consequently, the performance drops to around 15 GB/s.
Question:
Is there a known configuration or workaround to enable GDR on RTX 5090s? Specifically, is there a way to modify configuration files or environment variables to allow the system to recognize RTX 5090 as supporting GDR?
Any guidance would be appreciated. Thanks!
To Reproduce
GPU: 8x RTX 5090 (32GB VRAM) per node.
CPU & Memory: Dual AMD EPYC 9654 processors with 24x 32GB DDR5 modules (configured in a cascade topology).
Network: Each node is equipped with a single dual-port 100GbE ConnectX-6 NIC.
Interconnect: The two nodes are directly connected via two 100G cables (direct link).
Bug Incidence
Always
nvidia-bug-report.log.gz
On the RTX 5090 setup, GDR cannot be enabled. Consequently, the performance drops to around 15 GB/s.
More Info
No response
NVIDIA Open GPU Kernel Modules Version
590.44.01
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 24.04.3 LTS
Kernel Release
6.17.0-20
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
RTX5090 32G
Describe the bug
Hardware Environment:
Setup A (Working): 2 Nodes with NVIDIA Pro 5000 (16 GPUs total), interconnected via dual-port 100G RoCE.
Setup B (Issue): 2 Nodes with RTX 5090 (16 GPUs total), same network setup.
Description:
I am observing a significant performance difference between the two setups when using all_reduce.
On the NVIDIA Pro 5000 setup, GDR (GPU Direct RDMA) is enabled successfully, and the all_reduce performance reaches approximately 28 GB/s.
On the RTX 5090 setup, GDR cannot be enabled. Consequently, the performance drops to around 15 GB/s.
Question:
Is there a known configuration or workaround to enable GDR on RTX 5090s? Specifically, is there a way to modify configuration files or environment variables to allow the system to recognize RTX 5090 as supporting GDR?
Any guidance would be appreciated. Thanks!
To Reproduce
GPU: 8x RTX 5090 (32GB VRAM) per node.
CPU & Memory: Dual AMD EPYC 9654 processors with 24x 32GB DDR5 modules (configured in a cascade topology).
Network: Each node is equipped with a single dual-port 100GbE ConnectX-6 NIC.
Interconnect: The two nodes are directly connected via two 100G cables (direct link).
Bug Incidence
Always
nvidia-bug-report.log.gz
On the RTX 5090 setup, GDR cannot be enabled. Consequently, the performance drops to around 15 GB/s.
More Info
No response