Skip to content

Is there a GDR solution/workaround for RTX 5090? #20

@xzwgit

Description

@xzwgit

NVIDIA Open GPU Kernel Modules Version

590.44.01

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 24.04.3 LTS

Kernel Release

6.17.0-20

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

RTX5090 32G

Describe the bug

Hardware Environment:
Setup A (Working): 2 Nodes with NVIDIA Pro 5000 (16 GPUs total), interconnected via dual-port 100G RoCE.
Setup B (Issue): 2 Nodes with RTX 5090 (16 GPUs total), same network setup.
Description:
I am observing a significant performance difference between the two setups when using all_reduce.
On the NVIDIA Pro 5000 setup, GDR (GPU Direct RDMA) is enabled successfully, and the all_reduce performance reaches approximately 28 GB/s.
On the RTX 5090 setup, GDR cannot be enabled. Consequently, the performance drops to around 15 GB/s.
Question:
Is there a known configuration or workaround to enable GDR on RTX 5090s? Specifically, is there a way to modify configuration files or environment variables to allow the system to recognize RTX 5090 as supporting GDR?
Any guidance would be appreciated. Thanks!

To Reproduce

GPU: 8x RTX 5090 (32GB VRAM) per node.
CPU & Memory: Dual AMD EPYC 9654 processors with 24x 32GB DDR5 modules (configured in a cascade topology).
Network: Each node is equipped with a single dual-port 100GbE ConnectX-6 NIC.
Interconnect: The two nodes are directly connected via two 100G cables (direct link).

Bug Incidence

Always

nvidia-bug-report.log.gz

On the RTX 5090 setup, GDR cannot be enabled. Consequently, the performance drops to around 15 GB/s.

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions