Summary
When using:
ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04
Slurm 25.11.4
gres.conf with AutoDetect=nvml
the worker nodes report 0 GPUs to Slurm and get drained as invalid, even though the container can see all GPUs and NVML is present.
This looks like the Slurm build inside the image was compiled without NVML support (HAVE_NVML), so AutoDetect=nvml cannot work.
Environment
Image: ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04
Slurm version:
slurmd -V → slurm 25.11.4
scontrol -V → slurm 25.11.4
Arch: aarch64
OS in worker: Ubuntu 24.04-based image
GPUs: 4 NVIDIA GPUs visible in the worker container
Config
Relevant config:
configFiles:
gres.conf: |
AutoDetect=nvml
nodesets:
slinky:
slurmd:
resources:
limits:
nvidia.com/gpu: 4
extraConfMap:
Gres: "gpu:4"
Observed behavior
Slurm drains the nodes as invalid.
sinfo -R:
REASON USER TIMESTAMP NODELIST
gres/gpu count repor slurm 2026-04-14T06:39:53 slinky-0
gres/gpu count repor slurm 2026-04-14T06:39:54 slinky-1
scontrol show node slinky-0:
Gres=gpu:4
State=IDLE+DRAIN+DYNAMIC_NORM+INVALID_REG
CfgTRES=cpu=80,mem=204420M,billing=80
Reason=gres/gpu count reported lower than configured (0 < 4)
Worker log:
[2026-04-14T06:39:53] We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.
[2026-04-14T06:39:53] warning: Ignoring file-less GPU gpu:(null) from final GRES list
What works inside the worker container
The worker container can see the GPUs:
nvidia-smi -L
shows 4 GPUs.
NVML libraries are present:
find /usr /lib -name 'libnvidia-ml.so*' | sort
output:
/usr/lib/aarch64-linux-gnu/libnvidia-ml.so
/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1
/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.590.48.01
ldconfig -p | grep -i nvidia-ml:
libnvidia-ml.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1
libnvidia-ml.so (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libnvidia-ml.so
Symlinks also exist correctly under both /usr/lib/aarch64-linux-gnu and /lib/aarch64-linux-gnu.
Installed packages include:
ii nvslurm-plugin-pyxis 0.23.0-1
ii slurm-smd 25.11.4-1
ii slurm-smd-slurmd 25.11.4-1
Why I think this is a build/package issue
This does not look like a runtime NVML visibility problem, because:
nvidia-smi works in the worker container
libnvidia-ml.so is present
libnvidia-ml.so is in the linker cache
The exact log line:
We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.
suggests Slurm itself was built without NVML support, so AutoDetect=nvml cannot work regardless of runtime library presence.
Also, strings /usr/sbin/slurmd | grep -i nvml returns nothing.
Expected behavior
With:
gres.conf: |
AutoDetect=nvml
and:
I would expect slurmd to detect the 4 GPUs via NVML and register the node with:
Gres=gpu:4
CfgTRES=...gres/gpu=4
without draining the node.
Actual behavior
slurmd reports 0 GPUs to Slurm, and the node is drained with:
Reason=gres/gpu count reported lower than configured (0 < 4)
Question
Can you confirm whether ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04 / the underlying slurm-smd packages were built without NVML support?
If yes, would it be possible to publish an image/package variant with NVML-enabled Slurm so AutoDetect=nvml works?
Summary
When using:
ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04
Slurm 25.11.4
gres.conf with AutoDetect=nvml
the worker nodes report 0 GPUs to Slurm and get drained as invalid, even though the container can see all GPUs and NVML is present.
This looks like the Slurm build inside the image was compiled without NVML support (HAVE_NVML), so AutoDetect=nvml cannot work.
Environment
Image: ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04
Slurm version:
Arch:
aarch64OS in worker:
Ubuntu 24.04-based imageGPUs: 4 NVIDIA GPUs visible in the worker container
Config
Relevant config:
Observed behavior
Slurm drains the nodes as invalid.
sinfo -R:
scontrol show node slinky-0:
Worker log:
What works inside the worker container
The worker container can see the GPUs:
nvidia-smi -L
shows 4 GPUs.
NVML libraries are present:
output:
ldconfig -p | grep -i nvidia-ml:
Symlinks also exist correctly under both /usr/lib/aarch64-linux-gnu and /lib/aarch64-linux-gnu.
Installed packages include:
Why I think this is a build/package issue
This does not look like a runtime NVML visibility problem, because:
nvidia-smi works in the worker container
libnvidia-ml.so is present
libnvidia-ml.so is in the linker cache
The exact log line:
We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.
suggests Slurm itself was built without NVML support, so AutoDetect=nvml cannot work regardless of runtime library presence.
Also, strings /usr/sbin/slurmd | grep -i nvml returns nothing.
Expected behavior
With:
and:
I would expect slurmd to detect the 4 GPUs via NVML and register the node with:
without draining the node.
Actual behavior
slurmd reports 0 GPUs to Slurm, and the node is drained with:
Question
Can you confirm whether ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04 / the underlying slurm-smd packages were built without NVML support?
If yes, would it be possible to publish an image/package variant with NVML-enabled Slurm so AutoDetect=nvml works?