Skip to content

feat(nvcdi): Allow IPC sockets to not be discovered.#1790

Open
LandonTClipp wants to merge 1 commit intoNVIDIA:mainfrom
LandonTClipp:lclipp/disable-ipc
Open

feat(nvcdi): Allow IPC sockets to not be discovered.#1790
LandonTClipp wants to merge 1 commit intoNVIDIA:mainfrom
LandonTClipp:lclipp/disable-ipc

Conversation

@LandonTClipp
Copy link
Copy Markdown

There are cases we are dealing with where the containers should not have access to any NVIDIA IPC sockets like fabricmanager/persistenced. We want the ability to disable this.

This commit adds a feature flag disable-ipc-discoverer that if provided will cause the IPC discoverer.Discoverer to be nil. Ultimately, this is needed so that the CDI spec file generated by k8s-device-plugin does not include these sockets.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

There are cases we are dealing with where the containers should not have access
to any NVIDIA IPC sockets like fabricmanager/persistenced. We want the ability
to disable this.

This commit adds a feature flag `disable-ipc-discoverer` that if provided will
cause the IPC discoverer.Discoverer to be nil. Ultimately, this is needed so that
the CDI spec file generated by k8s-device-plugin does not include these sockets.

Signed-off-by: LandonTClipp <lclipp@coreweave.com>
@cdesiniotis
Copy link
Copy Markdown
Contributor

/ok to test 6dad053

@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 24740645194

Coverage increased (+0.005%) to 43.325%

Details

  • Coverage increased (+0.005%) from the base build.
  • Patch coverage: 2 uncovered changes across 1 file (4 of 6 lines covered, 66.67%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
pkg/nvcdi/driver-nvml.go 6 4 66.67%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 14862
Covered Lines: 6439
Line Coverage: 43.33%
Coverage Strength: 0.48 hits per line

💛 - Coveralls

Copy link
Copy Markdown
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @LandonTClipp.

I don't have any objections to introducing a feature flag to opt-out of this behavior. My one question would be -- are there any use cases where application containers need access to either the nvidia-persistenced or nvidia-fabricmanager sockets? IIUC excluding the nvidia-persistenced socket would only lead to cosmetic differences (e.g. running nvidia-smi in the container would report persistence mode as off even when it is on) and should not have any functional impact. I am not entirely certain about nvidia-fabricmanager.

@LandonTClipp
Copy link
Copy Markdown
Author

I don't have any objections to introducing a feature flag to opt-out of this behavior. My one question would be -- are there any use cases where application containers need access to either the nvidia-persistenced or nvidia-fabricmanager sockets?

Are you asking whether containers would need access to one but not the other? Perhaps. In that case I think a more granular feature flag could be used, but I think this change doesn't prevent us from adding that additional granularity if we wanted to in the future. For our particular workload sandboxing use case, any host NVIDIA IPC socket needs to be totally blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants