Page MenuHomeSolus

nvidia-docker : not working due to cgroups v2
Open, LowPublic

Description

Problem:

Starting a nvidia-docker container results in the error message

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

The root of the problem seems to be that current kernel uses cgroups v2 which libnvidia-container does not currently support.
See discussion in an issue posted in the nvidia-docker github repo.

To make nvidia-docker working I followed the instruction given by user "Zethson", commented on 10 Apr in the link given above:

Edit the file

/usr/share/nvidia-container-runtime/config.toml

and change the line

# no-cgroups = false

to

no-cgroups = true

After

systemctl restart docker

everything works fine. My proposal would be to make this change in post build step when package is build and installed.

Event Timeline

saitam created this task.Sat, Oct 9, 10:54 AM
JoshStrobl renamed this task from nvidia-docker : not working due to cgroups v2 used by current linux kernel and proposal for solving to nvidia-docker : not working due to cgroups v2.Sun, Oct 10, 1:57 PM
JoshStrobl assigned this task to xulongwu4.
JoshStrobl triaged this task as Low priority.
JoshStrobl edited projects, added Software; removed Lacks Project.
JoshStrobl moved this task from Backlog to System and Configuration Fixes on the Software board.

@saitam, the nvidia-container-runtime supports stateless configuration. Before we modify the packaging scripts to adopt your suggestion, you can copy /usr/share/nvidia-container-runtime/config.toml to /etc/nvidia-container-runtime/config.toml and make your desired changes. The file /etc/nvidia-container-runtime/config.toml takes precedence.

In the discussion you linked to above, people mentioned that changing no-cgroups = false to no-cgroups = true does not resolve the issue completely, as the nvidia devices are not accessible from inside docker. They have to add the extra boot parameter systemd.unified_cgroup_hierarchy=0 to make things work. See this comment. Does that align with your observation?

@xulongwu4 , in the discussion I also read that for some people the change in config.toml was not sufficient. But for me it resolved the issue and I was able to use GPU within a docker container.