Page MenuHomeSolus

Nvidia driver is not able to open /dev/nvidia0
Closed, WontfixPublic

Description

Hello,

I'm trying to leverage Tesla K40 for ML tasks, that I installed in Cisco UCS M3-240 server -> ESXi  6.7u3 -> Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-58-generic x86_64)

But for some reasons it can't read/write from/to /dev/nvidia0

ikuchin@ikuchin:~$ lspci | grep NVIDIA
0b:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)


ikuchin@ikuchin:~$ systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon

  Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
  Active: active (running) since Wed 2020-12-30 19:40:23 UTC; 1min 57s ago
 Process: 893 ExecStart=/usr/bin/nvidia-persistenced --verbose (code=exited, status=0/SUCCESS)
Main PID: 902 (nvidia-persiste)
   Tasks: 1 (limit: 38412)
  Memory: 1.1M
  CGroup: /system.slice/nvidia-persistenced.service
          └─902 /usr/bin/nvidia-persistenced --verbose

Dec 30 19:40:22 ikuchin systemd[1]: Starting NVIDIA Persistence Daemon...
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: Verbose syslog connection opened
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: Started (902)
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: device 0000:0b:00.0 - registered
Dec 30 19:40:23 ikuchin nvidia-persistenced[902]: device 0000:0b:00.0 - failed to open.
Dec 30 19:40:23 ikuchin nvidia-persistenced[902]: Local RPC services initialized
Dec 30 19:40:23 ikuchin systemd[1]: Started NVIDIA Persistence Daemon.


ikuchin@ikuchin:~$ sudo journalctl -xe -u nvidia-persistenced
Dec 30 18:59:56 ikuchin systemd[1]: Started NVIDIA Persistence Daemon.

  • Subject: A start job for unit nvidia-persistenced.service has finished successfully
  • Defined-By: systemd
  • Support: http://www.ubuntu.com/support
  • A start job for unit nvidia-persistenced.service has finished successfully.
  • The job identifier is 144.

Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: Received signal 15
Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: Socket closed.
Dec 30 19:01:26 ikuchin systemd[1]: Stopping NVIDIA Persistence Daemon...

  • Subject: A stop job for unit nvidia-persistenced.service has begun execution
  • Defined-By: systemd
  • Support: http://www.ubuntu.com/support
  • A stop job for unit nvidia-persistenced.service has begun execution.
  • The job identifier is 1266.

Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: PID file unlocked.
Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: PID file closed.
Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: Shutdown (873)
Dec 30 19:01:26 ikuchin systemd[1]: nvidia-persistenced.service: Succeeded.

  • Subject: Unit succeeded
  • Defined-By: systemd
  • Support: http://www.ubuntu.com/support
  • The unit nvidia-persistenced.service has successfully entered the 'dead' state.

Dec 30 19:01:26 ikuchin systemd[1]: Stopped NVIDIA Persistence Daemon.

  • Subject: A stop job for unit nvidia-persistenced.service has finished
  • Defined-By: systemd
  • Support: http://www.ubuntu.com/support
  • A stop job for unit nvidia-persistenced.service has finished.
  • The job identifier is 1266 and the job result is done.
  • Reboot --

Dec 30 19:40:22 ikuchin systemd[1]: Starting NVIDIA Persistence Daemon...

  • Subject: A start job for unit nvidia-persistenced.service has begun execution
  • Defined-By: systemd
  • Support: http://www.ubuntu.com/support
  • A start job for unit nvidia-persistenced.service has begun execution.
  • The job identifier is 149.

Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: Verbose syslog connection opened
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: Started (902)
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: device 0000:0b:00.0 - registered
Dec 30 19:40:23 ikuchin nvidia-persistenced[902]: device 0000:0b:00.0 - failed to open.
Dec 30 19:40:23 ikuchin nvidia-persistenced[902]: Local RPC services initialized
Dec 30 19:40:23 ikuchin systemd[1]: Started NVIDIA Persistence Daemon.

  • Subject: A start job for unit nvidia-persistenced.service has finished successfully
  • Defined-By: systemd
  • Support: http://www.ubuntu.com/support
  • A start job for unit nvidia-persistenced.service has finished successfully.
  • The job identifier is 149.

ikuchin@ikuchin:~$ ls -la /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Dec 30 19:40 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec 30 19:40 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Dec 30 19:40 /dev/nvidia-modeset
crw-rw-rw- 1 root root 237, 0 Dec 30 19:40 /dev/nvidia-uvm
crw-rw-rw- 1 root root 237, 1 Dec 30 19:40 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
drwxr-xr-x 2 root root 80 Dec 30 19:41 .
drwxr-xr-x 19 root root 4160 Dec 30 19:41 ..
cr-------- 1 root root 241, 1 Dec 30 19:41 nvidia-cap1
cr--r--r-- 1 root root 241, 2 Dec 30 19:41 nvidia-cap2


ikuchin@ikuchin:~$ sudo strace -ytff -o trace.out /usr/bin/nvidia-persistenced

root@ikuchin:/home/ikuchin# more trace.out.1940
19:03:55 set_robust_list(0x7f549117ea20, 24) = 0
19:03:55 umask(000) = 022
19:03:55 setsid() = 1940
19:03:55 getpid() = 1940
19:03:55 close(0</dev/pts/0>) = 0
19:03:55 close(1</dev/pts/0>) = 0
19:03:55 close(2</dev/pts/0>) = 0
19:03:55 close(3<pipe:[31658]>) = 0
19:03:55 openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 0</usr/share/zoneinfo/Etc/UTC>
19:03:55 fstat(0</usr/share/zoneinfo/Etc/UTC>, {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
19:03:55 fstat(0</usr/share/zoneinfo/Etc/UTC>, {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
19:03:55 read(0</usr/share/zoneinfo/Etc/UTC>, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\0"..., 4
096) = 118
19:03:55 lseek(0</usr/share/zoneinfo/Etc/UTC>, -62, SEEK_CUR) = 56
19:03:55 read(0</usr/share/zoneinfo/Etc/UTC>, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\0"..., 4
096) = 62
19:03:55 close(0</usr/share/zoneinfo/Etc/UTC>) = 0
19:03:55 socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 0<socket:[38967]>
19:03:55 connect(0<socket:[38967]>, {sa_family=AF_UNIX, sun_path="/dev/log"}, 110) = 0
19:03:55 sendto(0<socket:[38967]>, "<30>Dec 30 19:03:55 nvidia-persi"..., 73, MSG_NOSIGNAL, NULL, 0) = 73
19:03:55 chdir("/") = 0
19:03:55 mkdir("/var/run/nvidia-persistenced", 0755) = 0
19:03:55 getuid() = 0
19:03:55 getgid() = 0
19:03:55 access("/var/run/nvidia-persistenced", R_OK|W_OK) = 0
19:03:55 openat(AT_FDCWD, "/var/run/nvidia-persistenced/nvidia-persistenced.pid", O_RDWR|O_CREAT, 0644) = 1</run/
nvidia-persistenced/nvidia-persistenced.pid>
19:03:55 fcntl(1</run/nvidia-persistenced/nvidia-persistenced.pid>, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_CUR,
l_start=0, l_len=0}) = 0
19:03:55 write(1</run/nvidia-persistenced/nvidia-persistenced.pid>, "1940\n", 5) = 5
19:03:55 sendto(0<socket:[38967]>, "<29>Dec 30 19:03:55 nvidia-persi"..., 55, MSG_NOSIGNAL, NULL, 0) = 55
19:03:55 futex(0x7f549139b0c8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
19:03:55 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 2</etc/ld.so.cache>
19:03:55 fstat(2</etc/ld.so.cache>, {st_mode=S_IFREG|0644, st_size=63700, ...}) = 0
19:03:55 mmap(NULL, 63700, PROT_READ, MAP_PRIVATE, 2</etc/ld.so.cache>, 0) = 0x7f549139e000
19:03:55 close(2</etc/ld.so.cache>) = 0
19:03:55 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnvidia-cfg.so.1", O_RDONLY|O_CLOEXEC) = 2</usr/lib/x86_64-lin
ux-gnu/libnvidia-cfg.so.460.27.04>
19:03:55 read(2</usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.27.04>, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\
0\0\0\360M\0\0\0\0\0\0"..., 832) = 832
19:03:55 fstat(2</usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.27.04>, {st_mode=S_IFREG|0644, st_size=217456, ..
.}) = 0
19:03:55 mmap(NULL, 2317568, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 2</usr/lib/x86_64-linux-gnu/libnvidi
a-cfg.so.460.27.04>, 0) = 0x7f5490f48000
19:03:55 mprotect(0x7f5490f73000, 2093056, PROT_NONE) = 0
19:03:55 mmap(0x7f5491172000, 45056, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 2</usr/lib/x86_64
-linux-gnu/libnvidia-cfg.so.460.27.04>, 0x2a000) = 0x7f5491172000
19:03:55 mmap(0x7f549117d000, 3328, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f54911
7d000
19:03:55 close(2</usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.27.04>) = 0
19:03:55 mprotect(0x7f5491172000, 32768, PROT_READ) = 0
19:03:55 munmap(0x7f549139e000, 63700) = 0
19:03:55 openat(AT_FDCWD, "/proc/modules", O_RDONLY) = 2</proc/modules>
19:03:55 fstat(2</proc/modules>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/modules>, "xt_conntrack 16384 1 - Live 0xff"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "y 114688 0 - Live 0xffffffffc24e"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "5 xt_conntrack,xt_MASQUERADE,xt_"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "000\nlinear 20480 0 - Live 0xffff"..., 1024) = 1024
19:03:55 close(2</proc/modules>) = 0
19:03:55 openat(AT_FDCWD, "/proc/devices", O_RDONLY) = 2</proc/devices>
19:03:55 fstat(2</proc/devices>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/devices>, "Character devices:\n 1 mem\n 4 /"..., 1024) = 640
19:03:55 close(2</proc/devices>) = 0
19:03:55 openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 2</proc/driver/nvidia/params>
19:03:55 fstat(2</proc/driver/nvidia/params>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/driver/nvidia/params>, "ResmanDebugLevel: 4294967295\nRmL"..., 1024) = 791
19:03:55 close(2</proc/driver/nvidia/params>) = 0
19:03:55 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0xff), ...}) = 0
19:03:55 openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 2</dev/nvidiactl>
19:03:55 fcntl(2</dev/nvidiactl>, F_SETFD, FD_CLOEXEC) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd2, 0x48), 0x7ffc866bbb10) = 0
19:03:55 openat(AT_FDCWD, "/sys/devices/system/memory/block_size_bytes", O_RDONLY) = 3</sys/devices/system/memory
/block_size_bytes>
19:03:55 read(3</sys/devices/system/memory/block_size_bytes>, "8000000\n", 99) = 8
19:03:55 close(3</sys/devices/system/memory/block_size_bytes>) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd6, 0x8), 0x7ffc866bbba0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xc8, 0x900), 0x7f549117cac0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc866bbc70) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbc40) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbc40) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc866bbc50) = 0
19:03:55 close(2</dev/nvidiactl>) = 0
19:03:55 sendto(0<socket:[38967]>, "<31>Dec 30 19:03:55 nvidia-persi"..., 73, MSG_NOSIGNAL, NULL, 0) = 73
19:03:55 openat(AT_FDCWD, "/proc/modules", O_RDONLY) = 2</proc/modules>
19:03:55 fstat(2</proc/modules>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/modules>, "xt_conntrack 16384 1 - Live 0xff"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "y 114688 0 - Live 0xffffffffc24e"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "5 xt_conntrack,xt_MASQUERADE,xt_"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "000\nlinear 20480 0 - Live 0xffff"..., 1024) = 1024
19:03:55 close(2</proc/modules>) = 0
19:03:55 openat(AT_FDCWD, "/proc/devices", O_RDONLY) = 2</proc/devices>
19:03:55 fstat(2</proc/devices>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/devices>, "Character devices:\n 1 mem\n 4 /"..., 1024) = 640
19:03:55 close(2</proc/devices>) = 0
19:03:55 openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 2</proc/driver/nvidia/params>
19:03:55 fstat(2</proc/driver/nvidia/params>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/driver/nvidia/params>, "ResmanDebugLevel: 4294967295\nRmL"..., 1024) = 791
19:03:55 close(2</proc/driver/nvidia/params>) = 0
19:03:55 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0xff), ...}) = 0
19:03:55 openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 2</dev/nvidiactl>
19:03:55 fcntl(2</dev/nvidiactl>, F_SETFD, FD_CLOEXEC) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd2, 0x48), 0x7ffc866bb980) = 0
19:03:55 openat(AT_FDCWD, "/sys/devices/system/memory/block_size_bytes", O_RDONLY) = 3</sys/devices/system/memory
/block_size_bytes>
19:03:55 read(3</sys/devices/system/memory/block_size_bytes>, "8000000\n", 99) = 8
19:03:55 close(3</sys/devices/system/memory/block_size_bytes>) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd6, 0x8), 0x7ffc866bba10) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xc8, 0x900), 0x7f549117cac0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc866bbae0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbab0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbab0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbab0) = 0
19:03:55 openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 3</proc/driver/nvidia/params>
19:03:55 fstat(3</proc/driver/nvidia/params>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(3</proc/driver/nvidia/params>, "ResmanDebugLevel: 4294967295\nRmL"..., 1024) = 791
19:03:55 close(3</proc/driver/nvidia/params>) = 0
19:03:55 stat("/dev/nvidia0", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0), ...}) = 0
19:03:55 openat(AT_FDCWD, "/dev/nvidia0", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error)
19:03:56 openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = -1 EIO (Input/output error)
19:03:56 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd1, 0xc), 0x7ffc866bb9b4) = 0
19:03:56 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc866bbac0) = 0
19:03:56 close(2</dev/nvidiactl>) = 0
19:03:56 close(-5) = -1 EBADF (Bad file descriptor)
19:03:56 sendto(0<socket:[38967]>, "<27>Dec 30 19:03:56 nvidia-persi"..., 78, MSG_NOSIGNAL, NULL, 0) = 78
19:03:56 unlink("/var/run/nvidia-persistenced/socket") = -1 ENOENT (No such file or directory)
19:03:56 socket(AF_UNIX, SOCK_STREAM, 0) = 2<socket:[38968]>
19:03:56 bind(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, 110) = 0
19:03:56 listen(2<socket:[38968]>, 128) = 0
19:03:56 getsockname(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128
->38]) = 0
19:03:56 getsockopt(2<socket:[38968]>, SOL_SOCKET, SO_TYPE, [1], [4]) = 0
19:03:56 getsockname(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128
->38]) = 0
19:03:56 getpeername(2<socket:[38968]>, 0x7ffc866bbc60, [128]) = -1 ENOTCONN (Transport endpoint is not connected
)
19:03:56 getsockname(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128
->38]) = 0
19:03:56 getsockopt(2<socket:[38968]>, SOL_SOCKET, SO_TYPE, [1], [4]) = 0
19:03:56 getsockname(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128
->38]) = 0
19:03:56 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
19:03:56 sendto(0<socket:[38967]>, "<30>Dec 30 19:03:56 nvidia-persi"..., 71, MSG_NOSIGNAL, NULL, 0) = 71
19:03:56 write(4<pipe:[31658]>, "\1", 1) = 1
19:03:56 close(4<pipe:[31658]>) = 0
19:03:56 poll([{fd=2<socket:[38968]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, -1
root@ikuchin:/home/ikuchin#


IMO problem is here:

19:03:55 openat(AT_FDCWD, "/dev/nvidia0", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error)
19:03:56 openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = -1 EIO (Input/output error)
19:03:56 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd1, 0xc), 0x7ffc866bb9b4) = 0
19:03:56 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc866bbac0) = 0
19:03:56 close(2</dev/nvidiactl>) = 0
19:03:56 close(-5) = -1 EBADF (Bad file descriptor)

I'm not sure how to fix that. Could you help ?

Event Timeline

JoshStrobl added subscribers: xulongwu4, JoshStrobl.

I believe you would need CUDA support for this to function, which is a duplicate of T354. Other applications such as Blender which would require it for GPU acceleration are affected as well (e.g. T238). CUDA SDK's EULA does not permit redistribution and even if it did, we can't consent to an EULA on behalf of the user. The solution has been to use nvidia-docker, though mileage may vary as @xulongwu4 isn't actively maintaining it.

Hi Josh,

Thank you for the hint but it didn't help. Problem still persists.

Cuda-drivers and nvidia-docker2 are installed but Tesla still not found:

ikuchin@ikuchin:~$ lspci | grep NVI
0b:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)

ikuchin@ikuchin:~$ sudo apt list nvidia-docker2
Listing... Done
nvidia-docker2/bionic,now 2.5.0-1 all [installed]
N: There are 23 additional versions. Please use the '-a' switch to see them.
ikuchin@ikuchin:~$ sudo apt list cuda-drivers
Listing... Done
cuda-drivers/unknown,unknown,now 460.27.04-1 amd64 [installed]
N: There are 6 additional versions. Please use the '-a' switch to see them.

Docker images nvidia/cuda installed too:

ikuchin@ikuchin:~$ sudo docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
tensorflow/tensorflow latest-gpu cdbd4acb8a4c 2 weeks ago 5.53GB
nvidia/cuda 10.0-base 0f12aac8787e 3 months ago 109MB
nvidia/cuda 11.0-base 2ec708416bb8 4 months ago 122MB
hello-world latest bf756fb1ae65 12 months ago 13.3kB
ikuchin@ikuchin:~$
ikuchin@ikuchin:~$ sudo docker run --rm --gpus all nvidia/cuda:10.0-base nvidia-smi
No devices were found
ikuchin@ikuchin:~$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
No devices were found

Here is the log from nvidia-docker-plugin

ikuchin@ikuchin:~/nvidia-docker$ sudo ./nvidia-docker-plugin
[sudo] password for ikuchin:
./nvidia-docker-plugin | 2021/01/03 20:16:58 Loading NVIDIA unified memory
./nvidia-docker-plugin | 2021/01/03 20:16:58 Loading NVIDIA management library
./nvidia-docker-plugin | 2021/01/03 20:16:59 Discovering GPU devices

There is no difference in docker running: either "sudo docker run --gpus all ..." or "nvidia-docker ..." always lead to "no device found"

ikuchin@ikuchin:~/nvidia-docker$ sudo nvidia-docker run --rm nvidia/cuda nvidia-smi
No devices were found
ikuchin@ikuchin:~/nvidia-docker$ sudo docker run --rm --gpus all nvidia/cuda nvidia-smi
No devices were found

ikuchin@ikuchin:~$ sudo apt list nvidia-docker2
Listing... Done
nvidia-docker2/bionic,now 2.5.0-1 all [installed]
N: There are 23 additional versions. Please use the '-a' switch to see them.
ikuchin@ikuchin:~$ sudo apt list cuda-drivers
Listing... Done
cuda-drivers/unknown,unknown,now 460.27.04-1 amd64 [installed]
N: There are 6 additional versions. Please use the '-a' switch to see them.

@IvanKuchin Are you running ubuntu or solus?

By the way, patches were submitted to update nvidia-docker and nvidia-container-toolkit to their latest versions.

I tried both: Ubuntu and solus - same issue.
Fix in my case is changing BIOS VM boot to EFI type.

more details here: https://blogs.vmware.com/apps/2018/09/using-gpus-with-virtual-machines-on-vsphere-part-2-vmdirectpath-i-o.html