Hello,
I'm trying to leverage Tesla K40 for ML tasks, that I installed in Cisco UCS M3-240 server -> ESXi 6.7u3 -> Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-58-generic x86_64) But for some reasons it can't read/write from/to /dev/nvidia0
ikuchin@ikuchin:~$ lspci | grep NVIDIA
0b:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
ikuchin@ikuchin:~$ systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-12-30 19:40:23 UTC; 1min 57s ago
Process: 893 ExecStart=/usr/bin/nvidia-persistenced --verbose (code=exited, status=0/SUCCESS)
Main PID: 902 (nvidia-persiste)
Tasks: 1 (limit: 38412)
Memory: 1.1M
CGroup: /system.slice/nvidia-persistenced.service
└─902 /usr/bin/nvidia-persistenced --verboseDec 30 19:40:22 ikuchin systemd[1]: Starting NVIDIA Persistence Daemon...
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: Verbose syslog connection opened
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: Started (902)
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: device 0000:0b:00.0 - registered
Dec 30 19:40:23 ikuchin nvidia-persistenced[902]: device 0000:0b:00.0 - failed to open.
Dec 30 19:40:23 ikuchin nvidia-persistenced[902]: Local RPC services initialized
Dec 30 19:40:23 ikuchin systemd[1]: Started NVIDIA Persistence Daemon.
ikuchin@ikuchin:~$ sudo journalctl -xe -u nvidia-persistenced
Dec 30 18:59:56 ikuchin systemd[1]: Started NVIDIA Persistence Daemon.
- Subject: A start job for unit nvidia-persistenced.service has finished successfully
- Defined-By: systemd
- Support: http://www.ubuntu.com/support
- A start job for unit nvidia-persistenced.service has finished successfully.
- The job identifier is 144.
Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: Received signal 15
Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: Socket closed.
Dec 30 19:01:26 ikuchin systemd[1]: Stopping NVIDIA Persistence Daemon...
- Subject: A stop job for unit nvidia-persistenced.service has begun execution
- Defined-By: systemd
- Support: http://www.ubuntu.com/support
- A stop job for unit nvidia-persistenced.service has begun execution.
- The job identifier is 1266.
Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: PID file unlocked.
Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: PID file closed.
Dec 30 19:01:26 ikuchin nvidia-persistenced[873]: Shutdown (873)
Dec 30 19:01:26 ikuchin systemd[1]: nvidia-persistenced.service: Succeeded.
- Subject: Unit succeeded
- Defined-By: systemd
- Support: http://www.ubuntu.com/support
- The unit nvidia-persistenced.service has successfully entered the 'dead' state.
Dec 30 19:01:26 ikuchin systemd[1]: Stopped NVIDIA Persistence Daemon.
- Subject: A stop job for unit nvidia-persistenced.service has finished
- Defined-By: systemd
- Support: http://www.ubuntu.com/support
- A stop job for unit nvidia-persistenced.service has finished.
- The job identifier is 1266 and the job result is done.
- Reboot --
Dec 30 19:40:22 ikuchin systemd[1]: Starting NVIDIA Persistence Daemon...
- Subject: A start job for unit nvidia-persistenced.service has begun execution
- Defined-By: systemd
- Support: http://www.ubuntu.com/support
- A start job for unit nvidia-persistenced.service has begun execution.
- The job identifier is 149.
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: Verbose syslog connection opened
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: Started (902)
Dec 30 19:40:22 ikuchin nvidia-persistenced[902]: device 0000:0b:00.0 - registered
Dec 30 19:40:23 ikuchin nvidia-persistenced[902]: device 0000:0b:00.0 - failed to open.
Dec 30 19:40:23 ikuchin nvidia-persistenced[902]: Local RPC services initialized
Dec 30 19:40:23 ikuchin systemd[1]: Started NVIDIA Persistence Daemon.
- Subject: A start job for unit nvidia-persistenced.service has finished successfully
- Defined-By: systemd
- Support: http://www.ubuntu.com/support
- A start job for unit nvidia-persistenced.service has finished successfully.
- The job identifier is 149.
ikuchin@ikuchin:~$ ls -la /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Dec 30 19:40 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec 30 19:40 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Dec 30 19:40 /dev/nvidia-modeset
crw-rw-rw- 1 root root 237, 0 Dec 30 19:40 /dev/nvidia-uvm
crw-rw-rw- 1 root root 237, 1 Dec 30 19:40 /dev/nvidia-uvm-tools
/dev/nvidia-caps:
total 0
drwxr-xr-x 2 root root 80 Dec 30 19:41 .
drwxr-xr-x 19 root root 4160 Dec 30 19:41 ..
cr-------- 1 root root 241, 1 Dec 30 19:41 nvidia-cap1
cr--r--r-- 1 root root 241, 2 Dec 30 19:41 nvidia-cap2
ikuchin@ikuchin:~$ sudo strace -ytff -o trace.out /usr/bin/nvidia-persistenced
root@ikuchin:/home/ikuchin# more trace.out.1940
19:03:55 set_robust_list(0x7f549117ea20, 24) = 0
19:03:55 umask(000) = 022
19:03:55 setsid() = 1940
19:03:55 getpid() = 1940
19:03:55 close(0</dev/pts/0>) = 0
19:03:55 close(1</dev/pts/0>) = 0
19:03:55 close(2</dev/pts/0>) = 0
19:03:55 close(3<pipe:[31658]>) = 0
19:03:55 openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 0</usr/share/zoneinfo/Etc/UTC>
19:03:55 fstat(0</usr/share/zoneinfo/Etc/UTC>, {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
19:03:55 fstat(0</usr/share/zoneinfo/Etc/UTC>, {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
19:03:55 read(0</usr/share/zoneinfo/Etc/UTC>, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\0"..., 4
096) = 118
19:03:55 lseek(0</usr/share/zoneinfo/Etc/UTC>, -62, SEEK_CUR) = 56
19:03:55 read(0</usr/share/zoneinfo/Etc/UTC>, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\0"..., 4
096) = 62
19:03:55 close(0</usr/share/zoneinfo/Etc/UTC>) = 0
19:03:55 socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 0<socket:[38967]>
19:03:55 connect(0<socket:[38967]>, {sa_family=AF_UNIX, sun_path="/dev/log"}, 110) = 0
19:03:55 sendto(0<socket:[38967]>, "<30>Dec 30 19:03:55 nvidia-persi"..., 73, MSG_NOSIGNAL, NULL, 0) = 73
19:03:55 chdir("/") = 0
19:03:55 mkdir("/var/run/nvidia-persistenced", 0755) = 0
19:03:55 getuid() = 0
19:03:55 getgid() = 0
19:03:55 access("/var/run/nvidia-persistenced", R_OK|W_OK) = 0
19:03:55 openat(AT_FDCWD, "/var/run/nvidia-persistenced/nvidia-persistenced.pid", O_RDWR|O_CREAT, 0644) = 1</run/
nvidia-persistenced/nvidia-persistenced.pid>
19:03:55 fcntl(1</run/nvidia-persistenced/nvidia-persistenced.pid>, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_CUR,
l_start=0, l_len=0}) = 0
19:03:55 write(1</run/nvidia-persistenced/nvidia-persistenced.pid>, "1940\n", 5) = 5
19:03:55 sendto(0<socket:[38967]>, "<29>Dec 30 19:03:55 nvidia-persi"..., 55, MSG_NOSIGNAL, NULL, 0) = 55
19:03:55 futex(0x7f549139b0c8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
19:03:55 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 2</etc/ld.so.cache>
19:03:55 fstat(2</etc/ld.so.cache>, {st_mode=S_IFREG|0644, st_size=63700, ...}) = 0
19:03:55 mmap(NULL, 63700, PROT_READ, MAP_PRIVATE, 2</etc/ld.so.cache>, 0) = 0x7f549139e000
19:03:55 close(2</etc/ld.so.cache>) = 0
19:03:55 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnvidia-cfg.so.1", O_RDONLY|O_CLOEXEC) = 2</usr/lib/x86_64-lin
ux-gnu/libnvidia-cfg.so.460.27.04>
19:03:55 read(2</usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.27.04>, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\
0\0\0\360M\0\0\0\0\0\0"..., 832) = 832
19:03:55 fstat(2</usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.27.04>, {st_mode=S_IFREG|0644, st_size=217456, ..
.}) = 0
19:03:55 mmap(NULL, 2317568, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 2</usr/lib/x86_64-linux-gnu/libnvidi
a-cfg.so.460.27.04>, 0) = 0x7f5490f48000
19:03:55 mprotect(0x7f5490f73000, 2093056, PROT_NONE) = 0
19:03:55 mmap(0x7f5491172000, 45056, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 2</usr/lib/x86_64
-linux-gnu/libnvidia-cfg.so.460.27.04>, 0x2a000) = 0x7f5491172000
19:03:55 mmap(0x7f549117d000, 3328, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f54911
7d000
19:03:55 close(2</usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.27.04>) = 0
19:03:55 mprotect(0x7f5491172000, 32768, PROT_READ) = 0
19:03:55 munmap(0x7f549139e000, 63700) = 0
19:03:55 openat(AT_FDCWD, "/proc/modules", O_RDONLY) = 2</proc/modules>
19:03:55 fstat(2</proc/modules>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/modules>, "xt_conntrack 16384 1 - Live 0xff"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "y 114688 0 - Live 0xffffffffc24e"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "5 xt_conntrack,xt_MASQUERADE,xt_"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "000\nlinear 20480 0 - Live 0xffff"..., 1024) = 1024
19:03:55 close(2</proc/modules>) = 0
19:03:55 openat(AT_FDCWD, "/proc/devices", O_RDONLY) = 2</proc/devices>
19:03:55 fstat(2</proc/devices>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/devices>, "Character devices:\n 1 mem\n 4 /"..., 1024) = 640
19:03:55 close(2</proc/devices>) = 0
19:03:55 openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 2</proc/driver/nvidia/params>
19:03:55 fstat(2</proc/driver/nvidia/params>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/driver/nvidia/params>, "ResmanDebugLevel: 4294967295\nRmL"..., 1024) = 791
19:03:55 close(2</proc/driver/nvidia/params>) = 0
19:03:55 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0xff), ...}) = 0
19:03:55 openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 2</dev/nvidiactl>
19:03:55 fcntl(2</dev/nvidiactl>, F_SETFD, FD_CLOEXEC) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd2, 0x48), 0x7ffc866bbb10) = 0
19:03:55 openat(AT_FDCWD, "/sys/devices/system/memory/block_size_bytes", O_RDONLY) = 3</sys/devices/system/memory
/block_size_bytes>
19:03:55 read(3</sys/devices/system/memory/block_size_bytes>, "8000000\n", 99) = 8
19:03:55 close(3</sys/devices/system/memory/block_size_bytes>) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd6, 0x8), 0x7ffc866bbba0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xc8, 0x900), 0x7f549117cac0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc866bbc70) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbc40) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbc40) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc866bbc50) = 0
19:03:55 close(2</dev/nvidiactl>) = 0
19:03:55 sendto(0<socket:[38967]>, "<31>Dec 30 19:03:55 nvidia-persi"..., 73, MSG_NOSIGNAL, NULL, 0) = 73
19:03:55 openat(AT_FDCWD, "/proc/modules", O_RDONLY) = 2</proc/modules>
19:03:55 fstat(2</proc/modules>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/modules>, "xt_conntrack 16384 1 - Live 0xff"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "y 114688 0 - Live 0xffffffffc24e"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "5 xt_conntrack,xt_MASQUERADE,xt_"..., 1024) = 1024
19:03:55 read(2</proc/modules>, "000\nlinear 20480 0 - Live 0xffff"..., 1024) = 1024
19:03:55 close(2</proc/modules>) = 0
19:03:55 openat(AT_FDCWD, "/proc/devices", O_RDONLY) = 2</proc/devices>
19:03:55 fstat(2</proc/devices>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/devices>, "Character devices:\n 1 mem\n 4 /"..., 1024) = 640
19:03:55 close(2</proc/devices>) = 0
19:03:55 openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 2</proc/driver/nvidia/params>
19:03:55 fstat(2</proc/driver/nvidia/params>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(2</proc/driver/nvidia/params>, "ResmanDebugLevel: 4294967295\nRmL"..., 1024) = 791
19:03:55 close(2</proc/driver/nvidia/params>) = 0
19:03:55 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0xff), ...}) = 0
19:03:55 openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 2</dev/nvidiactl>
19:03:55 fcntl(2</dev/nvidiactl>, F_SETFD, FD_CLOEXEC) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd2, 0x48), 0x7ffc866bb980) = 0
19:03:55 openat(AT_FDCWD, "/sys/devices/system/memory/block_size_bytes", O_RDONLY) = 3</sys/devices/system/memory
/block_size_bytes>
19:03:55 read(3</sys/devices/system/memory/block_size_bytes>, "8000000\n", 99) = 8
19:03:55 close(3</sys/devices/system/memory/block_size_bytes>) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd6, 0x8), 0x7ffc866bba10) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xc8, 0x900), 0x7f549117cac0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc866bbae0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbab0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbab0) = 0
19:03:55 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc866bbab0) = 0
19:03:55 openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 3</proc/driver/nvidia/params>
19:03:55 fstat(3</proc/driver/nvidia/params>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:03:55 read(3</proc/driver/nvidia/params>, "ResmanDebugLevel: 4294967295\nRmL"..., 1024) = 791
19:03:55 close(3</proc/driver/nvidia/params>) = 0
19:03:55 stat("/dev/nvidia0", {st_mode=S_IFCHR|0666, st_rdev=makedev(0xc3, 0), ...}) = 0
19:03:55 openat(AT_FDCWD, "/dev/nvidia0", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error)
19:03:56 openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = -1 EIO (Input/output error)
19:03:56 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd1, 0xc), 0x7ffc866bb9b4) = 0
19:03:56 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc866bbac0) = 0
19:03:56 close(2</dev/nvidiactl>) = 0
19:03:56 close(-5) = -1 EBADF (Bad file descriptor)
19:03:56 sendto(0<socket:[38967]>, "<27>Dec 30 19:03:56 nvidia-persi"..., 78, MSG_NOSIGNAL, NULL, 0) = 78
19:03:56 unlink("/var/run/nvidia-persistenced/socket") = -1 ENOENT (No such file or directory)
19:03:56 socket(AF_UNIX, SOCK_STREAM, 0) = 2<socket:[38968]>
19:03:56 bind(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, 110) = 0
19:03:56 listen(2<socket:[38968]>, 128) = 0
19:03:56 getsockname(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128
->38]) = 0
19:03:56 getsockopt(2<socket:[38968]>, SOL_SOCKET, SO_TYPE, [1], [4]) = 0
19:03:56 getsockname(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128
->38]) = 0
19:03:56 getpeername(2<socket:[38968]>, 0x7ffc866bbc60, [128]) = -1 ENOTCONN (Transport endpoint is not connected
)
19:03:56 getsockname(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128
->38]) = 0
19:03:56 getsockopt(2<socket:[38968]>, SOL_SOCKET, SO_TYPE, [1], [4]) = 0
19:03:56 getsockname(2<socket:[38968]>, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128
->38]) = 0
19:03:56 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
19:03:56 sendto(0<socket:[38967]>, "<30>Dec 30 19:03:56 nvidia-persi"..., 71, MSG_NOSIGNAL, NULL, 0) = 71
19:03:56 write(4<pipe:[31658]>, "\1", 1) = 1
19:03:56 close(4<pipe:[31658]>) = 0
19:03:56 poll([{fd=2<socket:[38968]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, -1
root@ikuchin:/home/ikuchin#
IMO problem is here:
19:03:55 openat(AT_FDCWD, "/dev/nvidia0", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error)
19:03:56 openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = -1 EIO (Input/output error)
19:03:56 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0xd1, 0xc), 0x7ffc866bb9b4) = 0
19:03:56 ioctl(2</dev/nvidiactl>, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc866bbac0) = 0
19:03:56 close(2</dev/nvidiactl>) = 0
19:03:56 close(-5) = -1 EBADF (Bad file descriptor)
I'm not sure how to fix that. Could you help ?