Page MenuHomeSolus

Nvidia driver not initializing for some hardware configs with kernel 5.3.1 and any Nvidia proprietary driver
Closed, ResolvedPublic

Description

The Nvidia driver is not initializing for some hardware configs with kernel 5.3.1 and any Nvidia proprietary driver. We'll use this task to gather what we know

Event Timeline

DataDrake triaged this task as Unbreak Now! priority.
DataDrake created this task.
DataDrake moved this task from Backlog to Kernel Drivers on the Hardware board.
Staudey added a subscriber: Staudey.EditedSep 25 2019, 7:25 PM

Here's my journalctl-log of an unsuccesfull boot (with loglevel=7 as suggested by Josh): https://paste.ubuntu.com/p/3CZFZXJZ2Q/

Xorg log after unsuccesful boot: https://paste.ubuntu.com/p/VdzXwjHsJ2/

GPU: NVIDIA GeForce GTX 1050 Ti

inxi -F output of my PC: https://paste.ubuntu.com/p/VW9zMkbHG9/

I've tried the nvidia-developer-driver, nvidia-glx-driver and nvidia-390-glx-driver packages (I'm not quite sure whether or not I've also tried the beta driver, and I can't use the 340 driver with my GPU).

nvidia-bug-report.sh output:

Plot thickens: https://devtalk.nvidia.com/default/topic/1050894/gpu-devices-lost-with-nvrm-rminitadapter-failed-when-cpu-or-network-is-busy/

Looking at the logs though I see the driver continuously initializing/deinitializing in fast succession which points to that you're missing the nvidia-persistenced.

So this happens because nvidia-persistenced fails to start, not the other way around.

From the log @Staudey posted:

Sep 25 21:15:42 solus-pc nvidia-persistenced[693]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 143 has read and write permissions for those files.

However, I also have the same items in my logs:

Sep 25 00:51:44 goliath nvidia-persistenced[911]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 143 has read and write permissions for those files.
Sep 25 00:51:44 goliath nvidia-persistenced[905]: nvidia-persistenced failed to initialize. Check syslog for more details.
Sep 25 00:51:44 goliath nvidia-persistenced[911]: Shutdown (911)

However, at login, persistenced is indeed active:

[~] systemctl status nvidia-persistenced                                                                                                                                                                                                                                         
● nvidia-persistenced.service - NVIDIA Persistence Daemon
   Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-09-25 00:51:45 EEST; 22h ago
  Process: 920 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)
  Process: 979 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced (code=exited, status=0/SUCCESS)
 Main PID: 980 (nvidia-persiste)
    Tasks: 1 (limit: 4915)
   Memory: 1.0M
   CGroup: /system.slice/nvidia-persistenced.service
           └─980 /usr/bin/nvidia-persistenced --user nvidia-persistenced

@Staudey can you get me the journal entries for nvidia-persistenced on a failed boot? Something like:

sudo journalctl -xe -u nvidia-persistenced

@JoshStrobl Yeah, it definitely fails the first time since the kernel isn't ready. It tries again later multiple times. This is probably why it is fine on your machine. I need to see if there are more log entries that I am missing though.

Sure, here you go. (btw I've also edited my original comment to add the output file of nvidia-bug-report.sh)

-- Logs begin at Sun 2019-08-11 13:47:12 CEST, end at Wed 2019-09-25 22:13:29 CEST. --
Sep 25 21:15:42 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon...
-- Subject: Unit nvidia-persistenced.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has begun starting up.
Sep 25 21:15:42 solus-pc nvidia-persistenced[693]: Started (693)
Sep 25 21:15:42 solus-pc nvidia-persistenced[693]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 143 has read and write permissions for those files.
Sep 25 21:15:42 solus-pc nvidia-persistenced[690]: nvidia-persistenced failed to initialize. Check syslog for more details.
Sep 25 21:15:42 solus-pc systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=1
Sep 25 21:15:42 solus-pc nvidia-persistenced[693]: Shutdown (693)
Sep 25 21:15:42 solus-pc systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Sep 25 21:15:42 solus-pc systemd[1]: Failed to start NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has failed.
-- 
-- The result is RESULT.
Sep 25 21:15:42 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon...
-- Subject: Unit nvidia-persistenced.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has begun starting up.
Sep 25 21:15:42 solus-pc nvidia-persistenced[768]: Started (768)
Sep 25 21:15:43 solus-pc nvidia-persistenced[768]: device 0000:01:00.0 - failed to open.
Sep 25 21:15:43 solus-pc systemd[1]: Started NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has finished starting up.
-- 
-- The start-up result is RESULT.
Sep 25 21:15:43 solus-pc systemd[1]: Stopping NVIDIA Persistence Daemon...
-- Subject: Unit nvidia-persistenced.service has begun shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has begun shutting down.
Sep 25 21:15:43 solus-pc nvidia-persistenced[768]: Shutdown (768)
Sep 25 21:15:43 solus-pc systemd[1]: Stopped NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has finished shutting down.
Sep 25 21:15:44 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon...
-- Subject: Unit nvidia-persistenced.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has begun starting up.
Sep 25 21:15:44 solus-pc nvidia-persistenced[813]: Started (813)
Sep 25 21:15:44 solus-pc nvidia-persistenced[813]: device 0000:01:00.0 - failed to open.
Sep 25 21:15:44 solus-pc systemd[1]: Started NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has finished starting up.
-- 
-- The start-up result is RESULT.
Sep 25 21:15:45 solus-pc systemd[1]: Stopping NVIDIA Persistence Daemon...
-- Subject: Unit nvidia-persistenced.service has begun shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has begun shutting down.
Sep 25 21:15:45 solus-pc nvidia-persistenced[813]: Shutdown (813)
Sep 25 21:15:45 solus-pc systemd[1]: Stopped NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has finished shutting down.
Sep 25 21:15:45 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon...
-- Subject: Unit nvidia-persistenced.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has begun starting up.
Sep 25 21:15:45 solus-pc nvidia-persistenced[862]: Started (862)
Sep 25 21:15:45 solus-pc nvidia-persistenced[862]: device 0000:01:00.0 - failed to open.
Sep 25 21:15:45 solus-pc systemd[1]: Started NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has finished starting up.
-- 
-- The start-up result is RESULT.
Sep 25 21:15:45 solus-pc nvidia-persistenced[862]: Shutdown (862)
Sep 25 21:15:45 solus-pc systemd[1]: Stopping NVIDIA Persistence Daemon...
-- Subject: Unit nvidia-persistenced.service has begun shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has begun shutting down.
Sep 25 21:15:45 solus-pc systemd[1]: Stopped NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has finished shutting down.
Sep 25 21:15:45 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon...
-- Subject: Unit nvidia-persistenced.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has begun starting up.
Sep 25 21:15:45 solus-pc nvidia-persistenced[900]: Started (900)
Sep 25 21:15:46 solus-pc nvidia-persistenced[900]: device 0000:01:00.0 - failed to open.
Sep 25 21:15:46 solus-pc systemd[1]: Started NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has finished starting up.
-- 
-- The start-up result is RESULT.
Sep 25 21:15:46 solus-pc systemd[1]: Stopping NVIDIA Persistence Daemon...
-- Subject: Unit nvidia-persistenced.service has begun shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has begun shutting down.
Sep 25 21:15:46 solus-pc nvidia-persistenced[900]: Shutdown (900)
Sep 25 21:15:46 solus-pc systemd[1]: Stopped NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has finished shutting down.
Sep 25 21:15:43 solus-pc nvidia-persistenced[768]: device 0000:01:00.0 - failed to open.

That seems the smoking gun.

Alright @Staudey, one more thing to stare at on a failed boot:

sudo strace -ytff -o trace.out /usr/bin/nvidia-persistenced --user nvidia-persistenced

The trace will be pretty big and multiple files starting with the prefix trace.out and ending with a numerical Process ID, so you'll want to bundle them up in a compressed archive before uploading. Thanks!

Hey, sorry, I fell asleep yesterday while the trace was running. I stopped it in the morning but had to get to work immediately.
However, now that I look at the files there are just two ~10kb traces, so I will upload them directly.

@Staudey thank you! Looks like we have our first lead:

openat(AT_FDCWD, "/sys/devices/system/memory/block_size_bytes", O_RDONLY) = -1 ENOENT (No such file or directory)

This is disabled right now because we don't have CONFIG_MEMORY_HOTPLUG=y in the kernel config. I'm not 100% sure, but I think when nvidia can't read the file, it guesses the memory block size poorly and is getting told to piss off by the IOMMU guarding the PCI-E bus addresses, Which is why we saw these in the Journal:

kernel: resource sanity check: requesting [mem 0x000e0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000e0000-0x000e3fff window]
kernel: caller _nv030477rm+0x58/0x90 [nvidia] mapping multiple BARs

This is likely very system-specific which might start to explain things.

chax added a subscriber: chax.Sep 27 2019, 12:23 PM

I'm in the middle of rebuilds for linux-current out-of-tree modules, with the hotplug driver enabled. If all goes well, I'll enable it for linux-lts when the new release is out on Sunday.

chax added a comment.Sep 27 2019, 2:37 PM

chax added a comment.Sep 27 2019, 3:24 PM

linux kernel 5.2.13:

$ ls -alh /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 ruj  27 17:19 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 ruj  27 17:19 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 ruj  27 17:19 /dev/nvidia-modeset
crw-rw-rw- 1 root root 240,   0 ruj  27 17:19 /dev/nvidia-uvm

linux kernel 5.3.1:

$ ls -alh /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 ruj  27 17:16 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 ruj  27 17:16 /dev/nvidiactl
crw-rw-rw- 1 root root 240,   0 ruj  27 17:16 /dev/nvidia-uvm
chax added a comment.Sep 27 2019, 3:38 PM

Putting iommu=off as kernel parameter, as @DataDrake suggested, i successfully booted into linux kernel 5.3.1

Hello, sorry I didn't come home yesterday and couldn't test, here's my update:

The new kernel package didn't fix things for me, but adding iommu=off as a kernel parameter, like @chax described enabled me to boot into Budgie.

rjurga added a subscriber: rjurga.Oct 18 2019, 10:49 AM
mbeso added a subscriber: mbeso.Oct 23 2019, 12:21 PM
chax added a comment.Oct 24 2019, 7:08 AM

I pinpointed the issue to one of our kernel patches.
What i did was first compiled kernel 5.3.7 with all the patches and kernel config that we used to compile kernel 5.3.1 (R3571:70ca91bdea72)
After that i just recompiled nvidia-glx-driver with new kernel, rebooted and i got the same error as we had on 5.3.1.
Then i removed all the patches by removing line %apply_patches form package.yml
Recompiled the kernel and nvidia driver, rebooted and BINGO, successful boot with normally loaded nvidia drivers, everything working as it should.
Then i checked the list of patches that we have trying to find main suspect for this problem, and this one seemed like the most probable one.
I re-added %apply_patches line to the package.yml and then just removed suspected patch from list of patches (files/series) and did another re-build of kernel and nvidia drivers.
Another successful boot, which means that this patch, when applied to kernel >5.3.x breaks something with IOMMU and only way to boot successfully with that kernel is to turn off IOMMU via kernel parameter.

chax added a comment.Oct 29 2019, 3:35 PM

Works now :)

Can confirm that the fix works. Thanks @chax and @DataDrake!

ender added a subscriber: ender.EditedNov 10 2019, 3:19 PM

Still getting this error on each boot:
Nov 10 16:11:47 alecto nvidia-persistenced[814]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 143 has read and write permissions for those files.

inxi -F:

System:    Host: alecto Kernel: 5.3.8-133.current x86_64 bits: 64 Console: N/A Distro: Solus 4.0 
Machine:   Type: Laptop System: Micro-Star product: GS70 2OD v: REV:1.0 serial: FFFFFFFF 
           Mobo: Micro-Star model: MS-1771 v: REV:0.B serial: BSS-0123456789 UEFI: American Megatrends v: E1771IMS.50G 
           date: 11/14/2014 
Battery:   ID-1: BAT1 charge: 60.6 Wh condition: 63.0/59.9 Wh (105%) 
CPU:       Topology: Quad Core model: Intel Core i7-4700HQ bits: 64 type: MT MCP L2 cache: 6144 KiB 
           Speed: 2195 MHz min/max: 800/3400 MHz Core speeds (MHz): 1: 2195 2: 2195 3: 2195 4: 2196 5: 2197 6: 2195 7: 2198 
           8: 2197 
Graphics:  Device-1: Intel 4th Gen Core Processor Integrated Graphics driver: i915 v: kernel 
           Device-2: NVIDIA GK106M [GeForce GTX 765M] driver: nvidia v: 390.129 
           Display: server: X.Org 1.20.5 driver: modesetting,nvidia resolution: 1920x1080~60Hz 
           OpenGL: renderer: GeForce GTX 765M/PCIe/SSE2 v: 4.6.0 NVIDIA 390.129 
Audio:     Device-1: Intel 8 Series/C220 Series High Definition Audio driver: snd_hda_intel 
           Device-2: NVIDIA GK106 HDMI Audio driver: snd_hda_intel 
           Sound Server: ALSA v: k5.3.8-133.current 
Network:   Device-1: Qualcomm Atheros Killer E220x Gigabit Ethernet driver: alx 
           IF: enp4s0 state: down mac: 6c:62:6d:35:ae:18 
           Device-2: Qualcomm Atheros AR9462 Wireless Network Adapter driver: ath9k 
           IF: wlp5s0 state: up mac: 3c:77:e6:68:4b:cd 
           Device-3: Qualcomm Atheros AR3012 Bluetooth 4.0 type: USB driver: btusb 
Drives:    Local Storage: total: 1.14 TiB used: 34.05 GiB (2.9%) 
           ID-1: /dev/sda vendor: SanDisk model: SD6SF1M128G size: 119.24 GiB 
           ID-2: /dev/sdb vendor: Smart Modular Tech. model: SH00M120GB size: 111.79 GiB 
           ID-3: /dev/sdc vendor: HGST (Hitachi) model: HTS721010A9E630 size: 931.51 GiB 
Partition: ID-1: / size: 105.39 GiB used: 34.05 GiB (32.3%) fs: ext4 dev: /dev/dm-1 
           ID-2: swap-1 size: 3.73 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/dm-0 
Sensors:   System Temperatures: cpu: 50.0 C mobo: 27.8 C gpu: nvidia temp: 56 C 
           Fan Speeds (RPM): N/A 
Info:      Processes: 227 Uptime: 6m Memory: 7.76 GiB used: 1.55 GiB (20.0%) Shell: bash inxi: 3.0.36

That might happen the first time due to an early boot race condition, it's fine as long as it doesn't happen a second time. It's clearly loading correctly or Xorg would have fallen back to Intel.

ender added a comment.Nov 10 2019, 4:04 PM

ah ! ok, it indeed does happen only once per boot (if I understood you well), and indeed without any further problem that I'm aware of anyway. thx for your reply