The Nvidia driver is not initializing for some hardware configs with kernel 5.3.1 and any Nvidia proprietary driver. We'll use this task to gather what we know
Description
Revisions and Commits
| R3571 linux-current | |||
| R3571:dad3cd0991cd Updated to 5.3.7 | |||
Related Objects
- Mentioned In
- T8449: Meta: Week 44/45 Task List
- Mentioned Here
- R3571:70ca91bdea72: Updated to 5.3.1
Event Timeline
Here's my journalctl-log of an unsuccesfull boot (with loglevel=7 as suggested by Josh): https://paste.ubuntu.com/p/3CZFZXJZ2Q/
Xorg log after unsuccesful boot: https://paste.ubuntu.com/p/VdzXwjHsJ2/
GPU: NVIDIA GeForce GTX 1050 Ti
inxi -F output of my PC: https://paste.ubuntu.com/p/VW9zMkbHG9/
I've tried the nvidia-developer-driver, nvidia-glx-driver and nvidia-390-glx-driver packages (I'm not quite sure whether or not I've also tried the beta driver, and I can't use the 340 driver with my GPU).
nvidia-bug-report.sh output:
Plot thickens: https://devtalk.nvidia.com/default/topic/1050894/gpu-devices-lost-with-nvrm-rminitadapter-failed-when-cpu-or-network-is-busy/
Looking at the logs though I see the driver continuously initializing/deinitializing in fast succession which points to that you're missing the nvidia-persistenced.
So this happens because nvidia-persistenced fails to start, not the other way around.
From the log @Staudey posted:
Sep 25 21:15:42 solus-pc nvidia-persistenced[693]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 143 has read and write permissions for those files.
However, I also have the same items in my logs:
Sep 25 00:51:44 goliath nvidia-persistenced[911]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 143 has read and write permissions for those files. Sep 25 00:51:44 goliath nvidia-persistenced[905]: nvidia-persistenced failed to initialize. Check syslog for more details. Sep 25 00:51:44 goliath nvidia-persistenced[911]: Shutdown (911)
However, at login, persistenced is indeed active:
[~] systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2019-09-25 00:51:45 EEST; 22h ago
Process: 920 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)
Process: 979 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced (code=exited, status=0/SUCCESS)
Main PID: 980 (nvidia-persiste)
Tasks: 1 (limit: 4915)
Memory: 1.0M
CGroup: /system.slice/nvidia-persistenced.service
└─980 /usr/bin/nvidia-persistenced --user nvidia-persistenced@Staudey can you get me the journal entries for nvidia-persistenced on a failed boot? Something like:
sudo journalctl -xe -u nvidia-persistenced
@JoshStrobl Yeah, it definitely fails the first time since the kernel isn't ready. It tries again later multiple times. This is probably why it is fine on your machine. I need to see if there are more log entries that I am missing though.
Sure, here you go. (btw I've also edited my original comment to add the output file of nvidia-bug-report.sh)
-- Logs begin at Sun 2019-08-11 13:47:12 CEST, end at Wed 2019-09-25 22:13:29 CEST. -- Sep 25 21:15:42 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon... -- Subject: Unit nvidia-persistenced.service has begun start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has begun starting up. Sep 25 21:15:42 solus-pc nvidia-persistenced[693]: Started (693) Sep 25 21:15:42 solus-pc nvidia-persistenced[693]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 143 has read and write permissions for those files. Sep 25 21:15:42 solus-pc nvidia-persistenced[690]: nvidia-persistenced failed to initialize. Check syslog for more details. Sep 25 21:15:42 solus-pc systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=1 Sep 25 21:15:42 solus-pc nvidia-persistenced[693]: Shutdown (693) Sep 25 21:15:42 solus-pc systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'. Sep 25 21:15:42 solus-pc systemd[1]: Failed to start NVIDIA Persistence Daemon. -- Subject: Unit nvidia-persistenced.service has failed -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has failed. -- -- The result is RESULT. Sep 25 21:15:42 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon... -- Subject: Unit nvidia-persistenced.service has begun start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has begun starting up. Sep 25 21:15:42 solus-pc nvidia-persistenced[768]: Started (768) Sep 25 21:15:43 solus-pc nvidia-persistenced[768]: device 0000:01:00.0 - failed to open. Sep 25 21:15:43 solus-pc systemd[1]: Started NVIDIA Persistence Daemon. -- Subject: Unit nvidia-persistenced.service has finished start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has finished starting up. -- -- The start-up result is RESULT. Sep 25 21:15:43 solus-pc systemd[1]: Stopping NVIDIA Persistence Daemon... -- Subject: Unit nvidia-persistenced.service has begun shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has begun shutting down. Sep 25 21:15:43 solus-pc nvidia-persistenced[768]: Shutdown (768) Sep 25 21:15:43 solus-pc systemd[1]: Stopped NVIDIA Persistence Daemon. -- Subject: Unit nvidia-persistenced.service has finished shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has finished shutting down. Sep 25 21:15:44 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon... -- Subject: Unit nvidia-persistenced.service has begun start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has begun starting up. Sep 25 21:15:44 solus-pc nvidia-persistenced[813]: Started (813) Sep 25 21:15:44 solus-pc nvidia-persistenced[813]: device 0000:01:00.0 - failed to open. Sep 25 21:15:44 solus-pc systemd[1]: Started NVIDIA Persistence Daemon. -- Subject: Unit nvidia-persistenced.service has finished start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has finished starting up. -- -- The start-up result is RESULT. Sep 25 21:15:45 solus-pc systemd[1]: Stopping NVIDIA Persistence Daemon... -- Subject: Unit nvidia-persistenced.service has begun shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has begun shutting down. Sep 25 21:15:45 solus-pc nvidia-persistenced[813]: Shutdown (813) Sep 25 21:15:45 solus-pc systemd[1]: Stopped NVIDIA Persistence Daemon. -- Subject: Unit nvidia-persistenced.service has finished shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has finished shutting down. Sep 25 21:15:45 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon... -- Subject: Unit nvidia-persistenced.service has begun start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has begun starting up. Sep 25 21:15:45 solus-pc nvidia-persistenced[862]: Started (862) Sep 25 21:15:45 solus-pc nvidia-persistenced[862]: device 0000:01:00.0 - failed to open. Sep 25 21:15:45 solus-pc systemd[1]: Started NVIDIA Persistence Daemon. -- Subject: Unit nvidia-persistenced.service has finished start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has finished starting up. -- -- The start-up result is RESULT. Sep 25 21:15:45 solus-pc nvidia-persistenced[862]: Shutdown (862) Sep 25 21:15:45 solus-pc systemd[1]: Stopping NVIDIA Persistence Daemon... -- Subject: Unit nvidia-persistenced.service has begun shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has begun shutting down. Sep 25 21:15:45 solus-pc systemd[1]: Stopped NVIDIA Persistence Daemon. -- Subject: Unit nvidia-persistenced.service has finished shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has finished shutting down. Sep 25 21:15:45 solus-pc systemd[1]: Starting NVIDIA Persistence Daemon... -- Subject: Unit nvidia-persistenced.service has begun start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has begun starting up. Sep 25 21:15:45 solus-pc nvidia-persistenced[900]: Started (900) Sep 25 21:15:46 solus-pc nvidia-persistenced[900]: device 0000:01:00.0 - failed to open. Sep 25 21:15:46 solus-pc systemd[1]: Started NVIDIA Persistence Daemon. -- Subject: Unit nvidia-persistenced.service has finished start-up -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has finished starting up. -- -- The start-up result is RESULT. Sep 25 21:15:46 solus-pc systemd[1]: Stopping NVIDIA Persistence Daemon... -- Subject: Unit nvidia-persistenced.service has begun shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has begun shutting down. Sep 25 21:15:46 solus-pc nvidia-persistenced[900]: Shutdown (900) Sep 25 21:15:46 solus-pc systemd[1]: Stopped NVIDIA Persistence Daemon. -- Subject: Unit nvidia-persistenced.service has finished shutting down -- Defined-By: systemd -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nvidia-persistenced.service has finished shutting down.
Sep 25 21:15:43 solus-pc nvidia-persistenced[768]: device 0000:01:00.0 - failed to open.
That seems the smoking gun.
Alright @Staudey, one more thing to stare at on a failed boot:
sudo strace -ytff -o trace.out /usr/bin/nvidia-persistenced --user nvidia-persistenced
The trace will be pretty big and multiple files starting with the prefix trace.out and ending with a numerical Process ID, so you'll want to bundle them up in a compressed archive before uploading. Thanks!
Hey, sorry, I fell asleep yesterday while the trace was running. I stopped it in the morning but had to get to work immediately.
However, now that I look at the files there are just two ~10kb traces, so I will upload them directly.
@Staudey thank you! Looks like we have our first lead:
openat(AT_FDCWD, "/sys/devices/system/memory/block_size_bytes", O_RDONLY) = -1 ENOENT (No such file or directory)
This is disabled right now because we don't have CONFIG_MEMORY_HOTPLUG=y in the kernel config. I'm not 100% sure, but I think when nvidia can't read the file, it guesses the memory block size poorly and is getting told to piss off by the IOMMU guarding the PCI-E bus addresses, Which is why we saw these in the Journal:
kernel: resource sanity check: requesting [mem 0x000e0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000e0000-0x000e3fff window] kernel: caller _nv030477rm+0x58/0x90 [nvidia] mapping multiple BARs
This is likely very system-specific which might start to explain things.
I'm in the middle of rebuilds for linux-current out-of-tree modules, with the hotplug driver enabled. If all goes well, I'll enable it for linux-lts when the new release is out on Sunday.
linux kernel 5.2.13:
$ ls -alh /dev/nvidia* crw-rw-rw- 1 root root 195, 0 ruj 27 17:19 /dev/nvidia0 crw-rw-rw- 1 root root 195, 255 ruj 27 17:19 /dev/nvidiactl crw-rw-rw- 1 root root 195, 254 ruj 27 17:19 /dev/nvidia-modeset crw-rw-rw- 1 root root 240, 0 ruj 27 17:19 /dev/nvidia-uvm
linux kernel 5.3.1:
$ ls -alh /dev/nvidia* crw-rw-rw- 1 root root 195, 0 ruj 27 17:16 /dev/nvidia0 crw-rw-rw- 1 root root 195, 255 ruj 27 17:16 /dev/nvidiactl crw-rw-rw- 1 root root 240, 0 ruj 27 17:16 /dev/nvidia-uvm
Putting iommu=off as kernel parameter, as @DataDrake suggested, i successfully booted into linux kernel 5.3.1
Hello, sorry I didn't come home yesterday and couldn't test, here's my update:
The new kernel package didn't fix things for me, but adding iommu=off as a kernel parameter, like @chax described enabled me to boot into Budgie.
I pinpointed the issue to one of our kernel patches.
What i did was first compiled kernel 5.3.7 with all the patches and kernel config that we used to compile kernel 5.3.1 (R3571:70ca91bdea72)
After that i just recompiled nvidia-glx-driver with new kernel, rebooted and i got the same error as we had on 5.3.1.
Then i removed all the patches by removing line %apply_patches form package.yml
Recompiled the kernel and nvidia driver, rebooted and BINGO, successful boot with normally loaded nvidia drivers, everything working as it should.
Then i checked the list of patches that we have trying to find main suspect for this problem, and this one seemed like the most probable one.
I re-added %apply_patches line to the package.yml and then just removed suspected patch from list of patches (files/series) and did another re-build of kernel and nvidia drivers.
Another successful boot, which means that this patch, when applied to kernel >5.3.x breaks something with IOMMU and only way to boot successfully with that kernel is to turn off IOMMU via kernel parameter.
Still getting this error on each boot:
Nov 10 16:11:47 alecto nvidia-persistenced[814]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 143 has read and write permissions for those files.
inxi -F:
System: Host: alecto Kernel: 5.3.8-133.current x86_64 bits: 64 Console: N/A Distro: Solus 4.0
Machine: Type: Laptop System: Micro-Star product: GS70 2OD v: REV:1.0 serial: FFFFFFFF
Mobo: Micro-Star model: MS-1771 v: REV:0.B serial: BSS-0123456789 UEFI: American Megatrends v: E1771IMS.50G
date: 11/14/2014
Battery: ID-1: BAT1 charge: 60.6 Wh condition: 63.0/59.9 Wh (105%)
CPU: Topology: Quad Core model: Intel Core i7-4700HQ bits: 64 type: MT MCP L2 cache: 6144 KiB
Speed: 2195 MHz min/max: 800/3400 MHz Core speeds (MHz): 1: 2195 2: 2195 3: 2195 4: 2196 5: 2197 6: 2195 7: 2198
8: 2197
Graphics: Device-1: Intel 4th Gen Core Processor Integrated Graphics driver: i915 v: kernel
Device-2: NVIDIA GK106M [GeForce GTX 765M] driver: nvidia v: 390.129
Display: server: X.Org 1.20.5 driver: modesetting,nvidia resolution: 1920x1080~60Hz
OpenGL: renderer: GeForce GTX 765M/PCIe/SSE2 v: 4.6.0 NVIDIA 390.129
Audio: Device-1: Intel 8 Series/C220 Series High Definition Audio driver: snd_hda_intel
Device-2: NVIDIA GK106 HDMI Audio driver: snd_hda_intel
Sound Server: ALSA v: k5.3.8-133.current
Network: Device-1: Qualcomm Atheros Killer E220x Gigabit Ethernet driver: alx
IF: enp4s0 state: down mac: 6c:62:6d:35:ae:18
Device-2: Qualcomm Atheros AR9462 Wireless Network Adapter driver: ath9k
IF: wlp5s0 state: up mac: 3c:77:e6:68:4b:cd
Device-3: Qualcomm Atheros AR3012 Bluetooth 4.0 type: USB driver: btusb
Drives: Local Storage: total: 1.14 TiB used: 34.05 GiB (2.9%)
ID-1: /dev/sda vendor: SanDisk model: SD6SF1M128G size: 119.24 GiB
ID-2: /dev/sdb vendor: Smart Modular Tech. model: SH00M120GB size: 111.79 GiB
ID-3: /dev/sdc vendor: HGST (Hitachi) model: HTS721010A9E630 size: 931.51 GiB
Partition: ID-1: / size: 105.39 GiB used: 34.05 GiB (32.3%) fs: ext4 dev: /dev/dm-1
ID-2: swap-1 size: 3.73 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/dm-0
Sensors: System Temperatures: cpu: 50.0 C mobo: 27.8 C gpu: nvidia temp: 56 C
Fan Speeds (RPM): N/A
Info: Processes: 227 Uptime: 6m Memory: 7.76 GiB used: 1.55 GiB (20.0%) Shell: bash inxi: 3.0.36That might happen the first time due to an early boot race condition, it's fine as long as it doesn't happen a second time. It's clearly loading correctly or Xorg would have fallen back to Intel.
ah ! ok, it indeed does happen only once per boot (if I understood you well), and indeed without any further problem that I'm aware of anyway. thx for your reply