Page MenuHomeSolus

GPU hangs and crashes XOrg on Unstable
Closed, ResolvedPublic

Description

After switching to the Unstable Repo recently my Xorg session dies after waking up from suspend and I am greeted by a fresh lightdm after waking up.
On the stable repo everything was working fine.
After digging through the logs I found out that my Intel HD 630 hangs and resets itself and thus crashing Xorg.
The /sys/class/drm/card0/error log is attached to this.
Because it was working fine on shannon this led me to thinking that some reason kernel/firmware/driver upgrade which didn't got synced yet cause this crash.
This may be a upstream regression but I don't know that yet.

Here is also my dmesg related to drm/driver:

➜  ~ dmesg | grep drm 
[    1.833094] [drm] Initialized
[    2.856236] [drm] Memory usable by graphics device = 4096M
[    2.856239] fb: switching to inteldrmfb from simple
[    2.856436] [drm] Replacing VGA console driver
[    2.908890] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    2.908890] [drm] Driver supports precise vblank timestamp query.
[    2.918644] [drm] Finished loading i915/kbl_dmc_ver1_01.bin (v1.1)
[    2.920516] [drm] GuC firmware load skipped
[    3.276490] [drm] Initialized i915 1.6.0 20160919 for 0000:00:02.0 on minor 0
[    3.283265] fbcon: inteldrmfb (fb0) is primary device
[    3.283430] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[    4.784248] [drm] RC6 on
[   34.199871] [drm] GuC firmware load skipped
[   35.278609] [drm] RC6 on
[   46.772197] [drm] GPU HANG: ecode 9:0:0x89074f16, in Xorg [793], reason: Hang on render ring, action: reset
[   46.772199] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   46.772199] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   46.772199] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   46.772199] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   46.772200] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   47.530492] drm/i915: Resetting chip after gpu hang
[   47.530814] [drm] RC6 on
[   47.546779] [drm] GuC firmware load skipped
[   56.755606] drm/i915: Resetting chip after gpu hang
[   56.755661] [drm] RC6 on
[   56.772176] [drm] GuC firmware load skipped

Event Timeline

taaem created this task.Jul 12 2017, 12:36 AM

How recently? i.e. eopkg info xorg-server and what release is installed? Need to try and pinpoint the issue to what is most likely one of kernel update, xorg-server update, mesa update

taaem added a comment.Jul 12 2017, 8:43 AM

I switched to unstable last Saturday I think and I have latest xorg from unstable so 1.18.something. I updated that one yesterday but I don't think its an xorg issue more like kernel or Mesa, I'll maybe locally revert the latest changes and try by myself.
If it is relevant this is on a Dell XPS 15 9560.

sunnyflunk added a comment.EditedJul 12 2017, 9:52 AM

Well it was more I literally just pushed an xorg-server update not long before this post. Being able to discount that would be useful. You should be able to boot into the old kernel which will indicate whether that is the issue.

On UEFI, you prob have to spam space during boot in order to get the boot menu to show on boot.

If it's Grub, the old kernel is in a submenu of the boot menu

There was also a linux-firmware update I think

Yeah I ran clr-boot-manager update again to remove some of my cmdline options to make sure those aren't the reason and thus I only have the current kernel.
I'm on UEFI adn i set timeout for the boot so I always see the options.
If you mean the CVE update to xorg yesterday, it was happening before and after that.
In the logs after the GPU hangs systemd tries to restart all services and thus creating new sessions for my user and lightdm.

taaem added a comment.Jul 13 2017, 3:45 PM

Okay seems to be a regression bug in the kernel.
At least the latest 4.9.37 kernel update seems to fixed the issue, I tried it a few times and it always came back normally without any crashes.
I'll leave this open until I can assure that it is definitly fixed.

DataDrake edited projects, added Hardware; removed Lacks Project.Jul 20 2017, 1:53 AM
taaem added a comment.Jul 20 2017, 9:46 AM

Next update on this: I was experimenting with systemd services that wake the computer from suspend to hibernate it again if it is untouched for some time. And waking from suspend crashed always Xorg, while normal suspend only did that sometimes.
After the rollback from the latest linux-firmware to the version we have now the suspend to hibernate thing worked like a charm so I suspect that the issue was in the firmware stack somewhere.

Is this valid with latest kernel + mesa? 17.1.4 was especially janky, 17.1.5 seems much nicer

So I just updated everything on unstable and it seems to be working just fine now so its either mesa or kernel or linux-firmware

I'm betting on mesa. Blaz was having segfaults in budgie-wm on startup with 17.1.4, 17.1.5 resolved it

Okay I'll let this open for a few days so I get some time to test this and then I'll close this as resolved

taaem added a comment.Jul 20 2017, 2:47 PM

Okay so the crash happend again and now I'm more and more thinking its because I use a systemd service to switch from normal suspend to hibernate after one hour and then for some reason it can't restore from that because something is scrued.
Just for reference here is my systemd unit:

[Unit]
Description=Delayed hibernation trigger
Before=suspend.target
Conflicts=hibernate.target hybrid-sleep.target
StopWhenUnneeded=true

[Service]
Type=oneshot
RemainAfterExit=yes
Environment="WAKEALARM=/sys/class/rtc/rtc0/wakealarm"
Environment="ALARM_SEC=3600"
ExecStart=/usr/sbin/rtcwake --seconds $ALARM_SEC --auto --mode no
ExecStop=/bin/sh -c '\
  alarm=$(cat $WAKEALARM); \
  now=$(date +%%s); \
  if [ -z "$alarm" ] || [ "$now" -ge "$alarm" ]; then \
     echo "suspend-to-hibernate: Woke up - no alarm set. Hibernating..."; \
     systemctl hibernate; \
  else \
     echo "suspend-to-hibernate: Woke up before alarm - normal wakeup"; \
  fi; \
  /usr/sbin/rtcwake --auto --mode disable; \
'

[Install]
WantedBy=sleep.target
taaem added a comment.Jul 20 2017, 6:40 PM

Okay waking up after normal suspend seems to crash the lockscreen but it just restarts and its fine after that, I can unlock the session and everything is back to how it was

Regarding the lock screen - try with latest gnome-screensaver update

taaem added a comment.Jul 21 2017, 1:12 PM

Updated to latest gnome-screensaver and tried normal hibernation which worked perfectly fine and gnome-screensaver didn't crash.
And also after my suspend to hibernate thing it worked flawlessly two times in a row. So it was probably related to some bug in gnome-screensaver.

Okay so I updated to linux-current 2 weeks ago and had no lock up/crash since then so I suppose it was a kernel bug, because linux-lts doesn't like my shiny new XPS 15 2017.
Anyways this can be closed as fixed in updated kernel I think.

taaem closed this task as Resolved.Aug 20 2017, 10:23 AM
taaem claimed this task.