Page MenuHomeSolus

NVMe regression Macbook Air 7,1 (early 2015) - SSD not detected on recent kernels
Open, Needs More InfoPublic

Description

Running Solus 3 on a Macbook Air 7,1 (early 2015). Recent kernel updates fail to boot; I suspect there might be a regression in the nvme driver.

4.16.15-76.current is the last kernel I've managed to boot on my system. Any updates since fail to boot with dracut unable to find the root partition and no /dev/nvme* devices present at all. The Solus 3.9999 live USB fails to detect the SSD; an earlier version (Solus 3? it's been a while since I installed...) could detect it fine.

It's worth noting that the Fedora 29 and Ubuntu 18.10 live USBs also fail to detect the SSD so I *think* this issue might come from upstream? https://bugzilla.kernel.org/show_bug.cgi?id=105621 there was this upstream bug in 2015 to work around problems with this exact NVMe SSD/controller.

Note in the following output that Solus does not detect an SSD in and of itself, merely an "Apple NVMe Controller".

 ~$ uname -a
Linux hyatt 4.16.15-76.current #1 SMP PREEMPT Tue Jun 12 20:51:13 UTC 2018 x86_64 GNU/Linux

~$ lspci -vvkx -s 04:00.0
04:00.0 Mass storage controller: Apple Inc. S1X NVMe Controller (rev 01) (prog-if 02)
	Subsystem: Apple Inc. S1X NVMe Controller
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 256 bytes
	Interrupt: pin A routed to IRQ 48
	NUMA node: 0
	Region 0: Memory at c1300000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable+ Count=1/8 Maskable- 64bit+
		Address: 00000000fee00358  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <32us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <2us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [148 v1] Latency Tolerance Reporting
		Max snoop latency: 3145728ns
		Max no snoop latency: 3145728ns
	Kernel driver in use: nvme
00: 6b 10 01 20 06 04 10 08 01 02 80 01 40 00 00 00
10: 04 00 30 c1 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 6b 10 01 20
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 00 00

Event Timeline

fpiesche renamed this task from NVMe regression - upstream? to NVMe regression Macbook Air 7,1 (early 2015) - SSD not detected on recent kernels.Oct 30 2018, 10:30 PM
fpiesche updated the task description. (Show Details)
DataDrake edited projects, added Hardware; removed Lacks Project.Oct 30 2018, 11:49 PM

So two question for ya:

  1. Are you sure that is was exactly that version? Looking at the git log there is nothing in release 77 to indicate that NVME support changed at all.
  1. Where do I get a time machine? That kernel is so old, it was literally the last one published before I took over kernel updates. Seriously though, if you don't mind me asking: why did you wait so long to report and how the hell have you survived on such an old kernel?
JoshStrobl triaged this task as Needs More Info priority.Oct 31 2018, 12:10 AM
fpiesche added a comment.EditedOct 31 2018, 12:15 AM

I'm not sure if it was exactly that version, but it is certainly the only one I've got installed that's working, and the one I've been running ever since the problem occurred (a few months now). I have intermittently tried to run the newer kernels as updates have rolled out but none of the ones I've tried since have indeed worked. What was the next version published after 4.16.15-76? I'll be happy to incrementally install newer versions release by release, until I find the point it breaks :)

The main reason I've waited so long before reporting is that I've had A Lot going on life-wise (health stuff, new baby, you name it) so I've only had very intermittent spots of time to try and fix the problem - though I did post on the Solus forums about it a few weeks back, before I noticed all the nvme devices were missing in the dracut shell.

As for survival on that old a kernel, it's not too terrible on day-to-day use? It's only this week or so that I noticed the headers for 4.16.15-76 seem to have aged off my system and I can't compile the facetime camera drivers anymore, which is what prompted me to try and make some time to dig into the issue properly...

bwat47 added a subscriber: bwat47.Oct 31 2018, 12:39 AM

I was playing with solus on my 2013 macbook air the other day and I noticed the SSD not being detected at all as you're describing. I found that it works and detects the SSD if I boot with intel_iommu=off, might be worth a try

I've tried booting 4.18.16-96 with intel_iommu=off which sadly still ran into the same issue. For what it's worth, the switch to NVMe SSDs is a change with Apple's 2015 hardware revision so this likely is why that fix doesn't work here...

I could really use a dmesg log to debug this. Afaict the flag for that controller is still enabled in the kernel.

fpiesche added a comment.EditedOct 31 2018, 11:26 AM

I'll run the live USB tonight and get a dmesg log off of there. Anything else that might be useful while I'm there?

It only now strikes me: the lspci output above is from the Solus system with it running under 4.16. I'll check if that differs at all on the 3.9999 live USB as well, just in case...

sudo journalctl -b wouldn't hurt

Here you go!

Just to follow up on this (even though it's been a while): It seems this is related to https://github.com/Dunedan/mbp-2016-linux/issues/71#issuecomment-507325112 this issue (see also https://twitter.com/mjg59/status/1149060258116956160)

the tl;dr of it is that some of Apple's NVMe SSDs have PCI device class identifiers that are wrong, identifying them as controllers without attached storage (or something to that effect), and there may be other subtle differences in behaviour that trip up the nvme module in the kernel itself. Most certainly an upstream issue, in either case.