Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
How do I troubleshoot a NVMe drive that drops out under load?
I'm running a self built machine with a MATX-CS612 motherboard. It runs ubuntu server 24.04 with a ZFS root. The machine tends to reboot under a very specific workload (Running a selenium workflow that downloads large data dumps from a set of websites), and I've narrowed down the issue to the drive.
I'm 99% certain the machine is otherwise fine - I've stress tested the machine, and worked through testing nearly every component. I'd initially suspected this was a PSU issue, and memtest passes.
The drive in question is a 1 TB Crucial P3+ NVMe drive.
I've run short and long self tests and the logs look ok
geek@beepy:~$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 41 °C (314 K)
available_spare : 100%
available_spare_threshold : 5%
percentage_used : 1%
endurance group critical warning summary: 0
Data Units Read : 3014830 (1.54 TB)
Data Units Written : 17018276 (8.71 TB)
host_read_commands : 22947270
host_write_commands : 313549319
controller_busy_time : 3544
power_cycles : 347
power_on_hours : 4134
unsafe_shutdowns : 250
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 41 °C (314 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 20
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 10
(The reads look a little high).
I've tried running smart tests but the output seems very arcane
geek@beepy:~$ sudo nvme self-test-log /dev/nvme0n1
Device Self Test Log for NVME device:nvme0n1
Current operation : 0
Current Completion : 0%
Self Test Result[0]:
Operation Result : 0
Self Test Code : 2
Valid Diagnostic Information : 0
Power on hours (POH) : 0x1027
Vendor Specific : 0 0
Self Test Result[1]:
Operation Result : 0
Self Test Code : 1
Valid Diagnostic Information : 0
Power on hours (POH) : 0x1026
Vendor Specific : 0 0
Self Test Result[2]:
Operation Result : 0
Self Test Code : 1
Valid Diagnostic Information : 0
Power on hours (POH) : 0x1026
Vendor Specific : 0 0
Self Test Result[3]:
Operation Result : 0xf
Self Test Result[4]:
Operation Result : 0xf
...
Repeated results
...
Self Test Result[19]:
Operation Result : 0xf
geek@beepy:~$
Now troubleshooting a drive that causes complete PC shutdowns is.. tricky. It works fine otherwise, but since I need that specific workload to trigger the issue I figure I could do a fresh install on another/known good drive, replicate the trigger and proceed from there.
My problem here is - with the clean OS install, assuming there's a similar failure mode, what should I be looking out for? What sort of logs would be useful in monitoring a NVMe drive that might be randomly dropping off under load?

1 comment thread