Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

How do I troubleshoot a NVMe drive that drops out under load?

−0

I'm running a self built machine with a MATX-CS612 motherboard. It runs ubuntu server 24.04 with a ZFS root. The machine tends to reboot under a very specific workload (Running a selenium workflow that downloads large data dumps from a set of websites), and I've narrowed down the issue to the drive.

I'm 99% certain the machine is otherwise fine - I've stress tested the machine, and worked through testing nearly every component. I'd initially suspected this was a PSU issue, and memtest passes.

The drive in question is a 1 TB Crucial P3+ NVMe drive.

I've run short and long self tests and the logs look ok

geek@beepy:~$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 41 °C (314 K)
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 1%
endurance group critical warning summary: 0
Data Units Read                         : 3014830 (1.54 TB)
Data Units Written                      : 17018276 (8.71 TB)
host_read_commands                      : 22947270
host_write_commands                     : 313549319
controller_busy_time                    : 3544
power_cycles                            : 347
power_on_hours                          : 4134
unsafe_shutdowns                        : 250
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 41 °C (314 K)
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 20
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 10

(The reads look a little high).

I've tried running smart tests but the output seems very arcane

geek@beepy:~$ sudo nvme self-test-log /dev/nvme0n1
Device Self Test Log for NVME device:nvme0n1
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0
  Self Test Code               : 2
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1027
  Vendor Specific              : 0 0
Self Test Result[1]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1026
  Vendor Specific              : 0 0
Self Test Result[2]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1026
  Vendor Specific              : 0 0
Self Test Result[3]:
  Operation Result             : 0xf
Self Test Result[4]:
  Operation Result             : 0xf
...
Repeated results
...
Self Test Result[19]:
  Operation Result             : 0xf
geek@beepy:~$

Now troubleshooting a drive that causes complete PC shutdowns is.. tricky. It works fine otherwise, but since I need that specific workload to trigger the issue I figure I could do a fresh install on another/known good drive, replicate the trigger and proceed from there.

My problem here is - with the clean OS install, assuming there's a similar failure mode, what should I be looking out for? What sort of logs would be useful in monitoring a NVMe drive that might be randomly dropping off under load?

storage ubuntu

posted 7 months ago

CC BY-NC 4.0

7mo ago

Journeymangeek ‭

206 reputation 3 7 24 9

Copy Link

Raw

Markdown

History

1 comment thread

Is the loss of the root drive fatal? (2 comments)

Communities

How do I troubleshoot a NVMe drive that drops out under load?

1 comment thread