Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

How do I troubleshoot a NVMe drive that drops out under load?

+4
−0

I'm running a self built machine with a MATX-CS612 motherboard. It runs ubuntu server 24.04 with a ZFS root. The machine tends to reboot under a very specific workload (Running a selenium workflow that downloads large data dumps from a set of websites), and I've narrowed down the issue to the drive.

I'm 99% certain the machine is otherwise fine - I've stress tested the machine, and worked through testing nearly every component. I'd initially suspected this was a PSU issue, and memtest passes.

The drive in question is a 1 TB Crucial P3+ NVMe drive.

I've run short and long self tests and the logs look ok

geek@beepy:~$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 41 °C (314 K)
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 1%
endurance group critical warning summary: 0
Data Units Read                         : 3014830 (1.54 TB)
Data Units Written                      : 17018276 (8.71 TB)
host_read_commands                      : 22947270
host_write_commands                     : 313549319
controller_busy_time                    : 3544
power_cycles                            : 347
power_on_hours                          : 4134
unsafe_shutdowns                        : 250
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 41 °C (314 K)
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 20
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 10

(The reads look a little high).

I've tried running smart tests but the output seems very arcane

geek@beepy:~$ sudo nvme self-test-log /dev/nvme0n1
Device Self Test Log for NVME device:nvme0n1
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0
  Self Test Code               : 2
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1027
  Vendor Specific              : 0 0
Self Test Result[1]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1026
  Vendor Specific              : 0 0
Self Test Result[2]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1026
  Vendor Specific              : 0 0
Self Test Result[3]:
  Operation Result             : 0xf
Self Test Result[4]:
  Operation Result             : 0xf
...
Repeated results
...
Self Test Result[19]:
  Operation Result             : 0xf
geek@beepy:~$

Now troubleshooting a drive that causes complete PC shutdowns is.. tricky. It works fine otherwise, but since I need that specific workload to trigger the issue I figure I could do a fresh install on another/known good drive, replicate the trigger and proceed from there.

My problem here is - with the clean OS install, assuming there's a similar failure mode, what should I be looking out for? What sort of logs would be useful in monitoring a NVMe drive that might be randomly dropping off under load?

History

1 comment thread

Is the loss of the root drive fatal? (2 comments)

Sign up to answer this question »