DGX Spark. low fan speed, high temps, device very hot

I’m not sure if anyone else is experiencing this. I received the device on monday, got it booted and updated.

yesterday got around to actually setting it up. im just running models via ollama to test things out and im noticing it gets very hot and the air movement out the back is practically non existent regardless of load or temps.

it burned my hand when i set my palm on it. so hot it actually still kind of hurts a few hours later and i wouldnt consider my hands soft (i do yard work, woodwork, and other stuff on weekends).

theres nothing in the bios to adjust pretty much anything. im not seeing any way to control the fan via software. surely its not supposed to get this hot? i dont even hear the fan, just a slight coil whine.

the system is fully updated.

-> % nvidia-smi                      
Thu Oct 23 17:32:40 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   40C    P8              4W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

i can see an fan detected in sensors but doesn’t appear to be detected correctly.

-> % sensors
mlx5-pci-20101
Adapter: PCI adapter
asic:         +44.0°C  (crit = +105.0°C, highest = +67.0°C)

mlx5-pci-0101
Adapter: PCI adapter
asic:         +44.0°C  (crit = +105.0°C, highest = +67.0°C)

nvme-pci-40100
Adapter: PCI adapter
Composite:    +39.9°C  (low  = -273.1°C, high = +82.8°C)
                       (crit = +84.8°C)
Sensor 1:     +39.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +39.9°C  (low  = -273.1°C, high = +65261.8°C)

acpi_fan-acpi-0
Adapter: ACPI interface
fan1:           2 RPM
power1:        5.00 mW 

mt7925_phy0-pci-90100
Adapter: PCI adapter
temp1:        +38.0°C  

mlx5-pci-20100
Adapter: PCI adapter
asic:         +44.0°C  (crit = +105.0°C, highest = +67.0°C)

mlx5-pci-0100
Adapter: PCI adapter
asic:         +44.0°C  (crit = +105.0°C, highest = +67.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +38.8°C  
temp2:        +37.8°C  
temp3:        +38.7°C  
temp4:        +37.8°C  
temp5:        +38.7°C  
temp6:        +38.8°C  
temp7:        +37.8°C  
2 Likes

Fans are controlled entirely by firmware. No PWM no BMC control posible. I tried no result …

Your unit could be detected. Mine gets hot, but the fan is not loud but clearly audible at high loads. What temperatures are you seeing under load? I’ve seen up to 90C so far (on the CPU side, GPU doesn’t get that hot).

These thermal performance reports are mostly BS. Did a lot of benchmarks and the fans work really well under load and continuous hours of operation.

Possibly early firmware issue or some defective units. I’ve had mine crunching for hours on both CPU and GPU and it didn’t overheat. Granted, it’s not that hot here right now, but not cold either.

It does get hot to the point where it would be uncomfortable to pick it up while it’s working, this is normal based on the reviews I’ve seen. Aluminium is a good conductor of heat, and this obviously wasn’t designed with the idea that you’d be picking it up while it’s running, it’s more of a device you put in a corner and forget about. You’d know if it was getting too hot as it would reset or crash if it got too hot as GPUs as quite sensitive like that. Personally, I’ve done training runs that lasted 24 hours with no issues.

The only time the DGX Spark seems to get upset is if it runs out of memory and then bleeds into swap space, I’ve found it just causes the entire system to lock up, and the only way to seem to get it working again is to turn it off and back on again. I’ll probably start a topic on this issue at some point. In terms of thermals though, I haven’t observed anything worrying, highest I’ve seen the GPU go is the late 80s and then the fan kicks in pretty quickly to bring it back down. I’m training on it at the moment and I can see it hit 84C every now and then.

2 Likes

Just experienced this for the first time today. Forgot that I was already running vllm instance that was consuming almost all the memory, and launched llama-server with gpt-oss-120b on top of it :) When I realized my mistake, it started swapping and crawled to almost complete stop. I quickly typed killall -KILL llama-server, but it took it about 5 minutes to process my command and actually kill the process, after which everything went back to normal. I was almost ready to just turn it off, like you said.

I guess, running into swap is no-no on this system for some reason. I was on stock DGX OS, I guess, I need to try that on my Fedora install too to see if it improved in 6.17.

1 Like

I didn’t mention this, but I’ve actually disabled my swap file because of this issue. Then the process just crashes, or worst case the system resets, but at least doesn’t lock up so I don’t have to restart it manually. You’re lucky you managed to issue a kill command. Normally it doesn’t give you the chance.

I was connected via ssh and had to wait a couple of minutes until my command even appeared in the terminal :) X11 session was dead though :)

1 Like

Oh, that I went through already. I’m actually writing an emergency safeguard service to address those situations where things go out of track pretty quickly. Will share the repo next-week.

3 Likes

I’m using Asus GX10. The device is too hot. When running in full load, temp rise to 95oC

I thought they would shutdown over 90. From what I’ve read the temps are lower on the ASUS unit than the NVIDIA model due to enclosure materials.

Can you share the workload? Haven’t seen this on my spark even under heavy training

Not really, the spark’s thermal performance under heavy training workloads have been great. Haven’t tested on Asus yet.

1 Like

Can we get a more aggressive fan curve?

This is all through observations but as of this week I’ve been getting a bunch of random shutdowns, if the reason is through a thermal shutdown there’s too many factors for me to isolate the cause out of (workload, software update, etc).

But monitoring netdata and by correlating metrics, the fans only start to ramp up maybe high 80s or past the 90s threshold - or is it tied to the GPU temperature (seems generally correlated to 80 though). Maybe start spinning at 85-90 which would create some leeway?

I haven’t gotten any shutdowns after shifting the location of the Spark under the same workload so maybe I’ve gotten hit with a weird combination of environment and power surge that spikes the temperature to trigger a shutdown?

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +92.8°C  
temp2:        +84.3°C  
temp3:        +74.7°C  
temp4:        +84.5°C  
temp5:        +74.7°C  
temp6:        +92.8°C  
temp7:        +77.8°C  

What is measured by temp6 (temp1 is probably max of the rest)? That has been teetering upwards of 95C with the fan at a low ramp (as of posting this it just shut down on me).

Just execute this in the command line to report the issue:

sudo nvidia-bug-report.sh

The system is shut down? How would you run that to get any useful information if it is just off.

Could you share directions to how to reproduce your workload? I can replicate it on my setup and see if I can replicate it. Maybe you have a defective hardware?

It is dynamically assigned jobs to run so it is not reproducible, the workload difference is more jobs being assigned in parallel to run that reside on the GPU.

Can you share the type of the most expensive operation you’re performing in those jobs and specify the top load scenario, and I can try to simulate that locally with a stress tool.