Status and Experience on Thermal Performance

There are several heat related reports in this forum (e.g. DGX Spark. low fan speed, high temps, device very hot and DGX Spark Bundle Cooling Makeshift Rack - #4 by maiia).

What are your latest experience? Some have successfully trained nanochat on the Spark (straight run for 9+ days) and have not reported any heat related issues.

And what’s your experience with the partners’ (Asus, Acer, Dell, HP, MSI, Lenovo, ..) models, in terms of thermal performance?

I was running a TensorRT-LLM optimization on gpt-oss-20b, and it got hot:

From my own script (AI wrote, but I directed it).

I wish that DGX Spark Bundle Cooling Makeshift Rack - #4 by maiia exists. I will totally buy one if it’s <$50. But for time being, I got my Arduino Uno Q to go with DGX Spark. I am thinking of building something like this: Total waste of a new state-of-art device, a perfect way to show off ;-)

That makeshift rack was 4-figures and wont do much.

This is what I currently have but I’m replacing my TARGUS Laptop fan with the QUICARD

Please be VERY MINDFUL with the TECs they DO make ICE, that’s why I have them below the vents and not touching the Sparks.

That actually exists? :o Wow, I feel so .. But now that you tested, yea, it would be bad get that condensation formed inside of DGX. Thanks for update.

If you’re asking about the TEC from the previous thread that’s the FS12-45 Tablet cooler and is ~$29

The QUICARD I’m replacing the TARGUS with is $69.99

The Acrylic Stand and TEC is $39.99

The actual “rack” idea with proper fans and TECs was over $4K and was not worth the build for minimal impact given how insulated the Sparks are.

https://www.aliexpress.us/item/3256809021967972.html?gatewayAdapt=glo2usa4itemAdapt

With the laptop fan base with built in TECs, propped up on the open acrylic stands and ensuring the vents are not obstructed with the mini TECs below, the units have been cold on standby, and cool running AI Workbench. I haven’t run anything heavy or overnight yet. So far this setup looks promising, or at least supportive.

can you share the script?

Dang! That looks like about to take off to explore the Galaxy… or take over the world at least. Well, at least with that 4i4, it will be DJing killer death-rock music while destroying humans.

A tip for future human-survivors:

When the AI singularity do happen, in order to fight the AI Lord, just push a broken change as a new version to flashinfer python library, at least it will slow down a bit ;-)

1 Like

@haidij Yes I will, but the script is super dangerous as you can imagine, so I need to make it usable by others and put all the warning bells. Also it doesn’t and can’t include the manual part which you need to reimage the DGX Spark to create a smaller partition. Although, I found that I can use Ubuntu 26.04 ARM64 Live ISO to resize the existing DGX Spark partition, but I can’t exactly duplicate it because mine is already out of oven…UNLESS… @NVIDIA team send me another DGX Spark ;-)

I will create an instruction from my note and script, and will share it soon.

@maiia Is that setup not causing any condensation on DGX Spark at all? Any chance of that would be deadly to DGX Spark being all aluminum exterior.

I am planning to have a Arduino Uno Q with temperature and humidity sensor and shut DGX down if any of value goes critical.

Our vision of AI singularity is human-centered, benevolent intelligence co-creating an elevated state of being for humanity. The Sparks do look like they are about to re-mix some code and should have them compose something to commemorate not crashing for a full 48 hours! :D
Re: condensation
I have both Sparks on 1.5 in high acrylic mounts.The Sparks don’t touch the TECs, they also don’t touch the fans or the TEC built in the laptop cooling pad. The external TECs (the two little ones on both sides of the laptop cooling pad digital panel) are positioned away from the Spark body.

I never placed the TECs on the Spark given the high temp delta. However, although I had a similar set up (Sparks on acrylic mounts) with the Targus the little TECs did create ice below them when they were sitting directly on the Targus cooling pad.

My old Targus has 2 Fans, the QUICARD has 4, the Targus was pushing more air on an equally open body, the QUICARD has 4 localized fans and the TEC in the middle doesn’t get that cold.

Depending on how we progress over the next week or two I may add this side fan from the back for ‘emotional support’.

You have to bare in mind that the problem is both the storage size, the air circulation and how insulated the NVIDIA DGX Spark Founder edition enclosure is. That’s why creating a data-center like make-shift rack and investing in proper rack fans and add magnetic TECs wont give the expected thermal support.

Some Partners claim to have built their units with better thermal management, but it’s too early to tell if they did manage to mitigate any potential thermal throttling on their end. Also, the ASUS may appear to operate at cooler temperatures because their units are 1TB and 2TB vs. 4TB, that alone might impact thermal efficiency just like you were sharing your logic behind choosing a smaller 1TB SSD to manage overheating.

I expect we’ll see incremental improvements over the next 3 months through NVIDIA firmware updates and this wont be a persistent issue.

I’m not familiar with the Arduino Uno Q with temperature and humidity sensor. How would you integrate it with the Spark and wouldn’t an external forced shut down risk data loss or disk corruption?

Just to report back my experience.

My DGX Spark has been running nanochat training for 9 hours. I did a few spot checks. The GPU temperature is around 77C to 80C (86W to 90W). The CPU is around 85C. The case is warm/hot to touch, not toasty. The room temperature is around 18C.

On the Dell Pro Max with GB10 I’ve been testing, I don’t see the temps rise above 80°C even after a couple hours of HPL and LLM tests, and they’re pretty quiet too.

The facade is more open on the Dell units, it seems; my theory is that gold front on the Spark cuts off enough pass-through airflow (making the bottom vent more essential) that it just can’t cool down as efficiently.

The outside of the Dell unit (top and sides) stays below 40°C during all my testing, and noise is under 48 dBa measured 1’ away. Rear shows parts up to 50-55°C

I think I’ve been having thermal issues. Though my problem has never been the GPU which has also mainly been in the 80C range.

lm_sensors seems to whatever is being recorded with temp6 is reaching 95C+ on my runs and being the outlier temperature getting recorded (while others are 70,80 ish). The only time I record 90C+ otherwise is with a CPU stress test, though in that case it is temp5 and temp6 is relatively cool.

Interesting. Level1Tech https://www.youtube.com/watch?v=sx6ANedcIfI found that the MSI EdgeXpert had “better thermal and power dissipation characteristics and as a result is 5-10% better for most tasks” than the DGX Spark and he seems to be contributing most of it to thermal management. I had returned two Sparks after seeing people have thermal issues, and was anxious to see how partner units would perform - seems like they might perform better than the Spark.

1 Like