Model outputting NaNs

Description

I’m running into a issue while compiling my model from .pt to .onnx to .engine.
I have DINOv3 and I’m trying to compile two optimized versions of this model - one for imputs of dimension 112, and one for 512.
After running it through conversion to ONNXFP32 and then to TRTENGINEFP16, my model for 112 works fine and outputs real numbers, but for some reason the model optimized for 512 output NaNs.
I have tried using polygraphy debug precision and the –layerPrecisions flag to maybe allow some layers to remain in FP32, but nothing helps.
It’s important to notice that the model works very well for the size of 112.

Thank you so much for your help!

Environment

TensorRT Version: 10.4.0
GPU Type: V100-PCIE-16GB
Nvidia Driver Version: 535.129.03
CUDA Version: 12.6
CUDNN Version: 9.4.0
Operating System + Version: Red Hat Enterprise Linux 8.10
Python Version (if applicable): 3.11
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.7.1
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tritonserver:24.09-py3

Relevant Files

https://drive.google.com/drive/folders/1Jg3Jad5DkvKsN4rtHtFkV4Au16Hxov1n?usp=drive_link

Steps To Reproduce

This is what I get for 112:

And this is what I get for 512:

  • Clone the repo - GitHub - zakcory/model-opt: Converting .pt to .onnx to .engine
  • Copy the .pt DINOv3 model into sources/models/raw after cloning
  • Run the ./convert_all_models.sh script to convert to .onnx and then to .engine
  • Run the ./test_all_models.sh to run the engines both for 112 and for 512 and see the outputs

NOTES:

  1. Make sure to have uv installed to have the bash scripts run succesfully (they use uv run and uv sync). If you do decide to use pip instead, download the dependencies from the requirements.txt file and change the uv run inside the bash script to python
  2. Make sure to run the scripts on the V100, as for me the script runs fine on A10 but the problem seems to persist with the input size of 512 on V100 specifically

I’m seeing the same issue on my side. The model optimized for input size 112 runs correctly and produces valid outputs, but the 512 version produces NaNs after conversion to TensorRT FP16. I also tried adjusting precision settings and keeping some layers in FP32, but it didn’t resolve the problem.

Thanks in advance,
Avishai

Experiencing the same issue with my V100 GPU.
Compared it with my personal RTX 4070.
On the V100 gpu it persists with outputting NaNs for larger input dimensions, as if with a newer GPU the issue seems not to appear.

Happens with FP16 conversion only for some reason.

I am seeing the exact same behavior on a similar setup (V100, TensorRT 10.x). Like the OP, my DINOv3 conversions work perfectly at smaller input resolutions (112x112), but yield NaN outputs as soon as the spatial dimension scales up to 512x512.

I suspect this might be related to specific tactical kernels being selected for the V100’s Volta architecture during the FP16 optimization phase for larger tensor shapes. I’ve also attempted to isolate the issue using --layerPrecisions to keep the attention heads in FP32, but the NaNs persist.

Given that this works on Ampere (A10) but fails on Volta (V100) for the exact same 512 input, could an NVIDIA engineer look into whether there is a known precision overflow in the scaled dot-product attention or LayerNorm kernels specifically for the V100 on TRT 10.4? This is a major blocker for deploying high-resolution Vision Transformer models on older enterprise hardware.