Description
TensorRT: Quantization issues with convtranspose3D
Environment
TensorRT Version: 10.10
GPU Type: RTX5070
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version: Win11
Python Version (if applicable): 3.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
My code for export model (quantification model) and my model
code.txt (12.0 KB)
model.txt (59.6 KB)
Hello,
I don’t understand why convtranspose3d works in int8 with implicit quantization, and when I want to use explicit quantization or QAT phase with the model optimizer, convtranspose3d is now in fp16.
I am using a model optimizer with a default int8 configura21tion.
I think this is only a problem related to the model optimizer, but I’m not sure.
Here is my implicit ONNX.
Here is the result after TensorRT quantization.
So, here, it s int8 convTranspose3D.
Inference time in implicit mode.
9 ms inference time
Here is explicit ONNX quantization.
Here is the result after TensorRT quantization.
Inference time in explicit mode.
In explicit quantization: inference time of 11 ms.
*** Please note that this reply is generated by LLM automatically ***
The issue you’re experiencing with convtranspose3D working in INT8 with implicit quantization but not with explicit quantization or QAT phase using the model optimizer is likely due to the way TensorRT handles quantization for this specific layer.
In implicit quantization, TensorRT automatically determines the quantization scheme for each layer, including convtranspose3D. However, when using explicit quantization or QAT, you need to specify the quantization scheme for each layer explicitly.
In your case, it seems that the model optimizer is defaulting to FP16 for convtranspose3D when using explicit quantization or QAT. This is likely because the model optimizer is not able to find an INT8 implementation for convtranspose3D on your specific hardware (RTX 5070).
There are a few possible reasons for this:
- Hardware limitations: As mentioned in the TensorRT documentation, SM 7.5 and earlier devices may not have INT8 implementations for all layers. Although your RTX 5070 is a more recent device, it’s possible that the INT8 implementation for
convtranspose3D is not available or not optimized for your specific hardware.
- Model optimizer configuration: The model optimizer may not be configured to use INT8 for
convtranspose3D explicitly. You can try setting the BuilderFlag option to kINT8 for the convtranspose3D layer to force the model optimizer to use INT8.
- QAT limitations: QAT is a more complex process than implicit quantization, and it may not always produce the same results. It’s possible that the QAT process is not able to find an optimal INT8 implementation for
convtranspose3D on your specific hardware.
To resolve this issue, you can try the following:
- Check the ONNX operator support matrix: Verify that the
convtranspose3D layer is supported in the ONNX opset you’re using and that it has an INT8 implementation available.
- Use the
BuilderFlag option: Set the BuilderFlag option to kINT8 for the convtranspose3D layer to force the model optimizer to use INT8.
- Try a different QAT configuration: Experiment with different QAT configurations to see if you can find an optimal INT8 implementation for
convtranspose3D.
- Use a different hardware: If possible, try running your model on a different NVIDIA GPU to see if the issue is specific to your RTX 5070.
In terms of inference time, the difference between implicit and explicit quantization is likely due to the fact that explicit quantization requires more computations to perform the quantization and dequantization operations. However, the inference time difference of 2ms (9ms vs 11ms) is relatively small, and you may need to consider other factors such as model accuracy and robustness when deciding between implicit and explicit quantization.
*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***