Rtx 5090 Spec's FP16 Tensor TFLOPS is ambiguous

pengxuan.peng · November 11, 2025, 12:08pm

I noticed the spec says the FP16 Tensor TFLOPS is 209T and 419T. Could anyone please explain what “acc fp32” means here?

Does this correspond to the computing power of the MMA instruction SM80_16x8x16_F32F16F16F32_TN? If so, then the computing power of the MMA instruction SM80_16x8x16_F16F16F16F16_TN is 419T?

Richard3D · November 14, 2025, 2:42am

Hi there, and thanks for the question. However, this is really a forum for GPU hardware as it relates to Omniverse, not GPU hardware in general. I will move this thread.

pengxuan.peng · November 14, 2025, 2:49am

Thank you, I have another question. Generally speaking, is it normal for measured gemm flops to exceed the spec by 10%?

This situation occurs with the SM80_16x8x16_F32F16F16F32_TN instruction.

MarkusHoHo · November 14, 2025, 1:20pm

Hi @pengxuan.peng,

Accumulate simply refers to the basic machine instructions that are associated with the ACC unit originally.
Data is loaded into the ACC and any form of operation follows after. Nowadays with Tensor operations this is a bit more complex, but essentially this is the idea.

And splitting FP32 to FP16 allows for the theoretical double number of operations.

To your second question, the listed performance numbers are best estimates based on standardized conditions in term of power, system setup, temperature, etc. Any deviation from those can cause different performance results.

I hope that helped.

pengxuan.peng · November 17, 2025, 9:28am

Thank you. Is cute SM80_16x8x16_F32F16F16F32_TN(HMMA.16816.F32) the corresponding ACC fp32 instruction? In other words, does HMMA.16816.F32 translate to an upper limit of 209.5 tflops?

MarkusHoHo · November 17, 2025, 1:12pm

I am sorry, but that I do not know.

pengxuan.peng · November 19, 2025, 7:20am

Thank you!

Topic		Replies	Views
Rtx 5090 Peak BF16 Tensor TFLOPS CUDA Programming and Performance	0	102	November 7, 2025
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	7365	August 14, 2024
Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor GPU-Accelerated Libraries	4	6257	June 21, 2022
TF32 TFLOPs of GeForce RTX 3090 vs A40 CUDA Programming and Performance	2	3054	September 11, 2023
RTX 3090 Peak Performance GPU-Accelerated Libraries cutensor	1	8734	December 14, 2021
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	913	November 25, 2024
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	1150	December 10, 2024
Tensor Core Flops Nsight Compute	1	21	December 2, 2025
L40 vs. RTX 6000 Ada FP16/FP8 throughput? GPU - Hardware benchmarks	7	15792	April 4, 2023
Double precision tensor core performance on A100 CUDA Programming and Performance cuda , a100 , ampere	1	1078	July 7, 2023

Rtx 5090 Spec's FP16 Tensor TFLOPS is ambiguous

Related topics