I noticed the spec says the FP16 Tensor TFLOPS is 209T and 419T. Could anyone please explain what “acc fp32” means here?
Does this correspond to the computing power of the MMA instruction SM80_16x8x16_F32F16F16F32_TN? If so, then the computing power of the MMA instruction SM80_16x8x16_F16F16F16F16_TN is 419T?
Hi there, and thanks for the question. However, this is really a forum for GPU hardware as it relates to Omniverse, not GPU hardware in general. I will move this thread.
Thank you, I have another question. Generally speaking, is it normal for measured gemm flops to exceed the spec by 10%?
This situation occurs with the SM80_16x8x16_F32F16F16F32_TN instruction.
Hi @pengxuan.peng,
Accumulate simply refers to the basic machine instructions that are associated with the ACC unit originally.
Data is loaded into the ACC and any form of operation follows after. Nowadays with Tensor operations this is a bit more complex, but essentially this is the idea.
And splitting FP32 to FP16 allows for the theoretical double number of operations.
To your second question, the listed performance numbers are best estimates based on standardized conditions in term of power, system setup, temperature, etc. Any deviation from those can cause different performance results.
I hope that helped.
Thank you. Is cute SM80_16x8x16_F32F16F16F32_TN(HMMA.16816.F32) the corresponding ACC fp32 instruction? In other words, does HMMA.16816.F32 translate to an upper limit of 209.5 tflops?
I am sorry, but that I do not know.