Disagreement Eigen::bfloat16 and Intel on conversion from NaN to bfloat16
There appears a disagreement between Eigen::bfloat16 and implementations by Intel on the conversion from NaN to bfloat16.
When the argument of Eigen::bfloat16::bfloat16(float f) is Not a Number, the bit pattern of the constructed Eigen::bfloat16 gets a specific value, value = 0x7fc0, ignoring the particular bit pattern of the float NaN: https://gitlab.com/libeigen/eigen/-/blob/b8ca93842c02f37ed398613b03f064c707d02fdc/Eigen/src/Core/arch/Default/BFloat16.h#L259
On the other hand, the approach followed by Intel is described below here, from Intel Architecture, Instruction Set Extensions and Future Features, Programming Reference:
Define convert_fp32_to_bfloat16(x):
IF x is zero or denormal:
dest[15] := x[31] // sign preserving zero (denormal go to zero)
dest[14:0] := 0
ELSE IF x is infinity:
dest[15:0] := x[31:16]
ELSE IF x is NAN:
dest[15:0] := x[31:16] // truncate and set MSB of the mantisa force QNAN
dest[6] := 1
ELSE // normal number
LSB := x[16]
rounding_bias := 0x00007FFF + LSB
temp[31:0] := x[31:0] + rounding_bias // integer add
dest[15:0] := temp[31:16]
RETURN dest
And in Intel's oneDNN library, their bfloat16_t::operator=(float f), at https://github.com/oneapi-src/oneDNN/blob/v2.0-beta07/src/cpu/bfloat16.cpp#L56-L60
case FP_NAN:
// truncate and set MSB of the mantissa force QNAN
raw_bits_ = iraw[1];
raw_bits_ |= 1 << 6;
break;
So during the conversion, the sign bit and six of the original significand (mantissa) bits from a NaN float are preserved in the bfloat16 representation that way.
Could this approach also be considered for Eigen::bfloat16? Or is there possibly a specific reason why a single bfloat16 bit pattern 0x7fc0 to represent any NaN float was chosen for TensorFlow?
[Edit Jul 25, 2020: mentioned the sign bit, for the sake of completeness]