Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APFloat] Wrong result when truncating double to float #55838

Closed
danilaml opened this issue Jun 2, 2022 · 20 comments
Closed

[APFloat] Wrong result when truncating double to float #55838

danilaml opened this issue Jun 2, 2022 · 20 comments

Comments

@danilaml
Copy link
Collaborator

danilaml commented Jun 2, 2022

For the following C(++) code Clang produces output that differs from every other major compiler (and from internal tests it was reduced from):

#include <cstdio>

int main() {
    long x = 503489155L;
    double y = *(double*)&x;
    float fy = y;
    printf("%lx, %e (%a)\n%f (%a)\n", x, y, y, fy, fy);
}

output:

1e02a283, 2.487567e-315 (0x0.000001e02a283p-1022)
0.000000 (0x1p-149)

expected:

1e02a283, 2.487567e-315 (0x0.000001e02a283p-1022)
0.000000 (0x0p+0)

I've pinpointed the issue to Val.convert in FPTrunc case of llvm::ConstantFoldCastInstruction which for 0x0.000001e02a283p-1022 double returns 0x1p-149 float, whereas one would expect (assuming the other compilers are correct here) returned value to be 0.
Furthermore, it is represented as float 0x36A0000000000000 in IR which may or may not be expected (if it was hex value for simple float, when there are eight extra zeroes, but if it uses double width even for regular floats, then it's ok.

godbolt: https://godbolt.org/z/YM3rvY3hM

@danilaml
Copy link
Collaborator Author

danilaml commented Jun 2, 2022

@rotateright @efriedma-quic pinging you as people that seem familiar with that code. Not sure where to even begin looking to understand why it currently behaves as it does and what is the appropriate fix.

@danilaml
Copy link
Collaborator Author

danilaml commented Jun 2, 2022

Perhaps the biggest indication that this is a bug in conversion function is that if I hide conversion behind non-inlinable helper (preventing constant folding) the result is expected (0): https://godbolt.org/z/cehTrKrcj

@efriedma-quic
Copy link
Collaborator

IEEEFloat::opStatus IEEEFloat::convert(const fltSemantics &toSemantics,
is the code that actually does constant floating-point conversions. I wouldn't be surprised if there's a bug somewhere.

@pmor13
Copy link

pmor13 commented Jun 2, 2022

Just a note: in C the double y = *(double*)&x; violates the aliasing rules. Use

memcpy(&y, &x, sizeof y);

or

double y = ( union { unsigned long long u; double f; } ){ 503489155 }.f;

@nunoplopes
Copy link
Member

Here's what Alive2 has to say (https://alive2.llvm.org/ce/z/TFYgXt):

  %a = bitcast i32 503489155 to float
  %b = fpext float %a to double
  %c = fptrunc double %b to float

; Evaluate to:
float %a  = #x1e02a283 (0.000000000000?)
double %b = #x3bc0545060000000 (0.000000000000?)
float %c  = #x1e02a283 (0.000000000000?)

@efriedma-quic
Copy link
Collaborator

@nunoplopes The original testcase is equivalent to something more like this:

define float @src() {
  %a = bitcast i64 503489155 to double
  %b = fptrunc double %a to float
  ret float %b
}

See this alive link

@nunoplopes
Copy link
Member

Thanks! Uhm, the output truncates precision in the optimized IR so it almost looks correct.

@danilaml
Copy link
Collaborator Author

danilaml commented Jun 2, 2022

@pmor, the original test case this was reduced from was in IR so had no aliasing concerns. I didn't want to bother with memcpy/std::bit_cast for a simple C++ reproducer. The problematic place in IR is more or less just a bitcast + fptrunc, like in @efriedma-quic example.

@pcordes
Copy link

pcordes commented Jun 3, 2022

Just a note: in C the double y = *(double*)&x; violates the aliasing rules. Use [...]

Or in C++20, we have #include <bit> for std::bit_cast<double>(x) which provides a constexpr-safe way to do it, with nicer syntax.
But yes, unless you compile with -fno-strict-aliasing the original test case has compile-time-visible UB, never a good thing when you're testing something else, since other readers are going to wonder if the UB is a problem.

In this case, that seems not to be the source of the problem, since GCC and clang make minor efforts to support some basic idioms like this. -fno-strict-aliasing doesn't change the results with either GCC or clang.

But clang -O0 to do the conversion at run-time using x86 instructions does produce 0.000000 (0x0p+0) for the second line, after converting to float and promoting back to double with cvtsd2ss and cvtss2sd. https://godbolt.org/z/hhxPKKboa
So just for the record, 0.0 is the correct result.

@danilaml
Copy link
Collaborator Author

danilaml commented Jun 3, 2022

Looks like the wrong result only appears when x is in range (inclusive) [0x10000001, 0x1FFFFFFF].

@danilaml
Copy link
Collaborator Author

danilaml commented Jun 3, 2022

Does anyone know/understand why this whole code is guarded by omsb being non-zero?

if (omsb) {
/* OMSB is numbered from 1. We want to place it in the integer
bit numbered PRECISION if possible, with a compensating change in
the exponent. */
exponentChange = omsb - semantics->precision;
/* If the resulting exponent is too high, overflow according to
the rounding mode. */
if (exponent + exponentChange > semantics->maxExponent)
return handleOverflow(rounding_mode);
/* Subnormal numbers have exponent minExponent, and their MSB
is forced based on that. */
if (exponent + exponentChange < semantics->minExponent)
exponentChange = semantics->minExponent - exponent;
/* Shifting left is easy as we don't lose precision. */
if (exponentChange < 0) {
assert(lost_fraction == lfExactlyZero);
shiftSignificandLeft(-exponentChange);
return opOK;
}
if (exponentChange > 0) {
lostFraction lf;
/* Shift right and capture any new lost fraction. */
lf = shiftSignificandRight(exponentChange);
lost_fraction = combineLostFractions(lf, lost_fraction);
/* Keep OMSB up-to-date. */
if (omsb > (unsigned) exponentChange)
omsb -= exponentChange;
else
omsb = 0;
}
}

I don't know about the handleOverflow stuff, but the rest seems to be handling denormals and this is the exact piece missing when converting double 0x10000001 to float: after shifting right by 29, the significand is 0 and lost_fraction is lfMoreThanHalf. Since significandMSB() + 1 for zero is 0, this if is never executed:

if (exponentChange > 0) {
      lostFraction lf;

      /* Shift right and capture any new lost fraction.  */
      lf = shiftSignificandRight(exponentChange);

      lost_fraction = combineLostFractions(lf, lost_fraction);

      /* Keep OMSB up-to-date.  */
      if (omsb > (unsigned) exponentChange)
        omsb -= exponentChange;
      else
        omsb = 0;
    }

so lost_fraction remains lfMoreThanHalf which leads to roundAwayFromZero rounding 0 up to the next float.
While I understand exactly why the current code produces this number I'm still having trouble with reconciling this code with it's intended function (it seems to be somewhat abstracted from the plain conversion algorithm so I don't know which invariants it supposed to uphold), so I can't really propose a "correct" fix.

@rotateright
Copy link
Contributor

I haven't made sense of that code either yet, but all double denorms should be 0.0 or -0.0 when rounded to float, so maybe we can bail out sooner.

And that's really any sufficiently small magnitude double - as long as the top 2 exponent bits are not set, it's going to be rounded to 0.0 if this is correct:
https://alive2.llvm.org/ce/z/NeeSVq

@efriedma-quic
Copy link
Collaborator

"normalize" is the last step of most floating-point operations in APFloat. It ensures that the most significant bit is in the correct position, and performs rounding.

"omsb" is zero if the significand is exactly zero. I assume the idea here is that if the result is zero, we can't do exponent adjustment: there is no "most significant bit" because all the bits are zero. So there's nothing to adjust, and the meaning of the lost_fraction is ambiguous; it's not clear what it supposed to be half of, since half of zero is just zero.

Probably the right fix here is to change IEEEFloat::convert so when the operand is denormal, it performs a smaller shift, so at least one bit of the significand is still set. Then normalize() should do the right thing.

And maybe normalize() should assert that if the signficand is exactly zero, the lost_fraction is also exactly zero.

@danilaml
Copy link
Collaborator Author

danilaml commented Jun 5, 2022

Hm, so normalize can't be given non-zero lost fraction with zero significand in other codepaths?
Is this a problem with normalize or with how it's called in convert? I.e. should normalize be adjusted (like just do category = fcZero if we are in subnormal land and significand is zero, ignoring the omsb) or should convert be fixed to either change the shift (hm, feels hacky), or maybe do something with lost_fraction if it detects a subnormal number at the previous step?

@rotateright
Copy link
Contributor

I was thinking the same as your last option, but it's not clear to me if shifting less as Eli suggested is necessary:

diff --git a/llvm/lib/Support/APFloat.cpp b/llvm/lib/Support/APFloat.cpp
index 4b75c9db8526..c6aec7269d26 100644
--- a/llvm/lib/Support/APFloat.cpp
+++ b/llvm/lib/Support/APFloat.cpp
@@ -2226,9 +2226,15 @@ IEEEFloat::opStatus IEEEFloat::convert(const fltSemantics &toSemantics,
   }
 
   // If this is a truncation, perform the shift before we narrow the storage.
-  if (shift < 0 && (isFiniteNonZero() || category==fcNaN))
+  if (shift < 0 && (isFiniteNonZero() || category==fcNaN)) {
     lostFraction = shiftRight(significandParts(), oldPartCount, -shift);
 
+    // If we are truncating a denormal, the lostFraction is not relevant when
+    // normalizing, so just set it to zero.
+    if (isDenormal())
+      lostFraction = lfExactlyZero;
+  }
+
   // Fix the storage so it can hold to new value.
   if (newPartCount > oldPartCount) {
     // The new type requires more storage; make it available.

@danilaml
Copy link
Collaborator Author

danilaml commented Jun 6, 2022

@rotateright I'll submit a patch then, but I think lostFraction should be changed to lfLessThanHalf or similar, instead of simply lfExactlyZero, since otherwise the truncated number would be treated as simple zero, and losesInfo would be set to false, which is wrong. I believe it should hit the final return statement in normalize instead:

  /* The fcZero case is a denormal that underflowed to zero.  */
  return (opStatus) (opUnderflow | opInexact);

Also, I don't know whether we can rely on isDenormal at this point of convert (it's checked manually and a bit differently earlier).

@rotateright
Copy link
Contributor

rotateright commented Jun 6, 2022

Sounds good - I added some IR regression tests, but we might want to supplement that with unit tests for APFloat directly:
abb21b5

@efriedma-quic
Copy link
Collaborator

Is this a problem with normalize or with how it's called in convert? I.e. should normalize be adjusted (like just do category = fcZero if we are in subnormal land and significand is zero, ignoring the omsb) or should convert be fixed to either change the shift (hm, feels hacky), or maybe do something with lost_fraction if it detects a subnormal number at the previous step?

I'd probably say it's a problem with how it's called? I mean, in practice, I don't think any of the other callers can call normalize() with both a zero significand and a lost fraction. And the expected semantics aren't really clear.

I don't think it's true in general that converting a denormal to a float with narrower significand storage necessarily produces zero. Consider conversions between half and bfloat. (I don't think there's any way to write that in LLVM IR at the moment, but the APFloat API allows it.)

@pmor13
Copy link

pmor13 commented Jun 9, 2022

A little off-topic. Why in LLVM rounding is a part of normalization?

IEEEFloat::normalize(roundingMode rounding_mode,
                     lostFraction lost_fraction) 

In other compilers I see that normalization and rounding are separate operations.

@efriedma-quic
Copy link
Collaborator

I'm not the original author... but I guess it just made sense to put them together because they both need to happen after every floating-point computation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants