Skip to content

TST/BENCH: Adding test coverage and benchmarks for floating point umath functions#19485

Merged
mattip merged 7 commits intonumpy:mainfrom
r-devulap:fp-tests
Aug 4, 2021
Merged

TST/BENCH: Adding test coverage and benchmarks for floating point umath functions#19485
mattip merged 7 commits intonumpy:mainfrom
r-devulap:fp-tests

Conversation

@r-devulap
Copy link
Member

Expands test coverage for umath module in the following ways:

  1. Adds tests to ensure correctness for special floating points (NAN, INF, etc.)
  2. Adds tests to ensure FP exceptions are raised when appropriate (divide by zero, invalid, overflow and underflow)
  3. Expands accuracy validation datasets for more umath functions. Source code for generating these is also provided.
  4. Expanded benchmarking for all the unary umath functions.

Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of CSV data files!

I see there's a README at that path explaining what is going on.

Super minor point, but the README says: Add a file 'umath-validation-set-<ufuncname>.txt; looks like the team is going with .csv instead.

@r-devulap
Copy link
Member Author

That's a lot of CSV data files!

yup, but it is in line with this wish list #13515. This patch covers most of the umath functions and hence the long list of csv files. Commit a655aa8 also adds the C++ code I wrote to generate these files. I am not sure if you would want to include that in the repository. If you don't care for it, I am okay reverting that commit.

Super minor point, but the README says: Add a file 'umath-validation-set-<ufuncname>.txt; looks like the team is going with .csv instead.

Thanks for pointing that out. I will update the README file.

@r-devulap
Copy link
Member Author

The umath validation tests fail on the manylinux2010 container which seems to have really bad ULP errors (max difference is 9.21431e+18 ULP for sin and cos). For now, I have enabled these tests only if both the compiler and the CPU supports AVX512 (which essentially restricts to testing just the AVX512 code paths in NumPy).

@r-devulap
Copy link
Member Author

The umath validation tests fail on the manylinux2010 container which seems to have really bad ULP errors (max difference is 9.21431e+18 ULP for sin and cos). For now, I have enabled these tests only if both the compiler and the CPU supports AVX512 (which essentially restricts to testing just the AVX512 code paths in NumPy).

umm, on a second thought, This still might not work :/

@charris
Copy link
Member

charris commented Jul 16, 2021

We have dropped 2010 for the development numpy wheels, looks like we need to do it locally also.

@r-devulap
Copy link
Member Author

We have dropped 2010 for the development numpy wheels, looks like we need to do it locally also.

I will stop worrying about fixing the CI failures then, both the failing checks are from manylinux2010.

@charris
Copy link
Member

charris commented Jul 16, 2021

See #19498.

@r-devulap
Copy link
Member Author

See #19498.

I will rebase after that gets merged.

@charris
Copy link
Member

charris commented Jul 16, 2021

I will stop worrying about fixing the CI failures then

I think we should wait to see if things work with 2014.

@r-devulap
Copy link
Member Author

I will stop worrying about fixing the CI failures then

I think we should wait to see if things work with 2014.

It passes on manylinux2014_i686 on my local machine at least, optimistic it will be okay after I rebase.

@charris
Copy link
Member

charris commented Jul 16, 2021

OK, the change to manylinux2014 is in.

@r-devulap
Copy link
Member Author

rebased ..

@mattip
Copy link
Member

mattip commented Jul 18, 2021

This extends the existing set of csv files (for np.float32) with more functions and adds np.float64 data as well. It ups the tolerated ULP from 2 to 4. Should we be less tolerant of np.float64 errors? I realize that this might slow down processing, but I think our approach has always been accuracy before performance.

@r-devulap
Copy link
Member Author

It ups the tolerated ULP from 2 to 4. Should we be less tolerant of np.float64 errors?

4-ulp max errors means that up to two bits at the end of a result’s significand may be incorrect, so the top 51 bits are still correct. I think the results are still pretty accurate and not a major concern IMO. For comparison, note that such an error can be accumulated easily after a few (possibly just 8) floating-point additions, multiplications, or divisions in a row.

What level of accuracy would be acceptable for these functions?

@charris
Copy link
Member

charris commented Jul 21, 2021

What level of accuracy would be acceptable for these functions?

Hard to say, I expect it would depend on the application. Astronomy might be an area where exact coordinates transformations or timing might matter. Outside of that, my rule of thumb is that if you end up with 10 digits of accuracy at the end of a calculation, you have done well enough. @mhvk might have some thoughts.

@mhvk
Copy link
Contributor

mhvk commented Jul 21, 2021

Hard to make a generic statement for astropy - I assume this is mostly for more complicated functions (trig, exp, log, etc.)? For those the loss of a few bits should probably be OK, though it may well be that tests would start failing as they assume more stringent constraints even if those are not needed scientifically. For addition/multiplication, we do try very hard to keep losses as small as possible (especially in our Time class).

But I am obviously very intrigued by the very large speed-ups mentioned in #19478!

@charris
Copy link
Member

charris commented Jul 21, 2021

@r-devulap Are there timings available for the high precision versions?

@r-devulap
Copy link
Member Author

@r-devulap Are there timings available for the high precision versions?

I don't have specific numbers at the moment. But since higher accuracy means lower performance (which is the trade-off), they certainly will be slower than the ones in #19478.

@charris
Copy link
Member

charris commented Jul 21, 2021

which is the trade-off

It would be nice to know what the trade-off is.

@seberg
Copy link
Member

seberg commented Jul 21, 2021

The one thing I wonder is if we could set the max-ulp to something function specific (and lower than 4) where we know it should be more precise (i.e. most functions both for intel SVM and other math libraries).

Not because I think 4 is unreasonable for pretty much any function, but it also "documents" what our precision typically is. And it might help us get a comparison with the current precision, I still have the feeling that these are more comparable to glibc than the 4 ULP may sound like.

It would be interesting to report if some exotic system is not within the precision limits we would like to see (rather than fail a test), but that is far beyond here :).

@mattip
Copy link
Member

mattip commented Jul 25, 2021

If you tighten the ULP to 2, what tests fail?

@r-devulap
Copy link
Member Author

If you tighten the ULP to 2, what tests fail?

Nothing fails actually. Depending on the polynomial approximation used, the max ULP can occur at different points and its hard for a validation suite to capture those points for all the functions. I am working on getting hold of some data that might help with understanding the performance and accuracy trade offs.

@r-devulap
Copy link
Member Author

The one thing I wonder is if we could set the max-ulp to something function specific (and lower than 4) where we know it should be more precise (i.e. most functions both for intel SVM and other math libraries).

We absolutely could set the max ulp to something function specific. I am looking into it.

It would be interesting to report if some exotic system is not within the precision limits we would like to see (rather than fail a test), but that is far beyond here :).

We already know that sin and cos implementation for some 32-bit libraries are terrible for largish numbers (see 742f3f1) . I would imagine it is quite a bit of work to document precision limits across all the platforms that NumPy supports.

@r-devulap
Copy link
Member Author

Here are the ULP errors per function for the low accuracy (LA) SVML function in #19478 . Accuracy for single precision functions of one argument was determined by exhaustive testing, so these are the true maximum ulp errors. For the double precision functions, the worst case error was determined by measuring accuracy for billions of test points for each function (exhaustive testing is infeasible for double precision).

SVML float32 LA ULP float64 LA ULP
tan 3.91 3.93
cos 2.36 3.38
sin 2.37 3.3
asin 3.12 2.55
atan 2.3 2.52
atan2 3.26 2.37
expm1 2.62 2.1
cosh 2.48 1.97
log1p 1.96 1.93
log10 3.5 1.92
sinh 1.55 1.89
log2 2.12 1.84
cbrt 1.94 1.82
acos 2.1 1.67
log 1.84 1.67
exp 3.76 1.53
asinh 1.01 1.48
atanh 1.45 1.46
tanh 1.38 1.19
pow 1.23 1.06
acosh 1.16 1.05
exp2 1.01 1.04

Is it safe to assume that an accuracy of 2ULP is acceptable for double precision functions? For those that are worse than 2 ULP error (tan, cos, sin, asin, atan, expm1), moving from the low accuracy (LA) SVML implementation to high accuracy (HA) SVML slows it down by roughly 1.3x- 1.6x (depending on the function).

@charris
Copy link
Member

charris commented Jul 28, 2021

@r-devulap Thanks. It is unfortunate that the biggest errs belong to some of the most used functions :) I wonder how much of that is due to the range over which the tests are run?

@r-devulap
Copy link
Member Author

@r-devulap Thanks. It is unfortunate that the biggest errs belong to some of the most used functions :) I wonder how much of that is due to the range over which the tests are run?

Not a 100% sure but that is likely the case. When I was developing float32 sin, cos, exp and log algorithms on my own (#13368 and #13134), that was indeed the case. The range reduction step to reduce large numbers to a certain range introduces some errors and then the polynomial approximation adds some more.

@seberg
Copy link
Member

seberg commented Jul 28, 2021

Nice, this table is cool to have! Frankly, it may even be cool to have as some utility script that users can run if they are curious.

Hmm, the table here looks a bit different? And some of these, e.g. tan seems ⪅1 on typical math. For sin/cos, I guess the problem may just be very large/small values?

Looking at the official tables, it seems the high accuracy 1 ULP versions seem in the ballpark of about 50% slower on the high-end CPUs (less so on older CPUs).

@r-devulap
Copy link
Member Author

Hmm, the table here looks a bit different?

The table you are looking at is VML which is part of oneAPI Math Kernel Library. As @akolesov-intel clarified in #19478 (comment), SVML is the internal library in Intel Compilers which has its own implementations.

And some of these, e.g. tan seems ⪅1 on typical math. For sin/cos, I guess the problem may just be very large/small values?

Yes, generally ULP errors tend be worse for very large values. sin, cos and tan should be fairly accurate for very small values.

Looking at the official tables, it seems the high accuracy 1 ULP versions seem in the ballpark of about 50% slower on the high-end CPUs (less so on older CPUs).

That looks to be about the same SVML too.

@r-devulap
Copy link
Member Author

Updated csv files to reflect the appropriate ULP error.

@rgommers
Copy link
Member

@r-devulap it seems to me that it would also be quite useful to see typical distributions of error. You are reporting only the worst accuracy over the full range, but it also matter if the median error is similar to that, or way lower. Maybe just take one or a couple of functions and plot a histogram of abs(actual - expected)? The mean or median of such a distribution would be relevant as well.

It looks like if you tell people to choose between "high accuracy" and "low accuracy", they will go for high. But if the average error is < 1 ULP, I wouldn't call it "low accuracy" exactly.

@carlkl
Copy link
Member

carlkl commented Jul 31, 2021

You may also interested in Paul Zimmermann's @zimmermann6 work:
https://members.loria.fr/PZimmermann/papers/accuracy.pdf compares the accuracy of math functions across several math libraries.
https://gitlab.inria.fr/zimmerma/math_accuracy: Tools to check the accuracy of mathematical functions metioned in the paper above.

@carlkl
Copy link
Member

carlkl commented Jul 31, 2021

As for precision vs accuracy: this was discussed i.e. in this issue tread: shibatch/sleef#107 as well some other threads of the sleef project. Basically:

  1. correctly performed domain reduction (range reduction) reduces the performance regardless of the accuracy of the function.
  2. Precision around 1 ulp needs a higher intermediate precision. In case of sleef this is done be its own SIMD implementation of double double arithmetic.
  3. FMA instructions helps in case of 2. Needs Haswell and above.

I agree with @rgommers: if you are not sure what you need you will tend to higher precision anyway.

@r-devulap
Copy link
Member Author

r-devulap commented Aug 2, 2021

@r-devulap it seems to me that it would also be quite useful to see typical distributions of error. You are reporting only the worst accuracy over the full range, but it also matter if the median error is similar to that, or way lower. Maybe just take one or a couple of functions and plot a histogram of abs(actual - expected)? The mean or median of such a distribution would be relevant as well.

This is pretty easy to compute. We can compare the double precision output computed via SVML to the corresponding higher accuracy 128 bit float using the nulp_diff function in NumPy:

N = 10000000
arr1 = np.random.uniform(low=-1.0, high=1.0, size=N)
svml_out = np.sin(arr1)
true_out = np.sin(arr1.astype(np.float128))
ulp = nulp_diff(svml_out, true_out, dtype=np.float64)

Here are the stats for roughly a billion uniformly sampled numbers for the worst ULP error LA functions:

ufunc 0 ulp 1 ulp 2 ulp 3 ulp
sin 39.35 % 54.95 % 5.62 % 0.07 %
cos 36.13 % 55.11 % 8.66 % 0.10 %
tan 59.80 % 39.15 % 1.02 % 0.02 %
arctan 72.42 % 27.59 % 0.00 % 0.00 %
arcsin 73.74 % 25.40 % 0.86 % 0.00 %
expm1 96.67 % 3.33 % 0.00 % 0.00 %

It looks like if you tell people to choose between "high accuracy" and "low accuracy", they will go for high. But if the average error is < 1 ULP, I wouldn't call it "low accuracy" exactly.

From the table it is clear that more than 90% of the errors are 1ULP 1.5 ULP or lesser. For comparison, the corresponding output for libm library in glibc 2.31 shows that errors are closer to 0.5 ULP:

ufunc 0 ulp 1 ulp 2 ulp 3 ulp
sin 99.88 % 0.12 % 0.00 % 0.00 %
cos 99.87 % 0.13 % 0.00 % 0.00 %
tan 99.97 % 0.03 % 0.00 % 0.00 %
arctan 99.97 % 0.03 % 0.00 % 0.00 %
arcsin 99.97 % 0.03 % 0.00 % 0.00 %
expm1 90.79 % 9.21 % 0.00 % 0.00 %

@r-devulap
Copy link
Member Author

For the sake of completeness: the nulp_diff function snaps the float128 to the nearest float64 and then computes the ULP diff. So, a 0 ulp diff really means < 0.5 ULP and 1 ulp means < 1.5 ULP and so on.

//{"log",log,log,1.84,1.67},
{"log10",log10,log10,3.5,1.92},
{"log1p",log1p,log1p,1.96,1.93},
{"log2",log2,log2,2.12,1.84},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about the f32ulp and f64ulp values here (maybe move the struct definition closer to this code so it is clear what the fields are). How did you calculate them as floats?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleared up on a community call: these are actually the difference between the values so they should be floating point.

@mattip mattip merged commit 07cbfff into numpy:main Aug 4, 2021
@mattip
Copy link
Member

mattip commented Aug 4, 2021

Thanks @r-devulap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants