Skip to content

Conversation

@r-devulap
Copy link
Member

@r-devulap r-devulap commented Jun 20, 2023

Todo:

  • Add sin, cos and tanh
  • Benchmark numbers

@r-devulap
Copy link
Member Author

r-devulap commented Jun 21, 2023

Benchmark numbers: On average the double precision high accuracy version seems 1.23x slower than the LA versions (results do not include sin, cos and tanh):

       before           after         ratio
     [92ee012a]       [a0bb1b5e]
     <main>           <svml_ha>
+     14.2±0.07μs      22.7±0.02μs     1.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd')
+     13.2±0.02μs       19.3±0.1μs     1.46  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'expm1'>, 1, 1, 'd')
+     27.0±0.05μs      34.8±0.01μs     1.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccosh'>, 1, 1, 'd')
+     29.6±0.02μs      37.5±0.02μs     1.27  bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'power'>, 1, 1, 1, 'd')
+     11.9±0.01μs         15.0±0μs     1.26  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log10'>, 1, 1, 'd')
+     14.9±0.01μs       18.6±0.5μs     1.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arcsin'>, 1, 1, 'd')
+     12.0±0.03μs      14.9±0.01μs     1.24  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log2'>, 1, 1, 'd')
+     11.9±0.03μs      14.7±0.01μs     1.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log'>, 1, 1, 'd')
+      15.7±0.2μs       19.0±0.3μs     1.21  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log1p'>, 1, 1, 'd')
+      25.0±0.6μs         30.2±2μs     1.21  bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'arctan2'>, 1, 1, 1, 'd')
+     17.3±0.05μs       20.2±0.1μs     1.17  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccos'>, 1, 1, 'd')
+     15.7±0.03μs         17.7±0μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sinh'>, 1, 1, 'd')
+      21.5±0.4μs      24.1±0.01μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctanh'>, 1, 1, 'd')
+     11.1±0.03μs         12.4±0μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cosh'>, 1, 1, 'd')
+      18.6±0.9μs      20.0±0.02μs     1.07  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 1, 1, 'd')
+     14.3±0.08μs       15.2±0.2μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cbrt'>, 1, 1, 'd')

@seberg
Copy link
Member

seberg commented Jun 28, 2023

seems 1.23x slower than the LA versions

Wow, I thought it would be much more of a difference, although I will note that those benchmarks seem to run in 10-20μs and may have ~1μs overhead, so the slowdown should be a bit more in practice.
Don't we have larger benchmarks, or do they just report a constant time because all of these are in practice memory throughput bound?

TBH, if this slowdown is what it is, I am comfortable with the high precision as default.

Some of the tests seem to fail due to lower error bounds (i.e. the smoke test), I am not sure if that is the system lib or not, but I guess just means we need to relax those bounds slightly.

This test is failing due to a compiler error because inlining failed?

@r-devulap
Copy link
Member Author

Wow, I thought it would be much more of a difference, although I will note that those benchmarks seem to run in 10-20μs and may have ~1μs overhead, so the slowdown should be a bit more in practice. Don't we have larger benchmarks, or do they just report a constant time because all of these are in practice memory throughput bound?

We benchmark 10000 elements. Let me try increasing the array size. But i suspect the numbers wont be too different.

TBH, if this slowdown is what it is, I am comfortable with the high precision as default.

I agree. 20-30% speed isn't worth all the fuss about accuracy.

Some of the tests seem to fail due to lower error bounds (i.e. the smoke test), I am not sure if that is the system lib or not, but I guess just means we need to relax those bounds slightly.

4 ULP tanh was converted to universal intrinsic and we aren't using SVML for now. I was thinking I could backport SVML to universal intrinsic but it might need more time than I thought.

This test is failing due to a compiler error because inlining failed?

Not sure, will spend some time today/tomorrow in fixing up the CI failures.

@r-devulap
Copy link
Member Author

I think I will defer sin, cos and tanh to another PR. I will try and convert sin/cos HA SVML to universal intrinsics, which might take me some time.

@r-devulap
Copy link
Member Author

huh, it looks like libm float64 crbt has a 2 ULP accuracy.

@r-devulap
Copy link
Member Author

The CI looks stuck?

@charris
Copy link
Member

charris commented Jun 30, 2023

The pyodide error looks unrelated. @hoodmane Could you take a look?

@charris charris merged commit d87c7e6 into numpy:main Jun 30, 2023
@charris
Copy link
Member

charris commented Jun 30, 2023

Thanks @r-devulap .

@hoodmane
Copy link
Contributor

Definitely looks like it's our fault in some way. Probably we needed to pin pydantic, looks like they released a breaking change and we don't have a pin on them. Pinging more Pyodide people who might have a better idea about this: @rth @ryanking13

@hoodmane
Copy link
Contributor

hoodmane commented Jul 1, 2023

Opened #24091.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants