Skip to content

Conversation

@r-devulap
Copy link
Member

@r-devulap r-devulap commented Oct 13, 2023

Updating x86-simd-sort to latest commit. Includes 2 major updates:

  1. Speed up np.sort by up-to 2x for 32-bit and up-to 1.5x for 64-bit data. Ref: Various performance improvements x86-simd-sort#83 adds optimal sorting networks and minor improvements to vectorized partitioning. This also speeds up np.partition by up-to 1.3x.
  2. np.argsort and np.argparition relied heavily on the gather instruction which unfortunately is terrible for performance because of a new vulnerability DOWNFALL (see https://www.phoronix.com/review/downfall). We reverted back to using scalar emulation of gather (see Use scalar emulation of gather instruction for arg methods x86-simd-sort#65) which then improves performance by about 1.6x for both np.argsort and np.argpartition.

Benchmarks for random data:

| Change   | Before [67539a40] <main>   | After [da55324f] <xss-perf>   |   Ratio | Benchmark (Parameter)                                                                    |
|----------|----------------------------|-------------------------------|---------|------------------------------------------------------------------------------------------|
| -        | 124±2μs                    | 118±0.9μs                     |    0.95 | bench_function_base.Partition.time_partition('int64', ('random',), 10)                   |
| -        | 124±0.8μs                  | 118±0.3μs                     |    0.95 | bench_function_base.Partition.time_partition('int64', ('random',), 1000)                 |
| -        | 124±0.3μs                  | 107±1μs                       |    0.87 | bench_function_base.Partition.time_partition('float32', ('random',), 10)                 |
| -        | 123±0.4μs                  | 105±0.4μs                     |    0.86 | bench_function_base.Partition.time_partition('float32', ('random',), 1000)               |
| -        | 123±0.4μs                  | 105±0.2μs                     |    0.85 | bench_function_base.Partition.time_partition('float32', ('random',), 100)                |
| -        | 56.5±0.5μs                 | 44.7±1μs                      |    0.79 | bench_function_base.Partition.time_partition('int32', ('random',), 10)                   |
| -        | 57.4±0.4μs                 | 43.6±1μs                      |    0.76 | bench_function_base.Partition.time_partition('int32', ('random',), 100)                  |
| -        | 58.1±0.8μs                 | 43.4±0.8μs                    |    0.75 | bench_function_base.Partition.time_partition('int32', ('random',), 1000)                 |
| -        | 73.2±0.06μs                | 50.5±0.02μs                   |    0.69 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))                        |
| -        | 210±0.2μs                  | 142±0.2μs                     |    0.68 | bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))                     |
| -        | 60.0±0.1μs                 | 40.2±0.2μs                    |    0.67 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))                      |
| -        | 447±0.6μs                  | 295±2μs                       |    0.66 | bench_function_base.Partition.time_argpartition('float64', ('random',), 10)              |
| -        | 449±1μs                    | 298±1μs                       |    0.66 | bench_function_base.Partition.time_argpartition('float64', ('random',), 100)             |
| -        | 445±1μs                    | 294±1μs                       |    0.66 | bench_function_base.Partition.time_argpartition('float64', ('random',), 1000)            |
| -        | 34.4±0.05μs                | 22.7±0.03μs                   |    0.66 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))                        |
| -        | 218±0.5μs                  | 142±0.2μs                     |    0.65 | bench_function_base.Sort.time_argsort('quick', 'float64', ('random',))                   |
| -        | 211±0.7μs                  | 136±1μs                       |    0.64 | bench_function_base.Sort.time_argsort('quick', 'float32', ('random',))                   |
| -        | 35.7±0.1μs                 | 22.8±0.05μs                   |    0.64 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))                       |
| -        | 418±3μs                    | 264±1μs                       |    0.63 | bench_function_base.Partition.time_argpartition('float32', ('random',), 10)              |
| -        | 419±3μs                    | 264±2μs                       |    0.63 | bench_function_base.Partition.time_argpartition('float32', ('random',), 100)             |
| -        | 416±3μs                    | 261±1μs                       |    0.63 | bench_function_base.Partition.time_argpartition('float32', ('random',), 1000)            |
| -        | 424±0.7μs                  | 264±1μs                       |    0.62 | bench_function_base.Partition.time_argpartition('int64', ('random',), 10)                |
| -        | 427±1μs                    | 266±0.5μs                     |    0.62 | bench_function_base.Partition.time_argpartition('int64', ('random',), 100)               |
| -        | 423±0.7μs                  | 264±1μs                       |    0.62 | bench_function_base.Partition.time_argpartition('int64', ('random',), 1000)              |
| -        | 410±3μs                    | 251±2μs                       |    0.61 | bench_function_base.Partition.time_argpartition('int32', ('random',), 100)               |
| -        | 408±3μs                    | 248±2μs                       |    0.61 | bench_function_base.Partition.time_argpartition('int32', ('random',), 1000)              |
| -        | 190±0.7μs                  | 116±0.2μs                     |    0.61 | bench_function_base.Sort.time_argsort('quick', 'int32', ('random',))                     |
| -        | 190±0.6μs                  | 115±0.2μs                     |    0.61 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('random',))                    |
| -        | 412±3μs                    | 249±2μs                       |    0.6  | bench_function_base.Partition.time_argpartition('int32', ('random',), 10)                |
| -        | 44.5±0.07μs                | 22.6±0.04μs                   |    0.51 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))     

Detailed benchmarks can be seen here: https://gist.github.com/r-devulap/21dc3afdbab47c7aa08087c1445954b7

@seberg
Copy link
Member

seberg commented Oct 17, 2023

@r-devulap the test failures here do look real, at least in the form of compiler warnings.

EDIT: If you should want to backport the DOWNFALL fix, that probably would need a more targeted diff. OTOH, I am not sure it's worth the trouble since it "only" restores performance.

@r-devulap
Copy link
Member Author

@seberg Indeed, still working on fixing them. And I don't think we need to worry about backporting the DOWNFALL fix.

@r-devulap
Copy link
Member Author

Update: We have changed API to use size_t for all array indices and the calls to avx512 routines are updated accordingly. That made it easy for us to vectorize np.argpartition and np.argsort on 32-bit platforms (see PR numpy/x86-simd-sort#88).

@seberg
Copy link
Member

seberg commented Oct 18, 2023

We have changed API to use size_t for all array indices

So that is technically be broken or need a guard until we would do gh-24888? Although I guess that the use of that code implies we are not on a niche platform.

@r-devulap
Copy link
Member Author

We have changed API to use size_t for all array indices

So that is technically be broken or need a guard until we would do gh-24888? Although I guess that the use of that code implies we are not on a niche platform.

I don't think that's necessary.

  1. For the sort and partition methods npy_intp will get cast to size_t. Array size passed to sort/partition won't be negative and so this implicit cast isn't a problem.
  2. For argsort, we need to reinterpret_cast npy_intp* to size_t*. Whether npy_intp is Py_intptr_t or ssize_t shouldn't make a difference to this cast.

let me know if my assumptions are wrong.

@r-devulap
Copy link
Member Author

hmm, the macOS x86_64 failure seems hard to figure out. It's using x86_64-apple-darwin13.4.0-clang++ (clang 15.0.7 "clang version 15.0.7") but I don't see this build fail with clang++-15.0.7 locally.

Fails with: error: implicit instantiation of undefined template 'zmm_vector<unsigned long>'. With long as 8 bytes, unsigned long should just default to uin64_t and it should find the template definition like all other cases.

@r-devulap
Copy link
Member Author

yay, finally. Decided to explicitly instantiate to zmm_vector<uint64_t> on 64-bit macOS.

@r-devulap
Copy link
Member Author

Friendly ping :)

@r-devulap
Copy link
Member Author

Changes in NumPy here are pretty minor in this PR. Perhaps we can close this and merge it into a single PR #25045. Please reopen if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants