Improve performance of `merge_percentiles` #7172

shwina · 2021-02-04T19:48:36Z

As discussed in #7162 (comment), this PR adds an improvement to merge_percentiles - by using numpy operations rather than merge_sorted. This makes it significantly faster, especially for CuPy arrays:

merge_percentiles timings for NumPy arrays, before and after this PR:

merge_percentiles timings for CuPy arrays, before and after this PR:

Additional info:

This PR depends on Add percentile support for NEP-35 #7162 and should not be merged before that one
The script I used to generate the timings:

Details

from dask.array.percentile import merge_percentiles
import numpy as np
import cupy as cp
import timeit

def bench(nquantiles, ndatasets, lib):
    scale = 100
    calculated_quantiles = nquantiles//2

    finalq = lib.floor(lib.random.rand(nquantiles) * scale)
    qs = list(lib.floor(lib.random.rand(ndatasets, calculated_quantiles) * scale))
    vals = list(lib.random.rand(ndatasets, calculated_quantiles))
    Ns = lib.ones(ndatasets) *  100
    
    dt = 0.
    repeat = 10
    for i in range(repeat):
        t1 = timeit.default_timer()
        merge_percentiles(finalq, qs, vals, Ns=Ns)
        t2 = timeit.default_timer()
        dt += (t2 - t1)
    return (dt) / repeat

import pandas as pd

results = pd.DataFrame(columns=("nquantiles", "ndatasets", "time"))

for nquantiles in (2, 10, 100):
    for ndatasets in (2, 10, 100):
        results.loc[len(results)] = (nquantiles, ndatasets, bench(nquantiles, ndatasets, cp))

results

…-perf

eriknw · 2021-02-05T19:33:03Z

dask/array/percentile.py

    # comparable to, but typically slower than, `merge_sorted`.
    #
    # >>> A = np.concatenate(map(np.array, map(zip, vals, counts)))
    # >>> A.sort(0, kind='mergesort')


Nice! You can probably update or remove the comment block above. I hope it was helpful. Your solution below looks good. (I wish I could go back and look at the variations I tried and my benchmarks, but I think they are lost to time)

Sanity check: you have cytoolz installed in the benchmark environment, right? Based on your timings, I think it must be.

Thanks for the feedback! Looking at the comment above, should it be:

# Sort by calculated percentile values

rather than

# Sort by calculated percentile values, then number of observations.

?

And yes, I do have cytoolz installed.

… improve-percentile-perf

…e-perf

shwina · 2021-02-05T20:52:09Z

Apologies for the multitude of commits here - got myself in a bit of a git mess. Should be good now.

jakirkham · 2021-02-11T08:09:21Z

No worries we do squash merge here 🙂

jsignell · 2021-02-22T16:03:08Z

It looks like the tests are passing other than the known mac one. There is a fix on master for that, so if you merge or rebase, the tests should all pass.

…-perf

jakirkham · 2021-02-24T20:33:08Z

Looks like Ashwin merged in master. Were there other things we were waiting on here?

jakirkham · 2021-02-26T03:14:09Z

Taking silence here to mean no

jakirkham · 2021-02-26T03:14:52Z

Thanks Ashwin! 😄

If anything else comes up, please let us know and we can follow up in a new issue/PR as appropriate 🙂

pentschev and others added 6 commits February 3, 2021 12:18

Add array_safe function supporting like= kwarg

271d33d

Support for like= in percentile

6b10e3b

Add CuPy tests for percentiles

d8824a3

Fix array_like_safe for NumPy < 1.20

610bd4b

Improve percentile perf

a7078bb

Merge branch 'master' of github.com:dask/dask into improve-percentile…

9d9842d

…-perf

jakirkham mentioned this pull request Feb 4, 2021

Add percentile support for NEP-35 #7162

Merged

eriknw reviewed Feb 5, 2021

View reviewed changes

crusaderky and others added 10 commits February 5, 2021 15:25

Generically rebuild a collection with different keys (dask#7142)

b9c2968

Add array_safe function supporting like= kwarg

31c3e5b

Support for like= in percentile

51cf7cb

Add CuPy tests for percentiles

1707136

Fix array_like_safe for NumPy < 1.20

fd9bd3c

Improve percentile perf

dd72537

Merge branch 'improve-percentile-perf' of github.com:shwina/dask into…

fff520c

… improve-percentile-perf

Merge remote-tracking branch 'upstream/master' into improve-percentil…

86c2497

…e-perf

Remove note comparing np/merge_sorted

8e2f7ba

Update low-level graph spec to use any hashable for keys (dask#7163)

09fe536

Remove unused merge_sorted

30e0314

shwina changed the title ~~[WIP] Improve performance of merge_percentiles~~ Improve performance of merge_percentiles Feb 5, 2021

Merge branch 'master' of github.com:dask/dask into improve-percentile…

17b81a9

…-perf

jakirkham merged commit d540d73 into dask:master Feb 26, 2021

pentschev mentioned this pull request Mar 24, 2021

[BUG] dask-cudf .describe() broken with NumPy 1.20 rapidsai/cudf#7289

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve performance of `merge_percentiles` #7172

Improve performance of `merge_percentiles` #7172

Uh oh!

shwina commented Feb 4, 2021 •

edited

Loading

Uh oh!

eriknw Feb 5, 2021

Uh oh!

shwina Feb 5, 2021

Uh oh!

shwina commented Feb 5, 2021

Uh oh!

jakirkham commented Feb 11, 2021

Uh oh!

jsignell commented Feb 22, 2021

Uh oh!

jakirkham commented Feb 24, 2021

Uh oh!

jakirkham commented Feb 26, 2021

Uh oh!

jakirkham commented Feb 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Improve performance of merge_percentiles #7172

Improve performance of merge_percentiles #7172

Uh oh!

Conversation

shwina commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional info:

Uh oh!

eriknw Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

shwina Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

shwina commented Feb 5, 2021

Uh oh!

jakirkham commented Feb 11, 2021

Uh oh!

jsignell commented Feb 22, 2021

Uh oh!

jakirkham commented Feb 24, 2021

Uh oh!

jakirkham commented Feb 26, 2021

Uh oh!

jakirkham commented Feb 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Improve performance of `merge_percentiles` #7172

Improve performance of `merge_percentiles` #7172

shwina commented Feb 4, 2021 •

edited

Loading