Skip to content

Conversation

@shwina
Copy link
Contributor

@shwina shwina commented Feb 4, 2021

As discussed in #7162 (comment), this PR adds an improvement to merge_percentiles - by using numpy operations rather than merge_sorted. This makes it significantly faster, especially for CuPy arrays:

merge_percentiles timings for NumPy arrays, before and after this PR:

before-numpy

after-numpy

merge_percentiles timings for CuPy arrays, before and after this PR:

before-cupy

after-cupy

Additional info:

Details
from dask.array.percentile import merge_percentiles
import numpy as np
import cupy as cp
import timeit

def bench(nquantiles, ndatasets, lib):
    scale = 100
    calculated_quantiles = nquantiles//2

    finalq = lib.floor(lib.random.rand(nquantiles) * scale)
    qs = list(lib.floor(lib.random.rand(ndatasets, calculated_quantiles) * scale))
    vals = list(lib.random.rand(ndatasets, calculated_quantiles))
    Ns = lib.ones(ndatasets) *  100
    
    dt = 0.
    repeat = 10
    for i in range(repeat):
        t1 = timeit.default_timer()
        merge_percentiles(finalq, qs, vals, Ns=Ns)
        t2 = timeit.default_timer()
        dt += (t2 - t1)
    return (dt) / repeat

import pandas as pd

results = pd.DataFrame(columns=("nquantiles", "ndatasets", "time"))

for nquantiles in (2, 10, 100):
    for ndatasets in (2, 10, 100):
        results.loc[len(results)] = (nquantiles, ndatasets, bench(nquantiles, ndatasets, cp))

results

# comparable to, but typically slower than, `merge_sorted`.
#
# >>> A = np.concatenate(map(np.array, map(zip, vals, counts)))
# >>> A.sort(0, kind='mergesort')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! You can probably update or remove the comment block above. I hope it was helpful. Your solution below looks good. (I wish I could go back and look at the variations I tried and my benchmarks, but I think they are lost to time)

Sanity check: you have cytoolz installed in the benchmark environment, right? Based on your timings, I think it must be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! Looking at the comment above, should it be:

# Sort by calculated percentile values

rather than

# Sort by calculated percentile values, then number of observations.

?

And yes, I do have cytoolz installed.

@shwina
Copy link
Contributor Author

shwina commented Feb 5, 2021

Apologies for the multitude of commits here - got myself in a bit of a git mess. Should be good now.

@shwina shwina changed the title [WIP] Improve performance of merge_percentiles Improve performance of merge_percentiles Feb 5, 2021
@jakirkham
Copy link
Member

No worries we do squash merge here 🙂

@jsignell
Copy link
Member

It looks like the tests are passing other than the known mac one. There is a fix on master for that, so if you merge or rebase, the tests should all pass.

@jakirkham
Copy link
Member

Looks like Ashwin merged in master. Were there other things we were waiting on here?

@jakirkham
Copy link
Member

Taking silence here to mean no

@jakirkham jakirkham merged commit d540d73 into dask:master Feb 26, 2021
@jakirkham
Copy link
Member

Thanks Ashwin! 😄

If anything else comes up, please let us know and we can follow up in a new issue/PR as appropriate 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants