Refactor array.percentile and dataframe.quantile to use t-digest #4677

Dimplexion · 2019-04-08T19:27:54Z

Tests added / passed
Passes flake8 dask

Related to #1225

This PR changes the public array.percentile and dataframe.quantile methods to use t-digest when possible. Meaning that depending on the input parameters it uses t-digest when it's possible and falls back to the old implementation for data types that are not integer or float and if interpolation is not allowed.

I also noticed that array.percentile was being used for calculating divisions and figured someone else might want to be able to do that also. So I added a parameter use_tdigest that defaults to True but can be disabled to revert back to the old behavior. I had to add it to a few places for it to make it sense, let me know if you think this is a good idea or if it should be changed.

I will still need to write some tests to cover cases when use_tdigest == True for good coverage before this is ready to be merged (if we will keep it).

…() versions to allow them to pass

…ublic-quantile

mrocklin · 2019-04-08T19:29:06Z

cc @jcrist

jcrist

Thanks @Dimplexion for working on this. I've only left high-level api comments for now, will try to give a deeper review later.

jcrist · 2019-04-08T19:34:06Z

continuous_integration/travis/test_imports.sh

 (test_import "Delayed" "toolz" "import dask.delayed") && \
 (test_import "Bag" "toolz partd cloudpickle" "import dask.bag") && \
-(test_import "Array" "toolz numpy" "import dask.array") && \
+(test_import "Array" "toolz numpy crick" "import dask.array") && \


crick should not be a mandatory dependency for dask array - it could be optional for certain functionality (e.g. percentile), but should only be needed if requested.

Removed crick from being a mandatory dependency.

Btw is there a list of optional dependencies somewhere where I should add it?

We should add it to the install scripts on travis and appveyor, but other than that no.

There's also dask.utils.import_required which is used in a few places to provide a nice error message when importing a dependency fails for some optional functionality. I'd probably add this to quantile and percentile to error nicely when they're called by crick isn't installed. See e.g.

dask/dask/bag/avro.py

Lines 95 to 97 in 9f870d1

import_required('fastavro',

"fastavro is a required dependency for using "

"bag.read_avro().")

I added it to the install scripts and also added those import warnings.

jcrist · 2019-04-08T19:34:26Z

dask/array/percentile.py

 from numbers import Number

 import numpy as np
+from crick import TDigest


This import should be local to functions using it to prevent crick from being a required dependency.

Moved it to be imported only where needed.

jcrist · 2019-04-08T19:38:47Z

dask/dataframe/core.py

            return result

-    def quantile(self, q=0.5, axis=0):
+    def quantile(self, q=0.5, axis=0, use_tdigest=True):


Since we may add more methods in the future, I'd prefer this to be a string. Current values would be:

`method='tdigest': uses tdigest

`method='dask': uses dask's custom algorithm

method='default': the default value for this kwarg just means use the default for this version of dask. We should feel free to change this if we find a new best method. For now this would map to 'dask', but could be changed later.

Implemented this change as described here. Although I must raise the concern here that defaulting to 'dask' can lead to some pretty nasty surprises. For example in the case that brought this issue to my knowledge I wanted to get the 95th percentile from data and the call ddf.quantile(0.95) was giving me a value of ~12. A week later I learned about the issues with the quantile method and ran it using t-digest which game me a result of ~8. The internal implementation sometimes fails massively and it can be very difficult to spot. I think it's dangerous to use it as the default without a warning. I'm not sure what would be the best way to handle it without breaking existing code, though, as t-digest can't handle all the cases the internal one can.

@jcrist what are your thoughts on what the default should be? Should we make crick mandatory for dask dataframe?

Maybe? My thought was we should add the implementation's in this PR, let them sit for a bit so they can get some use, then maybe pull the switch and change the default. I think we do still want the current algorithm for partitioning logic (no external dependency, handles more dtypes), but for numerical results tdigest is likely better.

Yeah I agree we still want to use the custom algorithm for handling partitioning. I'm fine with having it as 'dask' for now, just wanted to make sure I bring that up since it was a pretty nasty surprise for me when I was starting to use Dask.

jcrist · 2019-04-08T19:39:26Z

dask/dataframe/core.py


    @derived_from(pd.DataFrame)
-    def describe(self, split_every=False, percentiles=None):
+    def describe(self, split_every=False, percentiles=None, use_tdigest=True):


perhaps percentiles_method=... here?

jcrist · 2019-04-08T19:39:48Z

dask/dataframe/core.py

        return df

-    def quantile(self, q=0.5):
+    def quantile(self, q=0.5, use_tdigest=True):


perhaps method=... here?

jcrist · 2019-04-08T19:40:01Z

dask/dataframe/core.py



-def quantile(df, q):
+def quantile(df, q, use_tdigest=True):


perhaps method=... here?

jcrist · 2019-04-08T19:40:59Z

setup.py

 # you modify these, make sure to change the corresponding line there.
 extras_require = {
-  'array': ['numpy >= 1.11.0', 'toolz >= 0.7.3'],
+  'array': ['numpy >= 1.11.0', 'toolz >= 0.7.3', 'crick >= 0.0.3'],


crick should be an optional dependency for dask array, and probably an optional dependency for dask dataframe (for now).

…percentile functions

mrocklin · 2019-04-10T14:55:42Z

It looks like tests are fine except for linting errors

(From travis-ci logs)

$ if [[ $LINT == 'true' ]]; then flake8 dask; fi
dask/dataframe/core.py:4016:17: E128 continuation line under-indented for visual indent
dask/dataframe/core.py:4020:45: E128 continuation line under-indented for visual indent
dask/dataframe/core.py:4027:17: E128 continuation line under-indented for visual indent
dask/array/percentile.py:92:17: E128 continuation line under-indented for visual indent

…ion/dask into refactor-public-quantile

Dimplexion · 2019-04-10T17:05:09Z

It looks like tests are fine except for linting errors

(From travis-ci logs)

$ if [[ $LINT == 'true' ]]; then flake8 dask; fi
dask/dataframe/core.py:4016:17: E128 continuation line under-indented for visual indent
dask/dataframe/core.py:4020:45: E128 continuation line under-indented for visual indent
dask/dataframe/core.py:4027:17: E128 continuation line under-indented for visual indent
dask/array/percentile.py:92:17: E128 continuation line under-indented for visual indent

Thanks for pointing that out, those should be fixed now. For some reason I'm getting this weird error when I'm trying to run flake8 locally so I'm gonna have to rely on travis for this.

The error I'm getting is this: pkg_resources.DistributionNotFound: The 'pyflakes<2.1.0,>=2.0.0' distribution was not found and is required by flake8. I'll see if I can get it fixed later.

…ing installed

Dimplexion · 2019-04-11T07:47:20Z

The interface is probably set now so I'll start adding the tests so that we have similar coverage with method='tdigest' as we have with method='dask'.

Dimplexion · 2019-04-15T08:15:33Z

Pushed some test cases now. Still need to add some tests for DataFrame.describe() and DataFrame.quantile().

…ublic-quantile

Dimplexion · 2019-04-16T12:32:40Z

I pushed the tests for the other functions that were still missing. This PR is ready from my perspective so feel free to check it whenever you have time @jcrist. I also removed WIP from the title now.

jcrist · 2019-04-18T21:13:37Z

Thanks @Dimplexion, I'll give this a review tomorrow. Looks like this has developed merge conflicts, if you have time at some point could you resolve these?

jcrist

Thanks @Dimplexion. Overall the implementation looks good - I left a few comments about the tests and docstrings but this is pretty close to mergeable.

jcrist · 2019-04-19T19:28:11Z

dask/array/percentile.py

+    t = TDigest()
+    t.merge(*digests)
+
+    return np.array([t.quantile(q / 100.0) for q in qs])


quantile supports array arguments:

t.quantile(q / 100.0)

Good catch, didn't notice that.

jcrist · 2019-04-19T19:33:55Z

dask/array/tests/test_percentiles.py

 from dask.array.utils import assert_eq, same_keys


 def test_percentile():


I'd write this with pytest.mark.parametrize over the method arg:

@pytest.mark.parametrize('method', ['tdigest', 'dask']) def test_percentile(method): ...

For arguments that don't support one type I'd just special case around them inside the test:

if method != 'tdigest': x = np.array(['a', 'a', 'd', 'd', 'd', 'e']) d = da.from_array(x, chunks=(3,)) assert_eq(da.percentile(d, [0, 50, 100]), np.array(['a', 'd', 'e'], dtype=x.dtype))

Doing it this way ensures we have coverage across both methods and minimizes duplicating code.

That's a really nice way of doing it, thanks for pointing that out!

jcrist · 2019-04-19T19:34:25Z

dask/array/tests/test_percentiles.py

+    assert_eq(da.percentile(d, q, method='tdigest'), np.array([1], dtype=d.dtype))


 def test_unknown_chunk_sizes():


Could use pytest.mark.parametrize here.

jcrist · 2019-04-19T19:35:52Z

dask/dataframe/core.py

+
+        Note: this implementation will use t-digest for columns with floating
+              dtype if axis is set to 0 and `method` is set to `tdigest`.
+              Otherwise it falls back to the internal implementation.


Doesn't it use tdigest only if that's specified explicitly?

Also, should put this as a parameter docstring:

""" ... method : {'default', 'tdigest', 'dask'}: What method to use. By default will use dask's internal custom algorithm (``'dask'``). If set to ``'tdigest'`` will use tdigest for floats and ints and fallback to the ``'dask'`` otherwise. ... """

Yes it does, your version is much better.

jcrist · 2019-04-19T19:36:59Z

dask/dataframe/core.py

        q : list/array of floats, default 0.5 (50%)
            Iterable of numbers ranging from 0 to 1 for the desired quantiles
+
+        Note: this implementation will use t-digest is `method` is set to `tdigest` and the


Should put this as a parameter docstring:

""" ... method : {'default', 'tdigest', 'dask'}: What method to use. By default will use dask's internal custom algorithm (``'dask'``). If set to ``'tdigest'`` will use tdigest for floats and ints and fallback to the ``'dask'`` otherwise. ... """

jcrist · 2019-04-19T19:53:43Z

dask/dataframe/core.py

        Iterable of numbers ranging from 0 to 100 for the desired quantiles
+
+    Note: this implementation will use t-digest is `method` is set to `tdigest` and the
+          dtype of df is float. Otherwise it falls back to the internal implementation.


Should put this as a parameter docstring:

""" ... method : {'default', 'tdigest', 'dask'}: What method to use. By default will use dask's internal custom algorithm (``'dask'``). If set to ``'tdigest'`` will use tdigest for floats and ints and fallback to the ``'dask'`` otherwise. ... """

jcrist · 2019-04-19T19:56:50Z

dask/dataframe/tests/test_dataframe.py

    assert ds.describe(split_every=2)._name != ds.describe()._name
    assert ddf.describe(split_every=2)._name != ddf.describe()._name

+    s = pd.Series(list(range(10)) * 6)


General comment for this whole file: it would be good to use pytest.mark.parametrize here to minimize duplicating code.

…ublic-quantile

Dimplexion · 2019-04-25T12:51:01Z

Everything that was listed has be fixed now, including the conflict with the base branch. I also added testing on some cases that were still missing previously. test_quantile_for_possibly_unsorted_q is not being run with 'tdigest' yet as it wasn't trivial to make it work the way it's now. I can look into that a bit later still.

- Make tests not require crick to run - Fix docstring formatting - A few other nits

jcrist · 2019-04-26T15:43:44Z

Thanks @Dimplexion. I pushed an extra commit to make the tests run without crick installed, and also formatted a few of the docstrings. Everything looks good, will merge once tests pass.

jcrist · 2019-04-26T16:59:49Z

Thanks @Dimplexion!

Dimplexion · 2019-04-27T07:15:35Z

Always happy to help!

…k#4677) * Use t-digest for arrays when possible * implement using t-digest for DataFrame quantiles when possible * Change dataframe/io/from_bcolz to use the old percentile implementation * Change tests to use the old DataFrame.quantile() and Array.percentile() versions to allow them to pass * Add crick as a dependency * Remove crick from being a mandatory requirement * Change "use_tdigest" parameter to a more general "method" * Update tests to work with the new "method" parameter to quantile and percentile functions * Fix flake8 warnings. * Add warning when attempting to use t-digest function without crick being installed * Add crick as an optional dependency to appveyor and travis * Add tests for array.percentile() when using 'tdigest' method * Add some tests for DataFrame.quantile with method 'tdigest' * Fix styling error. * Add tests for DataFrame.quantile when method='tdigest' * Add tests for DataFrame.describe() when using method='tdigest' * Change array.percentile to use list parameter for crick.quantile(). * Change to use pytest.mark.parametrize when needed * Fix doc strings in dataframe/core. * Refactor quantile tests in test_dataframe to use pytest.mark.parametrize * Fixups - Make tests not require crick to run - Fix docstring formatting - A few other nits

Dimplexion added 6 commits April 8, 2019 22:12

Use t-digest for arrays when possible

e10be3b

implement using t-digest for DataFrame quantiles when possible

849764c

Change dataframe/io/from_bcolz to use the old percentile implementation

844a67f

Change tests to use the old DataFrame.quantile() and Array.percentile…

323727e

…() versions to allow them to pass

Add crick as a dependency

5dd5b14

Merge branch 'master' of https://github.com/dask/dask into refactor-p…

ad865c1

…ublic-quantile

jcrist reviewed Apr 8, 2019

View reviewed changes

Dimplexion and others added 3 commits April 9, 2019 09:31

Remove crick from being a mandatory requirement

433a117

Change "use_tdigest" parameter to a more general "method"

3c640e8

Update tests to work with the new "method" parameter to quantile and …

b4c38cc

…percentile functions

Dimplexion added 2 commits April 10, 2019 20:01

Fix flake8 warnings.

e68ce1f

Merge branch 'refactor-public-quantile' of https://github.com/Dimplex…

e959c2a

…ion/dask into refactor-public-quantile

Dimplexion added 2 commits April 11, 2019 10:12

Add warning when attempting to use t-digest function without crick be…

9074725

…ing installed

Add crick as an optional dependency to appveyor and travis

91fe1ee

Dimplexion added 2 commits April 12, 2019 10:37

Add tests for array.percentile() when using 'tdigest' method

3e32ccb

Add some tests for DataFrame.quantile with method 'tdigest'

f4a1a50

Dimplexion added 4 commits April 16, 2019 09:02

Fix styling error.

cd90456

Add tests for DataFrame.quantile when method='tdigest'

637e44b

Merge branch 'master' of https://github.com/dask/dask into refactor-p…

aadbc5a

…ublic-quantile

Add tests for DataFrame.describe() when using method='tdigest'

dd48c4f

Dimplexion changed the title ~~WIP: Refactor array.percentile and dataframe.quantile to use t-digest~~ Refactor array.percentile and dataframe.quantile to use t-digest Apr 16, 2019

jcrist reviewed Apr 19, 2019

View reviewed changes

Merge branch 'master' of https://github.com/dask/dask into refactor-p…

1ba033e

…ublic-quantile

Dimplexion added 5 commits April 23, 2019 21:45

Change array.percentile to use list parameter for crick.quantile().

277a25c

Change to use pytest.mark.parametrize when needed

b565516

Fix doc strings in dataframe/core.

24ff7c2

Merge branch 'master' of https://github.com/dask/dask into refactor-p…

25a252b

…ublic-quantile

Refactor quantile tests in test_dataframe to use pytest.mark.parametrize

79007a3

Fixups

c33a979

- Make tests not require crick to run - Fix docstring formatting - A few other nits

jcrist merged commit 87a46eb into dask:master Apr 26, 2019

Dimplexion deleted the refactor-public-quantile branch April 27, 2019 07:15

dcherian mentioned this pull request Jun 5, 2019

median on dask arrays pydata/xarray#2999

Closed

rafa-guedes mentioned this pull request Jul 22, 2019

(trivial) xarray.quantile silently resolves dask arrays pydata/xarray#1524

Closed

rabernat mentioned this pull request Sep 22, 2019

Redo Quantiles #1225

Closed

jrbourbeau mentioned this pull request Sep 24, 2019

dask.dataframe quantile fails spectacularly in some edge cases #731

Open

stsievert mentioned this pull request Nov 10, 2019

Error if using dask_ml.preprocessing RobustScaler dask/dask-ml#582

Open

rabernat mentioned this pull request Aug 27, 2020

Add axis= keyword to percentile #2824

Open

	import_required('fastavro',
	"fastavro is a required dependency for using "
	"bag.read_avro().")

		from dask.array.utils import assert_eq, same_keys


		def test_percentile():

		assert_eq(da.percentile(d, q, method='tdigest'), np.array([1], dtype=d.dtype))


		def test_unknown_chunk_sizes():

Uh oh!

Refactor array.percentile and dataframe.quantile to use t-digest #4677

Refactor array.percentile and dataframe.quantile to use t-digest #4677

Uh oh!

Conversation

Dimplexion commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Apr 8, 2019

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimplexion Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 10, 2019

Uh oh!

Dimplexion commented Apr 10, 2019

Uh oh!

Dimplexion commented Apr 11, 2019

Uh oh!

Dimplexion commented Apr 15, 2019

Uh oh!

Dimplexion commented Apr 16, 2019

Uh oh!

jcrist commented Apr 18, 2019

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimplexion commented Apr 8, 2019 •

edited

Loading

Dimplexion Apr 9, 2019 •

edited

Loading