-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Refactor array.percentile and dataframe.quantile to use t-digest #4677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…() versions to allow them to pass
|
cc @jcrist |
jcrist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dimplexion for working on this. I've only left high-level api comments for now, will try to give a deeper review later.
| (test_import "Delayed" "toolz" "import dask.delayed") && \ | ||
| (test_import "Bag" "toolz partd cloudpickle" "import dask.bag") && \ | ||
| (test_import "Array" "toolz numpy" "import dask.array") && \ | ||
| (test_import "Array" "toolz numpy crick" "import dask.array") && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crick should not be a mandatory dependency for dask array - it could be optional for certain functionality (e.g. percentile), but should only be needed if requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed crick from being a mandatory dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw is there a list of optional dependencies somewhere where I should add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add it to the install scripts on travis and appveyor, but other than that no.
There's also dask.utils.import_required which is used in a few places to provide a nice error message when importing a dependency fails for some optional functionality. I'd probably add this to quantile and percentile to error nicely when they're called by crick isn't installed. See e.g.
Lines 95 to 97 in 9f870d1
| import_required('fastavro', | |
| "fastavro is a required dependency for using " | |
| "bag.read_avro().") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it to the install scripts and also added those import warnings.
dask/array/percentile.py
Outdated
| from numbers import Number | ||
|
|
||
| import numpy as np | ||
| from crick import TDigest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import should be local to functions using it to prevent crick from being a required dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to be imported only where needed.
dask/dataframe/core.py
Outdated
| return result | ||
|
|
||
| def quantile(self, q=0.5, axis=0): | ||
| def quantile(self, q=0.5, axis=0, use_tdigest=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we may add more methods in the future, I'd prefer this to be a string. Current values would be:
- `method='tdigest': uses tdigest
- `method='dask': uses dask's custom algorithm
method='default': the default value for this kwarg just means use the default for this version of dask. We should feel free to change this if we find a new best method. For now this would map to'dask', but could be changed later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented this change as described here. Although I must raise the concern here that defaulting to 'dask' can lead to some pretty nasty surprises. For example in the case that brought this issue to my knowledge I wanted to get the 95th percentile from data and the call ddf.quantile(0.95) was giving me a value of ~12. A week later I learned about the issues with the quantile method and ran it using t-digest which game me a result of ~8. The internal implementation sometimes fails massively and it can be very difficult to spot. I think it's dangerous to use it as the default without a warning. I'm not sure what would be the best way to handle it without breaking existing code, though, as t-digest can't handle all the cases the internal one can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcrist what are your thoughts on what the default should be? Should we make crick mandatory for dask dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? My thought was we should add the implementation's in this PR, let them sit for a bit so they can get some use, then maybe pull the switch and change the default. I think we do still want the current algorithm for partitioning logic (no external dependency, handles more dtypes), but for numerical results tdigest is likely better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree we still want to use the custom algorithm for handling partitioning. I'm fine with having it as 'dask' for now, just wanted to make sure I bring that up since it was a pretty nasty surprise for me when I was starting to use Dask.
dask/dataframe/core.py
Outdated
|
|
||
| @derived_from(pd.DataFrame) | ||
| def describe(self, split_every=False, percentiles=None): | ||
| def describe(self, split_every=False, percentiles=None, use_tdigest=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps percentiles_method=... here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dask/dataframe/core.py
Outdated
| return df | ||
|
|
||
| def quantile(self, q=0.5): | ||
| def quantile(self, q=0.5, use_tdigest=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps method=... here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dask/dataframe/core.py
Outdated
|
|
||
|
|
||
| def quantile(df, q): | ||
| def quantile(df, q, use_tdigest=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps method=... here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
setup.py
Outdated
| # you modify these, make sure to change the corresponding line there. | ||
| extras_require = { | ||
| 'array': ['numpy >= 1.11.0', 'toolz >= 0.7.3'], | ||
| 'array': ['numpy >= 1.11.0', 'toolz >= 0.7.3', 'crick >= 0.0.3'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crick should be an optional dependency for dask array, and probably an optional dependency for dask dataframe (for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
It looks like tests are fine except for linting errors (From travis-ci logs) |
Thanks for pointing that out, those should be fixed now. For some reason I'm getting this weird error when I'm trying to run flake8 locally so I'm gonna have to rely on travis for this. The error I'm getting is this: |
|
The interface is probably set now so I'll start adding the tests so that we have similar coverage with |
|
Pushed some test cases now. Still need to add some tests for |
|
I pushed the tests for the other functions that were still missing. This PR is ready from my perspective so feel free to check it whenever you have time @jcrist. I also removed WIP from the title now. |
|
Thanks @Dimplexion, I'll give this a review tomorrow. Looks like this has developed merge conflicts, if you have time at some point could you resolve these? |
jcrist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dimplexion. Overall the implementation looks good - I left a few comments about the tests and docstrings but this is pretty close to mergeable.
dask/array/percentile.py
Outdated
| t = TDigest() | ||
| t.merge(*digests) | ||
|
|
||
| return np.array([t.quantile(q / 100.0) for q in qs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quantile supports array arguments:
t.quantile(q / 100.0)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, didn't notice that.
dask/array/tests/test_percentiles.py
Outdated
| from dask.array.utils import assert_eq, same_keys | ||
|
|
||
|
|
||
| def test_percentile(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write this with pytest.mark.parametrize over the method arg:
@pytest.mark.parametrize('method', ['tdigest', 'dask'])
def test_percentile(method):
...For arguments that don't support one type I'd just special case around them inside the test:
if method != 'tdigest':
x = np.array(['a', 'a', 'd', 'd', 'd', 'e'])
d = da.from_array(x, chunks=(3,))
assert_eq(da.percentile(d, [0, 50, 100]),
np.array(['a', 'd', 'e'], dtype=x.dtype))Doing it this way ensures we have coverage across both methods and minimizes duplicating code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really nice way of doing it, thanks for pointing that out!
dask/array/tests/test_percentiles.py
Outdated
| assert_eq(da.percentile(d, q, method='tdigest'), np.array([1], dtype=d.dtype)) | ||
|
|
||
|
|
||
| def test_unknown_chunk_sizes(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use pytest.mark.parametrize here.
dask/dataframe/core.py
Outdated
| Note: this implementation will use t-digest for columns with floating | ||
| dtype if axis is set to 0 and `method` is set to `tdigest`. | ||
| Otherwise it falls back to the internal implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it use tdigest only if that's specified explicitly?
Also, should put this as a parameter docstring:
"""
...
method : {'default', 'tdigest', 'dask'}:
What method to use. By default will use dask's internal custom algorithm (``'dask'``).
If set to ``'tdigest'`` will use tdigest for floats and ints and fallback to the ``'dask'``
otherwise.
...
"""There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does, your version is much better.
dask/dataframe/core.py
Outdated
| q : list/array of floats, default 0.5 (50%) | ||
| Iterable of numbers ranging from 0 to 1 for the desired quantiles | ||
| Note: this implementation will use t-digest is `method` is set to `tdigest` and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should put this as a parameter docstring:
"""
...
method : {'default', 'tdigest', 'dask'}:
What method to use. By default will use dask's internal custom algorithm (``'dask'``).
If set to ``'tdigest'`` will use tdigest for floats and ints and fallback to the ``'dask'``
otherwise.
...
"""
dask/dataframe/core.py
Outdated
| Iterable of numbers ranging from 0 to 100 for the desired quantiles | ||
| Note: this implementation will use t-digest is `method` is set to `tdigest` and the | ||
| dtype of df is float. Otherwise it falls back to the internal implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should put this as a parameter docstring:
"""
...
method : {'default', 'tdigest', 'dask'}:
What method to use. By default will use dask's internal custom algorithm (``'dask'``).
If set to ``'tdigest'`` will use tdigest for floats and ints and fallback to the ``'dask'``
otherwise.
...
"""| assert ds.describe(split_every=2)._name != ds.describe()._name | ||
| assert ddf.describe(split_every=2)._name != ddf.describe()._name | ||
|
|
||
| s = pd.Series(list(range(10)) * 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment for this whole file: it would be good to use pytest.mark.parametrize here to minimize duplicating code.
|
Everything that was listed has be fixed now, including the conflict with the base branch. I also added testing on some cases that were still missing previously. |
|
Thanks @Dimplexion. I pushed an extra commit to make the tests run without |
|
Thanks @Dimplexion! |
|
Always happy to help! |
…k#4677) * Use t-digest for arrays when possible * implement using t-digest for DataFrame quantiles when possible * Change dataframe/io/from_bcolz to use the old percentile implementation * Change tests to use the old DataFrame.quantile() and Array.percentile() versions to allow them to pass * Add crick as a dependency * Remove crick from being a mandatory requirement * Change "use_tdigest" parameter to a more general "method" * Update tests to work with the new "method" parameter to quantile and percentile functions * Fix flake8 warnings. * Add warning when attempting to use t-digest function without crick being installed * Add crick as an optional dependency to appveyor and travis * Add tests for array.percentile() when using 'tdigest' method * Add some tests for DataFrame.quantile with method 'tdigest' * Fix styling error. * Add tests for DataFrame.quantile when method='tdigest' * Add tests for DataFrame.describe() when using method='tdigest' * Change array.percentile to use list parameter for crick.quantile(). * Change to use pytest.mark.parametrize when needed * Fix doc strings in dataframe/core. * Refactor quantile tests in test_dataframe to use pytest.mark.parametrize * Fixups - Make tests not require crick to run - Fix docstring formatting - A few other nits
flake8 daskRelated to #1225
This PR changes the public
array.percentileanddataframe.quantilemethods to use t-digest when possible. Meaning that depending on the input parameters it uses t-digest when it's possible and falls back to the old implementation for data types that are not integer or float and if interpolation is not allowed.I also noticed that
array.percentilewas being used for calculating divisions and figured someone else might want to be able to do that also. So I added a parameteruse_tdigestthat defaults toTruebut can be disabled to revert back to the old behavior. I had to add it to a few places for it to make it sense, let me know if you think this is a good idea or if it should be changed.I will still need to write some tests to cover cases when
use_tdigest == Truefor good coverage before this is ready to be merged (if we will keep it).