Groupby Covariance/Correlation #4889

quasiben · 2019-06-05T21:12:32Z

Tests added / passed
Passes flake8 dask

This PR will close #4828 . The algorithm follows a similar path as variance in each block we calculate means, sums, and counts and aggregate all the results at the end to call

cov(x, y) = (x_i * x_j - x_mu * y_mu)

dask/dataframe/tests/test_groupby.py

mrocklin

Thanks @quasiben !

It looks like you've also run black on a couple files. To aid review, can I ask you to not include the style changes here? There are a couple reasons for this:

It's hard to see what is substantive and what is a style change, making reviewing hard
Style changes are going to screw with git blame. I'd rather that we do this once in a mega-PR rather than file by file attached to various commits.

I'm all in favor of running black on this codebase, but that should probably be a separate effort.

dask/dataframe/groupby.py

quasiben · 2019-06-06T01:48:40Z

I thought dask had black applied to it recently -- perhaps that was dask-cudf ? Apologies

mrocklin · 2019-06-06T02:10:02Z

We're rolling it out across the repositories. It hasn't hit dask/dask yet.
#4319

mrocklin

Some tiny comments, but in general things here look really clean to me.

dask/dataframe/groupby.py

mrocklin · 2019-06-06T02:12:33Z

dask/dataframe/groupby.py

+            split_out_setup=split_out_on_index,
+        )
+
+        if isinstance(self.obj, Series):


We might want to use is_series_like here, though it's not critical at this point

I'm curious. It looks like everything here would also work with cudf. Is that the case? I would be surprised if so, but pleasantly surprised :)

dask/dataframe/groupby.py

quasiben · 2019-06-07T20:13:39Z

dask/dataframe/groupby.py

+
+        if self._slice:
+            sliced_plus = self._slice + list(self.index)
+            self.obj = self.obj[sliced_plus]


@TomAugspurger this looks wrong to me -- I am trying to handle the following case:

df.groupby(df['a'])[['b', 'c']])

aka

"dask/dataframe/tests/test_groupby.py::test_groupby_multilevel_getitem[cov-3]

Seems ok. May want to ensure _slice is a list before doing +. Not sure, but that may break with

.groupy('a')['b'].cov()

and

.groupby('a')[pd.Index(['a', 'b'])].cov()

(not sure if the second is valid).

Is there a better way of doing this ? the variance calculations don't

This is also not going to support an index which is a mask. This currently fails in the test_dataframe_aggregations_multilevel test with lambda df: [df['a'] > 2, df['b'] > 1]

KeyError('[(True, True) (True, True)] not found in axis')]

Should cov handle this case ?

quasiben · 2019-06-11T02:37:28Z

All tests are passing now and, if someone has time, could use another review

TomAugspurger

Looks quite nice overall, thanks.

Can you also add this to the dataframe-api.rst page?

dask/dataframe/groupby.py

dask/dataframe/tests/test_groupby.py

TomAugspurger

I think this looks nice @quasiben. Thanks for working on it.

dask/dataframe/groupby.py

TomAugspurger · 2019-06-12T16:10:19Z

Haven't looked too closely at the CI failures, but if just Python < 3.5 fail with https://travis-ci.org/dask/dask/jobs/544800310#L1826, then you may have a dict ordering issue. It could be a dict created internally, or passed to the DataFrame constructor (or something else entirely)

quasiben · 2019-06-12T16:24:54Z

Thanks @TomAugspurger ! I'll build a 35 env and debug

TomAugspurger · 2019-06-13T02:15:24Z

All green @quasiben. Do you think this is ready to go?

quasiben · 2019-06-13T02:27:23Z

I think so, yeah. I haven't tested with cudf but I probably won't get to it for another week or so. When I do, I can make incremental changes

quasiben · 2019-06-13T14:35:09Z

@TomAugspurger @mrocklin ok to merge ?

TomAugspurger · 2019-06-13T14:38:03Z

I think so, thanks!

mrocklin · 2019-06-13T14:51:27Z

Nice work @quasiben !

…

On Thu, Jun 13, 2019 at 4:38 PM Tom Augspurger ***@***.***> wrote: Merged #4889 <#4889> into master. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4889?email_source=notifications&email_token=AACKZTCPUC6ZZUCCDONIC6TP2JLVVA5CNFSM4HUIHH62YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOR6ZIPRY#event-2410842055>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTBOZBMKD7QE4UM5TW3P2JLVVANCNFSM4HUIHH6Q> .

commit 66531ba Author: jakirkham <[email protected]> Date: Thu Jun 13 12:13:55 2019 -0400 Drop size 0 arrays in concatenate (dask#4167) * Test `da.concatenate` with size 0 array Make sure that `da.concatenate` does not include empty arrays in the result as they don't contribute any data. * Drop size 0 arrays from `da.concatenate` If any of the arrays passed to `da.concatenate` has a size of 0, then it won't contribute anything to the array created by concatenation. As such make sure to drop any size 0 arrays from the sequence of arrays to concatenate before proceeding. * Handle dtype and all 0 size case * Cast inputs with asarray * Coerce all arrays to concatenate to the same type * Drop obsoleted type handling code * Comment on why arrays are being dropped * Use `np.promote_types` for parity w/old behavior * Handle endianness during type promotion * Construct empty array of right type Avoids the need to cast later and the addition of another node to the graph. * Promote types in `concatenate` using `_meta` There was some left over type promotion code for the arrays to concatenate using their `dtype`s. However this should now use the `_meta` information instead since that is available. * Ensure `concatenate` is working on Dask Arrays * Raise `ValueError` if `concatenate` gets no arrays NumPy will raise if no arrays are provided to concatenate as it is unclear what to do. This adds a similar exception for Dask Arrays. Also this short circuits handling unusual cases later. Plus raises a clearer exception than one might see if this weren't raised. * Test `concatenate` raises when no arrays are given * Determine the concatenated array's shape Needed to handle the case where all arrays have trivial shapes. * Handle special sequence cases together * Update dask/array/core.py Co-Authored-By: James Bourbeau <[email protected]> * Drop outdated comment * Assume valid `_meta` in `concatenate` Simplifies the `_meta` handling logic in `concatenate` to assume that `_meta` is valid. As all arguments have been coerced to Dask Arrays, this is a reasonable assumption to make. commit 46aef58 Author: James Bourbeau <[email protected]> Date: Thu Jun 13 11:04:47 2019 -0500 Overload HLG values method (dask#4918) * Overload HLG values method * Return lists for keys, values, and items * Add tests for keys and items commit f9cd802 Author: mcsoini <[email protected]> Date: Thu Jun 13 18:03:55 2019 +0200 Merge dtype warning (dask#4917) * add test covering the merge column dtype mismatch warning * for various merge types: checks that the resulting dataframe has either no nans or that a UserWarning has been thrown * Add warning for mismatches between column data types * fixes issue dask#4574 * Warning is thrown if the on-columns of left and right have different dtypes * flake8 fixes * fixes * use asciitable for warning string commit c400691 Author: Hugo <[email protected]> Date: Thu Jun 13 17:38:37 2019 +0300 Docs: Drop support for Python 2.7 (dask#4932) commit 985cdf2 Author: Benjamin Zaitlen <[email protected]> Date: Thu Jun 13 10:38:15 2019 -0400 Groupby Covariance/Correlation (dask#4889) commit 6e8c1b7 Author: Jim Crist <[email protected]> Date: Wed Jun 12 15:55:11 2019 -0500 Drop Python 2.7 (dask#4919) * Drop Python 2.7 Drops Python 2.7 from our `setup.py`, and from our test matrix. We don't drop any of the compatability fixes (yet), but won't be adding new ones. * fixup commit 7a9cfaf Author: Ian Bolliger <[email protected]> Date: Wed Jun 12 11:44:26 2019 -0700 keep index name with to_datetime (dask#4905) * keep index name with to_datetime * allow users to pass meta * Update dask/dataframe/core.py put meta as explicit kwarg Co-Authored-By: Matthew Rocklin <[email protected]> * Update dask/dataframe/core.py remove meta kwargs.pop Co-Authored-By: Matthew Rocklin <[email protected]> * remove test for index * allow index commit abc86d3 Author: jakirkham <[email protected]> Date: Wed Jun 12 14:20:59 2019 -0400 Raise ValueError if concatenate is given no arrays (dask#4927) * Raise `ValueError` if `concatenate` gets no arrays NumPy will raise if no arrays are provided to concatenate as it is unclear what to do. This adds a similar exception for Dask Arrays. Also this short circuits handling unusual cases later. Plus raises a clearer exception than one might see if this weren't raised. * Test `concatenate` raises when no arrays are given commit ce2f866 Author: jakirkham <[email protected]> Date: Wed Jun 12 14:09:35 2019 -0400 Promote types in `concatenate` using `_meta` (dask#4925) * Promote types in `concatenate` using `_meta` There was some left over type promotion code for the arrays to concatenate using their `dtype`s. However this should now use the `_meta` information instead since that is available. * Ensure `concatenate` is working on Dask Arrays

commit 255cc5b Author: Justin Waugh <[email protected]> Date: Mon Jun 17 08:18:26 2019 -0600 Map Dask Series to Dask Series (dask#4872) * index-test needed fix * single-parititon-error * added code to make it work * add tests * delete some comments * remove seed set * updated tests * remove sort_index and add tests commit f7d73f8 Author: Matthew Rocklin <[email protected]> Date: Mon Jun 17 15:22:35 2019 +0200 Further relax Array meta checks for Xarray (dask#4944) Our checks in slicing were causing issues for Xarray, which has some unslicable array types. Additionally, this centralizes a bit of logic from blockwise into meta_from_array * simplify slicing meta code with meta_from_array commit 4f97be6 Author: Peter Andreas Entschev <[email protected]> Date: Mon Jun 17 15:21:15 2019 +0200 Expand *_like_safe usage (dask#4946) commit abe9e28 Author: Peter Andreas Entschev <[email protected]> Date: Mon Jun 17 15:19:24 2019 +0200 Defer order/casting einsum parameters to NumPy implementation (dask#4914) commit 76f55fd Author: Matthew Rocklin <[email protected]> Date: Mon Jun 17 09:28:07 2019 +0200 Remove numpy warning in moment calculation (dask#4921) Previously we would divide by 0 in meta calculations for dask array moments, which would raise a Numpy RuntimeWarning to users. Now we avoid that situation, though we may also want to investigate a more thorough solution. commit c437e63 Author: Matthew Rocklin <[email protected]> Date: Sun Jun 16 10:42:16 2019 +0200 Fix meta_from_array to support Xarray test suite (dask#4938) Fixes pydata/xarray#3009 commit d8ff4c4 Author: jakirkham <[email protected]> Date: Fri Jun 14 10:35:00 2019 -0400 Add a diagnostics extra (includes bokeh) (dask#4924) * Add a diagnostics extra (includes bokeh) * Bump bokeh minimum to 0.13.0 * Add to `test_imports` commit 773f775 Author: btw08 <[email protected]> Date: Fri Jun 14 14:34:34 2019 +0000 4809 fix extra cr (dask#4935) * added test that fails to demonstrate the issue in 4809 * modfied open_files/OpenFile to accept a newline parameter, similar to io.TextIOWrapper or the builtin open on py3. Pass newline='' to open_files when preparing to write csv files. Fixed dask#4809 * modified newline documentation to follow convention * added blank line to make test_csv.py flake8-compliant commit 419d27e Author: Peter Andreas Entschev <[email protected]> Date: Fri Jun 14 15:18:42 2019 +0200 Minor meta construction cleanup in concatenate (dask#4937) commit 1f821f4 Author: Bruce Merry <[email protected]> Date: Fri Jun 14 12:49:59 2019 +0200 Cache chunk boundaries for integer slicing (dask#4923) This is an alternative to dask#4909, to implement dask#4867. Instead of caching in the class as in dask#4909, use functools.lru_cache. This unfortunately has a fixed cache size rather than a cache entry stored with each array, but simplifies the code as it is not necessary to pass the cached value from the Array class down through the call tree to the point of use. A quick benchmark shows that the result for indexing a single value from a large array is similar to that from dask#4909, i.e., around 10x faster for constructing the graph. This only applies the cache in `_slice_1d`, so should be considered a proof-of-concept. * Move cached_cumsum to dask/array/slicing.py It can't go in dask/utils.py because the top level is not supposed to depend on numpy. * cached_cumsum: index cache by both id and hash The underlying _cumsum is first called with _HashIdWrapper, which will hit (very cheaply) if we've seen this tuple object before. If not, it will call itself again without the wrapper, which will hit (but at a higher cost for tuple.__hash__) if we've seen the same value before but in a different tuple object. * Apply cached_cumsum in more places commit 66531ba Author: jakirkham <[email protected]> Date: Thu Jun 13 12:13:55 2019 -0400 Drop size 0 arrays in concatenate (dask#4167) * Test `da.concatenate` with size 0 array Make sure that `da.concatenate` does not include empty arrays in the result as they don't contribute any data. * Drop size 0 arrays from `da.concatenate` If any of the arrays passed to `da.concatenate` has a size of 0, then it won't contribute anything to the array created by concatenation. As such make sure to drop any size 0 arrays from the sequence of arrays to concatenate before proceeding. * Handle dtype and all 0 size case * Cast inputs with asarray * Coerce all arrays to concatenate to the same type * Drop obsoleted type handling code * Comment on why arrays are being dropped * Use `np.promote_types` for parity w/old behavior * Handle endianness during type promotion * Construct empty array of right type Avoids the need to cast later and the addition of another node to the graph. * Promote types in `concatenate` using `_meta` There was some left over type promotion code for the arrays to concatenate using their `dtype`s. However this should now use the `_meta` information instead since that is available. * Ensure `concatenate` is working on Dask Arrays * Raise `ValueError` if `concatenate` gets no arrays NumPy will raise if no arrays are provided to concatenate as it is unclear what to do. This adds a similar exception for Dask Arrays. Also this short circuits handling unusual cases later. Plus raises a clearer exception than one might see if this weren't raised. * Test `concatenate` raises when no arrays are given * Determine the concatenated array's shape Needed to handle the case where all arrays have trivial shapes. * Handle special sequence cases together * Update dask/array/core.py Co-Authored-By: James Bourbeau <[email protected]> * Drop outdated comment * Assume valid `_meta` in `concatenate` Simplifies the `_meta` handling logic in `concatenate` to assume that `_meta` is valid. As all arguments have been coerced to Dask Arrays, this is a reasonable assumption to make. commit 46aef58 Author: James Bourbeau <[email protected]> Date: Thu Jun 13 11:04:47 2019 -0500 Overload HLG values method (dask#4918) * Overload HLG values method * Return lists for keys, values, and items * Add tests for keys and items commit f9cd802 Author: mcsoini <[email protected]> Date: Thu Jun 13 18:03:55 2019 +0200 Merge dtype warning (dask#4917) * add test covering the merge column dtype mismatch warning * for various merge types: checks that the resulting dataframe has either no nans or that a UserWarning has been thrown * Add warning for mismatches between column data types * fixes issue dask#4574 * Warning is thrown if the on-columns of left and right have different dtypes * flake8 fixes * fixes * use asciitable for warning string commit c400691 Author: Hugo <[email protected]> Date: Thu Jun 13 17:38:37 2019 +0300 Docs: Drop support for Python 2.7 (dask#4932) commit 985cdf2 Author: Benjamin Zaitlen <[email protected]> Date: Thu Jun 13 10:38:15 2019 -0400 Groupby Covariance/Correlation (dask#4889) commit 6e8c1b7 Author: Jim Crist <[email protected]> Date: Wed Jun 12 15:55:11 2019 -0500 Drop Python 2.7 (dask#4919) * Drop Python 2.7 Drops Python 2.7 from our `setup.py`, and from our test matrix. We don't drop any of the compatability fixes (yet), but won't be adding new ones. * fixup commit 7a9cfaf Author: Ian Bolliger <[email protected]> Date: Wed Jun 12 11:44:26 2019 -0700 keep index name with to_datetime (dask#4905) * keep index name with to_datetime * allow users to pass meta * Update dask/dataframe/core.py put meta as explicit kwarg Co-Authored-By: Matthew Rocklin <[email protected]> * Update dask/dataframe/core.py remove meta kwargs.pop Co-Authored-By: Matthew Rocklin <[email protected]> * remove test for index * allow index commit abc86d3 Author: jakirkham <[email protected]> Date: Wed Jun 12 14:20:59 2019 -0400 Raise ValueError if concatenate is given no arrays (dask#4927) * Raise `ValueError` if `concatenate` gets no arrays NumPy will raise if no arrays are provided to concatenate as it is unclear what to do. This adds a similar exception for Dask Arrays. Also this short circuits handling unusual cases later. Plus raises a clearer exception than one might see if this weren't raised. * Test `concatenate` raises when no arrays are given commit ce2f866 Author: jakirkham <[email protected]> Date: Wed Jun 12 14:09:35 2019 -0400 Promote types in `concatenate` using `_meta` (dask#4925) * Promote types in `concatenate` using `_meta` There was some left over type promotion code for the arrays to concatenate using their `dtype`s. However this should now use the `_meta` information instead since that is available. * Ensure `concatenate` is working on Dask Arrays Merge remote-tracking branch 'upstream/master' into dataframe-warnings

initial add of groupby covariance with test

8038bf8

quasiben commented Jun 5, 2019

View reviewed changes

dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved

mrocklin reviewed Jun 5, 2019

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

quasiben added 3 commits June 5, 2019 21:41

de-black things

ad70557

using dispatched concat

e627370

clean up breakpoint, increase testing, generalize dataframe creation

f0a3348

mrocklin reviewed Jun 6, 2019

View reviewed changes

quasiben commented Jun 6, 2019

View reviewed changes

dask/dataframe/groupby.py Outdated Show resolved Hide resolved

handle multicolumn groupby and selections

40aba4d

quasiben commented Jun 7, 2019

View reviewed changes

quasiben added 4 commits June 7, 2019 16:30

ensure slice is a list

0f80f7d

cov is not a series aggregation

873b326

handle combiner in aggregation and better test coverage

7d69567

handle boolean masks and lint

2891422

quasiben changed the title ~~[WIP] Groupby Covariance/Correlation~~ Groupby Covariance/Correlation Jun 10, 2019

list-ify things for all pandas version and fix test

82bf617

TomAugspurger reviewed Jun 11, 2019

View reviewed changes

quasiben added 6 commits June 11, 2019 09:28

better variable names and docstrings

44ae048

handle arbitrary column names and add tests to reflect this change

ab3ae45

remove random seed

5404928

update docs

46b3073

Merge branch 'master' of github.com:dask/dask into feat/cov

116c841

lint

b2140e0

TomAugspurger reviewed Jun 12, 2019

View reviewed changes

dask/dataframe/groupby.py Show resolved Hide resolved

quasiben added 2 commits June 12, 2019 09:35

wrap docstring lines properly

bcc803f

handle dict_keys error

e21e0a1

resolving column ordering test failures

2788287

TomAugspurger merged commit 985cdf2 into dask:master Jun 13, 2019

quasiben mentioned this pull request Aug 19, 2019

add correlation calculation and add test #5296

Merged

2 tasks

Uh oh!

Groupby Covariance/Correlation #4889

Groupby Covariance/Correlation #4889

Uh oh!

Conversation

quasiben commented Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quasiben commented Jun 6, 2019

Uh oh!

mrocklin commented Jun 6, 2019

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrocklin Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quasiben Jun 7, 2019

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 7, 2019

Choose a reason for hiding this comment

Uh oh!

quasiben Jun 7, 2019

Choose a reason for hiding this comment

Uh oh!

quasiben Jun 8, 2019

Choose a reason for hiding this comment

Uh oh!

quasiben commented Jun 11, 2019

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomAugspurger commented Jun 12, 2019

Uh oh!

quasiben commented Jun 12, 2019

Uh oh!

TomAugspurger commented Jun 13, 2019

Uh oh!

quasiben commented Jun 13, 2019

Uh oh!

quasiben commented Jun 13, 2019

Uh oh!

TomAugspurger commented Jun 13, 2019

Uh oh!

mrocklin commented Jun 13, 2019 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

quasiben commented Jun 5, 2019 •

edited

Loading