Fix correlation of series with same names #4934

Chilipp · 2019-06-13T09:13:25Z

closes #4906

Tests added / passed
Passes flake8 dask

see #4906

TomAugspurger · 2019-06-13T11:44:11Z

Thanks. Do you have time to push an update with a small test that tests the original problem?

see #4934 (comment)

Chilipp · 2019-06-13T12:51:17Z

Sure! See c8248dd where I just duplicated the current test_corr for series correlations

TomAugspurger · 2019-06-13T17:05:36Z

dask/dataframe/tests/test_dataframe.py

+    sol = da.corr(db)
+    sol2 = da.corr(db, min_periods=10)
+    assert_eq(res, sol)
+    assert_eq(res2, sol)


Seems this is failing CI.

I think a simpler test will suffice here. Something basic like

result = ddf.A.corr(ddf.A) expected = df.A.corr(df.A) assert_eq(result, expected)

True. But nevertheless: any idea why it fails? I tried it locally and the values disagree by something around 0.4!

TomAugspurger · 2019-06-13T18:06:51Z

Oh, should it be `assert_eq(res2, sol2)` instead of `sol`?

…

On Thu, Jun 13, 2019 at 12:44 PM Philipp S. Sommer ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dask/dataframe/tests/test_dataframe.py <#4934 (comment)>: > @@ -2434,6 +2434,27 @@ def test_corr(): assert res3._name != res4._name assert res._name != res3._name + # Series with same names (see #4906) + a = df.A + b = df.B.rename('A') + da = dd.from_pandas(a, npartitions=6) + db = dd.from_pandas(b, npartitions=7) + + res = da.corr(db) + res2 = da.corr(db, split_every=2) + res3 = da.corr(db, min_periods=10) + res4 = da.corr(db, min_periods=10, split_every=2) + sol = da.corr(db) + sol2 = da.corr(db, min_periods=10) + assert_eq(res, sol) + assert_eq(res2, sol) True. But nevertheless: any idea why it fails? I tried it locally and the values disagree by something around 0.4! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4934?email_source=notifications&email_token=AAKAOISIV34GHZ6R2QDT3MDP2KBPRA5CNFSM4HXXQNX2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB3PSJHY#discussion_r293499886>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIR4F4UCFE3JEEZP7R3P2KBPRANCNFSM4HXXQNXQ> .

Chilipp · 2019-06-13T18:35:56Z

Hmm, I'm afraid, it doesn't. And in fact, I just literally copied the original test and just changed df.B into df.B.rename('A').

You can also see with the further test implemented in d27b54c that it changes a lot if I rename the column and the results do not agree anymore. I can try to dig deeper into this tomorrow, but so far, I have no idea where this comes from.

Chilipp · 2019-06-14T10:39:05Z

Alright @TomAugspurger, I found the issue and solved it in 8db2cb4. It was another access to df[col] that fails if there are duplicated column names. It should be solved now.

Do you want me to reduce the number of tests that I implemented in d27b54c and c8248dd?

jcrist · 2019-06-17T14:16:59Z

@Chilipp, yes it would be good to reduce the length of your added tests to something minimal that would catch this error. This makes our test suite faster, and our test code easier to understand (for future maintenance).

Chilipp · 2019-06-17T14:37:30Z

yes it would be good to reduce the length of your added tests

alright @jcrist, thanks for the feedback! This has been implemented in 40cf5cb as a separate but shorter test_corr_same_name test function

dask/dataframe/tests/test_dataframe.py

TomAugspurger · 2019-06-17T20:12:43Z

@Chilipp can you clarify #4934 (comment) a bit? Is the issue that we don't correctly handle duplicate columns on master? Or does your PR not correctly handle duplicates?

Chilipp · 2019-06-17T21:24:42Z

Is the issue that we don't correctly handle duplicate columns on master? Or does your PR not correctly handle duplicates?

The problem is the current implementation. It iterates over the columns and accesses it using the __getitem__ method:

for idx, col in enumerate(df):
    mask = df[col].notnull()

The col however appears twice in the dataframe columns, and therefore mask is an array of shape (N, 2). I did not see this immediately, because, in contrast to what has been fixed by 453162c, it did not raise an error but just changed the result. Additionally, and I am not entirely sure why, it did only provide wrong results if split_every is set. Nevertheless, I fixed it now and the tests implemented in 06dc7cc show that it works.

I think the important message from this issue is that you should not use the column names to access the series in a dataframe but rather the index of the column.

TomAugspurger · 2019-06-17T21:29:27Z

Yep, thanks for confirming.

Fix correlation of series with same names

453162c

see #4906

Chilipp mentioned this pull request Jun 13, 2019

dask.dataframe.core.Series.corr fails when other Series has the same name #4906

Closed

Chilipp added 2 commits June 13, 2019 14:39

Added test for corr with duplicated series names

c8248dd

see #4934 (comment)

Removed duplicated test

20590c4

TomAugspurger reviewed Jun 13, 2019

View reviewed changes

Added further test for renaming

d27b54c

Fixed another df[col] issue in cov_corr_chunk

8db2cb4

Reduce tests for series with same names

40cf5cb

jcrist reviewed Jun 17, 2019

View reviewed changes

dask/dataframe/tests/test_dataframe.py Show resolved Hide resolved

typo

06dc7cc

TomAugspurger merged commit bfc0e6b into dask:master Jun 17, 2019

Chilipp deleted the patch-1 branch June 17, 2019 21:30

mathause mentioned this pull request Feb 23, 2021

Upstream CI failing silently pydata/xarray#4945

Closed

Uh oh!

Fix correlation of series with same names #4934

Fix correlation of series with same names #4934

Uh oh!

Conversation

Chilipp commented Jun 13, 2019

Uh oh!

TomAugspurger commented Jun 13, 2019

Uh oh!

Chilipp commented Jun 13, 2019

Uh oh!

TomAugspurger Jun 13, 2019

Choose a reason for hiding this comment

Uh oh!

Chilipp Jun 13, 2019

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jun 13, 2019 via email

Uh oh!

Chilipp commented Jun 13, 2019

Uh oh!

Chilipp commented Jun 14, 2019

Uh oh!

jcrist commented Jun 17, 2019

Uh oh!

Chilipp commented Jun 17, 2019

Uh oh!

Uh oh!

TomAugspurger commented Jun 17, 2019

Uh oh!

Chilipp commented Jun 17, 2019

Uh oh!

TomAugspurger commented Jun 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants