Skip to content

Conversation

@Chilipp
Copy link
Contributor

@Chilipp Chilipp commented Jun 13, 2019

closes #4906

  • Tests added / passed
  • Passes flake8 dask

@TomAugspurger
Copy link
Member

Thanks. Do you have time to push an update with a small test that tests the original problem?

@Chilipp
Copy link
Contributor Author

Chilipp commented Jun 13, 2019

Sure! See c8248dd where I just duplicated the current test_corr for series correlations

sol = da.corr(db)
sol2 = da.corr(db, min_periods=10)
assert_eq(res, sol)
assert_eq(res2, sol)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this is failing CI.

I think a simpler test will suffice here. Something basic like

result = ddf.A.corr(ddf.A)
expected = df.A.corr(df.A)
assert_eq(result, expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But nevertheless: any idea why it fails? I tried it locally and the values disagree by something around 0.4!

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 13, 2019 via email

@Chilipp
Copy link
Contributor Author

Chilipp commented Jun 13, 2019

Hmm, I'm afraid, it doesn't. And in fact, I just literally copied the original test and just changed df.B into df.B.rename('A').

You can also see with the further test implemented in d27b54c that it changes a lot if I rename the column and the results do not agree anymore. I can try to dig deeper into this tomorrow, but so far, I have no idea where this comes from.

@Chilipp
Copy link
Contributor Author

Chilipp commented Jun 14, 2019

Alright @TomAugspurger, I found the issue and solved it in 8db2cb4. It was another access to df[col] that fails if there are duplicated column names. It should be solved now.

Do you want me to reduce the number of tests that I implemented in d27b54c and c8248dd?

@jcrist
Copy link
Member

jcrist commented Jun 17, 2019

@Chilipp, yes it would be good to reduce the length of your added tests to something minimal that would catch this error. This makes our test suite faster, and our test code easier to understand (for future maintenance).

@Chilipp
Copy link
Contributor Author

Chilipp commented Jun 17, 2019

yes it would be good to reduce the length of your added tests

alright @jcrist, thanks for the feedback! This has been implemented in 40cf5cb as a separate but shorter test_corr_same_name test function

@TomAugspurger
Copy link
Member

@Chilipp can you clarify #4934 (comment) a bit? Is the issue that we don't correctly handle duplicate columns on master? Or does your PR not correctly handle duplicates?

@Chilipp
Copy link
Contributor Author

Chilipp commented Jun 17, 2019

Is the issue that we don't correctly handle duplicate columns on master? Or does your PR not correctly handle duplicates?

The problem is the current implementation. It iterates over the columns and accesses it using the __getitem__ method:

for idx, col in enumerate(df):
    mask = df[col].notnull()

The col however appears twice in the dataframe columns, and therefore mask is an array of shape (N, 2). I did not see this immediately, because, in contrast to what has been fixed by 453162c, it did not raise an error but just changed the result. Additionally, and I am not entirely sure why, it did only provide wrong results if split_every is set. Nevertheless, I fixed it now and the tests implemented in 06dc7cc show that it works.

I think the important message from this issue is that you should not use the column names to access the series in a dataframe but rather the index of the column.

@TomAugspurger
Copy link
Member

Yep, thanks for confirming.

@TomAugspurger TomAugspurger merged commit bfc0e6b into dask:master Jun 17, 2019
@Chilipp Chilipp deleted the patch-1 branch June 17, 2019 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dask.dataframe.core.Series.corr fails when other Series has the same name

3 participants