-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix correlation of series with same names #4934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks. Do you have time to push an update with a small test that tests the original problem? |
|
Sure! See c8248dd where I just duplicated the current test_corr for series correlations |
| sol = da.corr(db) | ||
| sol2 = da.corr(db, min_periods=10) | ||
| assert_eq(res, sol) | ||
| assert_eq(res2, sol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this is failing CI.
I think a simpler test will suffice here. Something basic like
result = ddf.A.corr(ddf.A)
expected = df.A.corr(df.A)
assert_eq(result, expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. But nevertheless: any idea why it fails? I tried it locally and the values disagree by something around 0.4!
|
Oh, should it be `assert_eq(res2, sol2)` instead of `sol`?
…On Thu, Jun 13, 2019 at 12:44 PM Philipp S. Sommer ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dask/dataframe/tests/test_dataframe.py
<#4934 (comment)>:
> @@ -2434,6 +2434,27 @@ def test_corr():
assert res3._name != res4._name
assert res._name != res3._name
+ # Series with same names (see #4906)
+ a = df.A
+ b = df.B.rename('A')
+ da = dd.from_pandas(a, npartitions=6)
+ db = dd.from_pandas(b, npartitions=7)
+
+ res = da.corr(db)
+ res2 = da.corr(db, split_every=2)
+ res3 = da.corr(db, min_periods=10)
+ res4 = da.corr(db, min_periods=10, split_every=2)
+ sol = da.corr(db)
+ sol2 = da.corr(db, min_periods=10)
+ assert_eq(res, sol)
+ assert_eq(res2, sol)
True. But nevertheless: any idea why it fails? I tried it locally and the
values disagree by something around 0.4!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4934?email_source=notifications&email_token=AAKAOISIV34GHZ6R2QDT3MDP2KBPRA5CNFSM4HXXQNX2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB3PSJHY#discussion_r293499886>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIR4F4UCFE3JEEZP7R3P2KBPRANCNFSM4HXXQNXQ>
.
|
|
Hmm, I'm afraid, it doesn't. And in fact, I just literally copied the original test and just changed You can also see with the further test implemented in d27b54c that it changes a lot if I rename the column and the results do not agree anymore. I can try to dig deeper into this tomorrow, but so far, I have no idea where this comes from. |
|
Alright @TomAugspurger, I found the issue and solved it in 8db2cb4. It was another access to Do you want me to reduce the number of tests that I implemented in d27b54c and c8248dd? |
|
@Chilipp, yes it would be good to reduce the length of your added tests to something minimal that would catch this error. This makes our test suite faster, and our test code easier to understand (for future maintenance). |
|
@Chilipp can you clarify #4934 (comment) a bit? Is the issue that we don't correctly handle duplicate columns on master? Or does your PR not correctly handle duplicates? |
The problem is the current implementation. It iterates over the columns and accesses it using the for idx, col in enumerate(df):
mask = df[col].notnull()The I think the important message from this issue is that you should not use the column names to access the series in a dataframe but rather the index of the column. |
|
Yep, thanks for confirming. |
closes #4906
flake8 dask