Skip to content

Add check for missing values to GroupByLabels#380

Closed
sebhofer wants to merge 8 commits intofslaborg:masterfrom
sebhofer:master
Closed

Add check for missing values to GroupByLabels#380
sebhofer wants to merge 8 commits intofslaborg:masterfrom
sebhofer:master

Conversation

@sebhofer
Copy link
Copy Markdown

@sebhofer sebhofer commented Mar 8, 2018

Throw error when trying to group by column with missing values. This leads to unexpected results (see, e.g., this issue). This is not really a fix, but grouping on missing values seemsill-defined anyway.

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Jul 27, 2018

@sebhofer I checked Pandas' behavior on similar case. It basically drops rows of the group column that has missing value first and then group by it. I would prefer that instead of throwing out an error message. I'll dig a bit to see how to accomplish this efficiently.

@sebhofer
Copy link
Copy Markdown
Author

I'm not sure I agree. If you just drop some rows and move on quietly, the user might never notice that something is off. In an interactive environment, which is, I think, where Deedle is mostly used, I rather I get an error message, and I can fix the problem myself.

@tpetricek
Copy link
Copy Markdown
Member

I think both throwing values away and throwing an exception are sensible design choices - I would have slight preference for exception, because I feel the F# style is to be a bit more explicit than in the Python style (it just means people have to do some extra work to drop before grouping, but that seems OK).

We might make that easier if we added Frame.dropSparseRowsBy "Column" which would drop rows where the value in Column is missing.

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Aug 1, 2018

@tpetricek I'm ok with throwing error if users got the message that some data are missing so that they can check data and clean it up.

@sebhofer Could you submit another commit to throw a little bit more explicit message to suggest user to check missing data in grouping columns?

@sebhofer
Copy link
Copy Markdown
Author

sebhofer commented Aug 1, 2018

@zyzhu Sure, will do. I can also try to come up with a PR Frame.dropSparseRowsBy "Column" if that's desired. I really like the idea @tpetricek

@sebhofer
Copy link
Copy Markdown
Author

sebhofer commented Aug 1, 2018

I had a quick look at the grouping function. As frame.GroupByLabels is used by different grouping functions (I guess the relevant ones are GroupRowsBy and GroupRowsUsing), I think it would be most clear and most helpful to the user if there was a different error for each function. Additionally, I will still keep the check in GroupByLabels to be safe. For each case, I will try to find a simple way to correct the problem and give a hint to the user in the error message.
@zyzhu please let me know if you don't like this approach.

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Aug 2, 2018

@sebhofer That makes a lot of sense to me. Thanks!

@kblohm
Copy link
Copy Markdown
Contributor

kblohm commented Aug 7, 2018

Hi,
i noticed that you already made a PR for Frame.dropSparseRowsBy, but you also add the function in this PR. Was this intended? There are also quite a lot of merge-commits. Would you mind rebasing/squashing this?

@sebhofer
Copy link
Copy Markdown
Author

sebhofer commented Aug 7, 2018

@kblohm No, that was not intended! I am new to forking projects and creating PRs so I really messed this up yesterday. Also, I didn't realize that everything I push to the master branch gets added to the PR. I'm not quite sure how to proceed with this. I'm guessing it's best to close this and open a new PR on a dedicated branch. Any input is welcome!

@sebhofer sebhofer closed this Aug 7, 2018
@sebhofer sebhofer reopened this Aug 7, 2018
@kblohm
Copy link
Copy Markdown
Contributor

kblohm commented Aug 7, 2018

You could clean this one up, but as far as i know you can not change the source branch.
I would just close this and create a branch for your changes and then open a new PR. If you need any help with this, let me know. :)

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Aug 7, 2018

@sebhofer You could sync your fork and then create another branch. Then commit just the change of throwing exception e85244d and submit the pull.

I'll merge it after that. BTW, @kblohm just changed the build using FAKE 5. After syncing with upstream, you need to do the following to get started.

  1. Install FAKE cli
    dotnet tool install fake-cli -g
  2. To build, run the following command under Deedle folder
    fake build target allcore

@kblohm
Copy link
Copy Markdown
Contributor

kblohm commented Aug 7, 2018

You do not need to install fake globally, you can just run fake.cmd / fake.sh and that will install it just for this project. I am also using the global version, but it is not a requirement.

@sebhofer
Copy link
Copy Markdown
Author

sebhofer commented Aug 7, 2018

@kblohm Yeah, that seems like the cleanest solution. I think I'll be able to handle it... but you never know. I'll let you know otherwise. Thanks for your help!

@zyzhu I'll try to do this tonight. On another note, did you see the issue I opened about working with the Deedle src? I'm struggling a bit with this and could use some advice. Or is this a trivial problem and I'm just missing something really obvious?

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Aug 7, 2018

@sebhofer I did see that. I am traveling for work this week. Will write up some drafts next week when I get back to office. But I will need some help from @tpetricek as he knows the original work more inside out than me as new comer too.

Thanks to @kblohm, we are on FAKE 5 already. Just follow what he wrote above and you shall be able to build the source.

As for how to work with fsx and source code. I will try to write up some sample cases next week.

@sebhofer
Copy link
Copy Markdown
Author

sebhofer commented Aug 7, 2018

@zyzhu Perfect, thanks. I just wasn't sure if it was just me being stupid!

@sebhofer sebhofer closed this Aug 7, 2018
@sebhofer
Copy link
Copy Markdown
Author

sebhofer commented Aug 7, 2018

@kblohm I might take you up on offer afterall :) How do I best clean this up? As I now have all these merge-commits in both master and origin/master I'm worried that by rebasing I'll make things worse... or is that not a problem?

@kblohm
Copy link
Copy Markdown
Contributor

kblohm commented Aug 7, 2018

Did you mange to fix this?
If you still need to cleanup your master branch, you can

git fetch upstream
git checkout master
git reset upstream/master --hard

This will reset all your changes, but it looks like you already have a new branch with the stuff you wanted to keep.
To get this up to your origin, you will have to force push.

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Aug 7, 2018

I've just merged the new pull.

@sebhofer
Copy link
Copy Markdown
Author

sebhofer commented Aug 7, 2018

@kblohm I did manage to create a new branch for PR, but I still didn’t know how to fix my master. So this is exactly what I needed, thx! Will try tomorrow.

@zyzhu Great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants