-
Notifications
You must be signed in to change notification settings - Fork 1k
Fix bug when joining on columns of different types #2734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… types since I can't imagine an example.
|
The decreased code coverage is due to a section, where I implement the behaviour for columns where all of the following return FALSE:
If option 1 turns out to be infeasible, I would opt for 2. |
|
Looks great and much more robust. New tests are extensive!
where y = yes types match as expected normally, e = error, w = no coerce but warn, cw = coerce and warn
|
|
Great, I like your suggested behaviour very much!
|
|
I have refactored the code according to @mattdowle 's proposals with 2 exceptions:
|
R/bmerge.R
Outdated
| ## where newclass is the class on the right of == | ||
| typeCastCoercionX = c("integer64==realDouble") | ||
|
|
||
| ## Establish a lookup table with coercion strategies for each pair of types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice but maybe we can move that to proper Rd? (doesn't have to be now, can be separate PR)
We can export coerceClass and put that to man.
jangorecki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice reorganization of code.
None of existing tests were altered and many new tests added, coverage increase, very nice.
I did not do that so far since this would affect > 100 existing unit tests, including cases, where internal data.table code would cause warnings. So, before doing so, I would propose to clean the code and tests for unnecessary type mismatches and then implementing the warnings, which is as simple as replacing ci and cx with ciw and cxw in the type conversion matrix respectively (see table under 1.).
Such significant change would need to be implemented with option (by default disabled) anyway, which could be eventually added now in this PR: if (getOption("datatable.warntypemismatch", FALSE)) "ciw" else "ci", but let's wait for Matt's comment.
R/bmerge.R
Outdated
| } | ||
| else { | ||
| ## we need as.'to'() conversion | ||
| converter <- match.fun(paste0("as.", to)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are just few choices here, why not just few if (...) as.x else if (...) as.y else if ...? making such converted make perfect sense if we want to support arbitrary coercion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to be ready for other as.xxx conversion, but if you want, I can change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any "other" on roadmap? I am not aware of any. Still it will need to be added to matrix anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No there isn't and you are completely riht about the matrix. If you want me to change the implementation of coerceClass, please indicate briefly, I will then do it.
|
Should the type conversion matrix be symmetrical* (assuming Currently it doesn't, for example there's |
|
You are right, @st-pasha, I also think that this would be the ideal behavior. However, this would alter existing behavior beyond fixing bugs IMHO. That is why I didn't implement it so far. If there is an agreement to change the behavior to be symmetrical as suggested by you, I am more than willing to implement. Note, however, that very likely existing tests have to be updated. |
|
My suggestion is to add warnings and make matrix symmetric in this PR but as option. This is big breaking change and it might need to wait for more than 1 release cycle to be changed to TRUE by default. We need to release soon to CRAN, so putting this as an option gives us much more time to check revdeps and inform maintainers. This we can do already after release to CRAN so maintainers can smoothly migrate by using option on recent CRAN build. Later before future releases we can check how many packages adapted to new rules and depending on that change default at some point. @mattdowle what do you think? |
|
While I like the idea of using this PR to consolidate type conversion behaviour in joins in general, we should remember that its primary goal is to fix two pretty terrible bugs in master: So maybe, if the decision on further changes needs more discussion, it can be seperated and this PR merged as a bugfix. |
|
@mattd As you are going to look at |
|
@jangorecki I am not sure, what you are hinting at with |
|
@MarkusBonsch I mean separation of code as in fread and fwrite, where |
|
Thank you so much for taking care of |
|
|
Closes #2592
Main purpose
The old behaviour of joining could lead to severe bugs when joins with different column types were involved (see #2592). This has been fixed, namely:
Changes
bmerge.Rhas been overhauled according to the suggestions in the second comment of joins with type coercion: 1.5 == 1 is TRUE #2592.xo(the order in thexdata.table) into bmerge after all type conversions are over. Otherwise,xowould not reflect the latest state after all modifications and problems similar to [bug] %in% statement fails if the category contains both lowercase and uppercase letters #2881 could occur.Tests
Others
freadinternally. This would need to be cleaned before.xandi(see his post below). While I agree, this is IMHO infeasible for now, since it would introduce breaking changes.data.tableoptions myself. So I didn't implement so far.