Skip to content

Conversation

@MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Apr 10, 2018

Closes #2592

Main purpose
The old behaviour of joining could lead to severe bugs when joins with different column types were involved (see #2592). This has been fixed, namely:

  • Joining on numeric to integer doesn't lead to weird matches of 1 == 1.5 anymore.
  • Joining on factor to integer throws an error now and doesn't confusingly join the integer to the factor level number anymore.

Changes

Tests

  • All existing tests are passing without adaptation.
  • 106 new tests have been added to check the type conversion routine extensively.

Others

  • Originally, the idea was to add warnings on type conversions during joins. Unfortunately this is infeasible for now since it would make several hundred tests fail. For example, such conversion happens in fread internally. This would need to be cleaned before.
  • @st-pasha suggested that type conversion should be symmetric between x and i (see his post below). While I agree, this is IMHO infeasible for now, since it would introduce breaking changes.
  • @jangorecki suggested to implement warnings and symmetric type conversion as an option. I am not sure, what the benefit would be since I don't work with data.table options myself. So I didn't implement so far.

@MarkusBonsch
Copy link
Contributor Author

MarkusBonsch commented Apr 10, 2018

The decreased code coverage is due to a section, where I implement the behaviour for columns where all of the following return FALSE: is.logical, is.integer, base::is.double, is.character, is.factor., is.integer64
I haven't managed to come up with an example that satisfies all of these. I tried raw or complex vectors, but these fail earlier in forder and can't be used to cover this section. There are three options from my perspective:

  1. Find a test column that satisfies the above-mentioned criteria.
  2. Remove the code that deals with this case. If in reality the case occurs, it will raise an error and can be fixed. If we can't come up with an example for the tests, this is very unlikely to occur.
  3. Live with the decreased coverage, leave the code in and don't test it.

If option 1 turns out to be infeasible, I would opt for 2.

@mattdowle
Copy link
Member

mattdowle commented Apr 12, 2018

Looks great and much more robust. New tests are extensive!

  1. How about defining a data matrix in the code, something like this :
i type down rows
x type along cols
                1     2     3     4   ...
1. logical      y     w     w     w
2. integer      e     y     w
3. intAsReal    e     w     w
4. reallyReal   e    cw           y
5. ...

where y = yes types match as expected normally, e = error, w = no coerce but warn, cw = coerce and warn
Then the code would look up the behaviour from the matrix and do that action code. That would be easier to read, debug and maintain; e.g. adding new types in future. For example, in writing this comment I looked at position [3,3] and changed it from "y" to "w". Since, although the types match in that case we'd want to suggest to the user they change both sides from double to integer for space and time efficiency.

  1. For the "other" type, maybe that should just be error? What happens now if either or both typeof()=="complex" for example? I'd expect/hope error. That error could be tested and then the coverage would be 100%.

  2. Needs a new item in NEWS.md bug fix section please.

@MarkusBonsch
Copy link
Contributor Author

MarkusBonsch commented Apr 12, 2018

Great, I like your suggested behaviour very much!

  • Concerning 1: I will update so that the matrix is used for lookup. For the coerce case, we need to specify if i or x gets coerced, so I will introduce ci (coerce i so that it matches x) and cx (coerce x so that it matches i).
  • Concerning 2: I will implement that any column that !is.integer & !base::is.double& !is.character & !is.factor & !inherits(col,"integer64") throws an error. Currently, complex column in x throws an error before arriving at bmerge in forder when attempting to create an index. Maybe, complex in i will work to cover the code in unit tests, I will test that.
  • Concerning 3: I will update the NEWS.md

@MarkusBonsch
Copy link
Contributor Author

I have refactored the code according to @mattdowle 's proposals with 2 exceptions:

  1. I had difficulties defining the warning message for the flagw for 'no coercion but warning': what should it warn about? Would it only be for intAsReal columns as suggested by @mattdowle? Then it could only be entered at very specific positions in the matrix. This is not good from my perspective. Therefore, I would propose to treat this warning seperately, outside the matrix. After further thought, I decided not to implement it at all for a very simple reason: I don't believe that bmerge is the correct place to warn users about storing integers as integers. If that is intended, it should happen on creation of intAsReal columns in my opinion. I am very open for other suggestions on this topic. The current possible coercion strategies are the following:
  ##------------------|-----------------------------------------------------
  ## y (yes):         | no coercion necessary, types match
  ##------------------|-----------------------------------------------------
  ## e (error):       | throw an error because of incompatible types
  ##------------------|-----------------------------------------------------
  ## ci (coercion i): | coerce the column in i to the type of the column in x
  ##------------------|-----------------------------------------------------
  ## ciw:             | same as above, but with warning.
  ##------------------|-----------------------------------------------------
  ## cx (coercion x): | coerce the column in x to the type of the column in i
  ##------------------|-----------------------------------------------------
  ## cxw:             | same as above, but with warning.
  ##------------------|-----------------------------------------------------
  1. Originally, we wanted to issue warnings to users on joins with non-matching types (currently, there are only verbose message warnings). I did not do that so far since this would affect > 100 existing unit tests, including cases, where internal data.table code would cause warnings. So, before doing so, I would propose to clean the code and tests for unnecessary type mismatches and then implementing the warnings, which is as simple as replacing ci and cx with ciw and cxw in the type conversion matrix respectively (see table under 1.).

R/bmerge.R Outdated
## where newclass is the class on the right of ==
typeCastCoercionX = c("integer64==realDouble")

## Establish a lookup table with coercion strategies for each pair of types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice but maybe we can move that to proper Rd? (doesn't have to be now, can be separate PR)
We can export coerceClass and put that to man.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice reorganization of code.
None of existing tests were altered and many new tests added, coverage increase, very nice.

I did not do that so far since this would affect > 100 existing unit tests, including cases, where internal data.table code would cause warnings. So, before doing so, I would propose to clean the code and tests for unnecessary type mismatches and then implementing the warnings, which is as simple as replacing ci and cx with ciw and cxw in the type conversion matrix respectively (see table under 1.).

Such significant change would need to be implemented with option (by default disabled) anyway, which could be eventually added now in this PR: if (getOption("datatable.warntypemismatch", FALSE)) "ciw" else "ci", but let's wait for Matt's comment.

R/bmerge.R Outdated
}
else {
## we need as.'to'() conversion
converter <- match.fun(paste0("as.", to))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are just few choices here, why not just few if (...) as.x else if (...) as.y else if ...? making such converted make perfect sense if we want to support arbitrary coercion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be ready for other as.xxx conversion, but if you want, I can change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any "other" on roadmap? I am not aware of any. Still it will need to be added to matrix anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there isn't and you are completely riht about the matrix. If you want me to change the implementation of coerceClass, please indicate briefly, I will then do it.

@st-pasha
Copy link
Contributor

Should the type conversion matrix be symmetrical* (assuming ci <-> cx exchange upon transpose)? That is, do we expect that for any two types A, B the following should hold:

  [A, B] = y    <=>  [B, A] = y
  [A, B] = e    <=>  [B, A] = e
  [A, B] = ci   <=>  [B, A] = cx
  [A, B] = ciw  <=>  [B, A] = cxw

Currently it doesn't, for example there's ct["integer", "logical"] = "e" whereas ct["logical", "integer"] = "ci" (there are other mismatched pairs like this too).

@MarkusBonsch
Copy link
Contributor Author

You are right, @st-pasha, I also think that this would be the ideal behavior. However, this would alter existing behavior beyond fixing bugs IMHO. That is why I didn't implement it so far. If there is an agreement to change the behavior to be symmetrical as suggested by you, I am more than willing to implement. Note, however, that very likely existing tests have to be updated.

@jangorecki
Copy link
Member

jangorecki commented Apr 20, 2018

My suggestion is to add warnings and make matrix symmetric in this PR but as option. This is big breaking change and it might need to wait for more than 1 release cycle to be changed to TRUE by default. We need to release soon to CRAN, so putting this as an option gives us much more time to check revdeps and inform maintainers. This we can do already after release to CRAN so maintainers can smoothly migrate by using option on recent CRAN build. Later before future releases we can check how many packages adapted to new rules and depending on that change default at some point. @mattdowle what do you think?

@MarkusBonsch
Copy link
Contributor Author

While I like the idea of using this PR to consolidate type conversion behaviour in joins in general, we should remember that its primary goal is to fix two pretty terrible bugs in master:

library(data.table)
dt <- data.table(intCol = 1:2, numCol = c(1.5, 2.5), factCol = factor(c("a","a")))
dt[dt, on="intCol==numCol", nomatch = 0L]
#   intCol numCol factCol i.intCol i.factCol
#  1:      1         1.5          a           1           a
#  2:      2         2.5          a           2           a
## 1.5 is joine dto 1 and 2.5 to 2 !

dt[dt, on="intCol==factCol", nomatch = 0L]
#   intCol numCol factCol i.intCol i.numCol
# 1:       a       1.5       a          1         1.5
# 2:       a       1.5       a          2         2.5
## 1 gets joined to "a" since "a" is factor level 1.

So maybe, if the decision on further changes needs more discussion, it can be seperated and this PR merged as a bugfix.

@mattdowle mattdowle added this to the v1.11.2 milestone Apr 29, 2018
@jangorecki
Copy link
Member

jangorecki commented Oct 9, 2018

@mattd As you are going to look at bmerge, could you also briefly evaluate how difficult it would be to split current bmerge into bmergeR and bmerge C for future releases?

@MarkusBonsch
Copy link
Contributor Author

@jangorecki I am not sure, what you are hinting at with bmergeR and bmergeC. Currently, the R function bmerge is used to take care about type conversions and factor level consolidations. Then, Cbmerge is called to do the real merge.

@jangorecki
Copy link
Member

jangorecki commented Oct 9, 2018

@MarkusBonsch I mean separation of code as in fread and fwrite, where funR is used for R C API, and fun is used to operate in plain C. This allows to use those inner function without going through R C API. AFAIR currently both depends on R C API.

@MarkusBonsch
Copy link
Contributor Author

Thank you so much for taking care of trimws @mattdowle. I didn't find the time. If anything else pops up, I will try to fix it.

@mattdowle mattdowle removed this from the 1.12.0 milestone Jan 11, 2019
@mattdowle mattdowle added this to the 1.12.4 milestone May 17, 2019
@Rdatatable Rdatatable deleted a comment from codecov-io May 17, 2019
@mattdowle
Copy link
Member

mattdowle commented May 17, 2019

ciw and cxw aren't used anywhere in the ct matrix so there isn't any coverage on the warnings. Maybe the warning code can be dropped then, or if not, there should be entries using them. Still looking ...

@mattdowle mattdowle merged commit 6790cd6 into master May 22, 2019
@mattdowle mattdowle deleted the joinTypeFix branch May 22, 2019 02:15
@mattdowle mattdowle mentioned this pull request May 22, 2019
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

joins with type coercion: 1.5 == 1 is TRUE

4 participants