Skip to content

%notin% is not safe in the way it handles NA by default #5481

@Kamgang-B

Description

@Kamgang-B

IMHO, the typical use of the function %notin% is likely expected to be DT[lhs %notin% rhs), ...] where 1- rhs contains no missing value and 2- the user wants to return/modify rows where lhs contains only values in rhs.

Also, I don't expect users to do something like !lhs %notin% (since %in% is already convenient for this operation).

For these reasons, I think that it is better to be on the safe side by allowing DT[lhs %notin% rhs,...], to return/modify only rows whose values are in rhs. In doing so, the user will have to explicitly add NA to the rhs if he also wants to include rows with missing values.

Consider the following example:

 dt = data.table(x=c(1:3, NA, 4L, NA), y=1:6, z=10*c(3, 1, 4, 8, 3, 8))

       x     y     z
   <int> <int> <num>
1:     1     1    30
2:     2     2    10
3:     3     3    40
4:    NA     4    80
5:     4     5    30
6:    NA     6    80

dt[x %notin% 1:3, y := z]

       x     y     z
   <int> <int> <num>
1:     1     1    30
2:     2     2    10
3:     3     3    40
4:    NA    80    80
5:     4    30    30
6:    NA    80    80

In doing this operation, I don't really think users expect the rows where x is NA to be modified.

So, even if %notin% is meant to provide a more memory-efficient version of !lhs %in% rhs% (IIRW), I also think that it would better to handle missing values more safely.

P.S.: I wonder if it's also possible to export a functional alternative of %notin%. something like notin(x, table, nomatch=-1L).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions