%notin% is not safe in the way it handles NA by default

IMHO, the typical use of the function `%notin%` is likely expected to be `DT[lhs %notin% rhs), ...]` where 1- rhs contains no missing value and 2- the user wants to return/modify rows where lhs contains _only_ values in rhs. 

Also, I don't expect users to do something like `!lhs %notin%` (since `%in%` is already convenient for this operation).

For these reasons, I think that it is better to be on the safe side by allowing `DT[lhs %notin% rhs,...]`, to return/modify only rows whose values are in rhs. In doing so, the user will have to explicitly add NA to the rhs if he also wants to include rows with missing values.

Consider the following example:

     dt = data.table(x=c(1:3, NA, 4L, NA), y=1:6, z=10*c(3, 1, 4, 8, 3, 8))

	       x     y     z
	   <int> <int> <num>
	1:     1     1    30
	2:     2     2    10
	3:     3     3    40
	4:    NA     4    80
	5:     4     5    30
	6:    NA     6    80

	dt[x %notin% 1:3, y := z]

	       x     y     z
	   <int> <int> <num>
	1:     1     1    30
	2:     2     2    10
	3:     3     3    40
	4:    NA    80    80
	5:     4    30    30
	6:    NA    80    80


In doing this operation, I don't really think users expect the rows where x is NA to be modified.

So, even if %notin% is meant to provide a more memory-efficient version of `!lhs %in% rhs%` (IIRW), I also think that it would better to handle missing values more safely.


P.S.: I wonder if it's also possible to export a functional alternative of %notin%. something like `notin(x, table, nomatch=-1L)`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

%notin% is not safe in the way it handles NA by default #5481

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

%notin% is not safe in the way it handles NA by default #5481

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions