Skip to content

allow.cartesian should be more precise #4383

@jangorecki

Description

@jangorecki

Rather than having particular threshold, which is now nrow(x)+nrow(i) I propose to make row explosion exception (of course balancing backward compatibility) in either of those cases:

  • a) there are duplicates in on columns in x
  • b) there are duplicates in on columns in x AND i

Note that there is a related issue #2455


Case a)
Looks quite aggresive, but if we would just print message or cat to console then it shouldn't be big deal.
It gives a nice feature, that number of rows of result is never bigger than nrow(i), and in case of nomatch=NA it is always equal to nrow(i). So it corresponds to mult="error" feature (#1260). This feature is commonly expected when we join tables to lookup fields from one table to another.
Note that when we use update-on-join for that (join and :=) AFAIK it will automatically behaves like mult="last" rather than mult="all" (#3747, #2837), so row explosion is not hapenning. Also note that during update-on-join, x and i are swapped.

Case b) basically address cartesian product of duplicated entries.

I think it make sense to add those as options for now, so people can more easily evaluate them. Maybe allow.cartesian should handle case b), and new option handle a) ? or even just mult option: options("datatable.mult"="error").

d = function(n) as.data.table(list(x=rep(1L, n)))
d(1)[d(1), on="x"]
#   x
#1: 1
d(1)[d(2), on="x"]
#   x
#1: 1
#2: 1
d(2)[d(1), on="x"] # error in a) because we lookup from `x` and there are multiple matches!
#   x
#1: 1
#2: 1
d(2)[d(2), on="x"] # error in b) because there is cartesian product happening
#   x
#1: 1
#2: 1
#3: 1
#4: 1
d(1)[d(3), on="x"]
#   x
#1: 1
#2: 1
#3: 1
d(3)[d(1), on="x"] # error in a) duplicated entries in lookup table
#   x
#1: 1
#2: 1
#3: 1
d(3)[d(2), on="x"] # already errors
d(2)[d(3), on="x"]
d(3)[d(3), on="x"]
#Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
#  Join results in 9 rows; more than 6 = nrow(x)+nrow(i). Check for duplicate key values in i each of #which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each #group to avoid the large allocation. If you are sure you wish to proceed, rerun with #allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack #Overflow and data.table issue tracker for advice.

Metadata

Metadata

Assignees

Labels

joinsUse label:"non-equi joins" for rolling, overlapping, and non-equi joins

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions