-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Rather than having particular threshold, which is now nrow(x)+nrow(i) I propose to make row explosion exception (of course balancing backward compatibility) in either of those cases:
- a) there are duplicates in
oncolumns inx - b) there are duplicates in
oncolumns inxANDi
Note that there is a related issue #2455
Case a)
Looks quite aggresive, but if we would just print message or cat to console then it shouldn't be big deal.
It gives a nice feature, that number of rows of result is never bigger than nrow(i), and in case of nomatch=NA it is always equal to nrow(i). So it corresponds to mult="error" feature (#1260). This feature is commonly expected when we join tables to lookup fields from one table to another.
Note that when we use update-on-join for that (join and :=) AFAIK it will automatically behaves like mult="last" rather than mult="all" (#3747, #2837), so row explosion is not hapenning. Also note that during update-on-join, x and i are swapped.
Case b) basically address cartesian product of duplicated entries.
I think it make sense to add those as options for now, so people can more easily evaluate them. Maybe allow.cartesian should handle case b), and new option handle a) ? or even just mult option: options("datatable.mult"="error").
d = function(n) as.data.table(list(x=rep(1L, n)))
d(1)[d(1), on="x"]
# x
#1: 1
d(1)[d(2), on="x"]
# x
#1: 1
#2: 1
d(2)[d(1), on="x"] # error in a) because we lookup from `x` and there are multiple matches!
# x
#1: 1
#2: 1
d(2)[d(2), on="x"] # error in b) because there is cartesian product happening
# x
#1: 1
#2: 1
#3: 1
#4: 1
d(1)[d(3), on="x"]
# x
#1: 1
#2: 1
#3: 1
d(3)[d(1), on="x"] # error in a) duplicated entries in lookup table
# x
#1: 1
#2: 1
#3: 1
d(3)[d(2), on="x"] # already errors
d(2)[d(3), on="x"]
d(3)[d(3), on="x"]
#Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
# Join results in 9 rows; more than 6 = nrow(x)+nrow(i). Check for duplicate key values in i each of #which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each #group to avoid the large allocation. If you are sure you wish to proceed, rerun with #allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack #Overflow and data.table issue tracker for advice.