allow.cartesian should be more precise

Rather than having particular threshold, which is now `nrow(x)+nrow(i)` I propose to make _row explosion_ exception (of course balancing backward compatibility) in either of those cases:
- a) there are duplicates in `on` columns in `x`
- b) there are duplicates in `on` columns in `x` AND `i`

Note that there is a related issue #2455

----

Case a)
Looks quite aggresive, but if we would just print `message` or `cat` to console then it shouldn't be big deal.
It gives a nice feature, that number of rows of result is never bigger than `nrow(i)`, and in case of `nomatch=NA` it is always equal to `nrow(i)`. So it corresponds to `mult="error"` feature (#1260). This feature is commonly expected when we join tables to _lookup_ fields from one table to another.
Note that when we use update-on-join for that (join and `:=`) AFAIK it will automatically behaves like `mult="last"` rather than `mult="all"` (#3747, #2837), so _row explosion_ is not hapenning. Also note that during update-on-join, `x` and `i` are swapped.

Case b) basically address cartesian product of duplicated entries.

I think it make sense to add those as options for now, so people can more easily evaluate them. Maybe `allow.cartesian` should handle case b), and new option handle a) ? or even just `mult` option: `options("datatable.mult"="error")`.
```r
d = function(n) as.data.table(list(x=rep(1L, n)))
d(1)[d(1), on="x"]
#   x
#1: 1
d(1)[d(2), on="x"]
#   x
#1: 1
#2: 1
d(2)[d(1), on="x"] # error in a) because we lookup from `x` and there are multiple matches!
#   x
#1: 1
#2: 1
d(2)[d(2), on="x"] # error in b) because there is cartesian product happening
#   x
#1: 1
#2: 1
#3: 1
#4: 1
d(1)[d(3), on="x"]
#   x
#1: 1
#2: 1
#3: 1
d(3)[d(1), on="x"] # error in a) duplicated entries in lookup table
#   x
#1: 1
#2: 1
#3: 1
d(3)[d(2), on="x"] # already errors
d(2)[d(3), on="x"]
d(3)[d(3), on="x"]
#Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
#  Join results in 9 rows; more than 6 = nrow(x)+nrow(i). Check for duplicate key values in i each of #which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each #group to avoid the large allocation. If you are sure you wish to proceed, rerun with #allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack #Overflow and data.table issue tracker for advice.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

allow.cartesian should be more precise #4383

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

allow.cartesian should be more precise #4383

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions