Increase threshold to trigger allow.cartesian error?

One of the only times I find myself using `allow.cartesian = TRUE` is when I'm doing clustered bootstrap estimates, for example:

```
# sample to induce staggered group sizes
set.seed(203940)
x = as.data.table(iris)[sample(.N, .N, TRUE)]
setkey(x, Species)
x[ , .N, keyby = Species]
#       Species  N
# 1:     setosa 43
# 2: versicolor 59
# 3:  virginica 48
spec = unique(x$Species)
BB = 100
runs = sapply(seq_len(BB), function(ii) {
  tryCatch({
    #sample _groups_ with replacement
    x[.(sample(spec, length(spec), TRUE)), 
      lm(Sepal.Length ~ Sepal.Width)$coefficients]
    #success
    TRUE
    #failure
  }, error = function(x) FALSE)
})
mean(runs)
# [1] 0.63
```

This fails about 40% of the time, because the staggered group sizes means that, even though we pull the same number of _groups_ at each iteration, the resulting number of _rows_ often exceeds that of the original table. This is expected behavior, so it's somewhat bothersome to have to specify `allow.cartesian`, especially since the argument doesn't really capture what I'm trying to do (this is nothing near a Cartesian join).

Diagnosing a bit more, we see:

```
sizes = replicate(1e5, x[.(sample(spec, length(spec), TRUE)),
                         .N, allow.cartesian = TRUE])
table(sizes)
# sizes
# 129   134   139   144   145   150   155   161   166   177 
# 3684 11086 11084  3832 11073 22338 11007 10998 11219  3679 
```

The number of rows never exceeds about 20% of the table size (of course this depends on the underlying group sizes). `1.2*(nrow(x) + nrow(i))` seems as good a threshold as any... not sure how useful this would be to other users, so just throwing it out there for now.

Could also consider basing the threshold on _proximity to `nrow(x)*nrow(i)`_ (i.e., full Cartesian) instead of excess over `nrow(x) + nrow(i)`, say, if it's more than 40% of the way to being Cartesian, throw the error? (that threshold would be 180 in this case, i.e., the same as 20% over the summative row total)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase threshold to trigger allow.cartesian error? #2455

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Increase threshold to trigger allow.cartesian error? #2455

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions