Skip to content

Increase threshold to trigger allow.cartesian error? #2455

@MichaelChirico

Description

@MichaelChirico

One of the only times I find myself using allow.cartesian = TRUE is when I'm doing clustered bootstrap estimates, for example:

# sample to induce staggered group sizes
set.seed(203940)
x = as.data.table(iris)[sample(.N, .N, TRUE)]
setkey(x, Species)
x[ , .N, keyby = Species]
#       Species  N
# 1:     setosa 43
# 2: versicolor 59
# 3:  virginica 48
spec = unique(x$Species)
BB = 100
runs = sapply(seq_len(BB), function(ii) {
  tryCatch({
    #sample _groups_ with replacement
    x[.(sample(spec, length(spec), TRUE)), 
      lm(Sepal.Length ~ Sepal.Width)$coefficients]
    #success
    TRUE
    #failure
  }, error = function(x) FALSE)
})
mean(runs)
# [1] 0.63

This fails about 40% of the time, because the staggered group sizes means that, even though we pull the same number of groups at each iteration, the resulting number of rows often exceeds that of the original table. This is expected behavior, so it's somewhat bothersome to have to specify allow.cartesian, especially since the argument doesn't really capture what I'm trying to do (this is nothing near a Cartesian join).

Diagnosing a bit more, we see:

sizes = replicate(1e5, x[.(sample(spec, length(spec), TRUE)),
                         .N, allow.cartesian = TRUE])
table(sizes)
# sizes
# 129   134   139   144   145   150   155   161   166   177 
# 3684 11086 11084  3832 11073 22338 11007 10998 11219  3679 

The number of rows never exceeds about 20% of the table size (of course this depends on the underlying group sizes). 1.2*(nrow(x) + nrow(i)) seems as good a threshold as any... not sure how useful this would be to other users, so just throwing it out there for now.

Could also consider basing the threshold on proximity to nrow(x)*nrow(i) (i.e., full Cartesian) instead of excess over nrow(x) + nrow(i), say, if it's more than 40% of the way to being Cartesian, throw the error? (that threshold would be 180 in this case, i.e., the same as 20% over the summative row total)

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestjoinsUse label:"non-equi joins" for rolling, overlapping, and non-equi joins

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions