-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
One of the only times I find myself using allow.cartesian = TRUE is when I'm doing clustered bootstrap estimates, for example:
# sample to induce staggered group sizes
set.seed(203940)
x = as.data.table(iris)[sample(.N, .N, TRUE)]
setkey(x, Species)
x[ , .N, keyby = Species]
# Species N
# 1: setosa 43
# 2: versicolor 59
# 3: virginica 48
spec = unique(x$Species)
BB = 100
runs = sapply(seq_len(BB), function(ii) {
tryCatch({
#sample _groups_ with replacement
x[.(sample(spec, length(spec), TRUE)),
lm(Sepal.Length ~ Sepal.Width)$coefficients]
#success
TRUE
#failure
}, error = function(x) FALSE)
})
mean(runs)
# [1] 0.63
This fails about 40% of the time, because the staggered group sizes means that, even though we pull the same number of groups at each iteration, the resulting number of rows often exceeds that of the original table. This is expected behavior, so it's somewhat bothersome to have to specify allow.cartesian, especially since the argument doesn't really capture what I'm trying to do (this is nothing near a Cartesian join).
Diagnosing a bit more, we see:
sizes = replicate(1e5, x[.(sample(spec, length(spec), TRUE)),
.N, allow.cartesian = TRUE])
table(sizes)
# sizes
# 129 134 139 144 145 150 155 161 166 177
# 3684 11086 11084 3832 11073 22338 11007 10998 11219 3679
The number of rows never exceeds about 20% of the table size (of course this depends on the underlying group sizes). 1.2*(nrow(x) + nrow(i)) seems as good a threshold as any... not sure how useful this would be to other users, so just throwing it out there for now.
Could also consider basing the threshold on proximity to nrow(x)*nrow(i) (i.e., full Cartesian) instead of excess over nrow(x) + nrow(i), say, if it's more than 40% of the way to being Cartesian, throw the error? (that threshold would be 180 in this case, i.e., the same as 20% over the summative row total)