-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
The performance of dt[selector, foo := bar] on key could significantly drop when an index is present. Following is my use case and reproducible example:
library(data.table)
dt <- data.table(symbol = rep(1:1000, each = 5000))
dt[, date := seq_len(.N), by = symbol]
setkeyv(dt, c("symbol", "date"))
flag_dt <- data.table(symbol = sample.int(500, 5000, replace = TRUE))
flag_dt[, start_date := sample.int(3000, .N, replace = TRUE)]
flag_dt[, end_date := start_date + sample.int(3000, .N, replace = TRUE)]
flag_dt[, id := seq_len(.N)]
calendar <- dt[, sort(unique(date))]When dt has no index, the following code that repeatedly using symbol, date selector to modify flag is fast enough.
system.time({
dt[, flag := 0L]
flag_dt[, {
dates <- calendar[calendar %between% c(start_date, end_date)]
if (length(dates)) {
selector <- list(symbol, dates)
dt[selector, flag := 1L]
}
NULL
}, by = id]
}) user system elapsed
26.189 0.399 3.344
user system elapsed
24.648 0.220 3.119
However, if an index is created intentionally, or in many cases unintentionally (auto index triggered by dt[flag0 == 1, ...]), the performance of the above code significantly decreases and could be unstable:
dt[, flag0 := sample(0:1, .N, replace = TRUE)]
setindexv(dt, "flag0")
system.time({
dt[, flag := 0L]
flag_dt[, {
dates <- calendar[calendar %between% c(start_date, end_date)]
if (length(dates)) {
selector <- list(symbol, dates)
dt[selector, flag := 1L]
}
NULL
}, by = id]
}) user system elapsed
386.415 27.380 52.938
user system elapsed
212.908 7.289 27.665
I also tried explicitly writing dt[selector, flag := 1L, on = .(symbol, date)], still no luck.
Avoiding creating an index or disabling auto-index could avoid this problem but I'm still curious if there's something that significantly adds the overhead of keyed := while there's an index.