Skip to content

Major performance drop of keyed := when index is present #4311

@renkun-ken

Description

@renkun-ken

The performance of dt[selector, foo := bar] on key could significantly drop when an index is present. Following is my use case and reproducible example:

library(data.table)

dt <- data.table(symbol = rep(1:1000, each = 5000))
dt[, date := seq_len(.N), by = symbol]
setkeyv(dt, c("symbol", "date"))

flag_dt <- data.table(symbol = sample.int(500, 5000, replace = TRUE))
flag_dt[, start_date := sample.int(3000, .N, replace = TRUE)]
flag_dt[, end_date := start_date + sample.int(3000, .N, replace = TRUE)]
flag_dt[, id := seq_len(.N)]

calendar <- dt[, sort(unique(date))]

When dt has no index, the following code that repeatedly using symbol, date selector to modify flag is fast enough.

system.time({
  dt[, flag := 0L]
  flag_dt[, {
    dates <- calendar[calendar %between% c(start_date, end_date)]
    if (length(dates)) {
      selector <- list(symbol, dates)
      dt[selector, flag := 1L]
    }
    NULL
  }, by = id]
})
   user  system elapsed 
 26.189   0.399   3.344 
  user  system elapsed 
 24.648   0.220   3.119 

However, if an index is created intentionally, or in many cases unintentionally (auto index triggered by dt[flag0 == 1, ...]), the performance of the above code significantly decreases and could be unstable:

dt[, flag0 := sample(0:1, .N, replace = TRUE)]
setindexv(dt, "flag0")
system.time({
  dt[, flag := 0L]
  flag_dt[, {
    dates <- calendar[calendar %between% c(start_date, end_date)]
    if (length(dates)) {
      selector <- list(symbol, dates)
      dt[selector, flag := 1L]
    }
    NULL
  }, by = id]
})
   user  system elapsed 
386.415  27.380  52.938
   user  system elapsed 
212.908   7.289  27.665

I also tried explicitly writing dt[selector, flag := 1L, on = .(symbol, date)], still no luck.

Avoiding creating an index or disabling auto-index could avoid this problem but I'm still curious if there's something that significantly adds the overhead of keyed := while there's an index.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions