Skip to content

uniqueN() is very slow compared to length(unique()) #3739

@shrektan

Description

@shrektan

I'm using the lastest dev version of data.table. Still uniqueN() is an order of magnitude slower than length(unique()) . So slow I think it should be tagged as a bug ... See the reprex example below.

Note, the below code uses 4 threads (on a Win7 computer). If we set the thread number to 1, the time cost will be reduced to a half but still significantly slower than length(unique()).

(In fact, the reason I notice this is because I have a daily routine script costs maybe 20 minutes... and trying to improve the speed leads me to the cause - uniqueN())

Character

library(data.table)
set.seed(1000)
mk_rd_words <- function(min = 4, max = 20) {
  n <- floor(runif(1, min, max))
  paste0(sample(c(letters, LETTERS), size = n, replace = TRUE), collapse = '')
}
words <- vapply(1:1000, function(x) mk_rd_words(4, 50), FUN.VALUE = 'a')

n <- 1e4
tbl <- data.table(
  a = sample(words, size = n, replace = TRUE),
  b = sample(words, size = n, replace = TRUE)
)
microbenchmark::microbenchmark(
  times = 100,
  tbl[, .(N = uniqueN(b)), keyby = a],
  tbl[, .(N = length(unique(b))), keyby = a]
)
#> Unit: milliseconds
#>                                        expr       min         lq
#>         tbl[, .(N = uniqueN(b)), keyby = a] 169.13260 171.651133
#>  tbl[, .(N = length(unique(b))), keyby = a]   8.12066   8.607032
#>        mean     median         uq       max neval
#>  176.649940 173.972808 181.373316 201.70846   100
#>    9.233874   8.738746   9.208014  16.28779   100

Created on 2019-08-02 by the reprex package (v0.2.1)

Double

library(data.table)
set.seed(1000)
n <- 1e4
tbl <- data.table(
  a = sample(1:1e3, size = n, replace = TRUE),
  b = sample(1:1e3, size = n, replace = TRUE)
)
microbenchmark::microbenchmark(
  times = 100,
  tbl[, .(N = uniqueN(b)), keyby = a],
  tbl[, .(N = length(unique(b))), keyby = a]
)
#> Unit: milliseconds
#>                                        expr        min         lq
#>         tbl[, .(N = uniqueN(b)), keyby = a] 107.329319 111.531912
#>  tbl[, .(N = length(unique(b))), keyby = a]   5.777745   5.980306
#>        mean     median         uq       max neval
#>  119.497038 115.412347 124.303334 158.04780   100
#>    6.992722   6.314294   7.678619  13.08759   100

Created on 2019-08-02 by the reprex package (v0.2.1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugtop requestOne of our most-requested issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions