-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Labels
Description
I'm using the lastest dev version of data.table. Still uniqueN() is an order of magnitude slower than length(unique()) . So slow I think it should be tagged as a bug ... See the reprex example below.
Note, the below code uses 4 threads (on a Win7 computer). If we set the thread number to 1, the time cost will be reduced to a half but still significantly slower than length(unique()).
(In fact, the reason I notice this is because I have a daily routine script costs maybe 20 minutes... and trying to improve the speed leads me to the cause - uniqueN())
Character
library(data.table)
set.seed(1000)
mk_rd_words <- function(min = 4, max = 20) {
n <- floor(runif(1, min, max))
paste0(sample(c(letters, LETTERS), size = n, replace = TRUE), collapse = '')
}
words <- vapply(1:1000, function(x) mk_rd_words(4, 50), FUN.VALUE = 'a')
n <- 1e4
tbl <- data.table(
a = sample(words, size = n, replace = TRUE),
b = sample(words, size = n, replace = TRUE)
)
microbenchmark::microbenchmark(
times = 100,
tbl[, .(N = uniqueN(b)), keyby = a],
tbl[, .(N = length(unique(b))), keyby = a]
)
#> Unit: milliseconds
#> expr min lq
#> tbl[, .(N = uniqueN(b)), keyby = a] 169.13260 171.651133
#> tbl[, .(N = length(unique(b))), keyby = a] 8.12066 8.607032
#> mean median uq max neval
#> 176.649940 173.972808 181.373316 201.70846 100
#> 9.233874 8.738746 9.208014 16.28779 100Created on 2019-08-02 by the reprex package (v0.2.1)
Double
library(data.table)
set.seed(1000)
n <- 1e4
tbl <- data.table(
a = sample(1:1e3, size = n, replace = TRUE),
b = sample(1:1e3, size = n, replace = TRUE)
)
microbenchmark::microbenchmark(
times = 100,
tbl[, .(N = uniqueN(b)), keyby = a],
tbl[, .(N = length(unique(b))), keyby = a]
)
#> Unit: milliseconds
#> expr min lq
#> tbl[, .(N = uniqueN(b)), keyby = a] 107.329319 111.531912
#> tbl[, .(N = length(unique(b))), keyby = a] 5.777745 5.980306
#> mean median uq max neval
#> 119.497038 115.412347 124.303334 158.04780 100
#> 6.992722 6.314294 7.678619 13.08759 100Created on 2019-08-02 by the reprex package (v0.2.1)
HughParsonage, artemklevtsov, caseybreen, chinsoon12, khotilov and 4 more