-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
It seems there is something wrong with [i, j := sum, by=] in data.table version 1.14.3, compiled with OpenMP. The outcome is not calculated in (id, t) subset as expected from this example:
require(magrittr)
require(dplyr)
dt_test <- data.table(
t = sample(c(1:3), size = 15, replace = T),
id = sample(LETTERS[1:3], size = 15, replace = T),
v1 = sample(c(1:10), size = 15, replace = T),
v2 = 1
) %>%
`[`(, sum_v2_idT := sum(v2), by = c("id", "t"), verbose = F) %>%
`[`(, n_idT := dim(.SD)[[1]], by = list(t, id), verbose = F) %>%
`[`(, sum_v2_id := sum(v2), by = .(id), verbose = F) %>%
`[`(, sum_v1_idT := sum(v1), by = c("id", "t"), verbose = T) %>%
`[`(, sum_v1_id := sum(v1), by = c("id"), verbose = F) %>%
arrange(id, t)
#> output
> dt_test
t id v1 v2 sum_v2_idT n_idT sum_v2_id sum_v1_idT sum_v1_id
<int> <char> <int> <num> <num> <int> <num> <int> <int>
1: 2 A 8 1 *1 2 3 (10)*7 17
2: 2 A 2 1 2 2 3 (10)*5 17
3: 3 A 7 1 2 1 3 (7)*5 17
4: 1 B 3 1 3 2 5 (5)*17 29
5: 1 B 2 1 3 2 5 (5)*17 29
6: 2 B 9 1 2 2 5 14 29
7: 2 B 5 1 2 2 5 14 29
8: 3 B 10 1 1 1 5 10 29
9: 1 C 4 1 4 4 7 21 38
10: 1 C 5 1 4 4 7 21 38
11: 1 C 3 1 *2 4 7 (21)*10 38
12: 1 C 9 1 *2 4 7 (21)*10 38
13: 3 C 8 1 3 3 7 17 38
14: 3 C 5 1 *4 3 7 21 38
15: 3 C 4 1 *4 3 7 (17)*21 38The right values in output were noted in ()*. And if only one argument assigned to by the outcome is correct. This is the verbose output (of 4th calculation, i.e., sum_v1_idT):
Argument 'by' after substitute: c("id", "t")
Detected that j uses these columns: [v1]
Finding groups using forderv ... forder.c received 15 rows and 2 columns
0.001s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
Getting back original order ... forder.c received a vector type 'integer' length 8
0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'sum(v1)'
GForce optimized j to 'gsum(v1)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.000
This gsum (narm=FALSE) took ... gather took ... 0.000s
0.000s
gforce eval took 0.000
0.000s elapsed (0.001s cpu)
Assigning to 15 row subset of 15 rows
RHS_list_of_columns == false
When trying to write a example, some thing even more odd is: replace sample with the same fixed values, the output was correct in this 15-observation data. While in my real data, about 130k rows, this problem still exists.
dt_test <- data.table(
t = c(2, 2, 3, 1, 1, 2, 2, 3, 1, 1, 1, 1, 3, 3, 3),
id = rep(LETTERS[1:3], c(3, 5, 7)),
v1 = c(8, 2, 7, 3, 2, 9, 5, 10, 4, 5, 3, 9, 8, 5, 4),
v2 = 1
) %>%
`[`(, sum_v2_idT := sum(v2), by = c("id", "t")) %>%
`[`(, n_idT := dim(.SD)[[1]], by = list(t, id)) %>%
`[`(, sum_v2_id := sum(v2), by = .(id)) %>%
`[`(, sum_v1_idT := sum(v1), by = c("id", "t")) %>%
`[`(, sum_v1_id := sum(v1), by = c("id")) %>%
arrange(id, t)
# dt_test is the same with above (but fixed),
# the output is correctNoted that this problem disappeared after re-installing package version 1.14.2 with OpenMP.
Package information:
# install
> require(data.table)
Loading required package: data.table
data.table 1.14.3 IN DEVELOPMENT built 2021-12-21 03:03:48 UTC; root using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
**********
This development version of data.table was built more than 4 weeks ago. Please update: data.table::update.dev.pkg()
**********
> data.table::update.dev.pkg()
R data.table package is up-to-date at eed712ef45fd9198de6aa1ac1b672a7347253d18 (1.14.3)setDTthreads(1) did not fix this bug. Except sum, I did not test other functions.
Finally, for your information:
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.2.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.8 magrittr_2.0.2 data.table_1.14.3
loaded via a namespace (and not attached):
[1] tidyr_1.2.0 fansi_1.0.2 assertthat_0.2.1 utf8_1.2.2 crayon_1.5.0
[6] R6_2.5.1 DBI_1.1.2 lifecycle_1.0.1 pillar_1.7.0 rlang_1.0.2
[11] cli_3.2.0 vctrs_0.3.8 generics_0.1.1 ellipsis_0.3.2 tools_4.1.2
[16] glue_1.6.1 purrr_0.3.4 compiler_4.1.2 pkgconfig_2.0.3 tidyselect_1.1.1
[21] tibble_3.1.6