Skip to content

Some functions using by= operator return wrong output in version 1.14.3 #5345

@SeanShao98

Description

@SeanShao98

It seems there is something wrong with [i, j := sum, by=] in data.table version 1.14.3, compiled with OpenMP. The outcome is not calculated in (id, t) subset as expected from this example:

require(magrittr)
require(dplyr)
dt_test <- data.table(
  t  = sample(c(1:3),       size = 15, replace = T),
  id = sample(LETTERS[1:3], size = 15, replace = T),
  v1 = sample(c(1:10),      size = 15, replace = T),
  v2 = 1
) %>%
  `[`(, sum_v2_idT := sum(v2),        by = c("id", "t"), verbose = F) %>%
  `[`(, n_idT      := dim(.SD)[[1]],  by = list(t, id),  verbose = F) %>%
  `[`(, sum_v2_id  := sum(v2),        by = .(id),        verbose = F) %>%
  `[`(, sum_v1_idT := sum(v1),        by = c("id", "t"), verbose = T) %>%
  `[`(, sum_v1_id  := sum(v1),        by = c("id"),      verbose = F) %>%
  arrange(id, t)
#> output
> dt_test
        t     id    v1    v2 sum_v2_idT n_idT sum_v2_id sum_v1_idT sum_v1_id
    <int> <char> <int> <num>      <num> <int>     <num>      <int>     <int>
 1:     2      A     8     1         *1     2         3     (10)*7        17
 2:     2      A     2     1          2     2         3     (10)*5        17
 3:     3      A     7     1          2     1         3      (7)*5        17
 4:     1      B     3     1          3     2         5     (5)*17        29
 5:     1      B     2     1          3     2         5     (5)*17        29
 6:     2      B     9     1          2     2         5         14        29
 7:     2      B     5     1          2     2         5         14        29
 8:     3      B    10     1          1     1         5         10        29
 9:     1      C     4     1          4     4         7         21        38
10:     1      C     5     1          4     4         7         21        38
11:     1      C     3     1         *2     4         7    (21)*10        38
12:     1      C     9     1         *2     4         7    (21)*10        38
13:     3      C     8     1          3     3         7         17        38
14:     3      C     5     1         *4     3         7         21        38
15:     3      C     4     1         *4     3         7    (17)*21        38

The right values in output were noted in ()*. And if only one argument assigned to by the outcome is correct. This is the verbose output (of 4th calculation, i.e., sum_v1_idT):

Argument 'by' after substitute: c("id", "t")
Detected that j uses these columns: [v1]
Finding groups using forderv ... forder.c received 15 rows and 2 columns
0.001s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
Getting back original order ... forder.c received a vector type 'integer' length 8
0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'sum(v1)'
GForce optimized j to 'gsum(v1)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.000
This gsum (narm=FALSE) took ... gather took ... 0.000s
0.000s
gforce eval took 0.000
0.000s elapsed (0.001s cpu) 
Assigning to 15 row subset of 15 rows
RHS_list_of_columns == false

When trying to write a example, some thing even more odd is: replace sample with the same fixed values, the output was correct in this 15-observation data. While in my real data, about 130k rows, this problem still exists.

dt_test <- data.table(
  t  = c(2, 2, 3, 1, 1, 2, 2, 3, 1, 1, 1, 1, 3, 3, 3),
  id = rep(LETTERS[1:3], c(3, 5, 7)),
  v1 = c(8, 2, 7, 3, 2, 9, 5, 10, 4, 5, 3, 9, 8, 5, 4),
  v2 = 1
) %>%
  `[`(, sum_v2_idT := sum(v2),        by = c("id", "t")) %>%
  `[`(, n_idT      := dim(.SD)[[1]],  by = list(t, id)) %>%
  `[`(, sum_v2_id  := sum(v2),        by = .(id)) %>%
  `[`(, sum_v1_idT := sum(v1),        by = c("id", "t")) %>%
  `[`(, sum_v1_id  := sum(v1),        by = c("id")) %>%
  arrange(id, t)
# dt_test is the same with above (but fixed),
# the output is correct

Noted that this problem disappeared after re-installing package version 1.14.2 with OpenMP.

Package information:

# install
> require(data.table)
Loading required package: data.table
data.table 1.14.3 IN DEVELOPMENT built 2021-12-21 03:03:48 UTC; root using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
**********
This development version of data.table was built more than 4 weeks ago. Please update: data.table::update.dev.pkg()
**********
> data.table::update.dev.pkg()
R data.table package is up-to-date at eed712ef45fd9198de6aa1ac1b672a7347253d18 (1.14.3)

setDTthreads(1) did not fix this bug. Except sum, I did not test other functions.

Finally, for your information:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.2.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.8       magrittr_2.0.2    data.table_1.14.3

loaded via a namespace (and not attached):
 [1] tidyr_1.2.0      fansi_1.0.2      assertthat_0.2.1 utf8_1.2.2       crayon_1.5.0    
 [6] R6_2.5.1         DBI_1.1.2        lifecycle_1.0.1  pillar_1.7.0     rlang_1.0.2     
[11] cli_3.2.0        vctrs_0.3.8      generics_0.1.1   ellipsis_0.3.2   tools_4.1.2     
[16] glue_1.6.1       purrr_0.3.4      compiler_4.1.2   pkgconfig_2.0.3  tidyselect_1.1.1
[21] tibble_3.1.6    

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions