Skip to content

Weird inconsitency between sort order, gForce (I think), and lots of by categories in 1.14.3 #5326

@dcaseykc

Description

@dcaseykc

I don't totally understand whats going on, but it seems that sort order is affecting the by calculation (sum, which I assume is being gforce optimized). Also, it seems like the answers might just be wrong (e.g. there should only be one unique value per id2).

library('data.table')
set.seed(10)
n = 100000
a = data.table(id1 = 1:n, id2 = sample(1:900,n,replace = T), flag = sample(c(0,0,0,1),n, replace = T))
b = copy(a)
#shuffle
a = a[sample(seq_len(nrow(a)), nrow(a))]
a[, t1 := sum(flag, na.rm = T), id2]
setorder(a,id1)
a[, t2 := sum(flag, na.rm = T), id2]
any(a[,t1!=t2])
#> [1] TRUE
any(a[, length(unique(t1))>1, id2]$V1)
#> [1] TRUE
any(a[, length(unique(t2))>1, id2]$V1)
#> [1] TRUE
#Without using gforce optimization
a = copy(b)
#shuffle
sum2 = sum
a = a[sample(seq_len(nrow(a)), nrow(a))]
a[, t1 := sum2(flag, na.rm = T), id2]
setorder(a,id1)
a[, t2 := sum2(flag, na.rm = T), id2]
any(a[,t1!=t2])
#> [1] FALSE
any(a[, length(unique(t1))>1, id2]$V1)
#> [1] FALSE
any(a[, length(unique(t2))>1, id2]$V1)
#> [1] FALSE

Session info:

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.3

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13 knitr_1.34      magrittr_2.0.1  R6_2.5.1        rlang_0.4.11    fastmap_1.1.0  
 [7] fansi_0.4.2     highr_0.9       tools_4.1.2     xfun_0.23       utf8_1.2.1      cli_3.1.0      
[13] clipr_0.7.1     withr_2.4.2     htmltools_0.5.2 ellipsis_0.3.2  yaml_2.2.1      digest_0.6.27  
[19] tibble_3.1.2    lifecycle_1.0.0 crayon_1.4.1    processx_3.5.2  callr_3.7.0     ps_1.6.0       
[25] vctrs_0.3.8     fs_1.5.0        glue_1.4.2      evaluate_0.14   rmarkdown_2.11  reprex_2.0.1   
[31] compiler_4.1.2  pillar_1.6.1    pkgconfig_2.0.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    GForceissues relating to optimized grouping calculations (GForce)dev

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions