Skip to content

Problematic behaviour when using := and by in dev version only #5307

@clerousset

Description

@clerousset

Hi,

I'm using the dev version :

> data.table::update.dev.pkg()
R data.table package is up-to-date at eed712ef45fd9198de6aa1ac1b672a7347253d18 (1.14.3)

because I can't wait for some of the new features (especially optimized shift by and the great new env argument !!).
I met a problematic behaviour of := combined with by :

> dt <- data.table(by1 = c("a","a","b","b"), by2 = c("c","d","c","d"), value=c("ac","ad","bc","bd"))
> dt[,same_value:=value[1], .(by1, by2)][]
Argument 'by' after substitute: .(by1, by2)
Detected that j uses these columns: [value]
Finding groups using forderv ... forder.c received 4 rows and 2 columns
0.000s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'value[1]'
GForce optimized j to '`g[`(value, 1)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.000
gforce eval took 0.000
0.000s elapsed (0.000s cpu) 
Assigning to all 4 rows
RHS_list_of_columns == false
RHS for item 1 has been duplicated because NAMED==2 MAYBE_SHARED==1, but then is being plonked. length(values)==4; length(cols)==1)
      by1    by2  value same_value
   <char> <char> <char>     <char>
1:      a      c     ac         ac
2:      a      d     ad         ad
3:      b      c     bc         bc
4:      b      d     bd         bd
> dt[,same_value:=value[1], .(by2, by1)][]
Argument 'by' after substitute: .(by2, by1)
Detected that j uses these columns: [same_value, value]
Finding groups using forderv ... forder.c received 4 rows and 2 columns
0.000s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
Getting back original order ... forder.c received a vector type 'integer' length 4
0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'value[1]'
GForce optimized j to '`g[`(value, 1)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.001
gforce eval took 0.000
0.000s elapsed (0.000s cpu) 
Assigning to 4 row subset of 4 rows
RHS_list_of_columns == false
      by1    by2  value same_value
   <char> <char> <char>     <char>
1:      a      c     ac         ac
2:      a      d     ad         bc
3:      b      c     bc         ad
4:      b      d     bd         bd

Clearly value is expected to be equal to same_value whatever the order of the by arguments. It works normally when I use the 1.14.2 version.

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C                   LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.3

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0   

Metadata

Metadata

Assignees

No one assigned

    Labels

    GForceissues relating to optimized grouping calculations (GForce)Highdev

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions