Skip to content

on performing slower than double setkey #1232

@MichaelChirico

Description

@MichaelChirico

I recently integrated the new on functionality into some code of mine that was being dragged down by repetitive key switching (here for some context), so I was excited for the new on feature to (potentially) speed things up. I was quite surprised to find that actually the code ran about 30% slower (45 instead of 35 minutes) using on.

I was able to reproduce this using large data.tables beefed up from @jangorecki's join_on tests:

nn<-1e6
mm<-1e2

times=50L

set.seed(45L)
DT1 = data.table(x=sample(letters[1:3], nn, TRUE), y=sample(6:10, nn, TRUE), 
                 a=sample(100, nn,T), b=runif(nn))
DT2 = CJ(x=letters[1:3], y=6:10)[, mul := sample(20, 15)][sample(15L, mm,T)]

times2<-times1<-numeric(times)
for (ii in 1:times){
  cp1<-copy(DT1); cp2<-copy(DT2)
  strt<-get_nanotime()
  cp1[cp2,on="x",allow.cartesian=T]
  stp<-get_nanotime()
  times1[ii]<-stp-strt

  cp1<-copy(DT1); cp2<-copy(DT2)
  strt<-get_nanotime()
  setkey(cp1,x)[setkey(cp2,x),allow.cartesian=T]
  stp<-get_nanotime()
  times2[ii]<-stp-strt
}
> median(times1)/median(times2)
[1] 1.274535

So about 27% slower here. Maybe I'm not understanding the purpose of on, but I thought that the double-keyed approach should basically be an upper bound for how long on takes. And indeed on is faster when the tables are smaller:

nn<-1e3

> median(times1)/median(times2)
[1] 0.9491699

So, roughly 5% faster when DT1 is smaller.

nn<-1e6; mm<-5
> median(times1)/median(times2)
[1] 0.9394226

Roughly 7% faster when DT2 is smaller.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions