-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
I am filling this issue as a placeholder to evaluate users demand for such feature, at present there are no plans for incorporating it, so if you would need it be sure to upvote.
Extension of #2778.
Rolling functions and shift has been implemented to operate on physical order of data, which means that they do not handle "gaps" in, for example, time/date fields. If one wants to shift an IDate type vector by one day, one has to ensure that every single day is included in vector. If it isn't then one has to expand vector (or eventually a data.table) and perform shift afterwards. This can be flexibly and time efficiently solved using "rolling join" but the problem is memory consumption, especially for very sparse data. In an ideal world we would prefer to isolate roll functionality of rolling joins into helper function and re-use it in those cases.
Some examples of expected output for input x:
library(data.table)
id = c(0L,1L,2L,5L,6L,8L)
x = data.table(date=as.IDate(id), value=c(1,2,3,4,5,6))
x
# date value
#1: 1970-01-01 1
#2: 1970-01-02 2
#3: 1970-01-03 3
#4: 1970-01-06 4
#5: 1970-01-07 5
#6: 1970-01-09 6
## shift value by 1 date
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(NA,1,2,NA,4,NA)))
# date value s_date s_value
#1: 1970-01-01 1 1969-12-31 NA
#2: 1970-01-02 2 1970-01-01 1
#3: 1970-01-03 3 1970-01-02 2
#4: 1970-01-06 4 1970-01-05 NA
#5: 1970-01-07 5 1970-01-06 4
#6: 1970-01-09 6 1970-01-08 NA
## shift value by 1 date locf
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(NA,1,2,3,4,5)))
# date value s_date s_value
#1: 1970-01-01 1 1969-12-31 NA
#2: 1970-01-02 2 1970-01-01 1
#3: 1970-01-03 3 1970-01-02 2
#4: 1970-01-06 4 1970-01-05 3
#5: 1970-01-07 5 1970-01-06 4
#6: 1970-01-09 6 1970-01-08 5
## shift value by 1 date nocb
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(1,1,2,4,4,6)))
# date value s_date s_value
#1: 1970-01-01 1 1969-12-31 1
#2: 1970-01-02 2 1970-01-01 1
#3: 1970-01-03 3 1970-01-02 2
#4: 1970-01-06 4 1970-01-05 4
#5: 1970-01-07 5 1970-01-06 4
#6: 1970-01-09 6 1970-01-08 6
## shift value by -1 date
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,NA,5,NA,NA)))
# date value s_date s_value
#1: 1970-01-01 1 1970-01-02 2
#2: 1970-01-02 2 1970-01-03 3
#3: 1970-01-03 3 1970-01-04 NA
#4: 1970-01-06 4 1970-01-07 5
#5: 1970-01-07 5 1970-01-08 NA
#6: 1970-01-09 6 1970-01-10 NA
## shift value by -1 date locf
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,3,5,5,6)))
# date value s_date s_value
#1: 1970-01-01 1 1970-01-02 2
#2: 1970-01-02 2 1970-01-03 3
#3: 1970-01-03 3 1970-01-04 3
#4: 1970-01-06 4 1970-01-07 5
#5: 1970-01-07 5 1970-01-08 5
#6: 1970-01-09 6 1970-01-10 6
## shift value by -1 date nocb
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,4,5,6,NA)))
# date value s_date s_value
#1: 1970-01-01 1 1970-01-02 2
#2: 1970-01-02 2 1970-01-03 3
#3: 1970-01-03 3 1970-01-04 4
#4: 1970-01-06 4 1970-01-07 5
#5: 1970-01-07 5 1970-01-08 6
#6: 1970-01-09 6 1970-01-10 NA
## rollsum value by 3 date
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(NA,NA,1),sum(NA,1,2),sum(1,2,3),sum(NA,NA,4),sum(NA,4,5),sum(5,NA,6))))
# date value w_date w_value
#1: 1970-01-01 1 1969-12-30,1969-12-31,1970-01-01 NA
#2: 1970-01-02 2 1969-12-31,1970-01-01,1970-01-02 NA
#3: 1970-01-03 3 1970-01-01,1970-01-02,1970-01-03 6
#4: 1970-01-06 4 1970-01-04,1970-01-05,1970-01-06 NA
#5: 1970-01-07 5 1970-01-05,1970-01-06,1970-01-07 NA
#6: 1970-01-09 6 1970-01-07,1970-01-08,1970-01-09 NA
## rollsum value by 3 date locf
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(NA,NA,1),sum(NA,1,2),sum(1,2,3),sum(3,3,4),sum(3,4,5),sum(5,5,6))))
# date value w_date w_value
#1: 1970-01-01 1 1969-12-30,1969-12-31,1970-01-01 NA
#2: 1970-01-02 2 1969-12-31,1970-01-01,1970-01-02 NA
#3: 1970-01-03 3 1970-01-01,1970-01-02,1970-01-03 6
#4: 1970-01-06 4 1970-01-04,1970-01-05,1970-01-06 10
#5: 1970-01-07 5 1970-01-05,1970-01-06,1970-01-07 12
#6: 1970-01-09 6 1970-01-07,1970-01-08,1970-01-09 16
## rollsum value by 3 date nocb
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(1,1,1),sum(1,1,2),sum(1,2,3),sum(4,4,4),sum(4,4,5),sum(5,6,6))))
# date value w_date w_value
#1: 1970-01-01 1 1969-12-30,1969-12-31,1970-01-01 3
#2: 1970-01-02 2 1969-12-31,1970-01-01,1970-01-02 4
#3: 1970-01-03 3 1970-01-01,1970-01-02,1970-01-03 6
#4: 1970-01-06 4 1970-01-04,1970-01-05,1970-01-06 12
#5: 1970-01-07 5 1970-01-05,1970-01-06,1970-01-07 13
#6: 1970-01-09 6 1970-01-07,1970-01-08,1970-01-09 17Related issue tagged as data.table: https://stackoverflow.com/questions/33553230/calculate-moving-average-every-n-hours
Worth to note that pandas, as of 0.23.4, do support rolling functions by logical order when window argument received offset instead of int: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
window : int, or offset
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. This is new in 0.19.0