Skip to content

rolling funs / shift could support logical window #3241

@jangorecki

Description

@jangorecki

I am filling this issue as a placeholder to evaluate users demand for such feature, at present there are no plans for incorporating it, so if you would need it be sure to upvote.
Extension of #2778.

Rolling functions and shift has been implemented to operate on physical order of data, which means that they do not handle "gaps" in, for example, time/date fields. If one wants to shift an IDate type vector by one day, one has to ensure that every single day is included in vector. If it isn't then one has to expand vector (or eventually a data.table) and perform shift afterwards. This can be flexibly and time efficiently solved using "rolling join" but the problem is memory consumption, especially for very sparse data. In an ideal world we would prefer to isolate roll functionality of rolling joins into helper function and re-use it in those cases.
Some examples of expected output for input x:

library(data.table)
id = c(0L,1L,2L,5L,6L,8L)
x = data.table(date=as.IDate(id), value=c(1,2,3,4,5,6))
x
#         date value
#1: 1970-01-01     1
#2: 1970-01-02     2
#3: 1970-01-03     3
#4: 1970-01-06     4
#5: 1970-01-07     5
#6: 1970-01-09     6

## shift value by 1 date
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(NA,1,2,NA,4,NA)))
#         date value     s_date s_value
#1: 1970-01-01     1 1969-12-31      NA
#2: 1970-01-02     2 1970-01-01       1
#3: 1970-01-03     3 1970-01-02       2
#4: 1970-01-06     4 1970-01-05      NA
#5: 1970-01-07     5 1970-01-06       4
#6: 1970-01-09     6 1970-01-08      NA

## shift value by 1 date locf
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(NA,1,2,3,4,5)))
#         date value     s_date s_value
#1: 1970-01-01     1 1969-12-31      NA
#2: 1970-01-02     2 1970-01-01       1
#3: 1970-01-03     3 1970-01-02       2
#4: 1970-01-06     4 1970-01-05       3
#5: 1970-01-07     5 1970-01-06       4
#6: 1970-01-09     6 1970-01-08       5

## shift value by 1 date nocb
cbind(x, data.table(s_date=as.IDate(id-1), s_value=c(1,1,2,4,4,6)))
#         date value     s_date s_value
#1: 1970-01-01     1 1969-12-31       1
#2: 1970-01-02     2 1970-01-01       1
#3: 1970-01-03     3 1970-01-02       2
#4: 1970-01-06     4 1970-01-05       4
#5: 1970-01-07     5 1970-01-06       4
#6: 1970-01-09     6 1970-01-08       6

## shift value by -1 date
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,NA,5,NA,NA)))
#         date value     s_date s_value
#1: 1970-01-01     1 1970-01-02       2
#2: 1970-01-02     2 1970-01-03       3
#3: 1970-01-03     3 1970-01-04      NA
#4: 1970-01-06     4 1970-01-07       5
#5: 1970-01-07     5 1970-01-08      NA
#6: 1970-01-09     6 1970-01-10      NA

## shift value by -1 date locf
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,3,5,5,6)))
#         date value     s_date s_value
#1: 1970-01-01     1 1970-01-02       2
#2: 1970-01-02     2 1970-01-03       3
#3: 1970-01-03     3 1970-01-04       3
#4: 1970-01-06     4 1970-01-07       5
#5: 1970-01-07     5 1970-01-08       5
#6: 1970-01-09     6 1970-01-10       6

## shift value by -1 date nocb
cbind(x, data.table(s_date=as.IDate(id+1), s_value=c(2,3,4,5,6,NA)))
#         date value     s_date s_value
#1: 1970-01-01     1 1970-01-02       2
#2: 1970-01-02     2 1970-01-03       3
#3: 1970-01-03     3 1970-01-04       4
#4: 1970-01-06     4 1970-01-07       5
#5: 1970-01-07     5 1970-01-08       6
#6: 1970-01-09     6 1970-01-10      NA

## rollsum value by 3 date
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(NA,NA,1),sum(NA,1,2),sum(1,2,3),sum(NA,NA,4),sum(NA,4,5),sum(5,NA,6))))
#         date value                           w_date w_value
#1: 1970-01-01     1 1969-12-30,1969-12-31,1970-01-01      NA
#2: 1970-01-02     2 1969-12-31,1970-01-01,1970-01-02      NA
#3: 1970-01-03     3 1970-01-01,1970-01-02,1970-01-03       6
#4: 1970-01-06     4 1970-01-04,1970-01-05,1970-01-06      NA
#5: 1970-01-07     5 1970-01-05,1970-01-06,1970-01-07      NA
#6: 1970-01-09     6 1970-01-07,1970-01-08,1970-01-09      NA

## rollsum value by 3 date locf
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(NA,NA,1),sum(NA,1,2),sum(1,2,3),sum(3,3,4),sum(3,4,5),sum(5,5,6))))
#         date value                           w_date w_value
#1: 1970-01-01     1 1969-12-30,1969-12-31,1970-01-01      NA
#2: 1970-01-02     2 1969-12-31,1970-01-01,1970-01-02      NA
#3: 1970-01-03     3 1970-01-01,1970-01-02,1970-01-03       6
#4: 1970-01-06     4 1970-01-04,1970-01-05,1970-01-06      10
#5: 1970-01-07     5 1970-01-05,1970-01-06,1970-01-07      12
#6: 1970-01-09     6 1970-01-07,1970-01-08,1970-01-09      16

## rollsum value by 3 date nocb
cbind(x, data.table(w_date=sapply(as.IDate(id), function(x) paste(x+((-2):0), collapse=",")), w_value=c(sum(1,1,1),sum(1,1,2),sum(1,2,3),sum(4,4,4),sum(4,4,5),sum(5,6,6))))
#         date value                           w_date w_value
#1: 1970-01-01     1 1969-12-30,1969-12-31,1970-01-01       3
#2: 1970-01-02     2 1969-12-31,1970-01-01,1970-01-02       4
#3: 1970-01-03     3 1970-01-01,1970-01-02,1970-01-03       6
#4: 1970-01-06     4 1970-01-04,1970-01-05,1970-01-06      12
#5: 1970-01-07     5 1970-01-05,1970-01-06,1970-01-07      13
#6: 1970-01-09     6 1970-01-07,1970-01-08,1970-01-09      17

Related issue tagged as data.table: https://stackoverflow.com/questions/33553230/calculate-moving-average-every-n-hours


Worth to note that pandas, as of 0.23.4, do support rolling functions by logical order when window argument received offset instead of int: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html

window : int, or offset
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. This is new in 0.19.0

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions