0% found this document useful (0 votes)
1K views1 page

Datatable Cheat Sheet R

Data table is an R package that provides a high-performance version of data.frame. It allows for fast aggregation, joining, and updating of large data sets. Key features include: - Subsetting rows using i and calculating columns grouped by by for fast aggregation - Adding or updating columns by reference using := for efficient programming - Using .SD to work with subsets of the data within each group for aggregation operations - Chaining operations together for concise code

Uploaded by

loshude
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views1 page

Datatable Cheat Sheet R

Data table is an R package that provides a high-performance version of data.frame. It allows for fast aggregation, joining, and updating of large data sets. Key features include: - Subsetting rows using i and calculating columns grouped by by for fast aggregation - Adding or updating columns by reference using := for efficient programming - Using .SD to work with subsets of the data within each group for aggregation operations - Chaining operations together for concise code

Uploaded by

loshude
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

R For Data Science Cheat Sheet General form: DT[i, j, by] Advanced Data Table Operations

> DT[.N-1]
data.table Take DT, subset rows using i, then calculate j grouped by by > DT[,.N]
Return the penultimate row of the DT
Return the number of rows
> DT[,.(V2,V3)] Return V2 and V3 as a data.table
Learn R for data science Interactively at www.DataCamp.com
Adding/Updating Columns By Reference in j Using := >
>
DT[,list(V2,V3)]
DT[,mean(V3),by=.(V1,V2)]
Return V2 and V3 as a data.table
Return the result of j, grouped by all possible
> DT[,V1:=round(exp(V1),2)] V1 is updated by what is after := V1 V2 V1 combinations of groups specified in by
> DT Return the result by calling DT 1: 1 A 0.4053
2: 1 B 0.4053
V1 V2 V3 V4
data.table 1: 2.72 A -0.1107 1
2: 7.39 B -0.1427 2
3:
4:
1 C 0.4053
2 A -0.6443
5: 2 B -0.6443
data.table is an R package that provides a high-performance 3: 2.72 C -1.8893 3 6: 2 C -0.6443
4: 7.39 A -0.3571 4
version of base Rs data.frame with syntax and feature ... .SD & .SDcols
enhancements for ease of use, convenience and > DT[,c("V1","V2"):=list(round(exp(V1),2), Columns V1 and V2 are updated by
> DT[,print(.SD),by=V2] Look at what .SD contains
LETTERS[4:6])] what is after :=
programming speed. > DT[,':='(V1=round(exp(V1),2), Alternative to the above one. With [], > DT[,.SD[c(1,.N)],by=V2] Select the first and last row grouped by V2
V2=LETTERS[4:6])][] you print the result to the screen > DT[,lapply(.SD,sum),by=V2] Calculate sum of columns in .SD grouped by
Load the package: V1 V2 V3 V4
V2
1: 15.18 D -0.1107 1 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> library(data.table) .SDcols=c("V3","V4")] V2
2: 1619.71 E -0.1427 2
V2 V3 V4

Creating A data.table
3: 15.18 F -1.8893 3
1: A -0.478 22
4: 1619.71 D -0.3571 4 2: B -0.478 26

> set.seed(45L) Create a data.table > DT[,V1:=NULL] Remove V1 3: C -0.478 30


> DT[,c("V1","V2"):=NULL] Remove columns V1 and V2 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> DT <- data.table(V1=c(1L,2L), and call it DT .SDcols=paste0("V",3:4)] V2
V2=LETTERS[1:3], > Cols.chosen=c("A","B")
V3=round(rnorm(4),4), > DT[,Cols.Chosen:=NULL] Delete the column with column name
V4=1:12)
> DT[,(Cols.Chosen):=NULL]
Cols.chosen
Delete the columns specified in the Chaining
variable Cols.chosen
Subsetting Rows Using i > DT <- DT[,.(V4.SUM=sum(V4)),
by=V1]
Calculate sum of V4, grouped by V1

> DT[3:5,] Select 3rd to 5th row Indexing And Keys 1:


V1 V4.Sum
1 36
> DT[3:5] Select 3rd to 5th row
2: 2 42
> DT[V2=="A"] Select all rows that have value A in column V2 > setkey(DT,V2) A key is set on V2; output is returned invisibly
> DT["A"] Return all rows where the key column (set to V2) has > DT[V4.Sum>40] Select that group of which the sum is >40
> DT[V2 %in% c("A","C")] Select all rows that have value A or C in column V2
V1 V2 V3 V4 the value A > DT[,.(V4.Sum=sum(V4)), Select that group of which the sum is >40
by=V1][V4.Sum>40] (chaining)
Manipulating on Columns in j
1: 1 A -0.2392 1
2: 2 A -1.6148 4 V1 V4.Sum
3: 1 A 1.0498 7 1: 2 42
> DT[,V2] Return V2 as a vector 4: 2 A 0.3262 10 2: 1 36
[1] A B C A B C ... > DT[c("A","C")] Return all rows where the key column (V2) has value A or C > DT[,.(V4.Sum=sum(V4)), Calculate sum of V4, grouped by V1,
> DT[,.(V2,V3)] Return V2 and V3 as a data.table > DT["A",mult="first"] Return first row of all rows that match value A in key by=V1][order(-V1)] ordered on V1
> DT[,sum(V1)] Return the sum of all elements of V1 in a column V2 V1 V4.Sum
[1] 18 vector > DT["A",mult="last"] Return last row of all rows that match value A in key 1: 2 42
> DT[,.(sum(V1),sd(V3))] Return the sum of all elements of V1 and the column V2
2: 1 36
V1 V2 std. dev. of V3 in a data.table > DT[c("A","D")] Return all rows where key column V2 has value A or D
1: 18 0.4546055
> DT[,.(Aggregate=sum(V1), The same as the above, with new names
V1 V2 V3 V4
1: 1 A -0.2392 1 set()-Family
Sd.V3=sd(V3))] 2: 2 A -1.6148 4
Aggregate Sd.V3 3: 1 A 1.0498 7 set()
1: 18 0.4546055 4: 2 A 0.3262 10
> DT[,.(V1,Sd.V3=sd(V3))] Select column V2 and compute std. dev. of V3, 5: NA D NA NA Syntax: for (i in from:to) set(DT, row, column, new value)
which returns a single value and gets recycled > DT[c("A","D"),nomatch=0] Return all rows where key column V2 has value A or D > rows <- list(3:4,5:6)
V1 V2 V3 V4
> DT[,.(print(V2), Print column V2 and plot V3 > cols <- 1:2
1: 1 A -0.2392 1
plot(V3), > for(i in seq_along(rows)) Sequence along the values of rows, and
2: 2 A -1.6148 4
NULL)] {set(DT, for the values of cols, set the values of
3: 1 A 1.0498 7
4: 2 A 0.3262 10 i=rows[[i]], those elements equal to NA (invisible)
j=cols[i],
Doing j by Group > DT[c("A","C"),sum(V4)] Return total sum of V4, for rows of key column V2 that
have values A or C value=NA)}
> DT[,.(V4.Sum=sum(V4)),by=V1] Calculate sum of V4 for every group in V1
V1 V4.Sum
> DT[c("A","C"),
sum(V4),
Return sum of column V4 for rows of V2 that have value A,
and anohter sum for rows of V2 that have value C setnames()
1: 1 36 by=.EACHI] Syntax: setnames(DT,"old","new")[]
2: 2 42 V2 V1
1: A 22 > setnames(DT,"V2","Rating") Set name of V2 to Rating (invisible)
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 Change 2 column names (invisible)
by=.(V1,V2)] and V2 2: C 30 > setnames(DT,
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in > setkey(DT,V1,V2) Sort by V1 and then by V2 within each group of V1 (invisible) c("V2","V3"),
by=sign(V1-1)] sign(V1-1) > DT[.(2,"C")] Select rows that have value 2 for the first key (V1) and the c("V2.rating","V3.DC"))
value C for the second key (V2)
setnames()
V1 V2 V3 V4
sign V4.Sum
1: 0 36 1: 2 C 0.3262 6
2: 1 42 2: 2 C -1.6148 12
Syntax: setcolorder(DT,"neworder")
> DT[,.(V4.Sum=sum(V4)), The same as the above, with new name > DT[.(2,c("A","C"))] Select rows that have value 2 for the first key (V1) and within
V1 V2 V3 V4 those rows the value A or C for the second key (V2) > setcolorder(DT, Change column ordering to contents
by=.(V1.01=sign(V1-1))] for the variable youre grouping by
> DT[1:5,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 1: 2 A -1.6148 4 c("V2","V1","V4","V3")) of the specified vector (invisible)
2: 2 A 0.3262 10
by=V1] after subsetting on the first 5 rows
3: 2 C 0.3262 6
> DT[,.N,by=V1] Count number of rows for every group in
4: 2 C -1.6148 12
DataCamp
V1 Learn Python for Data Science Interactively

You might also like