0% found this document useful (0 votes)

11 views9 pages

Datatable Intro

This document introduces the data.table package in R, focusing on its key features such as keys, fast grouping, and time series joins. It highlights the differences between data.frames and data.tables, emphasizing the need for users to learn and utilize data.table's features effectively to achieve performance benefits. The document includes examples of creating data.tables, setting keys, and performing operations that demonstrate the speed advantages of using data.table over traditional data.frames.

Uploaded by

markim.gov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views9 pages

Datatable Intro

Uploaded by

markim.gov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to the data.

table package in R
Matthew Dowle
Revised: June 26, 2014
(A later revision may be available on the homepage)

Introduction
This vignette is aimed at those who are already familiar with R—in particular, creating and using
objects of class [Link]. We aim for this quick introduction to be readable in 10 minutes,
covering the main features in brief, namely: 1. Keys; 2. Fast Grouping; and 3. Fast time series join.
For the context in which this document sits, please briefly check the last section, Further Resources.
[Link] is not automatically better or faster. The user has to climb a short learning curve,
experiment, and then use its features well. For example, this document explains the difference
between a vector scan and a binary search. Although both extraction methods are available in
[Link], if a user continues to use vector scans (as in a [Link]), it will ‘work’, but one
will miss out on the benefits that [Link] provides.

Creation
Recall that we create a [Link] using the function [Link]():
> DF = [Link](x=c("b","b","b","a","a"),v=rnorm(5))
> DF

x v
1 b 1.4129672
2 b 0.6255852
3 b 1.0321743
4 a -0.2482623
5 a 2.1497087

A [Link] is created in exactly the same way:

> DT = [Link](x=c("b","b","b","a","a"),v=rnorm(5))
> DT

x v
1: b 0.38350043
2: b -0.04438885
3: b 1.37531316
4: a -1.09037378
5: a 2.48324556

Observe that a [Link] prints the row numbers slightly differently. There is nothing significant
about that. We can easily convert existing [Link] objects to [Link].
> CARS = [Link](cars)
> head(CARS)

1
speed dist
1: 4 2
2: 4 10
3: 7 4
4: 7 22
5: 8 16
6: 9 10

We have just created two [Link]: DT and CARS. It is often useful to see a list of all
[Link] in memory:

> tables()

NAME NROW NCOL MB COLS KEY

[1,] CARS 50 2 1 speed,dist
[2,] DT 5 2 1 x,v
Total: 2MB

The MB column is useful to quickly assess memory use and to spot if any redundant tables can
be removed to free up memory. Just like [Link], [Link] must fit inside RAM.
Some users regularly work with 20 or more tables in memory, rather like a database. The result
of tables() is itself a [Link], returned silently, so that tables() can be used in programs.
tables() is unrelated to the base function table().
To see the column types1 :

> sapply(DT,class)

x v
"character" "numeric"

You may have noticed the empty column KEY in the result of tables() earlier above. This is
the subject of the next section, the first of the 3 main features of the package.

1. Keys
Let’s start by considering [Link], specifically rownames (or in English, row names). That is,
the multiple names belonging to a single row. The multiple names belonging to the single row?
That is not what we are used to in a [Link]. We know that each row has at most one name.
A person has at least two names, a first name and a second name. That is useful to organise a
telephone directory, for example, which is sorted by surname, then first name. However, each row
in a [Link] can only have one name.
A key consists of one or more columns of rownames, which may be integer, factor, character or
some other class, not simply character. Furthermore, the rows are sorted by the key. Therefore, a
[Link] can have at most one key, because it cannot be sorted in more than one way.
Uniqueness is not enforced, i.e., duplicate key values are allowed. Since the rows are sorted by
the key, any duplicates in the key will appear consecutively.
Let’s remind ourselves of our tables:
> tables()

NAME NROW NCOL MB COLS KEY

[1,] CARS 50 2 1 speed,dist
[2,] DT 5 2 1 x,v
Total: 2MB

> DT
1 As from v1.8.0, [Link]() no longer converts character to factor.

2
x v
1: b 0.38350043
2: b -0.04438885
3: b 1.37531316
4: a -1.09037378
5: a 2.48324556

No keys have been set yet. We can use [Link] syntax in a [Link], too.

> DT[2,]

x v
1: b -0.04438885

> DT[DT$x=="b",]

x v
1: b 0.38350043
2: b -0.04438885
3: b 1.37531316

But since there are no rownames, the following does not work:
> cat(try(DT["b",],silent=TRUE))

Error in `[.[Link]`(DT, "b", ) :

When i is a [Link] (or character vector), x must be keyed (i.e. sorted, and, marked as sorte

The error message tells us we need to use setkey():

> setkey(DT,x)
> DT

x v
1: a -1.09037378
2: a 2.48324556
3: b 0.38350043
4: b -0.04438885
5: b 1.37531316

Notice that the rows in DT have been re-ordered according to the values of x. The two "a" rows
have moved to the top. We can confirm that DT does indeed have a key using haskey(), key(),
attributes(), or just running tables().

> tables()

NAME NROW NCOL MB COLS KEY

[1,] CARS 50 2 1 speed,dist
[2,] DT 5 2 1 x,v x
Total: 2MB

Now that we are sure DT has a key, let’s try again:

> DT["b",]

x v
1: b 0.38350043
2: b -0.04438885
3: b 1.37531316

3
By default all the rows in the group are returned2 . The mult argument (short for multiple)
allows the first or last row of the group to be returned instead.

> DT["b",mult="first"]

x v
1: b 0.3835004

> DT["b",mult="last"]

x v
1: b 1.375313

The comma is optional.

> DT["b"]

x v
1: b 0.38350043
2: b -0.04438885
3: b 1.37531316

Let’s now create a new [Link]. We will make it large enough to demonstrate the difference
between a vector scan and a binary search.
> grpsize = ceiling(1e7/26^2) # 10 million rows, 676 groups

[1] 14793

> tt=[Link]( DF <- [Link](

+ x=rep(LETTERS,each=26*grpsize),
+ y=rep(letters,each=grpsize),
+ v=runif(grpsize*26^2),
+ stringsAsFactors=FALSE)
+ )

user system elapsed

1.048 0.216 2.449

> head(DF,3)

x y v
1 A a 0.8004513
2 A a 0.7763523
3 A a 0.5914273

> tail(DF,3)

x y v
10000066 Z z 0.6575659
10000067 Z z 0.5805366
10000068 Z z 0.5529532

> dim(DF)

[1] 10000068 3
2 In contrast to a [Link] where only the first rowname is returned when the rownames contain duplicates.

4
We might say that R has created a 3 column table and inserted 10,000,068 rows. It took
2.449 secs, so it inserted 4,083,327 rows per second. This is normal in base R. Notice that we set
stringsAsFactors=FALSE. This makes it a little faster for a fairer comparison, but feel free to
experiment.
Let’s extract an arbitrary group from DF:

> tt=[Link](ans1 <- DF[DF$x=="R" & DF$y=="h",]) # 'vector scan'

user system elapsed

2.032 0.300 3.450

> head(ans1,3)

x y v
6642058 R h 0.47423463
6642059 R h 0.05985042
6642060 R h 0.60237692

> dim(ans1)

[1] 14793 3

Now convert to a [Link] and extract the same group:

> DT = [Link](DF) # but normally use fread() or [Link]() directly, originally

> [Link](setkey(DT,x,y)) # one-off cost, usually

user system elapsed

0.136 0.032 0.253

> ss=[Link](ans2 <- DT[J("R","h")]) # binary search

user system elapsed

0.000 0.000 0.002

> head(ans2,3)

x y v
1: R h 0.47423463
2: R h 0.05985042
3: R h 0.60237692

> dim(ans2)

[1] 14793 3

> identical(ans1$v, ans2$v)

[1] TRUE

At 0.002 seconds, this was 1724 times faster than 3.450 seconds, and produced precisely the
same result. If you are thinking that a few seconds is not much to save, it’s the relative speedup
that’s important. The vector scan is linear, but the binary search is O(log n). It scales. If a task
taking 10 hours is sped up by 100 times to 6 minutes, that is significant3 .
We can do vector scans in [Link], too. In other words we can use [Link] badly.

> [Link](ans1 <- DT[x=="R" & y=="h",]) # works but is using [Link] badly
3 We wonder how many people are deploying parallel techniques to code that is vector scanning

5
user system elapsed
1.076 0.092 1.873

> [Link](ans2 <- DF[DF$x=="R" & DF$y=="h",]) # the [Link] way

user system elapsed

1.772 0.348 3.225

> mapply(identical,ans1,ans2)

x y v
TRUE TRUE TRUE

If the phone book analogy helped, the 1724 times speedup should not be surprising. We use
the key to take advantage of the fact that the table is sorted and use binary search to find the
matching rows. We didn’t vector scan; we didn’t use ==.
When we used DT$x=="R" we scanned the entire column x, testing each and every value to see
if it equalled ”R”. We did it again in the y column, testing for ”h”. Then & combined the two logical
results to create a single logical vector which was passed to the [ method, which in turn searched
it for TRUE and returned those rows. These were vectorized operations. They occurred internally
in R and were very fast, but they were scans. We did those scans because we wrote that R code.
When i is itself a [Link], we say that we are joining the two [Link]. In this case,
we are joining DT to the 1 row, 2 column table returned by [Link]("R","h"). Since we do
this a lot, there is an alias for [Link] called J(), short for join.

> identical( DT[J("R","h"),],

+ DT[[Link]("R","h"),])

[1] TRUE

Both vector scanning and binary search are available in [Link], but one way of using
[Link] is much better than the other.
The join syntax is a short, fast to write and easy to maintain. Passing a [Link] into a
[Link] subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column
matrix4 . In fact, the A[B] syntax in base R inspired the [Link] package. There are other
types of join and further arguments which are beyond the scope of this quick introduction.
The merge method of [Link] is very similar to X[Y], but there are some differences. See
FAQ 1.12.
This first section has been about the first argument to [, namely i. The next section has to do
with the 2nd argument j.

2. Fast grouping
The second argument to [ is j, which may consist of one or more expressions whose arguments are
(unquoted) column names, as if the column names were variables.

> DT[,sum(v)]

[1] 5001774

When we supply a j expression and a ’by’ list of expressions, the j expression is repeated for
each ’by’ group:

> DT[,sum(v),by=x]
4 Subsetting a keyed [Link] by a n-column [Link] is consistent with subsetting a n-dimension array by

a n-column matrix in base R

6
x V1
1: A 192298.9
2: B 192276.6
3: C 192277.3
4: D 192522.5
5: E 192422.4
6: F 192248.5
7: G 192317.1
8: H 192548.6
9: I 192542.1
10: J 192496.3
11: K 192438.1
12: L 192514.7
13: M 192443.7
14: N 192549.1
15: O 192255.8
16: P 192406.0
17: Q 192165.6
18: R 192429.0
19: S 192399.2
20: T 192313.5
21: U 192318.7
22: V 192133.4
23: W 192281.7
24: X 192461.3
25: Y 192381.7
26: Z 192332.8
x V1

The by in [Link] is fast. Let’s compare it to tapply.

> ttt=[Link](tt <- tapply(DT$v,DT$x,sum)); ttt

user system elapsed

1.444 0.800 3.116

> sss=[Link](ss <- DT[,sum(v),by=x]); sss

user system elapsed

0.152 0.036 0.393

> head(tt)

A B C D E F
192298.9 192276.6 192277.3 192522.5 192422.4 192248.5

> head(ss)

x V1
1: A 192298.9
2: B 192276.6
3: C 192277.3
4: D 192522.5
5: E 192422.4
6: F 192248.5

> identical([Link](tt), ss$V1)

[1] TRUE

7
At 0.393 sec, this was 7 times faster than 3.116 sec, and produced precisely the same result.
Next, let’s group by two columns:
> ttt=[Link](tt <- tapply(DT$v,list(DT$x,DT$y),sum)); ttt
user system elapsed
2.272 1.352 5.932
> sss=[Link](ss <- DT[,sum(v),by="x,y"]); sss
user system elapsed
0.180 0.040 0.223
> tt[1:5,1:5]
a b c d e
A 7404.636 7391.323 7382.284 7450.424 7415.376
B 7402.222 7390.865 7441.711 7429.015 7328.845
C 7416.403 7401.231 7371.955 7388.700 7376.935
D 7417.088 7417.991 7442.599 7409.437 7458.976
E 7411.580 7499.448 7397.866 7414.710 7392.116
> head(ss)
x y V1
1: A a 7404.636
2: A b 7391.323
3: A c 7382.284
4: A d 7450.424
5: A e 7415.376
6: A f 7405.022
> identical([Link](t(tt)), ss$V1)
[1] TRUE
This was 26 times faster, and the syntax is a little simpler and easier to read.

The following features are mentioned only briefly here; further examples are in ?[Link] and
the FAQ vignette.
• To return several expressions, pass a list() to j.
• Each item of the list is recycled to match the length of the longest item.
• You can pass a list() of expressions of column names to by e.g.
DT[,sum(v),by=list(month(dateCol),region)]
where calling month() on dateCol is what we mean by expressions of column names.
• Any R functions from any package can be used in j and by.

3. Fast time series join

This is also known as last observation carried forward (LOCF) or a rolling join.
Recall that x[i] is a join between [Link] x and [Link] i. If i has 2 columns, the
first column is matched to the first column of the key of x, and the 2nd column to the 2nd. An
equi-join is performed, meaning that the values must be equal.
The syntax for fast rolling join is
x[i,roll=TRUE]
As before the first column of i is matched to x where the values are equal. The last column
of i though, the 2nd one in this example, is treated specially. If no match is found, then the row
before is returned, provided the first column still matches.
For examples type example([Link]) and follow the output at the prompt.

8
Other resources
This was a quick start guide. Further resources include :

• The help page describes each and every argument: ?[Link]

• The FAQs deal with distinct topics: vignette("datatable-faq")
• The performance tests contain more examples: vignette("datatable-timings")
• [Link]() contains over 250 low level tests of the features

• Website: [Link]
• Presentations:
– [Link]
– [Link]

• YouTube Demo: [Link]

• R-Forge commit logs: [Link]
• Mailing list : datatable-help@[Link]

• User reviews : [Link]

Introduction To The Data - Table Package in R: Revised: September 18, 2015 (A Later Revision May Be Available On The)
No ratings yet
Introduction To The Data - Table Package in R: Revised: September 18, 2015 (A Later Revision May Be Available On The)
8 pages
Introduction To The Data - Table Package in R: Revised: October 2, 2014 (A Later Revision May Be Available On The)
No ratings yet
Introduction To The Data - Table Package in R: Revised: October 2, 2014 (A Later Revision May Be Available On The)
8 pages
Data Table
No ratings yet
Data Table
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Datatable
No ratings yet
Datatable
2 pages
Enhanced Data
No ratings yet
Enhanced Data
12 pages
R Programming Cheatsheet
100% (2)
R Programming Cheatsheet
6 pages
R Programming Cheat Sheet: Data Structures
No ratings yet
R Programming Cheat Sheet: Data Structures
2 pages
Data.table Transformation Cheat Sheet
No ratings yet
Data.table Transformation Cheat Sheet
2 pages
Faqs About The Data - Table Package in R: Revised: October 2, 2014 (A Later Revision May Be Available On The)
No ratings yet
Faqs About The Data - Table Package in R: Revised: October 2, 2014 (A Later Revision May Be Available On The)
21 pages
R Programming Basics: Vectors, Matrices, Dataframes
No ratings yet
R Programming Basics: Vectors, Matrices, Dataframes
13 pages
UL2
No ratings yet
UL2
2 pages
Presentation 1
No ratings yet
Presentation 1
34 pages
Understanding the abs() Function in R
No ratings yet
Understanding the abs() Function in R
34 pages
Descriptive Statistics in R Programming
No ratings yet
Descriptive Statistics in R Programming
3 pages
Week 1-B. Data in R
No ratings yet
Week 1-B. Data in R
5 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
M2 Dar
No ratings yet
M2 Dar
46 pages
Base R
No ratings yet
Base R
9 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
R
No ratings yet
R
15 pages
R Study Material I
No ratings yet
R Study Material I
8 pages
R data.table Guide: 50 Examples
No ratings yet
R data.table Guide: 50 Examples
13 pages
Module 2.9
No ratings yet
Module 2.9
12 pages
Datatable Cheat Sheet R
No ratings yet
Datatable Cheat Sheet R
1 page
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
N2 Data in R
No ratings yet
N2 Data in R
7 pages
R Basics for Economics Students
No ratings yet
R Basics for Economics Students
7 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
R Reference Guide for Programmers
No ratings yet
R Reference Guide for Programmers
6 pages
R Programming Essentials
No ratings yet
R Programming Essentials
27 pages
ISYS3447 - Week 3 Notes
No ratings yet
ISYS3447 - Week 3 Notes
3 pages
R Lecture 2-1
No ratings yet
R Lecture 2-1
28 pages
R - A Practical Course
No ratings yet
R - A Practical Course
42 pages
Solutions For QB3
No ratings yet
Solutions For QB3
14 pages
Teaching R
No ratings yet
Teaching R
15 pages
Saving R Environment to RData
No ratings yet
Saving R Environment to RData
60 pages
R Studio Commands
No ratings yet
R Studio Commands
19 pages
A Crash Course in R - Intro To Statistical Programming
No ratings yet
A Crash Course in R - Intro To Statistical Programming
53 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
Basics of R Programming - Ghosh - Tagged
No ratings yet
Basics of R Programming - Ghosh - Tagged
18 pages
Data Manipulation and Visualization in R
No ratings yet
Data Manipulation and Visualization in R
58 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
R Reference Card
No ratings yet
R Reference Card
1 page
Create A Simple Servlet Program Using
No ratings yet
Create A Simple Servlet Program Using
45 pages
1.1 01 Online Judge
No ratings yet
1.1 01 Online Judge
10 pages
The Pearl Study Guide
100% (2)
The Pearl Study Guide
24 pages
Disable Logging of Changes On Customizing Tables
No ratings yet
Disable Logging of Changes On Customizing Tables
2 pages
NFS320 Panel: Invrep Troubleshooting Guide
100% (1)
NFS320 Panel: Invrep Troubleshooting Guide
52 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
First and Second Conditional
No ratings yet
First and Second Conditional
4 pages
Fast Majority Vote Algorithm
No ratings yet
Fast Majority Vote Algorithm
14 pages
As at May 2025
No ratings yet
As at May 2025
30 pages
2021 Y9 Da Vinci AT2
No ratings yet
2021 Y9 Da Vinci AT2
8 pages
Class 11 Maths Detailed Formulas
No ratings yet
Class 11 Maths Detailed Formulas
2 pages
5777-Article Text-10632-1-10-20210605
No ratings yet
5777-Article Text-10632-1-10-20210605
10 pages
Dictation
No ratings yet
Dictation
2 pages
Oracle Unix/Linux Performance Tuning
No ratings yet
Oracle Unix/Linux Performance Tuning
35 pages
MALLAKIN WAYE COMPLETE Document From Queen Samy - TXT by Abooks - Com.ng
100% (1)
MALLAKIN WAYE COMPLETE Document From Queen Samy - TXT by Abooks - Com.ng
265 pages
Engaging Activities on Canadian Symbols
No ratings yet
Engaging Activities on Canadian Symbols
2 pages
Exploring The Premed Discourse Community
No ratings yet
Exploring The Premed Discourse Community
4 pages
Cambridge International AS & A Level: English Language 9093/32
No ratings yet
Cambridge International AS & A Level: English Language 9093/32
8 pages
Graphic Novel Exam Review Assessment
No ratings yet
Graphic Novel Exam Review Assessment
5 pages
Man Bni Iol 302 XXX K006 e B15 Dok 887780 04 000
No ratings yet
Man Bni Iol 302 XXX K006 e B15 Dok 887780 04 000
25 pages
AI Curriculum for Grade 8 Students
No ratings yet
AI Curriculum for Grade 8 Students
4 pages
Essentials of News Writing Techniques
No ratings yet
Essentials of News Writing Techniques
53 pages
Sandeep - CH Resume
No ratings yet
Sandeep - CH Resume
4 pages
Ch03 OS9e
No ratings yet
Ch03 OS9e
53 pages
TOEFL Test 2
No ratings yet
TOEFL Test 2
3 pages
File Transfer Protocol
No ratings yet
File Transfer Protocol
11 pages
Graphic Organizer - Activity Sheet in RWS - Selecting and Organizing Information
No ratings yet
Graphic Organizer - Activity Sheet in RWS - Selecting and Organizing Information
2 pages
Advanced Work Vocabulary Guide
No ratings yet
Advanced Work Vocabulary Guide
2 pages
SailPoint IdentityIQ Training
No ratings yet
SailPoint IdentityIQ Training
7 pages
Sid Resume Feb 11
No ratings yet
Sid Resume Feb 11
2 pages

Datatable Intro

Uploaded by

Datatable Intro

Uploaded by

Introduction to the data.

A [Link] is created in exactly the same way:

NAME NROW NCOL MB COLS KEY

NAME NROW NCOL MB COLS KEY

Error in `[.[Link]`(DT, "b", ) :

The error message tells us we need to use setkey():

NAME NROW NCOL MB COLS KEY

Now that we are sure DT has a key, let’s try again:

The comma is optional.

> tt=[Link]( DF <- [Link](

user system elapsed

> tt=[Link](ans1 <- DF[DF$x=="R" & DF$y=="h",]) # 'vector scan'

user system elapsed

Now convert to a [Link] and extract the same group:

> DT = [Link](DF) # but normally use fread() or [Link]() directly, originally

user system elapsed

> ss=[Link](ans2 <- DT[J("R","h")]) # binary search

user system elapsed

> identical(ans1$v, ans2$v)

> [Link](ans2 <- DF[DF$x=="R" & DF$y=="h",]) # the [Link] way

user system elapsed

> identical( DT[J("R","h"),],

a n-column matrix in base R

The by in [Link] is fast. Let’s compare it to tapply.

> ttt=[Link](tt <- tapply(DT$v,DT$x,sum)); ttt

user system elapsed

> sss=[Link](ss <- DT[,sum(v),by=x]); sss

user system elapsed

> identical([Link](tt), ss$V1)

3. Fast time series join

• The help page describes each and every argument: ?[Link]

• YouTube Demo: [Link]

• User reviews : [Link]

You might also like