Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Mining
of
Massive
Datasets
Jure
Leskovec,
Anand
Rajaraman,
Jeff
Ullman
Stanford
University
http://www.mmds.org
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
3
Data
contains
value
and
knowledge
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
4
¡ But
to
extract
the
knowledge
data
needs
to
be
§ Stored
§ Managed
§ And
ANALYZED
!
this
class
Data
Mining
≈
Big
Data
≈
Predic@ve
Analy@cs
≈
Data
Science
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
5
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
6
¡ Given
lots
of
data
¡ Discover
paFerns
and
models
that
are:
§ Valid:
hold
on
new
data
with
some
certainty
§ Useful:
should
be
possible
to
act
on
the
item
§ Unexpected:
non-‐obvious
to
the
system
§ Understandable:
humans
should
be
able
to
interpret
the
pa=ern
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
7
¡ Descrip@ve
methods
§ Find
human-‐interpretable
pa=erns
that
describe
the
data
§ Example:
Clustering
¡ Predic@ve
methods
§ Use
some
variables
to
predict
unknown
or
future
values
of
other
variables
§ Example:
Recommender
systems
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
8
¡ A
risk
with
“Data
mining”
is
that
an
analyst
can
“discover”
paFerns
that
are
meaningless
¡ StaOsOcians
call
it
Bonferroni’s
principle:
§ Roughly,
if
you
look
in
more
places
for
interesOng
pa=erns
than
your
amount
of
data
will
support,
you
are
bound
to
find
crap
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
9
Example:
¡ We
want
to
find
(unrelated)
people
who
at
least
twice
have
stayed
at
the
same
hotel
on
the
same
day
§ 109
people
being
tracked
§ 1,000
days
§ Each
person
stays
in
a
hotel
1%
of
Ome
(1
day
out
of
100)
§ Hotels
hold
100
people
(so
105
hotels)
§ If
everyone
behaves
randomly
(i.e.,
no
terrorists)
will
the
data
mining
detect
anything
suspicious?
¡ Expected
number
of
“suspicious”
pairs
of
people:
§ 250,000
§ …
too
many
combinaOons
to
check
–
we
need
to
have
some
addiOonal
evidence
to
find
“suspicious”
pairs
of
people
in
some
more
efficient
way
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
10
Usage
Quality
Context
Streaming
Scalability
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
11
¡ Data
mining
overlaps
with:
§ Databases:
Large-‐scale
data,
simple
queries
§ Machine
learning:
Small
data,
Complex
models
§ CS
Theory:
(Randomized)
Algorithms
¡ Different
cultures:
§ To
a
DB
person,
data
mining
is
an
extreme
form
of
analy@c
processing
–
queries
that
examine
large
amounts
of
data
CS
Machine
Theory
§ Result
is
the
query
answer
Learning
Data
§ To
a
ML
person,
data-‐mining
Mining
is
the
inference
of
models
§ Result
is
the
parameters
of
the
model
Database
¡ In
this
class
we
will
do
both!
systems
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
12
¡ This
class
overlaps
with
machine
learning,
sta@s@cs,
ar@ficial
intelligence,
databases
but
more
stress
on
§ Scalability
(big
data)
Statistics
§ Algorithms
Machine
Learning
§ Compu@ng
architectures
§ AutomaOon
for
handling
Data
Mining
large
data
Database
systems
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
13
¡ We
will
learn
to
mine
different
types
of
data:
§ Data
is
high
dimensional
§ Data
is
a
graph
§ Data
is
infinite/never-‐ending
§ Data
is
labeled
¡ We
will
learn
to
use
different
models
of
computa@on:
§ MapReduce
§ Streams
and
online
algorithms
§ Single
machine
in-‐memory
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
14
¡ We
will
learn
to
solve
real-‐world
problems:
§ Recommender
systems
§ Market
Basket
Analysis
§ Spam
detecOon
§ Duplicate
document
detecOon
¡ We
will
learn
various
“tools”:
§ Linear
algebra
(SVD,
Rec.
Sys.,
CommuniOes)
§ OpOmizaOon
(stochasOc
gradient
descent)
§ Dynamic
programming
(frequent
itemsets)
§ Hashing
(LSH,
Bloom
filters)
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
15
High
dim.
Graph
Infinite
Machine
Apps
data
data
data
learning
Locality
Filtering
PageRank,
Recommen
sensiOve
data
SVM
SimRank
der
systems
hashing
streams
Community
Web
Decision
AssociaOon
Clustering
DetecOon
adverOsing
Trees
Rules
Dimensional Duplicate
Spam
Queries
on
Perceptron,
ity
document
DetecOon
streams
kNN
reducOon
detecOon
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
16
How
do
you
want
that
data?
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
17
¡ TAs:
§ We
have
9
great
TAs!
§ Sean
Choi
(Head
TA),
Sumit
ArrawaOa,
Jus@n
Chen,
Dingyi
Li,
Anshul
Mi=al,
Rose
Marie
Philip,
Robi
Robaszkiewicz,
Le
Yu,
Tongda
Zhang
¡ Office
hours:
§ Jure:
Wednesdays
9-‐10am,
Gates
418
§ See
course
website
for
TA
office
hours
§ For
SCPD
students
we
will
use
Google
Hangout
§ We
will
post
Google
Hangout
links
on
Piazza
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
19
¡ Course
website:
hFp://cs246.stanford.edu
§ Lecture
slides
(at
least
30min
before
the
lecture)
§ Homeworks,
soluOons
§ Readings
¡ Readings:
Book
Mining
of
Massive
Datasets
with
A.
Rajaraman
and
J.
Ullman
Free
online:
hFp://www.mmds.org
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
20
¡ Piazza
Q&A
website:
§ h=ps://piazza.com/class#winter2013/cs246
§ Use
Piazza
for
all
quesOons
and
public
communicaOon
with
the
course
staff
§ If
you
don’t
have
@stanford.edu
email
address,
send
us
your
email
and
we
will
manually
register
you
to
Piazza
¡ For
e-‐mailing
us,
always
use:
§ cs246-‐win1213-‐staff@lists.stanford.edu
¡ We
will
post
course
announcements
to
Piazza
(make
sure
you
check
it
regularly)
Auditors
are
welcome
to
sit-‐in
&
audit
the
class
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
21
¡ (1+)4
longer
homeworks:
40%
§ TheoreOcal
and
programming
quesOons
§ HW0
(Hadoop
tutorial)
has
just
been
posted
§ Assignments
take
lots
of
@me.
Start
early!!
¡ How
to
submit?
§ Homework
write-‐up:
§ Stanford
students:
In
class
or
in
Gates
submission
box
§ SCPD
students:
Submit
write-‐ups
via
SCPD
§ AFach
the
HW
cover
sheet
(and
SCPD
rouOng
form)
§ Upload
code:
§ Put
the
code
for
1
quesOon
into
1
file
and
submit
at:
h=p://snap.stanford.edu/submit/
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
22
¡ Short
weekly
quizzes:
20%
§ Short
e-‐quizzes
on
Gradiance
§ You
have
exactly
7
days
to
complete
it
No
late
days!
§ First
quiz
is
already
online
¡ Final
exam:
40%
§ Friday,
March
22
12:15pm-‐3:15pm
¡ It’s
going
to
be
fun
and
hard
work.
☺
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
23
¡ Homework
schedule:
Date
Out In
01/08, Tue HW0
01/10, Thu HW1
01/15, Tue HW0
01/24, Thu HW2 HW1
02/07, Thu HW3 HW2
02/21, Thu HW4 HW3
03/07, Thu HW4
§ 2
late
“days”
(late
periods)
for
HWs
for
the
quarter:
§ 1
late
day
expires
at
the
start
of
next
class
§ You
can
use
max
1
late
day
per
assignment
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
24
¡ Algorithms
(CS161)
§ Dynamic
programming,
basic
data
structures
¡ Basic
probability
(CS109
or
Stat116)
§ Moments,
typical
distribuOons,
MLE,
…
¡ Programming
(CS107
or
CS145)
§ Your
choice,
but
C++/Java
will
be
very
useful
¡ We
provide
some
background,
but
the
class
will
be
fast
paced
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
25
¡ 3
recita@on
sessions:
§ Hadoop:
Thurs.
1/10,
5:15-‐6:30pm
§ We
prepared
a
virtual
machine
with
Hadoop
preinstalled
§ HW0
helps
you
write
your
first
Hadoop
program
§ Review
of
probability&stats:
1/17,
5:15-‐6:30pm
§ Review
of
linear
algebra:
1/18,
5:15-‐6:30pm
§ All
sessions
will
be
held
in
Thornton
102,
Thornton
Center
(Terman
Annex)
§ Sessions
will
be
video
recorded!
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
26
¡ InfoSeminar
(CS545):
§ h=p://i.stanford.edu/infoseminar
§ Great
industrial
&
academic
speakers
§ Topics
include
data
mining
and
large
scale
data
processing
¡ CS341:
Project
in
Data
Mining
(Spring
2013)
§ Research
project
on
big
data
§ Groups
of
3
students
§ We
provide
interesOng
data,
compuOng
resources
(Amazon
EC2)
and
mentoring
¡ We
have
big-‐data
RA
posi@ons
open!
§ I
will
post
details
on
Piazza
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
27
¡ 3
To-‐do
items
for
you:
§ Register
to
Piazza
§ Complete
HW0:
Hadoop
tutorial
§ HW0
should
take
your
about
1
hour
to
complete
(Note
this
is
a
“toy”
homework
to
get
you
started.
Real
homeworks
will
be
much
more
challenging
and
longer)
§ Register
to
Gradiance
and
complete
the
first
quiz
§ Use
your
SUNet
ID
to
register!
(so
we
can
match
grading
records)
§ You
have
7
days
(sharp!)
to
do
so
§ Quizzes
typically
take
several
hours
¡ Addi@onal
details/instruc@ons
at
hFp://cs246.stanford.edu
J.
Leskovec,
A.
Rajaraman,
J.
Ullman:
Mining
of
Massive
Datasets,
h=p://www.mmds.org
28