Mining
of
Massive
Datasets
Leskovec,
Rajaraman,
and
Ullman
Stanford
University
CPU
Machine
Learning,
Statistics
Memory
Classical
Data
Mining
Disk
10
billion
web
pages
Average
size
of
webpage
=
20KB
10
billion
*
20KB
=
200
TB
Disk
read
bandwidth
=
50
MB/sec
Time
to
read
=
4
million
seconds
=
46+
days
Even
longer
to
do
something
useful
with
the
data
3
2-10
Gbps
backbone
between
racks
1
Gbps
between
any
pair
of
nodes
in
a
rack
Switch
Switch
CPU
Mem
Disk
Switch
CPU
CPU
Mem
Mem
Disk
Disk
CPU
Mem
Disk
Each
rack
contains
16-64
commodity
Linux
nodes
In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
4
Node
failures
A
single
server
can
stay
up
for
3
years
(1000
days)
1000
servers
in
cluster
=>
1
failure/day
1M
servers
in
cluster
=>
1000
failures/day
How
to
store
data
persistently
and
keep
it
available
if
nodes
can
fail?
How
to
deal
with
node
failures
during
a
long-
running
computaRon?
Network
boSleneck
Network
bandwidth
=
1
Gbps
Moving
10TB
takes
approximately
1
day
Distributed
programming
is
hard!
Need
a
simple
model
that
hides
most
of
the
complexity
Map-Reduce
addresses
the
challenges
of
cluster
compuRng
Store
data
redundantly
on
mulRple
nodes
for
persistence
and
availability
Move
computaRon
close
to
data
to
minimize
data
movement
Simple
programming
model
to
hide
the
complexity
of
all
this
magic
Distributed
File
System
Provides
global
le
namespace,
redundancy,
and
availability
E.g.,
Google
GFS;
Hadoop
HDFS
Typical
usage
pa5ern
Huge
les
(100s
of
GB
to
TB)
Data
is
rarely
updated
in
place
Reads
and
appends
are
common
Data
kept
in
chunks
spread
across
machines
Each
chunk
replicated
on
dierent
machines
Ensures
persistence
and
availability
C0
C1
D0
C1
C2
C4
C5
C2
C5
C3
D0
D1
Chunk server 1
Chunk server 2
Chunk server 3
C0
C4
D1
C3
Chunk server N
Chunk
servers
also
serve
as
compute
servers
Bring
computation
to
data!
10
Chunk
servers
File
is
split
into
conRguous
chunks
(16-64MB)
Each
chunk
replicated
(usually
2x
or
3x)
Try
to
keep
replicas
in
dierent
racks
Master
node
a.k.a.
Name
Node
in
Hadoops
HDFS
Stores
metadata
about
where
les
are
stored
Might
be
replicated
Client
library
for
le
access
Talks
to
master
to
nd
chunk
servers
Connects
directly
to
chunk
servers
to
access
data
11