Mining
of
Massive
Datasets
Leskovec,
Rajaraman,
and
Ullman
Stanford
University
¡ O"en
a
Map
task
will
produce
many
pairs
of
the
form
(k,v1),
(k,v2),
…
for
the
same
key
k
§ E.g.,
popular
words
in
the
word
count
example
¡ Can
save
network
-me
by
pre-‐aggrega-ng
values
in
the
mapper:
§ combine(k, list(v1)) à v2
§ Combiner
is
usually
same
as
the
reduce
func?on
2
¡ Back
to
our
word
coun-ng
example:
§ Combiner
combines
the
values
of
all
keys
of
a
single
mapper
(single
node):
§ Much
less
data
needs
to
be
copied
and
shuffled!
3
¡ Combiner
trick
works
only
if
reduce
func?on
is
commuta?ve
and
associa?ve
¡ Sum
¡ Average
¡ Median
4
¡ Want
to
control
how
keys
get
par--oned
§ The
set
of
keys
that
go
to
a
single
reduce
worker
¡ System
uses
a
default
par--on
func-on:
§ hash(key) mod R
¡ Some-mes
useful
to
override
the
hash
func-on:
§ E.g.,
hash(hostname(URL)) mod R
ensures
URLs
from
a
host
end
up
in
the
same
output
file
5
¡ Google
MapReduce
§ Uses
Google
File
System
(GFS)
for
stable
storage
§ Not
available
outside
Google
¡ Hadoop
§ Open-‐source
implementa?on
in
Java
§ Uses
HDFS
for
stable
storage
§ Download:
http://lucene.apache.org/hadoop/
¡ Hive,
Pig
§ Provide
SQL-‐like
abstrac?ons
on
top
of
Hadoop
Map-‐
Reduce
layer
6
¡ Ability
to
rent
compu?ng
by
the
hour
§ Addi?onal
services
e.g.,
persistent
storage
¡ E.g.,
Amazon’s
“Elas?c
Compute
Cloud”
(EC2)
§ S3
(stable
storage)
§ Elas?c
Map
Reduce
(EMR)
7
¡ Jeffrey
Dean
and
Sanjay
Ghemawat:
MapReduce:
Simplified
Data
Processing
on
Large
Clusters
§ hbp://labs.google.com/papers/mapreduce.html
¡ Sanjay
Ghemawat,
Howard
Gobioff,
and
Shun-‐Tak
Leung:
The
Google
File
System
§ hbp://labs.google.com/papers/gfs.html
9
¡ Hadoop
Wiki
§
Introduc?on
§
hbp://wiki.apache.org/lucene-‐hadoop/
§
Gegng
Started
§
hbp://wiki.apache.org/lucene-‐hadoop/
GegngStartedWithHadoop
§
Map/Reduce
Overview
§
hbp://wiki.apache.org/lucene-‐hadoop/HadoopMapReduce
§
hbp://wiki.apache.org/lucene-‐hadoop/
HadoopMapRedClasses
§
Eclipse
Environment
§ hbp://wiki.apache.org/lucene-‐hadoop/EclipseEnvironment
¡
Javadoc
§
hbp://lucene.apache.org/hadoop/docs/api/
10
¡
Releases
from
Apache
download
mirrors
§ hbp://www.apache.org/dyn/closer.cgi/lucene/
hadoop/
¡
Nightly
builds
of
source
§ hbp://people.apache.org/dist/lucene/hadoop/
nightly/
¡
Source
code
from
subversion
§ hbp://lucene.apache.org/hadoop/
version_control.html
11