IntroducAon
to
Data
Science
with
Hadoop
Glynn
Durham,
Senior
Instructor,
Cloudera
glynn@[Link]
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1
of
36
Terms
I
will
cover:
with
a
few
extras:
Hadoop,
Hadoop
ecosystem
HDFS
MapReduce
Sqoop
Flume
Hive
Pig
Mahout
Machine
learning
Data
science
using
Hadoop
YARN
HBase
Impala
Oozie
data
products
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2
of
36
Hadoop
Hadoop
is:
a
plaLorm
for
big
data
several
Apache
SoNware
FoundaOon
(ASF)
projects
free
open
source
soNware
Major
parts:
Hadoop
Core
Hadoop
ecosystem
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
3
of
36
Hadoop
Core
Main
Features:
File
System
and
Batch
Programming
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4
of
36
Hadoop
Core
Hadoop
Core
consists
of:
HDFS
(Hadoop
Distributed
File
System),
for
storage
MapReduce
for
batch
programming
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
5
of
36
HDFS
Writes
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
6
of
36
HDFS
Reads
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7
of
36
HDFS
Strengths
and
Weaknesses
HDFS
is
good
at:
storing
enormous
les
storing
a
lot
of
data
reliably
throughput
on
sequenAal
writes
throughput
on
sequenAal
reads
of
a
le
or
part
of
a
le
HDFS
is
not
good
at:
high
speed
random
reads
of
parts
of
a
le
HDFS
cannot:
update
any
part
of
a
le
once
wri>en*
*
but
you
can
always
write
a
new
le,
and/or
delete,
move,
and
rename
les
and
directories
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
8
of
36
MapReduce:
Programming
with
Simple
FuncAons
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
9
of
36
MapReduce
Chains
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
10
of
36
MapReduce
at
Scale
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
11
of
36
MapReduce
in
Hadoop
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
12
of
36
MapReduce
Strengths
and
Weaknesses
MapReduce
is
good
at:
processing
enormous
amounts
of
data
scaling
out
as
you
add
more
machines
conAnuing
to
compleAon,
even
when
some
machines
die
MapReduce
is
not
good
at:
running
any
algorithm
you
can
think
up
algorithms
that
require
shared
state
overall*
*
but
maybe
you
can
get
clever
with
your
algorithm
design
MapReduce
cannot:
run
in
real
Ame:
MapReduce
jobs
are
batch
jobs
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
13
of
36
Detour:
YARN,
Yet
Another
Resource
NegoAatornear
future
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
14
of
36
Hadoop
Ecosystem
The
Hadoop
Ecosystem
consists
of
other
projects
that
round
out
Hadoop
Core
to
make
it
a
useful
pla\orm:
Sqoop,
for
RDBMS
integraAon
Flume,
for
event
ingesAon
Hive,
for
"SQL"-like
high-level
programming
Pig,
another
high-level
programming
paradigm
Mahout,
a
Java
library
for
machine
learning
in
Hadoop
Plus:
HBase,
a
"NoSQL"
database
system
Oozie,
a
workow
manager
for
Hadoop
acAons
....
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
15
of
36
Sqoop:
RDBMS
to
Hadoop
and
Back
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
16
of
36
Flume:
IngesAng
ConAnuing
Event
Data
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
17
of
36
Detour:
General
File
Input/Output
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
18
of
36
MapReduce
revisited:
How
to
write
MapReduce
programs?
Java
MapReduce
API
The
most
expressive
technique
possible
The
most
work,
by
far
(Can
be
easier
with
Hadoop
Streaming:
a
way
to
use
streaming
programming
such
as
shell
scripOng
or
Python)
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
19
of
36
Hive:
MapReduce
as
"SQL"
Familiar
language
and
programming
paradigm
Provides
interface
to
many
SQL-compliant
tools
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
20
of
36
Detour:
Impala,
High
Speed
AnalyAcs
in
Hadoop
5
to
30
Omes
faster
then
Hive
queries
(someOmes
100's
of
Omes
faster!)
Cloudera
exclusive
oering,
but
Apache
licensed,
so
it's
free
and
open
source
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
21
of
36
Impala
Does
Not
Use
MapReduce
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
22
of
36
Detour:
HBase,
A
NoSQL
Database
System
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
23
of
36
Detour:
A
bit
more
about
HBase
HBase
is
a
NoSQL
database
system:
programmers
create
and
use
database
tables
high
volume,
high
performance
access
to
individual
cells
much
weaker
query
language
than
SQL
lacks
ACID-compliant
transacAons
HBase
is
not
strictly
needed
to
do
"data
science"
a
resource
hog;
competes
with
analyAcal
programs
ogen
deployed
on
its
own
separate
cluster
may
be
part
of
your
organizaAon's
data
storage
and
delivery,
so
you
may
need
to
get
or
put
data
into
an
HBase
system*
*
(or
other
NoSQL
system)
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
24
of
36
Pig:
Another
Language
for
MapReduce
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
25
of
36
Mahout:
Machine
Learning
in
MapReduce
Mahout
is:
a
collecOon
of
algorithms,
mainly
focused
on
"the
three
C's"
of
machine
learning
wriden
in
Java
largely
implemented
over
Hadoop
MapReduce
invocable
from
the
command
line
extensible,
with
the
Java
API
Mahout
is
not:
a
turnkey
soluOon
for
doing
machine
learning
always
user-friendly
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
26
of
36
Machine
Learning
"The
three
C's"
of
machine
learning:
ClassicaOon
Clustering
CollaboraOve
ltering
(recommenders)
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
27
of
36
Supervised
Machine
Learning:
ClassicaAon
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
28
of
36
Machine
Learning:
Clustering
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
29
of
36
Machine
Learning:
CollaboraAve
Filtering
for
Recommenders
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
30
of
36
Simple
Enterprise
Deployment:
Hadoop
as
ETL
Appliance
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
31
of
36
Detour:
Oozie,
Workow
within
Hadoop
Simple
workow
within
Hadoop:
1. Clear
out
staging
directory
in
HDFS
2. Sqoop
import
from
OLTP
tables
3. Hive
(or
Pig)
script
to
transform
data
4. Sqoop
export
to
data
warehouse
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
32
of
36
Hadoop:
The
Bigger
Picture
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
33
of
36
Data
Science
with
Hadoop
A
data
scienOst
will:
1.
IdenOfy
internal
and
external
data
for
potenOal
use
(general
data
wrangling
tools).
2.
Help
build
ingesOon
pipelines
to
obtain
data
for
use
(Flume,
Sqoop,
other).
3.
Examine,
clean,
and
anonymize
ingested
data
(Hive,
Impala,
Pig,
Hadoop
Streaming).
4.
Shape
data
into
useful
formats
(Hive,
Pig).
5.
Explore
data
sets
to
gain
understanding
of
problems,
trends,
reality
(Impala,
Hive,
Pig,
staOsOcal
programming).
6.
Build
predicOve
models
using
staOsOcal
programming,
machine
learning
(Mahout).
7.
Contribute
to
data
products:
products
in
the
organizaOon
that
are
built
in
large
part
from
the
data
itself
(Mahout,
Sqoop
export,
general
le
export).
8.
Conduct
experiments
with
data
products,
quanOfying
benets
and/or
tradeos
of
system
changes
(Flume,
Sqoop,
staOsOcal
tests).
9.
Communicate
results
and
insights
to
stakeholders
(visualizaOon*).
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
34
of
36
VisualizaAon:
Needs
VisualizaAon
Sogware
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
35
of
36
Thank
you!
QuesAons?
ContribuAons?
Glynn
Durham,
Senior
Instructor,
Cloudera
glynn@[Link]
Copyright
2010-2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
36
of
36