Introduc)on
to
Impala
and
Hive
Chapter
5
201509
Course
Chapters
1
Introduc)on
Course
Introduc)on
2
Introduc)on
to
Hadoop
and
the
Hadoop
Ecosystem
Introduc)on
to
Hadoop
3
Hadoop
Architecture
and
HDFS
4
Impor)ng
Rela)onal
Data
with
Apache
Sqoop
5
Introduc*on
to
Impala
and
Hive
Impor*ng
and
Modeling
6
Working
with
Tables
in
Impala
Structured
Data
7
Data
Formats
8
Data
File
Par))oning
9
Capturing
Data
with
Apache
Flume
Inges)ng
Streaming
Data
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
Aggrega)ng
Data
with
Pair
RDDs
13
Wri)ng
and
Deploying
Spark
Applica)ons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaDerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
18
Conclusion
Course
Conclusion
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐2
Introduc)on
to
Impala
and
Hive
In
this
chapter
you
will
learn
§ What
Hive
is
§ What
Impala
is
§ How
Impala
and
Hive
Compare
§ How
to
query
data
using
Impala
and
Hive
§ How
Hive
and
Impala
differ
from
a
rela*onal
database
§ Ways
in
which
organiza*ons
use
Hive
and
Impala
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐3
Chapter
Topics
Impor*ng
and
Modeling
Structured
Introduc*on
to
Impala
and
Hive
Data
§ Introduc*on
to
Impala
and
Hive
§ Why
Use
Impala
and
Hive?
§ Querying
Data
With
Impala
and
Hive
§ Comparing
Hive
and
Impala
to
Tradi)onal
Databases
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐4
Introduc)on
to
Impala
and
Hive
(1)
§ Impala
and
Hive
are
both
SELECT zipcode, SUM(cost) AS total
tools
that
provide
SQL
FROM customers
querying
of
data
stored
in
JOIN orders
ON (customers.cust_id = orders.cust_id)
HDFS
/
HBase
WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;
Hadoop
Cluster
HDFS
/
HBase
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐5
Introduc)on
to
Impala
and
Hive
(2)
§ Apache
Hive
is
a
high-‐level
abstrac*on
on
top
of
MapReduce
– Uses
HiveQL
– Generates
MapReduce
or
Spark*
jobs
that
run
on
the
Hadoop
cluster
– Originally
developed
at
Facebook
around
2007
– Now
an
open-‐source
Apache
project
§ Cloudera
Impala
is
a
high-‐performance
dedicated
SQL
engine
– Uses
Impala
SQL
– Inspired
by
Google’s
Dremel
project
– Query
latency
measured
in
milliseconds
– Developed
at
Cloudera
in
2012
– Open-‐source
with
an
Apache
license
*
Hive-‐on-‐Spark
is
currently
in
beta
tes)ng
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐6
What’s
the
Difference?
§ Hive
has
more
features
– E.g.
Complex
data
types
(arrays,
maps)
and
full
support
for
windowing
analy)cs
– Highly
extensible
– Commonly
used
for
batch
processing
§ Impala
is
much
faster
– Specialized
SQL
engine
offers
5x
to
50x
beDer
performance
– Ideal
for
interac)ve
queries
and
data
analysis
– More
features
being
added
over
)me
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐7
High-‐Level
Overview
SELECT zipcode, SUM(cost) AS total
FROM customers
JOIN orders
ON (customers.cust_id = orders.cust_id)
WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;
• Parse
HiveQL
• Parse
Impala
SQL
• Make
op)miza)ons
• Make
op)miza)ons
• Plan
execu)on
• Plan
execu)on
• Submit
job(s)
to
cluster
• Execute
query
on
the
• Monitor
progress
cluster
Data
Processing
Engine
(MapReduce)
Hadoop
Cluster
Hadoop
HDFS
Cluster
HDFS
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐8
Chapter
Topics
Impor*ng
and
Modeling
Structured
Introduc*on
to
Impala
and
Hive
Data
§ Introduc)on
to
Impala
and
Hive
§ Why
Use
Impala
and
Hive?
§ Querying
Data
With
Impala
and
Hive
§ Comparing
Hive
to
Tradi)onal
Databases
§ Conclusion
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐9
Why
Use
Hive
and
Impala?
§ Brings
large-‐scale
data
analysis
to
a
broader
audience
– No
sofware
development
experience
required
– Leverage
exis)ng
knowledge
of
SQL
§ More
produc*ve
than
wri*ng
MapReduce
or
Spark
directly
– Five
lines
of
HiveQL/Impala
SQL
might
be
equivalent
to
200
lines
or
more
of
Java
§ Offers
interoperability
with
other
systems
– Extensible
through
Java
and
external
scripts
– Many
business
intelligence
(BI)
tools
support
Hive
and/or
Impala
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐10
Use
Case:
Sen)ment
Analysis
§ Many
organiza*ons
use
Hive
or
Impala
to
analyze
social
media
coverage
Mentions of Dualcore on Social Media (by Hour)
Negative
Neutral
Positive
07 08 09 10 11 12 13 14 15 16 17 18
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐12
Use
Case:
Business
Intelligence
§ Many
leading
business
intelligence
tools
support
Hive
and
Impala
Dualcore Inc. Dashboard
https://dashboard.example.com/ Google
Revenue by Period Order Shipments Per Month
Top States for In-Store Sales
Suppliers by Region
Japan: 31 suppliers
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐13
Chapter
Topics
Impor*ng
and
Modeling
Structured
Introduc*on
to
Impala
and
Hive
Data
§ Introduc)on
to
Impala
and
Hive
§ Why
Use
Impala
and
Hive?
§ Querying
Data
With
Hive
and
Impala
§ Comparing
Hive
to
Tradi)onal
Databases
§ Conclusion
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐14
Interac)ng
with
Hive
and
Impala
§ Hive
and
Impala
offer
many
interfaces
for
running
queries
– Command-‐line
shell
– Impala:
Impala-shell
– Hive:
Hive
– Hue
Web
UI
– Hive
Query
Editor
– Impala
Query
Editor
– Metastore
Manager
– ODBC
/
JDBC
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐15
Star)ng
the
Impala
Shell
§ You
can
execute
statements
in
the
Impala
shell
– This
interac)ve
tool
is
similar
to
the
shell
in
MySQL
§ Execute
the
impala-shell
command
to
start
the
shell
– Some
log
messages
truncated
to
beDer
fit
the
slide
$ impala-shell
Connected to localhost.localdomain:21000
Server version: impalad version 2.1.0-cdh5 (…)
Welcome to the Impala shell.
[localhost.localdomain:21000] >
§ Use
-i hostname:port
op*on
to
connect
to
a
different
server
$ impala-shell –i myserver.example.com:21000
[myserver.example.com:21000] >
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐16
Using
the
Impala
Shell
§ Enter
semicolon-‐terminated
statements
at
the
prompt
– Hit
[Enter]
to
execute
a
query
or
command
– Use
the
quit
command
to
exit
the
shell
§ Use impala-shell --help for
a
full
list
of
op*ons
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐17
Using
Hue
with
Hive
and
Impala
You
can
use
Hue
to…
Query
data
with
Hive
or
Impala
View
and
manage
the
Metastore
©
Copyright
2010a 2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-18
The
Hue
Query
Editor
§ The
Impala
and
Hive
Query
editors
are
nearly
iden*cal
Enter,
edit,
save
and
execute
queries
Choose
a
database
Explore
schema
and
sample
data
View
results,
logs,
reports,
etc.
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐19
Chapter
Topics
Impor*ng
and
Modeling
Structured
Introduc*on
to
Impala
and
Hive
Data
§ Introduc)on
to
Impala
and
Hive
§ Why
Use
Impala
and
Hive?
§ Querying
Data
With
Impala
and
Hive
§ Comparing
Hive
and
Impala
to
Tradi*onal
Databases
§ Conclusion
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐20
Your
Cluster
is
Not
a
Database
Server
§ Client-‐server
database
management
systems
have
many
strengths
– Very
fast
response
)me
– Support
for
transac)ons
– Allow
modifica)on
of
exis)ng
records
– Can
serve
thousands
of
simultaneous
clients
§ Your
Hadoop
cluster
is
not
an
RDBMS
– Hive
generates
processing
engine
jobs
(MapReduce)
from
HiveQL
queries
– Limita)ons
of
HDFS
and
MapReduce
s)ll
apply
– Impala
is
faster
but
not
intended
for
the
throughput
speed
required
for
an
OLTP
database
– No
transac)on
support
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐21
Comparing
Hive
and
Impala
To
A
Rela)onal
Database
Rela*onal
Database
Hive
Impala
Query
language SQL
(full) SQL
(subset) SQL
(subset)
Update
individual
Yes No No
records
Delete
individual
Yes No No
records
Transac*ons Yes No No
Index
support Extensive Limited No
Latency Very
low High Low
Data
size Terabytes Petabytes Petabytes
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐22
Chapter
Topics
Impor*ng
and
Modeling
Structured
Introduc*on
to
Impala
and
Hive
Data
§ Introduc)on
to
Impala
and
Hive
§ Why
Use
Impala
and
Hive?
§ Querying
Data
With
Impala
and
Hive
§ Comparing
Hive
and
Impala
to
Tradi)onal
Databases
§ Conclusion
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐23
Essen)al
Points
§ Impala
and
Hive
are
tools
for
performing
SQL
queries
on
data
in
HDFS
§ HiveQL
and
Impala
SQL
are
very
similar
to
SQL-‐92
– Easy
to
learn
for
those
with
rela)onal
database
experience
– However,
does
not
replace
your
RDBMS
§ Hive
generates
jobs
that
run
on
the
Hadoop
cluster
data
processing
engine
– Runs
MapReduce
jobs
on
Hadoop
based
on
HiveQL
statements
§ Impala
execute
queries
directly
on
the
Hadoop
cluster
– Uses
a
very
fast
specialized
SQL
engine,
not
MapReduce
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐24
Bibliography
The
following
offer
more
informa*on
on
topics
discussed
in
this
chapter
§ Programming
Hive
(O’Reilly
book)
– http://tiny.cloudera.com/programminghive
§ Data
Analysis
with
Hadoop
and
Hive
at
Orbitz
– http://tiny.cloudera.com/dac09b
§ Wired
Ar*cle
on
Impala
– http://tiny.cloudera.com/wiredimpala
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
5-‐25