0% found this document useful (0 votes)

14 views24 pages

05 ImpalaHiveIntro

This chapter introduces Impala and Hive, two SQL querying tools for data stored in HDFS/HBase. It covers their functionalities, comparisons to traditional databases, and how organizations utilize them for data analysis. The chapter also highlights the differences in performance and features between Impala and Hive.

Uploaded by

priyanka chowdary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views24 pages

05 ImpalaHiveIntro

Uploaded by

priyanka chowdary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Introduc)on

to Impala and Hive

Chapter 5

201509
Course Chapters

1 Introduc)on Course Introduc)on

2 Introduc)on to Hadoop and the Hadoop Ecosystem
Introduc)on to Hadoop
3 Hadoop Architecture and HDFS
4 Impor)ng Rela)onal Data with Apache Sqoop
5 Introduc*on to Impala and Hive
Impor*ng and Modeling
6 Working with Tables in Impala
Structured Data
7 Data Formats
8 Data File Par))oning
9 Capturing Data with Apache Flume Inges)ng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 Aggrega)ng Data with Pair RDDs
13 Wri)ng and Deploying Spark Applica)ons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaDerns in Spark Data Processing
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐2
Introduc)on to Impala and Hive

In this chapter you will learn

§ What Hive is
§ What Impala is
§ How Impala and Hive Compare
§ How to query data using Impala and Hive
§ How Hive and Impala diﬀer from a rela*onal database
§ Ways in which organiza*ons use Hive and Impala

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐3
Chapter Topics

Impor*ng and Modeling Structured

Introduc*on to Impala and Hive
Data

§ Introduc*on to Impala and Hive

§ Why Use Impala and Hive?
§ Querying Data With Impala and Hive
§ Comparing Hive and Impala to Tradi)onal Databases

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐4
Introduc)on to Impala and Hive (1)

§ Impala and Hive are both SELECT zipcode, SUM(cost) AS total

tools that provide SQL FROM customers
querying of data stored in JOIN orders
ON (customers.cust_id = orders.cust_id)
HDFS / HBase WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;

Hadoop
Cluster

HDFS / HBase

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐5
Introduc)on to Impala and Hive (2)

§ Apache Hive is a high-‐level abstrac*on on top of MapReduce

– Uses HiveQL
– Generates MapReduce or Spark* jobs that run on the Hadoop cluster
– Originally developed at Facebook around 2007
– Now an open-‐source Apache project
§ Cloudera Impala is a high-‐performance dedicated SQL engine
– Uses Impala SQL
– Inspired by Google’s Dremel project
– Query latency measured in milliseconds
– Developed at Cloudera in 2012
– Open-‐source with an Apache license

* Hive-‐on-‐Spark is currently in beta tes)ng

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐6
What’s the Diﬀerence?

§ Hive has more features

– E.g. Complex data types (arrays, maps) and full support for
windowing analy)cs
– Highly extensible
– Commonly used for batch processing
§ Impala is much faster
– Specialized SQL engine oﬀers 5x to 50x beDer performance
– Ideal for interac)ve queries and data analysis
– More features being added over )me

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐7
High-‐Level Overview
SELECT zipcode, SUM(cost) AS total
FROM customers
JOIN orders
ON (customers.cust_id = orders.cust_id)
WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;

• Parse HiveQL • Parse Impala SQL

• Make op)miza)ons • Make op)miza)ons
• Plan execu)on • Plan execu)on
• Submit job(s) to cluster • Execute query on the
• Monitor progress cluster

Data Processing Engine (MapReduce) Hadoop

Cluster

Hadoop HDFS
Cluster

HDFS

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐8
Chapter Topics

Impor*ng and Modeling Structured

Introduc*on to Impala and Hive
Data

§ Introduc)on to Impala and Hive

§ Why Use Impala and Hive?
§ Querying Data With Impala and Hive
§ Comparing Hive to Tradi)onal Databases
§ Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐9
Why Use Hive and Impala?

§ Brings large-‐scale data analysis to a broader audience

– No sofware development experience required
– Leverage exis)ng knowledge of SQL
§ More produc*ve than wri*ng MapReduce or Spark directly
– Five lines of HiveQL/Impala SQL might be equivalent to 200 lines or
more of Java
§ Oﬀers interoperability with other systems
– Extensible through Java and external scripts
– Many business intelligence (BI) tools support Hive and/or Impala

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐10
Use Case: Sen)ment Analysis

§ Many organiza*ons use Hive or Impala to analyze social media coverage

Mentions of Dualcore on Social Media (by Hour)

Negative
Neutral
Positive

07 08 09 10 11 12 13 14 15 16 17 18

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐12
Use Case: Business Intelligence

§ Many leading business intelligence tools support Hive and Impala

Dualcore Inc. Dashboard
https://dashboard.example.com/ Google

Revenue by Period Order Shipments Per Month

Top States for In-Store Sales

Suppliers by Region

Japan: 31 suppliers

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐13
Chapter Topics

Impor*ng and Modeling Structured

Introduc*on to Impala and Hive
Data

§ Introduc)on to Impala and Hive

§ Why Use Impala and Hive?
§ Querying Data With Hive and Impala
§ Comparing Hive to Tradi)onal Databases
§ Conclusion

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐14
Interac)ng with Hive and Impala

§ Hive and Impala oﬀer many interfaces for running queries

– Command-‐line shell
– Impala: Impala-shell
– Hive: Hive
– Hue Web UI
– Hive Query Editor
– Impala Query Editor
– Metastore Manager
– ODBC / JDBC

© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 5-‐15
Star)ng the Impala Shell

§ You can execute statements in the Impala shell

– This interac)ve tool is similar to the shell in MySQL
§ Execute the impala-shell command to start the shell
– Some log messages truncated to beDer ﬁt the slide

$ impala-shell
Connected to localhost.localdomain:21000
Server version: impalad version 2.1.0-cdh5 (…)
Welcome to the Impala shell.
[localhost.localdomain:21000] >

§ Use -i hostname:port op*on to connect to a diﬀerent server

$ impala-shell –i myserver.example.com:21000
[myserver.example.com:21000] >

§ Enter semicolon-‐terminated statements at the prompt

– Hit [Enter] to execute a query or command
– Use the quit command to exit the shell
§ Use impala-shell --help for a full list of op*ons

You can use Hue to…

Query data with

Hive or Impala

View and manage

the Metastore

§ The Impala and Hive Query editors are nearly iden*cal

Enter, edit, save

and execute
queries

Choose a
database

Explore
schema and
sample data

View results, logs,

reports, etc.

Impor*ng and Modeling Structured

Introduc*on to Impala and Hive
Data

§ Introduc)on to Impala and Hive

§ Why Use Impala and Hive?
§ Querying Data With Impala and Hive
§ Comparing Hive and Impala to Tradi*onal Databases
§ Conclusion

§ Client-‐server database management systems have many strengths

– Very fast response )me
– Support for transac)ons
– Allow modiﬁca)on of exis)ng records
– Can serve thousands of simultaneous clients
§ Your Hadoop cluster is not an RDBMS
– Hive generates processing engine jobs (MapReduce) from HiveQL
queries
– Limita)ons of HDFS and MapReduce s)ll apply
– Impala is faster but not intended for the throughput speed required for
an OLTP database
– No transac)on support

Rela*onal Database Hive Impala

Query language SQL (full) SQL (subset) SQL (subset)
Update individual Yes No No
records
Delete individual Yes No No
records
Transac*ons Yes No No
Index support Extensive Limited No
Latency Very low High Low
Data size Terabytes Petabytes Petabytes

Impor*ng and Modeling Structured

Introduc*on to Impala and Hive
Data

§ Introduc)on to Impala and Hive

§ Why Use Impala and Hive?
§ Querying Data With Impala and Hive
§ Comparing Hive and Impala to Tradi)onal Databases
§ Conclusion

§ Impala and Hive are tools for performing SQL queries on data in HDFS
§ HiveQL and Impala SQL are very similar to SQL-‐92
– Easy to learn for those with rela)onal database experience
– However, does not replace your RDBMS
§ Hive generates jobs that run on the Hadoop cluster data processing engine
– Runs MapReduce jobs on Hadoop based on HiveQL statements
§ Impala execute queries directly on the Hadoop cluster
– Uses a very fast specialized SQL engine, not MapReduce

The following oﬀer more informa*on on topics discussed in this chapter
§ Programming Hive (O’Reilly book)
– http://tiny.cloudera.com/programminghive
§ Data Analysis with Hadoop and Hive at Orbitz
– http://tiny.cloudera.com/dac09b
§ Wired Ar*cle on Impala
– http://tiny.cloudera.com/wiredimpala

Hive and Impala
No ratings yet
Hive and Impala
46 pages
Impala vs Hive: Big Data Analytics
No ratings yet
Impala vs Hive: Big Data Analytics
33 pages
06 ImpalaHiveDataModeling
No ratings yet
06 ImpalaHiveDataModeling
47 pages
Cloudera Impala Overview and Features
No ratings yet
Cloudera Impala Overview and Features
11 pages
Big Data and Data Analytics Cloudera.
No ratings yet
Big Data and Data Analytics Cloudera.
3 pages
Learning Cloudera Impala Sample Chapter
No ratings yet
Learning Cloudera Impala Sample Chapter
25 pages
DS Lab - Manual - Assignment 11
No ratings yet
DS Lab - Manual - Assignment 11
3 pages
Impala - Overview
No ratings yet
Impala - Overview
1 page
Understanding Impala in Big Data
No ratings yet
Understanding Impala in Big Data
5 pages
Cloudera - DANA-262: Analyzing With Cloudera Data Warehouse
No ratings yet
Cloudera - DANA-262: Analyzing With Cloudera Data Warehouse
3 pages
Getting Started
No ratings yet
Getting Started
1 page
Impala: High-Performance SQL for Hadoop
No ratings yet
Impala: High-Performance SQL for Hadoop
60 pages
Impala vs BigQuery: A Detailed Comparison
No ratings yet
Impala vs BigQuery: A Detailed Comparison
47 pages
Hadoop Cluster Setup Guide
100% (2)
Hadoop Cluster Setup Guide
23 pages
Impala Overview: Goals: General-Purpose SQL Query Engine
No ratings yet
Impala Overview: Goals: General-Purpose SQL Query Engine
39 pages
Performance Comparison of Hive, Impala and Spark SQL
No ratings yet
Performance Comparison of Hive, Impala and Spark SQL
6 pages
GRP B Exp1
No ratings yet
GRP B Exp1
1 page
Apache Impala for Data Engineers
No ratings yet
Apache Impala for Data Engineers
879 pages
Impala-2 11
No ratings yet
Impala-2 11
872 pages
Impala-3 3 PDF
No ratings yet
Impala-3 3 PDF
885 pages
Cloudera Apache Impala Guide
No ratings yet
Cloudera Apache Impala Guide
691 pages
Week 4 - Hadoop Ecosystem
No ratings yet
Week 4 - Hadoop Ecosystem
109 pages
Cloudera Msazure Hadoop Deployment Guide
No ratings yet
Cloudera Msazure Hadoop Deployment Guide
39 pages
Apache Hive Cookbook - Sample Chapter
100% (1)
Apache Hive Cookbook - Sample Chapter
27 pages
Cloudera Hive
No ratings yet
Cloudera Hive
106 pages
Hive and Hiveql
No ratings yet
Hive and Hiveql
10 pages
Scripting
No ratings yet
Scripting
88 pages
Cloudera Developer Training For Spark and Hadoop
No ratings yet
Cloudera Developer Training For Spark and Hadoop
4 pages
Cloudera JDBC Driver For Apache Hive Install Guide 2 5 4
No ratings yet
Cloudera JDBC Driver For Apache Hive Install Guide 2 5 4
21 pages
Hive-Impala Characteristics
No ratings yet
Hive-Impala Characteristics
2 pages
Hive Database & Analytics Guide
No ratings yet
Hive Database & Analytics Guide
10 pages
Cloudera Data Analyst Training
0% (1)
Cloudera Data Analyst Training
2 pages
Impala Reference
No ratings yet
Impala Reference
95 pages
Hive
No ratings yet
Hive
12 pages
Cloudera Connector For Tableau
No ratings yet
Cloudera Connector For Tableau
12 pages
Cloudera JDBC Connector For Apache Impala Install Guide
No ratings yet
Cloudera JDBC Connector For Apache Impala Install Guide
99 pages
Apache Hive Guide
No ratings yet
Apache Hive Guide
99 pages
Cloudera Hive
No ratings yet
Cloudera Hive
118 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Installing and Using Impala
0% (1)
Installing and Using Impala
288 pages
Introduction to Hive Data Warehousing
No ratings yet
Introduction to Hive Data Warehousing
4 pages
Apache Hive Overview & Architecture
No ratings yet
Apache Hive Overview & Architecture
27 pages
DWstudent Slides
No ratings yet
DWstudent Slides
679 pages
Cloudera-Impala (2016)
No ratings yet
Cloudera-Impala (2016)
760 pages
CDP 4001 Demo
No ratings yet
CDP 4001 Demo
13 pages
Bda Report
No ratings yet
Bda Report
16 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Installing and Using Impala
No ratings yet
Installing and Using Impala
248 pages
Cloudera Hive
No ratings yet
Cloudera Hive
107 pages
Cloudera Data Analyst Training PDF
No ratings yet
Cloudera Data Analyst Training PDF
2 pages
Cloudera Data Analyst Training
No ratings yet
Cloudera Data Analyst Training
2 pages
7 Hive
No ratings yet
7 Hive
30 pages
Cloudera Impala
No ratings yet
Cloudera Impala
526 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
Unit V
No ratings yet
Unit V
23 pages
Programming For Data Science - Assignment 1
No ratings yet
Programming For Data Science - Assignment 1
2 pages
Business Analyst: Priyanka Kilaru
No ratings yet
Business Analyst: Priyanka Kilaru
2 pages
Mapreduce
No ratings yet
Mapreduce
5 pages
UTD Resume Final
No ratings yet
UTD Resume Final
1 page
Group - 3
No ratings yet
Group - 3
24 pages
Group - 1
No ratings yet
Group - 1
27 pages
Lecture 2
No ratings yet
Lecture 2
63 pages
Data-Driven Growth Strategies for Gardein
No ratings yet
Data-Driven Growth Strategies for Gardein
9 pages
PHP Unit Eval-stdin File Paths
No ratings yet
PHP Unit Eval-stdin File Paths
10 pages
Python Flash Cards Booklet - Eric Matthes
100% (2)
Python Flash Cards Booklet - Eric Matthes
8 pages
Sourabh23 Resume
No ratings yet
Sourabh23 Resume
1 page
RoboSoul App Setup and Control Guide
No ratings yet
RoboSoul App Setup and Control Guide
6 pages
Steps in Problem Solving Template - Age Prog
No ratings yet
Steps in Problem Solving Template - Age Prog
3 pages
Pipeline Hazards
No ratings yet
Pipeline Hazards
37 pages
Class 10th Result
No ratings yet
Class 10th Result
1 page
TRI-CAPTURE User Safety Guide
No ratings yet
TRI-CAPTURE User Safety Guide
52 pages
When Machine Learning Meets Hardware Cybersecurity Delving Into Accurate Zero-Day Malware Detection
No ratings yet
When Machine Learning Meets Hardware Cybersecurity Delving Into Accurate Zero-Day Malware Detection
6 pages
Understanding Edge Processor
No ratings yet
Understanding Edge Processor
127 pages
Construction Cost Control Guide
No ratings yet
Construction Cost Control Guide
10 pages
GitLab Beginner Presentation
No ratings yet
GitLab Beginner Presentation
12 pages
Btree
No ratings yet
Btree
12 pages
Uk 1988060241
No ratings yet
Uk 1988060241
6 pages
Python Anchor Chart
No ratings yet
Python Anchor Chart
4 pages
Saudi Aramco: Introduction To The Workover Manual
No ratings yet
Saudi Aramco: Introduction To The Workover Manual
3 pages
Final Report
No ratings yet
Final Report
30 pages
Orthopedic Physical Assessment 6e Musculoskeletal Rehabilitation Downloads Torrent PDF
No ratings yet
Orthopedic Physical Assessment 6e Musculoskeletal Rehabilitation Downloads Torrent PDF
3 pages
Hosts
No ratings yet
Hosts
888 pages
Amazon Polly: Developer Guide
No ratings yet
Amazon Polly: Developer Guide
256 pages
En Product-Flyer Axiocam 208-Color
No ratings yet
En Product-Flyer Axiocam 208-Color
4 pages
Tableau Desktop Fundamentals Student Guide
No ratings yet
Tableau Desktop Fundamentals Student Guide
75 pages
Information Technology Csec Jan 2015 p1 With Answers
No ratings yet
Information Technology Csec Jan 2015 p1 With Answers
20 pages
AN2004-RF67 Communication Example
No ratings yet
AN2004-RF67 Communication Example
11 pages
OKCL
No ratings yet
OKCL
2 pages
Software Enginerring
No ratings yet
Software Enginerring
15 pages
Ai Project File Edited
No ratings yet
Ai Project File Edited
65 pages
Banned Book Project Lesson Plan
No ratings yet
Banned Book Project Lesson Plan
2 pages
JUnit BDD Testing with BDDMockito
No ratings yet
JUnit BDD Testing with BDDMockito
9 pages
C 2 Advanced Order Types Pine Script
No ratings yet
C 2 Advanced Order Types Pine Script
8 pages

05 ImpalaHiveIntro

Uploaded by

05 ImpalaHiveIntro

Uploaded by

Introduc)on

to Impala and Hive

1 Introduc)on Course Introduc)on

18 Conclusion Course Conclusion

In this chapter you will learn

Impor*ng and Modeling Structured

§ Introduc*on to Impala and Hive

§ Impala and Hive are both SELECT zipcode, SUM(cost) AS total

§ Apache Hive is a high-­‐level abstrac*on on top of MapReduce

* Hive-­‐on-­‐Spark is currently in beta tes)ng

§ Hive has more features

• Parse HiveQL • Parse Impala SQL

Data Processing Engine (MapReduce) Hadoop

Impor*ng and Modeling Structured

§ Introduc)on to Impala and Hive

§ Brings large-­‐scale data analysis to a broader audience

Mentions of Dualcore on Social Media (by Hour)

§ Many leading business intelligence tools support Hive and Impala

Revenue by Period Order Shipments Per Month

Top States for In-Store Sales

Impor*ng and Modeling Structured

§ Introduc)on to Impala and Hive

§ Hive and Impala oﬀer many interfaces for running queries

§ You can execute statements in the Impala shell

§ Enter semicolon-­‐terminated statements at the prompt

You can use Hue to…

Query data with

View and manage

§ The Impala and Hive Query editors are nearly iden*cal

Enter, edit, save

View results, logs,

Impor*ng and Modeling Structured

§ Introduc)on to Impala and Hive

§ Client-­‐server database management systems have many strengths

Rela*onal Database Hive Impala

Impor*ng and Modeling Structured

§ Introduc)on to Impala and Hive

You might also like

§ Apache Hive is a high-‐level abstrac*on on top of MapReduce

* Hive-‐on-‐Spark is currently in beta tes)ng

§ Brings large-‐scale data analysis to a broader audience

§ Enter semicolon-‐terminated statements at the prompt

§ Client-‐server database management systems have many strengths