0% found this document useful (0 votes)

116 views28 pages

Apache Hive: Data Warehousing on Hadoop

Hive is a framework that runs on Hadoop and enables users to run SQL queries on large datasets stored in Hadoop. It converts SQL queries into MapReduce jobs and allows querying of structured data stored in tables. Hive uses a SQL-like language and concepts familiar from relational databases like tables, rows, and columns to make it easy for users to query large datasets.

Uploaded by

fab vif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

116 views28 pages

Apache Hive: Data Warehousing on Hadoop

Uploaded by

fab vif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

What is Hive?

• Hive is a framework designed for data warehousing that runs on top of

Hadoop.
• It enables users to run queries on the huge volumes of data.
• Its basic function is to convert SQL queries into MapReduce jobs.
• Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive.
What is Hive?
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates

Hive makes job easy for performing operations like

Data encapsulation
Ad-hoc queries
Analysis of huge datasets
Important characteristics of Hive
• In Hive, tables and databases are created first and then data is loaded into these tables.
• Hive as data warehouse designed for managing and querying only structured data that is
stored in tables.
• While dealing with structured data, Map Reduce doesn't have optimization and usability
features but Hive framework does.
• Query optimization refers to an effective way of query execution in terms of performance.
• Hive's SQL-inspired language separates the user from the complexity of Map Reduce
programming.
• It reuses familiar concepts from the relational database world, such as tables, rows, columns
and schema, etc. for ease of learning.
• Hadoop's programming works on flat files.
• So, Hive can use directory structures to "partition" data to improve performance on certain
queries.
Hive Vs Relational Databases
• Relational databases are of "Schema on READ and Schema on Write". First
creating a table then inserting data into the particular table. On relational
database tables, functions like Insertions, Updates, and Modifications can
be performed.
• Hive is "Schema on READ only". So, functions like the update,
modifications, etc. don't work with this. Because the Hive query in a typical
cluster runs on multiple Data Nodes. So it is not possible to update and
modify data across multiple nodes.
• Also, Hive supports "READ Many WRITE Once" pattern.
• Which means that after inserting table we can update the table in the
latest Hive versions.
Hive Components
• High-level language (HiveQL)
• Set of commands
Two Main
Components • Two execution modes
• Local: reads/write to local file system
• Mapreduce: connects to Hadoop cluster and
reads/writes to HDFS

• Interactive mode
• Console
Two modes
• Batch mode
• Submit a script
Hive deals with Structured Data
• Hive Data Models:
• The Hive data models contain the following components:
 Databases : 3-Levels: Tables  Partitions  Buckets
 Tables : maps to a HDFS directory
 Partitions : maps to sub-directories under the table
 Buckets or clusters : maps to files under each partition

Very similar to SQL and Relational DBs

Partitions:
• Partition means dividing a table into a coarse grained parts based on the value
of a partition column such as ‘data’. This makes it faster to do queries on slices
of data.
• The Partition keys determine how data is stored. Here, each unique value of the
Partition key defines a Partition of the table. The Partitions are named after
dates for convenience. It is similar to ‘Block Splitting’ in HDFS.
• Allows users to efficiently retrieve rows

8
• Buckets:
• Buckets give extra structure to the data that may be used for efficient queries.

 Split data based on hash of a column – mainly for parallelism

 Data in each partition may in turn be divided into Buckets based on the value
of a hash function of some column of a table.

9
Hive Architecture
Hive Consists of
Mainly 3 core parts
1. Hive Clients
2. Hive Services
3. Hive Storage and
Computing
Hive Clients

• Hive provides different drivers for communication with a different type of

applications. For Thrift based applications, it will provide Thrift client for
communication.
• For Java related applications, it provides JDBC Drivers.
• Other than any type of applications provided ODBC drivers.
• These Clients and drivers in turn again communicate with Hive server in
the Hive services.
Hive Services
• Client interactions with Hive can be performed through Hive Services.
• If the client wants to perform any query related operations in Hive, it has
to communicate through Hive Services.
• CLI is the command line interface acts as Hive service for DDL (Data
definition Language) operations.
• All drivers communicate with Hive server and to the main driver in Hive
services as shown in above architecture diagram.
• Driver present in the Hive services represents the main driver, and it
communicates all type of Thrift, JDBC, ODBC, and other client specific
applications.
• Driver will process those requests from different applications to meta store
and field systems for further processing.
Hive Storage and Computing
Hive services such as Meta store, File system, and Job Client in turn
communicates with Hive storage and performs the following actions
• Metadata information of tables created in Hive is stored in Hive "Meta
storage database".
• Query results and data loaded in the tables are going to be stored in
Hadoop cluster on HDFS.
Component diagram depicts the architecture of Hive
Component diagram depicts the architecture of Hive
Job exectution flow
Job exectution flow (cont..)
The data flow in Hive behaves in the following pattern;
1. Executing Query from the UI( User Interface)
2. The driver is interacting with Compiler for getting the plan. (Here plan refers to
query execution) process and its related metadata information gathering.
3. The compiler creates the plan for a job to be executed. Compiler communicating with
Meta store for getting metadata request
4. Meta store sends metadata information back to compiler
5. Compiler communicating with Driver with the proposed plan to execute the query
6. Driver Sending execution plans to Execution engine
Job exectution flow (cont..)
7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS
operations.
• EE should first contacts Name Node and then to Data nodes to get the values stored in tables.
• EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data
node only.
• While from Name Node it only fetches the metadata information for the query.
• It collects actual data from data nodes related to mentioned query
• Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform
DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and
ALTERING tables and databases are done.
• Meta store will store information about database name, table names and column names only. It
will fetch data related to query mentioned.
• Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data
nodes, and job tracker to execute the query on top of Hadoop file system
Job exectution flow (cont..)
8. Fetching results from driver
9. Sending results to Execution engine. Once the results fetched from data
nodes to the EE, it will send results back to driver and to UI ( front end)

• Hive Continuously in contact with Hadoop file system and its daemons
via Execution engine.
• The dotted arrow in the Job flow diagram shows the Execution engine
communication with Hadoop daemons.
Metastore
• Database: namespace containing a set of tables

• Holds table definitions (column types, physical layout)

• Holds partitioning information

• Can be stored in MySQL, and many other relational databases

Physical Layout
• Warehouse directory in HDFS
• E.g., /user/hive/warehouse

• Tables stored in subdirectories of warehouse

• Partitions form subdirectories of tables
• Each table has a corresponding HDFS directory

• Actual data stored in flat files

• Users can associate a table with a serialization format
• Control char-delimited text, or SequenceFiles
• With custom SerDe, can use arbitrary format
Hive DDL Commands
CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING);
SHOW TABLES '.*s';
DESCRIBE sample;
ALTER TABLE sample ADD COLUMNS (new_col INT);
DROP TABLE sample;
A table in Hive is an HDFS directory in
Hadoop

Schema is known at creation time (like DB schema)

Partitioned tables have “sub-directories”, one for each partition

Hive DML
Load data from local file system Delete previous data from that table

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample;

Load data from HDFS Augment to the existing data

LOAD DATA INPATH '/user/falvariz/hive/sample.txt’ INTO TABLE

partitioned_sample PARTITION (ds='2012-02-24');

Must define a specific partition for partitioned tables

Loaded data are files copied to HDFS under the

corresponding directory
Hive QL – Join
page_view user pv_users
pageid userid time pageid age
userid age gender
1 25
1 111 9:08:01 X 111 25 female =
2 111 9:08:13 2 25
222 32 male
1 222 9:08:14
1 32

• SQL:
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Hive QL – Join in Map Reduce
page_view pv_users
pageid userid time key value key value pageid age

1 111 9:08:01 111 <1,1> 111 <1,1>

111 <1,2> 1 25
2 111 9:08:13 111 <1,2>
111 <2,25>
1 222 9:08:14 222 <1,1> 2 25
Shuffle
user Map Sort Reduce
userid age gender key value key value
111 <2,25> pageid age
222 <1,1>
111 25 female
222 <2,32>
222 <2,32> 1 32
222 32 male
Hive QL – Group By
pv_users pageid_age_sum
pageid age
pageid age Count

1 25 1 25 1
2 25
2 25 2
1 32
1 32 1
2 25

• SQL:
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
• GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users pageid_age_sum
pageid age key value key value pageid age Count
<1,25> 1
1 25 <1,25> 1
1 25 1
2 25 <2,25> 1 <1,32> 1
1 32 1
Shuffle
Map Sort Reduce
pageid age key value key value
pageid age Count
<1,32> 1 <2,25> 1
1 32
<2,25> 1 <2,25> 1 2 25 2
2 25
Hive QL – Group By with Distinct
page_view result
pageid userid time
pageid count_distinct
1 111 9:08:01 _userid
2 111 9:08:13
1 2
1 222 9:08:14
2 111 9:08:20 2 1

• SQL
• SELECT pageid, COUNT(DISTINCT userid)
• FROM page_view GROUP BY pageid

Chapter - 4 - Data Access - Hive
No ratings yet
Chapter - 4 - Data Access - Hive
35 pages
Hive
No ratings yet
Hive
52 pages
Understanding Hive Map Types
No ratings yet
Understanding Hive Map Types
49 pages
Hive
No ratings yet
Hive
12 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
BDA Hive
No ratings yet
BDA Hive
22 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Apache Hive 34 35
No ratings yet
Apache Hive 34 35
65 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Hive
No ratings yet
Hive
30 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Hive Architecture
No ratings yet
Hive Architecture
7 pages
Overview of Apache Hive Features and Limitations
No ratings yet
Overview of Apache Hive Features and Limitations
35 pages
Apache Hive Overview & Architecture
No ratings yet
Apache Hive Overview & Architecture
27 pages
Bigdata Lecture 5
No ratings yet
Bigdata Lecture 5
19 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
182 pages
Introduction to Hive Architecture
No ratings yet
Introduction to Hive Architecture
23 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Hive Final
No ratings yet
Hive Final
75 pages
DSS U4 HIVE Rev1.1
No ratings yet
DSS U4 HIVE Rev1.1
23 pages
HIVE
No ratings yet
HIVE
18 pages
HIVE
No ratings yet
HIVE
33 pages
Understanding Apache Hive Architecture
No ratings yet
Understanding Apache Hive Architecture
5 pages
Day 4
No ratings yet
Day 4
10 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Hive Main
No ratings yet
Hive Main
33 pages
Introduction to Apache Hive Overview
No ratings yet
Introduction to Apache Hive Overview
8 pages
Apache Hive for Big Data Processing
No ratings yet
Apache Hive for Big Data Processing
19 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Hive
No ratings yet
Hive
28 pages
Big Data 4
No ratings yet
Big Data 4
14 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Introduction to Hive Data Warehousing
No ratings yet
Introduction to Hive Data Warehousing
4 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Understanding Hive and Pig in Hadoop
No ratings yet
Understanding Hive and Pig in Hadoop
91 pages
HIVE
No ratings yet
HIVE
16 pages
Big Data Analytics Module-4
No ratings yet
Big Data Analytics Module-4
39 pages
Configuring Hive Metadata Storage
No ratings yet
Configuring Hive Metadata Storage
22 pages
Unit 3
No ratings yet
Unit 3
23 pages
Wa0006.
No ratings yet
Wa0006.
53 pages
Hive Introduction
No ratings yet
Hive Introduction
47 pages
Viewnet 20250502
No ratings yet
Viewnet 20250502
2 pages
Linux by Zoom - Course 2019 PDF
50% (2)
Linux by Zoom - Course 2019 PDF
210 pages
AI TH-312 WS: Face & Temp Recognition Camera
No ratings yet
AI TH-312 WS: Face & Temp Recognition Camera
7 pages
Applying UML & Patterns (3 Ed.) : Logical Architecture & Uml Package Diagrams
No ratings yet
Applying UML & Patterns (3 Ed.) : Logical Architecture & Uml Package Diagrams
18 pages
Session 8 M4
No ratings yet
Session 8 M4
18 pages
Study of TCP & UDP Performance DNS, FTP, WEB & Email Multi Server Configuration Using Cisco Packet Tracer. Theory
No ratings yet
Study of TCP & UDP Performance DNS, FTP, WEB & Email Multi Server Configuration Using Cisco Packet Tracer. Theory
25 pages
What Is Structured Data Vs Unstructured Data Vs Se
No ratings yet
What Is Structured Data Vs Unstructured Data Vs Se
2 pages
JS - Introduction To DOM Manipulation
No ratings yet
JS - Introduction To DOM Manipulation
27 pages
Windows 10 System Information Report
No ratings yet
Windows 10 System Information Report
38 pages
Beginner's Guide to Web Development
No ratings yet
Beginner's Guide to Web Development
10 pages
Yocto Slides
No ratings yet
Yocto Slides
300 pages
Ai Metaverse
No ratings yet
Ai Metaverse
47 pages
Wireshark Lab: Network Traffic Analysis
No ratings yet
Wireshark Lab: Network Traffic Analysis
8 pages
Office Wise Phone Directory-6-3
No ratings yet
Office Wise Phone Directory-6-3
3 pages
Syllabus ISEM 565: Business Intelligence and Decision Support Systems
No ratings yet
Syllabus ISEM 565: Business Intelligence and Decision Support Systems
6 pages
Combined OOAD Questions July Dec 2023
No ratings yet
Combined OOAD Questions July Dec 2023
9 pages
Essential Windows and Chrome Shortcuts
No ratings yet
Essential Windows and Chrome Shortcuts
2 pages
Introduction to OOP Concepts
No ratings yet
Introduction to OOP Concepts
8 pages
Dbms Indexing
No ratings yet
Dbms Indexing
3 pages
DDR4 SDRAM Technical Specs
No ratings yet
DDR4 SDRAM Technical Specs
396 pages
Console Quick Start: HP Integrity BL C-Class Servers
No ratings yet
Console Quick Start: HP Integrity BL C-Class Servers
4 pages
MRC Information Security Policy v1 2
No ratings yet
MRC Information Security Policy v1 2
40 pages
Data Analytics For Accounting: Vernon J. Richardson
100% (2)
Data Analytics For Accounting: Vernon J. Richardson
39 pages
DevOps & AWS Course Content
No ratings yet
DevOps & AWS Course Content
5 pages
LCI PPT Template BestPractices 00c
No ratings yet
LCI PPT Template BestPractices 00c
11 pages
Mordax DATA - User Guide - 171002
No ratings yet
Mordax DATA - User Guide - 171002
29 pages
Digital Power Meter Communication Settings
No ratings yet
Digital Power Meter Communication Settings
22 pages
SDLXLIFF Toolkit User Guide
No ratings yet
SDLXLIFF Toolkit User Guide
10 pages
I-Unit C#
No ratings yet
I-Unit C#
11 pages

Apache Hive: Data Warehousing on Hadoop

Uploaded by

Apache Hive: Data Warehousing on Hadoop

Uploaded by

What is Hive?

• Hive is a framework designed for data warehousing that runs on top of

Hive makes job easy for performing operations like

Very similar to SQL and Relational DBs

 Split data based on hash of a column – mainly for parallelism

• Hive provides different drivers for communication with a different type of

• Holds table definitions (column types, physical layout)

• Holds partitioning information

• Can be stored in MySQL, and many other relational databases

• Tables stored in subdirectories of warehouse

• Actual data stored in flat files

Schema is known at creation time (like DB schema)

Partitioned tables have “sub-directories”, one for each partition

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample;

LOAD DATA INPATH '/user/falvariz/hive/sample.txt’ INTO TABLE

Must define a specific partition for partitioned tables

Loaded data are files copied to HDFS under the

1 111 9:08:01 111 <1,1> 111 <1,1>

You might also like