What is Hive?
• Hive is a framework designed for data warehousing that runs on top of
Hadoop.
• It enables users to run queries on the huge volumes of data.
• Its basic function is to convert SQL queries into MapReduce jobs.
• Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive.
What is Hive?
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Hive makes job easy for performing operations like
Data encapsulation
Ad-hoc queries
Analysis of huge datasets
Important characteristics of Hive
• In Hive, tables and databases are created first and then data is loaded into these tables.
• Hive as data warehouse designed for managing and querying only structured data that is
stored in tables.
• While dealing with structured data, Map Reduce doesn't have optimization and usability
features but Hive framework does.
• Query optimization refers to an effective way of query execution in terms of performance.
• Hive's SQL-inspired language separates the user from the complexity of Map Reduce
programming.
• It reuses familiar concepts from the relational database world, such as tables, rows, columns
and schema, etc. for ease of learning.
• Hadoop's programming works on flat files.
• So, Hive can use directory structures to "partition" data to improve performance on certain
queries.
Hive Vs Relational Databases
• Relational databases are of "Schema on READ and Schema on Write". First
creating a table then inserting data into the particular table. On relational
database tables, functions like Insertions, Updates, and Modifications can
be performed.
• Hive is "Schema on READ only". So, functions like the update,
modifications, etc. don't work with this. Because the Hive query in a typical
cluster runs on multiple Data Nodes. So it is not possible to update and
modify data across multiple nodes.
• Also, Hive supports "READ Many WRITE Once" pattern.
• Which means that after inserting table we can update the table in the
latest Hive versions.
Hive Components
• High-level language (HiveQL)
• Set of commands
Two Main
Components • Two execution modes
• Local: reads/write to local file system
• Mapreduce: connects to Hadoop cluster and
reads/writes to HDFS
• Interactive mode
• Console
Two modes
• Batch mode
• Submit a script
Hive deals with Structured Data
• Hive Data Models:
• The Hive data models contain the following components:
Databases : 3-Levels: Tables Partitions Buckets
Tables : maps to a HDFS directory
Partitions : maps to sub-directories under the table
Buckets or clusters : maps to files under each partition
Very similar to SQL and Relational DBs
Partitions:
• Partition means dividing a table into a coarse grained parts based on the value
of a partition column such as ‘data’. This makes it faster to do queries on slices
of data.
• The Partition keys determine how data is stored. Here, each unique value of the
Partition key defines a Partition of the table. The Partitions are named after
dates for convenience. It is similar to ‘Block Splitting’ in HDFS.
• Allows users to efficiently retrieve rows
8
• Buckets:
• Buckets give extra structure to the data that may be used for efficient queries.
Split data based on hash of a column – mainly for parallelism
Data in each partition may in turn be divided into Buckets based on the value
of a hash function of some column of a table.
9
Hive Architecture
Hive Consists of
Mainly 3 core parts
1. Hive Clients
2. Hive Services
3. Hive Storage and
Computing
Hive Clients
• Hive provides different drivers for communication with a different type of
applications. For Thrift based applications, it will provide Thrift client for
communication.
• For Java related applications, it provides JDBC Drivers.
• Other than any type of applications provided ODBC drivers.
• These Clients and drivers in turn again communicate with Hive server in
the Hive services.
Hive Services
• Client interactions with Hive can be performed through Hive Services.
• If the client wants to perform any query related operations in Hive, it has
to communicate through Hive Services.
• CLI is the command line interface acts as Hive service for DDL (Data
definition Language) operations.
• All drivers communicate with Hive server and to the main driver in Hive
services as shown in above architecture diagram.
• Driver present in the Hive services represents the main driver, and it
communicates all type of Thrift, JDBC, ODBC, and other client specific
applications.
• Driver will process those requests from different applications to meta store
and field systems for further processing.
Hive Storage and Computing
Hive services such as Meta store, File system, and Job Client in turn
communicates with Hive storage and performs the following actions
• Metadata information of tables created in Hive is stored in Hive "Meta
storage database".
• Query results and data loaded in the tables are going to be stored in
Hadoop cluster on HDFS.
Component diagram depicts the architecture of Hive
Component diagram depicts the architecture of Hive
Job exectution flow
Job exectution flow (cont..)
The data flow in Hive behaves in the following pattern;
1. Executing Query from the UI( User Interface)
2. The driver is interacting with Compiler for getting the plan. (Here plan refers to
query execution) process and its related metadata information gathering.
3. The compiler creates the plan for a job to be executed. Compiler communicating with
Meta store for getting metadata request
4. Meta store sends metadata information back to compiler
5. Compiler communicating with Driver with the proposed plan to execute the query
6. Driver Sending execution plans to Execution engine
Job exectution flow (cont..)
7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS
operations.
• EE should first contacts Name Node and then to Data nodes to get the values stored in tables.
• EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data
node only.
• While from Name Node it only fetches the metadata information for the query.
• It collects actual data from data nodes related to mentioned query
• Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform
DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and
ALTERING tables and databases are done.
• Meta store will store information about database name, table names and column names only. It
will fetch data related to query mentioned.
• Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data
nodes, and job tracker to execute the query on top of Hadoop file system
Job exectution flow (cont..)
8. Fetching results from driver
9. Sending results to Execution engine. Once the results fetched from data
nodes to the EE, it will send results back to driver and to UI ( front end)
• Hive Continuously in contact with Hadoop file system and its daemons
via Execution engine.
• The dotted arrow in the Job flow diagram shows the Execution engine
communication with Hadoop daemons.
Metastore
• Database: namespace containing a set of tables
• Holds table definitions (column types, physical layout)
• Holds partitioning information
• Can be stored in MySQL, and many other relational databases
Physical Layout
• Warehouse directory in HDFS
• E.g., /user/hive/warehouse
• Tables stored in subdirectories of warehouse
• Partitions form subdirectories of tables
• Each table has a corresponding HDFS directory
• Actual data stored in flat files
• Users can associate a table with a serialization format
• Control char-delimited text, or SequenceFiles
• With custom SerDe, can use arbitrary format
Hive DDL Commands
CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING);
SHOW TABLES '.*s';
DESCRIBE sample;
ALTER TABLE sample ADD COLUMNS (new_col INT);
DROP TABLE sample;
A table in Hive is an HDFS directory in
Hadoop
Schema is known at creation time (like DB schema)
Partitioned tables have “sub-directories”, one for each partition
Hive DML
Load data from local file system Delete previous data from that table
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample;
Load data from HDFS Augment to the existing data
LOAD DATA INPATH '/user/falvariz/hive/sample.txt’ INTO TABLE
partitioned_sample PARTITION (ds='2012-02-24');
Must define a specific partition for partitioned tables
Loaded data are files copied to HDFS under the
corresponding directory
Hive QL – Join
page_view user pv_users
pageid userid time pageid age
userid age gender
1 25
1 111 9:08:01 X 111 25 female =
2 111 9:08:13 2 25
222 32 male
1 222 9:08:14
1 32
• SQL:
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Hive QL – Join in Map Reduce
page_view pv_users
pageid userid time key value key value pageid age
1 111 9:08:01 111 <1,1> 111 <1,1>
111 <1,2> 1 25
2 111 9:08:13 111 <1,2>
111 <2,25>
1 222 9:08:14 222 <1,1> 2 25
Shuffle
user Map Sort Reduce
userid age gender key value key value
111 <2,25> pageid age
222 <1,1>
111 25 female
222 <2,32>
222 <2,32> 1 32
222 32 male
Hive QL – Group By
pv_users pageid_age_sum
pageid age
pageid age Count
1 25 1 25 1
2 25
2 25 2
1 32
1 32 1
2 25
• SQL:
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
• GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users pageid_age_sum
pageid age key value key value pageid age Count
<1,25> 1
1 25 <1,25> 1
1 25 1
2 25 <2,25> 1 <1,32> 1
1 32 1
Shuffle
Map Sort Reduce
pageid age key value key value
pageid age Count
<1,32> 1 <2,25> 1
1 32
<2,25> 1 <2,25> 1 2 25 2
2 25
Hive QL – Group By with Distinct
page_view result
pageid userid time
pageid count_distinct
1 111 9:08:01 _userid
2 111 9:08:13
1 2
1 222 9:08:14
2 111 9:08:20 2 1
• SQL
• SELECT pageid, COUNT(DISTINCT userid)
• FROM page_view GROUP BY pageid