Module-4
MapReduce, Hive and Pig
HIVE
• Hive was created by Facebook.
• Hive is a data warehousing tool and is also a data store on
the top of Hadoop.
• Enterprises uses a data warehouse as large data repositories
that are designed to enable the Searching, managing, and
analyzing the data.
• Additionally, also manages the volumes of data.
HIVE Features
Hive Characteristics
1. Has the capability to translate queries into MapReduce jobs.
2. Supports web interfaces as well.
3. Provides an SQL dialect (Hive Query Language, abbreviated HiveQL or
HQL).
HIVE Limitations
1. Not a full database. Main disadvantage is that Hive does not provide update,
alter and deletion of records in the database.
2. Not developed for unstructured data.
3. Not designed for real-time queries.
4. Performs the partition always from the last column.
Hive Architecture
Hive architecture components are:
• Hive Server(Thrift) - An optional service that allows a remote client to
submit requests to Hive and retrieve results.
• Hive CLI (Command Line Interface) - Popular interface to interact with
Hive.
• Web Interface - Hive can be accessed using a web browser as well. The
URL http:// hadoop:<port no.> / hwi command can be used to access
Hive through the web.
• Metastore - It is a central repository that stores the schema or
metadata of tables, databases, columns in a table, their data
types.
• Hive Driver - It processes and manages the execution of
queries.
Hive Data Types and File Formats
Hive has three Collection data types
HIVE file formats and their descriptions
Hive Data Model
Hive Integration and Workflow Steps
Hive integrates with the MapReduce and HDFS. Figure below shows the dataflow
sequences and workflow steps between Hive and Hadoop.
1. Execute Query: Hive interface sends a query to DatabaseDriver to
execute the query.
2. Get Plan: Driver sends the query to query compiler that parses the
query to check the syntax and query plan or the requirement of the
query.
3. Get Metadata: Compiler sends metadata request to Metastore (of
any database, such as MySQL).
4. Send Metadata: Metastore sends metadata as a response to
compiler.
5. Send Plan: Compiler checks the requirement and resends the plan
to driver.
6. Execute Plan: Driver sends the execute plan to execution engine.
7. Execute Job: Internally, the process of execution job is a
MapReduce job. The execution engine sends the job to JobTracker,
which is in Name node and it assigns this job to TaskTracker, which is
in Data node. Then , the query executes the job.
8. Metadata Operations: Meanwhile the execution engine can
execute the metadata operations with Metastore.
9. Fetch Result: Execution engine receives the results from Data
nodes.
10. Send Results: Execution engine sends the result to Driver.
11. Send Results: Driver sends the results to Hive Interfaces.
HIVEQL
Hive Query Language (abbreviated HiveQL) is for querying the large datasets which
reside in the HDFS environment.
HiveQL script commands enable data definition, data manipulation and query
processing.
HiveQL Data Definition Language (DDL)
HiveQL database commands for data definition for DBs and Tables are
CREATE DATABASE,
SHOW DATABASE (list of all DBs),
CREATE SCHEMA,
CREATE TABLE.
Following are HiveQL commands which create a table:
Example: on Creating a Table
Creating a Database
Showing Database
Dropping a Database
RESTRICT-Think of it as safe mode: You can’t delete the database unless it’s empty.
CASCADE-Forces deletion of the database and all its contents (tables, views, functions, etc.).
Example- Usages of database commands for CREATE, SHOW and DROP.
CREATE DATABASE IF NOT EXISTS toys_companyDB;
SHOW DATABASES;
Drop Database toys_companyDB.
HiveQL Data Manipulation Language (DML)
HiveQL commands for data manipulation are
USE <database name>,
DROP DATABASE,
DROP SCHEMA,
ALTER TABLE,
DROP TABLE, and
LOAD DATA.
Loading Data into HIVE DB
LOCAL (Optional) -If used, it means the file is on the local file system (like your laptop).
If omitted, Hive assumes the file is in HDFS (Hadoop Distributed File System).
INPATH '<file path>‘ - Specifies the path of the file you want to load.
OVERWRITE (Optional)- If used, it will delete existing data in the table (or partition) before loading the
new data.
PARTITION (...) (Optional)- If the table is partitioned, you must specify which partition to load the data
into.
Partitioning
• Table partitioning refers to dividing the table data into some parts based on the
values of particular set of columns.
• Hive organizes tables into partitions.
• Partition makes querying easy and fast.
• This is because SELECT is then from the smaller number of column fields.
Aggregation
Hive supports the following built-in aggregation functions.
Join
• A JOIN clause combines columns of two or more tables, based on a relation
between them.
• HiveQL Join is more or less similar to SQL JOINS.
• Example:
Join Example
Left outer Join Example
Right Outer Join Example
Full Outer Join Example
Group By Clause