0% found this document useful (0 votes)
19 views39 pages

Big Data Analytics Module-4

Hive is a data warehousing tool developed by Facebook that operates on top of Hadoop, enabling data searching, management, and analysis. It features an SQL dialect called HiveQL, but has limitations such as lack of support for updates and real-time queries. Hive integrates with MapReduce and HDFS, allowing for efficient data processing and querying through its architecture components like Hive Server, Metastore, and execution engine.

Uploaded by

Yohima Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views39 pages

Big Data Analytics Module-4

Hive is a data warehousing tool developed by Facebook that operates on top of Hadoop, enabling data searching, management, and analysis. It features an SQL dialect called HiveQL, but has limitations such as lack of support for updates and real-time queries. Hive integrates with MapReduce and HDFS, allowing for efficient data processing and querying through its architecture components like Hive Server, Metastore, and execution engine.

Uploaded by

Yohima Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Module-4

MapReduce, Hive and Pig


HIVE
• Hive was created by Facebook.

• Hive is a data warehousing tool and is also a data store on


the top of Hadoop.

• Enterprises uses a data warehouse as large data repositories


that are designed to enable the Searching, managing, and
analyzing the data.

• Additionally, also manages the volumes of data.


HIVE Features
Hive Characteristics
1. Has the capability to translate queries into MapReduce jobs.

2. Supports web interfaces as well.

3. Provides an SQL dialect (Hive Query Language, abbreviated HiveQL or


HQL).
HIVE Limitations
1. Not a full database. Main disadvantage is that Hive does not provide update,
alter and deletion of records in the database.

2. Not developed for unstructured data.

3. Not designed for real-time queries.

4. Performs the partition always from the last column.


Hive Architecture
Hive architecture components are:

• Hive Server(Thrift) - An optional service that allows a remote client to


submit requests to Hive and retrieve results.

• Hive CLI (Command Line Interface) - Popular interface to interact with


Hive.

• Web Interface - Hive can be accessed using a web browser as well. The
URL http:// hadoop:<port no.> / hwi command can be used to access
Hive through the web.
• Metastore - It is a central repository that stores the schema or
metadata of tables, databases, columns in a table, their data
types.

• Hive Driver - It processes and manages the execution of


queries.
Hive Data Types and File Formats
Hive has three Collection data types
HIVE file formats and their descriptions
Hive Data Model
Hive Integration and Workflow Steps
Hive integrates with the MapReduce and HDFS. Figure below shows the dataflow
sequences and workflow steps between Hive and Hadoop.
1. Execute Query: Hive interface sends a query to DatabaseDriver to
execute the query.

2. Get Plan: Driver sends the query to query compiler that parses the
query to check the syntax and query plan or the requirement of the
query.

3. Get Metadata: Compiler sends metadata request to Metastore (of


any database, such as MySQL).

4. Send Metadata: Metastore sends metadata as a response to


compiler.
5. Send Plan: Compiler checks the requirement and resends the plan
to driver.

6. Execute Plan: Driver sends the execute plan to execution engine.

7. Execute Job: Internally, the process of execution job is a


MapReduce job. The execution engine sends the job to JobTracker,
which is in Name node and it assigns this job to TaskTracker, which is
in Data node. Then , the query executes the job.
8. Metadata Operations: Meanwhile the execution engine can
execute the metadata operations with Metastore.

9. Fetch Result: Execution engine receives the results from Data


nodes.

10. Send Results: Execution engine sends the result to Driver.

11. Send Results: Driver sends the results to Hive Interfaces.


HIVEQL
Hive Query Language (abbreviated HiveQL) is for querying the large datasets which
reside in the HDFS environment.

HiveQL script commands enable data definition, data manipulation and query
processing.
HiveQL Data Definition Language (DDL)
HiveQL database commands for data definition for DBs and Tables are

CREATE DATABASE,

SHOW DATABASE (list of all DBs),

CREATE SCHEMA,

CREATE TABLE.
Following are HiveQL commands which create a table:
Example: on Creating a Table
Creating a Database

Showing Database
Dropping a Database

RESTRICT-Think of it as safe mode: You can’t delete the database unless it’s empty.
CASCADE-Forces deletion of the database and all its contents (tables, views, functions, etc.).

Example- Usages of database commands for CREATE, SHOW and DROP.

CREATE DATABASE IF NOT EXISTS toys_companyDB;


SHOW DATABASES;
Drop Database toys_companyDB.
HiveQL Data Manipulation Language (DML)
HiveQL commands for data manipulation are

USE <database name>,

DROP DATABASE,

DROP SCHEMA,

ALTER TABLE,

DROP TABLE, and

LOAD DATA.
Loading Data into HIVE DB

LOCAL (Optional) -If used, it means the file is on the local file system (like your laptop).
If omitted, Hive assumes the file is in HDFS (Hadoop Distributed File System).

INPATH '<file path>‘ - Specifies the path of the file you want to load.

OVERWRITE (Optional)- If used, it will delete existing data in the table (or partition) before loading the
new data.

PARTITION (...) (Optional)- If the table is partitioned, you must specify which partition to load the data
into.
Partitioning
• Table partitioning refers to dividing the table data into some parts based on the
values of particular set of columns.

• Hive organizes tables into partitions.

• Partition makes querying easy and fast.

• This is because SELECT is then from the smaller number of column fields.
Aggregation
Hive supports the following built-in aggregation functions.
Join
• A JOIN clause combines columns of two or more tables, based on a relation
between them.
• HiveQL Join is more or less similar to SQL JOINS.
• Example:
Join Example
Left outer Join Example
Right Outer Join Example
Full Outer Join Example
Group By Clause

You might also like