0% found this document useful (0 votes)
14 views43 pages

Module4 Hive Notes

Hive is a data warehousing tool built on Hadoop designed to process structured data and enable users to query and analyze Big Data. It utilizes a SQL-like language called HiveQL for ease of use and operates on data stored in Hadoop's HDFS, making it suitable for large-scale data analytics in various industries. Key components of Hive include Hive clients, Hive services, and a Meta Store that manages metadata for efficient query execution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views43 pages

Module4 Hive Notes

Hive is a data warehousing tool built on Hadoop designed to process structured data and enable users to query and analyze Big Data. It utilizes a SQL-like language called HiveQL for ease of use and operates on data stored in Hadoop's HDFS, making it suitable for large-scale data analytics in various industries. Key components of Hive include Hive clients, Hive services, and a Meta Store that manages metadata for efficient query execution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Hive Background

Big Data refers to huge amounts of data that are:


High in volume – large size
High in velocity – generated very fast
High in variety – different formats like text, images, logs, etc.
Traditional databases cannot handle such massive and fast-growing
data.

Why Hive?
Hive is a tool built on top of Hadoop.
It is designed to:
Process structured data (like rows & columns)
Allow users to query, analyze, and summarize Big Data.
Developed By:
Originally created by Facebook to manage their growing data.
Later, it became an open-source project under the Apache Software
Foundation.
Now widely used by many companies and industries.
Feature Description
Hive works like a warehouse for big
Data Warehouse
structured data.
SQL-like Language You can write queries similar to SQL
(HiveQL) to work with Big Data.
Uses Hadoop’s HDFS for storage
Built on Hadoop
and MapReduce for processing.
Handles Large Data Useful for querying TBs or PBs of
Easily data.
Used In:
Data analytics, Business intelligence, Big Data processing in companies
like Facebook, Netflix, and Amazon
ARCHITECTURE
Important components of HIVE:
● HIVE clients ● HIVE Services ● Storage and computing

Hive Clients
Hive clients are tools or drivers that allow applications to connect and
interact with the Hive system.
These clients act as a bridge between user applications and the Hive
server.
Thrift → For Thrift-based cross-language apps
JDBC → For Java-based applications
ODBC → For applications like Excel, Python, and BI tools
These clients communicate with the Hive server to run queries and
fetch results.
These drivers send user queries to the Hive Server.
Hive Server then processes the queries and returns the result back to
the client.
Helps applications interact with Hive without knowing the internal
details of Hadoop or Hive.
Hive Services
Hive Services are core components that allow clients to interact with
Hive.
Any application or user that wants to run queries or commands in Hive
must use these services. When a user or application wants to run a Hive
query, it must go through Hive Services.
These services help manage, process, and route queries to the right
components in Hive.

A client (Java, Python, Excel, etc.) sends query using JDBC/ODBC/Thrift.


The Hive server receives this query.
The main driver processes it.
The driver communicates with Meta Store to get table info.
The query is sent to the execution engine (MapReduce/Tez).
Final results are returned back to the client.
Interface Purpose
Used by users to execute DDL commands
Command Line
(like create table, drop table) directly from the
Interface (CLI)
terminal.
Main component that connects with all
Hive Server drivers (JDBC, ODBC, Thrift) and sends
queries to Hive engine.
Core part of Hive Services. It handles all
Main Driver requests from JDBC, ODBC, CLI, and other
clients.
Stores all the metadata (like table names,
Meta Store
column types, partitions).
Execution
Actually runs the query and processes the data
Engine / Field
stored in HDFS.
System
Storage and Computing in Hive:

Key Components Involved:


1 Meta Store (Meta Storage Database):
• Stores metadata of Hive tables.
• Metadata means: Table names, Column names and types, Partitions,
• File locations in HDFS.
• Helps Hive understand the structure of stored data.

2 File System (HDFS):


All actual table data and query results are saved in HDFS.
Hive reads/writes data from HDFS, not from a traditional RDBMS.

3 Job Client:
Responsible for submitting Hive queries to the Hadoop engine.
It communicates with the execution engine (like MapReduce) to
perform the processing tasks.
Hive Job Execution Flow in Hadoop

Flow of Query Execution in Hive:


✅ 1. Query Execution from UI
User writes and submits a query from Hive UI or command-line.

✅ 2. Driver → Compiler
The Driver sends the query to the Compiler.
Compiler starts creating an execution plan for the query.

✅ 3. Compiler → Meta Store


Compiler asks Meta Store for metadata (table names, columns, etc.).

✅ 4. Meta Store → Compiler


Meta Store returns the required metadata to the Compiler.

✅ 5. Compiler → Driver
Compiler shares the execution plan with the Driver.
✅ 6. Driver → Execution Engine
Driver sends the plan to the Execution Engine (EE) to start execution.

⚙️ Execution Engine (EE):


EE connects Hive with Hadoop to run the query.
🔸 EE first contacts:
NameNode: To know where data is located.
DataNodes: To get the actual data stored in HDFS.
🔸 EE handles:
Reading data from DataNodes.
Reading metadata from NameNode.
🔸 Talking to Meta Store for any DDL (CREATE, DROP) operations.

💡 Hadoop Daemons involved:


NameNode: Stores metadata
DataNodes: Store real data
Job Tracker: Manages MapReduce jobs (if using MapReduce)
🔁 Final Steps:
✅ 8. Data Returned to Driver
EE fetches the result data from DataNodes and gives it to Driver.
✅ 9. Driver → UI
Driver sends final result to the UI or user.
Component Role
Driver Manages query flow and communication
Compiler Parses and plans the query
Meta Store Stores table/column metadata
Execution Connects Hive to Hadoop; handles actual data
Engine execution
NameNode Stores metadata about where files are located
DataNode Stores actual table data
warehouse directory and meta-store
1. Hive Table Creation: Managed vs External Tables
✅ Managed Table (Default):
When you create a table in Hive without specifying it as EXTERNAL,
Hive automatically manages the table.
Hive will move the data into its warehouse directory (usually
/user/hive/warehouse).
Hive takes full control — creating, reading, and deleting both table and
data.
✅ External Table:
You can create a table as EXTERNAL if your data is already stored in
another HDFS location.
Hive will not move or manage the data; it just refers to the location
you specify.
Deleting the external table will not delete the actual data.
2. Hive Component Role in Hadoop Ecosystem
Hive handles:
Table structure (columns, data types)
Partitions
Input/output formats
Metadata info
It also knows where in HDFS the actual table data is stored.
Meta-store keeps all metadata information for Hive tables contains:
Metadata Description
1 Database IDs Unique IDs for each Hive database

Unique IDs for tables and indexes


2 Table and Index IDs

When the table was created


3 Table Creation Time

When the index was created


4 Index Creation Time

Format used to read data


5 Input Format
6 Output Format Format used to write data

Details of how data is partitioned


7 Partition Info

HDFS path where data is stored


8 Table Location
4. Meta-store Details
Meta-store is a central repository for Hive's metadata.
It is implemented using a relational database like:
MySQL, PostgreSQL
It provides API access through the Meta-store Service, so that:
Hive
External tools
Hive clients (JDBC/ODBC)
can fetch metadata info quickly and efficiently.

Hive by default creates managed tables and stores their data in its
warehouse directory.
You can create external tables that point to data stored outside.
All metadata (like schema, table structure, formats, creation time) is
stored in the Meta-store.
The Meta-store is crucial for query planning, execution, and data
management in Hive.
HIVE QUERY LANGUAGE

HiveQL (Hive Query Language) is a SQL-like language used to query and


process structured data in Apache Hive.

It allows users to interact with big data using familiar SQL-style


commands without writing complex MapReduce code.

HiveQL is inspired by traditional relational database SQL (like MySQL


and Oracle SQL).
Feature Benefit
SQL-like syntax Easy for SQL users to learn

Abstracts MapReduce No need to write low-level code

Command Line Interface to run HiveQL


CLI Support
commands

Supports MySQL-like SQL-92, SQL:2003


Good Integration features

Special Hive features for Big Data like


Useful Extensions
TRANSFORM, MAP, REDUCE
Syntax Basics
HiveQL supports commands like:
CREATE TABLE ...
SELECT * FROM ...
INSERT INTO ...
SHOW TABLES;
Every command must end with a semicolon ;

Characteristic Description
Commands like SHOW TABLES and show
Case Insensitive
tables both work (except string comparisons).
Press Tab key to auto-complete
Auto-completion
keywords/functions in Hive CLI.
Works with tables, rows, columns, and
Schema-Oriented
data types like RDBMS.
On first use, Hive takes a few seconds to set
Lazy Initialization
up the Meta-store in the background.
SQL Features in HiveQL
Feature HiveQL Support
SQL-92 Standard ✅ Supported (improving over time)
MySQL Syntax ✅ Borrowed a lot
Oracle Syntax ✅ Partial
SQL:2003 Window
✅ Supported
Functions
✅ MAP, REDUCE, TRANSFORM,
Custom Extensions
MULTITABLE INSERT

Advanced HiveQL Capabilities


✅ Window/Analytic Functions (e.g., RANK(), ROW_NUMBER())
✅ User-Defined Functions (UDFs) for custom logic
✅ Joins, Group By, Order By, Having
✅ Partitioning Support
Method Type
1 Insert using VALUES Hive DML
2 Insert using SELECT
Hive DML
query
3 Load using HDFS
Hadoop-level operation
commands

Loading Data Into Hive:


1 Insert Using VALUES (Manual Row Insertion)
You can add new rows manually into a Hive table using INSERT INTO
with VALUES.
Syntax: INSERT INTO TABLE employee VALUES (15, 'Bala', 150000, 35);
This command appends a new row to the employee table.
Check data using:
SELECT * FROM employee;
2 Insert Using SELECT Query (Query-Based Insertion)
You can use a query to insert multiple rows into a table from another
table.
✅ Syntax:INSERT INTO TABLE employee
SELECT * FROM emp WHERE dno = 45;
This will copy matching rows from the emp table into the employee
table.

3 Load Using LOAD Command


You can load bulk data from a file into a Hive table using the LOAD
DATA command.
✅ From Local File System:
LOAD DATA LOCAL INPATH '/path/to/localfile'
INTO TABLE employee;

✅ From HDFS:
LOAD DATA INPATH '/path/in/hdfs'
INTO TABLE employee;
Using HDFS Shell Commands (Direct Upload)
Hive tables are stored as directories in HDFS. You can manually copy
files using HDFS commands.
Find table path:
DESCRIBE FORMATTED employee;
Check the Location: field, for example: /data/employee

Upload using HDFS:


hadoop fs -put /path/to/localfile /data/employee

hadoop fs -copyFromLocal /path/to/localfile /data/employee

hadoop fs -moveFromLocal /path/to/localfile /data/employee


HIVE BUILT-IN FUNCTIONS
Mathematical Functions
Return
Function Signature Description
Type
Returns the nearest BIGINT
value to a. Example:
BIGINT round(double a)
round(10.7) = 11

Returns the greatest BIGINT ≤


BIGINT floor(double a) a. Example: floor(10.7) = 10

Returns the smallest BIGINT ≥


BIGINT ceil(double a) a. Example: ceil(10.2) = 11

rand() / rand(int Returns a pseudo-random


DOUBLE
seed) number (different for each row)
Function
Type Description
Signature
concat(string A, Concatenates strings. Example:
STRING
string B, ...) concat('Hive','QL') = 'HiveQL'
substr(string A, Returns substring from start to end of
STRING
int start) string
substr(string A,
Returns substring of specified length
STRING int start, int
from start
length)
upper(string A) or Converts all characters of A to
STRING
ucase(string A) uppercase
lower(string A) or Converts all characters of A to
STRING
lcase(string A) lowercase
STRING trim(string A) Removes spaces from both ends
STRING ltrim(string A) Removes spaces from the beginning
STRING rtrim(string A) Removes spaces from the end
regexp_replace
Replaces parts of A matching regex
STRING (string A, string B,
B with C
string C)
Collection Functions

Return
Function Signature Description
Type
Returns the number of
INT size(Map<K, V>)
elements in a Map
Returns the number of
INT size(Array<T>)
elements in an Array

Type Conversion Functions:

Return Type Function Signature Description


Converts expr to <type>.
VALUE OF cast(<expr> as
Example: cast('1' as
TYPE <type>)
BIGINT) = 1
JSON Function

Type Function Signature Description


Extracts value from
get_json_object(string JSON using JSON path
STRING
json_string, string path) syntax. Returns NULL if
JSON is invalid
Date and Time Functions
Return
Function Signature Description
Type
Converts Unix epoch
from_unixtime(int
STRING time to readable
unixtime)
datetime
Extracts date part from
timestamp. Example:
STRING to_date(string timestamp)
"2024-01-01 10:30:00"
→ "2024-01-01"
Extracts year. Example:
INT year(string date)
"2024-07-19" → 2024
Extracts month.
INT month(string date) Example: "2024-07-19"
→7
Extracts day. Example:
INT day(string date)
"2024-07-19" → 19
Hive Built-in Function Examples
1. round() Function
Purpose: Rounds a decimal (floating-point number) to the nearest
whole number.
Syntax: SELECT round(<number>) FROM <table>;
Example: hive> SELECT round(2.6) FROM temp;

2. floor() Function
Purpose: Returns the largest integer less than or equal to the input
number.
Syntax: SELECT floor(<number>) FROM <table>;
Example: hive> SELECT floor(2.6) FROM temp;

3. ceil() Function
Purpose: Returns the smallest integer greater than or equal to the
input number.
Syntax: SELECT ceil(<number>) FROM <table>;
Example: hive> SELECT ceil(2.6) FROM temp;
Aggregate Functions?
Aggregate functions perform calculations on a set (group) of rows and
return a single value.
These are similar to SQL aggregate functions and are used with GROUP
BY, HAVING, and SELECT statements.
Function Return Type Description
count(*) BIGINT Returns the total number of rows.
Returns the number of non-NULL
count(expr) BIGINT
values in the expression.
Returns the sum of values in a
sum(col) DOUBLE
column.
sum(DISTINC Returns the sum of distinct
DOUBLE
T col) values only.
Returns the average (mean) of all
avg(col) DOUBLE
values in a column.
avg(DISTINC Returns the average of distinct
DOUBLE
T col) values only.
Returns the minimum value in the
min(col) DOUBLE
column.
Returns the maximum value in the
max(col) DOUBLE
column.
Joins in Hive
What is a Join in Hive?
Joins are used to combine rows from two or more tables based on a
related column (usually primary key & foreign key).
Hive supports only equality joins (no inequality joins like > or <).
Hive Join is similar to SQL JOINs but with some restrictions and unique
characteristics.
Important Points
✅ Only equality joins are supported.
✅ You can join more than two tables in a single query.
✅ Hive joins are NOT commutative (i.e., A JOIN B ≠ B JOIN A).
✅ All joins are left-associative.
✅ Hive supports various types of joins including:

INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN


FULL OUTER JOIN LEFT SEMI JOIN CROSS JOIN
Hive Join Syntax

SELECT cols
FROM table1
[INNER | LEFT OUTER | RIGHT OUTER | FULL OUTER | LEFT SEMI |
CROSS] JOIN table2
ON (table1.col = table2.col);

Types of Hive Joins


1. INNER JOIN
Returns only matched rows from both tables.
Records without matches are excluded.
✅ Example:
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c
JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
2. LEFT OUTER JOIN
Returns all rows from the left table.
If no match found in the right table → NULL values will be returned for
right table columns.
✅ Example:
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

3. RIGHT OUTER JOIN


Returns all rows from the right table.
If no match found in the left table → NULL values will be returned for
left table columns.
✅ Example:
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o
4. FULL OUTER JOIN
Returns all rows from both left and right tables.
If no match is found, fills with NULLs on the side with no match.
✅ Example:
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
Partitioning in Hive:
Partitioning is the technique of dividing a table’s data into smaller, more
manageable sections based on the values of specific columns (e.g.,
date, country).
How it works:
Input records are stored in separate files/directories depending on the
value of the partitioned column.
Multi-column Partitioning:
You can partition by more than one column to create a multi-level
directory structure.
Example: Partition by date and country stores data as:
/logs/date=2025-07-19/country=IN/

How to Create Partitions


Use the PARTITIONED BY clause when creating the Hive table.
Example:
Example:
CREATE TABLE logs (
user_id INT,
action STRING
)
PARTITIONED BY (log_date STRING, country STRING);

Advantages of Partitioning
Improves Query Speed
Only the relevant partition (directory) is scanned instead of the
entire dataset.
Example: If a table is partitioned by country, and your query for
country='IN', only the country=IN directory is scanned.
Efficient Execution
Execution load is horizontally distributed across partitions.
Optimized Disk Usage
Data is stored in smaller slices, reducing the I/O overhead.
Limitations of Partitioning
Too Many Partitions = Overhead
Creates a large number of small files/directories in HDFS.
Puts pressure on the NameNode, as it stores metadata in memory.
Slower Grouping Queries
Partitioning improves WHERE clause filtering but may slow down
queries that use GROUP BY.
MapReduce Overhead
Each partition = one task = one JVM.
For small files or too many partitions, JVM startup/teardown time
can be more than actual processing.
Feature Description
Purpose Split data into sections based on column values
Command PARTITIONED BY in table creation
Improves Query performance and resource management
Risks Too many partitions slow down performance
Best For Filtering queries (WHERE clause)
Querying Data:

Sorting and Aggregating in Hive

1. ORDER BY
What it does:
Performs a total sort of the entire data globally (across all reducers).
Limitation:
Since it sorts all data using a single reducer, it is not scalable for large
datasets.
Use Case:
When you need a fully sorted result from start to end.
Syntax:
SELECT * FROM records ORDER BY year ASC;
2. SORT BY
What it does:
Sorts data within each reducer (i.e., partial/local sorting).
How it differs from ORDER BY:
Parallelized sorting — faster on large datasets.
Doesn't guarantee a globally sorted output.
Use Case:
When local sorting is enough (e.g., for intermediate processing).
Syntax:

SELECT year, temperature FROM records SORT BY year;


3. DISTRIBUTE BY
What it does:
Controls how rows are distributed among reducers.

Why use it:


Ensures all rows with the same key (e.g., year) go to the same reducer.

Common use:
When you want to group data by a column for further operations like
aggregation or ranking.

Example:
FROM records2
SELECT year, temperature
DISTRIBUTE BY year
SORT BY year ASC, temperature DESC;
4. CLUSTER BY
What it does:
Shortcut for using both DISTRIBUTE BY and SORT BY on the same
column(s).
Use Case:
When you want to distribute and sort by the same key (e.g., year).
Syntax:
SELECT year, temperature FROM records CLUSTER BY year;

You might also like