0% found this document useful (0 votes)

14 views43 pages

Module4 Hive Notes

Hive is a data warehousing tool built on Hadoop designed to process structured data and enable users to query and analyze Big Data. It utilizes a SQL-like language called HiveQL for ease of use and operates on data stored in Hadoop's HDFS, making it suitable for large-scale data analytics in various industries. Key components of Hive include Hive clients, Hive services, and a Meta Store that manages metadata for efficient query execution.

Uploaded by

2403hinhuebhargav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views43 pages

Module4 Hive Notes

Uploaded by

2403hinhuebhargav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Hive Background

Big Data refers to huge amounts of data that are:

High in volume – large size
High in velocity – generated very fast
High in variety – different formats like text, images, logs, etc.
Traditional databases cannot handle such massive and fast-growing
data.

Why Hive?
Hive is a tool built on top of Hadoop.
It is designed to:
Process structured data (like rows & columns)
Allow users to query, analyze, and summarize Big Data.
Developed By:
Originally created by Facebook to manage their growing data.
Later, it became an open-source project under the Apache Software
Foundation.
Now widely used by many companies and industries.
Feature Description
Hive works like a warehouse for big
Data Warehouse
structured data.
SQL-like Language You can write queries similar to SQL
(HiveQL) to work with Big Data.
Uses Hadoop’s HDFS for storage
Built on Hadoop
and MapReduce for processing.
Handles Large Data Useful for querying TBs or PBs of
Easily data.
Used In:
Data analytics, Business intelligence, Big Data processing in companies
like Facebook, Netflix, and Amazon
ARCHITECTURE
Important components of HIVE:
● HIVE clients ● HIVE Services ● Storage and computing

Hive Clients
Hive clients are tools or drivers that allow applications to connect and
interact with the Hive system.
These clients act as a bridge between user applications and the Hive
server.
Thrift → For Thrift-based cross-language apps
JDBC → For Java-based applications
ODBC → For applications like Excel, Python, and BI tools
These clients communicate with the Hive server to run queries and
fetch results.
These drivers send user queries to the Hive Server.
Hive Server then processes the queries and returns the result back to
the client.
Helps applications interact with Hive without knowing the internal
details of Hadoop or Hive.
Hive Services
Hive Services are core components that allow clients to interact with
Hive.
Any application or user that wants to run queries or commands in Hive
must use these services. When a user or application wants to run a Hive
query, it must go through Hive Services.
These services help manage, process, and route queries to the right
components in Hive.

A client (Java, Python, Excel, etc.) sends query using JDBC/ODBC/Thrift.

The Hive server receives this query.
The main driver processes it.
The driver communicates with Meta Store to get table info.
The query is sent to the execution engine (MapReduce/Tez).
Final results are returned back to the client.
Interface Purpose
Used by users to execute DDL commands
Command Line
(like create table, drop table) directly from the
Interface (CLI)
terminal.
Main component that connects with all
Hive Server drivers (JDBC, ODBC, Thrift) and sends
queries to Hive engine.
Core part of Hive Services. It handles all
Main Driver requests from JDBC, ODBC, CLI, and other
clients.
Stores all the metadata (like table names,
Meta Store
column types, partitions).
Execution
Actually runs the query and processes the data
Engine / Field
stored in HDFS.
System
Storage and Computing in Hive:

Key Components Involved:

1 Meta Store (Meta Storage Database):
• Stores metadata of Hive tables.
• Metadata means: Table names, Column names and types, Partitions,
• File locations in HDFS.
• Helps Hive understand the structure of stored data.

2 File System (HDFS):

All actual table data and query results are saved in HDFS.
Hive reads/writes data from HDFS, not from a traditional RDBMS.

3 Job Client:
Responsible for submitting Hive queries to the Hadoop engine.
It communicates with the execution engine (like MapReduce) to
perform the processing tasks.
Hive Job Execution Flow in Hadoop

Flow of Query Execution in Hive:

✅ 1. Query Execution from UI
User writes and submits a query from Hive UI or command-line.

✅ 2. Driver → Compiler
The Driver sends the query to the Compiler.
Compiler starts creating an execution plan for the query.

✅ 3. Compiler → Meta Store

Compiler asks Meta Store for metadata (table names, columns, etc.).

✅ 4. Meta Store → Compiler

Meta Store returns the required metadata to the Compiler.

✅ 5. Compiler → Driver
Compiler shares the execution plan with the Driver.
✅ 6. Driver → Execution Engine
Driver sends the plan to the Execution Engine (EE) to start execution.

⚙️ Execution Engine (EE):

EE connects Hive with Hadoop to run the query.
🔸 EE first contacts:
NameNode: To know where data is located.
DataNodes: To get the actual data stored in HDFS.
🔸 EE handles:
Reading data from DataNodes.
Reading metadata from NameNode.
🔸 Talking to Meta Store for any DDL (CREATE, DROP) operations.

💡 Hadoop Daemons involved:

NameNode: Stores metadata
DataNodes: Store real data
Job Tracker: Manages MapReduce jobs (if using MapReduce)
🔁 Final Steps:
✅ 8. Data Returned to Driver
EE fetches the result data from DataNodes and gives it to Driver.
✅ 9. Driver → UI
Driver sends final result to the UI or user.
Component Role
Driver Manages query flow and communication
Compiler Parses and plans the query
Meta Store Stores table/column metadata
Execution Connects Hive to Hadoop; handles actual data
Engine execution
NameNode Stores metadata about where files are located
DataNode Stores actual table data
warehouse directory and meta-store
1. Hive Table Creation: Managed vs External Tables
✅ Managed Table (Default):
When you create a table in Hive without specifying it as EXTERNAL,
Hive automatically manages the table.
Hive will move the data into its warehouse directory (usually
/user/hive/warehouse).
Hive takes full control — creating, reading, and deleting both table and
data.
✅ External Table:
You can create a table as EXTERNAL if your data is already stored in
another HDFS location.
Hive will not move or manage the data; it just refers to the location
you specify.
Deleting the external table will not delete the actual data.
2. Hive Component Role in Hadoop Ecosystem
Hive handles:
Table structure (columns, data types)
Partitions
Input/output formats
Metadata info
It also knows where in HDFS the actual table data is stored.
Meta-store keeps all metadata information for Hive tables contains:
Metadata Description
1 Database IDs Unique IDs for each Hive database

Unique IDs for tables and indexes

2 Table and Index IDs

When the table was created

3 Table Creation Time

When the index was created

4 Index Creation Time

Format used to read data

5 Input Format
6 Output Format Format used to write data

Details of how data is partitioned

7 Partition Info

HDFS path where data is stored

8 Table Location
4. Meta-store Details
Meta-store is a central repository for Hive's metadata.
It is implemented using a relational database like:
MySQL, PostgreSQL
It provides API access through the Meta-store Service, so that:
Hive
External tools
Hive clients (JDBC/ODBC)
can fetch metadata info quickly and efficiently.

Hive by default creates managed tables and stores their data in its
warehouse directory.
You can create external tables that point to data stored outside.
All metadata (like schema, table structure, formats, creation time) is
stored in the Meta-store.
The Meta-store is crucial for query planning, execution, and data
management in Hive.
HIVE QUERY LANGUAGE

HiveQL (Hive Query Language) is a SQL-like language used to query and

process structured data in Apache Hive.

It allows users to interact with big data using familiar SQL-style

commands without writing complex MapReduce code.

HiveQL is inspired by traditional relational database SQL (like MySQL

and Oracle SQL).
Feature Benefit
SQL-like syntax Easy for SQL users to learn

Abstracts MapReduce No need to write low-level code

Command Line Interface to run HiveQL

CLI Support
commands

Supports MySQL-like SQL-92, SQL:2003

Good Integration features

Special Hive features for Big Data like

Useful Extensions
TRANSFORM, MAP, REDUCE
Syntax Basics
HiveQL supports commands like:
CREATE TABLE ...
SELECT * FROM ...
INSERT INTO ...
SHOW TABLES;
Every command must end with a semicolon ;

Characteristic Description
Commands like SHOW TABLES and show
Case Insensitive
tables both work (except string comparisons).
Press Tab key to auto-complete
Auto-completion
keywords/functions in Hive CLI.
Works with tables, rows, columns, and
Schema-Oriented
data types like RDBMS.
On first use, Hive takes a few seconds to set
Lazy Initialization
up the Meta-store in the background.
SQL Features in HiveQL
Feature HiveQL Support
SQL-92 Standard ✅ Supported (improving over time)
MySQL Syntax ✅ Borrowed a lot
Oracle Syntax ✅ Partial
SQL:2003 Window
✅ Supported
Functions
✅ MAP, REDUCE, TRANSFORM,
Custom Extensions
MULTITABLE INSERT

Advanced HiveQL Capabilities

✅ Window/Analytic Functions (e.g., RANK(), ROW_NUMBER())
✅ User-Defined Functions (UDFs) for custom logic
✅ Joins, Group By, Order By, Having
✅ Partitioning Support
Method Type
1 Insert using VALUES Hive DML
2 Insert using SELECT
Hive DML
query
3 Load using HDFS
Hadoop-level operation
commands

Loading Data Into Hive:

1 Insert Using VALUES (Manual Row Insertion)
You can add new rows manually into a Hive table using INSERT INTO
with VALUES.
Syntax: INSERT INTO TABLE employee VALUES (15, 'Bala', 150000, 35);
This command appends a new row to the employee table.
Check data using:
SELECT * FROM employee;
2 Insert Using SELECT Query (Query-Based Insertion)
You can use a query to insert multiple rows into a table from another
table.
✅ Syntax:INSERT INTO TABLE employee
SELECT * FROM emp WHERE dno = 45;
This will copy matching rows from the emp table into the employee
table.

3 Load Using LOAD Command

You can load bulk data from a file into a Hive table using the LOAD
DATA command.
✅ From Local File System:
LOAD DATA LOCAL INPATH '/path/to/localfile'
INTO TABLE employee;

✅ From HDFS:
LOAD DATA INPATH '/path/in/hdfs'
INTO TABLE employee;
Using HDFS Shell Commands (Direct Upload)
Hive tables are stored as directories in HDFS. You can manually copy
files using HDFS commands.
Find table path:
DESCRIBE FORMATTED employee;
Check the Location: field, for example: /data/employee

Upload using HDFS:

hadoop fs -put /path/to/localfile /data/employee

hadoop fs -copyFromLocal /path/to/localfile /data/employee

hadoop fs -moveFromLocal /path/to/localfile /data/employee

HIVE BUILT-IN FUNCTIONS
Mathematical Functions
Return
Function Signature Description
Type
Returns the nearest BIGINT
value to a. Example:
BIGINT round(double a)
round(10.7) = 11

Returns the greatest BIGINT ≤

BIGINT floor(double a) a. Example: floor(10.7) = 10

Returns the smallest BIGINT ≥

BIGINT ceil(double a) a. Example: ceil(10.2) = 11

rand() / rand(int Returns a pseudo-random

DOUBLE
seed) number (different for each row)
Function
Type Description
Signature
concat(string A, Concatenates strings. Example:
STRING
string B, ...) concat('Hive','QL') = 'HiveQL'
substr(string A, Returns substring from start to end of
STRING
int start) string
substr(string A,
Returns substring of specified length
STRING int start, int
from start
length)
upper(string A) or Converts all characters of A to
STRING
ucase(string A) uppercase
lower(string A) or Converts all characters of A to
STRING
lcase(string A) lowercase
STRING trim(string A) Removes spaces from both ends
STRING ltrim(string A) Removes spaces from the beginning
STRING rtrim(string A) Removes spaces from the end
regexp_replace
Replaces parts of A matching regex
STRING (string A, string B,
B with C
string C)
Collection Functions

Return
Function Signature Description
Type
Returns the number of
INT size(Map<K, V>)
elements in a Map
Returns the number of
INT size(Array<T>)
elements in an Array

Type Conversion Functions:

Return Type Function Signature Description

Converts expr to <type>.
VALUE OF cast(<expr> as
Example: cast('1' as
TYPE <type>)
BIGINT) = 1
JSON Function

Type Function Signature Description

Extracts value from
get_json_object(string JSON using JSON path
STRING
json_string, string path) syntax. Returns NULL if
JSON is invalid
Date and Time Functions
Return
Function Signature Description
Type
Converts Unix epoch
from_unixtime(int
STRING time to readable
unixtime)
datetime
Extracts date part from
timestamp. Example:
STRING to_date(string timestamp)
"2024-01-01 10:30:00"
→ "2024-01-01"
Extracts year. Example:
INT year(string date)
"2024-07-19" → 2024
Extracts month.
INT month(string date) Example: "2024-07-19"
→7
Extracts day. Example:
INT day(string date)
"2024-07-19" → 19
Hive Built-in Function Examples
1. round() Function
Purpose: Rounds a decimal (floating-point number) to the nearest
whole number.
Syntax: SELECT round(<number>) FROM <table>;
Example: hive> SELECT round(2.6) FROM temp;

2. floor() Function
Purpose: Returns the largest integer less than or equal to the input
number.
Syntax: SELECT floor(<number>) FROM <table>;
Example: hive> SELECT floor(2.6) FROM temp;

3. ceil() Function
Purpose: Returns the smallest integer greater than or equal to the
input number.
Syntax: SELECT ceil(<number>) FROM <table>;
Example: hive> SELECT ceil(2.6) FROM temp;
Aggregate Functions?
Aggregate functions perform calculations on a set (group) of rows and
return a single value.
These are similar to SQL aggregate functions and are used with GROUP
BY, HAVING, and SELECT statements.
Function Return Type Description
count(*) BIGINT Returns the total number of rows.
Returns the number of non-NULL
count(expr) BIGINT
values in the expression.
Returns the sum of values in a
sum(col) DOUBLE
column.
sum(DISTINC Returns the sum of distinct
DOUBLE
T col) values only.
Returns the average (mean) of all
avg(col) DOUBLE
values in a column.
avg(DISTINC Returns the average of distinct
DOUBLE
T col) values only.
Returns the minimum value in the
min(col) DOUBLE
column.
Returns the maximum value in the
max(col) DOUBLE
column.
Joins in Hive
What is a Join in Hive?
Joins are used to combine rows from two or more tables based on a
related column (usually primary key & foreign key).
Hive supports only equality joins (no inequality joins like > or <).
Hive Join is similar to SQL JOINs but with some restrictions and unique
characteristics.
Important Points
✅ Only equality joins are supported.
✅ You can join more than two tables in a single query.
✅ Hive joins are NOT commutative (i.e., A JOIN B ≠ B JOIN A).
✅ All joins are left-associative.
✅ Hive supports various types of joins including:

INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN

FULL OUTER JOIN LEFT SEMI JOIN CROSS JOIN
Hive Join Syntax

Types of Hive Joins

1. INNER JOIN
Returns only matched rows from both tables.
Records without matches are excluded.
✅ Example:
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c
JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
2. LEFT OUTER JOIN
Returns all rows from the left table.
If no match found in the right table → NULL values will be returned for
right table columns.
✅ Example:
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

3. RIGHT OUTER JOIN

Returns all rows from the right table.
If no match found in the left table → NULL values will be returned for
left table columns.
✅ Example:
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o
4. FULL OUTER JOIN
Returns all rows from both left and right tables.
If no match is found, fills with NULLs on the side with no match.
✅ Example:
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
Partitioning in Hive:
Partitioning is the technique of dividing a table’s data into smaller, more
manageable sections based on the values of specific columns (e.g.,
date, country).
How it works:
Input records are stored in separate files/directories depending on the
value of the partitioned column.
Multi-column Partitioning:
You can partition by more than one column to create a multi-level
directory structure.
Example: Partition by date and country stores data as:
/logs/date=2025-07-19/country=IN/

How to Create Partitions

Use the PARTITIONED BY clause when creating the Hive table.
Example:
Example:
CREATE TABLE logs (
user_id INT,
action STRING
)
PARTITIONED BY (log_date STRING, country STRING);

Advantages of Partitioning
Improves Query Speed
Only the relevant partition (directory) is scanned instead of the
entire dataset.
Example: If a table is partitioned by country, and your query for
country='IN', only the country=IN directory is scanned.
Efficient Execution
Execution load is horizontally distributed across partitions.
Optimized Disk Usage
Data is stored in smaller slices, reducing the I/O overhead.
Limitations of Partitioning
Too Many Partitions = Overhead
Creates a large number of small files/directories in HDFS.
Puts pressure on the NameNode, as it stores metadata in memory.
Slower Grouping Queries
Partitioning improves WHERE clause filtering but may slow down
queries that use GROUP BY.
MapReduce Overhead
Each partition = one task = one JVM.
For small files or too many partitions, JVM startup/teardown time
can be more than actual processing.
Feature Description
Purpose Split data into sections based on column values
Command PARTITIONED BY in table creation
Improves Query performance and resource management
Risks Too many partitions slow down performance
Best For Filtering queries (WHERE clause)
Querying Data:

Sorting and Aggregating in Hive

1. ORDER BY
What it does:
Performs a total sort of the entire data globally (across all reducers).
Limitation:
Since it sorts all data using a single reducer, it is not scalable for large
datasets.
Use Case:
When you need a fully sorted result from start to end.
Syntax:
SELECT * FROM records ORDER BY year ASC;
2. SORT BY
What it does:
Sorts data within each reducer (i.e., partial/local sorting).
How it differs from ORDER BY:
Parallelized sorting — faster on large datasets.
Doesn't guarantee a globally sorted output.
Use Case:
When local sorting is enough (e.g., for intermediate processing).
Syntax:

SELECT year, temperature FROM records SORT BY year;

3. DISTRIBUTE BY
What it does:
Controls how rows are distributed among reducers.

Why use it:

Ensures all rows with the same key (e.g., year) go to the same reducer.

Common use:
When you want to group data by a column for further operations like
aggregation or ranking.

Example:
FROM records2
SELECT year, temperature
DISTRIBUTE BY year
SORT BY year ASC, temperature DESC;
4. CLUSTER BY
What it does:
Shortcut for using both DISTRIBUTE BY and SORT BY on the same
column(s).
Use Case:
When you want to distribute and sort by the same key (e.g., year).
Syntax:
SELECT year, temperature FROM records CLUSTER BY year;

Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Apache Hive: Data Warehousing on Hadoop
No ratings yet
Apache Hive: Data Warehousing on Hadoop
28 pages
Understanding Hive Map Types
No ratings yet
Understanding Hive Map Types
49 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
? Hive
No ratings yet
? Hive
12 pages
Hive
No ratings yet
Hive
30 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Hive
No ratings yet
Hive
52 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
HIVE
No ratings yet
HIVE
18 pages
Chapter - 4 - Data Access - Hive
No ratings yet
Chapter - 4 - Data Access - Hive
35 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
7 Hive
No ratings yet
7 Hive
30 pages
Hive
No ratings yet
Hive
12 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
182 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Hive Architecture
No ratings yet
Hive Architecture
7 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Big Data Analytics Module-4
No ratings yet
Big Data Analytics Module-4
39 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
HIVE
No ratings yet
HIVE
16 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit 4 Hive
No ratings yet
Unit 4 Hive
30 pages
Hadoop Ecosystem: Hive and MapReduce
No ratings yet
Hadoop Ecosystem: Hive and MapReduce
14 pages
HiveQL Documentation Overview
No ratings yet
HiveQL Documentation Overview
3 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Apache Hive Overview & Architecture
No ratings yet
Apache Hive Overview & Architecture
27 pages
Unit 3 Hive Overview and Architecture
No ratings yet
Unit 3 Hive Overview and Architecture
5 pages
Understanding Apache Hive Architecture
No ratings yet
Understanding Apache Hive Architecture
5 pages
Apache Hive for Big Data Processing
No ratings yet
Apache Hive for Big Data Processing
19 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Bda Report
No ratings yet
Bda Report
16 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Hive Final
No ratings yet
Hive Final
75 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
HIVE
No ratings yet
HIVE
33 pages
Configuring Hive Metadata Storage
No ratings yet
Configuring Hive Metadata Storage
22 pages
Introduction to Apache Hive Overview
No ratings yet
Introduction to Apache Hive Overview
8 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Introduction To Hive-5
No ratings yet
Introduction To Hive-5
4 pages
Bda Exp 6
No ratings yet
Bda Exp 6
8 pages
Bigdata Lecture 5
No ratings yet
Bigdata Lecture 5
19 pages
Introduction to Hive Architecture
No ratings yet
Introduction to Hive Architecture
23 pages
Day 4
No ratings yet
Day 4
10 pages
Hive for Big Data Professionals
No ratings yet
Hive for Big Data Professionals
17 pages
Bda Bi Jit Chapter-5
No ratings yet
Bda Bi Jit Chapter-5
27 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Overview of Apache Hive Features and Limitations
No ratings yet
Overview of Apache Hive Features and Limitations
35 pages
Introduction to Hive and Its Architecture
No ratings yet
Introduction to Hive and Its Architecture
25 pages
Hive and MapReduce Techniques Overview
No ratings yet
Hive and MapReduce Techniques Overview
3 pages
Sony Mds Je480
No ratings yet
Sony Mds Je480
68 pages
CSR Practices in Bangladesh Banking
No ratings yet
CSR Practices in Bangladesh Banking
37 pages
Service Quality and Customer Satisfaction
No ratings yet
Service Quality and Customer Satisfaction
34 pages
How To Pin and Unpin WhatsApp Chat On Iphone and Android Phone
No ratings yet
How To Pin and Unpin WhatsApp Chat On Iphone and Android Phone
10 pages
Electric Fencing DIY Manual
100% (1)
Electric Fencing DIY Manual
15 pages
Silent Hill 2 Remake Deluxe Edition MULTi14-ElAmigos
No ratings yet
Silent Hill 2 Remake Deluxe Edition MULTi14-ElAmigos
4 pages
Detached-Eddy Simulation of Ahmed Car
No ratings yet
Detached-Eddy Simulation of Ahmed Car
8 pages
NationalShippersContactList 迅捷PDF转换器
No ratings yet
NationalShippersContactList 迅捷PDF转换器
150 pages
J-Runner v0.2 Beta RJTAG Setup
No ratings yet
J-Runner v0.2 Beta RJTAG Setup
14 pages
DLTA Annual Report 2014
33% (3)
DLTA Annual Report 2014
154 pages
Planning and Control
No ratings yet
Planning and Control
71 pages
Books
No ratings yet
Books
2 pages
CNB Sof 2018 19 Personal PDF
No ratings yet
CNB Sof 2018 19 Personal PDF
4 pages
GIKI Hostel Infrastructure Improvement Report
No ratings yet
GIKI Hostel Infrastructure Improvement Report
5 pages
Construction and Environmental Guidelines
No ratings yet
Construction and Environmental Guidelines
41 pages
TM 2018 Prepaid Call, Text, Data Promos
No ratings yet
TM 2018 Prepaid Call, Text, Data Promos
8 pages
Beige and White Contemporary Editorial Landscape University Research Poster
No ratings yet
Beige and White Contemporary Editorial Landscape University Research Poster
1 page
Manhattan Associates Sample Aptitude Placement Paper Level1
No ratings yet
Manhattan Associates Sample Aptitude Placement Paper Level1
7 pages
Fast Track To Powerbuilder Part 1 Labworkbook v1
No ratings yet
Fast Track To Powerbuilder Part 1 Labworkbook v1
279 pages
Form - Mutual Recognition Statutory Declaration
No ratings yet
Form - Mutual Recognition Statutory Declaration
2 pages
Big Sky Summer Camp for Kids
No ratings yet
Big Sky Summer Camp for Kids
2 pages
MC 2021-2293
100% (1)
MC 2021-2293
9 pages
A.C No. 11026, Nov. 29, 2023
No ratings yet
A.C No. 11026, Nov. 29, 2023
1 page
AI-Powered Gun Detection Solutions
No ratings yet
AI-Powered Gun Detection Solutions
20 pages
Temu Partner Platform Terms 20250523
No ratings yet
Temu Partner Platform Terms 20250523
44 pages
Engaging Young Dancers & Athletes
No ratings yet
Engaging Young Dancers & Athletes
1 page
Charitable Trust
No ratings yet
Charitable Trust
27 pages
Stress and Stress Management in Clinical Psychology Findings From A Systematic Review
No ratings yet
Stress and Stress Management in Clinical Psychology Findings From A Systematic Review
12 pages
Labpostharvest 2
No ratings yet
Labpostharvest 2
12 pages
Domino v. Comelec
No ratings yet
Domino v. Comelec
4 pages

Module4 Hive Notes

Uploaded by

Module4 Hive Notes

Uploaded by

Hive Background

Big Data refers to huge amounts of data that are:

A client (Java, Python, Excel, etc.) sends query using JDBC/ODBC/Thrift.

Key Components Involved:

2 File System (HDFS):

Flow of Query Execution in Hive:

✅ 3. Compiler → Meta Store

✅ 4. Meta Store → Compiler

⚙️ Execution Engine (EE):

💡 Hadoop Daemons involved:

Unique IDs for tables and indexes

When the table was created

When the index was created

Format used to read data

Details of how data is partitioned

HDFS path where data is stored

HiveQL (Hive Query Language) is a SQL-like language used to query and

It allows users to interact with big data using familiar SQL-style

HiveQL is inspired by traditional relational database SQL (like MySQL

Abstracts MapReduce No need to write low-level code

Command Line Interface to run HiveQL

Supports MySQL-like SQL-92, SQL:2003

Special Hive features for Big Data like

Advanced HiveQL Capabilities

Loading Data Into Hive:

3 Load Using LOAD Command

Upload using HDFS:

hadoop fs -copyFromLocal /path/to/localfile /data/employee

hadoop fs -moveFromLocal /path/to/localfile /data/employee

Returns the greatest BIGINT ≤

Returns the smallest BIGINT ≥

rand() / rand(int Returns a pseudo-random

Type Conversion Functions:

Return Type Function Signature Description

Type Function Signature Description

INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN

Types of Hive Joins

3. RIGHT OUTER JOIN

How to Create Partitions

Sorting and Aggregating in Hive

SELECT year, temperature FROM records SORT BY year;

Why use it:

You might also like