Apache HIVE
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
Open source.
For querying and analyzing large datasets stored in Hadoop files.
Hive use language called HiveQL (HQL), which is similar to SQL.
HiveQL automatically translates SQL-like queries into MapReduce jobs.
Hive organizes data into tables.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Hive Shell
The shell is the primary way with which we interact with the Hive.
We can issue our commands or queries in HiveQL inside the Hive shell.
Hive Shell is almost similar to MySQL Shell.
It is the command line interface for Hive.
In Hive Shell users can run HQL queries.
We can run the Hive Shell in two modes which are:
Non-Interactive mode
◦ With -f option we can specify the location of a file which contains HQL queries.
◦ For example- hive -f my-script.q
Interactive mode
◦ We directly need to go to the hive shell and run the queries there.
◦ In hive shell, we can submit required queries manually and get the result.
Hive Services
Hive provides various services like the Hive server2, Beeline, etc. To perform queries
1. Beeline
The Beeline is a command shell supported by HiveServer2, where the user can submit its queries
and command to the system. It is a JDBC client that is based on SQLLINE CLI (pure Java-console
based utility for connecting with relational database and executing SQL queries).
2. Hive Server 2
HiveServer2 is the successor of HiveServer1. HiveServer2 enables clients to execute queries against
the Hive. It allows multiple clients to submit requests to Hive and retrieve the final results. It is
basically designed to provide the best support for open API clients like JDBC and ODBC.
Hive Driver
The Hive driver receives the HiveQL statements submitted by the user through the command shell.
It creates the session handles for the query and sends the query to the compiler.
4. Hive Compiler
Hive compiler parses the query. It performs semantic analysis and type-checking on the different
query blocks and query expressions by using the metadata stored in metastore and generates an
execution plan.
The execution plan created by the compiler is the DAG(Directed Acyclic Graph), where each
stage is a map/reduce job, operation on HDFS, a metadata operation.
5. Optimizer
Optimizer performs the transformation operations on the execution plan and splits the task to
improve efficiency and scalability.
6. Execution Engine
Execution engine, after the compilation and optimization steps, executes the execution plan created
by the compiler in order of their dependencies using Hadoop.
7. Metastore
Metastore is a central repository that stores the metadata information about the structure of tables
and partitions, including column and column type information.
It also stores information of serializer and deserializer, required for the read/write operation, and
HDFS files where data is stored. This metastore is generally a relational database.
Metastore provides a Thrift interface for querying and manipulating Hive metadata.
We can configure metastore in any of the two modes:
Remote: In remote mode, metastore is a Thrift service and is useful for non-Java
applications.
Embedded: In embedded mode, the client can directly interact with the metastore using
JDBC.
8. HCatalog
HCatalog is the table and storage management layer for Hadoop. It enables users with different data
processing tools such as Pig, MapReduce, etc. to easily read and write data on the grid.
It is built on the top of Hive metastore and exposes the tabular data of Hive metastore to other data
processing tools.
Read the HCatalog article to explore the meaning and need for HCatalog in detail.
9. WebHCat
WebHCat is the REST API for HCatalog. It is an HTTP interface to perform Hive metadata
operations. It provides a service to the user for running Hadoop MapReduce (or YARN), Pig, Hive
jobs.
Architecture of Hive
The following diagram depicts Hive Architecture
Unit Name Operation
Hive is a data warehouse infrastructure software that can create interaction between
User Interface user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).
Hive chooses respective database servers to store the schema or Metadata of tables,
Meta Store
databases, columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of
HiveQL
the replacements of traditional approach for MapReduce program. Instead of
Process
writing MapReduce program in Java, we can write a query for MapReduce job and
Engine
process it.
Execution The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine. Execution engine processes the query and generates results as same as
Engine
MapReduce results. It uses the flavor of MapReduce.
HDFS or Hadoop distributed file system or HBASE are the data storage techniques to store
HBASE data into file system.
Hive Vs RDBMS
RDBMS Hive
It is used to maintain database. It is used to maintain data warehouse.
It uses SQL (Structured Query
It uses HQL (Hive Query Language).
Language).
Schema is fixed in RDBMS. Schema varies in it.
Normalized and de-normalized both type of data is
Normalized data is stored.
stored.
Tables in rdms are sparse. Table in hive are dense.
It doesn’t support partitioning. It supports automation partition.
No partition method is used. Sharding method is used for partition.
Schema on READ only
◦ Functions like the update, modifications, etc. don't work with this.
◦ Because the Hive query in a typical cluster runs on multiple Data Nodes.
◦ In versions below 0.13 it is not possible to update and modify data across multiple
nodes.
READ Many WRITE Once
◦ In latest versions updations are possible after insertion
Hive Data Model
Data in Apache Hive can be categorized into:
Table
same as the tables present in a Relational Database.
The associated metadata describes the layout of the data in the table.
Hive stores the metadata in a relational database and not in HDFS. Hive has two
types of tables which are as follows:
Managed Table
When we load data into a Managed table, Hive moves data into Hive
warehouse directory.
External Table
Partition
Apache Hive organizes tables into partitions for grouping same type of data together
based on a column or partition key.
Each table in the hive can have one or more partition keys to identify a particular
partition.
Using partition we can also make it faster to do queries on slices of the data.
Bucket
Tables or partition are subdivided into buckets based on the hash function of a column in the
table.
Buckets give extra structure to the data that may be used for more efficient queries.
Hive Data Types
1. Arrays
An ordered sequence of similar type elements that are indexable using the zero-based
integers.
Arrays in Hive are similar to the arrays in JAVA.
array<datatype>
Example: array(‘Apple’,’Orange’). The second element is accessed as array[1].
2. maps
Map in Hive is a collection of key-value pairs.
Where the fields are accessed using array notations of keys (e.g., [‘key’]).
map<primitive_type, data_type>
Example: ‘first’ -> ‘John’, ‘last’ -> ‘Deo’, represented as map(‘first’, ‘John’, ‘last’, ‘Deo’).
Now ‘John’ can be accessed with map[‘first’].
3. structs
Similar to the STRUCT in C language.
It is a record type that encapsulates a set of named fields, which can be any primitive data
type.
We can access the elements in STRUCT type using DOT (.) notation.
STRUCT <col_name : data_type [ COMMENT col_comment], ...>
Example: For a column c3 of type STRUCT {c1 INTEGER; c2 INTEGER}, the c1 field is
accessed by the expression c3.c1.
4. union
Similar to the UNION in C.
UNION types at any point of time can hold exactly one data type from its specified data
types.
The full support for UNIONTYPE data type in Hive is still incomplete.
UNIONTYPE<data_type, data_type, ...>
In Hive data types, the missing values are represented by the special value
NULL.
HiveQL
Query language for Hive to process and analyze structured data in a Metastore.
SELECT
SELECT statement is used to retrieve the data from a table.
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+-----
SELECT ORDER BY
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number
hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
SELECT GROUP BY
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
JOINS
JOIN clause is used to combine and retrieve the records from multiple tables.
JOIN is same as OUTER JOIN in SQL.
There are different types of joins given as follows:
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
Consider the following table named CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
Consider another table ORDERS as follows:
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
LEFT OUTER JOIN
◦ The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there
are no matches in the right table.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+
RIGHT OUTER JOIN
◦ The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if
there are no matches in the left table.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER
JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
FULL OUTER JOIN
◦ The HiveQL FULL OUTER JOIN combines the records of both the left and the right
outer tables that fulfil the JOIN condition.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
VIEWS
Same as view in SQL.
CREATE VIEW
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT
column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...
Example
hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;
DROP VIEW
DROP VIEW view_name
◦ The following query drops a view named as emp_30000:
hive> DROP VIEW emp_30000;
MAP JOIN
Auto Map Join, or Map Side Join, or Broadcast Join.
In normal join there are too much activities spending on shuffling data around.
This slows down the Hive Queries.