0% found this document useful (0 votes)
65 views10 pages

Unit V-Hive

hive

Uploaded by

Smitha Rajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views10 pages

Unit V-Hive

hive

Uploaded by

Smitha Rajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Apache HIVE

 Hive is a data warehouse infrastructure tool to process structured data in Hadoop.


 Open source.
 For querying and analyzing large datasets stored in Hadoop files.
 Hive use language called HiveQL (HQL), which is similar to SQL.
 HiveQL automatically translates SQL-like queries into MapReduce jobs.
 Hive organizes data into tables.

Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
Hive Shell
 The shell is the primary way with which we interact with the Hive.
 We can issue our commands or queries in HiveQL inside the Hive shell.
 Hive Shell is almost similar to MySQL Shell.
 It is the command line interface for Hive.
 In Hive Shell users can run HQL queries.
 We can run the Hive Shell in two modes which are:
 Non-Interactive mode
◦ With -f option we can specify the location of a file which contains HQL queries.
◦ For example- hive -f my-script.q
 Interactive mode
◦ We directly need to go to the hive shell and run the queries there.
◦ In hive shell, we can submit required queries manually and get the result.
Hive Services

Hive provides various services like the Hive server2, Beeline, etc. To perform queries

1. Beeline
The Beeline is a command shell supported by HiveServer2, where the user can submit its queries
and command to the system. It is a JDBC client that is based on SQLLINE CLI (pure Java-console
based utility for connecting with relational database and executing SQL queries).

2. Hive Server 2
HiveServer2 is the successor of HiveServer1. HiveServer2 enables clients to execute queries against
the Hive. It allows multiple clients to submit requests to Hive and retrieve the final results. It is
basically designed to provide the best support for open API clients like JDBC and ODBC.

Hive Driver
The Hive driver receives the HiveQL statements submitted by the user through the command shell.
It creates the session handles for the query and sends the query to the compiler.

4. Hive Compiler
Hive compiler parses the query. It performs semantic analysis and type-checking on the different
query blocks and query expressions by using the metadata stored in metastore and generates an
execution plan.
The execution plan created by the compiler is the DAG(Directed Acyclic Graph), where each
stage is a map/reduce job, operation on HDFS, a metadata operation.

5. Optimizer
Optimizer performs the transformation operations on the execution plan and splits the task to
improve efficiency and scalability.

6. Execution Engine
Execution engine, after the compilation and optimization steps, executes the execution plan created
by the compiler in order of their dependencies using Hadoop.

7. Metastore
Metastore is a central repository that stores the metadata information about the structure of tables
and partitions, including column and column type information.
It also stores information of serializer and deserializer, required for the read/write operation, and
HDFS files where data is stored. This metastore is generally a relational database.
Metastore provides a Thrift interface for querying and manipulating Hive metadata.
We can configure metastore in any of the two modes:
 Remote: In remote mode, metastore is a Thrift service and is useful for non-Java
applications.
 Embedded: In embedded mode, the client can directly interact with the metastore using
JDBC.

8. HCatalog
HCatalog is the table and storage management layer for Hadoop. It enables users with different data
processing tools such as Pig, MapReduce, etc. to easily read and write data on the grid.
It is built on the top of Hive metastore and exposes the tabular data of Hive metastore to other data
processing tools.
Read the HCatalog article to explore the meaning and need for HCatalog in detail.

9. WebHCat
WebHCat is the REST API for HCatalog. It is an HTTP interface to perform Hive metadata
operations. It provides a service to the user for running Hadoop MapReduce (or YARN), Pig, Hive
jobs.
Architecture of Hive

The following diagram depicts Hive Architecture

Unit Name Operation


Hive is a data warehouse infrastructure software that can create interaction between
User Interface user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).
Hive chooses respective database servers to store the schema or Metadata of tables,
Meta Store
databases, columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of
HiveQL
the replacements of traditional approach for MapReduce program. Instead of
Process
writing MapReduce program in Java, we can write a query for MapReduce job and
Engine
process it.
Execution The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine. Execution engine processes the query and generates results as same as
Engine
MapReduce results. It uses the flavor of MapReduce.
HDFS or Hadoop distributed file system or HBASE are the data storage techniques to store
HBASE data into file system.

Hive Vs RDBMS

RDBMS Hive
It is used to maintain database. It is used to maintain data warehouse.
It uses SQL (Structured Query
It uses HQL (Hive Query Language).
Language).
Schema is fixed in RDBMS. Schema varies in it.
Normalized and de-normalized both type of data is
Normalized data is stored.
stored.
Tables in rdms are sparse. Table in hive are dense.
It doesn’t support partitioning. It supports automation partition.
No partition method is used. Sharding method is used for partition.

 Schema on READ only


◦ Functions like the update, modifications, etc. don't work with this.
◦ Because the Hive query in a typical cluster runs on multiple Data Nodes.
◦ In versions below 0.13 it is not possible to update and modify data across multiple
nodes.
 READ Many WRITE Once
◦ In latest versions updations are possible after insertion

Hive Data Model

Data in Apache Hive can be categorized into:


 Table
 same as the tables present in a Relational Database.
 The associated metadata describes the layout of the data in the table.
 Hive stores the metadata in a relational database and not in HDFS. Hive has two
types of tables which are as follows:
 Managed Table
 When we load data into a Managed table, Hive moves data into Hive
warehouse directory.
 External Table
Partition
 Apache Hive organizes tables into partitions for grouping same type of data together
based on a column or partition key.
 Each table in the hive can have one or more partition keys to identify a particular
partition.
 Using partition we can also make it faster to do queries on slices of the data.
 Bucket
 Tables or partition are subdivided into buckets based on the hash function of a column in the
table.
 Buckets give extra structure to the data that may be used for more efficient queries.

Hive Data Types

1. Arrays
 An ordered sequence of similar type elements that are indexable using the zero-based
integers.
 Arrays in Hive are similar to the arrays in JAVA.
array<datatype>
Example: array(‘Apple’,’Orange’). The second element is accessed as array[1].

2. maps
 Map in Hive is a collection of key-value pairs.
 Where the fields are accessed using array notations of keys (e.g., [‘key’]).
map<primitive_type, data_type>
Example: ‘first’ -> ‘John’, ‘last’ -> ‘Deo’, represented as map(‘first’, ‘John’, ‘last’, ‘Deo’).
Now ‘John’ can be accessed with map[‘first’].

3. structs
 Similar to the STRUCT in C language.
 It is a record type that encapsulates a set of named fields, which can be any primitive data
type.
 We can access the elements in STRUCT type using DOT (.) notation.
STRUCT <col_name : data_type [ COMMENT col_comment], ...>
Example: For a column c3 of type STRUCT {c1 INTEGER; c2 INTEGER}, the c1 field is
accessed by the expression c3.c1.

4. union
 Similar to the UNION in C.
 UNION types at any point of time can hold exactly one data type from its specified data
types.
 The full support for UNIONTYPE data type in Hive is still incomplete.
UNIONTYPE<data_type, data_type, ...>

 In Hive data types, the missing values are represented by the special value
NULL.

HiveQL

 Query language for Hive to process and analyze structured data in a Metastore.

SELECT

 SELECT statement is used to retrieve the data from a table.

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];

 hive> SELECT * FROM employee WHERE salary>30000;


On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+-----

SELECT ORDER BY

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number

 hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;

SELECT GROUP BY

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];

 hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;

JOINS

 JOIN clause is used to combine and retrieve the records from multiple tables.
 JOIN is same as OUTER JOIN in SQL.
 There are different types of joins given as follows:
 JOIN
 LEFT OUTER JOIN
 RIGHT OUTER JOIN
 FULL OUTER JOIN
Consider the following table named CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |

Consider another table ORDERS as follows:


+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT


FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+

 LEFT OUTER JOIN


◦ The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there
are no matches in the right table.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+

 RIGHT OUTER JOIN


◦ The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if
there are no matches in the left table.

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER
JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+

 FULL OUTER JOIN


◦ The HiveQL FULL OUTER JOIN combines the records of both the left and the right
outer tables that fulfil the JOIN condition.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+

VIEWS
 Same as view in SQL.

 CREATE VIEW
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT
column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

 Example

hive> CREATE VIEW emp_30000 AS


SELECT * FROM employee
WHERE salary>30000;

 DROP VIEW

DROP VIEW view_name

◦ The following query drops a view named as emp_30000:

hive> DROP VIEW emp_30000;

MAP JOIN

 Auto Map Join, or Map Side Join, or Broadcast Join.

 In normal join there are too much activities spending on shuffling data around.

 This slows down the Hive Queries.

You might also like