Chapter 9
Introduction to Hive
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Learning Objectives and Learning Outcomes
Learning Objectives Learning Outcomes
Introduction to Hive
[Link] study the Hive Architecture a)To understand the hive
architecture.
[Link] study the Hive File format b)To create databases, tables and
execute data manipulation language
[Link] study the Hive Query statements on it.
Language c)To differentiate between static
and dynamic partitions.
d)To differentiate between
managed and external tables.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Agenda
What is Hive?
Hive Architecture
Hive Data Types
Primitive Data Types
Collection Data Types
Hive File Format
Text File
Sequential File
RCFile (Record Columnar File)
Hive Queries
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
What is Hive?
Hive is a Data Warehousing tool. Hive is used to query
structured data built on top of Hadoop. Facebook created
Hive component to manage their ever- growing volumes of
data. Hive makes use of the following:
[Link] for Storage
[Link] for execution
[Link] metadata in an RDBMS.
Hive is suitable for data warehousing applications,
processing batch jobs on huge data that is immutable.
Examples: Eg: Analysis of web logs, application logs.
History of HIVE
2007: HIVE was born at Facebook to analyze their incoming log data
2008: HIVE became Apache Hadoop sub-project.
Hive provides HQL(Hive Query Language) which is similar to SQL.
Hive compiles SQL queries into Map Reduce jobs and then runs the job in
the Hadoop Cluster.
Hive provides extensive data type functions and formats for data
summarization and analysis.
Features of Hive
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types/collection data types such
as structs, lists, and maps.
4. Custom types , custom functions can be defined.
5. Hive supports SQL filters, group-by and order-by
clauses.
Points to remember
Hive Query Language is similar to SQL and gets
compiled into map reduce jobs and then runs the job in
the Hadoop cluster.
Hive's default database is derby.
Hive integration and work flow
Hourly log data can be stored directly into HDFS and then
data cleansing is performed on the log file. Finally, Hive tables
can be created to query the log file.
Hive Data Units
• Databases : The namespace for tables.
• Tables: Set of records that have similar schema.
• Partitions: Logical separations of data based on classification of given
information as per specific attributes. Once hive has partitioned the data
based on a specified key, it starts to assemble the records into specific folders
as and when the records are inserted.
• Buckets (Clusters) : Similar to partitions but uses hash functions to segregate
data and determines the cluster or bucket into which the record should be
placed.
• In Hive, tables are stored as a folder, partitions are stored as a sub-directory
and buckets are stored as a file.
Hive Data Units
Data units as arranged in a HIVE Cluster.
Database
Tables
Buckets
Partitions
Columns Columns
Data Data
Hive Architecture
Hive Command Line Interface: Interface to interact with Hive.
Web interface: It is a graphical user interface to interact with Hive.
Hive server: It is used to submit Hive jobs from a client.
JDBC/ODBC: Java code can be written to connect to Hive and submit jobs
on it.
Driver: It compiles, optimizes and executes Hive queries.
Hive Metastore: Hive table definitions and mappings to the data are stored
in a metastore.
A metastore consists of the following:
l
Metastore service: offers interface to the hive.
Database: Stores data definitions, mappings to the data and others.
HIVE METASTORE
Hive metastore is a database that stores metadata about your Hive
tables (eg. table name, column names and types, table location, storage
handler being used, number of buckets in the table, sorting columns if
any, partition columns if any, etc.).
When you create a table, this metastore gets updated with the
information related to the new table which gets queried when you
issue queries on that table.
Metadata stored in Meat store includes:
ID’s of Database, Tables , Indexes
Table creation time
Input/Output format used for a table.
Metadata is updated whenever a table is created or deleted from HIVE
HIVE METASTORE
Three Kinds:
Embedded Metastore : Both the metastore database and the
metastore service run embedded in the main HiveServer
process
Local Metastore: In this mode the Hive metastore service runs
in the same process as the main HiveServer process, but the
metastore database runs in a separate process, and can be on a
separate host. The embedded metastore service communicates
with the metastore database over JDBC.
Remote Metastore: Hive driver and metastore interface run on
different JVMs.
Diagram depicts the workflow between Hive
and Hadoop.
Hive File Format
•Text File
The default file format is text file. Each record is a line in the file.
In text file, different control characters are used as delimiters
^A(octal 001, separates fields)
^B(octal 002, separates elements in array/struct)
^C(octal 003, separates key-value pairs).
\n record delimiter.
The term field is used when overriding the default delimiter. Formats
supported csv, tsv, XML, JSON.
•Sequential File
Sequential files are flat files that store binary key-value pairs. Includes
compression support which optimizes I/O requirements.
•RCFile (Record Columnar File)
RCFile stores the data in Column Oriented Manner which ensures that
Aggregation operation is not an expensive operation.
Hive File Format
•Text File
The default file format is text file. Each record is a line in the file.
In text file, different control characters are used as delimiters
^A(octal 001, separates fields as columns)
^B(octal 002, separates elements in array/struct)
^C(octal 003, separates key-value pairs).
\n used to split them as each line or record in the text file.
The term field is used when overriding the default delimiter. Formats
supported csv, tsv, XML, JSON.
If table format is..
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>);
John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState
Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600
The same sample record can be provided in form of JSON document (JavaScript
Object Notation).
{
“name”: “John Doe”,
“salary”: 100000.0,
“subordinates”: [“Mary Smith”, “Todd Jones”],
“deductions”: {
“Federal Taxes”: .2,
“State Taxes”:.05,
“Insurance”:.1
},
“address”: {
“street”: “1 Michigan Ave.”,
“city”:”Chicago”,
“state”: “IL”,
“zip”:60600
}
}
Now you able see how to represent the Complex data type like ARRAY,MAP,STRUCT
in JSON . Readable and easy to represent.
DDL and DML statements
•DDL statements are used to build and modify the tables and other
objects. They deal with
1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show
7. Describe
•DML statements are used retrive , store, modify delete and update
data in database. They deal with
1. Loading files into table.
2. Inserting data into Hive table from queries.
Database
Starting HIVE shell
Go to its installation path and type hive (hadoop needs to be started in another terminal)
To create a database named “STUDENTS” with comments and database properties.
CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT
'STUDENT Details' WITH DBPROPERTIES ('creator' = 'JOHN');
hive> show databases;
hive> describe database students; //shows DB name, comment and directory
hive> describe database extended students; //will show the properties also
Database
To alter the database properties
hive> alter database students set dbproperties ('edited by'= 'David');
hive> describe database extended students;
hive>use students //to make it current database
hive>drop database students;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Tables
Hive provides two kinds of tables: Managed and External table
Metadata of a table: When we create a table the masternode will keep
information about the location, schema, list of partitions etc .
Managed Table
When the internal table is dropped, it drops both the data and metadata.
Table is stored under the warehouse folder under Hive.
Life cycle of the table and data is managed by Hive
External Table –
When you drop this table, it retains the data in the underlying location
External keyword is used to create an external table
Location needs to be specified to store the dataset in that particular
location
hive>describe formatted <tablename> (to see type)
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Tables
To create managed table named ‘STUDENT’.
Format
CREATE [temporary] [external] TABLE [ IF NOT EXISTS] STUDENT(rollno INT,name
STRING,gpa FLOAT) [COMMENT] ROW FORMAT DELIMITED FIELDS TERMINATED BY
'\t';
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive>describe STUDENT //will show the schema of the table
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Tables
To create external table named ‘EXT_STUDENT’.
CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name
STRING,gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION ‘/STUDENT_INFO';
ALTER TABLE name RENAME TO new_name
hive> ALTER TABLE employee RENAME TO emp;
DROP TABLE [IF EXISTS] table_name;
hive> DROP TABLE IF EXISTS employee;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
CREATING TEXT FILE
create table olympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,
sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as textfile;
Here we are creating a table with name “olympic” and the schema of the table is as
specified above. The data inside the above input file is delimited by tab space. As
explained earlier the file format is specified as TEXTFILE at the end. The schema of
the table created above can be checked using describe olympic;
Loading data into table from file
To load data into the table from file named [Link].
LOAD DATA LOCAL INPATH ‘/root/hivedemos/[Link]' OVERWRITE INTO
TABLE EXT_STUDENT;
Local is used to load data from the local file system . To load the data from
HDFS remove local key word.
To retrieve the student details from “EXT_STUDENT” table.
hive>show tables;
SELECT * from EXT_STUDENT
hive> select roll no, name, from Student; //to view selected fields
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Collection Data Types
ARRAY : Syntax ARRAY<data_type>,
MAP : MAP<primitive_type, data_type>
STRUCT : STRUCT<col_name:data_type>
hive> create table student_info(rollno INT, name STRING, sub ARRAY<STRING >, marks
MAP<STRING,INT>, addr struct<city:STRING, state:STRING, pin:BIGINT>)row format delimited
fields terminated by ',' collection items terminated by ':' map keys terminated by ‘!';
hive> load data local inpath '/home/hduser/hive/collection_demo.csv' into table student_info;
1001,John,[English:Hindi],{Mark1!45,Mark2!65},{delhi:delhi:897654}
1002,Jill,[Physics:Maths],{Mark1!43,Mark2!89},{bhopal:MP:432024}
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Collection Data Types-Array
hive> select *from student_info ;
1001 John [“English",“Hindi"] {"Mark1":45,"Mark2":65}
1002 Jill [“Physics",“Maths"] {"Mark1":43,"Mark2":89}
//Accessing complex data types
hive> select name, sub from student_info; //select whole array
John [“English",“Hindi"]
Jill [“Physics",“Maths"]
hive> select name, sub[1] from student_info where rollno=1001;
hive> select name, sub[0] from student_info; //array element
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Collection Data Types - Map
hive> select name, marks from student_info; //display the whole map collection
John {"Mark1":45,"Mark2":65}
Jill {"Mark1":43,"Mark2":89}
hive >select name, marks[“mark1”], marks[“mark2”] from student_info;
John 45 65
Jill 43 89
hive>select name, [Link] from student_info;
John delhi
Jill bhopal
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Partitions -> To execute hive query faster
Very often users need to filter the data on specific column values.
Example: select the employees whose salary is above 50000 in
a table of 10 lakh entries.
If Users understand the domain of the data on which they are
doing analysis, they can identify frequently queried columns and
then partitioning can be applied on those columns.
When “where clause” is applied, Hive reads the entire dataset.
This decreases the efficiency and becomes a bottleneck when we
are required to run the queries on large tables.
This issue can be overcome by implementing partitions in hive.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Partition splits the large dataset into more meaningful chunks.
Two kinds of partitions:
1) Static partition (It comprises columns whose values are
known at compile time)
2) Dynamic partition (It comprises columns whose values are
known only at execution time)
Static partition:
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH '/home/hduser/hive/student_data.csv' OVERWRITE
INTO TABLE student;
create table if not exists static_student(rollno int,name string) partitioned by
(gpa float) row format delimited fields terminated by ',‘ ;
insert overwrite table static_student partition (gpa=6.5) select rollno,name
from student where gpa=6.5;
To add one more static partition based on “gpa” column using the alter statement.
alter table static_student add partition (gpa=8.0);
Goto: /user/hive/warehouse/[Link]/static_student
Name type size replication blocksize modification time permission owner group
Gpa=6.5 dir
Gpa=8.0 dir
Dynamic partition:
To use the dynamic partitioning in hive we need to set the below parameters in hive
shell or in [Link] file.
select * from dynamic_part_student;
The file name says file1 contains client data table:
1.tab1/clientdata/file1
id, name, dept, yoj
1, sunny, SC, 2009 Now, let us partition above data into two files using years
2, animesh, HR, 2009 Creating a partitioned table is as follows:
3, sumeer, SC, 2010 CREATE TABLE table_tab1 (id INT, name STRING, dept
4, sarthak, TP, 2010 STRING, yoj INT) PARTITIONED BY (year STRING);
[Link] DATA LOCAL INPATH
tab1’/clientdata/2009/file2’OVERWRITE INTO TABLE
studentTab PARTITION (year='2009');
[Link] DATA LOCAL INPATH
tab1’/clientdata/2010/file3’OVERWRITE INTO TABLE
studentTab PARTITION (year='2010');
tab1/clientdata/2009/file2
1, sunny, SC, 2009
2, animesh, HR, 2009
tab1/clientdata/2010/file3
3, sumeer, SC, 2010
4, sarthak, TP, 2010
Bucketing Features in Hive
Hive partition divides table into number of partitions and these
partitions can be further subdivided into more manageable parts
known as Buckets or Clusters.
The Bucketing concept is based on Hash function, which depends
on the type of the bucketing column. Records which are bucketed by
the same column will always be saved in the same bucket.
Here, CLUSTERED BY clause is used to divide the table into
buckets.
In Hive Partition, each partition will be created as directory. But in
Hive Buckets, each bucket will be created as file.
Bucketing can also be done even without partitioning on Hive tables.
Buckets
To use the bucketing in hive we need to set the below parameters in hive shell or in hive-
[Link] file.
set [Link]=true;
•To create a bucketed table having 3 buckets.
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT,name STRING,gpa FLOAT)
CLUSTERED BY (gpa) into 3 buckets;
•Load data to bucketed table.
FROM STUDENT INSERT OVERWRITE TABLE STUDENT_BUCKET SELECT rollno,name,gpa;
•To display the content of first bucket.
SELECT DISTINCT GRADE FROM STUDENT_BUCKET TABLESAMPLE(BUCKET 1 OUT OF 3 ON
GRADE);
When you create the table and bucket it using the clustered by clause into 32
buckets (as an example), hive buckets your data into 32 buckets using
deterministic hash functions. Then when you use TABLESAMPLE(BUCKET x
OUT OF y), hive divides your buckets into groups of y buckets and then picks the
x'th bucket of each group. For example:
If you use TABLESAMPLE(BUCKET 6 OUT OF 8), hive would divide your 32
buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then
picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.
If you use TABLESAMPLE(BUCKET 23 OUT OF 32), hive would divide your 32
buckets into groups of 32, resulting in only 1 group of 32 buckets, and then picks
the 23rd bucket as your result.
Partitioning VS Bucketing
Basically both Partitioning and Bucketing slice the data for executing the query much more efficiently
than on the non-sliced data.
The major difference is that the number of slices will keep on changing in the case of partitioning as
data is modified, but with bucketing the number of slices are fixed which are specified while creating
the table.
Bucketing happen by using a Hash algorithm and then a modulo on the number of buckets. So, a row
might get inserted into any of the bucket. Bucketing can be used for sampling of data, as well also for
joining two data sets much more effectively and much more.
when we do partitioning, we create a partition for each unique value of the column. But there may be
situation where we need to create lot of tiny partitions.
But if you use bucketing, you can limit it to a number which you choose and decompose your data
into those buckets. In hive a partition is a directory but a bucket is a file.
Now, if we want to perform partitioning on the basis of department column. Then the
information of all the employees belonging to a particular department will be stored
together in that very partition. Physically, a partition in Hive is nothing but just a sub-
directory in the table directory.
Thus we will have three partitions in total for each of the departments as we can see
clearly in diagram. For each department we will have all the data regarding that
very department residing in a separate sub – directory under the table directory.
So for example, all the employee data regarding Technical departments will be
stored in user/hive/warehouse/employee_details/dept.=EEE.
Hence, from the above diagram, we can see that how each partition is bucketed
into 2 buckets. Therefore each partition, says Technical, will have two files
where each of them will be storing the Technical employee’s data
Aggregations
Hive supports aggregation functions like avg, count, etc.
To write the average and count aggregation function.
SELECT AVG(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Group by and Having
Data in a column or columns can be grouped on the basis of values contained therein
by using “Group By”.
“Having” clause is used to filter out groups NOT meeting the specified condition
select rollno, name, gpa from student group by rollno, name,
gpa having gpa > 4;
select gpa from student group by gpa having count(gpa) >= 3;
HAVING Clause enables you to specify conditions that filter which
group results appear in the results.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
VIEWS: Views in SQL are kind of virtual tables. A view also has rows and columns as
they are in a real table in the database. We can create a view by selecting fields
from one or more tables present in the database.
A View can either have all the rows of a table or specific rows based on certain condition.
CREATE VIEW DetailsView AS SELECT NAME, ADDRESS FROM StudentDetails
WHERE S_ID < 5;
CREATE VIEW MarksView AS SELECT [Link], [Link],
[Link] FROM StudentDetails, StudentMarks WHERE [Link] =
[Link];
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Querying the View
Select * from marksview limit 4;
Drop the View
Drop View marksview;
join
Join- similar to the SQL JOIN
To create a join between Student and Department tables where we use rollnos from both as the join key.
For example:
CREATE TABLE a (k1 string, v1 string);
CREATE TABLE b (k2 string, v2 string);
SELECT k1, v1, k2, v2 FROM a JOIN b ON k1 = k2;
i) Create two tab separated files [Link] and [Link] with common rollnos
hive>create table if not exists Stud(rollno INT, name STRING, gpa FLOAT) row format delimited
fields terminated by '\t';
hive>create table if not exists Dept(rollno INT, name STRING, gpa FLOAT) row format delimited
fields terminated by '\t';
ii) Create the two tables and load data into them
hive>load data local inpath '/home/sangeetha/hivedemo/[Link]' overwrite into table stud;
hive>load data local inpath '/home/sangeetha/hivedemo/[Link]' overwrite into table dept;
iii) Select [Link], [Link], [Link], [Link], FROM STUD a join DEPT b on [Link]=[Link];
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
SerDer
Serializer converts java objects that hive has been working with and
translates it into something that Hive can write to HDFS.
Deserializer converts binary representation of records and converts it
into java objects.
•SerDer stands for Serializer/Deserializer.
•Contains the logic to convert unstructured data into records.
•Implemented using Java.
•Serializers are used at the time of writing.
•Deserializers are used at query time (SELECT Statement).
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Sub Query
In Hive sub queries are supported only in the FROM
Clause(Hive 0.12).
Need to specify name for sub query because every table
in a from clause has a name.
The columns in the sub query select list should have
unique names.
The columns in the sub query select list are available to
the outer query just like columns of a table.
In Hive(0.13),sub queries are supported in the where
clause as well.
Sub query to count occurrence of similar words in
the file.
CREATE TABLE docs (line STRING);
LOAD DATA LOCAL INPATH ‘/root/hivedemos/[Link]’ OVERWRITE INTO TABLE docs;
CREATE TABLE word_count AS
SELECT word, count(1) AS count FROM
(SELECT explode (split (line, ‘ ’)) AS word FROM docs) w
GROUP BY word
ORDER BY word;
The explode() function takes an array as input and outputs the elements of the array as
separate rows.
SELECT * FROM word_count;