HIVE
Instructor: Oussama Derbel
Introduction
■ What is Hadoop?
– Apache Hadoop is an open source software framework used to develop data processing
applications which are executed in a distributed computing environment.
– Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers.
■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater
computational power at low cost.
Introduction
■ Core of Hadoop
HDFS
Storage Part
( Hadoop Distributed File System)
MAPREDUCE Processing Part
Introduction
■ Apache Hadoop consists of two sub-projects
1. Hadoop MapReduce: MapReduce is a computational model
and software framework for writing applications which are
run on Hadoop. These MapReduce programs are capable of
processing enormous data in parallel on large clusters of
computation nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of
the storage part of Hadoop applications.
Note
MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.
What is Hive?
■ Apache Hive is an open source data warehouse system built on top of Hadoop Housed for querying and
analyzing large datasets stored in Hadoop files.
■ Initially, you must write complex Map-Reduce jobs, but now with the help of the Hive, you just need to submit merely
SQL queries.
■ Hive is mainly targeted towards users who are comfortable with SQL.
■ Hive use language called HiveQL (HQL), which is like SQL.
■ HiveQL automatically translates SQL-like queries into MapReduce jobs.
What is Hive?
■ Hive abstracts the complexity of Hadoop. The main thing to notice is that there is no need to learn java for Hive.
■ The Hive generally runs on your workstation and converts your SQL query into a series of jobs for execution on
a Hadoop cluster.
■ Apache Hive organizes data into tables. This provides a means for attaching the structure to data stored in HDFS.
Why Apache Hive?
■ Facebook had faced a lot of challenges before the implementation of Apache Hive. Challenges like the size of the data
being generated increased or exploded, making it very difficult to handle them.
■ The traditional RDBMS could not handle the pressure. As a result, Facebook was looking out for better options.
■ To overcome this problem, Facebook initially tried using MapReduce.
■ But it has difficulty in programming and mandatory knowledge in SQL, making it an impractical solution.
■ Hence, Apache Hive allowed them to overcome the challenges they were facing.
Why Apache Hive?
■ With Apache Hive, we are now able to perform the following:
– Schema flexibility and evolution
– Tables can be portioned and bucketed
– Apache Hive tables are defined directly in the HDFS
– JDBC (Java Database Connector) /ODBC (Open Database Connectivity) drivers are available
Why Apache Hive?
■ Apache Hive saves developers from writing complex Hadoop MapReduce jobs for ad-hoc requirements.
■ Hence, hive provides summarization, analysis, and query of data.
■ Hive is very fast and scalable. It is highly extensible. Since Apache Hive is like SQL, hence it becomes very easy for the
SQL developers to learn and implement Hive Queries.
■ Hive reduces the complexity of MapReduce by providing an interface where the user can submit SQL queries. So, now
business analysts can play with Big Data using Apache Hive and generate insights.
■ It also provides file access on various data stores like HDFS and Hbase
■ The most important feature of Apache Hive is that to learn Hive we don’t have to learn Java.
Hive Architecture
■ Metastore:
■ It stores metadata for each of the tables like their schema and location.
■ Hive also includes the partition metadata. This helps the driver to track the progress of various data sets distributed over
the cluster. It stores the data in a traditional RDBMS format. Hive metadata helps the driver to keep a track of the data
and it is highly crucial. Backup server regularly replicates the data which it can retrieve in case of data loss.
Hive Architecture
■ Driver:
■ It acts like a controller which receives the HiveQL statements.
■ The driver starts the execution of the statement by creating sessions. It monitors the life cycle and progress of
the execution.
■ Driver stores the necessary metadata generated during the execution of a HiveQL statement. It also acts as a collection
point of data or query result obtained after the Reduce operation.
Hive Architecture
■ Complier:
■ It performs the compilation of the HiveQL query. This converts the query to an execution plan. The plan contains the
tasks. It also contains steps needed to be performed by the MapReduce to get the output as translated by the query.
■ The compiler in Hive converts the query to an Abstract Syntax Tree (AST). First, check for compatibility and compile-
time errors, then converts the AST to a Directed Acyclic Graph(DAG).
Hive Architecture
■ Optimizer
■ It performs various transformations on the execution plan to provide optimized DAG.
■ It aggregates the transformations together, such as converting a pipeline of joins to a single join, for better performance.
■ The optimizer can also split the tasks, such as applying a transformation on data before a reduce operation, to provide
better performance.
Hive Architecture
■ Executor
■ Once compilation and optimization complete, the executor executes the tasks. Executor takes care of pipelining the
tasks.
■ CLI, UI, and Thrift Server –CLI (command-line interface)provides a user interface for an external user to interact with
Hive. Thrift server in Hive allows external clients to interact with Hive over a network, similar to the JDBC or ODBC
protocols.
Hive Architecture
■ Architecture
Hive Components
Hive Limitations
• Does not provide row level updates (earlier versions)
• Not suitable for OLTP (Online Transactional Processing)
• Queries have higher latency
• Start-up overhead for MapReduce jobs
• Best when large dataset is maintained and mined
Hive -Data Types -Numeric
• TINYINT (1-byte signed integer)
• SMALLINT (2-byte signed integer)
• INT (4-byte signed integer)
• BIGINT (8-byte signed integer)
• FLOAT (4-byte single precision floating point number)
• DOUBLE (8-byte double precision floating point number)
• DECIMAL (User defined precisions)
Hive -Data Types –Date/Time
• TIMESTAMP ( Hive version > 0.8.0)
• DATE ( Hive version > 0.12.0 ) -YYYY-MM-DD
Hive -Data Types –String
• STRING
• VARCHAR ( Hive version > 0.12.0)
• CHAR ( Hive version > 0.13.0)
Hive -Data Types –Misc
• BOOLEAN
• BINARY ( Hive version > 0.8.0)
Hive -Data Types –Complex
• arrays: ARRAY<data_type>
• maps: MAP <primitive_type,data_type>
• structs: STRUCT <col_name : data_type [COMMENT col_comment],...>
• union: UNIONTYPE <data_type, data_type, ...> ( Hive version > 0.7.0)
Hive -Data Types –Example
CREATE TABLEemployees(
nameSTRING,
salaryFLOAT,
subordinates ARRAY<STRING>, ‘’John’’
deductions MAP<STRING,FLOAT>,
addressSTRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT,email:STRING>
)
Hive -Data Types –Example
CREATE TABLEemployees(
nameSTRING,
salaryFLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING,FLOAT>,
addressSTRUCT<street:STRING,
[“Michael”,“Rumi”]
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT,email:STRING>
)
Hive -Data Types –Example
CREATE TABLEemployees(
nameSTRING,
salaryFLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING,FLOAT>,
addressSTRUCT<street:STRING,
city:STRING,
state:STRING, {
“Insurance”:500.00,
zip:INT>, “Charity”:600.00
auth UNION<fbid:INT, gid:INT,email:STRING> }
)
Hive -Metastore
• Stores the metadata of tables into a relational database
• Metadata includes
• Details of databases and tables
• Table definitions: name of table, columns, partitions etc.
Hive -Warehouse
• Hive tables are stored in the Hive warehouse directory
• /apps/hive/warehouse on HDFS
• At the location specified in the table definition
Hive –Getting started – Command Line
• Login to CloudxLab Linux console
• Type “hive” to access hive shell
• By default database named “default” will be selected as current db for the current session
• Type “SHOW DATABASES” to see list of all databases
Hive –Getting started – Command Line
• “SHOW TABLES” will list tables in current selected database which is “default” database.
• Create your own database with your login name
• CREATE DATABAS Eabhinav9884;
• DESCRIBE DATABAS Eabhinav9884;
• DROP DATABASE abhinav9884;
Hive –Getting started – Command Line
• CREATE DATABASE abhinav9884;
• USE abhinav9884;
• CREATE TABLE x(a INT);
Hive –Getting started – Hue
• Login to Hue
• Click on “Query Editors” and select “Hive”
• Select your database (abhinav9984) from the list
• SELECT * FROM x;
• DESCRIBE x;
• DESCRIBE FORMATTED x;
Hive –Tables
●Managed tables
●External tables
Hive –Managed Tables
●Also called Internal
●Lifecycle managed by Hive
●Data is stored in the warehouse directory
●Dropping the table deletes data from warehouse
Hive –External Tables
●The lifecycle is not managed by Hive
●Hive assumes that it does not own the data
●Dropping the table does not delete the underlying data
●Metadata will be deleted
Hive –Managed Tables -Example
CREATE TABLEnyse(
exchange1STRING,
symbol1 STRING,
ymd STRING,
price_openFLOAT,
price_high FLOAT,
price_low FLOAT,
price_closeFLOAT,
volumeINT,
price_adj_closeFLOAT
)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t’;
DESCRIB Enyse;
DESCRIBE FORMATTED nyse;
Hive –Loadind Data-From Local Directory
● hadoop fs -copyToLocal/data/NYSE_daily
● Launch Hive
● use yourdatabase;
● load data localinpath 'NYSE_daily'overwrite into table nyse;
● Copies the data from local file system to warehouse
Hive –Loadind Data-From HDFS
CREATE TABLEnyse_hdfs(
exchange1 STRING,
symbol1STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_closeFLOAT
)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t';
Hive –Loadind Data-From HDFS
●Copy /data/NYSE_daily to your home directory in HDFS
●load data inpath 'hdfs:///user/abhinav9884/NYSE_daily' overwrite into table nyse_hdfs;
●Moves the data from specified location to warehouse
●Check if NYSE_daily is in your home directory in HDFS
Hive – External Tables
CREATE EXTERNAL TABLE nyse_external(
exchange1STRING,
symbol1 STRING,
ymd STRING,
price_openFLOAT,
price_high FLOAT,
price_low FLOAT,
price_closeFLOAT,
volumeINT,
price_adj_closeFLOAT
)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t'
LOCATION'/user/abhinav9884/NYSE_daily';
describe formatted nyse_external;
Hive – S3 Based External Table
create external table miniwikistats(
projcode string,
pagename string,
pageviews int,
bytes int)
partitioned by(dt string)
row format delimited fields terminated by ' ‘
lines terminated by '\n’
location 's3n://paid/default-datasets/miniwikistats/’ ;
Hive – Select Statements
● Select allcolumns
SELECT * FROM nyse;
● Select only required columns
SELECT exchange1, symbol1 FROM nyse;
Hive – Aggregations
●Find average opening price for each stock
SELECTsymbol1,AVG(price_open) AS avg_price FROM nyse
GROUP BY symbol1;
●To improve performance set top-level aggregation in mapphase
SET hive.map.aggr=true;
Hive – Saving Data
●In local filesystem
insert overwrite local directory '/home/abhinav9884/onlycmc’
select * from nyse where symbol1 ='CMC';
●In HDFS
insert overwrite directory 'onlycmc' select * from nyse where symbol1 = 'CMC';
Hive – Tables – DDL -- ALTER
● Rename a table
ALTERTABLExRENAMETOx1;
● Change datatype of column
ALTER TABLE x1 CHANGEa a FLOAT;
● Add columns in existing table
ALTERTABLEx1ADDCOLUMNS(bFLOAT,cINT);
Hive – Partitions
#First name, Department,Year of joining Mark, Engineering, 2012
Jon, HR,2012
Monica, Finance,2015
Steve, Engineering,2012
Michael, Marketing,2015
Hive – Partitions– Hands-on
●Data is located at /data/bdhs/employees/ on HDFS
●Copy data to your home directory in HDFS
hadoop fs -cp /data/bdhs/employees.
●Create table
CREATE TABLEemployees(
nameSTRING,
departmentSTRING,
somedateDATE
)
PARTITIONED BY(yearSTRING)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY',';
Hive – Partitions– Hands-on
● Load dataset2012.csv
load data inpath 'hdfs:///user/sandeepgiri9034/employees/2012.csv' into table employees partition(year=2012);
●Load dataset2015.csv
load data inpath 'hdfs:///user/sandeepgiri9034/employees/2015.csv' into table employees partition(year=2015);
● SHOW PARTITIONS employees;
● Check warehouse and metastore
Hive – Partitions– Summary
To avoid the full table scan
•The data is stored in different files in warehouse defined by the partitions
•Define the partitions using “partition by” in “create table”
•We can also add a partition later
•Partition can happen on multiple columns (year=2012, month=10, day=12)
Hive – Views
●SELECT * FROM employees where department='Engineering';
●Create a view
CREATE VIEW employees_engineeringAS
SELECT * FROM employees where department='Engineering';
●Now query from the view
SELECT * FROM employees_engineering;
Hive – Views- Summary
•Allows a query to be saved and treated like a table
•Logical construct -does not store data
•Hides the query complexity
•Divide long and complicated query into smaller and manageable pieces
•Similar to writing a function in a programming language
Hive – Load JSON Data
• Download JSON-SERDEBINARIES
• ADD JAR
hdfs:///data/serde/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar;
• Create Table
CREATE EXTERNAL TABLE tweets_raw (
ROW FORMAT SERDE'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION'/user/abhinav9884/senti/upload/data/tweets_raw';
Hive – Sorting & Distributing – Order By
ORDER By x
●Guarantees global ordering
●Data goes through just one reducer
●This is unacceptable for large datasets as it will overload the reducer
●You end up one sorted file as output
Hive – Sorting & Distributing – Sort By
SORT By x
●Orders data at each of N reducers
●Number of reducers are 1 per1GB
●You end up with N or more sorted files with overlapping ranges
Hive – Sorting & Distributing – Sort By
SORT By x
Hive – Sorting & Distributing – Distribute By
DISTRIBUTE By x
●Ensures each of N reducers gets non-overlapping ranges of x
●But doesn't sort the output of each reducer
●You end up with N or unsorted files with non-overlapping ranges
Hive – Sorting & Distributing – Distribute By
DISTRIBUTE By x
Hive – Sorting & Distributing – Cluster By
CLUSTER By x
●Gives global ordering
●Is the same as (DISTRIBUTE BY x and SORT BYx)
●CLUSTER BY is basically the more scalable version of ORDERBY
Hive – Bucketing
CREATETABLEpage_view(viewTimeINT,useridBIGINT, page_url STRING, referrer_urlSTRING,
ipSTRINGCOMMENT'IPAddressoftheUser')
COMMENT 'This is the page view table’
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) INTO 32BUCKETS
ROW FORMAT DELIMITED FIELD STERMINATED BY'\001’
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003’
STORED ASSEQUENCEFILE;
Hive – ORC files
●Optimized Row Columnar file format
●Provides a highly efficient way to store Hive data
●Improves performance when
○Reading
○Writing
○Processing
●Has a built-in index, min/max values, and other aggregations
●Proven in large-scale deployments
○Facebook uses the ORC file format for a 300+ PB deployment
Hive – ORC files-Example
CREATE TABLE orc_table(
first_name STRING,
last_nameSTRING
) STORED ASORC;
INSERT INTO orc_table VALUES('John','Gill');
SELECT * fromorc_table;
To Know more, please visit https://orc.apache.org
Hive – Quick Recap
•Each table has got a location
•By default the table is in a directory under the location
/apps/hive/warehouse
•We can override that location by mentioning 'location' in create table clause
•Load data copies the data if it is local
•Load moves the data if it is on hdfs for both external and managed tables
•Dropping managed table deletes the data at the 'location'
•Dropping external table does not delete the data at the ’location'
•The metadata is stored in the relational database –hive metastore
Hive – Connecting to Tableau
•Tableau is a visualization tool
•Tableau allows for instantaneous insight by transforming data into visually appealing, interactive visualizations
called dashboards
Hive – Connecting to Tableau - Steps
•Download and install Tableau desktop from https://www.tableau.com/products/desktop
•Download and install Hortonworks ODBC driver for Apache Hive for yourOS
https://hortonworks.com/downloads/
Hive – Connecting to Tableau – Hands-on
Visualize top 10 stocks with highest opening price on Dec 31,2009
Hive – Quick Demo
1.Copy data from /data/ml100k/u.data into our hdfs home
2.Open Hive in Hue and runfollowing:
CREATE TABLE u_data(useridINT,movieidINT,ratingINT,unixtimeSTRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY'\t’
STORED AS TEXT FILE;
LOAD DATA IN PATH '/user/sandeepgiri9034/u.data'overwriteintotableu_data;
select * from u_data limit5;
Select movie id,avg(rating)arfromu_data group by movieid order by ar desc
Hive – Quick Demo
Join with MovieNames
create view top100mas
Select movieid,avg(rating)arfromu_data group by movie id order by ar desc
CREATE TABLE m_data( movieid INT, name STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY'|’
STORED AS TEXTFILE;
load data inpath '/user/sandeepgiri9034/u.item' into tablem_data;
select * from m_data limit100
select * from m_data, top100m where top100m.movieid =m_data.movieid
Hive – Lab
1.For each movie how many users rated it
2.For movies having more than 30 ratings, what is the average rating
References
https://data-flair.training/
Thank you