HIVE
DATA WAREHOUSING USING HIVE QUERY
Additional Data Warehousing System
The table shows the problems related to data inflow & expressiveness, and
the solutions adapted to address the need for an additional data
warehousing system.
It was difficult to develop MapReduce program to express the
data , hence data lacked expressiveness.
Hence, HIVE came into BIG Data . .
What is HIVE ?
HIVE is defined as warehouse system for Hadoop that facilitates ad-hoc Queries
and the analysis of large data sets stored in Hadoop . .
FACTS !!
Hive provides SQL- like flavour to Big data
i.e. HiveQL(HQL). Because of that it’s a
popular choice for Big data Analytics on
Hadoop platform.
HIVE FACTS It provides massive scale-out & fault
tolerance capabilities for data storage
and processing of commodity hardware.
Relying on MapReduce for execution,
hive is batch-oriented and high latency
for query execution.
HIVE | Characteristics
Hive is a system for managing and querying unstructured data into structured data format.
It uses the concepts :
It uses HDFS for storage and retrieval of data.
The scripts of Hive uses MapReduce for execution
Hive commands are similar to SQL which Interoperability (extensible framework to
is data warehousing tool. support different files and data formats).
Principles
of
Hive Performance is better in Hive since Hive
Extensibility (pluggable MapReduce scripts
engine uses the best in-built script to
in the language of your choice – rich, user
reduce the execution time while enabling
def data types & user def functions.
high output.
System Architecture and Components of Hive
JDBC ODBC
Web
Command Line
Interface
Interface Thrift Server
MetaStore
Driver
(Compiler, Optimizer, Executor)
Hadoop
MapReduce + HDFS
Job Name
Tracker Node
Metastore is the component that stores the
system catalog and metadata about tables,
columns, partitions, & so on . .
Metadata is stored in traditional RDMS format,
Apache Hive uses by default DERBY database.
But its not compulosory its complimentary. If
METADATA you wish you can add any JDBC database like
MySQL.
Metadata client : metastore_db
MetaStore
Hive driver is the component that:
Manages the lifecycle of a HIVE Query
Language(HQL) statement as it moves through
Hive;
It maintains a session handle and any session
statistics.
DRIVER It includes three basic components:
Compiler
Optimizer
Executer
Driver
(Compiler, Optimizer, Executor)
Query Compiler is the driver components of Hive
and checks for error, if no error encountered it
converts HiveQL to Directed Acyclic Graph(DAG)
Query COMPILER of MapReduce tasks.
Driver
(Compiler, Optimizer, Executor)
Query optimizer optimizes the HiveQL scripts for
faster execution.
It consists of a chain of transformations, so that
the operator DAG resulting from on
transformations is passed as an input to the
Query OPTIMIZER next transformations.
Driver
(Compiler, Optimizer, Executor)
Hive Execution Engine:
Executes the tasks produced by the compiler in
proper dependency order.
Interacts with the underlying Hadoop Interface to
ensure perfect synchronization with Hadoop
Query EXECUTOR services.
Driver
(Compiler, Optimizer, Executor)
Hive Server
Hive Server is the main component, providing a thrift interface & it provides connectivity
to modules Java DB Connectivity/Open DB Connectivity server namely JDBC/ODBC.
It enables the integration of HIVE with other applications.
JDBC ODBC
Web
Command Line
Interface
Interface Thrift Server
HIVE
DATA MODEL & HIVE QUERY LANGUAGE
HIVE | DATA TYPES
HIVE have three different Data types that are involved in Table Creation.
PREMITIVE TYPES COMPLEX TYPES USER-DEFINED TYPES
Integers: TINYINT, SMALLINT, INT and BIGINT
Boolean: BOOLEAN
Floating Types: FLOAT , DOUBLE
String: STRING (VARCHAR, CHAR)
HIVE | DATA TYPES
HIVE have three different Data types that are involved in Table Creation.
PREMITIVE TYPES COMPLEX TYPES USER-DEFINED TYPES
Structs: {a INT; b:INT}
Maps: M[‘group’]
Arrays: [‘a’, ‘b’, ‘c’], A[1] returns ‘b’
HIVE | DATA TYPES
HIVE have three different Data types that are involved in Table Creation.
PREMITIVE TYPES COMPLEX TYPES USER-DEFINED TYPES
Structures with Attributes
Attributes can be of Any Type
HIVE | DATA MODELS
Tables in HIVE are analogous to Tables in Relational Databases. Tables can be filtered, projected,
joined and unioined. Additionally all the data of a table is stored in a directory in HDFS. Hive also
supports the notion of external tables wherein a table can be created on pre-existing files or
directories in HDFS by providing the appropriate location to the table creation.
Two Types of tables in HIVE
Managed Tables External Tables
HIVE | DATA MODELS TABLE
HQL Command used to create Tables:
CREATE [TEMPORARY] [EXTERNAL] TABLE IF NOT EXISTS [db_name.]
tab_name[(col_name data_type [COMMENT col_comment], . . . )]
[COMMENT table_comment]
[ROW FORMAT row_format_type]
[STORED AS file_format_type]
COMMENT ‘db_details’ ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’ STORED AS file_format_type
IT SORES TABLES IN HIVE HDFS WAREHOUSE . . . .
HIVE | DATA MODELS EXTERNAL TABLE
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive
does not use a default location for this table. This comes in handy if you
already have data generated.
Dropping an EXTERNAL table, data in the table is NOT deleted from the
file system.
An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration
property.
HQL Command to create External Commands:
CREATE EXTERNAL TABLE weatherext ( wban INT, date STRING)
COMMENT ‘this is external table view’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE LOCATION ‘ /hive/data/weatherext’;
HIVE | PARTITIONING TABLES
Hive stores tables in partitions. Partitions are used to divide the table into related parts.
Partitions make data querying more efficient. For example in the above weather table
the data can be partitioned on the basis of year and month and when query is fired on
weather table this partition can be used as one of the column.
HQL Commands to create the Partitioning the table:
CREATE EXTERNAL TABLE IF NOT EXSISTS weatherext
( wban INT, date STRING)
PARTITIONED BY (year INT, month STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘location of text_file’;
HIVE | FLOW OF CSV FILE INTO HIVE
That’s All for Theory
LETS PRACTICE NOW . . . .