0% found this document useful (0 votes)
35 views46 pages

Guide To Hive Interviews

This document is a comprehensive guide for preparing for interviews in data science, specifically focusing on Hive for big data. It includes a variety of interview questions and answers, covering topics such as Hive's functionality, types of tables, and data processing techniques. The guide aims to equip candidates with the necessary knowledge and skills to excel in data analytics interviews.

Uploaded by

Mel Ford
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views46 pages

Guide To Hive Interviews

This document is a comprehensive guide for preparing for interviews in data science, specifically focusing on Hive for big data. It includes a variety of interview questions and answers, covering topics such as Hive's functionality, types of tables, and data processing techniques. The guide aims to equip candidates with the necessary knowledge and skills to excel in data analytics interviews.

Uploaded by

Mel Ford
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

COMPREHENSIVE

GUIDE TO
INTERVIEWS FOR
HIVE FOR BIG DATA

ZEP ANALYTICS
Introduction
We've curated this series of interview which guides to
accelerate your learning and your mastery of data
science skills and tools.

From job-specific technical questions to tricky


behavioral inquires and unexpected brainteasers and
guesstimates, we will prepare you for any job
candidacy in the fields of data science, data
analytics, or BI analytics and Big Data.

These guides are the result of our data analytics


expertise, direct experience interviewing at
companies, and countless conversations with job
candidates. Its goal is to teach by example - not only
by giving you a list of interview questions and their
answers, but also by sharing the techniques and
thought processes behind each question and the
expected answer.

Become a global tech talent and unleash your next,


best self with all the knowledge and tools to succeed
in a data analytics interview with this series of guides.

ZEP ANALYTICS
COMPREHENSIVE 03
GUIDE TO
INTERVIEWS
FOR DATA
SCIENCE

Data Science interview questions cover a wide


scope of multidisciplinary topics. That means
you can never be quite sure what challenges
the interviewer(s) might send your way.

That being said, being familiar with the type of


questions you can encounter is an important
aspect of your preparation process.

Below you’ll find examples of real-life questions


and answers. Reviewing those should help you
assess the areas you’re confident in and where
you should invest additional efforts to improve.

Become a Tech Blogger


at Zep!!

Why don't you start your journey as a


blogger and enjoy unlimited free perks
and cash prizes every month.

Explore

ZEP ANALYTICS
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

04
1. What is Hive ?
Apache Hive is an open source for data warehouse system.
Its similar like SQL Queries. We can use Hive for analyzing
and querying in large data sets on top of Hadoop.

2. Why do we need Hive?


Hive is a tool in Hadoop ecosystem which provides an
interface to organize and query data in a databse like
fashion and write SQL like queries. It is suitable for accessing
and analyzing data in Hadoop using SQL syntax.

3. What is a metastore in Hive?


It is a relational database storing the metadata of hive
tables, partitions, Hive databases etc...
When you create a table, this metastore gets updated with
the information related to the new table which gets queried
when you issue queries on that table.

4. Is Hive suitable to be used for OLTP systems? Why?


No, Hive does not provide insert and update at row level. So
it is not suitable for OLTP system.

5. Can you explain about ACID transactions in Hive?


Hive supports ACID transactions: The full form of ACID is
Atomicity, Consistency, Isolation and Durability. ACID
transactions are provided at the row levels, there are Insert,
Delete, and Update options so that Hive supports ACID
transaction. Insert, Delete and Update.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

05
6. What are the types of tables in Hive?
There are two types of tables in Hive : Internal Table(aka
Managed Table) and External table.

7. What kind of data warehouse application is suitable for


Hive?
Hive is not considered as a full database. The design rules
and regulations of Hadoop and HDFS put restrictions on
what Hive can do.Hive is most suitable for data warehouse
applications.
Where Analyzing the relatively static data, Less Responsive
time and No rapid changes in data.
Hive does not provide fundamental features required for
OLTP (Online Transaction Processing). Hive is suitable for
data warehouse applications in large data sets.

8. Explain what is a Hive variable. What do we use it for?


Hive variable is basically created in the Hive environment
that is referenced by Hive scripting languages. It provides to
pass some values to the hive queries when the query starts
executing. It uses the source command.

9. How to change the warehouse.dir location for older


tables?
To change the base location of the Hive tables, edit the
hive.metastore.warehouse.dir param. This will not affect the
older tables. Metadata needs to be changed in the
database (MySQL or Derby). The location of Hive tables is in
table SDS and column LOCATION.
zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

06
10. What are the types of metastore available in Hive ?
There are three types of meta stores available in Hive.
Embedded Metastore (Derby)
Local Metastore
Remote Metastore.

11. Is it possible to use same metastore by multiple users, in


case of embedded hive?
No, it is not possible to use metastores in sharing mode. It is
recommended to use standalone real database like MySQL
or PostGresSQL.

12. If you run hive server, what are the available mechanism
for connecting it from application?
There are following ways by which you can connect with the
Hive Server
1. Thrift Client: Using thrift you can call hive commands from
a various programming languages e.g. C++, Java, PHP,
Python and Ruby.
2. JDBC Driver : It supports for the Java protocol.
3. ODBC Driver: It supports ODBC protocol.

13. What is SerDe in Apache Hive ?


A SerDe is a short name for a Serializer Deserializer.
Hive uses SerDe as FileFormat to read and write data from
tables. An important concept behind Hive is that it DOES NOT
own the Hadoop File System format that data is stored in.
Users are able to write files to HDFS with whatever tools or

zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

07
mechanism takes their fancy (CREATE EXTERNAL TABLE or
LOAD DATA INPATH) and use Hive to correctly parse that
file format in a way that can be used by Hive. A SerDe is a
powerful and customizable mechanism that Hive uses to
parse data stored in HDFS to be used by Hive.

14. Which classes are used by the Hive to Read and Write
HDFS Files ?
Following classes are used by Hive to read and write HDFS
files
TextInputFormat or HiveIgnoreKeyTextOutputFormat:
These 2 classes read/write data in plain text file format.
SequenceFileInputFormat or SequenceFileOutputFormat:
These 2 classes read/write data in hadoop SequenceFile
format.

15. Give examples of the SerDe classes which hive uses to


Serialize and Deserialize data ?
Hive currently use these SerDe classes to serialize data.

Deserialize data: MetadataTypedColumnsetSerDe: This


SerDe is used to read/write delimited records like CSV,
tab-separated control-A separated records (quote is not
supported yet.)

ThriftSerDe: This SerDe is used to read or write thrift


serialized objects. The class file for the Thrift object must
be loaded first.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

08
DynamicSerDe: This SerDe also read or write thrift
serialized objects, but it understands thrift DDL so
the schema of the object can be provided at
runtime. Also it supports a lot of different protocols,
including TBinaryProtocol, TJSONProtocol,
TCTLSeparatedProtocol(which writes data in
delimited records).

16. How do you write your own custom SerDe and


what is the need for that?
In most cases, users want to write a Deserializer
instead of a SerDe, because users just want to read
their own data format instead of writing to it.
For example, the RegexDeserializer will deserialize
the data using the configuration parameter regex,
and possibly a list of column names.
If your SerDe supports DDL (basically, SerDe with
parameterized columns and column types), you
probably want to implement a Protocol based on
DynamicSerDe, instead of writing a SerDe from
scratch. The reason is that the framework passes
DDL to SerDe through thrift DDL format, and its non-
trivial to write a thrift DDL parser.
Depending on the nature of data the user has, the
inbuilt SerDe may not satisfy the format of the data.
So users need to write their own java code to satisfy
their data format requirements.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

09
17. What is ObjectInspector functionality?
Hive uses ObjectInspector to analyze the internal
structure of the row object and also the structure of the
individual columns.
ObjectInspector provides a uniform way to access
complex objects that can be stored in multiple formats in
the memory, including:
Instance of a Java class (Thrift or native Java)
A standard Java object (we use java.util.List to represent
Struct and Array, and use java.util.Map to represent Map)
A lazily-initialized object (For example, a Struct of string
fields stored in a single Java string object with starting
offset for each field)
A complex object can be represented by a pair of
ObjectInspector and Java Object. The ObjectInspector
not only tells us the structure of the Object, but also gives
us ways to access the internal fields inside the Object.
In simple terms, ObjectInspector functionality in Hive is
used to analyze the internal structure of the columns,
rows, and complex objects. It allows to access the
internal fields inside the objects.

18. What is the functionality of Query Processor in Apache


Hive?
This component implements the processing framework
for converting SQL to a graph of map or reduce jobs and
the execution time framework to run those jobs in the
order of dependencies and the help of metastore details.

zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

10
19. What is the limitation of Derby database for Hive
metastore?
With derby database, you cannot have multiple
connections or multiple sessions instantiated at the same
time.
Derby database runs in the local mode and it creates a
log file so that multiple users cannot access Hive
simultaneously.

20. What are managed and external tables?


We have got two things, one of which is data present in
the HDFS and the other is the metadata, present in some
database. There are two categories of Hive tables that is
Managed and External Tables.

In the Managed tables, both the data and the metadata


are managed by Hive and if you drop the managed table,
both data and metadata are deleted.There are some
situations where your data will be controlled by some
other application and you want to read that data but you
must allow Hive to delete that data. In such case, you can
create an external table in Hive.

In the external table, metadata is controlled by Hive but


the actual data will be controlled by some other
application. So, when you delete a table accidentally, only
the metadata will be lost and the actual data will reside
wherever it is.

zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

11
21. What are the complex data types in Hive?
MAP: The Map contains a key-value pair where you
can search for a value using the key.

STRUCT: A Struct is a collection of elements of


different data types. For example, if you take the
address, it can have different data types. For
example, pin code will be in Integer format.

ARRAY: An Array will have a collection of


homogeneous elements. For example, if you take
your skillset, you can have N number of skills

UNIONTYPE: It represents a column which can have a


value that can belong to any of the data types of
your choice.

22. How does partitioning help in the faster


execution of queries?
With the help of partitioning, a sub directory will be
created with the name of the partitioned column
and when you perform a query using the WHERE
clause, only the particular sub-directory will be
scanned instead of scanning the whole table. This
gives you faster execution of queries.

23. How to enable dynamic partitioning in Hive?


Related to partitioning there are two types of
partitioning Static and Dynamic. In the static
partitioning, you will specify the partition column
while loading the data.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

12
Whereas in dynamic partitioning, you push the data
into Hive and then Hive decides which value should
go into which partition. To enable dynamic
partitioning, you have set the below property
set hive.exec.dynamic.parition.mode = nonstrict;
Example: insert overwrite table
emp_details_partitioned partition(location) select
* from emp_details;

24. What is bucketing?


The values in a column are hashed into a number of
buckets which is defined by user. It is a way to
avoid too many partitions or nested partitions while
ensuring optimizes query output.

25. How does bucketing help in the faster execution


of queries?
If you have to join two large tables, you can go for
reduce side join. But if both the tables have the
same number of buckets or same multiples of
buckets and also sorted on the same column there
is a possibility of SMBMJ in which all the joins take
place in the map phase itself by matching the
corresponding buckets. Buckets are basically files
that are created inside the HDFS directory.
There are different properties which you need to set
for bucket map joins and they are as follows:
set hive.enforce.sortmergebucketmapjoin = false;
set hive.auto.convert.sortmerge.join = false;
set hive.optimize.bucketmapjoin = ture;
set hive.optimize.bucketmapjoin.sortedmerge = true;
zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

13
26. How to enable bucketing in Hive?
By default bucketing is disabled in Hive, you can
enforce to enable it by setting the below property
set hive.enforce.bucketing = true;

27. What are the different file formats in Hive?


Every file format has its own characteristics and
Hive allows you to choose easily the file format
which you wanted to use.
There are different file formats supported by Hive
1.Text File format
2.Sequence File format
3.Parquet
4.Avro
5.RC file format
6.ORC

28. How is SerDe different from File format in Hive?


SerDe stands for Serializer and Deserializer. It
determines how to encode and decode the field
values or the column values from a record that is
how you serialize and deserialize the values of a
column. But file format determines how records are
stored in key value format or how do you retrieve
the records from the table.

29. What is RegexSerDe?


Regex stands for a regular expression. Whenever
you want to have a kind of pattern matching, based
on the pattern matching, you have to store the
fields.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

14
RegexSerDe is present in
org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
In the SerDeproperties, you have to define your input
pattern and output fields. For example, you have to
get the column values from line xyz/pq@def if you
want to take xyz, pq and def separately.
To extract the pattern, you can use:
input.regex = (.*)/(.*)@(.*)
To specify how to store them, you can use
output.format.string = %1$s%2$s%3$s;

30. How is ORC file format optimised for data


storage and analysis?
ORC stores collections of rows in one file and within
the collection the row data will be stored in a
columnar format. With columnar format, it is very
easy to compress, thus reducing a lot of storage
cost.
While querying also, it queries the particular column
instead of querying the whole row as the records
are stored in columnar format.
ORC has got indexing on every block based on the
statistics min, max, sum, count on columns so when
you query, it will skip the blocks based on the
indexing.

31. How to access HBase tables from Hive?


Using Hive-HBase storage handler, you can access the
HBase tables from Hive and once you are connected, you
can query HBase using the SQL queries from Hive. You
can also join multiple tables in HBase from Hive and
retrieve the result.
zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

15
32. When running a JOIN query, I see out-of-
memory errors?
This is usually caused by the order of JOIN tables.
Instead of [FROM tableA a JOIN tableB b ON ], try
[FROM tableB b JOIN tableA a ON ] NOTE that if you
are using LEFT OUTER JOIN, you might want to
change to RIGHT OUTER JOIN. This trick usually solve
the problem the rule of thumb is, always put the
table with a lot of rows having the same value in
the join key on the rightmost side of the JOIN.

33. Did you used Mysql as Metatstore and faced


errors like com.mysql.jdbc.exceptions.jdbc4.
CommunicationsException: Communications link
failure ?
This is usually caused by MySQL servers closing
connections after the connection is idling for some
time. Run the following command on the MySQL
server will solve the problem [set global
wait_status=120]
When using MySQL as a metastore I see the error
[com.mysql.jdbc.exceptions.MySQLSyntaxErrorExcept
ion: Specified key was too long; max key length is
767 bytes].
This is a known limitation of MySQL 5.0 and UTF8
databases. One option is to use another character
set, such as latin1, which is known to work.

34. Does Hive support Unicode?


You can use Unicode string on data or comments,
but cannot use for database or table or column
name.
zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

16
You can use UTF-8 encoding for Hive data. However,
other encodings are not supported (HIVE 7142
introduce encoding for LazySimpleSerDe, however,
the implementation is not complete and not
address all cases)

35. Are Hive SQL identifiers (e.g. table names,


columns, etc) case sensitive?
No, Hive is case insensitive.

36. What is the best way to load xml data into hive?
The easiest way is to use the Hive XML SerDe
(com.ibm.spss.hive.serde2.xml.XmlSerDe), which will
allow you to directly import and work with XML data.

37. When Hive is not suitable?


It does not provide OLTP transactions support only
OLAP transactions.If application required OLAP,
switch to NoSQL database.HQL queries have higher
latency, due to the mapreduce.

38. Mention what are the different modes of Hive?


Depending on the size of data nodes in Hadoop,
Hive can operate in two modes. These modes are,
Local mode and Map reduce mode

39. Mention what Hive query processor does?


Hive query processor convert graph of MapReduce
jobs with the execution time framework. So that the
jobs can be executed in the order of dependencies.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

17
40. Mention what is (HS2) HiveServer2?
It is a server interface that performs following
functions.
It allows remote clients to execute queries against
Hive Retrieve the results of mentioned queries
Some advanced features Based on Thrift RPC in its
latest version include Multi-client concurrency
Authentication.

41. Mention what are the steps of Hive in query


processor?
The components of a Hive query processor include,
1.Logical Plan Generation
2.Physical Plan Generation Execution Engine
3.Operators
4.UDFs and UDAFs
5.Optimizer
6.Parser
7.Semantic Analyzer
8.Type Checking

42. Explain how can you change a column data


type in Hive?
You can change a column data type in Hive by
using command,
ALTER TABLE table_name CHANGE column_name
column_name new_datatype;

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

18
43. Explain when to use explode in Hive?
Hadoop developers sometimes take an array as
input and convert into a separate table row. To
convert complex data types into desired table
formats, then we can use explode function.

44. Mention what is the difference between order by


and sort by in Hive?
SORT BY will sort the data within each reducer. You
can use any number of reducers for SORT BY
operation.
ORDER BY will sort all of the data together, which
has to pass through one reducer. Thus, ORDER BY in
hive uses a single.

45. Mention how can you stop a partition form being


queried?
You can stop a partition form being queried by
using the ENABLE OFFLINE clause with ALTER TABLE
statement.

46. Can we rename a Hive table ?


yes, using below command Alter Table table_name
RENAME TO new_name

47. What is the default location where hive stores


table data?
hdfs://namenode_server/user/hive/warehouse

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

19
48. Is there a date data type in Hive?
Yes. The TIMESTAMP data types stores date in
java.sql.timestamp format

49. Can we run unix shell commands from hive?


Give example.
Yes, using the ! mark just before the command.
For example !pwd at hive prompt will list the current
directory.

50. Can hive queries be executed from script files?


How?
Using the source command.
Example −
Hive> source /path/to/file/file_with_query.hql

51. What is the importance of .hiverc file?


It is a file containing list of commands needs to run
when the hive CLI starts. For example setting the
strict mode to be true etc.

52. What are the default record and field delimiter


used for hive text files?
The default record delimiter is − \n
And the filed delimiters are − \001,\002,\003

53. What do you mean by schema on read?


The schema is validated with the data when reading
the data and not enforced when writing data.

zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

20
54. How do you list all databases whose name
starts with p?
SHOW DATABASES LIKE ‘p.*’

55. What does the “USE” command in hive do?


With the use command you fix the database on
which all the subsequent hive queries will run.

56. How can you delete the DBPROPERTY in Hive?


There is no way you can delete the DBPROPERTY.

57. What is the significance of the line?


set hive.mapred.mode = strict;
It sets the mapreduce jobs to strict mode.By which
the queries on partitioned tables can not run
without a WHERE clause. This prevents very large job
running for long time.

58. How do you check if a particular partition exists?


This can be done with following query
SHOW PARTITIONS table_name
PARTITION(partitioned_column=’partition_value’)

59. Which java class handles the Input and Output


records encoding into files in Hive tables ?
For
Input:org.apache.hadoop.mapred.TextInputFormat
For Output:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutp
utFormat

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

21
60. What is the significance of ‘IF EXISTS” clause
while dropping a table?
When we issue the command DROP TABLE IF EXISTS
table_name
Hive throws an error if the table being dropped does
not exist in the first place.

61. When you point a partition of a hive table to a


new directory, what happens to the data?
The data stays in the old location. It has to be
moved manually.
Write a query to insert a new column(new_col INT)
into a hive table (htab) at a position before an
existing column (x_col)
ALTER TABLE table_name
CHANGE COLUMN new_col INT BEFORE x_col

62. Does the archiving of Hive tables, it saves any


spaces in HDFS?
No. It only reduces the number of files which
becomes easier for namenode to manage.

63. While loading data into a hive table using the


LOAD DATA clause, how do you specify it is a hdfs
file and not a local file ?
By Omitting the LOCAL CLAUSE in the LOAD DATA
statement.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

22
64. If you omit the OVERWRITE clause while creating
a hive table,what happens to file which are new and
files which already exist?
The new incoming files are just added to the target
directory and the existing files are simply
overwritten. Other files whose name does not match
any of the incoming files will continue to exist. If you
add the OVERWRITE clause then all the existing data
in the directory will be deleted before new data is
written.

65. What does the following query do?


INSERT OVERWRITE TABLE employees PARTITION
(country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;
It creates partition on table employees with
partition values coming from the columns in the
select clause. It is called Dynamic partition insert.

66. What is a Table generating Function on hive?


A table generating function is a function which
takes a single column as argument and expands it
to multiple column or rows. Example exploe().

67. How can Hive avoid MapReduce?


If we set the property hive.exec.mode.local.auto to
true then hive will avoid mapreduce to fetch query
results.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

23
68. What is the difference between LIKE and RLIKE
operators in Hive?
The LIKE operator behaves the same way as the
regular SQL operators used in select queries.
Example − street_name like ‘%Chi’
But the RLIKE operator uses more advance regular
expressions which are available in java
Example − street_name RLIKE ‘.*(Chi|Oho).*’ which
will select any word which has either chi or oho in it.

69. Is it possible to create Cartesian join between 2


tables, using Hive?
No. As this kind of Join can not be implemented in
map reduce

70. What should be the order of table size in a join


query?
In a join query the smallest table to be taken in the
first position and largest table should be taken in
the last position.

71. What is the usefulness of the DISTRIBUTED BY


clause in Hive?
It controls how the map output is reduced among
the reducers. It is useful in case of streaming data

72. How will you convert the string ’51.2’ to a float


value in the price column?
Select cast(price as FLOAT)

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

24
73. What will be the result when you do cast(‘abc’
as INT)?
Hive will return NULL

74. Can we LOAD data into a view?


No. A view can not be the target of a INSERT or LOAD
statement.

75. What types of costs are associated in creating


index on hive tables?
Indexes occupies space and there is a processing
cost in arranging the values of the column on which
index is cerated.
Give the command to see the indexes on a table.
SHOW INDEX ON table_name
This will list all the indexes created on any of the
columns in the table table_name.

76. What does /*streamtable(table_name)*/ do?


It is query hint to stream a table into memory
before running the query. It is a query optimization
Technique.

77. Can a partition be archived? What are the


advantages and Disadvantages?
Yes. A partition can be archived. Advantage is it
decreases the number of files stored in namenode
and the archived file can be queried using hive. The
disadvantage is it will cause less efficient query and
does not offer any space savings.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

25
78. What is a generic UDF in hive?
It is a UDF which is created using a java program to
server some specific need not covered under the
existing functions in Hive. It can detect the type of
input argument programmatically and provide
appropriate response.

79. The following statement failed to execute. What


can be the cause?
LOAD DATA LOCAL INPATH
‘${env:HOME}/country/state/’
OVERWRITE INTO TABLE address;
The local inpath should contain a file and not a
directory. The $env:HOME is a valid variable
available in the hive
environment.

80. How do you specify the table creator name


when creating a table in Hive?
The TBLPROPERTIES clause is used to add the creator
name while creating a table. The TBLPROPERTIES is
added like −
TBLPROPERTIES(‘creator’= ‘Joan’)

81. Which method has to be overridden when we use


custom UDF in Hive?
Whenever you write a custom UDF in Hive, you have
to extend the UDF class and you have to override
the evaluate() function.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

26
82. Suppose I have installed Apache Hive on top of
my Hadoop cluster using default metastore
configuration. Then, what will happen if we have
multiple clients trying to access Hive at the same
time?
The default metastore configuration allows only one
Hive session to be opened at a time for accessing
the metastore. Therefore, if multiple clients try to
access the metastore at the same time, they will
get an error. One has to use a standalone
metastore, i.e. Local or remote metastore
configuration in Apache Hive for allowing access to
multiple clients concurrently.
Following are the steps to configure MySQL
database as the local metastore in Apache Hive:
One should make the following changes in hive-
site.xml:
1.javax.jdo.option.ConnectionURL property should be
set to jdbc:mysql://host/dbname?createDataba
seIfNotExist=true.
2.javax.jdo.option.ConnectionDriverName property
should be set to com.mysql.jdbc.Driver. One should
also set the username and password as:
3.javax.jdo.option.ConnectionUserName is set to
desired username.
4.javax.jdo.option.ConnectionPassword is set to the
desired password.
The JDBC driver JAR file for MySQL must be on the
Hive classpath, i.e. The jar file should be copied into
the Hive lib directory.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

27
Now, after restarting the Hive shell, it will
automatically connect to the MySQL database which
is running as a standalone metastore.

83. Is it possible to change the default location of a


managed table?
Yes, it is possible to change the default location of
a managed table. It can be achieved by using the
clause LOCATION [hdfs_path].

84. When should we use SORT BY instead of ORDER


BY?
We should use SORT BY instead of ORDER BY when
we have to sort huge datasets because SORT BY
clause sorts the data using multiple reducers
whereas ORDER BY sorts all of the data together
using a single reducer. Therefore, using ORDER BY
against a large number of inputs will take a lot of
time to execute.

85. What is dynamic partitioning and when is it


used?
In dynamic partitioning values for partition columns
are known in the runtime, i.e. It is known during
loading of the data into a Hive table.
One may use dynamic partition in following two
cases:
1. Loading data from an existing non-partitioned
table to improve the sampling and therefore,
decrease the
query latency.
zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

28
2. When one does not know all the values of the
partitions before hand and therefore, finding these
partition values manually from a huge data sets is a
tedious task.

86. Suppose, I create a table that contains details


of all the transactions done by the customers of
year 2016: CREATE TABLE transaction_details
(cust_id INT, amount FLOAT, month STRING, country
STRING) ROW FORMAT DELIMITED FIELDS TERMINATED
BY ,;
Now, after inserting 50,000 tuples in this table, I
want to know the total revenue
generated for each month. But, Hive is taking too
much time in processing this
query. How will you solve this problem and list the
steps that I will be taking in
order to do so?
We can solve this problem of query latency by
partitioning the table according to each month. So,
for each month we will be scanning only the
partitioned data instead of whole data sets.
As we know, we can not partition an existing non-
partitioned table directly. So, we will be taking
following steps to solve the very problem:

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

29
1.Create a partitioned table, say
partitioned_transaction:
·CREATE TABLE partitioned_transaction (cust_id INT,
amount FLOAT, country STRING) PARTITIONED BY
(month STRING) ROW FORMAT DELIMITED FIELDS
·TERMINATED BY (,) ;
·2. Enable dynamic partitioning in Hive:
·SET hive.exec.dynamic.partition = true;
·SET hive.exec.dynamic.partition.mode = nonstrict;
·3. Transfer the data from the non - partitioned
table into the newly created partitioned table:
INSERT OVERWRITE TABLE partitioned_transaction
PARTITION (month) SELECT cust_id, amount, country,
month FROM transaction_details;
Now, we can perform the query using each partition
and therefore, decrease the query time.

87. How can you add a new partition for the month
December in the above partitioned table?
For adding a new partition in the above table
partitioned_transaction, we will issue the command
give below:
ALTER TABLE partitioned_transaction ADD PARTITION
(month=Dec) LOCATION /partitioned_transaction;

88. What is the default maximum dynamic partition


that can be created by a mapper/reducer? How
can you change it?
By default the number of maximum partition that
can be created by a mapper or reducer is set to

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

30
100. One can change it by issuing the following
command:
SET hive.exec.max.dynamic.partitions.pernode =
value

89. I am inserting data into a table based on


partitions dynamically. But, I received
an error FAILED ERROR IN SEMANTIC ANALYSIS:
Dynamic partition strict mode requires at least one
static partition column. How will you remove this
error?
To remove this error one has to execute following
commands:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

90. Suppose, I have a CSV file sample.csv present in


temp directory with the following entries:
id first_name last_name email gender ip_address
1 Hugh Jackman [email protected] Male
136.90.241.52
2 David Lawrence [email protected] Male
101.177.15.130
3 Andy Hall [email protected] Female
114.123.153.64
4 Samuel Jackson [email protected] Male
89.60.227.31
5 Emily Rose [email protected]
Female 119.92.21.19
How will you consume this CSV file into the Hive
warehouse using built SerDe?
zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

31
SerDe stands for serializer or deserializer. A SerDe
allows us to convert the unstructured bytes into a
record that we can process using Hive. SerDes are
implemented using Java. Hive comes with several
built-in SerDes and many other third-party SerDes
are also available.
Hive provides a specific SerDe for working with CSV
files. We can use this SerDe for the sample.csv by
issuing following commands:
CREATE EXTERNAL TABLE sample (id int, first_name
string,last_name string, email string,
gender string, ip_address string)
ROW FORMAT SERDE
org.apache.hadoop.hive.serde2.OpenCSVSerde
STORED AS TEXTFILE LOCATION temp;
Now, we can perform any query on the table
sample:
SELECT first_name FROM sample WHERE gender =
male;

91. Suppose, I have a lot of small CSV files present


in input directory in HDFS and I want to create a
single Hive table corresponding to these files. The
data in these files are in the format: {id, name, e-
mail, country}. Now, as we know, Hadoop
performance degrades when we use lots of small
files.
So, how will you solve this problem where we want
to create a single Hive table for lots of small files
without degrading the performance of the system?

zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

32
One can use the SequenceFile format which will
group these small files together to form a single
sequence file. The steps that will be followed in
doing so are as follows:

1.Create a temporary table:


CREATE TABLE temp_table (id INT, name STRING, e-
mail STRING, country STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY , STORED AS TEXTFILE;
Load the data into temp_table:
LOAD DATA INPATH input INTO TABLE temp_table;

2.Create a table that will store data in SequenceFile


format:
CREATE TABLE sample_seqfile (id INT, name STRING,
e-mail STRING, country STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY , STORED AS
SEQUENCEFILE; Transfer the data from the temporary
table into the sample_seqfile table:
INSERT OVERWRITE TABLE sample SELECT * FROM
temp_table;
Hence, a single SequenceFile is generated which
contains the data present in all of the input files
and therefore, the problem of having lots of small
files is finally eliminated.

92. Can We Change settings within Hive Session? If


Yes, How?
Yes, we can change the settings within Hive session,
using the SET command. It helps to change Hive job
settings for an exact query.
zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

32
Example: The following commands shows buckets
are occupied according to the table definition.
hive> SET hive.enforce.bucketing=true;
We can see the current value of any property by
using SET with the property name. SET will list all the
properties with their values set by Hive.
hive> SET hive.enforce.bucketing;
hive.enforce.bucketing=true
And this list will not include defaults of Hadoop. So
we should use the below like
SET -v
It will list all the properties including the Hadoop
defaults in the system.

93. Is it possible to add 100 nodes when we have


100 nodes already in Hive? How?
Yes, we can add the nodes by following the below
steps.
Take a new system create a new username and
password.
Install the SSH and with master node setup ssh
connections. Add ssh public_rsa id key to the
authorized keys file.
Add the new data node host name, IP address and
other details in /etc/hosts slaves file 192.168.1.102
slave3.in slave3.
Start the Data Node on New Node.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

33
Login to the new node like suhadoop or ssh -X
[email protected].
Start HDFS of a newly added slave node by using
the following command ./bin/hadoop-daemon.sh
start data node.
Check the output of jps command on a new node

94. Explain the concatenation function in Hive with


an example?
Concatenate function will join the input strings.We
can specify the N number of strings separated by a
comma.
Example:
CONCAT (It,-,is,-,a,-,eLearning,-,provider);
Output:
It-is-a-eLearning-provider
So, every time we set the limits of the strings by -. If
it is common for every strings, then Hive provides
another command
CONCAT_WS. In this case,we have to specify the set
limits of operator first. CONCAT_WS
(-,It,is,a,eLearning,provider);
Output: It-is-a-eLearning-provider.

95. Explain Trim and Reverse function in Hive with


examples?
Trim function will delete the spaces associated with
a string.
Example:
TRIM( BHAVESH );

zepanalytics.com
HIVE| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

34
Output:
BHAVESH
To remove the Leading space LTRIM( BHAVESH);
To remove the trailing space
RTRIM(BHAVESH );
In Reverse function, characters are reversed in the
string.
Example:
REVERSE(BHAVESH);
Output:
HSEVAHB

96. Explain process to access sub directories


recursively in Hive queries?
By using below commands we can access sub
directories recursively in Hive hive> Set
mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;
Hive tables can be pointed to the higher level
directory and this is suitable for the directory
structure which is like /data/country/state/city/

97. How to skip header rows from a table in Hive?


Header records in log files
1.System=
2.Version=
3.Sub-version=
In the above three lines of headers that we do not
want to include in our Hive query. To skip header
lines from our tables in the Hive,set a table property
that will allow us to skip the header lines.
zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

35
CREATE EXTERNAL TABLE employee (name STRING,job
STRING, dob STRING, id INT,
salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
STORED AS TEXTFILE LOCATION /user/data
TBLPROPERTIES(skip.header.line.count=2);

98. What is the maximum size of string data type


supported by hive? Mention the Hive support binary
formats ?
The maximum size of string data type supported by
hive is 2 GB.
Hive supports the text file format by default and it
supports the binary format Sequence files, ORC files,
Avro Data files, Parquet files.
Sequence files: Splittable, compressible and row
oriented are the general binary format.
ORC files: Full form of ORC is optimized row
columnar format files. It is a Record columnar file
and column oriented storage file. It divides the table
in row split. In each split stores that value of the
first row in the first column and followed sub
subsequently.
AVRO datafiles: It is same as a sequence file
splittable, compressible and row oriented, but
except the support of schema evolution and
multilingual binding support.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

36
99. What is the precedence order of HIVE
configuration?
We are using a precedence hierarchy for setting the
properties SET Command in HIVE
The command line -hiveconf option Hive-site.XML
1.Hive-default.xml
2.Hadoop-site.xml
3.Hadoop-default.xml

100. If you run a select * query in Hive, Why does it


not run MapReduce?
The hive.fetch.task.conversion property of Hive
lowers the latency of mapreduce overhead and in
effect when executing queries like SELECT, LIMIT, etc.,
it skips mapreduce function

101. How Hive can improve performance with ORC


format tables?
We can store the hive data in highly efficient
manner in the Optimized Row Columnar file format.
It can simplify many Hive file format limitations. We
can improve the performance by using ORC files
while reading, writing and processing the data.
Set hive.compute.query.using.stats-true; Set
hive.stats.dbclass-fs;
CREATE TABLE orc_table (idint,name string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \:
LINES TERMINATED BY \n STORES AS ORC;

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

37
102. Explain about the different types of join in Hive?
HiveQL has 4 different types of joins -
JOIN- Similar to Outer Join in SQL
FULL OUTER JOIN - Combines the records of both the
left and right outer tables that fulfil the join
condition.
LEFT OUTER JOIN- All the rows from the left table are
returned even if there are no matches in the right
table.
RIGHT OUTER JOIN-All the rows from the right table
are returned even if there are no matches in the left
table.

103. How can you configure remote metastore mode


in Hive?
To configure metastore in Hive, hive-site.xml file has to
be configured with the below property -
hive.metastore.uris thrift: //node1 (or IP Address):9083
IP address and port of the metastore host

104. What happens on executing the below query? After


executing the below query, if you modify the column how
will the changes be tracked?
Hive> CREATE INDEX index_bonuspay ON TABLE employee
(bonus)
AS
org.apache.hadoop.hive.ql.index.compact.CompactIndex
Handler;
The query creates an index named index_bonuspay
which points to the bonus column in the employee table.
Whenever the value of bonus is modified it will be stored
using an index value.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

38
105. How to load Data from a .txt file to Table Stored
as ORC in Hive?
LOAD DATA just copies the files to hive datafiles.
Hive does not do any transformation while loading
data into tables.
So, in this case the input file
/home/user/test_details.txt needs to be in ORC
format if you are loading it into an ORC table.
A possible workaround is to create a temporary
table with STORED AS TEXT, then LOAD DATA into it,
and then copy data from this table to the ORC
table.
Here is an example:
CREATE TABLE test_details_txt( visit_id INT, store_id
SMALLINT) STORED AS TEXTFILE;
CREATE TABLE test_details_orc( visit_id INT,
store_id SMALLINT) STORED AS ORC;
Load into Text table
LOAD DATA LOCAL INPATH
/home/user/test_details.txt INTO TABLE
test_details_txt; Copy to ORC table
INSERT INTO TABLE test_details_orc SELECT * FROM
test_details_txt;

106. How to create HIVE Table with multi character


delimiter?
FILELDS TERMINATED BY does not support multi-
character delimiters. The easiest way to do this is to
use RegexSerDe:

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

39
CREATE EXTERNAL TABlE tableex(id INT, name
STRING)
ROW FORMAT
org.apache.hadoop.hive.contrib.serde2.RegexSerDe
WITH SERDEPROPERTIES (input.regex = ^(\\d+)~\\*
(.*)$)
STORED AS TEXTFILE LOCATION /user/myusername;

107. Is there any way to get the column name along


with the output while execute any query in Hive?
If we want to see the columns names of the table in
HiveQl, the following hive conf property should be
set to true. hive> set hive.cli.print.header=true;
If you prefer to see the column names always then
update the $HOME/.hiverc file with the above
setting in the first line.
Hive automatically looks for a file named .hiverc in
your HOME directory and runs the commands it
contains, if any.

108. How to Improve Hive Query Performance With


Hadoop?
1.Use Tez Engine
Apache Tez Engine is an extensible framework for
building high-performance batch processing and
interactive data processing. It is coordinated by
YARN in Hadoop. Tez improved the MapReduce
paradigm by increasing the processing speed and
maintaining the MapReduce ability to scale to
petabytes of data.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

40
Tez engine can be enabled in your environment by
setting hive.execution.engine to tez:
set hive.execution.engine=tez;

2.Use Vectorization
Vectorization improves the performance by fetching
1,024 rows in a single operation instead of fetching
single row each time. It improves the performance
for operations like filter, join, aggregation, etc.
Vectorization can be enabled in the environment by
executing below commands.
·set hive.vectorized.execution.enabled=true;
·set hive.vectorized.execution.reduce.enabled=true;

3.Use ORCFile
Optimized Row Columnar format provides highly
efficient ways of storing the hive data by reducing
the data storage format by 75% of the original. The
ORCFile format is better than the Hive files format
when it comes to reading, writing, and processing
the data. It uses techniques like predicate push-
down, compression, and more to improve the
performance of the query.

4.Use Partitioning
With partitioning, data is stored in separate
individual folders on HDFS. Instead of querying the
whole dataset, it will query partitioned dataset.
1)Create Temporary Table and Load Data Into
Temporary Table
2)Create Partitioned Table
zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

41
3)Enable Dynamic Hive Partition
4)Import Data From Temporary Table To Partitioned
Table

5.Use Bucketing
The Hive table is divided into a number of partitions
and is called Hive Partition. Hive Partition is further
subdivided into clusters or buckets and is called
bucketing or clustering.
Cost-Based Query Optimization
Hive optimizes each querys logical and physical
execution plan before submitting for final execution.
However, this is not based on the cost of the query
during the initial version of Hive.
During later versions of Hive, query has been
optimized according to the cost of the query (like
which types of join to be performed, how to order
joins, the degree of parallelism, etc.).

109. How do I query from a horizontal output to


vertical output?
There should be easier way to achieve that using
explode function and selecting separately data for
prev and next columns.

110. Is there a simple way to replace non numeric


characters hive excluding - to allow only -ve and
+ve numbers.?
we can try to use regexp_extract instead:
regexp_extract('abcd-9090','.*(-[0-9]+)',1)

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

42
111. What is Hive Tablename maximum character
limit.?
Metastore service uses a RDBMS as back-end, and
you can see that the standard MySQL schema
defines
CREATE TABLE IF NOT EXISTS TBLS (...
TBL_NAME varchar(128) CHARACTER SET latin1
COLLATE latin1_bin DEFAULT NULL,

112. How can I convert a date in string format (“April


25, 2018”) to timestamp in hive?
use from_unixtime in conjunction with
unix_timestamp.
select from_unixtime(unix_timestamp(`d ate`, 'MMM
dd, yyyy'),'yyyy-MM-dd')

113. How to drop the hive database whether it


contains some tables.?
Use cascade command while drop the database.
Example:
hive> drop database sampleDB cascade;

114. I Dropped and recreated hive external table, but


no data shown, So what should I do?
This is because the table you created is a
partitioned table.

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

43
The insert you ran would have created a partition
for partition_column='abcxyz'. When you drop and
re-create the table, Hive looses the information
about the partition, it only knows about the table.
Run the command below to get hive to re-create
the partitions based on the data.
MSCK REPAIR TABLE user_dinesh.sampletable;

zepanalytics.com
HIVE | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE

44
This brings our list of 100+ HIVE for Big Data
interview questions to an end.

We believe this concise guide will help you “expect


the unexpected” and enter your first data analytics
interview with confidence

We, at Zep provide a platform for Education, where


your demand gets fulfilled. You demand we fulfil all
your learning needs without costing you extra.

zepanalytics.com
ZEP ANALYTICS

Ready to take the next steps?

Zep offers a platform for education to learn, grow &


earn.

Become a Tech Blogger


at Zep!!
Why don't you start your journey as a
blogger and enjoy unlimited free perks
and cash prizes every month.

Explore

zepanalytics.com

You might also like