0% found this document useful (0 votes)
14 views24 pages

05 ImpalaHiveIntro

This chapter introduces Impala and Hive, two SQL querying tools for data stored in HDFS/HBase. It covers their functionalities, comparisons to traditional databases, and how organizations utilize them for data analysis. The chapter also highlights the differences in performance and features between Impala and Hive.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views24 pages

05 ImpalaHiveIntro

This chapter introduces Impala and Hive, two SQL querying tools for data stored in HDFS/HBase. It covers their functionalities, comparisons to traditional databases, and how organizations utilize them for data analysis. The chapter also highlights the differences in performance and features between Impala and Hive.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Introduc)on

 to  Impala  and  Hive  


Chapter  5  

201509  
Course  Chapters  

1   Introduc)on   Course  Introduc)on  


2   Introduc)on  to  Hadoop  and  the  Hadoop  Ecosystem  
Introduc)on  to  Hadoop  
3   Hadoop  Architecture  and  HDFS  
4   Impor)ng  Rela)onal  Data  with  Apache  Sqoop  
5   Introduc*on  to  Impala  and  Hive  
Impor*ng  and  Modeling  
6   Working  with  Tables  in  Impala    
Structured  Data  
7   Data  Formats  
8   Data  File  Par))oning  
9   Capturing  Data  with  Apache  Flume     Inges)ng  Streaming  Data  

10   Spark  Basics  
11   Working  with  RDDs  in  Spark  
12   Aggrega)ng  Data  with  Pair  RDDs  
13   Wri)ng  and  Deploying  Spark  Applica)ons   Distributed  Data  Processing  with  
14   Parallel  Processing  in  Spark   Spark  
15   Spark  RDD  Persistence    
16   Common  PaDerns  in  Spark  Data  Processing  
17   Spark  SQL  and  DataFrames  

18   Conclusion   Course  Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐2  
Introduc)on  to  Impala  and  Hive  

In  this  chapter  you  will  learn  


§ What  Hive  is  
§ What  Impala  is  
§ How  Impala  and  Hive  Compare  
§ How  to  query  data  using  Impala  and  Hive  
§ How  Hive  and  Impala  differ  from  a  rela*onal  database  
§ Ways  in  which  organiza*ons  use  Hive  and  Impala  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐3  
Chapter  Topics  

Impor*ng  and  Modeling  Structured  


Introduc*on  to  Impala  and  Hive  
Data  

§ Introduc*on  to  Impala  and  Hive  


§ Why  Use  Impala  and  Hive?  
§ Querying  Data  With  Impala  and  Hive  
§ Comparing  Hive  and  Impala  to  Tradi)onal  Databases  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐4  
Introduc)on  to  Impala  and  Hive  (1)  

§ Impala  and  Hive  are  both   SELECT zipcode, SUM(cost) AS total


tools  that  provide  SQL   FROM customers
querying  of  data  stored  in   JOIN orders
ON (customers.cust_id = orders.cust_id)
HDFS  /  HBase   WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;

Hadoop  
Cluster  

HDFS  /  HBase  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐5  
Introduc)on  to  Impala  and  Hive  (2)  

§ Apache  Hive  is  a  high-­‐level  abstrac*on  on  top  of  MapReduce  


– Uses  HiveQL  
– Generates  MapReduce  or  Spark*  jobs  that  run  on  the  Hadoop  cluster  
– Originally  developed  at  Facebook  around  2007  
– Now  an  open-­‐source  Apache  project  
§ Cloudera  Impala  is  a  high-­‐performance  dedicated  SQL  engine    
– Uses  Impala  SQL  
– Inspired  by  Google’s  Dremel  project  
– Query  latency  measured  in  milliseconds  
– Developed  at  Cloudera  in  2012  
– Open-­‐source  with  an  Apache  license  

*  Hive-­‐on-­‐Spark  is  currently  in  beta  tes)ng      

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐6  
What’s  the  Difference?  

§ Hive  has  more  features  


– E.g.  Complex  data  types  (arrays,  maps)  and  full  support  for  
windowing  analy)cs  
– Highly  extensible  
– Commonly  used  for  batch  processing  
§ Impala  is  much  faster  
– Specialized  SQL  engine  offers  5x  to  50x  beDer  performance  
– Ideal  for  interac)ve  queries  and  data  analysis  
– More  features  being  added  over  )me  
 

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐7  
High-­‐Level  Overview  
SELECT zipcode, SUM(cost) AS total
FROM customers
JOIN orders
ON (customers.cust_id = orders.cust_id)
WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;

• Parse  HiveQL   • Parse  Impala  SQL  


• Make  op)miza)ons   • Make  op)miza)ons  
• Plan  execu)on   • Plan  execu)on  
• Submit  job(s)  to  cluster   • Execute  query  on  the    
• Monitor  progress   cluster  

Data  Processing  Engine  (MapReduce)   Hadoop  


Cluster  

Hadoop   HDFS  
Cluster  

HDFS  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐8  
Chapter  Topics  

Impor*ng  and  Modeling  Structured  


Introduc*on  to  Impala  and  Hive  
Data  

§ Introduc)on  to  Impala  and  Hive  


§ Why  Use  Impala  and  Hive?  
§ Querying  Data  With  Impala  and  Hive  
§ Comparing  Hive  to  Tradi)onal  Databases  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐9  
Why  Use  Hive  and  Impala?  

§ Brings  large-­‐scale  data  analysis  to  a  broader  audience  


– No  sofware  development  experience  required  
– Leverage  exis)ng  knowledge  of  SQL  
§ More  produc*ve  than  wri*ng  MapReduce  or  Spark  directly    
– Five  lines  of  HiveQL/Impala  SQL  might  be  equivalent  to  200  lines  or  
more  of  Java  
§ Offers  interoperability  with  other  systems  
– Extensible  through  Java  and  external  scripts    
– Many  business  intelligence  (BI)  tools  support  Hive  and/or  Impala  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐10  
Use  Case:  Sen)ment  Analysis  

§ Many  organiza*ons  use  Hive  or  Impala  to  analyze  social  media  coverage  

Mentions of Dualcore on Social Media (by Hour)

Negative
Neutral
Positive

07 08 09 10 11 12 13 14 15 16 17 18

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐12  
Use  Case:  Business  Intelligence    

§ Many  leading  business  intelligence  tools  support  Hive  and  Impala  


Dualcore Inc. Dashboard
https://dashboard.example.com/ Google

Revenue by Period Order Shipments Per Month

Top States for In-Store Sales

Suppliers by Region

Japan: 31 suppliers

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐13  
Chapter  Topics  

Impor*ng  and  Modeling  Structured  


Introduc*on  to  Impala  and  Hive  
Data  

§ Introduc)on  to  Impala  and  Hive  


§ Why  Use  Impala  and  Hive?  
§ Querying  Data  With  Hive  and  Impala  
§ Comparing  Hive  to  Tradi)onal  Databases  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐14  
Interac)ng  with  Hive  and  Impala  

§ Hive  and  Impala  offer  many  interfaces  for  running  queries  


– Command-­‐line  shell  
– Impala:  Impala-shell  
– Hive:  Hive  
– Hue  Web  UI  
– Hive  Query  Editor    
– Impala  Query  Editor  
– Metastore  Manager  
– ODBC  /  JDBC    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐15  
Star)ng  the  Impala  Shell  

§ You  can  execute  statements  in  the  Impala  shell  


– This  interac)ve  tool  is  similar  to  the  shell  in  MySQL    
§ Execute  the  impala-shell  command  to  start  the  shell  
– Some  log  messages  truncated  to  beDer  fit  the  slide  

$ impala-shell
Connected to localhost.localdomain:21000
Server version: impalad version 2.1.0-cdh5 (…)
Welcome to the Impala shell.
[localhost.localdomain:21000] >
 
§ Use  -i hostname:port  op*on  to  connect  to  a  different  server  

$ impala-shell –i myserver.example.com:21000
[myserver.example.com:21000] >

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐16  
Using  the  Impala  Shell  

§ Enter  semicolon-­‐terminated  statements  at  the  prompt  


– Hit  [Enter]  to  execute  a  query  or  command  
– Use  the  quit  command  to  exit  the  shell  
§ Use impala-shell --help for  a  full  list  of  op*ons  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐17  
Using  Hue  with  Hive  and  Impala  

You  can  use  Hue  to…  

Query  data  with  


Hive  or  Impala  

View  and  manage  


the  Metastore  

©  Copyright  2010a 2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-18
The  Hue  Query  Editor  

§ The  Impala  and  Hive  Query  editors  are  nearly  iden*cal

Enter,  edit,  save  


and  execute  
queries  

Choose  a  
database  

Explore  
schema  and  
sample  data  

View  results,  logs,  


reports,  etc.  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐19  
Chapter  Topics  

Impor*ng  and  Modeling  Structured  


Introduc*on  to  Impala  and  Hive  
Data  

§ Introduc)on  to  Impala  and  Hive  


§ Why  Use  Impala  and  Hive?  
§ Querying  Data  With  Impala  and  Hive  
§ Comparing  Hive  and  Impala  to  Tradi*onal  Databases  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐20  
Your  Cluster  is  Not  a  Database  Server  

§ Client-­‐server  database  management  systems  have  many  strengths  


– Very  fast  response  )me  
– Support  for  transac)ons  
– Allow  modifica)on  of  exis)ng  records  
– Can  serve  thousands  of  simultaneous  clients  
§ Your  Hadoop  cluster  is  not  an  RDBMS  
– Hive  generates  processing  engine  jobs  (MapReduce)  from  HiveQL  
queries  
– Limita)ons  of  HDFS  and  MapReduce  s)ll  apply  
– Impala  is  faster  but  not  intended  for  the  throughput  speed  required  for  
an  OLTP  database  
– No  transac)on  support  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐21  
Comparing  Hive  and  Impala  To  A  Rela)onal  Database  

Rela*onal  Database   Hive   Impala  


Query  language SQL  (full) SQL  (subset) SQL  (subset)
Update  individual   Yes No No
records
Delete  individual   Yes No No
records
Transac*ons Yes No No
Index  support Extensive Limited No
Latency Very  low High Low
Data  size Terabytes Petabytes Petabytes

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐22  
Chapter  Topics  

Impor*ng  and  Modeling  Structured  


Introduc*on  to  Impala  and  Hive  
Data  

§ Introduc)on  to  Impala  and  Hive  


§ Why  Use  Impala  and  Hive?  
§ Querying  Data  With  Impala  and  Hive  
§ Comparing  Hive  and  Impala  to  Tradi)onal  Databases  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐23  
Essen)al  Points  

§ Impala  and  Hive  are  tools  for  performing  SQL  queries  on  data  in  HDFS  
§ HiveQL  and  Impala  SQL  are  very  similar  to  SQL-­‐92  
– Easy  to  learn  for  those  with  rela)onal  database  experience  
– However,  does  not  replace  your  RDBMS  
§ Hive  generates  jobs  that  run  on  the  Hadoop  cluster  data  processing  engine  
– Runs  MapReduce  jobs  on  Hadoop  based  on  HiveQL  statements  
§ Impala  execute  queries  directly  on  the  Hadoop  cluster  
– Uses  a  very  fast  specialized  SQL  engine,  not  MapReduce  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐24  
Bibliography  

The  following  offer  more  informa*on  on  topics  discussed  in  this  chapter  
§ Programming  Hive  (O’Reilly  book)
– http://tiny.cloudera.com/programminghive
§ Data  Analysis  with  Hadoop  and  Hive  at  Orbitz
– http://tiny.cloudera.com/dac09b
§ Wired  Ar*cle  on  Impala
– http://tiny.cloudera.com/wiredimpala

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriDen  consent  from  Cloudera.   5-­‐25  

You might also like