Cloudera Apache Impala Guide
Cloudera Apache Impala Guide
Important Notice
© 2010-2018 Cloudera, Inc. All rights reserved.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and company
names or logos mentioned in this document are the property of their respective owners.
Reference to any products, services, processes or other information, by trade name,
trademark, manufacturer, supplier or otherwise does not constitute or imply
endorsement, sponsorship or recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced, stored
in or introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,
without the express written permission of Cloudera.
The information in this document is subject to change without notice. Cloudera shall
not be liable for any damages resulting from technical errors or omissions which may
be present in this document, or from use of this document.
Cloudera, Inc.
395 Page Mill Road
Palo Alto, CA 94306
[email protected]
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
Installing Impala.....................................................................................................27
What is Included in an Impala Installation.........................................................................................................27
Installing Impala with Cloudera Manager..........................................................................................................27
Installing Impala without Cloudera Manager.....................................................................................................28
Managing Impala...................................................................................................30
Post-Installation Configuration for Impala.........................................................................................................30
Configuring Impala to Work with ODBC.............................................................................................................31
Downloading the ODBC Driver.............................................................................................................................................32
Configuring the ODBC Port..................................................................................................................................................32
Example of Setting Up an ODBC Application for Impala......................................................................................................32
Notes about JDBC and ODBC Interaction with Impala SQL Features...................................................................................33
Configuring Impala to Work with JDBC..............................................................................................................34
Configuring the JDBC Port....................................................................................................................................................34
Choosing the JDBC Driver.....................................................................................................................................................34
Enabling Impala JDBC Support on Client Systems................................................................................................................35
Establishing JDBC Connections.............................................................................................................................................36
Notes about JDBC and ODBC Interaction with Impala SQL Features...................................................................................37
Kudu Considerations for DML Statements...........................................................................................................................37
Upgrading Impala...................................................................................................38
Upgrading Impala through Cloudera Manager - Parcels....................................................................................38
Upgrading Impala through Cloudera Manager - Packages.................................................................................39
Upgrading Impala without Cloudera Manager...................................................................................................40
Starting Impala......................................................................................................42
Starting Impala through Cloudera Manager.......................................................................................................42
Starting Impala from the Command Line...........................................................................................................42
Modifying Impala Startup Options.....................................................................................................................43
Configuring Impala Startup Options through Cloudera Manager.......................................................................................43
Configuring Impala Startup Options through the Command Line.......................................................................................43
Checking the Values of Impala Configuration Options........................................................................................................46
Startup Options for impalad Daemon..................................................................................................................................46
Startup Options for statestored Daemon.............................................................................................................................46
Startup Options for catalogd Daemon.................................................................................................................................46
Impala Tutorials.....................................................................................................47
Tutorials for Getting Started...............................................................................................................................47
Explore a New Impala Instance............................................................................................................................................47
Load CSV Data from Local Files............................................................................................................................................52
Point an Impala Table at Existing Data Files........................................................................................................................53
Describe the Impala Table....................................................................................................................................................55
Query the Impala Table........................................................................................................................................................55
Data Loading and Querying Examples.................................................................................................................................56
Advanced Tutorials.............................................................................................................................................58
Attaching an External Partitioned Table to an HDFS Directory Structure............................................................................58
Switching Back and Forth Between Impala and Hive...........................................................................................................60
Cross Joins and Cartesian Products with the CROSS JOIN Operator....................................................................................61
Dealing with Parquet Files with Unknown Schema............................................................................................62
Impala Administration...........................................................................................75
Admission Control and Query Queuing..............................................................................................................75
Overview of Impala Admission Control................................................................................................................................75
How Impala Admission Control Relates to YARN.................................................................................................................76
How Impala Schedules and Enforces Limits on Concurrent Queries....................................................................................76
How Admission Control works with Impala Clients (JDBC, ODBC, HiveServer2)..................................................................77
Configuring Admission Control............................................................................................................................................77
Guidelines for Using Admission Control...............................................................................................................................82
Integrated Resource Management with YARN...................................................................................................83
The Llama Daemon..............................................................................................................................................................83
How Resource Limits Are Enforced.......................................................................................................................................84
impala-shell Query Options for Resource Management......................................................................................................84
Limitations of Resource Management for Impala...............................................................................................................84
Setting Timeout Periods for Daemons, Queries, and Sessions...........................................................................84
Increasing the Statestore Timeout.......................................................................................................................................84
Setting the Idle Query and Idle Session Timeouts for impalad.............................................................................................84
Cancelling a Query...............................................................................................................................................................85
Using Impala through a Proxy for High Availability............................................................................................85
Overview of Proxy Usage and Load Balancing for Impala...................................................................................................86
Special Proxy Considerations for Clusters Using Kerberos...................................................................................................87
Example of Configuring HAProxy Load Balancer for Impala................................................................................................88
Managing Disk Space for Impala Data................................................................................................................89
Impala Security......................................................................................................91
Security Guidelines for Impala...........................................................................................................................91
Securing Impala Data and Log Files....................................................................................................................92
Installation Considerations for Impala Security..................................................................................................93
Securing the Hive Metastore Database..............................................................................................................93
Securing the Impala Web User Interface............................................................................................................93
Configuring TLS/SSL for Impala..........................................................................................................................94
Using Cloudera Manager.....................................................................................................................................................94
Using the Command Line.....................................................................................................................................................95
Using TLS/SSL with Business Intelligence Tools....................................................................................................................95
Enabling Sentry Authorization for Impala..........................................................................................................95
The Sentry Privilege Model..................................................................................................................................................96
Starting the impalad Daemon with Sentry Authorization Enabled......................................................................................96
Using Impala with the Sentry Service (CDH 5.1 or higher only)...........................................................................................97
Using Impala with the Sentry Policy File..............................................................................................................................98
Setting Up Schema Objects for a Secure Impala Deployment............................................................................................102
Privilege Model and Object Hierarchy...............................................................................................................................102
Debugging Failed Sentry Authorization Requests..............................................................................................................105
Managing Sentry for Impala through Cloudera Manager.................................................................................................106
The DEFAULT Database in a Secure Deployment...............................................................................................................106
Impala Authentication......................................................................................................................................106
Enabling Kerberos Authentication for Impala....................................................................................................................106
Enabling LDAP Authentication for Impala..........................................................................................................................109
Using Multiple Authentication Methods with Impala........................................................................................................111
Configuring Impala Delegation for Hue and BI Tools.........................................................................................................112
Auditing Impala Operations.............................................................................................................................112
Durability and Performance Considerations for Impala Auditing......................................................................................112
Format of the Audit Log Files.............................................................................................................................................113
Which Operations Are Audited..........................................................................................................................................113
Reviewing the Audit Logs...................................................................................................................................................114
Viewing Lineage Information for Impala Data..................................................................................................114
Troubleshooting Impala.......................................................................................578
Troubleshooting Impala SQL Syntax Issues......................................................................................................578
Troubleshooting I/O Capacity Problems...........................................................................................................578
Impala Troubleshooting Quick Reference........................................................................................................579
Impala Web User Interface for Debugging.......................................................................................................580
Debug Web UI for impalad................................................................................................................................................581
Note: Impala graduated from the Apache Incubator on November 15, 2017. In places where the
documentation formerly referred to “Cloudera Impala”, now the official name is “Apache Impala”.
Impala Benefits
Impala provides:
• Familiar SQL interface that data scientists and analysts already know.
• Ability to query high volumes of data (“big data”) in Apache Hadoop.
• Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective commodity
hardware.
• Ability to share data files between different components with no copy or export/import step; for example, to
write with Pig, transform with Hive and query with Impala. Impala can read from and write to Hive tables, enabling
simple data interchange using Impala for analytics on Hive-produced data.
• Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just for
analytics.
• Hive Metastore - Stores information about the data available to Impala. For example, the metastore lets Impala
know what databases are available and what the structure of those databases is. As you create, drop, and alter
schema objects, load data into tables, and so on through Impala SQL statements, the relevant metadata changes
are automatically broadcast to all Impala nodes by the dedicated catalog service introduced in Impala 1.2.
• Impala - This process, which runs on DataNodes, coordinates and executes queries. Each instance of Impala can
receive, plan, and coordinate queries from Impala clients. Queries are distributed among Impala nodes, and these
nodes then act as workers, executing parallel query fragments.
• HBase and HDFS - Storage for data to be queried.
Queries executed using Impala are handled as follows:
1. User applications send SQL queries to Impala through ODBC or JDBC, which provide standardized querying
interfaces. The user application may connect to any impalad in the cluster. This impalad becomes the coordinator
for the query.
2. Impala parses the query and analyzes it to determine what tasks need to be performed by impalad instances
across the cluster. Execution is planned for optimal efficiency.
3. Services such as HDFS and HBase are accessed by local impalad instances to provide data.
4. Each impalad returns data to the coordinating impalad, which sends these results to the client.
Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and
catalogd daemons do not have special requirements for high availability, because problems with those daemons do
not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the
Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and
restart the Impala service.
Related information:
Scalability Considerations for the Impala Statestore on page 516, Modifying Impala Startup Options on page 43, Starting
Impala on page 42, Increasing the Statestore Timeout on page 84, Ports Used by Impala on page 583
By default, the metadata loading and caching on startup happens asynchronously, so Impala can begin accepting
requests promptly. To enable the original behavior, where Impala waited until all metadata was loaded before accepting
any requests, set the catalogd configuration option --load_catalog_in_background=false.
Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and
catalogd daemons do not have special requirements for high availability, because problems with those daemons do
not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the
Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and
restart the Impala service.
Note:
In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE METADATA after the table
is created in Hive, allowing you to make individual tables visible to Impala without doing a full reload
of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast
mechanism faster and more responsive, especially during Impala startup. See New Features in Impala
Version 1.2.4 on page 617 for details.
Related information: Modifying Impala Startup Options on page 43, Starting Impala on page 42, Ports Used by Impala
on page 583
• In Impala 1.2 and higher, UDFs let you perform custom comparisons and transformation logic during SELECT and
INSERT...SELECT statements.
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the
SQL dialect might require some learning and practice for you to become proficient in the Hadoop environment:
• Impala SQL is focused on queries and includes relatively little DML. There is no UPDATE or DELETE statement.
Stale data is typically discarded (by DROP TABLE or ALTER TABLE ... DROP PARTITION statements) or
replaced (by INSERT OVERWRITE statements).
• All data creation is done by INSERT statements, which typically insert data in bulk by querying from other tables.
There are two variations, INSERT INTO which appends to the existing data, and INSERT OVERWRITE which
replaces the entire contents of a table or partition (similar to TRUNCATE TABLE followed by a new INSERT).
Although there is an INSERT ... VALUES syntax to create a small number of values in a single statement, it is
far more efficient to use the INSERT ... SELECT to copy and transform large amounts of data from one table
to another in a single operation.
• You often construct Impala table definitions and data files in some other environment, and then attach Impala so
that it can run real-time queries. The same data files and table metadata are shared with other components of
the Hadoop ecosystem. In particular, Impala can access tables created by Hive or data inserted by Hive, and Hive
can access tables and data produced by Impala. Many other Hadoop components can write files in formats such
as Parquet and Avro, that can then be queried by Impala.
• Because Hadoop and Impala are focused on data warehouse-style operations on large data sets, Impala SQL
includes some idioms that you might find in the import utilities for traditional database systems. For example,
you can create a table that reads comma-separated or tab-separated text files, specifying the separator in the
CREATE TABLE statement. You can create external tables that read existing data files but do not move or transform
them.
• Because Impala reads large quantities of data that might not be perfectly tidy and predictable, it does not require
length constraints on string data types. For example, you can define a database column as STRING with unlimited
length, rather than CHAR(1) or VARCHAR(64). (Although in Impala 2.0 and later, you can also use
length-constrained CHAR and VARCHAR types.)
Related information: Impala SQL Language Reference on page 116, especially Impala SQL Statements on page 214 and
Impala Built-In Functions on page 338
METADATA by itself retrieves metadata for all the tables tracked by the metastore. If you know that only specific tables
have been changed outside of Impala, you can issue REFRESH table_name for each affected table to only retrieve
the latest metadata for those tables.
Impala Requirements
To perform as expected, Impala depends on the availability of the software, hardware, and configurations described
in the following sections.
Note:
Installing and configuring a Hive metastore is an Impala requirement. Impala does not work
without the metastore database. For the process of installing and configuring the metastore, see
Installing Impala on page 27.
Always configure a Hive metastore service rather than connecting directly to the metastore
database. The Hive metastore service is required to interoperate between possibly different levels
of metastore APIs used by CDH and Impala, and avoids known issues with connecting directly to
the metastore database. The Hive metastore service is set up for you by default if you install
through Cloudera Manager 4.5 or higher.
A summary of the metastore installation process is as follows:
• Install a MySQL or PostgreSQL database. Start the database if it is not started after installation.
• Download the MySQL connector or the PostgreSQL connector and place it in the
/usr/share/java/ directory.
• Use the appropriate command line tool for your database to create the metastore database.
• Use the appropriate command line tool for your database to grant privileges for the metastore
database to the hive user.
• Modify hive-site.xml to include information matching your particular database: its URL,
user name, and password. You will copy the hive-site.xml file to the Impala Configuration
Directory later in the Impala installation process.
• Optional: Hive. Although only the Hive metastore database is required for Impala to function, you might install
Hive on some client machines to create and load data into tables that use certain file formats. See How Impala
Works with Hadoop File Formats on page 527 for details. Hive does not need to be installed on the same DataNodes
as Impala; it just needs access to the same metastore database.
Java Dependencies
Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop components:
• The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically resulting in a
failure at impalad startup. In particular, the JamVM used by default on certain levels of Ubuntu systems can
cause impalad to fail to start.
• Internally, the impalad daemon relies on the JAVA_HOME environment variable to locate the system Java libraries.
Make sure the impalad service is not run from an environment with an incorrect setting for this variable.
• All Java dependencies are packaged in the impala-dependencies.jar file, which is located at
/usr/lib/impala/lib/. These map to everything that is built under fe/target/dependency.
In the majority of cases, this automatic detection works correctly. If you need to explicitly set the hostname, do so by
setting the --hostname flag.
Hardware Requirements
During join operations, portions of data from each joined table are loaded into memory. Data sets can be very large,
so ensure your hardware has sufficient memory to accommodate the joins you anticipate completing.
While requirements vary according to data set size, the following is generally recommended:
• CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors.
Note: This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and
2.1 releases had a stricter requirement for the SSE4.1 instruction set, which has now been relaxed.
• Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query
processing on a particular node exceed the amount of memory available to Impala on that node, the query writes
temporary work data to disk, which can lead to long query times. Note that because the work is parallelized, and
intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join
tables that are much larger than the memory available on an individual node.
• Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk performance
with Impala. Ensure that you have sufficient disk space to store the data Impala will be querying.
For correct file deletion during DROP TABLE operations, Impala must be able to move files to the HDFS trashcan. You
might need to create an HDFS directory /user/impala, writeable by the impala user, so that the trashcan can be
created. Otherwise, data files might remain behind after a DROP TABLE statement.
Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not permitted to
use direct reads. Therefore, running Impala as root negatively affects performance.
By default, any user can connect to Impala and access all the associated databases and tables. You can enable
authorization and authentication based on the Linux OS user who connects to the Impala server, and the associated
groups for that user. Impala Security on page 91 for details. These security features do not change the underlying file
permission requirements; the impala user still needs to be able to access the data files.
example, if you have only a few tens of megabytes of data per day, partitioning by YEAR, MONTH, and DAY columns
might be too granular. Most of your cluster might be sitting idle during queries that target a single day, or each node
might have very little work to do. Consider reducing the number of partition key columns so that each partition directory
contains several gigabytes worth of data.
For example, consider a Parquet table where each data file is 1 HDFS block, with a maximum block size of 1 GB. (In
Impala 2.0 and later, the default Parquet block size is reduced to 256 MB. For this exercise, let's assume you have
bumped the size back up to 1 GB by setting the query option PARQUET_FILE_SIZE=1g.) if you have a 10-node cluster,
you need 10 data files (up to 10 GB) to give each node some work to do for a query. But each core on each machine
can process a separate data block in parallel. With 16-core machines on a 10-node cluster, a query could process up
to 160 GB fully in parallel. If there are only a few data files per partition, not only are most cluster nodes sitting idle
during queries, so are most cores on those machines.
You can reduce the Parquet block size to as low as 128 MB or 64 MB to increase the number of files per partition and
improve parallelism. But also consider reducing the level of partitioning so that analytic queries have enough data to
work with.
Installing Impala
Impala is an open-source add-on to the Cloudera Enterprise Core that returns rapid responses to queries.
Note:
Under CDH 5, Impala is included as part of the CDH installation and no separate steps are needed.
Therefore, the instruction steps in this section apply to CDH 4 only.
Before doing the installation, ensure that you have all necessary prerequisites. See Impala Requirements on page 23
for details.
Note:
To install the latest Impala under CDH 4, upgrade Cloudera Manager to 4.8 or higher. Cloudera Manager
4.8 is the first release that can manage the Impala catalog service introduced in Impala 1.2. Cloudera
Manager 4.8 requires this service to be present, so if you upgrade to Cloudera Manager 4.8, also
upgrade Impala to the most recent version at the same time.
For information on installing Impala in a Cloudera Manager-managed environment, see Installing Impala.
Managing your Impala installation through Cloudera Manager has a number of advantages. For example, when you
make configuration changes to CDH components using Cloudera Manager, it automatically applies changes to the
copies of configuration files, such as hive-site.xml, that Impala keeps under /etc/impala/conf. It also sets up
the Hive Metastore service that is required for Impala running under CDH 4.1.
In some cases, depending on the level of Impala, CDH, and Cloudera Manager, you might need to add particular
component configuration details in some of the free-form option fields on the Impala configuration pages within
Cloudera Manager. In Cloudera Manager 4, these fields are labelled Safety Valve; in Cloudera Manager 5, they are
called Advanced Configuration Snippet.
<property>
<name>hive.metastore.uris</name>
<value>thrift://metastore_server_host:9083</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>3600</value>
<description>MetaStore Client socket timeout in seconds</description>
</property>
3. (Optional) If you installed the full Hive component on any host, you can verify that the metastore is configured
properly by starting the Hive console and querying for the list of available tables. Once you confirm that the console
starts, exit the console to continue the installation:
$ hive
Hive history file=/tmp/root/hive_job_log_root_201207272011_678722950.txt
hive> show tables;
table1
table2
hive> quit;
$
4. Confirm that your package management command is aware of the Impala repository settings, as described in
Impala Requirements on page 23. (For CDH 4, this is a different repository than for CDH.) You might need to
download a repo or list file into a system directory underneath /etc.
5. Use one of the following sets of commands to install the Impala package:
Note: Cloudera recommends that you not install Impala on any HDFS NameNode. Installing
Impala on NameNodes provides no additional data locality, and executing queries with such a
configuration might cause memory contention and negatively impact the HDFS NameNode.
6. Copy the client hive-site.xml, core-site.xml, hdfs-site.xml, and hbase-site.xml configuration files
to the Impala configuration directory, which defaults to /etc/impala/conf. Create this directory if it does not
already exist.
7. Use one of the following commands to install impala-shell on the machines from which you want to issue
queries. You can install impala-shell on any supported machine that can connect to DataNodes that are running
impalad.
8. Complete any required or recommended configuration, as described in Post-Installation Configuration for Impala
on page 30. Some of these configuration changes are mandatory. (They are applied automatically when you install
using Cloudera Manager.)
Once installation and configuration are complete, see Starting Impala on page 42 for how to activate the software on
the appropriate nodes in your cluster.
If this is your first time setting up and using Impala in this cluster, run through some of the exercises in Impala Tutorials
on page 47 to verify that you can do basic operations such as creating tables and querying them.
Managing Impala
This section explains how to configure Impala to accept connections from applications that use popular programming
APIs:
• Post-Installation Configuration for Impala on page 30
• Configuring Impala to Work with ODBC on page 31
• Configuring Impala to Work with JDBC on page 34
This type of configuration is especially useful when using Impala in combination with Business Intelligence tools, which
use these standard interfaces to query different kinds of database and Big Data systems.
You can also configure these other aspects of Impala:
• Impala Security on page 91
• Modifying Impala Startup Options on page 43
Note: If you use Cloudera Manager, you can enable short-circuit reads through a checkbox in the
user interface and that setting takes effect for Impala as well.
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hdfs-sockets/dn</value>
</property>
<property>
<name>dfs.client.file-block-storage-locations.timeout.millis</name>
<value>10000</value>
</property>
Note: If you are also going to enable block location tracking, you can skip copying configuration
files and restarting DataNodes and go straight to Optional: Block Location Tracking. Configuring
short-circuit reads and block location tracking require the same process of copying files and
restarting services, so you can complete that process once when you have completed all
configuration changes. Whether you copy files and restart services now or during configuring
block location tracking, short-circuit reads are not enabled until you complete those final steps.
<property>
<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
<value>true</value>
</property>
2. Copy the client core-site.xml and hdfs-site.xml configuration files from the Hadoop configuration directory
to the Impala configuration directory. The default Impala configuration location is /etc/impala/conf.
3. After applying these changes, restart all DataNodes.
ODBC, the operating system, and the third-party product have all been approved for use together. Before configuring
your systems to use ODBC, download a connector. You may need to sign in and accept license agreements before
accessing the pages required for downloading ODBC connectors.
Important: As of late 2015, most business intelligence applications are certified with the 2.x ODBC
drivers. Although the instructions on this page cover both the 2.x and 1.x drivers, expect to use the
2.x drivers exclusively for most ODBC applications connecting to Impala.
$ ls -1
Cloudera-ODBC-Driver-for-Impala-Install-Guide.pdf
BI_Tool_Installer.dmg
iodbc-sdk-3.52.7-macosx-10.5.dmg
ClouderaImpalaODBC.dmg
$ open iodbc-sdk-3.52.7-macosx-10.dmg
Install the IODBC driver using its installer
$ open ClouderaImpalaODBC.dmg
Install the Cloudera ODBC Connector using its installer
$ installer_dir=$(pwd)
$ cd /opt/cloudera/impalaodbc
$ ls -1
Cloudera ODBC Driver for Impala Install Guide.pdf
Readme.txt
Setup
lib
ErrorMessages
Release Notes.txt
Tools
$ cd Setup
$ ls
odbc.ini odbcinst.ini
$ cp odbc.ini ~/.odbc.ini
$ vi ~/.odbc.ini
$ cat ~/.odbc.ini
[ODBC]
# Specify any global ODBC configuration here such as ODBC tracing.
# Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here.
# They can also be specified on the connection string.
HOST=hostname.sample.example.com
PORT=21050
Schema=default
# General settings
TSaslTransportBufSize=1000
RowsFetchedPerBlock=10000
SocketTimeout=0
StringColumnLength=32767
UseNativeQuery=0
$ pwd
/opt/cloudera/impalaodbc/Setup
$ cd $installer_dir
$ open BI_Tool_Installer.dmg
Install the BI tool using its installer
$ ls /Applications | grep BI_Tool
BI_Tool.app
$ open -a BI_Tool.app
In the BI tool, connect to a data source using port 21050
Notes about JDBC and ODBC Interaction with Impala SQL Features
Most Impala SQL features work equivalently through the impala-shell interpreter of the JDBC or ODBC APIs. The
following are some exceptions to keep in mind when switching between the interactive shell and applications using
the APIs:
Note: If your JDBC or ODBC application connects to Impala through a load balancer such as haproxy,
be cautious about reusing the connections. If the load balancer has set up connection timeout values,
either check the connection frequently so that it never sits idle longer than the load balancer timeout
value, or check the connection validity before using it and create a new one if the connection has
been closed.
• The Impala complex types (STRUCT, ARRAY, or MAP) are available in CDH 5.5 / Impala 2.3 and higher. To use these
types with JDBC requires version 2.5.28 or higher of the Cloudera JDBC Connector for Impala. To use these types
with ODBC requires version 2.5.30 or higher of the Cloudera ODBC Connector for Impala. Consider upgrading all
JDBC and ODBC drivers at the same time you upgrade from CDH 5.5 or higher.
• Although the result sets from queries involving complex types consist of all scalar values, the queries involve join
notation and column references that might not be understood by a particular JDBC or ODBC connector. Consider
defining a view that represents the flattened version of a table containing complex type columns, and pointing
the JDBC or ODBC application at the view. See Complex Types (CDH 5.5 or higher only) on page 156 for details.
Note: The latest JDBC driver, corresponding to Hive 0.13, provides substantial performance
improvements for Impala queries that return large result sets. Impala 2.0 and later are compatible
with the Hive 0.13 driver. If you already have an older JDBC driver installed, and are running Impala
2.0 or higher, consider upgrading to the latest Hive JDBC driver for best performance with JDBC
applications.
If you are using JDBC-enabled applications on hosts outside the CDH cluster, you cannot use the CDH install procedure
on the non-CDH hosts. Install the JDBC driver on at least one CDH host using the preceding procedure. Then download
the JAR files to each client machine that will use JDBC with Impala:
commons-logging-X.X.X.jar
hadoop-common.jar
hive-common-X.XX.X-cdhX.X.X.jar
hive-jdbc-X.XX.X-cdhX.X.X.jar
hive-metastore-X.XX.X-cdhX.X.X.jar
hive-service-X.XX.X-cdhX.X.X.jar
httpclient-X.X.X.jar
httpcore-X.X.X.jar
libfb303-X.X.X.jar
libthrift-X.X.X.jar
log4j-X.X.XX.jar
slf4j-api-X.X.X.jar
slf4j-logXjXX-X.X.X.jar
To enable JDBC support for Impala on the system where you run the JDBC application:
1. Download the JAR files listed above to each client machine.
Note: For Maven users, see this sample github page for an example of the dependencies you
could add to a pom file instead of downloading the individual JARs.
2. Store the JAR files in a location of your choosing, ideally a directory already referenced in your CLASSPATH setting.
For example:
• On Linux, you might use a location such as /opt/jars/.
• On Windows, you might use a subdirectory underneath C:\Program Files.
3. To successfully load the Impala JDBC driver, client programs must be able to locate the associated JAR files. This
often means setting the CLASSPATH for the client process to include the JARs. Consult the documentation for
your JDBC client for more details on how to install new JDBC drivers, but some examples of how to set CLASSPATH
variables include:
• On Linux, if you extracted the JARs to /opt/jars/, you might issue the following command to prepend the
JAR files path to an existing classpath:
export CLASSPATH=/opt/jars/*.jar:$CLASSPATH
• On Windows, use the System Properties control panel item to modify the Environment Variables for your
system. Modify the environment variables to include the path to which you extracted the files.
Note: If the existing CLASSPATH on your client machine refers to some older version of the
Hive JARs, ensure that the new JARs are the first ones listed. Either put the new JAR files
earlier in the listings, or delete the other references to Hive JAR files.
Note: If your JDBC or ODBC application connects to Impala through a load balancer such as haproxy,
be cautious about reusing the connections. If the load balancer has set up connection timeout values,
either check the connection frequently so that it never sits idle longer than the load balancer timeout
value, or check the connection validity before using it and create a new one if the connection has
been closed.
jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...
jdbc:hive2://myhost.example.com:21050/;auth=noSasl
To connect to an instance of Impala that requires Kerberos authentication, use a connection string of the form
jdbc:hive2://host:port/;principal=principal_name. The principal must be the same user principal you
used when starting Impala. For example, you might use:
jdbc:hive2://myhost.example.com:21050/;principal=impala/[email protected]
To connect to an instance of Impala that requires LDAP authentication, use a connection string of the form
jdbc:hive2://host:port/db_name;user=ldap_userid;password=ldap_password. For example, you might
use:
jdbc:hive2://myhost.example.com:21050/test_db;user=fred;password=xyz123
Note:
Currently, the Hive JDBC driver does not support connections that use both Kerberos authentication
and SSL encryption. To use both of these security features with Impala through a JDBC application,
use the Cloudera JDBC Connector as the JDBC driver.
Notes about JDBC and ODBC Interaction with Impala SQL Features
Most Impala SQL features work equivalently through the impala-shell interpreter of the JDBC or ODBC APIs. The
following are some exceptions to keep in mind when switching between the interactive shell and applications using
the APIs:
• Complex type considerations:
– Queries involving the complex types (ARRAY, STRUCT, and MAP) require notation that might not be available
in all levels of JDBC and ODBC drivers. If you have trouble querying such a table due to the driver level or
inability to edit the queries used by the application, you can create a view that exposes a “flattened” version
of the complex columns and point the application at the view. See Complex Types (CDH 5.5 or higher only)
on page 156 for details.
– The complex types available in CDH 5.5 / Impala 2.3 and higher are supported by the JDBC getColumns()
API. Both MAP and ARRAY are reported as the JDBC SQL Type ARRAY, because this is the closest matching Java
SQL type. This behavior is consistent with Hive. STRUCT types are reported as the JDBC SQL Type STRUCT.
To be consistent with Hive's behavior, the TYPE_NAME field is populated with the primitive type name for
scalar types, and with the full toSql() for complex types. The resulting type names are somewhat inconsistent,
because nested types are printed differently than top-level types. For example, the following list shows how
toSQL() for Impala types are translated to TYPE_NAME values:
Upgrading Impala
Upgrading Impala involves stopping Impala services, using your operating system's package management tool to
upgrade Impala to the latest version, and then restarting Impala services.
Note:
• Each version of CDH 5 has an associated version of Impala, When you upgrade from CDH 4 to
CDH 5, you get whichever version of Impala comes with the associated level of CDH. Depending
on the version of Impala you were running on CDH 4, this could install a lower level of Impala on
CDH 5. For example, if you upgrade to CDH 5.0 from CDH 4 plus Impala 1.4, the CDH 5.0 installation
comes with Impala 1.3. Always check the associated level of Impala before upgrading to a specific
version of CDH 5. Where practical, upgrade from CDH 4 to the latest CDH 5, which also has the
latest Impala.
• When you upgrade Impala, also upgrade Cloudera Manager if necessary:
– Users running Impala on CDH 5 must upgrade to Cloudera Manager 5.0.0 or higher.
– Users running Impala on CDH 4 must upgrade to Cloudera Manager 4.8 or higher. Cloudera
Manager 4.8 includes management support for the Impala catalog service, and is the minimum
Cloudera Manager version you can use.
– Cloudera Manager is continually updated with configuration settings for features introduced
in the latest Impala releases.
• If you are upgrading from CDH 5 beta to CDH 5.0 production, make sure you are using the
appropriate CDH 5 repositories shown on the CDH version and packaging page, then follow the
procedures throughout the rest of this section.
• Every time you upgrade to a new major or minor Impala release, see Incompatible Changes and
Limitations in Apache Impala on page 625 in the Release Notes for any changes needed in your
source code, startup scripts, and so on.
• Also check Known Issues and Workarounds in Impala on page 635 in the Release Notes for any
issues or limitations that require workarounds.
Important: In CDH 5, there is not a separate Impala parcel; Impala is part of the main CDH 5 parcel.
Each level of CDH 5 has a corresponding version of Impala, and you upgrade Impala by upgrading CDH.
See the CDH 5 upgrade instructions and choose the instructions for parcels. The remainder of this
section only covers parcel upgrades for Impala under CDH 4.
and then remove the packages using one of the following commands:
5. Use one of the following sets of commands to update Impala shell on each node on which it is installed:
For RHEL, Oracle Linux, or CentOS systems:
2. Check if there are new recommended or required configuration settings to put into place in the configuration
files, typically under /etc/impala/conf. See Post-Installation Configuration for Impala on page 30 for settings
related to performance and scalability.
3. Use one of the following sets of commands to update Impala on each Impala node in your cluster:
For RHEL, Oracle Linux, or CentOS systems:
4. Use one of the following sets of commands to update Impala shell on each node on which it is installed:
For RHEL, Oracle Linux, or CentOS systems:
5. Depending on which release of Impala you are upgrading from, you might find that the symbolic links
/etc/impala/conf and /usr/lib/impala/sbin are missing. If so, see Known Issues and Workarounds in
Impala on page 635 for the procedure to work around this problem.
6. Restart Impala services:
a. Restart the Impala state store service on the desired nodes in your cluster. Expect to see a process named
statestored if the service started successfully.
Restart the state store service before the Impala server service to avoid “Not connected” errors when you
run impala-shell.
b. Restart the Impala catalog service on whichever host it runs on in your cluster. Expect to see a process named
catalogd if the service started successfully.
c. Restart the Impala daemon service on each node in your cluster. Expect to see a process named impalad if
the service started successfully.
Note:
If the services did not start successfully (even though the sudo service command might display
[OK]), check for errors in the Impala log file, typically in /var/log/impala.
Starting Impala
To activate Impala if it is installed but not yet started:
1. Set any necessary configuration options for the Impala services. See Modifying Impala Startup Options on page
43 for details.
2. Start one instance of the Impala statestore. The statestore helps Impala to distribute work efficiently, and to
continue running in the event of availability problems for other Impala nodes. If the statestore becomes unavailable,
Impala continues to function.
3. Start one instance of the Impala catalog service.
4. Start the main Impala service on one or more DataNodes, ideally on all DataNodes to maximize local processing
and avoid network traffic due to remote reads.
Once Impala is running, you can conduct interactive experiments using the instructions in Impala Tutorials on page 47
and try Using the Impala Shell (impala-shell Command) on page 470.
Note:
Currently, Impala UDFs and UDAs are not persisted in the metastore database. Information about
these functions is held in the memory of the catalogd daemon. You must reload them by running
the CREATE FUNCTION statements again each time you restart the catalogd daemon.
Start the Impala service on each DataNode using a command similar to the following:
Note:
Currently, Impala UDFs and UDAs are not persisted in the metastore database. Information about
these functions is held in the memory of the catalogd daemon. You must reload them by running
the CREATE FUNCTION statements again each time you restart the catalogd daemon.
IMPALA_STATE_STORE_HOST=127.0.0.1
IMPALA_STATE_STORE_PORT=24000
IMPALA_BACKEND_PORT=22000
IMPALA_LOG_DIR=/var/log/impala
IMPALA_CATALOG_SERVICE_HOST=...
IMPALA_STATE_STORE_HOST=...
export IMPALA_STATE_STORE_ARGS=${IMPALA_STATE_STORE_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}}
IMPALA_SERVER_ARGS=" \
-log_dir=${IMPALA_LOG_DIR} \
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore \
-state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}"
export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}
To use alternate values, edit the defaults file, then restart all the Impala-related services so that the changes take
effect. Restart the Impala server using the following commands:
IMPALA_STATE_STORE_HOST=127.0.0.1
to:
IMPALA_STATE_STORE_HOST=192.168.0.27
• Catalog server address (including both the hostname and the port number). Update the value of the
IMPALA_CATALOG_SERVICE_HOST variable. Cloudera recommends the catalog server be on the same host as
the statestore. In that recommended configuration, the impalad daemon cannot refer to the catalog server using
the loopback address. If the catalog service is hosted on a machine with an IP address of 192.168.0.27, add the
following line:
IMPALA_CATALOG_SERVICE_HOST=192.168.0.27:26000
The /etc/default/impala defaults file currently does not define an IMPALA_CATALOG_ARGS environment
variable, but if you add one it will be recognized by the service startup/shutdown script. Add a definition for this
variable to /etc/default/impala and add the option -catalog_service_host=hostname. If the port is
different than the default 26000, also add the option -catalog_service_port=port.
• Memory limits. You can limit the amount of memory available to Impala. For example, to allow Impala to use no
more than 70% of system memory, change:
export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}}
to:
export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}
You can specify the memory limit using absolute notation such as 500m or 2G, or as a percentage of physical
memory such as 60%.
Note: Queries that exceed the specified memory limit are aborted. Percentage limits are based
on the physical memory of the machine and do not consider cgroups.
• Core dump enablement. To enable core dumps on systems not managed by Cloudera Manager, change:
export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}
to:
export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-true}
On systems managed by Cloudera Manager, enable the Enable Core Dump setting for the Impala service.
Note:
• The location of core dump files may vary according to your operating system configuration.
• Other security settings may prevent Impala from writing core dumps even when this option
is enabled.
• On systems managed by Cloudera Manager, the default location for core dumps is on a
temporary filesystem, which can lead to out-of-space issues if the core dumps are large,
frequent, or not removed promptly. To specify an alternative location for the core dumps,
filter the Impala configuration settings to find the core_dump_dir option, which is available
in Cloudera Manager 5.4.3 and higher. This option lets you specify a different directory for
core dumps for each of the Impala-related daemons.
• Authorization using the open source Sentry plugin. Specify the -server_name and
-authorization_policy_file options as part of the IMPALA_SERVER_ARGS and IMPALA_STATE_STORE_ARGS
settings to enable the core Impala support for authentication. See Starting the impalad Daemon with Sentry
Authorization Enabled on page 96 for details.
• Auditing for successful or blocked Impala queries, another aspect of security. Specify the
-audit_event_log_dir=directory_path option and optionally the
-max_audit_event_log_file_size=number_of_queries and -abort_on_failed_audit_event options
as part of the IMPALA_SERVER_ARGS settings, for each Impala node, to enable and customize auditing. See
Auditing Impala Operations on page 112 for details.
• Password protection for the Impala web UI, which listens on port 25000 by default. This feature involves adding
some or all of the --webserver_password_file, --webserver_authentication_domain, and
--webserver_certificate_file options to the IMPALA_SERVER_ARGS and IMPALA_STATE_STORE_ARGS
settings. See Security Guidelines for Impala on page 91 for details.
• Another setting you might add to IMPALA_SERVER_ARGS is a comma-separated list of query options and values:
-default_query_options='option=value,option=value,...'
These options control the behavior of queries performed by this impalad instance. The option values you specify
here override the default values for Impala query options, as shown by the SET statement in impala-shell.
• During troubleshooting, Cloudera Support might direct you to change other values, particularly for
IMPALA_SERVER_ARGS, to work around issues or gather debugging information.
Note:
These startup options for the impalad daemon are different from the command-line options for the
impala-shell command. For the impala-shell options, see impala-shell Configuration Options
on page 470.
Impala Tutorials
This section includes tutorial scenarios that demonstrate how to begin using Impala once the software is installed. It
focuses on techniques for loading data, because once you have some data in tables and can query that data, you can
quickly progress to more advanced Impala features.
Note:
Where practical, the tutorials take you from “ground zero” to having the desired Impala tables and
data. In some cases, you might need to download additional files from outside sources, set up additional
software components, modify commands or scripts to fit your own configuration, or substitute your
own sample data.
Before trying these tutorial lessons, install Impala using one of these procedures:
• If you already have a CDH environment set up and just need to add Impala to it, follow the installation process
described in Installing Impala on page 27. Make sure to also install the Hive metastore service if you do not already
have Hive configured.
• To set up Impala and all its prerequisites at once, in a minimal configuration that you can use for small-scale
experiments, set up the Cloudera QuickStart VM, which includes CDH and Impala on CentOS. Use this single-node
VM to try out basic SQL functionality, not anything related to performance and scalability. For more information,
see the Cloudera QuickStart VM.
+-------------------------------------------
[localhost:21000] > show databases;
+--------------------------+
| name |
+--------------------------+
| _impala_builtins |
| ctas |
| d1 |
| d2 |
| d3 |
| default |
| explain_plans |
| external_table |
| file_formats |
| tpc |
+--------------------------+
[localhost:21000] > select current_database();
+--------------------+
| current_database() |
+--------------------+
| default |
+--------------------+
[localhost:21000] > show tables;
+-------+
| name |
+-------+
| ex_t |
| t1 |
+-------+
[localhost:21000] > show tables in d3;
Once you know what tables and databases are available, you descend into a database with the USE statement. To
understand the structure of each table, you use the DESCRIBE command. Once inside a database, you can issue
statements such as INSERT and SELECT that operate on particular tables.
The following example explores a database named TPC whose name we learned in the previous example. It shows
how to filter the table names within a database based on a search string, examine the columns of a table, and run
queries to examine the characteristics of the table data. For example, for an unfamiliar table you might want to know
the number of rows, the number of different values for a column, and other properties such as whether the column
contains any NULL values. When sampling the actual data values from a table, use a LIMIT clause to avoid excessive
output if the table contains more rows or distinct values than you expect.
| Dr. |
| |
| Miss |
| Sir |
| Mrs. |
+--------------+
When you graduate from read-only exploration, you use statements such as CREATE DATABASE and CREATE TABLE
to set up your own database objects.
The following example demonstrates creating a new database holding a new table. Although the last example ended
inside the TPC database, the new EXPERIMENTS database is not nested inside TPC; all databases are arranged in a
single top-level list.
The following example creates a new table, T1. To illustrate a common mistake, it creates this table inside the wrong
database, the TPC database where the previous example ended. The ALTER TABLE statement lets you move the table
to the intended database, EXPERIMENTS, as part of a rename operation. The USE statement is always needed to switch
to a new database, and the current_database() function confirms which database the session is in, to avoid these
kinds of mistakes.
+--------------------+
| tpc |
+--------------------+
[localhost:21000] > alter table t1 rename to experiments.t1;
[localhost:21000] > use experiments;
[localhost:21000] > show tables;
+------+
| name |
+------+
| t1 |
+------+
[localhost:21000] > select current_database();
+--------------------+
| current_database() |
+--------------------+
| experiments |
+--------------------+
For your initial experiments with tables, you can use ones with just a few columns and a few rows, and text-format
data files.
Note: As you graduate to more realistic scenarios, you will use more elaborate tables with many
columns, features such as partitioning, and file formats such as Parquet. When dealing with realistic
data volumes, you will bring in data using LOAD DATA or INSERT ... SELECT statements to operate
on millions or billions of rows at once.
The following example sets up a couple of simple tables with a few rows, and performs queries involving sorting,
aggregate functions and joins.
$ whoami
cloudera
$ hdfs dfs -ls /user
Found 3 items
drwxr-xr-x - cloudera cloudera 0 2013-04-22 18:54 /user/cloudera
drwxrwx--- - mapred mapred 0 2013-03-15 20:11 /user/history
drwxr-xr-x - hue supergroup 0 2013-03-15 20:10 /user/hive
Here is some sample data, for two tables named TAB1 and TAB2.
Copy the following content to .csv files in your local filesystem:
tab1.csv:
1,true,123.123,2012-10-24 08:55:00
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33
tab2.csv:
1,true,12789.123
2,false,1243.5
3,false,24453.325
4,false,2423.3254
5,true,243.325
60,false,243565423.325
70,true,243.325
80,false,243423.325
90,true,243.325
Put each .csv file into a separate HDFS directory using commands like the following, which use paths available in the
Impala Demo VM:
The name of each data file is not significant. In fact, when Impala examines the contents of the data directory for the
first time, it considers all files in the directory to make up the data of the table, regardless of how many files there are
or what the files are named.
To understand what paths are available within your own HDFS filesystem and what the permissions are for the various
directories and files, issue hdfs dfs -ls / and work your way down the tree doing -ls operations for the various
directories.
Use the impala-shell command to create tables, either interactively or through a SQL script.
The following example shows creating three tables. For each table, the example shows creating columns with various
attributes such as Boolean or integer types. The example also includes commands that provide information about how
the data is formatted, such as rows terminating with commas, which makes sense in the case of importing data from
a .csv file. Where we already have .csv files containing data in the HDFS directory tree, we specify the location of
the directory containing the appropriate .csv file. Impala considers all the data from all the files in that directory to
represent the data for the table.
Note: Getting through these CREATE TABLE statements successfully is an important validation step
to confirm everything is configured correctly with the Hive metastore and HDFS permissions. If you
receive any errors during the CREATE TABLE statements:
• Make sure you followed the installation instructions closely, in Installing Impala on page 27.
• Make sure the hive.metastore.warehouse.dir property points to a directory that Impala
can write to. The ownership should be hive:hive, and the impala user should also be a member
of the hive group.
• If the value of hive.metastore.warehouse.dir is different in the Cloudera Manager dialogs
and in the Hive shell, you might need to designate the hosts running impalad with the “gateway”
role for Hive, and deploy the client configuration files to those hosts.
impala-shell a *.sql file with contents similar to the following, to create an Impala table that accesses an existing
data file used by Hive.
The following examples set up 2 tables, referencing the paths and sample data supplied with the Cloudera QuickStart
VM. For historical reasons, the data physically resides in an HDFS directory tree under /user/hive, although this
particular data is entirely managed by Impala rather than Hive. When we create an external table, we specify the
directory containing one or more data files, and Impala queries the combined content of all the files inside that directory.
Here is how we examine the directories and files within the HDFS filesystem:
$ cd ~/cloudera/datasets
$ ./tpcds-setup.sh
... Downloads and unzips the kit, builds the data and loads it into HDFS ...
$ hdfs dfs -ls /user/hive/tpcds/customer
Found 1 items
-rw-r--r-- 1 cloudera supergroup 13209372 2013-03-22 18:09
/user/hive/tpcds/customer/customer.dat
$ hdfs dfs -cat /user/hive/tpcds/customer/customer.dat | more
1|AAAAAAAABAAAAAAA|980124|7135|32946|2452238|2452208|Mr.|Javier|Lewis|Y|9|12|1936|CHILE||Javie
[email protected]|2452508|
2|AAAAAAAACAAAAAAA|819667|1461|31655|2452318|2452288|Dr.|Amy|Moses|Y|9|4|1966|TOGO||Amy.Moses@
Ovk9KjHH.com|2452318|
3|AAAAAAAADAAAAAAA|1473522|6247|48572|2449130|2449100|Miss|Latisha|Hamilton|N|18|9|1979|NIUE||
[email protected]|2452313|
4|AAAAAAAAEAAAAAAA|1703214|3986|39558|2450030|2450000|Dr.|Michael|White|N|7|6|1983|MEXICO||Mic
[email protected]|2452361|
5|AAAAAAAAFAAAAAAA|953372|4470|36368|2449438|2449408|Sir|Robert|Moran|N|8|5|1956|FIJI||Robert.
[email protected]|2452469|
...
Here is a SQL script to set up Impala tables pointing to some of these data files in HDFS. (The script in the VM sets up
tables like this through Hive; ignore those tables for purposes of this demonstration.) Save the following as
customer_setup.sql:
--
-- store_sales fact table and surrounding dimension tables only
--
create database tpcds;
use tpcds;
ca_street_name string,
ca_street_type string,
ca_suite_number string,
ca_city string,
ca_county string,
ca_state string,
ca_zip string,
ca_country string,
ca_gmt_offset float,
ca_location_type string
)
row format delimited fields terminated by '|'
location '/user/hive/tpcds/customer_address';
$ impala-shell -i impala-host
Connected to localhost:21000
• Passing a single command to the impala-shell command. The query is executed, the results are returned, and
the shell exits. Make sure to quote the command, preferably with single quotation marks to avoid shell expansion
of characters such as *.
Loading Data
Loading data involves:
• Establishing a data set. The example below uses .csv files.
• Creating tables to which to load data.
• Loading the data into the tables you created.
Sample Queries
To run these sample queries, create a SQL query file query.sql, copy and paste each query into the query file, and
then run the query file using the shell. For example, to run query.sql on impala-host, you might use the command:
The examples and results below assume you have loaded the sample data into the tables as described above.
Example: Examining Contents of Tables
Let's start by verifying that the tables do contain the data we expect. Because Impala often deals with tables containing
millions or billions of rows, when examining tables of unknown size, include the LIMIT clause to avoid huge amounts
of unnecessary output, as in the final query. (If your interactive query starts displaying an unexpected volume of data,
press Ctrl-C in impala-shell to cancel the query.)
Results:
+----+-------+------------+-------------------------------+
| id | col_1 | col_2 | col_3 |
+----+-------+------------+-------------------------------+
| 1 | true | 123.123 | 2012-10-24 08:55:00 |
| 2 | false | 1243.5 | 2012-10-25 13:40:00 |
| 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 |
| 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
+----+-------+---------------+
| id | col_1 | col_2 |
+----+-------+---------------+
| 1 | true | 12789.123 |
| 2 | false | 1243.5 |
| 3 | false | 24453.325 |
| 4 | false | 2423.3254 |
| 5 | true | 243.325 |
| 60 | false | 243565423.325 |
| 70 | true | 243.325 |
| 80 | false | 243423.325 |
| 90 | true | 243.325 |
+----+-------+---------------+
+----+-------+-----------+
| id | col_1 | col_2 |
+----+-------+-----------+
| 1 | true | 12789.123 |
| 2 | false | 1243.5 |
| 3 | false | 24453.325 |
| 4 | false | 2423.3254 |
| 5 | true | 243.325 |
+----+-------+-----------+
Results:
+-------+-----------------+-----------------+
| col_1 | max(tab2.col_2) | min(tab2.col_2) |
+-------+-----------------+-----------------+
| false | 24453.325 | 1243.5 |
| true | 12789.123 | 243.325 |
+-------+-----------------+-----------------+
SELECT tab2.*
FROM tab2,
(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
FROM tab2, tab1
WHERE tab1.id = tab2.id
GROUP BY col_1) subquery1
WHERE subquery1.max_col2 = tab2.col_2;
Results:
+----+-------+-----------+
| id | col_1 | col_2 |
+----+-------+-----------+
| 1 | true | 12789.123 |
| 3 | false | 24453.325 |
+----+-------+-----------+
Results:
+----+-------+---------+-------+-----+
| id | col_1 | col_2 | month | day |
+----+-------+---------+-------+-----+
| 1 | true | 123.123 | 10 | 24 |
| 2 | false | 1243.5 | 10 | 25 |
+----+-------+---------+-------+-----+
Advanced Tutorials
These tutorials walk you through advanced scenarios or specialized features.
Back in the Linux shell, we examine the HDFS directory structure. (Your Impala data directory might be in a different
location; for historical reasons, it is sometimes under the HDFS path /user/hive/warehouse.) We use the hdfs
dfs -ls command to examine the nested subdirectories corresponding to each partitioning column, with separate
subdirectories at each level (with = in their names) representing the different values for each partitioning column.
When we get to the lowest level of subdirectory, we use the hdfs dfs -cat command to examine the data file and
see CSV-formatted data produced by the INSERT statement in Impala.
Found 2 items
drwxr-xr-x - impala hive 0 2013-08-07 12:23
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07
drwxr-xr-x - impala hive 0 2013-08-07 12:24
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=08
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07
Found 2 items
drwxr-xr-x - impala hive 0 2013-08-07 12:22
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28
drwxr-xr-x - impala hive 0 2013-08-07 12:23
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=29
$ hdfs dfs -ls
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28
Found 2 items
drwxr-xr-x - impala hive 0 2013-08-07 12:21
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1
drwxr-xr-x - impala hive 0 2013-08-07 12:22
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host2
$ hdfs dfs -ls
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1
Found 1 items
-rw-r--r-- 3 impala hive 12 2013-08-07 12:21
/user/impala/warehouse/external_partiti
ons.db/logs/year=2013/month=07/day=28/host=host1/3981726974111751120--8907184999369517436_822630111_data.0
$ hdfs dfs -cat
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/\
host=host1/3981726974111751120--8 907184999369517436_822630111_data.0
foo,foo,foo
Still in the Linux shell, we use hdfs dfs -mkdir to create several data directories outside the HDFS directory tree
that Impala controls (/user/impala/warehouse in this example, maybe different in your case). Depending on your
configuration, you might need to log in as a user with permission to write into this HDFS directory tree; for example,
the commands shown here were run while logged in as the hdfs user.
We make a tiny CSV file, with values different than in the INSERT statements used earlier, and put a copy within each
subdirectory that we will use as an Impala partition.
$ cat >dummy_log_data
bar,baz,bletch
$ hdfs dfs -mkdir -p
/user/impala/data/external_partitions/year=2013/month=08/day=01/host=host1
$ hdfs dfs -mkdir -p
/user/impala/data/external_partitions/year=2013/month=07/day=28/host=host1
$ hdfs dfs -mkdir -p
/user/impala/data/external_partitions/year=2013/month=07/day=28/host=host2
$ hdfs dfs -mkdir -p
/user/impala/data/external_partitions/year=2013/month=07/day=29/host=host1
$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host1
$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host2
$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=29/host=host1
$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=08/day=01/host=host1
Back in the impala-shell interpreter, we move the original Impala-managed table aside, and create a new external
table with a LOCATION clause pointing to the directory under which we have set up all the partition subdirectories
and data files.
use external_partitions;
alter table logs rename to logs_original;
create external table logs (field1 string, field2 string, field3 string)
partitioned by (year string, month string, day string, host string)
Because partition subdirectories and data files come and go during the data lifecycle, you must identify each of the
partitions through an ALTER TABLE statement before Impala recognizes the data files they contain.
We issue a REFRESH statement for the table, always a safe practice when data files have been manually added, removed,
or changed. Then the data is ready to be queried. The SELECT * statement illustrates that the data from our trivial
CSV file was recognized in each of the partitions where we copied it. Although in this case there are only a few rows,
we include a LIMIT clause on this test query just in case there is more data than we expect.
refresh log_type;
select * from log_type limit 100;
+--------+--------+--------+------+-------+-----+-------+
| field1 | field2 | field3 | year | month | day | host |
+--------+--------+--------+------+-------+-----+-------+
| bar | baz | bletch | 2013 | 07 | 28 | host1 |
| bar | baz | bletch | 2013 | 08 | 01 | host1 |
| bar | baz | bletch | 2013 | 07 | 29 | host1 |
| bar | baz | bletch | 2013 | 07 | 28 | host2 |
+--------+--------+--------+------+-------+-----+-------+
For examples showing how this process works for the REFRESH statement, look at the examples of creating RCFile and
SequenceFile tables in Impala, loading data through Hive, and then querying the data through Impala. See Using the
RCFile File Format with Impala Tables on page 552 and Using the SequenceFile File Format with Impala Tables on page
554 for those examples.
For examples showing how this process works for the INVALIDATE METADATA statement, look at the example of
creating and loading an Avro table in Hive, and then querying the data through Impala. See Using the Avro File Format
with Impala Tables on page 547 for that example.
Note:
Originally, Impala did not support UDFs, but this feature is available in Impala starting in Impala 1.2.
Some INSERT ... SELECT transformations that you originally did through Hive can now be done
through Impala. See Impala User-Defined Functions (UDFs) on page 447 for details.
Prior to Impala 1.2, the REFRESH and INVALIDATE METADATA statements needed to be issued on
each Impala node to which you connected and issued queries. In Impala 1.2 and higher, when you
issue either of those statements on any Impala node, the results are broadcast to all the Impala nodes
in the cluster, making it truly a one-step operation after each round of DDL or ETL operations in Hive.
Cross Joins and Cartesian Products with the CROSS JOIN Operator
Originally, Impala restricted join queries so that they had to include at least one equality comparison between the
columns of the tables on each side of the join operator. With the huge tables typically processed by Impala, any
miscoded query that produced a full Cartesian product as a result set could consume a huge amount of cluster resources.
In Impala 1.2.2 and higher, this restriction is lifted when you use the CROSS JOIN operator in the query. You still
cannot remove all WHERE clauses from a query like SELECT * FROM t1 JOIN t2 to produce all combinations of
rows from both tables. But you can use the CROSS JOIN operator to explicitly request such a Cartesian product.
Typically, this operation is applicable for smaller tables, where the result set still fits within the memory of a single
Impala node.
The following example sets up data for use in a series of comic books where characters battle each other. At first, we
use an equijoin query, which only allows characters from the same time period and the same planet to meet.
[localhost:21000] > create table heroes (name string, era string, planet string);
[localhost:21000] > create table villains (name string, era string, planet string);
[localhost:21000] > insert into heroes values
> ('Tesla','20th century','Earth'),
> ('Pythagoras','Antiquity','Earth'),
> ('Zopzar','Far Future','Mars');
Inserted 3 rows in 2.28s
[localhost:21000] > insert into villains values
> ('Caligula','Antiquity','Earth'),
> ('John Dillinger','20th century','Earth'),
> ('Xibulor','Far Future','Venus');
Inserted 3 rows in 1.93s
[localhost:21000] > select concat(heroes.name,' vs. ',villains.name) as battle
> from heroes join villains
> where heroes.era = villains.era and heroes.planet = villains.planet;
+--------------------------+
| battle |
+--------------------------+
| Tesla vs. John Dillinger |
| Pythagoras vs. Caligula |
+--------------------------+
Returned 2 row(s) in 0.47s
Readers demanded more action, so we added elements of time travel and space travel so that any hero could face
any villain. Prior to Impala 1.2.2, this type of query was impossible because all joins had to reference matching values
between the two tables:
With Impala 1.2.2, we rewrite the query slightly to use CROSS JOIN rather than JOIN, and now the result set includes
all combinations:
[localhost:21000] > -- Cartesian product available in Impala 1.2.2 with the CROSS JOIN
syntax.
> select concat(heroes.name,' vs. ',villains.name) as battle from
heroes cross join villains;
+-------------------------------+
| battle |
+-------------------------------+
| Tesla vs. Caligula |
| Tesla vs. John Dillinger |
| Tesla vs. Xibulor |
| Pythagoras vs. Caligula |
| Pythagoras vs. John Dillinger |
| Pythagoras vs. Xibulor |
| Zopzar vs. Caligula |
| Zopzar vs. John Dillinger |
| Zopzar vs. Xibulor |
+-------------------------------+
Returned 9 row(s) in 0.33s
The full combination of rows from both tables is known as the Cartesian product. This type of result set is often used
for creating grid data structures. You can also filter the result set by including WHERE clauses that do not explicitly
compare columns between the two tables. The following example shows how you might produce a list of combinations
of year and quarter for use in a chart, and then a shorter list with only selected quarters.
The data used in this tutorial represents airline on-time arrival statistics, from October 1987 through April 2008. See
the details on the 2009 ASA Data Expo web site. You can also see the explanations of the columns; for purposes of this
exercise, wait until after following the tutorial before examining the schema, to better simulate a real-life situation
where you cannot rely on assumptions and assertions about the ranges and representations of data values.
We will download Parquet files containing this data from the Ibis blog. First, we download and unpack the data files.
There are 8 files totalling 1.4 GB. Each file is less than 256 MB.
$ wget -O airlines_parquet.tar.gz
https://www.dropbox.com/s/ol9x51tqp6cv4yc/airlines_parquet.tar.gz?dl=0
...
Length: 1245204740 (1.2G) [application/octet-stream]
Saving to: “airlines_parquet.tar.gz”
Next, we put the Parquet data files in HDFS, all together in a single directory, with permissions on the directory and
the files so that the impala user will be able to read them.
Note: After unpacking, we saw the largest Parquet file was 253 MB. When copying Parquet files into
HDFS for Impala to use, for maximum query performance, make sure that each file resides in a single
HDFS data block. Therefore, we pick a size larger than any single file and specify that as the block size,
using the argument -Ddfs.block.size=256m on the hdfs dfs -put command.
With the files in an accessible location in HDFS, we create a database table that uses the data in those files. The CREATE
EXTERNAL syntax and the LOCATION attribute point Impala at the appropriate HDFS directory. The LIKE PARQUET
'path_to_any_parquet_file' clause means we skip the list of column names and types; Impala automatically
gets the column names and data types straight from the data files. (Currently, this technique only works for Parquet
files.) We ignore the warning about lack of READ_WRITE access to the files in HDFS; the impala user can read the
files, which will be sufficient for us to experiment with queries and perform some copy and transform operations into
other tables.
$ impala-shell -i localhost
Starting Impala Shell without Kerberos authentication
Connected to localhost:21000
Server version: impalad version 2.2.0-cdh5 RELEASE (build
2ffd73a4255cefd521362ffe1cfb37463f67f75c)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
(Shell build version: Impala Shell v2.1.2-cdh5 (92438b7) built on Tue Feb 24 12:36:33
PST 2015)
[localhost:21000] > create database airline_data;
[localhost:21000] > use airline_data;
[localhost:21000] > create external table airlines_external
> like parquet
'hdfs://demo_host.example.com:8020/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq'
With the table created, we examine its physical and logical characteristics to confirm that the data is really there and
in a format and shape that we can work with. The SHOW TABLE STATS statement gives a very high-level summary of
the table, showing how many files and how much total data it contains. Also, it confirms that the table is expecting all
the associated data files to be in Parquet format. (The ability to work with all kinds of HDFS data files in different
formats means that it is possible to have a mismatch between the format of the data files, and the format that the
table expects the data files to be in.) The SHOW FILES statement confirms that the data in the table has the expected
number, names, and sizes of the original Parquet files. The DESCRIBE statement (or its abbreviation DESC) confirms
the names and types of the columns that Impala automatically created after reading that metadata from the Parquet
file. The DESCRIBE FORMATTED statement prints out some extra detail along with the column definitions; the pieces
we care about for this exercise are the containing database for the table, the location of the associated data files in
HDFS, the fact that it's an external table so Impala will not delete the HDFS files when we finish the experiments and
drop the table, and the fact that the table is set up to work exclusively with files in the Parquet format.
| 239.95MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
| 252.90MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
| 15.89MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
| 176.92MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
| 212.47MB | |
+----------------------------------------------------------------------------------------+----------+-----------+
[localhost:21000] > describe airlines_external;
+---------------------+--------+---------------------------------------------------+
| name | type | comment |
+---------------------+--------+---------------------------------------------------+
| year | int | inferred from: optional int32 year |
| month | int | inferred from: optional int32 month |
| day | int | inferred from: optional int32 day |
| dayofweek | int | inferred from: optional int32 dayofweek |
| dep_time | int | inferred from: optional int32 dep_time |
| crs_dep_time | int | inferred from: optional int32 crs_dep_time |
| arr_time | int | inferred from: optional int32 arr_time |
| crs_arr_time | int | inferred from: optional int32 crs_arr_time |
| carrier | string | inferred from: optional binary carrier |
| flight_num | int | inferred from: optional int32 flight_num |
| tail_num | int | inferred from: optional int32 tail_num |
| actual_elapsed_time | int | inferred from: optional int32 actual_elapsed_time |
| crs_elapsed_time | int | inferred from: optional int32 crs_elapsed_time |
| airtime | int | inferred from: optional int32 airtime |
| arrdelay | int | inferred from: optional int32 arrdelay |
| depdelay | int | inferred from: optional int32 depdelay |
| origin | string | inferred from: optional binary origin |
| dest | string | inferred from: optional binary dest |
| distance | int | inferred from: optional int32 distance |
| taxi_in | int | inferred from: optional int32 taxi_in |
| taxi_out | int | inferred from: optional int32 taxi_out |
| cancelled | int | inferred from: optional int32 cancelled |
| cancellation_code | string | inferred from: optional binary cancellation_code |
| diverted | int | inferred from: optional int32 diverted |
| carrier_delay | int | inferred from: optional int32 carrier_delay |
| weather_delay | int | inferred from: optional int32 weather_delay |
| nas_delay | int | inferred from: optional int32 nas_delay |
| security_delay | int | inferred from: optional int32 security_delay |
| late_aircraft_delay | int | inferred from: optional int32 late_aircraft_delay |
+---------------------+--------+---------------------------------------------------+
[localhost:21000] > desc formatted airlines_external;
+------------------------------+-------------------------------
| name | type
+------------------------------+-------------------------------
...
| # Detailed Table Information | NULL
| Database: | airline_data
| Owner: | jrussell
...
| Location: | /user/impala/staging/airlines
| Table Type: | EXTERNAL_TABLE
...
| # Storage Information | NULL
| SerDe Library: | parquet.hive.serde.ParquetHiveSerDe
| InputFormat: | parquet.hive.DeprecatedParquetInputFormat
| OutputFormat: | parquet.hive.DeprecatedParquetOutputFormat
...
Now that we are confident that the connections are solid between the Impala table and the underlying Parquet files,
we run some initial queries to understand the characteristics of the data: the overall number of rows, and the ranges
and how many different values are in certain columns. For convenience in understanding the magnitude of the COUNT(*)
result, we run another query dividing the number of rows by 1 million, demonstrating that there are 123 million rows
in the table.
+-----------+
| 123534969 |
+-----------+
Fetched 1 row(s) in 1.32s
[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external;
+------------------+
| millions of rows |
+------------------+
| 123.534969 |
+------------------+
Fetched 1 row(s) in 1.24s
The NDV() function stands for “number of distinct values”, which for performance reasons is an estimate when there
are lots of different values in the column, but is precise when the cardinality is less than 16 K. Use NDV() calls for this
kind of exploration rather than COUNT(DISTINCT colname), because Impala can evaluate multiple NDV() functions
in a single query, but only a single instance of COUNT DISTINCT. Here we see that there are modest numbers of
different airlines, flight numbers, and origin and destination airports. Two things jump out from this query: the number
of tail_num values is much smaller than we might have expected, and there are more destination airports than origin
airports. Let's dig further. What we find is that most tail_num values are NULL. It looks like this was an experimental
column that wasn't filled in accurately. We make a mental note that if we use this data as a starting point, we'll ignore
this column. We also find that certain airports are represented in the ORIGIN column but not the DEST column; now
we know that we cannot rely on the assumption that those sets of airport codes are identical.
Note: A slight digression for some performance tuning. Notice how the first SELECT DISTINCT
DEST query takes almost 40 seconds. We expect all queries on such a small data set, less than 2 GB,
to take a few seconds at most. The reason is because the expression NOT IN (SELECT origin
FROM airlines_external) produces an intermediate result set of 123 million rows, then runs 123
million comparisons on each data node against the tiny set of destination airports. The way the NOT
IN operator works internally means that this intermediate result set with 123 million rows might be
transmitted across the network to each data node in the cluster. Applying another DISTINCT inside
the NOT IN subquery means that the intermediate result set is only 340 items, resulting in much less
network traffic and fewer comparison operations. The more efficient query with the added DISTINCT
is approximately 7 times as fast.
Next, we try doing a simple calculation, with results broken down by year. This reveals that some years have no data
in the AIRTIME column. That means we might be able to use that column in queries involving certain date ranges, but
we cannot count on it to always be reliable. The question of whether a column contains any NULL values, and if so
what is their number, proportion, and distribution, comes up again and again when doing initial exploration of a data
set.
With the notion of NULL values in mind, let's come back to the TAILNUM column that we discovered had a lot of NULLs.
Let's quantify the NULL and non-NULL values in that column for better understanding. First, we just count the overall
number of rows versus the non-NULL values in that column. That initial result gives the appearance of relatively few
non-NULL values, but we can break it down more clearly in a single query. Once we have the COUNT(*) and the
COUNT(colname) numbers, we can encode that initial query in a WITH clause, then run a followon query that performs
multiple arithmetic operations on those values. Seeing that only one-third of one percent of all rows have non-NULL
values for the TAILNUM column clearly illustrates that that column won't be of much use.
By examining other columns using these techniques, we can form a mental picture of the way data is distributed
throughout the table, and which columns are most significant for query purposes. For this tutorial, we focus mostly
on the fields likely to hold discrete values, rather than columns such as ACTUAL_ELAPSED_TIME whose names suggest
they hold measurements. We would dig deeper into those columns once we had a clear picture of which questions
were worthwhile to ask, and what kinds of trends we might look for. For the final piece of initial exploration, let's look
at the YEAR column. A simple GROUP BY query shows that it has a well-defined range, a manageable number of distinct
values, and relatively even distribution of rows across the different years.
We could go quite far with the data in this initial raw format, just as we downloaded it from the web. If the data set
proved to be useful and worth persisting in Impala for extensive queries, we might want to copy it to an internal table,
letting Impala manage the data files and perhaps reorganizing a little for higher efficiency. In this next stage of the
tutorial, we copy the original data into a partitioned table, still in Parquet format. Partitioning based on the YEAR
column lets us run queries with clauses such as WHERE year = 2001 or WHERE year BETWEEN 1989 AND 1999,
which can dramatically cut down on I/O by ignoring all the data from years outside the desired range. Rather than
reading all the data and then deciding which rows are in the matching years, Impala can zero in on only the data files
from specific YEAR partitions. To do this, Impala physically reorganizes the data files, putting the rows from each year
into data files in a separate HDFS directory for each YEAR value. Along the way, we'll also get rid of the TAIL_NUM
column that proved to be almost entirely NULL.
The first step is to create a new table with a layout very similar to the original AIRLINES_EXTERNAL table. We'll do
that by reverse-engineering a CREATE TABLE statement for the first table, then tweaking it slightly to include a
PARTITION BY clause for YEAR, and excluding the TAIL_NUM column. The SHOW CREATE TABLE statement gives us
the starting point.
Although we could edit that output into a new SQL statement, all the ASCII box characters make such editing
inconvenient. To get a more stripped-down CREATE TABLE to start with, we restart the impala-shell command
with the -B option, which turns off the box-drawing behavior.
(Shell build version: Impala Shell v2.1.2-cdh5 (92438b7) built on Tue Feb 24 12:36:33
PST 2015)
[localhost:21000] > show create table airlines_external;
"CREATE EXTERNAL TABLE airline_data.airlines_external (
year INT COMMENT 'inferred from: optional int32 year',
month INT COMMENT 'inferred from: optional int32 month',
day INT COMMENT 'inferred from: optional int32 day',
After copying and pasting the CREATE TABLE statement into a text editor for fine-tuning, we quit and restart
impala-shell without the -B option, to switch back to regular output.
Next we run the CREATE TABLE statement that we adapted from the SHOW CREATE TABLE output. We kept the
STORED AS PARQUET clause because we want to rearrange the data somewhat but still keep it in the high-performance
Parquet format. The LOCATION and TBLPROPERTIES clauses are not relevant for this new table, so we edit those out.
Because we are going to partition the new table based on the YEAR column, we move that column name (and its type)
into a new PARTITIONED BY clause.
Next, we copy all the rows from the original table into this new one with an INSERT statement. (We edited the CREATE
TABLE statement to make an INSERT statement with the column names in the same order.) The only change is to add
a PARTITION(year) clause, and move the YEAR column to the very end of the SELECT list of the INSERT statement.
Specifying PARTITION(year), rather than a fixed value such as PARTITION(year=2000), means that Impala figures
out the partition value for each row based on the value of the very last column in the SELECT list. This is the first SQL
statement that legitimately takes any substantial time, because the rows from different years are shuffled around the
cluster; the rows that go into each partition are collected on one node, before being written to one or more new data
files.
Once partitioning or join queries come into play, it's important to have statistics that Impala can use to optimize queries
on the corresponding tables. The COMPUTE INCREMENTAL STATS statement is the way to collect statistics for
partitioned tables. Then the SHOW TABLE STATS statement confirms that the statistics are in place for each partition,
and also illustrates how many files and how much raw data is in each partition.
At this point, we go through a quick thought process to sanity check the partitioning we did. All the partitions have
exactly one file, which is on the low side. A query that includes a clause WHERE year=2004 will only read a single data
block; that data block will be read and processed by a single data node; therefore, for a query targeting a single year,
all the other nodes in the cluster will sit idle while all the work happens on a single machine. It's even possible that by
chance (depending on HDFS replication factor and the way data blocks are distributed across the cluster), that multiple
year partitions selected by a filter such as WHERE year BETWEEN 1999 AND 2001 could all be read and processed
by the same data node. The more data files each partition has, the more parallelism you can get and the less probability
of “hotspots” occurring on particular nodes, therefore a bigger performance boost by having a big CDH cluster.
However, the more data files, the less data goes in each one. The overhead of dividing the work in a parallel query
might not be worth it if each node is only reading a few megabytes. 50 or 100 megabytes is a decent size for a Parquet
data block; 9 or 37 megabytes is on the small side. Which is to say, the data distribution we ended up with based on
this partitioning scheme is on the borderline between sensible (reasonably large files) and suboptimal (few files in
each partition). The way to see how well it works in practice is to run the same queries against the original flat table
and the new partitioned table, and compare times.
Spoiler: in this case, with my particular 4-node cluster with its specific distribution of data blocks and my particular
exploratory queries, queries against the partitioned table do consistently run faster than the same queries against the
unpartitioned table. But I could not be sure that would be the case without some real measurements. Here are some
queries I ran to draw that conclusion, first against AIRLINES_EXTERNAL (no partitioning), then against AIRLINES
(partitioned by year). The AIRLINES queries are consistently faster. Changing the volume of data, changing the size
of the cluster, running queries that did or didn't refer to the partition key columns, or other factors could change the
results to favor one table layout or the other.
Note: If you find the volume of each partition is only in the low tens of megabytes, consider lowering
the granularity of partitioning. For example, instead of partitioning by year, month, and day, partition
by year and month or even just by year. The ideal layout to distribute work efficiently in a parallel
query is many tens or even hundreds of megabytes per Parquet file, and the number of Parquet files
in each partition somewhat higher than the number of data nodes.
| 8662859484 |
+--------------+
Fetched 1 row(s) in 1.21s
Now we can finally do some serious analysis with this data set that, remember, a few minutes ago all we had were
some raw data files and we didn't even know what columns they contained. Let's see whether the “air time” of a flight
tends to be different depending on the day of the week. We can see that the average is a little higher on day number
6; perhaps Saturday is a busy flying day and planes have to circle for longer at the destination airport before landing.
To see if the apparent trend holds up over time, let's do the same breakdown by day of week, but also split up by year.
Now we can see that day number 6 consistently has a higher average air time in each year. We can also see that the
average air time increased over time across the board. And the presence of NULL for this column in years 1987 to 1994
shows that queries involving this column need to be restricted to a date range of 1995 and higher.
| 2008 | 6 | 107.4006306562128 |
| 2008 | 7 | 104.8648851041755 |
| 2007 | 1 | 102.2196114337825 |
| 2007 | 2 | 101.9317791906348 |
| 2007 | 3 | 102.0964767689043 |
| 2007 | 4 | 102.6215927201686 |
| 2007 | 5 | 102.4289399000661 |
| 2007 | 6 | 105.1477448215756 |
| 2007 | 7 | 103.6305945644095 |
...
| 1996 | 1 | 99.33860750862108 |
| 1996 | 2 | 99.54225446396656 |
| 1996 | 3 | 99.41129336113134 |
| 1996 | 4 | 99.5110373340348 |
| 1996 | 5 | 99.22120745027595 |
| 1996 | 6 | 101.1717447111921 |
| 1996 | 7 | 99.95410136133704 |
| 1995 | 1 | 96.93779698300494 |
| 1995 | 2 | 96.93458674589712 |
| 1995 | 3 | 97.00972311337051 |
| 1995 | 4 | 96.90843832024412 |
| 1995 | 5 | 96.78382115425562 |
| 1995 | 6 | 98.70872826057003 |
| 1995 | 7 | 97.85570478374616 |
| 1994 | 1 | NULL |
| 1994 | 2 | NULL |
| 1994 | 3 | NULL |
...
| 1987 | 5 | NULL |
| 1987 | 6 | NULL |
| 1987 | 7 | NULL |
+------+-----------+-------------------+
Impala Administration
As an administrator, you monitor Impala's use of resources and take action when necessary to keep Impala running
smoothly and avoid conflicts with other Hadoop components running on the same cluster. When you detect that an
issue has happened or could happen in the future, you reconfigure Impala or other components such as HDFS or even
the hardware of the cluster itself to resolve or avoid problems.
Related tasks:
As an administrator, you can expect to perform installation, upgrade, and configuration tasks for Impala on all machines
in a cluster. See Installing Impala on page 27, Upgrading Impala on page 38, and Managing Impala on page 30 for
details.
For security tasks typically performed by administrators, see Impala Security on page 91.
Administrators also decide how to allocate cluster resources so that all Hadoop components can run smoothly together.
For Impala, this task primarily involves:
• Deciding how many Impala queries can run concurrently and with how much memory, through the admission
control feature. See Admission Control and Query Queuing on page 75 for details.
• Dividing cluster resources such as memory between Impala and other components, using YARN for overall resource
management, and Llama to mediate resource requests from Impala to YARN. See Integrated Resource Management
with YARN on page 83 for details.
Note: Though Impala can be used together with YARN via simple configuration of Static Service Pools
in Cloudera Manager, the use of the general-purpose component Llama for integrated resource
management within YARN is no longer supported with CDH 5.5 / Impala 2.3 and higher.
being cancelled or running slowly and causing contention. As other queries finish, the queued queries are allowed to
proceed.
For details on the internal workings of admission control, see How Impala Schedules and Enforces Limits on Concurrent
Queries on page 76.
How Admission Control works with Impala Clients (JDBC, ODBC, HiveServer2)
Most aspects of admission control work transparently with client interfaces such as JDBC and ODBC:
• If a SQL statement is put into a queue rather than running immediately, the API call blocks until the statement is
dequeued and begins execution. At that point, the client program can request to fetch results, which might also
block until results become available.
• If a SQL statement is cancelled because it has been queued for too long or because it exceeded the memory limit
during execution, the error is returned to the client program with a descriptive error message.
If you want to submit queries to different resource pools through the REQUEST_POOL query option, as described in
REQUEST_POOL Query Option on page 320, In Impala 2.0 and higher you can change that query option through a SQL
SET statement that you submit from the client application, in the same session. Prior to Impala 2.0, that option was
only settable for a session through the impala-shell SET command, or cluster-wide through an impalad startup
option.
Admission control has the following limitations or special behavior when used with JDBC or ODBC applications:
• The MEM_LIMIT query option, sometimes useful to work around problems caused by inaccurate memory estimates
for complicated queries, is only settable through the impala-shell interpreter and cannot be used directly
through JDBC or ODBC applications.
• Admission control does not use the other resource-related query options, RESERVATION_REQUEST_TIMEOUT or
V_CPU_CORES. Those query options only apply to the YARN resource management framework.
Type: int64
Default: 200
default_pool_max_requests
Purpose: Maximum number of concurrent outstanding requests allowed to run before incoming requests are
queued. Because this limit applies cluster-wide, but each Impala node makes independent decisions to run queries
immediately or queue them, it is a soft limit; the overall number of concurrent queries might be slightly higher
during times of heavy load. A negative value indicates no limit. Ignored if fair_scheduler_config_path and
llama_site_path are set.
Type: int64
Default: 200
default_pool_mem_limit
Purpose: Maximum amount of memory (across the entire cluster) that all outstanding requests in this pool can use
before new requests to this pool are queued. Specified in bytes, megabytes, or gigabytes by a number followed by
the suffix b (optional), m, or g, either uppercase or lowercase. You can specify floating-point values for megabytes
and gigabytes, to represent fractional numbers such as 1.5. You can also specify it as a percentage of the physical
memory by specifying the suffix %. 0 or no setting indicates no limit. Defaults to bytes if no unit is given. Because
this limit applies cluster-wide, but each Impala node makes independent decisions to run queries immediately or
queue them, it is a soft limit; the overall memory used by concurrent queries might be slightly higher during times
of heavy load. Ignored if fair_scheduler_config_path and llama_site_path are set.
Note: Impala relies on the statistics produced by the COMPUTE STATS statement to estimate
memory usage for each query. See COMPUTE STATS Statement on page 226 for guidelines about
how and when to use this statement.
Type: string
Default: "" (empty string, meaning unlimited)
disable_admission_control
Purpose: Turns off the admission control feature entirely, regardless of other configuration option settings.
Type: Boolean
Default: true
disable_pool_max_requests
Purpose: Disables all per-pool limits on the maximum number of running requests.
Type: Boolean
Default: false
disable_pool_mem_limits
Purpose: Disables all per-pool mem limits.
Type: Boolean
Default: false
fair_scheduler_allocation_path
Purpose: Path to the fair scheduler allocation file (fair-scheduler.xml).
Type: string
Default: "" (empty string)
Usage notes: Admission control only uses a small subset of the settings that can go in this file, as described below.
For details about all the Fair Scheduler configuration settings, see the Apache wiki.
llama_site_path
Purpose: Path to the Llama configuration file (llama-site.xml). If set, fair_scheduler_allocation_path
must also be set.
Type: string
Default: "" (empty string)
Usage notes: Admission control only uses a small subset of the settings that can go in this file, as described below.
For details about all the Llama configuration settings, see the documentation on Github.
queue_wait_timeout_ms
Purpose: Maximum amount of time (in milliseconds) that a request waits to be admitted before timing out.
Type: int64
Default: 60000
llama.am.throttling.maximum.placed.reservations.queue_name
llama.am.throttling.maximum.queued.reservations.queue_name
For details about all the Llama configuration settings, see Llama Default Configuration.
See Example Admission Control Configurations Using Configuration Files on page 81 for sample configuration files for
admission control using multiple resource pools, without Cloudera Manager.
Figure 1: Sample Settings for Cloudera Manager Dynamic Resource Pools Page
The following figure shows a sample of the Placement Rules page in Cloudera Manager, accessed through the Clusters >
Cluster name > Resource Management > Dynamic Resource Pools menu choice and then the Configuration > Placement
Rules tabs. The settings demonstrate a reasonable configuration of a pool named default to service all requests
where the specified resource pool does not exist, is not explicitly set, or the user or group is not authorized for the
specified pool.
<allocations>
<queue name="root">
<aclSubmitApps> </aclSubmitApps>
<queue name="default">
<maxResources>50000 mb, 0 vcores</maxResources>
<aclSubmitApps>*</aclSubmitApps>
</queue>
<queue name="development">
<maxResources>200000 mb, 0 vcores</maxResources>
<aclSubmitApps>user1,user2 dev,ops,admin</aclSubmitApps>
</queue>
<queue name="production">
<maxResources>1000000 mb, 0 vcores</maxResources>
<aclSubmitApps> ops,admin</aclSubmitApps>
</queue>
</queue>
<queuePlacementPolicy>
<rule name="specified" create="false"/>
<rule name="default" />
</queuePlacementPolicy>
</allocations>
llama-site.xml:
The statements affected by the admission control feature are primarily queries, but also include statements that write
data such as INSERT and CREATE TABLE AS SELECT. Most write operations in Impala are not resource-intensive,
but inserting into a Parquet table can require substantial memory due to buffering 1 GB of data before writing out
each Parquet data block. See Loading Data into Parquet Tables on page 536 for instructions about inserting data
efficiently into Parquet tables.
Although admission control does not scrutinize memory usage for other kinds of DDL statements, if a query is queued
due to a limit on concurrent queries or memory usage, subsequent statements in the same session are also queued
so that they are processed in the correct order:
If you set up different resource pools for different users and groups, consider reusing any classifications and hierarchy
you developed for use with Sentry security. See Enabling Sentry Authorization for Impala on page 95 for details.
For details about all the Fair Scheduler configuration settings, see Fair Scheduler Configuration, in particular the tags
such as <queue> and <aclSubmitApps> to map users and groups to particular resource pools (queues).
You can limit the CPU and memory resources used by Impala, to manage and prioritize workloads on clusters that run
jobs from many Hadoop components.
Requests from Impala to YARN go through an intermediary service called Llama. When the resource requests are
granted, Impala starts the query and places all relevant execution threads into the cgroup containers and sets up the
memory limit on each host. If sufficient resources are not available, the Impala query waits until other jobs complete
and the resources are freed. During query processing, as the need for additional resources arises, Llama can “expand”
already-requested resources, to avoid over-allocating at the start of the query.
After a query is finished, Llama caches the resources (for example, leaving memory allocated) in case they are needed
for subsequent Impala queries. This caching mechanism avoids the latency involved in making a whole new set of
resource requests for each query. If the resources are needed by YARN for other types of jobs, Llama returns them.
While the delays to wait for resources might make individual queries seem less responsive on a heavily loaded cluster,
the resource management feature makes the overall performance of the cluster smoother and more predictable,
without sudden spikes in utilization due to memory paging, CPUs pegged at 100%, and so on.
See Scalability Considerations for the Impala Statestore on page 516 for more details about statestore operation and
settings on clusters with a large number of Impala-related objects such as tables and partitions.
Setting the Idle Query and Idle Session Timeouts for impalad
To keep long-running queries or idle sessions from tying up cluster resources, you can set timeout intervals for both
individual queries, and entire sessions.
Note:
The timeout clock for queries and sessions only starts ticking when the query or session is idle. For
queries, this means the query has results ready but is waiting for a client to fetch the data. A query
can run for an arbitrary time without triggering a timeout, because the query is computing results
rather than sitting idle waiting for the results to be fetched. The timeout period is intended to prevent
unclosed queries from consuming resources and taking up slots in the admission count of running
queries, potentially preventing other queries from starting.
For sessions, this means that no query has been submitted for some period of time.
For instructions on changing impalad startup options, see Modifying Impala Startup Options on page 43.
Note:
To avoid excessive polling, Impala checks periodically for idle sessions and queries to cancel. The actual
idle time before cancellation might be up to 50% greater than the specified configuration setting. For
example, if the timeout setting was 60, the session or query might be cancelled after being idle between
60 and 90 seconds.
Cancelling a Query
Sometimes, an Impala query might run for an unexpectedly long time, tying up resources in the cluster. You can cancel
the query explicitly, independent of the timeout period, by going into the web UI for the impalad host (on port 25000
by default), and using the link on the /queries tab to cancel the running query. Various client applications let you
interactively cancel queries submitted or monitored through those applications. For example, by pressing ^C in
impala-shell, clicking the Cancel button from the Watch page in Hue, clicking Actions > Cancel from the Queries
list in Cloudera Manager, and so on.
Note:
Most considerations for load balancing and high availability apply to the impalad daemon. The
statestored and catalogd daemons do not have special requirements for high availability, because
problems with those daemons do not result in data loss. If those daemons become unavailable due
to an outage on a particular host, you can stop the Impala service, delete the Impala StateStore and
Impala Catalog Server roles, add the roles on a different host, and restart the Impala service.
3. Specify the host and port settings for each Impala node. These are the hosts that the load balancer will choose
from when relaying each Impala query. See Ports Used by Impala on page 583 for when to use port 21000, 21050,
or another value depending on what type of connections you are load balancing.
Note:
In particular, if you are using Hue or JDBC-based applications, you typically set up load balancing
for both ports 21000 and 21050, because these client applications connect through port 21050
while the impala-shell command connects through port 21000.
4. Run the load-balancing proxy server, pointing it at the configuration file that you set up.
5. On systems managed by Cloudera Manager, on the page Impala > Configuration > Impala Daemon Default Group,
specify a value for the Impala Daemons Load Balancer field. Specify the address of the load balancer in host:port
format. This setting lets Cloudera Manager route all appropriate Impala-related operations through the proxy
server.
6. For any scripts, jobs, or configuration settings for applications that formerly connected to a specific datanode to
run Impala SQL statements, change the connection information (such as the -i option in impala-shell) to point
to the load balancer instead.
Note: The following sections use the HAProxy software as a representative example of a load balancer
that you can use with Impala. For information specifically about using Impala with the F5 BIG-IP load
balancer, see Impala HA with F5 BIG-IP.
$ ktutil
ktutil: read_kt proxy.keytab
ktutil: read_kt impala.keytab
ktutil: write_kt proxy_impala.keytab
ktutil: quit
Note: On systems managed by Cloudera Manager 5.1.0 and later, the keytab merging happens
automatically. To verify that Cloudera Manager has merged the keytabs, run the command:
klist -k keytabfile
which lists the credentials for both principal and be_principal on all nodes.
6. Make sure that the impala user has permission to read this merged keytab file.
7. Change some configuration settings for each host in the cluster that participates in the load balancing. Follow the
appropriate steps depending on whether you use Cloudera Manager or not:
• In the impalad option definition, or the Cloudera Manager safety valve (Cloudera Manager 4) or advanced
configuration snippet (Cloudera Manager 5), add:
--principal=impala/proxy_host@realm
--be_principal=impala/actual_host@realm
--keytab_file=path_to_merged_keytab
Note:
Every host has a different --be_principal because the actual hostname is different on
each host.
Specify the fully qualified domain name (FQDN) for the proxy host, not the IP address. Use
the exact FQDN as returned by a reverse DNS lookup for the associated IP address.
• On a cluster managed by Cloudera Manager, create a role group to set the configuration values from the
preceding step on a per-host basis.
• On a cluster not managed by Cloudera Manager, see Modifying Impala Startup Options on page 43 for the
procedure to modify the startup options.
8. Restart Impala to make the changes take effect. Follow the appropriate steps depending on whether you use
Cloudera Manager or not:
• On a cluster managed by Cloudera Manager, restart the Impala service.
• On a cluster not managed by Cloudera Manager, restart the impalad daemons on all hosts in the cluster, as
well as the statestored and catalogd daemons.
/usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg
• In impala-shell, JDBC applications, or ODBC applications, connect to the listener port of the proxy host, rather
than port 21000 or 21050 on a host actually running impalad. The sample configuration file sets haproxy to listen
on port 25003, therefore you would send all requests to haproxy_host:25003.
global
# To have these messages end up in /var/log/haproxy.log you will
# need to:
#
# 1) configure syslog to accept network log events. This is done
# by adding the '-r' option to the SYSLOGD_OPTIONS in
# /etc/sysconfig/syslog
#
# 2) configure local2 events to go to the /var/log/haproxy.log
# file. A line like the following can be added to
# /etc/sysconfig/syslog
#
# local2.* /var/log/haproxy.log
#
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 4000
user haproxy
group haproxy
daemon
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#
# You might need to adjust timing values to prevent timeouts.
#---------------------------------------------------------------------
defaults
mode http
log global
option httplog
option dontlognull
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
maxconn 3000
contimeout 5000
clitimeout 50000
srvtimeout 50000
#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats :25002
balance
mode http
stats enable
stats auth username:password
Note: If your JDBC or ODBC application connects to Impala through a load balancer such as haproxy,
be cautious about reusing the connections. If the load balancer has set up connection timeout values,
either check the connection frequently so that it never sits idle longer than the load balancer timeout
value, or check the connection validity before using it and create a new one if the connection has
been closed.
How Impala Works with Hadoop File Formats on page 527 for details about file formats, especially Using the Parquet
File Format with Impala Tables on page 535. See CREATE TABLE Statement on page 235 and ALTER TABLE Statement
on page 215 for syntax details.
• You manage underlying data files differently depending on whether the corresponding Impala table is defined as
an internal or external table:
– Use the DESCRIBE FORMATTED statement to check if a particular table is internal (managed by Impala) or
external, and to see the physical location of the data files in HDFS. See DESCRIBE Statement on page 245 for
details.
– For Impala-managed (“internal”) tables, use DROP TABLE statements to remove data files. See DROP TABLE
Statement on page 258 for details.
– For tables not managed by Impala (“external” tables), use appropriate HDFS-related commands such as
hadoop fs, hdfs dfs, or distcp, to create, move, copy, or delete files within HDFS directories that are
accessible by the impala user. Issue a REFRESH table_name statement after adding or removing any files
from the data directory of an external table. See REFRESH Statement on page 277 for details.
– Use external tables to reference HDFS data files in their original location. With this technique, you avoid
copying the files, and you can map more than one Impala table to the same set of data files. When you drop
the Impala table, the data files are left undisturbed. See External Tables on page 209 for details.
– Use the LOAD DATA statement to move HDFS files into the data directory for an Impala table from inside
Impala, without the need to specify the HDFS path of the destination directory. This technique works for both
internal and external tables. See LOAD DATA Statement on page 274 for details.
• Make sure that the HDFS trashcan is configured correctly. When you remove files from HDFS, the space might not
be reclaimed for use by other files until sometime later, when the trashcan is emptied. See DROP TABLE Statement
on page 258 and the FAQ entry Why is space not freed up when I issue DROP TABLE? on page 601 for details. See
User Account Requirements on page 24 for permissions needed for the HDFS trashcan to operate correctly.
• Drop all tables in a database before dropping the database itself. See DROP DATABASE Statement on page 252 for
details.
• Clean up temporary files after failed INSERT statements. If an INSERT statement encounters an error, and you
see a directory named .impala_insert_staging or _impala_insert_staging left behind in the data
directory for the table, it might contain temporary data files taking up space in HDFS. You might be able to salvage
these data files, for example if they are complete but could not be moved into place due to a permission error.
Or, you might delete those files through commands such as hadoop fs or hdfs dfs, to reclaim space before
re-trying the INSERT. Issue DESCRIBE FORMATTED table_name to see the HDFS path where you can check for
temporary files.
• By default, intermediate files used during large sort, join, aggregation, or analytic function operations are stored
in the directory /tmp/impala-scratch . These files are removed when the operation finishes. (Multiple concurrent
queries can perform operations that use the “spill to disk” technique, without any name conflicts for these
temporary files.) You can specify a different location by starting the impalad daemon with the
--scratch_dirs="path_to_directory" configuration option or the equivalent configuration option in the
Cloudera Manager user interface. You can specify a single directory, or a comma-separated list of directories. The
scratch directories must be on the local filesystem, not in HDFS. You might specify different directory paths for
different hosts, depending on the capacity and speed of the available storage devices. In CDH 5.5 / Impala 2.3 or
higher, Impala successfully starts (with a warning written to the log) if it cannot create or read and write files in
one of the scratch directories. If there is less than 1 GB free on the filesystem where that directory resides, Impala
still runs, but writes a warning message to its log. If Impala encounters an error reading or writing files in a scratch
directory during a query, Impala logs the error and the query fails.
• If you use the Amazon Simple Storage Service (S3) as a place to offload data to reduce the volume of local storage,
Impala 2.2.0 and higher can query the data directly from S3. See Using Impala to Query the Amazon S3 Filesystem
on page 566 for details.
Impala Security
Impala includes a fine-grained authorization framework for Hadoop, based on the Sentry open source project. Sentry
authorization was added in Impala 1.1.0. Together with the Kerberos authentication framework, Sentry takes Hadoop
security to a new level needed for the requirements of highly regulated industries such as healthcare, financial services,
and government. Impala also includes an auditing capability; Impala generates the audit data, the Cloudera Navigator
product consolidates the audit data from all nodes in the cluster, and Cloudera Manager lets you filter, visualize, and
produce reports. The auditing feature was added in Impala 1.1.1.
The Impala security features have several objectives. At the most basic level, security prevents accidents or mistakes
that could disrupt application processing, delete or corrupt data, or reveal data to unauthorized users. More advanced
security features and practices can harden the system against malicious users trying to gain unauthorized access or
perform other disallowed operations. The auditing feature provides a way to confirm that no unauthorized access
occurred, and detect whether any such attempts were made. This is a critical set of features for production deployments
in large organizations that handle important or sensitive data. It sets the stage for multi-tenancy, where multiple
applications run concurrently and are prevented from interfering with each other.
The material in this section presumes that you are already familiar with administering secure Linux systems. That is,
you should know the general security practices for Linux and Hadoop, and their associated commands and configuration
files. For example, you should know how to create Linux users and groups, manage Linux group membership, set Linux
and HDFS file permissions and ownership, and designate the default permissions and ownership for new files. You
should be familiar with the configuration of the nodes in your Hadoop cluster, and know how to apply configuration
changes or run a set of commands across all the nodes.
The security features are divided into these broad categories:
authorization
Which users are allowed to access which resources, and what operations are they allowed to perform? Impala relies
on the open source Sentry project for authorization. By default (when authorization is not enabled), Impala does
all read and write operations with the privileges of the impala user, which is suitable for a development/test
environment but not for a secure production environment. When authorization is enabled, Impala uses the OS user
ID of the user who runs impala-shell or other client program, and associates various privileges with each user.
See Enabling Sentry Authorization for Impala on page 95 for details about setting up and managing authorization.
authentication
How does Impala verify the identity of the user to confirm that they really are allowed to exercise the privileges
assigned to that user? Impala relies on the Kerberos subsystem for authentication. See Enabling Kerberos
Authentication for Impala on page 106 for details about setting up and managing authentication.
auditing
What operations were attempted, and did they succeed or not? This feature provides a way to look back and
diagnose whether attempts were made to perform unauthorized operations. You use this information to track
down suspicious activity, and to see where changes are needed in authorization policies. The audit data produced
by this feature is collected by the Cloudera Manager product and then presented in a user-friendly form by the
Cloudera Manager product. See Auditing Impala Operations on page 112 for details about setting up and managing
auditing.
• Restrict membership in the sudoers list (in the /etc/sudoers file). The users who can run the sudo command
can do many of the same things as the root user.
• Ensure the Hadoop ownership and permissions for Impala data files are restricted.
• Ensure the Hadoop ownership and permissions for Impala log files are restricted.
• Ensure that the Impala web UI (available by default on port 25000 on each Impala node) is password-protected.
See Impala Web User Interface for Debugging on page 580 for details.
• Create a policy file that specifies which Impala privileges are available to users in particular Hadoop groups (which
by default map to Linux OS groups). Create the associated Linux groups using the groupadd command if necessary.
• The Impala authorization feature makes use of the HDFS file ownership and permissions mechanism; for background
information, see the CDH HDFS Permissions Guide. Set up users and assign them to groups at the OS level,
corresponding to the different categories of users with different access levels for various databases, tables, and
HDFS locations (URIs). Create the associated Linux users using the useradd command if necessary, and add them
to the appropriate groups with the usermod command.
• Design your databases, tables, and views with database and table structure to allow policy rules to specify simple,
consistent rules. For example, if all tables related to an application are inside a single database, you can assign
privileges for that database and use the * wildcard for the table name. If you are creating views with different
privileges than the underlying base tables, you might put the views in a separate database so that you can use
the * wildcard for the database containing the base tables, while specifying the precise names of the individual
views. (For specifying table or database names, you either specify the exact name or * to mean all the databases
on a server, or all the tables and views in a database.)
• Enable authorization by running the impalad daemons with the -server_name and
-authorization_policy_file options on all nodes. (The authorization feature does not apply to the
statestored daemon, which has no access to schema objects or data files.)
• Set up authentication using Kerberos, to make sure users really are who they say they are.
host to the network and creates user IDs that match your privileged IDs, they will be blocked from connecting to
Impala at all from that host.
Note:
If the private key is encrypted using a passphrase, Impala will ask for that passphrase on startup, which
is not useful for a large cluster. In that case, remove the passphrase and make the .pem file readable
only by Impala and administrators.
When you turn on TLS/SSL for the Impala web UI, the associated URLs change from http:// prefixes
to https://. Adjust any bookmarks or application code that refers to those URLs.
Important:
• You can use either Cloudera Manager or the following command-line instructions to complete
this configuration.
• This information applies specifically to CDH 5.6.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
Property Description
Enable TLS/SSL for Impala Encrypt communication between clients (like ODBC, JDBC, and the Impala shell)
Client Services and the Impala daemon using Transport Layer Security (TLS) (formerly known
as Secure Socket Layer (SSL)).
SSL/TLS Certificate for Clients Local path to the X509 certificate that identifies the Impala daemon to clients
during TLS/SSL connections. This file must be in PEM format.
SSL/TLS Private Key for Clients Local path to the private key that matches the certificate specified in the
Certificate for Clients. This file must be in PEM format.
SSL/TLS Private Key Password A shell command for Impala to run on startup to retrieve the password for a
for Clients password-protected private key file. The output of the command is truncated
to a maximum of 1024 bytes, and any trailing whitespace (such as spaces or
newline characters) is trimmed. If the command exits with an error, Impala does
not start. If the password is incorrect, clients cannot connect to the server
regardless of whether the public key is correct.
SSL/TLS CA Certificate Must be specified for TLS/SSL encryption to be enabled for communication
between internal Impala components.
SSL/TLS Certificate for Impala There are three of these configuration settings, one each for “Impala Daemon”,
component Webserver “Catalog Server”, and “Statestore”. Each of these Impala components has its
own internal web server that powers the associated web UI with diagnostic
information. The configuration setting represents the local path to the X509
certificate that identifies the web server to clients during TLS/SSL connections.
This file must be in PEM format.
For information on configuring TLS/SSL communication with the impala-shell interpreter, see Configuring TLS/SSL
Communication for the Impala Shell on page 95.
Warning: Prior to CDH 5.5.2 / Impala 2.3.2, you could enable Kerberos authentication between Impala
internal components, or SSL encryption between Impala internal components, but not both at the
same time. This restriction has now been lifted. See IMPALA-2598 to see the maintenance releases
for different levels of CDH where the fix has been published.
If either of these flags are set, both must be set. In that case, Impala starts listening for Beeswax and HiveServer2
requests on SSL-secured ports only. (The port numbers stay the same; see Ports Used by Impala on page 583 for details.)
Since Impala uses passphrase-less certificates in PEM format, you can reuse a host's existing Java keystore by converting
it to the PEM format. For instructions, see
http://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_openssl_jks.html.
a secure production environment. When authorization is enabled, Impala uses the OS user ID of the user who runs
impala-shell or other client program, and associates various privileges with each user.
Note: Sentry is typically used in conjunction with Kerberos authentication, which defines which hosts
are allowed to connect to each server. Using the combination of Sentry and Kerberos prevents malicious
users from being able to connect by creating a named account on an untrusted machine. See Enabling
Kerberos Authentication for Impala on page 106 for details about Kerberos authentication.
• Specifying the -authorization_policy_file option in addition to -server_name makes Impala read privilege
information from a policy file, rather than from the metastore database. The argument to the
-authorization_policy_file option specifies the HDFS path to the policy file that defines the privileges on
different schema objects.
For example, you might adapt your /etc/default/impala configuration to contain lines like the following. To use
the Sentry service rather than the policy file:
IMPALA_SERVER_ARGS=" \
-server_name=server1 \
...
IMPALA_SERVER_ARGS=" \
-authorization_policy_file=/user/hive/warehouse/auth-policy.ini \
-server_name=server1 \
...
The preceding examples set up a symbolic name of server1 to refer to the current instance of Impala. This symbolic
name is used in the following ways:
• In an environment managed by Cloudera Manager, the server name is specified through Impala (Service-Wide) >
Category > Advanced > Sentry Service and Hive > Service-Wide > Advanced > Sentry Service. The values must
be the same for both, so that Impala and Hive can share the privilege rules. Restart the Impala and Hive services
after setting or changing this value.
• In an environment not managed by Cloudera Manager, you specify this value for the sentry.hive.server
property in the sentry-site.xml configuration file for Hive, as well as in the -server_name option for impalad.
If the impalad daemon is not already running, start it as described in Starting Impala on page 42. If it is already
running, restart it with the command sudo /etc/init.d/impala-server restart. Run the appropriate
commands on all the nodes where impalad normally runs.
• If you use the mode of operation using the policy file, the rules in the [roles] section of the policy file refer to
this same server1 name. For example, the following rule sets up a role report_generator that lets users with
that role query any table in a database named reporting_db on a node where the impalad daemon was started
up with the -server_name=server1 option:
[roles]
report_generator = server=server1->db=reporting_db->table=*->action=SELECT
When impalad is started with one or both of the -server_name=server1 and -authorization_policy_file
options, Impala authorization is enabled. If Impala detects any errors or inconsistencies in the authorization settings
or the policy file, the daemon refuses to start.
Using Impala with the Sentry Service (CDH 5.1 or higher only)
When you use the Sentry service rather than the policy file, you set up privileges through GRANT and REVOKE statement
in either Impala or Hive, then both components use those same privileges automatically. (Impala added the GRANT
and REVOKE statements in Impala 2.0.0 / CDH 5.2.0.)
Hive already had GRANT and REVOKE statements prior to CDH 5.1, but those statements were not production-ready.
CDH 5.1 is the first release where those statements use the Sentry framework and are considered GA level. If you used
the Hive GRANT and REVOKE statements prior to CDH 5.1, you must set up these privileges with the CDH 5.1 versions
of GRANT and REVOKE to take advantage of Sentry authorization.
For information about using the updated Hive GRANT and REVOKE statements, see Sentry service topic in the CDH 5
Security Guide.
Note:
In CDH 5 and higher, Cloudera recommends managing privileges through SQL statements, as described
in Using Impala with the Sentry Service (CDH 5.1 or higher only) on page 97. If you are still using policy
files, plan to migrate to the new approach some time in the future.
The location of the policy file is listed in the auth-site.xml configuration file. To minimize overhead, the security
information from this file is cached by each impalad daemon and refreshed automatically, with a default interval of
5 minutes. After making a substantial change to security policies, restart all Impala daemons to pick up the changes
immediately.
server=server_name->db=database_name->table=table_name->action=SELECT
server=server_name->db=database_name->table=table_name->action=CREATE
server=server_name->db=database_name->table=table_name->action=ALL
For the server_name value, substitute the same symbolic name you specify with the impalad -server_name option.
You can use * wildcard characters at each level of the privilege specification to allow access to all such objects. For
example:
server=impala-host.example.com->db=default->table=t1->action=SELECT
server=impala-host.example.com->db=*->table=*->action=CREATE
server=impala-host.example.com->db=*->table=audit_log->action=SELECT
server=impala-host.example.com->db=default->table=t1->action=*
When authorization is enabled, Impala uses the policy file as a whitelist, representing every privilege available to any
user on any object. That is, only operations specified for the appropriate combination of object, role, group, and user
are allowed; all other operations are not allowed. If a group or role is defined multiple times in the policy file, the last
definition takes precedence.
To understand the notion of whitelisting, set up a minimal policy file that does not provide any privileges for any object.
When you connect to an Impala node where this policy file is in effect, you get no results for SHOW DATABASES, and
an error when you issue any SHOW TABLES, USE database_name, DESCRIBE table_name, SELECT, and or other
statements that expect to access databases or tables, even if the corresponding databases and tables exist.
The contents of the policy file are cached, to avoid a performance penalty for each query. The policy file is re-checked
by each impalad node every 5 minutes. When you make a non-time-sensitive change such as adding new privileges
or new users, you can let the change take effect automatically a few minutes later. If you remove or reduce privileges,
and want the change to take effect immediately, restart the impalad daemon on all nodes, again specifying the
-server_name and -authorization_policy_file options so that the rules from the updated policy file are
applied.
[groups]
supergroup = all_databases
[roles]
read_all_tables = server=server1->db=*->table=*->action=SELECT
all_tables = server=server1->db=*->table=*
all_databases = server=server1->db=*
one_database = server=server1->db=test_db
entire_server = server=server1
The following sample policy file shows some of the syntax that is appropriate as the policy file grows, such as the #
comment syntax, \ continuation syntax, and comma separation for roles assigned to groups or privileges assigned to
roles.
[groups]
cloudera = training_sysadmin, instructor
visitor = student
[roles]
training_sysadmin = server=server1->db=training, \
server=server1->db=instructor_private, \
server=server1->db=lesson_development
instructor = server=server1->db=training->table=*->action=*, \
server=server1->db=instructor_private->table=*->action=*, \
server=server1->db=lesson_development->table=lesson*
# This particular course is all about queries, so the students can SELECT but not INSERT
or CREATE/DROP.
student = server=server1->db=training->table=lesson_*->action=SELECT
[groups]
cloudera = external_table, staging_dir
[roles]
external_table_admin = server=server1->db=external_table
external_table = server=server1->db=external_table->table=sample->action=*
staging_dir =
server=server1->uri=hdfs://127.0.0.1:8020/user/cloudera/external_data->action=*
impala-shell session:
| sample |
+--------+
Returned 1 row(s) in 0.02s
[groups]
supergroup = training_sysadmin
cloudera = instructor
visitor = student
[roles]
training_sysadmin = server=server1->db=training
instructor = server=server1->db=training->table=*->action=*
student = server=server1->db=training->table=*->action=SELECT
[databases]
# Defines the location of the per-DB policy files for the 'customers' and 'sales'
databases.
customers = hdfs://ha-nn-uri/etc/access/customers.ini
sales = hdfs://ha-nn-uri/etc/access/sales.ini
To enable URIs in per-DB policy files, add the following string in the Cloudera Manager field Impala Service Environment
Advanced Configuration Snippet (Safety Valve):
JAVA_TOOL_OPTIONS="-Dsentry.allow.uri.db.policyfile=true"
Important: Enabling URIs in per-DB policy files introduces a security risk by allowing the owner of
the db-level policy file to grant himself/herself load privileges to anything the impala user has read
permissions for in HDFS (including data in other databases controlled by different db-level policy files).
Server
URI
Database
Table
The server name is specified by the -server_name option when impalad starts. Specify the same name for all
impalad nodes in the cluster.
URIs represent the HDFS paths you specify as part of statements such as CREATE EXTERNAL TABLE and LOAD DATA.
Typically, you specify what look like UNIX paths, but these locations can also be prefixed with hdfs:// to make clear
that they are really URIs. To set privileges for a URI, specify the name of a directory, and the privilege applies to all the
files in that directory and any directories underneath it.
In CDH 5.5 / Impala 2.3 and higher, you can specify privileges for individual columns, as described in
https://www.cloudera.com/documentation/enterprise/latest/topics/sg_hive_sql.html. Formerly, to specify read
privileges at this level, you created a view that queried specific columns and/or partitions from a base table, and gave
SELECT privilege on the view but not the underlying table.
URIs must start with either hdfs:// or file://. If a URI starts with anything else, it will cause an exception and the
policy file will be invalid. When defining URIs for HDFS, you must also specify the NameNode. For example:
data_read = server=server1->uri=file:///path/to/dir, \
server=server1->uri=hdfs://namenode:port/path/to/dir
Warning:
Because the NameNode host and port must be specified, Cloudera strongly recommends you use High
Availability (HA). This ensures that the URI will remain constant even if the NameNode changes.
data_read = server=server1->uri=file:///path/to/dir,\
server=server1->uri=hdfs://ha-nn-uri/path/to/dir
Privilege Object
INSERT DB, TABLE
SELECT DB, TABLE, COLUMN
ALL SERVER, TABLE, DB, URI
Note:
Although this document refers to the ALL privilege, currently if you use the policy file mode, you do
not use the actual keyword ALL in the policy file. When you code role entries in the policy file:
• To specify the ALL privilege for a server, use a role like server=server_name.
• To specify the ALL privilege for a database, use a role like
server=server_name->db=database_name.
• To specify the ALL privilege for a table, use a role like
server=server_name->db=database_name->table=table_name->action=*.
which indicate each evaluation Sentry makes. The FilePermission is from the policy file, while RequestPermission
is the privilege required for the query. A RequestPermission will iterate over all appropriate FilePermission
settings until a match is found. If no matching privilege is found, Sentry returns false indicating “Access Denied” .
Note: The Hive/Impala > Service-Wide > Policy File Based Sentry tab contains parameters only
relevant to configuring Sentry using policy files. In particular, make sure that Enable Sentry
Authorization using Policy Files parameter is unchecked when using the Sentry service. Cloudera
Manager throws a validation error if you attempt to configure the Sentry service and policy file at the
same time.
Impala Authentication
Authentication is the mechanism to ensure that only specified hosts and users can connect to Impala. It also verifies
that when clients connect to Impala, they are connected to a legitimate server. This feature prevents spoofing such as
impersonation (setting up a phony client system with the same account and group names as a legitimate user) and
man-in-the-middle attacks (intercepting application requests before they reach Impala and eavesdropping on sensitive
information in the requests or the results).
Impala supports authentication using either Kerberos or LDAP.
Note: Regardless of the authentication mechanism used, Impala always creates HDFS directories and
data files owned by the same user (typically impala). To implement user-level access to different
databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained
in Enabling Sentry Authorization for Impala on page 95.
Once you are finished setting up authentication, move on to authorization, which involves specifying what databases,
tables, HDFS directories, and so on can be accessed by particular users when they connect through Impala. See Enabling
Sentry Authorization for Impala on page 95 for details.
In Impala 2.0 and later, user() returns the full Kerberos principal string, such as [email protected], in a Kerberized
environment.
Note: Regardless of the authentication mechanism used, Impala always creates HDFS directories and
data files owned by the same user (typically impala). To implement user-level access to different
databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained
in Enabling Sentry Authorization for Impala on page 95.
An alternative form of authentication you can use is LDAP, described in Enabling LDAP Authentication for Impala on
page 109.
Important:
• If you plan to use Impala in your cluster, you must configure your KDC to allow tickets to be
renewed, and you must configure krb5.conf to request renewable tickets. Typically, you can
do this by adding the max_renewable_life setting to your realm in kdc.conf, and by adding
the renew_lifetime parameter to the libdefaults section of krb5.conf.
For more information about renewable tickets, see the Kerberos documentation.
• The Impala Web UI does not support Kerberos authentication.
• You cannot use the Impala resource management feature on a cluster that has Kerberos
authentication enabled.
Start all impalad and statestored daemons with the --principal and --keytab-file flags set to the principal
and full path name of the keytab file containing the credentials for the principal.
Impala supports the Cloudera ODBC driver and the Kerberos interface provided. To use Kerberos through the ODBC
driver, the host type must be set depending on the level of the ODBD driver:
• SecImpala for the ODBC 1.0 driver.
• SecBeeswax for the ODBC 1.2 driver.
• Blank for the ODBC 2.0 driver or higher, when connecting to a secure cluster.
• HS2NoSasl for the ODBC 2.0 driver or higher, when connecting to a non-secure cluster.
To enable Kerberos in the Impala shell, start the impala-shell command using the -k flag.
To enable Impala to work with Kerberos security on your Hadoop cluster, make sure you perform the installation and
configuration steps in Authentication in the CDH 5 Security Guide.
$ kadmin
kadmin: addprinc -requires_preauth -randkey
impala/[email protected]
Note: The HTTP component of the service principal must be uppercase as shown in the preceding
example.
4. Use ktutil to read the contents of the two keytab files and then write those contents to a new file. For example:
$ ktutil
ktutil: rkt impala.keytab
ktutil: rkt http.keytab
ktutil: wkt impala-http.keytab
ktutil: quit
5. (Optional) Test that credentials in the merged keytab file are valid, and that the “renew until” date is in the future.
For example:
$ klist -e -k -t impala-http.keytab
6. Copy the impala-http.keytab file to the Impala configuration directory. Change the permissions to be only
read for the file owner and change the file owner to the impala user. By default, the Impala user and group are
both named impala. For example:
$ cp impala-http.keytab /etc/impala/conf
$ cd /etc/impala/conf
$ chmod 400 impala-http.keytab
$ chown impala:impala impala-http.keytab
7. Add Kerberos options to the Impala defaults file, /etc/default/impala. Add the options for both the impalad
and statestored daemons, using the IMPALA_SERVER_ARGS and IMPALA_STATE_STORE_ARGS variables. For
example, you might add:
-kerberos_reinit_interval=60
-principal=impala_1/[email protected]
-keytab_file=/var/run/cloudera-scm-agent/process/3212-impala-IMPALAD/impala.keytab
For more information on changing the Impala defaults specified in /etc/default/impala, see Modifying Impala
Startup Options.
Note: Restart impalad and statestored for these configuration changes to take effect.
Note: Regardless of the authentication mechanism used, Impala always creates HDFS directories and
data files owned by the same user (typically impala). To implement user-level access to different
databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained
in Enabling Sentry Authorization for Impala on page 95.
An alternative form of authentication you can use is Kerberos, described in Enabling Kerberos Authentication for Impala
on page 106.
[impala]
auth_username=<LDAP username of Hue user to be authenticated>
auth_password=<LDAP password of Hue user to be authenticated>
These login details are only used by Impala to authenticate to LDAP. The Impala service trusts Hue to have already
validated the user being impersonated, rather than simply passing on the credentials.
Note: Make sure to use single quotes or escape characters to ensure that any * characters do not
undergo wildcard expansion when specified in command-line arguments.
See Modifying Impala Startup Options on page 43 for details about adding or changing impalad startup options. See
this Cloudera blog post for background information about the delegation capability in HiveServer2.
To set up authentication for the delegated users:
• On the server side, configure either user/password authentication through LDAP, or Kerberos authentication, for
all the delegated users. See Enabling LDAP Authentication for Impala on page 109 or Enabling Kerberos Authentication
for Impala on page 106 for details.
• On the client side, follow the instructions in the “Using User Name and Password” section in the ODBC driver
installation guide. Then search for “delegation” in that same installation guide to learn about the Delegation UID
field and DelegationUID configuration keyword to enable the delegation feature for ODBC-based BI tools.
Because any Impala host can process a query, enable auditing on all hosts where the impalad daemon runs. Each
host stores its own log files, in a directory in the local filesystem. The log data is periodically flushed to disk (through
an fsync() system call) to avoid loss of audit data in case of a crash.
The runtime overhead of auditing applies to whichever host serves as the coordinator for the query, that is, the host
you connect to when you issue the query. This might be the same host for all queries, or different applications or users
might connect to and issue queries through different hosts.
To avoid excessive I/O overhead on busy coordinator hosts, Impala syncs the audit log data (using the fsync() system
call) periodically rather than after every query. Currently, the fsync() calls are issued at a fixed interval, every 5
seconds.
By default, Impala avoids losing any audit log data in the case of an error during a logging operation (such as a disk full
error), by immediately shutting down impalad on the host where the auditing problem occurred. You can override
this setting by specifying the option -abort_on_failed_audit_event=false in the impalad startup options.
The audit log does not contain entries for queries that could not be parsed and analyzed. For example, a query that
fails due to a syntax error is not recorded in the audit log. The audit log also does not contain queries that fail due to
a reference to a table that does not exist, if you would be authorized to access the table if it did exist.
Certain statements in the impala-shell interpreter, such as CONNECT, SUMMARY, PROFILE, SET, and QUIT, do not
correspond to actual SQL queries, and these statements are not reflected in the audit log.
Column Lineage
Column lineage tracks information in fine detail, at the level of particular columns rather than entire tables.
For example, if you have a table with information derived from web logs, you might copy that data into other tables
as part of the ETL process. The ETL operations might involve transformations through expressions and function calls,
and rearranging the columns into more or fewer tables (normalizing or denormalizing the data). Then for reporting,
you might issue queries against multiple tables and views. In this example, column lineage helps you determine that
data that entered the system as RAW_LOGS.FIELD1 was then turned into WEBSITE_REPORTS.IP_ADDRESS through
an INSERT ... SELECT statement. Or, conversely, you could start with a reporting query against a view, and trace
the origin of the data in a field such as TOP_10_VISITORS.USER_ID back to the underlying table and even further
back to the point where the data was first loaded into Impala.
When you have tables where you need to track or control access to sensitive information at the column level, see
Enabling Sentry Authorization for Impala on page 95 for how to implement column-level security. You set up
authorization using the Sentry framework, create views that refer to specific sets of columns, and then assign
authorization privileges to those views rather than the underlying tables.
To enable or disable this feature on a system not managed by Cloudera Manager, set or remove the
-lineage_event_log_dir configuration option for the impalad daemon. For information about turning the lineage
feature on and off through Cloudera Manager, see
http://www.cloudera.com/documentation/enterprise/latest/topics/datamgmt_impala_lineage_log.html.
Impala supports data types with the same names and semantics as the equivalent Hive data types: string, TINYINT,
SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING, TIMESTAMP.
For full details about Impala SQL syntax and semantics, see Impala SQL Statements on page 214.
Most HiveQL SELECT and INSERT statements run unmodified with Impala. For information about Hive syntax not
available in Impala, see SQL Differences Between Impala and Hive on page 461.
For a list of the built-in functions available in Impala queries, see Impala Built-In Functions on page 338.
Comments
Impala supports the familiar styles of SQL comments:
• All text from a -- sequence to the end of the line is considered a comment and ignored. This type of comment
can occur on a single line by itself, or after all or part of a statement.
• All text from a /* sequence to the next */ sequence is considered a comment and ignored. This type of comment
can stretch over multiple lines. This type of comment can occur on one or more lines by itself, in the middle of a
statement, or before or after a statement.
For example:
/*
This is a multi-line comment about a query.
*/
select ...;
Data Types
Impala supports a set of data types that you can use for table columns, expression values, and function arguments and
return values.
Note: Currently, Impala supports only scalar types, not composite or nested types. Accessing a table
containing any columns with unsupported types causes an error.
For the notation to write literals of each of these data types, see Literals on page 184.
See SQL Differences Between Impala and Hive on page 461 for differences between Impala and Hive data types.
Usage notes:
Because complex types are often used in combination, for example an ARRAY of STRUCT elements, if you are unfamiliar
with the Impala complex types, start with Complex Types (CDH 5.5 or higher only) on page 156 for background information
and usage examples.
The elements of the array have no names. You refer to the value of the array item using the ITEM pseudocolumn, or
its position in the array with the POS pseudocolumn. See ITEM and POS Pseudocolumns on page 170 for information
about these pseudocolumns.
Each row can have a different number of elements (including none) in the array for that row.
When an array contains items of scalar types, you can use aggregation functions on the array elements without using
join notation. For example, you can find the COUNT(), AVG(), SUM(), and so on of numeric array elements, or the
MAX() and MIN() of any scalar array elements by referring to table_name.array_column in the FROM clause of
the query. When you need to cross-reference values from the array with scalar values from the same row, such as by
including a GROUP BY clause to produce a separate aggregated result for each row, then the join clause is required.
A common usage pattern with complex types is to have an array as the top-level type for the column: an array of
structs, an array of maps, or an array of arrays. For example, you can model a denormalized table by creating a column
that is an ARRAY of STRUCT elements; each item in the array represents a row from a table that would normally be
used in a join query. This kind of data structure lets you essentially denormalize tables by associating multiple rows
from one table with the matching row in another table.
You typically do not create more than one top-level ARRAY column, because if there is some relationship between the
elements of multiple arrays, it is convenient to model the data as an array of another complex type element (either
STRUCT or MAP).
You can pass a qualified name to DESCRIBE to specify an ARRAY, STRUCT, or MAP column and visualize its structure
as if it were a table. An ARRAY is shown as a two-column table, with ITEM and POS columns. A STRUCT is shown as a
table with each field representing a column in the table. A MAP is shown as a two-column table, with KEY and VALUE
columns.
Added in: CDH 5.5.0 (Impala 2.3.0)
Restrictions:
• Columns with this data type can only be used in tables or partitions with the Parquet file format.
• Columns with this data type cannot be used as partition key columns in a partitioned table.
• The COMPUTE STATS statement does not produce any statistics for columns of this data type.
• The maximum length of the column definition for any complex type, including declarations for any nested types,
is 4000 characters.
• See Limitations and Restrictions for Complex Types on page 161 for a full list of limitations and associated guidelines
about complex type columns.
Examples:
Note: Many of the complex type examples refer to tables such as CUSTOMER and REGION adapted
from the tables used in the TPC-H benchmark. See Sample Schema and Data for Experimenting with
Impala Complex Types on page 178 for the table definitions.
The following example shows how to construct a table with various kinds of ARRAY columns, both at the top level and
nested within other complex types. Whenever the ARRAY consists of a scalar value, such as in the PETS column or the
CHILDREN field, you can see that future expansion is limited. For example, you could not easily evolve the schema to
record the kind of pet or the child's birthday alongside the name. Therefore, it is more common to use an ARRAY whose
elements are of STRUCT type, to associate multiple fields with each array element.
Note: Practice the CREATE TABLE and query notation for complex type columns using empty tables,
until you can visualize a complex data structure and construct corresponding SQL statements reliably.
The following example shows how to examine the structure of a table containing one or more ARRAY columns by using
the DESCRIBE statement. You can visualize each ARRAY as its own two-column table, with columns ITEM and POS.
DESCRIBE array_demo;
+--------------+---------------------------+
| name | type |
+--------------+---------------------------+
| id | bigint |
| name | string |
| pets | array<string> |
| marriages | array<struct< |
| | spouse:string, |
| | children:array<string> |
| | >> |
| places_lived | array<struct< |
| | place:string, |
| | start_year:int |
| | >> |
| ancestors | map<string,array<string>> |
+--------------+---------------------------+
DESCRIBE array_demo.pets;
+------+--------+
| name | type |
+------+--------+
| item | string |
| pos | bigint |
+------+--------+
DESCRIBE array_demo.marriages;
+------+--------------------------+
| name | type |
+------+--------------------------+
| item | struct< |
| | spouse:string, |
| | children:array<string> |
| | > |
| pos | bigint |
+------+--------------------------+
DESCRIBE array_demo.places_lived;
+------+------------------+
| name | type |
+------+------------------+
| item | struct< |
| | place:string, |
| | start_year:int |
| | > |
| pos | bigint |
+------+------------------+
DESCRIBE array_demo.ancestors;
+-------+---------------+
| name | type |
+-------+---------------+
| key | string |
| value | array<string> |
+-------+---------------+
The following example shows queries involving ARRAY columns containing elements of scalar or complex types. You
“unpack” each ARRAY column by referring to it in a join query, as if it were a separate table with ITEM and POS columns.
If the array element is a scalar type, you refer to its value using the ITEM pseudocolumn. If the array element is a
STRUCT, you refer to the STRUCT fields using dot notation and the field names. If the array element is another ARRAY
or a MAP, you use another level of join to unpack the nested collection elements.
-- Array of structs.
-- Now each array element has named fields, possibly of different types.
-- You can consider an ARRAY of STRUCT to represent a table inside another table.
SELECT id, name, places_lived.pos, places_lived.item.place, places_lived.item.start_year
FROM array_demo, array_demo.places_lived;
-- The .ITEM name is optional for array elements that are structs.
-- The following query is equivalent to the previous one, with .ITEM
-- removed from the column references.
SELECT id, name, places_lived.pos, places_lived.place, places_lived.start_year
FROM array_demo, array_demo.places_lived;
-- To filter specific items from the array, do comparisons against the .POS or .ITEM
-- pseudocolumns, or names of struct fields, in the WHERE clause.
SELECT id, name, pets.item FROM array_demo, array_demo.pets
WHERE pets.pos in (0, 1, 3);
Related information:
Complex Types (CDH 5.5 or higher only) on page 156, STRUCT Complex Type (CDH 5.5 or higher only) on page 142, MAP
Complex Type (CDH 5.5 or higher only) on page 136
column_name BIGINT
Usage notes:
BIGINT is a convenient type to use for column declarations because you can use any kind of integer values in INSERT
statements and they are promoted to BIGINT where necessary. However, BIGINT also requires the most bytes of
any integer type on disk and in memory, meaning your queries are not as efficient and scalable as possible if you
overuse this type. Therefore, prefer to use the smallest integer type with sufficient range to hold all input values, and
CAST() when necessary to the appropriate type.
For a convenient and automated way to check the bounds of the BIGINT type, call the functions MIN_BIGINT() and
MAX_BIGINT().
If an integer value is too large to be represented as a BIGINT, use a DECIMAL instead with sufficient digits of precision.
NULL considerations: Casting any non-numeric value to this type produces a NULL value.
Partitioning: Prefer to use this type for a partition key column. Impala can process the numeric type more efficiently
than a STRING representation of the value.
HBase considerations: This data type is fully compatible with HBase tables.
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Internal details: Represented in memory as an 8-byte value.
Added in: Available in all versions of Impala.
Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always
filled in for column statistics, even before you run the COMPUTE STATS statement.
Sqoop considerations:
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE,
DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is
represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala
interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported
this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Related information:
Numeric Literals on page 184, TINYINT Data Type on page 153, SMALLINT Data Type on page 140, INT Data Type on page
135, BIGINT Data Type on page 120, DECIMAL Data Type (CDH 5.1 or higher only) on page 125, Impala Mathematical
Functions on page 339
column_name BOOLEAN
Range: TRUE or FALSE. Do not use quotation marks around the TRUE and FALSE literal values. You can write the literal
values in uppercase, lowercase, or mixed case. The values queried from a table are always returned in lowercase, true
or false.
Conversions: Impala does not automatically convert any other type to BOOLEAN. All conversions must use an explicit
call to the CAST() function.
You can use CAST() to convert any integer or floating-point type to BOOLEAN: a value of 0 represents false, and any
non-zero value is converted to true.
When you cast the opposite way, from BOOLEAN to a numeric type, the result becomes either 1 or 0:
You can cast DECIMAL values to BOOLEAN, with the same treatment of zero and non-zero values as the other numeric
types. You cannot cast a BOOLEAN to a DECIMAL.
You cannot cast a STRING value to BOOLEAN, although you can cast a BOOLEAN value to STRING, returning '1' for
true values and '0' for false values.
Although you can cast a TIMESTAMP to a BOOLEAN or a BOOLEAN to a TIMESTAMP, the results are unlikely to be useful.
Any non-zero TIMESTAMP (that is, any value other than 1970-01-01 00:00:00) becomes TRUE when converted to
BOOLEAN, while 1970-01-01 00:00:00 becomes FALSE. A value of FALSE becomes 1970-01-01 00:00:00 when
converted to BOOLEAN, and TRUE becomes one second past this epoch date, that is, 1970-01-01 00:00:01.
NULL considerations: An expression of this type produces a NULL value if any argument of the expression is NULL.
Partitioning:
Do not use a BOOLEAN column as a partition key. Although you can create such a table, subsequent operations produce
errors:
ERROR: AnalysisException: INSERT into table with BOOLEAN partition column (truth) is
not supported: partitioning.truth_table
Examples:
SELECT 1 < 2;
SELECT 2 = 5;
SELECT 100 < NULL, 100 > NULL;
CREATE TABLE assertions (claim STRING, really BOOLEAN);
INSERT INTO assertions VALUES
("1 is less than 2", 1 < 2),
("2 is the same as 5", 2 = 5),
("Grass is green", true),
("The moon is made of green cheese", false);
SELECT claim FROM assertions WHERE really = TRUE;
HBase considerations: This data type is fully compatible with HBase tables.
Parquet considerations: This type is fully compatible with Parquet tables.
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always
filled in for column statistics, even before you run the COMPUTE STATS statement.
Related information: Boolean Literals on page 187, SQL Operators on page 188, Impala Conditional Functions on page
391
column_name CHAR(length)
• Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
• Parquet data files might contain values that are longer than allowed by the CHAR(n) length limit. Impala ignores
any extra trailing characters when it processes those values during a query.
Text table considerations:
Text data files might contain values that are longer than allowed for a particular CHAR(n) column. Any extra trailing
characters are ignored when Impala processes those values during a query. Text data files can also contain values that
are shorter than the defined length limit, and Impala pads them with trailing spaces up to the specified length. Any
text data files produced by Impala INSERT statements do not include any trailing blanks for CHAR columns.
Avro considerations:
The Avro specification allows string values up to 2**64 bytes in length. Impala queries for Avro tables use 32-bit integers
to hold string lengths. In CDH 5.7 / Impala 2.5 and higher, Impala truncates CHAR and VARCHAR values in Avro tables
to (2**31)-1 bytes. If a query encounters a STRING value longer than (2**31)-1 bytes in an Avro table, the query fails.
In earlier releases, encountering such long values in an Avro table could cause a crash.
Compatibility:
This type is available using Impala 2.0 or higher under CDH 4, or with Impala on CDH 5.2 or higher. There are no
compatibility issues with other components when exchanging data files or running Impala on CDH 4.
Some other database systems make the length specification optional. For Impala, the length is required.
Internal details: Represented in memory as a byte array with the same size as the length specification. Values that are
shorter than the specified length are padded on the right with trailing spaces.
Added in: CDH 5.2.0 (Impala 2.0.0)
Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always
filled in for column statistics, even before you run the COMPUTE STATS statement.
UDF considerations: This type cannot be used for the argument or return type of a user-defined function (UDF) or
user-defined aggregate function (UDA).
Examples:
These examples show how trailing spaces are not considered significant when comparing or processing CHAR values.
CAST() truncates any longer string to fit within the defined length. If a CHAR value is shorter than the specified length,
it is padded on the right with spaces until it matches the specified length. Therefore, LENGTH() represents the length
including any trailing spaces, and CONCAT() also treats the column value as if it has trailing spaces.
This example shows a case where data values are known to have a specific length, where CHAR is a logical data type
to use.
The following example shows how values written by Impala do not physically include the trailing spaces. It creates a
table using text format, with CHAR values much shorter than the declared length, and then prints the resulting data
file to show that the delimited values are not separated by spaces. The same behavior applies to binary-format Parquet
data files.
The following example further illustrates the treatment of spaces. It replaces the contents of the previous table with
some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved within the data file,
but trailing spaces are discarded. Then when the values are retrieved by a query, the leading spaces are retrieved
verbatim while any necessary trailing spaces are supplied by Impala.
Restrictions:
Because the blank-padding behavior requires allocating the maximum length for each value in memory, for scalability
reasons avoid declaring CHAR columns that are much longer than typical values in that column.
All data in CHAR and VARCHAR columns must be in a character encoding that is compatible with UTF-8. If you have
binary data from another database system (that is, a BLOB type), use a STRING column to hold it.
When an expression compares a CHAR with a STRING or VARCHAR, the CHAR value is implicitly converted to STRING
first, with trailing spaces preserved.
This behavior differs from other popular database systems. To get the expected result of TRUE, cast the expressions
on both sides to CHAR values of the appropriate length:
column_name DECIMAL[(precision[,scale])]
• For addition and subtraction, the precision and scale are based on the maximum possible result, that is, if all the
digits of the input values were 9s and the absolute values were added together.
• For multiplication, the precision is the sum of the precisions of the input values. The scale is the sum of the scales
of the input values.
• For division, Impala sets the precision and scale to values large enough to represent the whole and fractional parts
of the result.
• For UNION, the scale is the larger of the scales of the input values, and the precision is increased if necessary to
accommodate any additional fractional digits. If the same input value has the largest precision and the largest
scale, the result value has the same precision and scale. If one value has a larger precision but smaller scale, the
scale of the result value is increased. For example, DECIMAL(20,2) UNION DECIMAL(8,6) produces a result
of type DECIMAL(24,6). The extra 4 fractional digits of scale (6-2) are accommodated by extending the precision
by the same amount (20+4).
• To doublecheck, you can always call the PRECISION() and SCALE() functions on the results of an arithmetic
expression to see the relevant values, or use a CREATE TABLE AS SELECT statement to define a column based
on the return type of the expression.
Compatibility:
• Using the DECIMAL type is only supported under CDH 5.1.0 and higher.
• Use the DECIMAL data type in Impala for applications where you used the NUMBER data type in Oracle. The Impala
DECIMAL type does not support the Oracle idioms of * for scale or negative values for precision.
To avoid potential conversion errors, you can use CAST() to convert DECIMAL values to FLOAT, TINYINT, SMALLINT,
INT, BIGINT, STRING, TIMESTAMP, or BOOLEAN. You can use exponential notation in DECIMAL literals or when casting
from STRING, for example 1.0e6 to represent one million.
If you cast a value with more fractional digits than the scale of the destination type, any extra fractional digits are
truncated (not rounded). Casting a value to a target type with not enough precision produces a result of NULL and
displays a runtime warning.
When you specify integer literals, for example in INSERT ... VALUES statements or arithmetic expressions, those
numbers are interpreted as the smallest applicable integer type. You must use CAST() calls for some combinations
of integer literals and DECIMAL precision. For example, INT has a maximum value that is 10 digits long, TINYINT has
a maximum value that is 3 digits long, and so on. If you specify a value such as 123456 to go into a DECIMAL column,
Impala checks if the column has enough precision to represent the largest value of that integer type, and raises an
error if not. Therefore, use an expression like CAST(123456 TO DECIMAL(9,0)) for DECIMAL columns with precision
9 or less, CAST(50 TO DECIMAL(2,0)) for DECIMAL columns with precision 2 or less, and so on. For DECIMAL
columns with precision 10 or greater, Impala automatically interprets the value as the correct DECIMAL type; however,
because DECIMAL(10) requires 8 bytes of storage while DECIMAL(9) requires only 4 bytes, only use precision of 10
or higher when actually needed.
Be aware that in memory and for binary file formats such as Parquet or Avro, DECIMAL(10) or higher consumes 8
bytes while DECIMAL(9) (the default for DECIMAL) or lower consumes 4 bytes. Therefore, to conserve space in large
tables, use the smallest-precision DECIMAL type that is appropriate and CAST() literal values where necessary, rather
than declaring DECIMAL columns with high precision for convenience.
To represent a very large or precise DECIMAL value as a literal, for example one that contains more digits than can be
represented by a BIGINT literal, use a quoted string or a floating-point value for the number, and CAST() to the
desired DECIMAL type:
insert into decimals_38_5 values (1), (2), (4), (8), (16), (1024), (32768), (65536),
(1000000),
(cast("999999999999999999999999999999" as decimal(38,5))),
(cast(999999999999999999999999999999. as decimal(38,5)));
• The result of the SUM() aggregate function on DECIMAL values is promoted to a precision of 38, with the same
precision as the underlying column. Thus, the result can represent the largest possible value at that particular
precision.
• STRING columns, literals, or expressions can be converted to DECIMAL as long as the overall number of digits and
digits to the right of the decimal point fit within the specified precision and scale for the declared DECIMAL type.
By default, a DECIMAL value with no specified scale or precision can hold a maximum of 9 digits of an integer
value. If there are more digits in the string value than are allowed by the DECIMAL scale and precision, the result
is NULL.
The following examples demonstrate how STRING values with integer and fractional parts are represented when
converted to DECIMAL. If the scale is 0, the number is treated as an integer value with a maximum of precision
digits. If the precision is greater than 0, the scale must be increased to account for the digits both to the left and
right of the decimal point. As the precision increases, output values are printed with additional trailing zeros after
the decimal point if needed. Any trailing zeros after the decimal point in the STRING value must fit within the
number of digits specified by the precision.
[localhost:21000] > select cast('100' as decimal); -- Small integer value fits within
9 digits of scale.
+-----------------------------+
| cast('100' as decimal(9,0)) |
+-----------------------------+
| 100 |
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(3,0)); -- Small integer value fits
within 3 digits of scale.
+-----------------------------+
| cast('100' as decimal(3,0)) |
+-----------------------------+
| 100 |
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(2,0)); -- 2 digits of scale is not
enough!
+-----------------------------+
| cast('100' as decimal(2,0)) |
+-----------------------------+
| NULL |
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(3,1)); -- (3,1) = 2 digits left of the
decimal point, 1 to the right. Not enough.
+-----------------------------+
| cast('100' as decimal(3,1)) |
+-----------------------------+
| NULL |
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(4,1)); -- 4 digits total, 1 to the
right of the decimal point.
+-----------------------------+
| cast('100' as decimal(4,1)) |
+-----------------------------+
| 100.0 |
+-----------------------------+
[localhost:21000] > select cast('98.6' as decimal(3,1)); -- (3,1) can hold a 3 digit
number with 1 fractional digit.
+------------------------------+
| cast('98.6' as decimal(3,1)) |
+------------------------------+
| 98.6 |
+------------------------------+
[localhost:21000] > select cast('98.6' as decimal(15,1)); -- Larger scale allows bigger
numbers but still only 1 fractional digit.
+-------------------------------+
| cast('98.6' as decimal(15,1)) |
+-------------------------------+
| 98.6 |
+-------------------------------+
[localhost:21000] > select cast('98.6' as decimal(15,5)); -- Larger precision allows
more fractional digits, outputs trailing zeros.
+-------------------------------+
| cast('98.6' as decimal(15,5)) |
+-------------------------------+
| 98.60000 |
+-------------------------------+
[localhost:21000] > select cast('98.60000' as decimal(15,1)); -- Trailing zeros in the
string must fit within 'scale' digits (1 in this case).
+-----------------------------------+
| cast('98.60000' as decimal(15,1)) |
+-----------------------------------+
| NULL |
+-----------------------------------+
• Most built-in arithmetic functions such as SIN() and COS() continue to accept only DOUBLE values because they
are so commonly used in scientific context for calculations of IEEE 754-compliant values. The built-in functions
that accept and return DECIMAL are:
– ABS()
– CEIL()
– COALESCE()
– FLOOR()
– FNV_HASH()
– GREATEST()
– IF()
– ISNULL()
– LEAST()
– NEGATIVE()
– NULLIF()
– POSITIVE()
– PRECISION()
– ROUND()
– SCALE()
– TRUNCATE()
– ZEROIFNULL()
+--------------------------+
[localhost:21000] > select cast(10 as decimal(1,1));
+--------------------------+
| cast(10 as decimal(1,1)) |
+--------------------------+
| 10.0 |
+--------------------------+
[localhost:21000] > select cast(100 as decimal(1,1));
+---------------------------+
| cast(100 as decimal(1,1)) |
+---------------------------+
| 100.0 |
+---------------------------+
[localhost:21000] > select cast(1000 as decimal(1,1));
+----------------------------+
| cast(1000 as decimal(1,1)) |
+----------------------------+
| 1000.0 |
+----------------------------+
• When a DECIMAL value is converted to any of the integer types, any fractional part is truncated (that is, rounded
towards zero):
• You cannot directly cast TIMESTAMP or BOOLEAN values to or from DECIMAL values. You can turn a DECIMAL
value into a time-related representation using a two-step process, by converting it to an integer value and then
using that result in a call to a date and time function such as from_unixtime().
• Because values in INSERT statements are checked rigorously for type compatibility, be prepared to use CAST()
function calls around literals, column references, or other expressions that you are inserting into a DECIMAL
column.
NULL considerations: Casting any non-numeric value to this type produces a NULL value.
DECIMAL differences from integer and floating-point types:
With the DECIMAL type, you are concerned with the number of overall digits of a number rather than powers of 2 (as
in TINYINT, SMALLINT, and so on). Therefore, the limits with integral values of DECIMAL types fall around 99, 999,
9999, and so on rather than 32767, 65535, 2 32 -1, and so on. For fractional values, you do not need to account for
imprecise representation of the fractional part according to the IEEE-954 standard (as in FLOAT and DOUBLE). Therefore,
when you insert a fractional value into a DECIMAL column, you can compare, sum, query, GROUP BY, and so on that
column and get back the original values rather than some “close but not identical” value.
FLOAT and DOUBLE can cause problems or unexpected behavior due to inability to precisely represent certain fractional
values, for example dollar and cents values for currency. You might find output values slightly different than you
inserted, equality tests that do not match precisely, or unexpected values for GROUP BY columns. DECIMAL can help
reduce unexpected behavior and rounding errors, at the expense of some performance overhead for assignments and
comparisons.
Literals and expressions:
• When you use an integer literal such as 1 or 999 in a SQL statement, depending on the context, Impala will treat
it as either the smallest appropriate DECIMAL type, or the smallest integer type (TINYINT, SMALLINT, INT, or
BIGINT). To minimize memory usage, Impala prefers to treat the literal as the smallest appropriate integer type.
• When you use a floating-point literal such as 1.1 or 999.44 in a SQL statement, depending on the context, Impala
will treat it as either the smallest appropriate DECIMAL type, or the smallest floating-point type (FLOAT or DOUBLE).
To avoid loss of accuracy, Impala prefers to treat the literal as a DECIMAL.
Storage considerations:
• Only the precision determines the storage size for DECIMAL values; the scale setting has no effect on the storage
size.
• Text, RCFile, and SequenceFile tables all use ASCII-based formats. In these text-based file formats, leading zeros
are not stored, but trailing zeros are stored. In these tables, each DECIMAL value takes up as many bytes as there
are digits in the value, plus an extra byte if the decimal point is present and an extra byte for negative values.
Once the values are loaded into memory, they are represented in 4, 8, or 16 bytes as described in the following
list items. The on-disk representation varies depending on the file format of the table.
• Parquet and Avro tables use binary formats, In these tables, Impala stores each value in as few bytes as possible
depending on the precision specified for the DECIMAL column.
– In memory, DECIMAL values with precision of 9 or less are stored in 4 bytes.
– In memory, DECIMAL values with precision of 10 through 18 are stored in 8 bytes.
– In memory, DECIMAL values with precision greater than 18 are stored in 16 bytes.
• Parquet and Avro tables use binary formats, In these tables, Impala stores each value in 4, 8, or 16 bytes depending
on the precision specified for the DECIMAL column.
UDF considerations: When writing a C++ UDF, use the DecimalVal data type defined in
/usr/include/impala_udf/udf.h.
Partitioning:
You can use a DECIMAL column as a partition key. Doing so provides a better match between the partition key values
and the HDFS directory names than using a DOUBLE or FLOAT partitioning column:
Schema evolution considerations:
• For text-based formats (text, RCFile, and SequenceFile tables), you can issue an ALTER TABLE ... REPLACE
COLUMNS statement to change the precision and scale of an existing DECIMAL column. As long as the values in
the column fit within the new precision and scale, they are returned correctly by a query. Any values that do not
fit within the new precision and scale are returned as NULL, and Impala reports the conversion error. Leading
zeros do not count against the precision value, but trailing zeros after the decimal point do.
• For binary formats (Parquet and Avro tables), although an ALTER TABLE ... REPLACE COLUMNS statement
that changes the precision or scale of a DECIMAL column succeeds, any subsequent attempt to query the changed
column results in a fatal error. (The other columns can still be queried successfully.) This is because the metadata
about the columns is stored in the data files themselves, and ALTER TABLE does not actually make any updates
to the data files. If the metadata in the data files disagrees with the metadata in the metastore database, Impala
cancels the query.
Examples:
Restrictions:
Currently, the COMPUTE STATS statement under CDH 4 does not store any statistics for DECIMAL columns. When
Impala runs under CDH 5, which has better support for DECIMAL in the metastore database, COMPUTE STATS does
collect statistics for DECIMAL columns and Impala uses the statistics to optimize query performance.
HBase considerations: This data type is fully compatible with HBase tables.
Parquet considerations: This type is fully compatible with Parquet tables.
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always
filled in for column statistics, even before you run the COMPUTE STATS statement.
Related information:
Numeric Literals on page 184, TINYINT Data Type on page 153, SMALLINT Data Type on page 140, INT Data Type on page
135, BIGINT Data Type on page 120, DECIMAL Data Type (CDH 5.1 or higher only) on page 125, Impala Mathematical
Functions on page 339 (especially PRECISION() and SCALE())
column_name DOUBLE
Partitioning: Because fractional values of this type are not always represented precisely, when this type is used for a
partition key column, the underlying HDFS directories might not be named exactly as you expect. Prefer to partition
on a DECIMAL column instead.
HBase considerations: This data type is fully compatible with HBase tables.
Parquet considerations: This type is fully compatible with Parquet tables.
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Internal details: Represented in memory as an 8-byte value.
Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always
filled in for column statistics, even before you run the COMPUTE STATS statement.
Restrictions:
Due to the way arithmetic on FLOAT and DOUBLE columns uses high-performance hardware instructions, and distributed
queries can perform these operations in different order for each query, results can vary slightly for aggregate function
calls such as SUM() and AVG() for FLOAT and DOUBLE columns, particularly on large data sets where millions or billions
of values are summed or averaged. For perfect consistency and repeatability, use the DECIMAL data type for such
operations instead of FLOAT or DOUBLE.
The inability to exactly represent certain floating-point values means that DECIMAL is sometimes a better choice than
DOUBLE or FLOAT when precision is critical, particularly when transferring data from other database systems that use
different representations or file formats.
Related information:
Numeric Literals on page 184, Impala Mathematical Functions on page 339, FLOAT Data Type on page 134
column_name FLOAT
Partitioning: Because fractional values of this type are not always represented precisely, when this type is used for a
partition key column, the underlying HDFS directories might not be named exactly as you expect. Prefer to partition
on a DECIMAL column instead.
HBase considerations: This data type is fully compatible with HBase tables.
Parquet considerations: This type is fully compatible with Parquet tables.
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Internal details: Represented in memory as a 4-byte value.
Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always
filled in for column statistics, even before you run the COMPUTE STATS statement.
Restrictions:
Due to the way arithmetic on FLOAT and DOUBLE columns uses high-performance hardware instructions, and distributed
queries can perform these operations in different order for each query, results can vary slightly for aggregate function
calls such as SUM() and AVG() for FLOAT and DOUBLE columns, particularly on large data sets where millions or billions
of values are summed or averaged. For perfect consistency and repeatability, use the DECIMAL data type for such
operations instead of FLOAT or DOUBLE.
The inability to exactly represent certain floating-point values means that DECIMAL is sometimes a better choice than
DOUBLE or FLOAT when precision is critical, particularly when transferring data from other database systems that use
different representations or file formats.
Related information:
Numeric Literals on page 184, Impala Mathematical Functions on page 339, DOUBLE Data Type on page 133
column_name INT
Partitioning: Prefer to use this type for a partition key column. Impala can process the numeric type more efficiently
than a STRING representation of the value.
HBase considerations: This data type is fully compatible with HBase tables.
Parquet considerations:
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Internal details: Represented in memory as a 4-byte value.
Added in: Available in all versions of Impala.
Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always
filled in for column statistics, even before you run the COMPUTE STATS statement.
Related information:
Numeric Literals on page 184, TINYINT Data Type on page 153, SMALLINT Data Type on page 140, INT Data Type on page
135, BIGINT Data Type on page 120, DECIMAL Data Type (CDH 5.1 or higher only) on page 125, Impala Mathematical
Functions on page 339
Usage notes:
Because complex types are often used in combination, for example an ARRAY of STRUCT elements, if you are unfamiliar
with the Impala complex types, start with Complex Types (CDH 5.5 or higher only) on page 156 for background information
and usage examples.
The MAP complex data type represents a set of key-value pairs. Each element of the map is indexed by a primitive type
such as BIGINT or STRING, letting you define sequences that are not continuous or categories with arbitrary names.
You might find it convenient for modelling data produced in other languages, such as a Python dictionary or Java
HashMap, where a single scalar value serves as the lookup key.
In a big data context, the keys in a map column might represent a numeric sequence of events during a manufacturing
process, or TIMESTAMP values corresponding to sensor observations. The map itself is inherently unordered, so you
choose whether to make the key values significant (such as a recorded TIMESTAMP) or synthetic (such as a random
global universal ID).
Note: Behind the scenes, the MAP type is implemented in a similar way as the ARRAY type. Impala
does not enforce any uniqueness constraint on the KEY values, and the KEY values are processed by
looping through the elements of the MAP rather than by a constant-time lookup. Therefore, this type
is primarily for ease of understanding when importing data and algorithms from non-SQL contexts,
rather than optimizing the performance of key lookups.
You can pass a qualified name to DESCRIBE to specify an ARRAY, STRUCT, or MAP column and visualize its structure
as if it were a table. An ARRAY is shown as a two-column table, with ITEM and POS columns. A STRUCT is shown as a
table with each field representing a column in the table. A MAP is shown as a two-column table, with KEY and VALUE
columns.
Added in: CDH 5.5.0 (Impala 2.3.0)
Restrictions:
• Columns with this data type can only be used in tables or partitions with the Parquet file format.
• Columns with this data type cannot be used as partition key columns in a partitioned table.
• The COMPUTE STATS statement does not produce any statistics for columns of this data type.
• The maximum length of the column definition for any complex type, including declarations for any nested types,
is 4000 characters.
• See Limitations and Restrictions for Complex Types on page 161 for a full list of limitations and associated guidelines
about complex type columns.
Examples:
Note: Many of the complex type examples refer to tables such as CUSTOMER and REGION adapted
from the tables used in the TPC-H benchmark. See Sample Schema and Data for Experimenting with
Impala Complex Types on page 178 for the table definitions.
The following example shows a table with various kinds of MAP columns, both at the top level and nested within other
complex types. Each row represents information about a specific country, with complex type fields of various levels
of nesting to represent different information associated with the country: factual measurements such as area and
population, notable people in different categories, geographic features such as cities, points of interest within each
city, and mountains with associated facts. Practice the CREATE TABLE and query notation for complex type columns
using empty tables, until you can visualize a complex data structure and construct corresponding SQL statements
reliably.
-- MAP that is an element within an ARRAY. The MAP is inside a STRUCT field to associate
-- the mountain name with all the facts about the mountain.
-- The "key" of the map (the first STRING field) represents the name of some fact whose
value
-- can be expressed as an integer, such as 'Height', 'Year First Climbed', and so on.
mountains ARRAY < STRUCT < name: STRING, facts: MAP <STRING, INT > > >
)
STORED AS PARQUET;
DESCRIBE map_demo;
+------------+------------------------------------------------+
| name | type |
+------------+------------------------------------------------+
| country_id | bigint |
| metrics | map<string,bigint> |
| notables | map<string,array<string>> |
| cities | array<struct< |
| | name:string, |
| | points_of_interest:map<string,array<string>> |
| | >> |
| mountains | array<struct< |
| | name:string, |
| | facts:map<string,int> |
| | >> |
+------------+------------------------------------------------+
DESCRIBE map_demo.metrics;
+-------+--------+
| name | type |
+-------+--------+
| key | string |
| value | bigint |
+-------+--------+
DESCRIBE map_demo.notables;
+-------+---------------+
| name | type |
+-------+---------------+
| key | string |
| value | array<string> |
+-------+---------------+
DESCRIBE map_demo.notables.value;
+------+--------+
| name | type |
+------+--------+
| item | string |
| pos | bigint |
+------+--------+
DESCRIBE map_demo.cities;
+------+------------------------------------------------+
| name | type |
+------+------------------------------------------------+
| item | struct< |
| | name:string, |
| | points_of_interest:map<string,array<string>> |
| | > |
| pos | bigint |
+------+------------------------------------------------+
DESCRIBE map_demo.cities.item.points_of_interest;
+-------+---------------+
| name | type |
+-------+---------------+
| key | string |
| value | array<string> |
+-------+---------------+
DESCRIBE map_demo.cities.item.points_of_interest.value;
+------+--------+
| name | type |
+------+--------+
| item | string |
| pos | bigint |
+------+--------+
DESCRIBE map_demo.mountains;
+------+-------------------------+
| name | type |
+------+-------------------------+
| item | struct< |
| | name:string, |
| | facts:map<string,int> |
| | > |
| pos | bigint |
+------+-------------------------+
DESCRIBE map_demo.mountains.item.facts;
+-------+--------+
| name | type |
+-------+--------+
| key | string |
| value | int |
+-------+--------+
The following example shows a table that uses a variety of data types for the MAP “key” field. Typically, you use BIGINT
or STRING to use numeric or character-based keys without worrying about exceeding any size or length constraints.
CREATE TABLE celebrities (name STRING, birth_year MAP < STRING, SMALLINT >) STORED AS
PARQUET;
-- A typical row might represent values with 2 different birth years, such as:
-- ("Joe Movie Star", { "real": 1972, "claimed": 1977 })
CREATE TABLE countries (name STRING, famous_leaders MAP < INT, STRING >) STORED AS
PARQUET;
-- A typical row might represent values with different leaders, with key values
corresponding to their numeric sequence, such as:
-- ("United States", { 1: "George Washington", 3: "Thomas Jefferson", 16: "Abraham
Lincoln" })
CREATE TABLE airlines (name STRING, special_meals MAP < STRING, MAP < STRING, STRING >
>) STORED AS PARQUET;
-- A typical row might represent values with multiple kinds of meals, each with several
components:
-- ("Elegant Airlines",
-- {
-- "vegetarian": { "breakfast": "pancakes", "snack": "cookies", "dinner": "rice
pilaf" },
-- "gluten free": { "breakfast": "oatmeal", "snack": "fruit", "dinner": "chicken"
}
-- } )
Related information:
Complex Types (CDH 5.5 or higher only) on page 156, ARRAY Complex Type (CDH 5.5 or higher only) on page 117, STRUCT
Complex Type (CDH 5.5 or higher only) on page 142
These examples show how you can use the type names REAL and DOUBLE interchangeably, and behind the scenes
Impala treats them always as DOUBLE.
column_name SMALLINT
Parquet considerations:
Physically, Parquet files represent TINYINT and SMALLINT values as 32-bit integers. Although Impala rejects attempts
to insert out-of-range values into such columns, if you create a new table with the CREATE TABLE ... LIKE PARQUET
syntax, any TINYINT or SMALLINT columns in the original table turn into INT columns in the new table.
Partitioning: Prefer to use this type for a partition key column. Impala can process the numeric type more efficiently
than a STRING representation of the value.
HBase considerations: This data type is fully compatible with HBase tables.
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Internal details: Represented in memory as a 2-byte value.
column_name STRING
Length: Maximum of 32,767 bytes. Do not use any length constraint when declaring STRING columns, as you might
be familiar with from VARCHAR, CHAR, or similar column types from relational database systems. If you do need to
manipulate string values with precise or maximum lengths, in Impala 2.0 and higher you can declare columns as
VARCHAR(max_length) or CHAR(length), but for best performance use STRING where practical.
Character sets: For full support in all Impala subsystems, restrict string values to the ASCII character set. Although
some UTF-8 character data can be stored in Impala and retrieved through queries, UTF-8 strings containing non-ASCII
characters are not guaranteed to work properly in combination with many SQL aspects, including but not limited to:
• String manipulation functions.
• Comparison operators.
• The ORDER BY clause.
• Values in partition key columns.
For any national language aspects such as collation order or interpreting extended ASCII variants such as ISO-8859-1
or ISO-8859-2 encodings, Impala does not include such metadata with the table definition. If you need to sort, manipulate,
or display data depending on those national language characteristics of string data, use logic on the application side.
Conversions:
• Impala does not automatically convert STRING to any numeric type. Impala does automatically convert STRING
to TIMESTAMP if the value matches one of the accepted TIMESTAMP formats; see TIMESTAMP Data Type on page
148 for details.
• You can use CAST() to convert STRING values to TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, or
TIMESTAMP.
• You cannot directly cast a STRING value to BOOLEAN. You can use a CASE expression to evaluate string values
such as 'T', 'true', and so on and return Boolean true and false values as appropriate.
• You can cast a BOOLEAN value to STRING, returning '1' for true values and '0' for false values.
Partitioning:
Although it might be convenient to use STRING columns for partition keys, even when those columns contain numbers,
for performance and scalability it is much better to use numeric columns as partition keys whenever practical. Although
the underlying HDFS directory name might be the same in either case, the in-memory storage for the partition key
columns is more compact, and computations are faster, if partition key columns such as YEAR, MONTH, DAY and so on
are declared as INT, SMALLINT, and so on.
Zero-length strings: For purposes of clauses such as DISTINCT and GROUP BY, Impala considers zero-length strings
(""), NULL, and space to all be different values.
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Avro considerations:
The Avro specification allows string values up to 2**64 bytes in length. Impala queries for Avro tables use 32-bit integers
to hold string lengths. In CDH 5.7 / Impala 2.5 and higher, Impala truncates CHAR and VARCHAR values in Avro tables
to (2**31)-1 bytes. If a query encounters a STRING value longer than (2**31)-1 bytes in an Avro table, the query fails.
In earlier releases, encountering such long values in an Avro table could cause a crash.
Column statistics considerations: Because the values of this type have variable size, none of the column statistics fields
are filled in until you run the COMPUTE STATS statement.
Examples:
The following examples demonstrate double-quoted and single-quoted string literals, and required escaping for
quotation marks within string literals:
The following examples demonstrate calls to string manipulation functions to concatenate strings, convert numbers
to strings, or pull out substrings:
SELECT CONCAT("Once upon a time, there were ", CAST(3 AS STRING), ' little pigs.');
SELECT SUBSTR("hello world",7,5);
The following examples show how to perform operations on STRING columns within a table:
Related information:
String Literals on page 185, CHAR Data Type (CDH 5.2 or higher only) on page 122, VARCHAR Data Type (CDH 5.2 or
higher only) on page 154, Impala String Functions on page 394, Impala Date and Time Functions on page 363
The names and number of fields within the STRUCT are fixed. Each field can be a different type. A field within a STRUCT
can also be another STRUCT, or an ARRAY or a MAP, allowing you to create nested data structures with a maximum
nesting depth of 100.
A STRUCT can be the top-level type for a column, or can itself be an item within an ARRAY or the value part of the
key-value pair in a MAP.
When a STRUCT is used as an ARRAY element or a MAP value, you use a join clause to bring the ARRAY or MAP elements
into the result set, and then refer to array_name.ITEM.field or map_name.VALUE.field. In the case of a STRUCT
directly inside an ARRAY or MAP, you can omit the .ITEM and .VALUE pseudocolumns and refer directly to
array_name.field or map_name.field.
Usage notes:
Because complex types are often used in combination, for example an ARRAY of STRUCT elements, if you are unfamiliar
with the Impala complex types, start with Complex Types (CDH 5.5 or higher only) on page 156 for background information
and usage examples.
A STRUCT is similar conceptually to a table row: it contains a fixed number of named fields, each with a predefined
type. To combine two related tables, while using complex types to minimize repetition, the typical way to represent
that data is as an ARRAY of STRUCT elements.
Because a STRUCT has a fixed number of named fields, it typically does not make sense to have a STRUCT as the type
of a table column. In such a case, you could just make each field of the STRUCT into a separate column of the table.
The STRUCT type is most useful as an item of an ARRAY or the value part of the key-value pair in a MAP. A nested type
column with a STRUCT at the lowest level lets you associate a variable number of row-like objects with each row of
the table.
The STRUCT type is straightforward to reference within a query. You do not need to include the STRUCT column in a
join clause or give it a table alias, as is required for the ARRAY and MAP types. You refer to the individual fields using
dot notation, such as struct_column_name.field_name, without any pseudocolumn such as ITEM or VALUE.
You can pass a qualified name to DESCRIBE to specify an ARRAY, STRUCT, or MAP column and visualize its structure
as if it were a table. An ARRAY is shown as a two-column table, with ITEM and POS columns. A STRUCT is shown as a
table with each field representing a column in the table. A MAP is shown as a two-column table, with KEY and VALUE
columns.
Internal details:
Within the Parquet data file, the values for each STRUCT field are stored adjacent to each other, so that they can be
encoded and compressed using all the Parquet techniques for storing sets of similar or repeated values. The adjacency
applies even when the STRUCT values are part of an ARRAY or MAP. During a query, Impala avoids unnecessary I/O by
reading only the portions of the Parquet data file containing the requested STRUCT fields.
Added in: CDH 5.5.0 (Impala 2.3.0)
Restrictions:
• Columns with this data type can only be used in tables or partitions with the Parquet file format.
• Columns with this data type cannot be used as partition key columns in a partitioned table.
• The COMPUTE STATS statement does not produce any statistics for columns of this data type.
• The maximum length of the column definition for any complex type, including declarations for any nested types,
is 4000 characters.
• See Limitations and Restrictions for Complex Types on page 161 for a full list of limitations and associated guidelines
about complex type columns.
Examples:
Note: Many of the complex type examples refer to tables such as CUSTOMER and REGION adapted
from the tables used in the TPC-H benchmark. See Sample Schema and Data for Experimenting with
Impala Complex Types on page 178 for the table definitions.
The following example shows a table with various kinds of STRUCT columns, both at the top level and nested within
other complex types. Practice the CREATE TABLE and query notation for complex type columns using empty tables,
until you can visualize a complex data structure and construct corresponding SQL statements reliably.
The following example shows how to examine the structure of a table containing one or more STRUCT columns by
using the DESCRIBE statement. You can visualize each STRUCT as its own table, with columns named the same as
each field of the STRUCT. If the STRUCT is nested inside another complex type, such as ARRAY, you can extend the
qualified name passed to DESCRIBE until the output shows just the STRUCT fields.
DESCRIBE struct_demo;
+-------------------+--------------------------+
| name | type |
+-------------------+--------------------------+
| id | bigint |
| name | string |
| employee_info | struct< |
| | employer:string, |
| | id:bigint, |
| | address:string |
| | > |
| places_lived | array<struct< |
| | street:string, |
| | city:string, |
| | country:string |
| | >> |
| memorable_moments | map<string,struct< |
| | year:int, |
| | place:string, |
| | details:string |
| | >> |
| current_address | struct< |
| | street_address:struct< |
| | street_number:int, |
| | street_name:string, |
| | street_type:string |
| | >, |
| | country:string, |
| | postal_code:string |
| | > |
+-------------------+--------------------------+
The top-level column EMPLOYEE_INFO is a STRUCT. Describing table_name.struct_name displays the fields of the
STRUCT as if they were columns of a table:
DESCRIBE struct_demo.employee_info;
+----------+--------+
| name | type |
+----------+--------+
| employer | string |
| id | bigint |
| address | string |
+----------+--------+
Because PLACES_LIVED is a STRUCT inside an ARRAY, the initial DESCRIBE shows the structure of the ARRAY:
DESCRIBE struct_demo.places_lived;
+------+------------------+
| name | type |
+------+------------------+
| item | struct< |
| | street:string, |
| | city:string, |
| | country:string |
| | > |
| pos | bigint |
+------+------------------+
Ask for the details of the ITEM field of the ARRAY to see just the layout of the STRUCT:
DESCRIBE struct_demo.places_lived.item;
+---------+--------+
| name | type |
+---------+--------+
| street | string |
| city | string |
| country | string |
+---------+--------+
Likewise, MEMORABLE_MOMENTS has a STRUCT inside a MAP, which requires an extra level of qualified name to see just
the STRUCT part:
DESCRIBE struct_demo.memorable_moments;
+-------+------------------+
| name | type |
+-------+------------------+
| key | string |
| value | struct< |
| | year:int, |
| | place:string, |
| | details:string |
| | > |
+-------+------------------+
For a MAP, ask to see the VALUE field to see the corresponding STRUCT fields in a table-like structure:
DESCRIBE struct_demo.memorable_moments.value;
+---------+--------+
| name | type |
+---------+--------+
| year | int |
| place | string |
| details | string |
+---------+--------+
For a STRUCT inside a STRUCT, we can see the fields of the outer STRUCT:
DESCRIBE struct_demo.current_address;
+----------------+-----------------------+
| name | type |
+----------------+-----------------------+
| street_address | struct< |
| | street_number:int, |
| | street_name:string, |
| | street_type:string |
| | > |
| country | string |
| postal_code | string |
+----------------+-----------------------+
Then we can use a further qualified name to see just the fields of the inner STRUCT:
DESCRIBE struct_demo.current_address.street_address;
+---------------+--------+
| name | type |
+---------------+--------+
| street_number | int |
| street_name | string |
| street_type | string |
+---------------+--------+
The following example shows how to examine the structure of a table containing one or more STRUCT columns by
using the DESCRIBE statement. You can visualize each STRUCT as its own table, with columns named the same as
each field of the STRUCT. If the STRUCT is nested inside another complex type, such as ARRAY, you can extend the
qualified name passed to DESCRIBE until the output shows just the STRUCT fields.
DESCRIBE struct_demo;
+-------------------+--------------------------+---------+
| name | type | comment |
+-------------------+--------------------------+---------+
| id | bigint | |
| name | string | |
| employee_info | struct< | |
| | employer:string, | |
| | id:bigint, | |
| | address:string | |
| | > | |
| places_lived | array<struct< | |
| | street:string, | |
| | city:string, | |
| | country:string | |
| | >> | |
| memorable_moments | map<string,struct< | |
| | year:int, | |
| | place:string, | |
| | details:string | |
| | >> | |
| current_address | struct< | |
| | street_address:struct< | |
| | street_number:int, | |
| | street_name:string, | |
| | street_type:string | |
| | >, | |
| | country:string, | |
| | postal_code:string | |
| | > | |
+-------------------+--------------------------+---------+
places_lived.country
FROM struct_demo, struct_demo.places_lived;
For example, this table uses a struct that encodes several data values for each phone number associated with a person.
Each person can have a variable-length array of associated phone numbers, and queries can refer to the category field
to locate specific home, work, mobile, and so on kinds of phone numbers.
Because structs are naturally suited to composite values where the fields have different data types, you might use
them to decompose things such as addresses:
In a big data context, splitting out data fields such as the number part of the address and the street name could let
you do analysis on each field independently. For example, which streets have the largest number range of addresses,
what are the statistical properties of the street names, which areas have a higher proportion of “Roads”, “Courts” or
“Boulevards”, and so on.
Related information:
Complex Types (CDH 5.5 or higher only) on page 156, ARRAY Complex Type (CDH 5.5 or higher only) on page 117, MAP
Complex Type (CDH 5.5 or higher only) on page 136
column_name TIMESTAMP
Range: Allowed date values range from 1400-01-01 to 9999-12-31; this range is different from the Hive TIMESTAMP
type. Internally, the resolution of the time portion of a TIMESTAMP value is in nanoseconds.
INTERVAL expressions:
You can perform date arithmetic by adding or subtracting a specified number of time units, using the INTERVAL
keyword and the + and - operators or date_add() and date_sub() functions. You can specify units as YEAR[S],
MONTH[S], WEEK[S], DAY[S], HOUR[S], MINUTE[S], SECOND[S], MILLISECOND[S], MICROSECOND[S], and
NANOSECOND[S]. You can only specify one time unit in each interval expression, for example INTERVAL 3 DAYS or
INTERVAL 25 HOURS, but you can produce any granularity by adding together successive INTERVAL values, such as
timestamp_value + INTERVAL 3 WEEKS - INTERVAL 1 DAY + INTERVAL 10 MICROSECONDS.
For example:
Time zones:
By default, Impala does not store timestamps using the local timezone, to avoid undesired results from unexpected
time zone issues. Timestamps are stored and interpreted relative to UTC, both when written to or read from data files,
or when converted to or from Unix time values through functions such as from_unixtime() or unix_timestamp().
To convert such a TIMESTAMP value to one that represents the date and time in a specific time zone, convert the
original value with the from_utc_timestamp() function.
Because Impala does not assume that TIMESTAMP values are in any particular time zone, you must be conscious of
the time zone aspects of data that you query, insert, or convert.
For consistency with Unix system calls, the TIMESTAMP returned by the now() function represents the local time in
the system time zone, rather than in UTC. To store values relative to the current time in a portable way, convert any
now() return values using the to_utc_timestamp() function first. For example, the following example shows that
the current time in California (where this Impala cluster is located) is shortly after 2 PM. If that value was written to a
data file, and shipped off to a distant server to be analyzed alongside other data from far-flung locations, the dates
and times would not match up precisely because of time zone differences. Therefore, the to_utc_timestamp()
function converts it using a common reference point, the UTC time zone (descended from the old Greenwich Mean
Time standard). The 'PDT' argument indicates that the original value is from the Pacific time zone with Daylight Saving
Time in effect. When servers in all geographic locations run the same transformation on any local date and time values
(with the appropriate time zone argument), the stored data uses a consistent representation. Impala queries can use
functions such as EXTRACT(), MIN(), AVG(), and so on to do time-series analysis on those timestamps.
The converse function, from_utc_timestamp(), lets you take stored TIMESTAMP data or calculated results and
convert back to local date and time for processing on the application side. The following example shows how you might
represent some future date (such as the ending date and time of an auction) in UTC, and then convert back to local
time when convenient for reporting or other processing. The final query in the example tests whether this arbitrary
UTC date and time has passed yet, by converting it back to the local time zone and comparing it against the current
date and time.
If you have data files written by Hive, those TIMESTAMP values represent the local timezone of the host where the
data was written, potentially leading to inconsistent results when processed by Impala. To avoid compatibility problems
or having to code workarounds, you can specify one or both of these impalad startup flags:
-use_local_tz_for_unix_timestamp_conversions=true
-convert_legacy_hive_parquet_utc_timestamps=true. Although
-convert_legacy_hive_parquet_utc_timestamps is turned off by default to avoid performance overhead,
Cloudera recommends turning it on when processing TIMESTAMP columns in Parquet files written by Hive, to avoid
unexpected behavior.
The -use_local_tz_for_unix_timestamp_conversions setting affects conversions from TIMESTAMP to BIGINT,
or from BIGINT to TIMESTAMP. By default, Impala treats all TIMESTAMP values as UTC, to simplify analysis of time-series
data from different geographic regions. When you enable the -use_local_tz_for_unix_timestamp_conversions
setting, these operations treat the input values as if they are in the local tie zone of the host doing the processing. See
Impala Date and Time Functions on page 363 for the list of functions affected by the
-use_local_tz_for_unix_timestamp_conversions setting.
The following sequence of examples shows how the interpretation of TIMESTAMP values in Parquet tables is affected
by the setting of the -convert_legacy_hive_parquet_utc_timestamps setting.
Regardless of the -convert_legacy_hive_parquet_utc_timestamps setting, TIMESTAMP columns in text tables
can be written and read interchangeably by Impala and Hive:
| 2015-04-08 15:43:02.892403000 |
+-------------------------------+
[localhost:21000] > select to_utc_timestamp(x, 'PDT') from t1;
+-------------------------------+
| to_utc_timestamp(x, 'pdt') |
+-------------------------------+
| 2015-04-07 22:43:02.892403000 |
| 2015-04-08 22:43:02.892403000 |
+-------------------------------+
When the table uses Parquet format, Impala expects any time zone adjustment to be applied prior to writing, while
TIMESTAMP values written by Hive are adjusted to be in the UTC time zone. When Hive queries Parquet data files that
it wrote, it adjusts the TIMESTAMP values back to the local time zone, while Impala does no conversion. Hive does no
time zone conversion when it queries Impala-written Parquet files.
The discrepancy arises when Impala queries the Hive-created Parquet table. The underlying values in the TIMESTAMP
column are different from the ones written by Impala, even though they were copied from one table to another by an
INSERT ... SELECT statement in Hive. Hive did an implicit conversion from the local time zone to UTC as it wrote
the values to Parquet.
Impala query for TIMESTAMP values from Impala-written and Hive-written data:
| 2015-04-07 15:43:02.892403000 |
| 2015-04-08 15:43:02.892403000 |
+-------------------------------+
Fetched 2 row(s) in 0.29s
[localhost:21000] > select * from h1;
+-------------------------------+
| x |
+-------------------------------+
| 2015-04-07 22:43:02.892403000 |
| 2015-04-08 22:43:02.892403000 |
+-------------------------------+
Fetched 2 row(s) in 0.41s
Conversions:
Impala automatically converts STRING literals of the correct format into TIMESTAMP values. Timestamp values are
accepted in the format "yyyy-MM-dd HH:mm:ss.SSSSSS", and can consist of just the date, or just the time, with
or without the fractional second portion. For example, you can specify TIMESTAMP values such as '1966-07-30',
'08:30:00', or '1985-09-25 17:45:30.005'. Casting an integer or floating-point value N to TIMESTAMP produces
a value that is N seconds past the start of the epoch date (January 1, 1970). By default, the result value represents a
date and time in the UTC time zone. If the setting --use_local_tz_for_unix_timestamp_conversions=true
is in effect, the resulting TIMESTAMP represents a date and time in the local time zone.
In Impala 1.3 and higher, the FROM_UNIXTIME() and UNIX_TIMESTAMP() functions allow a wider range of format
strings, with more flexibility in element order, repetition of letter placeholders, and separator characters. In CDH 5.5
/ Impala 2.3 and higher, the UNIX_TIMESTAMP() function also allows a numeric timezone offset to be specified as
part of the input string. See Impala Date and Time Functions on page 363 for details.
In Impala 2.2.0 and higher, built-in functions that accept or return integers representing TIMESTAMP values use the
BIGINT type for parameters and return values, rather than INT. This change lets the date and time functions avoid
an overflow error that would otherwise occur on January 19th, 2038 (known as the “Year 2038 problem” or “Y2K38
problem”). This change affects the from_unixtime() and unix_timestamp() functions. You might need to change
application code that interacts with these functions, change the types of columns that store the return values, or add
CAST() calls to SQL statements that call these functions.
Partitioning:
Although you cannot use a TIMESTAMP column as a partition key, you can extract the individual years, months, days,
hours, and so on and partition based on those columns. Because the partition key column values are represented in
HDFS directory names, rather than as fields in the data files themselves, you can also keep the original TIMESTAMP
values if desired, without duplicating data or wasting storage space. See Partition Key Columns on page 525 for more
details on partitioning with date and time values.
Examples:
NULL considerations: Casting any unrecognized STRING value to this type produces a NULL value.
Partitioning: Because this type potentially has so many distinct values, it is often not a sensible choice for a partition
key column. For example, events 1 millisecond apart would be stored in different partitions. Consider using the TRUNC()
function to condense the number of distinct values, and partition on a new column with the truncated values.
HBase considerations: This data type is fully compatible with HBase tables.
Parquet considerations: This type is fully compatible with Parquet tables.
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Internal details: Represented in memory as a 16-byte value.
Added in: Available in all versions of Impala.
Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always
filled in for column statistics, even before you run the COMPUTE STATS statement.
Sqoop considerations:
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE,
DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is
represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala
interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported
this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Restrictions:
If you cast a STRING with an unrecognized format to a TIMESTAMP, the result is NULL rather than an error. Make sure
to test your data pipeline to be sure any textual date and time values are in a format that Impala TIMESTAMP can
recognize.
Currently, Avro tables cannot contain TIMESTAMP columns. If you need to store date and time values in Avro tables,
as a workaround you can use a STRING representation of the values, convert the values to BIGINT with the
UNIX_TIMESTAMP() function, or create separate numeric columns for individual date and time fields using the
EXTRACT() function.
Related information:
• Timestamp Literals on page 187.
• To convert to or from different date formats, or perform date arithmetic, use the date and time functions described
in Impala Date and Time Functions on page 363. In particular, the from_unixtime() function requires a
case-sensitive format string such as "yyyy-MM-dd HH:mm:ss.SSSS", matching one of the allowed variations
of a TIMESTAMP value (date plus time, only date, only time, optional fractional seconds).
• See SQL Differences Between Impala and Hive on page 461 for details about differences in TIMESTAMP handling
between Impala and Hive.
column_name TINYINT
For a convenient and automated way to check the bounds of the TINYINT type, call the functions MIN_TINYINT()
and MAX_TINYINT().
If an integer value is too large to be represented as a TINYINT, use a SMALLINT instead.
NULL considerations: Casting any non-numeric value to this type produces a NULL value.
Examples:
Parquet considerations:
Physically, Parquet files represent TINYINT and SMALLINT values as 32-bit integers. Although Impala rejects attempts
to insert out-of-range values into such columns, if you create a new table with the CREATE TABLE ... LIKE PARQUET
syntax, any TINYINT or SMALLINT columns in the original table turn into INT columns in the new table.
HBase considerations: This data type is fully compatible with HBase tables.
Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other
binary formats.
Internal details: Represented in memory as a 1-byte value.
Added in: Available in all versions of Impala.
Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always
filled in for column statistics, even before you run the COMPUTE STATS statement.
Related information:
Numeric Literals on page 184, TINYINT Data Type on page 153, SMALLINT Data Type on page 140, INT Data Type on page
135, BIGINT Data Type on page 120, DECIMAL Data Type (CDH 5.1 or higher only) on page 125, Impala Mathematical
Functions on page 339
column_name VARCHAR(max_length)
Text data files can contain values that are longer than allowed by the VARCHAR(n) length limit. Any extra trailing
characters are ignored when Impala processes those values during a query.
Avro considerations:
The Avro specification allows string values up to 2**64 bytes in length. Impala queries for Avro tables use 32-bit integers
to hold string lengths. In CDH 5.7 / Impala 2.5 and higher, Impala truncates CHAR and VARCHAR values in Avro tables
to (2**31)-1 bytes. If a query encounters a STRING value longer than (2**31)-1 bytes in an Avro table, the query fails.
In earlier releases, encountering such long values in an Avro table could cause a crash.
Schema evolution considerations:
You can use ALTER TABLE ... CHANGE to switch column data types to and from VARCHAR. You can convert from
STRING to VARCHAR(n), or from VARCHAR(n) to STRING, or from CHAR(n) to VARCHAR(n), or from VARCHAR(n)
to CHAR(n). When switching back and forth between VARCHAR and CHAR, you can also change the length value. This
schema evolution works the same for tables using any file format. If a table contains values longer than the maximum
length defined for a VARCHAR column, Impala does not return an error. Any extra trailing characters are ignored when
Impala processes those values during a query.
Compatibility:
This type is available on CDH 5.2 or higher.
Internal details: Represented in memory as a byte array with the minimum size needed to represent each value.
Added in: CDH 5.2.0 (Impala 2.0.0)
Column statistics considerations: Because the values of this type have variable size, none of the column statistics fields
are filled in until you run the COMPUTE STATS statement.
Restrictions:
All data in CHAR and VARCHAR columns must be in a character encoding that is compatible with UTF-8. If you have
binary data from another database system (that is, a BLOB type), use a STRING column to hold it.
Examples:
The following examples show how long and short VARCHAR values are treated. Values longer than the maximum
specified length are truncated by CAST(), or when queried from existing data files. Values shorter than the maximum
specified length are represented as the actual length of the value, with no extra padding as seen with CHAR values.
+------+
[localhost:21000] > select * from varchar_20;
+-------+
| s |
+-------+
| a |
| b |
| hello |
| world |
+-------+
select concat('[',s,']') as s from varchar_20;
+---------+
| s |
+---------+
| [a] |
| [b] |
| [hello] |
| [world] |
+---------+
The following example shows how identical VARCHAR values compare as equal, even if the columns are defined with
different maximum lengths. Both tables contain 'a' and 'b' values. The longer 'hello' and 'world' values from
the VARCHAR_20 table were truncated when inserted into the VARCHAR_1 table.
The following examples show how VARCHAR values are freely interchangeable with STRING values in contexts such
as comparison operators and built-in functions:
UDF considerations: This type cannot be used for the argument or return type of a user-defined function (UDF) or
user-defined aggregate function (UDA).
Related information:
STRING Data Type on page 141, CHAR Data Type (CDH 5.2 or higher only) on page 122, String Literals on page 185, Impala
String Functions on page 394
Once you understand the basics of complex types, refer to the individual type topics when you need to refresh your
memory about syntax and examples:
• ARRAY Complex Type (CDH 5.5 or higher only) on page 117
The Impala complex type support produces result sets with all scalar values, and the scalar components of complex
types can be used with all SQL clauses, such as GROUP BY, ORDER BY, all kinds of joins, subqueries, and inline views.
The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or
other programming languages to deconstruct the underlying data structures.
Note:
Although Impala can query complex types that are present in Parquet files, Impala currently cannot
create new Parquet files containing complex types. Therefore, the discussion and examples presume
that you are working with existing Parquet data produced through Hive, Spark, or some other source.
See Constructing Parquet Files with Complex Columns Using Hive on page 179 for examples of
constructing Parquet data files with complex type columns.
For learning purposes, you can create empty tables with complex type columns and practice query
syntax, even if you do not have sample data with the required structure.
Complex types represent a middle ground that addresses these performance and volume concerns. By physically
locating related data within the same data files, complex types increase locality and reduce the expense of join
queries. By associating an arbitrary amount of data with a single row, complex types avoid the need to repeat
lengthy values such as strings. Because Impala knows which complex type values are associated with each row,
you can save storage by avoiding artificial foreign key values that are only used for joins. The flexibility of the
STRUCT, ARRAY, and MAP types lets you model familiar constructs such as fact and dimension tables from a data
warehouse, and wide tables representing sparse matrixes.
• For an ARRAY containing scalar values, all those values (represented by the ITEM pseudocolumn) are stored
adjacent to each other.
• For a MAP, the values of the KEY pseudocolumn are stored adjacent to each other. If the VALUE pseudocolumn is
a scalar type, its values are also stored adjacent to each other.
• If an ARRAY element, STRUCT field, or MAP VALUE part is another complex type, the column-oriented storage
applies to the next level down (or the next level after that, and so on for deeply nested types) where the final
elements, fields, or values are of scalar types.
The numbers represented by the POS pseudocolumn of an ARRAY are not physically stored in the data files. They are
synthesized at query time based on the order of the ARRAY elements associated with each row.
File Format Support for Impala Complex Types
Currently, Impala queries support complex type data only in the Parquet file format. See Using the Parquet File Format
with Impala Tables on page 535 for details about the performance benefits and physical layout of this file format.
Each table, or each partition within a table, can have a separate file format, and you can change file format at the table
or partition level through an ALTER TABLE statement. Because this flexibility makes it difficult to guarantee ahead of
time that all the data files for a table or partition are in a compatible format, Impala does not throw any errors when
you change the file format for a table or partition using ALTER TABLE. Any errors come at runtime when Impala
actually processes a table or partition that contains nested types and is not in one of the supported formats. If a query
on a partitioned table only processes some partitions, and all those partitions are in one of the supported formats, the
query succeeds.
Because Impala does not parse the data structures containing nested types for unsupported formats such as text, Avro,
SequenceFile, or RCFile, you cannot use data files in these formats with Impala, even if the query does not refer to the
nested type columns. Also, if a table using an unsupported format originally contained nested type columns, and then
those columns were dropped from the table using ALTER TABLE ... DROP COLUMN, any existing data files in the
table still contain the nested type data and Impala queries on that table will generate errors.
You can perform DDL operations (even CREATE TABLE) for tables involving complex types in file formats other than
Parquet. The DDL support lets you set up intermediate tables in your ETL pipeline, to be populated by Hive, before the
final stage where the data resides in a Parquet table and is queryable by Impala. Also, you can have a partitioned table
with complex type columns that uses a non-Parquet format, and use ALTER TABLE to change the file format to Parquet
for individual partitions. When you put Parquet data files into those partitions, Impala can execute queries against
that data as long as the query does not involve any of the non-Parquet partitions.
If you use the parquet-tools command to examine the structure of a Parquet data file that includes complex types,
you see that both ARRAY and MAP are represented as a Bag in Parquet terminology, with all fields marked Optional
because Impala allows any column to be nullable.
Impala supports either 2-level and 3-level encoding within each Parquet data file. When constructing Parquet data
files outside Impala, use either encoding style but do not mix 2-level and 3-level encoding within the same data file.
Choosing Between Complex Types and Normalized Tables
Choosing between multiple normalized fact and dimension tables, or a single table containing complex types, is an
important design decision.
• If you are coming from a traditional database or data warehousing background, you might be familiar with how
to split up data between tables. Your business intelligence tools might already be optimized for dealing with this
kind of multi-table scenario through join queries.
• If you are pulling data from Impala into an application written in a programming language that has data structures
analogous to the complex types, such as Python or Java, complex types in Impala could simplify data interchange
and improve understandability and reliability of your program logic.
• You might already be faced with existing infrastructure or receive high volumes of data that assume one layout
or the other. For example, complex types are popular with web-oriented applications, for example to keep
information about an online user all in one place for convenient lookup and analysis, or to deal with sparse or
constantly evolving data fields.
• If some parts of the data change over time while related data remains constant, using multiple normalized tables
lets you replace certain parts of the data without reloading the entire data set. Conversely, if you receive related
data all bundled together, such as in JSON files, using complex types can save the overhead of splitting the related
items across multiple tables.
• From a performance perspective:
– In Parquet tables, Impala can skip columns that are not referenced in a query, avoiding the I/O penalty of
reading the embedded data. When complex types are nested within a column, the data is physically divided
at a very granular level; for example, a query referring to data nested multiple levels deep in a complex type
column does not have to read all the data from that column, only the data for the relevant parts of the column
type hierarchy.
– Complex types avoid the possibility of expensive join queries when data from fact and dimension tables is
processed in parallel across multiple hosts. All the information for a row containing complex types is typically
to be in the same data block, and therefore does not need to be transmitted across the network when joining
fields that are all part of the same row.
– The tradeoff with complex types is that fewer rows fit in each data block. Whether it is better to have more
data blocks with fewer rows, or fewer data blocks with many rows, depends on the distribution of your data
and the characteristics of your query workload. If the complex columns are rarely referenced, using them
might lower efficiency. If you are seeing low parallelism due to a small volume of data (relatively few data
blocks) in each table partition, increasing the row size by including complex columns might produce more
data blocks and thus spread the work more evenly across the cluster. See Scalability Considerations for Impala
on page 516 for more on this advanced topic.