Redshift-Developer Guide
Redshift-Developer Guide
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not
Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or
discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may
or may not be affiliated with, connected to, or sponsored by Amazon.
Amazon Redshift Database Developer Guide
Table of Contents
System overview ................................................................................................................................ 1
Are you a first-time Amazon Redshift user? ................................................................................... 1
Are you a database developer? .................................................................................................... 2
Prerequisites .............................................................................................................................. 3
System and architecture overview ................................................................................................ 3
Data warehouse system architecture ..................................................................................... 3
Performance ...................................................................................................................... 5
Columnar storage ............................................................................................................... 7
Workload management ....................................................................................................... 9
Using Amazon Redshift with other services ........................................................................... 9
Best practices .................................................................................................................................. 11
Conducting a proof of concept .................................................................................................. 11
Overview of the process .................................................................................................... 11
Identify the business goals and success criteria .................................................................... 12
Set up your proof of concept ............................................................................................. 12
Checklist for a complete evaluation .................................................................................... 13
Develop a project plan for your evaluation .......................................................................... 14
Additional resources to help your evaluation ........................................................................ 15
Need help? ...................................................................................................................... 16
Best practices for designing tables ............................................................................................. 16
Choose the best sort key ................................................................................................... 16
Choose the best distribution style ...................................................................................... 17
Use automatic compression ............................................................................................... 18
Define constraints ............................................................................................................ 18
Use the smallest possible column size ................................................................................. 18
Use date/time data types for date columns ......................................................................... 19
Best practices for loading data ................................................................................................... 19
Take the loading data tutorial ............................................................................................ 19
Use a COPY command to load data .................................................................................... 19
Use a single COPY command ............................................................................................. 19
Split your load data .......................................................................................................... 20
Compress your data files ................................................................................................... 20
Verify data files before and after a load .............................................................................. 20
Use a multi-row insert ...................................................................................................... 20
Use a bulk insert .............................................................................................................. 21
Load data in sort key order ............................................................................................... 21
Load data in sequential blocks ........................................................................................... 21
Use time-series tables ....................................................................................................... 21
Use a staging table to perform a merge .............................................................................. 22
Schedule around maintenance windows .............................................................................. 22
Best practices for designing queries ............................................................................................ 22
Working with Advisor ................................................................................................................ 24
Amazon Redshift Regions .................................................................................................. 24
Access Advisor ................................................................................................................. 25
Advisor recommendations ................................................................................................. 26
Tutorials .......................................................................................................................................... 36
Working with automatic table optimization ......................................................................................... 37
Enabling automatic table optimization ........................................................................................ 37
Removing automatic table optimization ...................................................................................... 38
Monitoring actions of automatic table optimization ...................................................................... 38
Working with column compression ............................................................................................. 39
Compression encodings ..................................................................................................... 40
Testing compression encodings .......................................................................................... 46
Example: Choosing compression encodings for the CUSTOMER table ....................................... 48
iii
Amazon Redshift Database Developer Guide
iv
Amazon Redshift Database Developer Guide
v
Amazon Redshift Database Developer Guide
vi
Amazon Redshift Database Developer Guide
vii
Amazon Redshift Database Developer Guide
Getting started data sharing using the SQL interface .......................................................... 314
Getting started data sharing using the console ................................................................... 334
Getting started data sharing with CloudFormation .............................................................. 343
Limitations ..................................................................................................................... 345
Ingesting and querying semistructured data in Amazon Redshift .......................................................... 346
Use cases for the SUPER data type ........................................................................................... 346
Concepts for SUPER data type use ........................................................................................... 347
Considerations for SUPER data ................................................................................................. 348
SUPER sample dataset ............................................................................................................ 348
Loading semistructured data into Amazon Redshift .................................................................... 350
Parsing JSON documents to SUPER columns ...................................................................... 350
Using COPY to load JSON data in Amazon Redshift ............................................................ 351
Unloading semistructured data ................................................................................................. 354
Unloading semistructured data in CSV or text formats ........................................................ 354
Unloading semistructured data in the Parquet format ......................................................... 355
Querying semistructured data .................................................................................................. 355
Navigation ..................................................................................................................... 355
Unnesting queries ........................................................................................................... 356
Object unpivoting ........................................................................................................... 357
Dynamic typing .............................................................................................................. 358
Lax semantics ................................................................................................................ 360
Types of introspection ..................................................................................................... 360
Order by ........................................................................................................................ 361
Operators and functions .......................................................................................................... 362
Arithmetic operators ....................................................................................................... 362
Arithmetic functions ....................................................................................................... 362
Array functions ............................................................................................................... 363
SUPER configurations .............................................................................................................. 364
Lax and strict modes for SUPER ....................................................................................... 364
Accessing JSON fields with upper or mixed case letters ....................................................... 364
Limitations ............................................................................................................................. 365
Using SUPER data type with materialized views ......................................................................... 367
Accelerating PartiQL queries ............................................................................................ 367
Limitations for using the SUPER data type with materialized views ....................................... 369
Using machine learning in Amazon Redshift ...................................................................................... 370
Machine learning overview ...................................................................................................... 371
How machine learning can solve a problem ....................................................................... 371
Terms and concepts for Amazon Redshift ML ..................................................................... 372
Machine learning for novices and experts .................................................................................. 373
Costs for using Amazon Redshift ML ......................................................................................... 374
Getting started with Amazon Redshift ML ................................................................................. 375
Administrative setup ....................................................................................................... 376
Using model explainability with Amazon Redshift ML .......................................................... 379
Tutorial: Creating a customer churn model ........................................................................ 379
Tuning query performance .............................................................................................................. 383
Query processing .................................................................................................................... 383
Query planning and execution workflow ........................................................................... 383
Query plan .................................................................................................................... 385
Reviewing query plan steps ............................................................................................. 390
Factors affecting query performance ................................................................................. 392
Analyzing and improving queries .............................................................................................. 393
Query analysis workflow ................................................................................................. 393
Reviewing query alerts .................................................................................................... 394
Analyzing the query plan ................................................................................................ 395
Analyzing the query summary .......................................................................................... 396
Improving query performance .......................................................................................... 401
Diagnostic queries for query tuning .................................................................................. 403
viii
Amazon Redshift Database Developer Guide
ix
Amazon Redshift Database Developer Guide
x
Amazon Redshift Database Developer Guide
xi
Amazon Redshift Database Developer Guide
xii
Amazon Redshift Database Developer Guide
xiii
Amazon Redshift Database Developer Guide
xiv
Amazon Redshift Database Developer Guide
xv
Amazon Redshift Database Developer Guide
Are you a first-time Amazon Redshift user?
Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.
This guide focuses on using Amazon Redshift to create and manage a data warehouse. If you work with
databases as a designer, software developer, or administrator, it gives you the information you need to
design, build, query, and maintain your data warehouse.
• Service Highlights and Pricing – The product detail page provides the Amazon Redshift value
proposition, service highlights, and pricing.
• Getting Started – Amazon Redshift Getting Started Guide includes an example that walks you
through the process of creating an Amazon Redshift data warehouse cluster, creating database tables,
uploading data, and testing queries.
After you complete the Getting Started guide, we recommend that you explore one of the following
guides:
• Amazon Redshift Cluster Management Guide – The Cluster Management guide shows you how to
create and manage Amazon Redshift clusters.
If you are an application developer, you can use the Amazon Redshift Query API to manage clusters
programmatically. Additionally, the AWS SDK libraries that wrap the underlying Amazon Redshift
API can help simplify your programming tasks. If you prefer a more interactive way of managing
clusters, you can use the Amazon Redshift console and the AWS Command Line Interface (AWS CLI).
For information about the API and CLI, go to the following manuals:
• API reference
• CLI reference
• Amazon Redshift Database Developer Guide (this document) – If you are a database developer, the
Database Developer Guide explains how to design, build, query, and maintain the databases that make
up your data warehouse.
If you are transitioning to Amazon Redshift from another relational database system or data warehouse
application, you should be aware of important differences in how Amazon Redshift is implemented. For
1
Amazon Redshift Database Developer Guide
Are you a database developer?
a summary of the most important considerations for designing tables and loading data, see Amazon
Redshift best practices for designing tables (p. 16) and Amazon Redshift best practices for loading
data (p. 19). Amazon Redshift is based on PostgreSQL. For a detailed list of the differences between
Amazon Redshift and PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).
Quickly start using Begin by following the steps in Amazon Redshift Getting Started Guide to
Amazon Redshift quickly deploy a cluster, connect to a database, and try out some queries.
When you are ready to build your database, load data into tables, and
write queries to manipulate data in the data warehouse, return here to the
Database Developer Guide.
Learn about the The System and architecture overview (p. 3) gives a high-level overview
internal architecture of of Amazon Redshift's internal architecture.
the Amazon Redshift
data warehouse. If you want a broader overview of the Amazon Redshift web service, go to
the Amazon Redshift product detail page.
Create databases, Getting started using databases is a quick introduction to the basics of SQL
tables, users, and other development.
database objects.
The Amazon Redshift SQL (p. 469) has the syntax and examples for
Amazon Redshift SQL commands and functions and other SQL elements.
Amazon Redshift best practices for designing tables (p. 16) provides a
summary of our recommendations for choosing sort keys, distribution keys,
and compression encodings.
Learn how to design Working with automatic table optimization (p. 37) details considerations
tables for optimum for applying compression to the data in table columns and choosing
performance. distribution and sort keys.
Load data. Loading data (p. 65) explains the procedures for loading large datasets
from Amazon DynamoDB tables or from flat files stored in Amazon S3
buckets.
Amazon Redshift best practices for loading data (p. 19) provides for tips
for loading your data quickly and effectively.
Manage users, groups, Managing database security (p. 457) covers database security topics.
and database security.
Monitor and optimize The System tables reference (p. 1241) details system tables and views
system performance. that you can query for the status of the database and monitor queries and
processes.
You should also consult the Amazon Redshift Cluster Management Guide to
learn how to use the AWS Management Console to check the system health,
monitor metrics, and back up and restore clusters.
2
Amazon Redshift Database Developer Guide
Prerequisites
Analyze and report Many popular software vendors are certifying Amazon Redshift with their
information from very offerings to enable you to continue to use the tools you use today. For more
large datasets. information, see the Amazon Redshift partner page.
The SQL reference (p. 469) has all the details for the SQL expressions,
commands, and functions Amazon Redshift supports.
Prerequisites
Before you use this guide, you should complete these tasks.
You should also know how to use your SQL client and should have a fundamental understanding of the
SQL language.
An Amazon Redshift data warehouse is an enterprise-class relational database query and management
system.
Amazon Redshift supports client connections with many types of applications, including business
intelligence (BI), reporting, data, and analytics tools.
When you run analytic queries, you are retrieving, comparing, and evaluating large amounts of data in
multiple-stage operations to produce a final result.
Amazon Redshift achieves efficient storage and optimum query performance through a combination
of massively parallel processing, columnar data storage, and very efficient, targeted data compression
encoding schemes. This section presents an introduction to the Amazon Redshift system architecture.
3
Amazon Redshift Database Developer Guide
Data warehouse system architecture
Client applications
Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools
and business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on
industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal
changes. For information about important differences between Amazon Redshift SQL and PostgreSQL,
see Amazon Redshift and PostgreSQL (p. 470).
Clusters
A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more
compute nodes, an additional leader node coordinates the compute nodes and handles external
communication. Your client application interacts directly only with the leader node. The compute nodes
are transparent to external applications.
Leader node
The leader node manages communications with client programs and all communication with compute
nodes. It parses and develops execution plans to carry out database operations, in particular, the series
of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node
compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to
each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables
that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon
Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any
of these functions will return an error if it references tables that reside on the compute nodes. For more
information, see SQL functions supported on the leader node (p. 469).
Compute nodes
The leader node compiles code for individual elements of the execution plan and assigns the code to
individual compute nodes. The compute nodes runs the compiled code and send intermediate results
back to the leader node for final aggregation.
4
Amazon Redshift Database Developer Guide
Performance
Each compute node has its own dedicated CPU, memory, and attached disk storage, which are
determined by the node type. As your workload grows, you can increase the compute capacity and
storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.
Amazon Redshift provides several node types for your compute and storage needs. For details of each
node type, see Amazon Redshift clusters in the Amazon Redshift Cluster Management Guide.
Node slices
A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and
disk space, where it processes a portion of the workload assigned to the node. The leader node manages
distributing data to the slices and apportions the workload for any queries or other database operations
to the slices. The slices then work in parallel to complete the operation.
The number of slices per node is determined by the node size of the cluster. For more information about
the number of slices for each node size, go to About clusters and nodes in the Amazon Redshift Cluster
Management Guide.
When you create a table, you can optionally specify one column as the distribution key. When the table
is loaded with data, the rows are distributed to the node slices according to the distribution key that is
defined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing
to load data and run queries efficiently. For information about choosing a distribution key, see Choose
the best distribution style (p. 17).
Internal network
Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom
communication protocols to provide private, very high-speed network communication between the
leader node and compute nodes. The compute nodes run on a separate, isolated network that client
applications never access directly.
Databases
A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client
communicates with the leader node, which in turn coordinates query execution with the compute nodes.
Amazon Redshift is based on PostgreSQL. Amazon Redshift and PostgreSQL have a number of
very important differences that you need to take into account as you design and develop your data
warehouse applications. For information about how Amazon Redshift SQL differs from PostgreSQL, see
Amazon Redshift and PostgreSQL (p. 470).
Performance
Amazon Redshift achieves extremely fast query execution by employing these performance features.
Topics
• Massively parallel processing (p. 6)
• Columnar data storage (p. 6)
• Data compression (p. 6)
• Query optimizer (p. 6)
• Result caching (p. 6)
• Compiled code (p. 7)
5
Amazon Redshift Database Developer Guide
Performance
Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed
in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution
of data to balance the workload and minimize movement of data from node to node. For more
information, see Choose the best distribution style (p. 17).
Loading data from flat files takes advantage of parallel processing by spreading the workload across
multiple nodes while simultaneously reading from multiple files. For more information about how to
load data into tables, see Amazon Redshift best practices for loading data (p. 19).
When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of
data blocks. For more information, see Choose the best sort key (p. 16).
Data compression
Data compression reduces storage requirements, thereby reducing disk I/O, which improves query
performance. When you run a query, the compressed data is read into memory, then uncompressed
during query execution. Loading less data into memory enables Amazon Redshift to allocate more
memory to analyzing the data. Because columnar storage stores similar data sequentially, Amazon
Redshift is able to apply adaptive compression encodings specifically tied to columnar data types. The
best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal
compression encodings when you load the table with data. To learn more about using automatic data
compression, see Loading tables with automatic compression (p. 89).
Query optimizer
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and
also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer
implements significant enhancements and extensions for processing complex analytic queries that often
include multi-table joins, subqueries, and aggregation. To learn more about optimizing queries, see
Tuning query performance (p. 383).
Result caching
To reduce query execution time and improve system performance, Amazon Redshift caches the results of
certain types of queries in memory on the leader node. When a user submits a query, Amazon Redshift
checks the results cache for a valid, cached copy of the query results. If a match is found in the result
cache, Amazon Redshift uses the cached results and doesn't run the query. Result caching is transparent
to the user.
Result caching is turned on by default. To turn off result caching for the current session, set the
enable_result_cache_for_session (p. 1486) parameter to off.
6
Amazon Redshift Database Developer Guide
Columnar storage
Amazon Redshift uses cached results for a new query when all of the following are true:
• The user submitting the query has access privilege to the objects used in the query.
• The table or views in the query haven't been modified.
• The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.
• The query doesn't reference Amazon Redshift Spectrum external tables.
• Configuration parameters that might affect query results are unchanged.
• The query syntactically matches the cached query.
To maximize cache effectiveness and efficient use of resources, Amazon Redshift doesn't cache some
large query result sets. Amazon Redshift determines whether to cache query results based on a number
of factors. These factors include the number of entries in the cache and the instance type of your
Amazon Redshift cluster.
To determine whether a query used the result cache, query the SVL_QLOG (p. 1385) system view. If a
query used the result cache, the source_query column returns the query ID of the source query. If result
caching wasn't used, the source_query column value is NULL.
The following example shows that queries submitted by userid 104 and userid 102 use the result cache
from queries run by userid 100.
Compiled code
The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling
the query eliminates the overhead associated with an interpreter and therefore increases the execution
speed, especially for complex queries. The compiled code is cached and shared across sessions on the
same cluster, so subsequent executions of the same query will be faster, often even with different
parameters.
The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql
(libq) connection protocols, so two clients using different protocols will each incur the first-time cost
of compiling the code. Other clients that use the same protocol, however, will benefit from sharing the
cached code.
Columnar storage
Columnar storage for database tables is an important factor in optimizing analytic query performance
because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need
to load from disk.
7
Amazon Redshift Database Developer Guide
Columnar storage
The following series of illustrations describe how columnar data storage implements efficiencies and
how that translates into efficiencies when retrieving data into memory.
This first illustration shows how records from database tables are typically stored into disk blocks by row.
In a typical relational database table, each row contains field values for a single record. In row-wise
database storage, data blocks store values sequentially for each consecutive column making up the
entire row. If block size is smaller than the size of a record, storage for an entire record may take more
than one block. If block size is larger than the size of a record, storage for an entire record may take
less than one block, resulting in an inefficient use of disk space. In online transaction processing (OLTP)
applications, most transactions involve frequently reading and writing all of the values for entire records,
typically one record or a small number of records at a time. As a result, row-wise storage is optimal for
OLTP databases.
The next illustration shows how with columnar storage, the values for each column are stored
sequentially into disk blocks.
Using columnar storage, each data block stores values of a single column for multiple rows. As records
enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the
columns.
In this simplified example, using columnar storage, each data block holds column field values for as
many as three times as many records as row-based storage. This means that reading the same number
of column field values for the same number of records requires a third of the I/O operations compared
to row-wise storage. In practice, using tables with very large numbers of columns and very large row
counts, storage efficiency is even greater.
An added advantage is that, since each block holds the same type of data, block data can use a
compression scheme selected specifically for the column data type, further reducing disk space and
I/O. For more information about compression encodings based on data types, see Compression
encodings (p. 40).
The savings in space for storing data on disk also carries over to retrieving and then storing that data in
memory. Since many database operations only need to access or operate on one or a small number of
columns at a time, you can save memory space by only retrieving blocks for columns you actually need
for a query. Where OLTP transactions typically involve most or all of the columns in a row for a small
8
Amazon Redshift Database Developer Guide
Workload management
number of records, data warehouse queries commonly read only a few columns for a very large number
of rows. This means that reading the same number of column field values for the same number of rows
requires a fraction of the I/O operations and uses a fraction of the memory that would be required for
processing row-wise blocks. In practice, using tables with very large numbers of columns and very large
row counts, the efficiency gains are proportionally greater. For example, suppose a table contains 100
columns. A query that uses five columns will only need to read about five percent of the data contained
in the table. This savings is repeated for possibly billions or even trillions of records for large databases.
In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.
Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB,
which is more efficient and further reduces the number of I/O requests needed to perform any database
loading or other operations that are part of query execution.
Workload management
Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within
workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.
Amazon Redshift WLM creates query queues at runtime according to service classes, which define the
configuration parameters for various types of queues, including internal system queues and user-
accessible queues. From a user perspective, a user-accessible service class and a queue are functionally
equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service
class as well as a runtime queue.
When you run a query, WLM assigns the query to a queue according to the user's user group or by
matching a query group that is listed in the queue configuration with a query group label that the user
sets at runtime.
Currently, the default for clusters using the default parameter group is to use automatic WLM.
Automatic WLM manages query concurrency and memory allocation. For more information, see
Implementing automatic WLM (p. 413).
With manual WLM, Amazon Redshift configures one queue with a concurrency level of five, which enables
up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level
of one. You can define up to eight queues. Each queue can be configured with a maximum concurrency
level of 50. The maximum total concurrency level for all user-defined queues (not including the
Superuser queue) is 50.
The easiest way to modify the WLM configuration is by using the Amazon Redshift Management Console.
You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API.
For more information about implementing and using workload management, see Implementing
workload management (p. 411).
9
Amazon Redshift Database Developer Guide
Using Amazon Redshift with other services
You can also use parallel processing to export data from your Amazon Redshift data warehouse to
multiple data files on Amazon S3. For more information, see Unloading data (p. 143).
10
Amazon Redshift Database Developer Guide
Conducting a proof of concept
Amazon Redshift is not the same as other SQL database systems. To fully realize the benefits of the
Amazon Redshift architecture, you must specifically design, build, and load your tables to use massively
parallel processing, columnar data storage, and columnar data compression. If your data loading and
query execution times are longer than you expect, or longer than you want, you might be overlooking
key information.
If you are an experienced SQL database developer, we strongly recommend that you review this topic
before you begin developing your Amazon Redshift data warehouse.
If you are new to developing SQL databases, this topic is not the best place to start. We recommend that
you begin by reading Getting started using databases and trying the examples yourself.
In this topic, you can find an overview of the most important development principles, along with specific
tips, examples, and best practices for implementing those principles. No single practice can apply to
every application. You should evaluate all of your options before finalizing a database design. For more
information, see Working with automatic table optimization (p. 37), Loading data (p. 65), Tuning
query performance (p. 383), and the reference chapters.
Topics
• Conducting a proof of concept for Amazon Redshift (p. 11)
• Amazon Redshift best practices for designing tables (p. 16)
• Amazon Redshift best practices for loading data (p. 19)
• Amazon Redshift best practices for designing queries (p. 22)
• Working with recommendations from Amazon Redshift Advisor (p. 24)
In the following sections, you can find a framework for building a proof of concept with Amazon
Redshift. The framework helps you to use architectural best practices for designing and operating a
secure, high-performing, and cost-effective data warehouse. This guidance is based on reviewing designs
of thousands of customer architectures across a wide variety of business types and use cases. We have
compiled customer experiences to develop this set of best practices to help you develop criteria for
evaluating your data warehouse workload.
1. Identify the goals of the proof of concept – you can work backward from your business requirements
and success criteria, and translate them into a technical proof of concept project plan.
11
Amazon Redshift Database Developer Guide
Identify the business goals and success criteria
2. Set up the proof of concept environment – most of the setup process is a click of few buttons to
create your resources. Within minutes, you can have a data warehouse environment ready with data
loaded.
3. Complete the proof of concept project plan to ensure that the goals are met.
After you have answered these questions, you should be able to establish SMART goals and success
criteria for building your proof of concept. For information about setting goals, see SMART criteria in
Wikipedia.
Broadly, setting up the Amazon Redshift proof of concept environment is a two-step process. It involves
the launching of a data warehouse and then the conversion of the schema and datasets for evaluation.
1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.
2. On the navigation pane, choose Create cluster to open the configuration page.
3. For Cluster identifier, enter a name for your cluster.
4. Choose one of the following methods to size your cluster:
Note
The following step describes an Amazon Redshift console that is running in an AWS Region
that supports RA3 node types. For a list of AWS Regions that support RA3 node types, see
Overview of RA3 node types in the Amazon Redshift Cluster Management Guide.
12
Amazon Redshift Database Developer Guide
Checklist for a complete evaluation
• If your AWS Region supports RA3 node types, choose either Production or Free trial to answer the
question What are you planning to use this cluster for?
If your organization is eligible, you might be able to create a cluster under the Amazon Redshift free
trial program. To do this, choose Free trial to create a configuration with the dc2.large node type.
For more information about choosing a free trial, see Amazon Redshift free trial.
• If you don't know how large to size your cluster, choose Help me choose. Doing this starts a sizing
calculator that asks you questions about the size and query characteristics of the data that you plan
to store in your data warehouse.
If you know the required size of your cluster (that is, the node type and number of nodes), choose
I'll choose. Then choose the Node type and number of Nodes to size your cluster for the proof of
concept.
5. After you enter all required cluster properties, choose Create cluster to launch your data warehouse.
For more details about creating clusters with the Amazon Redshift console, see Creating a cluster in
the Amazon Redshift Cluster Management Guide.
Convert the schema and set up the datasets for the proof of
concept
If you don't have an existing data warehouse, skip this section and see Amazon Redshift Getting Started
Guide. Amazon Redshift Getting Started Guide provides a tutorial to create a cluster and examples of
setting up data in Amazon Redshift.
When migrating from your existing data warehouse, you can convert schema, code, and data using the
AWS Schema Conversion Tool and the AWS Database Migration Service. Your choice of tools depends on
the source of your data and optional ongoing replications. For more information, see What Is the AWS
Schema Conversion Tool? in the AWS Schema Conversion Tool User Guide and What Is AWS Database
Migration Service? in the AWS Database Migration Service User Guide. The following can help you set up
your data in Amazon Redshift:
• Migrate Your Data Warehouse to Amazon Redshift Using the AWS Schema Conversion Tool – this blog
post provides an overview on how you can use the AWS SCT data extractors to migrate your existing
data warehouse to Amazon Redshift. The AWS SCT tool can migrate your data from many legacy
platforms (such as Oracle, Greenplum, Netezza, Teradata, Microsoft SQL Server, or Vertica).
• Optionally, you can also use the AWS Database Migration Service for ongoing replications of changed
data from the source. For more information, see Using an Amazon Redshift Database as a Target for
AWS Database Migration Service in the AWS Database Migration Service User Guide.
Amazon Redshift is a relational database management system (RDBMS). As such, it can run many
types of data models including star schemas, snowflake schemas, data vault models, and simple, flat,
or normalized tables. After setting up your schemas in Amazon Redshift, you can take advantage of
massively parallel processing and columnar data storage for fast analytical queries out of the box. For
information about types of schemas, see star schema, snowflake schema, and data vault modeling in
Wikipedia.
• Data load time – using the COPY command is a common way to test how long it takes to load data. For
more information, see Amazon Redshift best practices for loading data (p. 19).
13
Amazon Redshift Database Developer Guide
Develop a project plan for your evaluation
• Throughput of the cluster – measuring queries per hour is a common way to determine throughput.
To do so, set up a test to run typical queries for your workload.
• Data security – you can easily encrypt data at rest and in transit with Amazon Redshift. You also
have a number of options for managing keys. Amazon Redshift also supports single sign-on (SSO)
integration. Amazon Redshift pricing includes built-in security, data compression, backup storage, and
data transfer.
• Third-party tools integration – you can use either a JDBC or ODBC connection to integrate with
business intelligence and other external tools.
• Interoperability with other AWS services – Amazon Redshift integrates with other AWS services, such
as Amazon EMR, Amazon QuickSight, AWS Glue, Amazon S3, and Amazon Kinesis. You can use this
integration when setting up and managing your data warehouse.
• Backups and snapshots – backups and snapshots are created automatically. You can also create a
point-in-time snapshot at any time or on a schedule. Try using a snapshot and creating a second
cluster as part of your evaluation. Evaluate if your development and testing organizations can use the
cluster.
• Resizing – your evaluation should include increasing the number or types of Amazon Redshift nodes.
Evaluate that the workload throughput before and after a resize meets any variability of the volume of
your workload. For more information, see Resizing clusters in Amazon Redshift in the Amazon Redshift
Cluster Management Guide.
• Concurrency scaling – this feature helps you handle variability of traffic volume in your data
warehouse. With concurrency scaling, you can support virtually unlimited concurrent users and
concurrent queries, with consistently fast query performance. For more information, see Working with
concurrency scaling (p. 438).
• Automatic workload management (WLM) – prioritize your business critical queries over other queries
by using automatic WLM. Try setting up queues based on your workloads (for example, a queue for
ETL and a queue for reporting). Then enable automatic WLM to allocate the concurrency and memory
resources dynamically. For more information, see Implementing automatic WLM (p. 413).
• Amazon Redshift Advisor – the Advisor develops customized recommendations to increase
performance and optimize costs by analyzing your workload and usage metrics for your cluster. Sign in
to the Amazon Redshift console to view Advisor recommendations. For more information, see Working
with recommendations from Amazon Redshift Advisor (p. 24).
• Table design – Amazon Redshift provides great performance out of the box for most workloads.
When you create a table, the default sort key and distribution key is AUTO. For more information, see
Working with automatic table optimization (p. 37).
• Support – we strongly recommend that you evaluate AWS Support as part of your evaluation. Also,
make sure to talk to your account manager about your proof of concept. AWS can help with technical
guidance and credits for the proof of concept if you qualify. If you don't find the help you're looking
for, you can talk directly to the Amazon Redshift team. For help, submit the form at Request support
for your Amazon Redshift proof-of-concept.
• Lake house integration – with built-in integration, try using the out-of-box Amazon Redshift
Spectrum feature. With Redshift Spectrum, you can extend the data warehouse into your data lake and
run queries against petabytes of data in Amazon S3 using your existing cluster. For more information,
see Querying external data using Amazon Redshift Spectrum (p. 232).
• Assemble a list of queries for each runtime category. Having a sufficient number (for example, 30 per
category) helps ensure that your evaluation reflects a real-world data warehouse implementation.
Add a unique identifier to associate each query that you include in your evaluation with one of the
14
Amazon Redshift Database Developer Guide
Additional resources to help your evaluation
categories you establish for your evaluation. You can then use these unique identifiers to determine
throughput from the system tables.
You can also create a query group to organize your evaluation queries. For example, if you have
established a "Reporting" category for your evaluation, you might create a coding system to tag your
evaluation queries with the word "Report." You can then identify individual queries within reporting as
R1, R2, and so on. The following example demonstrates this approach.
When you have associated a query with an evaluation category, you can use a unique identifier to
determine throughput from the system tables for each category.
• Test throughput with historical user or ETL queries that have a variety of runtimes in your existing data
warehouse. You might use a load testing utility, such as the open-source JMeter or a custom utility. If
so, make sure that your utility does the following:
• It can take the network transmission time into account.
• It evaluates execution time based on throughput of the internal system tables. For information
about how to do this, see Analyzing the query summary (p. 396).
• Identify all the various permutations that you plan to test during your evaluation. The following list
provides some common variables:
• Cluster size
• Node type
• Load testing duration
• Concurrency settings
• Reduce the cost of your proof of concept by pausing your cluster during off-hours and weekends.
When a cluster is paused, on-demand compute billing is suspended. To run tests on the cluster, resume
per-second billing. You can also create a schedule to pause and resume your cluster automatically.
For more information, see Pausing and resuming clusters in the Amazon Redshift Cluster Management
Guide.
At this stage, you're ready to complete your project plan and evaluate results.
• Service highlights and pricing – this product detail page provides the Amazon Redshift value
proposition, service highlights, and pricing.
• Amazon Redshift Getting Started Guide – this guide provides a tutorial of using Amazon Redshift to
create a sample cluster and work with sample data.
• Getting started with Amazon Redshift Spectrum (p. 234) – in this tutorial, you learn how to use
Redshift Spectrum to query data directly from files on Amazon S3.
• Amazon Redshift management overview – this topic in the Amazon Redshift Cluster Management Guide
provides an overview of Amazon Redshift.
15
Amazon Redshift Database Developer Guide
Need help?
• Optimize Amazon Redshift for performance with BI tools – consider integration with tools such as
Tableau, Power BI, and others.
• Amazon Redshift Advisor recommendations (p. 26) – contains explanations and details for each
Advisor recommendation.
• What's new in Amazon Redshift – announcements that help you keep track of new features and
enhancements.
• Improved speed and scalability – this blog post summarizes recent Amazon Redshift improvements.
Need help?
Make sure to talk to your account manager to let them know about your proof of concept. AWS can help
with technical guidance and credits for the proof of concept if you qualify. If you don't find the help you
are looking for, you can talk directly to the Amazon Redshift team. For help, submit the form at Request
support for your Amazon Redshift proof-of-concept.
In this section, you can find a summary of the most important design decisions and presents best
practices for optimizing query performance. Working with automatic table optimization (p. 37)
provides more detailed explanations and examples of table design options.
Topics
• Choose the best sort key (p. 16)
• Choose the best distribution style (p. 17)
• Let COPY choose compression encodings (p. 18)
• Define primary key and foreign key constraints (p. 18)
• Use the smallest possible column size (p. 18)
• Use date/time data types for date columns (p. 19)
• To have Amazon Redshift choose the appropriate sort order, specify AUTO for the sort key.
• If recent data is queried most frequently, specify the timestamp column as the leading column for the
sort key.
16
Amazon Redshift Database Developer Guide
Choose the best distribution style
Queries are more efficient because they can skip entire blocks that fall outside the time range.
• If you do frequent range filtering or equality filtering on one column, specify that column as the sort
key.
Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks
the minimum and maximum column values stored on each block and can skip blocks that don't apply
to the predicate range.
• If you frequently join a table, specify the join column as both the sort key and the distribution key.
Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join.
Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of
the sort merge join.
1. Distribute the fact table and one dimension table on their common columns.
Your fact table can have only one distribution key. Any tables that join on another key aren't
collocated with the fact table. Choose one dimension to collocate based on how frequently it is joined
and the size of the joining rows. Designate both the dimension table's primary key and the fact table's
corresponding foreign key as the DISTKEY.
2. Choose the largest dimension based on the size of the filtered dataset.
Only the rows that are used in the join need to be distributed, so consider the size of the dataset after
filtering, not the size of the table.
3. Choose a column with high cardinality in the filtered result set.
If you distribute a sales table on a date column, for example, you should probably get fairly even data
distribution, unless most of your sales are seasonal. However, if you commonly use a range-restricted
predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices
and the query workload is skewed.
4. Change some dimension tables to use ALL distribution.
If a dimension table cannot be collocated with the fact table or other important joining tables, you
can improve query performance significantly by distributing the entire table to all of the nodes. Using
ALL distribution multiplies storage space requirements and increases load times and maintenance
operations, so you should weigh all factors before choosing ALL distribution.
To have Amazon Redshift choose the appropriate distribution style, specify AUTO for the distribution
style.
For more information about choosing distribution styles, see Working with data distribution
styles (p. 50).
17
Amazon Redshift Database Developer Guide
Use automatic compression
ENCODE AUTO is the default for tables. Amazon Redshift automatically manages compression encoding
for all columns in the table. If you specify compression encoding for any column in the table, the table
is no longer set to ENCODE AUTO. Amazon Redshift no longer automatically manages compression
encoding for all columns in the table. You can specify the ENCODE AUTO option for the table to enable
Amazon Redshift to automatically manage compression encoding for all columns in the table. For more
information, see CREATE TABLE (p. 708) and ALTER TABLE (p. 542).
The COPY command analyzes your data and applies compression encodings to an empty table
automatically as part of the load operation.
Automatic compression balances overall performance when choosing compression encodings. Range-
restricted scans might perform poorly if sort key columns are compressed much more highly than other
columns in the same query. As a result, automatic compression chooses a less efficient compression
encoding to keep the sort key columns balanced with other columns.
Suppose that your table's sort key is a date or timestamp and the table uses many large varchar columns.
In this case, you might get better performance by not compressing the sort key column at all. Run
the ANALYZE COMPRESSION (p. 566) command on the table, then use the encodings to create a new
table, but leave out the compression encoding for the sort key.
There is a performance cost for automatic compression encoding, but only if the table is empty
and does not already have compression encoding. For short-lived tables and tables that you create
frequently, such as staging tables, load the table once with automatic compression or run the ANALYZE
COMPRESSION command. Then use those encodings to create new tables. You can add the encodings to
the CREATE TABLE statement, or use CREATE TABLE LIKE to create a new table with the same encoding.
For more information, see Loading tables with automatic compression (p. 89).
Do not define primary key and foreign key constraints unless your application enforces the constraints.
Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.
See Defining table constraints (p. 64) for additional information about how Amazon Redshift uses
constraints.
Instead, consider the largest values you are likely to store in a VARCHAR column, for example, and size
your columns accordingly. Because Amazon Redshift compresses column data very effectively, creating
columns much larger than necessary has minimal impact on the size of data tables. During processing
for complex queries, however, intermediate query results might need to be stored in temporary tables.
Because temporary tables are not compressed, unnecessarily large columns consume excessive memory
and temporary disk space, which can affect query performance.
18
Amazon Redshift Database Developer Guide
Use date/time data types for date columns
Loading very large datasets can take a long time and consume a lot of computing resources. How your
data is loaded can also affect query performance. This section presents best practices for loading data
efficiently using COPY commands, bulk inserts, and staging tables.
For more information about using the COPY command, see Loading data from Amazon S3 (p. 68) and
Loading data from an Amazon DynamoDB table (p. 86).
19
Amazon Redshift Database Developer Guide
Split your load data
However, if you use multiple concurrent COPY commands to load one table from multiple files, Amazon
Redshift is forced to perform a serialized load. This type of load is much slower and requires a VACUUM
process at the end if the table has a sort column defined. For more information about using COPY to
load data in parallel, see Loading data from Amazon S3 (p. 68).
In contrast, when you load delimited data from a large, uncompressed file, Amazon Redshift makes
use of multiple slices. These slices work in parallel, automatically. This provides fast load performance.
Specifically, when Amazon Redshift loads uncompressed, delimited data, data is split into ranges and
handled by slices in each node.
If you intend to load data from a large, compressed file, we recommend that you split your data into
smaller files that are about equal size, from 1 MB to 1 GB after compression. For optimum parallelism,
the ideal file size is 1–125 MB after compression. Make the number of files a multiple of the number of
slices in your cluster. For more information about how to split your data into files and examples of using
COPY to load data, see Loading data from Amazon S3 (p. 68).
Specify the GZIP, LZOP, BZIP2, or ZSTD option with the COPY command. This example loads the TIME
table from a pipe-delimited lzop file.
copy time
from 's3://mybucket/data/timerows.lzo'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
lzop
delimiter '|';
There are instances when you don't have to split uncompressed data files. For more information
about splitting your data and examples of using COPY to load data, see Loading data from Amazon
S3 (p. 68).
After the load operation is complete, query the STL_LOAD_COMMITS (p. 1304) system table to
verify that the expected files were loaded. For more information, see Verifying that the data loaded
correctly (p. 88).
Multi-row inserts improve performance by batching up a series of inserts. The following example inserts
three rows into a four-column table using a single INSERT statement. This is still a small insert, shown
simply to illustrate the syntax of a multi-row insert.
20
Amazon Redshift Database Developer Guide
Use a bulk insert
Use the INSERT (p. 780) and CREATE TABLE AS (p. 723) commands when you need to move data or a
subset of data from one table into another.
For example, the following INSERT statement selects all of the rows from the CATEGORY table and
inserts them into the CATEGORY_STAGE table.
The following example creates CATEGORY_STAGE as a copy of CATEGORY and inserts all of the rows in
CATEGORY into CATEGORY_STAGE.
If each batch of new data follows the existing rows in your table, your data is properly stored in sort
order, and you don't need to run a vacuum. You don't need to presort the rows in each load because
COPY sorts each batch of incoming data as it loads.
For example, suppose that you load data every day based on the current day's activity. If your sort key is
a timestamp column, your data is stored in sort order. This order occurs because the current day's data is
always appended at the end of the previous day's data. For more information, see Loading your data in
sort key order (p. 115).
For example, suppose that you need to load a table with events from January 2017 to December 2017.
Assuming each month is in a single file, load the rows for January, then February, and so on. Your
table is completely sorted when your load completes, and you don't need to run a vacuum. For more
information, see Use time-series tables (p. 21).
When loading very large datasets, the space required to sort might exceed the total available space. By
loading data in smaller blocks, you use much less intermediate sort space during each load. In addition,
loading smaller blocks make it easier to restart if the COPY fails and is rolled back.
21
Amazon Redshift Database Developer Guide
Use a staging table to perform a merge
You can easily remove old data simply by running a DROP TABLE command on the corresponding tables.
This approach is much faster than running a large-scale DELETE process and saves you from having to
run a subsequent VACUUM process to reclaim space. To hide the fact that the data is stored in different
tables, you can create a UNION ALL view. When you delete old data, simply refine your UNION ALL view
to remove the dropped tables. Similarly, as you load new time periods into new tables, add the new
tables to the view. To signal the optimizer to skip the scan on tables that don't match the query filter,
your view definition filters for the date range that corresponds to each table.
Avoid having too many tables in the UNION ALL view. Each additional table adds a small processing
time to the query. Tables don't need to use the same time frame. For example, you might have tables for
differing time periods, such as daily, monthly, and yearly.
If you use time-series tables with a timestamp column for the sort key, you effectively load your data in
sort key order. Doing this eliminates the need to vacuum to re-sort the data. For more information, see
Loading your data in sort key order (p. 115).
Amazon Redshift doesn't support a single merge statement (update or insert, also known as an upsert)
to insert and update data from a single data source. However, you can effectively perform a merge
operation. To do so, load your data into a staging table and then join the staging table with your target
table for an UPDATE statement and an INSERT statement. For instructions, see Updating and inserting
new data (p. 96).
• Design tables according to best practices to provide a solid foundation for query performance. For
more information, see Amazon Redshift best practices for designing tables (p. 16).
• Avoid using select *. Include only the columns you specifically need.
• Use a CASE expression (p. 914) to perform complex aggregations instead of selecting from the same
table multiple times.
• Don't use cross-joins unless absolutely necessary. These joins without a join condition result in the
Cartesian product of two tables. Cross-joins are typically run as nested-loop joins, which are the
slowest of the possible join types.
• Use subqueries in cases where one table in the query is used only for predicate conditions and
the subquery returns a small number of rows (less than about 200). The following example uses a
subquery to avoid joining the LISTING table.
select sum(sales.qtysold)
from sales
22
Amazon Redshift Database Developer Guide
Best practices for designing queries
where salesid in (select listid from listing where listtime > '2008-12-26');
For example, suppose that you want to join SALES and LISTING to find ticket sales for tickets listed
after December, grouped by seller. Both tables are sorted by date. The following query joins the tables
on their common key and filters for listing.listtime values greater than December 1.
The WHERE clause doesn't include a predicate for sales.saletime, so the execution engine is forced
to scan the entire SALES table. If you know the filter would result in fewer rows participating in the
join, then add that filter as well. The following example cuts execution time significantly.
• Use sort keys in the GROUP BY clause so the query planner can use more efficient aggregation. A
query might qualify for one-phase aggregation when its GROUP BY list contains only sort key columns,
one of which is also the distribution key. The sort key columns in the GROUP BY list must include the
first sort key, then other sort keys that you want to use in sort key order. For example, it is valid to use
the first sort key, the first and second sort keys, the first, second, and third sort keys, and so on. It is
not valid to use the first and third sort keys.
You can confirm the use of one-phase aggregation by running the EXPLAIN (p. 762) command and
looking for XN GroupAggregate in the aggregation step of the query.
• If you use both GROUP BY and ORDER BY clauses, make sure that you put the columns in the same
order in both. That is, use the approach just following.
group by a, b, c
order by a, b, c
group by b, c, a
order by a, b, c
23
Amazon Redshift Database Developer Guide
Working with Advisor
Advisor bases its recommendations on observations regarding performance statistics or operations data.
Advisor develops observations by running tests on your clusters to determine if a test value is within
a specified range. If the test result is outside of that range, Advisor generates an observation for your
cluster. At the same time, Advisor creates a recommendation about how to bring the observed value
back into the best-practice range. Advisor only displays recommendations that should have a significant
impact on performance and operations. When Advisor determines that a recommendation has been
addressed, it removes it from your recommendation list.
For example, suppose that your data warehouse contains a large number of uncompressed table
columns. In this case, you can save on cluster storage costs by rebuilding tables using the ENCODE
parameter to specify column compression. In another example, suppose that Advisor observes that your
cluster contains a significant amount of data in uncompressed table data. In this case, it provides you
with the SQL code block to find the table columns that are candidates for compression and resources
that describe how to compress those columns.
Topics
24
Amazon Redshift Database Developer Guide
Access Advisor
New console
To view Amazon Redshift Advisor recommendations on the console
1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.
2. On the navigation menu, choose ADVISOR.
3. Expand each recommendation to see more details. On this page, you can sort and group
recommendations.
Original console
To view Amazon Redshift Advisor recommendations on the console
1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.
2. In the navigation pane, choose Advisor.
3. Choose the cluster that you want to get recommendations for.
25
Amazon Redshift Database Developer Guide
Advisor recommendations
Topics
• Compress Amazon S3 file objects loaded by COPY (p. 26)
• Isolate multiple active databases (p. 27)
• Reallocate workload management (WLM) memory (p. 28)
• Skip compression analysis during COPY (p. 29)
• Split Amazon S3 objects loaded by COPY (p. 30)
• Update table statistics (p. 31)
• Enable short query acceleration (p. 32)
• Alter distribution keys on tables (p. 33)
• Alter sort keys on tables (p. 34)
• Alter compression encodings on columns (p. 34)
• Data type recommendations (p. 35)
When loading large amounts of data, we strongly recommend using the COPY command to load
compressed data files from S3. Compressing large datasets saves time uploading the files to S3. COPY
can also speed up the load process by uncompressing the files as they are read.
Analysis
Long-running COPY commands that load large uncompressed datasets often have an opportunity for
considerable performance improvement. The Advisor analysis identifies COPY commands that load large
uncompressed datasets. In such a case, Advisor generates a recommendation to implement compression
on the source files in S3.
26
Amazon Redshift Database Developer Guide
Advisor recommendations
Recommendation
Ensure that each COPY that loads a significant amount of data, or runs for a significant duration, ingests
compressed data objects from S3. You can identify the COPY commands that load large uncompressed
datasets from S3 by running the following SQL command as a superuser.
SELECT
wq.userid, query, exec_start_time AS starttime, COUNT(*) num_files,
ROUND(MAX(wq.total_exec_time/1000000.0),2) execution_secs,
ROUND(SUM(transfer_size)/(1024.0*1024.0),2) total_mb,
SUBSTRING(querytxt,1,60) copy_sql
FROM stl_s3client s
JOIN stl_query q USING (query)
JOIN stl_wlm_query wq USING (query)
WHERE s.userid>1 AND http_method = 'GET'
AND POSITION('COPY ANALYZE' IN querytxt) = 0
AND aborted = 0 AND final_state='Completed'
GROUP BY 1, 2, 3, 7
HAVING SUM(transfer_size) = SUM(data_size)
AND SUM(transfer_size)/(1024*1024) >= 5
ORDER BY 6 DESC, 5 DESC;
If the staged data remains in S3 after you load it, which is common in data lake architectures, storing this
data in a compressed form can reduce your storage costs.
Implementation tips
Analysis
The Advisor analysis reviews all databases on the cluster for active workloads running at the same time.
If there are active workloads running at the same time, Advisor generates a recommendation to consider
migrating databases to separate Amazon Redshift clusters.
Recommendation
Consider moving each actively queried database to a separate dedicated cluster. Using a separate cluster
can reduce resource contention and improve query performance. It can do so because it enables you
to set the size for each cluster for the storage, cost, and performance needs of each workload. Also,
unrelated workloads often benefit from different workload management configurations.
To identify which databases are actively used, you can run this SQL command as a superuser.
SELECT database,
COUNT(*) as num_queries,
AVG(DATEDIFF(sec,starttime,endtime)) avg_duration,
MIN(starttime) as oldest_ts,
27
Amazon Redshift Database Developer Guide
Advisor recommendations
MAX(endtime) as latest_ts
FROM stl_query
WHERE userid > 1
GROUP BY database;
Implementation tips
• Because a user must connect to each database specifically, and queries can only access a single
database, moving databases to separate clusters has minimal impact for users.
• One option to move a database is to take the following steps:
1. Temporarily restore a snapshot of the current cluster to a cluster of the same size.
2. Delete all databases from the new cluster except the target database to be moved.
3. Resize the cluster to an appropriate node type and count for the database's workload.
When a queue is configured with more slots than the workload requires, the memory allocated to these
unused slots goes underutilized. Reducing the configured slots to match the peak workload requirements
redistributes the underutilized memory to active slots, and can result in improved query performance.
Analysis
The Advisor analysis reviews workload concurrency requirements to identify query queues with unused
slots. Advisor generates a recommendation to reduce the number of slots in a queue when it finds the
following:
• A queue with slots that are completely inactive throughout the analysis
• A queue with more than four slots that had at least two inactive slots throughout the analysis
Recommendation
Reducing the configured slots to match peak workload requirements redistributes underutilized memory
to active slots. Consider reducing the configured slot count for queues where the slots have never been
fully utilized. To identify these queues, you can compare the peak hourly slot requirements for each
queue by running the following SQL command as a superuser.
WITH
generate_dt_series AS (select sysdate - (n * interval '5 second') as dt from (select
row_number() over () as n from stl_scan limit 17280)),
apex AS (
SELECT iq.dt, iq.service_class, iq.num_query_tasks, count(iq.slot_count) as
service_class_queries, sum(iq.slot_count) as service_class_slots
FROM
(select gds.dt, wq.service_class, wscc.num_query_tasks, wq.slot_count
FROM stl_wlm_query wq
JOIN stv_wlm_service_class_config wscc ON (wscc.service_class = wq.service_class
AND wscc.service_class > 5)
JOIN generate_dt_series gds ON (wq.service_class_start_time <= gds.dt AND
wq.service_class_end_time > gds.dt)
WHERE wq.userid > 1 AND wq.service_class > 5) iq
28
Amazon Redshift Database Developer Guide
Advisor recommendations
The max_service_class_slots column represents the maximum number of WLM query slots in the
query queue for that hour. If underutilized queues exist, implement the slot reduction optimization by
modifying a parameter group, as described in the Amazon Redshift Cluster Management Guide.
Implementation tips
• If your workload is highly variable in volume, make sure that the analysis captured a peak utilization
period. If it didn't, run the preceding SQL repeatedly to monitor peak concurrency requirements.
• For more details on interpreting the query results from the preceding SQL code, see the
wlm_apex_hourly.sql script on GitHub.
Analysis
The Advisor analysis checks for COPY operations that were delayed by automatic compression analysis.
The analysis determines the compression encodings by sampling the data while it's being loaded. This
sampling is similar to that performed by the ANALYZE COMPRESSION (p. 566) command.
When you load data as part of a structured process, such as in an overnight extract, transform, load
(ETL) batch, you can define the compression beforehand. You can also optimize your table definitions to
permanently skip this phase without any negative impacts.
Recommendation
To improve COPY responsiveness by skipping the compression analysis phase, implement either of the
following two options:
• Use the column ENCODE parameter when creating any tables that you load using the COPY command.
• Turn off compression altogether by supplying the COMPUPDATE OFF parameter in the COPY
command.
The best solution is generally to use column encoding during table creation, because this approach also
maintains the benefit of storing compressed data on disk. You can use the ANALYZE COMPRESSION
command to suggest compression encodings, but you must recreate the table to apply these encodings.
To automate this process, you can use the AWSColumnEncodingUtility, found on GitHub.
To identify recent COPY operations that triggered automatic compression analysis, run the following SQL
command.
29
Amazon Redshift Database Developer Guide
Advisor recommendations
WITH xids AS (
SELECT xid FROM stl_query WHERE userid>1 AND aborted=0
AND querytxt = 'analyze compression phase 1' GROUP BY xid
INTERSECT SELECT xid FROM stl_commit_stats WHERE node=-1)
SELECT a.userid, a.query, a.xid, a.starttime, b.complyze_sec,
a.copy_sec, a.copy_sql
FROM (SELECT q.userid, q.query, q.xid, date_trunc('s',q.starttime)
starttime, substring(querytxt,1,100) as copy_sql,
ROUND(datediff(ms,starttime,endtime)::numeric / 1000.0, 2) copy_sec
FROM stl_query q JOIN xids USING (xid)
WHERE (querytxt ilike 'copy %from%' OR querytxt ilike '% copy %from%')
AND querytxt not like 'COPY ANALYZE %') a
LEFT JOIN (SELECT xid,
ROUND(sum(datediff(ms,starttime,endtime))::numeric / 1000.0,2) complyze_sec
FROM stl_query q JOIN xids USING (xid)
WHERE (querytxt like 'COPY ANALYZE %'
OR querytxt like 'analyze compression phase %')
GROUP BY xid ) b ON a.xid = b.xid
WHERE b.complyze_sec IS NOT NULL ORDER BY a.copy_sql, a.starttime;
Implementation tips
• Ensure that all tables of significant size created during your ETL processes (for example, staging tables
and temporary tables) declare a compression encoding for all columns except the first sort key.
• Estimate the expected lifetime size of the table being loaded for each of the COPY commands
identified by the SQL command preceding. If you are confident that the table will remain extremely
small, turn off compression altogether with the COMPUPDATE OFF parameter. Otherwise, create the
table with explicit compression before loading it with the COPY command.
Analysis
The Advisor analysis identifies COPY commands that load large datasets contained in a small number of
files staged in S3. Long-running COPY commands that load large datasets from a few files often have
an opportunity for considerable performance improvement. When Advisor identifies that these COPY
commands are taking a significant amount of time, it creates a recommendation to increase parallelism
by splitting the data into additional files in S3.
Recommendation
1. Optimize COPY commands that load fewer files than the number of cluster nodes.
2. Optimize COPY commands that load fewer files than the number of cluster slices.
3. Optimize COPY commands where the number of files is not a multiple of the number of cluster slices.
Certain COPY commands load a significant amount of data or run for a significant duration. For these
commands, we recommend that you load a number of data objects from S3 that is equivalent to a
30
Amazon Redshift Database Developer Guide
Advisor recommendations
multiple of the number of slices in the cluster. To identify how many S3 objects each COPY command has
loaded, run the following SQL code as a superuser.
SELECT
query, COUNT(*) num_files,
ROUND(MAX(wq.total_exec_time/1000000.0),2) execution_secs,
ROUND(SUM(transfer_size)/(1024.0*1024.0),2) total_mb,
SUBSTRING(querytxt,1,60) copy_sql
FROM stl_s3client s
JOIN stl_query q USING (query)
JOIN stl_wlm_query wq USING (query)
WHERE s.userid>1 AND http_method = 'GET'
AND POSITION('COPY ANALYZE' IN querytxt) = 0
AND aborted = 0 AND final_state='Completed'
GROUP BY query, querytxt
HAVING (SUM(transfer_size)/(1024*1024))/COUNT(*) >= 2
ORDER BY CASE
WHEN COUNT(*) < (SELECT max(node)+1 FROM stv_slices) THEN 1
WHEN COUNT(*) < (SELECT COUNT(*) FROM stv_slices WHERE node=0) THEN 2
ELSE 2+((COUNT(*) % (SELECT COUNT(*) FROM stv_slices))/(SELECT COUNT(*)::DECIMAL FROM
stv_slices))
END, (SUM(transfer_size)/(1024.0*1024.0))/COUNT(*) DESC;
Implementation tips
• The number of slices in a node depends on the node size of the cluster. For more information about
the number of slices in the various node types, see Clusters and Nodes in Amazon Redshift in the
Amazon Redshift Cluster Management Guide.
• You can load multiple files by specifying a common prefix, or prefix key, for the set, or by explicitly
listing the files in a manifest file. For more information about loading files, see Loading data from
compressed and uncompressed files (p. 68).
• Amazon Redshift doesn't take file size into account when dividing the workload. Split your load data
files so that the files are about equal size, between 1 MB and 1 GB after compression.
Analysis
The Advisor analysis tracks tables whose statistics are out-of-date or missing. It reviews table access
metadata associated with complex queries. If tables that are frequently accessed with complex patterns
are missing statistics, Advisor creates a critical recommendation to run ANALYZE. If tables that are
frequently accessed with complex patterns have out-of-date statistics, Advisor creates a suggested
recommendation to run ANALYZE.
Recommendation
Whenever table content changes significantly, update statistics with ANALYZE. We recommend running
ANALYZE whenever a significant number of new data rows are loaded into an existing table with COPY
or INSERT commands. We also recommend running ANALYZE whenever a significant number of rows are
31
Amazon Redshift Database Developer Guide
Advisor recommendations
modified using UPDATE or DELETE commands. To identify tables with missing or out-of-date statistics,
run the following SQL command as a superuser. The results are ordered from largest to smallest table.
To identify tables with missing or out-of-date statistics, run the following SQL command as a superuser.
The results are ordered from largest to smallest table.
SELECT
ti.schema||'.'||ti."table" tablename,
ti.size table_size_mb,
ti.stats_off statistics_accuracy
FROM svv_table_info ti
WHERE ti.stats_off > 5.00
ORDER BY ti.size DESC;
Implementation tips
The default ANALYZE threshold is 10 percent. This default means that the ANALYZE command skips a
given table if fewer than 10 percent of the table's rows have changed since the last ANALYZE. As a result,
you might choose to issue ANALYZE commands at the end of each ETL process. Taking this approach
means that ANALYZE is often skipped but also ensures that ANALYZE runs when needed.
ANALYZE statistics have the most impact for columns that are used in joins (for example, JOIN tbl_a
ON col_b) or as predicates (for example, WHERE col_b = 'xyz'). By default, ANALYZE collects
statistics for all columns in the table specified. If needed, you can reduce the time required to run
ANALYZE by running ANALYZE only for the columns where it has the most impact. You can run the
following SQL command to identify columns used as predicates. You can also let Amazon Redshift
choose which columns to analyze by specifying ANALYZE PREDICATE COLUMNS.
WITH predicate_column_info as (
SELECT ns.nspname AS schema_name, c.relname AS table_name, a.attnum as col_num, a.attname
as col_name,
CASE
WHEN 10002 = s.stakind1 THEN array_to_string(stavalues1, '||')
WHEN 10002 = s.stakind2 THEN array_to_string(stavalues2, '||')
WHEN 10002 = s.stakind3 THEN array_to_string(stavalues3, '||')
WHEN 10002 = s.stakind4 THEN array_to_string(stavalues4, '||')
ELSE NULL::varchar
END AS pred_ts
FROM pg_statistic s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_namespace ns ON c.relnamespace = ns.oid
JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum)
SELECT schema_name, table_name, col_num, col_name,
pred_ts NOT LIKE '2000-01-01%' AS is_predicate,
CASE WHEN pred_ts NOT LIKE '2000-01-01%' THEN (split_part(pred_ts,
'||',1))::timestamp ELSE NULL::timestamp END as first_predicate_use,
CASE WHEN pred_ts NOT LIKE '%||2000-01-01%' THEN (split_part(pred_ts,
'||',2))::timestamp ELSE NULL::timestamp END as last_analyze
FROM predicate_column_info;
32
Amazon Redshift Database Developer Guide
Advisor recommendations
If you enable SQA, you can reduce or eliminate workload management (WLM) queues that are dedicated
to running short queries. In addition, long-running queries don't need to contend with short queries
for slots in a queue, so you can configure your WLM queues to use fewer query slots. When you use
lower concurrency, query throughput is increased and overall system performance is improved for most
workloads. For more information, see Working with short query acceleration (p. 442).
Analysis
Advisor checks for workload patterns and reports the number of recent queries where SQA would reduce
latency and the daily queue time for SQA-eligible queries.
Recommendation
Modify the WLM configuration to enable SQA. Amazon Redshift uses a machine learning algorithm
to analyze each eligible query. Predictions improve as SQA learns from your query patterns. For more
information, see Configuring Workload Management.
When you enable SQA, WLM sets the maximum run time for short queries to dynamic by default. We
recommend keeping the dynamic setting for SQA maximum run time.
Implementation tips
To check whether SQA is enabled, run the following query. If the query returns a row, then SQA is
enabled.
An appropriate DISTKEY places a similar number of rows on each node slice and is frequently referenced
in join conditions. An optimized join occurs when tables are joined on their DISTKEY columns,
accelerating query performance.
Analysis
Advisor analyzes your cluster’s workload to identify the most appropriate distribution key for the tables
that can significantly benefit from a KEY distribution style.
Recommendation
Advisor provides ALTER TABLE (p. 542) statements that alter the DISTSTYLE and DISTKEY of a table
based on its analysis. To realize a significant performance benefit, make sure to implement all SQL
statements within a recommendation group.
Redistributing a large table with ALTER TABLE consumes cluster resources and requires temporary table
locks at various times. Implement each recommendation group when other cluster workload is light.
For more details on optimizing table distribution properties, see the Amazon Redshift Engineering's
Advanced Table Design Playbook: Distribution Styles and Distribution Keys.
For more information about ALTER DISTSYLE and DISTKEY, see ALTER TABLE (p. 542).
33
Amazon Redshift Database Developer Guide
Advisor recommendations
Note
If you don't see a recommendation, that doesn't necessarily mean that the current distribution
styles are the most appropriate. Advisor doesn't provide recommendations when there isn't
enough data or the expected benefit of redistribution is small.
Advisor recommendations apply to a particular table and don't necessarily apply to a table that
contains a column with the same name. Tables that share a column name can have different
characteristics for those columns unless data inside the tables is the same.
If you see recommendations for staging tables that are created or dropped by ETL jobs, modify
your ETL processes to use the Advisor recommended distribution keys.
Sorting a table on an appropriate sort key can accelerate performance of queries, especially those with
range-restricted predicates, by requiring fewer table blocks to be read from disk.
Analysis
Advisor analyzes your cluster's workload over several days to identify a beneficial sort key for your tables.
Recommendation
Advisor provides two groups of ALTER TABLE statements that alter the sort key of a table based on its
analysis:
• Statements that alter a table that currently doesn't have a sort key to add a COMPOUND sort key.
• Statements that alter a sort key from INTERLEAVED to COMPOUND or no sort key.
Using compound sort keys significantly reduces maintenance overhead. Tables with compound sort
keys don't need the expensive VACUUM REINDEX operations that are necessary for interleaved sorts.
In practice, compound sort keys are more effective than interleaved sort keys for the vast majority of
Amazon Redshift workloads. However, if a table is small, it's more efficient not to have a sort key to
avoid sort key storage overhead.
When sorting a large table with the ALTER TABLE, cluster resources are consumed and table locks are
required at various times. Implement each recommendation when a cluster's workload is moderate. More
details on optimizing table sort key configurations can be found in the Amazon Redshift Engineering's
Advanced Table Design Playbook: Compound and Interleaved Sort Keys.
For more information about ALTER SORTKEY, see ALTER TABLE (p. 542).
Note
If you don't see a recommendation for a table, that doesn't necessarily mean that the current
configuration is the best. Advisor doesn't provide recommendations when there isn't enough
data or the expected benefit of sorting is small.
Advisor recommendations apply to a particular table and don’t necessarily apply to a table that
contains a column with the same name and data type. Tables that share column names can have
different recommendations based on the data in the tables and the workload.
34
Amazon Redshift Database Developer Guide
Advisor recommendations
Analysis
Advisor performs analysis of your cluster's workload and database schema continually to identify the
optimal compression encoding for each table column.
Recommendation
Advisor provides ALTER TABLE statements that change the compression encoding of particular columns,
based on its analysis.
Changing column compression encodings with ALTER TABLE (p. 542) consumes cluster resources and
requires table locks at various times. It's best to implement recommendations when the cluster workload
is light.
For reference, ALTER TABLE examples (p. 551) shows several statements that change the encoding for a
column.
Note
Advisor doesn't provide recommendations when there isn't enough data or the expected benefit
of changing the encoding is small.
Analysis
Advisor performs analysis of your cluster's workload and database schema continually to identify
columns that can benefit significantly from a data type change.
Recommendation
Advisor provides an ALTER TABLE statement that adds a new column with the suggested data type. An
accompanying UPDATE statement copies data from the existing column to the new column. After you
create the new column and load the data, change your queries and ingestion scripts to access the new
column. Then leverage features and functions specialized to the new data type, found in SQL functions
reference (p. 878).
Copying existing data to the new column can take time. We recommend that you implement each
advisor recommendation when the cluster’s workload is light. Reference the list of available data types at
Data types (p. 478).
Note that Advisor doesn't provide recommendations when there isn't enough data or the expected
benefit of changing the data type is small.
35
Amazon Redshift Database Developer Guide
36
Amazon Redshift Database Developer Guide
Enabling automatic table optimization
Automatic table optimization continuously observes how queries interact with tables. It uses advanced
artificial intelligence methods to choose sort and distribution keys to optimize performance for the
cluster's workload. If Amazon Redshift determines that applying a key improves cluster performance,
tables are automatically altered within hours from the time the cluster was created, with minimal impact
to queries.
To take advantage of this automation, an Amazon Redshift administrator creates a new table, or alters
an existing table to enable it to use automatic optimization. Existing tables with a distribution style
or sort key of AUTO are already enabled for automation. When you run queries against those tables,
Amazon Redshift determines if a sort key or distribution key will improve performance. If so, then
Amazon Redshift automatically modifies the table without requiring administrator intervention. If
a minimum number of queries are run, optimizations are applied within hours of the cluster being
launched.
If Amazon Redshift determines that a distribution key improves the performance of queries, tables where
distribution style is AUTO can have their distribution style changed to KEY.
Topics
• Enabling automatic table optimization (p. 37)
• Removing automatic table optimization from a table (p. 38)
• Monitoring actions of automatic table optimization (p. 38)
• Working with column compression (p. 39)
• Working with data distribution styles (p. 50)
• Working with sort keys (p. 62)
• Defining table constraints (p. 64)
To enable an existing table to be automatically optimized, use the ALTER statement options to change
the table to AUTO. You might choose to define automation for sort keys, but not for distribution keys
(and vice versa). If you run an ALTER statement to convert a table to be an automated table, existing sort
keys and distribution styles are preserved.
37
Amazon Redshift Database Developer Guide
Removing automatic table optimization
Initially, a table has no distribution key or sort key. The distribution style is set to either EVEN or ALL
depending on table size. As the table grows in size, Amazon Redshift applies the optimal distribution
keys and sort keys. Optimizations are applied within hours after a minimum number of queries are run.
When determining sort key optimizations, Amazon Redshift attempts to optimize the data blocks read
from disk during a table scan. When determining distribution style optimizations, Amazon Redshift tries
to optimize the number of bytes transferred between cluster nodes.
To change a sort key, you can define a sort key or choose none.
To view if a table is defined for automatic optimization, query the system view SVV_TABLE_INFO.
Entries appear only for tables visible in the current session's database. Recommendations are inserted
into the view twice per day starting within hours from the time the cluster was created. After a
recommendation is available, it's started within an hour. After a recommendation has been applied
(either by Amazon Redshift or by you), it no longer appears in the view.
The system view SVL_AUTO_WORKER_ACTION shows an audit log of all the actions taken by the Amazon
Redshift, and the previous state of the table.
The system view SVV_TABLE_INFO lists all of the tables in the system, along with a column to indicate
whether the sort key and distribution style of the table is set to AUTO.
38
Amazon Redshift Database Developer Guide
Working with column compression
For more information about these system views, see System views (p. 1275).
ENCODE AUTO is the default for tables. Amazon Redshift automatically manages compression encoding
for all columns in the table. You can specify the ENCODE AUTO option for the table to enable Amazon
Redshift to automatically manage compression encoding for all columns in the table. For more
information, see CREATE TABLE (p. 708) and ALTER TABLE (p. 542).
However, if you specify compression encoding for any column in the table, the table is no longer set
to ENCODE AUTO. Amazon Redshift no longer automatically manages compression encoding for all
columns in the table.
You can apply a compression type, or encoding, to the columns in a table manually when you create the
table. Or you can use the COPY command to analyze and apply compression automatically. For more
information, see Let COPY choose compression encodings (p. 18). For details about applying automatic
compression, see Loading tables with automatic compression (p. 89).
Note
We strongly recommend using the COPY command to apply automatic compression.
You might choose to apply compression encodings manually if the new table shares the same data
characteristics as another table. Or you might do so if in testing you discover that the compression
encodings that are applied during automatic compression are not the best fit for your data. If you choose
to apply compression encodings manually, you can run the ANALYZE COMPRESSION (p. 566) command
against an already populated table and use the results to choose compression encodings.
To apply compression manually, you specify compression encodings for individual columns as part of the
CREATE TABLE statement. The syntax is as follows.
Here, encoding-type is taken from the keyword table in the following section.
For example, the following statement creates a two-column table, PRODUCT. When data is loaded into
the table, the PRODUCT_ID column is not compressed, but the PRODUCT_NAME column is compressed,
using the byte dictionary encoding (BYTEDICT).
You can specify the encoding for a column when it is added to a table using the ALTER TABLE command.
Topics
• Compression encodings (p. 40)
• Testing compression encodings (p. 46)
• Example: Choosing compression encodings for the CUSTOMER table (p. 48)
39
Amazon Redshift Database Developer Guide
Compression encodings
Compression encodings
A compression encoding specifies the type of compression that is applied to a column of data values as
rows are added to a table.
ENCODE AUTO is the default for tables. Amazon Redshift automatically manages compression encoding
for all columns in the table. You can specify the ENCODE AUTO option for the table to enable Amazon
Redshift to automatically manage compression encoding for all columns in the table. For more
information, see CREATE TABLE (p. 708) and ALTER TABLE (p. 542).
However, if you specify compression encoding for any column in the table, the table is no longer set
to ENCODE AUTO. Amazon Redshift no longer automatically manages compression encoding for all
columns in the table.
If you specify compression encoding for any column in the table or if you don't specify the ENCODE
AUTO option for the table, Amazon Redshift automatically assigns compression encoding to columns for
which you don't specify compression encoding as follows:
• Columns that are defined as sort keys are assigned RAW compression.
• Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW
compression.
• Columns that are defined as SMALLINT, INTEGER, BIGINT, DECIMAL, DATE, TIMESTAMP, or
TIMESTAMPTZ data types are assigned AZ64 compression.
• Columns that are defined as CHAR or VARCHAR data types are assigned LZO compression.
The following table identifies the supported compression encodings and the data types that support the
encoding.
40
Amazon Redshift Database Developer Guide
Compression encodings
Raw encoding
Raw encoding is the default encoding for columns that are designated as sort keys and columns that are
defined as BOOLEAN, REAL, or DOUBLE PRECISION data types. With raw encoding, data is stored in raw,
uncompressed form.
AZ64 encoding
AZ64 is a proprietary compression encoding algorithm designed by Amazon to achieve a high
compression ratio and improved query processing. At its core, the AZ64 algorithm compresses smaller
groups of data values and uses single instruction, multiple data (SIMD) instructions for parallel
processing. Use AZ64 to achieve significant storage savings and high performance for numeric, date, and
time data types.
You can use AZ64 as the compression encoding when defining columns using CREATE TABLE and ALTER
TABLE statements with the following data types:
• SMALLINT
• INTEGER
• BIGINT
• DECIMAL
• DATE
• TIMESTAMP
• TIMESTAMPTZ
Byte-dictionary encoding
In byte dictionary encoding, a separate dictionary of unique values is created for each block of column
values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-
byte values that are stored as indexes to the original data values. If more than 256 values are stored in a
single block, the extra values are written into the block in raw, uncompressed form. The process repeats
for each disk block.
This encoding is very effective when a column contains a limited number of unique values. This encoding
is optimal when the data domain of a column is fewer than 256 unique values. Byte-dictionary encoding
is especially space-efficient if a CHAR column holds long character strings.
41
Amazon Redshift Database Developer Guide
Compression encodings
Note
Byte-dictionary encoding is not always effective when used with VARCHAR columns. Using
BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly
recommend using a different encoding, such as LZO, for VARCHAR columns.
Suppose that a table has a COUNTRY column with a CHAR(30) data type. As data is loaded, Amazon
Redshift creates the dictionary and populates the COUNTRY column with the index value. The dictionary
contains the indexed unique values, and the table itself contains only the one-byte subscripts of the
corresponding values.
Note
Trailing blanks are stored for fixed-length character columns. Therefore, in a CHAR(30) column,
every compressed value saves 29 bytes of storage when you use the byte-dictionary encoding.
The following table represents the dictionary for the COUNTRY column.
Unique data value Dictionary index Size (fixed length, 30 bytes per
value)
England 0 30
Venezuela 2 30
Sri Lanka 3 30
Argentina 4 30
Japan 5 30
Total 180
Original data value Original size (fixed Compressed value New size (bytes)
length, 30 bytes per (index)
value)
England 30 0 1
England 30 0 1
United States of 30 1 1
America
United States of 30 1 1
America
Venezuela 30 2 1
Sri Lanka 30 3 1
Argentina 30 4 1
Japan 30 5 1
Sri Lanka 30 3 1
42
Amazon Redshift Database Developer Guide
Compression encodings
Original data value Original size (fixed Compressed value New size (bytes)
length, 30 bytes per (index)
value)
Argentina 30 4 1
Total 300 10
The total compressed size in this example is calculated as follows: 6 different entries are stored in the
dictionary (6 * 30 = 180), and the table contains 10 1-byte compressed values, for a total of 190 bytes.
Delta encoding
Delta encodings are very useful for date time columns.
Delta encoding compresses data by recording the difference between values that follow each other in the
column. This difference is recorded in a separate dictionary for each block of column values on disk. (An
Amazon Redshift disk block occupies 1 MB.) For example, suppose that the column contains 10 integers
in sequence from 1 to 10. The first are stored as a 4-byte integer (plus a 1-byte flag). The next nine are
each stored as a byte with the value 1, indicating that it is one greater than the previous value.
If most of the values in the column could be compressed by using a single byte, the 1-byte variation
is very effective. However, if the deltas are larger, this encoding, in the worst case, is somewhat less
effective than storing the uncompressed data. Similar logic applies to the 16-bit version.
If the difference between two values exceeds the 1-byte range (DELTA) or 2-byte range (DELTA32K), the
full original value is stored, with a leading 1-byte flag. The 1-byte range is from -127 to 127, and the 2-
byte range is from -32K to 32K.
The following table shows how a delta encoding works for a numeric column.
Original data Original size Difference (delta) Compressed value Compressed size
value (bytes) (bytes)
5 4 4 4 1
50 4 45 45 1
220 4 35 35 1
221 4 1 1 1
Totals 28 15
43
Amazon Redshift Database Developer Guide
Compression encodings
LZO encoding
LZO encoding provides a very high compression ratio with good performance. LZO encoding works
especially well for CHAR and VARCHAR columns that store very long character strings. They are
especially good for free-form text, such as product descriptions, user comments, or JSON strings.
Mostly encoding
Mostly encodings are useful when the data type for a column is larger than most of the stored values
require. By specifying a mostly encoding for this type of column, you can compress the majority of the
values in the column to a smaller standard storage size. The remaining values that cannot be compressed
are stored in their raw form. For example, you can compress a 16-bit column, such as an INT2 column, to
8-bit storage.
In general, the mostly encodings work with the following data types:
• SMALLINT/INT2 (16-bit)
• INTEGER/INT (32-bit)
• BIGINT/INT8 (64-bit)
• DECIMAL/NUMERIC (64-bit)
Choose the appropriate variation of the mostly encoding to suit the size of the data type for the
column. For example, apply MOSTLY8 to a column that is defined as a 16-bit integer column. Applying
MOSTLY16 to a column with a 16-bit data type or MOSTLY32 to a column with a 32-bit data type is
disallowed.
Mostly encodings might be less effective than no compression when a relatively high number of the
values in the column cannot be compressed. Before applying one of these encodings to a column,
perform a check. Most of the values that you are going to load now (and are likely to load in the future)
should fit into the ranges shown in the following table.
Note
For decimal values, ignore the decimal point to determine whether the value fits into the range.
For example, 1,234.56 is treated as 123,456 and can be compressed in a MOSTLY32 column.
For example, the VENUEID column in the VENUE table is defined as a raw integer column, which means
that its values consume 4 bytes of storage. However, the current range of values in the column is 0 to
309. Therefore, recreating and reloading this table with MOSTLY16 encoding for VENUEID would reduce
the storage of every value in that column to 2 bytes.
If the VENUEID values referenced in another table were mostly in the range of 0 to 127, it might make
sense to encode that foreign-key column as MOSTLY8. Before making the choice, you have to run some
queries against the referencing table data to find out whether the values mostly fall into the 8-bit, 16-
bit, or 32-bit range.
44
Amazon Redshift Database Developer Guide
Compression encodings
The following table shows compressed sizes for specific numeric values when the MOSTLY8, MOSTLY16,
and MOSTLY32 encodings are used:
1 4 1 2 4
10 4 1 2 4
100 4 1 2 4
20000 4 2 4
100000 8 4
2000000000 8 4
For example, suppose that a column in a large dimension table has a predictably small domain, such as
a COLOR column with fewer than 10 possible values. These values are likely to fall in long sequences
throughout the table, even if the data is not sorted.
We don't recommend applying run length encoding on any column that is designated as a sort key.
Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns
are compressed much more highly than other columns in the same query, range-restricted scans might
perform poorly.
The following table uses the COLOR column example to show how the run length encoding works.
Original data value Original size (bytes) Compressed value Compressed size
(token) (bytes)
Blue 4 {2,Blue} 5
Blue 4 0
Green 5 {3,Green} 6
Green 5 0
Green 5 0
Blue 4 {1,Blue} 5
45
Amazon Redshift Database Developer Guide
Testing compression encodings
Original data value Original size (bytes) Compressed value Compressed size
(token) (bytes)
Yellow 6 {4,Yellow} 7
Yellow 6 0
Yellow 6 0
Yellow 6 0
Total 51 23
For the text32k encoding, the principle is the same, but the dictionary for each block does not capture a
specific number of words. Instead, the dictionary indexes each unique word it finds until the combined
entries reach a length of 32K, minus some overhead. The index values are stored in two bytes.
For example, consider the VENUENAME column in the VENUE table. Words such as Arena, Center, and
Theatre recur in this column and are likely to be among the first 245 words encountered in each block if
text255 compression is applied. If so, this column benefits from compression. This is because every time
those words appear, they occupy only 1 byte of storage (instead of 5, 6, or 7 bytes, respectively).
Zstandard encoding
Zstandard (ZSTD) encoding provides a high compression ratio with very good performance across diverse
datasets. ZSTD works especially well with CHAR and VARCHAR columns that store a wide range of long
and short strings, such as product descriptions, user comments, logs, and JSON strings. Where some
algorithms, such as Delta (p. 43) encoding or Mostly (p. 44) encoding, can potentially use more
storage space than no compression, ZSTD is very unlikely to increase disk usage.
ZSTD supports SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION, BOOLEAN, CHAR,
VARCHAR, DATE, TIMESTAMP, and TIMESTAMPTZ data types.
To perform a meaningful test of data compression, you need a large number of rows. For this example,
we create a table and insert rows by using a statement that selects from two tables; VENUE and LISTING.
46
Amazon Redshift Database Developer Guide
Testing compression encodings
We leave out the WHERE clause that would normally join the two tables. The result is that each row
in the VENUE table is joined to all of the rows in the LISTING table, for a total of over 32 million rows.
This is known as a Cartesian join and normally is not recommended. However, for this purpose, it's a
convenient method of creating a lot of rows. If you have an existing table with data that you want to
test, you can skip this step.
After we have a table with sample data, we create a table with seven columns. Each has a different
compression encoding: raw, bytedict, lzo, run length, text255, text32k, and zstd. We populate each
column with exactly the same data by running an INSERT command that selects the data from the first
table.
1. (Optional) First, use a Cartesian join to create a table with a large number of rows. Skip this step if you
want to test an existing table.
2. Next, create a table with the encodings that you want to compare.
3. Insert the same data into all of the columns using an INSERT statement with a SELECT clause.
count
----------
38884394
(1 row)
5. Query the STV_BLOCKLIST (p. 1244) system table to compare the number of 1 MB disk blocks used
by each column.
The MAX aggregate function returns the highest block number for each column. The STV_BLOCKLIST
table includes details for three system-generated columns. This example uses col < 6 in the WHERE
clause to exclude the system-generated columns.
47
Amazon Redshift Database Developer Guide
Example: Choosing compression
encodings for the CUSTOMER table
The query returns the following results. The columns are numbered beginning with zero. Depending
on how your cluster is configured, your result might have different numbers, but the relative sizes
should be similar. You can see that BYTEDICT encoding on the second column produced the best
results for this dataset. This approach has a compression ratio of better than 20:1. LZO and ZSTD
encoding also produced excellent results. Different data sets produce different results, of course.
When a column contains longer text strings, LZO often produces the best compression results.
col | max
-----+-----
0 | 203
1 | 10
2 | 22
3 | 204
4 | 56
5 | 72
6 | 20
(7 rows)
If you have data in an existing table, you can use the ANALYZE COMPRESSION (p. 566) command
to view the suggested encodings for the table. For example, the following example shows the
recommended encoding for a copy of the VENUE table, CARTESIAN_VENUE, that contains 38 million
rows. Notice that ANALYZE COMPRESSION recommends LZO encoding for the VENUENAME column.
ANALYZE COMPRESSION chooses optimal compression based on multiple factors, which include percent
of reduction. In this specific case, BYTEDICT provides better compression, but LZO also produces greater
than 90 percent compression.
48
Amazon Redshift Database Developer Guide
Example: Choosing compression
encodings for the CUSTOMER table
The following table shows the column encodings that were chosen for the CUSTOMER table and gives an
explanation for the choices:.
49
Amazon Redshift Database Developer Guide
Working with data distribution styles
Topics
• Data distribution concepts (p. 51)
• Distribution styles (p. 51)
• Viewing distribution styles (p. 52)
• Evaluating query patterns (p. 53)
• Designating distribution styles (p. 54)
50
Amazon Redshift Database Developer Guide
Data distribution concepts
An Amazon Redshift cluster is a set of nodes. Each node in the cluster has its own operating system,
dedicated memory, and dedicated disk storage. One node is the leader node, which manages the
distribution of data and query processing tasks to the compute nodes. The compute nodes provide
resources to do those tasks.
The disk storage for a compute node is divided into a number of slices. The number of slices per node
depends on the node size of the cluster. For example, each DS2.XL compute node has two slices, and
each DS2.8XL compute node has 16 slices. The nodes all participate in running parallel queries, working
on data that is distributed as evenly as possible across the slices. For more information about the
number of slices that each node size has, see About clusters and nodes in the Amazon Redshift Cluster
Management Guide.
Data redistribution
When you load data into a table, Amazon Redshift distributes the rows of the table to each of the
node slices according to the table's distribution style. As part of a query plan, the optimizer determines
where blocks of data need to be located to best run the query. The data is then physically moved, or
redistributed, while the query runs. Redistribution might involve either sending specific rows to nodes for
joining or broadcasting an entire table to all of the nodes.
Data redistribution can account for a substantial portion of the cost of a query plan, and the network
traffic it generates can affect other database operations and slow overall system performance. To the
extent that you anticipate where best to locate data initially, you can minimize the impact of data
redistribution.
When you load data into a table, Amazon Redshift distributes the table's rows to the compute nodes and
slices according to the distribution style that you chose when you created the table. Data distribution has
two primary goals:
• To distribute the workload uniformly among the nodes in the cluster. Uneven distribution, or data
distribution skew, forces some nodes to do more work than others, which impairs query performance.
• To minimize data movement as a query runs. If the rows that participate in joins or aggregates are
already collocated on the nodes with their joining rows in other tables, the optimizer doesn't need to
redistribute as much data when queries run.
The distribution strategy that you choose for your database has important consequences for query
performance, storage requirements, data loading, and maintenance. By choosing the best distribution
style for each table, you can balance your data distribution and significantly improve overall system
performance.
Distribution styles
When you create a table, you can designate one of four distribution styles; AUTO, EVEN, KEY, or ALL.
If you don't specify a distribution style, Amazon Redshift uses AUTO distribution.
51
Amazon Redshift Database Developer Guide
Viewing distribution styles
AUTO distribution
With AUTO distribution, Amazon Redshift assigns an optimal distribution style based on the size of
the table data. For example, Amazon Redshift initially assigns ALL distribution to a small table, then
changes to EVEN distribution when the table grows larger. When a table is changed from ALL to
EVEN distribution, storage utilization might change slightly. The change in distribution occurs in the
background, in a few seconds.
When you set DISTSTYLE to AUTO, Amazon Redshift might change the distribution of your table
data to have a KEY-based distribution style. To view actions that Amazon Redshift automatically
performed to alter a table distribution key, see SVL_AUTO_WORKER_ACTION (p. 1376).
To view current recommendations regarding altering a table distribution key, see
SVV_ALTER_TABLE_RECOMMENDATIONS (p. 1419).
To view the distribution style applied to a table, query the PG_CLASS_INFO system catalog view. For
more information, see Viewing distribution styles (p. 52). If you don't specify a distribution style with
the CREATE TABLE statement, Amazon Redshift applies AUTO distribution.
EVEN distribution
The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in
any particular column. EVEN distribution is appropriate when a table doesn't participate in joins. It's also
appropriate when there isn't a clear choice between KEY distribution and ALL distribution.
KEY distribution
The rows are distributed according to the values in one column. The leader node places matching values
on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates
the rows on the slices according to the values in the joining columns. This way, matching values from the
common columns are physically stored together.
ALL distribution
A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place
only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for
every join that the table participates in.
ALL distribution multiplies the storage required by the number of nodes in the cluster, and so it takes
much longer to load, update, or insert data into multiple tables. ALL distribution is appropriate only for
relatively slow moving tables; that is, tables that are not updated frequently or extensively. Because the
cost of redistributing small tables during a query is low, there isn't a significant benefit to define small
dimension tables as DISTSTYLE ALL.
Note
After you have specified a distribution style for a column, Amazon Redshift handles data
distribution at the cluster level. Amazon Redshift does not require or support the concept
of partitioning data within database objects. You don't need to create table spaces or define
partitioning schemes for tables.
In certain scenarios, you can change the distribution style of a table after it is created. For more
information, see ALTER TABLE (p. 542). For scenarios when you can't change the distribution style of a
table after it's created, you can recreate the table and populate the new table with a deep copy. For more
information, see Performing a deep copy (p. 102)
The RELEFFECTIVEDISTSTYLE column in PG_CLASS_INFO indicates the current distribution style for the
table. If the table uses automatic distribution, RELEFFECTIVEDISTSTYLE is 10, 11, or 12 which indicates
whether the effective distribution style is AUTO (ALL), AUTO (EVEN), or AUTO (KEY). If the table uses
52
Amazon Redshift Database Developer Guide
Evaluating query patterns
automatic distribution, the distribution style might initially show AUTO (ALL), then change to AUTO
(EVEN) or AUTO (KEY) when the table grows.
The following table gives the distribution style for each value in RELEFFECTIVEDISTSTYLE column:
0 EVEN
1 KEY
8 ALL
10 AUTO (ALL)
11 AUTO (EVEN)
12 AUTO (KEY)
The DISTSTYLE column in SVV_TABLE_INFO indicates the current distribution style for the table. If the
table uses automatic distribution, DISTSTYLE is AUTO (ALL), AUTO (EVEN), or AUTO (KEY).
The following example creates four tables using the three distribution styles and automatic distribution,
then queries SVV_TABLE_INFO to view the distribution styles.
53
Amazon Redshift Database Developer Guide
Designating distribution styles
Test your system with data that is as close to real data as possible.
In order to make good choices for distribution styles, you need to understand the query patterns for
your Amazon Redshift application. Identify the most costly queries in your system and base your initial
database design on the demands of those queries. Factors that determine the total cost of a query
include how long the query takes to run and how much computing resources it consumes. Other factors
that determine query cost are how often it is run, and how disruptive it is to other queries and database
operations.
Identify the tables that are used by the most costly queries, and evaluate their role in query execution.
Consider how the tables are joined and aggregated.
Use the guidelines in this section to choose a distribution style for each table. When you have done so,
create the tables and load them with data that is as close as possible to real data. Then test the tables
for the types of queries that you expect to use. You can evaluate the query explain plans to identify
tuning opportunities. Compare load times, storage space, and query execution times in order to balance
your system's overall requirements.
1. Specify the primary key and foreign keys for all your tables.
Amazon Redshift does not enforce primary key and foreign key constraints, but the query optimizer
uses them when it generates query plans. If you set primary keys and foreign keys, your application
must maintain the validity of the keys.
2. Distribute the fact table and its largest dimension table on their common columns.
Choose the largest dimension based on the size of dataset that participates in the most common join,
not just the size of the table. If a table is commonly filtered, using a WHERE clause, only a portion
of its rows participate in the join. Such a table has less impact on redistribution than a smaller table
that contributes more data. Designate both the dimension table's primary key and the fact table's
corresponding foreign key as DISTKEY. If multiple tables use the same distribution key, they are also
collocated with the fact table. Your fact table can have only one distribution key. Any tables that join
on another key isn't collocated with the fact table.
3. Designate distribution keys for the other dimension tables.
Distribute the tables on their primary keys or their foreign keys, depending on how they most
commonly join with other tables.
4. Evaluate whether to change some of the dimension tables to use ALL distribution.
If a dimension table cannot be collocated with the fact table or other important joining tables, you
can improve query performance significantly by distributing the entire table to all of the nodes. Using
ALL distribution multiplies storage space requirements and increases load times and maintenance
operations, so you should weigh all factors before choosing ALL distribution. The following section
explains how to identify candidates for ALL distribution by evaluating the EXPLAIN plan.
5. Use EVEN distribution for the remaining tables.
If a table is largely denormalized and does not participate in joins, or if you don't have a clear choice
for another distribution style, use EVEN distribution.
To let Amazon Redshift choose the appropriate distribution style, don't explicitly specify a distribution
style.
54
Amazon Redshift Database Developer Guide
Evaluating the query plan
After making your initial design decisions, create your tables, load them with data, and test them. Use
a test dataset that is as close as possible to the real data. Measure load times to use as a baseline for
comparisons.
Evaluate queries that are representative of the most costly queries you expect to run, specifically queries
that use joins and aggregations. Compare execution times for various design options. When you compare
execution times, don't count the first time the query is run, because the first runtime includes the
compilation time.
DS_DIST_NONE
No redistribution is required, because corresponding slices are collocated on the compute nodes.
You typically have only one DS_DIST_NONE step, the join between the fact table and one dimension
table.
DS_DIST_ALL_NONE
No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is
located on every node.
DS_DIST_INNER
A copy of the entire inner table is broadcast to all the compute nodes.
DS_DIST_ALL_INNER
The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.
DS_DIST_BOTH
DS_DIST_NONE and DS_DIST_ALL_NONE are good. They indicate that no distribution was required for
that step because all of the joins are collocated.
DS_DIST_INNER means that the step probably has a relatively high cost because the inner table is being
redistributed to the nodes. DS_DIST_INNER indicates that the outer table is already properly distributed
on the join key. Set the inner table's distribution key to the join key to convert this to DS_DIST_NONE.
In some cases, distributing the inner table on the join key isn't possible because the outer table isn't
distributed on the join key. If this is the case, evaluate whether to use ALL distribution for the inner table.
If the table isn't updated frequently or extensively, and it's large enough to carry a high redistribution
cost, change the distribution style to ALL and test again. ALL distribution causes increased load times, so
when you retest, include the load time in your evaluation factors.
DS_DIST_ALL_INNER is not good. It means the entire inner table is redistributed to a single slice because
the outer table uses DISTSTYLE ALL, so that a copy of the entire outer table is located on each node. This
results in inefficient serial execution of the join on a single node instead taking advantage of parallel
execution using all of the nodes. DISTSTYLE ALL is meant to be used only for the inner join table.
Instead, specify a distribution key or use even distribution for the outer table.
55
Amazon Redshift Database Developer Guide
Query plan example
DS_BCAST_INNER and DS_DIST_BOTH are not good. Usually these redistributions occur because the
tables are not joined on their distribution keys. If the fact table does not already have a distribution
key, specify the joining column as the distribution key for both tables. If the fact table already has a
distribution key on another column, evaluate whether changing the distribution key to collocate this join
improve overall performance. If changing the distribution key of the outer table isn't an optimal choice,
you can achieve collocation by specifying DISTSTYLE ALL for the inner table.
The following example shows a portion of a query plan with DS_BCAST_INNER and DS_DIST_NONE
labels.
After changing the dimension tables to use DISTSTYLE ALL, the query plan for the same query shows
DS_DIST_ALL_NONE in place of DS_BCAST_INNER. Also, there is a dramatic change in the relative cost
for the join steps.
Run the following query with an EXPLAIN command to produce a query plan.
explain
select lastname, catname, venuename, venuecity, venuestate, eventname,
month, sum(pricepaid) as buyercost, max(totalprice) as maxtotalprice
from category join event on category.catid = event.catid
join venue on venue.venueid = event.venueid
join sales on sales.eventid = event.eventid
join listing on sales.listid = listing.listid
join date on sales.dateid = date.dateid
join users on users.userid = sales.buyerid
group by lastname, catname, venuename, venuecity, venuestate, eventname, month
having sum(pricepaid)>9999
order by catname, buyercost desc;
In the TICKIT database, SALES is a fact table and LISTING is its largest dimension. In order to collocate
the tables, SALES is distributed on the LISTID, which is the foreign key for LISTING, and LISTING is
distributed on its primary key, LISTID. The following example shows the CREATE TABLE commands for
SALES and LISTID.
56
Amazon Redshift Database Developer Guide
Query plan example
In the following query plan, the Merge Join step for the join on SALES and LISTING shows
DS_DIST_NONE, which indicates that no redistribution is required for the step. However, moving up
the query plan, the other inner joins show DS_BCAST_INNER, which indicates that the inner table is
broadcast as part of the query execution. Because only one pair of tables can be collocated using key
distribution, five tables need to be rebroadcast.
QUERY PLAN
XN Merge (cost=1015345167117.54..1015345167544.46 rows=1000 width=103)
Merge Key: category.catname, sum(sales.pricepaid)
-> XN Network (cost=1015345167117.54..1015345167544.46 rows=170771 width=103)
Send to leader
-> XN Sort (cost=1015345167117.54..1015345167544.46 rows=170771 width=103)
Sort Key: category.catname, sum(sales.pricepaid)
-> XN HashAggregate (cost=15345150568.37..15345152276.08 rows=170771
width=103)
Filter: (sum(pricepaid) > 9999.00)
-> XN Hash Join DS_BCAST_INNER (cost=742.08..15345146299.10
rows=170771 width=103)
Hash Cond: ("outer".catid = "inner".catid)
-> XN Hash Join DS_BCAST_INNER (cost=741.94..15342942456.61
rows=170771 width=97)
Hash Cond: ("outer".dateid = "inner".dateid)
-> XN Hash Join DS_BCAST_INNER
(cost=737.38..15269938609.81 rows=170766 width=90)
Hash Cond: ("outer".buyerid = "inner".userid)
-> XN Hash Join DS_BCAST_INNER
(cost=112.50..3272334142.59 rows=170771 width=84)
Hash Cond: ("outer".venueid = "inner".venueid)
-> XN Hash Join DS_BCAST_INNER
(cost=109.98..3167290276.71 rows=172456 width=47)
Hash Cond: ("outer".eventid =
"inner".eventid)
-> XN Merge Join DS_DIST_NONE
(cost=0.00..6286.47 rows=172456 width=30)
57
Amazon Redshift Database Developer Guide
Query plan example
One solution is to recreate the tables with DISTSTYLE ALL. You cannot change a table's distribution style
after it is created. To recreate tables with a different distribution style, use a deep copy.
Run the following script to recreate USERS, VENUE, CATEGORY, DATE, EVENT. Don't make any changes to
SALES and LISTING.
58
Amazon Redshift Database Developer Guide
Query plan example
venuecity varchar(30),
venuestate char(2),
venueseats integer,
primary key(venueid)) diststyle all;
Insert the data back into the tables and run an ANALYZE command to update the statistics.
analyze;
Run the same query with EXPLAIN again, and examine the new query plan. The joins now show
DS_DIST_ALL_NONE, indicating that no redistribution is required because the data was distributed to
every node using DISTSTYLE ALL.
QUERY PLAN
XN Merge (cost=1000000047117.54..1000000047544.46 rows=1000 width=103)
Merge Key: category.catname, sum(sales.pricepaid)
-> XN Network (cost=1000000047117.54..1000000047544.46 rows=170771 width=103)
Send to leader
59
Amazon Redshift Database Developer Guide
Distribution examples
Distribution examples
The following examples show how data is distributed according to the options that you define in the
CREATE TABLE statement.
DISTKEY examples
Look at the schema of the USERS table in the TICKIT database. USERID is defined as the SORTKEY
column and the DISTKEY column:
60
Amazon Redshift Database Developer Guide
Distribution examples
...
USERID is a good choice for the distribution column on this table. If you query the SVV_DISKUSAGE
system view, you can see that the table is very evenly distributed. Column numbers are zero-based, so
USERID is column 0.
The table contains 49,990 rows. The rows (num_values) column shows that each slice contains about the
same number of rows. The minvalue and maxvalue columns show the range of values on each slice. Each
slice includes nearly the entire range of values, so there's a good chance that every slice participates in
running a query that filters for a range of user IDs.
This example demonstrates distribution on a small test system. The total number of slices is typically
much higher.
If you commonly join or group using the STATE column, you might choose to distribute on the STATE
column. The following example shows a case where you create a new table with the same data as the
USERS table but set the DISTKEY to the STATE column. In this case, the distribution isn't as even. Slice
0 (13,587 rows) holds approximately 30 percent more rows than slice 3 (10,150 rows). In a much larger
table, this amount of distribution skew can have an adverse impact on query processing.
61
Amazon Redshift Database Developer Guide
Working with sort keys
However, because distribution is not based on a specific column, query processing can be degraded,
especially if the table is joined to other tables. The lack of distribution on a joining column often
influences the type of join operation that can be performed efficiently. Joins, aggregations, and grouping
operations are optimized when both tables are distributed and sorted on their respective joining
columns.
(4 rows)
When you create a table, you can alternatively define one or more of its columns as sort keys. When data
is initially loaded into the empty table, the rows are stored on disk in sorted order. Information about
sort key columns is passed to the query planner, and the planner uses this information to construct plans
that exploit the way that the data is sorted. For more information, see CREATE TABLE (p. 708).
Sorting enables efficient handling of range-restricted predicates. Amazon Redshift stores columnar data
in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a query
uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip
over large numbers of blocks during table scans. For example, suppose that a table stores five years of
data sorted by date and a query specifies a date range of one month. In this case, you can eliminate up
to 98 percent of the disk blocks from the scan. If the data is not sorted, more of the disk blocks (possibly
all of them) have to be scanned.
You can specify either a compound or interleaved sort key. A compound sort key is more efficient when
query predicates use a prefix, which is a subset of the sort key columns in order. An interleaved sort key
gives equal weight to each column in the sort key, so query predicates can use any subset of the columns
that make up the sort key, in any order.
62
Amazon Redshift Database Developer Guide
Compound sort key
To understand the impact of the chosen sort key on query performance, use the EXPLAIN (p. 762)
command. For more information, see Query planning and execution workflow (p. 383).
To define a sort type, use either the INTERLEAVED or COMPOUND keyword with your CREATE TABLE
or CREATE TABLE AS statement. The default is COMPOUND. The default COMPOUND is recommended
unless your tables aren't updated regularly with INSERT, UPDATE, or DELETE. An INTERLEAVED sort key
can use a maximum of eight columns. Depending on your data and cluster size, VACUUM REINDEX takes
significantly longer than VACUUM FULL because it makes an additional pass to analyze the interleaved
sort keys. The sort and merge operation can take longer for interleaved tables because the interleaved
sort might need to rearrange more rows than a compound sort.
To view the sort keys for a table, query the SVV_TABLE_INFO (p. 1450) system view.
Topics
• Compound sort key (p. 63)
• Interleaved sort key (p. 63)
Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions
that use PARTITION BY and ORDER BY. For example, a merge join, which is often faster than a hash join,
is feasible when the data is distributed and presorted on the joining columns. Compound sort keys also
help improve compression.
As you add rows to a sorted table that already contains data, the unsorted region grows, which has
a significant effect on performance. The effect is greater when the table uses interleaved sorting,
especially when the sort columns include data that increases monotonically, such as date or timestamp
columns. You should run a VACUUM operation regularly, especially after large data loads, to re-sort and
re-analyze the data. For more information, see Managing the size of the unsorted region (p. 111). After
vacuuming to resort the data, it's a good practice to run an ANALYZE command to update the statistical
metadata for the query planner. For more information, see Analyzing tables (p. 103).
The performance improvements you gain by implementing an interleaved sort key should be weighed
against increased load and vacuum times.
Interleaved sorts are most effective with highly selective queries that filter on one or more of the sort
key columns in the WHERE clause, for example select c_name from customer where c_region
= 'ASIA'. The benefits of interleaved sorting increase with the number of sorted columns that are
restricted.
63
Amazon Redshift Database Developer Guide
Defining table constraints
An interleaved sort is more effective with large tables. Sorting is applied on each slice. Thus, an
interleaved sort is most effective when a table is large enough to require multiple 1 MB blocks per slice.
Here, the query processor can skip a significant proportion of the blocks using restrictive predicates. To
view the number of blocks a table uses, query the STV_BLOCKLIST (p. 1244) system view.
When sorting on a single column, an interleaved sort might give better performance than a compound
sort if the column values have a long common prefix. For example, URLs commonly begin with "http://
www". Compound sort keys use a limited number of characters from the prefix, which results in a lot
of duplication of keys. Interleaved sorts use an internal compression scheme for zone map values that
enables them to better discriminate among column values that have a long common prefix.
VACUUM REINDEX
As you add rows to a sorted table that already contains data, performance might deteriorate over
time. This deterioration occurs for both compound and interleaved sorts, but it has a greater effect on
interleaved tables. A VACUUM restores the sort order, but the operation can take longer for interleaved
tables because merging new interleaved data might involve modifying every data block.
When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort
key columns and uses that information for optimal interleaving of the sort key columns. As a table
grows, the distribution of the values in the sort key columns can change, or skew, especially with date or
timestamp columns. If the skew becomes too large, performance might be affected. To re-analyze the
sort keys and restore performance, run the VACUUM command with the REINDEX key word. Because it
needs to take an extra analysis pass over the data, VACUUM REINDEX can take longer than a standard
VACUUM for interleaved tables. To view information about key distribution skew and last reindex time,
query the SVV_INTERLEAVED_COLUMNS (p. 1434) system view.
For more information about how to determine how often to run VACUUM and when to run a VACUUM
REINDEX, see Deciding whether to reindex (p. 110).
For example, the query planner uses primary and foreign keys in certain statistical computations. It does
this to infer uniqueness and referential relationships that affect subquery decorrelation techniques. By
doing this, it can order large numbers of joins and eliminate redundant joins.
The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are
valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return
incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key
is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand,
you should always declare primary and foreign keys and uniqueness constraints when you know that
they are valid.
For more information about table constraints, see CREATE TABLE (p. 708).
64
Amazon Redshift Database Developer Guide
Using COPY to load data
Loading data
Topics
• Using a COPY command to load data (p. 65)
• Updating tables with DML commands (p. 96)
• Updating and inserting new data (p. 96)
• Performing a deep copy (p. 102)
• Analyzing tables (p. 103)
• Vacuuming tables (p. 108)
• Managing concurrent write operations (p. 116)
• Tutorial: Loading data from Amazon S3 (p. 122)
A COPY command is the most efficient way to load a table. You can also add data to your tables using
INSERT commands, though it is much less efficient than using COPY. The COPY command is able to
read from multiple data files or multiple data streams simultaneously. Amazon Redshift allocates the
workload to the cluster nodes and performs the load operations in parallel, including sorting the rows
and distributing data across node slices.
Note
Amazon Redshift Spectrum external tables are read-only. You can't COPY or INSERT to an
external table.
To access data on other AWS resources, your cluster must have permission to access those resources and
to perform the necessary actions to access the data. You can use Identity and Access Management (IAM)
to limit the access users have to your cluster resources and data.
After your initial data load, if you add, modify, or delete a significant amount of data, you should follow
up by running a VACUUM command to reorganize your data and reclaim space after deletes. You should
also run an ANALYZE command to update table statistics.
This section explains how to load data and troubleshoot data loads and presents best practices for
loading data.
65
Amazon Redshift Database Developer Guide
Credentials and access permissions
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture
to read and load data in parallel from files on Amazon S3, from a DynamoDB table, or from text output
from one or more remote hosts.
Note
We strongly recommend using the COPY command to load large amounts of data. Using
individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if
your data already exists in other Amazon Redshift database tables, use INSERT INTO ... SELECT
or CREATE TABLE AS to improve performance. For information, see INSERT (p. 780) or CREATE
TABLE AS (p. 723).
To load data from another AWS resource, your cluster must have permission to access the resource and
perform the necessary actions.
To grant or revoke privilege to load data into a table using a COPY command, grant or revoke the INSERT
privilege.
Your data needs to be in the proper format for loading into your Amazon Redshift table. This section
presents guidelines for preparing and verifying your data before the load and for validating a COPY
statement before you run it.
To protect the information in your files, you can encrypt the data files before you upload them to your
Amazon S3 bucket; COPY will decrypt the data as it performs the load. You can also limit access to your
load data by providing temporary security credentials to users. Temporary security credentials provide
enhanced security because they have short life spans and cannot be reused after they expire.
Amazon Redshift has features built in to COPY to load uncompressed, delimited data quickly. But you
can compress your files using gzip, lzop, or bzip2 to save time uploading the files.
If the following keywords are in the COPY query, automatic splitting of uncompressed data is not
supported: ESCAPE, REMOVEQUOTES, and FIXEDWIDTH. But the CSV keyword is supported.
To help keep your data secure in transit within the AWS cloud, Amazon Redshift uses hardware
accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD, backup,
and restore operations.
When you load your table directly from an Amazon DynamoDB table, you have the option to control the
amount of Amazon DynamoDB provisioned throughput you consume.
You can optionally let COPY analyze your input data and automatically apply optimal compression
encodings to your table as part of the load process.
To obtain authorization to access a resource, your cluster must be authenticated. You can choose either
role-based access control or key-based access control. This section presents an overview of the two
methods. For complete details and examples, see Permissions to access other AWS Resources (p. 611).
66
Amazon Redshift Database Developer Guide
Preparing your input data
We recommend using role-based access control because it is provides more secure, fine-grained control
of access to AWS resources and sensitive user data, in addition to safeguarding your AWS credentials.
To use role-based access control, you must first create an IAM role using the Amazon Redshift service
role type, and then attach the role to your cluster. The role must have, at a minimum, the permissions
listed in IAM permissions for COPY, UNLOAD, and CREATE LIBRARY (p. 614). For steps to create an IAM
role and attach it to your cluster, see Creating an IAM Role to Allow Your Amazon Redshift Cluster to
Access AWS Services in the Amazon Redshift Cluster Management Guide.
You can add a role to a cluster or view the roles associated with a cluster by using the Amazon Redshift
Management Console, CLI, or API. For more information, see Authorizing COPY and UNLOAD Operations
Using IAM Roles in the Amazon Redshift Cluster Management Guide.
When you create an IAM role, IAM returns an Amazon Resource Name (ARN) for the role. To run a COPY
command using an IAM role, provide the role ARN using the IAM_ROLE parameter or the CREDENTIALS
parameter.
The following COPY command example uses IAM_ROLE parameter with the role MyRedshiftRole for
authentication.
The AWS IAM user must have, at a minimum, the permissions listed in IAM permissions for COPY,
UNLOAD, and CREATE LIBRARY (p. 614).
Use the following guidelines to help ensure that your input data is valid:
• Your data can only contain UTF-8 characters up to four bytes long.
• Verify that CHAR and VARCHAR strings are no longer than the lengths of the corresponding columns.
VARCHAR strings are measured in bytes, not characters, so, for example, a four-character string of
Chinese characters that occupy four bytes each requires a VARCHAR(16) column.
• Multibyte characters can only be used with VARCHAR columns. Verify that multibyte characters are no
more than four bytes long.
• Verify that data for CHAR columns only contains single-byte characters.
• Do not include any special characters or syntax to indicate the last field in a record. This field can be a
delimiter.
• If your data includes null terminators, also referred to as NUL (UTF-8 0000) or binary zero (0x000),
you can load these characters as NULLS into CHAR or VARCHAR columns by using the NULL AS
option in the COPY command: null as '\0' or null as '\000' . If you do not use NULL AS, null
terminators will cause your COPY to fail.
67
Amazon Redshift Database Developer Guide
Loading data from Amazon S3
• If your strings contain special characters, such as delimiters and embedded newlines, use the ESCAPE
option with the COPY (p. 576) command.
• Verify that all single and double quotation marks are appropriately matched.
• Verify that floating-point strings are in either standard floating-point format, such as 12.123, or an
exponential format, such as 1.0E4.
• Verify that all timestamp and date strings follow the specifications for DATEFORMAT and
TIMEFORMAT strings (p. 620). The default timestamp format is YYYY-MM-DD hh:mm:ss, and the
default date format is YYYY-MM-DD.
• For more information about boundaries and limitations on individual data types, see Data
types (p. 478). For information about multibyte character errors, see Multibyte character load
errors (p. 94)
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture
to read and load data in parallel from a file or multiple files in an Amazon S3 bucket. You can take
maximum advantage of parallel processing by splitting your data into multiple files, in cases where the
files are compressed. You can also take maximum advantage of parallel processing by setting distribution
keys on your tables. For more information about distribution keys, see Working with data distribution
styles (p. 50).
Data is loaded into the target table, one line per row. The fields in the data file are matched to table
columns in order, left to right. Fields in the data files can be fixed-width or character delimited; the
default delimiter is a pipe (|). By default, all the table columns are loaded, but you can optionally define
a comma-separated list of columns. If a table column is not included in the column list specified in the
COPY command, it is loaded with a default value. For more information, see Loading default column
values (p. 91).
Split your data into files so that the number of files is a multiple of the number of slices in your cluster.
That way, Amazon Redshift can divide the data evenly among the slices. The number of slices per node
depends on the node size of the cluster. For example, each ds2.xl compute node has two slices, and each
ds2.8xl compute node has 32 slices. For more information about the number of slices that each node size
has, see About clusters and nodes in the Amazon Redshift Cluster Management Guide.
The nodes all participate in running parallel queries, working on data that is distributed as evenly as
possible across the slices. If you have a cluster with two ds2.xl nodes, you might split your data into four
68
Amazon Redshift Database Developer Guide
Loading data from Amazon S3
files or some multiple of four. Amazon Redshift doesn't take file size into account when dividing the
workload. Thus, you need to ensure that the files are roughly the same size, from 1 MB to 1 GB after
compression.
To use object prefixes to identify the load files, name each file with a common prefix. For example, you
might split the venue.txt file might be split into four files, as follows.
venue.txt.1
venue.txt.2
venue.txt.3
venue.txt.4
If you put multiple files in a folder in your bucket and specify the folder name as the prefix, COPY loads
all of the files in the folder. If you explicitly list the files to be loaded by using a manifest file, the files can
reside in different buckets or folders.
For more information about manifest files, see Example: COPY from Amazon S3 using a
manifest (p. 624).
Automatic, parallel data loading isn't supported when the COPY query includes any of the following
keywords: ESCAPE, REMOVEQUOTES, and FIXEDWIDTH.
Data from the file or files is loaded into the target table, one line per row. The fields in the data file are
matched to table columns in order, left to right. Fields in the data files can be fixed-width or character
delimited; the default delimiter is a pipe (|). By default, all the table columns are loaded, but you can
optionally define a comma-separated list of columns. If a table column isn't included in the column list
specified in the COPY command, it's loaded with a default value. For more information, see Loading
default column values (p. 91).
Follow this general process to load data from Amazon S3, when your data is uncompressed and
delimited:
For examples of COPY commands, see COPY examples (p. 622). For information about data loaded into
Amazon Redshift, check the STL_LOAD_COMMITS (p. 1304) and STL_LOAD_ERRORS (p. 1307) system
tables.
For more information about nodes and the slices contained in each, see About clusters and nodes in the
Amazon Redshift Cluster Management Guide.
69
Amazon Redshift Database Developer Guide
Loading data from Amazon S3
There are a couple approaches to take when uploading text files to Amazon S3:
• If you have compressed files, we recommend that you split large files to take advantage of parallel
processing in Amazon Redshift.
• On the other hand, COPY automatically splits large, uncompressed, text-delimited file data to facilitate
parallelism and effectively distribute the data from large files.
Create an Amazon S3 bucket to hold your data files, and then upload the data files to the bucket. For
information about creating buckets and uploading files, see Working with Amazon S3 Buckets in the
Amazon Simple Storage Service User Guide.
Important
The Amazon S3 bucket that holds the data files must be created in the same AWS Region as
your cluster unless you use the REGION (p. 583) option to specify the Region in which the
Amazon S3 bucket is located.
You can create an Amazon S3 bucket in a specific Region either by selecting the Region when you create
the bucket by using the Amazon S3 console, or by specifying an endpoint when you create the bucket
using the Amazon S3 API or CLI.
Following the data load, verify that the correct files are present on Amazon S3.
Server-side encryption
Server-side encryption is data encryption at rest—that is, Amazon S3 encrypts your data as it uploads
it and decrypts it for you when you access it. When you load tables using a COPY command, there is no
difference in the way you load from server-side encrypted or unencrypted objects on Amazon S3. For
more information about server-side encryption, see Using Server-Side Encryption in the Amazon Simple
Storage Service User Guide.
Client-side encryption
In client-side encryption, your client application manages encryption of your data, the encryption keys,
and related tools. You can upload data to an Amazon S3 bucket using client-side encryption, and then
load the data using the COPY command with the ENCRYPTED option and a private encryption key to
provide greater security.
70
Amazon Redshift Database Developer Guide
Loading data from Amazon S3
You encrypt your data using envelope encryption. With envelope encryption, your application handles
all encryption exclusively. Your private encryption keys and your unencrypted data are never sent to
AWS, so it's very important that you safely manage your encryption keys. If you lose your encryption
keys, you won't be able to unencrypt your data, and you can't recover your encryption keys from AWS.
Envelope encryption combines the performance of fast symmetric encryption while maintaining the
greater security that key management with asymmetric keys provides. A one-time-use symmetric key
(the envelope symmetric key) is generated by your Amazon S3 encryption client to encrypt your data,
then that key is encrypted by your root key and stored alongside your data in Amazon S3. When Amazon
Redshift accesses your data during a load, the encrypted symmetric key is retrieved and decrypted with
your real key, then the data is decrypted.
To work with Amazon S3 client-side encrypted data in Amazon Redshift, follow the steps outlined in
Protecting Data Using Client-Side Encryption in the Amazon Simple Storage Service User Guide, with the
additional requirements that you use:
• Symmetric encryption – The AWS SDK for Java AmazonS3EncryptionClient class uses envelope
encryption, described preceding, which is based on symmetric key encryption. Use this class to create
an Amazon S3 client to upload client-side encrypted data.
• A 256-bit AES root symmetric key – A root key encrypts the envelope key. You pass the root key to
your instance of the AmazonS3EncryptionClient class. Save this key, because you will need it to
copy data into Amazon Redshift.
• Object metadata to store encrypted envelope key – By default, Amazon S3 stores the envelope key
as object metadata for the AmazonS3EncryptionClient class. The encrypted envelope key that is
stored as object metadata is used during the decryption process.
Note
If you get a cipher encryption error message when you use the encryption API for the first time,
your version of the JDK may have a Java Cryptography Extension (JCE) jurisdiction policy file
that limits the maximum key length for encryption and decryption transformations to 128 bits.
For information about addressing this issue, go to Specifying Client-Side Encryption Using the
AWS SDK for Java in the Amazon Simple Storage Service User Guide.
For information about loading client-side encrypted files into your Amazon Redshift tables using the
COPY command, see Loading encrypted data files from Amazon S3 (p. 75).
The second option shows the choices you must make during client-side encryption so that the data
can be loaded in Amazon Redshift. Specifically, the example shows using object metadata to store the
encrypted envelope key and the use of a 256-bit AES root symmetric key.
This example provides example code using the AWS SDK for Java to create a 256-bit AES symmetric root
key and save it to a file. Then the example upload an object to Amazon S3 using an S3 encryption client
that first encrypts sample data on the client-side. The example also downloads the object and verifies
that the data is the same.
71
Amazon Redshift Database Developer Guide
Loading data from Amazon S3
If you want to control specifically which files are loaded, you can use a manifest file to
explicitly list the data files. For more information about using a manifest file, see the
copy_from_s3_manifest_file (p. 582) option for the COPY command and Example: COPY from Amazon
S3 using a manifest (p. 624) in the COPY examples.
For more information about listing the contents of the bucket, see Listing Object Keys in the Amazon S3
Developer Guide.
Use the COPY (p. 576) command to load a table in parallel from data files on Amazon S3. You can
specify the files to be loaded by using an Amazon S3 object prefix or by using a manifest file.
The manifest file is a JSON-formatted file that lists the data files to be loaded. The syntax to specify the
files to be loaded by using a manifest file is as follows:
The table to be loaded must already exist in the database. For information about creating a table, see
CREATE TABLE (p. 708) in the SQL Reference.
The values for authorization provide the AWS authorization your cluster needs to access the Amazon
S3 objects. For information about required permissions, see IAM permissions for COPY, UNLOAD,
and CREATE LIBRARY (p. 614). The preferred method for authentication is to specify the IAM_ROLE
parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary
permissions. For more information, see Role-based access control (p. 611) .
IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'
copy customer
from 's3://mybucket/mydata'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
For more information about other authorization options, see Authorization parameters (p. 590)
If you want to validate your data without actually loading the table, use the NOLOAD option with the
COPY (p. 576) command.
72
Amazon Redshift Database Developer Guide
Loading data from Amazon S3
The following example shows the first few rows of a pipe-delimited data in a file named venue.txt.
1|Toyota Park|Bridgeview|IL|0
2|Columbus Crew Stadium|Columbus|OH|0
3|RFK Stadium|Washington|DC|0
Before uploading the file to Amazon S3, split the file into multiple files so that the COPY command
can load it using parallel processing. The number of files should be a multiple of the number of slices
in your cluster. Split your load data files so that the files are about equal size, between 1 MB and 1
GB after compression. For more information, see Loading data from compressed and uncompressed
files (p. 68).
For example, the venue.txt file might be split into four files, as follows:
venue.txt.1
venue.txt.2
venue.txt.3
venue.txt.4
The following COPY command loads the VENUE table using the pipe-delimited data in the data files with
the prefix 'venue' in the Amazon S3 bucket mybucket.
Note
The Amazon S3 bucket mybucket in the following examples does not exist. For sample COPY
commands that use real data in an existing Amazon S3 bucket, see Load sample data.
If no Amazon S3 objects with the key prefix 'venue' exist, the load fails.
For more information about manifest files, see the COPY example Using a manifest to specify data
files (p. 624).
The following example shows the JSON to load files from different buckets and with file names that
begin with date stamps.
{
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
]
}
The optional mandatory flag specifies whether COPY should return an error if the file is not found. The
default of mandatory is false. Regardless of any mandatory settings, COPY will terminate if no files
are found.
73
Amazon Redshift Database Developer Guide
Loading data from Amazon S3
The following example runs the COPY command with the manifest in the previous example, which is
named cust.manifest.
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
A manifest created by an UNLOAD (p. 850) operation using the MANIFEST parameter might have keys
that are not required for the COPY operation. For example, the following UNLOAD manifest includes a
meta key that is required for an Amazon Redshift Spectrum external table and for loading data files
in an ORC or Parquet file format. The meta key contains a content_length key with a value that
is the actual size of the file in bytes. The COPY operation requires only the url key and an optional
mandatory key.
{
"entries": [
{"url":"s3://mybucket/unload/manifest_0000_part_00", "meta": { "content_length":
5956875 }},
{"url":"s3://mybucket/unload/unload/manifest_0001_part_00", "meta": { "content_length":
5997091 }}
]
}
For more information about manifest files, see Example: COPY from Amazon S3 using a
manifest (p. 624).
COPY does not support files compressed using the lzop --filter option.
For example, the following command loads from files that were compressing using lzop.
To load a fixed-width data file into an existing table, USE the FIXEDWIDTH parameter in the COPY
command. Your table specifications must match the value of fixedwidth_spec in order for the data to
load correctly.
To load fixed-width data from a file to a table, issue the following command:
74
Amazon Redshift Database Developer Guide
Loading data from Amazon S3
fixedwidth 'fixedwidth_spec';
The fixedwidth_spec parameter is a string that contains an identifier for each column and the width of
each column, separated by a colon. The column:width pairs are delimited by commas. The identifier
can be anything that you choose: numbers, letters, or a combination of the two. The identifier has no
relation to the table itself, so the specification must contain the columns in the same order as the table.
The following two examples show the same specification, with the first using numeric identifiers and the
second using string identifiers:
'0:3,1:25,2:12,3:2,4:6'
'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6'
The following example shows fixed-width sample data that could be loaded into the VENUE table using
the above specifications:
The following COPY command loads this data set into the VENUE table:
copy venue
from 's3://mybucket/data/venue_fw.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
fixedwidth 'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6';
To check which encoding an input file uses, use the Linux file command:
$ file ordersdata.txt
ordersdata.txt: ASCII English text
$ file uni_ordersdata.dat
uni_ordersdata.dat: UTF-8 Unicode text
75
Amazon Redshift Database Developer Guide
Loading data from Amazon EMR
The COPY command doesn't support the following types of Amazon S3 encryption:
For more information about Amazon S3 encryption, see Protecting Data Using Server-Side Encryption
and Protecting Data Using Client-Side Encryption in the Amazon Simple Storage Service User Guide.
The UNLOAD (p. 850) command automatically encrypts files using SSE-S3. You can also unload using
SSE-KMS or client-side encryption with a customer-managed symmetric key. For more information, see
Unloading encrypted data files (p. 145)
The COPY command automatically recognizes and loads files encrypted using SSE-S3 and SSE-KMS. You
can load files encrypted using a client-side symmetric root key by specifying the ENCRYPTED option and
providing the key value. For more information, see Uploading encrypted data to Amazon S3 (p. 70).
To load client-side encrypted data files, provide the root key value using the MASTER_SYMMETRIC_KEY
parameter and include the ENCRYPTED option.
To load encrypted data files that are gzip, lzop, or bzip2 compressed, include the GZIP, LZOP, or BZIP2
option along with the root key value and the ENCRYPTED option.
The users that create the Amazon EMR cluster and run the Amazon Redshift COPY command must
have the necessary permissions.
• Step 2: Create an Amazon EMR cluster (p. 77)
Configure the cluster to output text files to the Hadoop Distributed File System (HDFS). You will need
the Amazon EMR cluster ID and the cluster's master public DNS (the endpoint for the Amazon EC2
instance that hosts the cluster).
76
Amazon Redshift Database Developer Guide
Loading data from Amazon EMR
• Step 3: Retrieve the Amazon Redshift cluster public key and cluster node IP addresses (p. 78)
The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the hosts.
You will use the IP address for each cluster node to configure the host security groups to permit access
from your Amazon Redshift cluster using these IP addresses.
• Step 4: Add the Amazon Redshift cluster public key to each Amazon EC2 host's authorized keys
file (p. 79)
You add the Amazon Redshift cluster public key to the host's authorized keys file so that the host will
recognize the Amazon Redshift cluster and accept the SSH connection.
• Step 5: Configure the hosts to accept all of the Amazon Redshift cluster's IP addresses (p. 80)
Modify the Amazon EMR instance's security groups to add ingress rules to accept the Amazon Redshift
IP addresses.
• Step 6: Run the COPY command to load the data (p. 80)
From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift
table.
1. Add the following permissions for the IAM user that will create the Amazon EMR cluster.
ec2:DescribeSecurityGroups
ec2:RevokeSecurityGroupIngress
ec2:AuthorizeSecurityGroupIngress
redshift:DescribeClusters
2. Add the following permission for the IAM role or IAM user that will run the COPY command.
elasticmapreduce:ListInstances
3. Add the following permission to the Amazon EMR cluster's IAM role.
redshift:DescribeClusters
1. Create an Amazon EMR cluster in the same AWS Region as the Amazon Redshift cluster.
If the Amazon Redshift cluster is in a VPC, the Amazon EMR cluster must be in the same VPC group.
If the Amazon Redshift cluster uses EC2-Classic mode (that is, it is not in a VPC), the Amazon EMR
cluster must also use EC2-Classic mode. For more information, see Managing Clusters in Virtual
Private Cloud (VPC) in the Amazon Redshift Cluster Management Guide.
77
Amazon Redshift Database Developer Guide
Loading data from Amazon EMR
2. Configure the cluster to output data files to the cluster's HDFS. The HDFS file names must not
include asterisks (*) or question marks (?).
Important
The file names must not include asterisks ( * ) or question marks ( ? ).
3. Specify No for the Auto-terminate option in the Amazon EMR cluster configuration so that the
cluster remains available while the COPY command runs.
Important
If any of the data files are changed or deleted before the COPY completes, you might have
unexpected results, or the COPY operation might fail.
4. Note the cluster ID and the master public DNS (the endpoint for the Amazon EC2 instance that hosts
the cluster). You will use that information in later steps.
Note the Cluster Public Key and Node IP addresses. You will use them in later steps.
You will use the Private IP addresses in Step 3 to configure the Amazon EC2 host to accept the
connection from Amazon Redshift.
To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon
Redshift CLI, run the describe-clusters command. For example:
78
Amazon Redshift Database Developer Guide
Loading data from Amazon EMR
The response will include a ClusterPublicKey value and the list of private and public IP addresses, similar
to the following:
{
"Clusters": [
{
"VpcSecurityGroups": [],
"ClusterStatus": "available",
"ClusterNodes": [
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "LEADER",
"PublicIPAddress": "10.nnn.nnn.nnn"
},
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "COMPUTE-0",
"PublicIPAddress": "10.nnn.nnn.nnn"
},
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "COMPUTE-1",
"PublicIPAddress": "10.nnn.nnn.nnn"
}
],
"AutomatedSnapshotRetentionPeriod": 1,
"PreferredMaintenanceWindow": "wed:05:30-wed:06:00",
"AvailabilityZone": "us-east-1a",
"NodeType": "ds2.xlarge",
"ClusterPublicKey": "ssh-rsa AAAABexamplepublickey...Y3TAl Amazon-Redshift",
...
...
}
To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon
Redshift API, use the DescribeClusters action. For more information, see describe-clusters in the
Amazon Redshift CLI Guide or DescribeClusters in the Amazon Redshift API Guide.
To add the Amazon Redshift cluster public key to the host's authorized keys file
For information about connecting to an instance using SSH, see Connect to Your Instance in the
Amazon EC2 User Guide.
2. Copy the Amazon Redshift public key from the console or from the CLI response text.
3. Copy and paste the contents of the public key into the /home/<ssh_username>/.ssh/
authorized_keys file on the host. Include the complete string, including the prefix "ssh-rsa "
and suffix "Amazon-Redshift". For example:
79
Amazon Redshift Database Developer Guide
Loading data from remote hosts
In the COPY command, specify the Amazon EMR cluster ID and the HDFS file path and file name.
copy sales
from 'emr://myemrclusterid/myoutput/part*' credentials
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
You can use the wildcard characters asterisk ( * ) and question mark ( ? ) as part of the file name
argument. For example, part* loads the files part-0000, part-0001, and so on. If you specify only a
folder name, COPY attempts to load all files in the folder.
Important
If you use wildcard characters or use only the folder name, verify that no unwanted files will be
loaded or the COPY command will fail. For example, some processes might write a log file to the
output folder.
The remote host can be an Amazon EC2 Linux instance or another Unix or Linux computer configured
to accept SSH connections. This guide assumes your remote host is an Amazon EC2 instance. Where the
procedure is different for another computer, the guide will point out the difference.
Amazon Redshift can connect to multiple hosts, and can open multiple SSH connections to each host.
Amazon Redshifts sends a unique command through each connection to generate text output to the
host's standard output, which Amazon Redshift then reads as it would a text file.
80
Amazon Redshift Database Developer Guide
Loading data from remote hosts
• One or more host machines, such as Amazon EC2 instances, that you can connect to using SSH.
• Data sources on the hosts.
You will provide commands that the Amazon Redshift cluster will run on the hosts to generate the text
output. After the cluster connects to a host, the COPY command runs the commands, reads the text
from the hosts' standard output, and loads the data in parallel into an Amazon Redshift table. The text
output must be in a form that the COPY command can ingest. For more information, see Preparing
your input data (p. 67)
• Access to the hosts from your computer.
For an Amazon EC2 instance, you will use an SSH connection to access the host. You will need to access
the host to add the Amazon Redshift cluster's public key to the host's authorized keys file.
• A running Amazon Redshift cluster.
For information about how to launch a cluster, see Amazon Redshift Getting Started Guide.
• Step 1: Retrieve the cluster public key and cluster node IP addresses (p. 82)
The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the remote
hosts. You will use the IP address for each cluster node to configure the host security groups or firewall
to permit access from your Amazon Redshift cluster using these IP addresses.
• Step 2: Add the Amazon Redshift cluster public key to the host's authorized keys file (p. 83)
You add the Amazon Redshift cluster public key to the host's authorized keys file so that the host will
recognize the Amazon Redshift cluster and accept the SSH connection.
• Step 3: Configure the host to accept all of the Amazon Redshift cluster's IP addresses (p. 83)
For Amazon EC2 , modify the instance's security groups to add ingress rules to accept the Amazon
Redshift IP addresses. For other hosts, modify the firewall so that your Amazon Redshift nodes are
able to establish SSH connections to the remote host.
• Step 4: Get the public key for the host (p. 84)
You can optionally specify that Amazon Redshift should use the public key to identify the host. You
will need to locate the public key and copy the text into your manifest file.
• Step 5: Create a manifest file (p. 84)
The manifest is a JSON-formatted text file with the details Amazon Redshift needs to connect to the
hosts and fetch the data.
• Step 6: Upload the manifest file to an Amazon S3 bucket (p. 86)
Amazon Redshift reads the manifest and uses that information to connect to the remote host. If the
Amazon S3 bucket does not reside in the same Region as your Amazon Redshift cluster, you must use
the REGION (p. 583) option to specify the Region in which the data is located.
• Step 7: Run the COPY command to load the data (p. 86)
From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift
table.
81
Amazon Redshift Database Developer Guide
Loading data from remote hosts
Note the Cluster Public Key and Node IP addresses. You will use them in later steps.
You will use the IP addresses in Step 3 to configure the host to accept the connection from Amazon
Redshift. Depending on what type of host you connect to and whether it is in a VPC, you will use
either the public IP addresses or the private IP addresses.
To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon
Redshift CLI, run the describe-clusters command.
For example:
The response will include the ClusterPublicKey and the list of Private and Public IP addresses, similar to
the following:
{
"Clusters": [
{
"VpcSecurityGroups": [],
"ClusterStatus": "available",
"ClusterNodes": [
82
Amazon Redshift Database Developer Guide
Loading data from remote hosts
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "LEADER",
"PublicIPAddress": "10.nnn.nnn.nnn"
},
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "COMPUTE-0",
"PublicIPAddress": "10.nnn.nnn.nnn"
},
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "COMPUTE-1",
"PublicIPAddress": "10.nnn.nnn.nnn"
}
],
"AutomatedSnapshotRetentionPeriod": 1,
"PreferredMaintenanceWindow": "wed:05:30-wed:06:00",
"AvailabilityZone": "us-east-1a",
"NodeType": "ds2.xlarge",
"ClusterPublicKey": "ssh-rsa AAAABexamplepublickey...Y3TAl Amazon-Redshift",
...
...
}
To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon
Redshift API, use the DescribeClusters action. For more information, see describe-clusters in the Amazon
Redshift CLI Guide or DescribeClusters in the Amazon Redshift API Guide.
Step 2: Add the Amazon Redshift cluster public key to the host's
authorized keys file
You add the cluster public key to each host's authorized keys file so that the host will recognize Amazon
Redshift and accept the SSH connection.
To add the Amazon Redshift cluster public key to the host's authorized keys file
For information about connecting to an instance using SSH, see Connect to Your Instance in the
Amazon EC2 User Guide.
2. Copy the Amazon Redshift public key from the console or from the CLI response text.
3. Copy and paste the contents of the public key into the /home/<ssh_username>/.ssh/
authorized_keys file on the remote host. The <ssh_username> must match the value for the
"username" field in the manifest file. Include the complete string, including the prefix "ssh-rsa "
and suffix "Amazon-Redshift". For example:
83
Amazon Redshift Database Developer Guide
Loading data from remote hosts
adding rules to an Amazon EC2 security group, see Authorizing Inbound Traffic for Your Instances in the
Amazon EC2 User Guide.
• You have an Amazon Redshift cluster that is not in a Virtual Private Cloud (VPC), and an Amazon EC2 -
Classic instance, both of which are in the same AWS Region.
• You have an Amazon Redshift cluster that is in a VPC, and an Amazon EC2 -VPC instance, both of
which are in the same AWS Region and in the same VPC.
For more information about using Amazon Redshift in a VPC, see Managing Clusters in Virtual Private
Cloud (VPC) in the Amazon Redshift Cluster Management Guide.
You can find the host's public key in the following location, where <ssh_host_rsa_key_name> is the
unique name for the host's public key:
: /etc/ssh/<ssh_host_rsa_key_name>.pub
Note
Amazon Redshift only supports RSA keys. We do not support DSA keys.
When you create your manifest file in Step 5, you will paste the text of the public key into the "Public
Key" field in the manifest file entry.
Create the manifest file on your local computer. In a later step, you upload the file to Amazon S3.
{
"entries": [
{"endpoint":"<ssh_endpoint_or_IP>",
"command": "<remote_command>",
"mandatory":true,
"publickey": "<public_key>",
"username": "<host_user_name>"},
{"endpoint":"<ssh_endpoint_or_IP>",
"command": "<remote_command>",
"mandatory":true,
"publickey": "<public_key>",
"username": "host_user_name"}
]
84
Amazon Redshift Database Developer Guide
Loading data from remote hosts
The manifest file contains one "entries" construct for each SSH connection. Each entry represents a single
SSH connection. You can have multiple connections to a single host or multiple connections to multiple
hosts. The double quotation marks are required as shown, both for the field names and the values.
The only value that does not need double quotation marks is the Boolean value true or false for the
mandatory field.
endpoint
The command that will be run by the host to generate text or binary (gzip, lzop, or bzip2) output.
The command can be any command that the user "host_user_name" has permission to run. The
command can be as simple as printing a file, or it could query a database or launch a script. The
output (text file, gzip binary file, lzop binary file, or bzip2 binary file) must be in a form the Amazon
Redshift COPY command can ingest. For more information, see Preparing your input data (p. 67).
publickey
(Optional) The public key of the host. If provided, Amazon Redshift will use the public key to identify
the host. If the public key is not provided, Amazon Redshift will not attempt host identification. For
example, if the remote host's public key is: ssh-rsa AbcCbaxxx…xxxDHKJ [email protected]
enter the following text in the publickey field: AbcCbaxxx…xxxDHKJ.
mandatory
(Optional) Indicates whether the COPY command should fail if the connection fails. The default is
false. If Amazon Redshift does not successfully make at least one connection, the COPY command
fails.
username
(Optional) The username that will be used to log on to the host system and run the remote
command. The user login name must be the same as the login that was used to add the public key to
the host's authorized keys file in Step 2. The default username is "redshift".
The following example shows a completed manifest to open four connections to the same host and run a
different command through each connection:
{
"entries": [
{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",
"command": "cat loaddata1.txt",
"mandatory":true,
"publickey": "ec2publickeyportionoftheec2keypair",
"username": "ec2-user"},
{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",
"command": "cat loaddata2.txt",
"mandatory":true,
"publickey": "ec2publickeyportionoftheec2keypair",
"username": "ec2-user"},
{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",
"command": "cat loaddata3.txt",
"mandatory":true,
"publickey": "ec2publickeyportionoftheec2keypair",
"username": "ec2-user"},
85
Amazon Redshift Database Developer Guide
Loading from Amazon DynamoDB
{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",
"command": "cat loaddata4.txt",
"mandatory":true,
"publickey": "ec2publickeyportionoftheec2keypair",
"username": "ec2-user"}
]
}
copy sales
from 's3://mybucket/ssh_manifest' credentials
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|'
ssh;
Note
If you use automatic compression, the COPY command performs two data reads, which means it
runs the remote command twice. The first read is to provide a sample for compression analysis,
then the second read actually loads the data. If executing the remote command twice might
cause a problem because of potential side effects, you should disable automatic compression.
To disable automatic compression, run the COPY command with the COMPUPDATE option set to
OFF. For more information, see Loading tables with automatic compression (p. 89).
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to
read and load data in parallel from an Amazon DynamoDB table. You can take maximum advantage of
parallel processing by setting distribution styles on your Amazon Redshift tables. For more information,
see Working with data distribution styles (p. 50).
Important
When the COPY command reads data from the Amazon DynamoDB table, the resulting data
transfer is part of that table's provisioned throughput.
To avoid consuming excessive amounts of provisioned read throughput, we recommend that you not
load data from Amazon DynamoDB tables that are in production environments. If you do load data from
production tables, we recommend that you set the READRATIO option much lower than the average
percentage of unused provisioned throughput. A low READRATIO setting will help minimize throttling
issues. To use the entire provisioned throughput of an Amazon DynamoDB table, set READRATIO to 100.
86
Amazon Redshift Database Developer Guide
Loading from Amazon DynamoDB
The COPY command matches attribute names in the items retrieved from the DynamoDB table to
column names in an existing Amazon Redshift table by using the following rules:
• Amazon Redshift table columns are case-insensitively matched to Amazon DynamoDB item attributes.
If an item in the DynamoDB table contains multiple attributes that differ only in case, such as Price and
PRICE, the COPY command will fail.
• Amazon Redshift table columns that do not match an attribute in the Amazon DynamoDB table are
loaded as either NULL or empty, depending on the value specified with the EMPTYASNULL option in
the COPY (p. 576) command.
• Amazon DynamoDB attributes that do not match a column in the Amazon Redshift table are
discarded. Attributes are read before they are matched, and so even discarded attributes consume part
of that table's provisioned throughput.
• Only Amazon DynamoDB attributes with scalar STRING and NUMBER data types are supported. The
Amazon DynamoDB BINARY and SET data types are not supported. If a COPY command tries to load
an attribute with an unsupported data type, the command will fail. If the attribute does not match an
Amazon Redshift table column, COPY does not attempt to load it, and it does not raise an error.
The COPY command uses the following syntax to load data from an Amazon DynamoDB table:
The values for authorization are the AWS credentials needed to access the Amazon DynamoDB table. If
these credentials correspond to an IAM user, that IAM user must have permission to SCAN and DESCRIBE
the Amazon DynamoDB table that is being loaded.
The values for authorization provide the AWS authorization your cluster needs to access the Amazon
DynamoDB table. The permission must include SCAN and DESCRIBE for the Amazon DynamoDB table
that is being loaded. For more information about required permissions, see IAM permissions for COPY,
UNLOAD, and CREATE LIBRARY (p. 614). The preferred method for authentication is to specify the
IAM_ROLE parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary
permissions. For more information, see Role-based access control (p. 611).
IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'
copy favoritemovies
from 'dynamodb://ProductCatalog'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
For more information about other authorization options, see Authorization parameters (p. 590)
If you want to validate your data without actually loading the table, use the NOLOAD option with the
COPY (p. 576) command.
The following example loads the FAVORITEMOVIES table with data from the DynamoDB table my-
favorite-movies-table. The read activity can consume up to 50% of the provisioned throughput.
87
Amazon Redshift Database Developer Guide
Verifying that the data loaded correctly
readratio 50;
To maximize throughput, the COPY command loads data from an Amazon DynamoDB table in parallel
across the compute nodes in the cluster.
After sampling, the sample rows are discarded and the entire table is loaded. As a result, many rows
are read twice. For more information about how automatic compression works, see Loading tables with
automatic compression (p. 89).
Important
When the COPY command reads data from the Amazon DynamoDB table, including the rows
used for sampling, the resulting data transfer is part of that table's provisioned throughput.
The following query returns entries for loading the tables in the TICKIT database:
88
Amazon Redshift Database Developer Guide
Automatic compression
For example, if you specified the incorrect Amazon S3 path for the input file, Amazon Redshift would
display the following error:
To troubleshoot error messages, see the Load error reference (p. 95).
You can apply compression encodings to columns in tables manually, based on your own evaluation
of the data. Or you can use the COPY command with COMPUPDATE set to ON to analyze and apply
compression automatically based on sample data.
You can use automatic compression when you create and load a brand new table. The COPY command
performs a compression analysis. You can also perform a compression analysis without loading data
or changing the compression on a table by running the ANALYZE COMPRESSION (p. 566) command
on an already populated table. For example, you can run ANALYZE COMPRESSION when you want to
analyze compression on a table for future use, while preserving the existing data definition language
(DDL) statements.
Automatic compression balances overall performance when choosing compression encodings. Range-
restricted scans might perform poorly if sort key columns are compressed much more highly than other
columns in the same query. As a result, automatic compression skips the data analyzing phase on the
sort key columns and keeps the user-defined encoding types.
Automatic compression chooses RAW encoding if you haven't explicitly defined a type of encoding.
ANALYZE COMPRESSION behaves the same. For optimal query performance, consider using RAW for sort
keys.
To apply automatic compression to an empty table, regardless of its current compression encodings, run
the COPY command with the COMPUPDATE option set to ON. To disable automatic compression, run the
COPY command with the COMPUPDATE option set to OFF.
You cannot apply automatic compression to a table that already contains data.
Note
Automatic compression analysis requires enough rows in the load data (at least 100,000 rows
per slice) to generate a meaningful sample.
Automatic compression performs these operations in the background as part of the load transaction:
89
Amazon Redshift Database Developer Guide
Automatic compression
1. An initial sample of rows is loaded from the input file. Sample size is based on the value of the
COMPROWS parameter. The default is 100,000.
2. Compression options are chosen for each column.
3. The sample rows are removed from the table.
4. The table is recreated with the chosen compression encodings.
5. The entire input file is loaded and compressed using the new encodings.
After you run the COPY command, the table is fully loaded, compressed, and ready for use. If you load
more data later, appended rows are compressed according to the existing encoding.
If you only want to perform a compression analysis, run ANALYZE COMPRESSION, which is more
efficient than running a full COPY. Then you can evaluate the results to decide whether to use automatic
compression or recreate the table manually.
Automatic compression is supported only for the COPY command. Alternatively, you can manually apply
compression encoding when you create the table. For information about manual compression encoding,
see Working with column compression (p. 39).
1. Ensure that the table is empty. You can apply automatic compression only to an empty table:
truncate biglist;
2. Load the table with a single COPY command. Although the table is empty, some earlier encoding
might have been specified. To ensure that Amazon Redshift performs a compression analysis, set the
COMPUPDATE parameter to ON.
Because no COMPROWS option is specified, the default and recommended sample size of 100,000
rows per slice is used.
3. Look at the new schema for the BIGLIST table in order to review the automatically chosen encoding
schemes.
90
Amazon Redshift Database Developer Guide
Optimizing for narrow tables
count
---------
3079952
(1 row)
When rows are later appended to this table using COPY or INSERT statements, the same compression
encodings are applied.
In order to optimize compression of the hidden columns, load the table in a single COPY transaction
where possible. If you load the table with multiple separate COPY commands, the INSERT_XID column
will not compress well. You will need to perform a vacuum operation if you use multiple COPY
commands, but it will not improve compression of INSERT_XID.
If COPY attempts to assign NULL to a column that is defined as NOT NULL, the COPY command fails. For
information about assigning the DEFAULT option, see CREATE TABLE (p. 708).
When loading from data files on Amazon S3, the columns in the column list must be in the same order as
the fields in the data file. If a field in the data file does not have a corresponding column in the column
list, the COPY command fails.
When loading from Amazon DynamoDB table, order does not matter. Any fields in the Amazon
DynamoDB attributes that do not match a column in the Amazon Redshift table are discarded.
The following restrictions apply when using the COPY command to load DEFAULT values into a table:
• If an IDENTITY (p. 711) column is included in the column list, the EXPLICIT_IDS option must also be
specified in the COPY (p. 576) command, or the COPY command will fail. Similarly, if an IDENTITY
column is omitted from the column list, and the EXPLICIT_IDS option is specified, the COPY operation
will fail.
• Because the evaluated DEFAULT expression for a given column is the same for all loaded rows, a
DEFAULT expression that uses a RANDOM() function will assign to same value to all the rows.
• DEFAULT expressions that contain CURRENT_DATE or SYSDATE are set to the timestamp of the current
transaction.
For an example, see "Load data from a file with default values" in COPY examples (p. 622).
91
Amazon Redshift Database Developer Guide
Troubleshooting
This section provides information about identifying and resolving data loading errors.
S3ServiceException errors
The most common s3ServiceException errors are caused by an improperly formatted or incorrect
credentials string, having your cluster and your bucket in different AWS Regions, and insufficient Amazon
S3 privileges.
Verify that the credentials string does not contain any spaces or line breaks, and is enclosed in single
quotation marks.
This is often a copy and paste error. Verify that the access key ID was entered correctly. Also, if you are
using temporary session keys, check that the value for token is set.
This is often a copy and paste error. Verify that the secret access key was entered correctly and that it is
the correct key for the access key ID.
ERROR: S3ServiceException:The bucket you are attempting to access must be addressed using
the specified endpoint.
92
Amazon Redshift Database Developer Guide
Troubleshooting
You can create an Amazon S3 bucket in a specific Region either by selecting the Region when you create
the bucket by using the Amazon S3 Management Console, or by specifying an endpoint when you
create the bucket using the Amazon S3 API or CLI. For more information, see Uploading files to Amazon
S3 (p. 69).
For more information about Amazon S3 regions, see Accessing a Bucket in the Amazon Simple Storage
Service User Guide.
Alternatively, you can specify the Region using the REGION (p. 583) option with the COPY command.
Access denied
The user account identified by the credentials must have LIST and GET access to the Amazon S3 bucket.
If the user does not have sufficient privileges, you will receive the following error message:
For information about managing user access to buckets, see Access Control in the Amazon S3 Developer
Guide.
• Query STL_LOAD_ERRORS (p. 1307) to discover the errors that occurred during specific loads.
• Query STL_FILE_SCAN (p. 1297) to view load times for specific files or to see if a specific file was even
read.
• Query STL_S3CLIENT_ERROR (p. 1332) to find details for errors encountered while transferring data
from Amazon S3.
1. Create a view or define a query that returns details about load errors. The following example joins
the STL_LOAD_ERRORS table to the STV_TBL_PERM table to match table IDs with actual table
names.
2. Set the MAXERRORS option in your COPY command to a large enough value to enable COPY to
return useful information about your data. If the COPY encounters errors, an error message directs
you to consult the STL_LOAD_ERRORS table for details.
3. Query the LOADVIEW view to see error details. For example:
93
Amazon Redshift Database Developer Guide
Troubleshooting
+----------------+-------------+-------+----------+---------------------
| venue_pipe.txt | 1 | 0 | 1214 | Delimiter not found
4. Fix the problem in the input file or the load script, based on the information that the view returns.
Some typical load errors to watch for include:
• Mismatch between data types in table and values in input data fields.
• Mismatch between number of columns in table and number of fields in input data.
• Mismatched quotation marks. Amazon Redshift supports both single and double quotation marks;
however, these quotation marks must be balanced appropriately.
• Incorrect format for date/time data in input files.
• Out-of-range values in input files (for numeric columns).
• Number of distinct values for a column exceeds the limitation for its compression encoding.
If a line in your load data contains a character that is invalid for the column data type, COPY returns
an error and logs a row in the STL_LOAD_ERRORS system log table with error number 1220. The
ERR_REASON field includes the byte sequence, in hex, for the invalid character.
An alternative to fixing invalid characters in your load data is to replace the invalid characters during the
load process. To replace invalid UTF-8 characters, specify the ACCEPTINVCHARS option with the COPY
command. If the ACCEPTINVCHARS option is set, the character you specify replaces the code point. If
the ACCEPTINVCHARS option isn't set, Amazon Redshift accepts the characters as valid UTF-8. For more
information, see ACCEPTINVCHARS (p. 603).
The following list of code points are valid UTF-8, COPY operations don't return an error if the
ACCEPTINVCHARS option is not set. However, these code points are invalid characters. You can use the
ACCEPTINVCHARS (p. 603) option to replace a code point with a character that you specify. These code
points include the range of values from 0xFDD0 to 0xFDEF and values up to 0x10FFFF, ending with
FFFE or FFFF:
The following example shows the error reason when COPY attempts to load UTF-8 character e0 a1
c7a4 into a CHAR column:
If the error is related to a VARCHAR data type, the error reason includes an error code as well as the
invalid UTF-8 hex sequence. The following example shows the error reason when COPY attempts to load
UTF-8 a4 into a VARCHAR field:
The following table lists the descriptions and suggested workarounds for VARCHAR load errors. If one of
these errors occurs, replace the character with a valid UTF-8 code sequence or remove the character.
94
Amazon Redshift Database Developer Guide
Troubleshooting
1 The UTF-8 byte sequence exceeds the four-byte maximum supported by VARCHAR.
2 The UTF-8 byte sequence is incomplete. COPY did not find the expected number of
continuation bytes for a multibyte character before the end of the string.
3 The UTF-8 single-byte character is out of range. The starting byte must not be 254,
255 or any character between 128 and 191 (inclusive).
4 The value of the trailing byte in the byte sequence is out of range. The continuation
byte must be between 128 and 191 (inclusive).
9 The UTF-8 byte sequence does not have a matching code point.
1202 Input data had more columns than were defined in the DDL.
1203 Input data had fewer columns than were defined in the DDL.
1204 Input data exceeded the acceptable range for the data type.
1205 Date format is invalid. See DATEFORMAT and TIMEFORMAT strings (p. 620) for valid
formats.
1206 Timestamp format is invalid. See DATEFORMAT and TIMEFORMAT strings (p. 620) for
valid formats.
95
Amazon Redshift Database Developer Guide
Updating with DML
1218 When using NULL AS '\0', a field containing a null terminator (NUL, or UTF-8 0000)
contained more than one byte.
1221 Encoding of the file is not the same as that specified in the COPY command.
8001 COPY with MANIFEST parameter requires full path of an Amazon S3 object.
If you insert, update, or delete a significant number of rows in a table, relative to the number of rows
before the changes, run the ANALYZE and VACUUM commands against the table when you are done.
If a number of small changes accumulate over time in your application, you might want to schedule
the ANALYZE and VACUUM commands to run at regular intervals. For more information, see Analyzing
tables (p. 103) and Vacuuming tables (p. 108).
Topics
• Merge method 1: Replacing existing rows (p. 97)
• Merge method 2: Specifying a column list (p. 97)
• Creating a temporary staging table (p. 97)
• Performing a merge operation by replacing existing rows (p. 98)
• Performing a merge operation by specifying a column list (p. 98)
96
Amazon Redshift Database Developer Guide
Merge method 1: Replacing existing rows
Note
You should run the entire merge operation, except for creating and dropping the temporary
staging table, in a single transaction so that the transaction will roll back if any step fails. Using
a single transaction also reduces the number of commits, which saves time and resources.
• Your target table and your staging table contain the same columns.
• You intend to replace all of the data in the target table columns with all of the staging table columns.
• You will use all of the rows in the staging table in the merge.
If any of these criteria do not apply, use Merge Method 2: Specifying a column list, described in the
following section.
If you will not use all of the rows in the staging table, you can filter the DELETE and INSERT statements
by using a WHERE clause to leave out rows that are not actually changing. However, if most of the rows
in the staging table will not participate in the merge, we recommend performing an UPDATE and an
INSERT in separate steps, as described later in this section.
A merge operation requires a join between the staging table and the target table. To collocate the
joining rows, set the staging table's distribution key to the same column as the target table's distribution
key. For example, if the target table uses a foreign key column as its distribution key, use the same
column for the staging table's distribution key. If you create the staging table by using a CREATE TABLE
LIKE (p. 714) statement, the staging table will inherit the distribution key from the parent table. If
you use a CREATE TABLE AS statement, the new table does not inherit the distribution key. For more
information, see Working with data distribution styles (p. 50)
If the distribution key is not the same as the primary key and the distribution key is not updated as part
of the merge operation, add a redundant join predicate on the distribution key columns to enable a
collocated join. For example:
97
Amazon Redshift Database Developer Guide
Performing a merge operation by replacing existing rows
To verify that the query will use a collocated join, run the query with EXPLAIN (p. 762) and check for
DS_DIST_NONE on all of the joins. For more information, see Evaluating the query plan (p. 55)
1. Create a staging table, and then populate it with data to be merged, as shown in the following
pseudocode.
2. Use an inner join with the staging table to delete the rows from the target table that are being
updated.
Put the delete and insert operations in a single transaction block so that if there is a problem,
everything will be rolled back.
begin transaction;
end transaction;
1. Put the entire operation in a single transaction block so that if there is a problem, everything will be
rolled back.
begin transaction;
…
98
Amazon Redshift Database Developer Guide
Performing a merge operation by specifying a column list
end transaction;
2. Create a staging table, and then populate it with data to be merged, as shown in the following
pseudocode.
3. Update the target table by using an inner join with the staging table.
For example:
begin transaction;
update target
set col1 = stage.col1,
col2 = stage.col2,
col3 = 'expression'
from stage
where target.primarykey = stage.primarykey
and target.distkey = stage.distkey
and target.col3 > 'last_update_time'
and (target.col1 != stage.col1
or target.col2 != stage.col2
or target.col3 = 'filter_expression');
4. Delete unneeded rows from the staging table by using an inner join with the target table. Some
rows in the target table already match the corresponding rows in the staging table, and others were
updated in the previous step. In either case, they are not needed for the insert.
5. Insert the remaining rows from the staging table. Use the same column list in the VALUES clause
that you used in the UPDATE statement in step two.
99
Amazon Redshift Database Developer Guide
Merge examples
end transaction;
Merge examples
The following examples perform a merge to update the SALES table. The first example uses the simpler
method of deleting from the target table and then inserting all of the rows from the staging table. The
second example requires updating on select columns in the target table, so it includes an extra update
step.
The examples in this section need a sample data source that includes both updates and inserts. For the
examples, we will create a sample table named SALES_UPDATE that uses data from the SALES table.
We'll populate the new table with random data that represents new sales activity for December. We will
use the SALES_UPDATE sample table to create the staging table in the examples that follow.
update sales_update
set qtysold = qtysold*2,
pricepaid = pricepaid*0.8,
commission = commission*1.1
where saletime > '2008-11-30'
and mod(sellerid, 5) = 0;
The following script uses the SALES_UPDATE table to perform a merge operation on the SALES table
with new data for December sales activity. This example deletes rows in the SALES table that have
updates so they can be replaced with the updated rows in the staging table. The staging table should
contain only rows that will participate in the merge, so the CREATE TABLE statement includes a filter to
exclude rows that have not changed.
-- Create a staging table and populate it with updated rows from SALES_UPDATE
100
Amazon Redshift Database Developer Guide
Merge examples
-- Delete any rows from SALES that exist in STAGESALES, because they are updates
-- The join includes a redundant predicate to collocate on the distribution key
–- A filter on saletime enables a range-restricted scan on SALES
-- Insert all the rows from the staging table into the target table
insert into sales
select * from stagesales;
The following example performs a merge operation to update SALES with new data for December
sales activity. We need sample data that includes both updates and inserts, along with rows that have
not changed. For this example, we want to update the QTYSOLD and PRICEPAID columns but leave
COMMISSION and SALETIME unchanged. The following script uses the SALES_UPDATE table to perform
a merge operation on the SALES table.
-- Create a staging table and populate it with rows from SALES_UPDATE for Dec
create temp table stagesales as select * from sales_update
where saletime > '2008-11-30';
-- Update the target table using an inner join with the staging table
-- The join includes a redundant predicate to collocate on the distribution key –- A filter
on saletime enables a range-restricted scan on SALES
update sales
set qtysold = stagesales.qtysold,
pricepaid = stagesales.pricepaid
from stagesales
where sales.salesid = stagesales.salesid
and sales.listid = stagesales.listid
and stagesales.saletime > '2008-11-30'
and (sales.qtysold != stagesales.qtysold
or sales.pricepaid != stagesales.pricepaid);
-- Insert the remaining rows from the staging table into the target table
101
Amazon Redshift Database Developer Guide
Performing a deep copy
You can choose one of the following methods to create a copy of the original table:
If the CREATE TABLE DDL is available, this is the fastest and preferred method. If you create a new
table, you can specify all table and column attributes, including primary key and foreign keys.
Note
If the original DDL is not available, you might be able to recreate the DDL by running a script
called v_generate_tbl_ddl. You can download the script from amazon-redshift-utils,
which is part of the Amazon Web Services - Labs git hub repository.
• Use CREATE TABLE LIKE.
If the original DDL is not available, you can use CREATE TABLE LIKE to recreate the original table. The
new table inherits the encoding, distribution key, sort key, and not-null attributes of the parent table.
The new table doesn't inherit the primary key and foreign key attributes of the parent table, but you
can add them using ALTER TABLE (p. 542).
• Create a temporary table and truncate the original table.
If you need to retain the primary key and foreign key attributes of the parent table, or if the parent
table has dependencies, you can use CREATE TABLE ... AS (CTAS) to create a temporary table, then
truncate the original table and populate it from the temporary table.
Using a temporary table improves performance significantly compared to using a permanent table,
but there is a risk of losing data. A temporary table is automatically dropped at the end of the session
in which it is created. TRUNCATE commits immediately, even if it is inside a transaction block. If the
TRUNCATE succeeds but the session terminates before the subsequent INSERT completes, the data is
lost. If data loss is unacceptable, use a permanent table.
After you create a copy of a table, you may need to grant access to the new table. You can use
GRANT (p. 767) to define access privileges.
102
Amazon Redshift Database Developer Guide
Analyzing tables
The following example performs a deep copy on the SALES table using a duplicate of SALES named
SALESCOPY.
The following example performs a deep copy on the SALES table using CREATE TABLE LIKE.
To perform a deep copy by creating a temporary table and truncating the original table
1. Use CREATE TABLE AS to create a temporary table with the rows from the original table.
2. Truncate the current table.
3. Use an INSERT INTO … SELECT statement to copy the rows from the temporary table to the original
table.
4. Drop the temporary table.
The following example performs a deep copy on the SALES table by creating a temporary table and
truncating the original table:
Analyzing tables
The ANALYZE operation updates the statistical metadata that the query planner uses to choose optimal
plans.
In most cases, you don't need to explicitly run the ANALYZE command. Amazon Redshift monitors
changes to your workload and automatically updates statistics in the background. In addition, the COPY
command performs an analysis automatically when it loads data into an empty table.
To explicitly analyze a table or the entire database, run the ANALYZE (p. 564) command.
Topics
• Automatic analyze (p. 104)
103
Amazon Redshift Database Developer Guide
Automatic analyze
Automatic analyze
Amazon Redshift continuously monitors your database and automatically performs analyze operations in
the background. To minimize impact to your system performance, automatic analyze runs during periods
when workloads are light.
Automatic analyze is enabled by default. To disable automatic analyze, set the auto_analyze
parameter to false by modifying your cluster's parameter group.
To reduce processing time and improve overall system performance, Amazon Redshift skips automatic
analyze for any table where the extent of modifications is small.
An analyze operation skips tables that have up-to-date statistics. If you run ANALYZE as part of your
extract, transform, and load (ETL) workflow, automatic analyze skips tables that have current statistics.
Similarly, an explicit ANALYZE skips tables when automatic analyze has updated the table's statistics.
Amazon Redshift also analyzes new tables that you create with the following commands:
Amazon Redshift returns a warning message when you run a query against a new table that was not
analyzed after its data was initially loaded. No warning occurs when you query a table after a subsequent
update or load. The same warning message is returned when you run the EXPLAIN command on a query
that references tables that have not been analyzed.
Whenever adding data to a nonempty table significantly changes the size of the table, you can explicitly
update statistics. You do so either by running an ANALYZE command or by using the STATUPDATE ON
option with the COPY command. To view details about the number of rows that have been inserted or
deleted since the last ANALYZE, query the PG_STATISTIC_INDICATOR (p. 1472) system catalog table.
You can specify the scope of the ANALYZE (p. 564) command to one of the following:
The ANALYZE command gets a sample of rows from the table, does some calculations, and saves
resulting column statistics. By default, Amazon Redshift runs a sample pass for the DISTKEY column
and another sample pass for all of the other columns in the table. If you want to generate statistics for
a subset of columns, you can specify a comma-separated column list. You can run ANALYZE with the
PREDICATE COLUMNS clause to skip columns that aren’t used as predicates.
104
Amazon Redshift Database Developer Guide
Analysis of new table data
ANALYZE operations are resource intensive, so run them only on tables and columns that actually require
statistics updates. You don't need to analyze all columns in all tables regularly or on the same schedule.
If the data changes substantially, analyze the columns that are frequently used in the following:
To reduce processing time and improve overall system performance, Amazon Redshift skips
ANALYZE for any table that has a low percentage of changed rows, as determined by the
analyze_threshold_percent (p. 1481) parameter. By default, the analyze threshold is set to 10 percent.
You can change the analyze threshold for the current session by running a SET (p. 837) command.
Columns that are less likely to require frequent analysis are those that represent facts and measures and
any related attributes that are never actually queried, such as large VARCHAR columns. For example,
consider the LISTING table in the TICKIT database.
If this table is loaded every day with a large number of new records, the LISTID column, which is
frequently used in queries as a join key, needs to be analyzed regularly. If TOTALPRICE and LISTTIME are
the frequently used constraints in queries, you can analyze those columns and the distribution key on
every weekday.
Suppose that the sellers and events in the application are much more static, and the date IDs refer to a
fixed set of days covering only two or three years. In this case,the unique values for these columns don't
change significantly. However, the number of instances of each unique value will increase steadily.
In addition, consider the case where the NUMTICKETS and PRICEPERTICKET measures are queried
infrequently compared to the TOTALPRICE column. In this case, you can run the ANALYZE command on
the whole table once every weekend to update statistics for the five columns that are not analyzed daily:
Predicate columns
As a convenient alternative to specifying a column list, you can choose to analyze only the columns
that are likely to be used as predicates. When you run a query, any columns that are used in a join, filter
condition, or group by clause are marked as predicate columns in the system catalog. When you run
ANALYZE with the PREDICATE COLUMNS clause, the analyze operation includes only columns that meet
the following criteria:
105
Amazon Redshift Database Developer Guide
Analysis of new table data
If none of a table's columns are marked as predicates, ANALYZE includes all of the columns, even when
PREDICATE COLUMNS is specified. If no columns are marked as predicate columns, it might be because
the table has not yet been queried.
You might choose to use PREDICATE COLUMNS when your workload's query pattern is relatively stable.
When the query pattern is variable, with different columns frequently being used as predicates, using
PREDICATE COLUMNS might temporarily result in stale statistics. Stale statistics can lead to suboptimal
query execution plans and long execution times. However, the next time you run ANALYZE using
PREDICATE COLUMNS, the new predicate columns are included.
To view details for predicate columns, use the following SQL to create a view named
PREDICATE_COLUMNS.
Suppose you run the following query against the LISTING table. Note that LISTID, LISTTIME, and
EVENTID are used in the join, filter, and group by clauses.
When you query the PREDICATE_COLUMNS view, as shown in the following example, you see that
LISTID, EVENTID, and LISTTIME are marked as predicate columns.
106
Amazon Redshift Database Developer Guide
ANALYZE command history
Keeping statistics current improves query performance by enabling the query planner to choose optimal
plans. Amazon Redshift refreshes statistics automatically in the background, and you can also explicitly
run the ANALYZE command. If you choose to explicitly run ANALYZE, do the following:
An analyze operation skips tables that have up-to-date statistics. If you run ANALYZE as part of your
extract, transform, and load (ETL) workflow, automatic analyze skips tables that have current statistics.
Similarly, an explicit ANALYZE skips tables when automatic analyze has updated the table's statistics.
Query STL_ANALYZE to view the history of analyze operations. If Amazon Redshift analyzes a table using
automatic analyze, the is_background column is set to t (true). Otherwise, it is set to f (false). The
following example joins STV_TBL_PERM to show the table name and execution details.
-------+-------+-----------------+-------+---------------+---------------------
+--------------------
1582 | users | Full | 49990 | 49990 | 2016-09-22 22:02:23 | 2016-09-22
22:02:28
107
Amazon Redshift Database Developer Guide
Vacuuming tables
Alternatively, you can run a more complex query that returns all the statements that ran in every
completed transaction that included an ANALYZE command:
Vacuuming tables
Amazon Redshift can automatically sort and perform a VACUUM DELETE operation on tables in the
background. To clean up tables after a load or a series of incremental updates, you can also run the
VACUUM (p. 873) command, either against the entire database or against individual tables.
Note
Only the table owner or a superuser can effectively vacuum a table. If you don't have owner or
superuser privileges for a table, a VACUUM operation that specifies a single table fails. If you
run a VACUUM of the entire database without specifying a table name, the operation completes
successfully. However, the operation has no effect on tables for which you don't have owner or
superuser privileges.
For this reason, we recommend vacuuming individual tables as needed. We also recommend this
approach because vacuuming the entire database is potentially an expensive operation.
Depending on the load on the system, Amazon Redshift automatically initiates the sort. This automatic
sort lessens the need to run the VACUUM command to keep data in sort key order. If you need data
fully sorted in sort key order, for example after a large data load, then you can still manually run the
VACUUM command. To determine whether your table will benefit by running VACUUM SORT, monitor
the vacuum_sort_benefit column in SVV_TABLE_INFO (p. 1450).
108
Amazon Redshift Database Developer Guide
Automatic vacuum delete
Amazon Redshift tracks scan queries that use the sort key on each table. Amazon Redshift estimates
the maximum percentage of improvement in scanning and filtering of data for each table (if
the table was fully sorted). This estimate is visible in the vacuum_sort_benefit column in
SVV_TABLE_INFO (p. 1450). You can use this column, along with the unsorted column, to determine
when queries can benefit from manually running VACUUM SORT on a table. The unsorted column
reflects the physical sort order of a table. The vacuum_sort_benefit column specifies the impact of
sorting a table by manually running VACUUM SORT.
For the table “sales”, even though the table is ~86% physically unsorted, the query performance impact
from the table being 86% unsorted is only 5%. This might be either because only a small portion of
the table is accessed by queries, or very few queries accessed the table. For the table “event”, the table
is ~45% physically unsorted. But the query performance impact of 67% indicates that either a larger
portion of the table was accessed by queries, or the number of queries accessing the table was large. The
table "event" can potentially benefit from running VACUUM SORT.
Topics
• VACUUM frequency (p. 109)
• Sort stage and merge stage (p. 110)
• Vacuum threshold (p. 110)
• Vacuum types (p. 110)
• Managing vacuum times (p. 110)
VACUUM frequency
You should vacuum as often as you need to in order to maintain consistent query performance. Consider
these factors when determining how often to run your VACUUM command.
• Run VACUUM during time periods when you expect minimal activity on the cluster, such as evenings or
during designated database administration windows.
• A large unsorted region results in longer vacuum times. If you delay vacuuming, the vacuum will take
longer because more data has to be reorganized.
• VACUUM is an I/O intensive operation, so the longer it takes for your vacuum to complete, the more
impact it will have on concurrent queries and other database operations running on your cluster.
• VACUUM takes longer for tables that use interleaved sorting. To evaluate whether interleaved tables
need to be re-sorted, query the SVV_INTERLEAVED_COLUMNS (p. 1434) view.
109
Amazon Redshift Database Developer Guide
Sort stage and merge stage
Users can access tables while they are being vacuumed. You can perform queries and write operations
while a table is being vacuumed, but when DML and a vacuum run concurrently, both might take longer.
If you run UPDATE and DELETE statements during a vacuum, system performance might be reduced.
Incremental merges temporarily block concurrent UPDATE and DELETE operations, and UPDATE and
DELETE operations in turn temporarily block incremental merge steps on the affected tables. DDL
operations, such as ALTER TABLE, are blocked until the vacuum operation finishes with the table.
Note
Various modifiers to VACUUM control the way that it works. You can use them to tailor the
vacuum operation for the current need. For example, using VACUUM RECLUSTER shortens
the vacuum operation by not performing a full merge operation. For more information, see
VACUUM (p. 873).
Vacuum threshold
By default, VACUUM skips the sort phase for any table where more than 95 percent of the table's rows
are already sorted. Skipping the sort phase can significantly improve VACUUM performance. To change
the default sort threshold for a single table, include the table name and the TO threshold PERCENT
parameter when you run the VACUUM command.
Vacuum types
For information about different vacuum types, see VACUUM (p. 873).
Topics
• Deciding whether to reindex (p. 110)
• Managing the size of the unsorted region (p. 111)
• Managing the volume of merged rows (p. 112)
• Loading your data in sort key order (p. 115)
• Using time series tables (p. 115)
110
Amazon Redshift Database Developer Guide
Managing vacuum times
When you initially load an empty interleaved table using COPY or CREATE TABLE AS, Amazon Redshift
automatically builds the interleaved index. If you initially load an interleaved table using INSERT, you
need to run VACUUM REINDEX afterwards to initialize the interleaved index.
Over time, as you add rows with new sort key values, performance might degrade if the distribution of
the values in the sort key columns changes. If your new rows fall primarily within the range of existing
sort key values, you don’t need to reindex. Run VACUUM SORT ONLY or VACUUM FULL to restore the
sort order.
The query engine is able to use sort order to efficiently select which data blocks need to be scanned
to process a query. For an interleaved sort, Amazon Redshift analyzes the sort key column values
to determine the optimal sort order. If the distribution of key values changes, or skews, as rows
are added, the sort strategy will no longer be optimal, and the performance benefit of sorting
will degrade. To reanalyze the sort key distribution you can run a VACUUM REINDEX. The reindex
operation is time consuming, so to decide whether a table will benefit from a reindex, query the
SVV_INTERLEAVED_COLUMNS (p. 1434) view.
For example, the following query shows details for tables that use interleaved sort keys.
The value for interleaved_skew is a ratio that indicates the amount of skew. A value of 1 means there
is no skew. If the skew is greater than 1.4, a VACUUM REINDEX will usually improve performance unless
the skew is inherent in the underlying set.
You can use the date value in last_reindex to determine how long it has been since the last reindex.
If you load your tables in small increments (such as daily updates that represent a small percentage
of the total number of rows in the table), running VACUUM regularly will help ensure that individual
vacuum operations go quickly.
• Run the largest load first.
If you need to load a new table with multiple COPY operations, run the largest load first. When you
run an initial load into a new or truncated table, all of the data is loaded directly into the sorted
region, so no vacuum is required.
• Truncate a table instead of deleting all of the rows.
Deleting rows from a table does not reclaim the space that the rows occupied until you perform a
vacuum operation; however, truncating a table empties the table and reclaims the disk space, so no
vacuum is required. Alternatively, drop the table and re-create it.
111
Amazon Redshift Database Developer Guide
Managing vacuum times
If you are loading a small number of rows into a table for test purposes, don't delete the rows when
you are done. Instead, truncate the table and reload those rows as part of the subsequent production
load operation.
• Perform a deep copy.
If a table that uses a compound sort key table has a large unsorted region, a deep copy is much
faster than a vacuum. A deep copy recreates and repopulates a table by using a bulk insert, which
automatically re-sorts the table. If a table has a large unsorted region, a deep copy is much faster than
a vacuum. The trade off is that you cannot make concurrent updates during a deep copy operation,
which you can do during a vacuum. For more information, see Amazon Redshift best practices for
designing queries (p. 22).
Prior to a vacuum, a table consists of a sorted region at the head of the table, followed by an unsorted
region, which grows whenever rows are added or updated. When a set of rows is added by a COPY
operation, the new set of rows is sorted on the sort key as it is added to the unsorted region at the end
of the table. The new rows are ordered within their own set, but not within the unsorted region.
The following diagram illustrates the unsorted region after two successive COPY operations, where the
sort key is CUSTID. For simplicity, this example shows a compound sort key, but the same principles
apply to interleaved sort keys, except that the impact of the unsorted region is greater for interleaved
tables.
112
Amazon Redshift Database Developer Guide
Managing vacuum times
The first stage is relatively cheap, because only the unsorted region is rewritten. If the range of sort
key values of the newly-sorted region is higher than the existing range, only the new rows need to be
rewritten, and the vacuum is complete. For example, if the sorted region contains ID values 1 to 500
and subsequent copy operations add key values greater than 500, then only the unsorted region only
needs to be rewritten.
2. Merge the newly-sorted region with the previously-sorted region.
If the keys in the newly sorted region overlap the keys in the sorted region, then VACUUM needs to
merge the rows. Starting at the beginning of the newly-sorted region (at the lowest sort key), the
vacuum writes the merged rows from the previously sorted region and the newly sorted region into a
new set of blocks.
The extent to which the new sort key range overlaps the existing sort keys determines the extent
to which the previously-sorted region will need to be rewritten. If the unsorted keys are scattered
throughout the existing sort range, a vacuum might need to rewrite existing portions of the table.
The following diagram shows how a vacuum would sort and merge rows that are added to a table where
CUSTID is the sort key. Because each copy operation adds a new set of rows with key values that overlap
the existing keys, almost the entire table needs to be rewritten. The diagram shows single sort and
merge, but in practice, a large vacuum consists of a series of incremental sort and merge steps.
If the range of sort keys in a set of new rows overlaps the range of existing keys, the cost of the merge
stage continues to grow in proportion to the table size as the table grows while the cost of the sort
stage remains proportional to the size of the unsorted region. In such a case, the cost of the merge stage
overshadows the cost of the sort stage, as the following diagram shows.
113
Amazon Redshift Database Developer Guide
Managing vacuum times
To determine what proportion of a table was remerged, query SVV_VACUUM_SUMMARY after the
vacuum operation completes. The following query shows the effect of six successive vacuums as
CUSTSALES grew larger over time.
The merge_increments column gives an indication of the amount of data that was merged for each
vacuum operation. If the number of merge increments over consecutive vacuums increases in proportion
to the growth in table size, that is an indication that each vacuum operation is remerging an increasing
number of rows in the table because the existing and newly sorted regions overlap.
114
Amazon Redshift Database Developer Guide
Managing vacuum times
COPY automatically adds new rows to the table's sorted region when all of the following are true:
• The table uses a compound sort key with only one sort column.
• The sort column is NOT NULL.
• The table is 100 percent sorted or empty.
• All the new rows are higher in sort order than the existing rows, including rows marked for deletion. In
this instance, Amazon Redshift uses the first eight bytes of the sort key to determine sort order.
For example, suppose you have a table that records customer events using a customer ID and time. If
you sort on customer ID, it’s likely that the sort key range of new rows added by incremental loads will
overlap the existing range, as shown in the previous example, leading to an expensive vacuum operation.
If you set your sort key to a timestamp column, your new rows will be appended in sort order at the end
of the table, as the following diagram shows, reducing or even eliminating the need to vacuum.
115
Amazon Redshift Database Developer Guide
Managing concurrent write operations
Create a new table each time you add a set of data, then delete the oldest table in the series. You gain a
double benefit:
• You avoid the added cost of deleting rows, because a DROP TABLE operation is much more efficient
than a mass DELETE.
• If the tables are sorted by timestamp, no vacuum is needed. If each table contains data for one month,
a vacuum will at most have to rewrite one month’s worth of data, even if the tables are not sorted by
timestamp.
You can create a UNION ALL view for use by reporting queries that hides the fact that the data is stored
in multiple tables. If a query filters on the sort key, the query planner can efficiently skip all the tables
that aren't used. A UNION ALL can be less efficient for other types of queries, so you should evaluate
query performance in the context of all queries that use the tables.
Amazon Redshift allows tables to be read while they are being incrementally loaded or modified.
In some traditional data warehousing and business intelligence applications, the database is available
to users only when the nightly load is complete. In such cases, no updates are allowed during regular
work hours, when analytic queries are run and reports are generated; however, an increasing number of
applications remain live for long periods of the day or even all day, making the notion of a load window
obsolete.
Amazon Redshift supports these types of applications by allowing tables to be read while they are being
incrementally loaded or modified. Queries simply see the latest committed version, or snapshot, of the
data, rather than waiting for the next version to be committed. If you want a particular query to wait for
a commit from another write operation, you have to schedule it accordingly.
The following topics describe some of the key concepts and use cases that involve transactions, database
snapshots, updates, and concurrent behavior.
116
Amazon Redshift Database Developer Guide
Serializable isolation
Serializable isolation
Some applications require not only concurrent querying and loading, but also the ability to write to
multiple tables or the same table concurrently. In this context, concurrently means overlapping, not
scheduled to run at precisely the same time. Two transactions are considered to be concurrent if the
second one starts before the first commits. Concurrent operations can originate from different sessions
that are controlled either by the same user or by different users.
Note
Amazon Redshift supports a default automatic commit behavior in which each separately
run SQL command commits individually. If you enclose a set of commands in a transaction
block (defined by BEGIN (p. 568) and END (p. 760) statements), the block commits as one
transaction, so you can roll it back if necessary. Exceptions to this behavior are the TRUNCATE
and VACUUM commands, which automatically commit all outstanding changes made in the
current transaction.
Some SQL clients issue BEGIN and COMMIT commands automatically, so the client controls
whether a group of statements are run as a transaction or each individual statement is run as its
own transaction. Check the documentation for the interface you are using. For example, when
using the Amazon Redshift JDBC driver, a JDBC PreparedStatement with a query string that
contains multiple (semicolon separated) SQL commands runs all the statements as a single
transaction. In contrast, if you use SQL Workbench/J and set AUTO COMMIT ON, then if you run
multiple statements, each statement runs as its own transaction.
Concurrent write operations are supported in Amazon Redshift in a protective way, using write locks
on tables and the principle of serializable isolation. Serializable isolation preserves the illusion that
a transaction running against a table is the only transaction that is running against that table. For
example, two concurrently running transactions, T1 and T2, must produce the same results as at least
one of the following:
Concurrent transactions are invisible to each other; they cannot detect each other's changes. Each
concurrent transaction will create a snapshot of the database at the beginning of the transaction. A
database snapshot is created within a transaction on the first occurrence of most SELECT statements,
DML commands such as COPY, DELETE, INSERT, UPDATE, and TRUNCATE, and the following DDL
commands:
If any serial execution of the concurrent transactions produces the same results as their concurrent
execution, those transactions are deemed "serializable" and can be run safely. If no serial execution of
those transactions can produce the same results, the transaction that runs a statement that might break
the ability to serialize is stopped and rolled back.
System catalog tables (PG) and other Amazon Redshift system tables (STL and STV) are not locked in a
transaction. Therefore, changes to database objects that arise from DDL and TRUNCATE operations are
visible on commit to any concurrent transactions.
For example, suppose that table A exists in the database when two concurrent transactions, T1 and T2,
start. Suppose that T2 returns a list of tables by selecting from the PG_TABLES catalog table. Then T1
drops table A and commits, and then T2 lists the tables again. Table A is now no longer listed. If T2 tries
to query the dropped table, Amazon Redshift returns a "relation does not exist" error. The catalog query
117
Amazon Redshift Database Developer Guide
Serializable isolation
that returns the list of tables to T2 or checks that table A exists isn't subject to the same isolation rules as
operations performed on user tables.
Transactions for updates to these tables run in a read committed isolation mode. PG-prefix catalog
tables don't support snapshot isolation.
To address a serializable isolation error, you can try the following methods:
Amazon Redshift detected that a concurrent workload is not serializable. It suggests gaps in the
application logic, which can usually be worked around by retrying the transaction that encountered
the error. If the issue persists, try one of the other methods.
• Move any operations that don't have to be in the same atomic transaction outside of the transaction.
This method applies when individual operations inside two transactions cross-reference each other in
a way that can affect the outcome of the other transaction. For example, the following two sessions
each start a transaction.
Session1_Redshift=# begin;
Session2_Redshift=# begin;
The result of a SELECT statement in each transaction might be affected by an INSERT statement in the
other. In other words, suppose that you run the following statements serially, in any order. In every
case, the result is one of the SELECT statements returning one more row than if the transactions were
run concurrently. There is no order in which the operations can run serially that produces the same
result as when run concurrently. Thus, the last operation that is run results in a serializable isolation
error.
In many cases, the result of the SELECT statements isn't important. In other words, the atomicity of
the operations in the transactions isn't important. In these cases, move the SELECT statements outside
of their transactions, as shown in the following examples.
118
Amazon Redshift Database Developer Guide
Serializable isolation
Session1_Redshift=# begin;
Session1_Redshift=# insert into tab1 values (1)
Session1_Redshift=# end;
Session1_Redshift=# select * from tab2;
In these examples, there are no cross-references in the transactions. The two INSERT statements
don't affect each other. In these examples, there is at least one order in which the transactions can
run serially and produce the same result as if run concurrently. This means that the transactions are
serializable.
• Force serialization by locking all tables in each session.
The LOCK (p. 786) command blocks operations that can result in serializable isolation errors. When
you use the LOCK command, be sure to do the following:
• Lock all tables affected by the transaction, including those affected by read-only SELECT statements
inside the transaction.
• Lock tables in the same order, regardless of the order that operations are performed in.
• Lock all tables at the beginning of the transaction, before performing any operations.
Transactions in Amazon Redshift follow snapshot isolation. After a transaction begins, Amazon Redshift
takes a snapshot of the database. For the entire lifecycle of the transaction, the transaction operates on
the state of the database as reflected in the snapshot. If the transaction reads from a table that doesn't
exist in the snapshot, it throws the 1018 error message shown previously. Even when another concurrent
transaction creates a table after the transaction has taken the snapshot, the transaction can't read from
the newly created table.
To address this serialization isolation error, you can try to move the start of the transaction to a point
where you know the table exists.
If the table is created by another transaction, this point is at least after that transaction has been
committed. Also, ensure that no concurrent transaction has been committed that might have dropped
the table.
session1 = # BEGIN;
session1 = # DROP TABLE A;
session1 = # COMMIT;
session2 = # BEGIN;
session3 = # BEGIN;
session3 = # CREATE TABLE A (id INT);
119
Amazon Redshift Database Developer Guide
Write and read/write operations
session3 = # COMMIT;
The last operation that is run as the read operation by session2 results in a serializable isolation error.
This error happens when session2 takes a snapshot and the table has already been dropped by a
committed session1. In other words, even though a concurrent session3 has created the table, session2
doesn't see the table because it's not in the snapshot.
session1 = # BEGIN;
session1 = # DROP TABLE A;
session1 = # COMMIT;
session3 = # BEGIN;
session3 = # CREATE TABLE A (id INT);
session3 = # COMMIT;
session2 = # BEGIN;
session2 = # SELECT * FROM A;
Now when session2 takes its snapshot, session3 has already been committed, and the table is in the
database. Session2 can read from the table without any error.
COPY and INSERT operations are pure write operations, but DELETE and UPDATE operations are
read/write operations. (For rows to be deleted or updated, they have to be read first.) The results of
concurrent write operations depend on the specific commands that are being run concurrently. COPY
and INSERT operations against the same table are held in a wait state until the lock is released, then they
proceed as normal.
UPDATE and DELETE operations behave differently because they rely on an initial table read before they
do any writes. Given that concurrent transactions are invisible to each other, both UPDATEs and DELETEs
have to read a snapshot of the data from the last commit. When the first UPDATE or DELETE releases its
lock, the second UPDATE or DELETE needs to determine whether the data that it is going to work with is
potentially stale. It will not be stale, because the second transaction does not obtain its snapshot of data
until after the first transaction has released its lock.
120
Amazon Redshift Database Developer Guide
Concurrent write examples
For example, suppose that transactions T1 and T2 start at roughly the same time. If T1 starts writing to
table A and T2 starts writing to table B, both transactions can proceed without conflict; however, if T1
finishes writing to table A and needs to start writing to table B, it will not be able to proceed because T2
still holds the lock on B. Conversely, if T2 finishes writing to table B and needs to start writing to table A,
it will not be able to proceed either because T1 still holds the lock on A. Because neither transaction can
release its locks until all its write operations are committed, neither transaction can proceed.
In order to avoid this kind of deadlock, you need to schedule concurrent write operations carefully. For
example, you should always update tables in the same order in transactions and, if specifying locks, lock
tables in the same order before you perform any DML operations.
begin;
copy listing from ...;
end;
Transaction 2 starts concurrently in a separate session and attempts to copy more rows into the LISTING
table. Transaction 2 must wait until transaction 1 releases the write lock on the LISTING table, then it can
proceed.
begin;
[waits]
copy listing from ;
end;
The same behavior would occur if one or both transactions contained an INSERT command instead of a
COPY command.
begin;
delete from listing where ...;
end;
Transaction 2 starts concurrently and attempts to delete rows from the same table. It will succeed
because it waits for transaction 1 to complete before attempting to delete rows.
begin
[waits]
delete from listing where ;
end;
The same behavior would occur if one or both transactions contained an UPDATE command to the same
table instead of a DELETE command.
121
Amazon Redshift Database Developer Guide
Tutorial: Loading data from Amazon S3
begin;
delete one row from USERS table;
copy ;
select count(*) from users;
analyze ;
end;
Meanwhile, transaction 2 starts. This transaction attempts to copy additional rows into the USERS table,
analyze the table, and then run the same COUNT(*) query as the first transaction:
begin;
[waits]
copy users from ...;
select count(*) from users;
analyze;
end;
The second transaction will succeed because it must wait for the first to complete. Its COUNT query will
return the count based on the load it has completed.
• Download data files that use comma-separated value (CSV), character-delimited, and fixed width
formats.
• Create an Amazon S3 bucket and then upload the data files to the bucket.
• Launch an Amazon Redshift cluster and create database tables.
• Use COPY commands to load the tables from the data files on Amazon S3.
• Troubleshoot load errors and modify your COPY commands to correct the errors.
Prerequisites
You need the following prerequisites:
• An AWS account to launch an Amazon Redshift cluster and to create a bucket in Amazon S3.
• Your AWS credentials (IAM role) to load test data from Amazon S3. If you need to a new IAM role, go to
Creating IAM roles.
• An SQL client such as the Amazon Redshift console query editor.
122
Amazon Redshift Database Developer Guide
Overview
This tutorial is designed so that it can be taken by itself. In addition to this tutorial, we recommend
completing the following tutorials to gain a more complete understanding of how to design and use
Amazon Redshift databases:
• Amazon Redshift Getting Started Guide walks you through the process of creating an Amazon Redshift
cluster and loading sample data.
Overview
You can add data to your Amazon Redshift tables either by using an INSERT command or by using a
COPY command. At the scale and speed of an Amazon Redshift data warehouse, the COPY command is
many times faster and more efficient than INSERT commands.
The COPY command uses the Amazon Redshift massively parallel processing (MPP) architecture to
read and load data in parallel from multiple data sources. You can load from data files on Amazon S3,
Amazon EMR, or any remote host accessible through a Secure Shell (SSH) connection. Or you can load
directly from an Amazon DynamoDB table.
In this tutorial, you use the COPY command to load data from Amazon S3. Many of the principles
presented here apply to loading from other data sources as well.
To learn more about using the COPY command, see these resources:
Steps
• Step 1: Create a cluster (p. 123)
• Step 2: Download the data files (p. 125)
• Step 3: Upload the files to an Amazon S3 bucket (p. 125)
• Step 4: Create the sample tables (p. 127)
• Step 5: Run the COPY commands (p. 129)
• Step 6: Vacuum and analyze the database (p. 141)
• Step 7: Clean up your resources (p. 141)
New console
To create a cluster
1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.
123
Amazon Redshift Database Developer Guide
Step 1: Create a cluster
Important
If you use IAM user credentials, make sure that you have the necessary permissions to
perform the cluster operations. For more information, see Controlling access to IAM users in
the Amazon Redshift Cluster Management Guide.
2. At top right, choose the AWS Region in which you want to create the cluster. For the purposes of this
tutorial, choose US West (Oregon).
3. On the navigation menu, choose CLUSTERS, then choose Create cluster. The Create cluster page
appears.
4. On the Create cluster page enter parameters for your cluster. Choose your own values for the
parameters, except change the following values:
Original console
Follow the steps in Amazon Redshift Getting Started Guide, but choose Multi Node for Cluster Type and
set Number of Compute Nodes to 4.
124
Amazon Redshift Database Developer Guide
Step 2: Download the data files
Follow the Amazon Redshift Getting Started Guide steps to connect to your cluster from a SQL client
and test a connection. You don't need to complete the remaining Getting Started steps to create tables,
upload data, and try example queries.
Next step
Step 2: Download the data files (p. 125)
customer-fw-manifest
customer-fw.tbl-000
customer-fw.tbl-000.bak
customer-fw.tbl-001
customer-fw.tbl-002
customer-fw.tbl-003
customer-fw.tbl-004
customer-fw.tbl-005
customer-fw.tbl-006
customer-fw.tbl-007
customer-fw.tbl.log
dwdate-tab.tbl-000
dwdate-tab.tbl-001
dwdate-tab.tbl-002
dwdate-tab.tbl-003
dwdate-tab.tbl-004
dwdate-tab.tbl-005
dwdate-tab.tbl-006
dwdate-tab.tbl-007
part-csv.tbl-000
part-csv.tbl-001
part-csv.tbl-002
part-csv.tbl-003
part-csv.tbl-004
part-csv.tbl-005
part-csv.tbl-006
part-csv.tbl-007
Next step
Step 3: Upload the files to an Amazon S3 bucket (p. 125)
125
Amazon Redshift Database Developer Guide
Step 3: Upload the files to an Amazon S3 bucket
a. Sign in to the AWS Management Console and open the Amazon S3 console at https://
console.aws.amazon.com/s3/.
b. Click Create Bucket.
c. In the Bucket Name box of the Create a Bucket dialog box, type a bucket name.
The bucket name you choose must be unique among all existing bucket names in Amazon
S3. One way to help ensure uniqueness is to prefix your bucket names with the name of your
organization. Bucket names must comply with certain rules. For more information, go to Bucket
restrictions and limitations in the Amazon Simple Storage Service User Guide.
d. Select a Region.
Create the bucket in the same Region as your cluster. If your cluster is in the US West (Oregon)
Region, choose US West (Oregon) Region (us-west-2).
e. Choose Create.
When Amazon S3 successfully creates your bucket, the console displays your empty bucket in
the Buckets panel.
2. Create a folder.
Follow the Amazon S3 console instructions to upload all of the files you downloaded and
extracted,
c. Choose Start Upload.
User Credentials
The Amazon Redshift COPY command must have access to read the file objects in the Amazon S3 bucket.
If you use the same user credentials to create the Amazon S3 bucket and to run the Amazon Redshift
COPY command, the COPY command has all necessary permissions. If you want to use different user
credentials, you can grant access by using the Amazon S3 access controls. The Amazon Redshift COPY
command requires at least ListBucket and GetObject permissions to access the file objects in the Amazon
S3 bucket. For more information about controlling access to Amazon S3 resources, go to Managing
access permissions to your Amazon S3 resources.
Next step
Step 4: Create the sample tables (p. 127)
126
Amazon Redshift Database Developer Guide
Step 4: Create the sample tables
The SSB tables might already exist in the current database. If so, drop the tables to remove them from
the database before you create them using the CREATE TABLE commands in the next step. The tables
used in this tutorial might have different attributes than the existing tables.
1. To drop the SSB tables, run the following commands in your SQL client.
127
Amazon Redshift Database Developer Guide
Step 4: Create the sample tables
128
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
Next step
Step 5: Run the COPY commands (p. 129)
Topics
• COPY command syntax (p. 129)
• Loading the SSB tables (p. 131)
Table name
The target table for the COPY command. The table must already exist in the database. The table can be
temporary or persistent. The COPY command appends the new input data to any existing rows in the
table.
Column list
By default, COPY loads fields from the source data to the table columns in order. You can optionally
specify a column list, that is a comma-separated list of column names, to map data fields to specific
columns. You don't use column lists in this tutorial. For more information, see Column List (p. 592) in
the COPY command reference.
Data source
You can use the COPY command to load data from an Amazon S3 bucket, an Amazon EMR cluster, a
remote host using an SSH connection, or an Amazon DynamoDB table. For this tutorial, you load from
data files in an Amazon S3 bucket. When loading from Amazon S3, you must provide the name of the
bucket and the location of the data files. To do this, provide either an object path for the data files or the
location of a manifest file that explicitly lists each data file and its location.
• Key prefix
An object stored in Amazon S3 is uniquely identified by an object key, which includes the bucket name,
folder names, if any, and the object name. A key prefix refers to a set of objects with the same prefix.
The object path is a key prefix that the COPY command uses to load all objects that share the key
prefix. For example, the key prefix custdata.txt can refer to a single file or to a set of files, including
custdata.txt.001, custdata.txt.002, and so on.
• Manifest file
In some cases, you might need to load files with different prefixes, for example from multiple buckets
or folders. In others, you might need to exclude files that share a prefix. In these cases, you can use a
129
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
manifest file. A manifest file explicitly lists each load file and its unique object key. You use a manifest
file to load the PART table later in this tutorial.
Credentials
To access the AWS resources that contain the data to load, you must provide AWS access credentials
for an AWS user or an IAM user with sufficient privileges. These credentials include an IAM role
Amazon Resource Name (ARN). To load data from Amazon S3, the credentials must include ListBucket
and GetObject permissions. Additional credentials are required if your data is encrypted. For more
information, see Authorization parameters (p. 590) in the COPY command reference. For more
information about managing access, go to Managing access permissions to your Amazon S3 resources.
Options
You can specify a number of parameters with the COPY command to specify file formats, manage
data formats, manage errors, and control other features. In this tutorial, you use the following COPY
command options and features:
• Key prefix
For information on how to load from multiple files by specifying a key prefix, see Load the PART table
using NULL AS (p. 131).
• CSV format
For information on how to load data that is in CSV format, see Load the PART table using NULL
AS (p. 131).
• NULL AS
For information on how to load PART using the NULL AS option, see Load the PART table using NULL
AS (p. 131).
• Character-delimited format
For information on how to use the DELIMITER option, see Load the SUPPLIER table using
REGION (p. 134).
• REGION
For information on how to use the REGION option, see Load the SUPPLIER table using
REGION (p. 134).
• Fixed-format width
For information on how to load the CUSTOMER table from fixed-width data, see Load the CUSTOMER
table using MANIFEST (p. 134).
• MAXERROR
For information on how to use the MAXERROR option, see Load the CUSTOMER table using
MANIFEST (p. 134).
• ACCEPTINVCHARS
For information on how to use the ACCEPTINVCHARS option, see Load the CUSTOMER table using
MANIFEST (p. 134).
• MANIFEST
For information on how to use the MANIFEST option, see Load the CUSTOMER table using
MANIFEST (p. 134).
• DATEFORMAT
130
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
For information on how to use the DATEFORMAT option, see Load the DWDATE table using
DATEFORMAT (p. 138).
• GZIP, LZOP and BZIP2
For information on how to compress your files, see Load the LINEORDER table using multiple
files (p. 139).
• COMPUPDATE
For information on how to use the COMPUPDATE option, see Load the LINEORDER table using
multiple files (p. 139).
• Multiple files
For information on how to load multiple files, see Load the LINEORDER table using multiple
files (p. 139).
1. Replace <your-bucket-name> with the name of a bucket in the same region as your cluster.
This step assumes the bucket and the cluster are in the same region. Alternatively, you can specify the
region using the REGION (p. 583) option with the COPY command.
2. Replace <aws-account-id> and <role-name> with your own AWS account and IAM role. The
segment of the credentials string that is enclosed in single quotation marks must not contain any
spaces or line breaks.
The COPY command can load data from multiple files in parallel, which is much faster than loading from
a single file. To demonstrate this principle, the data for each table in this tutorial is split into eight files,
131
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
even though the files are very small. In a later step, you compare the time difference between loading
from a single file and loading from multiple files. For more information, see Split your load data (p. 20).
Key prefix
You can load from multiple files by specifying a key prefix for the file set, or by explicitly listing the files
in a manifest file. In this step, you use a key prefix. In a later step, you use a manifest file. The key prefix
's3://mybucket/load/part-csv.tbl' loads the following set of the files in the load folder.
part-csv.tbl-000
part-csv.tbl-001
part-csv.tbl-002
part-csv.tbl-003
part-csv.tbl-004
part-csv.tbl-005
part-csv.tbl-006
part-csv.tbl-007
CSV format
CSV, which stands for comma separated values, is a common format used for importing and exporting
spreadsheet data. CSV is more flexible than comma-delimited format because it enables you to include
quoted strings within fields. The default quotation mark character for COPY from CSV format is a
double quotation mark ( " ), but you can specify another quotation mark character by using the QUOTE
AS option. When you use the quotation mark character within the field, escape the character with an
additional quotation mark character.
The following excerpt from a CSV-formatted data file for the PART table shows strings enclosed in
double quotation marks ("LARGE ANODIZED BRASS"). It also shows a string enclosed in two double
quotation marks within a quoted string ("MEDIUM ""BURNISHED"" TIN").
The data for the PART table contains characters that cause COPY to fail. In this exercise, you
troubleshoot the errors and correct them.
To load data that is in CSV format, add csv to your COPY command. Run the following command to load
the PART table.
ERROR: Load into table 'part' failed. Check 'stl_load_errors' system table for details.
[SQL State=XX000]
1 statement(s) failed.
1 statement(s) failed.
132
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
To get more information about the error, query the STL_LOAD_ERRORS table. The following query uses
the SUBSTRING function to shorten columns for readability and uses LIMIT 10 to reduce the number of
rows returned. You can adjust the values in substring(filename,22,25) to allow for the length of
your bucket name.
NULL AS
The part-csv.tbl data files use the NUL terminator character (\x000 or \x0) to indicate NULL values.
Note
Despite very similar spelling, NUL and NULL are not the same. NUL is a UTF-8 character with
codepoint x000 that is often used to indicate end of record (EOR). NULL is a SQL value that
represents an absence of data.
By default, COPY treats a NUL terminator character as an EOR character and terminates the record,
which often results in unexpected results or an error. There is no single standard method of indicating
NULL in text data. Thus, the NULL AS COPY command option enables you to specify which character
to substitute with NULL when loading the table. In this example, you want COPY to treat the NUL
terminator character as a NULL value.
Note
The table column that receives the NULL value must be configured as nullable. That is, it must
not include the NOT NULL constraint in the CREATE TABLE specification.
To load PART using the NULL AS option, run the following COPY command.
To verify that COPY loaded NULL values, run the following command to select only the rows that contain
NULL.
select p_partkey, p_name, p_mfgr, p_category from part where p_mfgr is null;
133
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
(2 rows)
Character-Delimited Format
The fields in a character-delimited file are separated by a specific character, such as a pipe character ( | ),
a comma ( , ) or a tab ( \t ). Character-delimited files can use any single ASCII character, including one
of the nonprinting ASCII characters, as the delimiter. You specify the delimiter character by using the
DELIMITER option. The default delimiter is a pipe character ( | ).
The following excerpt from the data for the SUPPLIER table uses pipe-delimited format.
1|1|257368|465569|41365|19950218|2-HIGH|0|17|2608718|9783671|4|2504369|92072|2|19950331|
TRUCK
1|2|257368|201928|8146|19950218|2-HIGH|0|36|6587676|9783671|9|5994785|109794|6|19950416|
MAIL
REGION
Whenever possible, you should locate your load data in the same AWS region as your Amazon Redshift
cluster. If your data and your cluster are in the same region, you reduce latency and avoid cross-region
data transfer costs. For more information, see Amazon Redshift best practices for loading data (p. 19)
If you must load data from a different AWS region, use the REGION option to specify the AWS region in
which the load data is located. If you specify a region, all of the load data, including manifest files, must
be in the named region. For more information, see REGION (p. 583).
If your cluster is in the US East (N. Virginia) Region, run the following command to load the SUPPLIER
table from pipe-delimited data in an Amazon S3 bucket located in the US West (Oregon) Region. For this
example, do not change the bucket name.
If your cluster is not in the US East (N. Virginia) region, run the following command to load the SUPPLIER
table from pipe-delimited data in an Amazon S3 bucket located in the US East (N. Virginia) region. For
this example, do not change the bucket name.
134
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
The sample data for this exercise contains characters that cause errors when COPY attempts to load
them. You use the MAXERRORS option and the STL_LOAD_ERRORS system table to troubleshoot the
load errors and then use the ACCEPTINVCHARS and MANIFEST options to eliminate the errors.
Fixed-Width Format
Fixed-width format defines each field as a fixed number of characters, rather than separating fields with
a delimiter. The following excerpt from the data for the CUSTOMER table uses fixed-width format.
The order of the label/width pairs must match the order of the table columns exactly. For more
information, see FIXEDWIDTH (p. 594).
The fixed-width specification string for the CUSTOMER table data is as follows.
To load the CUSTOMER table from fixed-width data, run the following command.
copy customer
from 's3://<your-bucket-name>/load/customer-fw.tbl'
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12,
c_phone:15,c_mktsegment:10';
ERROR: Load into table 'customer' failed. Check 'stl_load_errors' system table for
details. [SQL State=XX000]
1 statement(s) failed.
MAXERROR
By default, the first time COPY encounters an error, the command fails and returns an error message. To
save time during testing, you can use the MAXERROR option to instruct COPY to skip a specified number
of errors before it fails. Because we expect errors the first time we test loading the CUSTOMER table
data, add maxerror 10 to the COPY command.
To test using the FIXEDWIDTH and MAXERROR options, run the following command.
copy customer
from 's3://<your-bucket-name>/load/customer-fw.tbl'
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12,
c_phone:15,c_mktsegment:10'
maxerror 10;
135
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
This time, instead of an error message, you get a warning message similar to the following.
Warnings:
Load into table 'customer' completed, 112497 record(s) loaded successfully.
Load into table 'customer' completed, 7 record(s) could not be loaded. Check
'stl_load_errors' system table for details.
The warning indicates that COPY encountered seven errors. To check the errors, query the
STL_LOAD_ERRORS table, as shown in the following example.
The results of the STL_LOAD_ERRORS query should look similar to the following.
By examining the results, you can see that there are two messages in the error_reasons column:
•
Invalid digit, Value '#', Pos 0, Type: Integ
These errors are caused by the customer-fw.tbl.log file. The problem is that it is a log file, not a
data file, and should not be loaded. You can use a manifest file to avoid loading the wrong file.
•
String contains invalid or unsupported UTF8
The VARCHAR data type supports multibyte UTF-8 characters up to three bytes. If the load data
contains unsupported or invalid characters, you can use the ACCEPTINVCHARS option to replace each
invalid character with a specified alternative character.
Another problem with the load is more difficult to detect—the load produced unexpected results. To
investigate this problem, run the following command to query the CUSTOMER table.
136
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
Another way to check for unexpected results is to verify the number of rows that were loaded. In our
case, 100000 rows should have been loaded, but the load message reported loading 112497 records.
The extra rows were loaded because the COPY loaded an extraneous file, customer-fw.tbl0000.bak.
In this exercise, you use a manifest file to avoid loading the wrong files.
ACCEPTINVCHARS
By default, when COPY encounters a character that is not supported by the column's data type, it skips
the row and returns an error. For information about invalid UTF-8 characters, see Multibyte character
load errors (p. 94).
You could use the MAXERRORS option to ignore errors and continue loading, then query
STL_LOAD_ERRORS to locate the invalid characters, and then fix the data files. However, MAXERRORS
is best used for troubleshooting load problems and should generally not be used in a production
environment.
The ACCEPTINVCHARS option is usually a better choice for managing invalid characters.
ACCEPTINVCHARS instructs COPY to replace each invalid character with a specified valid character
and continue with the load operation. You can specify any valid ASCII character, except NULL, as the
replacement character. The default replacement character is a question mark ( ? ). COPY replaces
multibyte characters with a replacement string of equal length. For example, a 4-byte character would
be replaced with '????'.
COPY returns the number of rows that contained invalid UTF-8 characters. It also adds an entry to the
STL_REPLACEMENTS system table for each affected row, up to a maximum of 100 rows per node slice.
Additional invalid UTF-8 characters are also replaced, but those replacement events are not recorded.
For this step, you add the ACCEPTINVCHARS with the replacement character '^'.
MANIFEST
When you COPY from Amazon S3 using a key prefix, there is a risk that you might load unwanted tables.
For example, the 's3://mybucket/load/ folder contains eight data files that share the key prefix
customer-fw.tbl: customer-fw.tbl0000, customer-fw.tbl0001, and so on. However, the same
folder also contains the extraneous files customer-fw.tbl.log and customer-fw.tbl-0001.bak.
137
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
To ensure that you load all of the correct files, and only the correct files, use a manifest file. The manifest
is a text file in JSON format that explicitly lists the unique object key for each source file to be loaded.
The file objects can be in different folders or different buckets, but they must be in the same region. For
more information, see MANIFEST (p. 583).
{
"entries": [
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-000"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-001"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-002"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-003"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-004"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-005"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-006"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-007"}
]
}
To load the data for the CUSTOMER table using the manifest file
When loading DATE and TIMESTAMP columns, COPY expects the default format, which is YYYY-MM-DD
for dates and YYYY-MM-DD HH:MI:SS for timestamps. If the load data does not use a default format, you
can use DATEFORMAT and TIMEFORMAT to specify the format.
The following excerpt shows date formats in the DWDATE table. Notice that the date formats in column
two are inconsistent.
DATEFORMAT
You can specify only one date format. If the load data contains inconsistent formats, possibly in
different columns, or if the format is not known at load time, you use DATEFORMAT with the 'auto'
138
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
argument. When 'auto' is specified, COPY recognizes any valid date or time format and convert it to
the default format. The 'auto' option recognizes several formats that are not supported when using
a DATEFORMAT and TIMEFORMAT string. For more information, see Using automatic recognition with
DATEFORMAT and TIMEFORMAT (p. 621).
In this exercise, you load the LINEORDER table from a single data file and then load it again from
multiple files. Doing this enables you to compare the load times for the two methods.
Note
The files for loading the LINEORDER table are provided in an AWS sample bucket. You don't
need to upload files for this step.
You can compress your files using either gzip, lzop, or bzip2 compression formats. When loading from
compressed files, COPY uncompresses the files during the load process. Compressing your files saves
storage space and shortens upload times.
COMPUPDATE
When COPY loads an empty table with no compression encodings, it analyzes the load data to determine
the optimal encodings. It then alters the table to use those encodings before beginning the load. This
analysis process takes time, but it occurs, at most, once per table. To save time, you can skip this step by
turning COMPUPDATE off. To enable an accurate evaluation of COPY times, you turn COMPUPDATE off
for this step.
Multiple Files
The COPY command can load data very efficiently when it loads from multiple files in parallel instead
of from a single file. You can split your data into files so that the number of files is a multiple of the
number of slices in your cluster. If you do, Amazon Redshift divides the workload and distributes the data
evenly among the slices. The number of slices per node depends on the node size of the cluster. For more
information about the number of slices that each node size has, go to About clusters and nodes in the
Amazon Redshift Cluster Management Guide.
For example, the dc2.large compute nodes used in this tutorial have two slices each, so the four-node
cluster has eight slices. In previous steps, the load data was contained in eight files, even though the files
are very small. In this step, you compare the time difference between loading from a single large file and
loading from multiple files.
The files you use for this tutorial contain about 15 million records and occupy about 1.2 GB. These files
are very small in Amazon Redshift scale, but sufficient to demonstrate the performance advantage of
loading from multiple files. The files are large enough that the time required to download them and then
upload them to Amazon S3 is excessive for this tutorial. Thus, you load the files directly from an AWS
sample bucket.
139
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands
1. Run the following command to COPY from a single file. Do not change the bucket name.
2. Your results should be similar to the following. Note the execution time.
Warnings:
Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.
0 row(s) affected.
copy executed successfully
3. Run the following command to COPY from multiple files. Do not change the bucket name.
4. Your results should be similar to the following. Note the execution time.
Warnings:
Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.
0 row(s) affected.
copy executed successfully
140
Amazon Redshift Database Developer Guide
Step 6: Vacuum and analyze the database
In our example, the time to load 15 million records decreased from 51.56 seconds to 17.7 seconds, a
reduction of 65.7 percent.
These results are based on using a four-node cluster. If your cluster has more nodes, the time savings
is multiplied. For typical Amazon Redshift clusters, with tens to hundreds of nodes, the difference
is even more dramatic. If you have a single node cluster, there is little difference between the
execution times.
Next step
Step 6: Vacuum and analyze the database (p. 141)
If you load the data in sort key order, a vacuum is fast. In this tutorial, you added a significant number
of rows, but you added them to empty tables. That being the case, there is no need to resort, and you
didn't delete any rows. COPY automatically updates statistics after loading an empty table, so your
statistics should be up-to-date. However, as a matter of good housekeeping, you complete this tutorial
by vacuuming and analyzing your database.
vacuum;
analyze;
Next step
Step 7: Clean up your resources (p. 141)
If you want to keep the cluster, but recover the storage used by the SSB tables, run the following
commands.
Next
Summary (p. 142)
141
Amazon Redshift Database Developer Guide
Summary
Summary
In this tutorial, you uploaded data files to Amazon S3 and then used COPY commands to load the data
from the files into Amazon Redshift tables.
• Character-delimited
• CSV
• Fixed-width
You used the STL_LOAD_ERRORS system table to troubleshoot load errors, and then used the REGION,
MANIFEST, MAXERROR, ACCEPTINVCHARS, DATEFORMAT, and NULL AS options to resolve the errors.
For more information about Amazon Redshift best practices, see the following links:
142
Amazon Redshift Database Developer Guide
Unloading data to Amazon S3
Unloading data
Topics
• Unloading data to Amazon S3 (p. 143)
• Unloading encrypted data files (p. 145)
• Unloading data in delimited or fixed-width format (p. 147)
• Reloading unloaded data (p. 148)
To unload data from database tables to a set of files in an Amazon S3 bucket, you can use the
UNLOAD (p. 850) command with a SELECT statement. You can unload text data in either delimited
format or fixed-width format, regardless of the data format that was used to load it. You can also specify
whether to create compressed GZIP files.
You can limit the access users have to your Amazon S3 bucket by using temporary security credentials.
You can use any select statement in the UNLOAD command that Amazon Redshift supports, except
for a select that uses a LIMIT clause in the outer select. For example, you can use a select statement
that includes specific columns or that uses a where clause to join multiple tables. If your query contains
quotation marks (enclosing literal values, for example), you need to escape them in the query text (\').
For more information, see the SELECT (p. 801) command reference. For more information about using a
LIMIT clause, see the Usage notes (p. 856) for the UNLOAD command.
For example, the following UNLOAD command sends the contents of the VENUE table to the Amazon S3
bucket s3://mybucket/tickit/unload/.
The file names created by the previous example include the prefix 'venue_'.
venue_0000_part_00
venue_0001_part_00
venue_0002_part_00
venue_0003_part_00
By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the
cluster. To write data to a single file, specify PARALLEL OFF. UNLOAD writes the data serially, sorted
absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. If
the data size is greater than the maximum, UNLOAD creates additional files, up to 6.2 GB each.
143
Amazon Redshift Database Developer Guide
Unloading data to Amazon S3
The following example writes the contents VENUE to a single file. Only one file is required because the
file size is less than 6.2 GB.
Note
The UNLOAD command is designed to use parallel processing. We recommend leaving
PARALLEL enabled for most cases, especially if the files will be used to load tables using a COPY
command.
Assuming the total data size for VENUE is 5 GB, the following example writes the contents of VENUE to
50 files, each 100 MB in size.
If you include a prefix in the Amazon S3 path string, UNLOAD will use that prefix for the file names.
You can create a manifest file that lists the unload files by specifying the MANIFEST option in the
UNLOAD command. The manifest is a text file in JSON format that explicitly lists the URL of each file
that was written to Amazon S3.
{
"entries": [
{"url":"s3://mybucket/tickit/venue_0000_part_00"},
{"url":"s3://mybucket/tickit/venue_0001_part_00"},
{"url":"s3://mybucket/tickit/venue_0002_part_00"},
{"url":"s3://mybucket/tickit/venue_0003_part_00"}
]
}
The manifest file can be used to load the same files by using a COPY with the MANIFEST option. For
more information, see Using a manifest to specify data files (p. 73).
After you complete an UNLOAD operation, confirm that the data was unloaded correctly by navigating to
the Amazon S3 bucket where UNLOAD wrote the files. You will see one or more numbered files per slice,
starting with the number zero. If you specified the MANIFEST option, you will also see a file ending with
'manifest'. For example:
144
Amazon Redshift Database Developer Guide
Unloading encrypted data files
mybucket/tickit/venue_0000_part_00
mybucket/tickit/venue_0001_part_00
mybucket/tickit/venue_0002_part_00
mybucket/tickit/venue_0003_part_00
mybucket/tickit/venue_manifest
You can programmatically get a list of the files that were written to Amazon S3 by calling an Amazon S3
list operation after the UNLOAD completes. You can also query STL_UNLOAD_LOG.
The following query returns the pathname for files that were created by an UNLOAD. The
PG_LAST_QUERY_ID (p. 1234) function returns the most recent query.
query | path
-------+--------------------------------------
2320 | s3://my-bucket/venue0000_part_00
2320 | s3://my-bucket/venue0001_part_00
2320 | s3://my-bucket/venue0002_part_00
2320 | s3://my-bucket/venue0003_part_00
(4 rows)
If the amount of data is very large, Amazon Redshift might split the files into multiple parts per slice. For
example:
venue_0000_part_00
venue_0000_part_01
venue_0000_part_02
venue_0001_part_00
venue_0001_part_01
venue_0001_part_02
...
The following UNLOAD command includes a quoted string in the select statement, so the quotation
marks are escaped (=\'OH\' ').
By default, UNLOAD will fail rather than overwrite existing files in the destination bucket. To overwrite
the existing files, including the manifest file, specify the ALLOWOVERWRITE option.
145
Amazon Redshift Database Developer Guide
Unloading encrypted data files
Service key (SSE-KMS) or client-side encryption with a customer managed key. UNLOAD doesn't support
Amazon S3 server-side encryption using a customer managed key. For more information, see Protecting
data using server-side encryption.
To unload to Amazon S3 using server-side encryption with an AWS KMS key, use the KMS_KEY_ID
parameter to provide the key ID as shown in the following example.
If you want to provide your own encryption key, you can create client-side encrypted data files in
Amazon S3 by using the UNLOAD command with the ENCRYPTED option. UNLOAD uses the same
envelope encryption process that Amazon S3 client-side encryption uses. You can then use the COPY
command with the ENCRYPTED option to load the encrypted files.
1. You create a base64 encoded 256-bit AES key that you will use as your private encryption key, or root
symmetric key.
2. You issue an UNLOAD command that includes your root symmetric key and the ENCRYPTED option.
3. UNLOAD generates a one-time-use symmetric key (called the envelope symmetric key) and an
initialization vector (IV), which it uses to encrypt your data.
4. UNLOAD encrypts the envelope symmetric key using your root symmetric key.
5. UNLOAD then stores the encrypted data files in Amazon S3 and stores the encrypted envelope key
and IV as object metadata with each file. The encrypted envelope key is stored as object metadata x-
amz-meta-x-amz-key and the IV is stored as object metadata x-amz-meta-x-amz-iv.
For more information about the envelope encryption process, see the Client-side data encryption with
the AWS SDK for Java and Amazon S3 article.
To unload encrypted data files, add the root key value to the credentials string and include the
ENCRYPTED option. If you use the MANIFEST option, the manifest file is also encrypted.
To unload encrypted data files that are GZIP compressed, include the GZIP option along with the root
key value and the ENCRYPTED option.
To load the encrypted data files, add the MASTER_SYMMETRIC_KEY parameter with the same root key
value and include the ENCRYPTED option.
146
Amazon Redshift Database Developer Guide
Unloading data in delimited or fixed-width format
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
master_symmetric_key '<root_key>'
encrypted;
To unload the same result set to a tab-delimited file, issue the following command:
Alternatively, you can use a FIXEDWIDTH specification. This specification consists of an identifier for
each table column and the width of the column (number of characters). The UNLOAD command will fail
rather than truncate data, so specify a width that is at least as long as the longest entry for that column.
Unloading fixed-width data works similarly to unloading delimited data, except that the resulting output
contains no delimiting characters. For example:
For more details about FIXEDWIDTH specifications, see the UNLOAD (p. 850) command.
147
Amazon Redshift Database Developer Guide
Reloading unloaded data
The following example shows a simple case in which the VENUE table is unloaded using a manifest file,
truncated, and reloaded.
truncate venue;
copy venue
from 's3://mybucket/tickit/venue/reload_manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest
delimiter '|';
148
Amazon Redshift Database Developer Guide
UDF security and privileges
For Python UDFs, in addition to using the standard Python functionality, you can import your own
custom Python modules. For more information, see Python language support for UDFs (p. 152).
You can also create AWS Lambda UDFs that use custom functions defined in Lambda as part of your SQL
queries. Lambda UDFs enable you to write complex UDFs and integrate with third-party components.
They also can help you overcome some of the limitations of current Python and SQL UDFs. For example,
they can help you access network and storage resources and write more full-fledged SQL statements.
You can create Lambda UDFs in any of the programming languages supported by Lambda, such as Java,
Go, PowerShell, Node.js, C#, Python, and Ruby. Or you can use a custom runtime.
By default, all users can run UDFs. For more information about privileges, see UDF security and
privileges (p. 149).
Topics
• UDF security and privileges (p. 149)
• Creating a scalar SQL UDF (p. 150)
• Creating a scalar Python UDF (p. 151)
• Creating a scalar Lambda UDF (p. 156)
• Example uses of user-defined functions (UDFs) (p. 162)
To revoke usage for SQL, first revoke usage from PUBLIC. Then grant usage on SQL only to the specific
users or groups permitted to create SQL UDFs. The following example revokes usage on SQL from
PUBLIC. Then it grants usage to the user group udf_devs.
To run a UDF, you must have permission to do so for each function. By default, permission to run new
UDFs is granted to PUBLIC. To restrict usage, revoke this permission from PUBLIC for the function. Then
grant the privilege to specific individuals or groups.
The following example revokes execution on function f_py_greater from PUBLIC. Then it grants usage
to the user group udf_devs.
149
Amazon Redshift Database Developer Guide
Creating a scalar SQL UDF
For more information, see GRANT (p. 767) and REVOKE (p. 791).
The input and return data types can be any standard Amazon Redshift data type.
Don't include a FROM clause in your SELECT clause. Instead, include the FROM clause in the SQL
statement that calls the SQL UDF.
The SELECT clause can't include any of the following types of clauses:
• FROM
• INTO
• WHERE
• GROUP BY
• ORDER BY
• LIMIT
The following query calls the new f_sql_greater function to query the SALES table and return either
COMMISSION or 20 percent of PRICEPAID, whichever is greater.
150
Amazon Redshift Database Developer Guide
Creating a scalar Python UDF
• (Optional) Input arguments. Each argument must have a name and a data type.
• One return data type.
• One executable Python program.
The input and return data types can be SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE
PRECISION, BOOLEAN, CHAR, VARCHAR, DATE, or TIMESTAMP. In addition, Python UDFs can use the data
type ANYELEMENT, which Amazon Redshift automatically converts to a standard data type based on the
arguments supplied at runtime. For more information, see ANYELEMENT data type (p. 152)
When an Amazon Redshift query calls a scalar UDF, the following steps occur at runtime:
For a mapping of Amazon Redshift data types to Python data types, see Python UDF data
types (p. 151).
2. The function runs the Python program, passing the converted input arguments.
3. The Python code returns a single value. The data type of the return value must correspond to the
RETURNS data type specified by the function definition.
4. The function converts the Python return value to the specified Amazon Redshift data type, then
returns that value to the query.
The following query calls the new f_greater function to query the SALES table and return either
COMMISSION or 20 percent of PRICEPAID, whichever is greater.
151
Amazon Redshift Database Developer Guide
ANYELEMENT data type
During execution, Amazon Redshift converts the arguments from Amazon Redshift data types to
Python data types for processing. It then converts the return value from the Python data type to the
corresponding Amazon Redshift data type. For more information about Amazon Redshift data types, see
Data types (p. 478).
The following table maps Amazon Redshift data types to Python data types.
smallint int
integer
bigint
short
long
double float
real
boolean bool
char string
varchar
timestamp datetime
If a function uses multiple ANYELEMENT data types, they must all resolve to the same actual data type
when the function is called. All ANYELEMENT argument data types are set to the actual data type of the
first argument passed to an ANYELEMENT. For example, a function declared as f_equal(anyelement,
anyelement) will take any two input values, so long as they are of the same data type.
If the return value of a function is declared as ANYELEMENT, at least one input argument must be
ANYELEMENT. The actual data type for the return value is the same as the actual data type supplied for
the ANYELEMENT input argument.
• ScrolledText
• Tix
• Tkinter
152
Amazon Redshift Database Developer Guide
Python language support
• tk
• turtle
• smtpd
In addition to the Python Standard Library, the following modules are part of the Amazon Redshift
implementation:
• numpy 1.8.2
• pandas 0.14.1
• python-dateutil 2.2
• pytz 2014.7
• scipy 0.12.1
• six 1.3.0
• wsgiref 0.1.2
You can also import your own custom Python modules and make them available for use in UDFs by
executing a CREATE LIBRARY (p. 684) command. For more information, see Importing custom Python
library modules (p. 153).
Important
Amazon Redshift blocks all network access and write access to the file system through UDFs.
You cannot create a library that contains a module with the same name as a Python Standard Library
module or an Amazon Redshift preinstalled Python module. If an existing user-installed library uses the
same Python package as a library you create, you must drop the existing library before installing the new
library.
You must be a superuser or have USAGE ON LANGUAGE plpythonu privilege to install custom libraries;
however, any user with sufficient privileges to create functions can use the installed libraries. You can
query the PG_LIBRARY (p. 1471) system catalog to view information about the libraries installed on your
cluster.
In this example, let's suppose that you create UDFs to work with positions and distances in your data.
Connect to your Amazon Redshift cluster from a SQL client tool, and run the following commands to
create the functions.
CREATE FUNCTION f_distance (x1 float, y1 float, x2 float, y2 float) RETURNS float IMMUTABLE
as $$
def distance(x1, y1, x2, y2):
import math
return math.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)
153
Amazon Redshift Database Developer Guide
Python language support
$$ LANGUAGE plpythonu;
CREATE FUNCTION f_within_range (x1 float, y1 float, x2 float, y2 float) RETURNS bool
IMMUTABLE as $$
def distance(x1, y1, x2, y2):
import math
return math.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)
Note that a few lines of code are duplicated in the previous functions. This duplication is necessary
because a UDF cannot reference the contents of another UDF, and both functions require the same
functionality. However, instead of duplicating code in multiple functions, you can create a custom library
and configure your functions to use it.
1. Create a folder named geometry. This folder is the top level package of the library.
2. In the geometry folder, create a file named __init__.py. Note that the file name contains two
double underscore characters. This file indicates to Python that the package can be initialized.
3. Also in the geometry folder, create a folder named trig. This folder is the subpackage of the library.
4. In the trig folder, create another file named __init__.py and a file named line.py. In this folder,
__init__.py indicates to Python that the subpackage can be initialized and that line.py is the file
that contains library code.
Your folder and file structure should be the same as the following:
geometry/
__init__.py
trig/
__init__.py
line.py
For more information about package structure, go to Modules in the Python tutorial on the Python
website.
5. The following code contains a class and member functions for the library. Copy and paste it into
line.py.
class LineSegment:
def __init__(self, x1, y1, x2, y2):
self.x1 = x1
self.y1 = y1
self.x2 = x2
self.y2 = y2
def angle(self):
import math
return math.atan2(self.y2 - self.y1, self.x2 - self.x1)
def distance(self):
import math
return math.sqrt((self.y2 - self.y1) ** 2 + (self.x2 - self.x1) ** 2)
After you have created the package, do the following to prepare the package and upload it to Amazon
S3.
1. Compress the contents of the geometry folder into a .zip file named geometry.zip. Do not include the
geometry folder itself; only include the contents of the folder as shown following:
154
Amazon Redshift Database Developer Guide
UDF constraints
geometry.zip
__init__.py
trig/
__init__.py
line.py
After you install the library in your cluster, you need to configure your functions to use the library. To do
this, run the following commands.
CREATE OR REPLACE FUNCTION f_distance (x1 float, y1 float, x2 float, y2 float) RETURNS
float IMMUTABLE as $$
from trig.line import LineSegment
CREATE OR REPLACE FUNCTION f_within_range (x1 float, y1 float, x2 float, y2 float) RETURNS
bool IMMUTABLE as $$
from trig.line import LineSegment
In the preceding commands, import trig/line eliminates the duplicated code from the original
functions in this section. You can reuse the functionality provided by this library in multiple UDFs. Note
that to import the module, you only need to specify the path to the subpackage and module name
(trig/line).
UDF constraints
Within the constraints listed in this topic, you can use UDFs anywhere you use the Amazon Redshift built-
in scalar functions. For more information, see SQL functions reference (p. 878).
• Python UDFs cannot access the network or read or write to the file system.
• The total size of user-installed Python libraries cannot exceed 100 MB.
• The number of Python UDFs that can run concurrently per cluster is limited to one-fourth of the
total concurrency level for the cluster. For example, if the cluster is configured with a concurrency of
15, a maximum of three UDFs can run concurrently. After the limit is reached, UDFs are queued for
execution within workload management queues. SQL UDFs don't have a concurrency limit. For more
information, see Implementing workload management (p. 411).
• When using Python UDFs, Amazon Redshift doesn't support the SUPER and HLLSKETCH data types.
155
Amazon Redshift Database Developer Guide
Creating a scalar Lambda UDF
Lambda UDFs are defined and managed in Lambda, and you can control the access privileges to invoke
these UDFs in Amazon Redshift. You can invoke multiple Lambda functions in the same query or invoke
the same function multiple times.
Use Lambda UDFs in any clauses of the SQL statements where scalar functions are supported. You can
also use Lambda UDFs in any SQL statement such as SELECT, UPDATE, INSERT, or DELETE.
Note
Using Lambda UDFs can incur additional charges from the Lambda service. Whether it does
so depends on factors such as the numbers of Lambda requests (UDF invocations) and the
total duration of the Lambda program execution. However, there is no additional charge to
use Lambda UDFs in Amazon Redshift. For information about AWS Lambda pricing, see AWS
Lambda Pricing.
The number of Lambda requests varies depending on the specific SQL statement clause where
the Lambda UDF is used. For example, suppose the function is used in a WHERE clause such as
the following.
SELECT a, b FROM t1 WHERE lambda_multiply(a, b) = 64; SELECT a, b FROM
t1 WHERE a*b = lambda_multiply(2, 32)
In this case, Amazon Redshift calls the first SELECT statement for each and calls the second
SELECT statement only once.
However, using a UDF in the projection part of the query might only invoke the Lambda
function once for every qualified or aggregated row in the result set. You can configure batching
of multiple invocations of your Lambda function to improve performance and lower costs.
For information about CREATE EXTERNAL FUNCTION, see CREATE EXTERNAL FUNCTION (p. 648).
The input and return data types for this function can be any standard Amazon Redshift data type.
Amazon Redshift ensures that the external function can send and receive batched arguments and results.
156
Amazon Redshift Database Developer Guide
Configuring the authorization parameter for Lambda UDFs
The following example revokes usage on exfunc from PUBLIC and then grants usage to the user group
lambda_udf_devs.
To run a Lambda UDF, make sure that you have permission for each function called. By default,
permission to run new Lambda UDFs is granted to PUBLIC. To restrict usage, revoke this permission from
PUBLIC for the function. Then, grant the privilege to specific users or groups.
The following example revokes execution on the function exfunc_sum from PUBLIC. Then, it grants
usage to the user group lambda_udf_devs.
For more information about granting and revoking privileges, see GRANT (p. 767) and
REVOKE (p. 791).
If there is an existing IAM role with permissions to invoke Lambda functions attached to your cluster,
then you can substitute your role Amazon Resource Name (ARN) in the IAM_ROLE parameter for
the command. Following sections describe the steps for using an IAM role in the CREATE EXTERNAL
FUNCTION command.
• Attach the AWSLambdaRole policy on the Attach permissions policy page while creating an IAM role.
The AWSLambdaRole policy grants permissions to invoke Lambda functions which is the minimal
requirement. For more information and other policies, see Identity-based IAM policies for AWS Lambda
in the AWS Lambda Developer Guide.
• Create your own custom policy to attach to your IAM role with the lambda:InvokeFunction
permission of either all resources or a particular Lambda function with the ARN of that function. For
more information on how to create a policy, see Creating IAM policies in the IAM User Guide.
The following example policy enables invoking Lambda on a particular Lambda function.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Invoke",
"Effect": "Allow",
"Action": [
157
Amazon Redshift Database Developer Guide
Using the JSON interface between
Amazon Redshift and Lambda
"lambda:InvokeFunction"
],
"Resource": "arn:aws:lambda:us-west-2:123456789012:function:my-function"
}
]
}
For more information on resources for Lambda functions, see Resources and conditions for Lambda
actions in the IAM API Reference.
After creating your custom policy with the required permissions, you can attach your policy to the IAM
role on the Attach permissions policy page while creating an IAM role.
For steps to create an IAM role, see Authorizing Amazon Redshift to access other AWS services on your
behalf in the Amazon Redshift Cluster Management Guide.
If you don't want to create a new IAM role, you can add the permissions mentioned previously to your
existing IAM role.
IAM_ROLE 'arn:aws:iam::aws-account-id:role/role-name'
To invoke Lambda functions which reside in other accounts within the same Region, see Chaining IAM
roles in Amazon Redshift.
The following table shows the list of input fields that the designated Lambda functions that you can
expect for the JSON payload.
158
Amazon Redshift Database Developer Guide
Using the JSON interface between
Amazon Redshift and Lambda
external_function
The fully qualified name A valid fully qualified function name.
of the external function
that makes the call.
arguments The data payload in the The data in array format must be a JSON array. Each
specified format. element is a record that is an array if the number of
arguments is larger than 1. By using an array, Amazon
Redshift preserves the order of the records in the payload.
The order of the JSON array determines the order of batch processing. The Lambda function must
process the arguments iteratively and produce the exact number of records. The following is an example
of a payload.
{
"request_id" : "23FF1F97-F28A-44AA-AB67-266ED976BF40",
"cluster" : 'arn:aws:redshift:xxxx',
"user" : "adminuser",
"database" : "db1",
"external_function": "public.foo",
"query_id" : 5678234,
"num_records" : 4,
"arguments" : [
[ 1, 2 ],
[ 3, null],
null,
[ 4, 6]
]
}
The return output of the Lambda function contains the following fields.
159
Amazon Redshift Database Developer Guide
Naming UDFs
{
"success": true, // true indicates the call succeeded
"error_msg" : "my function isn't working", // shall only exist when success != true
"num_records": 4, // number of records in this payload
"results" : [
1,
4,
null,
7
]
}
When you call Lambda functions from SQL queries, Amazon Redshift ensures the security of the
connection with the following considerations:
• GRANT and REVOKE permissions. For more information about UDF security and privileges, see UDF
security and privileges (p. 149).
• Amazon Redshift only submits the minimum set of data to the designated Lambda function.
• Amazon Redshift only calls the designated Lambda function with the designated IAM role.
Naming UDFs
You can avoid potential conflicts and unexpected results considering your UDF naming conventions
before implementation. Because function names can be overloaded, they can collide with existing and
future Amazon Redshift function names. This topic discusses overloading and presents a strategy for
avoiding conflict.
When you run a query, the query engine determines which function to call based on the number
of arguments you provide and the data types of the arguments. You can use overloading to
simulate functions with a variable number of arguments, up to the limit allowed by the CREATE
FUNCTION (p. 679) command.
160
Amazon Redshift Database Developer Guide
Logging errors and warnings
name a new function f_fibonacci, you avoid conflict if Amazon Redshift adds a function named
FIBONACCI in a future release.
You can create a UDF with the same name and signature as an existing Amazon Redshift built-in SQL
function without the function name being overloaded if the UDF and the built-in function exist in
different schemas. Because built-in functions exist in the system catalog schema, pg_catalog, you
can create a UDF with the same name in another schema, such as public or a user-defined schema. In
some cases, you might call a function that is not explicitly qualified with a schema name. If so, Amazon
Redshift searches the pg_catalog schema first by default. Thus, a built-in function runs before a new UDF
with the same name.
You can change this behavior by setting the search path to place pg_catalog at the end. If you do so,
your UDFs take precedence over built-in functions, but the practice can cause unexpected results.
Adopting a unique naming strategy, such as using the reserved prefix f_, is a more reliable practice. For
more information, see SET (p. 837) and search_path (p. 1490).
During query execution, the log handler writes messages to the SVL_UDF_LOG system view, along with
the corresponding function name, node, and slice. The log handler writes one row to the SVL_UDF_LOG
per message, per slice. Messages are truncated to 4096 bytes. The UDF log is limited to 500 rows per
slice. When the log is full, the log handler discards older messages and adds a warning message to
SVL_UDF_LOG.
Note
The Amazon Redshift UDF log handler escapes newlines ( \n ), pipe ( | ) characters, and
backslash ( \ ) characters with a backslash ( \ ) character.
By default, the UDF log level is set to WARNING. Messages with a log level of WARNING, ERROR, and
CRITICAL are logged. Messages with lower severity INFO, DEBUG, and NOTSET are ignored. To set the
UDF log level, use the Python logger method. For example, the following sets the log level to INFO.
logger.setLevel(logging.INFO)
For more information about using the Python logging module, see Logging facility for Python in the
Python documentation.
The following example creates a function named f_pyerror that imports the Python logging module,
instantiates the logger, and logs an error.
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.info('Your info message here')
return 0
$$ language plpythonu;
161
Amazon Redshift Database Developer Guide
Example uses of UDFs
The following example queries SVL_UDF_LOG to view the message logged in the previous example.
• Accessing external components using Amazon Redshift Lambda UDFs – describes how Amazon
Redshift Lambda UDFs work and walks through creating a Lambda UDF.
• Translate and analyze text using SQL functions with Amazon Redshift, Amazon Translate, and Amazon
Comprehend – provides prebuilt Amazon Redshift Lambda UDFs that you can install with a few clicks
to translate, redact, and analyze text fields.
• Access Amazon Location Service from Amazon Redshift – describes how to use Amazon Redshift
Lambda UDFs to integrate with Amazon Location Service.
• Data Tokenization with Amazon Redshift and Protegrity – describes how to integrate Amazon Redshift
Lambda UDFs with the Protegrity Serverless product.
• Amazon Redshift UDFs – a collection of Amazon Redshift SQL, Lambda, and Python UDFs.
162
Amazon Redshift Database Developer Guide
Stored procedure overview
Unlike a user-defined function (UDF), a stored procedure can incorporate data definition language (DDL)
and data manipulation language (DML) in addition to SELECT queries. A stored procedure doesn't need
to return a value. You can use procedural language, including looping and conditional expressions, to
control logical flow.
For details about SQL commands to create and manage stored procedures, see the following command
topics:
Topics
• Overview of stored procedures in Amazon Redshift (p. 163)
• PL/pgSQL language reference (p. 176)
For fine-grained access control, you can create stored procedures to perform functions without giving
a user access to the underlying tables. For example, only the owner or a superuser can truncate a table,
and a user needs write permission to insert data into a table. Instead of granting a user permissions on
the underlying tables, you can create a stored procedure that performs the task. You then give the user
permission to run the stored procedure.
A stored procedure with the DEFINER security attribute runs with the privileges of the stored procedure's
owner. By default, a stored procedure has INVOKER security, which means the procedure uses the
permissions of the user that calls the procedure.
163
Amazon Redshift Database Developer Guide
Stored procedure overview
To create a stored procedure, use the CREATE PROCEDURE (p. 702) command. To run a procedure, use
the CALL (p. 569) command. Examples follow later in this section.
Note
Some clients might throw the following error when creating an Amazon Redshift stored
procedure.
This error occurs due to the inability of the client to correctly parse the CREATE PROCEDURE
statement with semicolons delimiting statements and with dollar sign ($) quoting. This results in
only a part of the statement sent to the Amazon Redshift server. You can often work around this
error by using the Run as batch or Execute selected option of the client.
For example, when using an Aginity client, use the Run entire script as batch option.
When you use SQL Workbench/J, we recommend version 124. When you use SQL Workbench/J
version 125, consider specifying an alternate delimiter as a workaround.
CREATE PROCEDURE contains SQL statements delimited with a semicolon (;). Defining an
alternate delimiter such as a slash (/) and placing it at the end of the CREATE PROCEDURE
statement sends the statement to the Amazon Redshift server for processing. Following is an
example.
For more information, see Alternate delimiter in the SQL Workbench/J documentation. Or use a
client with better support for parsing CREATE PROCEDURE statements, such as the query editor
in the Amazon Redshift console or TablePlus.
Topics
• Naming stored procedures (p. 165)
• Security and privileges for stored procedures (p. 166)
• Returning a result set (p. 167)
• Managing transactions (p. 168)
• Trapping errors (p. 173)
• Logging stored procedures (p. 174)
• Limits and differences for stored procedure support (p. 175)
The following example shows a procedure with no output arguments. By default, arguments are input
(IN) arguments.
164
Amazon Redshift Database Developer Guide
Naming stored procedures
INFO: f1 = 5, f2 = abc
CALL
The following example shows a procedure with output arguments. Arguments are input (IN), input and
output (INOUT), and output (OUT).
call test_sp2(2,'2019');
f2 | column2
---------------------+---------
2019+2019+2019+2019 | 2
(1 row)
The owner or a superuser can replace the body of a stored procedure with a new one with the same
signature. To change the signature or return types of a stored procedure, drop the stored procedure and
recreate it. For more information, see DROP PROCEDURE (p. 751) and CREATE PROCEDURE (p. 702).
You can avoid potential conflicts and unexpected results by considering your naming conventions for
stored procedures before implementing them. Because you can overload procedure names, they can
collide with existing and future Amazon Redshift procedure names.
When you run a procedure, the query engine determines which procedure to call based on the number of
arguments that you provide and the data types of the arguments. You can use overloading to simulate
165
Amazon Redshift Database Developer Guide
Security and privileges
procedures with a variable number of arguments, up to the limit allowed by the CREATE PROCEDURE
command. For more information, see CREATE PROCEDURE (p. 702).
To call a procedure, you must be granted EXECUTE permission on the procedure. By default, EXECUTE
permission for new procedures is granted to the procedure owner and superusers. For more information,
see GRANT (p. 767).
The user creating a procedure is the owner by default. The owner has CREATE, DROP, and EXECUTE
privileges on the procedure by default. Superusers have all privileges.
The SECURITY attribute controls a procedure's privileges to access database objects. When you create
a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify
SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify
SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the
default.
Because a SECURITY DEFINER procedure runs with the privileges of the user that owns it, take care to
ensure that the procedure can't be misused. To ensure that SECURITY DEFINER procedures can't be
misused, do the following:
• Grant EXECUTE on SECURITY DEFINER procedures to specific users, and not to PUBLIC.
• Qualify all database objects that the procedure needs to access with the schema names. For example,
use myschema.mytable instead of just mytable.
• If you can't qualify an object name by its schema, set search_path when creating the procedure by
using the SET option. Set search_path to exclude any schemas that are writable by untrusted users.
This approach prevents any callers of this procedure from creating objects (for example, tables or
views) that mask objects intended to be used by the procedure. For more information about the SET
option, see CREATE PROCEDURE (p. 702).
The following example sets search_path to admin to ensure that the user_creds table is accessed
from the admin schema and not from public or any other schema in the caller's search_path.
166
Amazon Redshift Database Developer Guide
Returning a result set
Returning a cursor
To return a cursor, create a procedure with an INOUT argument defined with a refcursor data type.
When you call the procedure, give the cursor a name, then you can fetch the results from the cursor by
name.
The following example creates a procedure named get_result_set with an INOUT argument named
rs_out using the refcursor data type. The procedure opens the cursor using a SELECT statement.
The following CALL command opens the cursor with the name mycursor. Use cursors only within
transactions.
BEGIN;
CALL get_result_set(1, 'mycursor');
After the cursor is opened, you can fetch from the cursor, as the following example shows.
id | secondary_id | name
-------+--------------+---------
1 | 1 | Joe
1 | 2 | Ed
2 | 1 | Mary
1 | 3 | Mike
(4 rows)
COMMIT;
A cursor returned by a stored procedure is subject to the same constraints and performance
considerations as described in DECLARE CURSOR. For more information, see Cursor constraints (p. 740).
The following example shows the calling of the get_result_set stored procedure using a
refcursor data type from JDBC. The literal 'mycursor' (the name of the cursor) is passed to the
prepareStatement. Then the results are fetched from the ResultSet.
167
Amazon Redshift Database Developer Guide
Managing transactions
Managing transactions
The default automatic commit behavior causes each SQL command that runs separately to commit
individually. A call to a stored procedure is treated as a single SQL command. The SQL statements inside
a procedure behave as if they are in a transaction block that implicitly begins when the call starts and
ends when the call finishes. A nested call to another procedure is treated like any other SQL statement
and operates within the context of the same transaction as the caller. For more information about
automatic commit behavior, see Serializable isolation (p. 117).
However, suppose that you call a stored procedure from within a user specified transaction block
(defined by BEGIN...COMMIT). In this case, all statements in the stored procedure run in the context of
the user-specified transaction. The procedure doesn't commit implicitly on exit. The caller controls the
procedure commit or rollback.
If any error is encountered while running a stored procedure, all changes made in the current transaction
are rolled back.
You can use the following transaction control statements in a stored procedure:
• COMMIT – commits all work done in the current transaction and implicitly begins a new transaction.
For more information, see COMMIT (p. 575).
• ROLLBACK – rolls back the work done in the current transaction and implicitly begins a new
transaction. For more information, see ROLLBACK (p. 800).
TRUNCATE is another statement that you can issue from within a stored procedure and influences
transaction management. In Amazon Redshift, TRUNCATE issues a commit implicitly. This behavior stays
168
Amazon Redshift Database Developer Guide
Managing transactions
the same in the context of stored procedures. When a TRUNCATE statement is issued from within a
stored procedure, it commits the current transaction and begins a new one. For more information, see
TRUNCATE (p. 849).
All statements that follow a COMMIT, ROLLBACK, or TRUNCATE statement run in the context of a new
transaction. They do so until a COMMIT, ROLLBACK, or TRUNCATE statement is encountered or the
stored procedure exits.
When you use a COMMIT, ROLLBACK, or TRUNCATE statement from within a stored procedure, the
following constraints apply:
• If the stored procedure is called from within a transaction block, it can't issue a COMMIT, ROLLBACK, or
TRUNCATE statement. This restriction applies within the stored procedure's own body and within any
nested procedure call.
• If the stored procedure is created with SET config options, it can't issue a COMMIT, ROLLBACK, or
TRUNCATE statement. This restriction applies within the stored procedure's own body and within any
nested procedure call.
• Any cursor that is open (explicitly or implicitly) is closed automatically when a COMMIT, ROLLBACK,
or TRUNCATE statement is processed. For constraints on explicit and implicit cursors, see Limits and
differences for stored procedure support (p. 175).
Additionally, you can't run COMMIT or ROLLBACK using dynamic SQL. However, you can run TRUNCATE
using dynamic SQL. For more information, see Dynamic SQL (p. 182).
When working with stored procedures, consider that the BEGIN and END statements in PL/pgSQL are
only for grouping. They don't start or end a transaction. For more information, see Block (p. 177).
The following example demonstrates transaction behavior when calling a stored procedure from within
an explicit transaction block. The two insert statements issued from outside the stored procedure and
the one from within it are all part of the same transaction (3382). The transaction is committed when the
user issues the explicit commit.
Begin;
insert into test_table_a values (1);
Call sp_insert_table_a(2);
insert into test_table_a values (3);
Commit;
In contrast, take an example when the same statements are issued from outside of an explicit transaction
block and the session has autocommit set to ON. In this case, each statement runs in its own transaction.
169
Amazon Redshift Database Developer Guide
Managing transactions
The following example issues a TRUNCATE statement after inserting into test_table_a. The
TRUNCATE statement issues an implicit commit that commits the current transaction (3335) and starts a
new one (3336). The new transaction is committed when the procedure exits.
Call sp_truncate_proc(1,2);
The following example issues a TRUNCATE from a nested call. The TRUNCATE commits all work done so
far in the outer and inner procedures in a transaction (3344). It starts a new transaction (3345). The new
transaction is committed when the outer procedure exits.
CREATE OR REPLACE PROCEDURE sp_outer(a int, b int, c int, d int) LANGUAGE plpgsql
170
Amazon Redshift Database Developer Guide
Managing transactions
AS $$
BEGIN
INSERT INTO outer_table values (a);
Call sp_inner(c, d);
INSERT INTO outer_table values (b);
END;
$$;
The following example shows that cursor cur1 was closed when the TRUNCATE statement committed.
call sp_open_cursor_truncate();
ERROR: cursor "cur1" does not exist
CONTEXT: PL/pgSQL function "sp_open_cursor_truncate" line 8 at fetch
The following example issues a TRUNCATE statement and can't be called from within an explicit
transaction block.
Begin;
Call sp_truncate_atomic();
ERROR: TRUNCATE cannot be invoked from a procedure that is executing in an atomic context.
171
Amazon Redshift Database Developer Guide
Managing transactions
HINT: Try calling the procedure as a top-level call i.e. not from within an explicit
transaction block.
Or, if this procedure (or one of its ancestors in the call chain) was created with SET
config options, recreate the procedure without them.
CONTEXT: SQL statement "TRUNCATE test_table_b"
PL/pgSQL function "sp_truncate_atomic" line 2 at SQL statement
The following example shows that a user who is not a superuser or the owner of a table can issue a
TRUNCATE statement on the table. The user does this using a Security Definer stored procedure.
The example shows the following actions:
The following example issues COMMIT twice. The first COMMIT commits all work done in transaction
10363 and implicitly starts transaction 10364. Transaction 10364 is committed by the second COMMIT
statement.
call sp_commit(1,2);
172
Amazon Redshift Database Developer Guide
Trapping errors
The following example issues a ROLLBACK statement if sum_vals is greater than 2. The first ROLLBACK
statement rolls back all the work done in transaction 10377 and starts a new transaction 10378.
Transaction 10378 is committed when the procedure exits.
Trapping errors
When an error occurs in a stored procedure, no additional code is run and the transaction is rolled back.
But you can handle errors using an EXCEPTION block.
[ <<label>> ]
[ DECLARE
declarations ]
BEGIN
statements
EXCEPTION
WHEN OTHERS THEN
statements
END;
173
Amazon Redshift Database Developer Guide
Logging stored procedures
When an exception occurs, and you add an exception-handling block, you can write RAISE statements
and most other PL/pgSQL statements. For example, you can raise an exception with a custom message
or insert a record into a logging table.
When entering the exception-handling block, the current transaction is rolled back and a new transaction
is created to run the statements in the block. If the statements in the block run without error, the
transaction is committed and the exception is re-thrown. Lastly, the stored procedure exits.
The only supported condition in an exception block is OTHERS, which matches every error type except
query cancellation. Also, if an error occurs in an exception-handling block, it can be caught by an outer
exception-handling block.
Example
The following example shows how to write statements in the exception-handling block.
CALL update_employee_sp();
firstname | lastname
-----------+----------
Tomas | Smith
message
------------------------------------------------
Error message: column "invalid" does not exist
174
Amazon Redshift Database Developer Guide
Limits and differences
• SVL_STORED_PROC_CALL – details are logged about the stored procedure call's start time
and end time, and whether the call is ended before completion. For more information, see
SVL_STORED_PROC_CALL (p. 1408).
• SVL_STORED_PROC_MESSAGES – messages in stored procedures emitted by the RAISE
query are logged with the corresponding logging level. For more information, see
SVL_STORED_PROC_MESSAGES (p. 1409).
• SVL_QLOG – the query ID of the procedure call is logged for each query called from a stored
procedure. For more information, see SVL_QLOG (p. 1385).
• STL_UTILITYTEXT – stored procedure calls are logged after they are completed. For more information,
see STL_UTILITYTEXT (p. 1348).
• PG_PROC_INFO – this system catalog view shows information about stored procedures. For more
information, see PG_PROC_INFO (p. 1472).
• Amazon Redshift doesn't support subtransactions, and hence has limited support for exception
handling blocks.
Limits
The following are limits on stored procedures in Amazon Redshift:
175
Amazon Redshift Database Developer Guide
PL/pgSQL language reference
• The registerOutParameter method call through the Java Database Connectivity (JDBC) driver
isn't supported for the refcursor data type. For an example of using the refcursor data type, see
Returning a result set (p. 167).
Topics
• PL/pgSQL reference conventions (p. 176)
• Structure of PL/pgSQL (p. 176)
• Supported PL/pgSQL statements (p. 180)
Character Description
{} Braces indicate that you are required to choose one of the arguments inside the
braces.
red italics Words in red italics indicate placeholders. Insert the appropriate value in place of
the word in red italics.
... An ellipsis indicates that you can repeat the preceding element.
' Words in single quotation marks indicate that you must type the quotes.
Structure of PL/pgSQL
PL/pgSQL is a procedural language with many of the same constructs as other procedural languages.
Topics
• Block (p. 177)
• Variable declaration (p. 178)
• Alias declaration (p. 178)
• Built-in variables (p. 178)
176
Amazon Redshift Database Developer Guide
Structure of PL/pgSQL
Block
PL/pgSQL is a block-structured language. The complete body of a procedure is defined in a block, which
contains variable declarations and PL/pgSQL statements. A statement can also be a nested block, or
subblock.
End declarations and statements with a semicolon. Follow the END keyword in a block or subblock with a
semicolon. Don't use semicolons after the keywords DECLARE and BEGIN.
You can write all keywords and identifiers in mixed uppercase and lowercase. Identifiers are implicitly
converted to lowercase unless enclosed in double quotation marks.
A double hyphen (--) starts a comment that extends to the end of the line. A /* starts a block comment
that extends to the next occurrence of */. You can't nest block comments. However, you can enclose
double-hyphen comments in a block comment, and a double hyphen can hide the block comment
delimiters /* and */.
Any statement in the statement section of a block can be a subblock. You can use subblocks for logical
grouping or to localize variables to a small group of statements.
[ <<label>> ]
[ DECLARE
declarations ]
BEGIN
statements
END [ label ];
The variables declared in the declarations section preceding a block are initialized to their default values
every time the block is entered. In other words, they're not initialized only once per function call.
Use a label to identify the block to use in an EXIT statement or to qualify the names of the variables
declared in the block.
Don't confuse the use of BEGIN/END for grouping statements in PL/pgSQL with the database commands
for transaction control. The BEGIN and END in PL/pgSQL are only for grouping. They don't start or end a
transaction.
177
Amazon Redshift Database Developer Guide
Structure of PL/pgSQL
Variable declaration
Declare all variables in a block, with the exception of loop variables, in the block's DECLARE
section. Variables can use any valid Amazon Redshift data type. For supported data types, see Data
types (p. 478).
PL/pgSQL variables can be any Amazon Redshift supported data type, plus RECORD and refcursor. For
more information about RECORD, see Record types (p. 180). For more information about refcursor,
see Cursors (p. 188).
DECLARE
name [ CONSTANT ] type [ NOT NULL ] [ { DEFAULT | := } expression ];
customerID integer;
numberofitems numeric(6);
link varchar;
onerow RECORD;
The loop variable of a FOR loop iterating over a range of integers is automatically declared as an integer
variable.
The DEFAULT clause, if given, specifies the initial value assigned to the variable when the block is
entered. If the DEFAULT clause is not given, then the variable is initialized to the SQL NULL value. The
CONSTANT option prevents the variable from being assigned to, so that its value remains constant for
the duration of the block. If NOT NULL is specified, an assignment of a null value results in a runtime
error. All variables declared as NOT NULL must have a non-null default value specified.
The default value is evaluated every time the block is entered. So, for example, assigning now() to a
variable of type timestamp causes the variable to have the time of the current function call, not the
time when the function was precompiled.
The refcursor data type is the data type of cursor variables within stored procedures. A refcursor
value can be returned from within a stored procedure. For more information, see Returning a result
set (p. 167).
Alias declaration
If stored procedure's signature omits the argument name, you can declare an alias for the argument.
Built-in variables
The following built-in variables are supported:
• FOUND
• SQLSTATE
• SQLERRM
• GET DIAGNOSTICS integer_var := ROW_COUNT;
178
Amazon Redshift Database Developer Guide
Structure of PL/pgSQL
FOUND is a special variable of type Boolean. FOUND starts out false within each procedure call. FOUND
is set by the following types of statements:
• SELECT INTO
Sets FOUND to true if at least one row is affected, false if no row is affected.
• FETCH
Sets FOUND to true if the FOR statement iterates one or more times, and otherwise false. This applies
to all three variants of the FOR statement: integer FOR loops, record-set FOR loops, and dynamic
record-set FOR loops.
FOUND is set when the FOR loop exits. Inside the execution of the loop, FOUND isn't modified by the
FOR statement. However, it can be changed by the execution of other statements within the loop
body.
Within an exception handler, the special variable SQLSTATE contains the error code that corresponds to
the exception that was raised. The special variable SQLERRM contains the error message associated with
the exception. These variables are undefined outside exception handlers and throw an error if used.
ROW_COUNT is used with the GET DIAGNOSTICS command. It shows the number of rows processed by
the last SQL command sent down to the SQL engine.
179
Amazon Redshift Database Developer Guide
Supported PL/pgSQL statements
Record types
A RECORD type is not a true data type, only a placeholder. Record type variables assume the actual row
structure of the row that they are assigned during a SELECT or FOR command. The substructure of a
record variable can change each time it is assigned a value. Until a record variable is first assigned to, it
has no substructure. Any attempt to access a field in it throws a runtime error.
name RECORD;