0% found this document useful (0 votes)
57 views1,552 pages

Redshift-Developer Guide

Uploaded by

Himanshu Khare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views1,552 pages

Redshift-Developer Guide

Uploaded by

Himanshu Khare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Amazon Redshift

Database Developer Guide


Amazon Redshift Database Developer Guide

Amazon Redshift: Database Developer Guide


Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not
Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or
discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may
or may not be affiliated with, connected to, or sponsored by Amazon.
Amazon Redshift Database Developer Guide

Table of Contents
System overview ................................................................................................................................ 1
Are you a first-time Amazon Redshift user? ................................................................................... 1
Are you a database developer? .................................................................................................... 2
Prerequisites .............................................................................................................................. 3
System and architecture overview ................................................................................................ 3
Data warehouse system architecture ..................................................................................... 3
Performance ...................................................................................................................... 5
Columnar storage ............................................................................................................... 7
Workload management ....................................................................................................... 9
Using Amazon Redshift with other services ........................................................................... 9
Best practices .................................................................................................................................. 11
Conducting a proof of concept .................................................................................................. 11
Overview of the process .................................................................................................... 11
Identify the business goals and success criteria .................................................................... 12
Set up your proof of concept ............................................................................................. 12
Checklist for a complete evaluation .................................................................................... 13
Develop a project plan for your evaluation .......................................................................... 14
Additional resources to help your evaluation ........................................................................ 15
Need help? ...................................................................................................................... 16
Best practices for designing tables ............................................................................................. 16
Choose the best sort key ................................................................................................... 16
Choose the best distribution style ...................................................................................... 17
Use automatic compression ............................................................................................... 18
Define constraints ............................................................................................................ 18
Use the smallest possible column size ................................................................................. 18
Use date/time data types for date columns ......................................................................... 19
Best practices for loading data ................................................................................................... 19
Take the loading data tutorial ............................................................................................ 19
Use a COPY command to load data .................................................................................... 19
Use a single COPY command ............................................................................................. 19
Split your load data .......................................................................................................... 20
Compress your data files ................................................................................................... 20
Verify data files before and after a load .............................................................................. 20
Use a multi-row insert ...................................................................................................... 20
Use a bulk insert .............................................................................................................. 21
Load data in sort key order ............................................................................................... 21
Load data in sequential blocks ........................................................................................... 21
Use time-series tables ....................................................................................................... 21
Use a staging table to perform a merge .............................................................................. 22
Schedule around maintenance windows .............................................................................. 22
Best practices for designing queries ............................................................................................ 22
Working with Advisor ................................................................................................................ 24
Amazon Redshift Regions .................................................................................................. 24
Access Advisor ................................................................................................................. 25
Advisor recommendations ................................................................................................. 26
Tutorials .......................................................................................................................................... 36
Working with automatic table optimization ......................................................................................... 37
Enabling automatic table optimization ........................................................................................ 37
Removing automatic table optimization ...................................................................................... 38
Monitoring actions of automatic table optimization ...................................................................... 38
Working with column compression ............................................................................................. 39
Compression encodings ..................................................................................................... 40
Testing compression encodings .......................................................................................... 46
Example: Choosing compression encodings for the CUSTOMER table ....................................... 48

iii
Amazon Redshift Database Developer Guide

Working with data distribution styles .......................................................................................... 50


Data distribution concepts ................................................................................................. 51
Distribution styles ............................................................................................................ 51
Viewing distribution styles ................................................................................................. 52
Evaluating query patterns ................................................................................................. 53
Designating distribution styles ........................................................................................... 54
Evaluating the query plan ................................................................................................. 55
Query plan example ......................................................................................................... 56
Distribution examples ....................................................................................................... 60
Working with sort keys ............................................................................................................. 62
Compound sort key .......................................................................................................... 63
Interleaved sort key .......................................................................................................... 63
Defining table constraints ......................................................................................................... 64
Loading data ................................................................................................................................... 65
Using COPY to load data ........................................................................................................... 65
Credentials and access permissions ..................................................................................... 66
Preparing your input data ................................................................................................. 67
Loading data from Amazon S3 ........................................................................................... 68
Loading data from Amazon EMR ........................................................................................ 76
Loading data from remote hosts ........................................................................................ 80
Loading from Amazon DynamoDB ...................................................................................... 86
Verifying that the data loaded correctly .............................................................................. 88
Validating input data ........................................................................................................ 88
Automatic compression ..................................................................................................... 89
Optimizing for narrow tables ............................................................................................. 91
Default values .................................................................................................................. 91
Troubleshooting ............................................................................................................... 91
Updating with DML .................................................................................................................. 96
Updating and inserting ............................................................................................................. 96
Merge method 1: Replacing existing rows ............................................................................ 97
Merge method 2: Specifying a column list ........................................................................... 97
Creating a temporary staging table .................................................................................... 97
Performing a merge operation by replacing existing rows ...................................................... 98
Performing a merge operation by specifying a column list ..................................................... 98
Merge examples ............................................................................................................. 100
Performing a deep copy .......................................................................................................... 102
Analyzing tables ..................................................................................................................... 103
Automatic analyze .......................................................................................................... 104
Analysis of new table data .............................................................................................. 104
ANALYZE command history .............................................................................................. 107
Vacuuming tables ................................................................................................................... 108
Automatic table sort ....................................................................................................... 108
Automatic vacuum delete ................................................................................................ 109
VACUUM frequency ......................................................................................................... 109
Sort stage and merge stage ............................................................................................. 110
Vacuum threshold ........................................................................................................... 110
Vacuum types ................................................................................................................ 110
Managing vacuum times .................................................................................................. 110
Managing concurrent write operations ...................................................................................... 116
Serializable isolation ....................................................................................................... 117
Write and read/write operations ....................................................................................... 120
Concurrent write examples .............................................................................................. 121
Tutorial: Loading data from Amazon S3 .................................................................................... 122
Prerequisites .................................................................................................................. 122
Overview ....................................................................................................................... 123
Steps ............................................................................................................................ 123
Step 1: Create a cluster ................................................................................................... 123

iv
Amazon Redshift Database Developer Guide

Step 2: Download the data files ....................................................................................... 125


Step 3: Upload the files to an Amazon S3 bucket ............................................................... 125
Step 4: Create the sample tables ...................................................................................... 127
Step 5: Run the COPY commands ..................................................................................... 129
Step 6: Vacuum and analyze the database ......................................................................... 141
Step 7: Clean up your resources ....................................................................................... 141
Summary ....................................................................................................................... 142
Unloading data .............................................................................................................................. 143
Unloading data to Amazon S3 ................................................................................................. 143
Unloading encrypted data files ................................................................................................ 145
Unloading data in delimited or fixed-width format ..................................................................... 147
Reloading unloaded data ......................................................................................................... 148
Creating user-defined functions ........................................................................................................ 149
UDF security and privileges ...................................................................................................... 149
Creating a scalar SQL UDF ....................................................................................................... 150
Scalar SQL function example ........................................................................................... 150
Creating a scalar Python UDF .................................................................................................. 151
Scalar Python UDF example ............................................................................................. 151
Python UDF data types ................................................................................................... 151
ANYELEMENT data type .................................................................................................. 152
Python language support ................................................................................................ 152
UDF constraints .............................................................................................................. 155
Creating a scalar Lambda UDF ................................................................................................. 156
Registering a Lambda UDF .............................................................................................. 156
Managing Lambda UDF security and privileges ................................................................... 156
Configuring the authorization parameter for Lambda UDFs .................................................. 157
Using the JSON interface between Amazon Redshift and Lambda ......................................... 156
Naming UDFs ................................................................................................................. 160
Logging errors and warnings ............................................................................................ 161
Example uses of UDFs ............................................................................................................. 162
Creating stored procedures .............................................................................................................. 163
Stored procedure overview ...................................................................................................... 163
Naming stored procedures ............................................................................................... 165
Security and privileges .................................................................................................... 166
Returning a result set ..................................................................................................... 167
Managing transactions .................................................................................................... 168
Trapping errors .............................................................................................................. 173
Logging stored procedures .............................................................................................. 174
Limits and differences ..................................................................................................... 175
PL/pgSQL language reference .................................................................................................. 176
PL/pgSQL reference conventions ...................................................................................... 176
Structure of PL/pgSQL .................................................................................................... 176
Supported PL/pgSQL statements ..................................................................................... 180
Creating materialized views ............................................................................................................. 192
Querying a materialized view ................................................................................................... 194
Automatic query rewriting to use materialized views ................................................................... 195
Usage notes ................................................................................................................... 195
Limitations ..................................................................................................................... 195
Refreshing a materialized view ................................................................................................. 196
Autorefreshing a materialized view ................................................................................... 198
Automated materialized views (preview) .................................................................................... 198
Streaming ingestion (preview) .................................................................................................. 199
Data flow ...................................................................................................................... 200
Streaming ingestion use cases .......................................................................................... 200
Getting started with streaming ingestion ........................................................................... 200
Limitations ..................................................................................................................... 201
Electric vehicle station-data tutorial (preview) .................................................................... 202

v
Amazon Redshift Database Developer Guide

Querying spatial data ..................................................................................................................... 205


Tutorial: Using spatial SQL functions ......................................................................................... 207
Prerequisites .................................................................................................................. 207
Step 1: Create tables and load test data ............................................................................ 208
Step 2: Query spatial data ............................................................................................... 210
Step 3: Clean up your resources ....................................................................................... 212
Loading a shapefile ................................................................................................................ 213
Terminology ........................................................................................................................... 214
Bounding box ................................................................................................................. 214
Geometric validity .......................................................................................................... 214
Geometric simplicity ....................................................................................................... 216
Limitations ............................................................................................................................. 217
Querying data with federated queries ............................................................................................... 218
Getting started with using federated queries to PostgreSQL ........................................................ 218
Getting started using federated queries to PostgreSQL with CloudFormation .................................. 219
Launching a CloudFormation stack for Redshift federated queries ......................................... 220
Querying data from the external schema ........................................................................... 221
Getting started with using federated queries to MySQL ............................................................... 221
Creating a secret and an IAM role ............................................................................................. 222
Prerequisites .................................................................................................................. 222
Examples of using a federated query ........................................................................................ 224
Example of using a federated query with PostgreSQL ......................................................... 224
Example of using a mixed-case name ................................................................................ 225
Example of using a federated query with MySQL ................................................................ 227
Data type differences .............................................................................................................. 227
Limitations and considerations ................................................................................................. 230
Querying external data using Amazon Redshift Spectrum .................................................................... 232
Amazon Redshift Spectrum overview ........................................................................................ 232
Amazon Redshift Spectrum Regions .................................................................................. 233
Amazon Redshift Spectrum considerations ......................................................................... 233
Getting started with Amazon Redshift Spectrum ........................................................................ 234
Prerequisites .................................................................................................................. 234
CloudFormation .............................................................................................................. 234
Getting started with Redshift Spectrum step by step .......................................................... 234
Step 1. Create an IAM role ............................................................................................... 234
Step 2: Associate the IAM role with your cluster ................................................................. 237
Step 3: Create an external schema and an external table ..................................................... 238
Step 4: Query your data in Amazon S3 ............................................................................. 238
Launch your CloudFormation stack and then query your data ............................................... 241
IAM policies for Amazon Redshift Spectrum ............................................................................... 243
Amazon S3 permissions ................................................................................................... 244
Cross-account Amazon S3 permissions .............................................................................. 244
Grant or restrict access using Redshift Spectrum ................................................................ 245
Minimum permissions ..................................................................................................... 245
Chaining IAM roles .......................................................................................................... 246
Accessing AWS Glue data ................................................................................................ 247
Using Redshift Spectrum with Lake Formation ........................................................................... 253
Creating data files for queries in Amazon Redshift Spectrum ........................................................ 254
Data formats for Redshift Spectrum ................................................................................. 254
Compression types for Redshift Spectrum ......................................................................... 255
Encryption for Redshift Spectrum ..................................................................................... 255
Creating external schemas ....................................................................................................... 256
Working with external catalogs ........................................................................................ 257
Creating external tables .......................................................................................................... 261
Pseudocolumns .............................................................................................................. 262
Partitioning Redshift Spectrum external tables ................................................................... 263
Mapping to ORC columns ................................................................................................ 267

vi
Amazon Redshift Database Developer Guide

Creating external tables for Hudi-managed data ................................................................. 269


Creating external tables for Delta Lake data ...................................................................... 270
Improving Amazon Redshift Spectrum query performance ........................................................... 271
Setting data handling options .................................................................................................. 273
Monitoring metrics ................................................................................................................. 274
Troubleshooting queries .......................................................................................................... 274
Retries exceeded ............................................................................................................. 275
Access throttled ............................................................................................................. 275
Resource limit exceeded .................................................................................................. 276
No rows returned for a partitioned table ........................................................................... 276
Not authorized error ....................................................................................................... 276
Incompatible data formats ............................................................................................... 276
Syntax error when using Hive DDL in Amazon Redshift ........................................................ 277
Permission to create temporary tables .............................................................................. 277
Invalid range .................................................................................................................. 277
Invalid Parquet version number ........................................................................................ 278
Tutorial: Querying nested data with Amazon Redshift Spectrum ................................................... 278
Overview ....................................................................................................................... 278
Step 1: Create an external table that contains nested data ................................................... 279
Step 2: Query your nested data in Amazon S3 with SQL extensions ....................................... 280
Nested data use cases ..................................................................................................... 283
Nested data limitations ................................................................................................... 285
Serializing complex nested JSON ...................................................................................... 285
Using HyperLogLog sketches in Amazon Redshift ............................................................................... 288
Considerations ........................................................................................................................ 288
Limitations ............................................................................................................................. 289
Examples ............................................................................................................................... 289
Example: Return cardinality in a subquery ......................................................................... 289
Example: Return an HLLSKETCH type from combined sketches in a subquery .......................... 290
Example: Return a HyperLogLog sketch from combining multiple sketches ............................. 290
Example: Generate HyperLogLog sketches over S3 data using external tables ......................... 291
Querying data across databases ....................................................................................................... 293
Considerations ........................................................................................................................ 294
Limitations ............................................................................................................................. 294
Examples of using a cross-database query ................................................................................. 295
Using cross-database queries with the query editor .................................................................... 298
Sharing data across clusters ............................................................................................................. 300
Data sharing overview ............................................................................................................. 300
Data sharing use cases .................................................................................................... 300
Data sharing concepts ..................................................................................................... 301
Sharing data at different levels ........................................................................................ 302
Managing data consistency .............................................................................................. 302
Accessing shared data ..................................................................................................... 302
Data sharing considerations in Amazon Redshift ................................................................. 303
How data sharing works .................................................................................................. 303
Controlling access for cross-account datashares .................................................................. 304
Working with views in Amazon Redshift data sharing .................................................................. 307
Managing the data sharing lifecycle .......................................................................................... 308
Managing permissions for datashares ........................................................................................ 309
Tracking usage and auditing in data sharing .............................................................................. 310
Cluster management and data sharing ...................................................................................... 310
Integrating Amazon Redshift data sharing with business intelligence tools ..................................... 311
Accessing metadata for datashares ........................................................................................... 312
Working with AWS Data Exchange for Amazon Redshift .............................................................. 312
How AWS Data Exchange datashares work ......................................................................... 313
Considerations when using AWS Data Exchange for Amazon Redshift .................................... 313
Getting started data sharing .................................................................................................... 314

vii
Amazon Redshift Database Developer Guide

Getting started data sharing using the SQL interface .......................................................... 314
Getting started data sharing using the console ................................................................... 334
Getting started data sharing with CloudFormation .............................................................. 343
Limitations ..................................................................................................................... 345
Ingesting and querying semistructured data in Amazon Redshift .......................................................... 346
Use cases for the SUPER data type ........................................................................................... 346
Concepts for SUPER data type use ........................................................................................... 347
Considerations for SUPER data ................................................................................................. 348
SUPER sample dataset ............................................................................................................ 348
Loading semistructured data into Amazon Redshift .................................................................... 350
Parsing JSON documents to SUPER columns ...................................................................... 350
Using COPY to load JSON data in Amazon Redshift ............................................................ 351
Unloading semistructured data ................................................................................................. 354
Unloading semistructured data in CSV or text formats ........................................................ 354
Unloading semistructured data in the Parquet format ......................................................... 355
Querying semistructured data .................................................................................................. 355
Navigation ..................................................................................................................... 355
Unnesting queries ........................................................................................................... 356
Object unpivoting ........................................................................................................... 357
Dynamic typing .............................................................................................................. 358
Lax semantics ................................................................................................................ 360
Types of introspection ..................................................................................................... 360
Order by ........................................................................................................................ 361
Operators and functions .......................................................................................................... 362
Arithmetic operators ....................................................................................................... 362
Arithmetic functions ....................................................................................................... 362
Array functions ............................................................................................................... 363
SUPER configurations .............................................................................................................. 364
Lax and strict modes for SUPER ....................................................................................... 364
Accessing JSON fields with upper or mixed case letters ....................................................... 364
Limitations ............................................................................................................................. 365
Using SUPER data type with materialized views ......................................................................... 367
Accelerating PartiQL queries ............................................................................................ 367
Limitations for using the SUPER data type with materialized views ....................................... 369
Using machine learning in Amazon Redshift ...................................................................................... 370
Machine learning overview ...................................................................................................... 371
How machine learning can solve a problem ....................................................................... 371
Terms and concepts for Amazon Redshift ML ..................................................................... 372
Machine learning for novices and experts .................................................................................. 373
Costs for using Amazon Redshift ML ......................................................................................... 374
Getting started with Amazon Redshift ML ................................................................................. 375
Administrative setup ....................................................................................................... 376
Using model explainability with Amazon Redshift ML .......................................................... 379
Tutorial: Creating a customer churn model ........................................................................ 379
Tuning query performance .............................................................................................................. 383
Query processing .................................................................................................................... 383
Query planning and execution workflow ........................................................................... 383
Query plan .................................................................................................................... 385
Reviewing query plan steps ............................................................................................. 390
Factors affecting query performance ................................................................................. 392
Analyzing and improving queries .............................................................................................. 393
Query analysis workflow ................................................................................................. 393
Reviewing query alerts .................................................................................................... 394
Analyzing the query plan ................................................................................................ 395
Analyzing the query summary .......................................................................................... 396
Improving query performance .......................................................................................... 401
Diagnostic queries for query tuning .................................................................................. 403

viii
Amazon Redshift Database Developer Guide

Troubleshooting queries .......................................................................................................... 406


Connection fails ............................................................................................................. 407
Query hangs .................................................................................................................. 407
Query takes too long ...................................................................................................... 408
Load fails ...................................................................................................................... 409
Load takes too long ........................................................................................................ 409
Load data is incorrect ..................................................................................................... 409
Setting the JDBC fetch size parameter .............................................................................. 409
Implementing workload management ............................................................................................... 411
Modifying the WLM configuration ............................................................................................ 412
Migrating from manual WLM to automatic WLM ................................................................ 412
Automatic WLM ...................................................................................................................... 413
Priority .......................................................................................................................... 414
Concurrency scaling mode ............................................................................................... 414
User groups ................................................................................................................... 414
Query groups ................................................................................................................. 414
Wildcards ....................................................................................................................... 415
Query monitoring rules ................................................................................................... 415
Checking for automatic WLM ........................................................................................... 415
Query priority ................................................................................................................ 415
Manual WLM .......................................................................................................................... 419
Concurrency scaling mode ............................................................................................... 420
Concurrency level ........................................................................................................... 420
User groups ................................................................................................................... 421
Query groups ................................................................................................................. 421
Wildcards ....................................................................................................................... 421
WLM memory percent to use ........................................................................................... 421
WLM timeout ................................................................................................................. 422
Query monitoring rules ................................................................................................... 422
WLM query queue hopping .............................................................................................. 422
Tutorial: Configuring manual WLM queues ......................................................................... 425
Concurrency scaling ................................................................................................................ 438
Concurrency scaling capabilities ....................................................................................... 438
Limitations for concurrency scaling ................................................................................... 439
Regions for concurrency scaling ....................................................................................... 439
Concurrency scaling candidates ........................................................................................ 440
Configuring concurrency scaling queues ............................................................................ 416
Monitoring concurrency scaling ........................................................................................ 441
Concurrency scaling system views ..................................................................................... 441
Short query acceleration ......................................................................................................... 442
Maximum SQA runtime ................................................................................................... 442
Monitoring SQA .............................................................................................................. 443
WLM queue assignment rules ................................................................................................... 443
Queue assignments example ............................................................................................ 444
Assigning queries to queues ..................................................................................................... 446
Assigning queries to queues based on user groups .............................................................. 446
Assigning a query to a query group .................................................................................. 446
Assigning queries to the superuser queue .......................................................................... 446
Dynamic and static properties .................................................................................................. 447
WLM dynamic memory allocation ..................................................................................... 448
Dynamic WLM example ................................................................................................... 448
Query monitoring rules ........................................................................................................... 450
Defining a query monitor rule .......................................................................................... 450
Query monitoring metrics ................................................................................................ 451
Query monitoring rules templates .................................................................................... 453
System tables and views for query monitoring rules ........................................................... 454
WLM system tables and views .................................................................................................. 455

ix
Amazon Redshift Database Developer Guide

WLM service class IDs ..................................................................................................... 456


Managing database security ............................................................................................................. 457
Amazon Redshift security overview ........................................................................................... 457
Default database user privileges ............................................................................................... 458
Superusers ............................................................................................................................. 458
Users ..................................................................................................................................... 459
Creating, altering, and deleting users ................................................................................ 459
Role-based access control ........................................................................................................ 460
Role hierarchy ................................................................................................................ 460
Role assignment ............................................................................................................. 460
Amazon Redshift system-defined roles .............................................................................. 461
System privileges ............................................................................................................ 461
Database object permissions ............................................................................................ 464
ALTER DEFAULT PRIVILEGES for RBAC ............................................................................... 465
Considerations for role usage ........................................................................................... 465
Managing roles .............................................................................................................. 465
Groups .................................................................................................................................. 465
Creating, altering, and deleting groups .............................................................................. 466
Schemas ................................................................................................................................ 466
Creating, altering, and deleting schemas ........................................................................... 466
Search path ................................................................................................................... 467
Schema-based privileges ................................................................................................. 467
Example for controlling user and group access ........................................................................... 467
SQL reference ................................................................................................................................ 469
Amazon Redshift SQL ............................................................................................................. 469
SQL functions supported on the leader node ..................................................................... 469
Amazon Redshift and PostgreSQL .................................................................................... 470
Using SQL ............................................................................................................................. 475
SQL reference conventions .............................................................................................. 476
Basic elements ............................................................................................................... 476
Expressions .................................................................................................................... 505
Conditions ..................................................................................................................... 509
SQL commands ...................................................................................................................... 527
ABORT .......................................................................................................................... 529
ALTER DATABASE ........................................................................................................... 530
ALTER DATASHARE ......................................................................................................... 532
ALTER DEFAULT PRIVILEGES ............................................................................................ 534
ALTER GROUP ................................................................................................................ 536
ALTER IDENTITY PROVIDER .............................................................................................. 537
ALTER MATERIALIZED VIEW ............................................................................................. 538
ALTER ROLE ................................................................................................................... 538
ALTER PROCEDURE ......................................................................................................... 540
ALTER SCHEMA .............................................................................................................. 541
ALTER TABLE ................................................................................................................. 542
ALTER TABLE APPEND ..................................................................................................... 557
ALTER USER ................................................................................................................... 561
ANALYZE ....................................................................................................................... 564
ANALYZE COMPRESSION ................................................................................................. 566
BEGIN ........................................................................................................................... 568
CALL ............................................................................................................................. 569
CANCEL ......................................................................................................................... 571
CLOSE ........................................................................................................................... 573
COMMENT ..................................................................................................................... 574
COMMIT ........................................................................................................................ 575
COPY ............................................................................................................................ 576
CREATE DATABASE .......................................................................................................... 641
CREATE DATASHARE ....................................................................................................... 647

x
Amazon Redshift Database Developer Guide

CREATE EXTERNAL FUNCTION ......................................................................................... 648


CREATE EXTERNAL SCHEMA ............................................................................................ 655
CREATE EXTERNAL TABLE ................................................................................................ 661
CREATE FUNCTION ......................................................................................................... 679
CREATE GROUP .............................................................................................................. 683
CREATE IDENTITY PROVIDER ............................................................................................ 684
CREATE LIBRARY ............................................................................................................ 684
CREATE MATERIALIZED VIEW ........................................................................................... 687
CREATE MODEL .............................................................................................................. 690
CREATE PROCEDURE ....................................................................................................... 702
CREATE ROLE ................................................................................................................. 705
CREATE SCHEMA ............................................................................................................ 706
CREATE TABLE ............................................................................................................... 708
CREATE TABLE AS ........................................................................................................... 723
CREATE USER ................................................................................................................. 731
CREATE VIEW ................................................................................................................. 735
DEALLOCATE .................................................................................................................. 738
DECLARE ....................................................................................................................... 739
DELETE ......................................................................................................................... 741
DESC DATASHARE ........................................................................................................... 743
DESC IDENTITY PROVIDER ............................................................................................... 744
DROP DATABASE ............................................................................................................ 745
DROP DATASHARE .......................................................................................................... 746
DROP FUNCTION ............................................................................................................ 747
DROP GROUP ................................................................................................................ 748
DROP IDENTITY PROVIDER .............................................................................................. 749
DROP LIBRARY ............................................................................................................... 749
DROP MODEL ................................................................................................................ 750
DROP MATERIALIZED VIEW .............................................................................................. 751
DROP PROCEDURE ......................................................................................................... 751
DROP ROLE ................................................................................................................... 752
DROP SCHEMA ............................................................................................................... 753
DROP TABLE .................................................................................................................. 755
DROP USER ................................................................................................................... 757
DROP VIEW ................................................................................................................... 759
END .............................................................................................................................. 760
EXECUTE ....................................................................................................................... 761
EXPLAIN ........................................................................................................................ 762
FETCH ........................................................................................................................... 766
GRANT .......................................................................................................................... 767
INSERT .......................................................................................................................... 780
INSERT (external table) ................................................................................................... 784
LOCK ............................................................................................................................ 786
PREPARE ....................................................................................................................... 787
REFRESH MATERIALIZED VIEW ......................................................................................... 788
RESET ........................................................................................................................... 790
REVOKE ......................................................................................................................... 791
ROLLBACK ..................................................................................................................... 800
SELECT .......................................................................................................................... 801
SELECT INTO .................................................................................................................. 836
SET ............................................................................................................................... 837
SET SESSION AUTHORIZATION ........................................................................................ 840
SET SESSION CHARACTERISTICS ....................................................................................... 841
SHOW ........................................................................................................................... 841
SHOW EXTERNAL TABLE ................................................................................................. 841
SHOW MODEL ............................................................................................................... 843
SHOW DATASHARES ....................................................................................................... 845

xi
Amazon Redshift Database Developer Guide

SHOW PROCEDURE ........................................................................................................ 846


SHOW TABLE ................................................................................................................. 847
SHOW VIEW .................................................................................................................. 848
START TRANSACTION ...................................................................................................... 849
TRUNCATE ..................................................................................................................... 849
UNLOAD ........................................................................................................................ 850
UPDATE ......................................................................................................................... 869
VACUUM ........................................................................................................................ 873
SQL functions reference .......................................................................................................... 878
Leader node–only functions ............................................................................................. 879
Compute node–only functions .......................................................................................... 879
Aggregate functions ........................................................................................................ 880
Array functions ............................................................................................................... 898
VARBYTE functions ......................................................................................................... 902
Bit-wise aggregate functions ............................................................................................ 908
Conditional expressions ................................................................................................... 913
Date and time functions .................................................................................................. 922
Spatial functions ............................................................................................................ 962
Hash functions ............................................................................................................. 1056
HyperLogLog functions .................................................................................................. 1061
JSON functions ............................................................................................................. 1065
Math functions ............................................................................................................. 1072
Machine learning functions ............................................................................................ 1096
String functions ............................................................................................................ 1098
SUPER type information functions .................................................................................. 1143
Window functions ......................................................................................................... 1152
Data type formatting functions ...................................................................................... 1195
System administration functions ..................................................................................... 1212
System information functions ........................................................................................ 1219
Reserved words .................................................................................................................... 1238
System tables reference ................................................................................................................ 1241
System tables and views ........................................................................................................ 1241
Types of system tables and views ........................................................................................... 1241
Visibility of data in system tables and views ............................................................................ 1242
Filtering system-generated queries ................................................................................. 1242
STV tables for snapshot data ................................................................................................. 1242
STV_ACTIVE_CURSORS .................................................................................................. 1243
STV_BLOCKLIST ............................................................................................................ 1244
STV_CURSOR_CONFIGURATION ...................................................................................... 1247
STV_EXEC_STATE .......................................................................................................... 1247
STV_INFLIGHT .............................................................................................................. 1248
STV_LOAD_STATE ......................................................................................................... 1250
STV_LOCKS .................................................................................................................. 1251
STV_ML_MODEL_INFO ................................................................................................... 1252
STV_MV_DEPS .............................................................................................................. 1252
STV_MV_INFO .............................................................................................................. 1253
STV_NODE_STORAGE_CAPACITY ..................................................................................... 1255
STV_PARTITIONS .......................................................................................................... 1256
STV_QUERY_METRICS ................................................................................................... 1257
STV_RECENTS .............................................................................................................. 1261
STV_SESSIONS ............................................................................................................. 1262
STV_SLICES .................................................................................................................. 1263
STV_STARTUP_RECOVERY_STATE .................................................................................... 1264
STV_TBL_PERM ............................................................................................................ 1264
STV_TBL_TRANS ........................................................................................................... 1266
STV_WLM_QMR_CONFIG ............................................................................................... 1268
STV_WLM_CLASSIFICATION_CONFIG ............................................................................... 1268

xii
Amazon Redshift Database Developer Guide

STV_WLM_QUERY_QUEUE_STATE ................................................................................... 1269


STV_WLM_QUERY_STATE ............................................................................................... 1270
STV_WLM_QUERY_TASK_STATE ...................................................................................... 1272
STV_WLM_SERVICE_CLASS_CONFIG ................................................................................ 1273
STV_WLM_SERVICE_CLASS_STATE .................................................................................. 1274
System views ....................................................................................................................... 1275
STL views for logging .................................................................................................... 1276
SVCS views .................................................................................................................. 1357
SVL views .................................................................................................................... 1375
SVV views .................................................................................................................... 1415
Monitoring views (preview) ............................................................................................ 1457
System catalog tables ........................................................................................................... 1466
PG_ATTRIBUTE_INFO ..................................................................................................... 1467
PG_CLASS_INFO ........................................................................................................... 1467
PG_DATABASE_INFO ...................................................................................................... 1468
PG_DEFAULT_ACL ......................................................................................................... 1469
PG_EXTERNAL_SCHEMA ................................................................................................. 1470
PG_LIBRARY ................................................................................................................. 1471
PG_PROC_INFO ............................................................................................................ 1472
PG_STATISTIC_INDICATOR .............................................................................................. 1472
PG_TABLE_DEF ............................................................................................................. 1473
Querying the catalog tables ........................................................................................... 1475
Configuration reference ................................................................................................................. 1480
Modifying the server configuration ......................................................................................... 1480
activate_aqua ....................................................................................................................... 1481
Values ......................................................................................................................... 1481
Description ................................................................................................................... 1481
Example ...................................................................................................................... 1481
analyze_threshold_percent ..................................................................................................... 1481
Values (default in bold) ................................................................................................. 1481
Description ................................................................................................................... 1482
Examples ..................................................................................................................... 1482
cast_super_null_on_error ....................................................................................................... 1482
Values (default in bold) ................................................................................................. 1482
Description ................................................................................................................... 1481
datashare_break_glass_session_var .......................................................................................... 1482
Values (default in bold) ................................................................................................. 1482
Description ................................................................................................................... 1482
Example ...................................................................................................................... 1483
datestyle ............................................................................................................................. 1483
Values (default in bold) ................................................................................................. 1483
Description ................................................................................................................... 1481
Example ...................................................................................................................... 1483
default_geometry_encoding ................................................................................................... 1483
Values (default in bold) ................................................................................................. 1483
Description ................................................................................................................... 1481
describe_field_name_in_uppercase .......................................................................................... 1484
Values (default in bold) ................................................................................................. 1484
Description ................................................................................................................... 1481
Example ...................................................................................................................... 1483
downcase_delimited_identifier ................................................................................................ 1484
Values (default in bold) ................................................................................................. 1484
Description ................................................................................................................... 1481
enable_case_sensitive_identifier .............................................................................................. 1484
Values (default in bold) ................................................................................................. 1484
enable_result_cache_for_session ............................................................................................. 1486
Values (default in bold) ................................................................................................. 1486

xiii
Amazon Redshift Database Developer Guide

Description ................................................................................................................... 1481


enable_vacuum_boost ........................................................................................................... 1486
Values (default in bold) ................................................................................................. 1486
Description ................................................................................................................... 1481
error_on_nondeterministic_update .......................................................................................... 1486
Values (default in bold) ................................................................................................. 1486
Description ................................................................................................................... 1481
Example ...................................................................................................................... 1483
extra_float_digits .................................................................................................................. 1487
Values (default in bold) ................................................................................................. 1487
Description ................................................................................................................... 1487
json_serialization_enable ....................................................................................................... 1487
Values (default in bold) ................................................................................................. 1487
Description ................................................................................................................... 1481
json_serialization_parse_nested_strings .................................................................................... 1487
Values (default in bold) ................................................................................................. 1487
Description ................................................................................................................... 1481
max_concurrency_scaling_clusters ........................................................................................... 1488
Values (default in bold) ................................................................................................. 1488
Description ................................................................................................................... 1488
max_cursor_result_set_size ..................................................................................................... 1488
Values (default in bold) ................................................................................................. 1488
Description ................................................................................................................... 1488
mv_enable_aqmv_for_session ................................................................................................. 1488
Values (default in bold) ................................................................................................. 1488
Description ................................................................................................................... 1488
navigate_super_null_on_error ................................................................................................. 1488
Values (default in bold) ................................................................................................. 1488
Description ................................................................................................................... 1481
parse_super_null_on_error ..................................................................................................... 1489
Values (default in bold) ................................................................................................. 1489
Description ................................................................................................................... 1481
query_group ......................................................................................................................... 1489
Values (default in bold) ................................................................................................. 1489
Description ................................................................................................................... 1489
search_path ......................................................................................................................... 1490
Values (default in bold) ................................................................................................. 1490
Description ................................................................................................................... 1490
Example ...................................................................................................................... 1490
spectrum_query_maxerror ...................................................................................................... 1491
Values (default in bold) ................................................................................................. 1491
Description ................................................................................................................... 1491
Example ...................................................................................................................... 1491
statement_timeout ............................................................................................................... 1492
Values (default in bold) ................................................................................................. 1492
Description ................................................................................................................... 1492
Example ...................................................................................................................... 1492
stored_proc_log_min_messages .............................................................................................. 1492
Values (default in bold) ................................................................................................. 1492
Description ................................................................................................................... 1481
timezone ............................................................................................................................. 1493
Values (default in bold) ................................................................................................. 1493
Syntax ......................................................................................................................... 1493
Description ................................................................................................................... 1493
Time zone formats ........................................................................................................ 1493
Examples ..................................................................................................................... 1495
wlm_query_slot_count ........................................................................................................... 1495

xiv
Amazon Redshift Database Developer Guide

Values (default in bold) ................................................................................................. 1495


Description ................................................................................................................... 1495
Examples ..................................................................................................................... 1496
Sample database .......................................................................................................................... 1497
CATEGORY table ................................................................................................................... 1498
DATE table ........................................................................................................................... 1498
EVENT table ......................................................................................................................... 1499
VENUE table ........................................................................................................................ 1499
USERS table ......................................................................................................................... 1500
LISTING table ....................................................................................................................... 1500
SALES table ......................................................................................................................... 1501
Time zone names and abbreviations ............................................................................................... 1502
Time zone names ................................................................................................................. 1502
Time zone abbreviations ........................................................................................................ 1511
Document history ......................................................................................................................... 1515
Earlier updates ..................................................................................................................... 1521

xv
Amazon Redshift Database Developer Guide
Are you a first-time Amazon Redshift user?

Amazon Redshift system overview


Topics
• Are you a first-time Amazon Redshift user? (p. 1)
• Are you a database developer? (p. 2)
• Prerequisites (p. 3)
• System and architecture overview (p. 3)

This is the Amazon Redshift Database Developer Guide.

Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.

This guide focuses on using Amazon Redshift to create and manage a data warehouse. If you work with
databases as a designer, software developer, or administrator, it gives you the information you need to
design, build, query, and maintain your data warehouse.

Are you a first-time Amazon Redshift user?


If you are a first-time user of Amazon Redshift, we recommend that you begin by reading the following
sections.

• Service Highlights and Pricing – The product detail page provides the Amazon Redshift value
proposition, service highlights, and pricing.
• Getting Started – Amazon Redshift Getting Started Guide includes an example that walks you
through the process of creating an Amazon Redshift data warehouse cluster, creating database tables,
uploading data, and testing queries.

After you complete the Getting Started guide, we recommend that you explore one of the following
guides:

• Amazon Redshift Cluster Management Guide – The Cluster Management guide shows you how to
create and manage Amazon Redshift clusters.

If you are an application developer, you can use the Amazon Redshift Query API to manage clusters
programmatically. Additionally, the AWS SDK libraries that wrap the underlying Amazon Redshift
API can help simplify your programming tasks. If you prefer a more interactive way of managing
clusters, you can use the Amazon Redshift console and the AWS Command Line Interface (AWS CLI).
For information about the API and CLI, go to the following manuals:
• API reference
• CLI reference
• Amazon Redshift Database Developer Guide (this document) – If you are a database developer, the
Database Developer Guide explains how to design, build, query, and maintain the databases that make
up your data warehouse.

If you are transitioning to Amazon Redshift from another relational database system or data warehouse
application, you should be aware of important differences in how Amazon Redshift is implemented. For

1
Amazon Redshift Database Developer Guide
Are you a database developer?

a summary of the most important considerations for designing tables and loading data, see Amazon
Redshift best practices for designing tables (p. 16) and Amazon Redshift best practices for loading
data (p. 19). Amazon Redshift is based on PostgreSQL. For a detailed list of the differences between
Amazon Redshift and PostgreSQL, see Amazon Redshift and PostgreSQL (p. 470).

Are you a database developer?


If you are a database user, database designer, database developer, or database administrator, the
following table will help you find what you're looking for.

If you want to ... We recommend

Quickly start using Begin by following the steps in Amazon Redshift Getting Started Guide to
Amazon Redshift quickly deploy a cluster, connect to a database, and try out some queries.

When you are ready to build your database, load data into tables, and
write queries to manipulate data in the data warehouse, return here to the
Database Developer Guide.

Learn about the The System and architecture overview (p. 3) gives a high-level overview
internal architecture of of Amazon Redshift's internal architecture.
the Amazon Redshift
data warehouse. If you want a broader overview of the Amazon Redshift web service, go to
the Amazon Redshift product detail page.

Create databases, Getting started using databases is a quick introduction to the basics of SQL
tables, users, and other development.
database objects.
The Amazon Redshift SQL (p. 469) has the syntax and examples for
Amazon Redshift SQL commands and functions and other SQL elements.

Amazon Redshift best practices for designing tables (p. 16) provides a
summary of our recommendations for choosing sort keys, distribution keys,
and compression encodings.

Learn how to design Working with automatic table optimization (p. 37) details considerations
tables for optimum for applying compression to the data in table columns and choosing
performance. distribution and sort keys.

Load data. Loading data (p. 65) explains the procedures for loading large datasets
from Amazon DynamoDB tables or from flat files stored in Amazon S3
buckets.

Amazon Redshift best practices for loading data (p. 19) provides for tips
for loading your data quickly and effectively.

Manage users, groups, Managing database security (p. 457) covers database security topics.
and database security.

Monitor and optimize The System tables reference (p. 1241) details system tables and views
system performance. that you can query for the status of the database and monitor queries and
processes.

You should also consult the Amazon Redshift Cluster Management Guide to
learn how to use the AWS Management Console to check the system health,
monitor metrics, and back up and restore clusters.

2
Amazon Redshift Database Developer Guide
Prerequisites

If you want to ... We recommend

Analyze and report Many popular software vendors are certifying Amazon Redshift with their
information from very offerings to enable you to continue to use the tools you use today. For more
large datasets. information, see the Amazon Redshift partner page.

The SQL reference (p. 469) has all the details for the SQL expressions,
commands, and functions Amazon Redshift supports.

Prerequisites
Before you use this guide, you should complete these tasks.

• Install a SQL client.


• Launch an Amazon Redshift cluster.
• Connect your SQL client to a database in your cluster.

For step-by-step instructions, see Amazon Redshift Getting Started Guide.

You should also know how to use your SQL client and should have a fundamental understanding of the
SQL language.

System and architecture overview


Topics
• Data warehouse system architecture (p. 3)
• Performance (p. 5)
• Columnar storage (p. 7)
• Workload management (p. 9)
• Using Amazon Redshift with other services (p. 9)

An Amazon Redshift data warehouse is an enterprise-class relational database query and management
system.

Amazon Redshift supports client connections with many types of applications, including business
intelligence (BI), reporting, data, and analytics tools.

When you run analytic queries, you are retrieving, comparing, and evaluating large amounts of data in
multiple-stage operations to produce a final result.

Amazon Redshift achieves efficient storage and optimum query performance through a combination
of massively parallel processing, columnar data storage, and very efficient, targeted data compression
encoding schemes. This section presents an introduction to the Amazon Redshift system architecture.

Data warehouse system architecture


This section introduces the elements of the Amazon Redshift data warehouse architecture as shown in
the following figure.

3
Amazon Redshift Database Developer Guide
Data warehouse system architecture

Client applications

Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools
and business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on
industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal
changes. For information about important differences between Amazon Redshift SQL and PostgreSQL,
see Amazon Redshift and PostgreSQL (p. 470).

Clusters

The core infrastructure component of an Amazon Redshift data warehouse is a cluster.

A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more
compute nodes, an additional leader node coordinates the compute nodes and handles external
communication. Your client application interacts directly only with the leader node. The compute nodes
are transparent to external applications.

Leader node

The leader node manages communications with client programs and all communication with compute
nodes. It parses and develops execution plans to carry out database operations, in particular, the series
of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node
compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to
each compute node.

The leader node distributes SQL statements to the compute nodes only when a query references tables
that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon
Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any
of these functions will return an error if it references tables that reside on the compute nodes. For more
information, see SQL functions supported on the leader node (p. 469).

Compute nodes

The leader node compiles code for individual elements of the execution plan and assigns the code to
individual compute nodes. The compute nodes runs the compiled code and send intermediate results
back to the leader node for final aggregation.

4
Amazon Redshift Database Developer Guide
Performance

Each compute node has its own dedicated CPU, memory, and attached disk storage, which are
determined by the node type. As your workload grows, you can increase the compute capacity and
storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.

Amazon Redshift provides several node types for your compute and storage needs. For details of each
node type, see Amazon Redshift clusters in the Amazon Redshift Cluster Management Guide.

Node slices

A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and
disk space, where it processes a portion of the workload assigned to the node. The leader node manages
distributing data to the slices and apportions the workload for any queries or other database operations
to the slices. The slices then work in parallel to complete the operation.

The number of slices per node is determined by the node size of the cluster. For more information about
the number of slices for each node size, go to About clusters and nodes in the Amazon Redshift Cluster
Management Guide.

When you create a table, you can optionally specify one column as the distribution key. When the table
is loaded with data, the rows are distributed to the node slices according to the distribution key that is
defined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing
to load data and run queries efficiently. For information about choosing a distribution key, see Choose
the best distribution style (p. 17).

Internal network

Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom
communication protocols to provide private, very high-speed network communication between the
leader node and compute nodes. The compute nodes run on a separate, isolated network that client
applications never access directly.

Databases

A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client
communicates with the leader node, which in turn coordinates query execution with the compute nodes.

Amazon Redshift is a relational database management system (RDBMS), so it is compatible with


other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including
online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is
optimized for high-performance analysis and reporting of very large datasets.

Amazon Redshift is based on PostgreSQL. Amazon Redshift and PostgreSQL have a number of
very important differences that you need to take into account as you design and develop your data
warehouse applications. For information about how Amazon Redshift SQL differs from PostgreSQL, see
Amazon Redshift and PostgreSQL (p. 470).

Performance
Amazon Redshift achieves extremely fast query execution by employing these performance features.

Topics
• Massively parallel processing (p. 6)
• Columnar data storage (p. 6)
• Data compression (p. 6)
• Query optimizer (p. 6)
• Result caching (p. 6)
• Compiled code (p. 7)

5
Amazon Redshift Database Developer Guide
Performance

Massively parallel processing


Massively parallel processing (MPP) enables fast execution of the most complex queries operating on
large amounts of data. Multiple compute nodes handle all query processing leading up to final result
aggregation, with each core of each node executing the same compiled query segments on portions of
the entire data.

Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed
in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution
of data to balance the workload and minimize movement of data from node to node. For more
information, see Choose the best distribution style (p. 17).

Loading data from flat files takes advantage of parallel processing by spreading the workload across
multiple nodes while simultaneously reading from multiple files. For more information about how to
load data into tables, see Amazon Redshift best practices for loading data (p. 19).

Columnar data storage


Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an
important factor in optimizing analytic query performance. Storing database table information in a
columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to
load from disk. Loading less data into memory enables Amazon Redshift to perform more in-memory
processing when executing queries. See Columnar storage (p. 7) for a more detailed explanation.

When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of
data blocks. For more information, see Choose the best sort key (p. 16).

Data compression
Data compression reduces storage requirements, thereby reducing disk I/O, which improves query
performance. When you run a query, the compressed data is read into memory, then uncompressed
during query execution. Loading less data into memory enables Amazon Redshift to allocate more
memory to analyzing the data. Because columnar storage stores similar data sequentially, Amazon
Redshift is able to apply adaptive compression encodings specifically tied to columnar data types. The
best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal
compression encodings when you load the table with data. To learn more about using automatic data
compression, see Loading tables with automatic compression (p. 89).

Query optimizer
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and
also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer
implements significant enhancements and extensions for processing complex analytic queries that often
include multi-table joins, subqueries, and aggregation. To learn more about optimizing queries, see
Tuning query performance (p. 383).

Result caching
To reduce query execution time and improve system performance, Amazon Redshift caches the results of
certain types of queries in memory on the leader node. When a user submits a query, Amazon Redshift
checks the results cache for a valid, cached copy of the query results. If a match is found in the result
cache, Amazon Redshift uses the cached results and doesn't run the query. Result caching is transparent
to the user.

Result caching is turned on by default. To turn off result caching for the current session, set the
enable_result_cache_for_session (p. 1486) parameter to off.

6
Amazon Redshift Database Developer Guide
Columnar storage

Amazon Redshift uses cached results for a new query when all of the following are true:

• The user submitting the query has access privilege to the objects used in the query.
• The table or views in the query haven't been modified.
• The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.
• The query doesn't reference Amazon Redshift Spectrum external tables.
• Configuration parameters that might affect query results are unchanged.
• The query syntactically matches the cached query.

To maximize cache effectiveness and efficient use of resources, Amazon Redshift doesn't cache some
large query result sets. Amazon Redshift determines whether to cache query results based on a number
of factors. These factors include the number of entries in the cache and the instance type of your
Amazon Redshift cluster.

To determine whether a query used the result cache, query the SVL_QLOG (p. 1385) system view. If a
query used the result cache, the source_query column returns the query ID of the source query. If result
caching wasn't used, the source_query column value is NULL.

The following example shows that queries submitted by userid 104 and userid 102 use the result cache
from queries run by userid 100.

select userid, query, elapsed, source_query from svl_qlog


where userid > 1
order by query desc;

userid | query | elapsed | source_query


-------+--------+----------+-------------
104 | 629035 | 27 | 628919
104 | 629034 | 60 | 628900
104 | 629033 | 23 | 628891
102 | 629017 | 1229393 |
102 | 628942 | 28 | 628919
102 | 628941 | 57 | 628900
102 | 628940 | 26 | 628891
100 | 628919 | 84295686 |
100 | 628900 | 87015637 |
100 | 628891 | 58808694 |

Compiled code
The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling
the query eliminates the overhead associated with an interpreter and therefore increases the execution
speed, especially for complex queries. The compiled code is cached and shared across sessions on the
same cluster, so subsequent executions of the same query will be faster, often even with different
parameters.

The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql
(libq) connection protocols, so two clients using different protocols will each incur the first-time cost
of compiling the code. Other clients that use the same protocol, however, will benefit from sharing the
cached code.

Columnar storage
Columnar storage for database tables is an important factor in optimizing analytic query performance
because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need
to load from disk.

7
Amazon Redshift Database Developer Guide
Columnar storage

The following series of illustrations describe how columnar data storage implements efficiencies and
how that translates into efficiencies when retrieving data into memory.

This first illustration shows how records from database tables are typically stored into disk blocks by row.

In a typical relational database table, each row contains field values for a single record. In row-wise
database storage, data blocks store values sequentially for each consecutive column making up the
entire row. If block size is smaller than the size of a record, storage for an entire record may take more
than one block. If block size is larger than the size of a record, storage for an entire record may take
less than one block, resulting in an inefficient use of disk space. In online transaction processing (OLTP)
applications, most transactions involve frequently reading and writing all of the values for entire records,
typically one record or a small number of records at a time. As a result, row-wise storage is optimal for
OLTP databases.

The next illustration shows how with columnar storage, the values for each column are stored
sequentially into disk blocks.

Using columnar storage, each data block stores values of a single column for multiple rows. As records
enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the
columns.

In this simplified example, using columnar storage, each data block holds column field values for as
many as three times as many records as row-based storage. This means that reading the same number
of column field values for the same number of records requires a third of the I/O operations compared
to row-wise storage. In practice, using tables with very large numbers of columns and very large row
counts, storage efficiency is even greater.

An added advantage is that, since each block holds the same type of data, block data can use a
compression scheme selected specifically for the column data type, further reducing disk space and
I/O. For more information about compression encodings based on data types, see Compression
encodings (p. 40).

The savings in space for storing data on disk also carries over to retrieving and then storing that data in
memory. Since many database operations only need to access or operate on one or a small number of
columns at a time, you can save memory space by only retrieving blocks for columns you actually need
for a query. Where OLTP transactions typically involve most or all of the columns in a row for a small

8
Amazon Redshift Database Developer Guide
Workload management

number of records, data warehouse queries commonly read only a few columns for a very large number
of rows. This means that reading the same number of column field values for the same number of rows
requires a fraction of the I/O operations and uses a fraction of the memory that would be required for
processing row-wise blocks. In practice, using tables with very large numbers of columns and very large
row counts, the efficiency gains are proportionally greater. For example, suppose a table contains 100
columns. A query that uses five columns will only need to read about five percent of the data contained
in the table. This savings is repeated for possibly billions or even trillions of records for large databases.
In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.

Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB,
which is more efficient and further reduces the number of I/O requests needed to perform any database
loading or other operations that are part of query execution.

Workload management
Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within
workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.

Amazon Redshift WLM creates query queues at runtime according to service classes, which define the
configuration parameters for various types of queues, including internal system queues and user-
accessible queues. From a user perspective, a user-accessible service class and a queue are functionally
equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service
class as well as a runtime queue.

When you run a query, WLM assigns the query to a queue according to the user's user group or by
matching a query group that is listed in the queue configuration with a query group label that the user
sets at runtime.

Currently, the default for clusters using the default parameter group is to use automatic WLM.
Automatic WLM manages query concurrency and memory allocation. For more information, see
Implementing automatic WLM (p. 413).

With manual WLM, Amazon Redshift configures one queue with a concurrency level of five, which enables
up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level
of one. You can define up to eight queues. Each queue can be configured with a maximum concurrency
level of 50. The maximum total concurrency level for all user-defined queues (not including the
Superuser queue) is 50.

The easiest way to modify the WLM configuration is by using the Amazon Redshift Management Console.
You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API.

For more information about implementing and using workload management, see Implementing
workload management (p. 411).

Using Amazon Redshift with other services


Amazon Redshift integrates with other AWS services to enable you to move, transform, and load your
data quickly and reliably, using data security features.

Moving data between Amazon Redshift and Amazon S3


Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. Amazon
Redshift leverages parallel processing to read and load data from multiple data files stored in Amazon S3
buckets. For more information, see Loading data from Amazon S3 (p. 68).

9
Amazon Redshift Database Developer Guide
Using Amazon Redshift with other services

You can also use parallel processing to export data from your Amazon Redshift data warehouse to
multiple data files on Amazon S3. For more information, see Unloading data (p. 143).

Using Amazon Redshift with Amazon DynamoDB


Amazon DynamoDB is a fully managed NoSQL database service. You can use the COPY command to load
an Amazon Redshift table with data from a single Amazon DynamoDB table. For more information, see
Loading data from an Amazon DynamoDB table (p. 86).

Importing data from remote hosts over SSH


You can use the COPY command in Amazon Redshift to load data from one or more remote hosts, such
as Amazon EMR clusters, Amazon EC2 instances, or other computers. COPY connects to the remote
hosts using SSH and runs commands on the remote hosts to generate data. Amazon Redshift supports
multiple simultaneous connections. The COPY command reads and loads the output from multiple host
sources in parallel. For more information, see Loading data from remote hosts (p. 80).

Automating data loads using AWS Data Pipeline


You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon
Redshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and
run recurring jobs without having to write your own complex data transfer or transformation logic.
For example, you can set up a recurring job to automatically copy data from Amazon DynamoDB
into Amazon Redshift. For a tutorial that walks you through the process of creating a pipeline that
periodically moves data from Amazon S3 to Amazon Redshift, see Copy data to Amazon Redshift using
AWS Data Pipeline in the AWS Data Pipeline Developer Guide.

Migrating data using AWS Database Migration Service (AWS


DMS)
You can migrate data to Amazon Redshift using AWS Database Migration Service. AWS DMS can
migrate your data to and from most widely used commercial and open-source databases such as Oracle,
PostgreSQL, Microsoft SQL Server, Amazon Redshift, Aurora, DynamoDB, Amazon S3, MariaDB, and
MySQL. For more information, see Using an Amazon Redshift database as a target for AWS Database
Migration Service.

10
Amazon Redshift Database Developer Guide
Conducting a proof of concept

Amazon Redshift best practices


Following, you can find best practices for planning a proof of concept, designing tables, loading data into
tables, and writing queries for Amazon Redshift, and also a discussion of working with Amazon Redshift
Advisor.

Amazon Redshift is not the same as other SQL database systems. To fully realize the benefits of the
Amazon Redshift architecture, you must specifically design, build, and load your tables to use massively
parallel processing, columnar data storage, and columnar data compression. If your data loading and
query execution times are longer than you expect, or longer than you want, you might be overlooking
key information.

If you are an experienced SQL database developer, we strongly recommend that you review this topic
before you begin developing your Amazon Redshift data warehouse.

If you are new to developing SQL databases, this topic is not the best place to start. We recommend that
you begin by reading Getting started using databases and trying the examples yourself.

In this topic, you can find an overview of the most important development principles, along with specific
tips, examples, and best practices for implementing those principles. No single practice can apply to
every application. You should evaluate all of your options before finalizing a database design. For more
information, see Working with automatic table optimization (p. 37), Loading data (p. 65), Tuning
query performance (p. 383), and the reference chapters.

Topics
• Conducting a proof of concept for Amazon Redshift (p. 11)
• Amazon Redshift best practices for designing tables (p. 16)
• Amazon Redshift best practices for loading data (p. 19)
• Amazon Redshift best practices for designing queries (p. 22)
• Working with recommendations from Amazon Redshift Advisor (p. 24)

Conducting a proof of concept for Amazon


Redshift
Amazon Redshift is a fast, scalable data warehouse that makes it simple and cost-effective to analyze
all your data using standard SQL with your existing business intelligence (BI) tools. Amazon Redshift
offers fast performance in a low-cost cloud data warehouse. It uses sophisticated query optimization,
accelerated cache, columnar storage on high-performance local disks, and massively parallel query
execution.

In the following sections, you can find a framework for building a proof of concept with Amazon
Redshift. The framework helps you to use architectural best practices for designing and operating a
secure, high-performing, and cost-effective data warehouse. This guidance is based on reviewing designs
of thousands of customer architectures across a wide variety of business types and use cases. We have
compiled customer experiences to develop this set of best practices to help you develop criteria for
evaluating your data warehouse workload.

Overview of the process


Conducting a proof of concept is a three-step process:

1. Identify the goals of the proof of concept – you can work backward from your business requirements
and success criteria, and translate them into a technical proof of concept project plan.

11
Amazon Redshift Database Developer Guide
Identify the business goals and success criteria

2. Set up the proof of concept environment – most of the setup process is a click of few buttons to
create your resources. Within minutes, you can have a data warehouse environment ready with data
loaded.
3. Complete the proof of concept project plan to ensure that the goals are met.

In the following sections, we go into the details of each step.

Identify the business goals and success criteria


Identifying the goals of the proof of concept plays a critical role in determining what you want to
measure as part of the evaluation process. The evaluation criteria should include the current scaling
challenges, enhancements to improve your customer's experience of the data warehouse, and methods
of addressing your current operational pain points. You can use the following questions to identify the
goals of the proof of concept:

• What are your goals for scaling your data warehouse?


• What are the specific service-level agreements whose terms you want to improve?
• What new datasets do you need to include in your data warehouse?
• What are the business-critical SQL queries that you need to test and measure? Make sure to include
the full range of SQL complexities, such as the different types of queries (for example, select, insert,
update, and delete).
• What are the general types of workloads you plan to test? Examples might include extract-transform-
load (ETL) workloads, reporting queries, and batch extracts.

After you have answered these questions, you should be able to establish SMART goals and success
criteria for building your proof of concept. For information about setting goals, see SMART criteria in
Wikipedia.

Set up your proof of concept


Because we eliminated hardware provisioning, networking, and software installation from an on-
premises data warehouse, trying Amazon Redshift with your own dataset has never been easier. Many
of the sizing decisions and estimations that used to be required are now simply a click away. You can
flexibly resize your cluster or adjust the ratio of storage versus compute.

Broadly, setting up the Amazon Redshift proof of concept environment is a two-step process. It involves
the launching of a data warehouse and then the conversion of the schema and datasets for evaluation.

Choose a starting cluster size


You can choose the node type and number of nodes using the Amazon Redshift console. We recommend
that you also test resizing the cluster as part of your proof of concept plan. To get the initial sizing for
your cluster, take the following steps:

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.
2. On the navigation pane, choose Create cluster to open the configuration page.
3. For Cluster identifier, enter a name for your cluster.
4. Choose one of the following methods to size your cluster:
Note
The following step describes an Amazon Redshift console that is running in an AWS Region
that supports RA3 node types. For a list of AWS Regions that support RA3 node types, see
Overview of RA3 node types in the Amazon Redshift Cluster Management Guide.

12
Amazon Redshift Database Developer Guide
Checklist for a complete evaluation

• If your AWS Region supports RA3 node types, choose either Production or Free trial to answer the
question What are you planning to use this cluster for?

If your organization is eligible, you might be able to create a cluster under the Amazon Redshift free
trial program. To do this, choose Free trial to create a configuration with the dc2.large node type.
For more information about choosing a free trial, see Amazon Redshift free trial.
• If you don't know how large to size your cluster, choose Help me choose. Doing this starts a sizing
calculator that asks you questions about the size and query characteristics of the data that you plan
to store in your data warehouse.

If you know the required size of your cluster (that is, the node type and number of nodes), choose
I'll choose. Then choose the Node type and number of Nodes to size your cluster for the proof of
concept.
5. After you enter all required cluster properties, choose Create cluster to launch your data warehouse.

For more details about creating clusters with the Amazon Redshift console, see Creating a cluster in
the Amazon Redshift Cluster Management Guide.

Convert the schema and set up the datasets for the proof of
concept
If you don't have an existing data warehouse, skip this section and see Amazon Redshift Getting Started
Guide. Amazon Redshift Getting Started Guide provides a tutorial to create a cluster and examples of
setting up data in Amazon Redshift.

When migrating from your existing data warehouse, you can convert schema, code, and data using the
AWS Schema Conversion Tool and the AWS Database Migration Service. Your choice of tools depends on
the source of your data and optional ongoing replications. For more information, see What Is the AWS
Schema Conversion Tool? in the AWS Schema Conversion Tool User Guide and What Is AWS Database
Migration Service? in the AWS Database Migration Service User Guide. The following can help you set up
your data in Amazon Redshift:

• Migrate Your Data Warehouse to Amazon Redshift Using the AWS Schema Conversion Tool – this blog
post provides an overview on how you can use the AWS SCT data extractors to migrate your existing
data warehouse to Amazon Redshift. The AWS SCT tool can migrate your data from many legacy
platforms (such as Oracle, Greenplum, Netezza, Teradata, Microsoft SQL Server, or Vertica).
• Optionally, you can also use the AWS Database Migration Service for ongoing replications of changed
data from the source. For more information, see Using an Amazon Redshift Database as a Target for
AWS Database Migration Service in the AWS Database Migration Service User Guide.

Amazon Redshift is a relational database management system (RDBMS). As such, it can run many
types of data models including star schemas, snowflake schemas, data vault models, and simple, flat,
or normalized tables. After setting up your schemas in Amazon Redshift, you can take advantage of
massively parallel processing and columnar data storage for fast analytical queries out of the box. For
information about types of schemas, see star schema, snowflake schema, and data vault modeling in
Wikipedia.

Checklist for a complete evaluation


Make sure that a complete evaluation meets all your data warehouse needs. Consider including the
following items in your success criteria:

• Data load time – using the COPY command is a common way to test how long it takes to load data. For
more information, see Amazon Redshift best practices for loading data (p. 19).

13
Amazon Redshift Database Developer Guide
Develop a project plan for your evaluation

• Throughput of the cluster – measuring queries per hour is a common way to determine throughput.
To do so, set up a test to run typical queries for your workload.
• Data security – you can easily encrypt data at rest and in transit with Amazon Redshift. You also
have a number of options for managing keys. Amazon Redshift also supports single sign-on (SSO)
integration. Amazon Redshift pricing includes built-in security, data compression, backup storage, and
data transfer.
• Third-party tools integration – you can use either a JDBC or ODBC connection to integrate with
business intelligence and other external tools.
• Interoperability with other AWS services – Amazon Redshift integrates with other AWS services, such
as Amazon EMR, Amazon QuickSight, AWS Glue, Amazon S3, and Amazon Kinesis. You can use this
integration when setting up and managing your data warehouse.
• Backups and snapshots – backups and snapshots are created automatically. You can also create a
point-in-time snapshot at any time or on a schedule. Try using a snapshot and creating a second
cluster as part of your evaluation. Evaluate if your development and testing organizations can use the
cluster.
• Resizing – your evaluation should include increasing the number or types of Amazon Redshift nodes.
Evaluate that the workload throughput before and after a resize meets any variability of the volume of
your workload. For more information, see Resizing clusters in Amazon Redshift in the Amazon Redshift
Cluster Management Guide.
• Concurrency scaling – this feature helps you handle variability of traffic volume in your data
warehouse. With concurrency scaling, you can support virtually unlimited concurrent users and
concurrent queries, with consistently fast query performance. For more information, see Working with
concurrency scaling (p. 438).
• Automatic workload management (WLM) – prioritize your business critical queries over other queries
by using automatic WLM. Try setting up queues based on your workloads (for example, a queue for
ETL and a queue for reporting). Then enable automatic WLM to allocate the concurrency and memory
resources dynamically. For more information, see Implementing automatic WLM (p. 413).
• Amazon Redshift Advisor – the Advisor develops customized recommendations to increase
performance and optimize costs by analyzing your workload and usage metrics for your cluster. Sign in
to the Amazon Redshift console to view Advisor recommendations. For more information, see Working
with recommendations from Amazon Redshift Advisor (p. 24).
• Table design – Amazon Redshift provides great performance out of the box for most workloads.
When you create a table, the default sort key and distribution key is AUTO. For more information, see
Working with automatic table optimization (p. 37).
• Support – we strongly recommend that you evaluate AWS Support as part of your evaluation. Also,
make sure to talk to your account manager about your proof of concept. AWS can help with technical
guidance and credits for the proof of concept if you qualify. If you don't find the help you're looking
for, you can talk directly to the Amazon Redshift team. For help, submit the form at Request support
for your Amazon Redshift proof-of-concept.
• Lake house integration – with built-in integration, try using the out-of-box Amazon Redshift
Spectrum feature. With Redshift Spectrum, you can extend the data warehouse into your data lake and
run queries against petabytes of data in Amazon S3 using your existing cluster. For more information,
see Querying external data using Amazon Redshift Spectrum (p. 232).

Develop a project plan for your evaluation


Some of the following techniques for creating query benchmarks might help support your Amazon
Redshift evaluation:

• Assemble a list of queries for each runtime category. Having a sufficient number (for example, 30 per
category) helps ensure that your evaluation reflects a real-world data warehouse implementation.
Add a unique identifier to associate each query that you include in your evaluation with one of the

14
Amazon Redshift Database Developer Guide
Additional resources to help your evaluation

categories you establish for your evaluation. You can then use these unique identifiers to determine
throughput from the system tables.

You can also create a query group to organize your evaluation queries. For example, if you have
established a "Reporting" category for your evaluation, you might create a coding system to tag your
evaluation queries with the word "Report." You can then identify individual queries within reporting as
R1, R2, and so on. The following example demonstrates this approach.

SELECT 'Reporting' AS query_category, 'R1' as query_id, * FROM customers;

SELECT query, datediff(seconds, starttime, endtime)


FROM stl_query
WHERE
querytxt LIKE '%Reporting%'
and starttime >= '2018-04-15 00:00'
and endtime < '2018-04-15 23:59';

When you have associated a query with an evaluation category, you can use a unique identifier to
determine throughput from the system tables for each category.
• Test throughput with historical user or ETL queries that have a variety of runtimes in your existing data
warehouse. You might use a load testing utility, such as the open-source JMeter or a custom utility. If
so, make sure that your utility does the following:
• It can take the network transmission time into account.
• It evaluates execution time based on throughput of the internal system tables. For information
about how to do this, see Analyzing the query summary (p. 396).
• Identify all the various permutations that you plan to test during your evaluation. The following list
provides some common variables:
• Cluster size
• Node type
• Load testing duration
• Concurrency settings
• Reduce the cost of your proof of concept by pausing your cluster during off-hours and weekends.
When a cluster is paused, on-demand compute billing is suspended. To run tests on the cluster, resume
per-second billing. You can also create a schedule to pause and resume your cluster automatically.
For more information, see Pausing and resuming clusters in the Amazon Redshift Cluster Management
Guide.

At this stage, you're ready to complete your project plan and evaluate results.

Additional resources to help your evaluation


To help your Amazon Redshift evaluation, see the following:

• Service highlights and pricing – this product detail page provides the Amazon Redshift value
proposition, service highlights, and pricing.
• Amazon Redshift Getting Started Guide – this guide provides a tutorial of using Amazon Redshift to
create a sample cluster and work with sample data.
• Getting started with Amazon Redshift Spectrum (p. 234) – in this tutorial, you learn how to use
Redshift Spectrum to query data directly from files on Amazon S3.
• Amazon Redshift management overview – this topic in the Amazon Redshift Cluster Management Guide
provides an overview of Amazon Redshift.

15
Amazon Redshift Database Developer Guide
Need help?

• Optimize Amazon Redshift for performance with BI tools – consider integration with tools such as
Tableau, Power BI, and others.
• Amazon Redshift Advisor recommendations (p. 26) – contains explanations and details for each
Advisor recommendation.
• What's new in Amazon Redshift – announcements that help you keep track of new features and
enhancements.
• Improved speed and scalability – this blog post summarizes recent Amazon Redshift improvements.

Need help?
Make sure to talk to your account manager to let them know about your proof of concept. AWS can help
with technical guidance and credits for the proof of concept if you qualify. If you don't find the help you
are looking for, you can talk directly to the Amazon Redshift team. For help, submit the form at Request
support for your Amazon Redshift proof-of-concept.

Amazon Redshift best practices for designing


tables
As you plan your database, certain key table design decisions heavily influence overall query
performance. These design choices also have a significant effect on storage requirements, which in
turn affects query performance by reducing the number of I/O operations and minimizing the memory
required to process queries.

In this section, you can find a summary of the most important design decisions and presents best
practices for optimizing query performance. Working with automatic table optimization (p. 37)
provides more detailed explanations and examples of table design options.

Topics
• Choose the best sort key (p. 16)
• Choose the best distribution style (p. 17)
• Let COPY choose compression encodings (p. 18)
• Define primary key and foreign key constraints (p. 18)
• Use the smallest possible column size (p. 18)
• Use date/time data types for date columns (p. 19)

Choose the best sort key


Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift
query optimizer uses sort order when it determines optimal query plans.
Note
When you use automatic table optimization, you don't need to choose the sort key of your table.
For more information, see Working with automatic table optimization (p. 37).

Some suggestions for the best approach follow:

• To have Amazon Redshift choose the appropriate sort order, specify AUTO for the sort key.
• If recent data is queried most frequently, specify the timestamp column as the leading column for the
sort key.

16
Amazon Redshift Database Developer Guide
Choose the best distribution style

Queries are more efficient because they can skip entire blocks that fall outside the time range.
• If you do frequent range filtering or equality filtering on one column, specify that column as the sort
key.

Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks
the minimum and maximum column values stored on each block and can skip blocks that don't apply
to the predicate range.
• If you frequently join a table, specify the join column as both the sort key and the distribution key.

Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join.
Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of
the sort merge join.

Choose the best distribution style


When you run a query, the query optimizer redistributes the rows to the compute nodes as needed to
perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the
impact of the redistribution step by locating the data where it needs to be before the query is run.
Note
When you use automatic table optimization, you don't need to choose the distribution style of
your table. For more information, see Working with automatic table optimization (p. 37).

Some suggestions for the best approach follow:

1. Distribute the fact table and one dimension table on their common columns.

Your fact table can have only one distribution key. Any tables that join on another key aren't
collocated with the fact table. Choose one dimension to collocate based on how frequently it is joined
and the size of the joining rows. Designate both the dimension table's primary key and the fact table's
corresponding foreign key as the DISTKEY.
2. Choose the largest dimension based on the size of the filtered dataset.

Only the rows that are used in the join need to be distributed, so consider the size of the dataset after
filtering, not the size of the table.
3. Choose a column with high cardinality in the filtered result set.

If you distribute a sales table on a date column, for example, you should probably get fairly even data
distribution, unless most of your sales are seasonal. However, if you commonly use a range-restricted
predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices
and the query workload is skewed.
4. Change some dimension tables to use ALL distribution.

If a dimension table cannot be collocated with the fact table or other important joining tables, you
can improve query performance significantly by distributing the entire table to all of the nodes. Using
ALL distribution multiplies storage space requirements and increases load times and maintenance
operations, so you should weigh all factors before choosing ALL distribution.

To have Amazon Redshift choose the appropriate distribution style, specify AUTO for the distribution
style.

For more information about choosing distribution styles, see Working with data distribution
styles (p. 50).

17
Amazon Redshift Database Developer Guide
Use automatic compression

Let COPY choose compression encodings


You can specify compression encodings when you create a table, but in most cases, automatic
compression produces the best results.

ENCODE AUTO is the default for tables. Amazon Redshift automatically manages compression encoding
for all columns in the table. If you specify compression encoding for any column in the table, the table
is no longer set to ENCODE AUTO. Amazon Redshift no longer automatically manages compression
encoding for all columns in the table. You can specify the ENCODE AUTO option for the table to enable
Amazon Redshift to automatically manage compression encoding for all columns in the table. For more
information, see CREATE TABLE (p. 708) and ALTER TABLE (p. 542).

The COPY command analyzes your data and applies compression encodings to an empty table
automatically as part of the load operation.

Automatic compression balances overall performance when choosing compression encodings. Range-
restricted scans might perform poorly if sort key columns are compressed much more highly than other
columns in the same query. As a result, automatic compression chooses a less efficient compression
encoding to keep the sort key columns balanced with other columns.

Suppose that your table's sort key is a date or timestamp and the table uses many large varchar columns.
In this case, you might get better performance by not compressing the sort key column at all. Run
the ANALYZE COMPRESSION (p. 566) command on the table, then use the encodings to create a new
table, but leave out the compression encoding for the sort key.

There is a performance cost for automatic compression encoding, but only if the table is empty
and does not already have compression encoding. For short-lived tables and tables that you create
frequently, such as staging tables, load the table once with automatic compression or run the ANALYZE
COMPRESSION command. Then use those encodings to create new tables. You can add the encodings to
the CREATE TABLE statement, or use CREATE TABLE LIKE to create a new table with the same encoding.

For more information, see Loading tables with automatic compression (p. 89).

Define primary key and foreign key constraints


Define primary key and foreign key constraints between tables wherever appropriate. Even though they
are informational only, the query optimizer uses those constraints to generate more efficient query
plans.

Do not define primary key and foreign key constraints unless your application enforces the constraints.
Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.

See Defining table constraints (p. 64) for additional information about how Amazon Redshift uses
constraints.

Use the smallest possible column size


Don't make it a practice to use the maximum column size for convenience.

Instead, consider the largest values you are likely to store in a VARCHAR column, for example, and size
your columns accordingly. Because Amazon Redshift compresses column data very effectively, creating
columns much larger than necessary has minimal impact on the size of data tables. During processing
for complex queries, however, intermediate query results might need to be stored in temporary tables.
Because temporary tables are not compressed, unnecessarily large columns consume excessive memory
and temporary disk space, which can affect query performance.

18
Amazon Redshift Database Developer Guide
Use date/time data types for date columns

Use date/time data types for date columns


Amazon Redshift stores DATE and TIMESTAMP data more efficiently than CHAR or VARCHAR, which
results in better query performance. Use the DATE or TIMESTAMP data type, depending on the resolution
you need, rather than a character type when storing date/time information. For more information, see
Datetime types (p. 490).

Amazon Redshift best practices for loading data


Topics
• Take the loading data tutorial (p. 19)
• Use a COPY command to load data (p. 19)
• Use a single COPY command to load from multiple files (p. 19)
• Split your load data (p. 20)
• Compress your data files (p. 20)
• Verify data files before and after a load (p. 20)
• Use a multi-row insert (p. 20)
• Use a bulk insert (p. 21)
• Load data in sort key order (p. 21)
• Load data in sequential blocks (p. 21)
• Use time-series tables (p. 21)
• Use a staging table to perform a merge (upsert) (p. 22)
• Schedule around maintenance windows (p. 22)

Loading very large datasets can take a long time and consume a lot of computing resources. How your
data is loaded can also affect query performance. This section presents best practices for loading data
efficiently using COPY commands, bulk inserts, and staging tables.

Take the loading data tutorial


Tutorial: Loading data from Amazon S3 (p. 122) walks you beginning to end through the steps to
upload data to an Amazon S3 bucket and then use the COPY command to load the data into your tables.
The tutorial includes help with troubleshooting load errors and compares the performance difference
between loading from a single file and loading from multiple files.

Use a COPY command to load data


The COPY command loads data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or
multiple data sources on remote hosts. COPY loads large amounts of data much more efficiently than
using INSERT statements, and stores the data more effectively as well.

For more information about using the COPY command, see Loading data from Amazon S3 (p. 68) and
Loading data from an Amazon DynamoDB table (p. 86).

Use a single COPY command to load from multiple


files
Amazon Redshift can automatically load in parallel from multiple compressed data files.

19
Amazon Redshift Database Developer Guide
Split your load data

However, if you use multiple concurrent COPY commands to load one table from multiple files, Amazon
Redshift is forced to perform a serialized load. This type of load is much slower and requires a VACUUM
process at the end if the table has a sort column defined. For more information about using COPY to
load data in parallel, see Loading data from Amazon S3 (p. 68).

Split your load data


When you load compressed data, with the COPY command from multiple files, the data loads in parallel.
This divides the workload among the nodes in your cluster. When you load all the data from a single
large, compressed file, Amazon Redshift is forced to perform a serialized load, which is much slower.

In contrast, when you load delimited data from a large, uncompressed file, Amazon Redshift makes
use of multiple slices. These slices work in parallel, automatically. This provides fast load performance.
Specifically, when Amazon Redshift loads uncompressed, delimited data, data is split into ranges and
handled by slices in each node.

If you intend to load data from a large, compressed file, we recommend that you split your data into
smaller files that are about equal size, from 1 MB to 1 GB after compression. For optimum parallelism,
the ideal file size is 1–125 MB after compression. Make the number of files a multiple of the number of
slices in your cluster. For more information about how to split your data into files and examples of using
COPY to load data, see Loading data from Amazon S3 (p. 68).

Compress your data files


When you want to compress large load files, we recommend that you use gzip, lzop, bzip2, or Zstandard
to compress them and split the data into multiple smaller files.

Specify the GZIP, LZOP, BZIP2, or ZSTD option with the COPY command. This example loads the TIME
table from a pipe-delimited lzop file.

copy time
from 's3://mybucket/data/timerows.lzo'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
lzop
delimiter '|';

There are instances when you don't have to split uncompressed data files. For more information
about splitting your data and examples of using COPY to load data, see Loading data from Amazon
S3 (p. 68).

Verify data files before and after a load


When you load data from Amazon S3, first upload your files to your Amazon S3 bucket, then verify that
the bucket contains all the correct files, and only those files. For more information, see Verifying that the
correct files are present in your bucket (p. 71).

After the load operation is complete, query the STL_LOAD_COMMITS (p. 1304) system table to
verify that the expected files were loaded. For more information, see Verifying that the data loaded
correctly (p. 88).

Use a multi-row insert


If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever
possible. Data compression is inefficient when you add data only one row or a few rows at a time.

Multi-row inserts improve performance by batching up a series of inserts. The following example inserts
three rows into a four-column table using a single INSERT statement. This is still a small insert, shown
simply to illustrate the syntax of a multi-row insert.

20
Amazon Redshift Database Developer Guide
Use a bulk insert

insert into category_stage values


(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);

For more details and examples, see INSERT (p. 780).

Use a bulk insert


Use a bulk insert operation with a SELECT clause for high-performance data insertion.

Use the INSERT (p. 780) and CREATE TABLE AS (p. 723) commands when you need to move data or a
subset of data from one table into another.

For example, the following INSERT statement selects all of the rows from the CATEGORY table and
inserts them into the CATEGORY_STAGE table.

insert into category_stage


(select * from category);

The following example creates CATEGORY_STAGE as a copy of CATEGORY and inserts all of the rows in
CATEGORY into CATEGORY_STAGE.

create table category_stage as


select * from category;

Load data in sort key order


Load your data in sort key order to avoid needing to vacuum.

If each batch of new data follows the existing rows in your table, your data is properly stored in sort
order, and you don't need to run a vacuum. You don't need to presort the rows in each load because
COPY sorts each batch of incoming data as it loads.

For example, suppose that you load data every day based on the current day's activity. If your sort key is
a timestamp column, your data is stored in sort order. This order occurs because the current day's data is
always appended at the end of the previous day's data. For more information, see Loading your data in
sort key order (p. 115).

Load data in sequential blocks


If you need to add a large quantity of data, load the data in sequential blocks according to sort order to
eliminate the need to vacuum.

For example, suppose that you need to load a table with events from January 2017 to December 2017.
Assuming each month is in a single file, load the rows for January, then February, and so on. Your
table is completely sorted when your load completes, and you don't need to run a vacuum. For more
information, see Use time-series tables (p. 21).

When loading very large datasets, the space required to sort might exceed the total available space. By
loading data in smaller blocks, you use much less intermediate sort space during each load. In addition,
loading smaller blocks make it easier to restart if the COPY fails and is rolled back.

Use time-series tables


If your data has a fixed retention period, you can organize your data as a sequence of time-series tables.
In such a sequence, each table is identical but contains data for different time ranges.

21
Amazon Redshift Database Developer Guide
Use a staging table to perform a merge

You can easily remove old data simply by running a DROP TABLE command on the corresponding tables.
This approach is much faster than running a large-scale DELETE process and saves you from having to
run a subsequent VACUUM process to reclaim space. To hide the fact that the data is stored in different
tables, you can create a UNION ALL view. When you delete old data, simply refine your UNION ALL view
to remove the dropped tables. Similarly, as you load new time periods into new tables, add the new
tables to the view. To signal the optimizer to skip the scan on tables that don't match the query filter,
your view definition filters for the date range that corresponds to each table.

Avoid having too many tables in the UNION ALL view. Each additional table adds a small processing
time to the query. Tables don't need to use the same time frame. For example, you might have tables for
differing time periods, such as daily, monthly, and yearly.

If you use time-series tables with a timestamp column for the sort key, you effectively load your data in
sort key order. Doing this eliminates the need to vacuum to re-sort the data. For more information, see
Loading your data in sort key order (p. 115).

Use a staging table to perform a merge (upsert)


You can efficiently update and insert new data by loading your data into a staging table first.

Amazon Redshift doesn't support a single merge statement (update or insert, also known as an upsert)
to insert and update data from a single data source. However, you can effectively perform a merge
operation. To do so, load your data into a staging table and then join the staging table with your target
table for an UPDATE statement and an INSERT statement. For instructions, see Updating and inserting
new data (p. 96).

Schedule around maintenance windows


If a scheduled maintenance occurs while a query is running, the query is terminated and rolled back
and you need to restart it. Schedule long-running operations, such as large data loads or VACUUM
operation, to avoid maintenance windows. You can also minimize the risk, and make restarts easier
when they are needed, by performing data loads in smaller increments and managing the size of your
VACUUM operations. For more information, see Load data in sequential blocks (p. 21) and Vacuuming
tables (p. 108).

Amazon Redshift best practices for designing


queries
To maximize query performance, follow these recommendations when creating queries:

• Design tables according to best practices to provide a solid foundation for query performance. For
more information, see Amazon Redshift best practices for designing tables (p. 16).
• Avoid using select *. Include only the columns you specifically need.
• Use a CASE expression (p. 914) to perform complex aggregations instead of selecting from the same
table multiple times.
• Don't use cross-joins unless absolutely necessary. These joins without a join condition result in the
Cartesian product of two tables. Cross-joins are typically run as nested-loop joins, which are the
slowest of the possible join types.
• Use subqueries in cases where one table in the query is used only for predicate conditions and
the subquery returns a small number of rows (less than about 200). The following example uses a
subquery to avoid joining the LISTING table.

select sum(sales.qtysold)
from sales

22
Amazon Redshift Database Developer Guide
Best practices for designing queries

where salesid in (select listid from listing where listtime > '2008-12-26');

• Use predicates to restrict the dataset as much as possible.


• In the predicate, use the least expensive operators that you can. Comparison condition (p. 509)
operators are preferable to LIKE (p. 515) operators. LIKE operators are still preferable to SIMILAR
TO (p. 518) or POSIX operators (p. 520).
• Avoid using functions in query predicates. Using them can drive up the cost of the query by requiring
large numbers of rows to resolve the intermediate steps of the query.
• If possible, use a WHERE clause to restrict the dataset. The query planner can then use row order to
help determine which records match the criteria, so it can skip scanning large numbers of disk blocks.
Without this, the query execution engine must scan participating columns entirely.
• Add predicates to filter tables that participate in joins, even if the predicates apply the same filters.
The query returns the same result set, but Amazon Redshift is able to filter the join tables before the
scan step and can then efficiently skip scanning blocks from those tables. Redundant filters aren't
needed if you filter on a column that's used in the join condition.

For example, suppose that you want to join SALES and LISTING to find ticket sales for tickets listed
after December, grouped by seller. Both tables are sorted by date. The following query joins the tables
on their common key and filters for listing.listtime values greater than December 1.

select listing.sellerid, sum(sales.qtysold)


from sales, listing
where sales.salesid = listing.listid
and listing.listtime > '2008-12-01'
group by 1 order by 1;

The WHERE clause doesn't include a predicate for sales.saletime, so the execution engine is forced
to scan the entire SALES table. If you know the filter would result in fewer rows participating in the
join, then add that filter as well. The following example cuts execution time significantly.

select listing.sellerid, sum(sales.qtysold)


from sales, listing
where sales.salesid = listing.listid
and listing.listtime > '2008-12-01'
and sales.saletime > '2008-12-01'
group by 1 order by 1;

• Use sort keys in the GROUP BY clause so the query planner can use more efficient aggregation. A
query might qualify for one-phase aggregation when its GROUP BY list contains only sort key columns,
one of which is also the distribution key. The sort key columns in the GROUP BY list must include the
first sort key, then other sort keys that you want to use in sort key order. For example, it is valid to use
the first sort key, the first and second sort keys, the first, second, and third sort keys, and so on. It is
not valid to use the first and third sort keys.

You can confirm the use of one-phase aggregation by running the EXPLAIN (p. 762) command and
looking for XN GroupAggregate in the aggregation step of the query.
• If you use both GROUP BY and ORDER BY clauses, make sure that you put the columns in the same
order in both. That is, use the approach just following.

group by a, b, c
order by a, b, c

Don't use the following approach.

group by b, c, a
order by a, b, c

23
Amazon Redshift Database Developer Guide
Working with Advisor

Working with recommendations from Amazon


Redshift Advisor
To help you improve the performance and decrease the operating costs for your Amazon Redshift
cluster, Amazon Redshift Advisor offers you specific recommendations about changes to make. Advisor
develops its customized recommendations by analyzing performance and usage metrics for your cluster.
These tailored recommendations relate to operations and cluster settings. To help you prioritize your
optimizations, Advisor ranks recommendations by order of impact.

Advisor bases its recommendations on observations regarding performance statistics or operations data.
Advisor develops observations by running tests on your clusters to determine if a test value is within
a specified range. If the test result is outside of that range, Advisor generates an observation for your
cluster. At the same time, Advisor creates a recommendation about how to bring the observed value
back into the best-practice range. Advisor only displays recommendations that should have a significant
impact on performance and operations. When Advisor determines that a recommendation has been
addressed, it removes it from your recommendation list.

For example, suppose that your data warehouse contains a large number of uncompressed table
columns. In this case, you can save on cluster storage costs by rebuilding tables using the ENCODE
parameter to specify column compression. In another example, suppose that Advisor observes that your
cluster contains a significant amount of data in uncompressed table data. In this case, it provides you
with the SQL code block to find the table columns that are candidates for compression and resources
that describe how to compress those columns.

Amazon Redshift Regions


The Amazon Redshift Advisor feature is available only in the following AWS Regions:

• US East (N. Virginia) Region (us-east-1)


• US East (Ohio) Region (us-east-2)
• US West (N. California) Region (us-west-1)
• US West (Oregon) Region (us-west-2)
• Asia Pacific (Hong Kong) Region (ap-east-1)
• Asia Pacific (Mumbai) Region (ap-south-1)
• Asia Pacific (Seoul) Region (ap-northeast-2)
• Asia Pacific (Singapore) Region (ap-southeast-1)
• Asia Pacific (Sydney) Region (ap-southeast-2)
• Asia Pacific (Tokyo) Region (ap-northeast-1)
• Canada (Central) Region (ca-central-1)
• China (Beijing) Region (cn-north-1)
• China (Ningxia) Region (cn-northwest-1)
• Europe (Frankfurt) Region (eu-central-1)
• Europe (Ireland) Region (eu-west-1)
• Europe (London) Region (eu-west-2)
• Europe (Paris) Region (eu-west-3)
• Europe (Stockholm) Region (eu-north-1)
• Middle East (Bahrain) Region (me-south-1)
• South America (São Paulo) Region (sa-east-1)

Topics

24
Amazon Redshift Database Developer Guide
Access Advisor

• Viewing Amazon Redshift Advisor recommendations on the console (p. 25)


• Amazon Redshift Advisor recommendations (p. 26)

Viewing Amazon Redshift Advisor recommendations


on the console
You can view Amazon Redshift Advisor analysis results and recommendations on the AWS Management
Console.
Note
A new console is available for Amazon Redshift. Choose either the New console or the Original
console instructions based on the console that you are using. The New console instructions are
open by default.

New console
To view Amazon Redshift Advisor recommendations on the console

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.
2. On the navigation menu, choose ADVISOR.
3. Expand each recommendation to see more details. On this page, you can sort and group
recommendations.

Original console
To view Amazon Redshift Advisor recommendations on the console

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.
2. In the navigation pane, choose Advisor.
3. Choose the cluster that you want to get recommendations for.

4. Expand each recommendation to see more details.

25
Amazon Redshift Database Developer Guide
Advisor recommendations

Amazon Redshift Advisor recommendations


Amazon Redshift Advisor offers recommendations about how to optimize your Amazon Redshift
cluster to increase performance and save on operating costs. You can find explanations for each
recommendation in the console, as described preceding. You can find further details on these
recommendations in the following sections.

Topics
• Compress Amazon S3 file objects loaded by COPY (p. 26)
• Isolate multiple active databases (p. 27)
• Reallocate workload management (WLM) memory (p. 28)
• Skip compression analysis during COPY (p. 29)
• Split Amazon S3 objects loaded by COPY (p. 30)
• Update table statistics (p. 31)
• Enable short query acceleration (p. 32)
• Alter distribution keys on tables (p. 33)
• Alter sort keys on tables (p. 34)
• Alter compression encodings on columns (p. 34)
• Data type recommendations (p. 35)

Compress Amazon S3 file objects loaded by COPY


The COPY command takes advantage of the massively parallel processing (MPP) architecture in Amazon
Redshift to read and load data in parallel. It can read files from Amazon S3, DynamoDB tables, and text
output from one or more remote hosts.

When loading large amounts of data, we strongly recommend using the COPY command to load
compressed data files from S3. Compressing large datasets saves time uploading the files to S3. COPY
can also speed up the load process by uncompressing the files as they are read.

Analysis

Long-running COPY commands that load large uncompressed datasets often have an opportunity for
considerable performance improvement. The Advisor analysis identifies COPY commands that load large
uncompressed datasets. In such a case, Advisor generates a recommendation to implement compression
on the source files in S3.

26
Amazon Redshift Database Developer Guide
Advisor recommendations

Recommendation

Ensure that each COPY that loads a significant amount of data, or runs for a significant duration, ingests
compressed data objects from S3. You can identify the COPY commands that load large uncompressed
datasets from S3 by running the following SQL command as a superuser.

SELECT
wq.userid, query, exec_start_time AS starttime, COUNT(*) num_files,
ROUND(MAX(wq.total_exec_time/1000000.0),2) execution_secs,
ROUND(SUM(transfer_size)/(1024.0*1024.0),2) total_mb,
SUBSTRING(querytxt,1,60) copy_sql
FROM stl_s3client s
JOIN stl_query q USING (query)
JOIN stl_wlm_query wq USING (query)
WHERE s.userid>1 AND http_method = 'GET'
AND POSITION('COPY ANALYZE' IN querytxt) = 0
AND aborted = 0 AND final_state='Completed'
GROUP BY 1, 2, 3, 7
HAVING SUM(transfer_size) = SUM(data_size)
AND SUM(transfer_size)/(1024*1024) >= 5
ORDER BY 6 DESC, 5 DESC;

If the staged data remains in S3 after you load it, which is common in data lake architectures, storing this
data in a compressed form can reduce your storage costs.

Implementation tips

• The ideal object size is 1–128 MB after compression.


• You can compress files with gzip, lzop, or bzip2 format.

Isolate multiple active databases


As a best practice, we recommend isolating databases in Amazon Redshift from one another. Queries
run in a specific database and can't access data from any other database on the cluster. However, the
queries that you run in all databases of a cluster share the same underlying cluster storage space and
compute resources. When a single cluster contains multiple active databases, their workloads are usually
unrelated.

Analysis

The Advisor analysis reviews all databases on the cluster for active workloads running at the same time.
If there are active workloads running at the same time, Advisor generates a recommendation to consider
migrating databases to separate Amazon Redshift clusters.

Recommendation

Consider moving each actively queried database to a separate dedicated cluster. Using a separate cluster
can reduce resource contention and improve query performance. It can do so because it enables you
to set the size for each cluster for the storage, cost, and performance needs of each workload. Also,
unrelated workloads often benefit from different workload management configurations.

To identify which databases are actively used, you can run this SQL command as a superuser.

SELECT database,
COUNT(*) as num_queries,
AVG(DATEDIFF(sec,starttime,endtime)) avg_duration,
MIN(starttime) as oldest_ts,

27
Amazon Redshift Database Developer Guide
Advisor recommendations

MAX(endtime) as latest_ts
FROM stl_query
WHERE userid > 1
GROUP BY database;

Implementation tips

• Because a user must connect to each database specifically, and queries can only access a single
database, moving databases to separate clusters has minimal impact for users.
• One option to move a database is to take the following steps:
1. Temporarily restore a snapshot of the current cluster to a cluster of the same size.
2. Delete all databases from the new cluster except the target database to be moved.
3. Resize the cluster to an appropriate node type and count for the database's workload.

Reallocate workload management (WLM) memory


Amazon Redshift routes user queries to Implementing manual WLM (p. 419) for processing. Workload
management (WLM) defines how those queries are routed to the queues. Amazon Redshift allocates each
queue a portion of the cluster's available memory. A queue's memory is divided among the queue's query
slots.

When a queue is configured with more slots than the workload requires, the memory allocated to these
unused slots goes underutilized. Reducing the configured slots to match the peak workload requirements
redistributes the underutilized memory to active slots, and can result in improved query performance.

Analysis

The Advisor analysis reviews workload concurrency requirements to identify query queues with unused
slots. Advisor generates a recommendation to reduce the number of slots in a queue when it finds the
following:

• A queue with slots that are completely inactive throughout the analysis
• A queue with more than four slots that had at least two inactive slots throughout the analysis

Recommendation

Reducing the configured slots to match peak workload requirements redistributes underutilized memory
to active slots. Consider reducing the configured slot count for queues where the slots have never been
fully utilized. To identify these queues, you can compare the peak hourly slot requirements for each
queue by running the following SQL command as a superuser.

WITH
generate_dt_series AS (select sysdate - (n * interval '5 second') as dt from (select
row_number() over () as n from stl_scan limit 17280)),
apex AS (
SELECT iq.dt, iq.service_class, iq.num_query_tasks, count(iq.slot_count) as
service_class_queries, sum(iq.slot_count) as service_class_slots
FROM
(select gds.dt, wq.service_class, wscc.num_query_tasks, wq.slot_count
FROM stl_wlm_query wq
JOIN stv_wlm_service_class_config wscc ON (wscc.service_class = wq.service_class
AND wscc.service_class > 5)
JOIN generate_dt_series gds ON (wq.service_class_start_time <= gds.dt AND
wq.service_class_end_time > gds.dt)
WHERE wq.userid > 1 AND wq.service_class > 5) iq

28
Amazon Redshift Database Developer Guide
Advisor recommendations

GROUP BY iq.dt, iq.service_class, iq.num_query_tasks),


maxes as (SELECT apex.service_class, trunc(apex.dt) as d, date_part(h,apex.dt) as
dt_h, max(service_class_slots) max_service_class_slots
from apex group by apex.service_class, apex.dt, date_part(h,apex.dt))
SELECT apex.service_class - 5 AS queue, apex.service_class, apex.num_query_tasks AS
max_wlm_concurrency, maxes.d AS day, maxes.dt_h || ':00 - ' || maxes.dt_h || ':59' as
hour, MAX(apex.service_class_slots) as max_service_class_slots
FROM apex
JOIN maxes ON (apex.service_class = maxes.service_class AND apex.service_class_slots =
maxes.max_service_class_slots)
GROUP BY apex.service_class, apex.num_query_tasks, maxes.d, maxes.dt_h
ORDER BY apex.service_class, maxes.d, maxes.dt_h;

The max_service_class_slots column represents the maximum number of WLM query slots in the
query queue for that hour. If underutilized queues exist, implement the slot reduction optimization by
modifying a parameter group, as described in the Amazon Redshift Cluster Management Guide.

Implementation tips

• If your workload is highly variable in volume, make sure that the analysis captured a peak utilization
period. If it didn't, run the preceding SQL repeatedly to monitor peak concurrency requirements.
• For more details on interpreting the query results from the preceding SQL code, see the
wlm_apex_hourly.sql script on GitHub.

Skip compression analysis during COPY


When you load data into an empty table with compression encoding declared with the COPY command,
Amazon Redshift applies storage compression. This optimization ensures that data in your cluster is
stored efficiently even when loaded by end users. The analysis required to apply compression can require
significant time.

Analysis

The Advisor analysis checks for COPY operations that were delayed by automatic compression analysis.
The analysis determines the compression encodings by sampling the data while it's being loaded. This
sampling is similar to that performed by the ANALYZE COMPRESSION (p. 566) command.

When you load data as part of a structured process, such as in an overnight extract, transform, load
(ETL) batch, you can define the compression beforehand. You can also optimize your table definitions to
permanently skip this phase without any negative impacts.

Recommendation

To improve COPY responsiveness by skipping the compression analysis phase, implement either of the
following two options:

• Use the column ENCODE parameter when creating any tables that you load using the COPY command.
• Turn off compression altogether by supplying the COMPUPDATE OFF parameter in the COPY
command.

The best solution is generally to use column encoding during table creation, because this approach also
maintains the benefit of storing compressed data on disk. You can use the ANALYZE COMPRESSION
command to suggest compression encodings, but you must recreate the table to apply these encodings.
To automate this process, you can use the AWSColumnEncodingUtility, found on GitHub.

To identify recent COPY operations that triggered automatic compression analysis, run the following SQL
command.

29
Amazon Redshift Database Developer Guide
Advisor recommendations

WITH xids AS (
SELECT xid FROM stl_query WHERE userid>1 AND aborted=0
AND querytxt = 'analyze compression phase 1' GROUP BY xid
INTERSECT SELECT xid FROM stl_commit_stats WHERE node=-1)
SELECT a.userid, a.query, a.xid, a.starttime, b.complyze_sec,
a.copy_sec, a.copy_sql
FROM (SELECT q.userid, q.query, q.xid, date_trunc('s',q.starttime)
starttime, substring(querytxt,1,100) as copy_sql,
ROUND(datediff(ms,starttime,endtime)::numeric / 1000.0, 2) copy_sec
FROM stl_query q JOIN xids USING (xid)
WHERE (querytxt ilike 'copy %from%' OR querytxt ilike '% copy %from%')
AND querytxt not like 'COPY ANALYZE %') a
LEFT JOIN (SELECT xid,
ROUND(sum(datediff(ms,starttime,endtime))::numeric / 1000.0,2) complyze_sec
FROM stl_query q JOIN xids USING (xid)
WHERE (querytxt like 'COPY ANALYZE %'
OR querytxt like 'analyze compression phase %')
GROUP BY xid ) b ON a.xid = b.xid
WHERE b.complyze_sec IS NOT NULL ORDER BY a.copy_sql, a.starttime;

Implementation tips

• Ensure that all tables of significant size created during your ETL processes (for example, staging tables
and temporary tables) declare a compression encoding for all columns except the first sort key.
• Estimate the expected lifetime size of the table being loaded for each of the COPY commands
identified by the SQL command preceding. If you are confident that the table will remain extremely
small, turn off compression altogether with the COMPUPDATE OFF parameter. Otherwise, create the
table with explicit compression before loading it with the COPY command.

Split Amazon S3 objects loaded by COPY


The COPY command takes advantage of the massively parallel processing (MPP) architecture in
Amazon Redshift to read and load data from files on Amazon S3. The COPY command loads the data in
parallel from multiple files, dividing the workload among the nodes in your cluster. To achieve optimal
throughput, we strongly recommend that you divide your data into multiple files to take advantage of
parallel processing.

Analysis

The Advisor analysis identifies COPY commands that load large datasets contained in a small number of
files staged in S3. Long-running COPY commands that load large datasets from a few files often have
an opportunity for considerable performance improvement. When Advisor identifies that these COPY
commands are taking a significant amount of time, it creates a recommendation to increase parallelism
by splitting the data into additional files in S3.

Recommendation

In this case, we recommend the following actions, listed in priority order:

1. Optimize COPY commands that load fewer files than the number of cluster nodes.
2. Optimize COPY commands that load fewer files than the number of cluster slices.
3. Optimize COPY commands where the number of files is not a multiple of the number of cluster slices.

Certain COPY commands load a significant amount of data or run for a significant duration. For these
commands, we recommend that you load a number of data objects from S3 that is equivalent to a

30
Amazon Redshift Database Developer Guide
Advisor recommendations

multiple of the number of slices in the cluster. To identify how many S3 objects each COPY command has
loaded, run the following SQL code as a superuser.

SELECT
query, COUNT(*) num_files,
ROUND(MAX(wq.total_exec_time/1000000.0),2) execution_secs,
ROUND(SUM(transfer_size)/(1024.0*1024.0),2) total_mb,
SUBSTRING(querytxt,1,60) copy_sql
FROM stl_s3client s
JOIN stl_query q USING (query)
JOIN stl_wlm_query wq USING (query)
WHERE s.userid>1 AND http_method = 'GET'
AND POSITION('COPY ANALYZE' IN querytxt) = 0
AND aborted = 0 AND final_state='Completed'
GROUP BY query, querytxt
HAVING (SUM(transfer_size)/(1024*1024))/COUNT(*) >= 2
ORDER BY CASE
WHEN COUNT(*) < (SELECT max(node)+1 FROM stv_slices) THEN 1
WHEN COUNT(*) < (SELECT COUNT(*) FROM stv_slices WHERE node=0) THEN 2
ELSE 2+((COUNT(*) % (SELECT COUNT(*) FROM stv_slices))/(SELECT COUNT(*)::DECIMAL FROM
stv_slices))
END, (SUM(transfer_size)/(1024.0*1024.0))/COUNT(*) DESC;

Implementation tips

• The number of slices in a node depends on the node size of the cluster. For more information about
the number of slices in the various node types, see Clusters and Nodes in Amazon Redshift in the
Amazon Redshift Cluster Management Guide.
• You can load multiple files by specifying a common prefix, or prefix key, for the set, or by explicitly
listing the files in a manifest file. For more information about loading files, see Loading data from
compressed and uncompressed files (p. 68).
• Amazon Redshift doesn't take file size into account when dividing the workload. Split your load data
files so that the files are about equal size, between 1 MB and 1 GB after compression.

Update table statistics


Amazon Redshift uses a cost-based query optimizer to choose the optimum execution plan for queries.
The cost estimates are based on table statistics gathered using the ANALYZE command. When statistics
are out of date or missing, the database might choose a less efficient plan for query execution, especially
for complex queries. Maintaining current statistics helps complex queries run in the shortest possible
time.

Analysis

The Advisor analysis tracks tables whose statistics are out-of-date or missing. It reviews table access
metadata associated with complex queries. If tables that are frequently accessed with complex patterns
are missing statistics, Advisor creates a critical recommendation to run ANALYZE. If tables that are
frequently accessed with complex patterns have out-of-date statistics, Advisor creates a suggested
recommendation to run ANALYZE.

Recommendation

Whenever table content changes significantly, update statistics with ANALYZE. We recommend running
ANALYZE whenever a significant number of new data rows are loaded into an existing table with COPY
or INSERT commands. We also recommend running ANALYZE whenever a significant number of rows are

31
Amazon Redshift Database Developer Guide
Advisor recommendations

modified using UPDATE or DELETE commands. To identify tables with missing or out-of-date statistics,
run the following SQL command as a superuser. The results are ordered from largest to smallest table.

To identify tables with missing or out-of-date statistics, run the following SQL command as a superuser.
The results are ordered from largest to smallest table.

SELECT
ti.schema||'.'||ti."table" tablename,
ti.size table_size_mb,
ti.stats_off statistics_accuracy
FROM svv_table_info ti
WHERE ti.stats_off > 5.00
ORDER BY ti.size DESC;

Implementation tips

The default ANALYZE threshold is 10 percent. This default means that the ANALYZE command skips a
given table if fewer than 10 percent of the table's rows have changed since the last ANALYZE. As a result,
you might choose to issue ANALYZE commands at the end of each ETL process. Taking this approach
means that ANALYZE is often skipped but also ensures that ANALYZE runs when needed.

ANALYZE statistics have the most impact for columns that are used in joins (for example, JOIN tbl_a
ON col_b) or as predicates (for example, WHERE col_b = 'xyz'). By default, ANALYZE collects
statistics for all columns in the table specified. If needed, you can reduce the time required to run
ANALYZE by running ANALYZE only for the columns where it has the most impact. You can run the
following SQL command to identify columns used as predicates. You can also let Amazon Redshift
choose which columns to analyze by specifying ANALYZE PREDICATE COLUMNS.

WITH predicate_column_info as (
SELECT ns.nspname AS schema_name, c.relname AS table_name, a.attnum as col_num, a.attname
as col_name,
CASE
WHEN 10002 = s.stakind1 THEN array_to_string(stavalues1, '||')
WHEN 10002 = s.stakind2 THEN array_to_string(stavalues2, '||')
WHEN 10002 = s.stakind3 THEN array_to_string(stavalues3, '||')
WHEN 10002 = s.stakind4 THEN array_to_string(stavalues4, '||')
ELSE NULL::varchar
END AS pred_ts
FROM pg_statistic s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_namespace ns ON c.relnamespace = ns.oid
JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum)
SELECT schema_name, table_name, col_num, col_name,
pred_ts NOT LIKE '2000-01-01%' AS is_predicate,
CASE WHEN pred_ts NOT LIKE '2000-01-01%' THEN (split_part(pred_ts,
'||',1))::timestamp ELSE NULL::timestamp END as first_predicate_use,
CASE WHEN pred_ts NOT LIKE '%||2000-01-01%' THEN (split_part(pred_ts,
'||',2))::timestamp ELSE NULL::timestamp END as last_analyze
FROM predicate_column_info;

For more information, see Analyzing tables (p. 103).

Enable short query acceleration


Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running
queries. SQA runs short-running queries in a dedicated space, so that SQA queries aren't forced to wait
in queues behind longer queries. SQA only prioritizes queries that are short-running and are in a user-
defined queue. With SQA, short-running queries begin running more quickly and users see results sooner.

32
Amazon Redshift Database Developer Guide
Advisor recommendations

If you enable SQA, you can reduce or eliminate workload management (WLM) queues that are dedicated
to running short queries. In addition, long-running queries don't need to contend with short queries
for slots in a queue, so you can configure your WLM queues to use fewer query slots. When you use
lower concurrency, query throughput is increased and overall system performance is improved for most
workloads. For more information, see Working with short query acceleration (p. 442).

Analysis

Advisor checks for workload patterns and reports the number of recent queries where SQA would reduce
latency and the daily queue time for SQA-eligible queries.

Recommendation

Modify the WLM configuration to enable SQA. Amazon Redshift uses a machine learning algorithm
to analyze each eligible query. Predictions improve as SQA learns from your query patterns. For more
information, see Configuring Workload Management.

When you enable SQA, WLM sets the maximum run time for short queries to dynamic by default. We
recommend keeping the dynamic setting for SQA maximum run time.

Implementation tips

To check whether SQA is enabled, run the following query. If the query returns a row, then SQA is
enabled.

select * from stv_wlm_service_class_config


where service_class = 14;

For more information, see Monitoring SQA (p. 443).

Alter distribution keys on tables


Amazon Redshift distributes table rows throughout the cluster according to the table distribution style.
Tables with KEY distribution require a column as the distribution key (DISTKEY). A table row is assigned
to a node slice of a cluster based on its DISTKEY column value.

An appropriate DISTKEY places a similar number of rows on each node slice and is frequently referenced
in join conditions. An optimized join occurs when tables are joined on their DISTKEY columns,
accelerating query performance.

Analysis

Advisor analyzes your cluster’s workload to identify the most appropriate distribution key for the tables
that can significantly benefit from a KEY distribution style.

Recommendation

Advisor provides ALTER TABLE (p. 542) statements that alter the DISTSTYLE and DISTKEY of a table
based on its analysis. To realize a significant performance benefit, make sure to implement all SQL
statements within a recommendation group.

Redistributing a large table with ALTER TABLE consumes cluster resources and requires temporary table
locks at various times. Implement each recommendation group when other cluster workload is light.
For more details on optimizing table distribution properties, see the Amazon Redshift Engineering's
Advanced Table Design Playbook: Distribution Styles and Distribution Keys.

For more information about ALTER DISTSYLE and DISTKEY, see ALTER TABLE (p. 542).

33
Amazon Redshift Database Developer Guide
Advisor recommendations

Note
If you don't see a recommendation, that doesn't necessarily mean that the current distribution
styles are the most appropriate. Advisor doesn't provide recommendations when there isn't
enough data or the expected benefit of redistribution is small.
Advisor recommendations apply to a particular table and don't necessarily apply to a table that
contains a column with the same name. Tables that share a column name can have different
characteristics for those columns unless data inside the tables is the same.
If you see recommendations for staging tables that are created or dropped by ETL jobs, modify
your ETL processes to use the Advisor recommended distribution keys.

Alter sort keys on tables


Amazon Redshift sorts table rows according to the table sort key (p. 62). The sorting of table rows is
based on the sort key column values.

Sorting a table on an appropriate sort key can accelerate performance of queries, especially those with
range-restricted predicates, by requiring fewer table blocks to be read from disk.

Analysis

Advisor analyzes your cluster's workload over several days to identify a beneficial sort key for your tables.

Recommendation

Advisor provides two groups of ALTER TABLE statements that alter the sort key of a table based on its
analysis:

• Statements that alter a table that currently doesn't have a sort key to add a COMPOUND sort key.
• Statements that alter a sort key from INTERLEAVED to COMPOUND or no sort key.

Using compound sort keys significantly reduces maintenance overhead. Tables with compound sort
keys don't need the expensive VACUUM REINDEX operations that are necessary for interleaved sorts.
In practice, compound sort keys are more effective than interleaved sort keys for the vast majority of
Amazon Redshift workloads. However, if a table is small, it's more efficient not to have a sort key to
avoid sort key storage overhead.

When sorting a large table with the ALTER TABLE, cluster resources are consumed and table locks are
required at various times. Implement each recommendation when a cluster's workload is moderate. More
details on optimizing table sort key configurations can be found in the Amazon Redshift Engineering's
Advanced Table Design Playbook: Compound and Interleaved Sort Keys.

For more information about ALTER SORTKEY, see ALTER TABLE (p. 542).
Note
If you don't see a recommendation for a table, that doesn't necessarily mean that the current
configuration is the best. Advisor doesn't provide recommendations when there isn't enough
data or the expected benefit of sorting is small.
Advisor recommendations apply to a particular table and don’t necessarily apply to a table that
contains a column with the same name and data type. Tables that share column names can have
different recommendations based on the data in the tables and the workload.

Alter compression encodings on columns


Compression is a column-level operation that reduces the size of data when it's stored. Compression is
used in Amazon Redshift to conserve storage space and improve query performance by reducing the
amount of disk I/O. We recommend an optimal compression encoding for each column based on its
data type and on query patterns. With optimal compression, queries can run more efficiently and the
database can take up minimal storage space.

34
Amazon Redshift Database Developer Guide
Advisor recommendations

Analysis

Advisor performs analysis of your cluster's workload and database schema continually to identify the
optimal compression encoding for each table column.

Recommendation

Advisor provides ALTER TABLE statements that change the compression encoding of particular columns,
based on its analysis.

Changing column compression encodings with ALTER TABLE (p. 542) consumes cluster resources and
requires table locks at various times. It's best to implement recommendations when the cluster workload
is light.

For reference, ALTER TABLE examples (p. 551) shows several statements that change the encoding for a
column.
Note
Advisor doesn't provide recommendations when there isn't enough data or the expected benefit
of changing the encoding is small.

Data type recommendations


Amazon Redshift has a library of SQL data types for various use cases. These include integer types like
INT and types to store characters, like VARCHAR. Redshift stores types in an optimized way to provide
fast access and good query performance. Also, Redshift provides functions for specific types, which you
can use to format or perform calculations on query results.

Analysis

Advisor performs analysis of your cluster's workload and database schema continually to identify
columns that can benefit significantly from a data type change.

Recommendation

Advisor provides an ALTER TABLE statement that adds a new column with the suggested data type. An
accompanying UPDATE statement copies data from the existing column to the new column. After you
create the new column and load the data, change your queries and ingestion scripts to access the new
column. Then leverage features and functions specialized to the new data type, found in SQL functions
reference (p. 878).

Copying existing data to the new column can take time. We recommend that you implement each
advisor recommendation when the cluster’s workload is light. Reference the list of available data types at
Data types (p. 478).

Note that Advisor doesn't provide recommendations when there isn't enough data or the expected
benefit of changing the data type is small.

35
Amazon Redshift Database Developer Guide

Tutorials for Amazon Redshift


Follow the steps in these tutorials to learn about Amazon Redshift features:

• Tutorial: Loading data from Amazon S3 (p. 122)


• Tutorial: Querying nested data with Amazon Redshift Spectrum (p. 278)
• Tutorial: Configuring manual workload management (WLM) queues (p. 425)
• Tutorial: Using spatial SQL functions with Amazon Redshift (p. 207)

36
Amazon Redshift Database Developer Guide
Enabling automatic table optimization

Working with automatic table


optimization
Automatic table optimization is a self-tuning capability that automatically optimizes the design
of tables by applying sort and distribution keys without the need for administrator intervention.
By using automation to tune the design of tables, you can get started more easily and get the
fastest performance quickly without needing to invest time to manually tune and implement table
optimizations.

Automatic table optimization continuously observes how queries interact with tables. It uses advanced
artificial intelligence methods to choose sort and distribution keys to optimize performance for the
cluster's workload. If Amazon Redshift determines that applying a key improves cluster performance,
tables are automatically altered within hours from the time the cluster was created, with minimal impact
to queries.

To take advantage of this automation, an Amazon Redshift administrator creates a new table, or alters
an existing table to enable it to use automatic optimization. Existing tables with a distribution style
or sort key of AUTO are already enabled for automation. When you run queries against those tables,
Amazon Redshift determines if a sort key or distribution key will improve performance. If so, then
Amazon Redshift automatically modifies the table without requiring administrator intervention. If
a minimum number of queries are run, optimizations are applied within hours of the cluster being
launched.

If Amazon Redshift determines that a distribution key improves the performance of queries, tables where
distribution style is AUTO can have their distribution style changed to KEY.

Topics
• Enabling automatic table optimization (p. 37)
• Removing automatic table optimization from a table (p. 38)
• Monitoring actions of automatic table optimization (p. 38)
• Working with column compression (p. 39)
• Working with data distribution styles (p. 50)
• Working with sort keys (p. 62)
• Defining table constraints (p. 64)

Enabling automatic table optimization


By default, tables created without explicitly defining sort keys or distributions keys are set to AUTO. At
the time of table creation, you can also explicitly set a sort or a distribution key manually. If you set the
sort or distribution key, then the table is not automatically managed.

To enable an existing table to be automatically optimized, use the ALTER statement options to change
the table to AUTO. You might choose to define automation for sort keys, but not for distribution keys
(and vice versa). If you run an ALTER statement to convert a table to be an automated table, existing sort
keys and distribution styles are preserved.

ALTER TABLE table_name ALTER SORTKEY AUTO;

37
Amazon Redshift Database Developer Guide
Removing automatic table optimization

ALTER TABLE table_name ALTER DISTSTYLE AUTO;

For more information, see ALTER TABLE (p. 542).

Initially, a table has no distribution key or sort key. The distribution style is set to either EVEN or ALL
depending on table size. As the table grows in size, Amazon Redshift applies the optimal distribution
keys and sort keys. Optimizations are applied within hours after a minimum number of queries are run.
When determining sort key optimizations, Amazon Redshift attempts to optimize the data blocks read
from disk during a table scan. When determining distribution style optimizations, Amazon Redshift tries
to optimize the number of bytes transferred between cluster nodes.

Removing automatic table optimization from a


table
You can remove a table from automatic optimization. Removing a table from automation involves
selecting a sort key or distribution style. To change distribution style, specify a specific distribution style.

ALTER TABLE table_name ALTER DISTSTYLE EVEN;

ALTER TABLE table_name ALTER DISTSTYLE ALL;

ALTER TABLE table_name ALTER DISTSTYLE KEY DISTKEY c1;

To change a sort key, you can define a sort key or choose none.

ALTER TABLE table_name ALTER SORTKEY(c1, c2);

ALTER TABLE table_name ALTER SORTKEY NONE;

Monitoring actions of automatic table


optimization
The system view SVV_ALTER_TABLE_RECOMMENDATIONS records the current Amazon Redshift Advisor
recommendations for tables. This view shows recommendations for all tables, those that are defined for
automatic optimization and those that aren't.

To view if a table is defined for automatic optimization, query the system view SVV_TABLE_INFO.
Entries appear only for tables visible in the current session's database. Recommendations are inserted
into the view twice per day starting within hours from the time the cluster was created. After a
recommendation is available, it's started within an hour. After a recommendation has been applied
(either by Amazon Redshift or by you), it no longer appears in the view.

The system view SVL_AUTO_WORKER_ACTION shows an audit log of all the actions taken by the Amazon
Redshift, and the previous state of the table.

The system view SVV_TABLE_INFO lists all of the tables in the system, along with a column to indicate
whether the sort key and distribution style of the table is set to AUTO.

38
Amazon Redshift Database Developer Guide
Working with column compression

For more information about these system views, see System views (p. 1275).

Working with column compression


Compression is a column-level operation that reduces the size of data when it is stored. Compression
conserves storage space and reduces the size of data that is read from storage, which reduces the
amount of disk I/O and therefore improves query performance.

ENCODE AUTO is the default for tables. Amazon Redshift automatically manages compression encoding
for all columns in the table. You can specify the ENCODE AUTO option for the table to enable Amazon
Redshift to automatically manage compression encoding for all columns in the table. For more
information, see CREATE TABLE (p. 708) and ALTER TABLE (p. 542).

However, if you specify compression encoding for any column in the table, the table is no longer set
to ENCODE AUTO. Amazon Redshift no longer automatically manages compression encoding for all
columns in the table.

You can apply a compression type, or encoding, to the columns in a table manually when you create the
table. Or you can use the COPY command to analyze and apply compression automatically. For more
information, see Let COPY choose compression encodings (p. 18). For details about applying automatic
compression, see Loading tables with automatic compression (p. 89).
Note
We strongly recommend using the COPY command to apply automatic compression.

You might choose to apply compression encodings manually if the new table shares the same data
characteristics as another table. Or you might do so if in testing you discover that the compression
encodings that are applied during automatic compression are not the best fit for your data. If you choose
to apply compression encodings manually, you can run the ANALYZE COMPRESSION (p. 566) command
against an already populated table and use the results to choose compression encodings.

To apply compression manually, you specify compression encodings for individual columns as part of the
CREATE TABLE statement. The syntax is as follows.

CREATE TABLE table_name (column_name


data_type ENCODE encoding-type)[, ...]

Here, encoding-type is taken from the keyword table in the following section.

For example, the following statement creates a two-column table, PRODUCT. When data is loaded into
the table, the PRODUCT_ID column is not compressed, but the PRODUCT_NAME column is compressed,
using the byte dictionary encoding (BYTEDICT).

create table product(


product_id int encode raw,
product_name char(20) encode bytedict);

You can specify the encoding for a column when it is added to a table using the ALTER TABLE command.

ALTER TABLE table-name ADD [ COLUMN ] column_name column_type ENCODE encoding-type

Topics
• Compression encodings (p. 40)
• Testing compression encodings (p. 46)
• Example: Choosing compression encodings for the CUSTOMER table (p. 48)

39
Amazon Redshift Database Developer Guide
Compression encodings

Compression encodings
A compression encoding specifies the type of compression that is applied to a column of data values as
rows are added to a table.

ENCODE AUTO is the default for tables. Amazon Redshift automatically manages compression encoding
for all columns in the table. You can specify the ENCODE AUTO option for the table to enable Amazon
Redshift to automatically manage compression encoding for all columns in the table. For more
information, see CREATE TABLE (p. 708) and ALTER TABLE (p. 542).

However, if you specify compression encoding for any column in the table, the table is no longer set
to ENCODE AUTO. Amazon Redshift no longer automatically manages compression encoding for all
columns in the table.

If you specify compression encoding for any column in the table or if you don't specify the ENCODE
AUTO option for the table, Amazon Redshift automatically assigns compression encoding to columns for
which you don't specify compression encoding as follows:

• Columns that are defined as sort keys are assigned RAW compression.
• Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW
compression.
• Columns that are defined as SMALLINT, INTEGER, BIGINT, DECIMAL, DATE, TIMESTAMP, or
TIMESTAMPTZ data types are assigned AZ64 compression.
• Columns that are defined as CHAR or VARCHAR data types are assigned LZO compression.

The following table identifies the supported compression encodings and the data types that support the
encoding.

Encoding type Keyword in CREATE TABLE Data types


and ALTER TABLE

Raw (no compression) RAW All

AZ64 AZ64 SMALLINT, INTEGER, BIGINT,


DECIMAL, DATE, TIMESTAMP,
TIMESTAMPTZ

Byte dictionary BYTEDICT SMALLINT, INTEGER, BIGINT,


DECIMAL, REAL, DOUBLE PRECISION,
CHAR, VARCHAR, DATE, TIMESTAMP,
TIMESTAMPTZ

Delta DELTA SMALLINT, INT, BIGINT, DATE,


TIMESTAMP, DECIMAL
DELTA32K
INT, BIGINT, DATE, TIMESTAMP,
DECIMAL

LZO LZO SMALLINT, INTEGER, BIGINT,


DECIMAL, CHAR, VARCHAR, DATE,
TIMESTAMP, TIMESTAMPTZ, SUPER

Mostlyn MOSTLY8 SMALLINT, INT, BIGINT, DECIMAL

MOSTLY16 INT, BIGINT, DECIMAL

MOSTLY32 BIGINT, DECIMAL

40
Amazon Redshift Database Developer Guide
Compression encodings

Encoding type Keyword in CREATE TABLE Data types


and ALTER TABLE

Run-length RUNLENGTH SMALLINT, INTEGER, BIGINT,


DECIMAL, REAL, DOUBLE PRECISION,
BOOLEAN, CHAR, VARCHAR, DATE,
TIMESTAMP, TIMESTAMPTZ

Text TEXT255 VARCHAR only

TEXT32K VARCHAR only

Zstandard ZSTD SMALLINT, INTEGER, BIGINT,


DECIMAL, REAL, DOUBLE PRECISION,
BOOLEAN, CHAR, VARCHAR, DATE,
TIMESTAMP, TIMESTAMPTZ, SUPER

Raw encoding
Raw encoding is the default encoding for columns that are designated as sort keys and columns that are
defined as BOOLEAN, REAL, or DOUBLE PRECISION data types. With raw encoding, data is stored in raw,
uncompressed form.

AZ64 encoding
AZ64 is a proprietary compression encoding algorithm designed by Amazon to achieve a high
compression ratio and improved query processing. At its core, the AZ64 algorithm compresses smaller
groups of data values and uses single instruction, multiple data (SIMD) instructions for parallel
processing. Use AZ64 to achieve significant storage savings and high performance for numeric, date, and
time data types.

You can use AZ64 as the compression encoding when defining columns using CREATE TABLE and ALTER
TABLE statements with the following data types:

• SMALLINT
• INTEGER
• BIGINT
• DECIMAL
• DATE
• TIMESTAMP
• TIMESTAMPTZ

Byte-dictionary encoding
In byte dictionary encoding, a separate dictionary of unique values is created for each block of column
values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-
byte values that are stored as indexes to the original data values. If more than 256 values are stored in a
single block, the extra values are written into the block in raw, uncompressed form. The process repeats
for each disk block.

This encoding is very effective when a column contains a limited number of unique values. This encoding
is optimal when the data domain of a column is fewer than 256 unique values. Byte-dictionary encoding
is especially space-efficient if a CHAR column holds long character strings.

41
Amazon Redshift Database Developer Guide
Compression encodings

Note
Byte-dictionary encoding is not always effective when used with VARCHAR columns. Using
BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly
recommend using a different encoding, such as LZO, for VARCHAR columns.

Suppose that a table has a COUNTRY column with a CHAR(30) data type. As data is loaded, Amazon
Redshift creates the dictionary and populates the COUNTRY column with the index value. The dictionary
contains the indexed unique values, and the table itself contains only the one-byte subscripts of the
corresponding values.
Note
Trailing blanks are stored for fixed-length character columns. Therefore, in a CHAR(30) column,
every compressed value saves 29 bytes of storage when you use the byte-dictionary encoding.

The following table represents the dictionary for the COUNTRY column.

Unique data value Dictionary index Size (fixed length, 30 bytes per
value)

England 0 30

United States of America 1 30

Venezuela 2 30

Sri Lanka 3 30

Argentina 4 30

Japan 5 30

Total 180

The following table represents the values in the COUNTRY column.

Original data value Original size (fixed Compressed value New size (bytes)
length, 30 bytes per (index)
value)

England 30 0 1

England 30 0 1

United States of 30 1 1
America

United States of 30 1 1
America

Venezuela 30 2 1

Sri Lanka 30 3 1

Argentina 30 4 1

Japan 30 5 1

Sri Lanka 30 3 1

42
Amazon Redshift Database Developer Guide
Compression encodings

Original data value Original size (fixed Compressed value New size (bytes)
length, 30 bytes per (index)
value)

Argentina 30 4 1

Total 300 10

The total compressed size in this example is calculated as follows: 6 different entries are stored in the
dictionary (6 * 30 = 180), and the table contains 10 1-byte compressed values, for a total of 190 bytes.

Delta encoding
Delta encodings are very useful for date time columns.

Delta encoding compresses data by recording the difference between values that follow each other in the
column. This difference is recorded in a separate dictionary for each block of column values on disk. (An
Amazon Redshift disk block occupies 1 MB.) For example, suppose that the column contains 10 integers
in sequence from 1 to 10. The first are stored as a 4-byte integer (plus a 1-byte flag). The next nine are
each stored as a byte with the value 1, indicating that it is one greater than the previous value.

Delta encoding comes in two variations:

• DELTA records the differences as 1-byte values (8-bit integers)


• DELTA32K records differences as 2-byte values (16-bit integers)

If most of the values in the column could be compressed by using a single byte, the 1-byte variation
is very effective. However, if the deltas are larger, this encoding, in the worst case, is somewhat less
effective than storing the uncompressed data. Similar logic applies to the 16-bit version.

If the difference between two values exceeds the 1-byte range (DELTA) or 2-byte range (DELTA32K), the
full original value is stored, with a leading 1-byte flag. The 1-byte range is from -127 to 127, and the 2-
byte range is from -32K to 32K.

The following table shows how a delta encoding works for a numeric column.

Original data Original size Difference (delta) Compressed value Compressed size
value (bytes) (bytes)

1 4 1 1+4 (flag + actual


value)

5 4 4 4 1

50 4 45 45 1

200 4 150 150 1+4 (flag + actual


value)

185 4 -15 -15 1

220 4 35 35 1

221 4 1 1 1

Totals 28 15

43
Amazon Redshift Database Developer Guide
Compression encodings

LZO encoding
LZO encoding provides a very high compression ratio with good performance. LZO encoding works
especially well for CHAR and VARCHAR columns that store very long character strings. They are
especially good for free-form text, such as product descriptions, user comments, or JSON strings.

Mostly encoding
Mostly encodings are useful when the data type for a column is larger than most of the stored values
require. By specifying a mostly encoding for this type of column, you can compress the majority of the
values in the column to a smaller standard storage size. The remaining values that cannot be compressed
are stored in their raw form. For example, you can compress a 16-bit column, such as an INT2 column, to
8-bit storage.

In general, the mostly encodings work with the following data types:

• SMALLINT/INT2 (16-bit)
• INTEGER/INT (32-bit)
• BIGINT/INT8 (64-bit)
• DECIMAL/NUMERIC (64-bit)

Choose the appropriate variation of the mostly encoding to suit the size of the data type for the
column. For example, apply MOSTLY8 to a column that is defined as a 16-bit integer column. Applying
MOSTLY16 to a column with a 16-bit data type or MOSTLY32 to a column with a 32-bit data type is
disallowed.

Mostly encodings might be less effective than no compression when a relatively high number of the
values in the column cannot be compressed. Before applying one of these encodings to a column,
perform a check. Most of the values that you are going to load now (and are likely to load in the future)
should fit into the ranges shown in the following table.

Encoding Compressed storage size Range of values that can be compressed


(values outside the range are stored raw)

MOSTLY8 1 byte (8 bits) -128 to 127

MOSTLY16 2 bytes (16 bits) -32768 to 32767

MOSTLY32 4 bytes (32 bits) -2147483648 to +2147483647

Note
For decimal values, ignore the decimal point to determine whether the value fits into the range.
For example, 1,234.56 is treated as 123,456 and can be compressed in a MOSTLY32 column.

For example, the VENUEID column in the VENUE table is defined as a raw integer column, which means
that its values consume 4 bytes of storage. However, the current range of values in the column is 0 to
309. Therefore, recreating and reloading this table with MOSTLY16 encoding for VENUEID would reduce
the storage of every value in that column to 2 bytes.

If the VENUEID values referenced in another table were mostly in the range of 0 to 127, it might make
sense to encode that foreign-key column as MOSTLY8. Before making the choice, you have to run some
queries against the referencing table data to find out whether the values mostly fall into the 8-bit, 16-
bit, or 32-bit range.

44
Amazon Redshift Database Developer Guide
Compression encodings

The following table shows compressed sizes for specific numeric values when the MOSTLY8, MOSTLY16,
and MOSTLY32 encodings are used:

Original value Original INT MOSTLY8 MOSTLY16 MOSTLY32


or BIGINT size compressed compressed size compressed size
(bytes) size (bytes) (bytes) (bytes)

1 4 1 2 4

10 4 1 2 4

100 4 1 2 4

1000 4 Same as raw 2 4


data size
10000 4 2 4

20000 4 2 4

40000 8 Same as raw data size 4

100000 8 4

2000000000 8 4

Run length encoding


Run length encoding replaces a value that is repeated consecutively with a token that consists of
the value and a count of the number of consecutive occurrences (the length of the run). A separate
dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk
block occupies 1 MB.) This encoding is best suited to a table in which data values are often repeated
consecutively, for example, when the table is sorted by those values.

For example, suppose that a column in a large dimension table has a predictably small domain, such as
a COLOR column with fewer than 10 possible values. These values are likely to fall in long sequences
throughout the table, even if the data is not sorted.

We don't recommend applying run length encoding on any column that is designated as a sort key.
Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns
are compressed much more highly than other columns in the same query, range-restricted scans might
perform poorly.

The following table uses the COLOR column example to show how the run length encoding works.

Original data value Original size (bytes) Compressed value Compressed size
(token) (bytes)

Blue 4 {2,Blue} 5

Blue 4 0

Green 5 {3,Green} 6

Green 5 0

Green 5 0

Blue 4 {1,Blue} 5

45
Amazon Redshift Database Developer Guide
Testing compression encodings

Original data value Original size (bytes) Compressed value Compressed size
(token) (bytes)

Yellow 6 {4,Yellow} 7

Yellow 6 0

Yellow 6 0

Yellow 6 0

Total 51 23

Text255 and Text32k encodings


Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words
recur often. A separate dictionary of unique words is created for each block of column values on disk.
(An Amazon Redshift disk block occupies 1 MB.) The dictionary contains the first 245 unique words in the
column. Those words are replaced on disk by a one-byte index value representing one of the 245 values,
and any words that are not represented in the dictionary are stored uncompressed. The process repeats
for each 1 MB disk block. If the indexed words occur frequently in the column, the column yields a high
compression ratio.

For the text32k encoding, the principle is the same, but the dictionary for each block does not capture a
specific number of words. Instead, the dictionary indexes each unique word it finds until the combined
entries reach a length of 32K, minus some overhead. The index values are stored in two bytes.

For example, consider the VENUENAME column in the VENUE table. Words such as Arena, Center, and
Theatre recur in this column and are likely to be among the first 245 words encountered in each block if
text255 compression is applied. If so, this column benefits from compression. This is because every time
those words appear, they occupy only 1 byte of storage (instead of 5, 6, or 7 bytes, respectively).

Zstandard encoding
Zstandard (ZSTD) encoding provides a high compression ratio with very good performance across diverse
datasets. ZSTD works especially well with CHAR and VARCHAR columns that store a wide range of long
and short strings, such as product descriptions, user comments, logs, and JSON strings. Where some
algorithms, such as Delta (p. 43) encoding or Mostly (p. 44) encoding, can potentially use more
storage space than no compression, ZSTD is very unlikely to increase disk usage.

ZSTD supports SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION, BOOLEAN, CHAR,
VARCHAR, DATE, TIMESTAMP, and TIMESTAMPTZ data types.

Testing compression encodings


If you decide to manually specify column encodings, you might want to test different encodings with
your data.
Note
We recommend that you use the COPY command to load data whenever possible, and allow
the COPY command to choose the optimal encodings based on your data. Or you can use the
ANALYZE COMPRESSION (p. 566) command to view the suggested encodings for existing
data. For details about applying automatic compression, see Loading tables with automatic
compression (p. 89).

To perform a meaningful test of data compression, you need a large number of rows. For this example,
we create a table and insert rows by using a statement that selects from two tables; VENUE and LISTING.

46
Amazon Redshift Database Developer Guide
Testing compression encodings

We leave out the WHERE clause that would normally join the two tables. The result is that each row
in the VENUE table is joined to all of the rows in the LISTING table, for a total of over 32 million rows.
This is known as a Cartesian join and normally is not recommended. However, for this purpose, it's a
convenient method of creating a lot of rows. If you have an existing table with data that you want to
test, you can skip this step.

After we have a table with sample data, we create a table with seven columns. Each has a different
compression encoding: raw, bytedict, lzo, run length, text255, text32k, and zstd. We populate each
column with exactly the same data by running an INSERT command that selects the data from the first
table.

To test compression encodings, do the following:

1. (Optional) First, use a Cartesian join to create a table with a large number of rows. Skip this step if you
want to test an existing table.

create table cartesian_venue(


venueid smallint not null distkey sortkey,
venuename varchar(100),
venuecity varchar(30),
venuestate char(2),
venueseats integer);

insert into cartesian_venue


select venueid, venuename, venuecity, venuestate, venueseats
from venue, listing;

2. Next, create a table with the encodings that you want to compare.

create table encodingvenue (


venueraw varchar(100) encode raw,
venuebytedict varchar(100) encode bytedict,
venuelzo varchar(100) encode lzo,
venuerunlength varchar(100) encode runlength,
venuetext255 varchar(100) encode text255,
venuetext32k varchar(100) encode text32k,
venuezstd varchar(100) encode zstd);

3. Insert the same data into all of the columns using an INSERT statement with a SELECT clause.

insert into encodingvenue


select venuename as venueraw, venuename as venuebytedict, venuename as venuelzo,
venuename as venuerunlength, venuename as venuetext32k, venuename as venuetext255,
venuename as venuezstd
from cartesian_venue;

4. Verify the number of rows in the new table.

select count(*) from encodingvenue

count
----------
38884394
(1 row)

5. Query the STV_BLOCKLIST (p. 1244) system table to compare the number of 1 MB disk blocks used
by each column.

The MAX aggregate function returns the highest block number for each column. The STV_BLOCKLIST
table includes details for three system-generated columns. This example uses col < 6 in the WHERE
clause to exclude the system-generated columns.

47
Amazon Redshift Database Developer Guide
Example: Choosing compression
encodings for the CUSTOMER table

select col, max(blocknum)


from stv_blocklist b, stv_tbl_perm p
where (b.tbl=p.id) and name ='encodingvenue'
and col < 7
group by name, col
order by col;

The query returns the following results. The columns are numbered beginning with zero. Depending
on how your cluster is configured, your result might have different numbers, but the relative sizes
should be similar. You can see that BYTEDICT encoding on the second column produced the best
results for this dataset. This approach has a compression ratio of better than 20:1. LZO and ZSTD
encoding also produced excellent results. Different data sets produce different results, of course.
When a column contains longer text strings, LZO often produces the best compression results.

col | max
-----+-----
0 | 203
1 | 10
2 | 22
3 | 204
4 | 56
5 | 72
6 | 20
(7 rows)

If you have data in an existing table, you can use the ANALYZE COMPRESSION (p. 566) command
to view the suggested encodings for the table. For example, the following example shows the
recommended encoding for a copy of the VENUE table, CARTESIAN_VENUE, that contains 38 million
rows. Notice that ANALYZE COMPRESSION recommends LZO encoding for the VENUENAME column.
ANALYZE COMPRESSION chooses optimal compression based on multiple factors, which include percent
of reduction. In this specific case, BYTEDICT provides better compression, but LZO also produces greater
than 90 percent compression.

analyze compression cartesian_venue;

Table | Column | Encoding | Est_reduction_pct


---------------+------------+----------+------------------
reallybigvenue | venueid | lzo | 97.54
reallybigvenue | venuename | lzo | 91.71
reallybigvenue | venuecity | lzo | 96.01
reallybigvenue | venuestate | lzo | 97.68
reallybigvenue | venueseats | lzo | 98.21

Example: Choosing compression encodings for the


CUSTOMER table
The following statement creates a CUSTOMER table that has columns with various data types. This
CREATE TABLE statement shows one of many possible combinations of compression encodings for these
columns.

create table customer(


custkey int encode delta,
custname varchar(30) encode raw,
gender varchar(7) encode text255,
address varchar(200) encode text255,

48
Amazon Redshift Database Developer Guide
Example: Choosing compression
encodings for the CUSTOMER table

city varchar(30) encode text255,


state char(2) encode raw,
zipcode char(5) encode bytedict,
start_date date encode delta32k);

The following table shows the column encodings that were chosen for the CUSTOMER table and gives an
explanation for the choices:.

Column Data type Encoding Explanation

CUSTKEY int delta CUSTKEY consists of


unique, consecutive
integer values. Because
the differences are one
byte, DELTA is a good
choice.

CUSTNAME varchar(30) raw CUSTNAME has a


large domain with few
repeated values. Any
compression encoding
would probably be
ineffective.

GENDER varchar(7) text255 GENDER is very small


domain with many
repeated values.
Text255 works well
with VARCHAR columns
in which the same
words recur.

ADDRESS varchar(200) text255 ADDRESS is a large


domain, but contains
many repeated words,
such as Street Avenue,
North, South, and
so on. Text 255 and
text 32k are useful for
compressing VARCHAR
columns in which the
same words recur. The
column length is short,
so text255 is a good
choice.

CITY varchar(30) text255 CITY is a large domain,


with some repeated
values. Certain city
names are used much
more commonly than
others. Text255 is
a good choice for
the same reasons as
ADDRESS.

STATE char(2) raw In the United States,


STATE is a precise

49
Amazon Redshift Database Developer Guide
Working with data distribution styles

Column Data type Encoding Explanation


domain of 50 two-
character values.
Bytedict encoding
would yield some
compression, but
because the column
size is only two
characters, compression
might not be worth
the overhead of
uncompressing the
data.

ZIPCODE char(5) bytedict ZIPCODE is a known


domain of fewer than
50,000 unique values.
Certain zip codes
occur much more
commonly than others.
Bytedict encoding is
very effective when
a column contains
a limited number of
unique values.

START_DATE date delta32k Delta encodings


are very useful for
date time columns,
especially if the rows
are loaded in date
order.

Working with data distribution styles


When you load data into a table, Amazon Redshift distributes the rows of the table to each of the
compute nodes according to the table's distribution style. When you run a query, the query optimizer
redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal
in choosing a table distribution style is to minimize the impact of the redistribution step by locating the
data where it needs to be before the query is run.
Note
This section will introduce you to the principles of data distribution in an Amazon Redshift
database. We recommend that you create your tables with DISTSTYLE AUTO. If you do so, then
Amazon Redshift uses automatic table optimization to choose the data distribution style. For
more information, see Working with automatic table optimization (p. 37). The rest of this
section provides details about distribution styles.

Topics
• Data distribution concepts (p. 51)
• Distribution styles (p. 51)
• Viewing distribution styles (p. 52)
• Evaluating query patterns (p. 53)
• Designating distribution styles (p. 54)

50
Amazon Redshift Database Developer Guide
Data distribution concepts

• Evaluating the query plan (p. 55)


• Query plan example (p. 56)
• Distribution examples (p. 60)

Data distribution concepts


Some data distribution concepts for Amazon Redshift follow.

Nodes and slices

An Amazon Redshift cluster is a set of nodes. Each node in the cluster has its own operating system,
dedicated memory, and dedicated disk storage. One node is the leader node, which manages the
distribution of data and query processing tasks to the compute nodes. The compute nodes provide
resources to do those tasks.

The disk storage for a compute node is divided into a number of slices. The number of slices per node
depends on the node size of the cluster. For example, each DS2.XL compute node has two slices, and
each DS2.8XL compute node has 16 slices. The nodes all participate in running parallel queries, working
on data that is distributed as evenly as possible across the slices. For more information about the
number of slices that each node size has, see About clusters and nodes in the Amazon Redshift Cluster
Management Guide.

Data redistribution

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the
node slices according to the table's distribution style. As part of a query plan, the optimizer determines
where blocks of data need to be located to best run the query. The data is then physically moved, or
redistributed, while the query runs. Redistribution might involve either sending specific rows to nodes for
joining or broadcasting an entire table to all of the nodes.

Data redistribution can account for a substantial portion of the cost of a query plan, and the network
traffic it generates can affect other database operations and slow overall system performance. To the
extent that you anticipate where best to locate data initially, you can minimize the impact of data
redistribution.

Data distribution goals

When you load data into a table, Amazon Redshift distributes the table's rows to the compute nodes and
slices according to the distribution style that you chose when you created the table. Data distribution has
two primary goals:

• To distribute the workload uniformly among the nodes in the cluster. Uneven distribution, or data
distribution skew, forces some nodes to do more work than others, which impairs query performance.
• To minimize data movement as a query runs. If the rows that participate in joins or aggregates are
already collocated on the nodes with their joining rows in other tables, the optimizer doesn't need to
redistribute as much data when queries run.

The distribution strategy that you choose for your database has important consequences for query
performance, storage requirements, data loading, and maintenance. By choosing the best distribution
style for each table, you can balance your data distribution and significantly improve overall system
performance.

Distribution styles
When you create a table, you can designate one of four distribution styles; AUTO, EVEN, KEY, or ALL.

If you don't specify a distribution style, Amazon Redshift uses AUTO distribution.

51
Amazon Redshift Database Developer Guide
Viewing distribution styles

AUTO distribution

With AUTO distribution, Amazon Redshift assigns an optimal distribution style based on the size of
the table data. For example, Amazon Redshift initially assigns ALL distribution to a small table, then
changes to EVEN distribution when the table grows larger. When a table is changed from ALL to
EVEN distribution, storage utilization might change slightly. The change in distribution occurs in the
background, in a few seconds.

When you set DISTSTYLE to AUTO, Amazon Redshift might change the distribution of your table
data to have a KEY-based distribution style. To view actions that Amazon Redshift automatically
performed to alter a table distribution key, see SVL_AUTO_WORKER_ACTION (p. 1376).
To view current recommendations regarding altering a table distribution key, see
SVV_ALTER_TABLE_RECOMMENDATIONS (p. 1419).

To view the distribution style applied to a table, query the PG_CLASS_INFO system catalog view. For
more information, see Viewing distribution styles (p. 52). If you don't specify a distribution style with
the CREATE TABLE statement, Amazon Redshift applies AUTO distribution.

EVEN distribution

The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in
any particular column. EVEN distribution is appropriate when a table doesn't participate in joins. It's also
appropriate when there isn't a clear choice between KEY distribution and ALL distribution.

KEY distribution

The rows are distributed according to the values in one column. The leader node places matching values
on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates
the rows on the slices according to the values in the joining columns. This way, matching values from the
common columns are physically stored together.

ALL distribution

A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place
only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for
every join that the table participates in.

ALL distribution multiplies the storage required by the number of nodes in the cluster, and so it takes
much longer to load, update, or insert data into multiple tables. ALL distribution is appropriate only for
relatively slow moving tables; that is, tables that are not updated frequently or extensively. Because the
cost of redistributing small tables during a query is low, there isn't a significant benefit to define small
dimension tables as DISTSTYLE ALL.
Note
After you have specified a distribution style for a column, Amazon Redshift handles data
distribution at the cluster level. Amazon Redshift does not require or support the concept
of partitioning data within database objects. You don't need to create table spaces or define
partitioning schemes for tables.

In certain scenarios, you can change the distribution style of a table after it is created. For more
information, see ALTER TABLE (p. 542). For scenarios when you can't change the distribution style of a
table after it's created, you can recreate the table and populate the new table with a deep copy. For more
information, see Performing a deep copy (p. 102)

Viewing distribution styles


To view the distribution style of a table, query the PG_CLASS_INFO view or the SVV_TABLE_INFO view.

The RELEFFECTIVEDISTSTYLE column in PG_CLASS_INFO indicates the current distribution style for the
table. If the table uses automatic distribution, RELEFFECTIVEDISTSTYLE is 10, 11, or 12 which indicates
whether the effective distribution style is AUTO (ALL), AUTO (EVEN), or AUTO (KEY). If the table uses

52
Amazon Redshift Database Developer Guide
Evaluating query patterns

automatic distribution, the distribution style might initially show AUTO (ALL), then change to AUTO
(EVEN) or AUTO (KEY) when the table grows.

The following table gives the distribution style for each value in RELEFFECTIVEDISTSTYLE column:

RELEFFECTIVEDISTSTYLE Current distribution style

0 EVEN

1 KEY

8 ALL

10 AUTO (ALL)

11 AUTO (EVEN)

12 AUTO (KEY)

The DISTSTYLE column in SVV_TABLE_INFO indicates the current distribution style for the table. If the
table uses automatic distribution, DISTSTYLE is AUTO (ALL), AUTO (EVEN), or AUTO (KEY).

The following example creates four tables using the three distribution styles and automatic distribution,
then queries SVV_TABLE_INFO to view the distribution styles.

create table public.dist_key (col1 int)


diststyle key distkey (col1);

insert into public.dist_key values (1);

create table public.dist_even (col1 int)


diststyle even;

insert into public.dist_even values (1);

create table public.dist_all (col1 int)


diststyle all;

insert into public.dist_all values (1);

create table public.dist_auto (col1 int);

insert into public.dist_auto values (1);

select "schema", "table", diststyle from SVV_TABLE_INFO


where "table" like 'dist%';

schema | table | diststyle


------------+-----------------+------------
public | dist_key | KEY(col1)
public | dist_even | EVEN
public | dist_all | ALL
public | dist_auto | AUTO(ALL)

Evaluating query patterns


Choosing distribution styles is just one aspect of database design. You should consider distribution styles
only within the context of the entire system, balancing distribution with other important factors such as
cluster size, compression encoding methods, sort keys, and table constraints.

53
Amazon Redshift Database Developer Guide
Designating distribution styles

Test your system with data that is as close to real data as possible.

In order to make good choices for distribution styles, you need to understand the query patterns for
your Amazon Redshift application. Identify the most costly queries in your system and base your initial
database design on the demands of those queries. Factors that determine the total cost of a query
include how long the query takes to run and how much computing resources it consumes. Other factors
that determine query cost are how often it is run, and how disruptive it is to other queries and database
operations.

Identify the tables that are used by the most costly queries, and evaluate their role in query execution.
Consider how the tables are joined and aggregated.

Use the guidelines in this section to choose a distribution style for each table. When you have done so,
create the tables and load them with data that is as close as possible to real data. Then test the tables
for the types of queries that you expect to use. You can evaluate the query explain plans to identify
tuning opportunities. Compare load times, storage space, and query execution times in order to balance
your system's overall requirements.

Designating distribution styles


The considerations and recommendations for designating distribution styles in this section use a star
schema as an example. Your database design might be based on a star schema, some variant of a star
schema, or an entirely different schema. Amazon Redshift is designed to work effectively with whatever
schema design you choose. The principles in this section can be applied to any design schema.

1. Specify the primary key and foreign keys for all your tables.

Amazon Redshift does not enforce primary key and foreign key constraints, but the query optimizer
uses them when it generates query plans. If you set primary keys and foreign keys, your application
must maintain the validity of the keys.
2. Distribute the fact table and its largest dimension table on their common columns.

Choose the largest dimension based on the size of dataset that participates in the most common join,
not just the size of the table. If a table is commonly filtered, using a WHERE clause, only a portion
of its rows participate in the join. Such a table has less impact on redistribution than a smaller table
that contributes more data. Designate both the dimension table's primary key and the fact table's
corresponding foreign key as DISTKEY. If multiple tables use the same distribution key, they are also
collocated with the fact table. Your fact table can have only one distribution key. Any tables that join
on another key isn't collocated with the fact table.
3. Designate distribution keys for the other dimension tables.

Distribute the tables on their primary keys or their foreign keys, depending on how they most
commonly join with other tables.
4. Evaluate whether to change some of the dimension tables to use ALL distribution.

If a dimension table cannot be collocated with the fact table or other important joining tables, you
can improve query performance significantly by distributing the entire table to all of the nodes. Using
ALL distribution multiplies storage space requirements and increases load times and maintenance
operations, so you should weigh all factors before choosing ALL distribution. The following section
explains how to identify candidates for ALL distribution by evaluating the EXPLAIN plan.
5. Use EVEN distribution for the remaining tables.

If a table is largely denormalized and does not participate in joins, or if you don't have a clear choice
for another distribution style, use EVEN distribution.

To let Amazon Redshift choose the appropriate distribution style, don't explicitly specify a distribution
style.

54
Amazon Redshift Database Developer Guide
Evaluating the query plan

Evaluating the query plan


You can use query plans to identify candidates for optimizing the distribution style.

After making your initial design decisions, create your tables, load them with data, and test them. Use
a test dataset that is as close as possible to the real data. Measure load times to use as a baseline for
comparisons.

Evaluate queries that are representative of the most costly queries you expect to run, specifically queries
that use joins and aggregations. Compare execution times for various design options. When you compare
execution times, don't count the first time the query is run, because the first runtime includes the
compilation time.

DS_DIST_NONE

No redistribution is required, because corresponding slices are collocated on the compute nodes.
You typically have only one DS_DIST_NONE step, the join between the fact table and one dimension
table.
DS_DIST_ALL_NONE

No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is
located on every node.
DS_DIST_INNER

The inner table is redistributed.


DS_DIST_OUTER

The outer table is redistributed.


DS_BCAST_INNER

A copy of the entire inner table is broadcast to all the compute nodes.
DS_DIST_ALL_INNER

The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.
DS_DIST_BOTH

Both tables are redistributed.

DS_DIST_NONE and DS_DIST_ALL_NONE are good. They indicate that no distribution was required for
that step because all of the joins are collocated.

DS_DIST_INNER means that the step probably has a relatively high cost because the inner table is being
redistributed to the nodes. DS_DIST_INNER indicates that the outer table is already properly distributed
on the join key. Set the inner table's distribution key to the join key to convert this to DS_DIST_NONE.
In some cases, distributing the inner table on the join key isn't possible because the outer table isn't
distributed on the join key. If this is the case, evaluate whether to use ALL distribution for the inner table.
If the table isn't updated frequently or extensively, and it's large enough to carry a high redistribution
cost, change the distribution style to ALL and test again. ALL distribution causes increased load times, so
when you retest, include the load time in your evaluation factors.

DS_DIST_ALL_INNER is not good. It means the entire inner table is redistributed to a single slice because
the outer table uses DISTSTYLE ALL, so that a copy of the entire outer table is located on each node. This
results in inefficient serial execution of the join on a single node instead taking advantage of parallel
execution using all of the nodes. DISTSTYLE ALL is meant to be used only for the inner join table.
Instead, specify a distribution key or use even distribution for the outer table.

55
Amazon Redshift Database Developer Guide
Query plan example

DS_BCAST_INNER and DS_DIST_BOTH are not good. Usually these redistributions occur because the
tables are not joined on their distribution keys. If the fact table does not already have a distribution
key, specify the joining column as the distribution key for both tables. If the fact table already has a
distribution key on another column, evaluate whether changing the distribution key to collocate this join
improve overall performance. If changing the distribution key of the outer table isn't an optimal choice,
you can achieve collocation by specifying DISTSTYLE ALL for the inner table.

The following example shows a portion of a query plan with DS_BCAST_INNER and DS_DIST_NONE
labels.

-> XN Hash Join DS_BCAST_INNER (cost=112.50..3272334142.59 rows=170771 width=84)


Hash Cond: ("outer".venueid = "inner".venueid)
-> XN Hash Join DS_BCAST_INNER (cost=109.98..3167290276.71 rows=172456 width=47)
Hash Cond: ("outer".eventid = "inner".eventid)
-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30)
Merge Cond: ("outer".listid = "inner".listid)
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14)
-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)

After changing the dimension tables to use DISTSTYLE ALL, the query plan for the same query shows
DS_DIST_ALL_NONE in place of DS_BCAST_INNER. Also, there is a dramatic change in the relative cost
for the join steps.

-> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84)


Hash Cond: ("outer".venueid = "inner".venueid)
-> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47)
Hash Cond: ("outer".eventid = "inner".eventid)
-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30)
Merge Cond: ("outer".listid = "inner".listid)
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14)
-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)

Query plan example


This example shows how to evaluate a query plan to find opportunities to optimize the distribution.

Run the following query with an EXPLAIN command to produce a query plan.

explain
select lastname, catname, venuename, venuecity, venuestate, eventname,
month, sum(pricepaid) as buyercost, max(totalprice) as maxtotalprice
from category join event on category.catid = event.catid
join venue on venue.venueid = event.venueid
join sales on sales.eventid = event.eventid
join listing on sales.listid = listing.listid
join date on sales.dateid = date.dateid
join users on users.userid = sales.buyerid
group by lastname, catname, venuename, venuecity, venuestate, eventname, month
having sum(pricepaid)>9999
order by catname, buyercost desc;

In the TICKIT database, SALES is a fact table and LISTING is its largest dimension. In order to collocate
the tables, SALES is distributed on the LISTID, which is the foreign key for LISTING, and LISTING is
distributed on its primary key, LISTID. The following example shows the CREATE TABLE commands for
SALES and LISTID.

create table sales(


salesid integer not null,

56
Amazon Redshift Database Developer Guide
Query plan example

listid integer not null distkey,


sellerid integer not null,
buyerid integer not null,
eventid integer not null encode mostly16,
dateid smallint not null,
qtysold smallint not null encode mostly8,
pricepaid decimal(8,2) encode delta32k,
commission decimal(8,2) encode delta32k,
saletime timestamp,
primary key(salesid),
foreign key(listid) references listing(listid),
foreign key(sellerid) references users(userid),
foreign key(buyerid) references users(userid),
foreign key(dateid) references date(dateid))
sortkey(listid,sellerid);

create table listing(


listid integer not null distkey sortkey,
sellerid integer not null,
eventid integer not null encode mostly16,
dateid smallint not null,
numtickets smallint not null encode mostly8,
priceperticket decimal(8,2) encode bytedict,
totalprice decimal(8,2) encode mostly32,
listtime timestamp,
primary key(listid),
foreign key(sellerid) references users(userid),
foreign key(eventid) references event(eventid),
foreign key(dateid) references date(dateid));

In the following query plan, the Merge Join step for the join on SALES and LISTING shows
DS_DIST_NONE, which indicates that no redistribution is required for the step. However, moving up
the query plan, the other inner joins show DS_BCAST_INNER, which indicates that the inner table is
broadcast as part of the query execution. Because only one pair of tables can be collocated using key
distribution, five tables need to be rebroadcast.

QUERY PLAN
XN Merge (cost=1015345167117.54..1015345167544.46 rows=1000 width=103)
Merge Key: category.catname, sum(sales.pricepaid)
-> XN Network (cost=1015345167117.54..1015345167544.46 rows=170771 width=103)
Send to leader
-> XN Sort (cost=1015345167117.54..1015345167544.46 rows=170771 width=103)
Sort Key: category.catname, sum(sales.pricepaid)
-> XN HashAggregate (cost=15345150568.37..15345152276.08 rows=170771
width=103)
Filter: (sum(pricepaid) > 9999.00)
-> XN Hash Join DS_BCAST_INNER (cost=742.08..15345146299.10
rows=170771 width=103)
Hash Cond: ("outer".catid = "inner".catid)
-> XN Hash Join DS_BCAST_INNER (cost=741.94..15342942456.61
rows=170771 width=97)
Hash Cond: ("outer".dateid = "inner".dateid)
-> XN Hash Join DS_BCAST_INNER
(cost=737.38..15269938609.81 rows=170766 width=90)
Hash Cond: ("outer".buyerid = "inner".userid)
-> XN Hash Join DS_BCAST_INNER
(cost=112.50..3272334142.59 rows=170771 width=84)
Hash Cond: ("outer".venueid = "inner".venueid)
-> XN Hash Join DS_BCAST_INNER
(cost=109.98..3167290276.71 rows=172456 width=47)
Hash Cond: ("outer".eventid =
"inner".eventid)
-> XN Merge Join DS_DIST_NONE
(cost=0.00..6286.47 rows=172456 width=30)

57
Amazon Redshift Database Developer Guide
Query plan example

Merge Cond: ("outer".listid =


"inner".listid)
-> XN Seq Scan on listing
(cost=0.00..1924.97 rows=192497 width=14)
-> XN Seq Scan on sales
(cost=0.00..1724.56 rows=172456 width=24)
-> XN Hash (cost=87.98..87.98
rows=8798 width=25)
-> XN Seq Scan on event
(cost=0.00..87.98 rows=8798 width=25)
-> XN Hash (cost=2.02..2.02 rows=202
width=41)
-> XN Seq Scan on venue
(cost=0.00..2.02 rows=202 width=41)
-> XN Hash (cost=499.90..499.90 rows=49990
width=14)
-> XN Seq Scan on users (cost=0.00..499.90
rows=49990 width=14)
-> XN Hash (cost=3.65..3.65 rows=365 width=11)
-> XN Seq Scan on date (cost=0.00..3.65 rows=365
width=11)
-> XN Hash (cost=0.11..0.11 rows=11 width=10)
-> XN Seq Scan on category (cost=0.00..0.11 rows=11
width=10)

One solution is to recreate the tables with DISTSTYLE ALL. You cannot change a table's distribution style
after it is created. To recreate tables with a different distribution style, use a deep copy.

First, rename the tables.

alter table users rename to userscopy;


alter table venue rename to venuecopy;
alter table category rename to categorycopy;
alter table date rename to datecopy;
alter table event rename to eventcopy;

Run the following script to recreate USERS, VENUE, CATEGORY, DATE, EVENT. Don't make any changes to
SALES and LISTING.

create table users(


userid integer not null sortkey,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean,
primary key(userid)) diststyle all;

create table venue(


venueid smallint not null sortkey,
venuename varchar(100),

58
Amazon Redshift Database Developer Guide
Query plan example

venuecity varchar(30),
venuestate char(2),
venueseats integer,
primary key(venueid)) diststyle all;

create table category(


catid smallint not null,
catgroup varchar(10),
catname varchar(10),
catdesc varchar(50),
primary key(catid)) diststyle all;

create table date(


dateid smallint not null sortkey,
caldate date not null,
day character(3) not null,
week smallint not null,
month character(5) not null,
qtr character(5) not null,
year smallint not null,
holiday boolean default('N'),
primary key (dateid)) diststyle all;

create table event(


eventid integer not null sortkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null,
eventname varchar(200),
starttime timestamp,
primary key(eventid),
foreign key(venueid) references venue(venueid),
foreign key(catid) references category(catid),
foreign key(dateid) references date(dateid)) diststyle all;

Insert the data back into the tables and run an ANALYZE command to update the statistics.

insert into users select * from userscopy;


insert into venue select * from venuecopy;
insert into category select * from categorycopy;
insert into date select * from datecopy;
insert into event select * from eventcopy;

analyze;

Finally, drop the copies.

drop table userscopy;


drop table venuecopy;
drop table categorycopy;
drop table datecopy;
drop table eventcopy;

Run the same query with EXPLAIN again, and examine the new query plan. The joins now show
DS_DIST_ALL_NONE, indicating that no redistribution is required because the data was distributed to
every node using DISTSTYLE ALL.

QUERY PLAN
XN Merge (cost=1000000047117.54..1000000047544.46 rows=1000 width=103)
Merge Key: category.catname, sum(sales.pricepaid)
-> XN Network (cost=1000000047117.54..1000000047544.46 rows=170771 width=103)
Send to leader

59
Amazon Redshift Database Developer Guide
Distribution examples

-> XN Sort (cost=1000000047117.54..1000000047544.46 rows=170771 width=103)


Sort Key: category.catname, sum(sales.pricepaid)
-> XN HashAggregate (cost=30568.37..32276.08 rows=170771 width=103)
Filter: (sum(pricepaid) > 9999.00)
-> XN Hash Join DS_DIST_ALL_NONE (cost=742.08..26299.10 rows=170771
width=103)
Hash Cond: ("outer".buyerid = "inner".userid)
-> XN Hash Join DS_DIST_ALL_NONE (cost=117.20..21831.99
rows=170766 width=97)
Hash Cond: ("outer".dateid = "inner".dateid)
-> XN Hash Join DS_DIST_ALL_NONE (cost=112.64..17985.08
rows=170771 width=90)
Hash Cond: ("outer".catid = "inner".catid)
-> XN Hash Join DS_DIST_ALL_NONE
(cost=112.50..14142.59 rows=170771 width=84)
Hash Cond: ("outer".venueid = "inner".venueid)
-> XN Hash Join DS_DIST_ALL_NONE
(cost=109.98..10276.71 rows=172456 width=47)
Hash Cond: ("outer".eventid =
"inner".eventid)
-> XN Merge Join DS_DIST_NONE
(cost=0.00..6286.47 rows=172456 width=30)
Merge Cond: ("outer".listid =
"inner".listid)
-> XN Seq Scan on listing
(cost=0.00..1924.97 rows=192497 width=14)
-> XN Seq Scan on sales
(cost=0.00..1724.56 rows=172456 width=24)
-> XN Hash (cost=87.98..87.98 rows=8798
width=25)
-> XN Seq Scan on event
(cost=0.00..87.98 rows=8798 width=25)
-> XN Hash (cost=2.02..2.02 rows=202
width=41)
-> XN Seq Scan on venue
(cost=0.00..2.02 rows=202 width=41)
-> XN Hash (cost=0.11..0.11 rows=11 width=10)
-> XN Seq Scan on category (cost=0.00..0.11
rows=11 width=10)
-> XN Hash (cost=3.65..3.65 rows=365 width=11)
-> XN Seq Scan on date (cost=0.00..3.65 rows=365
width=11)
-> XN Hash (cost=499.90..499.90 rows=49990 width=14)
-> XN Seq Scan on users (cost=0.00..499.90 rows=49990
width=14)

Distribution examples
The following examples show how data is distributed according to the options that you define in the
CREATE TABLE statement.

DISTKEY examples
Look at the schema of the USERS table in the TICKIT database. USERID is defined as the SORTKEY
column and the DISTKEY column:

select "column", type, encoding, distkey, sortkey


from pg_table_def where tablename = 'users';

column | type | encoding | distkey | sortkey


---------------+------------------------+----------+---------+---------
userid | integer | none | t | 1

60
Amazon Redshift Database Developer Guide
Distribution examples

username | character(8) | none | f | 0


firstname | character varying(30) | text32k | f | 0

...

USERID is a good choice for the distribution column on this table. If you query the SVV_DISKUSAGE
system view, you can see that the table is very evenly distributed. Column numbers are zero-based, so
USERID is column 0.

select slice, col, num_values as rows, minvalue, maxvalue


from svv_diskusage
where name='users' and col=0 and rows>0
order by slice, col;

slice| col | rows | minvalue | maxvalue


-----+-----+-------+----------+----------
0 | 0 | 12496 | 4 | 49987
1 | 0 | 12498 | 1 | 49988
2 | 0 | 12497 | 2 | 49989
3 | 0 | 12499 | 3 | 49990
(4 rows)

The table contains 49,990 rows. The rows (num_values) column shows that each slice contains about the
same number of rows. The minvalue and maxvalue columns show the range of values on each slice. Each
slice includes nearly the entire range of values, so there's a good chance that every slice participates in
running a query that filters for a range of user IDs.

This example demonstrates distribution on a small test system. The total number of slices is typically
much higher.

If you commonly join or group using the STATE column, you might choose to distribute on the STATE
column. The following example shows a case where you create a new table with the same data as the
USERS table but set the DISTKEY to the STATE column. In this case, the distribution isn't as even. Slice
0 (13,587 rows) holds approximately 30 percent more rows than slice 3 (10,150 rows). In a much larger
table, this amount of distribution skew can have an adverse impact on query processing.

create table userskey distkey(state) as select * from users;

select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage


where name = 'userskey' and col=0 and rows>0
order by slice, col;

slice | col | rows | minvalue | maxvalue


------+-----+-------+----------+----------
0 | 0 | 13587 | 5 | 49989
1 | 0 | 11245 | 2 | 49990
2 | 0 | 15008 | 1 | 49976
3 | 0 | 10150 | 4 | 49986
(4 rows)

DISTSTYLE EVEN example


If you create a new table with the same data as the USERS table but set the DISTSTYLE to EVEN, rows
are always evenly distributed across slices.

create table userseven diststyle even as


select * from users;

select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage


where name = 'userseven' and col=0 and rows>0

61
Amazon Redshift Database Developer Guide
Working with sort keys

order by slice, col;

slice | col | rows | minvalue | maxvalue


------+-----+-------+----------+----------
0 | 0 | 12497 | 4 | 49990
1 | 0 | 12498 | 8 | 49984
2 | 0 | 12498 | 2 | 49988
3 | 0 | 12497 | 1 | 49989
(4 rows)

However, because distribution is not based on a specific column, query processing can be degraded,
especially if the table is joined to other tables. The lack of distribution on a joining column often
influences the type of join operation that can be performed efficiently. Joins, aggregations, and grouping
operations are optimized when both tables are distributed and sorted on their respective joining
columns.

DISTSTYLE ALL example


If you create a new table with the same data as the USERS table but set the DISTSTYLE to ALL, all the
rows are distributed to the first slice of each node.

select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage


where name = 'usersall' and col=0 and rows > 0
order by slice, col;

slice | col | rows | minvalue | maxvalue


------+-----+-------+----------+----------
0 | 0 | 49990 | 4 | 49990
2 | 0 | 49990 | 2 | 49990

(4 rows)

Working with sort keys


Note
We recommend that you create your tables with SORTKEY AUTO. If you do so, then Amazon
Redshift uses automatic table optimization to choose the sort key. For more information, see
Working with automatic table optimization (p. 37). The rest of this section provides details
about the sort order.

When you create a table, you can alternatively define one or more of its columns as sort keys. When data
is initially loaded into the empty table, the rows are stored on disk in sorted order. Information about
sort key columns is passed to the query planner, and the planner uses this information to construct plans
that exploit the way that the data is sorted. For more information, see CREATE TABLE (p. 708).

Sorting enables efficient handling of range-restricted predicates. Amazon Redshift stores columnar data
in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a query
uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip
over large numbers of blocks during table scans. For example, suppose that a table stores five years of
data sorted by date and a query specifies a date range of one month. In this case, you can eliminate up
to 98 percent of the disk blocks from the scan. If the data is not sorted, more of the disk blocks (possibly
all of them) have to be scanned.

You can specify either a compound or interleaved sort key. A compound sort key is more efficient when
query predicates use a prefix, which is a subset of the sort key columns in order. An interleaved sort key
gives equal weight to each column in the sort key, so query predicates can use any subset of the columns
that make up the sort key, in any order.

62
Amazon Redshift Database Developer Guide
Compound sort key

To understand the impact of the chosen sort key on query performance, use the EXPLAIN (p. 762)
command. For more information, see Query planning and execution workflow (p. 383).

To define a sort type, use either the INTERLEAVED or COMPOUND keyword with your CREATE TABLE
or CREATE TABLE AS statement. The default is COMPOUND. The default COMPOUND is recommended
unless your tables aren't updated regularly with INSERT, UPDATE, or DELETE. An INTERLEAVED sort key
can use a maximum of eight columns. Depending on your data and cluster size, VACUUM REINDEX takes
significantly longer than VACUUM FULL because it makes an additional pass to analyze the interleaved
sort keys. The sort and merge operation can take longer for interleaved tables because the interleaved
sort might need to rearrange more rows than a compound sort.

To view the sort keys for a table, query the SVV_TABLE_INFO (p. 1450) system view.

Topics
• Compound sort key (p. 63)
• Interleaved sort key (p. 63)

Compound sort key


A compound key is made up of all of the columns listed in the sort key definition, in the order they are
listed. A compound sort key is most useful when a query's filter applies conditions, such as filters and
joins, that use a prefix of the sort keys. The performance benefits of compound sorting decrease when
queries depend only on secondary sort columns, without referencing the primary columns. COMPOUND
is the default sort type.

Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions
that use PARTITION BY and ORDER BY. For example, a merge join, which is often faster than a hash join,
is feasible when the data is distributed and presorted on the joining columns. Compound sort keys also
help improve compression.

As you add rows to a sorted table that already contains data, the unsorted region grows, which has
a significant effect on performance. The effect is greater when the table uses interleaved sorting,
especially when the sort columns include data that increases monotonically, such as date or timestamp
columns. You should run a VACUUM operation regularly, especially after large data loads, to re-sort and
re-analyze the data. For more information, see Managing the size of the unsorted region (p. 111). After
vacuuming to resort the data, it's a good practice to run an ANALYZE command to update the statistical
metadata for the query planner. For more information, see Analyzing tables (p. 103).

Interleaved sort key


An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple
queries use different columns for filters, then you can often improve performance for those queries by
using an interleaved sort style. When a query uses restrictive predicates on secondary sort columns,
interleaved sorting significantly improves query performance as compared to compound sorting.
Important
Don't use an interleaved sort key on columns with monotonically increasing attributes, such as
identity columns, dates, or timestamps.

The performance improvements you gain by implementing an interleaved sort key should be weighed
against increased load and vacuum times.

Interleaved sorts are most effective with highly selective queries that filter on one or more of the sort
key columns in the WHERE clause, for example select c_name from customer where c_region
= 'ASIA'. The benefits of interleaved sorting increase with the number of sorted columns that are
restricted.

63
Amazon Redshift Database Developer Guide
Defining table constraints

An interleaved sort is more effective with large tables. Sorting is applied on each slice. Thus, an
interleaved sort is most effective when a table is large enough to require multiple 1 MB blocks per slice.
Here, the query processor can skip a significant proportion of the blocks using restrictive predicates. To
view the number of blocks a table uses, query the STV_BLOCKLIST (p. 1244) system view.

When sorting on a single column, an interleaved sort might give better performance than a compound
sort if the column values have a long common prefix. For example, URLs commonly begin with "http://
www". Compound sort keys use a limited number of characters from the prefix, which results in a lot
of duplication of keys. Interleaved sorts use an internal compression scheme for zone map values that
enables them to better discriminate among column values that have a long common prefix.

VACUUM REINDEX

As you add rows to a sorted table that already contains data, performance might deteriorate over
time. This deterioration occurs for both compound and interleaved sorts, but it has a greater effect on
interleaved tables. A VACUUM restores the sort order, but the operation can take longer for interleaved
tables because merging new interleaved data might involve modifying every data block.

When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort
key columns and uses that information for optimal interleaving of the sort key columns. As a table
grows, the distribution of the values in the sort key columns can change, or skew, especially with date or
timestamp columns. If the skew becomes too large, performance might be affected. To re-analyze the
sort keys and restore performance, run the VACUUM command with the REINDEX key word. Because it
needs to take an extra analysis pass over the data, VACUUM REINDEX can take longer than a standard
VACUUM for interleaved tables. To view information about key distribution skew and last reindex time,
query the SVV_INTERLEAVED_COLUMNS (p. 1434) system view.

For more information about how to determine how often to run VACUUM and when to run a VACUUM
REINDEX, see Deciding whether to reindex (p. 110).

Defining table constraints


Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by
Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should
be declared if your ETL process or some other process in your application enforces their integrity.

For example, the query planner uses primary and foreign keys in certain statistical computations. It does
this to infer uniqueness and referential relationships that affect subquery decorrelation techniques. By
doing this, it can order large numbers of joins and eliminate redundant joins.

The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are
valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return
incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key
is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand,
you should always declare primary and foreign keys and uniqueness constraints when you know that
they are valid.

Amazon Redshift does enforce NOT NULL column constraints.

For more information about table constraints, see CREATE TABLE (p. 708).

64
Amazon Redshift Database Developer Guide
Using COPY to load data

Loading data
Topics
• Using a COPY command to load data (p. 65)
• Updating tables with DML commands (p. 96)
• Updating and inserting new data (p. 96)
• Performing a deep copy (p. 102)
• Analyzing tables (p. 103)
• Vacuuming tables (p. 108)
• Managing concurrent write operations (p. 116)
• Tutorial: Loading data from Amazon S3 (p. 122)

A COPY command is the most efficient way to load a table. You can also add data to your tables using
INSERT commands, though it is much less efficient than using COPY. The COPY command is able to
read from multiple data files or multiple data streams simultaneously. Amazon Redshift allocates the
workload to the cluster nodes and performs the load operations in parallel, including sorting the rows
and distributing data across node slices.
Note
Amazon Redshift Spectrum external tables are read-only. You can't COPY or INSERT to an
external table.

To access data on other AWS resources, your cluster must have permission to access those resources and
to perform the necessary actions to access the data. You can use Identity and Access Management (IAM)
to limit the access users have to your cluster resources and data.

After your initial data load, if you add, modify, or delete a significant amount of data, you should follow
up by running a VACUUM command to reorganize your data and reclaim space after deletes. You should
also run an ANALYZE command to update table statistics.

This section explains how to load data and troubleshoot data loads and presents best practices for
loading data.

Using a COPY command to load data


Topics
• Credentials and access permissions (p. 66)
• Preparing your input data (p. 67)
• Loading data from Amazon S3 (p. 68)
• Loading data from Amazon EMR (p. 76)
• Loading data from remote hosts (p. 80)
• Loading data from an Amazon DynamoDB table (p. 86)
• Verifying that the data loaded correctly (p. 88)
• Validating input data (p. 88)
• Loading tables with automatic compression (p. 89)
• Optimizing storage for narrow tables (p. 91)
• Loading default column values (p. 91)
• Troubleshooting data loads (p. 91)

65
Amazon Redshift Database Developer Guide
Credentials and access permissions

The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture
to read and load data in parallel from files on Amazon S3, from a DynamoDB table, or from text output
from one or more remote hosts.
Note
We strongly recommend using the COPY command to load large amounts of data. Using
individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if
your data already exists in other Amazon Redshift database tables, use INSERT INTO ... SELECT
or CREATE TABLE AS to improve performance. For information, see INSERT (p. 780) or CREATE
TABLE AS (p. 723).

To load data from another AWS resource, your cluster must have permission to access the resource and
perform the necessary actions.

To grant or revoke privilege to load data into a table using a COPY command, grant or revoke the INSERT
privilege.

Your data needs to be in the proper format for loading into your Amazon Redshift table. This section
presents guidelines for preparing and verifying your data before the load and for validating a COPY
statement before you run it.

To protect the information in your files, you can encrypt the data files before you upload them to your
Amazon S3 bucket; COPY will decrypt the data as it performs the load. You can also limit access to your
load data by providing temporary security credentials to users. Temporary security credentials provide
enhanced security because they have short life spans and cannot be reused after they expire.

Amazon Redshift has features built in to COPY to load uncompressed, delimited data quickly. But you
can compress your files using gzip, lzop, or bzip2 to save time uploading the files.

If the following keywords are in the COPY query, automatic splitting of uncompressed data is not
supported: ESCAPE, REMOVEQUOTES, and FIXEDWIDTH. But the CSV keyword is supported.

To help keep your data secure in transit within the AWS cloud, Amazon Redshift uses hardware
accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD, backup,
and restore operations.

When you load your table directly from an Amazon DynamoDB table, you have the option to control the
amount of Amazon DynamoDB provisioned throughput you consume.

You can optionally let COPY analyze your input data and automatically apply optimal compression
encodings to your table as part of the load process.

Credentials and access permissions


To load or unload data using another AWS resource, such as Amazon S3, Amazon DynamoDB, Amazon
EMR, or Amazon EC2, your cluster must have permission to access the resource and perform the
necessary actions to access the data. For example, to load data from Amazon S3, COPY must have LIST
access to the bucket and GET access for the bucket objects.

To obtain authorization to access a resource, your cluster must be authenticated. You can choose either
role-based access control or key-based access control. This section presents an overview of the two
methods. For complete details and examples, see Permissions to access other AWS Resources (p. 611).

Role-based access control


With role-based access control, your cluster temporarily assumes an AWS Identity and Access
Management (IAM) role on your behalf. Then, based on the authorizations granted to the role, your
cluster can access the required AWS resources.

66
Amazon Redshift Database Developer Guide
Preparing your input data

We recommend using role-based access control because it is provides more secure, fine-grained control
of access to AWS resources and sensitive user data, in addition to safeguarding your AWS credentials.

To use role-based access control, you must first create an IAM role using the Amazon Redshift service
role type, and then attach the role to your cluster. The role must have, at a minimum, the permissions
listed in IAM permissions for COPY, UNLOAD, and CREATE LIBRARY (p. 614). For steps to create an IAM
role and attach it to your cluster, see Creating an IAM Role to Allow Your Amazon Redshift Cluster to
Access AWS Services in the Amazon Redshift Cluster Management Guide.

You can add a role to a cluster or view the roles associated with a cluster by using the Amazon Redshift
Management Console, CLI, or API. For more information, see Authorizing COPY and UNLOAD Operations
Using IAM Roles in the Amazon Redshift Cluster Management Guide.

When you create an IAM role, IAM returns an Amazon Resource Name (ARN) for the role. To run a COPY
command using an IAM role, provide the role ARN using the IAM_ROLE parameter or the CREDENTIALS
parameter.

The following COPY command example uses IAM_ROLE parameter with the role MyRedshiftRole for
authentication.

copy customer from 's3://mybucket/mydata'


iam_role 'arn:aws:iam::12345678901:role/MyRedshiftRole';

The AWS IAM user must have, at a minimum, the permissions listed in IAM permissions for COPY,
UNLOAD, and CREATE LIBRARY (p. 614).

Key-based access control


With key-based access control, you provide the access key ID and secret access key for an IAM user that is
authorized to access the AWS resources that contain the data.
Note
We strongly recommend using an IAM role for authentication instead of supplying a plain-text
access key ID and secret access key. If you choose key-based access control, never use your AWS
account (root) credentials. Always create an IAM user and provide that user's access key ID and
secret access key. For steps to create an IAM user, see Creating an IAM User in Your AWS Account.

Preparing your input data


If your input data is not compatible with the table columns that will receive it, the COPY command will
fail.

Use the following guidelines to help ensure that your input data is valid:

• Your data can only contain UTF-8 characters up to four bytes long.
• Verify that CHAR and VARCHAR strings are no longer than the lengths of the corresponding columns.
VARCHAR strings are measured in bytes, not characters, so, for example, a four-character string of
Chinese characters that occupy four bytes each requires a VARCHAR(16) column.
• Multibyte characters can only be used with VARCHAR columns. Verify that multibyte characters are no
more than four bytes long.
• Verify that data for CHAR columns only contains single-byte characters.
• Do not include any special characters or syntax to indicate the last field in a record. This field can be a
delimiter.
• If your data includes null terminators, also referred to as NUL (UTF-8 0000) or binary zero (0x000),
you can load these characters as NULLS into CHAR or VARCHAR columns by using the NULL AS
option in the COPY command: null as '\0' or null as '\000' . If you do not use NULL AS, null
terminators will cause your COPY to fail.

67
Amazon Redshift Database Developer Guide
Loading data from Amazon S3

• If your strings contain special characters, such as delimiters and embedded newlines, use the ESCAPE
option with the COPY (p. 576) command.
• Verify that all single and double quotation marks are appropriately matched.
• Verify that floating-point strings are in either standard floating-point format, such as 12.123, or an
exponential format, such as 1.0E4.
• Verify that all timestamp and date strings follow the specifications for DATEFORMAT and
TIMEFORMAT strings (p. 620). The default timestamp format is YYYY-MM-DD hh:mm:ss, and the
default date format is YYYY-MM-DD.
• For more information about boundaries and limitations on individual data types, see Data
types (p. 478). For information about multibyte character errors, see Multibyte character load
errors (p. 94)

Loading data from Amazon S3


Topics
• Loading data from compressed and uncompressed files (p. 68)
• Uploading files to Amazon S3 (p. 69)
• Using the COPY command to load from Amazon S3 (p. 72)

The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture
to read and load data in parallel from a file or multiple files in an Amazon S3 bucket. You can take
maximum advantage of parallel processing by splitting your data into multiple files, in cases where the
files are compressed. You can also take maximum advantage of parallel processing by setting distribution
keys on your tables. For more information about distribution keys, see Working with data distribution
styles (p. 50).

Data is loaded into the target table, one line per row. The fields in the data file are matched to table
columns in order, left to right. Fields in the data files can be fixed-width or character delimited; the
default delimiter is a pipe (|). By default, all the table columns are loaded, but you can optionally define
a comma-separated list of columns. If a table column is not included in the column list specified in the
COPY command, it is loaded with a default value. For more information, see Loading default column
values (p. 91).

Loading data from compressed and uncompressed files


When you load compressed data, we recommend that you split the data for each table into multiple files.
When you load uncompressed, delimited data, the COPY command uses massively parallel processing
(MPP) and scan ranges to load data from large files in an Amazon S3 bucket.

Loading data from multiple compressed files


In cases where you have compressed data, we recommend that you split the data for each table into
multiple files. The COPY command can load data from multiple files in parallel. You can load multiple
files by specifying a common prefix, or prefix key, for the set, or by explicitly listing the files in a manifest
file.

Split your data into files so that the number of files is a multiple of the number of slices in your cluster.
That way, Amazon Redshift can divide the data evenly among the slices. The number of slices per node
depends on the node size of the cluster. For example, each ds2.xl compute node has two slices, and each
ds2.8xl compute node has 32 slices. For more information about the number of slices that each node size
has, see About clusters and nodes in the Amazon Redshift Cluster Management Guide.

The nodes all participate in running parallel queries, working on data that is distributed as evenly as
possible across the slices. If you have a cluster with two ds2.xl nodes, you might split your data into four

68
Amazon Redshift Database Developer Guide
Loading data from Amazon S3

files or some multiple of four. Amazon Redshift doesn't take file size into account when dividing the
workload. Thus, you need to ensure that the files are roughly the same size, from 1 MB to 1 GB after
compression.

To use object prefixes to identify the load files, name each file with a common prefix. For example, you
might split the venue.txt file might be split into four files, as follows.

venue.txt.1
venue.txt.2
venue.txt.3
venue.txt.4

If you put multiple files in a folder in your bucket and specify the folder name as the prefix, COPY loads
all of the files in the folder. If you explicitly list the files to be loaded by using a manifest file, the files can
reside in different buckets or folders.

For more information about manifest files, see Example: COPY from Amazon S3 using a
manifest (p. 624).

Loading data from uncompressed, delimited files


When you load uncompressed, delimited data, the COPY command uses the massively parallel
processing (MPP) architecture in Amazon Redshift. Amazon Redshift automatically uses slices working in
parallel to load ranges of data from a large file in an Amazon S3 bucket. The file must be delimited for
parallel loading to occur. For example, pipe delimited. Automatic, parallel data loading with the COPY
command is also available for CSV files. You can also take advantage of parallel processing by setting
distribution keys on your tables. For more information about distribution keys, see Working with data
distribution styles (p. 50).

Automatic, parallel data loading isn't supported when the COPY query includes any of the following
keywords: ESCAPE, REMOVEQUOTES, and FIXEDWIDTH.

Data from the file or files is loaded into the target table, one line per row. The fields in the data file are
matched to table columns in order, left to right. Fields in the data files can be fixed-width or character
delimited; the default delimiter is a pipe (|). By default, all the table columns are loaded, but you can
optionally define a comma-separated list of columns. If a table column isn't included in the column list
specified in the COPY command, it's loaded with a default value. For more information, see Loading
default column values (p. 91).

Follow this general process to load data from Amazon S3, when your data is uncompressed and
delimited:

1. Upload your files to Amazon S3.


2. Run a COPY command to load the table.
3. Verify that the data was loaded correctly.

For examples of COPY commands, see COPY examples (p. 622). For information about data loaded into
Amazon Redshift, check the STL_LOAD_COMMITS (p. 1304) and STL_LOAD_ERRORS (p. 1307) system
tables.

For more information about nodes and the slices contained in each, see About clusters and nodes in the
Amazon Redshift Cluster Management Guide.

Uploading files to Amazon S3


Topics

69
Amazon Redshift Database Developer Guide
Loading data from Amazon S3

• Managing data consistency (p. 70)


• Uploading encrypted data to Amazon S3 (p. 70)
• Verifying that the correct files are present in your bucket (p. 71)

There are a couple approaches to take when uploading text files to Amazon S3:

• If you have compressed files, we recommend that you split large files to take advantage of parallel
processing in Amazon Redshift.
• On the other hand, COPY automatically splits large, uncompressed, text-delimited file data to facilitate
parallelism and effectively distribute the data from large files.

Upload your file or files to your bucket. If you

Create an Amazon S3 bucket to hold your data files, and then upload the data files to the bucket. For
information about creating buckets and uploading files, see Working with Amazon S3 Buckets in the
Amazon Simple Storage Service User Guide.
Important
The Amazon S3 bucket that holds the data files must be created in the same AWS Region as
your cluster unless you use the REGION (p. 583) option to specify the Region in which the
Amazon S3 bucket is located.

You can create an Amazon S3 bucket in a specific Region either by selecting the Region when you create
the bucket by using the Amazon S3 console, or by specifying an endpoint when you create the bucket
using the Amazon S3 API or CLI.

Following the data load, verify that the correct files are present on Amazon S3.

Managing data consistency


Amazon S3 provides strong read-after-write consistency for COPY, UNLOAD, INSERT (external table),
CREATE EXTERNAL TABLE AS, and Amazon Redshift Spectrum operations on Amazon S3 buckets in
all AWS Regions. In addition, read operations on Amazon S3 Select, Amazon S3 Access Control Lists,
Amazon S3 Object Tags, and object metadata (for example, HEAD object) are strongly consistent. For
more information about data consistency, see Amazon S3 Data Consistency Model in the Amazon Simple
Storage Service User Guide.

Uploading encrypted data to Amazon S3


Amazon S3 supports both server-side encryption and client-side encryption. This topic discusses the
differences between the server-side and client-side encryption and describes the steps to use client-side
encryption with Amazon Redshift. Server-side encryption is transparent to Amazon Redshift.

Server-side encryption

Server-side encryption is data encryption at rest—that is, Amazon S3 encrypts your data as it uploads
it and decrypts it for you when you access it. When you load tables using a COPY command, there is no
difference in the way you load from server-side encrypted or unencrypted objects on Amazon S3. For
more information about server-side encryption, see Using Server-Side Encryption in the Amazon Simple
Storage Service User Guide.

Client-side encryption

In client-side encryption, your client application manages encryption of your data, the encryption keys,
and related tools. You can upload data to an Amazon S3 bucket using client-side encryption, and then
load the data using the COPY command with the ENCRYPTED option and a private encryption key to
provide greater security.

70
Amazon Redshift Database Developer Guide
Loading data from Amazon S3

You encrypt your data using envelope encryption. With envelope encryption, your application handles
all encryption exclusively. Your private encryption keys and your unencrypted data are never sent to
AWS, so it's very important that you safely manage your encryption keys. If you lose your encryption
keys, you won't be able to unencrypt your data, and you can't recover your encryption keys from AWS.
Envelope encryption combines the performance of fast symmetric encryption while maintaining the
greater security that key management with asymmetric keys provides. A one-time-use symmetric key
(the envelope symmetric key) is generated by your Amazon S3 encryption client to encrypt your data,
then that key is encrypted by your root key and stored alongside your data in Amazon S3. When Amazon
Redshift accesses your data during a load, the encrypted symmetric key is retrieved and decrypted with
your real key, then the data is decrypted.

To work with Amazon S3 client-side encrypted data in Amazon Redshift, follow the steps outlined in
Protecting Data Using Client-Side Encryption in the Amazon Simple Storage Service User Guide, with the
additional requirements that you use:

• Symmetric encryption – The AWS SDK for Java AmazonS3EncryptionClient class uses envelope
encryption, described preceding, which is based on symmetric key encryption. Use this class to create
an Amazon S3 client to upload client-side encrypted data.
• A 256-bit AES root symmetric key – A root key encrypts the envelope key. You pass the root key to
your instance of the AmazonS3EncryptionClient class. Save this key, because you will need it to
copy data into Amazon Redshift.
• Object metadata to store encrypted envelope key – By default, Amazon S3 stores the envelope key
as object metadata for the AmazonS3EncryptionClient class. The encrypted envelope key that is
stored as object metadata is used during the decryption process.

Note
If you get a cipher encryption error message when you use the encryption API for the first time,
your version of the JDK may have a Java Cryptography Extension (JCE) jurisdiction policy file
that limits the maximum key length for encryption and decryption transformations to 128 bits.
For information about addressing this issue, go to Specifying Client-Side Encryption Using the
AWS SDK for Java in the Amazon Simple Storage Service User Guide.

For information about loading client-side encrypted files into your Amazon Redshift tables using the
COPY command, see Loading encrypted data files from Amazon S3 (p. 75).

Example: Uploading client-side encrypted data


For an example of how to use the AWS SDK for Java to upload client-side encrypted data, go to
Protecting data using client-side encryption in the Amazon Simple Storage Service User Guide.

The second option shows the choices you must make during client-side encryption so that the data
can be loaded in Amazon Redshift. Specifically, the example shows using object metadata to store the
encrypted envelope key and the use of a 256-bit AES root symmetric key.

This example provides example code using the AWS SDK for Java to create a 256-bit AES symmetric root
key and save it to a file. Then the example upload an object to Amazon S3 using an S3 encryption client
that first encrypts sample data on the client-side. The example also downloads the object and verifies
that the data is the same.

Verifying that the correct files are present in your bucket


After you upload your files to your Amazon S3 bucket, we recommend listing the contents of the
bucket to verify that all of the correct files are present and that no unwanted files are present. For
example, if the bucket mybucket holds a file named venue.txt.back, that file will be loaded, perhaps
unintentionally, by the following command:

copy venue from 's3://mybucket/venue' … ;

71
Amazon Redshift Database Developer Guide
Loading data from Amazon S3

If you want to control specifically which files are loaded, you can use a manifest file to
explicitly list the data files. For more information about using a manifest file, see the
copy_from_s3_manifest_file (p. 582) option for the COPY command and Example: COPY from Amazon
S3 using a manifest (p. 624) in the COPY examples.

For more information about listing the contents of the bucket, see Listing Object Keys in the Amazon S3
Developer Guide.

Using the COPY command to load from Amazon S3


Topics
• Using a manifest to specify data files (p. 73)
• Loading compressed data files from Amazon S3 (p. 74)
• Loading fixed-width data from Amazon S3 (p. 74)
• Loading multibyte data from Amazon S3 (p. 75)
• Loading encrypted data files from Amazon S3 (p. 75)

Use the COPY (p. 576) command to load a table in parallel from data files on Amazon S3. You can
specify the files to be loaded by using an Amazon S3 object prefix or by using a manifest file.

The syntax to specify the files to be loaded by using a prefix is as follows:

copy <table_name> from 's3://<bucket_name>/<object_prefix>'


authorization;

The manifest file is a JSON-formatted file that lists the data files to be loaded. The syntax to specify the
files to be loaded by using a manifest file is as follows:

copy <table_name> from 's3://<bucket_name>/<manifest_file>'


authorization
manifest;

The table to be loaded must already exist in the database. For information about creating a table, see
CREATE TABLE (p. 708) in the SQL Reference.

The values for authorization provide the AWS authorization your cluster needs to access the Amazon
S3 objects. For information about required permissions, see IAM permissions for COPY, UNLOAD,
and CREATE LIBRARY (p. 614). The preferred method for authentication is to specify the IAM_ROLE
parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary
permissions. For more information, see Role-based access control (p. 611) .

To authenticate using the IAM_ROLE parameter, replace <aws-account-id> and <role-name> as


shown in the following syntax.

IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'

The following example shows authentication using an IAM role.

copy customer
from 's3://mybucket/mydata'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

For more information about other authorization options, see Authorization parameters (p. 590)

If you want to validate your data without actually loading the table, use the NOLOAD option with the
COPY (p. 576) command.

72
Amazon Redshift Database Developer Guide
Loading data from Amazon S3

The following example shows the first few rows of a pipe-delimited data in a file named venue.txt.

1|Toyota Park|Bridgeview|IL|0
2|Columbus Crew Stadium|Columbus|OH|0
3|RFK Stadium|Washington|DC|0

Before uploading the file to Amazon S3, split the file into multiple files so that the COPY command
can load it using parallel processing. The number of files should be a multiple of the number of slices
in your cluster. Split your load data files so that the files are about equal size, between 1 MB and 1
GB after compression. For more information, see Loading data from compressed and uncompressed
files (p. 68).

For example, the venue.txt file might be split into four files, as follows:

venue.txt.1
venue.txt.2
venue.txt.3
venue.txt.4

The following COPY command loads the VENUE table using the pipe-delimited data in the data files with
the prefix 'venue' in the Amazon S3 bucket mybucket.
Note
The Amazon S3 bucket mybucket in the following examples does not exist. For sample COPY
commands that use real data in an existing Amazon S3 bucket, see Load sample data.

copy venue from 's3://mybucket/venue'


iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|';

If no Amazon S3 objects with the key prefix 'venue' exist, the load fails.

Using a manifest to specify data files


You can use a manifest to ensure that the COPY command loads all of the required files, and only the
required files, for a data load. You can use a manifest to load files from different buckets or files that do
not share the same prefix. Instead of supplying an object path for the COPY command, you supply the
name of a JSON-formatted text file that explicitly lists the files to be loaded. The URL in the manifest
must specify the bucket name and full object path for the file, not just a prefix.

For more information about manifest files, see the COPY example Using a manifest to specify data
files (p. 624).

The following example shows the JSON to load files from different buckets and with file names that
begin with date stamps.

{
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
]
}

The optional mandatory flag specifies whether COPY should return an error if the file is not found. The
default of mandatory is false. Regardless of any mandatory settings, COPY will terminate if no files
are found.

73
Amazon Redshift Database Developer Guide
Loading data from Amazon S3

The following example runs the COPY command with the manifest in the previous example, which is
named cust.manifest.

copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;

Using a manifest created by UNLOAD

A manifest created by an UNLOAD (p. 850) operation using the MANIFEST parameter might have keys
that are not required for the COPY operation. For example, the following UNLOAD manifest includes a
meta key that is required for an Amazon Redshift Spectrum external table and for loading data files
in an ORC or Parquet file format. The meta key contains a content_length key with a value that
is the actual size of the file in bytes. The COPY operation requires only the url key and an optional
mandatory key.

{
"entries": [
{"url":"s3://mybucket/unload/manifest_0000_part_00", "meta": { "content_length":
5956875 }},
{"url":"s3://mybucket/unload/unload/manifest_0001_part_00", "meta": { "content_length":
5997091 }}
]
}

For more information about manifest files, see Example: COPY from Amazon S3 using a
manifest (p. 624).

Loading compressed data files from Amazon S3


To load data files that are compressed using gzip, lzop, or bzip2, include the corresponding option: GZIP,
LZOP, or BZIP2.

COPY does not support files compressed using the lzop --filter option.

For example, the following command loads from files that were compressing using lzop.

copy customer from 's3://mybucket/customer.lzo'


iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|' lzop;

Loading fixed-width data from Amazon S3


Fixed-width data files have uniform lengths for each column of data. Each field in a fixed-width data file
has exactly the same length and position. For character data (CHAR and VARCHAR) in a fixed-width data
file, you must include leading or trailing spaces as placeholders in order to keep the width uniform. For
integers, you must use leading zeros as placeholders. A fixed-width data file has no delimiter to separate
columns.

To load a fixed-width data file into an existing table, USE the FIXEDWIDTH parameter in the COPY
command. Your table specifications must match the value of fixedwidth_spec in order for the data to
load correctly.

To load fixed-width data from a file to a table, issue the following command:

copy table_name from 's3://mybucket/prefix'


iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

74
Amazon Redshift Database Developer Guide
Loading data from Amazon S3

fixedwidth 'fixedwidth_spec';

The fixedwidth_spec parameter is a string that contains an identifier for each column and the width of
each column, separated by a colon. The column:width pairs are delimited by commas. The identifier
can be anything that you choose: numbers, letters, or a combination of the two. The identifier has no
relation to the table itself, so the specification must contain the columns in the same order as the table.

The following two examples show the same specification, with the first using numeric identifiers and the
second using string identifiers:

'0:3,1:25,2:12,3:2,4:6'

'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6'

The following example shows fixed-width sample data that could be loaded into the VENUE table using
the above specifications:

1 Toyota Park Bridgeview IL0


2 Columbus Crew Stadium Columbus OH0
3 RFK Stadium Washington DC0
4 CommunityAmerica Ballpark Kansas City KS0
5 Gillette Stadium Foxborough MA68756

The following COPY command loads this data set into the VENUE table:

copy venue
from 's3://mybucket/data/venue_fw.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
fixedwidth 'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6';

Loading multibyte data from Amazon S3


If your data includes non-ASCII multibyte characters (such as Chinese or Cyrillic characters), you must
load the data to VARCHAR columns. The VARCHAR data type supports four-byte UTF-8 characters,
but the CHAR data type only accepts single-byte ASCII characters. You cannot load five-byte or longer
characters into Amazon Redshift tables. For more information about CHAR and VARCHAR, see Data
types (p. 478).

To check which encoding an input file uses, use the Linux file command:

$ file ordersdata.txt
ordersdata.txt: ASCII English text
$ file uni_ordersdata.dat
uni_ordersdata.dat: UTF-8 Unicode text

Loading encrypted data files from Amazon S3


You can use the COPY command to load data files that were uploaded to Amazon S3 using server-side
encryption, client-side encryption, or both.

The COPY command supports the following types of Amazon S3 encryption:

• Server-side encryption with Amazon S3-managed keys (SSE-S3)


• Server-side encryption with AWS KMS keys (SSE-KMS)
• Client-side encryption using a client-side symmetric root key

75
Amazon Redshift Database Developer Guide
Loading data from Amazon EMR

The COPY command doesn't support the following types of Amazon S3 encryption:

• Server-side encryption with customer-provided keys (SSE-C)


• Client-side encryption using an AWS KMS key
• Client-side encryption using a customer-provided asymmetric root key

For more information about Amazon S3 encryption, see Protecting Data Using Server-Side Encryption
and Protecting Data Using Client-Side Encryption in the Amazon Simple Storage Service User Guide.

The UNLOAD (p. 850) command automatically encrypts files using SSE-S3. You can also unload using
SSE-KMS or client-side encryption with a customer-managed symmetric key. For more information, see
Unloading encrypted data files (p. 145)

The COPY command automatically recognizes and loads files encrypted using SSE-S3 and SSE-KMS. You
can load files encrypted using a client-side symmetric root key by specifying the ENCRYPTED option and
providing the key value. For more information, see Uploading encrypted data to Amazon S3 (p. 70).

To load client-side encrypted data files, provide the root key value using the MASTER_SYMMETRIC_KEY
parameter and include the ENCRYPTED option.

copy customer from 's3://mybucket/encrypted/customer'


iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
master_symmetric_key '<root_key>'
encrypted
delimiter '|';

To load encrypted data files that are gzip, lzop, or bzip2 compressed, include the GZIP, LZOP, or BZIP2
option along with the root key value and the ENCRYPTED option.

copy customer from 's3://mybucket/encrypted/customer'


iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
master_symmetric_key '<root_key>'
encrypted
delimiter '|'
gzip;

Loading data from Amazon EMR


You can use the COPY command to load data in parallel from an Amazon EMR cluster configured to
write text files to the cluster's Hadoop Distributed File System (HDFS) in the form of fixed-width files,
character-delimited files, CSV files, or JSON-formatted files.

Process for loading data from Amazon EMR


This section walks you through the process of loading data from an Amazon EMR cluster. The following
sections provide the details you need to accomplish each step.

• Step 1: Configure IAM permissions (p. 77)

The users that create the Amazon EMR cluster and run the Amazon Redshift COPY command must
have the necessary permissions.
• Step 2: Create an Amazon EMR cluster (p. 77)

Configure the cluster to output text files to the Hadoop Distributed File System (HDFS). You will need
the Amazon EMR cluster ID and the cluster's master public DNS (the endpoint for the Amazon EC2
instance that hosts the cluster).

76
Amazon Redshift Database Developer Guide
Loading data from Amazon EMR

• Step 3: Retrieve the Amazon Redshift cluster public key and cluster node IP addresses (p. 78)

The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the hosts.
You will use the IP address for each cluster node to configure the host security groups to permit access
from your Amazon Redshift cluster using these IP addresses.
• Step 4: Add the Amazon Redshift cluster public key to each Amazon EC2 host's authorized keys
file (p. 79)

You add the Amazon Redshift cluster public key to the host's authorized keys file so that the host will
recognize the Amazon Redshift cluster and accept the SSH connection.
• Step 5: Configure the hosts to accept all of the Amazon Redshift cluster's IP addresses (p. 80)

Modify the Amazon EMR instance's security groups to add ingress rules to accept the Amazon Redshift
IP addresses.
• Step 6: Run the COPY command to load the data (p. 80)

From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift
table.

Step 1: Configure IAM permissions


The users that create the Amazon EMR cluster and run the Amazon Redshift COPY command must have
the necessary permissions.

To configure IAM permissions

1. Add the following permissions for the IAM user that will create the Amazon EMR cluster.

ec2:DescribeSecurityGroups
ec2:RevokeSecurityGroupIngress
ec2:AuthorizeSecurityGroupIngress
redshift:DescribeClusters

2. Add the following permission for the IAM role or IAM user that will run the COPY command.

elasticmapreduce:ListInstances

3. Add the following permission to the Amazon EMR cluster's IAM role.

redshift:DescribeClusters

Step 2: Create an Amazon EMR cluster


The COPY command loads data from files on the Amazon EMR Hadoop Distributed File System (HDFS).
When you create the Amazon EMR cluster, configure the cluster to output data files to the cluster's
HDFS.

To create an Amazon EMR cluster

1. Create an Amazon EMR cluster in the same AWS Region as the Amazon Redshift cluster.

If the Amazon Redshift cluster is in a VPC, the Amazon EMR cluster must be in the same VPC group.
If the Amazon Redshift cluster uses EC2-Classic mode (that is, it is not in a VPC), the Amazon EMR
cluster must also use EC2-Classic mode. For more information, see Managing Clusters in Virtual
Private Cloud (VPC) in the Amazon Redshift Cluster Management Guide.

77
Amazon Redshift Database Developer Guide
Loading data from Amazon EMR

2. Configure the cluster to output data files to the cluster's HDFS. The HDFS file names must not
include asterisks (*) or question marks (?).
Important
The file names must not include asterisks ( * ) or question marks ( ? ).
3. Specify No for the Auto-terminate option in the Amazon EMR cluster configuration so that the
cluster remains available while the COPY command runs.
Important
If any of the data files are changed or deleted before the COPY completes, you might have
unexpected results, or the COPY operation might fail.
4. Note the cluster ID and the master public DNS (the endpoint for the Amazon EC2 instance that hosts
the cluster). You will use that information in later steps.

Step 3: Retrieve the Amazon Redshift cluster public key and


cluster node IP addresses
To retrieve the Amazon Redshift cluster public key and cluster node IP addresses for your
cluster using the console

1. Access the Amazon Redshift Management Console.


2. Click the Clusters link in the navigation pane.
3. Select your cluster from the list.
4. Locate the SSH Ingestion Settings group.

Note the Cluster Public Key and Node IP addresses. You will use them in later steps.

You will use the Private IP addresses in Step 3 to configure the Amazon EC2 host to accept the
connection from Amazon Redshift.

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon
Redshift CLI, run the describe-clusters command. For example:

78
Amazon Redshift Database Developer Guide
Loading data from Amazon EMR

aws redshift describe-clusters --cluster-identifier <cluster-identifier>

The response will include a ClusterPublicKey value and the list of private and public IP addresses, similar
to the following:

{
"Clusters": [
{
"VpcSecurityGroups": [],
"ClusterStatus": "available",
"ClusterNodes": [
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "LEADER",
"PublicIPAddress": "10.nnn.nnn.nnn"
},
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "COMPUTE-0",
"PublicIPAddress": "10.nnn.nnn.nnn"
},
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "COMPUTE-1",
"PublicIPAddress": "10.nnn.nnn.nnn"
}
],
"AutomatedSnapshotRetentionPeriod": 1,
"PreferredMaintenanceWindow": "wed:05:30-wed:06:00",
"AvailabilityZone": "us-east-1a",
"NodeType": "ds2.xlarge",
"ClusterPublicKey": "ssh-rsa AAAABexamplepublickey...Y3TAl Amazon-Redshift",
...
...
}

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon
Redshift API, use the DescribeClusters action. For more information, see describe-clusters in the
Amazon Redshift CLI Guide or DescribeClusters in the Amazon Redshift API Guide.

Step 4: Add the Amazon Redshift cluster public key to each


Amazon EC2 host's authorized keys file
You add the cluster public key to each host's authorized keys file for all of the Amazon EMR cluster nodes
so that the hosts will recognize Amazon Redshift and accept the SSH connection.

To add the Amazon Redshift cluster public key to the host's authorized keys file

1. Access the host using an SSH connection.

For information about connecting to an instance using SSH, see Connect to Your Instance in the
Amazon EC2 User Guide.
2. Copy the Amazon Redshift public key from the console or from the CLI response text.
3. Copy and paste the contents of the public key into the /home/<ssh_username>/.ssh/
authorized_keys file on the host. Include the complete string, including the prefix "ssh-rsa "
and suffix "Amazon-Redshift". For example:

79
Amazon Redshift Database Developer Guide
Loading data from remote hosts

ssh-rsa AAAACTP3isxgGzVWoIWpbVvRCOzYdVifMrh… uA70BnMHCaMiRdmvsDOedZDOedZ Amazon-


Redshift

Step 5: Configure the hosts to accept all of the Amazon Redshift


cluster's IP addresses
To allow inbound traffic to the host instances, edit the security group and add one Inbound rule for each
Amazon Redshift cluster node. For Type, select SSH with TCP protocol on Port 22. For Source, enter the
Amazon Redshift cluster node Private IP addresses you retrieved in Step 3: Retrieve the Amazon Redshift
cluster public key and cluster node IP addresses (p. 78). For information about adding rules to an
Amazon EC2 security group, see Authorizing Inbound Traffic for Your Instances in the Amazon EC2 User
Guide.

Step 6: Run the COPY command to load the data


Run a COPY (p. 576) command to connect to the Amazon EMR cluster and load the data into an
Amazon Redshift table. The Amazon EMR cluster must continue running until the COPY command
completes. For example, do not configure the cluster to auto-terminate.
Important
If any of the data files are changed or deleted before the COPY completes, you might have
unexpected results, or the COPY operation might fail.

In the COPY command, specify the Amazon EMR cluster ID and the HDFS file path and file name.

copy sales
from 'emr://myemrclusterid/myoutput/part*' credentials
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

You can use the wildcard characters asterisk ( * ) and question mark ( ? ) as part of the file name
argument. For example, part* loads the files part-0000, part-0001, and so on. If you specify only a
folder name, COPY attempts to load all files in the folder.
Important
If you use wildcard characters or use only the folder name, verify that no unwanted files will be
loaded or the COPY command will fail. For example, some processes might write a log file to the
output folder.

Loading data from remote hosts


You can use the COPY command to load data in parallel from one or more remote hosts, such as Amazon
EC2 instances or other computers. COPY connects to the remote hosts using SSH and runs commands on
the remote hosts to generate text output.

The remote host can be an Amazon EC2 Linux instance or another Unix or Linux computer configured
to accept SSH connections. This guide assumes your remote host is an Amazon EC2 instance. Where the
procedure is different for another computer, the guide will point out the difference.

Amazon Redshift can connect to multiple hosts, and can open multiple SSH connections to each host.
Amazon Redshifts sends a unique command through each connection to generate text output to the
host's standard output, which Amazon Redshift then reads as it would a text file.

Before you begin


Before you begin, you should have the following in place:

80
Amazon Redshift Database Developer Guide
Loading data from remote hosts

• One or more host machines, such as Amazon EC2 instances, that you can connect to using SSH.
• Data sources on the hosts.

You will provide commands that the Amazon Redshift cluster will run on the hosts to generate the text
output. After the cluster connects to a host, the COPY command runs the commands, reads the text
from the hosts' standard output, and loads the data in parallel into an Amazon Redshift table. The text
output must be in a form that the COPY command can ingest. For more information, see Preparing
your input data (p. 67)
• Access to the hosts from your computer.

For an Amazon EC2 instance, you will use an SSH connection to access the host. You will need to access
the host to add the Amazon Redshift cluster's public key to the host's authorized keys file.
• A running Amazon Redshift cluster.

For information about how to launch a cluster, see Amazon Redshift Getting Started Guide.

Loading data process


This section walks you through the process of loading data from remote hosts. The following sections
provide the details you need to accomplish each step.

• Step 1: Retrieve the cluster public key and cluster node IP addresses (p. 82)

The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the remote
hosts. You will use the IP address for each cluster node to configure the host security groups or firewall
to permit access from your Amazon Redshift cluster using these IP addresses.
• Step 2: Add the Amazon Redshift cluster public key to the host's authorized keys file (p. 83)

You add the Amazon Redshift cluster public key to the host's authorized keys file so that the host will
recognize the Amazon Redshift cluster and accept the SSH connection.
• Step 3: Configure the host to accept all of the Amazon Redshift cluster's IP addresses (p. 83)

For Amazon EC2 , modify the instance's security groups to add ingress rules to accept the Amazon
Redshift IP addresses. For other hosts, modify the firewall so that your Amazon Redshift nodes are
able to establish SSH connections to the remote host.
• Step 4: Get the public key for the host (p. 84)

You can optionally specify that Amazon Redshift should use the public key to identify the host. You
will need to locate the public key and copy the text into your manifest file.
• Step 5: Create a manifest file (p. 84)

The manifest is a JSON-formatted text file with the details Amazon Redshift needs to connect to the
hosts and fetch the data.
• Step 6: Upload the manifest file to an Amazon S3 bucket (p. 86)

Amazon Redshift reads the manifest and uses that information to connect to the remote host. If the
Amazon S3 bucket does not reside in the same Region as your Amazon Redshift cluster, you must use
the REGION (p. 583) option to specify the Region in which the data is located.
• Step 7: Run the COPY command to load the data (p. 86)

From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift
table.

81
Amazon Redshift Database Developer Guide
Loading data from remote hosts

Step 1: Retrieve the cluster public key and cluster node IP


addresses
To retrieve the cluster public key and cluster node IP addresses for your cluster using the
console

1. Access the Amazon Redshift Management Console.


2. Click the Clusters link in the navigation pane.
3. Select your cluster from the list.
4. Locate the SSH Ingestion Settings group.

Note the Cluster Public Key and Node IP addresses. You will use them in later steps.

You will use the IP addresses in Step 3 to configure the host to accept the connection from Amazon
Redshift. Depending on what type of host you connect to and whether it is in a VPC, you will use
either the public IP addresses or the private IP addresses.

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon
Redshift CLI, run the describe-clusters command.

For example:

aws redshift describe-clusters --cluster-identifier <cluster-identifier>

The response will include the ClusterPublicKey and the list of Private and Public IP addresses, similar to
the following:

{
"Clusters": [
{
"VpcSecurityGroups": [],
"ClusterStatus": "available",
"ClusterNodes": [

82
Amazon Redshift Database Developer Guide
Loading data from remote hosts

{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "LEADER",
"PublicIPAddress": "10.nnn.nnn.nnn"
},
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "COMPUTE-0",
"PublicIPAddress": "10.nnn.nnn.nnn"
},
{
"PrivateIPAddress": "10.nnn.nnn.nnn",
"NodeRole": "COMPUTE-1",
"PublicIPAddress": "10.nnn.nnn.nnn"
}
],
"AutomatedSnapshotRetentionPeriod": 1,
"PreferredMaintenanceWindow": "wed:05:30-wed:06:00",
"AvailabilityZone": "us-east-1a",
"NodeType": "ds2.xlarge",
"ClusterPublicKey": "ssh-rsa AAAABexamplepublickey...Y3TAl Amazon-Redshift",
...
...
}

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon
Redshift API, use the DescribeClusters action. For more information, see describe-clusters in the Amazon
Redshift CLI Guide or DescribeClusters in the Amazon Redshift API Guide.

Step 2: Add the Amazon Redshift cluster public key to the host's
authorized keys file
You add the cluster public key to each host's authorized keys file so that the host will recognize Amazon
Redshift and accept the SSH connection.

To add the Amazon Redshift cluster public key to the host's authorized keys file

1. Access the host using an SSH connection.

For information about connecting to an instance using SSH, see Connect to Your Instance in the
Amazon EC2 User Guide.
2. Copy the Amazon Redshift public key from the console or from the CLI response text.
3. Copy and paste the contents of the public key into the /home/<ssh_username>/.ssh/
authorized_keys file on the remote host. The <ssh_username> must match the value for the
"username" field in the manifest file. Include the complete string, including the prefix "ssh-rsa "
and suffix "Amazon-Redshift". For example:

ssh-rsa AAAACTP3isxgGzVWoIWpbVvRCOzYdVifMrh… uA70BnMHCaMiRdmvsDOedZDOedZ Amazon-


Redshift

Step 3: Configure the host to accept all of the Amazon Redshift


cluster's IP addresses
If you are working with an Amazon EC2 instance or an Amazon EMR cluster, add Inbound rules to the
host's security group to allow traffic from each Amazon Redshift cluster node. For Type, select SSH with
TCP protocol on Port 22. For Source, enter the Amazon Redshift cluster node IP addresses you retrieved
in Step 1: Retrieve the cluster public key and cluster node IP addresses (p. 82). For information about

83
Amazon Redshift Database Developer Guide
Loading data from remote hosts

adding rules to an Amazon EC2 security group, see Authorizing Inbound Traffic for Your Instances in the
Amazon EC2 User Guide.

Use the Private IP addresses when:

• You have an Amazon Redshift cluster that is not in a Virtual Private Cloud (VPC), and an Amazon EC2 -
Classic instance, both of which are in the same AWS Region.
• You have an Amazon Redshift cluster that is in a VPC, and an Amazon EC2 -VPC instance, both of
which are in the same AWS Region and in the same VPC.

Otherwise, use the Public IP addresses.

For more information about using Amazon Redshift in a VPC, see Managing Clusters in Virtual Private
Cloud (VPC) in the Amazon Redshift Cluster Management Guide.

Step 4: Get the public key for the host


You can optionally provide the host's public key in the manifest file so that Amazon Redshift can identify
the host. The COPY command does not require the host public key but, for security reasons, we strongly
recommend using a public key to help prevent 'man-in-the-middle' attacks.

You can find the host's public key in the following location, where <ssh_host_rsa_key_name> is the
unique name for the host's public key:

: /etc/ssh/<ssh_host_rsa_key_name>.pub

Note
Amazon Redshift only supports RSA keys. We do not support DSA keys.

When you create your manifest file in Step 5, you will paste the text of the public key into the "Public
Key" field in the manifest file entry.

Step 5: Create a manifest file


The COPY command can connect to multiple hosts using SSH, and can create multiple SSH connections
to each host. COPY runs a command through each host connection, and then loads the output from the
commands in parallel into the table. The manifest file is a text file in JSON format that Amazon Redshift
uses to connect to the host. The manifest file specifies the SSH host endpoints and the commands that
are run on the hosts to return data to Amazon Redshift. Optionally, you can include the host public key,
the login user name, and a mandatory flag for each entry.

Create the manifest file on your local computer. In a later step, you upload the file to Amazon S3.

The manifest file is in the following format:

{
"entries": [
{"endpoint":"<ssh_endpoint_or_IP>",
"command": "<remote_command>",
"mandatory":true,
"publickey": "<public_key>",
"username": "<host_user_name>"},
{"endpoint":"<ssh_endpoint_or_IP>",
"command": "<remote_command>",
"mandatory":true,
"publickey": "<public_key>",
"username": "host_user_name"}
]

84
Amazon Redshift Database Developer Guide
Loading data from remote hosts

The manifest file contains one "entries" construct for each SSH connection. Each entry represents a single
SSH connection. You can have multiple connections to a single host or multiple connections to multiple
hosts. The double quotation marks are required as shown, both for the field names and the values.
The only value that does not need double quotation marks is the Boolean value true or false for the
mandatory field.

The following describes the fields in the manifest file.

endpoint

The URL address or IP address of the host. For example,


"ec2-111-222-333.compute-1.amazonaws.com" or "22.33.44.56"
command

The command that will be run by the host to generate text or binary (gzip, lzop, or bzip2) output.
The command can be any command that the user "host_user_name" has permission to run. The
command can be as simple as printing a file, or it could query a database or launch a script. The
output (text file, gzip binary file, lzop binary file, or bzip2 binary file) must be in a form the Amazon
Redshift COPY command can ingest. For more information, see Preparing your input data (p. 67).
publickey

(Optional) The public key of the host. If provided, Amazon Redshift will use the public key to identify
the host. If the public key is not provided, Amazon Redshift will not attempt host identification. For
example, if the remote host's public key is: ssh-rsa AbcCbaxxx…xxxDHKJ [email protected]
enter the following text in the publickey field: AbcCbaxxx…xxxDHKJ.
mandatory

(Optional) Indicates whether the COPY command should fail if the connection fails. The default is
false. If Amazon Redshift does not successfully make at least one connection, the COPY command
fails.
username

(Optional) The username that will be used to log on to the host system and run the remote
command. The user login name must be the same as the login that was used to add the public key to
the host's authorized keys file in Step 2. The default username is "redshift".

The following example shows a completed manifest to open four connections to the same host and run a
different command through each connection:

{
"entries": [
{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",
"command": "cat loaddata1.txt",
"mandatory":true,
"publickey": "ec2publickeyportionoftheec2keypair",
"username": "ec2-user"},
{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",
"command": "cat loaddata2.txt",
"mandatory":true,
"publickey": "ec2publickeyportionoftheec2keypair",
"username": "ec2-user"},
{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",
"command": "cat loaddata3.txt",
"mandatory":true,
"publickey": "ec2publickeyportionoftheec2keypair",
"username": "ec2-user"},

85
Amazon Redshift Database Developer Guide
Loading from Amazon DynamoDB

{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",
"command": "cat loaddata4.txt",
"mandatory":true,
"publickey": "ec2publickeyportionoftheec2keypair",
"username": "ec2-user"}
]
}

Step 6: Upload the manifest file to an Amazon S3 bucket


Upload the manifest file to an Amazon S3 bucket. If the Amazon S3 bucket does not reside in the same
AWS Region as your Amazon Redshift cluster, you must use the REGION (p. 583) option to specify the
AWS Region in which the manifest is located. For information about creating an Amazon S3 bucket and
uploading a file, see Amazon Simple Storage Service User Guide.

Step 7: Run the COPY command to load the data


Run a COPY (p. 576) command to connect to the host and load the data into an Amazon Redshift table.
In the COPY command, specify the explicit Amazon S3 object path for the manifest file and include the
SSH option. For example,

copy sales
from 's3://mybucket/ssh_manifest' credentials
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|'
ssh;

Note
If you use automatic compression, the COPY command performs two data reads, which means it
runs the remote command twice. The first read is to provide a sample for compression analysis,
then the second read actually loads the data. If executing the remote command twice might
cause a problem because of potential side effects, you should disable automatic compression.
To disable automatic compression, run the COPY command with the COMPUPDATE option set to
OFF. For more information, see Loading tables with automatic compression (p. 89).

Loading data from an Amazon DynamoDB table


You can use the COPY command to load a table with data from a single Amazon DynamoDB table.
Important
The Amazon DynamoDB table that provides the data must be created in the same AWS Region
as your cluster unless you use the REGION (p. 583) option to specify the AWS Region in which
the Amazon DynamoDB table is located.

The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to
read and load data in parallel from an Amazon DynamoDB table. You can take maximum advantage of
parallel processing by setting distribution styles on your Amazon Redshift tables. For more information,
see Working with data distribution styles (p. 50).
Important
When the COPY command reads data from the Amazon DynamoDB table, the resulting data
transfer is part of that table's provisioned throughput.

To avoid consuming excessive amounts of provisioned read throughput, we recommend that you not
load data from Amazon DynamoDB tables that are in production environments. If you do load data from
production tables, we recommend that you set the READRATIO option much lower than the average
percentage of unused provisioned throughput. A low READRATIO setting will help minimize throttling
issues. To use the entire provisioned throughput of an Amazon DynamoDB table, set READRATIO to 100.

86
Amazon Redshift Database Developer Guide
Loading from Amazon DynamoDB

The COPY command matches attribute names in the items retrieved from the DynamoDB table to
column names in an existing Amazon Redshift table by using the following rules:

• Amazon Redshift table columns are case-insensitively matched to Amazon DynamoDB item attributes.
If an item in the DynamoDB table contains multiple attributes that differ only in case, such as Price and
PRICE, the COPY command will fail.
• Amazon Redshift table columns that do not match an attribute in the Amazon DynamoDB table are
loaded as either NULL or empty, depending on the value specified with the EMPTYASNULL option in
the COPY (p. 576) command.
• Amazon DynamoDB attributes that do not match a column in the Amazon Redshift table are
discarded. Attributes are read before they are matched, and so even discarded attributes consume part
of that table's provisioned throughput.
• Only Amazon DynamoDB attributes with scalar STRING and NUMBER data types are supported. The
Amazon DynamoDB BINARY and SET data types are not supported. If a COPY command tries to load
an attribute with an unsupported data type, the command will fail. If the attribute does not match an
Amazon Redshift table column, COPY does not attempt to load it, and it does not raise an error.

The COPY command uses the following syntax to load data from an Amazon DynamoDB table:

copy <redshift_tablename> from 'dynamodb://<dynamodb_table_name>'


authorization
readratio '<integer>';

The values for authorization are the AWS credentials needed to access the Amazon DynamoDB table. If
these credentials correspond to an IAM user, that IAM user must have permission to SCAN and DESCRIBE
the Amazon DynamoDB table that is being loaded.

The values for authorization provide the AWS authorization your cluster needs to access the Amazon
DynamoDB table. The permission must include SCAN and DESCRIBE for the Amazon DynamoDB table
that is being loaded. For more information about required permissions, see IAM permissions for COPY,
UNLOAD, and CREATE LIBRARY (p. 614). The preferred method for authentication is to specify the
IAM_ROLE parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary
permissions. For more information, see Role-based access control (p. 611).

To authenticate using the IAM_ROLE parameter, <aws-account-id> and <role-name> as shown in


the following syntax.

IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'

The following example shows authentication using an IAM role.

copy favoritemovies
from 'dynamodb://ProductCatalog'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

For more information about other authorization options, see Authorization parameters (p. 590)

If you want to validate your data without actually loading the table, use the NOLOAD option with the
COPY (p. 576) command.

The following example loads the FAVORITEMOVIES table with data from the DynamoDB table my-
favorite-movies-table. The read activity can consume up to 50% of the provisioned throughput.

copy favoritemovies from 'dynamodb://my-favorite-movies-table'


iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

87
Amazon Redshift Database Developer Guide
Verifying that the data loaded correctly

readratio 50;

To maximize throughput, the COPY command loads data from an Amazon DynamoDB table in parallel
across the compute nodes in the cluster.

Provisioned throughput with automatic compression


By default, the COPY command applies automatic compression whenever you specify an empty target
table with no compression encoding. The automatic compression analysis initially samples a large
number of rows from the Amazon DynamoDB table. The sample size is based on the value of the
COMPROWS parameter. The default is 100,000 rows per slice.

After sampling, the sample rows are discarded and the entire table is loaded. As a result, many rows
are read twice. For more information about how automatic compression works, see Loading tables with
automatic compression (p. 89).
Important
When the COPY command reads data from the Amazon DynamoDB table, including the rows
used for sampling, the resulting data transfer is part of that table's provisioned throughput.

Loading multibyte data from Amazon DynamoDB


If your data includes non-ASCII multibyte characters (such as Chinese or Cyrillic characters), you must
load the data to VARCHAR columns. The VARCHAR data type supports four-byte UTF-8 characters,
but the CHAR data type only accepts single-byte ASCII characters. You cannot load five-byte or longer
characters into Amazon Redshift tables. For more information about CHAR and VARCHAR, see Data
types (p. 478).

Verifying that the data loaded correctly


After the load operation is complete, query the STL_LOAD_COMMITS (p. 1304) system table to verify
that the expected files were loaded. Run the COPY command and load verification within the same
transaction so that if there is problem with the load you can roll back the entire transaction.

The following query returns entries for loading the tables in the TICKIT database:

select query, trim(filename) as filename, curtime, status


from stl_load_commits
where filename like '%tickit%' order by query;

query | btrim | curtime | status


-------+---------------------------+----------------------------+--------
22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1
22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1
22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1
22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1
22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1
22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1
22489 | tickit/sales_tab.txt | 2013-02-08 20:58:37.632939 | 1
(6 rows)

Validating input data


To validate the data in the Amazon S3 input files or Amazon DynamoDB table before you actually load
the data, use the NOLOAD option with the COPY (p. 576) command. Use NOLOAD with the same COPY
commands and options you would use to actually load the data. NOLOAD checks the integrity of all of
the data without loading it into the database. The NOLOAD option displays any errors that would occur if
you had attempted to load the data.

88
Amazon Redshift Database Developer Guide
Automatic compression

For example, if you specified the incorrect Amazon S3 path for the input file, Amazon Redshift would
display the following error:

ERROR: No such file or directory


DETAIL:
-----------------------------------------------
Amazon Redshift error: The specified key does not exist
code: 2
context: S3 key being read :
location: step_scan.cpp:1883
process: xenmaster [pid=22199]
-----------------------------------------------

To troubleshoot error messages, see the Load error reference (p. 95).

Loading tables with automatic compression


Topics
• How automatic compression works (p. 89)
• Automatic compression example (p. 90)

You can apply compression encodings to columns in tables manually, based on your own evaluation
of the data. Or you can use the COPY command with COMPUPDATE set to ON to analyze and apply
compression automatically based on sample data.

You can use automatic compression when you create and load a brand new table. The COPY command
performs a compression analysis. You can also perform a compression analysis without loading data
or changing the compression on a table by running the ANALYZE COMPRESSION (p. 566) command
on an already populated table. For example, you can run ANALYZE COMPRESSION when you want to
analyze compression on a table for future use, while preserving the existing data definition language
(DDL) statements.

Automatic compression balances overall performance when choosing compression encodings. Range-
restricted scans might perform poorly if sort key columns are compressed much more highly than other
columns in the same query. As a result, automatic compression skips the data analyzing phase on the
sort key columns and keeps the user-defined encoding types.

Automatic compression chooses RAW encoding if you haven't explicitly defined a type of encoding.
ANALYZE COMPRESSION behaves the same. For optimal query performance, consider using RAW for sort
keys.

How automatic compression works


When the COMPUPDATE parameter is ON, the COPY command applies automatic compression whenever
you run the COPY command with an empty target table and all of the table columns either have RAW
encoding or no encoding.

To apply automatic compression to an empty table, regardless of its current compression encodings, run
the COPY command with the COMPUPDATE option set to ON. To disable automatic compression, run the
COPY command with the COMPUPDATE option set to OFF.

You cannot apply automatic compression to a table that already contains data.
Note
Automatic compression analysis requires enough rows in the load data (at least 100,000 rows
per slice) to generate a meaningful sample.

Automatic compression performs these operations in the background as part of the load transaction:

89
Amazon Redshift Database Developer Guide
Automatic compression

1. An initial sample of rows is loaded from the input file. Sample size is based on the value of the
COMPROWS parameter. The default is 100,000.
2. Compression options are chosen for each column.
3. The sample rows are removed from the table.
4. The table is recreated with the chosen compression encodings.
5. The entire input file is loaded and compressed using the new encodings.

After you run the COPY command, the table is fully loaded, compressed, and ready for use. If you load
more data later, appended rows are compressed according to the existing encoding.

If you only want to perform a compression analysis, run ANALYZE COMPRESSION, which is more
efficient than running a full COPY. Then you can evaluate the results to decide whether to use automatic
compression or recreate the table manually.

Automatic compression is supported only for the COPY command. Alternatively, you can manually apply
compression encoding when you create the table. For information about manual compression encoding,
see Working with column compression (p. 39).

Automatic compression example


In this example, assume that the TICKIT database contains a copy of the LISTING table called BIGLIST,
and you want to apply automatic compression to this table when it is loaded with approximately 3
million rows.

To load and automatically compress the table

1. Ensure that the table is empty. You can apply automatic compression only to an empty table:

truncate biglist;

2. Load the table with a single COPY command. Although the table is empty, some earlier encoding
might have been specified. To ensure that Amazon Redshift performs a compression analysis, set the
COMPUPDATE parameter to ON.

copy biglist from 's3://mybucket/biglist.txt'


iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|' COMPUPDATE ON;

Because no COMPROWS option is specified, the default and recommended sample size of 100,000
rows per slice is used.
3. Look at the new schema for the BIGLIST table in order to review the automatically chosen encoding
schemes.

select "column", type, encoding


from pg_table_def where tablename = 'biglist';

Column | Type | Encoding


---------------+-----------------------------+----------
listid | integer | delta
sellerid | integer | delta32k
eventid | integer | delta32k
dateid | smallint | delta
+numtickets | smallint | delta
priceperticket | numeric(8,2) | delta32k
totalprice | numeric(8,2) | mostly32
listtime | timestamp without time zone | none

90
Amazon Redshift Database Developer Guide
Optimizing for narrow tables

4. Verify that the expected number of rows were loaded:

select count(*) from biglist;

count
---------
3079952
(1 row)

When rows are later appended to this table using COPY or INSERT statements, the same compression
encodings are applied.

Optimizing storage for narrow tables


If you have a table with very few columns but a very large number of rows, the three hidden metadata
identity columns (INSERT_XID, DELETE_XID, ROW_ID) will consume a disproportionate amount of the
disk space for the table.

In order to optimize compression of the hidden columns, load the table in a single COPY transaction
where possible. If you load the table with multiple separate COPY commands, the INSERT_XID column
will not compress well. You will need to perform a vacuum operation if you use multiple COPY
commands, but it will not improve compression of INSERT_XID.

Loading default column values


You can optionally define a column list in your COPY command. If a column in the table is omitted from
the column list, COPY will load the column with either the value supplied by the DEFAULT option that
was specified in the CREATE TABLE command, or with NULL if the DEFAULT option was not specified.

If COPY attempts to assign NULL to a column that is defined as NOT NULL, the COPY command fails. For
information about assigning the DEFAULT option, see CREATE TABLE (p. 708).

When loading from data files on Amazon S3, the columns in the column list must be in the same order as
the fields in the data file. If a field in the data file does not have a corresponding column in the column
list, the COPY command fails.

When loading from Amazon DynamoDB table, order does not matter. Any fields in the Amazon
DynamoDB attributes that do not match a column in the Amazon Redshift table are discarded.

The following restrictions apply when using the COPY command to load DEFAULT values into a table:

• If an IDENTITY (p. 711) column is included in the column list, the EXPLICIT_IDS option must also be
specified in the COPY (p. 576) command, or the COPY command will fail. Similarly, if an IDENTITY
column is omitted from the column list, and the EXPLICIT_IDS option is specified, the COPY operation
will fail.
• Because the evaluated DEFAULT expression for a given column is the same for all loaded rows, a
DEFAULT expression that uses a RANDOM() function will assign to same value to all the rows.
• DEFAULT expressions that contain CURRENT_DATE or SYSDATE are set to the timestamp of the current
transaction.

For an example, see "Load data from a file with default values" in COPY examples (p. 622).

Troubleshooting data loads


Topics

91
Amazon Redshift Database Developer Guide
Troubleshooting

• S3ServiceException errors (p. 92)


• System tables for troubleshooting data loads (p. 93)
• Multibyte character load errors (p. 94)
• Load error reference (p. 95)

This section provides information about identifying and resolving data loading errors.

S3ServiceException errors
The most common s3ServiceException errors are caused by an improperly formatted or incorrect
credentials string, having your cluster and your bucket in different AWS Regions, and insufficient Amazon
S3 privileges.

The section provides troubleshooting information for each type of error.

Invalid credentials string


If your credentials string was improperly formatted, you will receive the following error message:

ERROR: Invalid credentials. Must be of the format: credentials


'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>
[;token=<temporary-session-token>]'

Verify that the credentials string does not contain any spaces or line breaks, and is enclosed in single
quotation marks.

Invalid access key ID


If your access key id does not exist, you will receive the following error message:

[Amazon](500310) Invalid operation: S3ServiceException:The AWS Access Key Id you provided


does not exist in our records.

This is often a copy and paste error. Verify that the access key ID was entered correctly. Also, if you are
using temporary session keys, check that the value for token is set.

Invalid secret access key


If your secret access key is incorrect, you will receive the following error message:

[Amazon](500310) Invalid operation: S3ServiceException:The request signature we calculated


does not match the signature you provided.
Check your key and signing method.,Status 403,Error SignatureDoesNotMatch

This is often a copy and paste error. Verify that the secret access key was entered correctly and that it is
the correct key for the access key ID.

Bucket is in a different Region


The Amazon S3 bucket specified in the COPY command must be in the same AWS Region as the cluster.
If your Amazon S3 bucket and your cluster are in different Regions, you will receive an error similar to the
following:

ERROR: S3ServiceException:The bucket you are attempting to access must be addressed using
the specified endpoint.

92
Amazon Redshift Database Developer Guide
Troubleshooting

You can create an Amazon S3 bucket in a specific Region either by selecting the Region when you create
the bucket by using the Amazon S3 Management Console, or by specifying an endpoint when you
create the bucket using the Amazon S3 API or CLI. For more information, see Uploading files to Amazon
S3 (p. 69).

For more information about Amazon S3 regions, see Accessing a Bucket in the Amazon Simple Storage
Service User Guide.

Alternatively, you can specify the Region using the REGION (p. 583) option with the COPY command.

Access denied
The user account identified by the credentials must have LIST and GET access to the Amazon S3 bucket.
If the user does not have sufficient privileges, you will receive the following error message:

ERROR: S3ServiceException:Access Denied,Status 403,Error AccessDenied

For information about managing user access to buckets, see Access Control in the Amazon S3 Developer
Guide.

System tables for troubleshooting data loads


The following Amazon Redshift system tables can be helpful in troubleshooting data load issues:

• Query STL_LOAD_ERRORS (p. 1307) to discover the errors that occurred during specific loads.
• Query STL_FILE_SCAN (p. 1297) to view load times for specific files or to see if a specific file was even
read.
• Query STL_S3CLIENT_ERROR (p. 1332) to find details for errors encountered while transferring data
from Amazon S3.

To find and diagnose load errors

1. Create a view or define a query that returns details about load errors. The following example joins
the STL_LOAD_ERRORS table to the STV_TBL_PERM table to match table IDs with actual table
names.

create view loadview as


(select distinct tbl, trim(name) as table_name, query, starttime,
trim(filename) as input, line_number, colname, err_code,
trim(err_reason) as reason
from stl_load_errors sl, stv_tbl_perm sp
where sl.tbl = sp.id);

2. Set the MAXERRORS option in your COPY command to a large enough value to enable COPY to
return useful information about your data. If the COPY encounters errors, an error message directs
you to consult the STL_LOAD_ERRORS table for details.
3. Query the LOADVIEW view to see error details. For example:

select * from loadview where table_name='venue';

tbl | table_name | query | starttime


--------+------------+-------+----------------------------
100551 | venue | 20974 | 2013-01-29 19:05:58.365391

| input | line_number | colname | err_code | reason

93
Amazon Redshift Database Developer Guide
Troubleshooting

+----------------+-------------+-------+----------+---------------------
| venue_pipe.txt | 1 | 0 | 1214 | Delimiter not found

4. Fix the problem in the input file or the load script, based on the information that the view returns.
Some typical load errors to watch for include:

• Mismatch between data types in table and values in input data fields.
• Mismatch between number of columns in table and number of fields in input data.
• Mismatched quotation marks. Amazon Redshift supports both single and double quotation marks;
however, these quotation marks must be balanced appropriately.
• Incorrect format for date/time data in input files.
• Out-of-range values in input files (for numeric columns).
• Number of distinct values for a column exceeds the limitation for its compression encoding.

Multibyte character load errors


Columns with a CHAR data type only accept single-byte UTF-8 characters, up to byte value 127, or 7F
hex, which is also the ASCII character set. VARCHAR columns accept multibyte UTF-8 characters, to a
maximum of four bytes. For more information, see Character types (p. 487).

If a line in your load data contains a character that is invalid for the column data type, COPY returns
an error and logs a row in the STL_LOAD_ERRORS system log table with error number 1220. The
ERR_REASON field includes the byte sequence, in hex, for the invalid character.

An alternative to fixing invalid characters in your load data is to replace the invalid characters during the
load process. To replace invalid UTF-8 characters, specify the ACCEPTINVCHARS option with the COPY
command. If the ACCEPTINVCHARS option is set, the character you specify replaces the code point. If
the ACCEPTINVCHARS option isn't set, Amazon Redshift accepts the characters as valid UTF-8. For more
information, see ACCEPTINVCHARS (p. 603).

The following list of code points are valid UTF-8, COPY operations don't return an error if the
ACCEPTINVCHARS option is not set. However, these code points are invalid characters. You can use the
ACCEPTINVCHARS (p. 603) option to replace a code point with a character that you specify. These code
points include the range of values from 0xFDD0 to 0xFDEF and values up to 0x10FFFF, ending with
FFFE or FFFF:

• 0xFFFE, 0x1FFFE, 0x2FFFE, …, 0xFFFFE, 0x10FFFE


• 0xFFFF, 0x1FFFF, 0x2FFFF, …, 0xFFFFF, 0x10FFFF

The following example shows the error reason when COPY attempts to load UTF-8 character e0 a1
c7a4 into a CHAR column:

Multibyte character not supported for CHAR


(Hint: Try using VARCHAR). Invalid char: e0 a1 c7a4

If the error is related to a VARCHAR data type, the error reason includes an error code as well as the
invalid UTF-8 hex sequence. The following example shows the error reason when COPY attempts to load
UTF-8 a4 into a VARCHAR field:

String contains invalid or unsupported UTF-8 codepoints.


Bad UTF-8 hex sequence: a4 (error 3)

The following table lists the descriptions and suggested workarounds for VARCHAR load errors. If one of
these errors occurs, replace the character with a valid UTF-8 code sequence or remove the character.

94
Amazon Redshift Database Developer Guide
Troubleshooting

Error code Description

1 The UTF-8 byte sequence exceeds the four-byte maximum supported by VARCHAR.

2 The UTF-8 byte sequence is incomplete. COPY did not find the expected number of
continuation bytes for a multibyte character before the end of the string.

3 The UTF-8 single-byte character is out of range. The starting byte must not be 254,
255 or any character between 128 and 191 (inclusive).

4 The value of the trailing byte in the byte sequence is out of range. The continuation
byte must be between 128 and 191 (inclusive).

5 The UTF-8 character is reserved as a surrogate. Surrogate code points (U+D800


through U+DFFF) are invalid.

8 The byte sequence exceeds the maximum UTF-8 code point.

9 The UTF-8 byte sequence does not have a matching code point.

Load error reference


If any errors occur while loading data from a file, query the STL_LOAD_ERRORS (p. 1307) table to
identify the error and determine the possible explanation. The following table lists all error codes that
might occur during data loads:

Load error codes

Error code Description

1200 Unknown parse error. Contact support.

1201 Field delimiter was not found in the input file.

1202 Input data had more columns than were defined in the DDL.

1203 Input data had fewer columns than were defined in the DDL.

1204 Input data exceeded the acceptable range for the data type.

1205 Date format is invalid. See DATEFORMAT and TIMEFORMAT strings (p. 620) for valid
formats.

1206 Timestamp format is invalid. See DATEFORMAT and TIMEFORMAT strings (p. 620) for
valid formats.

1207 Data contained a value outside of the expected range of 0-9.

1208 FLOAT data type format error.

1209 DECIMAL data type format error.

1210 BOOLEAN data type format error.

1211 Input line contained no data.

1212 Load file was not found.

1213 A field specified as NOT NULL contained no data.

95
Amazon Redshift Database Developer Guide
Updating with DML

Error code Description

1214 Delimiter not found.

1215 CHAR field error.

1216 Invalid input line.

1217 Invalid identity column value.

1218 When using NULL AS '\0', a field containing a null terminator (NUL, or UTF-8 0000)
contained more than one byte.

1219 UTF-8 hexadecimal contains an invalid digit.

1220 String contains invalid or unsupported UTF-8 code points.

1221 Encoding of the file is not the same as that specified in the COPY command.

8001 COPY with MANIFEST parameter requires full path of an Amazon S3 object.

9005 Invalid end key specified.

Updating tables with DML commands


Amazon Redshift supports standard data manipulation language (DML) commands (INSERT, UPDATE,
and DELETE) that you can use to modify rows in tables. You can also use the TRUNCATE command to do
fast bulk deletes.
Note
We strongly encourage you to use the COPY (p. 576) command to load large amounts of
data. Using individual INSERT statements to populate a table might be prohibitively slow.
Alternatively, if your data already exists in other Amazon Redshift database tables, use INSERT
INTO ... SELECT FROM or CREATE TABLE AS to improve performance. For information, see
INSERT (p. 780) or CREATE TABLE AS (p. 723).

If you insert, update, or delete a significant number of rows in a table, relative to the number of rows
before the changes, run the ANALYZE and VACUUM commands against the table when you are done.
If a number of small changes accumulate over time in your application, you might want to schedule
the ANALYZE and VACUUM commands to run at regular intervals. For more information, see Analyzing
tables (p. 103) and Vacuuming tables (p. 108).

Updating and inserting new data


You can efficiently add new data to an existing table by using a combination of updates and inserts from
a staging table. While Amazon Redshift does not support a single merge, or upsert, command to update a
table from a single data source, you can perform a merge operation by creating a staging table and then
using one of the methods described in this section to update the target table from the staging table.

Topics
• Merge method 1: Replacing existing rows (p. 97)
• Merge method 2: Specifying a column list (p. 97)
• Creating a temporary staging table (p. 97)
• Performing a merge operation by replacing existing rows (p. 98)
• Performing a merge operation by specifying a column list (p. 98)

96
Amazon Redshift Database Developer Guide
Merge method 1: Replacing existing rows

• Merge examples (p. 100)

Note
You should run the entire merge operation, except for creating and dropping the temporary
staging table, in a single transaction so that the transaction will roll back if any step fails. Using
a single transaction also reduces the number of commits, which saves time and resources.

Merge method 1: Replacing existing rows


If you are overwriting all of the columns in the target table, the fastest method for performing a merge
is by replacing the existing rows because it scans the target table only once, by using an inner join to
delete rows that will be updated. After the rows are deleted, they are replaced along with new rows by a
single insert operation from the staging table.

Use this method if all of the following are true:

• Your target table and your staging table contain the same columns.
• You intend to replace all of the data in the target table columns with all of the staging table columns.
• You will use all of the rows in the staging table in the merge.

If any of these criteria do not apply, use Merge Method 2: Specifying a column list, described in the
following section.

If you will not use all of the rows in the staging table, you can filter the DELETE and INSERT statements
by using a WHERE clause to leave out rows that are not actually changing. However, if most of the rows
in the staging table will not participate in the merge, we recommend performing an UPDATE and an
INSERT in separate steps, as described later in this section.

Merge method 2: Specifying a column list


Use this method to update specific columns in the target table instead of overwriting entire rows.
This method takes longer than the previous method because it requires an extra update step. Use this
method if any of the following are true:

• Not all of the columns in the target table are to be updated.


• Most rows in the staging table will not be used in the updates.

Creating a temporary staging table


The staging table is a temporary table that holds all of the data that will be used to make changes to the
target table, including both updates and inserts.

A merge operation requires a join between the staging table and the target table. To collocate the
joining rows, set the staging table's distribution key to the same column as the target table's distribution
key. For example, if the target table uses a foreign key column as its distribution key, use the same
column for the staging table's distribution key. If you create the staging table by using a CREATE TABLE
LIKE (p. 714) statement, the staging table will inherit the distribution key from the parent table. If
you use a CREATE TABLE AS statement, the new table does not inherit the distribution key. For more
information, see Working with data distribution styles (p. 50)

If the distribution key is not the same as the primary key and the distribution key is not updated as part
of the merge operation, add a redundant join predicate on the distribution key columns to enable a
collocated join. For example:

97
Amazon Redshift Database Developer Guide
Performing a merge operation by replacing existing rows

where target.primarykey = stage.primarykey


and target.distkey = stage.distkey

To verify that the query will use a collocated join, run the query with EXPLAIN (p. 762) and check for
DS_DIST_NONE on all of the joins. For more information, see Evaluating the query plan (p. 55)

Performing a merge operation by replacing existing


rows
To perform a merge operation by replacing existing rows

1. Create a staging table, and then populate it with data to be merged, as shown in the following
pseudocode.

create temp table stage (like target);

insert into stage


select * from source
where source.filter = 'filter_expression';

2. Use an inner join with the staging table to delete the rows from the target table that are being
updated.

Put the delete and insert operations in a single transaction block so that if there is a problem,
everything will be rolled back.

begin transaction;

delete from target


using stage
where target.primarykey = stage.primarykey;

3. Insert all of the rows from the staging table.

insert into target


select * from stage;

end transaction;

4. Drop the staging table.

drop table stage;

Performing a merge operation by specifying a


column list
To perform a merge operation by specifying a column list

1. Put the entire operation in a single transaction block so that if there is a problem, everything will be
rolled back.

begin transaction;

98
Amazon Redshift Database Developer Guide
Performing a merge operation by specifying a column list

end transaction;

2. Create a staging table, and then populate it with data to be merged, as shown in the following
pseudocode.

create temp table stage (like target);


insert into stage
select * from source
where source.filter = 'filter_expression';

3. Update the target table by using an inner join with the staging table.

• In the UPDATE clause, explicitly list the columns to be updated.


• Perform an inner join with the staging table.
• If the distribution key is different from the primary key and the distribution key is not being
updated, add a redundant join on the distribution key. To verify that the query will use a
collocated join, run the query with EXPLAIN (p. 762) and check for DS_DIST_NONE on all of the
joins. For more information, see Evaluating the query plan (p. 55)
• If your target table is sorted by timestamp, add a predicate to take advantage of range-restricted
scans on the target table. For more information, see Amazon Redshift best practices for designing
queries (p. 22).
• If you will not use all of the rows in the merge, add a clause to filter the rows that need to be
changed. For example, add an inequality filter on one or more columns to exclude rows that have
not changed.
• Put the update, delete, and insert operations in a single transaction block so that if there is a
problem, everything will be rolled back.

For example:

begin transaction;

update target
set col1 = stage.col1,
col2 = stage.col2,
col3 = 'expression'
from stage
where target.primarykey = stage.primarykey
and target.distkey = stage.distkey
and target.col3 > 'last_update_time'
and (target.col1 != stage.col1
or target.col2 != stage.col2
or target.col3 = 'filter_expression');

4. Delete unneeded rows from the staging table by using an inner join with the target table. Some
rows in the target table already match the corresponding rows in the staging table, and others were
updated in the previous step. In either case, they are not needed for the insert.

delete from stage


using target
where stage.primarykey = target.primarykey;

5. Insert the remaining rows from the staging table. Use the same column list in the VALUES clause
that you used in the UPDATE statement in step two.

insert into target


(select col1, col2, 'expression'
from stage);

99
Amazon Redshift Database Developer Guide
Merge examples

end transaction;

6. Drop the staging table.

drop table stage;

Merge examples
The following examples perform a merge to update the SALES table. The first example uses the simpler
method of deleting from the target table and then inserting all of the rows from the staging table. The
second example requires updating on select columns in the target table, so it includes an extra update
step.

Sample merge data source

The examples in this section need a sample data source that includes both updates and inserts. For the
examples, we will create a sample table named SALES_UPDATE that uses data from the SALES table.
We'll populate the new table with random data that represents new sales activity for December. We will
use the SALES_UPDATE sample table to create the staging table in the examples that follow.

-- Create a sample table as a copy of the SALES table


create table sales_update as
select * from sales;

-- Change every fifth row so we have updates

update sales_update
set qtysold = qtysold*2,
pricepaid = pricepaid*0.8,
commission = commission*1.1
where saletime > '2008-11-30'
and mod(sellerid, 5) = 0;

-- Add some new rows so we have insert examples


-- This example creates a duplicate of every fourth row

insert into sales_update


select (salesid + 172456) as salesid, listid, sellerid, buyerid, eventid, dateid, qtysold,
pricepaid, commission, getdate() as saletime
from sales_update
where saletime > '2008-11-30'
and mod(sellerid, 4) = 0;

Example of a merge that replaces existing rows

The following script uses the SALES_UPDATE table to perform a merge operation on the SALES table
with new data for December sales activity. This example deletes rows in the SALES table that have
updates so they can be replaced with the updated rows in the staging table. The staging table should
contain only rows that will participate in the merge, so the CREATE TABLE statement includes a filter to
exclude rows that have not changed.

-- Create a staging table and populate it with updated rows from SALES_UPDATE

create temp table stagesales as


select * from sales_update
where sales_update.saletime > '2008-11-30'
and sales_update.salesid = (select sales.salesid from sales
where sales.salesid = sales_update.salesid
and sales.listid = sales_update.listid

100
Amazon Redshift Database Developer Guide
Merge examples

and (sales_update.qtysold != sales.qtysold


or sales_update.pricepaid != sales.pricepaid));

-- Start a new transaction


begin transaction;

-- Delete any rows from SALES that exist in STAGESALES, because they are updates
-- The join includes a redundant predicate to collocate on the distribution key
–- A filter on saletime enables a range-restricted scan on SALES

delete from sales


using stagesales
where sales.salesid = stagesales.salesid
and sales.listid = stagesales.listid
and sales.saletime > '2008-11-30';

-- Insert all the rows from the staging table into the target table
insert into sales
select * from stagesales;

-- End transaction and commit


end transaction;

-- Drop the staging table


drop table stagesales;

Example of a merge that specifies a column list

The following example performs a merge operation to update SALES with new data for December
sales activity. We need sample data that includes both updates and inserts, along with rows that have
not changed. For this example, we want to update the QTYSOLD and PRICEPAID columns but leave
COMMISSION and SALETIME unchanged. The following script uses the SALES_UPDATE table to perform
a merge operation on the SALES table.

-- Create a staging table and populate it with rows from SALES_UPDATE for Dec
create temp table stagesales as select * from sales_update
where saletime > '2008-11-30';

-- Start a new transaction


begin transaction;

-- Update the target table using an inner join with the staging table
-- The join includes a redundant predicate to collocate on the distribution key –- A filter
on saletime enables a range-restricted scan on SALES

update sales
set qtysold = stagesales.qtysold,
pricepaid = stagesales.pricepaid
from stagesales
where sales.salesid = stagesales.salesid
and sales.listid = stagesales.listid
and stagesales.saletime > '2008-11-30'
and (sales.qtysold != stagesales.qtysold
or sales.pricepaid != stagesales.pricepaid);

-- Delete matching rows from the staging table


-- using an inner join with the target table

delete from stagesales


using sales
where sales.salesid = stagesales.salesid
and sales.listid = stagesales.listid;

-- Insert the remaining rows from the staging table into the target table

101
Amazon Redshift Database Developer Guide
Performing a deep copy

insert into sales


select * from stagesales;

-- End transaction and commit


end transaction;

-- Drop the staging table


drop table stagesales;

Performing a deep copy


A deep copy recreates and repopulates a table by using a bulk insert, which automatically sorts the table.
If a table has a large unsorted Region, a deep copy is much faster than a vacuum. The trade off is that
you should not make concurrent updates during a deep copy operation unless you can track it and move
the delta updates into the new table after the process has completed. A VACUUM operation supports
concurrent updates automatically.

You can choose one of the following methods to create a copy of the original table:

• Use the original table DDL.

If the CREATE TABLE DDL is available, this is the fastest and preferred method. If you create a new
table, you can specify all table and column attributes, including primary key and foreign keys.
Note
If the original DDL is not available, you might be able to recreate the DDL by running a script
called v_generate_tbl_ddl. You can download the script from amazon-redshift-utils,
which is part of the Amazon Web Services - Labs git hub repository.
• Use CREATE TABLE LIKE.

If the original DDL is not available, you can use CREATE TABLE LIKE to recreate the original table. The
new table inherits the encoding, distribution key, sort key, and not-null attributes of the parent table.
The new table doesn't inherit the primary key and foreign key attributes of the parent table, but you
can add them using ALTER TABLE (p. 542).
• Create a temporary table and truncate the original table.

If you need to retain the primary key and foreign key attributes of the parent table, or if the parent
table has dependencies, you can use CREATE TABLE ... AS (CTAS) to create a temporary table, then
truncate the original table and populate it from the temporary table.

Using a temporary table improves performance significantly compared to using a permanent table,
but there is a risk of losing data. A temporary table is automatically dropped at the end of the session
in which it is created. TRUNCATE commits immediately, even if it is inside a transaction block. If the
TRUNCATE succeeds but the session terminates before the subsequent INSERT completes, the data is
lost. If data loss is unacceptable, use a permanent table.

After you create a copy of a table, you may need to grant access to the new table. You can use
GRANT (p. 767) to define access privileges.

To perform a deep copy using the original table DDL

1. (Optional) Recreate the table DDL by running a script called v_generate_tbl_ddl.


2. Create a copy of the table using the original CREATE TABLE DDL.
3. Use an INSERT INTO … SELECT statement to populate the copy with data from the original table.
4. Drop the original table.
5. Use an ALTER TABLE statement to rename the copy to the original table name.

102
Amazon Redshift Database Developer Guide
Analyzing tables

The following example performs a deep copy on the SALES table using a duplicate of SALES named
SALESCOPY.

create table salescopy ( … );


insert into salescopy (select * from sales);
drop table sales;
alter table salescopy rename to sales;

To perform a deep copy using CREATE TABLE LIKE

1. Create a new table using CREATE TABLE LIKE.


2. Use an INSERT INTO … SELECT statement to copy the rows from the current table to the new table.
3. Drop the current table.
4. Use an ALTER TABLE statement to rename the new table to the original table name.

The following example performs a deep copy on the SALES table using CREATE TABLE LIKE.

create table likesales (like sales);


insert into likesales (select * from sales);
drop table sales;
alter table likesales rename to sales;

To perform a deep copy by creating a temporary table and truncating the original table

1. Use CREATE TABLE AS to create a temporary table with the rows from the original table.
2. Truncate the current table.
3. Use an INSERT INTO … SELECT statement to copy the rows from the temporary table to the original
table.
4. Drop the temporary table.

The following example performs a deep copy on the SALES table by creating a temporary table and
truncating the original table:

create temp table salestemp as select * from sales;


truncate sales;
insert into sales (select * from salestemp);
drop table salestemp;

Analyzing tables
The ANALYZE operation updates the statistical metadata that the query planner uses to choose optimal
plans.

In most cases, you don't need to explicitly run the ANALYZE command. Amazon Redshift monitors
changes to your workload and automatically updates statistics in the background. In addition, the COPY
command performs an analysis automatically when it loads data into an empty table.

To explicitly analyze a table or the entire database, run the ANALYZE (p. 564) command.

Topics
• Automatic analyze (p. 104)

103
Amazon Redshift Database Developer Guide
Automatic analyze

• Analysis of new table data (p. 104)


• ANALYZE command history (p. 107)

Automatic analyze
Amazon Redshift continuously monitors your database and automatically performs analyze operations in
the background. To minimize impact to your system performance, automatic analyze runs during periods
when workloads are light.

Automatic analyze is enabled by default. To disable automatic analyze, set the auto_analyze
parameter to false by modifying your cluster's parameter group.

To reduce processing time and improve overall system performance, Amazon Redshift skips automatic
analyze for any table where the extent of modifications is small.

An analyze operation skips tables that have up-to-date statistics. If you run ANALYZE as part of your
extract, transform, and load (ETL) workflow, automatic analyze skips tables that have current statistics.
Similarly, an explicit ANALYZE skips tables when automatic analyze has updated the table's statistics.

Analysis of new table data


By default, the COPY command performs an ANALYZE after it loads data into an empty table. You can
force an ANALYZE regardless of whether a table is empty by setting STATUPDATE ON. If you specify
STATUPDATE OFF, an ANALYZE is not performed. Only the table owner or a superuser can run the
ANALYZE command or run the COPY command with STATUPDATE set to ON.

Amazon Redshift also analyzes new tables that you create with the following commands:

• CREATE TABLE AS (CTAS)


• CREATE TEMP TABLE AS
• SELECT INTO

Amazon Redshift returns a warning message when you run a query against a new table that was not
analyzed after its data was initially loaded. No warning occurs when you query a table after a subsequent
update or load. The same warning message is returned when you run the EXPLAIN command on a query
that references tables that have not been analyzed.

Whenever adding data to a nonempty table significantly changes the size of the table, you can explicitly
update statistics. You do so either by running an ANALYZE command or by using the STATUPDATE ON
option with the COPY command. To view details about the number of rows that have been inserted or
deleted since the last ANALYZE, query the PG_STATISTIC_INDICATOR (p. 1472) system catalog table.

You can specify the scope of the ANALYZE (p. 564) command to one of the following:

• The entire current database


• A single table
• One or more specific columns in a single table
• Columns that are likely to be used as predicates in queries

The ANALYZE command gets a sample of rows from the table, does some calculations, and saves
resulting column statistics. By default, Amazon Redshift runs a sample pass for the DISTKEY column
and another sample pass for all of the other columns in the table. If you want to generate statistics for
a subset of columns, you can specify a comma-separated column list. You can run ANALYZE with the
PREDICATE COLUMNS clause to skip columns that aren’t used as predicates.

104
Amazon Redshift Database Developer Guide
Analysis of new table data

ANALYZE operations are resource intensive, so run them only on tables and columns that actually require
statistics updates. You don't need to analyze all columns in all tables regularly or on the same schedule.
If the data changes substantially, analyze the columns that are frequently used in the following:

• Sorting and grouping operations


• Joins
• Query predicates

To reduce processing time and improve overall system performance, Amazon Redshift skips
ANALYZE for any table that has a low percentage of changed rows, as determined by the
analyze_threshold_percent (p. 1481) parameter. By default, the analyze threshold is set to 10 percent.
You can change the analyze threshold for the current session by running a SET (p. 837) command.

Columns that are less likely to require frequent analysis are those that represent facts and measures and
any related attributes that are never actually queried, such as large VARCHAR columns. For example,
consider the LISTING table in the TICKIT database.

select "column", type, encoding, distkey, sortkey


from pg_table_def where tablename = 'listing';

column | type | encoding | distkey | sortkey


---------------+--------------------+----------+---------+---------
listid | integer | none | t | 1
sellerid | integer | none | f | 0
eventid | integer | mostly16 | f | 0
dateid | smallint | none | f | 0
numtickets | smallint | mostly8 | f | 0
priceperticket | numeric(8,2) | bytedict | f | 0
totalprice | numeric(8,2) | mostly32 | f | 0
listtime | timestamp with... | none | f | 0

If this table is loaded every day with a large number of new records, the LISTID column, which is
frequently used in queries as a join key, needs to be analyzed regularly. If TOTALPRICE and LISTTIME are
the frequently used constraints in queries, you can analyze those columns and the distribution key on
every weekday.

analyze listing(listid, totalprice, listtime);

Suppose that the sellers and events in the application are much more static, and the date IDs refer to a
fixed set of days covering only two or three years. In this case,the unique values for these columns don't
change significantly. However, the number of instances of each unique value will increase steadily.

In addition, consider the case where the NUMTICKETS and PRICEPERTICKET measures are queried
infrequently compared to the TOTALPRICE column. In this case, you can run the ANALYZE command on
the whole table once every weekend to update statistics for the five columns that are not analyzed daily:

Predicate columns

As a convenient alternative to specifying a column list, you can choose to analyze only the columns
that are likely to be used as predicates. When you run a query, any columns that are used in a join, filter
condition, or group by clause are marked as predicate columns in the system catalog. When you run
ANALYZE with the PREDICATE COLUMNS clause, the analyze operation includes only columns that meet
the following criteria:

• The column is marked as a predicate column.


• The column is a distribution key.

105
Amazon Redshift Database Developer Guide
Analysis of new table data

• The column is part of a sort key.

If none of a table's columns are marked as predicates, ANALYZE includes all of the columns, even when
PREDICATE COLUMNS is specified. If no columns are marked as predicate columns, it might be because
the table has not yet been queried.

You might choose to use PREDICATE COLUMNS when your workload's query pattern is relatively stable.
When the query pattern is variable, with different columns frequently being used as predicates, using
PREDICATE COLUMNS might temporarily result in stale statistics. Stale statistics can lead to suboptimal
query execution plans and long execution times. However, the next time you run ANALYZE using
PREDICATE COLUMNS, the new predicate columns are included.

To view details for predicate columns, use the following SQL to create a view named
PREDICATE_COLUMNS.

CREATE VIEW predicate_columns AS


WITH predicate_column_info as (
SELECT ns.nspname AS schema_name, c.relname AS table_name, a.attnum as col_num, a.attname
as col_name,
CASE
WHEN 10002 = s.stakind1 THEN array_to_string(stavalues1, '||')
WHEN 10002 = s.stakind2 THEN array_to_string(stavalues2, '||')
WHEN 10002 = s.stakind3 THEN array_to_string(stavalues3, '||')
WHEN 10002 = s.stakind4 THEN array_to_string(stavalues4, '||')
ELSE NULL::varchar
END AS pred_ts
FROM pg_statistic s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_namespace ns ON c.relnamespace = ns.oid
JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum)
SELECT schema_name, table_name, col_num, col_name,
pred_ts NOT LIKE '2000-01-01%' AS is_predicate,
CASE WHEN pred_ts NOT LIKE '2000-01-01%' THEN (split_part(pred_ts,
'||',1))::timestamp ELSE NULL::timestamp END as first_predicate_use,
CASE WHEN pred_ts NOT LIKE '%||2000-01-01%' THEN (split_part(pred_ts,
'||',2))::timestamp ELSE NULL::timestamp END as last_analyze
FROM predicate_column_info;

Suppose you run the following query against the LISTING table. Note that LISTID, LISTTIME, and
EVENTID are used in the join, filter, and group by clauses.

select s.buyerid,l.eventid, sum(l.totalprice)


from listing l
join sales s on l.listid = s.listid
where l.listtime > '2008-12-01'
group by l.eventid, s.buyerid;

When you query the PREDICATE_COLUMNS view, as shown in the following example, you see that
LISTID, EVENTID, and LISTTIME are marked as predicate columns.

select * from predicate_columns


where table_name = 'listing';

schema_name | table_name | col_num | col_name | is_predicate | first_predicate_use |


last_analyze
------------+------------+---------+----------------+--------------+---------------------
+--------------------
public | listing | 1 | listid | true | 2017-05-05 19:27:59 |
2017-05-03 18:27:41

106
Amazon Redshift Database Developer Guide
ANALYZE command history

public | listing | 2 | sellerid | false | |


2017-05-03 18:27:41
public | listing | 3 | eventid | true | 2017-05-16 20:54:32 |
2017-05-03 18:27:41
public | listing | 4 | dateid | false | |
2017-05-03 18:27:41
public | listing | 5 | numtickets | false | |
2017-05-03 18:27:41
public | listing | 6 | priceperticket | false | |
2017-05-03 18:27:41
public | listing | 7 | totalprice | false | |
2017-05-03 18:27:41
public | listing | 8 | listtime | true | 2017-05-16 20:54:32 |
2017-05-03 18:27:41

Keeping statistics current improves query performance by enabling the query planner to choose optimal
plans. Amazon Redshift refreshes statistics automatically in the background, and you can also explicitly
run the ANALYZE command. If you choose to explicitly run ANALYZE, do the following:

• Run the ANALYZE command before running queries.


• Run the ANALYZE command on the database routinely at the end of every regular load or update
cycle.
• Run the ANALYZE command on any new tables that you create and any existing tables or columns that
undergo significant change.
• Consider running ANALYZE operations on different schedules for different types of tables and
columns, depending on their use in queries and their propensity to change.
• To save time and cluster resources, use the PREDICATE COLUMNS clause when you run ANALYZE.

An analyze operation skips tables that have up-to-date statistics. If you run ANALYZE as part of your
extract, transform, and load (ETL) workflow, automatic analyze skips tables that have current statistics.
Similarly, an explicit ANALYZE skips tables when automatic analyze has updated the table's statistics.

ANALYZE command history


It's useful to know when the last ANALYZE command was run on a table or database. When an ANALYZE
command is run, Amazon Redshift runs multiple queries that look like this:

padb_fetch_sample: select * from table_name

Query STL_ANALYZE to view the history of analyze operations. If Amazon Redshift analyzes a table using
automatic analyze, the is_background column is set to t (true). Otherwise, it is set to f (false). The
following example joins STV_TBL_PERM to show the table name and execution details.

select distinct a.xid, trim(t.name) as name, a.status, a.rows, a.modified_rows,


a.starttime, a.endtime
from stl_analyze a
join stv_tbl_perm t on t.id=a.table_id
where name = 'users'
order by starttime;

xid | name | status | rows | modified_rows | starttime | endtime

-------+-------+-----------------+-------+---------------+---------------------
+--------------------
1582 | users | Full | 49990 | 49990 | 2016-09-22 22:02:23 | 2016-09-22
22:02:28

107
Amazon Redshift Database Developer Guide
Vacuuming tables

244287 | users | Full | 24992 | 74988 | 2016-10-04 22:50:58 | 2016-10-04


22:51:01
244712 | users | Full | 49984 | 24992 | 2016-10-04 22:56:07 | 2016-10-04
22:56:07
245071 | users | Skipped | 49984 | 0 | 2016-10-04 22:58:17 | 2016-10-04
22:58:17
245439 | users | Skipped | 49984 | 1982 | 2016-10-04 23:00:13 | 2016-10-04
23:00:13
(5 rows)

Alternatively, you can run a more complex query that returns all the statements that ran in every
completed transaction that included an ANALYZE command:

select xid, to_char(starttime, 'HH24:MM:SS.MS') as starttime,


datediff(sec,starttime,endtime ) as secs, substring(text, 1, 40)
from svl_statementtext
where sequence = 0
and xid in (select xid from svl_statementtext s where s.text like 'padb_fetch_sample%' )
order by xid desc, starttime;

xid | starttime | secs | substring


-----+--------------+------+------------------------------------------
1338 | 12:04:28.511 | 4 | Analyze date
1338 | 12:04:28.511 | 1 | padb_fetch_sample: select count(*) from
1338 | 12:04:29.443 | 2 | padb_fetch_sample: select * from date
1338 | 12:04:31.456 | 1 | padb_fetch_sample: select * from date
1337 | 12:04:24.388 | 1 | padb_fetch_sample: select count(*) from
1337 | 12:04:24.388 | 4 | Analyze sales
1337 | 12:04:25.322 | 2 | padb_fetch_sample: select * from sales
1337 | 12:04:27.363 | 1 | padb_fetch_sample: select * from sales
...

Vacuuming tables
Amazon Redshift can automatically sort and perform a VACUUM DELETE operation on tables in the
background. To clean up tables after a load or a series of incremental updates, you can also run the
VACUUM (p. 873) command, either against the entire database or against individual tables.
Note
Only the table owner or a superuser can effectively vacuum a table. If you don't have owner or
superuser privileges for a table, a VACUUM operation that specifies a single table fails. If you
run a VACUUM of the entire database without specifying a table name, the operation completes
successfully. However, the operation has no effect on tables for which you don't have owner or
superuser privileges.
For this reason, we recommend vacuuming individual tables as needed. We also recommend this
approach because vacuuming the entire database is potentially an expensive operation.

Automatic table sort


Amazon Redshift automatically sorts data in the background to maintain table data in the order of its
sort key. Amazon Redshift keeps track of your scan queries to determine which sections of the table will
benefit from sorting.

Depending on the load on the system, Amazon Redshift automatically initiates the sort. This automatic
sort lessens the need to run the VACUUM command to keep data in sort key order. If you need data
fully sorted in sort key order, for example after a large data load, then you can still manually run the
VACUUM command. To determine whether your table will benefit by running VACUUM SORT, monitor
the vacuum_sort_benefit column in SVV_TABLE_INFO (p. 1450).

108
Amazon Redshift Database Developer Guide
Automatic vacuum delete

Amazon Redshift tracks scan queries that use the sort key on each table. Amazon Redshift estimates
the maximum percentage of improvement in scanning and filtering of data for each table (if
the table was fully sorted). This estimate is visible in the vacuum_sort_benefit column in
SVV_TABLE_INFO (p. 1450). You can use this column, along with the unsorted column, to determine
when queries can benefit from manually running VACUUM SORT on a table. The unsorted column
reflects the physical sort order of a table. The vacuum_sort_benefit column specifies the impact of
sorting a table by manually running VACUUM SORT.

For example, consider the following query:

select "table", unsorted,vacuum_sort_benefit from svv_table_info order by 1;

table | unsorted | vacuum_sort_benefit


-------+----------+---------------------
sales | 85.71 | 5.00
event | 45.24 | 67.00

For the table “sales”, even though the table is ~86% physically unsorted, the query performance impact
from the table being 86% unsorted is only 5%. This might be either because only a small portion of
the table is accessed by queries, or very few queries accessed the table. For the table “event”, the table
is ~45% physically unsorted. But the query performance impact of 67% indicates that either a larger
portion of the table was accessed by queries, or the number of queries accessing the table was large. The
table "event" can potentially benefit from running VACUUM SORT.

Automatic vacuum delete


When you perform a delete, the rows are marked for deletion, but not removed. Amazon Redshift
automatically runs a VACUUM DELETE operation in the background based on the number of deleted
rows in database tables. Amazon Redshift schedules the VACUUM DELETE to run during periods of
reduced load and pauses the operation during periods of high load.

Topics
• VACUUM frequency (p. 109)
• Sort stage and merge stage (p. 110)
• Vacuum threshold (p. 110)
• Vacuum types (p. 110)
• Managing vacuum times (p. 110)

VACUUM frequency
You should vacuum as often as you need to in order to maintain consistent query performance. Consider
these factors when determining how often to run your VACUUM command.

• Run VACUUM during time periods when you expect minimal activity on the cluster, such as evenings or
during designated database administration windows.
• A large unsorted region results in longer vacuum times. If you delay vacuuming, the vacuum will take
longer because more data has to be reorganized.
• VACUUM is an I/O intensive operation, so the longer it takes for your vacuum to complete, the more
impact it will have on concurrent queries and other database operations running on your cluster.
• VACUUM takes longer for tables that use interleaved sorting. To evaluate whether interleaved tables
need to be re-sorted, query the SVV_INTERLEAVED_COLUMNS (p. 1434) view.

109
Amazon Redshift Database Developer Guide
Sort stage and merge stage

Sort stage and merge stage


Amazon Redshift performs a vacuum operation in two stages: first, it sorts the rows in the unsorted
region, then, if necessary, it merges the newly sorted rows at the end of the table with the existing
rows. When vacuuming a large table, the vacuum operation proceeds in a series of steps consisting
of incremental sorts followed by merges. If the operation fails or if Amazon Redshift goes off line
during the vacuum, the partially vacuumed table or database will be in a consistent state, but you will
need to manually restart the vacuum operation. Incremental sorts are lost, but merged rows that were
committed before the failure do not need to be vacuumed again. If the unsorted region is large, the lost
time might be significant. For more information about the sort and merge stages, see Managing the
volume of merged rows (p. 112).

Users can access tables while they are being vacuumed. You can perform queries and write operations
while a table is being vacuumed, but when DML and a vacuum run concurrently, both might take longer.
If you run UPDATE and DELETE statements during a vacuum, system performance might be reduced.
Incremental merges temporarily block concurrent UPDATE and DELETE operations, and UPDATE and
DELETE operations in turn temporarily block incremental merge steps on the affected tables. DDL
operations, such as ALTER TABLE, are blocked until the vacuum operation finishes with the table.
Note
Various modifiers to VACUUM control the way that it works. You can use them to tailor the
vacuum operation for the current need. For example, using VACUUM RECLUSTER shortens
the vacuum operation by not performing a full merge operation. For more information, see
VACUUM (p. 873).

Vacuum threshold
By default, VACUUM skips the sort phase for any table where more than 95 percent of the table's rows
are already sorted. Skipping the sort phase can significantly improve VACUUM performance. To change
the default sort threshold for a single table, include the table name and the TO threshold PERCENT
parameter when you run the VACUUM command.

Vacuum types
For information about different vacuum types, see VACUUM (p. 873).

Managing vacuum times


Depending on the nature of your data, we recommend following the practices in this section to minimize
vacuum times.

Topics
• Deciding whether to reindex (p. 110)
• Managing the size of the unsorted region (p. 111)
• Managing the volume of merged rows (p. 112)
• Loading your data in sort key order (p. 115)
• Using time series tables (p. 115)

Deciding whether to reindex


You can often significantly improve query performance by using an interleaved sort style, but over time
performance might degrade if the distribution of the values in the sort key columns changes.

110
Amazon Redshift Database Developer Guide
Managing vacuum times

When you initially load an empty interleaved table using COPY or CREATE TABLE AS, Amazon Redshift
automatically builds the interleaved index. If you initially load an interleaved table using INSERT, you
need to run VACUUM REINDEX afterwards to initialize the interleaved index.

Over time, as you add rows with new sort key values, performance might degrade if the distribution of
the values in the sort key columns changes. If your new rows fall primarily within the range of existing
sort key values, you don’t need to reindex. Run VACUUM SORT ONLY or VACUUM FULL to restore the
sort order.

The query engine is able to use sort order to efficiently select which data blocks need to be scanned
to process a query. For an interleaved sort, Amazon Redshift analyzes the sort key column values
to determine the optimal sort order. If the distribution of key values changes, or skews, as rows
are added, the sort strategy will no longer be optimal, and the performance benefit of sorting
will degrade. To reanalyze the sort key distribution you can run a VACUUM REINDEX. The reindex
operation is time consuming, so to decide whether a table will benefit from a reindex, query the
SVV_INTERLEAVED_COLUMNS (p. 1434) view.

For example, the following query shows details for tables that use interleaved sort keys.

select tbl as tbl_id, stv_tbl_perm.name as table_name,


col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;

tbl_id | table_name | col | interleaved_skew | last_reindex


--------+------------+-----+------------------+--------------------
100048 | customer | 0 | 3.65 | 2015-04-22 22:05:45
100068 | lineorder | 1 | 2.65 | 2015-04-22 22:05:45
100072 | part | 0 | 1.65 | 2015-04-22 22:05:45
100077 | supplier | 1 | 1.00 | 2015-04-22 22:05:45
(4 rows)

The value for interleaved_skew is a ratio that indicates the amount of skew. A value of 1 means there
is no skew. If the skew is greater than 1.4, a VACUUM REINDEX will usually improve performance unless
the skew is inherent in the underlying set.

You can use the date value in last_reindex to determine how long it has been since the last reindex.

Managing the size of the unsorted region


The unsorted region grows when you load large amounts of new data into tables that already contain
data or when you do not vacuum tables as part of your routine maintenance operations. To avoid long-
running vacuum operations, use the following practices:

• Run vacuum operations on a regular schedule.

If you load your tables in small increments (such as daily updates that represent a small percentage
of the total number of rows in the table), running VACUUM regularly will help ensure that individual
vacuum operations go quickly.
• Run the largest load first.

If you need to load a new table with multiple COPY operations, run the largest load first. When you
run an initial load into a new or truncated table, all of the data is loaded directly into the sorted
region, so no vacuum is required.
• Truncate a table instead of deleting all of the rows.

Deleting rows from a table does not reclaim the space that the rows occupied until you perform a
vacuum operation; however, truncating a table empties the table and reclaims the disk space, so no
vacuum is required. Alternatively, drop the table and re-create it.

111
Amazon Redshift Database Developer Guide
Managing vacuum times

• Truncate or drop test tables.

If you are loading a small number of rows into a table for test purposes, don't delete the rows when
you are done. Instead, truncate the table and reload those rows as part of the subsequent production
load operation.
• Perform a deep copy.

If a table that uses a compound sort key table has a large unsorted region, a deep copy is much
faster than a vacuum. A deep copy recreates and repopulates a table by using a bulk insert, which
automatically re-sorts the table. If a table has a large unsorted region, a deep copy is much faster than
a vacuum. The trade off is that you cannot make concurrent updates during a deep copy operation,
which you can do during a vacuum. For more information, see Amazon Redshift best practices for
designing queries (p. 22).

Managing the volume of merged rows


If a vacuum operation needs to merge new rows into a table's sorted region, the time required for a
vacuum will increase as the table grows larger. You can improve vacuum performance by reducing the
number of rows that must be merged.

Prior to a vacuum, a table consists of a sorted region at the head of the table, followed by an unsorted
region, which grows whenever rows are added or updated. When a set of rows is added by a COPY
operation, the new set of rows is sorted on the sort key as it is added to the unsorted region at the end
of the table. The new rows are ordered within their own set, but not within the unsorted region.

The following diagram illustrates the unsorted region after two successive COPY operations, where the
sort key is CUSTID. For simplicity, this example shows a compound sort key, but the same principles
apply to interleaved sort keys, except that the impact of the unsorted region is greater for interleaved
tables.

112
Amazon Redshift Database Developer Guide
Managing vacuum times

A vacuum restores the table's sort order in two stages:

1. Sort the unsorted region into a newly-sorted region.

The first stage is relatively cheap, because only the unsorted region is rewritten. If the range of sort
key values of the newly-sorted region is higher than the existing range, only the new rows need to be
rewritten, and the vacuum is complete. For example, if the sorted region contains ID values 1 to 500
and subsequent copy operations add key values greater than 500, then only the unsorted region only
needs to be rewritten.
2. Merge the newly-sorted region with the previously-sorted region.

If the keys in the newly sorted region overlap the keys in the sorted region, then VACUUM needs to
merge the rows. Starting at the beginning of the newly-sorted region (at the lowest sort key), the
vacuum writes the merged rows from the previously sorted region and the newly sorted region into a
new set of blocks.

The extent to which the new sort key range overlaps the existing sort keys determines the extent
to which the previously-sorted region will need to be rewritten. If the unsorted keys are scattered
throughout the existing sort range, a vacuum might need to rewrite existing portions of the table.

The following diagram shows how a vacuum would sort and merge rows that are added to a table where
CUSTID is the sort key. Because each copy operation adds a new set of rows with key values that overlap
the existing keys, almost the entire table needs to be rewritten. The diagram shows single sort and
merge, but in practice, a large vacuum consists of a series of incremental sort and merge steps.

If the range of sort keys in a set of new rows overlaps the range of existing keys, the cost of the merge
stage continues to grow in proportion to the table size as the table grows while the cost of the sort
stage remains proportional to the size of the unsorted region. In such a case, the cost of the merge stage
overshadows the cost of the sort stage, as the following diagram shows.

113
Amazon Redshift Database Developer Guide
Managing vacuum times

To determine what proportion of a table was remerged, query SVV_VACUUM_SUMMARY after the
vacuum operation completes. The following query shows the effect of six successive vacuums as
CUSTSALES grew larger over time.

select * from svv_vacuum_summary


where table_name = 'custsales';

table_name | xid | sort_ | merge_ | elapsed_ | row_ | sortedrow_ | block_ |


max_merge_
| | partitions | increments | time | delta | delta | delta |
partitions
-----------+------+------------+------------+------------+-------+------------+---------
+---------------
custsales | 7072 | 3 | 2 | 143918314 | 0 | 88297472 | 1524 |
47
custsales | 7122 | 3 | 3 | 164157882 | 0 | 88297472 | 772 |
47
custsales | 7212 | 3 | 4 | 187433171 | 0 | 88297472 | 767 |
47
custsales | 7289 | 3 | 4 | 255482945 | 0 | 88297472 | 770 |
47
custsales | 7420 | 3 | 5 | 316583833 | 0 | 88297472 | 769 |
47
custsales | 9007 | 3 | 6 | 306685472 | 0 | 88297472 | 772 |
47
(6 rows)

The merge_increments column gives an indication of the amount of data that was merged for each
vacuum operation. If the number of merge increments over consecutive vacuums increases in proportion
to the growth in table size, that is an indication that each vacuum operation is remerging an increasing
number of rows in the table because the existing and newly sorted regions overlap.

114
Amazon Redshift Database Developer Guide
Managing vacuum times

Loading your data in sort key order


If you load your data in sort key order using a COPY command, you might reduce or even eliminate the
need to vacuum.

COPY automatically adds new rows to the table's sorted region when all of the following are true:

• The table uses a compound sort key with only one sort column.
• The sort column is NOT NULL.
• The table is 100 percent sorted or empty.
• All the new rows are higher in sort order than the existing rows, including rows marked for deletion. In
this instance, Amazon Redshift uses the first eight bytes of the sort key to determine sort order.

For example, suppose you have a table that records customer events using a customer ID and time. If
you sort on customer ID, it’s likely that the sort key range of new rows added by incremental loads will
overlap the existing range, as shown in the previous example, leading to an expensive vacuum operation.

If you set your sort key to a timestamp column, your new rows will be appended in sort order at the end
of the table, as the following diagram shows, reducing or even eliminating the need to vacuum.

Using time series tables


If you maintain data for a rolling time period, use a series of tables, as the following diagram illustrates.

115
Amazon Redshift Database Developer Guide
Managing concurrent write operations

Create a new table each time you add a set of data, then delete the oldest table in the series. You gain a
double benefit:

• You avoid the added cost of deleting rows, because a DROP TABLE operation is much more efficient
than a mass DELETE.
• If the tables are sorted by timestamp, no vacuum is needed. If each table contains data for one month,
a vacuum will at most have to rewrite one month’s worth of data, even if the tables are not sorted by
timestamp.

You can create a UNION ALL view for use by reporting queries that hides the fact that the data is stored
in multiple tables. If a query filters on the sort key, the query planner can efficiently skip all the tables
that aren't used. A UNION ALL can be less efficient for other types of queries, so you should evaluate
query performance in the context of all queries that use the tables.

Managing concurrent write operations


Topics
• Serializable isolation (p. 117)
• Write and read/write operations (p. 120)
• Concurrent write examples (p. 121)

Amazon Redshift allows tables to be read while they are being incrementally loaded or modified.

In some traditional data warehousing and business intelligence applications, the database is available
to users only when the nightly load is complete. In such cases, no updates are allowed during regular
work hours, when analytic queries are run and reports are generated; however, an increasing number of
applications remain live for long periods of the day or even all day, making the notion of a load window
obsolete.

Amazon Redshift supports these types of applications by allowing tables to be read while they are being
incrementally loaded or modified. Queries simply see the latest committed version, or snapshot, of the
data, rather than waiting for the next version to be committed. If you want a particular query to wait for
a commit from another write operation, you have to schedule it accordingly.

The following topics describe some of the key concepts and use cases that involve transactions, database
snapshots, updates, and concurrent behavior.

116
Amazon Redshift Database Developer Guide
Serializable isolation

Serializable isolation
Some applications require not only concurrent querying and loading, but also the ability to write to
multiple tables or the same table concurrently. In this context, concurrently means overlapping, not
scheduled to run at precisely the same time. Two transactions are considered to be concurrent if the
second one starts before the first commits. Concurrent operations can originate from different sessions
that are controlled either by the same user or by different users.
Note
Amazon Redshift supports a default automatic commit behavior in which each separately
run SQL command commits individually. If you enclose a set of commands in a transaction
block (defined by BEGIN (p. 568) and END (p. 760) statements), the block commits as one
transaction, so you can roll it back if necessary. Exceptions to this behavior are the TRUNCATE
and VACUUM commands, which automatically commit all outstanding changes made in the
current transaction.
Some SQL clients issue BEGIN and COMMIT commands automatically, so the client controls
whether a group of statements are run as a transaction or each individual statement is run as its
own transaction. Check the documentation for the interface you are using. For example, when
using the Amazon Redshift JDBC driver, a JDBC PreparedStatement with a query string that
contains multiple (semicolon separated) SQL commands runs all the statements as a single
transaction. In contrast, if you use SQL Workbench/J and set AUTO COMMIT ON, then if you run
multiple statements, each statement runs as its own transaction.

Concurrent write operations are supported in Amazon Redshift in a protective way, using write locks
on tables and the principle of serializable isolation. Serializable isolation preserves the illusion that
a transaction running against a table is the only transaction that is running against that table. For
example, two concurrently running transactions, T1 and T2, must produce the same results as at least
one of the following:

• T1 and T2 run serially in that order.


• T2 and T1 run serially in that order.

Concurrent transactions are invisible to each other; they cannot detect each other's changes. Each
concurrent transaction will create a snapshot of the database at the beginning of the transaction. A
database snapshot is created within a transaction on the first occurrence of most SELECT statements,
DML commands such as COPY, DELETE, INSERT, UPDATE, and TRUNCATE, and the following DDL
commands:

• ALTER TABLE (to add or drop columns)


• CREATE TABLE
• DROP TABLE
• TRUNCATE TABLE

If any serial execution of the concurrent transactions produces the same results as their concurrent
execution, those transactions are deemed "serializable" and can be run safely. If no serial execution of
those transactions can produce the same results, the transaction that runs a statement that might break
the ability to serialize is stopped and rolled back.

System catalog tables (PG) and other Amazon Redshift system tables (STL and STV) are not locked in a
transaction. Therefore, changes to database objects that arise from DDL and TRUNCATE operations are
visible on commit to any concurrent transactions.

For example, suppose that table A exists in the database when two concurrent transactions, T1 and T2,
start. Suppose that T2 returns a list of tables by selecting from the PG_TABLES catalog table. Then T1
drops table A and commits, and then T2 lists the tables again. Table A is now no longer listed. If T2 tries
to query the dropped table, Amazon Redshift returns a "relation does not exist" error. The catalog query

117
Amazon Redshift Database Developer Guide
Serializable isolation

that returns the list of tables to T2 or checks that table A exists isn't subject to the same isolation rules as
operations performed on user tables.

Transactions for updates to these tables run in a read committed isolation mode. PG-prefix catalog
tables don't support snapshot isolation.

Serializable isolation for system tables and catalog tables


A database snapshot is also created in a transaction for any SELECT query that references a user-created
table or Amazon Redshift system table (STL or STV). SELECT queries that don't reference any table don't
create a new transaction database snapshot. INSERT, DELETE, and UPDATE statements that operate
solely on system catalog tables (PG) also don't create a new transaction database snapshot.

How to fix serializable isolation errors


ERROR:1023 DETAIL: Serializable isolation violation on a table in Redshift
When Amazon Redshift detects a serializable isolation error, you see an error message such as the
following.

ERROR:1023 DETAIL: Serializable isolation violation on table in Redshift

To address a serializable isolation error, you can try the following methods:

• Retry the canceled transaction.

Amazon Redshift detected that a concurrent workload is not serializable. It suggests gaps in the
application logic, which can usually be worked around by retrying the transaction that encountered
the error. If the issue persists, try one of the other methods.
• Move any operations that don't have to be in the same atomic transaction outside of the transaction.

This method applies when individual operations inside two transactions cross-reference each other in
a way that can affect the outcome of the other transaction. For example, the following two sessions
each start a transaction.

Session1_Redshift=# begin;

Session2_Redshift=# begin;

The result of a SELECT statement in each transaction might be affected by an INSERT statement in the
other. In other words, suppose that you run the following statements serially, in any order. In every
case, the result is one of the SELECT statements returning one more row than if the transactions were
run concurrently. There is no order in which the operations can run serially that produces the same
result as when run concurrently. Thus, the last operation that is run results in a serializable isolation
error.

Session1_Redshift=# select * from tab1;


Session1_Redshift=# insert into tab2 values (1);

Session2_Redshift=# insert into tab1 values (1);


Session2_Redshift=# select * from tab2;

In many cases, the result of the SELECT statements isn't important. In other words, the atomicity of
the operations in the transactions isn't important. In these cases, move the SELECT statements outside
of their transactions, as shown in the following examples.

118
Amazon Redshift Database Developer Guide
Serializable isolation

Session1_Redshift=# begin;
Session1_Redshift=# insert into tab1 values (1)
Session1_Redshift=# end;
Session1_Redshift=# select * from tab2;

Session2_Redshift # select * from tab1;


Session2_Redshift=# begin;
Session2_Redshift=# insert into tab2 values (1)
Session2_Redshift=# end;

In these examples, there are no cross-references in the transactions. The two INSERT statements
don't affect each other. In these examples, there is at least one order in which the transactions can
run serially and produce the same result as if run concurrently. This means that the transactions are
serializable.
• Force serialization by locking all tables in each session.

The LOCK (p. 786) command blocks operations that can result in serializable isolation errors. When
you use the LOCK command, be sure to do the following:
• Lock all tables affected by the transaction, including those affected by read-only SELECT statements
inside the transaction.
• Lock tables in the same order, regardless of the order that operations are performed in.
• Lock all tables at the beginning of the transaction, before performing any operations.

ERROR:1018 DETAIL: Relation does not exist


When you run concurrent Amazon Redshift operations in different sessions, you see an error message
such as the following.

ERROR: 1018 DETAIL: Relation does not exist.

Transactions in Amazon Redshift follow snapshot isolation. After a transaction begins, Amazon Redshift
takes a snapshot of the database. For the entire lifecycle of the transaction, the transaction operates on
the state of the database as reflected in the snapshot. If the transaction reads from a table that doesn't
exist in the snapshot, it throws the 1018 error message shown previously. Even when another concurrent
transaction creates a table after the transaction has taken the snapshot, the transaction can't read from
the newly created table.

To address this serialization isolation error, you can try to move the start of the transaction to a point
where you know the table exists.

If the table is created by another transaction, this point is at least after that transaction has been
committed. Also, ensure that no concurrent transaction has been committed that might have dropped
the table.

session1 = # BEGIN;
session1 = # DROP TABLE A;
session1 = # COMMIT;

session2 = # BEGIN;

session3 = # BEGIN;
session3 = # CREATE TABLE A (id INT);

119
Amazon Redshift Database Developer Guide
Write and read/write operations

session3 = # COMMIT;

session2 = # SELECT * FROM A;

The last operation that is run as the read operation by session2 results in a serializable isolation error.
This error happens when session2 takes a snapshot and the table has already been dropped by a
committed session1. In other words, even though a concurrent session3 has created the table, session2
doesn't see the table because it's not in the snapshot.

To resolve this error, you can reorder the sessions as follows.

session1 = # BEGIN;
session1 = # DROP TABLE A;
session1 = # COMMIT;

session3 = # BEGIN;
session3 = # CREATE TABLE A (id INT);
session3 = # COMMIT;

session2 = # BEGIN;
session2 = # SELECT * FROM A;

Now when session2 takes its snapshot, session3 has already been committed, and the table is in the
database. Session2 can read from the table without any error.

Write and read/write operations


You can manage the specific behavior of concurrent write operations by deciding when and how to run
different types of commands. The following commands are relevant to this discussion:

• COPY commands, which perform loads (initial or incremental)


• INSERT commands that append one or more rows at a time
• UPDATE commands, which modify existing rows
• DELETE commands, which remove rows

COPY and INSERT operations are pure write operations, but DELETE and UPDATE operations are
read/write operations. (For rows to be deleted or updated, they have to be read first.) The results of
concurrent write operations depend on the specific commands that are being run concurrently. COPY
and INSERT operations against the same table are held in a wait state until the lock is released, then they
proceed as normal.

UPDATE and DELETE operations behave differently because they rely on an initial table read before they
do any writes. Given that concurrent transactions are invisible to each other, both UPDATEs and DELETEs
have to read a snapshot of the data from the last commit. When the first UPDATE or DELETE releases its
lock, the second UPDATE or DELETE needs to determine whether the data that it is going to work with is
potentially stale. It will not be stale, because the second transaction does not obtain its snapshot of data
until after the first transaction has released its lock.

Potential deadlock situation for concurrent write transactions


Whenever transactions involve updates of more than one table, there is always the possibility of
concurrently running transactions becoming deadlocked when they both try to write to the same set of
tables. A transaction releases all of its table locks at once when it either commits or rolls back; it does not
relinquish locks one at a time.

120
Amazon Redshift Database Developer Guide
Concurrent write examples

For example, suppose that transactions T1 and T2 start at roughly the same time. If T1 starts writing to
table A and T2 starts writing to table B, both transactions can proceed without conflict; however, if T1
finishes writing to table A and needs to start writing to table B, it will not be able to proceed because T2
still holds the lock on B. Conversely, if T2 finishes writing to table B and needs to start writing to table A,
it will not be able to proceed either because T1 still holds the lock on A. Because neither transaction can
release its locks until all its write operations are committed, neither transaction can proceed.

In order to avoid this kind of deadlock, you need to schedule concurrent write operations carefully. For
example, you should always update tables in the same order in transactions and, if specifying locks, lock
tables in the same order before you perform any DML operations.

Concurrent write examples


The following pseudo-code examples demonstrate how transactions either proceed or wait when they
are run concurrently.

Concurrent COPY operations into the same table


Transaction 1 copies rows into the LISTING table:

begin;
copy listing from ...;
end;

Transaction 2 starts concurrently in a separate session and attempts to copy more rows into the LISTING
table. Transaction 2 must wait until transaction 1 releases the write lock on the LISTING table, then it can
proceed.

begin;
[waits]
copy listing from ;
end;

The same behavior would occur if one or both transactions contained an INSERT command instead of a
COPY command.

Concurrent DELETE operations from the same table


Transaction 1 deletes rows from a table:

begin;
delete from listing where ...;
end;

Transaction 2 starts concurrently and attempts to delete rows from the same table. It will succeed
because it waits for transaction 1 to complete before attempting to delete rows.

begin
[waits]
delete from listing where ;
end;

The same behavior would occur if one or both transactions contained an UPDATE command to the same
table instead of a DELETE command.

121
Amazon Redshift Database Developer Guide
Tutorial: Loading data from Amazon S3

Concurrent transactions with a mixture of read and write


operations
In this example, transaction 1 deletes rows from the USERS table, reloads the table, runs a COUNT(*)
query, and then ANALYZE, before committing:

begin;
delete one row from USERS table;
copy ;
select count(*) from users;
analyze ;
end;

Meanwhile, transaction 2 starts. This transaction attempts to copy additional rows into the USERS table,
analyze the table, and then run the same COUNT(*) query as the first transaction:

begin;
[waits]
copy users from ...;
select count(*) from users;
analyze;
end;

The second transaction will succeed because it must wait for the first to complete. Its COUNT query will
return the count based on the load it has completed.

Tutorial: Loading data from Amazon S3


In this tutorial, you walk through the process of loading data into your Amazon Redshift database tables
from data files in an Amazon S3 bucket from beginning to end.

In this tutorial, you do the following:

• Download data files that use comma-separated value (CSV), character-delimited, and fixed width
formats.
• Create an Amazon S3 bucket and then upload the data files to the bucket.
• Launch an Amazon Redshift cluster and create database tables.
• Use COPY commands to load the tables from the data files on Amazon S3.
• Troubleshoot load errors and modify your COPY commands to correct the errors.

Estimated time: 60 minutes

Estimated cost: $1.00 per hour for the cluster

Prerequisites
You need the following prerequisites:

• An AWS account to launch an Amazon Redshift cluster and to create a bucket in Amazon S3.
• Your AWS credentials (IAM role) to load test data from Amazon S3. If you need to a new IAM role, go to
Creating IAM roles.
• An SQL client such as the Amazon Redshift console query editor.

122
Amazon Redshift Database Developer Guide
Overview

This tutorial is designed so that it can be taken by itself. In addition to this tutorial, we recommend
completing the following tutorials to gain a more complete understanding of how to design and use
Amazon Redshift databases:

• Amazon Redshift Getting Started Guide walks you through the process of creating an Amazon Redshift
cluster and loading sample data.

Overview
You can add data to your Amazon Redshift tables either by using an INSERT command or by using a
COPY command. At the scale and speed of an Amazon Redshift data warehouse, the COPY command is
many times faster and more efficient than INSERT commands.

The COPY command uses the Amazon Redshift massively parallel processing (MPP) architecture to
read and load data in parallel from multiple data sources. You can load from data files on Amazon S3,
Amazon EMR, or any remote host accessible through a Secure Shell (SSH) connection. Or you can load
directly from an Amazon DynamoDB table.

In this tutorial, you use the COPY command to load data from Amazon S3. Many of the principles
presented here apply to loading from other data sources as well.

To learn more about using the COPY command, see these resources:

• Amazon Redshift best practices for loading data (p. 19)


• Loading data from Amazon EMR (p. 76)
• Loading data from remote hosts (p. 80)
• Loading data from an Amazon DynamoDB table (p. 86)

Steps
• Step 1: Create a cluster (p. 123)
• Step 2: Download the data files (p. 125)
• Step 3: Upload the files to an Amazon S3 bucket (p. 125)
• Step 4: Create the sample tables (p. 127)
• Step 5: Run the COPY commands (p. 129)
• Step 6: Vacuum and analyze the database (p. 141)
• Step 7: Clean up your resources (p. 141)

Step 1: Create a cluster


If you already have a cluster that you want to use, you can skip this step.

For the exercises in this tutorial, use a four-node cluster.

New console
To create a cluster

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift/.

123
Amazon Redshift Database Developer Guide
Step 1: Create a cluster

Important
If you use IAM user credentials, make sure that you have the necessary permissions to
perform the cluster operations. For more information, see Controlling access to IAM users in
the Amazon Redshift Cluster Management Guide.
2. At top right, choose the AWS Region in which you want to create the cluster. For the purposes of this
tutorial, choose US West (Oregon).
3. On the navigation menu, choose CLUSTERS, then choose Create cluster. The Create cluster page
appears.
4. On the Create cluster page enter parameters for your cluster. Choose your own values for the
parameters, except change the following values:

• Choose dc2.large for the node type.


• Choose 4 for the Number of nodes.
• In the Cluster permissions section, choose an IAM role from Available IAM roles. This role should
be one that you previously created and that has access to Amazon S3. Then choose Associate IAM
role to add it to the list of Associated IAM roles for the cluster.
5. Choose Create cluster.

Original console
Follow the steps in Amazon Redshift Getting Started Guide, but choose Multi Node for Cluster Type and
set Number of Compute Nodes to 4.

124
Amazon Redshift Database Developer Guide
Step 2: Download the data files

Follow the Amazon Redshift Getting Started Guide steps to connect to your cluster from a SQL client
and test a connection. You don't need to complete the remaining Getting Started steps to create tables,
upload data, and try example queries.

Next step
Step 2: Download the data files (p. 125)

Step 2: Download the data files


In this step, you download a set of sample data files to your computer. In the next step, you upload the
files to an Amazon S3 bucket.

To download the data files

1. Download the zipped file: LoadingDataSampleFiles.zip.


2. Extract the files to a folder on your computer.
3. Verify that your folder contains the following files.

customer-fw-manifest
customer-fw.tbl-000
customer-fw.tbl-000.bak
customer-fw.tbl-001
customer-fw.tbl-002
customer-fw.tbl-003
customer-fw.tbl-004
customer-fw.tbl-005
customer-fw.tbl-006
customer-fw.tbl-007
customer-fw.tbl.log
dwdate-tab.tbl-000
dwdate-tab.tbl-001
dwdate-tab.tbl-002
dwdate-tab.tbl-003
dwdate-tab.tbl-004
dwdate-tab.tbl-005
dwdate-tab.tbl-006
dwdate-tab.tbl-007
part-csv.tbl-000
part-csv.tbl-001
part-csv.tbl-002
part-csv.tbl-003
part-csv.tbl-004
part-csv.tbl-005
part-csv.tbl-006
part-csv.tbl-007

Next step
Step 3: Upload the files to an Amazon S3 bucket (p. 125)

Step 3: Upload the files to an Amazon S3 bucket


In this step, you create an Amazon S3 bucket and upload the data files to the bucket.

To upload the files to an Amazon S3 bucket

1. Create a bucket in Amazon S3.

125
Amazon Redshift Database Developer Guide
Step 3: Upload the files to an Amazon S3 bucket

a. Sign in to the AWS Management Console and open the Amazon S3 console at https://
console.aws.amazon.com/s3/.
b. Click Create Bucket.
c. In the Bucket Name box of the Create a Bucket dialog box, type a bucket name.

The bucket name you choose must be unique among all existing bucket names in Amazon
S3. One way to help ensure uniqueness is to prefix your bucket names with the name of your
organization. Bucket names must comply with certain rules. For more information, go to Bucket
restrictions and limitations in the Amazon Simple Storage Service User Guide.
d. Select a Region.

Create the bucket in the same Region as your cluster. If your cluster is in the US West (Oregon)
Region, choose US West (Oregon) Region (us-west-2).
e. Choose Create.

When Amazon S3 successfully creates your bucket, the console displays your empty bucket in
the Buckets panel.
2. Create a folder.

a. Choose the name of the new bucket.


b. Choose the Actions button, and click Create Folder in the drop-down list.
c. Name the new folder load.
Note
The bucket that you created is not in a sandbox. In this exercise, you add objects to a
real bucket. You're charged a nominal amount for the time that you store the objects
in the bucket. For more information about Amazon S3 pricing, go to the Amazon S3
pricing page.
3. Upload the data files to the new Amazon S3 bucket.

a. Choose the name of the data folder.


b. In the Upload - Select Files wizard, choose Add Files.

Follow the Amazon S3 console instructions to upload all of the files you downloaded and
extracted,
c. Choose Start Upload.

User Credentials

The Amazon Redshift COPY command must have access to read the file objects in the Amazon S3 bucket.
If you use the same user credentials to create the Amazon S3 bucket and to run the Amazon Redshift
COPY command, the COPY command has all necessary permissions. If you want to use different user
credentials, you can grant access by using the Amazon S3 access controls. The Amazon Redshift COPY
command requires at least ListBucket and GetObject permissions to access the file objects in the Amazon
S3 bucket. For more information about controlling access to Amazon S3 resources, go to Managing
access permissions to your Amazon S3 resources.

Next step
Step 4: Create the sample tables (p. 127)

126
Amazon Redshift Database Developer Guide
Step 4: Create the sample tables

Step 4: Create the sample tables


For this tutorial, you use a set of five tables based on the Star Schema Benchmark (SSB) schema. The
following diagram shows the SSB data model.

The SSB tables might already exist in the current database. If so, drop the tables to remove them from
the database before you create them using the CREATE TABLE commands in the next step. The tables
used in this tutorial might have different attributes than the existing tables.

To create the sample tables

1. To drop the SSB tables, run the following commands in your SQL client.

drop table part cascade;


drop table supplier;
drop table customer;
drop table dwdate;
drop table lineorder;

2. Run the following CREATE TABLE commands in your SQL client.

CREATE TABLE part


(
p_partkey INTEGER NOT NULL,
p_name VARCHAR(22) NOT NULL,
p_mfgr VARCHAR(6),
p_category VARCHAR(7) NOT NULL,
p_brand1 VARCHAR(9) NOT NULL,
p_color VARCHAR(11) NOT NULL,
p_type VARCHAR(25) NOT NULL,
p_size INTEGER NOT NULL,
p_container VARCHAR(10) NOT NULL
);

CREATE TABLE supplier


(

127
Amazon Redshift Database Developer Guide
Step 4: Create the sample tables

s_suppkey INTEGER NOT NULL,


s_name VARCHAR(25) NOT NULL,
s_address VARCHAR(25) NOT NULL,
s_city VARCHAR(10) NOT NULL,
s_nation VARCHAR(15) NOT NULL,
s_region VARCHAR(12) NOT NULL,
s_phone VARCHAR(15) NOT NULL
);

CREATE TABLE customer


(
c_custkey INTEGER NOT NULL,
c_name VARCHAR(25) NOT NULL,
c_address VARCHAR(25) NOT NULL,
c_city VARCHAR(10) NOT NULL,
c_nation VARCHAR(15) NOT NULL,
c_region VARCHAR(12) NOT NULL,
c_phone VARCHAR(15) NOT NULL,
c_mktsegment VARCHAR(10) NOT NULL
);

CREATE TABLE dwdate


(
d_datekey INTEGER NOT NULL,
d_date VARCHAR(19) NOT NULL,
d_dayofweek VARCHAR(10) NOT NULL,
d_month VARCHAR(10) NOT NULL,
d_year INTEGER NOT NULL,
d_yearmonthnum INTEGER NOT NULL,
d_yearmonth VARCHAR(8) NOT NULL,
d_daynuminweek INTEGER NOT NULL,
d_daynuminmonth INTEGER NOT NULL,
d_daynuminyear INTEGER NOT NULL,
d_monthnuminyear INTEGER NOT NULL,
d_weeknuminyear INTEGER NOT NULL,
d_sellingseason VARCHAR(13) NOT NULL,
d_lastdayinweekfl VARCHAR(1) NOT NULL,
d_lastdayinmonthfl VARCHAR(1) NOT NULL,
d_holidayfl VARCHAR(1) NOT NULL,
d_weekdayfl VARCHAR(1) NOT NULL
);
CREATE TABLE lineorder
(
lo_orderkey INTEGER NOT NULL,
lo_linenumber INTEGER NOT NULL,
lo_custkey INTEGER NOT NULL,
lo_partkey INTEGER NOT NULL,
lo_suppkey INTEGER NOT NULL,
lo_orderdate INTEGER NOT NULL,
lo_orderpriority VARCHAR(15) NOT NULL,
lo_shippriority VARCHAR(1) NOT NULL,
lo_quantity INTEGER NOT NULL,
lo_extendedprice INTEGER NOT NULL,
lo_ordertotalprice INTEGER NOT NULL,
lo_discount INTEGER NOT NULL,
lo_revenue INTEGER NOT NULL,
lo_supplycost INTEGER NOT NULL,
lo_tax INTEGER NOT NULL,
lo_commitdate INTEGER NOT NULL,
lo_shipmode VARCHAR(10) NOT NULL
);

128
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

Next step
Step 5: Run the COPY commands (p. 129)

Step 5: Run the COPY commands


You run COPY commands to load each of the tables in the SSB schema. The COPY command
examples demonstrate loading from different file formats, using several COPY command options, and
troubleshooting load errors.

Topics
• COPY command syntax (p. 129)
• Loading the SSB tables (p. 131)

COPY command syntax


The basic COPY (p. 576) command syntax is as follows.

COPY table_name [ column_list ] FROM data_source CREDENTIALS access_credentials [options]

To run a COPY command, you provide the following values.

Table name

The target table for the COPY command. The table must already exist in the database. The table can be
temporary or persistent. The COPY command appends the new input data to any existing rows in the
table.

Column list

By default, COPY loads fields from the source data to the table columns in order. You can optionally
specify a column list, that is a comma-separated list of column names, to map data fields to specific
columns. You don't use column lists in this tutorial. For more information, see Column List (p. 592) in
the COPY command reference.

Data source

You can use the COPY command to load data from an Amazon S3 bucket, an Amazon EMR cluster, a
remote host using an SSH connection, or an Amazon DynamoDB table. For this tutorial, you load from
data files in an Amazon S3 bucket. When loading from Amazon S3, you must provide the name of the
bucket and the location of the data files. To do this, provide either an object path for the data files or the
location of a manifest file that explicitly lists each data file and its location.

• Key prefix

An object stored in Amazon S3 is uniquely identified by an object key, which includes the bucket name,
folder names, if any, and the object name. A key prefix refers to a set of objects with the same prefix.
The object path is a key prefix that the COPY command uses to load all objects that share the key
prefix. For example, the key prefix custdata.txt can refer to a single file or to a set of files, including
custdata.txt.001, custdata.txt.002, and so on.
• Manifest file

In some cases, you might need to load files with different prefixes, for example from multiple buckets
or folders. In others, you might need to exclude files that share a prefix. In these cases, you can use a

129
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

manifest file. A manifest file explicitly lists each load file and its unique object key. You use a manifest
file to load the PART table later in this tutorial.

Credentials

To access the AWS resources that contain the data to load, you must provide AWS access credentials
for an AWS user or an IAM user with sufficient privileges. These credentials include an IAM role
Amazon Resource Name (ARN). To load data from Amazon S3, the credentials must include ListBucket
and GetObject permissions. Additional credentials are required if your data is encrypted. For more
information, see Authorization parameters (p. 590) in the COPY command reference. For more
information about managing access, go to Managing access permissions to your Amazon S3 resources.

Options

You can specify a number of parameters with the COPY command to specify file formats, manage
data formats, manage errors, and control other features. In this tutorial, you use the following COPY
command options and features:

• Key prefix

For information on how to load from multiple files by specifying a key prefix, see Load the PART table
using NULL AS (p. 131).
• CSV format

For information on how to load data that is in CSV format, see Load the PART table using NULL
AS (p. 131).
• NULL AS

For information on how to load PART using the NULL AS option, see Load the PART table using NULL
AS (p. 131).
• Character-delimited format

For information on how to use the DELIMITER option, see Load the SUPPLIER table using
REGION (p. 134).
• REGION

For information on how to use the REGION option, see Load the SUPPLIER table using
REGION (p. 134).
• Fixed-format width

For information on how to load the CUSTOMER table from fixed-width data, see Load the CUSTOMER
table using MANIFEST (p. 134).
• MAXERROR

For information on how to use the MAXERROR option, see Load the CUSTOMER table using
MANIFEST (p. 134).
• ACCEPTINVCHARS

For information on how to use the ACCEPTINVCHARS option, see Load the CUSTOMER table using
MANIFEST (p. 134).
• MANIFEST

For information on how to use the MANIFEST option, see Load the CUSTOMER table using
MANIFEST (p. 134).
• DATEFORMAT

130
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

For information on how to use the DATEFORMAT option, see Load the DWDATE table using
DATEFORMAT (p. 138).
• GZIP, LZOP and BZIP2

For information on how to compress your files, see Load the LINEORDER table using multiple
files (p. 139).
• COMPUPDATE

For information on how to use the COMPUPDATE option, see Load the LINEORDER table using
multiple files (p. 139).
• Multiple files

For information on how to load multiple files, see Load the LINEORDER table using multiple
files (p. 139).

Loading the SSB tables


You use the following COPY commands to load each of the tables in the SSB schema. The command to
each table demonstrates different COPY options and troubleshooting techniques.

To load the SSB tables, follow these steps:

1. Replace the bucket name and AWS credentials (p. 131)


2. Load the PART table using NULL AS (p. 131)
3. Load the SUPPLIER table using REGION (p. 134)
4. Load the CUSTOMER table using MANIFEST (p. 134)
5. Load the DWDATE table using DATEFORMAT (p. 138)
6. Load the LINEORDER table using multiple files (p. 139)

Replace the bucket name and AWS credentials


The COPY commands in this tutorial are presented in the following format.

copy table from 's3://<your-bucket-name>/load/key_prefix'


credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
options;

For each COPY command, do the following:

1. Replace <your-bucket-name> with the name of a bucket in the same region as your cluster.

This step assumes the bucket and the cluster are in the same region. Alternatively, you can specify the
region using the REGION (p. 583) option with the COPY command.
2. Replace <aws-account-id> and <role-name> with your own AWS account and IAM role. The
segment of the credentials string that is enclosed in single quotation marks must not contain any
spaces or line breaks.

Load the PART table using NULL AS


In this step, you use the CSV and NULL AS options to load the PART table.

The COPY command can load data from multiple files in parallel, which is much faster than loading from
a single file. To demonstrate this principle, the data for each table in this tutorial is split into eight files,

131
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

even though the files are very small. In a later step, you compare the time difference between loading
from a single file and loading from multiple files. For more information, see Split your load data (p. 20).

Key prefix

You can load from multiple files by specifying a key prefix for the file set, or by explicitly listing the files
in a manifest file. In this step, you use a key prefix. In a later step, you use a manifest file. The key prefix
's3://mybucket/load/part-csv.tbl' loads the following set of the files in the load folder.

part-csv.tbl-000
part-csv.tbl-001
part-csv.tbl-002
part-csv.tbl-003
part-csv.tbl-004
part-csv.tbl-005
part-csv.tbl-006
part-csv.tbl-007

CSV format

CSV, which stands for comma separated values, is a common format used for importing and exporting
spreadsheet data. CSV is more flexible than comma-delimited format because it enables you to include
quoted strings within fields. The default quotation mark character for COPY from CSV format is a
double quotation mark ( " ), but you can specify another quotation mark character by using the QUOTE
AS option. When you use the quotation mark character within the field, escape the character with an
additional quotation mark character.

The following excerpt from a CSV-formatted data file for the PART table shows strings enclosed in
double quotation marks ("LARGE ANODIZED BRASS"). It also shows a string enclosed in two double
quotation marks within a quoted string ("MEDIUM ""BURNISHED"" TIN").

15,dark sky,MFGR#3,MFGR#47,MFGR#3438,indigo,"LARGE ANODIZED BRASS",45,LG CASE


22,floral beige,MFGR#4,MFGR#44,MFGR#4421,medium,"PROMO, POLISHED BRASS",19,LG DRUM
23,bisque slate,MFGR#4,MFGR#41,MFGR#4137,firebrick,"MEDIUM ""BURNISHED"" TIN",42,JUMBO JAR

The data for the PART table contains characters that cause COPY to fail. In this exercise, you
troubleshoot the errors and correct them.

To load data that is in CSV format, add csv to your COPY command. Run the following command to load
the PART table.

copy part from 's3://<your-bucket-name>/load/part-csv.tbl'


credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
csv;

You might get an error message similar to the following.

An error occurred when executing the SQL command:


copy part from 's3://mybucket/load/part-csv.tbl'
credentials' ...

ERROR: Load into table 'part' failed. Check 'stl_load_errors' system table for details.
[SQL State=XX000]

Execution time: 1.46s

1 statement(s) failed.
1 statement(s) failed.

132
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

To get more information about the error, query the STL_LOAD_ERRORS table. The following query uses
the SUBSTRING function to shorten columns for readability and uses LIMIT 10 to reduce the number of
rows returned. You can adjust the values in substring(filename,22,25) to allow for the length of
your bucket name.

select query, substring(filename,22,25) as filename,line_number as line,


substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as
line_text,
substring(raw_field_value,0,15) as field_text,
substring(err_reason,0,45) as reason
from stl_load_errors
order by query desc
limit 10;

query | filename | line | column | type | pos |


--------+-------------------------+-----------+------------+------------+-----+----
333765 | part-csv.tbl-000 | 1 | | | 0 |

line_text | field_text | reason


------------------+------------+----------------------------------------------
15,NUL next, | | Missing newline: Unexpected character 0x2c f

NULL AS

The part-csv.tbl data files use the NUL terminator character (\x000 or \x0) to indicate NULL values.
Note
Despite very similar spelling, NUL and NULL are not the same. NUL is a UTF-8 character with
codepoint x000 that is often used to indicate end of record (EOR). NULL is a SQL value that
represents an absence of data.

By default, COPY treats a NUL terminator character as an EOR character and terminates the record,
which often results in unexpected results or an error. There is no single standard method of indicating
NULL in text data. Thus, the NULL AS COPY command option enables you to specify which character
to substitute with NULL when loading the table. In this example, you want COPY to treat the NUL
terminator character as a NULL value.
Note
The table column that receives the NULL value must be configured as nullable. That is, it must
not include the NOT NULL constraint in the CREATE TABLE specification.

To load PART using the NULL AS option, run the following COPY command.

copy part from 's3://<your-bucket-name>/load/part-csv.tbl'


credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
csv
null as '\000';

To verify that COPY loaded NULL values, run the following command to select only the rows that contain
NULL.

select p_partkey, p_name, p_mfgr, p_category from part where p_mfgr is null;

p_partkey | p_name | p_mfgr | p_category


-----------+----------+--------+------------
15 | NUL next | | MFGR#47
81 | NUL next | | MFGR#23
133 | NUL next | | MFGR#44

133
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

(2 rows)

Load the SUPPLIER table using REGION


In this step, you use the DELIMITER and REGION options to load the SUPPLIER table.
Note
The files for loading the SUPPLIER table are provided in an AWS sample bucket. You don't need
to upload files for this step.

Character-Delimited Format

The fields in a character-delimited file are separated by a specific character, such as a pipe character ( | ),
a comma ( , ) or a tab ( \t ). Character-delimited files can use any single ASCII character, including one
of the nonprinting ASCII characters, as the delimiter. You specify the delimiter character by using the
DELIMITER option. The default delimiter is a pipe character ( | ).

The following excerpt from the data for the SUPPLIER table uses pipe-delimited format.

1|1|257368|465569|41365|19950218|2-HIGH|0|17|2608718|9783671|4|2504369|92072|2|19950331|
TRUCK
1|2|257368|201928|8146|19950218|2-HIGH|0|36|6587676|9783671|9|5994785|109794|6|19950416|
MAIL

REGION

Whenever possible, you should locate your load data in the same AWS region as your Amazon Redshift
cluster. If your data and your cluster are in the same region, you reduce latency and avoid cross-region
data transfer costs. For more information, see Amazon Redshift best practices for loading data (p. 19)

If you must load data from a different AWS region, use the REGION option to specify the AWS region in
which the load data is located. If you specify a region, all of the load data, including manifest files, must
be in the named region. For more information, see REGION (p. 583).

If your cluster is in the US East (N. Virginia) Region, run the following command to load the SUPPLIER
table from pipe-delimited data in an Amazon S3 bucket located in the US West (Oregon) Region. For this
example, do not change the bucket name.

copy supplier from 's3://awssampledbuswest2/ssbgz/supplier.tbl'


credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
delimiter '|'
gzip
region 'us-west-2';

If your cluster is not in the US East (N. Virginia) region, run the following command to load the SUPPLIER
table from pipe-delimited data in an Amazon S3 bucket located in the US East (N. Virginia) region. For
this example, do not change the bucket name.

copy supplier from 's3://awssampledb/ssbgz/supplier.tbl'


credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
delimiter '|'
gzip
region 'us-east-1';

Load the CUSTOMER table using MANIFEST


In this step, you use the FIXEDWIDTH, MAXERROR, ACCEPTINVCHARS, and MANIFEST options to load the
CUSTOMER table.

134
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

The sample data for this exercise contains characters that cause errors when COPY attempts to load
them. You use the MAXERRORS option and the STL_LOAD_ERRORS system table to troubleshoot the
load errors and then use the ACCEPTINVCHARS and MANIFEST options to eliminate the errors.

Fixed-Width Format

Fixed-width format defines each field as a fixed number of characters, rather than separating fields with
a delimiter. The following excerpt from the data for the CUSTOMER table uses fixed-width format.

1 Customer#000000001 IVhzIApeRb MOROCCO 0MOROCCO AFRICA 25-705


2 Customer#000000002 XSTf4,NCwDVaWNe6tE JORDAN 6JORDAN MIDDLE EAST 23-453
3 Customer#000000003 MG9kdTD ARGENTINA5ARGENTINAAMERICA 11-783

The order of the label/width pairs must match the order of the table columns exactly. For more
information, see FIXEDWIDTH (p. 594).

The fixed-width specification string for the CUSTOMER table data is as follows.

fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15,


c_region :12, c_phone:15,c_mktsegment:10'

To load the CUSTOMER table from fixed-width data, run the following command.

copy customer
from 's3://<your-bucket-name>/load/customer-fw.tbl'
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12,
c_phone:15,c_mktsegment:10';

You should get an error message, similar to the following.

An error occurred when executing the SQL command:


copy customer
from 's3://mybucket/load/customer-fw.tbl'
credentials'...

ERROR: Load into table 'customer' failed. Check 'stl_load_errors' system table for
details. [SQL State=XX000]

Execution time: 2.95s

1 statement(s) failed.

MAXERROR

By default, the first time COPY encounters an error, the command fails and returns an error message. To
save time during testing, you can use the MAXERROR option to instruct COPY to skip a specified number
of errors before it fails. Because we expect errors the first time we test loading the CUSTOMER table
data, add maxerror 10 to the COPY command.

To test using the FIXEDWIDTH and MAXERROR options, run the following command.

copy customer
from 's3://<your-bucket-name>/load/customer-fw.tbl'
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12,
c_phone:15,c_mktsegment:10'
maxerror 10;

135
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

This time, instead of an error message, you get a warning message similar to the following.

Warnings:
Load into table 'customer' completed, 112497 record(s) loaded successfully.
Load into table 'customer' completed, 7 record(s) could not be loaded. Check
'stl_load_errors' system table for details.

The warning indicates that COPY encountered seven errors. To check the errors, query the
STL_LOAD_ERRORS table, as shown in the following example.

select query, substring(filename,22,25) as filename,line_number as line,


substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as
line_text,
substring(raw_field_value,0,15) as field_text,
substring(err_reason,0,45) as error_reason
from stl_load_errors
order by query desc, filename
limit 7;

The results of the STL_LOAD_ERRORS query should look similar to the following.

query | filename | line | column | type | pos |


line_text | field_text | error_reason
--------+---------------------------+------+-----------
+------------+-----+-------------------------------+------------
+----------------------------------------------
334489 | customer-fw.tbl.log | 2 | c_custkey | int4 | -1 | customer-fw.tbl
| customer-f | Invalid digit, Value 'c', Pos 0, Type: Integ
334489 | customer-fw.tbl.log | 6 | c_custkey | int4 | -1 | Complete
| Complete | Invalid digit, Value 'C', Pos 0, Type: Integ
334489 | customer-fw.tbl.log | 3 | c_custkey | int4 | -1 | #Total rows
| #Total row | Invalid digit, Value '#', Pos 0, Type: Integ
334489 | customer-fw.tbl.log | 5 | c_custkey | int4 | -1 | #Status
| #Status | Invalid digit, Value '#', Pos 0, Type: Integ
334489 | customer-fw.tbl.log | 1 | c_custkey | int4 | -1 | #Load file
| #Load file | Invalid digit, Value '#', Pos 0, Type: Integ
334489 | customer-fw.tbl000 | 1 | c_address | varchar | 34 | 1
Customer#000000001 | .Mayag.ezR | String contains invalid or unsupported UTF8
334489 | customer-fw.tbl000 | 1 | c_address | varchar | 34 | 1
Customer#000000001 | .Mayag.ezR | String contains invalid or unsupported UTF8
(7 rows)

By examining the results, you can see that there are two messages in the error_reasons column:


Invalid digit, Value '#', Pos 0, Type: Integ

These errors are caused by the customer-fw.tbl.log file. The problem is that it is a log file, not a
data file, and should not be loaded. You can use a manifest file to avoid loading the wrong file.

String contains invalid or unsupported UTF8

The VARCHAR data type supports multibyte UTF-8 characters up to three bytes. If the load data
contains unsupported or invalid characters, you can use the ACCEPTINVCHARS option to replace each
invalid character with a specified alternative character.

Another problem with the load is more difficult to detect—the load produced unexpected results. To
investigate this problem, run the following command to query the CUSTOMER table.

136
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

select c_custkey, c_name, c_address


from customer
order by c_custkey
limit 10;

c_custkey | c_name | c_address


-----------+---------------------------+---------------------------
2 | Customer#000000002 | XSTf4,NCwDVaWNe6tE
2 | Customer#000000002 | XSTf4,NCwDVaWNe6tE
3 | Customer#000000003 | MG9kdTD
3 | Customer#000000003 | MG9kdTD
4 | Customer#000000004 | XxVSJsL
4 | Customer#000000004 | XxVSJsL
5 | Customer#000000005 | KvpyuHCplrB84WgAi
5 | Customer#000000005 | KvpyuHCplrB84WgAi
6 | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx
6 | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx
(10 rows)

The rows should be unique, but there are duplicates.

Another way to check for unexpected results is to verify the number of rows that were loaded. In our
case, 100000 rows should have been loaded, but the load message reported loading 112497 records.
The extra rows were loaded because the COPY loaded an extraneous file, customer-fw.tbl0000.bak.

In this exercise, you use a manifest file to avoid loading the wrong files.

ACCEPTINVCHARS

By default, when COPY encounters a character that is not supported by the column's data type, it skips
the row and returns an error. For information about invalid UTF-8 characters, see Multibyte character
load errors (p. 94).

You could use the MAXERRORS option to ignore errors and continue loading, then query
STL_LOAD_ERRORS to locate the invalid characters, and then fix the data files. However, MAXERRORS
is best used for troubleshooting load problems and should generally not be used in a production
environment.

The ACCEPTINVCHARS option is usually a better choice for managing invalid characters.
ACCEPTINVCHARS instructs COPY to replace each invalid character with a specified valid character
and continue with the load operation. You can specify any valid ASCII character, except NULL, as the
replacement character. The default replacement character is a question mark ( ? ). COPY replaces
multibyte characters with a replacement string of equal length. For example, a 4-byte character would
be replaced with '????'.

COPY returns the number of rows that contained invalid UTF-8 characters. It also adds an entry to the
STL_REPLACEMENTS system table for each affected row, up to a maximum of 100 rows per node slice.
Additional invalid UTF-8 characters are also replaced, but those replacement events are not recorded.

ACCEPTINVCHARS is valid only for VARCHAR columns.

For this step, you add the ACCEPTINVCHARS with the replacement character '^'.

MANIFEST

When you COPY from Amazon S3 using a key prefix, there is a risk that you might load unwanted tables.
For example, the 's3://mybucket/load/ folder contains eight data files that share the key prefix
customer-fw.tbl: customer-fw.tbl0000, customer-fw.tbl0001, and so on. However, the same
folder also contains the extraneous files customer-fw.tbl.log and customer-fw.tbl-0001.bak.

137
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

To ensure that you load all of the correct files, and only the correct files, use a manifest file. The manifest
is a text file in JSON format that explicitly lists the unique object key for each source file to be loaded.
The file objects can be in different folders or different buckets, but they must be in the same region. For
more information, see MANIFEST (p. 583).

The following shows the customer-fw-manifest text.

{
"entries": [
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-000"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-001"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-002"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-003"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-004"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-005"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-006"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-007"}
]
}

To load the data for the CUSTOMER table using the manifest file

1. Open the file customer-fw-manifest in a text editor.


2. Replace <your-bucket-name> with the name of your bucket.
3. Save the file.
4. Upload the file to the load folder on your bucket.
5. Run the following COPY command.

copy customer from 's3://<your-bucket-name>/load/customer-fw-manifest'


credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15,
c_region :12, c_phone:15,c_mktsegment:10'
maxerror 10
acceptinvchars as '^'
manifest;

Load the DWDATE table using DATEFORMAT


In this step, you use the DELIMITER and DATEFORMAT options to load the DWDATE table.

When loading DATE and TIMESTAMP columns, COPY expects the default format, which is YYYY-MM-DD
for dates and YYYY-MM-DD HH:MI:SS for timestamps. If the load data does not use a default format, you
can use DATEFORMAT and TIMEFORMAT to specify the format.

The following excerpt shows date formats in the DWDATE table. Notice that the date formats in column
two are inconsistent.

19920104 1992-01-04 Sunday January 1992 199201 Jan1992 1 4 4 1...


19920112 January 12, 1992 Monday January 1992 199201 Jan1992 2 12 12 1...
19920120 January 20, 1992 Tuesday January 1992 199201 Jan1992 3 20 20 1...

DATEFORMAT

You can specify only one date format. If the load data contains inconsistent formats, possibly in
different columns, or if the format is not known at load time, you use DATEFORMAT with the 'auto'

138
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

argument. When 'auto' is specified, COPY recognizes any valid date or time format and convert it to
the default format. The 'auto' option recognizes several formats that are not supported when using
a DATEFORMAT and TIMEFORMAT string. For more information, see Using automatic recognition with
DATEFORMAT and TIMEFORMAT (p. 621).

To load the DWDATE table, run the following COPY command.

copy dwdate from 's3://<your-bucket-name>/load/dwdate-tab.tbl'


credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
delimiter '\t'
dateformat 'auto';

Load the LINEORDER table using multiple files


This step uses the GZIP and COMPUPDATE options to load the LINEORDER table.

In this exercise, you load the LINEORDER table from a single data file and then load it again from
multiple files. Doing this enables you to compare the load times for the two methods.
Note
The files for loading the LINEORDER table are provided in an AWS sample bucket. You don't
need to upload files for this step.

GZIP, LZOP and BZIP2

You can compress your files using either gzip, lzop, or bzip2 compression formats. When loading from
compressed files, COPY uncompresses the files during the load process. Compressing your files saves
storage space and shortens upload times.

COMPUPDATE

When COPY loads an empty table with no compression encodings, it analyzes the load data to determine
the optimal encodings. It then alters the table to use those encodings before beginning the load. This
analysis process takes time, but it occurs, at most, once per table. To save time, you can skip this step by
turning COMPUPDATE off. To enable an accurate evaluation of COPY times, you turn COMPUPDATE off
for this step.

Multiple Files

The COPY command can load data very efficiently when it loads from multiple files in parallel instead
of from a single file. You can split your data into files so that the number of files is a multiple of the
number of slices in your cluster. If you do, Amazon Redshift divides the workload and distributes the data
evenly among the slices. The number of slices per node depends on the node size of the cluster. For more
information about the number of slices that each node size has, go to About clusters and nodes in the
Amazon Redshift Cluster Management Guide.

For example, the dc2.large compute nodes used in this tutorial have two slices each, so the four-node
cluster has eight slices. In previous steps, the load data was contained in eight files, even though the files
are very small. In this step, you compare the time difference between loading from a single large file and
loading from multiple files.

The files you use for this tutorial contain about 15 million records and occupy about 1.2 GB. These files
are very small in Amazon Redshift scale, but sufficient to demonstrate the performance advantage of
loading from multiple files. The files are large enough that the time required to download them and then
upload them to Amazon S3 is excessive for this tutorial. Thus, you load the files directly from an AWS
sample bucket.

The following screenshot shows the data files for LINEORDER.

139
Amazon Redshift Database Developer Guide
Step 5: Run the COPY commands

To evaluate the performance of COPY with multiple files

1. Run the following command to COPY from a single file. Do not change the bucket name.

copy lineorder from 's3://awssampledb/load/lo/lineorder-single.tbl'


credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
gzip
compupdate off
region 'us-east-1';

2. Your results should be similar to the following. Note the execution time.

Warnings:
Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.

0 row(s) affected.
copy executed successfully

Execution time: 51.56s

3. Run the following command to COPY from multiple files. Do not change the bucket name.

copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl'


credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
gzip
compupdate off
region 'us-east-1';

4. Your results should be similar to the following. Note the execution time.

Warnings:
Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.

0 row(s) affected.
copy executed successfully

Execution time: 17.7s

140
Amazon Redshift Database Developer Guide
Step 6: Vacuum and analyze the database

5. Compare execution times.

In our example, the time to load 15 million records decreased from 51.56 seconds to 17.7 seconds, a
reduction of 65.7 percent.

These results are based on using a four-node cluster. If your cluster has more nodes, the time savings
is multiplied. For typical Amazon Redshift clusters, with tens to hundreds of nodes, the difference
is even more dramatic. If you have a single node cluster, there is little difference between the
execution times.

Next step
Step 6: Vacuum and analyze the database (p. 141)

Step 6: Vacuum and analyze the database


Whenever you add, delete, or modify a significant number of rows, you should run a VACUUM command
and then an ANALYZE command. A vacuum recovers the space from deleted rows and restores the sort
order. The ANALYZE command updates the statistics metadata, which enables the query optimizer to
generate more accurate query plans. For more information, see Vacuuming tables (p. 108).

If you load the data in sort key order, a vacuum is fast. In this tutorial, you added a significant number
of rows, but you added them to empty tables. That being the case, there is no need to resort, and you
didn't delete any rows. COPY automatically updates statistics after loading an empty table, so your
statistics should be up-to-date. However, as a matter of good housekeeping, you complete this tutorial
by vacuuming and analyzing your database.

To vacuum and analyze the database, run the following commands.

vacuum;
analyze;

Next step
Step 7: Clean up your resources (p. 141)

Step 7: Clean up your resources


Your cluster continues to accrue charges as long as it is running. When you have completed this tutorial,
you should return your environment to the previous state by following the steps in Step 5: Revoke access
and delete your sample cluster in the Amazon Redshift Getting Started Guide.

If you want to keep the cluster, but recover the storage used by the SSB tables, run the following
commands.

drop table part;


drop table supplier;
drop table customer;
drop table dwdate;
drop table lineorder;

Next
Summary (p. 142)

141
Amazon Redshift Database Developer Guide
Summary

Summary
In this tutorial, you uploaded data files to Amazon S3 and then used COPY commands to load the data
from the files into Amazon Redshift tables.

You loaded data using the following formats:

• Character-delimited
• CSV
• Fixed-width

You used the STL_LOAD_ERRORS system table to troubleshoot load errors, and then used the REGION,
MANIFEST, MAXERROR, ACCEPTINVCHARS, DATEFORMAT, and NULL AS options to resolve the errors.

You applied the following best practices for loading data:

• Use a COPY command to load data (p. 19)


• Split your load data (p. 20)
• Use a single COPY command to load from multiple files (p. 19)
• Compress your data files (p. 20)
• Verify data files before and after a load (p. 20)

For more information about Amazon Redshift best practices, see the following links:

• Amazon Redshift best practices for loading data (p. 19)


• Amazon Redshift best practices for designing tables (p. 16)
• Amazon Redshift best practices for designing queries (p. 22)

142
Amazon Redshift Database Developer Guide
Unloading data to Amazon S3

Unloading data
Topics
• Unloading data to Amazon S3 (p. 143)
• Unloading encrypted data files (p. 145)
• Unloading data in delimited or fixed-width format (p. 147)
• Reloading unloaded data (p. 148)

To unload data from database tables to a set of files in an Amazon S3 bucket, you can use the
UNLOAD (p. 850) command with a SELECT statement. You can unload text data in either delimited
format or fixed-width format, regardless of the data format that was used to load it. You can also specify
whether to create compressed GZIP files.

You can limit the access users have to your Amazon S3 bucket by using temporary security credentials.

Unloading data to Amazon S3


Amazon Redshift splits the results of a select statement across a set of files, one or more files per node
slice, to simplify parallel reloading of the data. Alternatively, you can specify that UNLOAD (p. 850)
should write the results serially to one or more files by adding the PARALLEL OFF option. You can limit
the size of the files in Amazon S3 by specifying the MAXFILESIZE parameter. UNLOAD automatically
encrypts data files using Amazon S3 server-side encryption (SSE-S3).

You can use any select statement in the UNLOAD command that Amazon Redshift supports, except
for a select that uses a LIMIT clause in the outer select. For example, you can use a select statement
that includes specific columns or that uses a where clause to join multiple tables. If your query contains
quotation marks (enclosing literal values, for example), you need to escape them in the query text (\').
For more information, see the SELECT (p. 801) command reference. For more information about using a
LIMIT clause, see the Usage notes (p. 856) for the UNLOAD command.

For example, the following UNLOAD command sends the contents of the VENUE table to the Amazon S3
bucket s3://mybucket/tickit/unload/.

unload ('select * from venue')


to 's3://mybucket/tickit/unload/venue_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

The file names created by the previous example include the prefix 'venue_'.

venue_0000_part_00
venue_0001_part_00
venue_0002_part_00
venue_0003_part_00

By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the
cluster. To write data to a single file, specify PARALLEL OFF. UNLOAD writes the data serially, sorted
absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. If
the data size is greater than the maximum, UNLOAD creates additional files, up to 6.2 GB each.

143
Amazon Redshift Database Developer Guide
Unloading data to Amazon S3

The following example writes the contents VENUE to a single file. Only one file is required because the
file size is less than 6.2 GB.

unload ('select * from venue')


to 's3://mybucket/tickit/unload/venue_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
parallel off;

Note
The UNLOAD command is designed to use parallel processing. We recommend leaving
PARALLEL enabled for most cases, especially if the files will be used to load tables using a COPY
command.

Assuming the total data size for VENUE is 5 GB, the following example writes the contents of VENUE to
50 files, each 100 MB in size.

unload ('select * from venue')


to 's3://mybucket/tickit/unload/venue_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
parallel off
maxfilesize 100 mb;

If you include a prefix in the Amazon S3 path string, UNLOAD will use that prefix for the file names.

unload ('select * from venue')


to 's3://mybucket/tickit/unload/venue_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

You can create a manifest file that lists the unload files by specifying the MANIFEST option in the
UNLOAD command. The manifest is a text file in JSON format that explicitly lists the URL of each file
that was written to Amazon S3.

The following example includes the manifest option.

unload ('select * from venue')


to 's3://mybucket/tickit/venue_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;

The following example shows a manifest for four unload files.

{
"entries": [
{"url":"s3://mybucket/tickit/venue_0000_part_00"},
{"url":"s3://mybucket/tickit/venue_0001_part_00"},
{"url":"s3://mybucket/tickit/venue_0002_part_00"},
{"url":"s3://mybucket/tickit/venue_0003_part_00"}
]
}

The manifest file can be used to load the same files by using a COPY with the MANIFEST option. For
more information, see Using a manifest to specify data files (p. 73).

After you complete an UNLOAD operation, confirm that the data was unloaded correctly by navigating to
the Amazon S3 bucket where UNLOAD wrote the files. You will see one or more numbered files per slice,
starting with the number zero. If you specified the MANIFEST option, you will also see a file ending with
'manifest'. For example:

144
Amazon Redshift Database Developer Guide
Unloading encrypted data files

mybucket/tickit/venue_0000_part_00
mybucket/tickit/venue_0001_part_00
mybucket/tickit/venue_0002_part_00
mybucket/tickit/venue_0003_part_00
mybucket/tickit/venue_manifest

You can programmatically get a list of the files that were written to Amazon S3 by calling an Amazon S3
list operation after the UNLOAD completes. You can also query STL_UNLOAD_LOG.

The following query returns the pathname for files that were created by an UNLOAD. The
PG_LAST_QUERY_ID (p. 1234) function returns the most recent query.

select query, substring(path,0,40) as path


from stl_unload_log
where query=2320
order by path;

query | path
-------+--------------------------------------
2320 | s3://my-bucket/venue0000_part_00
2320 | s3://my-bucket/venue0001_part_00
2320 | s3://my-bucket/venue0002_part_00
2320 | s3://my-bucket/venue0003_part_00
(4 rows)

If the amount of data is very large, Amazon Redshift might split the files into multiple parts per slice. For
example:

venue_0000_part_00
venue_0000_part_01
venue_0000_part_02
venue_0001_part_00
venue_0001_part_01
venue_0001_part_02
...

The following UNLOAD command includes a quoted string in the select statement, so the quotation
marks are escaped (=\'OH\' ').

unload ('select venuename, venuecity from venue where venuestate=\'OH\' ')


to 's3://mybucket/tickit/venue/ '
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

By default, UNLOAD will fail rather than overwrite existing files in the destination bucket. To overwrite
the existing files, including the manifest file, specify the ALLOWOVERWRITE option.

unload ('select * from venue')


to 's3://mybucket/venue_pipe_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest
allowoverwrite;

Unloading encrypted data files


UNLOAD automatically creates files using Amazon S3 server-side encryption with AWS-managed
encryption keys (SSE-S3). You can also specify server-side encryption with an AWS Key Management

145
Amazon Redshift Database Developer Guide
Unloading encrypted data files

Service key (SSE-KMS) or client-side encryption with a customer managed key. UNLOAD doesn't support
Amazon S3 server-side encryption using a customer managed key. For more information, see Protecting
data using server-side encryption.

To unload to Amazon S3 using server-side encryption with an AWS KMS key, use the KMS_KEY_ID
parameter to provide the key ID as shown in the following example.

unload ('select venuename, venuecity from venue')


to 's3://mybucket/encrypted/venue_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
KMS_KEY_ID '1234abcd-12ab-34cd-56ef-1234567890ab'
encrypted;

If you want to provide your own encryption key, you can create client-side encrypted data files in
Amazon S3 by using the UNLOAD command with the ENCRYPTED option. UNLOAD uses the same
envelope encryption process that Amazon S3 client-side encryption uses. You can then use the COPY
command with the ENCRYPTED option to load the encrypted files.

The process works like this:

1. You create a base64 encoded 256-bit AES key that you will use as your private encryption key, or root
symmetric key.
2. You issue an UNLOAD command that includes your root symmetric key and the ENCRYPTED option.
3. UNLOAD generates a one-time-use symmetric key (called the envelope symmetric key) and an
initialization vector (IV), which it uses to encrypt your data.
4. UNLOAD encrypts the envelope symmetric key using your root symmetric key.
5. UNLOAD then stores the encrypted data files in Amazon S3 and stores the encrypted envelope key
and IV as object metadata with each file. The encrypted envelope key is stored as object metadata x-
amz-meta-x-amz-key and the IV is stored as object metadata x-amz-meta-x-amz-iv.

For more information about the envelope encryption process, see the Client-side data encryption with
the AWS SDK for Java and Amazon S3 article.

To unload encrypted data files, add the root key value to the credentials string and include the
ENCRYPTED option. If you use the MANIFEST option, the manifest file is also encrypted.

unload ('select venuename, venuecity from venue')


to 's3://mybucket/encrypted/venue_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
master_symmetric_key '<root_key>'
manifest
encrypted;

To unload encrypted data files that are GZIP compressed, include the GZIP option along with the root
key value and the ENCRYPTED option.

unload ('select venuename, venuecity from venue')


to 's3://mybucket/encrypted/venue_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
master_symmetric_key '<root_key>'
encrypted gzip;

To load the encrypted data files, add the MASTER_SYMMETRIC_KEY parameter with the same root key
value and include the ENCRYPTED option.

copy venue from 's3://mybucket/encrypted/venue_'

146
Amazon Redshift Database Developer Guide
Unloading data in delimited or fixed-width format

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
master_symmetric_key '<root_key>'
encrypted;

Unloading data in delimited or fixed-width format


You can unload data in delimited format or fixed-width format. The default output is pipe-delimited
(using the '|' character).

The following example specifies a comma as the delimiter:

unload ('select * from venue')


to 's3://mybucket/tickit/venue/comma'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter ',';

The resulting output files look like this:

20,Air Canada Centre,Toronto,ON,0


60,Rexall Place,Edmonton,AB,0
100,U.S. Cellular Field,Chicago,IL,40615
200,Al Hirschfeld Theatre,New York City,NY,0
240,San Jose Repertory Theatre,San Jose,CA,0
300,Kennedy Center Opera House,Washington,DC,0
...

To unload the same result set to a tab-delimited file, issue the following command:

unload ('select * from venue')


to 's3://mybucket/tickit/venue/tab'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter as '\t';

Alternatively, you can use a FIXEDWIDTH specification. This specification consists of an identifier for
each table column and the width of the column (number of characters). The UNLOAD command will fail
rather than truncate data, so specify a width that is at least as long as the longest entry for that column.
Unloading fixed-width data works similarly to unloading delimited data, except that the resulting output
contains no delimiting characters. For example:

unload ('select * from venue')


to 's3://mybucket/tickit/venue/fw'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
fixedwidth '0:3,1:100,2:30,3:2,4:6';

The fixed-width output looks like this:

20 Air Canada Centre Toronto ON0


60 Rexall Place Edmonton AB0
100U.S. Cellular Field Chicago IL40615
200Al Hirschfeld Theatre New York CityNY0
240San Jose Repertory TheatreSan Jose CA0
300Kennedy Center Opera HouseWashington DC0

For more details about FIXEDWIDTH specifications, see the UNLOAD (p. 850) command.

147
Amazon Redshift Database Developer Guide
Reloading unloaded data

Reloading unloaded data


To reload the results of an unload operation, you can use a COPY command.

The following example shows a simple case in which the VENUE table is unloaded using a manifest file,
truncated, and reloaded.

unload ('select * from venue order by venueid')


to 's3://mybucket/tickit/venue/reload_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest
delimiter '|';

truncate venue;

copy venue
from 's3://mybucket/tickit/venue/reload_manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest
delimiter '|';

After it is reloaded, the VENUE table looks like this:

select * from venue order by venueid limit 5;

venueid | venuename | venuecity | venuestate | venueseats


---------+---------------------------+-------------+------------+-----------
1 | Toyota Park | Bridgeview | IL | 0
2 | Columbus Crew Stadium | Columbus | OH | 0
3 | RFK Stadium | Washington | DC | 0
4 | CommunityAmerica Ballpark | Kansas City | KS | 0
5 | Gillette Stadium | Foxborough | MA | 68756
(5 rows)

148
Amazon Redshift Database Developer Guide
UDF security and privileges

Creating user-defined functions


You can create a custom scalar user-defined function (UDF) using either a SQL SELECT clause or a
Python program. The new function is stored in the database and is available for any user with sufficient
privileges to run. You run a custom scalar UDF in much the same way as you run existing Amazon
Redshift functions.

For Python UDFs, in addition to using the standard Python functionality, you can import your own
custom Python modules. For more information, see Python language support for UDFs (p. 152).

You can also create AWS Lambda UDFs that use custom functions defined in Lambda as part of your SQL
queries. Lambda UDFs enable you to write complex UDFs and integrate with third-party components.
They also can help you overcome some of the limitations of current Python and SQL UDFs. For example,
they can help you access network and storage resources and write more full-fledged SQL statements.
You can create Lambda UDFs in any of the programming languages supported by Lambda, such as Java,
Go, PowerShell, Node.js, C#, Python, and Ruby. Or you can use a custom runtime.

By default, all users can run UDFs. For more information about privileges, see UDF security and
privileges (p. 149).

Topics
• UDF security and privileges (p. 149)
• Creating a scalar SQL UDF (p. 150)
• Creating a scalar Python UDF (p. 151)
• Creating a scalar Lambda UDF (p. 156)
• Example uses of user-defined functions (UDFs) (p. 162)

UDF security and privileges


To create a UDF, you must have permission for usage on language for SQL or plpythonu (Python). By
default, USAGE ON LANGUAGE SQL is granted to PUBLIC, but you must explicitly grant USAGE ON
LANGUAGE PLPYTHONU to specific users or groups.

To revoke usage for SQL, first revoke usage from PUBLIC. Then grant usage on SQL only to the specific
users or groups permitted to create SQL UDFs. The following example revokes usage on SQL from
PUBLIC. Then it grants usage to the user group udf_devs.

revoke usage on language sql from PUBLIC;


grant usage on language sql to group udf_devs;

To run a UDF, you must have permission to do so for each function. By default, permission to run new
UDFs is granted to PUBLIC. To restrict usage, revoke this permission from PUBLIC for the function. Then
grant the privilege to specific individuals or groups.

The following example revokes execution on function f_py_greater from PUBLIC. Then it grants usage
to the user group udf_devs.

149
Amazon Redshift Database Developer Guide
Creating a scalar SQL UDF

revoke execute on function f_py_greater(a float, b float) from PUBLIC;


grant execute on function f_py_greater(a float, b float) to group udf_devs;

Superusers have all privileges by default.

For more information, see GRANT (p. 767) and REVOKE (p. 791).

Creating a scalar SQL UDF


A scalar SQL UDF incorporates a SQL SELECT clause that runs when the function is called and returns a
single value. The CREATE FUNCTION (p. 679) command defines the following parameters:

• (Optional) Input arguments. Each argument must have a data type.


• One return data type.
• One SQL SELECT clause. In the SELECT clause, refer to the input arguments using $1, $2, and so on,
according to the order of the arguments in the function definition.

The input and return data types can be any standard Amazon Redshift data type.

Don't include a FROM clause in your SELECT clause. Instead, include the FROM clause in the SQL
statement that calls the SQL UDF.

The SELECT clause can't include any of the following types of clauses:

• FROM
• INTO
• WHERE
• GROUP BY
• ORDER BY
• LIMIT

Scalar SQL function example


The following example creates a function that compares two numbers and returns the larger value. For
more information, see CREATE FUNCTION (p. 679).

create function f_sql_greater (float, float)


returns float
stable
as $$
select case when $1 > $2 then $1
else $2
end
$$ language sql;

The following query calls the new f_sql_greater function to query the SALES table and return either
COMMISSION or 20 percent of PRICEPAID, whichever is greater.

select f_sql_greater(commission, pricepaid*0.20) from sales;

150
Amazon Redshift Database Developer Guide
Creating a scalar Python UDF

Creating a scalar Python UDF


A scalar Python UDF incorporates a Python program that runs when the function is called and returns a
single value. The CREATE FUNCTION (p. 679) command defines the following parameters:

• (Optional) Input arguments. Each argument must have a name and a data type.
• One return data type.
• One executable Python program.

The input and return data types can be SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE
PRECISION, BOOLEAN, CHAR, VARCHAR, DATE, or TIMESTAMP. In addition, Python UDFs can use the data
type ANYELEMENT, which Amazon Redshift automatically converts to a standard data type based on the
arguments supplied at runtime. For more information, see ANYELEMENT data type (p. 152)

When an Amazon Redshift query calls a scalar UDF, the following steps occur at runtime:

1. The function converts the input arguments to Python data types.

For a mapping of Amazon Redshift data types to Python data types, see Python UDF data
types (p. 151).
2. The function runs the Python program, passing the converted input arguments.
3. The Python code returns a single value. The data type of the return value must correspond to the
RETURNS data type specified by the function definition.
4. The function converts the Python return value to the specified Amazon Redshift data type, then
returns that value to the query.

Scalar Python UDF example


The following example creates a function that compares two numbers and returns the larger value. Note
that the indentation of the code between the double dollar signs ($$) is a Python requirement. For more
information, see CREATE FUNCTION (p. 679).

create function f_py_greater (a float, b float)


returns float
stable
as $$
if a > b:
return a
return b
$$ language plpythonu;

The following query calls the new f_greater function to query the SALES table and return either
COMMISSION or 20 percent of PRICEPAID, whichever is greater.

select f_py_greater (commission, pricepaid*0.20) from sales;

Python UDF data types


Python UDFs can use any standard Amazon Redshift data type for the input arguments and the
function's return value. In addition to the standard data types, UDFs support the data type ANYELEMENT,
which Amazon Redshift automatically converts to a standard data type based on the arguments supplied
at runtime. Scalar UDFs can return a data type of ANYELEMENT. For more information, see ANYELEMENT
data type (p. 152).

151
Amazon Redshift Database Developer Guide
ANYELEMENT data type

During execution, Amazon Redshift converts the arguments from Amazon Redshift data types to
Python data types for processing. It then converts the return value from the Python data type to the
corresponding Amazon Redshift data type. For more information about Amazon Redshift data types, see
Data types (p. 478).

The following table maps Amazon Redshift data types to Python data types.

Amazon Redshift data type Python data type

smallint int

integer

bigint

short

long

decimal or numeric decimal

double float

real

boolean bool

char string

varchar

timestamp datetime

ANYELEMENT data type


ANYELEMENT is a polymorphic data type. This means that if a function is declared using ANYELEMENT
for an argument's data type, the function can accept any standard Amazon Redshift data type as input
for that argument when the function is called. The ANYELEMENT argument is set to the data type
actually passed to it when the function is called.

If a function uses multiple ANYELEMENT data types, they must all resolve to the same actual data type
when the function is called. All ANYELEMENT argument data types are set to the actual data type of the
first argument passed to an ANYELEMENT. For example, a function declared as f_equal(anyelement,
anyelement) will take any two input values, so long as they are of the same data type.

If the return value of a function is declared as ANYELEMENT, at least one input argument must be
ANYELEMENT. The actual data type for the return value is the same as the actual data type supplied for
the ANYELEMENT input argument.

Python language support for UDFs


You can create a custom UDF based on the Python programming language. The Python 2.7 standard
library is available for use in UDFs, with the exception of the following modules:

• ScrolledText
• Tix
• Tkinter

152
Amazon Redshift Database Developer Guide
Python language support

• tk
• turtle
• smtpd

In addition to the Python Standard Library, the following modules are part of the Amazon Redshift
implementation:

• numpy 1.8.2
• pandas 0.14.1
• python-dateutil 2.2
• pytz 2014.7
• scipy 0.12.1
• six 1.3.0
• wsgiref 0.1.2

You can also import your own custom Python modules and make them available for use in UDFs by
executing a CREATE LIBRARY (p. 684) command. For more information, see Importing custom Python
library modules (p. 153).
Important
Amazon Redshift blocks all network access and write access to the file system through UDFs.

Importing custom Python library modules


You define scalar functions using Python language syntax. You can use the Python Standard Library
modules and Amazon Redshift preinstalled modules. You can also create your own custom Python library
modules and import the libraries into your clusters, or use existing libraries from Python or third parties.

You cannot create a library that contains a module with the same name as a Python Standard Library
module or an Amazon Redshift preinstalled Python module. If an existing user-installed library uses the
same Python package as a library you create, you must drop the existing library before installing the new
library.

You must be a superuser or have USAGE ON LANGUAGE plpythonu privilege to install custom libraries;
however, any user with sufficient privileges to create functions can use the installed libraries. You can
query the PG_LIBRARY (p. 1471) system catalog to view information about the libraries installed on your
cluster.

To import a custom Python module into your cluster


This section provides an example of importing a custom Python module into your cluster. To perform the
steps in this section, you must have an Amazon S3 bucket, where you upload the library package. You
then install the package in your cluster. For more information about creating buckets, go to Creating a
bucket in the Amazon Simple Storage Service User Guide.

In this example, let's suppose that you create UDFs to work with positions and distances in your data.
Connect to your Amazon Redshift cluster from a SQL client tool, and run the following commands to
create the functions.

CREATE FUNCTION f_distance (x1 float, y1 float, x2 float, y2 float) RETURNS float IMMUTABLE
as $$
def distance(x1, y1, x2, y2):
import math
return math.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)

return distance(x1, y1, x2, y2)

153
Amazon Redshift Database Developer Guide
Python language support

$$ LANGUAGE plpythonu;

CREATE FUNCTION f_within_range (x1 float, y1 float, x2 float, y2 float) RETURNS bool
IMMUTABLE as $$
def distance(x1, y1, x2, y2):
import math
return math.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)

return distance(x1, y1, x2, y2) < 20


$$ LANGUAGE plpythonu;

Note that a few lines of code are duplicated in the previous functions. This duplication is necessary
because a UDF cannot reference the contents of another UDF, and both functions require the same
functionality. However, instead of duplicating code in multiple functions, you can create a custom library
and configure your functions to use it.

To do so, first create the library package by following these steps:

1. Create a folder named geometry. This folder is the top level package of the library.
2. In the geometry folder, create a file named __init__.py. Note that the file name contains two
double underscore characters. This file indicates to Python that the package can be initialized.
3. Also in the geometry folder, create a folder named trig. This folder is the subpackage of the library.
4. In the trig folder, create another file named __init__.py and a file named line.py. In this folder,
__init__.py indicates to Python that the subpackage can be initialized and that line.py is the file
that contains library code.

Your folder and file structure should be the same as the following:

geometry/
__init__.py
trig/
__init__.py
line.py

For more information about package structure, go to Modules in the Python tutorial on the Python
website.
5. The following code contains a class and member functions for the library. Copy and paste it into
line.py.

class LineSegment:
def __init__(self, x1, y1, x2, y2):
self.x1 = x1
self.y1 = y1
self.x2 = x2
self.y2 = y2
def angle(self):
import math
return math.atan2(self.y2 - self.y1, self.x2 - self.x1)
def distance(self):
import math
return math.sqrt((self.y2 - self.y1) ** 2 + (self.x2 - self.x1) ** 2)

After you have created the package, do the following to prepare the package and upload it to Amazon
S3.

1. Compress the contents of the geometry folder into a .zip file named geometry.zip. Do not include the
geometry folder itself; only include the contents of the folder as shown following:

154
Amazon Redshift Database Developer Guide
UDF constraints

geometry.zip
__init__.py
trig/
__init__.py
line.py

2. Upload geometry.zip to your Amazon S3 bucket.


Important
If the Amazon S3 bucket does not reside in the same region as your Amazon Redshift cluster,
you must use the REGION option to specify the region in which the data is located. For more
information, see CREATE LIBRARY (p. 684).
3. From your SQL client tool, run the following command to install the library. Replace <bucket_name>
with the name of your bucket, and replace <access key id> and <secret key> with an access key
and secret access key from your AWS Identity and Access Management (IAM) user credentials.

CREATE LIBRARY geometry LANGUAGE plpythonu FROM 's3://<bucket_name>/geometry.zip'


CREDENTIALS 'aws_access_key_id=<access key id>;aws_secret_access_key=<secret key>';

After you install the library in your cluster, you need to configure your functions to use the library. To do
this, run the following commands.

CREATE OR REPLACE FUNCTION f_distance (x1 float, y1 float, x2 float, y2 float) RETURNS
float IMMUTABLE as $$
from trig.line import LineSegment

return LineSegment(x1, y1, x2, y2).distance()


$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION f_within_range (x1 float, y1 float, x2 float, y2 float) RETURNS
bool IMMUTABLE as $$
from trig.line import LineSegment

return LineSegment(x1, y1, x2, y2).distance() < 20


$$ LANGUAGE plpythonu;

In the preceding commands, import trig/line eliminates the duplicated code from the original
functions in this section. You can reuse the functionality provided by this library in multiple UDFs. Note
that to import the module, you only need to specify the path to the subpackage and module name
(trig/line).

UDF constraints
Within the constraints listed in this topic, you can use UDFs anywhere you use the Amazon Redshift built-
in scalar functions. For more information, see SQL functions reference (p. 878).

Amazon Redshift Python UDFs have the following constraints:

• Python UDFs cannot access the network or read or write to the file system.
• The total size of user-installed Python libraries cannot exceed 100 MB.
• The number of Python UDFs that can run concurrently per cluster is limited to one-fourth of the
total concurrency level for the cluster. For example, if the cluster is configured with a concurrency of
15, a maximum of three UDFs can run concurrently. After the limit is reached, UDFs are queued for
execution within workload management queues. SQL UDFs don't have a concurrency limit. For more
information, see Implementing workload management (p. 411).
• When using Python UDFs, Amazon Redshift doesn't support the SUPER and HLLSKETCH data types.

155
Amazon Redshift Database Developer Guide
Creating a scalar Lambda UDF

Creating a scalar Lambda UDF


Amazon Redshift can use custom functions defined in AWS Lambda as part of SQL queries. You can write
scalar Lambda UDFs in any programming languages supported by Lambda, such as Java, Go, PowerShell,
Node.js, C#, Python, and Ruby. Or you can use a custom runtime.

Lambda UDFs are defined and managed in Lambda, and you can control the access privileges to invoke
these UDFs in Amazon Redshift. You can invoke multiple Lambda functions in the same query or invoke
the same function multiple times.

Use Lambda UDFs in any clauses of the SQL statements where scalar functions are supported. You can
also use Lambda UDFs in any SQL statement such as SELECT, UPDATE, INSERT, or DELETE.
Note
Using Lambda UDFs can incur additional charges from the Lambda service. Whether it does
so depends on factors such as the numbers of Lambda requests (UDF invocations) and the
total duration of the Lambda program execution. However, there is no additional charge to
use Lambda UDFs in Amazon Redshift. For information about AWS Lambda pricing, see AWS
Lambda Pricing.
The number of Lambda requests varies depending on the specific SQL statement clause where
the Lambda UDF is used. For example, suppose the function is used in a WHERE clause such as
the following.
SELECT a, b FROM t1 WHERE lambda_multiply(a, b) = 64; SELECT a, b FROM
t1 WHERE a*b = lambda_multiply(2, 32)
In this case, Amazon Redshift calls the first SELECT statement for each and calls the second
SELECT statement only once.
However, using a UDF in the projection part of the query might only invoke the Lambda
function once for every qualified or aggregated row in the result set. You can configure batching
of multiple invocations of your Lambda function to improve performance and lower costs.

Registering a Lambda UDF


The CREATE EXTERNAL FUNCTION (p. 648) command creates the following parameters:

• (Optional) A list of arguments with data type.


• One return data type.
• One function name of the external function that is called by Amazon Redshift.
• One IAM role that the Amazon Redshift cluster is authorized to assume and call to Lambda.
• One Lambda function name that the Lambda UDF invokes.

For information about CREATE EXTERNAL FUNCTION, see CREATE EXTERNAL FUNCTION (p. 648).

The input and return data types for this function can be any standard Amazon Redshift data type.

Amazon Redshift ensures that the external function can send and receive batched arguments and results.

Managing Lambda UDF security and privileges


To create a Lambda UDF, make sure that you have permissions for usage on the LANUGAGE EXFUNC.
You must explicitly grant USAGE ON LANGUAGE EXFUNC or revoke USAGE ON LANGUAGE EXFUNC to
specific users, groups, or public.

The following example grants usage on EXFUNC to PUBLIC.

grant usage on language exfunc to PUBLIC;

156
Amazon Redshift Database Developer Guide
Configuring the authorization parameter for Lambda UDFs

The following example revokes usage on exfunc from PUBLIC and then grants usage to the user group
lambda_udf_devs.

revoke usage on language exfunc from PUBLIC;


grant usage on language exfunc to group lambda_udf_devs;

To run a Lambda UDF, make sure that you have permission for each function called. By default,
permission to run new Lambda UDFs is granted to PUBLIC. To restrict usage, revoke this permission from
PUBLIC for the function. Then, grant the privilege to specific users or groups.

The following example revokes execution on the function exfunc_sum from PUBLIC. Then, it grants
usage to the user group lambda_udf_devs.

revoke execute on function exfunc_sum(int, int) from PUBLIC;


grant execute on function exfunc_sum(int, int) to group lambda_udf_devs;

Superusers have all privileges by default.

For more information about granting and revoking privileges, see GRANT (p. 767) and
REVOKE (p. 791).

Configuring the authorization parameter for Lambda


UDFs
The CREATE EXTERNAL FUNCTION command requires authorization to invoke Lambda functions in AWS
Lambda. To start authorization, specify an AWS Identity and Access Management (IAM) role when you
run the CREATE EXTERNAL FUNCTION command. For more information about IAM roles, see IAM roles in
the IAM User Guide.

If there is an existing IAM role with permissions to invoke Lambda functions attached to your cluster,
then you can substitute your role Amazon Resource Name (ARN) in the IAM_ROLE parameter for
the command. Following sections describe the steps for using an IAM role in the CREATE EXTERNAL
FUNCTION command.

Creating an IAM role for Lambda


The IAM role requires permission to invoke Lambda functions. While creating the IAM role, provide the
permission in one of the following ways:

• Attach the AWSLambdaRole policy on the Attach permissions policy page while creating an IAM role.
The AWSLambdaRole policy grants permissions to invoke Lambda functions which is the minimal
requirement. For more information and other policies, see Identity-based IAM policies for AWS Lambda
in the AWS Lambda Developer Guide.
• Create your own custom policy to attach to your IAM role with the lambda:InvokeFunction
permission of either all resources or a particular Lambda function with the ARN of that function. For
more information on how to create a policy, see Creating IAM policies in the IAM User Guide.

The following example policy enables invoking Lambda on a particular Lambda function.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Invoke",
"Effect": "Allow",
"Action": [

157
Amazon Redshift Database Developer Guide
Using the JSON interface between
Amazon Redshift and Lambda

"lambda:InvokeFunction"
],
"Resource": "arn:aws:lambda:us-west-2:123456789012:function:my-function"
}
]
}

For more information on resources for Lambda functions, see Resources and conditions for Lambda
actions in the IAM API Reference.

After creating your custom policy with the required permissions, you can attach your policy to the IAM
role on the Attach permissions policy page while creating an IAM role.

For steps to create an IAM role, see Authorizing Amazon Redshift to access other AWS services on your
behalf in the Amazon Redshift Cluster Management Guide.

If you don't want to create a new IAM role, you can add the permissions mentioned previously to your
existing IAM role.

Associating an IAM role with the cluster


Attach the IAM role to your cluster. You can add a role to a cluster or view the roles associated with
a cluster by using the Amazon Redshift Management Console, CLI, or API. For more information, see
Associating an IAM Role With a Cluster in the Amazon Redshift Cluster Management Guide.

Including the IAM role in the command


Include the IAM role ARN in the CREATE EXTERNAL FUNCTION command. When you create an IAM role,
IAM returns an Amazon Resource Name (ARN) for the role. To specify an IAM role, provide the role ARN
with the IAM_ROLE parameter. The following shows the syntax for the IAM_ROLE parameter.

IAM_ROLE 'arn:aws:iam::aws-account-id:role/role-name'

To invoke Lambda functions which reside in other accounts within the same Region, see Chaining IAM
roles in Amazon Redshift.

Using the JSON interface between Amazon Redshift


and AWS Lambda
Amazon Redshift uses a common interface for all Lambda functions that Amazon Redshift communicates
to.

The following table shows the list of input fields that the designated Lambda functions that you can
expect for the JSON payload.

Field name Description Value range

request_id A universally unique A valid UUID.


identifier (UUID) that
uniquely identifies each
invoke request.

cluster The full Amazon A valid cluster ARN.


Resource Name (ARN) of
the cluster.

158
Amazon Redshift Database Developer Guide
Using the JSON interface between
Amazon Redshift and Lambda

Field name Description Value range

user The name of the user A valid user name.


that makes the call.

database The name of the A valid database name.


database that the query
is running on.

external_function
The fully qualified name A valid fully qualified function name.
of the external function
that makes the call.

query_id The query ID of the query A valid query ID.


that is making the call.

num_records The number of A value of 1 - 2^64.


arguments in the
payload.

arguments The data payload in the The data in array format must be a JSON array. Each
specified format. element is a record that is an array if the number of
arguments is larger than 1. By using an array, Amazon
Redshift preserves the order of the records in the payload.

The order of the JSON array determines the order of batch processing. The Lambda function must
process the arguments iteratively and produce the exact number of records. The following is an example
of a payload.

{
"request_id" : "23FF1F97-F28A-44AA-AB67-266ED976BF40",
"cluster" : 'arn:aws:redshift:xxxx',
"user" : "adminuser",
"database" : "db1",
"external_function": "public.foo",
"query_id" : 5678234,
"num_records" : 4,
"arguments" : [
[ 1, 2 ],
[ 3, null],
null,
[ 4, 6]
]
}

The return output of the Lambda function contains the following fields.

Field name Description Value range

success The indication of A value of "true" or "false".


success or failure for the
function.

error_msg The error message if the A valid message.


success value is "false"
(if the function fails);
otherwise, this field is
ignored.

159
Amazon Redshift Database Developer Guide
Naming UDFs

Field name Description Value range

num_records The number of records in A value of 1 - 2^64.


the payload.

results The results of the call in N/A


the specified format.

The following is an example of the Lambda function output.

{
"success": true, // true indicates the call succeeded
"error_msg" : "my function isn't working", // shall only exist when success != true
"num_records": 4, // number of records in this payload
"results" : [
1,
4,
null,
7
]
}

When you call Lambda functions from SQL queries, Amazon Redshift ensures the security of the
connection with the following considerations:

• GRANT and REVOKE permissions. For more information about UDF security and privileges, see UDF
security and privileges (p. 149).
• Amazon Redshift only submits the minimum set of data to the designated Lambda function.
• Amazon Redshift only calls the designated Lambda function with the designated IAM role.

Naming UDFs
You can avoid potential conflicts and unexpected results considering your UDF naming conventions
before implementation. Because function names can be overloaded, they can collide with existing and
future Amazon Redshift function names. This topic discusses overloading and presents a strategy for
avoiding conflict.

Overloading function names


A function is identified by its name and signature, which is the number of input arguments and the
data types of the arguments. Two functions in the same schema can have the same name if they have
different signatures. In other words, the function names can be overloaded.

When you run a query, the query engine determines which function to call based on the number
of arguments you provide and the data types of the arguments. You can use overloading to
simulate functions with a variable number of arguments, up to the limit allowed by the CREATE
FUNCTION (p. 679) command.

Preventing UDF naming conflicts


We recommend that you name all UDFs using the prefix f_. Amazon Redshift reserves the f_ prefix
exclusively for UDFs and by prefixing your UDF names with f_, you ensure that your UDF name won't
conflict with any existing or future Amazon Redshift built-in SQL function names. For example, by
naming a new UDF f_sum, you avoid conflict with the Amazon Redshift SUM function. Similarly, if you

160
Amazon Redshift Database Developer Guide
Logging errors and warnings

name a new function f_fibonacci, you avoid conflict if Amazon Redshift adds a function named
FIBONACCI in a future release.

You can create a UDF with the same name and signature as an existing Amazon Redshift built-in SQL
function without the function name being overloaded if the UDF and the built-in function exist in
different schemas. Because built-in functions exist in the system catalog schema, pg_catalog, you
can create a UDF with the same name in another schema, such as public or a user-defined schema. In
some cases, you might call a function that is not explicitly qualified with a schema name. If so, Amazon
Redshift searches the pg_catalog schema first by default. Thus, a built-in function runs before a new UDF
with the same name.

You can change this behavior by setting the search path to place pg_catalog at the end. If you do so,
your UDFs take precedence over built-in functions, but the practice can cause unexpected results.
Adopting a unique naming strategy, such as using the reserved prefix f_, is a more reliable practice. For
more information, see SET (p. 837) and search_path (p. 1490).

Logging errors and warnings in UDFs


You can use the Python logging module to create user-defined error and warning messages in your UDFs.
Following query execution, you can query the SVL_UDF_LOG (p. 1412) system view to retrieve logged
messages.
Note
UDF logging consumes cluster resources and might affect system performance. We recommend
implementing logging only for development and troubleshooting.

During query execution, the log handler writes messages to the SVL_UDF_LOG system view, along with
the corresponding function name, node, and slice. The log handler writes one row to the SVL_UDF_LOG
per message, per slice. Messages are truncated to 4096 bytes. The UDF log is limited to 500 rows per
slice. When the log is full, the log handler discards older messages and adds a warning message to
SVL_UDF_LOG.
Note
The Amazon Redshift UDF log handler escapes newlines ( \n ), pipe ( | ) characters, and
backslash ( \ ) characters with a backslash ( \ ) character.

By default, the UDF log level is set to WARNING. Messages with a log level of WARNING, ERROR, and
CRITICAL are logged. Messages with lower severity INFO, DEBUG, and NOTSET are ignored. To set the
UDF log level, use the Python logger method. For example, the following sets the log level to INFO.

logger.setLevel(logging.INFO)

For more information about using the Python logging module, see Logging facility for Python in the
Python documentation.

The following example creates a function named f_pyerror that imports the Python logging module,
instantiates the logger, and logs an error.

CREATE OR REPLACE FUNCTION f_pyerror()


RETURNS INTEGER
VOLATILE AS
$$
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.info('Your info message here')
return 0
$$ language plpythonu;

161
Amazon Redshift Database Developer Guide
Example uses of UDFs

The following example queries SVL_UDF_LOG to view the message logged in the previous example.

select funcname, node, slice, trim(message) as message


from svl_udf_log;

funcname | query | node | slice | message


------------+-------+------+-------+------------------
f_pyerror | 12345 | 1| 1 | Your info message here

Example uses of user-defined functions (UDFs)


You can use user-defined functions to solve business problems by integrating Amazon Redshift with
other components. Following are some examples of how others have used UDFs for their use cases:

• Accessing external components using Amazon Redshift Lambda UDFs – describes how Amazon
Redshift Lambda UDFs work and walks through creating a Lambda UDF.
• Translate and analyze text using SQL functions with Amazon Redshift, Amazon Translate, and Amazon
Comprehend – provides prebuilt Amazon Redshift Lambda UDFs that you can install with a few clicks
to translate, redact, and analyze text fields.
• Access Amazon Location Service from Amazon Redshift – describes how to use Amazon Redshift
Lambda UDFs to integrate with Amazon Location Service.
• Data Tokenization with Amazon Redshift and Protegrity – describes how to integrate Amazon Redshift
Lambda UDFs with the Protegrity Serverless product.
• Amazon Redshift UDFs – a collection of Amazon Redshift SQL, Lambda, and Python UDFs.

162
Amazon Redshift Database Developer Guide
Stored procedure overview

Creating stored procedures in


Amazon Redshift
You can define an Amazon Redshift stored procedure using the PostgreSQL procedural language PL/
pgSQL to perform a set of SQL queries and logical operations. The procedure is stored in the database
and is available for any user with sufficient privileges to run.

Unlike a user-defined function (UDF), a stored procedure can incorporate data definition language (DDL)
and data manipulation language (DML) in addition to SELECT queries. A stored procedure doesn't need
to return a value. You can use procedural language, including looping and conditional expressions, to
control logical flow.

For details about SQL commands to create and manage stored procedures, see the following command
topics:

• CREATE PROCEDURE (p. 702)


• ALTER PROCEDURE (p. 540)
• DROP PROCEDURE (p. 751)
• SHOW PROCEDURE (p. 846)
• CALL (p. 569)
• GRANT (p. 767)
• REVOKE (p. 791)
• ALTER DEFAULT PRIVILEGES (p. 534)

Topics
• Overview of stored procedures in Amazon Redshift (p. 163)
• PL/pgSQL language reference (p. 176)

Overview of stored procedures in Amazon Redshift


Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and
business-specific logic. By combining multiple SQL steps into a stored procedure, you can reduce round
trips between your applications and the database.

For fine-grained access control, you can create stored procedures to perform functions without giving
a user access to the underlying tables. For example, only the owner or a superuser can truncate a table,
and a user needs write permission to insert data into a table. Instead of granting a user permissions on
the underlying tables, you can create a stored procedure that performs the task. You then give the user
permission to run the stored procedure.

A stored procedure with the DEFINER security attribute runs with the privileges of the stored procedure's
owner. By default, a stored procedure has INVOKER security, which means the procedure uses the
permissions of the user that calls the procedure.

163
Amazon Redshift Database Developer Guide
Stored procedure overview

To create a stored procedure, use the CREATE PROCEDURE (p. 702) command. To run a procedure, use
the CALL (p. 569) command. Examples follow later in this section.
Note
Some clients might throw the following error when creating an Amazon Redshift stored
procedure.

ERROR: 42601: [Amazon](500310) unterminated dollar-quoted string at or near "$$

This error occurs due to the inability of the client to correctly parse the CREATE PROCEDURE
statement with semicolons delimiting statements and with dollar sign ($) quoting. This results in
only a part of the statement sent to the Amazon Redshift server. You can often work around this
error by using the Run as batch or Execute selected option of the client.
For example, when using an Aginity client, use the Run entire script as batch option.
When you use SQL Workbench/J, we recommend version 124. When you use SQL Workbench/J
version 125, consider specifying an alternate delimiter as a workaround.
CREATE PROCEDURE contains SQL statements delimited with a semicolon (;). Defining an
alternate delimiter such as a slash (/) and placing it at the end of the CREATE PROCEDURE
statement sends the statement to the Amazon Redshift server for processing. Following is an
example.

CREATE OR REPLACE PROCEDURE test()


AS $$
BEGIN
SELECT 1 a;
END;
$$
LANGUAGE plpgsql
;
/

For more information, see Alternate delimiter in the SQL Workbench/J documentation. Or use a
client with better support for parsing CREATE PROCEDURE statements, such as the query editor
in the Amazon Redshift console or TablePlus.

Topics
• Naming stored procedures (p. 165)
• Security and privileges for stored procedures (p. 166)
• Returning a result set (p. 167)
• Managing transactions (p. 168)
• Trapping errors (p. 173)
• Logging stored procedures (p. 174)
• Limits and differences for stored procedure support (p. 175)

The following example shows a procedure with no output arguments. By default, arguments are input
(IN) arguments.

CREATE OR REPLACE PROCEDURE test_sp1(f1 int, f2 varchar)


AS $$
BEGIN
RAISE INFO 'f1 = %, f2 = %', f1, f2;
END;
$$ LANGUAGE plpgsql;

call test_sp1(5, 'abc');

164
Amazon Redshift Database Developer Guide
Naming stored procedures

INFO: f1 = 5, f2 = abc
CALL

The following example shows a procedure with output arguments. Arguments are input (IN), input and
output (INOUT), and output (OUT).

CREATE OR REPLACE PROCEDURE test_sp2(f1 IN int, f2 INOUT varchar(256), out_var OUT


varchar(256))
AS $$
DECLARE
loop_var int;
BEGIN
IF f1 is null OR f2 is null THEN
RAISE EXCEPTION 'input cannot be null';
END IF;
DROP TABLE if exists my_etl;
CREATE TEMP TABLE my_etl(a int, b varchar);
FOR loop_var IN 1..f1 LOOP
insert into my_etl values (loop_var, f2);
f2 := f2 || '+' || f2;
END LOOP;
SELECT INTO out_var count(*) from my_etl;
END;
$$ LANGUAGE plpgsql;

call test_sp2(2,'2019');

f2 | column2
---------------------+---------
2019+2019+2019+2019 | 2
(1 row)

Naming stored procedures


If you define a procedure with the same name and different input argument data types, or signature, you
create a new procedure. In other words, the procedure name is overloaded. For more information, see
Overloading procedure names (p. 165). Amazon Redshift doesn't enable procedure overloading based
on output arguments. In other words, you can't have two procedures with the same name and input
argument data types but different output argument types.

The owner or a superuser can replace the body of a stored procedure with a new one with the same
signature. To change the signature or return types of a stored procedure, drop the stored procedure and
recreate it. For more information, see DROP PROCEDURE (p. 751) and CREATE PROCEDURE (p. 702).

You can avoid potential conflicts and unexpected results by considering your naming conventions for
stored procedures before implementing them. Because you can overload procedure names, they can
collide with existing and future Amazon Redshift procedure names.

Overloading procedure names


A procedure is identified by its name and signature, which is the number of input arguments and the
data types of the arguments. Two procedures in the same schema can have the same name if they have
different signatures. In other words, you can overload procedure names.

When you run a procedure, the query engine determines which procedure to call based on the number of
arguments that you provide and the data types of the arguments. You can use overloading to simulate

165
Amazon Redshift Database Developer Guide
Security and privileges

procedures with a variable number of arguments, up to the limit allowed by the CREATE PROCEDURE
command. For more information, see CREATE PROCEDURE (p. 702).

Preventing naming conflicts


We recommend that you name all procedures using the prefix sp_. Amazon Redshift reserves the sp_
prefix exclusively for stored procedures. By prefixing your procedure names with sp_, you ensure that
your procedure name won't conflict with any existing or future Amazon Redshift procedure names.

Security and privileges for stored procedures


By default, all users have permission to create a procedure. To create a procedure, you must have USAGE
permission on the language PL/pgSQL, which is granted to PUBLIC by default. Only superusers and
owners have the permission to call a procedure by default. Superusers can run REVOKE USAGE on PL/
pgSQL from a user if they want to prevent the user from creating a stored procedure.

To call a procedure, you must be granted EXECUTE permission on the procedure. By default, EXECUTE
permission for new procedures is granted to the procedure owner and superusers. For more information,
see GRANT (p. 767).

The user creating a procedure is the owner by default. The owner has CREATE, DROP, and EXECUTE
privileges on the procedure by default. Superusers have all privileges.

The SECURITY attribute controls a procedure's privileges to access database objects. When you create
a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify
SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify
SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the
default.

Because a SECURITY DEFINER procedure runs with the privileges of the user that owns it, take care to
ensure that the procedure can't be misused. To ensure that SECURITY DEFINER procedures can't be
misused, do the following:

• Grant EXECUTE on SECURITY DEFINER procedures to specific users, and not to PUBLIC.
• Qualify all database objects that the procedure needs to access with the schema names. For example,
use myschema.mytable instead of just mytable.
• If you can't qualify an object name by its schema, set search_path when creating the procedure by
using the SET option. Set search_path to exclude any schemas that are writable by untrusted users.
This approach prevents any callers of this procedure from creating objects (for example, tables or
views) that mask objects intended to be used by the procedure. For more information about the SET
option, see CREATE PROCEDURE (p. 702).

The following example sets search_path to admin to ensure that the user_creds table is accessed
from the admin schema and not from public or any other schema in the caller's search_path.

CREATE OR REPLACE PROCEDURE sp_get_credentials(userid int, o_creds OUT varchar)


AS $$
BEGIN
SELECT creds INTO o_creds
FROM user_creds
WHERE user_id = $1;
END;
$$ LANGUAGE plpgsql
SECURITY DEFINER
-- Set a secure search_path
SET search_path = admin;

166
Amazon Redshift Database Developer Guide
Returning a result set

Returning a result set


You can return a result set using a cursor or a temp table.

Returning a cursor
To return a cursor, create a procedure with an INOUT argument defined with a refcursor data type.
When you call the procedure, give the cursor a name, then you can fetch the results from the cursor by
name.

The following example creates a procedure named get_result_set with an INOUT argument named
rs_out using the refcursor data type. The procedure opens the cursor using a SELECT statement.

CREATE OR REPLACE PROCEDURE get_result_set (param IN integer, rs_out INOUT refcursor)


AS $$
BEGIN
OPEN rs_out FOR SELECT * FROM fact_tbl where id >= param;
END;
$$ LANGUAGE plpgsql;

The following CALL command opens the cursor with the name mycursor. Use cursors only within
transactions.

BEGIN;
CALL get_result_set(1, 'mycursor');

After the cursor is opened, you can fetch from the cursor, as the following example shows.

FETCH ALL FROM mycursor;

id | secondary_id | name
-------+--------------+---------
1 | 1 | Joe
1 | 2 | Ed
2 | 1 | Mary
1 | 3 | Mike
(4 rows)

In the end, the transaction is either committed or rolled back.

COMMIT;

A cursor returned by a stored procedure is subject to the same constraints and performance
considerations as described in DECLARE CURSOR. For more information, see Cursor constraints (p. 740).

The following example shows the calling of the get_result_set stored procedure using a
refcursor data type from JDBC. The literal 'mycursor' (the name of the cursor) is passed to the
prepareStatement. Then the results are fetched from the ResultSet.

static void refcursor_example(Connection conn) throws SQLException {


conn.setAutoCommit(false);
PreparedStatement proc = conn.prepareStatement("CALL get_result_set(1, 'mycursor')");
proc.execute();
ResultSet rs = statement.executeQuery("fetch all from mycursor");
while (rs.next()) {
int n = rs.getInt(1);
System.out.println("n " + n);

167
Amazon Redshift Database Developer Guide
Managing transactions

Using a temp table


To return results, you can return a handle to a temp table containing result rows. The client can supply a
name as a parameter to the stored procedure. Inside the stored procedure, dynamic SQL can be used to
operate on the temp table. The following shows an example.

CREATE PROCEDURE get_result_set(param IN integer, tmp_name INOUT varchar(256)) as $$


DECLARE
row record;
BEGIN
EXECUTE 'drop table if exists ' || tmp_name;
EXECUTE 'create temp table ' || tmp_name || ' as select * from fact_tbl where id >= ' ||
param;
END;
$$ LANGUAGE plpgsql;

CALL get_result_set(2, 'myresult');


tmp_name
-----------
myresult
(1 row)

SELECT * from myresult;


id | secondary_id | name
----+--------------+------
1 | 1 | Joe
2 | 1 | Mary
1 | 2 | Ed
1 | 3 | Mike
(4 rows)

Managing transactions
The default automatic commit behavior causes each SQL command that runs separately to commit
individually. A call to a stored procedure is treated as a single SQL command. The SQL statements inside
a procedure behave as if they are in a transaction block that implicitly begins when the call starts and
ends when the call finishes. A nested call to another procedure is treated like any other SQL statement
and operates within the context of the same transaction as the caller. For more information about
automatic commit behavior, see Serializable isolation (p. 117).

However, suppose that you call a stored procedure from within a user specified transaction block
(defined by BEGIN...COMMIT). In this case, all statements in the stored procedure run in the context of
the user-specified transaction. The procedure doesn't commit implicitly on exit. The caller controls the
procedure commit or rollback.

If any error is encountered while running a stored procedure, all changes made in the current transaction
are rolled back.

You can use the following transaction control statements in a stored procedure:

• COMMIT – commits all work done in the current transaction and implicitly begins a new transaction.
For more information, see COMMIT (p. 575).
• ROLLBACK – rolls back the work done in the current transaction and implicitly begins a new
transaction. For more information, see ROLLBACK (p. 800).

TRUNCATE is another statement that you can issue from within a stored procedure and influences
transaction management. In Amazon Redshift, TRUNCATE issues a commit implicitly. This behavior stays

168
Amazon Redshift Database Developer Guide
Managing transactions

the same in the context of stored procedures. When a TRUNCATE statement is issued from within a
stored procedure, it commits the current transaction and begins a new one. For more information, see
TRUNCATE (p. 849).

All statements that follow a COMMIT, ROLLBACK, or TRUNCATE statement run in the context of a new
transaction. They do so until a COMMIT, ROLLBACK, or TRUNCATE statement is encountered or the
stored procedure exits.

When you use a COMMIT, ROLLBACK, or TRUNCATE statement from within a stored procedure, the
following constraints apply:

• If the stored procedure is called from within a transaction block, it can't issue a COMMIT, ROLLBACK, or
TRUNCATE statement. This restriction applies within the stored procedure's own body and within any
nested procedure call.
• If the stored procedure is created with SET config options, it can't issue a COMMIT, ROLLBACK, or
TRUNCATE statement. This restriction applies within the stored procedure's own body and within any
nested procedure call.
• Any cursor that is open (explicitly or implicitly) is closed automatically when a COMMIT, ROLLBACK,
or TRUNCATE statement is processed. For constraints on explicit and implicit cursors, see Limits and
differences for stored procedure support (p. 175).

Additionally, you can't run COMMIT or ROLLBACK using dynamic SQL. However, you can run TRUNCATE
using dynamic SQL. For more information, see Dynamic SQL (p. 182).

When working with stored procedures, consider that the BEGIN and END statements in PL/pgSQL are
only for grouping. They don't start or end a transaction. For more information, see Block (p. 177).

The following example demonstrates transaction behavior when calling a stored procedure from within
an explicit transaction block. The two insert statements issued from outside the stored procedure and
the one from within it are all part of the same transaction (3382). The transaction is committed when the
user issues the explicit commit.

CREATE OR REPLACE PROCEDURE sp_insert_table_a(a int) LANGUAGE plpgsql


AS $$
BEGIN
INSERT INTO test_table_a values (a);
END;
$$;

Begin;
insert into test_table_a values (1);
Call sp_insert_table_a(2);
insert into test_table_a values (3);
Commit;

select userid, xid, pid, type, trim(text) as stmt_text


from svl_statementtext where pid = pg_backend_pid() order by xid , starttime , sequence;

userid | xid | pid | type | stmt_text


--------+------+-----+---------+----------------------------------------
103 | 3382 | 599 | UTILITY | Begin;
103 | 3382 | 599 | QUERY | insert into test_table_a values (1);
103 | 3382 | 599 | UTILITY | Call sp_insert_table_a(2);
103 | 3382 | 599 | QUERY | INSERT INTO test_table_a values ( $1 )
103 | 3382 | 599 | QUERY | insert into test_table_a values (3);
103 | 3382 | 599 | UTILITY | COMMIT

In contrast, take an example when the same statements are issued from outside of an explicit transaction
block and the session has autocommit set to ON. In this case, each statement runs in its own transaction.

169
Amazon Redshift Database Developer Guide
Managing transactions

insert into test_table_a values (1);


Call sp_insert_table_a(2);
insert into test_table_a values (3);

select userid, xid, pid, type, trim(text) as stmt_text


from svl_statementtext where pid = pg_backend_pid() order by xid , starttime , sequence;

userid | xid | pid | type |


stmt_text
--------+------+-----+---------
+------------------------------------------------------------------------------------------------------
103 | 3388 | 599 | QUERY | insert into test_table_a values (1);
103 | 3388 | 599 | UTILITY | COMMIT
103 | 3389 | 599 | UTILITY | Call sp_insert_table_a(2);
103 | 3389 | 599 | QUERY | INSERT INTO test_table_a values ( $1 )
103 | 3389 | 599 | UTILITY | COMMIT
103 | 3390 | 599 | QUERY | insert into test_table_a values (3);
103 | 3390 | 599 | UTILITY | COMMIT

The following example issues a TRUNCATE statement after inserting into test_table_a. The
TRUNCATE statement issues an implicit commit that commits the current transaction (3335) and starts a
new one (3336). The new transaction is committed when the procedure exits.

CREATE OR REPLACE PROCEDURE sp_truncate_proc(a int, b int) LANGUAGE plpgsql


AS $$
BEGIN
INSERT INTO test_table_a values (a);
TRUNCATE test_table_b;
INSERT INTO test_table_b values (b);
END;
$$;

Call sp_truncate_proc(1,2);

select userid, xid, pid, type, trim(text) as stmt_text


from svl_statementtext where pid = pg_backend_pid() order by xid , starttime , sequence;

userid | xid | pid | type |


stmt_text
--------+------+-------+---------
+------------------------------------------------------------------------------------------------------
103 | 3335 | 23636 | UTILITY | Call sp_truncate_proc(1,2);
103 | 3335 | 23636 | QUERY | INSERT INTO test_table_a values ( $1 )
103 | 3335 | 23636 | UTILITY | TRUNCATE test_table_b
103 | 3335 | 23636 | UTILITY | COMMIT
103 | 3336 | 23636 | QUERY | INSERT INTO test_table_b values ( $1 )
103 | 3336 | 23636 | UTILITY | COMMIT

The following example issues a TRUNCATE from a nested call. The TRUNCATE commits all work done so
far in the outer and inner procedures in a transaction (3344). It starts a new transaction (3345). The new
transaction is committed when the outer procedure exits.

CREATE OR REPLACE PROCEDURE sp_inner(c int, d int) LANGUAGE plpgsql


AS $$
BEGIN
INSERT INTO inner_table values (c);
TRUNCATE outer_table;
INSERT INTO inner_table values (d);
END;
$$;

CREATE OR REPLACE PROCEDURE sp_outer(a int, b int, c int, d int) LANGUAGE plpgsql

170
Amazon Redshift Database Developer Guide
Managing transactions

AS $$
BEGIN
INSERT INTO outer_table values (a);
Call sp_inner(c, d);
INSERT INTO outer_table values (b);
END;
$$;

Call sp_outer(1, 2, 3, 4);

select userid, xid, pid, type, trim(text) as stmt_text


from svl_statementtext where pid = pg_backend_pid() order by xid , starttime , sequence;

userid | xid | pid | type |


stmt_text
--------+------+-------+---------
+------------------------------------------------------------------------------------------------------
103 | 3344 | 23636 | UTILITY | Call sp_outer(1, 2, 3, 4);
103 | 3344 | 23636 | QUERY | INSERT INTO outer_table values ( $1 )
103 | 3344 | 23636 | UTILITY | CALL sp_inner( $1 , $2 )
103 | 3344 | 23636 | QUERY | INSERT INTO inner_table values ( $1 )
103 | 3344 | 23636 | UTILITY | TRUNCATE outer_table
103 | 3344 | 23636 | UTILITY | COMMIT
103 | 3345 | 23636 | QUERY | INSERT INTO inner_table values ( $1 )
103 | 3345 | 23636 | QUERY | INSERT INTO outer_table values ( $1 )
103 | 3345 | 23636 | UTILITY | COMMIT

The following example shows that cursor cur1 was closed when the TRUNCATE statement committed.

CREATE OR REPLACE PROCEDURE sp_open_cursor_truncate()


LANGUAGE plpgsql
AS $$
DECLARE
rec RECORD;
cur1 cursor for select * from test_table_a order by 1;
BEGIN
open cur1;
TRUNCATE table test_table_b;
Loop
fetch cur1 into rec;
raise info '%', rec.c1;
exit when not found;
End Loop;
END
$$;

call sp_open_cursor_truncate();
ERROR: cursor "cur1" does not exist
CONTEXT: PL/pgSQL function "sp_open_cursor_truncate" line 8 at fetch

The following example issues a TRUNCATE statement and can't be called from within an explicit
transaction block.

CREATE OR REPLACE PROCEDURE sp_truncate_atomic() LANGUAGE plpgsql


AS $$
BEGIN
TRUNCATE test_table_b;
END;
$$;

Begin;
Call sp_truncate_atomic();
ERROR: TRUNCATE cannot be invoked from a procedure that is executing in an atomic context.

171
Amazon Redshift Database Developer Guide
Managing transactions

HINT: Try calling the procedure as a top-level call i.e. not from within an explicit
transaction block.
Or, if this procedure (or one of its ancestors in the call chain) was created with SET
config options, recreate the procedure without them.
CONTEXT: SQL statement "TRUNCATE test_table_b"
PL/pgSQL function "sp_truncate_atomic" line 2 at SQL statement

The following example shows that a user who is not a superuser or the owner of a table can issue a
TRUNCATE statement on the table. The user does this using a Security Definer stored procedure.
The example shows the following actions:

• The user1 creates table test_tbl.


• The user1 creates stored procedure sp_truncate_test_tbl.
• The user1 grants EXECUTE privilege on the stored procedure to user2.
• The user2 runs the stored procedure to truncate table test_tbl. The example shows the row count
before and after the TRUNCATE command.

set session_authorization to user1;


create table test_tbl(id int, name varchar(20));
insert into test_tbl values (1,'john'), (2, 'mary');
CREATE OR REPLACE PROCEDURE sp_truncate_test_tbl() LANGUAGE plpgsql
AS $$
DECLARE
tbl_rows int;
BEGIN
select count(*) into tbl_rows from test_tbl;
RAISE INFO 'RowCount before Truncate: %', tbl_rows;
TRUNCATE test_tbl;
select count(*) into tbl_rows from test_tbl;
RAISE INFO 'RowCount after Truncate: %', tbl_rows;
END;
$$ SECURITY DEFINER;
grant execute on procedure sp_truncate_test_tbl() to user2;
reset session_authorization;

set session_authorization to user2;


call sp_truncate_test_tbl();
INFO: RowCount before Truncate: 2
INFO: RowCount after Truncate: 0
CALL
reset session_authorization;

The following example issues COMMIT twice. The first COMMIT commits all work done in transaction
10363 and implicitly starts transaction 10364. Transaction 10364 is committed by the second COMMIT
statement.

CREATE OR REPLACE PROCEDURE sp_commit(a int, b int) LANGUAGE plpgsql


AS $$
BEGIN
INSERT INTO test_table values (a);
COMMIT;
INSERT INTO test_table values (b);
COMMIT;
END;
$$;

call sp_commit(1,2);

select userid, xid, pid, type, trim(text) as stmt_text

172
Amazon Redshift Database Developer Guide
Trapping errors

from svl_statementtext where pid = pg_backend_pid() order by xid , starttime , sequence;


userid | xid | pid | type |
stmt_text
--------+-------+------+---------
+------------------------------------------------------------------------------------------------------
100 | 10363 | 3089 | UTILITY | call sp_commit(1,2);
100 | 10363 | 3089 | QUERY | INSERT INTO test_table values ( $1 )
100 | 10363 | 3089 | UTILITY | COMMIT
100 | 10364 | 3089 | QUERY | INSERT INTO test_table values ( $1 )
100 | 10364 | 3089 | UTILITY | COMMIT

The following example issues a ROLLBACK statement if sum_vals is greater than 2. The first ROLLBACK
statement rolls back all the work done in transaction 10377 and starts a new transaction 10378.
Transaction 10378 is committed when the procedure exits.

CREATE OR REPLACE PROCEDURE sp_rollback(a int, b int) LANGUAGE plpgsql


AS $$
DECLARE
sum_vals int;
BEGIN
INSERT INTO test_table values (a);
SELECT sum(c1) into sum_vals from test_table;
IF sum_vals > 2 THEN
ROLLBACK;
END IF;

INSERT INTO test_table values (b);


END;
$$;

call sp_rollback(1, 2);

select userid, xid, pid, type, trim(text) as stmt_text


from svl_statementtext where pid = pg_backend_pid() order by xid , starttime , sequence;

userid | xid | pid |


type |
stmt_text
--------+-------+------+---------
+------------------------------------------------------------------------------------------------------
100 | 10377 | 3089 | UTILITY | call sp_rollback(1, 2);
100 | 10377 | 3089 | QUERY | INSERT INTO test_table values ( $1 )
100 | 10377 | 3089 | QUERY | SELECT sum(c1) from test_table
100 | 10377 | 3089 | QUERY | Undoing 1 transactions on table 133646 with current xid
10377 : 10377
100 | 10378 | 3089 | QUERY | INSERT INTO test_table values ( $1 )
100 | 10378 | 3089 | UTILITY | COMMIT

Trapping errors
When an error occurs in a stored procedure, no additional code is run and the transaction is rolled back.
But you can handle errors using an EXCEPTION block.

[ <<label>> ]
[ DECLARE
declarations ]
BEGIN
statements
EXCEPTION
WHEN OTHERS THEN
statements
END;

173
Amazon Redshift Database Developer Guide
Logging stored procedures

When an exception occurs, and you add an exception-handling block, you can write RAISE statements
and most other PL/pgSQL statements. For example, you can raise an exception with a custom message
or insert a record into a logging table.

When entering the exception-handling block, the current transaction is rolled back and a new transaction
is created to run the statements in the block. If the statements in the block run without error, the
transaction is committed and the exception is re-thrown. Lastly, the stored procedure exits.

The only supported condition in an exception block is OTHERS, which matches every error type except
query cancellation. Also, if an error occurs in an exception-handling block, it can be caught by an outer
exception-handling block.

Example
The following example shows how to write statements in the exception-handling block.

CREATE TABLE employee (firstname varchar, lastname varchar);


INSERT INTO employee VALUES ('Tomas','Smith');
CREATE TABLE employee_error_log (message varchar);

CREATE OR REPLACE PROCEDURE update_employee_sp() AS


$$
BEGIN
UPDATE employee SET firstname = 'Adam' WHERE lastname = 'Smith';
EXECUTE 'select invalid';
EXCEPTION WHEN OTHERS THEN
RAISE INFO 'An exception occurred.';
INSERT INTO employee_error_log VALUES ('Error message: ' || SQLERRM);
END;
$$
LANGUAGE plpgsql;

CALL update_employee_sp();

INFO: An exception occurred.


ERROR: column "invalid" does not exist
CONTEXT: SQL statement "select invalid"
PL/pgSQL function "update_employee_sp" line 3 at execute statement

In this example, if we call update_employee_sp, the informational message An exception occurred. is


raised and the error message is inserted in the logging table's employee_error_log log. The original
exception is thrown again before the stored procedure exits. The following queries show records that
result from running the example.

SELECT * from employee;

firstname | lastname
-----------+----------
Tomas | Smith

SELECT * from employee_error_log;

message
------------------------------------------------
Error message: column "invalid" does not exist

Logging stored procedures


Details about stored procedures are logged in the following system tables and views:

174
Amazon Redshift Database Developer Guide
Limits and differences

• SVL_STORED_PROC_CALL – details are logged about the stored procedure call's start time
and end time, and whether the call is ended before completion. For more information, see
SVL_STORED_PROC_CALL (p. 1408).
• SVL_STORED_PROC_MESSAGES – messages in stored procedures emitted by the RAISE
query are logged with the corresponding logging level. For more information, see
SVL_STORED_PROC_MESSAGES (p. 1409).
• SVL_QLOG – the query ID of the procedure call is logged for each query called from a stored
procedure. For more information, see SVL_QLOG (p. 1385).
• STL_UTILITYTEXT – stored procedure calls are logged after they are completed. For more information,
see STL_UTILITYTEXT (p. 1348).
• PG_PROC_INFO – this system catalog view shows information about stored procedures. For more
information, see PG_PROC_INFO (p. 1472).

Limits and differences for stored procedure support


The following constraints apply when you use Amazon Redshift stored procedures.

Differences between Amazon Redshift and PostgreSQL for


stored procedure support
The following are differences between stored procedure support in Amazon Redshift and PostgreSQL:

• Amazon Redshift doesn't support subtransactions, and hence has limited support for exception
handling blocks.

Limits
The following are limits on stored procedures in Amazon Redshift:

• The maximum number of stored procedures for a database is 10,000.


• The maximum size of the source code for a procedure is 2 MB.
• The maximum number of explicit and implicit cursors that you can open concurrently in a user session
is one. FOR loops that iterate over the result set of a SQL statement open implicit cursors. Nested
cursors aren't supported.
• Explicit and implicit cursors have the same restrictions on the result set size as standard Amazon
Redshift cursors. For more information, see Cursor constraints (p. 740).
• The maximum number of levels for nested calls is 16.
• The maximum number of procedure parameters is 32 for input arguments and 32 for output
arguments.
• The maximum number of variables in a stored procedure is 1,024.
• Any SQL command that requires its own transaction context isn't supported inside a stored procedure.
Examples include:
• PREPARE
• CREATE/DROP DATABASE
• CREATE EXTERNAL TABLE
• VACUUM
• SET LOCAL
• ALTER TABLE APPEND
• Transaction statements, such as BEGIN and START TRANSACTION.

175
Amazon Redshift Database Developer Guide
PL/pgSQL language reference

• The registerOutParameter method call through the Java Database Connectivity (JDBC) driver
isn't supported for the refcursor data type. For an example of using the refcursor data type, see
Returning a result set (p. 167).

PL/pgSQL language reference


Stored procedures in Amazon Redshift are based on the PostgreSQL PL/pgSQL procedural language,
with some important differences. In this reference, you can find details of PL/pgSQL syntax as
implemented by Amazon Redshift. For more information about PL/pgSQL, see PL/pgSQL - SQL
procedural language in the PostgreSQL documentation.

Topics
• PL/pgSQL reference conventions (p. 176)
• Structure of PL/pgSQL (p. 176)
• Supported PL/pgSQL statements (p. 180)

PL/pgSQL reference conventions


In this section, you can find the conventions that are used to write the syntax for the PL/pgSQL stored
procedure language.

Character Description

CAPS Words in capital letters are keywords.

[] Brackets denote optional arguments. Multiple arguments in brackets indicate


that you can choose any number of the arguments. In addition, arguments in
brackets on separate lines indicate that the Amazon Redshift parser expects the
arguments to be in the order that they are listed in the syntax.

{} Braces indicate that you are required to choose one of the arguments inside the
braces.

| Pipes indicate that you can choose between the arguments.

red italics Words in red italics indicate placeholders. Insert the appropriate value in place of
the word in red italics.

... An ellipsis indicates that you can repeat the preceding element.

' Words in single quotation marks indicate that you must type the quotes.

Structure of PL/pgSQL
PL/pgSQL is a procedural language with many of the same constructs as other procedural languages.

Topics
• Block (p. 177)
• Variable declaration (p. 178)
• Alias declaration (p. 178)
• Built-in variables (p. 178)

176
Amazon Redshift Database Developer Guide
Structure of PL/pgSQL

• Record types (p. 180)

Block
PL/pgSQL is a block-structured language. The complete body of a procedure is defined in a block, which
contains variable declarations and PL/pgSQL statements. A statement can also be a nested block, or
subblock.

End declarations and statements with a semicolon. Follow the END keyword in a block or subblock with a
semicolon. Don't use semicolons after the keywords DECLARE and BEGIN.

You can write all keywords and identifiers in mixed uppercase and lowercase. Identifiers are implicitly
converted to lowercase unless enclosed in double quotation marks.

A double hyphen (--) starts a comment that extends to the end of the line. A /* starts a block comment
that extends to the next occurrence of */. You can't nest block comments. However, you can enclose
double-hyphen comments in a block comment, and a double hyphen can hide the block comment
delimiters /* and */.

Any statement in the statement section of a block can be a subblock. You can use subblocks for logical
grouping or to localize variables to a small group of statements.

[ <<label>> ]
[ DECLARE
declarations ]
BEGIN
statements
END [ label ];

The variables declared in the declarations section preceding a block are initialized to their default values
every time the block is entered. In other words, they're not initialized only once per function call.

The following shows an example.

CREATE PROCEDURE update_value() AS $$


DECLARE
value integer := 20;
BEGIN
RAISE NOTICE 'Value here is %', value; -- Value here is 20
value := 50;
--
-- Create a subblock
--
DECLARE
value integer := 80;
BEGIN
RAISE NOTICE 'Value here is %', value; -- Value here is 80
END;

RAISE NOTICE 'Value here is %', value; -- Value here is 50


END;
$$ LANGUAGE plpgsql;

Use a label to identify the block to use in an EXIT statement or to qualify the names of the variables
declared in the block.

Don't confuse the use of BEGIN/END for grouping statements in PL/pgSQL with the database commands
for transaction control. The BEGIN and END in PL/pgSQL are only for grouping. They don't start or end a
transaction.

177
Amazon Redshift Database Developer Guide
Structure of PL/pgSQL

Variable declaration
Declare all variables in a block, with the exception of loop variables, in the block's DECLARE
section. Variables can use any valid Amazon Redshift data type. For supported data types, see Data
types (p. 478).

PL/pgSQL variables can be any Amazon Redshift supported data type, plus RECORD and refcursor. For
more information about RECORD, see Record types (p. 180). For more information about refcursor,
see Cursors (p. 188).

DECLARE
name [ CONSTANT ] type [ NOT NULL ] [ { DEFAULT | := } expression ];

Following, you can find example variable declarations.

customerID integer;
numberofitems numeric(6);
link varchar;
onerow RECORD;

The loop variable of a FOR loop iterating over a range of integers is automatically declared as an integer
variable.

The DEFAULT clause, if given, specifies the initial value assigned to the variable when the block is
entered. If the DEFAULT clause is not given, then the variable is initialized to the SQL NULL value. The
CONSTANT option prevents the variable from being assigned to, so that its value remains constant for
the duration of the block. If NOT NULL is specified, an assignment of a null value results in a runtime
error. All variables declared as NOT NULL must have a non-null default value specified.

The default value is evaluated every time the block is entered. So, for example, assigning now() to a
variable of type timestamp causes the variable to have the time of the current function call, not the
time when the function was precompiled.

quantity INTEGER DEFAULT 32;


url VARCHAR := 'http://mysite.com';
user_id CONSTANT INTEGER := 10;

The refcursor data type is the data type of cursor variables within stored procedures. A refcursor
value can be returned from within a stored procedure. For more information, see Returning a result
set (p. 167).

Alias declaration
If stored procedure's signature omits the argument name, you can declare an alias for the argument.

name ALIAS FOR $n;

Built-in variables
The following built-in variables are supported:

• FOUND
• SQLSTATE
• SQLERRM
• GET DIAGNOSTICS integer_var := ROW_COUNT;

178
Amazon Redshift Database Developer Guide
Structure of PL/pgSQL

FOUND is a special variable of type Boolean. FOUND starts out false within each procedure call. FOUND
is set by the following types of statements:

• SELECT INTO

Sets FOUND to true if it returns a row, false if no row is returned.


• UPDATE, INSERT, and DELETE

Sets FOUND to true if at least one row is affected, false if no row is affected.
• FETCH

Sets FOUND to true if it returns a row, false if no row is returned.


• FOR statement

Sets FOUND to true if the FOR statement iterates one or more times, and otherwise false. This applies
to all three variants of the FOR statement: integer FOR loops, record-set FOR loops, and dynamic
record-set FOR loops.

FOUND is set when the FOR loop exits. Inside the execution of the loop, FOUND isn't modified by the
FOR statement. However, it can be changed by the execution of other statements within the loop
body.

The following shows an example.

CREATE TABLE employee(empname varchar);


CREATE OR REPLACE PROCEDURE show_found()
AS $$
DECLARE
myrec record;
BEGIN
SELECT INTO myrec * FROM employee WHERE empname = 'John';
IF NOT FOUND THEN
RAISE EXCEPTION 'employee John not found';
END IF;
END;
$$ LANGUAGE plpgsql;

Within an exception handler, the special variable SQLSTATE contains the error code that corresponds to
the exception that was raised. The special variable SQLERRM contains the error message associated with
the exception. These variables are undefined outside exception handlers and throw an error if used.

The following shows an example.

CREATE OR REPLACE PROCEDURE sqlstate_sqlerrm() AS


$$
BEGIN
UPDATE employee SET firstname = 'Adam' WHERE lastname = 'Smith';
EXECUTE 'select invalid';
EXCEPTION WHEN OTHERS THEN
RAISE INFO 'error message SQLERRM %', SQLERRM;
RAISE INFO 'error message SQLSTATE %', SQLSTATE;
END;
$$ LANGUAGE plpgsql;

ROW_COUNT is used with the GET DIAGNOSTICS command. It shows the number of rows processed by
the last SQL command sent down to the SQL engine.

The following shows an example.

179
Amazon Redshift Database Developer Guide
Supported PL/pgSQL statements

CREATE OR REPLACE PROCEDURE sp_row_count() AS


$$
DECLARE
integer_var int;
BEGIN
INSERT INTO tbl_row_count VALUES(1);
GET DIAGNOSTICS integer_var := ROW_COUNT;
RAISE INFO 'rows inserted = %', integer_var;
END;
$$ LANGUAGE plpgsql;

Record types
A RECORD type is not a true data type, only a placeholder. Record type variables assume the actual row
structure of the row that they are assigned during a SELECT or FOR command. The substructure of a
record variable can change each time it is assigned a value. Until a record variable is first assigned to, it
has no substructure. Any attempt to access a field in it throws a runtime error.

name RECORD;

The following shows an example.

CREATE TABLE tbl_record(a int, b int);


INSERT INTO tbl_record VALUES(1, 2);
CREATE OR REPLACE PROCEDURE record_example()
LANGUAGE plpgsql
AS $$
DECLARE
rec RECORD;
BEGIN
FOR rec IN SELECT a FROM tbl_record
LOOP
RAISE INFO 'a = %', rec.a;
END LOOP;
END;
$$;

Supported PL/pgSQL statements


PL/pg