Discuss how data types and compatibility rules affect operations in Amazon Redshift.

Data types and their compatibility in Amazon Redshift affect various operations, from data manipulation to query executions. Data types determine constraints on values a column can store, and compatibility rules govern implicit conversion and coercion processes. For example, VARCHAR can store multi-byte characters, but not all character strings can be implicitly converted to numerics. Operations like UNION or evaluation expressions involve compatibility checks, where implicit conversions may occur if data types fall into the same category, such as numeric types. Handling mismatched data types requires explicit conversion to avoid errors during operations .

Explain the process and significance of using the EXPLAIN command in Amazon Redshift.

The EXPLAIN command in Amazon Redshift provides insights into the logical operation steps that the execution engine will perform for a given SQL query. By analyzing the query plan it generates, users can identify inefficient operations, such as unnecessary broadcasting (DS_BCAST_INNER) or data redistribution (DS_DIST_BOTH) steps, which may indicate that data is not optimally distributed. This allows users to adjust their queries or distribution styles to improve performance by minimizing these costly operations. The EXPLAIN command does not execute the query itself, thus allowing users to refine their strategies without affecting database performance directly .

What role do credentials play in the COPY command when loading data from Amazon S3 into Amazon Redshift, and what security measures should be considered?

Credentials in the COPY command authenticate access to AWS resources, allowing the command to fetch data from Amazon S3 to load into Amazon Redshift. Users must specify access credentials typically through an access key ID and a secret access key belonging to an IAM user with ListBucket and GetObject permissions. For enhanced security measures, temporary access credentials, which have a limited lifespan and cannot be reused after expiration, can be utilized, mitigating long-term risk. It is crucial that these credentials have minimal permissions necessary for data loading tasks .

How does the COPY command optimize data loading from Amazon S3 into an Amazon Redshift table, and what factors should be considered?

The COPY command optimizes data loading by allowing parallel data ingestion from Amazon S3, which is significantly faster than loading data serially from a single file. It can load data from multiple files by specifying a key prefix or using a manifest file for precise file listing. When using the COPY command, consider file format options, error management settings like MAXERROR, and specific data format options like CSV or GZIP compression. Additionally, ensure AWS credentials with appropriate permissions and possibly temporary credentials for enhanced security are used .

What strategies can be used to optimize the distribution of data in Amazon Redshift to reduce query performance costs?

Optimizing data distribution in Amazon Redshift involves choosing the appropriate distribution style to minimize data movement during query execution. KEY distribution places portions of table rows on specific nodes based on column values and is ideal when there's a clear join column. ALL distribution collocates all rows across nodes, best for smaller, slowly changing tables, but this can lead to increased load times. EVEN distribution spreads rows across nodes in a round-robin manner, and is the default. To evaluate and optimize distribution strategies, the EXPLAIN command can identify costly redistribution steps, such as DS_BCAST or DS_DIST labels, suggesting changes to reduce these steps by ensuring joins are collocated .

How can Workload Management (WLM) configuration impact query processing in Amazon Redshift?

WLM configuration impacts query processing by controlling resource allocation and query routing in Amazon Redshift. By defining concurrency levels, user and query groups, and memory allocation for each queue, WLM can balance and prioritize workload efficiently. WLM parameters, set via JSON configuration, influence how queries are distributed across queues. Proper WLM configuration can prevent long-running queries from blocking shorter ones by adjusting concurrency levels and assigning appropriate memory allocation to queues, which ultimately enhances overall system performance .

In what scenario would an Amazon Redshift user set the distribution style to ALL, and what are the potential drawbacks of this configuration?

Setting the distribution style to ALL in Amazon Redshift is suitable when the table is relatively static—i.e., it does not change frequently—and needs to be joined with multiple other tables, thereby minimizing expensive redistribution costs. The drawback is increased load times because the entire table is copied across all nodes, rather than just the necessary parts. Additionally, it can lead to storage inefficiencies since all nodes store the full table, consuming more space. Thus, it's recommended primarily for small, infrequently updated tables .

Why might a developer choose to use manifest files over key prefixes when loading data into Amazon Redshift from Amazon S3?

A developer might choose manifest files over key prefixes to exert finer control over which specific files are loaded into Amazon Redshift from Amazon S3. Manifest files explicitly list each data file to be loaded, which is beneficial in scenarios where files do not follow a consistent naming pattern, reside in multiple buckets or folders, or when certain prefixed files need exclusion. This level of control prevents unintended data files from being loaded and ensures data integrity during the loading process .

In Amazon Redshift, what are the potential performance impacts of choosing EVEN distribution style, and when is it recommended?

The EVEN distribution style distributes table rows across all slices in a round-robin fashion, which is beneficial when there is no clear choice of key or distribution style for the tables involved in joins. This can be advantageous because it avoids data skewness and helps maintain balanced load across nodes. However, the trade-off is potential inefficiency in joins, as the relevant data might not be collocated on the same node, leading to increased data movement costs. EVEN distribution is recommended when tables are not involved in frequent joins or when data size and skew remain uniform across nodes .

What is the significance of routing queries to different queues based on user groups in Amazon Redshift's WLM configuration?

Routing queries to different queues based on user groups in Amazon Redshift's WLM configuration is significant for optimizing resource utilization and ensuring fair allocation of computing power among users. By assigning queries to specific queues through user group definitions, administrators can prioritize certain jobs and isolate workloads. For instance, a default queue might handle general queries, while a high-priority queue caters to critical analytical workloads. This setup helps manage varying workloads efficiently, avoiding bottlenecks and minimizing wait times for queued jobs .

Open navigation menu

Upload

0% found this document useful (0 votes)

3K views783 pages

Amazon Redshift Database Developer Guide

Uploaded by

Mark Lobo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3K views783 pages

Amazon Redshift Database Developer Guide

Uploaded by

Mark Lobo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 783

Amazon Redshift

Database Developer Guide

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Amazon Redshift: Database Developer Guide

Copyright 2015 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
The following are trademarks of Amazon Web Services, Inc.: Amazon, Amazon Web Services Design, AWS, Amazon CloudFront,
AWS CloudTrail, AWS CodeDeploy, Amazon Cognito, Amazon DevPay, DynamoDB, ElastiCache, Amazon EC2, Amazon Elastic
Compute Cloud, Amazon Glacier, Amazon Kinesis, Kindle, Kindle Fire, AWS Marketplace Design, Mechanical Turk, Amazon Redshift,
Amazon Route 53, Amazon S3, Amazon VPC, and Amazon WorkDocs. In addition, Amazon.com graphics, logos, page headers,
button icons, scripts, and service names are trademarks, or trade dress of Amazon in the U.S. and/or other countries. Amazon's
trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is
likely to cause confusion among customers, or in any manner that disparages or discredits Amazon.
All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected
to, or sponsored by Amazon.

Amazon Redshift Database Developer Guide

Table of Contents
Welcome ..................................................................................................................................... 1
Are You a First-Time Amazon Redshift User? ............................................................................. 1
Are You a Database Developer? ............................................................................................... 2
Prerequisites ........................................................................................................................ 3
Amazon Redshift System Overview .................................................................................................. 4
Data Warehouse System Architecture ....................................................................................... 4
Performance ......................................................................................................................... 6
Columnar Storage ................................................................................................................. 8
Internal Architecture and System Operation ............................................................................... 9
Workload Management ......................................................................................................... 10
Using Amazon Redshift with Other Services ............................................................................. 11
Moving Data Between Amazon Redshift and Amazon S3 .................................................... 11
Using Amazon Redshift with Amazon DynamoDB ............................................................. 11
Importing Data from Remote Hosts over SSH ................................................................... 11
Automating Data Loads Using AWS Data Pipeline ............................................................. 11
Getting Started Using Databases ................................................................................................... 12
Step 1: Create a Database .................................................................................................... 12
Step 2: Create a Database User ............................................................................................. 13
Delete a Database User ................................................................................................ 13
Step 3: Create a Database Table ............................................................................................ 13
Insert Data Rows into a Table ........................................................................................ 14
Select Data from a Table ............................................................................................... 14
Step 4: Load Sample Data ..................................................................................................... 15
Step 5: Query the System Tables ............................................................................................ 18
View a List of Table Names ............................................................................................ 19
View Database Users ................................................................................................... 19
View Recent Queries .................................................................................................... 20
Determine the Process ID of a Running Query .................................................................. 20
Step 6: Cancel a Query ......................................................................................................... 21
Cancel a Query from Another Session ............................................................................. 22
Cancel a Query Using the Superuser Queue .................................................................... 22
Step 7: Clean Up Your Resources ........................................................................................... 22
Amazon Redshift Best Practices ..................................................................................................... 24
Best Practices for Designing Tables ......................................................................................... 24
Take the Tuning Table Design Tutorial .............................................................................. 25
Choose the Best Sort Key ............................................................................................. 25
Choose the Best Distribution Style .................................................................................. 25
Use Automatic Compression .......................................................................................... 26
Define Constraints ....................................................................................................... 26
Use the Smallest Possible Column Size ........................................................................... 26
Using Date/Time Data Types for Date Columns ................................................................. 26
Best Practices for Loading Data ............................................................................................. 27
Take the Loading Data Tutorial ....................................................................................... 27
Take the Tuning Table Design Tutorial .............................................................................. 27
Use a COPY Command to Load Data ............................................................................. 27
Use a Single COPY Command ...................................................................................... 28
Split Your Load Data into Multiple Files ............................................................................ 28
Compress Your Data Files with gzip or lzop ...................................................................... 28
Use a Manifest File ...................................................................................................... 28
Verify Data Files Before and After a Load ......................................................................... 28
Use a Multi-Row Insert ................................................................................................. 29
Use a Bulk Insert ......................................................................................................... 29
Load Data in Sort Key Order .......................................................................................... 29
Load Data in Sequential Blocks ...................................................................................... 29
Use Time-Series Tables ................................................................................................ 30
API Version 2012-12-01
iii

Amazon Redshift Database Developer Guide

Use a Staging Table to Perform a Merge ..........................................................................

Schedule Around Maintenance Windows .........................................................................
Best Practices for Designing Queries ......................................................................................
Tutorial: Tuning Table Design ..........................................................................................................
Prerequisites ......................................................................................................................
Steps .................................................................................................................................
Step 1: Create a Test Data Set ...............................................................................................
To Create a Test Data Set .............................................................................................
Next Step ...................................................................................................................
Step 2: Establish a Baseline ..................................................................................................
To Test System Performance to Establish a Baseline ..........................................................
Next Step ...................................................................................................................
Step 3: Select Sort Keys .......................................................................................................
To Select Sort Keys ......................................................................................................
Next Step ...................................................................................................................
Step 4: Select Distribution Styles ............................................................................................
Distribution Styles .......................................................................................................
To Select Distribution Styles ..........................................................................................
Next Step ...................................................................................................................
Step 5: Review Compression Encodings ..................................................................................
To Review Compression Encodings ................................................................................
Next Step ...................................................................................................................
Step 6: Recreate the Test Data Set .........................................................................................
To Recreate the Test Data Set ........................................................................................
Next Step ...................................................................................................................
Step 7: Retest System Performance After Tuning .......................................................................
To Retest System Performance After Tuning .....................................................................
Next Step ...................................................................................................................
Step 8: Evaluate the Results ..................................................................................................
Next Step ...................................................................................................................
Step 9: Clean Up Your Resources ...........................................................................................
Next Step ...................................................................................................................
Summary ...........................................................................................................................
Next Step ...................................................................................................................
Tutorial: Loading Data from Amazon S3 ...........................................................................................
Prerequisites ......................................................................................................................
Overview ............................................................................................................................
Steps .................................................................................................................................
Step 1: Launch a Cluster .......................................................................................................
Next Step ...................................................................................................................
Step 2: Download the Data Files .............................................................................................
Next Step ...................................................................................................................
Step 3: Upload the Files to an Amazon S3 Bucket .....................................................................
.................................................................................................................................
Next Step ...................................................................................................................
Step 4: Create the Sample Tables ...........................................................................................
Next Step ...................................................................................................................
Step 5: Run the COPY Commands .........................................................................................
COPY Command Syntax ..............................................................................................
Loading the SSB Tables ................................................................................................
Step 6: Vacuum and Analyze the Database ..............................................................................
Next Step ...................................................................................................................
Step 7: Clean Up Your Resources ...........................................................................................
Next ..........................................................................................................................
Summary ...........................................................................................................................
Next Step ...................................................................................................................
Tutorial: Configuring WLM Queues to Improve Query Processing .........................................................
Overview ............................................................................................................................
API Version 2012-12-01
iv

30
30
30
33
33
33
34
34
38
38
39
41
42
42
42
43
43
44
46
46
46
49
49
49
52
52
52
56
56
58
58
58
58
59
60
60
61
61
61
62
62
63
63
63
64
64
67
67
67
68
79
80
80
80
80
81
82
82

Amazon Redshift Database Developer Guide

Prerequisites .............................................................................................................. 82
Sections ..................................................................................................................... 82
Section 1: Understanding the Default Queue Processing Behavior ................................................ 83
Step 1: Create the WLM_QUEUE_STATE_VW View .......................................................... 83
Step 2: Create the WLM_QUERY_STATE_VW View ........................................................... 84
Step 3: Run Test Queries .............................................................................................. 85
Section 2: Modifying the WLM Query Queue Configuration .......................................................... 86
Step 1: Create a Parameter Group .................................................................................. 87
Step 2: Configure WLM ................................................................................................. 87
Step 3: Associate the Parameter Group with Your Cluster .................................................... 89
Section 3: Routing Queries to Queues Based on User Groups and Query Groups ........................... 91
Step 1: View Query Queue Configuration in the Database ................................................... 91
Step 2: Run a Query Using the Query Group Queue .......................................................... 91
Step 3: Create a Database User and Group ...................................................................... 92
Step 4: Run a Query Using the User Group Queue ............................................................ 93
Section 4: Using wlm_query_slot_count to Temporarily Override Concurrency Level in a Queue ........ 94
Step 1: Override the Concurrency Level Using wlm_query_slot_count ................................... 94
Step 2: Run Queries from Different Sessions .................................................................... 95
Section 5: Cleaning Up Your Resources ................................................................................... 96
Managing Database Security ......................................................................................................... 97
Amazon Redshift Security Overview ........................................................................................ 97
Default Database User Privileges ........................................................................................... 98
Superusers ......................................................................................................................... 98
Users ................................................................................................................................ 99
Creating, Altering, and Deleting Users ............................................................................. 99
Groups ............................................................................................................................. 100
Creating, Altering, and Deleting Groups ......................................................................... 100
Schemas .......................................................................................................................... 100
Creating, Altering, and Deleting Schemas ...................................................................... 101
Search Path .............................................................................................................. 101
Schema-Based Privileges ............................................................................................ 101
Example for Controlling User and Group Access ..................................................................... 102
Designing Tables ........................................................................................................................ 104
Choosing a Column Compression Type .................................................................................. 104
Compression Encodings ............................................................................................. 105
Testing Compression Encodings ................................................................................... 111
Example: Choosing Compression Encodings for the CUSTOMER Table .............................. 113
Choosing a Data Distribution Style ........................................................................................ 115
Data Distribution Concepts .......................................................................................... 115
Distribution Styles ...................................................................................................... 116
Viewing Distribution Styles ........................................................................................... 116
Evaluating Query Patterns ........................................................................................... 117
Designating Distribution Styles ..................................................................................... 117
Evaluating the Query Plan ........................................................................................... 118
Query Plan Example .................................................................................................. 120
Distribution Examples ................................................................................................. 125
Choosing Sort Keys ............................................................................................................ 127
Defining Constraints ........................................................................................................... 127
Analyzing Table Design ....................................................................................................... 127
Loading Data ............................................................................................................................ 131
Using COPY to Load Data ................................................................................................... 131
Preparing Your Input Data ............................................................................................ 132
Loading Data from Amazon S3 ..................................................................................... 133
Loading Data from Amazon EMR .................................................................................. 142
Loading Data from Remote Hosts ................................................................................. 149
Loading from Amazon DynamoDB ................................................................................ 155
Verifying That the Data Was Loaded Correctly ................................................................. 157
Validating Input Data .................................................................................................. 157
API Version 2012-12-01
v

Amazon Redshift Database Developer Guide

Automatic Compression ..............................................................................................

Optimizing for Narrow Tables ........................................................................................
Default Values ...........................................................................................................
Troubleshooting .........................................................................................................
Updating with DML .............................................................................................................
Updating and Inserting ........................................................................................................
Merge Method 1: Replacing Existing Rows .....................................................................
Merge Method 2: Specifying a Column List .....................................................................
Creating a Temporary Staging Table ..............................................................................
Performing a Merge Operation by Replacing Existing Rows ...............................................
Performing a Merge Operation by Specifying a Column List ...............................................
Merge Examples ........................................................................................................
Performing a Deep Copy .....................................................................................................
Analyzing Tables ................................................................................................................
ANALYZE Command History ........................................................................................
Automatic Analysis .....................................................................................................
Vacuuming Tables ..............................................................................................................
VACUUM Frequency ...................................................................................................
Sort Stage and Merge Stage ........................................................................................
Vacuum Types ...........................................................................................................
Managing Vacuum Times ............................................................................................
Vacuum Column Limit Exceeded Error ...........................................................................
Managing Concurrent Write Operations .................................................................................
Serializable Isolation ...................................................................................................
Write and Read-Write Operations .................................................................................
Concurrent Write Examples .........................................................................................
Unloading Data ..........................................................................................................................
Unloading Data to Amazon S3 .............................................................................................
Unloading Encrypted Data Files ...........................................................................................
Unloading Data in Delimited or Fixed-Width Format .................................................................
Reloading Unloaded Data ....................................................................................................
Tuning Query Performance ..........................................................................................................
Query Processing ..............................................................................................................
Query Planning And Execution Workflow ........................................................................
Reviewing Query Plan Steps ........................................................................................
Query Plan ...............................................................................................................
Factors Affecting Query Performance ............................................................................
Analyzing and Improving Queries ..........................................................................................
Query Analysis Workflow .............................................................................................
Reviewing Query Alerts ...............................................................................................
Analyzing the Query Plan ............................................................................................
Analyzing the Query Summary .....................................................................................
Improving Query Performance ......................................................................................
Diagnostic Queries for Query Tuning .............................................................................
Implementing Workload Management ....................................................................................
Defining Query Queues ...............................................................................................
WLM Queue Assignment Rules ....................................................................................
Modifying the WLM Configuration .................................................................................
Assigning Queries to Queues .......................................................................................
Monitoring Workload Management ................................................................................
Troubleshooting Queries .....................................................................................................
Connection Fails ........................................................................................................
Query Hangs .............................................................................................................
Query Takes Too Long ................................................................................................
Load Fails .................................................................................................................
Load Takes Too Long ..................................................................................................
Load Data Is Incorrect .................................................................................................
Setting the JDBC Fetch Size Parameter .........................................................................
API Version 2012-12-01
vi

158
160
160
161
165
165
165
166
166
166
167
169
171
173
174
175
175
176
176
177
177
182
183
183
184
185
187
187
190
191
192
194
194
195
196
197
203
204
204
205
206
207
212
214
218
218
221
223
224
225
226
227
227
228
228
229
229
229

Amazon Redshift Database Developer Guide

SQL Reference ..........................................................................................................................

Amazon Redshift SQL ........................................................................................................
SQL Functions Supported on the Leader Node ...............................................................
Amazon Redshift and PostgreSQL ................................................................................
Using SQL ........................................................................................................................
SQL Reference Conventions ........................................................................................
Basic Elements ..........................................................................................................
Expressions ..............................................................................................................
Conditions ................................................................................................................
SQL Commands ................................................................................................................
ABORT ....................................................................................................................
ALTER DATABASE .....................................................................................................
ALTER GROUP .........................................................................................................
ALTER SCHEMA .......................................................................................................
ALTER TABLE ...........................................................................................................
ALTER USER ............................................................................................................
ANALYZE .................................................................................................................
ANALYZE COMPRESSION .........................................................................................
BEGIN .....................................................................................................................
CANCEL ..................................................................................................................
CLOSE ....................................................................................................................
COMMENT ...............................................................................................................
COMMIT ..................................................................................................................
COPY ......................................................................................................................
CREATE DATABASE ..................................................................................................
CREATE GROUP .......................................................................................................
CREATE SCHEMA .....................................................................................................
CREATE TABLE .........................................................................................................
CREATE TABLE AS ....................................................................................................
CREATE USER .........................................................................................................
CREATE VIEW ..........................................................................................................
DEALLOCATE ...........................................................................................................
DECLARE ................................................................................................................
DELETE ...................................................................................................................
DROP DATABASE ......................................................................................................
DROP GROUP ..........................................................................................................
DROP SCHEMA ........................................................................................................
DROP TABLE ............................................................................................................
DROP USER .............................................................................................................
DROP VIEW .............................................................................................................
END ........................................................................................................................
EXECUTE ................................................................................................................
EXPLAIN ..................................................................................................................
FETCH ....................................................................................................................
GRANT ....................................................................................................................
INSERT ....................................................................................................................
LOCK ......................................................................................................................
PREPARE ................................................................................................................
RESET ....................................................................................................................
REVOKE ..................................................................................................................
ROLLBACK ..............................................................................................................
SELECT ...................................................................................................................
SELECT INTO ...........................................................................................................
SET .........................................................................................................................
SET SESSION AUTHORIZATION .................................................................................
SET SESSION CHARACTERISTICS .............................................................................
SHOW .....................................................................................................................
START TRANSACTION ...............................................................................................
API Version 2012-12-01
vii

231
231
231
232
238
238
238
261
265
282
283
285
286
286
287
292
294
295
296
298
299
300
301
302
336
337
338
339
349
354
355
356
357
359
361
362
362
363
366
367
368
369
370
374
375
379
383
384
385
386
389
390
419
420
423
424
424
425

Amazon Redshift Database Developer Guide

TRUNCATE ..............................................................................................................
UNLOAD ..................................................................................................................
UPDATE ...................................................................................................................
VACUUM ..................................................................................................................
SQL Functions Reference ...................................................................................................
Leader NodeOnly Functions .......................................................................................
Aggregate Functions ..................................................................................................
Bit-Wise Aggregate Functions ......................................................................................
Window Functions ......................................................................................................
Conditional Expressions ..............................................................................................
Date Functions ..........................................................................................................
Math Functions ..........................................................................................................
String Functions ........................................................................................................
JSON Functions ........................................................................................................
Data Type Formatting Functions ...................................................................................
System Administration Functions ..................................................................................
System Information Functions ......................................................................................
Reserved Words ................................................................................................................
System Tables Reference ............................................................................................................
System Tables and Views ....................................................................................................
Types of System Tables and Views ........................................................................................
Visibility of Data in System Tables and Views ..........................................................................
Filtering System-Generated Queries ..............................................................................
STL Tables for Logging .......................................................................................................
STL_AGGR ..............................................................................................................
STL_ALERT_EVENT_LOG .........................................................................................
STL_BCAST .............................................................................................................
STL_COMMIT_STATS ................................................................................................
STL_CONNECTION_LOG ...........................................................................................
STL_DDLTEXT ..........................................................................................................
STL_DIST ................................................................................................................
STL_DELETE ...........................................................................................................
STL_ERROR ............................................................................................................
STL_EXPLAIN ..........................................................................................................
STL_FILE_SCAN .......................................................................................................
STL_HASH ...............................................................................................................
STL_HASHJOIN ........................................................................................................
STL_INSERT ............................................................................................................
STL_LIMIT ................................................................................................................
STL_LOAD_COMMITS ...............................................................................................
STL_LOAD_ERRORS ................................................................................................
STL_LOADERROR_DETAIL ........................................................................................
STL_MERGE ............................................................................................................
STL_MERGEJOIN .....................................................................................................
STL_NESTLOOP .......................................................................................................
STL_PARSE .............................................................................................................
STL_PLAN_INFO ......................................................................................................
STL_PROJECT .........................................................................................................
STL_QUERY .............................................................................................................
STL_QUERYTEXT .....................................................................................................
STL_REPLACEMENTS ..............................................................................................
STL_RETURN ...........................................................................................................
STL_SAVE ...............................................................................................................
STL_S3CLIENT .........................................................................................................
STL_S3CLIENT_ERROR ............................................................................................
STL_SCAN ...............................................................................................................
STL_SESSIONS ........................................................................................................
STL_SORT ...............................................................................................................
API Version 2012-12-01
viii

425
426
438
442
444
445
446
454
459
501
509
534
558
595
598
607
611
622
626
626
627
627
627
628
629
631
632
634
635
636
637
639
641
642
644
645
646
648
649
651
653
655
656
657
658
660
661
663
664
666
667
668
670
671
672
673
676
677

Amazon Redshift Database Developer Guide

STL_SSHCLIENT_ERROR .........................................................................................
STL_STREAM_SEGS ................................................................................................
STL_TR_CONFLICT ..................................................................................................
STL_UNDONE ..........................................................................................................
STL_UNIQUE ...........................................................................................................
STL_UNLOAD_LOG ..................................................................................................
STL_USERLOG ........................................................................................................
STL_UTILITYTEXT ....................................................................................................
STL_VACUUM ..........................................................................................................
STL_WARNING .........................................................................................................
STL_WINDOW ..........................................................................................................
STL_WLM_ERROR ...................................................................................................
STL_WLM_QUERY ....................................................................................................
STV Tables for Snapshot Data ..............................................................................................
STV_ACTIVE_CURSORS ...........................................................................................
STV_BLOCKLIST ......................................................................................................
STV_CURSOR_CONFIGURATION ...............................................................................
STV_EXEC_STATE ....................................................................................................
STV_INFLIGHT .........................................................................................................
STV_LOAD_STATE ....................................................................................................
STV_LOCKS .............................................................................................................
STV_PARTITIONS .....................................................................................................
STV_RECENTS ........................................................................................................
STV_SLICES ............................................................................................................
STV_SESSIONS .......................................................................................................
STV_TBL_PERM .......................................................................................................
STV_TBL_TRANS .....................................................................................................
STV_WLM_CLASSIFICATION_CONFIG ........................................................................
STV_WLM_QUERY_QUEUE_STATE ............................................................................
STV_WLM_QUERY_STATE .........................................................................................
STV_WLM_QUERY_TASK_STATE ...............................................................................
STV_WLM_SERVICE_CLASS_CONFIG .......................................................................
STV_WLM_SERVICE_CLASS_STATE ..........................................................................
System Views ....................................................................................................................
SVL_COMPILE .........................................................................................................
SVV_DISKUSAGE .....................................................................................................
SVL_QERROR ..........................................................................................................
SVL_QLOG ..............................................................................................................
SVV_QUERY_INFLIGHT ............................................................................................
SVL_QUERY_QUEUE_INFO .......................................................................................
SVL_QUERY_REPORT ..............................................................................................
SVV_QUERY_STATE .................................................................................................
SVL_QUERY_SUMMARY ...........................................................................................
SVL_STATEMENTTEXT ..............................................................................................
SVV_TABLE_INFO ....................................................................................................
SVV_VACUUM_PROGRESS .......................................................................................
SVV_VACUUM_SUMMARY .........................................................................................
SVL_VACUUM_PERCENTAGE ....................................................................................
System Catalog Tables ........................................................................................................
PG_TABLE_DEF .......................................................................................................
Querying the Catalog Tables ........................................................................................
Configuration Reference ..............................................................................................................
Modifying the Server Configuration .......................................................................................
datestyle ..........................................................................................................................
Values (Default in Bold) ...............................................................................................
Description ...............................................................................................................
Example ...................................................................................................................
extra_float_digits ................................................................................................................
API Version 2012-12-01
ix

678
679
679
680
681
683
684
685
686
688
689
690
690
692
693
693
696
697
698
699
700
701
703
704
705
706
708
709
710
711
712
713
714
715
716
717
719
719
720
721
722
724
726
728
730
731
732
734
734
734
736
741
741
742
742
742
742
742

Amazon Redshift Database Developer Guide

Values (Default in Bold) ...............................................................................................

Description ...............................................................................................................
max_cursor_result_set_size .................................................................................................
Values (Default in Bold) ...............................................................................................
Description ...............................................................................................................
query_group .....................................................................................................................
Values (Default in Bold) ...............................................................................................
Description ...............................................................................................................
search_path ......................................................................................................................
Values (Default in Bold) ...............................................................................................
Description ...............................................................................................................
Example ...................................................................................................................
statement_timeout .............................................................................................................
Values (Default in Bold) ...............................................................................................
Description ...............................................................................................................
Example ...................................................................................................................
wlm_query_slot_count ........................................................................................................
Values (Default in Bold) ...............................................................................................
Description ...............................................................................................................
Examples .................................................................................................................
Sample Database ......................................................................................................................
CATEGORY Table ..............................................................................................................
DATE Table .......................................................................................................................
EVENT Table .....................................................................................................................
VENUE Table ....................................................................................................................
USERS Table ....................................................................................................................
LISTING Table ...................................................................................................................
SALES Table .....................................................................................................................
Time Zone Names and Abbreviations ............................................................................................
Time Zone Names .............................................................................................................
Time Zone Abbreviations .....................................................................................................
Document History ......................................................................................................................

API Version 2012-12-01

742
742
743
743
743
743
743
743
744
744
744
744
745
745
745
745
745
745
746
746
747
748
749
749
749
750
750
751
752
752
762
766

Amazon Redshift Database Developer Guide

Are You a First-Time Amazon Redshift User?

Welcome
Topics
Are You a First-Time Amazon Redshift User? (p. 1)
Are You a Database Developer? (p. 2)
Prerequisites (p. 3)
This is the Amazon Redshift Database Developer Guide.
Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.
This guide focuses on using Amazon Redshift to create and manage a data warehouse. If you work with
databases as a designer, software developer, or administrator, it gives you the information you need to
design, build, query, and maintain your data warehouse.

Are You a First-Time Amazon Redshift User?

If you are a first-time user of Amazon Redshift, we recommend that you begin by reading the following
sections.
Service Highlights and Pricing The product detail page provides the Amazon Redshift value proposition,
service highlights, and pricing.
Getting Started The Getting Started Guide includes an example that walks you through the process
of creating an Amazon Redshift data warehouse cluster, creating database tables, uploading data, and
testing queries.
After you complete the Getting Started guide, we recommend that you explore one of the following guides:
Amazon Redshift Cluster Management Guide The Cluster Management guide shows you how to
create and manage Amazon Redshift clusters.
If you are an application developer, you can use the Amazon Redshift Query API to manage clusters
programmatically. Additionally, the AWS SDK libraries that wrap the underlying Amazon Redshift API
can help simplify your programming tasks. If you prefer a more interactive way of managing clusters,
you can use the Amazon Redshift console and the AWS command line interface (AWS CLI). For
information about the API and CLI, go to the following manuals:
API Reference
API Version 2012-12-01
1

Amazon Redshift Database Developer Guide

Are You a Database Developer?

CLI Reference
Amazon Redshift Database Developer Guide (this document) If you are a database developer, the
Database Developer Guide explains how to design, build, query, and maintain the databases that make
up your data warehouse.
If you are transitioning to Amazon Redshift from another relational database system or data warehouse
application, you should be aware of important differences in how Amazon Redshift is implemented. For
a summary of the most important considerations for designing tables and loading data, see Best Practices
for Designing Tables (p. 24) and Best Practices for Loading Data (p. 27). Amazon Redshift is based on
PostgreSQL 8.0.2. For a detailed list of the differences between Amazon Redshift and PostgreSQL, see
Amazon Redshift and PostgreSQL (p. 232).

Are You a Database Developer?

If you are a database user, database designer, database developer, or database administrator, the
following table will help you find what youre looking for.
If you want to ...

We recommend

Quickly start using

Amazon Redshift

Begin by following the steps in the Getting Started guide to quickly deploy a
cluster, connect to a database, and try out some queries.
When you are ready to build your database, load data into tables, and write
queries to manipulate data in the data warehouse, return here to the Database
Developer Guide.

Learn about the internal The Amazon Redshift System Overview (p. 4) gives a high-level overview
architecture of the
of Amazon Redshift's internal architecture.
Amazon Redshift data
If you want a broader overview of the Amazon Redshift web service, go to
warehouse.
the Amazon Redshift product detail page.
Create databases,
tables, users, and other
database objects.

Getting Started Using Databases (p. 12) is a quick introduction to the basics
of SQL development.
The Amazon Redshift SQL (p. 231) has the syntax and examples for Amazon
Redshift SQL commands and functions and other SQL elements.
Best Practices for Designing Tables (p. 24) provides a summary of our recommendations for choosing sort keys, distribution keys, and compression encodings.

Learn how to design

tables for optimum performance.

Designing Tables (p. 104) details considerations for applying compression to

the data in table columns and choosing distribution and sort keys.

Load data.

Loading Data (p. 131) explains the procedures for loading large datasets from
Amazon DynamoDB tables or from flat files stored in Amazon S3 buckets.
Best Practices for Loading Data (p. 27) provides for tips for loading your data
quickly and effectively.

Manage users, groups,

and database security.

Managing Database Security (p. 97) covers database security topics.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Prerequisites

If you want to ...

We recommend

Monitor and optimize

system performance.

The System Tables Reference (p. 626) details system tables and views that
you can query for the status of the database and monitor queries and processes.
You should also consult the Amazon Redshift Management Guide to learn
how to use the AWS Management Console to check the system health,
monitor metrics, and back up and restore clusters.

Analyze and report information from very

large datasets.

Many popular software vendors are certifying Amazon Redshift with their offerings to enable you to continue to use the tools you use today. For more
information, see the Amazon Redshift partner page.
The SQL Reference (p. 231) has all the details for the SQL expressions,
commands, and functions Amazon Redshift supports.

Prerequisites
Before you use this guide, you should complete these tasks.
Install a SQL client.
Launch an Amazon Redshift cluster.
Connect your SQL client to the cluster master database.
For step-by-step instructions, see the Amazon Redshift Getting Started Guide.
You should also know how to use your SQL client and should have a fundamental understanding of the
SQL language.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Data Warehouse System Architecture

Amazon Redshift System Overview

Topics
Data Warehouse System Architecture (p. 4)
Performance (p. 6)
Columnar Storage (p. 8)
Internal Architecture and System Operation (p. 9)
Workload Management (p. 10)
Using Amazon Redshift with Other Services (p. 11)
An Amazon Redshift data warehouse is an enterprise-class relational database query and management
system.
Amazon Redshift supports client connections with many types of applications, including business
intelligence (BI), reporting, data, and analytics tools.
When you execute analytic queries, you are retrieving, comparing, and evaluating large amounts of data
in multiple-stage operations to produce a final result.
Amazon Redshift achieves efficient storage and optimum query performance through a combination of
massively parallel processing, columnar data storage, and very efficient, targeted data compression
encoding schemes. This section presents an introduction to the Amazon Redshift system architecture.

Data Warehouse System Architecture

This section introduces the elements of the Amazon Redshift data warehouse architecture as shown in
the following figure.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Data Warehouse System Architecture

Client applications
Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and
business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on
industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal
changes. For information about important differences between Amazon Redshift SQL and PostgreSQL,
see Amazon Redshift and PostgreSQL (p. 232).
Connections
Amazon Redshift communicates with client applications by using industry-standard PostgreSQL JDBC
and ODBC drivers. For more information, see Amazon Redshift and PostgreSQL JDBC and ODBC (p. 233).
Clusters
The core infrastructure component of an Amazon Redshift data warehouse is a cluster.
A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute
nodes, an additional leader node coordinates the compute nodes and handles external communication.
Your client application interacts directly only with the leader node. The compute nodes are transparent
to external applications.
Leader node
The leader node manages communications with client programs and all communication with compute
nodes. It parses and develops execution plans to carry out database operations, in particular, the series
of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node
compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to
each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables
that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon
Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any
of these functions will return an error if it references tables that reside on the compute nodes. For more
information, see SQL Functions Supported on the Leader Node (p. 231).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Performance

Compute nodes
The leader node compiles code for individual elements of the execution plan and assigns the code to
individual compute nodes. The compute nodes execute the compiled code send intermediate results back
to the leader node for final aggregation.
Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined
by the node type. As your workload grows, you can increase the compute capacity and storage capacity
of a cluster by increasing the number of nodes, upgrading the node type, or both.
Amazon Redshift provides two node types; dense storage nodes and dense compute nodes. Each node
provides two storage choices. You can start with a single 160 GB node and scale up to multiple 16 TB
nodes to support a petabyte of data or more.
For a more detailed explanation of data warehouse clusters and nodes, see Internal Architecture and
System Operation (p. 9).
Node slices
A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Each
slice is allocated a portion of the node's memory and disk space, where it processes a portion of the
workload assigned to the node. The leader node manages distributing data to the slices and apportions
the workload for any queries or other database operations to the slices. The slices then work in parallel
to complete the operation.
When you create a table, you can optionally specify one column as the distribution key. When the table
is loaded with data, the rows are distributed to the node slices according to the distribution key that is
defined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing
to load data and execute queries efficiently. For information about choosing a distribution key, see Choose
the Best Distribution Style (p. 25).
Internal network
Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom
communication protocols to provide private, very high-speed network communication between the leader
node and compute nodes. The compute nodes run on a separate, isolated network that client applications
never access directly.
Databases
A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client
communicates with the leader node, which in turn coordinates query execution with the compute nodes.
Amazon Redshift is a relational database management system (RDBMS), so it is compatible with other
RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online
transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized
for high-performance analysis and reporting of very large datasets.
Amazon Redshift is based on PostgreSQL 8.0.2. Amazon Redshift and PostgreSQL have a number of
very important differences that you need to take into account as you design and develop your data
warehouse applications. For information about how Amazon Redshift SQL differs from PostgreSQL, see
Amazon Redshift and PostgreSQL (p. 232).

Performance
Amazon Redshift achieves extremely fast query execution by employing these performance features:
Massively parallel processing
API Version 2012-12-01
6

Amazon Redshift Database Developer Guide

Performance

Columnar data storage

Data compression
Query optimization
Compiled code
Massively parallel processing
Massively parallel processing (MPP) enables fast execution of the most complex queries operating on
large amounts of data. Multiple compute nodes handle all query processing leading up to final result
aggregation, with each core of each node executing the same compiled query segments on portions of
the entire data.
Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed
in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution
of data to balance the workload and minimize movement of data from node to node. For more information,
see Choose the Best Distribution Style (p. 25).
Loading data from flat files takes advantage of parallel processing by spreading the workload across
multiple nodes while simultaneously reading from multiple files. For more information about how to load
data into tables, see Best Practices for Loading Data (p. 27).
Columnar data storage
Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an
important factor in optimizing analytic query performance. Storing database table information in a columnar
fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from
disk. Loading less data into memory enables Amazon Redshift to perform more in-memory processing
when executing queries. See Columnar Storage (p. 8) for a more detailed explanation.
When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of
data blocks. For more information, see Choose the Best Sort Key (p. 25).
Data compression
Data compression reduces storage requirements, thereby reducing disk I/O, which improves query
performance. When you execute a query, the compressed data is read into memory, then uncompressed
during query execution. Loading less data into memory enables Amazon Redshift to allocate more memory
to analyzing the data. Because columnar storage stores similar data sequentially, Amazon Redshift is
able to apply adaptive compression encodings specifically tied to columnar data types. The best way to
enable data compression on table columns is by allowing Amazon Redshift to apply optimal compression
encodings when you load the table with data. To learn more about using automatic data compression,
see Loading Tables with Automatic Compression (p. 158).
Query optimizer
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also
takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements
significant enhancements and extensions for processing complex analytic queries that often include
multi-table joins, subqueries, and aggregation. To learn more about optimizing queries, see Tuning Query
Performance (p. 194).
Compiled code
The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling
the query eliminates the overhead associated with an interpreter and therefore increases the execution
speed, especially for complex queries. The compiled code is cached and shared across sessions on the
same cluster, so subsequent executions of the same query will be faster, often even with different
parameters.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Columnar Storage

The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql
(libq) connection protocols, so two clients using different protocols will each incur the first-time cost of
compiling the code. Other clients that use the same protocol, however, will benefit from sharing the cached
code.

Columnar Storage
Columnar storage for database tables is an important factor in optimizing analytic query performance
because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need
to load from disk.
The following series of illustrations describe how columnar data storage implements efficiencies and how
that translates into efficiencies when retrieving data into memory.
This first illustration shows how records from database tables are typically stored into disk blocks by row.

In a typical relational database table, each row contains field values for a single record. In row-wise
database storage, data blocks store values sequentially for each consecutive column making up the entire
row. If block size is smaller than the size of a record, storage for an entire record may take more than
one block. If block size is larger than the size of a record, storage for an entire record may take less than
one block, resulting in an inefficient use of disk space. In online transaction processing (OLTP) applications,
most transactions involve frequently reading and writing all of the values for entire records, typically one
record or a small number of records at a time. As a result, row-wise storage is optimal for OLTP databases.
The next illustration shows how with columnar storage, the values for each column are stored sequentially
into disk blocks.

Using columnar storage, each data block stores values of a single column for multiple rows. As records
enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the
columns.
In this simplified example, using columnar storage, each data block holds column field values for as many
as three times as many records as row-based storage. This means that reading the same number of
column field values for the same number of records requires a third of the I/O operations compared to
API Version 2012-12-01
8

Amazon Redshift Database Developer Guide

Internal Architecture and System Operation

row-wise storage. In practice, using tables with very large numbers of columns and very large row counts,
storage efficiency is even greater.
An added advantage is that, since each block holds the same type of data, block data can use a
compression scheme selected specifically for the column data type, further reducing disk space and I/O.
For more information about compression encodings based on data types, see Compression
Encodings (p. 105).
The savings in space for storing data on disk also carries over to retrieving and then storing that data in
memory. Since many database operations only need to access or operate on one or a small number of
columns at a time, you can save memory space by only retrieving blocks for columns you actually need
for a query. Where OLTP transactions typically involve most or all of the columns in a row for a small
number of records, data warehouse queries commonly read only a few columns for a very large number
of rows. This means that reading the same number of column field values for the same number of rows
requires a fraction of the I/O operations and uses a fraction of the memory that would be required for
processing row-wise blocks. In practice, using tables with very large numbers of columns and very large
row counts, the efficiency gains are proportionally greater. For example, suppose a table contains 100
columns. A query that uses five columns will only need to read about five percent of the data contained
in the table. This savings is repeated for possibly billions or even trillions of records for large databases.
In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.
Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB,
which is more efficient and further reduces the number of I/O requests needed to perform any database
loading or other operations that are part of query execution.

Internal Architecture and System Operation

The following diagram shows a high level view of internal components and functionality of the Amazon
Redshift data warehouse.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Workload Management

Workload Management
Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads
so that short, fast-running queries won't get stuck in queues behind long-running queries.
Amazon Redshift WLM creates query queues at runtime according to service classes, which define the
configuration parameters for various types of queues, including internal system queues and user-accessible
queues. From a user perspective, a user-accessible service class and a queue are functionally equivalent.
For consistency, this documentation uses the term queue to mean a user-accessible service class as
well as a runtime queue.
When you run a query, WLM assigns the query to a queue according to the user's user group or by
matching a query group that is listed in the queue configuration with a query group label that the user
sets at runtime.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Using Amazon Redshift with Other Services

By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up to
five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one.
You can define up to eight queues. Each queue can be configured with a maximum concurrency level of
50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue)
is 50.
The easiest way to modify the WLM configuration is by using the Amazon Redshift Management Console.
You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API.
For more information about implementing and using workload management, see Implementing Workload
Management (p. 218).

Using Amazon Redshift with Other Services

Amazon Redshift integrates with other AWS services to enable you to move, transform, and load your
data quickly and reliably, using data security features.

Moving Data Between Amazon Redshift and

Amazon S3
Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. Amazon
Redshift leverages parallel processing to read and load data from multiple data files stored in Amazon
S3 buckets. For more information, see Loading Data from Amazon S3 (p. 133).
You can also use parallel processing to export data from your Amazon Redshift data warehouse to multiple
data files on Amazon S3. For more information, see Unloading Data (p. 187).

Using Amazon Redshift with Amazon DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service. You can use the COPY command to
load an Amazon Redshift table with data from a single Amazon DynamoDB table. For more information,
see Loading Data from an Amazon DynamoDB Table (p. 155).

Importing Data from Remote Hosts over SSH

You can use the COPY command in Amazon Redshift to load data from one or more remote hosts, such
as Amazon EMR clusters, Amazon EC2 instances, or other computers. COPY connects to the remote
hosts using SSH and executes commands on the remote hosts to generate data. Amazon Redshift
supports multiple simultaneous connections. The COPY command reads and loads the output from
multiple host sources in parallel. For more information, see Loading Data from Remote Hosts (p. 149).

Automating Data Loads Using AWS Data Pipeline

You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon
Redshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and execute
recurring jobs without having to write your own complex data transfer or transformation logic. For example,
you can set up a recurring job to automatically copy data from Amazon DynamoDB into Amazon Redshift.
For a tutorial that walks you through the process of creating a pipeline that periodically moves data from
Amazon S3 to Amazon Redshift, see Copy Data to Amazon Redshift Using AWS Data Pipeline in the
AWS Data Pipeline Developer Guide.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 1: Create a Database

Getting Started Using Databases

Topics
Step 1: Create a Database (p. 12)
Step 2: Create a Database User (p. 13)
Step 3: Create a Database Table (p. 13)
Step 4: Load Sample Data (p. 15)
Step 5: Query the System Tables (p. 18)
Step 6: Cancel a Query (p. 21)
Step 7: Clean Up Your Resources (p. 22)
This section describes the basic steps to begin using the Amazon Redshift database.
The examples in this section assume you have signed up for the Amazon Redshift data warehouse
service, created a cluster, and established a connection to the cluster from your SQL query tool. For
information about these tasks, see the Amazon Redshift Getting Started Guide.

Important
The cluster that you deployed for this exercise will be running in a live environment. As long as
it is running, it will accrue charges to your AWS account. For more pricing information, go to the
Amazon Redshift pricing page.
To avoid unnecessary charges, you should delete your cluster when you are done with it. The
final step of the exercise explains how to do so.

Step 1: Create a Database

After you have verified that your cluster is up and running, you can create your first database. This
database is where you will actually create tables, load data, and run queries. A single cluster can host
multiple databases. For example, you can have a TICKIT database and an ORDERS database on the
same cluster.
After you connect to the initial cluster database, the database you created when you launched the cluster,
you use the initial database as the base for creating a new database.
For example, to create a database named tickit, issue the following command:

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 2: Create a Database User

create database tickit;

For this exercise, we'll accept the defaults. For information about more command options, see CREATE
DATABASE (p. 336) in the SQL Command Reference.
After you have created the TICKIT database, you can connect to the new database from your SQL client.
Use the same connection parameters as you used for your current connection, but change the database
name to tickit.
You do not need to change the database to complete the remainder of this tutorial. If you prefer not to
connect to the TICKIT database, you can try the rest of the examples in this section using the default
database.

Step 2: Create a Database User

By default, only the master user that you created when you launched the cluster has access to the initial
database in the cluster. To grant other users access, you must create one or more user accounts. Database
user accounts are global across all the databases in a cluster; they do not belong to individual databases.
Use the CREATE USER command to create a new database user. When you create a new user, you
specify the name of the new user and a password. A password is required. It must have between 8 and
64 characters, and it must include at least one uppercase letter, one lowercase letter, and one numeral.
For example, to create a user named GUEST with password ABCd4321, issue the following command:
create user guest password 'ABCd4321';

For information about other command options, see CREATE USER (p. 354) in the SQL Command
Reference.

Delete a Database User

You won't need the GUEST user account for this tutorial, so you can delete it. If you delete a database
user account, the user will no longer be able to access any of the cluster databases.
Issue the following command to drop the GUEST user:
drop user guest;

The master user you created when you launched your cluster continues to have access to the database.

Important
Amazon Redshift strongly recommends that you do not delete the master user.
For information about command options, see DROP USER (p. 366) in the SQL Reference.

Step 3: Create a Database Table

After you create your new database, you create tables to hold your database data.You specify any column
information for the table when you create the table.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Insert Data Rows into a Table

For example, to create a table named testtable with a single column named testcol for an integer
data type, issue the following command:
create table testtable (testcol int);

The PG_TABLE_DEF system table contains information about all the tables in the cluster. To verify the
result, issue the following SELECT command to query the PG_TABLE_DEF system table.
select * from pg_table_def where tablename = 'testtable';

The query result should look something like this:

By default, new database objects, such as tables, are created in a schema named "public". For more
information about schemas, see Schemas (p. 100) in the Managing Database Security section.
The encoding, distkey, and sortkey columns are used by Amazon Redshift for parallel processing.
For more information about designing tables that incorporate these elements, see Best Practices for
Designing Tables (p. 24).

Insert Data Rows into a Table

After you create a table, you can insert rows of data into that table.

Note
The INSERT (p. 379) command inserts individual rows into a database table. For standard bulk
loads, use the COPY (p. 302) command. For more information, see Use a COPY Command to
Load Data (p. 27).
For example, to insert a value of 100 into the testtable table (which contains a single column), issue
the following command:
insert into testtable values (100);

Select Data from a Table

After you create a table and populate it with data, use a SELECT statement to display the data contained
in the table. The SELECT * statement returns all the column names and row values for all of the data in
a table and is a good way to verify that recently added data was correctly inserted into the table.
To view the data that you entered in the testtable table, issue the following command:
select * from testtable;

The result will look like this:

testcol
---------

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 4: Load Sample Data

100
(1 row)

For more information about using the SELECT statement to query tables, see SELECT (p. 390) in the
SQL Command Reference.

Step 4: Load Sample Data

Most of the examples in this guide use the TICKIT sample database. If you want to follow the examples
using your SQL query tool, you will need to load the sample data for the TICKIT database.
The sample data for this tutorial is provided in Amazon S3 buckets that give read access to all authenticated
AWS users, so any valid AWS credentials that permit access to Amazon S3 will work.

Note
If you followed the steps to load data in the Amazon Redshift Getting Started Guide, these tables
already exist.
To load the sample data for the TICKIT database, you will first create the tables, then use the COPY
command to load the tables with sample data that is stored in an Amazon S3 bucket. For more information,
see Loading Data from Amazon S3 (p. 133).
You create tables using the CREATE TABLE command with a list of columns paired with datatypes. Many
of the create table statements in this example specify options for the column in addition to the data type,
such as not null, distkey, and sortkey. These are column attributes related to optimizing your
tables for query performance. You can visit Designing Tables (p. 104) to learn about how to choose these
options when you design your table structures.
1. Create the tables for the database.
The following SQL creates these tables: USERS, VENUE, CATEGORY, DATE, EVENT, LISTING,
and SALES.
create table users(
userid integer not null distkey sortkey,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean);
create table venue(
venueid smallint not null distkey sortkey,

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 4: Load Sample Data

venuename varchar(100),
venuecity varchar(30),
venuestate char(2),
venueseats integer);
create table category(
catid smallint not null distkey sortkey,
catgroup varchar(10),
catname varchar(10),
catdesc varchar(50));
create table date(
dateid smallint not null distkey sortkey,
caldate date not null,
day character(3) not null,
week smallint not null,
month character(5) not null,
qtr character(5) not null,
year smallint not null,
holiday boolean default('N'));
create table event(
eventid integer not null distkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null sortkey,
eventname varchar(200),
starttime timestamp);
create table listing(
listid integer not null distkey,
sellerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
numtickets smallint not null,
priceperticket decimal(8,2),
totalprice decimal(8,2),
listtime timestamp);
create table sales(
salesid integer not null,
listid integer not null distkey,
sellerid integer not null,
buyerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
qtysold smallint not null,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp);

2. Load the tables with data.

In this step, you will use the COPY command to load the tables using data from an Amazon S3 bucket.
You will need your AWS credentials (access key ID and secret access key) to load test data from
Amazon S3. The sample data for this tutorial is provided in Amazon S3 buckets that give read access

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 4: Load Sample Data

to all authenticated AWS users, so any valid AWS credentials that permit access to Amazon S3 will
work. If you need to create new access keys, go to Administering Access Keys for IAM Users.
a. In the following COPY commands, replace <access-key-id> and <your-secret-access-key>
with valid AWS account credentials.
b. The bucket that contains the load data must be in the same region as your cluster. Use the following
table to find the correct bucket name for your region. Alternatively, you can specify the region using
the REGION (p. 312) option with the COPY command.
Region

<region-specific-bucket-name>

US East (N. Virginia)

awssampledb

US West (Oregon)

awssampledbuswest2

EU (Frankfurt)

awssampledbeucentral1

EU (Ireland)

awssampledbeuwest1

Asia Pacific (Singapore)

awssampledbapsoutheast1

Asia Pacific (Sydney)

awssampledbapsoutheast2

Asia Pacific (Tokyo)

awssampledbapnortheast1

c. Replace <region-specific-bucket-name> with the name of a sample data bucket in the same
region as your cluster.
d. Execute the following COPY commands
copy users from 's3://<region-specific-bucket-name>/tickit/allusers_pipe.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secretaccess-key>'
delimiter '|';
copy venue from 's3://<region-specific-bucket-name>/tickit/venue_pipe.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secretaccess-key>'
delimiter '|';
copy category from 's3://<region-specific-bucket-name>/tickit/cat
egory_pipe.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secretaccess-key>'
delimiter '|';
copy date from 's3://<region-specific-bucket-name>/tickit/date2008_pipe.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secretaccess-key>'
delimiter '|';
copy event from 's3://<region-specific-bucket-name>/tickit/al
levents_pipe.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secretaccess-key>'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS';
copy listing from 's3://<region-specific-bucket-name>/tickit/list
ings_pipe.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secretaccess-key>'

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 5: Query the System Tables

delimiter '|';
copy sales from 's3://<region-specific-bucket-name>/tickit/sales_tab.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secretaccess-key>'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS';

3. Verify the load results.

Use the following SELECT statements to verify that the tables were created and loaded with data. The
select count(*) statement returns the number of rows in the table.
select
select
select
select
select
select
select

count(*)
count(*)
count(*)
count(*)
count(*)
count(*)
count(*)

from
from
from
from
from
from
from

users;
venue;
category;
date;
event;
listing;
sales;

Step 5: Query the System Tables

In addition to the tables that you create, your database contains a number of system tables. These system
tables contain information about your installation and about the various queries and processes that are
running on the system. You can query these system tables to collect information about your database.

Note
The description for each table in the System Tables Reference indicates whether a table is
superuser visible or user visible. You must be logged in as a superuser to query tables that are
superuser visible.
Amazon Redshift provides access to the following types of system tables:
STL Tables for Logging (p. 628)
These system tables are generated from Amazon Redshift log files to provide a history of the system.
Logging tables have an STL prefix.
STV Tables for Snapshot Data (p. 692)
These tables are virtual system tables that contain snapshots of the current system data. Snapshot
tables have an STV prefix.
System Views (p. 715)
System views contain a subset of data found in several of the STL and STV system tables. Systems
views have an SVV or SVL prefix.
System Catalog Tables (p. 734)
The system catalog tables store schema metadata, such as information about tables and columns.
System catalog tables have a PG prefix.
You may need to specify the process ID associated with a query to retrieve system table information
about that query. For information, see Determine the Process ID of a Running Query (p. 20).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

View a List of Table Names

For example, to view a list of all tables in the public schema, you can query the PG_TABLE_DEF system
catalog table.
select distinct(tablename) from pg_table_def where schemaname = 'public';

The result will look something like this:

tablename
--------category
date
event
listing
sales
testtable
users
venue

View Database Users

passwd

| valuntil

------------+----------+-------------+----------+-----------+----------+---------+----------rdsdb
|
1 | t
| t
| t
| ******** |
|
masteruser |
100 | t
| t
| f
| ******** |
|
dwuser
|
101 | f
| f
| f
| ******** |
|
simpleuser |
102 | f
| f
| f
| ******** |
|
poweruser |
103 | f
| t
| f
| ******** |
|
dbuser
|
104 | t
| f
| f
| ******** |
|
(6 rows)

The user name rdsdb is used internally by Amazon Redshift to perform routine administrative and
maintenance tasks. You can filter your query to show only user-defined user names by adding where
usesysid > 1 to your select statement.
select * from pg_user
where usesysid > 1;
usename

| usesysid | usecreatedb | usesuper | usecatupd |

API Version 2012-12-01

passwd

| valuntil

Amazon Redshift Database Developer Guide

View Recent Queries

| useconfig
------------+----------+-------------+----------+-----------+----------+---------+----------masteruser |
100 | t
| t
| f
| ******** |
|
dwuser
|
101 | f
| f
| f
| ******** |
|
simpleuser |
102 | f
| f
| f
| ******** |
|
poweruser |
103 | f
| t
| f
| ******** |
|
dbuser
|
104 | t
| f
| f
| ******** |
|
(5 rows)

View Recent Queries

In the previous example, you found that the user ID (USESYSID) for masteruser is 100. To list the five
most recent queries executed by masteruser, you can query the STL_QLOG view. The SVL_QLOG view
is a friendlier subset of information from the STL_QUERY table. You can use this view to find the query
ID (QUERY) or process ID (PID) for a recently run query or to see how long it took a query to complete.
SVL_QLOG includes the first 60 characters of the query string (SUBSTRING) to help you locate a specific
query. Use the LIMIT clause with your SELECT statement to limit the results to five rows.
select query, pid, elapsed, substring from svl_qlog
where userid = 100
order by starttime desc
limit 5;

The result will look something like this:

query | pid | elapsed |
substring
--------+-------+----------+------------------------------------------------------------187752 | 18921 | 18465685 | select query, elapsed, substring from svl_qlog
order by query
204168 | 5117 |
59603 | insert into testtable values (100);
187561 | 17046 | 1003052 | select * from pg_table_def where tablename =
'testtable';
187549 | 17046 | 1108584 | select * from STV_WLM_SERVICE_CLASS_CONFIG
187468 | 17046 | 5670661 | select * from pg_table_def where schemaname =
'public';
(5 rows)

Determine the Process ID of a Running Query

In the previous example you learned how to obtain the query ID and process ID (PID) for a completed
query from the SVL_QLOG view.
You might need to find the PID for a query that is still running. For example, you will need the PID if you
need to cancel a query that is taking too long to run. You can query the STV_RECENTS system table to
obtain a list of process IDs for running queries, along with the corresponding query string. If your query
returns multiple PIDs, you can look at the query text to determine which PID you need.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 6: Cancel a Query

To determine the PID of a running query, issue the following SELECT statement:
select pid, user_name, starttime, query
from stv_recents
where status='Running';

Step 6: Cancel a Query

If a user issues a query that is taking too long or is consuming excessive cluster resources, you might
need to cancel the query. For example, a user might want to create a list of ticket sellers that includes
the seller's name and quantity of tickets sold. The following query selects data from the SALES table
USERS table and joins the two tables by matching SELLERID and USERID in the WHERE clause.
select sellerid, firstname, lastname, sum(qtysold)
from sales, users
where sales.sellerid = users.userid
group by sellerid, firstname, lastname
order by 4 desc;

Note
This is a complex query. For this tutorial, you don't need to worry about how this query is
constructed.
The previous query runs in seconds and returns 2,102 rows.
Suppose the user forgets to put in the WHERE clause.
select sellerid, firstname, lastname, sum(qtysold)
from sales, users
group by sellerid, firstname, lastname
order by 4 desc;

The result set will include all of the rows in the SALES table multiplied by all the rows in the USERS table
(49989*3766). This is called a Cartesian join, and it is not recommended. The result is over 188 million
rows and takes a long time to run.
To cancel a running query, use the CANCEL command with the query's PID.
To find the process ID, query the STV_RECENTS table, as shown in the previous step. The following
example shows how you can make the results more readable by using the TRIM function to trim trailing
spaces and by showing only the first 20 characters of the query string.
select pid, trim(user_name), starttime, substring(query,1,20)
from stv_recents
where status='Running';

The result looks something like this:

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Cancel a Query from Another Session

To cancel the query with PID 18764, issue the following command:
cancel 18764;

Note
The CANCEL command will not abort a transaction. To abort or roll back a transaction, you must
use the ABORT or ROLLBACK command. To cancel a query associated with a transaction, first
cancel the query then abort the transaction.
If the query that you canceled is associated with a transaction, use the ABORT or ROLLBACK. command
to cancel the transaction and discard any changes made to the data:
abort;

Unless you are signed on as a superuser, you can cancel only your own queries. A superuser can cancel
all queries.

Cancel a Query from Another Session

If your query tool does not support running queries concurrently, you will need to start another session
to cancel the query. For example, SQLWorkbench, which is the query we use in the Amazon Redshift
Getting Started Guide, does not support multiple concurrent queries. To start another session using
SQLWorkbench, select File, New Window and connect using the same connection parameters. Then
you can find the PID and cancel the query.

Cancel a Query Using the Superuser Queue

If your current session has too many queries running concurrently, you might not be able to run the
CANCEL command until another query finishes. In that case, you will need to issue the CANCEL command
using a different workload management query queue.
Workload management enables you to execute queries in different query queues so that you don't need
to wait for another query to complete. The workload manager creates a separate queue, called the
Superuser queue, that you can use for troubleshooting. To use the Superuser queue, you must be logged
on a superuser and set the query group to 'superuser' using the SET command. After running your
commands, reset the query group using the RESET command.
To cancel a query using the Superuser queue, issue these commands:
set query_group to 'superuser';
cancel 18764;
reset query_group;

For information about managing query queues, see Implementing Workload Management (p. 218).

Step 7: Clean Up Your Resources

If you deployed a cluster in order to complete this exercise, when you are finished with the exercise, you
should delete the cluster so that it will stop accruing charges to your AWS account.
To delete the cluster, follow the steps in Deleting a Cluster in the Amazon Redshift Management Guide.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 7: Clean Up Your Resources

If you want to keep the cluster, you might want to keep the sample data for reference. Most of the examples
in this guide use the tables you created in this exercise. The size of the data will not have any significant
effect on your available storage.
If you want to keep the cluster, but want to clean up the sample data, you can run the following command
to drop the TICKIT database:
drop database tickit;

If you didn't create a TICKIT database, or if you don't want to drop the database, run the following
commands to drop just the tables:
drop
drop
drop
drop
drop
drop
drop
drop

table
table
table
table
table
table
table
table

testtable;
users;
venue;
category;
date;
event;
listing;
sales;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Best Practices for Designing Tables

Amazon Redshift Best Practices

Topics
Amazon Redshift Best Practices for Designing Tables (p. 24)
Amazon Redshift Best Practices for Loading Data (p. 27)
Amazon Redshift Best Practices for Designing Queries (p. 30)
This chapter presents best practices for designing tables, loading data into tables, and writing queries.
Amazon Redshift is not the same as other SQL database systems. To fully realize the benefits of the
Amazon Redshift architecture, you must specifically design, build, and load your tables to leverage
massively parallel processing, columnar data storage, and columnar data compression. If your data
loading and query execution times are longer than you expect, or longer than you want, you might be
overlooking key information.
If you are an experienced SQL database developer, we strongly recommend that you review this chapter
before you begin developing your Amazon Redshift data warehouse.
If you are new to developing SQL databases, this is not the best place to start. We recommend that you
begin by reading Getting Started Using Databases (p. 12) and trying the examples yourself.
This chapter provides an overview of the most important development principles, along with specific tips,
examples, and best practices for implementing those principles. No single practice can apply to every
application. You should evaluate all of your options before finalizing a database design. For more
information, refer to Designing Tables (p. 104), Loading Data (p. 131), Tuning Query Performance (p. 194),
and the reference chapters.

Amazon Redshift Best Practices for Designing

Tables
Topics
Take the Tuning Table Design Tutorial (p. 25)
Choose the Best Sort Key (p. 25)
Choose the Best Distribution Style (p. 25)
Let COPY Choose Compression Encodings (p. 26)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Take the Tuning Table Design Tutorial

Define Primary Key and Foreign Key Constraints (p. 26)

Use the Smallest Possible Column Size (p. 26)
Use Date/Time Data Types for Date Columns (p. 26)
As you plan your database, there are key table design decisions that will heavily influence overall query
performance. These design choices also have a significant effect on storage requirements, which in turn
affects query performance by reducing the number of I/O operations and minimizing the memory required
to process queries.
This section summarizes the most important design decisions and presents best practices for optimizing
query performance. Designing Tables (p. 104) provides more detailed explanations and examples of table
design options.

Take the Tuning Table Design Tutorial

Tutorial: Tuning Table Design (p. 33) walks you step-by-step through the process of choosing sort keys,
distribution styles, and compression encodings, and shows you how to compare system performance
before and after tuning.

Choose the Best Sort Key

Amazon Redshift stores your data on disk in sorted order according to the sort key. The Redshift query
optimizer uses sort order when it determines optimal query plans.
If recent data is queried most frequently, specify the timestamp column as the leading column
for the sort key.
Queries will be more efficient because they can skip entire blocks that fall outside the time range.
If you do frequent range filtering or equality filtering on one column, specify that column as the
sort key.
Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum
and maximum column values stored on each block and can skip blocks that don't apply to the predicate
range.
If you frequently join a table, specify the join column as both the sort key and the distribution
key.
This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because
the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort
merge join.
For more information about choosing and specifying sort keys, see Tutorial: Tuning Table Design (p. 33)
and Choosing Sort Keys (p. 127).

Choose the Best Distribution Style

When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed
to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the
impact of the redistribution step by locating the data where it needs to be before the query is executed.
1. Distribute the fact table and one dimension table on their common columns.
Your fact table can have only one distribution key. Any tables that join on another key will not be
collocated with the fact table. Choose one dimension to collocate based on how frequently it is joined

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Use Automatic Compression

and the size of the joining rows. Designate both the dimension table's primary key and the fact table's
corresponding foreign key as the DISTKEY.
2. Choose the largest dimension based on the size of the filtered data set.
Only the rows that are used in the join need to be distributed, so consider the size of the of the data
set after filtering, not the size of the table.
3. Change some dimension tables to use ALL distribution.
If a dimension table cannot be collocated with the fact table or other important joining tables, you can
improve query performance significantly by distributing the entire table to all of the nodes. Using ALL
distribution multiplies storage space requirements and increases load times and maintenance operations,
so you should weigh all factors before choosing ALL distribution.
For more information about choosing distribution styles, see Tutorial: Tuning Table Design (p. 33) and
Choosing a Data Distribution Style (p. 115).

Let COPY Choose Compression Encodings

You can specify compression encodings when you create a table, but in most cases, automatic compression
produces the best results.
The COPY command will analyze your data and apply compression encodings to an empty table
automatically as part of the load operation. You can turn off automatic compression, but as a rule it's not
necessary. COPY disables automatic compression by default if the table already has compression
encodings or if it is not empty. For more information, see Tutorial: Tuning Table Design (p. 33) and Loading
Tables with Automatic Compression (p. 158).

Define Primary Key and Foreign Key Constraints

Define primary key and foreign key constraints between tables wherever appropriate. Even though they
are informational only, the query optimizer uses those constraints to generate more efficient query plans.
Do not define primary key and foreign key constraints unless your application enforces the constraints.
Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.
See Defining Constraints (p. 127) for additional information about how Amazon Redshift uses constraints.

Use the Smallest Possible Column Size

Dont make it a practice to use the maximum column size for convenience.
Instead, consider the largest values you are likely to store in a VARCHAR column, for example, and size
your columns accordingly. Because Amazon Redshift compresses column data very effectively, creating
columns much larger than necessary has minimal impact on the size of data tables. During processing
for complex queries, however, intermediate query results might need to be stored in temporary tables.
Because temporary tables are not compressed, unnecessarily large columns consume excessive memory
and temporary disk space, which can affect query performance.

Use Date/Time Data Types for Date Columns

Amazon Redshift stores DATE and TIMESTAMP data more efficiently than CHAR or VARCHAR, which
results in better query performance. Use the DATE or TIMESTAMP data type, depending on the resolution
you need, rather than a character type when storing date/time information. For more information, see
Datetime Types (p. 252).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Best Practices for Loading Data

Amazon Redshift Best Practices for Loading

Data
Topics
Take the Loading Data Tutorial (p. 27)
Take the Tuning Table Design Tutorial (p. 27)
Use a COPY Command to Load Data (p. 27)
Use a Single COPY Command to Load from Multiple Files (p. 28)
Split Your Load Data into Multiple Files (p. 28)
Compress Your Data Files with gzip or lzop (p. 28)
Use a Manifest File (p. 28)

Verify Data Files Before and After a Load (p. 28)

Use a Multi-Row Insert (p. 29)
Use a Bulk Insert (p. 29)
Load Data in Sort Key Order (p. 29)
Load Data in Sequential Blocks (p. 29)
Use Time-Series Tables (p. 30)
Use a Staging Table to Perform a Merge (Upsert) (p. 30)
Schedule Around Maintenance Windows (p. 30)

Loading very large data sets can take a long time and consume a lot of computing resources. How your
data is loaded can also affect query performance. This section presents best practices for loading data
efficiently using COPY commands, bulk inserts, and staging tables.

Take the Loading Data Tutorial

Tutorial: Loading Data from Amazon S3 (p. 60) walks you beginning to end through the steps to upload
data to an Amazon S3 bucket and then use the COPY command to load the data into your tables. The
tutorial includes help with troubleshooting load errors and compares the performance difference between
loading from a single file and loading from multiple files.

Take the Tuning Table Design Tutorial

Data loads are heavily influenced by table design, especially compression encodings and distribution
styles. Tutorial: Tuning Table Design (p. 33) walks you step-by-step through the process of choosing sort
keys, distribution styles, and compression encodings, and shows you how to compare system performance
before and after tuning.

Use a COPY Command to Load Data

The COPY command loads data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or
multiple data sources on remote hosts. COPY loads large amounts of data much more efficiently than
using INSERT statements, and stores the data more effectively as well.
For more information about using the COPY command, see Loading Data from Amazon S3 (p. 133) and
Loading Data from an Amazon DynamoDB Table (p. 155).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Use a Single COPY Command

Use a Single COPY Command to Load from

Multiple Files
Amazon Redshift automatically loads in parallel from multiple data files.
If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift
is forced to perform a serialized load, which is much slower and requires a VACUUM at the end if the
table has a sort column defined. For more information about using COPY to load data in parallel, see
Loading Data from Amazon S3 (p. 133).

Split Your Load Data into Multiple Files

The COPY command loads the data in parallel from multiple files, dividing the workload among the nodes
in your cluster. When you load all the data from a single large file, Amazon Redshift is forced to perform
a serialized load, which is much slower. Split your load data files so that the files are about equal size,
between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of
slices in your cluster. For more information about how to split your data into files and examples of using
COPY to load data, see Loading Data from Amazon S3 (p. 133).

Compress Your Data Files with gzip or lzop

We strongly recommend that you individually compress your load files using gzip or lzop when you have
large datasets.
Specify the GZIP or LZOP option with the COPY command. This example loads the TIME table from a
pipe-delimited lzop file:
copy time
from 's3://mybucket/data/timerows.lzo'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secretaccess-key>'
lzop
delimiter '|';

Use a Manifest File

Amazon S3 provides eventual consistency for some operations, so it is possible that new data will not
be available immediately after the upload, which could result in an incomplete data load or loading stale
data.You can manage data consistency by using a manifest file to load data and using a named endpoint
for Amazon S3 buckets in the US Standard region. For more information, see Managing Data
Consistency (p. 135).

Verify Data Files Before and After a Load

When you load data from Amazon S3, first upload your files to your Amazon S3 bucket, then verify that
the bucket contains all the correct files, and only those files. For more information, see Verifying That the
Correct Files Are Present in Your Bucket (p. 137).
After the load operation is complete, query the STL_LOAD_COMMITS (p. 651) system table to verify that
the expected files were loaded. For more information, see Verifying That the Data Was Loaded
Correctly (p. 157).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Use a Multi-Row Insert

If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever
possible. Data compression is inefficient when you add data only one row or a few rows at a time.
Multi-row inserts improve performance by batching up a series of inserts. The following example inserts
three rows into a four-column table using a single INSERT statement. This is still a small insert, shown
simply to illustrate the syntax of a multi-row insert.
insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);

See INSERT (p. 379) for more details and examples.

Use a Bulk Insert

Use a bulk insert operation with a SELECT clause for high performance data insertion.
Use the INSERT (p. 379) and CREATE TABLE AS (p. 349) commands when you need to move data or a
subset of data from one table into another.
For example, the following INSERT statement selects all of the rows from the CATEGORY table and
inserts them into the CATEGORY_STAGE table.
insert into category_stage
(select * from category);

The following example creates CATEGORY_STAGE as a copy of CATEGORY and inserts all of the rows
in CATEGORY into CATEGORY_STAGE.
create table category_stage as
select * from category;

Load Data in Sort Key Order

Load your data in sort key order to avoid needing to vacuum.
As long as each batch of new data follows the existing rows in your table, your data will be properly stored
in sort order, and you will not need to run a vacuum. You don't need to presort the rows in each load
because COPY sorts each batch of incoming data as it loads.
For example, suppose you load data every day based on the current day's activity. If your sort key is a
timestamp column, your data is stored in sort order because the current day's data is always appended
at the end of the previous day's data.

Load Data in Sequential Blocks

If you need to add a large quantity of data, load the data in sequential blocks according to sort order to
eliminate the need to vacuum.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Use Time-Series Tables

For example, suppose you need to load a table with events from January, 2012 to December, 2012. Load
the rows for January, then February, and so on. Your table will be completely sorted when your load
completes and you will not need to run a vacuum. For more information, see Use Time-Series
Tables (p. ?).
When loading very large data sets, the space required to sort might exceed the total available space. By
loading data in smaller blocks, you'll use much less intermediate sort space during each load. In addition,
loading smaller blocks will make it easier to restart if the COPY fails and is rolled back.

Use Time-Series Tables

If your data has a fixed retention period, we strongly recommend that you organize your data as a sequence
of time-series tables, where each table is identical but contains data for different time ranges.
You can easily remove old data simply by executing a DROP TABLE on the corresponding tables, which
is much faster than running a large scale DELETE, and also saves you from having to run a subsequent
VACUUM to reclaim space. You can create a UNION ALL view to hide the fact that the data is stored in
different tables. When you delete old data, simply refine your UNION ALL view to remove the dropped
tables. Similarly, as you load new time periods into new tables, add the new tables to the view.
If you use time-series tables with a timestamp column for the sort key, you effectively load your data in
sort key order, which eliminates the need to vacuum to resort the data. For more information, see Load
Data in Sort Key Order (p. 29).

Use a Staging Table to Perform a Merge (Upsert)

You can efficiently update and insert new data by loading your data into a staging table first.
You can efficiently update and insert new data by loading your data into a staging table first. While Amazon
Redshift does not support a single merge statement (update or insert, also known as an upsert) to insert
and update data from a single data source, you can effectively perform an merge operation by loading
your data into a staging table and then joining the staging table with your target table for an UPDATE
statement and an INSERT statement. For instructions, see Updating and Inserting New Data (p. 165).

Schedule Around Maintenance Windows

If a scheduled maintenance occurs while a query is running, the query is terminated and rolled back and
you will need to restart it. Schedule long-running operations, such as large data loads or VACUUM
operation, to avoid maintenance windows.You can also minimize the risk, and make restarts easier when
they are needed, by performing data loads in smaller increments and managing the size of your VACUUM
operations. For more information, see Load Data in Sequential Blocks (p. 29) and Vacuuming
Tables (p. 175).

Amazon Redshift Best Practices for Designing

Queries
To maximize query performance, follow these recommendations when creating queries.
Design tables according to best practices to provide a solid foundation for query performance. For more
information, see Best Practices for Designing Tables (p. 24).
Avoid using select *. Include only the columns you specifically need.
Use a CASE Expression (p. 501) to perform complex aggregations instead of selecting from the same
table multiple times.
API Version 2012-12-01
30

Amazon Redshift Database Developer Guide

Best Practices for Designing Queries

Dont use cross-joins unless absolutely necessary. These joins without a join condition result in the
Cartesian product of two tables. Cross-joins are typically executed as nested-loop joins, which are the
slowest of the possible join types.
Use subqueries in cases where one table in the query is used only for predicate conditions. This situation
is a form of a cross-join and again results in an expensive nested-loop join.
Use predicates to restrict the dataset as much as possible.
In the predicate, use the least expensive operators that you can. Comparison Condition (p. 265) operators
are preferable to LIKE (p. 271) operators. LIKE operators are still preferable to SIMILAR TO (p. 273) or
POSIX Operators (p. 275).
Avoid using functions in query predicates. Using them can drive up the cost of the query by requiring
large numbers of rows to resolve the intermediate steps of the query.
If possible, use a WHERE clause based on the primary sort column of the largest table in the query to
restrict the dataset. The query planner can then use row order to help determine which records match
the criteria, so it can skip scanning large numbers of disk blocks. Without this, the query execution
engine must scan the entire table.
Add predicates to filter other tables that participate in the join, even when the predicates are redundant.
Amazon Redshift can then efficiently skip scanning blocks from those tables.
For example, suppose you want to join TAB1 and TAB2. The sort key for TAB1 is tab1.timestamp,
and the sort key for TAB2 is tab2.timestamp. The following query joins the tables on their common
key and filters for tab1.timestamp values greater than January 1, 2013.
select * from tab1, tab2
where tab1.key = tab2.key
and tab1.timestamp > '1/1/2013';

If the WHERE clause doesn't include a predicate for tab2.timestamp, the execution engine is forced
to scan the entire table. If the join would result in values from tab2.timestamp2 also being greater
than January 1, 2013, then add that filter also, even though it is redundant.
select * from tab1, tab2
where tab1.key = tab2.key
and tab1.timestamp > '1/1/2013'
and tab2.timestamp > '1/1/2013';

Use sort keys in the GROUP BY clause so the query planner can use more efficient aggregation. A
query might qualify for one-phase aggregation when its GROUP BY list contains only sort key columns,
one of which is also the distribution key. The sort key columns in the GROUP BY list must include the
first sort key, then other sort keys that you want to use in sort key order. For example, it is valid to use
the first sort key, the first and second sort keys, the first, second and third sort keys, and so on. It is
not valid to use the first and third sort keys.
You can confirm the use of one-phase aggregation by running the EXPLAIN (p. 370) command and
looking for XN GroupAggregate in the aggregation step of the query.
If you use both SORT BY and GROUP BY clauses, make sure you put the columns in the same order
in both. That is, use the following approach:
sort by a, b, c
group by a, b, c

Don't use this approach:

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Best Practices for Designing Queries

sort by a, b, c
group by b, c, a

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Prerequisites

Tutorial: Tuning Table Design

In this tutorial, you will learn how to optimize the design of your tables. You will start by creating tables
based on the Star Schema Benchmark (SSB) schema without sort keys, distribution styles, and
compression encodings. You will load the tables with test data and test system performance. Next, you
will apply best practices to recreate the tables using sort keys and distribution styles. You will load the
tables with test data using automatic compression and then you will test performance again so that you
can compare the performance benefits of well-designed tables.
Estimated time: 60 minutes
Estimated cost: $1.00 per hour for the cluster

Prerequisites
You will need your AWS credentials (access key ID and secret access key) to load test data from Amazon
S3. If you need to create new access keys, go to Administering Access Keys for IAM Users.

Steps
Step 1: Create a Test Data Set (p. 34)
Step 2: Test System Performance to Establish a Baseline (p. 38)
Step 3: Select Sort Keys (p. 42)
Step 4: Select Distribution Styles (p. 43)
Step 5: Review Compression Encodings (p. 46)
Step 6: Recreate the Test Data Set (p. 49)
Step 7: Retest System Performance After Tuning (p. 52)
Step 8: Evaluate the Results (p. 56)
Step 9: Clean Up Your Resources (p. 58)
Summary (p. 58)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 1: Create a Test Data Set

Data warehouse databases commonly use a star schema design, in which a central fact table contains
the core data for the database and several dimension tables provide descriptive attribute information for
the fact table. The fact table joins each dimension table on a foreign key that matches the dimension's
primary key.
Star Schema Benchmark (SSB)
For this tutorial, you will use a set of five tables based on the Star Schema Benchmark (SSB) schema.
The following diagram shows the SSB data model.

To Create a Test Data Set

You will create a set of tables without sort keys, distribution styles, or compression encodings. Then you
will load the tables with data from the SSB data set.
1.

(Optional) Launch a cluster.

If you already have a cluster that you want to use, you can skip this step. Your cluster should have
at least two nodes. For the exercises in this tutorial, you will use a four-node cluster.
To launch a dw2.large cluster with four nodes, follow the steps in Getting Started with Amazon
Redshift, but select Multi Node for Cluster Type and set Number of Compute Nodes to 4.
Follow the steps to connect to your cluster from a SQL client and test a connection. You do not need
to complete the remaining steps to create tables, upload data, and try example queries.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Create a Test Data Set

Create the SSB test tables using minimum attributes.

Note
If the SSB tables already exist in the current database, you will need to drop the tables first.
See Step 6: Recreate the Test Data Set (p. 49) for the DROP TABLE commands.
For the purposes of this tutorial, the first time you create the tables, they will not have sort keys,
distribution styles, or compression encodings.
Execute the following CREATE TABLE commands.
CREATE TABLE part
(
p_partkey
INTEGER NOT NULL,
p_name
VARCHAR(22) NOT NULL,
p_mfgr
VARCHAR(6) NOT NULL,
p_category
VARCHAR(7) NOT NULL,
p_brand1
VARCHAR(9) NOT NULL,
p_color
VARCHAR(11) NOT NULL,
p_type
VARCHAR(25) NOT NULL,
p_size
INTEGER NOT NULL,
p_container
VARCHAR(10) NOT NULL
);
CREATE TABLE supplier
(
s_suppkey
INTEGER NOT NULL,
s_name
VARCHAR(25) NOT NULL,
s_address
VARCHAR(25) NOT NULL,
s_city
VARCHAR(10) NOT NULL,
s_nation
VARCHAR(15) NOT NULL,
s_region
VARCHAR(12) NOT NULL,
s_phone
VARCHAR(15) NOT NULL
);

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Create a Test Data Set

CREATE TABLE customer

(
c_custkey
INTEGER NOT
c_name
VARCHAR(25)
c_address
VARCHAR(25)
c_city
VARCHAR(10)
c_nation
VARCHAR(15)
c_region
VARCHAR(12)
c_phone
VARCHAR(15)
c_mktsegment
VARCHAR(10)
);
CREATE TABLE dwdate
(
d_datekey
d_date
d_dayofweek
d_month
d_year
d_yearmonthnum
d_yearmonth
d_daynuminweek
d_daynuminmonth
d_daynuminyear
d_monthnuminyear
d_weeknuminyear
d_sellingseason
d_lastdayinweekfl
d_lastdayinmonthfl
d_holidayfl
d_weekdayfl
);
CREATE TABLE lineorder
(
lo_orderkey
lo_linenumber
lo_custkey
lo_partkey
lo_suppkey
lo_orderdate
lo_orderpriority
lo_shippriority
lo_quantity
lo_extendedprice
lo_ordertotalprice
lo_discount
lo_revenue
lo_supplycost
lo_tax
lo_commitdate
lo_shipmode
);

NULL,
NOT NULL,
NOT NULL,
NOT NULL,
NOT NULL,
NOT NULL,
NOT NULL,
NOT NULL

INTEGER NOT NULL,

VARCHAR(19) NOT NULL,
VARCHAR(10) NOT NULL,
VARCHAR(10) NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
VARCHAR(8) NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
VARCHAR(13) NOT NULL,
VARCHAR(1) NOT NULL,
VARCHAR(1) NOT NULL,
VARCHAR(1) NOT NULL,
VARCHAR(1) NOT NULL

INTEGER NOT NULL,

INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
VARCHAR(15) NOT NULL,
VARCHAR(1) NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
INTEGER NOT NULL,
VARCHAR(10) NOT NULL

Load the tables using SSB sample data.

The sample data for this tutorial is provided in Amazon S3 buckets that give read access to all
authenticated AWS users, so any valid AWS credentials that permit access to Amazon S3 will work.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Create a Test Data Set

This steps assumes that the bucket that contains the load data is in the same region as your cluster.
Alternatively, you can specify the region using the REGION (p. 312) option with the COPY command.
a.

Use the following table to find the correct bucket name for your region.
Region

<region-specific-bucket-name>

US East (N. Virginia)

awssampledb

US West (Oregon)

awssampledbuswest2

EU (Frankfurt)

awssampledbeucentral1

EU (Ireland)

awssampledbeuwest1

Asia Pacific (Singapore)

awssampledbapsoutheast1

Asia Pacific (Sydney)

awssampledbapsoutheast2

Asia Pacific (Tokyo)

awssampledbapnortheast1

Create a new text file named loadssb.sql containing the following SQL.
copy customer from 's3://<region-specific-bucket-name>/ssbgz/customer'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip compupdate off;
copy dwdate from 's3://<region-specific-bucket-name>/ssbgz/dwdate'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip compupdate off;
copy lineorder from 's3://<region-specific-bucket-name>/ssbgz/lineorder'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip compupdate off;
copy part from 's3://<region-specific-bucket-name>/ssbgz/part'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip compupdate off;
copy supplier from 's3://<region-specific-bucket-name>/ssbgz/supplier'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip compupdate off;

Replace <region-specific-bucket-name> with the name of a bucket in the same region

as your cluster.

Replace <Your-Access-Key-ID> and <Your-Secret-Access-Key> with your own AWS

account credentials. The segment of the credentials string that is enclosed in single quotes must
not contain any spaces or line breaks.
Execute the COPY commands either by running the SQL script or by copying and pasting the
commands into your SQL client.

Note
The load operation will take about 10 to 15 minutes for all five tables.
Your results should look similar to the following.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Load into table 'customer' completed, 3000000 record(s) loaded success

fully.
0 row(s) affected.
copy executed successfully
Execution time: 10.28s
(Statement 1 of 5 finished)
...
...
Script execution finished
Total script execution time: 9m 51s

4.
5.

Sum the execution time for all five tables, or else note the total script execution time. Youll record
that number as the load time in the benchmarks table in Step 2, following.
To verify that each table loaded correctly, execute the following commands.
select
select
select
select
select

count(*)
count(*)
count(*)
count(*)
count(*)

from LINEORDER;
from PART;
from CUSTOMER;
from SUPPLIER;
from DWDATE;

The following results table shows the number of rows for each SSB table.
Table Name

Rows

LINEORDER

600,037,902

PART

1,400,000

CUSTOMER

3,000,000

SUPPLIER

1,000,000

DWDATE

2,556

Next Step
Step 2: Test System Performance to Establish a Baseline (p. 38)

Step 2: Test System Performance to Establish

a Baseline
As you test system performance before and after tuning your tables, you will record the following details:
Load time
Storage use

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Test System Performance to Establish a Baseline

Query performance
The examples in this tutorial are based on using a four-node dw2.large cluster.Your results will be different,
even if you use the same cluster configuration. System performance is influenced by many factors, and
no two systems will perform exactly the same.
You will record your results using the following benchmarks table.
Benchmark

Before

After

Load time (five tables)

Storage Use
LINEORDER
PART
CUSTOMER
DWDATE
SUPPLIER
Total storage
Query execution time
Query 1
Query 2
Query 3
Total execution time

To Test System Performance to Establish a

Baseline
1.

Note the cumulative load time for all five tables and enter it in the benchmarks table in the Before
column.
This is the value you noted in the previous step.
Record storage use.
Determine how many 1 MB blocks of disk space are used for each table by querying the
STV_BLOCKLIST table and record the results in your benchmarks table.
select stv_tbl_perm.name as table, count(*) as mb
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name in ('lineorder','part','customer','dwdate','supplier')
group by stv_tbl_perm.name
order by 1 asc;

Your results should look similar to this:

API Version 2012-12-01
39

Amazon Redshift Database Developer Guide

To Test System Performance to Establish a Baseline

Test query performance.

The first time you run a query, Amazon Redshift compiles the code, and then sends compiled code
to the compute nodes. When you compare the execution times for queries, you should not use the
results for the first time you execute the query. Instead, compare the times for the second execution
of each query. For more information, see Factors Affecting Query Performance (p. 203).
Run the following queries twice to eliminate compile time. Record the second time for each query in
the benchmarks table.
-- Query 1
-- Restrictions on only one dimension.
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder, dwdate
where lo_orderdate = d_datekey
and d_year = 1997
and lo_discount between 1 and 3
and lo_quantity < 24;
-- Query 2
-- Restrictions on two dimensions
select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;
-- Query 3
-- Drill down in time to just one month
select c_city, s_city, d_year, sum(lo_revenue) as revenue
from customer, lineorder, supplier, dwdate
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or
c_city='UNITED KI5')
and (s_city='UNITED KI1' or
s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;

Your results for the second time will look something like this:
API Version 2012-12-01
40

Amazon Redshift Database Developer Guide

Next Step

SELECT executed successfully

Execution time: 6.97s
(Statement 1 of 3 finished)
SELECT executed successfully
Execution time: 12.81s
(Statement 2 of 3 finished)
SELECT executed successfully
Execution time: 13.39s
(Statement 3 of 3 finished)
Script execution finished
Total script execution time: 33.17s

The following benchmarks table shows the example results for the cluster used in this tutorial.
Benchmark

Before

Load time (five tables)

10m 23s

Storage Use
LINEORDER

51024

PART

200

CUSTOMER

384

DWDATE

160

SUPPLIER

152

Total storage

51920

Query execution time

Query 1

6.97

Query 2

12.81

Query 3

13.39

Total execution time

33.17

Next Step
Step 3: Select Sort Keys (p. 42)

API Version 2012-12-01

After

Amazon Redshift Database Developer Guide

Step 3: Select Sort Keys

When you create a table, you can specify one or more columns as the sort key. Amazon Redshift stores
your data on disk in sorted order according to the sort key. How your data is sorted has an important
effect on disk I/O, columnar compression, and query performance.
In this step, you choose sort keys for the SSB tables based on these best practices:
If recent data is queried most frequently, specify the timestamp column as the leading column for the
sort key.
If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
If you frequently join a (dimension) table, specify the join column as the sort key.

To Select Sort Keys

Evaluate your queries to find timestamp columns that are used to filter the results.
For example, LINEORDER frequently uses equality filters using lo_orderdate.
where lo_orderdate = d_datekey and d_year = 1997

Look for columns that are used in range filters and equality filters. For example, LINEORDER also
uses lo_orderdate for range filtering.
where lo_orderdate = d_datekey and d_year >= 1992 and d_year <= 1997

Based on the first two best practices, lo_orderdate is a good choice for sort key.
In the tuning table, specify lo_orderdate as the sort key for LINEORDER.

The remaining tables are dimensions, so, based on the third best practice, specify their primary keys
as sort keys.

The following tuning table shows the chosen sort keys. You fill in the Distribution Style column in Step 4:
Select Distribution Styles (p. 43).
Table name

Sort Key

LINEORDER

lo_orderdate

PART

p_partkey

CUSTOMER

c_custkey

SUPPLIER

s_suppkey

DWDATE

d_datekey

Next Step
Step 4: Select Distribution Styles (p. 43)

API Version 2012-12-01

Distribution Style

Amazon Redshift Database Developer Guide

Step 4: Select Distribution Styles

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node
slices according to the table's distribution style. The number of slices is equal to the number of processor
cores on the node. For example, the dw2.large cluster that you are using in this tutorial has four nodes,
so it has eight slices.The nodes all participate in parallel query execution, working on data that is distributed
across the slices.
When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed
to perform any joins and aggregations. Redistribution might involve either sending specific rows to nodes
for joining or broadcasting an entire table to all of the nodes.
You should assign distribution styles to achieve these goals.
Collocate the rows from joining tables
When the rows for joining columns are on the same slices, less data needs to be moved during query
execution.
Distribute data evenly among the slices in a cluster.
If data is distributed evenly, workload can be allocated evenly to all the slices.
These goals may conflict in some cases, and you will need to evaluate which strategy is the best choice
for overall system performance. For example, even distribution might place all matching values for a
column on the same slice. If a query uses an equality filter on that column, the slice with those values
will carry a disproportionate share of the workload. If tables are collocated based on a distribution key,
the rows might be distributed unevenly to the slices because the keys are distributed unevenly through
the table.
In this step, you evaluate the distribution of the SSB tables with respect to the goals of data distribution,
and then select the optimum distribution styles for the tables.

Distribution Styles
When you create a table, you designate one of three distribution styles: KEY, ALL, or EVEN.
KEY distribution
The rows are distributed according to the values in one column. The leader node will attempt to place
matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader
node collocates the rows on the slices according to the values in the joining columns so that matching
values from the common columns are physically stored together.
ALL distribution
A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place
only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for
every join that the table participates in.
EVEN distribution
The rows are distributed across the slices in a round-robin fashion, regardless of the values in any particular
column. EVEN distribution is appropriate when a table does not participate in joins or when there is not
a clear choice between KEY distribution and ALL distribution. EVEN distribution is the default distribution
style.
For more information, see Distribution Styles (p. 116).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Select Distribution Styles

When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed
to perform any joins and aggregations. By locating the data where it needs to be before the query is
executed, you can minimize the impact of the redistribution step.
The first goal is to distribute the data so that the matching rows from joining tables are collocated, which
means that the matching rows from joining tables are located on the same node slice.
1.

To look for redistribution steps in the query plan, execute an EXPLAIN command followed by the
query. This example uses Query 2 from our set of test queries.
explain
select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;

The following shows a portion of the query plan. Look for labels that begin with DS_BCAST or
DS_DIST labels
QUERY PLAN
XN Merge (cost=1038007224737.84..1038007224738.54 rows=280 width=20)
Merge Key: dwdate.d_year, part.p_brand1
-> XN Network (cost=1038007224737.84..1038007224738.54 rows=280 width=20)
Send to leader
-> XN Sort (cost=1038007224737.84..1038007224738.54 rows=280
width=20)
Sort Key: dwdate.d_year, part.p_brand1
-> XN HashAggregate (cost=38007224725.76..38007224726.46
rows=280
-> XN Hash Join DS_BCAST_INNER
(cost=30674.95..38007188507.46
Hash Cond: ("outer".lo_orderdate = "in
ner".d_datekey)
-> XN Hash Join DS_BCAST_INNER
(cost=30643.00..37598119820.65
Hash Cond: ("outer".lo_suppkey = "in
ner".s_suppkey)
-> XN Hash Join DS_BCAST_INNER
Hash Cond: ("outer".lo_partkey = "in
ner".p_partkey)
-> XN Seq Scan on lineorder
-> XN Hash (cost=17500.00..17500.00
rows=56000
-> XN Seq Scan on part
(cost=0.00..17500.00
Filter: ((p_category)::text
=
-> XN Hash (cost=12500.00..12500.00

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Select Distribution Styles

rows=201200
->

XN Seq Scan on supplier

(cost=0.00..12500.00
Filter: ((s_region)::text =
'AMERICA'::text)
->

XN Hash
->

(cost=25.56..25.56 rows=2556 width=8)

XN Seq Scan on dwdate

(cost=0.00..25.56

rows=2556

DS_BCAST_INNER indicates that the inner join table was broadcast to every slice. A DS_DIST_BOTH
label, if present, would indicate that both the outer join table and the inner join table were redistributed
across the slices. Broadcasting and redistribution can be expensive steps in terms of query
performance. You want to select distribution strategies that reduce or eliminate broadcast and
distribution steps. For more information about evaluating the EXPLAIN plan, see Evaluating Query
Patterns (p. 117).
Distribute the fact table and one dimension table on their common columns.
The following diagram shows the relationships between the fact table, LINEORDER, and the dimension
tables in the SSB schema.

Each table can have only one distribution key, which means that only one pair of tables in the schema
can be collocated on their common columns. The central fact table is the clear first choice. For the
second table in the pair, choose the largest dimension that commonly joins the fact table. In this
design, LINEORDER is the fact table, and PART is the largest dimension. PART joins LINEORDER
on its primary key, p_partkey.

Designate lo_partkey as the distribution key for LINEORDER and p_partkey as the distribution
key for PART so that the matching values for the joining keys will be collocated on the same slices
when the data is loaded.
Change some dimension tables to use ALL distribution.
If a dimension table cannot be collocated with the fact table or other important joining tables, you
can often improve query performance significantly by distributing the entire table to all of the nodes.
ALL distribution guarantees that the joining rows will be collocated on every slice. You should weigh
all factors before choosing ALL distribution. Using ALL distribution multiplies storage space
requirements and increases load times and maintenance operations.
CUSTOMER, SUPPLIER, and DWDATE also join the LINEORDER table on their primary keys;
however, LINEORDER will be collocated with PART, so you will set the remaining tables to use
DISTSTYLE ALL. Because the tables are relatively small and are not updated frequently, using ALL
distribution will have minimal impact on storage and load times.

Use EVEN distribution for the remaining tables.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

All of the tables have been assigned with DISTKEY or ALL distribution styles, so you won't assign
EVEN to any tables. After evaluating your performance results, you might decide to change some
tables from ALL to EVEN distribution.
The following tuning table shows the chosen distribution styles.
Table name

Sort Key

Distribution Style

LINEORDER

lo_orderdate

lo_partkey

PART

p_partkey

CUSTOMER

c_custkey

ALL

SUPPLIER

s_suppkey

ALL

DWDATE

d_datekey

ALL

For more information, see Choose the Best Distribution Style (p. 25).

Next Step
Step 5: Review Compression Encodings (p. 46)

Step 5: Review Compression Encodings

Compression is a column-level operation that reduces the size of data when it is stored. Compression
conserves storage space and reduces the size of data that is read from storage, which reduces the amount
of disk I/O and therefore improves query performance.
By default, Amazon Redshift stores data in its raw, uncompressed format. When you create tables in an
Amazon Redshift database, you can define a compression type, or encoding, for the columns. For more
information, see Compression Encodings (p. 105).
You can apply compression encodings to columns in tables manually when you create the tables, or you
can use the COPY command to analyze the load data and apply compression encodings automatically.

To Review Compression Encodings

Find how much space each column uses.

Query the STV_BLOCKLIST system view to find the number of 1 MB blocks each column uses. The
MAX aggregate function returns the highest block number for each column. This example uses col
< 6 in the WHERE clause to exclude system-generated columns.
Execute the following command.
select col, max(blocknum)
from stv_blocklist b, stv_tbl_perm p
where (b.tbl=p.id) and name ='lineorder'
and col < 6
group by name, col
order by col;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Review Compression Encodings

Your results will look similar to the following.

col | max
-----+----0
| 286
1
| 286
2
| 286
3
| 286
4
| 286
5
| 286

Experiment with the different encoding methods.

In this step, you create a table with identical columns, except that each column uses a different
compression encoding. Then you insert a large number of rows, using data from the p_name column
in the PART table, so that every column has the same data. Finally, you will examine the table to
compare the effects of the different encodings on column sizes.
a.

Create a table with the encodings that you want to compare.

create table encodingshipmode (
moderaw varchar(22) encode raw,
modebytedict varchar(22) encode bytedict,
modelzo varchar(22) encode lzo,
moderunlength varchar(22) encode runlength,
modetext255 varchar(22) encode text255,
modetext32k varchar(22) encode text32k);

Insert the same data into all of the columns using an INSERT statement with a SELECT clause.
The command will take a couple minutes to execute.
insert into encodingshipmode
select lo_shipmode as moderaw, lo_shipmode as modebytedict, lo_shipmode
as modelzo,
lo_shipmode as moderunlength, lo_shipmode as modetext255,
lo_shipmode as modetext32k
from lineorder where lo_orderkey < 200000000;

Query the STV_BLOCKLIST system table to compare the number of 1 MB disk blocks used by
each column.
select col, max(blocknum)
from stv_blocklist b, stv_tbl_perm p
where (b.tbl=p.id) and name = 'encodingshipmode'
and col < 6
group by name, col
order by col;

The query returns results similar to the following. Depending on how your cluster is configured,
your results will be different, but the relative sizes should be similar.
col | max
-----+----0
| 221

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Review Compression Encodings

1
| 26
2
| 61
3
| 192
4
| 54
5
| 105
(6 rows)

The columns show the results for the following encodings:

Raw
Bytedict
LZO
Runlength
Text255
Text32K
You can see that Bytedict encoding on the second column produced the best results for this
data set, with a compression ratio of better than 8:1. Different data sets will produce different
results, of course.

Use the ANALYZE COMPRESSION command to view the suggested encodings for an existing table.
Execute the following command.
analyze compression lineorder;

Your results should look similar to the following.

Table
| Column
|
Encoding
----------+------------------+------------------lineorder
lo_orderkey
delta
lineorder
lo_linenumber
delta
lineorder
lo_custkey
raw
lineorder
lo_partkey
raw
lineorder
lo_suppkey
raw
lineorder
lo_orderdate
delta32k
lineorder
lo_orderpriority
bytedict
lineorder
lo_shippriority
runlength
lineorder
lo_quantity
delta
lineorder
lo_extendedprice
lzo
lineorder
lo_ordertotalprice
lzo
lineorder
lo_discount
delta
lineorder
lo_revenue
lzo
lineorder
lo_supplycost
delta32k
lineorder
lo_tax
delta
lineorder
lo_commitdate
delta32k
lineorder
lo_shipmode
bytedict

Notice that ANALYZE COMPRESSION chose BYTEDICT encoding for the lo_shipmode column.

For an example that walks through choosing manually applied compression encodings, see Example:
Choosing Compression Encodings for the CUSTOMER Table (p. 113).
Apply automatic compression to the SSB tables.
API Version 2012-12-01
48

Amazon Redshift Database Developer Guide

Next Step

By default, the COPY command automatically applies compression encodings when you load data
into an empty table that has no compression encodings other than RAW encoding. For this tutorial,
you will let the COPY command automatically select and apply optimal encodings for the tables as
part of the next step, Recreate the test data set.
For more information, see Loading Tables with Automatic Compression (p. 158).

Next Step
Step 6: Recreate the Test Data Set (p. 49)

Step 6: Recreate the Test Data Set

Now that you have chosen the sort keys and distribution styles for each of the tables, you can create the
tables using those attributes and reload the data. You will allow the COPY command to analyze the load
data and apply compression encodings automatically.

To Recreate the Test Data Set

You need to drop the SSB tables before you run the CREATE TABLE commands.
Execute the following commands.
drop
drop
drop
drop
drop

table
table
table
table
table

part cascade;
supplier cascade;
customer cascade;
dwdate cascade;
lineorder cascade;

Create the tables with sort keys and distribution styles.

Execute the following set of SQL CREATE TABLE commands.
CREATE TABLE part (
p_partkey
integer
p_name
varchar(22)
p_mfgr
varchar(6)
p_category
varchar(7)
p_brand1
varchar(9)
p_color
varchar(11)
p_type
varchar(25)
p_size
integer
p_container
varchar(10)
);
CREATE TABLE supplier (
s_suppkey
integer
s_name
varchar(25)
s_address
varchar(25)
s_city
varchar(10)
s_nation
varchar(15)
s_region
varchar(12)

not null sortkey distkey,

not null,
not null,
not null,
not null,
not null,
not null,
not null,
not null

not
not
not
not
not
not

null sortkey,
null,
null,
null,
null,
null,

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Recreate the Test Data Set

s_phone
diststyle all;

varchar(15)

CREATE TABLE customer (

c_custkey
integer
c_name
varchar(25)
c_address
varchar(25)
c_city
varchar(10)
c_nation
varchar(15)
c_region
varchar(12)
c_phone
varchar(15)
c_mktsegment
varchar(10)
diststyle all;

not null)

not
not
not
not
not
not
not

null sortkey,
null,
null,
null,
null,
null,
null,
not null)

CREATE TABLE dwdate (

d_datekey
integer
not null sortkey,
d_date
varchar(19)
not null,
d_dayofweek
varchar(10)
not null,
d_month
varchar(10)
not null,
d_year
integer
not null,
d_yearmonthnum
integer
not null,
d_yearmonth
varchar(8) not null,
d_daynuminweek
integer
not null,
d_daynuminmonth
integer
not null,
d_daynuminyear
integer
not null,
d_monthnuminyear
integer
not null,
d_weeknuminyear
integer
not null,
d_sellingseason
varchar(13)
not null,
d_lastdayinweekfl
varchar(1)
not null,
d_lastdayinmonthfl
varchar(1)
not null,
d_holidayfl
varchar(1)
not null,
d_weekdayfl
varchar(1)
not null)
diststyle all;
CREATE TABLE lineorder (
lo_orderkey
integer
lo_linenumber
integer
lo_custkey
integer
lo_partkey
integer
lo_suppkey
integer
lo_orderdate
integer
lo_orderpriority
varchar(15)
lo_shippriority
varchar(1)
lo_quantity
integer
lo_extendedprice
integer
lo_ordertotalprice
integer
lo_discount
integer
lo_revenue
integer
lo_supplycost
integer
lo_tax
integer
lo_commitdate
integer
lo_shipmode
varchar(10)
);

not
not
not
not
not
not

null,
null,
null,
null distkey,
null,
null sortkey,
not null,
not null,
not null,
not null,
not null,
not null,
not null,
not null,
not null,
not null,
not null

Load the tables using the same sample data.

Open the loadssb.sql script that you created in the previous step.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Recreate the Test Data Set

Delete compupdate off from each COPY statement. This time, you will allow COPY to apply
compression encodings.
For reference, the edited script should look like the following:
copy customer from 's3://<region-specific-bucket-name>/ssbgz/customer'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip;
copy dwdate from 's3://<region-specific-bucket-name>/ssbgz/dwdate'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip;
copy lineorder from 's3://<region-specific-bucket-name>/ssbgz/lineorder'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip;
copy part from 's3://<region-specific-bucket-name>/ssbgz/part'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip;
copy supplier from 's3://<region-specific-bucket-name>/ssbgz/supplier'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>' gzip;

c.
d.

Save the file.

Execute the COPY commands either by running the SQL script or by copying and pasting the
commands into your SQL client.

Note
The load operation will take about 10 to 15 minutes. This might be a good time to get
another cup of tea or feed the fish.
Your results should look similar to the following.
Warnings:
Load into table 'customer' completed, 3000000 record(s) loaded success
fully.
...
...
Script execution finished
Total script execution time: 12m 15s

Record the load time in the benchmarks table.

Benchmark

Before

After

Load time (five tables)

10m 23s

12m 15s

Storage Use
LINEORDER

51024

PART

384

CUSTOMER

200

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Benchmark

Before

DWDATE

160

SUPPLIER

152

Total storage

51920

After

Query execution time

Query 1

6.97

Query 2

12.81

Query 3

13.39

Total execution time

33.17

Next Step
Step 7: Retest System Performance After Tuning (p. 52)

Step 7: Retest System Performance After Tuning

After recreating the test data set with the selected sort keys, distribution styles, and compressions
encodings, you will retest the system performance.

To Retest System Performance After Tuning

Record storage use.

Determine how many 1 MB blocks of disk space are used for each table by querying the
STV_BLOCKLIST table and record the results in your benchmarks table.
select stv_tbl_perm.name as "table", count(*) as "blocks (mb)"
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name in ('customer', 'part', 'supplier', 'dwdate', 'line
order')
group by stv_tbl_perm.name
order by 1 asc;

Your results will look similar to this:

table
| blocks (mb)
-----------+----------------customer
604
dwdate
160
lineorder
27152

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Retest System Performance After Tuning

part
supplier

200
236

Check for distribution skew.

Uneven distribution, or data distribution skew, forces some nodes to do more work than others, which
limits query performance.
To check for distribution skew, query the SVV_DISKUSAGE system view. Each row in
SVV_DISKUSAGE records the statistics for one disk block. The num_values column gives the
number of rows in that disk block, so sum(num_values) returns the number of rows on each slice.
Execute the following query to see the distribution for all of the tables in the SSB database.
select trim(name) as table, slice, sum(num_values) as rows, min(minvalue),
max(maxvalue)
from svv_diskusage
where name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder')
and col =0
group by name, slice
order by name, slice;

Your results will look something like this:

table
| slice |
rows
|
min
|
max
-----------+-------+----------+----------+----------customer |
0 | 3000000 |
1 |
3000000
customer |
2 | 3000000 |
1 |
3000000
customer |
4 | 3000000 |
1 |
3000000
customer |
6 | 3000000 |
1 |
3000000
dwdate
|
0 |
2556 | 19920101 | 19981230
dwdate
|
2 |
2556 | 19920101 | 19981230
dwdate
|
4 |
2556 | 19920101 | 19981230
dwdate
|
6 |
2556 | 19920101 | 19981230
lineorder |
0 | 75029991 |
3 | 599999975
lineorder |
1 | 75059242 |
7 | 600000000
lineorder |
2 | 75238172 |
1 | 599999975
lineorder |
3 | 75065416 |
1 | 599999973
lineorder |
4 | 74801845 |
3 | 599999975
lineorder |
5 | 75177053 |
1 | 599999975
lineorder |
6 | 74631775 |
1 | 600000000
lineorder |
7 | 75034408 |
1 | 599999974
part
|
0 |
175006 |
15 |
1399997
part
|
1 |
175199 |
1 |
1399999
part
|
2 |
175441 |
4 |
1399989
part
|
3 |
175000 |
3 |
1399995
part
|
4 |
175018 |
5 |
1399979
part
|
5 |
175091 |
11 |
1400000
part
|
6 |
174253 |
2 |
1399969
part
|
7 |
174992 |
13 |
1399996
supplier |
0 | 1000000 |
1 |
1000000
supplier |
2 | 1000000 |
1 |
1000000
supplier |
4 | 1000000 |
1 |
1000000
supplier |
6 | 1000000 |
1 |
1000000
(28 rows)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Retest System Performance After Tuning

The following chart illustrates the distribution of the three largest tables. (The columns are not to
scale.) Notice that because CUSTOMER uses ALL distribution, it was distributed to only one slice
per node.

The distribution is relatively even, so you don't need to adjust for distribution skew.
3.

Run an EXPLAIN command with each query to view the query plans.
The following example shows the EXPLAIN command with Query 2.
explain
select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;

In the EXPLAIN plan for Query 2, notice that the DS_BCAST_INNER labels have been replaced by
DS_DIST_ALL_NONE and DS_DIST_NONE, which means that no redistribution was required for
those steps, and the query should run much more quickly.
QUERY PLAN
XN Merge (cost=1000014243538.45..1000014243539.15 rows=280 width=20)
Merge Key: dwdate.d_year, part.p_brand1
-> XN Network (cost=1000014243538.45..1000014243539.15 rows=280 width=20)
Send to leader
-> XN Sort (cost=1000014243538.45..1000014243539.15 rows=280
width=20)
Sort Key: dwdate.d_year, part.p_brand1
-> XN HashAggregate (cost=14243526.37..14243527.07 rows=280
width=20)
-> XN Hash Join DS_DIST_ALL_NONE
(cost=30643.30..14211277.03 rows=4299912
Hash Cond: ("outer".lo_orderdate = "in
ner".d_datekey)
-> XN Hash Join DS_DIST_ALL_NONE
(cost=30611.35..14114497.06
Hash Cond: ("outer".lo_suppkey = "in
ner".s_suppkey)
-> XN Hash Join DS_DIST_NONE
(cost=17640.00..13758507.64
Hash Cond: ("outer".lo_partkey = "in

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Retest System Performance After Tuning

ner".p_partkey)
->

XN Seq Scan on lineorder

(cost=0.00..6000378.88
->

XN Hash

(cost=17500.00..17500.00

rows=56000 width=16)
->

XN Seq Scan on part

(cost=0.00..17500.00
Filter: ((p_category)::text
= 'MFGR#12'::text)
->

XN Hash

(cost=12500.00..12500.00

rows=188541 width=4)
->

XN Seq Scan on supplier

(cost=0.00..12500.00
Filter: ((s_region)::text =
'AMERICA'::text)
->

XN Hash
->

(cost=25.56..25.56 rows=2556 width=8)

XN Seq Scan on dwdate

(cost=0.00..25.56

rows=2556 width=8)

Run the same test queries again.

As you did earlier, run the following queries twice to eliminate compile time. Record the second time
for each query in the benchmarks table.
-- Query 1
-- Restrictions on only one dimension.
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder, dwdate
where lo_orderdate = d_datekey
and d_year = 1997
and lo_discount between 1 and 3
and lo_quantity < 24;
-- Query 2
-- Restrictions on two dimensions
select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;
-- Query 3
-- Drill down in time to just one month
select c_city, s_city, d_year, sum(lo_revenue) as revenue
from customer, lineorder, supplier, dwdate
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or
c_city='UNITED KI5')

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

and (s_city='UNITED KI1' or

s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;

The following benchmarks table shows the results based on the cluster used in this example.Your results
will vary based on a number of factors, but the relative results should be similar.
Benchmark

Before

After

Load time (five tables)

10m 23s

12m 15s

LINEORDER

51024

27152

PART

200

CUSTOMER

384

604

DWDATE

160

SUPPLIER

152

236

Total storage

51920

28352

Query 1

6.97

3.19

Query 2

12.81

9.02

Query 3

13.39

10.54

Total execution time

33.17

22.75

Storage Use

Query execution time

Next Step
Step 8: Evaluate the Results (p. 56)

Step 8: Evaluate the Results

You tested load times, storage requirements, and query execution times before and after tuning the tables,
and recorded the results.
The following table shows the example results for the cluster that was used for this tutorial. Your results
will be different, but should show similar improvements.
Benchmark

Before

After

Change

Load time (five

tables)

623

732

109

17.5%

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 8: Evaluate the Results

Benchmark

Before

After

Change

LINEORDER

51024

27152

-23872

-46.8%

PART

200

CUSTOMER

384

604

220

57.3%

DWDATE

160

SUPPLIER

152

236

55.3%

Total storage

51920

28352

-23568

-45.4%

Storage Use

Query execution time

Query 1

6.97

3.19

-3.78

-54.2%

Query 2

12.81

9.02

-3.79

-29.6%

Query 3

13.39

10.54

-2.85

-21.3%

Total execution
time

33.17

22.75

-10.42

-31.4%

Load time
Load time increased by 17.5%.
Sorting, compression, and distribution increase load time. In particular, in this case, you used automatic
compression, which increases the load time for empty tables that don't already have compression
encodings. Subsequent loads to the same tables would be faster. You also increased load time by using
ALL distribution. You could reduce load time by using EVEN or DISTKEY distribution instead for some
of the tables, but that decision needs to be weighed against query performance.
Storage requirements
Storage requirements were reduced by 45.4%.
Some of the storage improvement from using columnar compression was offset by using ALL distribution
on some of the tables. Again, you could improve storage use by using EVEN or DISTKEY distribution
instead for some of the tables, but that decision needs to be weighed against query performance.
Distribution
You verified that there is no distribution skew as a result of your distribution choices.
By checking the EXPLAIN plan, you saw that data redistribution was eliminated for the test queries.
Query execution time
Total query execution time was reduced by 31.4%.
The improvement in query performance was due to a combination of optimizing sort keys, distribution
styles, and compression. Often, query performance can be improved even further by rewriting queries
and configuring workload management (WLM). For more information, see Tuning Query
Performance (p. 194).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Next Step
Step 9: Clean Up Your Resources (p. 58)

Step 9: Clean Up Your Resources

Your cluster continues to accrue charges as long as it is running. When you have completed this tutorial,
you should return your environment to the previous state by following the steps in Step 5: Revoke Access
and Delete Your Sample Cluster in the Amazon Redshift Getting Started Guide.
If you want to keep the cluster, but recover the storage used by the SSB tables, execute the following
commands.
drop
drop
drop
drop
drop

table
table
table
table
table

part cascade;
supplier cascade;
customer cascade;
dwdate cascade;
lineorder cascade;

Next Step
Summary (p. 58)

Summary
In this tutorial, you learned how to optimize the design of your tables by applying table design best
practices.
You chose sort keys for the SSB tables based on these best practices:
If recent data is queried most frequently, specify the timestamp column as the leading column for the
sort key.
If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
If you frequently join a (dimension) table, specify the join column as the sort key.
You applied the following best practices to improve the distribution of the tables.
Distribute the fact table and one dimension table on their common columns
Change some dimension tables to use ALL distribution
You evaluated the effects of compression on a table and determined that using automatic compression
usually produces the best results.
For more information, see the following links:
Best Practices for Designing Tables (p. 24)
Choose the Best Sort Key (p. 25)
Choosing a Data Distribution Style (p. 115)
Choosing a Column Compression Type (p. 104)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Analyzing Table Design (p. 127)

Next Step
For your next step, if you haven't done so already, we recommend taking Tutorial: Loading Data from
Amazon S3 (p. 60).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Prerequisites

Tutorial: Loading Data from

Amazon S3
In this tutorial, you will walk through the process of loading data into your Amazon Redshift database
tables from data files in an Amazon Simple Storage Service (Amazon S3) bucket from beginning to end.
In this tutorial, you will:

Download data files that use CSV, character-delimited, and fixed width formats.
Create an Amazon S3 bucket and then upload the data files to the bucket.
Launch an Amazon Redshift cluster and create database tables.
Use COPY commands to load the tables from the data files on Amazon S3.
Troubleshoot load errors and modify your COPY commands to correct the errors.

Estimated time: 60 minutes

Estimated cost: $1.00 per hour for the cluster

Prerequisites
You will need the following prerequisites:
An AWS account to launch an Amazon Redshift cluster and to create a bucket in Amazon S3.
Your AWS credentials (an access key ID and secret access key) to load test data from Amazon S3. If
you need to create new access keys, go to Administering Access Keys for IAM Users.
This tutorial is designed so that it can be taken by itself. In addition to this tutorial, we recommend
completing the following tutorials to gain a more complete understanding of how to design and use Amazon
Redshift databases:
Getting Started with Amazon Redshift walks you through the process of creating an Amazon Redshift
cluster and loading sample data.
Tutorial: Tuning Table Design (p. 33) walks you step by step through the process of designing and
tuning tables, including choosing sort keys, distribution styles, and compression encodings, and
evaluating system performance before and after tuning.
API Version 2012-12-01
60

Amazon Redshift Database Developer Guide

Overview

Overview
You can add data to your Amazon Redshift tables either by using an INSERT command or by using a
COPY command. At the scale and speed of an Amazon Redshift data warehouse, the COPY command
is many times faster and more efficient than INSERT commands.
The COPY command uses the Amazon Redshift massively parallel processing (MPP) architecture to
read and load data in parallel from multiple data sources. You can load from data files on Amazon S3,
Amazon EMR, or any remote host accessible through a Secure Shell (SSH) connection, or you can load
directly from an Amazon DynamoDB table.
In this tutorial, you will use the COPY command to load data from Amazon S3. Many of the principles
presented here apply to loading from other data sources as well.
To learn more about using the COPY command, see these resources:

Best Practices for Loading Data (p. 27)

Loading Data from Amazon EMR (p. 142)
Loading Data from Remote Hosts (p. 149)
Loading Data from an Amazon DynamoDB Table (p. 155)

Steps

Step 1: Launch a Cluster (p. 61)

Step 2: Download the Data Files (p. 62)
Step 3: Upload the Files to an Amazon S3 Bucket (p. 63)
Step 4: Create the Sample Tables (p. 64)
Step 5: Run the COPY Commands (p. 67)
Step 6: Vacuum and Analyze the Database (p. 79)
Step 7: Clean Up Your Resources (p. 80)

Step 1: Launch a Cluster

If you already have a cluster that you want to use, you can skip this step.
For the exercises in this tutorial, you will use a four-node cluster. Follow the steps in Getting Started with
Amazon Redshift, but select Multi Node for Cluster Type and set Number of Compute Nodes to 4.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Follow the Getting Started steps to connect to your cluster from a SQL client and test a connection. You
do not need to complete the remaining Getting Started steps to create tables, upload data, and try example
queries.

Next Step
Step 2: Download the Data Files (p. 62)

Step 2: Download the Data Files

In this step, you will download a set of sample data files to your computer. In the next step, you will upload
the files to an Amazon S3 bucket.

To download the data files

1.
2.

Download the zipped file from the following link: LoadingDataSampleFiles.zip

Extract the files to a folder on your computer.

Verify that your folder contains the following files.

customer-fw-manifest
customer-fw.tbl-000
customer-fw.tbl-000.bak
customer-fw.tbl-001
customer-fw.tbl-002
customer-fw.tbl-003
customer-fw.tbl-004
customer-fw.tbl-005
customer-fw.tbl-006
customer-fw.tbl-007
customer-fw.tbl.log

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

dwdate-tab.tbl-000
dwdate-tab.tbl-001
dwdate-tab.tbl-002
dwdate-tab.tbl-003
dwdate-tab.tbl-004
dwdate-tab.tbl-005
dwdate-tab.tbl-006
dwdate-tab.tbl-007
part-csv.tbl-000
part-csv.tbl-001
part-csv.tbl-002
part-csv.tbl-003
part-csv.tbl-004
part-csv.tbl-005
part-csv.tbl-006
part-csv.tbl-007

Next Step
Step 3: Upload the Files to an Amazon S3 Bucket (p. 63)

Step 3: Upload the Files to an Amazon S3 Bucket

In this step, you create an Amazon S3 bucket and upload the data files to the bucket.

To upload the files to an Amazon S3 bucket

Create a bucket in Amazon S3.

1.
2.
3.

Sign in to the AWS Management Console, and open the Amazon S3 console.
Click Create Bucket.
In the Bucket Name box of the Create a Bucket dialog box, type a bucket name.

The bucket name you choose must be unique among all existing bucket names in Amazon S3.
One way to help ensure uniqueness is to prefix your bucket names with the name of your
organization. Bucket names must comply with certain rules. For more information, go to Bucket
Restrictions and Limitations in the Amazon Simple Storage Service Developer Guide.
Select a region.
Create the bucket in the same region as your cluster. If your cluster is in the Oregon region,
click Oregon.

Click Create.
When Amazon S3 successfully creates your bucket, the console displays your empty bucket in
the Buckets panel.

Create a folder.
1.

Click the name of the new bucket.

2.
3.

Click the Actions button, and click Create Folder in the drop-down list.
Name the new bucket load.
API Version 2012-12-01
63

Amazon Redshift Database Developer Guide

Next Step

Note
The bucket that you created is not in a sandbox. In this exercise, you will add objects
to a real bucket, and you will be charged a nominal amount for the time that you store
the objects in the bucket. For more information about Amazon S3 pricing, go to the
Amazon S3 Pricing page.

Upload the data files the new Amazon S3 bucket.

1.
2.

Click the name of the data folder.

In the Upload - Select Files wizard, click Add Files.

A file selection dialog box opens.

Select all of the files you downloaded and extracted, and then click Open.

Click Start Upload.

User Credentials
The Amazon Redshift COPY command must have access to read the file objects in the Amazon S3
bucket. If you use the same user credentials to create the Amazon S3 bucket and to run the Amazon
Redshift COPY command, the COPY command will have all necessary permissions. If you want to use
different user credentials, you can grant access by using the Amazon S3 access controls. The Amazon
Redshift COPY command requires at least ListBucket and GetObject permissions to access the file
objects in the Amazon S3 bucket. For more information about controlling access to Amazon S3 resources,
go to Managing Access Permissions to Your Amazon S3 Resources.

Next Step
Step 4: Create the Sample Tables (p. 64)

Step 4: Create the Sample Tables

For this tutorial, you will use a set of five tables based on the Star Schema Benchmark (SSB) schema.
The following diagram shows the SSB data model.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 4: Create the Sample Tables

If the SSB tables already exist in the current database, you will need to drop the tables to remove them
from the database before you create them using the CREATE TABLE commands in the next step. The
tables used in this tutorial might have different attributes than the existing tables.

To create the sample tables

To drop the SSB tables, execute the following commands.

drop
drop
drop
drop
drop

table
table
table
table
table

part cascade;
supplier;
customer;
dwdate;
lineorder;

Execute the following CREATE TABLE commands.

CREATE TABLE part
(
p_partkey
INTEGER NOT NULL,
p_name
VARCHAR(22) NOT NULL,
p_mfgr
VARCHAR(6),
p_category
VARCHAR(7) NOT NULL,
p_brand1
VARCHAR(9) NOT NULL,
p_color
VARCHAR(11) NOT NULL,
p_type
VARCHAR(25) NOT NULL,
p_size
INTEGER NOT NULL,
p_container
VARCHAR(10) NOT NULL
);
CREATE TABLE supplier
(
s_suppkey
INTEGER NOT NULL,
s_name
VARCHAR(25) NOT NULL,

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 4: Create the Sample Tables

s_address
s_city
s_nation
s_region
s_phone

VARCHAR(25)
VARCHAR(10)
VARCHAR(15)
VARCHAR(12)
VARCHAR(15)

NOT
NOT
NOT
NOT
NOT

NULL,
NULL,
NULL,
NULL,
NULL

);
CREATE TABLE customer
(
c_custkey
INTEGER NOT
c_name
VARCHAR(25)
c_address
VARCHAR(25)
c_city
VARCHAR(10)
c_nation
VARCHAR(15)
c_region
VARCHAR(12)
c_phone
VARCHAR(15)
c_mktsegment
VARCHAR(10)
);
CREATE TABLE dwdate
(
d_datekey
d_date
d_dayofweek
d_month
d_year
d_yearmonthnum
d_yearmonth
d_daynuminweek
d_daynuminmonth
d_daynuminyear
d_monthnuminyear
d_weeknuminyear
d_sellingseason
d_lastdayinweekfl
d_lastdayinmonthfl
d_holidayfl
d_weekdayfl
);
CREATE TABLE lineorder
(
lo_orderkey
lo_linenumber
lo_custkey
lo_partkey
lo_suppkey
lo_orderdate
lo_orderpriority
lo_shippriority
lo_quantity
lo_extendedprice
lo_ordertotalprice
lo_discount
lo_revenue
lo_supplycost
lo_tax
lo_commitdate

NULL,
NOT NULL,
NOT NULL,
NOT NULL,
NOT NULL,
NOT NULL,
NOT NULL,
NOT NULL

INTEGER NOT NULL,

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

lo_shipmode

VARCHAR(10) NOT NULL

);

Next Step
Step 5: Run the COPY Commands (p. 67)

Step 5: Run the COPY Commands

You will run COPY commands to load each of the tables in the SSB schema. The COPY command
examples demonstrate loading from different file formats, using several COPY command options, and
troubleshooting load errors.
Topics
COPY Command Syntax (p. 67)
Loading the SSB Tables (p. 68)

COPY Command Syntax

The basic COPY (p. 302) command syntax is as follows.
COPY table_name [ column_list ] FROM data_source CREDENTIALS access_credentials
[options]

To execute a COPY command, you provide the following values.

Table name
The target table for the COPY command. The table must already exist in the database. The table can be
temporary or persistent. The COPY command appends the new input data to any existing rows in the
table.
Column list
By default, COPY loads fields from the source data to the table columns in order. You can optionally
specify a column list, that is a comma-separated list of column names, to map data fields to specific
columns. You will not use column lists in this tutorial. For more information, see column list (p. 303) in the
COPY command reference.
Data source
You can use the COPY command to load data from an Amazon S3 bucket, an Amazon EMR cluster, a
remote host using an SSH connection, or an Amazon DynamoDB table. For this tutorial, you will load
from data files in an Amazon S3 bucket. When loading from Amazon S3, you must provide the name of
the bucket and the location of the data files, by providing either an object path for the data files or the
location of a manifest file that explicitly lists each data file and its location.
Key prefix
An object stored in Amazon S3 is uniquely identified by an object key, which includes the bucket name,
folder names, if any, and the object name. A key prefix refers to a set of objects with the same prefix.
API Version 2012-12-01
67

Amazon Redshift Database Developer Guide

Loading the SSB Tables

The object path is a key prefix that the COPY command uses to load all objects that share the key
prefix. For example, the key prefix custdata.txt can refer to a single file or to a set of files, including
custdata.txt.001, custdata.txt.002, and so on.
Manifest file
If you need to load files with different prefixes, for example, from multiple buckets or folders, or if you
need to exclude files that share a prefix, you can use a manifest file. A manifest file explicitly lists each
load file and its unique object key.You will use a manifest file to load the PART table later in this tutorial.
Credentials
To access the AWS resources that contain the data to load, you must provide AWS access credentials
(that is, an access key ID and a secret access key) for an AWS user or an IAM user with sufficient
privileges. To load data from Amazon S3, the credentials must include ListBucket and GetObject
permissions. Additional credentials are required if your data is encrypted or if you are using temporary
access credentials. For more information, see CREDENTIALS (p. 307) in the COPY command reference.
For more information about managing access, go to Managing Access Permissions to Your Amazon S3
Resources. If you do not have an access key ID and secret access key, you will need to get them. For
more information, go to Administering Access Keys for IAM Users.
Options
You can specify a number of parameters with the COPY command to specify file formats, manage data
formats, manage errors, and control other features. In this tutorial, you will use the following COPY
command options and features:

Key Prefix (p. 69)

CSV Format (p. 69)
NULL AS (p. 71)
REGION (p. 72)
Fixed-Width Format (p. 72)
MAXERROR (p. 73)
ACCEPTINVCHARS (p. 75)
MANIFEST (p. 76)
DATEFORMAT (p. 77)
GZIP and LZOP (p. 77)
COMPUPDATE (p. 77)

Multiple Files (p. 77)

Loading the SSB Tables

You will use the following COPY commands to load each of the tables in the SSB schema. The command
to each table demonstrates different COPY options and troubleshooting techniques.
To load the SSB tables, follow these steps:
1. Replace the Bucket Name and AWS Credentials (p. 69)
2. Load the PART Table Using NULL AS (p. 69)
3. Load the SUPPLIER table Using REGION (p. 71)
4. Load the CUSTOMER Table Using MANIFEST (p. 72)
5. Load the DWDATE Table Using DATEFORMAT (p. 77)
6. Load the LINEORDER Table Using Multiple Files (p. 77)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

Replace the Bucket Name and AWS Credentials

The COPY commands in this tutorial are presented in the following format.
copy table from 's3://<your-bucket-name>/load/key_prefix'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<YourSecret-Access-Key>'
options;

For each COPY command, do the following:

1. Replace <your-bucket-name> with the name of a bucket in the same region as your cluster.
This step assumes the bucket and the cluster are in the same region. Alternatively, you can specify
the region using the REGION (p. 312) option with the COPY command.
2. Replace <Your-Access-Key-ID> and <Your-Secret-Access-Key> with your own AWS IAM
account credentials. The segment of the credentials string that is enclosed in single quotation marks
must not contain any spaces or line breaks.

Load the PART Table Using NULL AS

In this step, you will use the CSV and NULL AS options to load the PART table.
The COPY command can load data from multiple files in parallel, which is much faster than loading from
a single file. To demonstrate this principle, the data for each table in this tutorial is split into eight files,
even though the files are very small. In a later step, you will compare the time difference between loading
from a single file and loading from multiple files. For more information, see Split Your Load Data into
Multiple Files (p. 28).
Key Prefix
You can load from multiple files by specifying a key prefix for the file set, or by explicitly listing the files
in a manifest file. In this step, you will use a key prefix. In a later step, you will use a manifest file. The
key prefix 's3://mybucket/load/part-csv.tbl' loads the following set of the files in the load
folder.
part-csv.tbl-000
part-csv.tbl-001
part-csv.tbl-002
part-csv.tbl-003
part-csv.tbl-004
part-csv.tbl-005
part-csv.tbl-006
part-csv.tbl-007

CSV Format
CSV, which stands for comma separated values, is a common format used for importing and exporting
spreadsheet data. CSV is more flexible than comma-delimited format because it enables you to include
quoted strings within fields. The default quote character for COPY from CSV format is a double quotation
mark ( " ), but you can specify another quote character by using the QUOTE AS option. When you use
the quote character within the field, escape the character with an additional quote character.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

The following excerpt from a CSV-formatted data file for the PART table shows strings enclosed in double
quotation marks ("LARGE ANODIZED BRASS") and a string enclosed in two double quotation marks
within a quoted string ("MEDIUM ""BURNISHED"" TIN").
15,dark sky,MFGR#3,MFGR#47,MFGR#3438,indigo,"LARGE ANODIZED BRASS",45,LG CASE
22,floral beige,MFGR#4,MFGR#44,MFGR#4421,medium,"PROMO, POLISHED BRASS",19,LG
DRUM
23,bisque slate,MFGR#4,MFGR#41,MFGR#4137,firebrick,"MEDIUM ""BURNISHED""
TIN",42,JUMBO JAR

The data for the PART table contains characters that will cause COPY to fail. In this exercise, you will
troubleshoot the errors and correct them.
To load data that is in CSV format, add csv to your COPY command. Execute the following command
to load the PART table.
copy part from 's3://<your-bucket-name>/load/part-csv.tbl'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<YourSecret-Access-Key>'
csv;

You should get an error message similar to the following.

An error occurred when executing the SQL command:
copy part from 's3://mybucket/load/part-csv.tbl'
credentials' ...
ERROR: Load into table 'part' failed.
details. [SQL State=XX000]

Check 'stl_load_errors' system table for

Execution time: 1.46s

1 statement(s) failed.
1 statement(s) failed.

To get more information about the error, query the STL_LOAD_ERRORS table. The following query uses
the SUBSTRING function to shorten columns for readability and uses LIMIT 10 to reduce the number of
rows returned. You can adjust the values in substring(filename,22,25) to allow for the length of
your bucket name.
select query, substring(filename,22,25) as filename,line_number as line,
substring(colname,0,12) as column, type, position as pos, sub
string(raw_line,0,30) as line_text,
substring(raw_field_value,0,15) as field_text,
substring(err_reason,0,45) as reason
from stl_load_errors
order by query desc
limit 10;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

line_text
| field_text |
reason
------------------+------------+---------------------------------------------15,NUL next,

| Missing newline: Unexpected character 0x2c f

NULL AS
The part-csv.tbl data files use the NUL terminator character (\x000 or \x0) to indicate NULL values.

Note
Despite very similar spelling, NUL and NULL are not the same. NUL is a UTF-8 character with
codepoint x000 that is often used to indicate end of record (EOR). NULL is a SQL value that
represents an absence of data.
By default, COPY treats a NUL terminator character as an EOR character and terminates the record,
which often results in unexpected results or an error. Because there is no single standard method of
indicating NULL in text data, the NULL AS COPY command option enables you to specify which character
to substitute with NULL when loading the table. In this example, you want COPY to treat the NUL terminator
character as a NULL value.

Note
The table column that receives the NULL value must be configured as nullable. That is, it must
not include the NOT NULL constraint in the CREATE TABLE specification.
To load PART using the NULL AS option, execute the following COPY command.
copy part from 's3://<your-bucket-name>/load/part-csv.tbl'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<YourSecret-Access-Key>'
csv
null as '\000';

To verify that COPY loaded NULL values, execute the following command to select only the rows that
contain NULL.
select p_partkey, p_name, p_mfgr, p_category from part where p_mfgr is null;

p_partkey | p_name | p_mfgr | p_category

Load the SUPPLIER table Using REGION

In this step you will use the DELIMITER and REGION options to load the SUPPLIER table.
Character-Delimited Format
The fields in a character-delimited file are separated by a specific character, such as a pipe character (
| ), a comma ( , ) or a tab ( \t ). Character-delimited files can use any single ASCII character, including
one of the nonprinting ASCII characters, as the delimiter. You specify the delimiter character by using
the DELIMITER option. The default delimiter is a pipe character ( | ).
The following excerpt from the data for the SUPPLIER table uses pipe-delimited format.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

1|1|257368|465569|41365|19950218|2HIGH|0|17|2608718|9783671|4|2504369|92072|2|19950331|TRUCK
1|2|257368|201928|8146|19950218|2HIGH|0|36|6587676|9783671|9|5994785|109794|6|19950416|MAIL

REGION
Whenever possible, you should locate your load data in the same AWS region as your Amazon Redshift
cluster. If your data and your cluster are in the same region, you reduce latency, minimize eventual
consistency issues, and avoid cross-region data transfer costs. For more information, see Best Practices
for Loading Data (p. 27)
If you must load data from a different AWS region, use the REGION option to specify the AWS region in
which the load data is located. If you specify a region, all of the load data, including manifest files, must
be in the named region. For more information, see REGION (p. 312).
If your cluster is in the US East (N. Virginia) region, execute the following command to load the SUPPLIER
table from pipe-delimited data in an Amazon S3 bucket located in the US West (Oregon) region. For this
example, do not change the bucket name.
copy supplier from 's3://awssampledbuswest2/ssbgz/supplier.tbl'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<YourSecret-Access-Key>'
delimiter '|'
gzip
region 'us-west-2';

If your cluster is not in the US East (N. Virginia) region, execute the following command to load the
SUPPLIER table from pipe-delimited data in an Amazon S3 bucket located in the US East (N. Virginia)
region. For this example, do not change the bucket name.
copy supplier from 's3://awssampledb/ssbgz/supplier.tbl'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<YourSecret-Access-Key>'
delimiter '|'
gzip
region 'us-east-1';

Load the CUSTOMER Table Using MANIFEST

In this step, you will use the FIXEDWIDTH, MAXERROR, ACCEPTINVCHARS, and MANIFEST options
to load the CUSTOMER table.
The sample data for this exercise contains characters that will cause errors when COPY attempts to load
them. You will use the MAXERRORS option and the STL_LOAD_ERRORS system table to troubleshoot
the load errors and then use the ACCEPTINVCHARS and MANIFEST options to eliminate the errors.
Fixed-Width Format
Fixed-width format defines each field as a fixed number of characters, rather than separating fields with
a delimiter. The following excerpt from the data for the CUSTOMER table uses fixed-width format.
1
Customer#000000001
25-705
2
Customer#000000002

IVhzIApeRb

MOROCCO

0MOROCCO

AFRICA

XSTf4,NCwDVaWNe6tE

JORDAN

6JORDAN

MIDDLE EAST

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

23-453
3
Customer#000000003
11-783

MG9kdTD

ARGENTINA5ARGENTINAAMERICA

The order of the label/width pairs must match the order of the table columns exactly. For more information,
see FIXEDWIDTH (p. 308).
The fixed-width specification string for the CUSTOMER table data is as follows.
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15,
c_region :12, c_phone:15,c_mktsegment:10'

To load the CUSTOMER table from fixed-width data, execute the following command.
copy customer
from 's3://<your-bucket-name>/load/customer-fw.tbl'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<YourSecret-Access-Key>'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15,
c_region :12, c_phone:15,c_mktsegment:10';

You should get an error message, similar to the following.

An error occurred when executing the SQL command:
copy customer
from 's3://mybucket/load/customer-fw.tbl'
credentials'aws_access_key_id=...
ERROR: Load into table 'customer' failed.
for details. [SQL State=XX000]

Check 'stl_load_errors' system table

Execution time: 2.95s

1 statement(s) failed.

MAXERROR
By default, the first time COPY encounters an error, the command fails and returns an error message.
To save time during testing, you can use the MAXERROR option to instruct COPY to skip a specified
number of errors before it fails. Because we expect errors the first time we test loading the CUSTOMER
table data, add maxerror 10 to the COPY command.
To test using the FIXEDWIDTH and MAXERROR options, execute the following command.
copy customer
from 's3://<your-bucket-name>/load/customer-fw.tbl'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<YourSecret-Access-Key>'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15,
c_region :12, c_phone:15,c_mktsegment:10'
maxerror 10;

This time, instead of an error message, you get a warning message similar to the following.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

Warnings:
Load into table 'customer' completed, 112497 record(s) loaded successfully.
Load into table 'customer' completed, 7 record(s) could not be loaded. Check
'stl_load_errors' system table for details.

The warning indicates that COPY encountered seven errors. To check the errors, query the
STL_LOAD_ERRORS table, as shown in the following example.
select query, substring(filename,22,25) as filename,line_number as line,
substring(colname,0,12) as column, type, position as pos, sub
string(raw_line,0,30) as line_text,
substring(raw_field_value,0,15) as field_text,
substring(err_reason,0,45) as error_reason
from stl_load_errors
order by query desc, filename
limit 7;

The results of the STL_LOAD_ERRORS query should look similar to the following.
query

By examining the results, you can see that there are two messages in the error_reasons column:

Invalid digit, Value '#', Pos 0, Type: Integ

These errors are caused by the customer-fw.tbl.log file. The problem is that it is a log file, not a
data file, and should not be loaded. You can use a manifest file to avoid loading the wrong file.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

String contains invalid or unsupported UTF8

The VARCHAR data type supports multibyte UTF-8 characters up to three bytes. If the load data
contains unsupported or invalid characters, you can use the ACCEPTINVCHARS option to replace
each invalid character with a specified alternative character.
Another problem with the load is more difficult to detectthe load produced unexpected results. To
investigate this problem, execute the following command to query the CUSTOMER table.
select c_custkey, c_name, c_address
from customer
order by c_custkey
limit 10;

The rows should be unique, but there are duplicates.

Another way to check for unexpected results is to verify the number of rows that were loaded. In our case,
100000 rows should have been loaded, but the load message reported loading 112497 records. The
extra rows were loaded because the COPY loaded an extraneous file, customer-fw.tbl0000.bak.
In this exercise, you will use a manifest file to avoid loading the wrong files.
ACCEPTINVCHARS
By default, when COPY encounters a character that is not supported by the column's data type, it skips
the row and returns an error. For information about invalid UTF-8 characters, see Multibyte Character
Load Errors (p. 163).
You could use the MAXERRORS option to ignore errors and continue loading, then query
STL_LOAD_ERRORS to locate the invalid characters, and then fix the data files. However, MAXERRORS
is best used for troubleshooting load problems and should generally not be used in a production
environment.
The ACCEPTINVCHARS option is usually a better choice for managing invalid characters.
ACCEPTINVCHARS instructs COPY to replace each invalid character with a specified valid character
and continue with the load operation. You can specify any valid ASCII character, except NULL, as the
replacement character. The default replacement character is a question mark ( ? ). COPY replaces
multibyte characters with a replacement string of equal length. For example, a 4-byte character would be
replaced with '????'.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

COPY returns the number of rows that contained invalid UTF-8 characters, and it adds an entry to the
STL_REPLACEMENTS system table for each affected row, up to a maximum of 100 rows per node slice.
Additional invalid UTF-8 characters are also replaced, but those replacement events are not recorded.
ACCEPTINVCHARS is valid only for VARCHAR columns.
For this step, you will add the ACCEPTINVCHARS with the replacement character '^'.
MANIFEST
When you COPY from Amazon S3 using a key prefix, there is a risk that you will load unwanted tables.
For example, the 's3://mybucket/load/ folder contains eight data files that share the key prefix
customer-fw.tbl: customer-fw.tbl0000, customer-fw.tbl0001, and so on. However, the same
folder also contains the extraneous files customer-fw.tbl.log and customer-fw.tbl-0001.bak.
To ensure that you load all of the correct files, and only the correct files, use a manifest file. The manifest
is a text file in JSON format that explicitly lists the unique object key for each source file to be loaded.
The file objects can be in different folders or different buckets, but they must be in the same region. For
more information, see MANIFEST (p. 313).
The following shows the customer-fw-manifest text.
{
"entries": [
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-000"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-001"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-002"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-003"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-004"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-005"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-006"},
{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-007"}
]
}

To load the data for the CUSTOMER table using the manifest file
1.

Open the file customer-fw-manifest in a text editor.

Replace <your-bucket-name> with the name of your bucket.

3.
4.

Save the file.

Upload the file to the load folder on your bucket.

Execute the following COPY command.

copy customer from 's3://<your-bucket-name>/load/customer-fw-manifest'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15,
c_region :12, c_phone:15,c_mktsegment:10'
maxerror 10
acceptinvchars as '^'
manifest;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

Load the DWDATE Table Using DATEFORMAT

In this step, you will use the DELIMITER and DATEFORMAT options to load the DWDATE table.
When loading DATE and TIMESTAMP columns, COPY expects the default format, which is YYYY-MM-DD
for dates and YYYY-MM-DD HH:MI:SS for time stamps. If the load data does not use a default format,
you can use DATEFORMAT and TIMEFORMAT to specify the format.
The following excerpt shows date formats in the DWDATE table. Notice that the date formats in column
two are inconsistent.
19920104 1992-01-04
Sunday January 1992 199201 Jan1992 1 4 4 1...
19920112 January 12, 1992 Monday January 1992 199201 Jan1992 2 12 12 1...
19920120 January 20, 1992 Tuesday
January 1992 199201 Jan1992 3 20 20 1...

DATEFORMAT
You can specify only one date format. If the load data contains inconsistent formats, possibly in different
columns, or if the format is not known at load time, you use DATEFORMAT with the 'auto' argument.
When 'auto' is specified, COPY will recognize any valid date or time format and convert it to the default
format.The 'auto' option recognizes several formats that are not supported when using a DATEFORMAT
and TIMEFORMAT string. For more information, see Using Automatic Recognition with DATEFORMAT
and TIMEFORMAT (p. 323).
To load the DWDATE table, execute the following COPY command.
copy dwdate from 's3://<your-bucket-name>/load/dwdate-tab.tbl'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<YourSecret-Access-Key>'
delimiter '\t'
dateformat 'auto';

Load the LINEORDER Table Using Multiple Files

This step uses the GZIP and COMPUPDATE options to load the LINEORDER table.
In this exercise, you will load the LINEORDER table from a single data file, and then load it again from
multiple files in order to compare the load times for the two methods.
GZIP and LZOP
You can compress your files using either the gzip or lzop compression format. When loading from
compressed files, COPY uncompresses the files during the load process. Compressing your files saves
storage space and shortens upload times.
COMPUPDATE
When COPY loads an empty table with no compression encodings, it analyzes the load data to determine
the optimal encodings. It then alters the table to use those encodings before beginning the load. This
analysis process takes time, but it occurs, at most, once per table. To save time, you can skip this step
by turning COMPUPDATE off. To enable an accurate evaluation of COPY times, you will turn
COMPUTDATE off for this step.
Multiple Files
The COPY command can load data very efficiently when it loads from multiple files in parallel instead of
loading from a single file. If you split your data into files so that the number of files is a multiple of the
API Version 2012-12-01
77

Amazon Redshift Database Developer Guide

Loading the SSB Tables

number of slices in your cluster, Amazon Redshift divides the workload and distributes the data evenly
among the slices. The number of slices is equal to the number of processor cores on the node.
For example, the dw2.large compute nodes used in this tutorial have two slices each, so the four-node
cluster has eight slices. In previous steps, the load data was contained in eight files, even though the files
are very small. In this step, you will compare the time difference between loading from a single large file
and loading from multiple files.
The files you will use for this tutorial contain about 15 million records and occupy about 1.2 GB. These
files are very small in Amazon Redshift scale, but sufficient to demonstrate the performance advantage
of loading from multiple files. The files are large enough that the time required to download them and
then upload them to Amazon S3 is excessive for this tutorial, so you will load the files directly from an
AWS sample bucket.
The following screenshot shows the data files for LINEORDER.

To evaluate the performance of COPY with multiple files

Execute the following command to COPY from a single file. Do not change the bucket name.
copy lineorder from 's3://awssampledb/load/lo/lineorder-single.tbl'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>'
gzip
compupdate off
region 'us-east-1';

Your results should be similar to the following. Note the execution time.
Warnings:
Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.
0 row(s) affected.
copy executed successfully
Execution time: 51.56s

Execute the following command to COPY from multiple files. Do not change the bucket name.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 6: Vacuum and Analyze the Database

copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_ac
cess_key=<Your-Secret-Access-Key>'
gzip
compupdate off
region 'us-east-1';

Compare execution times.

In our example, the time to load 15 million records decreased from 51.56 seconds to 17.7 seconds,
a reduction of 65.7 percent.
These results are based on using a four-node cluster. If your cluster has more nodes, the time savings
is multiplied. For typical Amazon Redshift clusters, with tens to hundreds of nodes, the difference is
even more dramatic. If you have a single node cluster, there is little difference between the execution
times.

Next Step
Step 6: Vacuum and Analyze the Database (p. 79)

Step 6: Vacuum and Analyze the Database

Whenever you add, delete, or modify a significant number of rows, you should run a VACUUM command
and then an ANALYZE command. A vacuum recovers the space from deleted rows and restores the sort
order. The ANALYZE command updates the statistics metadata, which enables the query optimizer to
generate more accurate query plans. For more information, see Vacuuming Tables (p. 175).
If you load the data in sort key order, a vacuum is fast. In this tutorial, you added a significant number of
rows, but you added them to empty tables. That being the case, there is no need to resort, and you didn't
delete any rows. COPY automatically updates statistics after loading an empty table, so your statistics
should be up-to-date. However, as a matter of good housekeeping, you will complete this tutorial by
vacuuming and analyzing your database.
To vacuum and analyze the database, execute the following commands.
vacuum;
analyze;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Next Step
Step 7: Clean Up Your Resources (p. 80)

Step 7: Clean Up Your Resources

table
table
table
table
table

part;
supplier;
customer;
dwdate;
lineorder;

Next
Summary (p. 80)

Summary
In this tutorial, you uploaded data files to Amazon S3 and then used COPY commands to load the data
from the files into Amazon Redshift tables.
You loaded data using the following formats:
Character-delimited
CSV
Fixed-width
You used the STL_LOAD_ERRORS system table to troubleshoot load errors, and then used the REGION,
MANIFEST, MAXERROR, ACCEPTINVCHARS, DATEFORMAT, and NULL AS options to resolve the
errors.
You applied the following best practices for loading data:
Use a COPY Command to Load Data (p. 27)
Split Your Load Data into Multiple Files (p. 28)
Use a Single COPY Command to Load from Multiple Files (p. 28)
Compress Your Data Files with gzip or lzop (p. 28)
Use a Manifest File (p. 28)
Verify Data Files Before and After a Load (p. 28)
For more information about Amazon Redshift best practices, see the following links:

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Best Practices for Loading Data (p. 27)

Best Practices for Designing Tables (p. 24)
Best Practices for Designing Queries (p. 30)

Next Step
For your next step, if you haven't done so already, we recommend taking Tutorial: Tuning Table
Design (p. 33).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Overview

Tutorial: Configuring Workload

Management (WLM) Queues to
Improve Query Processing
Overview
This tutorial walks you through the process of configuring workload management (WLM) in Amazon
Redshift. By configuring WLM, you can improve query performance and resource allocation in your cluster.
Amazon Redshift routes user queries to queues for processing. WLM defines how those queries are
routed to the queues. By default, Amazon Redshift has two queues available for queries: one for
superusers, and one for users. The superuser queue cannot be configured and can only process one
query at a time. You should reserve this queue for troubleshooting purposes only. The user queue can
process up to five queries at a time, but you can configure this by changing the concurrency level of the
queue if needed.
When you have several users running queries against the database, you might find another configuration
to be more efficient. For example, if some users run resource-intensive operations, such as VACUUM,
these might have a negative impact on less-intensive queries, such as reports.You might consider adding
additional queues and configuring them for different workloads.
Estimated time: 75 minutes
Estimated cost: 50 cents

Prerequisites
You will need an Amazon Redshift cluster, the sample TICKIT database, and the psql client tool. If you
do not already have these set up, go to Amazon Redshift Getting Started and Connect to Your Cluster
by Using the psql Tool.

Sections
Section 1: Understanding the Default Queue Processing Behavior (p. 83)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Section 1: Understanding the Default Queue Processing
Behavior

Section 2: Modifying the WLM Query Queue Configuration (p. 86)

Section 3: Routing Queries to Queues Based on User Groups and Query Groups (p. 91)
Section 4: Using wlm_query_slot_count to Temporarily Override Concurrency Level in a Queue (p. 94)
Section 5: Cleaning Up Your Resources (p. 96)

Section 1: Understanding the Default Queue

Processing Behavior
Before you start to configure WLM, its useful to understand the default behavior of queue processing in
Amazon Redshift. In this section, youll create two database views that return information from several
system tables. Then youll run some test queries to see how queries are routed by default. For more
information about system tables, see System Tables Reference (p. 626).

Step 1: Create the WLM_QUEUE_STATE_VW View

In this step, youll create a view called WLM_QUEUE_STATE_VW. This view returns information from
the following system tables.
STV_WLM_CLASSIFICATION_CONFIG (p. 709)
STV_WLM_SERVICE_CLASS_CONFIG (p. 713)
STV_WLM_SERVICE_CLASS_STATE (p. 714)
Youll use this view throughout the tutorial to monitor what happens to queues after you change the WLM
configuration. The following table describes the data that the WLM_QUEUE_STATE_VW view returns.
Column

Description

queue

The number associated with the row that represents a queue. Queue number
determines the order of the queues in the database.

description

A value that describes whether the queue is available only to certain user
groups, to certain query groups, or all types of queries.

slots

The number of slots allocated to the queue.

working_mem

The amount of memory allocated to the queue.

max_execution_time

The amount of time a query is allowed to run before it is terminated.

user_*

A value that indicates whether wildcard characters are allowed in the WLM
configuration to specify user groups.

query_*

A value that indicates whether wildcard characters are allowed in the WLM
configuration to specify query groups.

queued

The number of queries that are waiting in the queue to be processed.

executing

The number of queries that are currently executing.

executed

The number of queries that have executed.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 2: Create the WLM_QUERY_STATE_VW View

To Create the WLM_QUEUE_STATE_VW View

1. Open psql and connect to your TICKIT sample database. If you do not have this database, see
Prerequisites (p. 82).
2. Run the following query to create the WLM_QUEUE_STATE_VW view.
create view WLM_QUEUE_STATE_VW as
select (config.service_class-5) as queue
, trim (class.condition) as description
, config.num_query_tasks as slots
, config.query_working_mem as mem
, config.max_execution_time as max_time
, config.user_group_wild_card as "user_*"
, config.query_group_wild_card as "query_*"
, state.num_queued_queries queued
, state.num_executing_queries executing
, state.num_executed_queries executed
from
STV_WLM_CLASSIFICATION_CONFIG class,
STV_WLM_SERVICE_CLASS_CONFIG config,
STV_WLM_SERVICE_CLASS_STATE state
where
class.action_service_class = config.service_class
and class.action_service_class = state.service_class
and config.service_class > 4
order by config.service_class;

3. Run the following query to see the information that the view contains.
select * from wlm_queue_state_vw;

The following is an example result.

Step 2: Create the WLM_QUERY_STATE_VW View

In this step, youll create a view called WLM_QUERY_STATE_VW. This view returns information from
the STV_WLM_QUERY_STATE (p. 711) system table.
Youll use this view throughout the tutorial to monitor the queries that are running. The following table
describes the data that the WLM_QUERY_STATE_VW view returns.
Column

Description

query

The query ID.

queue

The queue number.

slot_count

The number of slots allocated to the query.

start_time

The time that the query started.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 3: Run Test Queries

Column

Description

state

The state of the query, such as executing.

queue_time

The number of microseconds that the query has spent in the queue.

exec_time

The number of microseconds that the query has been executing.

To Create the WLM_QUERY_STATE_VW View

1. In psql, run the following query to create the WLM_QUERY_STATE_VW view.
create view WLM_QUERY_STATE_VW as
select query, (service_class-5) as queue, slot_count, trim(wlm_start_time)
as start_time, trim(state) as state, trim(queue_time) as queue_time,
trim(exec_time) as exec_time
from stv_wlm_query_state;

2. Run the following query to see the information that the view contains.
select * from wlm_query_state_vw;

The following is an example result.

Step 3: Run Test Queries

In this step, youll run queries from multiple connections in psql and review the system tables to determine
how the queries were routed for processing.
For this step, you will need two psql windows open:
In psql window 1, youll run queries that monitor the state of the queues and queries using the views
you already created in this tutorial.
In psql window 2, youll run long-running queries to change the results you find in psql window 1.

To Run the Test Queries

1. Open two psql windows. If you already have one window open, you only need to open a second window.
You can use the same user account for both of these connections.
2. In psql window 1, run the following query.
select * from wlm_query_state_vw;

The following is an example result.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Section 2: Modifying the WLM Query Queue
Configuration

This query returns a self-referential result.The query that is currently executing is the SELECT statement
from this view. A query on this view will always return at least one result. Youll compare this result
with the result that occurs after starting the long-running query in the next step.
3. In psql window 2, run the following query from the TICKIT sample database.
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid
<40000;

Note
This query should run for approximately a minute so that you have time to explore the results
of the WLM_QUEUE_STATE_VW view and the WLM_QUERY_STATE_VW view that you
created earlier. If you find that the query does not run long enough for you to query both views,
you can increase the value of l.listid to make it run longer.
4. In psql window 1, query WLM_QUEUE_STATE_VW and WLM_QUERY_STATE_VW and compare
the results to your earlier results.

select * from wlm_queue_state_vw;

select * from wlm_query_state_vw;

The following are example results.

Note the following differences between your previous queries and the results in this step:
There are two rows now in WLM_QUERY_STATE_VW. One result is the self-referential query for
running a SELECT operation on this view. The second result is the long-running query from the previous
step.
The executing column in WLM_QUEUE_STATE_VW has increased from 1 to 2. This column entry
means that there are two queries running in the queue.
The executed column is incremented each time you run a query in the queue.
The WLM_QUEUE_STATE_VW view is useful for getting an overall view of the queues and how many
queries are being processed in each queue. The WLM_QUERY_STATE_VW view is useful for getting a
more detailed view of the individual queries that are currently running.

Section 2: Modifying the WLM Query Queue

Configuration
Now that you understand how queues work by default, you will learn how to configure query queues in
WLM. In this section, youll create and configure a new parameter group for your cluster. Youll create
two additional user queues and configure them to accept queries based on the queries user group or
query group labels. Any queries that do not get routed to one of these two queues will be routed to the
default queue at run time.
API Version 2012-12-01
86

Amazon Redshift Database Developer Guide

Step 1: Create a Parameter Group

In this step, well create a new parameter group to use to configure WLM for this tutorial.

To Create a Parameter Group

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://
console.aws.amazon.com/redshift.
2. In the navigation pane, click Parameter Groups.
3. Click Create Cluster Parameter Group.
4. In the Create Cluster Parameter Group dialog box, type wlmtutorial in the Parameter Group
Name box and type WLM tutorial in the Description box. You can leave the Parameter Group
Family setting as is. Then click Create.

Step 2: Configure WLM

In this step, youll modify the default settings of your new parameter group. Youll add two new query
queues to the WLM configuration and specify different settings for each queue.

To Modify Parameter Group Settings

1. On the Parameter Groups page of the Amazon Redshift console, click the magnifying glass icon next
to wlmtutorial. Doing this opens up the Parameters tab for wlmtutorial.

2. On the Parameters tab, review the parameters and values. At this point, all of the settings will have
default values. The wlm_json_configuration parameter is the one that will be modified when you
configure WLM in this tutorial.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 2: Configure WLM

3. Click the WLM tab. Click Add New Queue twice to add two new queues to this parameter group.
Configure the queues with the following values.
For queue 1, type 2 in the Concurrency box, test in the Query Groups box, and 30 in the %
Memory box. Leave the other boxes empty.

Note
You can ignore the warning about invalid memory allocation. After you specify memory
values to total 100 percent for all of the queues, the message will disappear.
For queue 2, type 3 in the Concurrency box, admin in the User Groups box, and 40 in the %
Memory box. Leave the other boxes empty.
For queue 3, type 30 in the % Memory box. Leave the other boxes empty.

4. Click Save Changes.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 3: Associate the Parameter Group with Your Cluster

Step 3: Associate the Parameter Group with Your

Cluster
In this step, youll open your sample cluster and associate it with the new parameter group. After you do
this, youll reboot the cluster so that Amazon Redshift can apply the new settings to the database.

To Associate the Parameter Group with Your Cluster

1. In the navigation pane, click Clusters, and then click your cluster to open it. If you are using the same
cluster from Amazon Redshift Getting Started, your cluster will be named examplecluster.

2. On the Configuration tab, click Modify in the Cluster menu.

3. In the Modify Cluster dialog box, select wlmtutorial from the Cluster Parameter Group menu,
and then click Modify.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 3: Associate the Parameter Group with Your Cluster

The statuses shown in the Cluster Parameter Group and Parameter Group Apply Status will change
from in-sync to applying as shown in the following.

After the new parameter group is applied to the cluster, the Cluster Properties and Cluster Status
show the new parameter group that you associated with the cluster. You need to reboot the cluster so
that these settings can be applied to the database also.

4. In the Cluster menu, click Reboot. The status shown in Cluster Status will change from available to
rebooting. After the cluster is rebooted, the status will return to available.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Section 3: Routing Queries to Queues Based on User
Groups and Query Groups

Section 3: Routing Queries to Queues Based on

User Groups and Query Groups
Now that you have your cluster associated with a new parameter group, and you have configured WLM,
youll run some queries to see how Amazon Redshift routes queries into queues for processing.

Step 1: View Query Queue Configuration in the

Database
First, verify that the database has the WLM configuration that you expect.

To View the Query Queue Configuration

1. Open psql and run the following query. If you already had a session connected to the database prior
to the cluster reboot, youll need to reconnect.
select * from wlm_queue_state_vw;

The following is an example result.

Compare these results to the results you received in Step 1: Create the WLM_QUEUE_STATE_VW
View (p. 83). Notice that there are now two additional queues. Queue 1 is now the queue for the test
query group, and queue 2 is the queue for the admin user group.
Queue 3 is now the default queue. The last queue in the list is always the default queue, and that is
the queue to which queries are routed by default if no user group or query group is specified in a query.
2. Run the following query to confirm that your query now runs in queue 3.
select * from wlm_query_state_vw;

The following is an example result.

Step 2: Run a Query Using the Query Group Queue

To Run a Query Using the Query Group Queue
1. Run the following query to route it to the test query group.
set query_group to test;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid
<40000;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 3: Create a Database User and Group

2. From the other psql window, run the following query.

select * from wlm_query_state_vw;

The following is an example result.

The query was routed to the test query group, which is queue 1 now.
3. Select all from the other view as shown following, and youll see a result similar to the following.
select * from wlm_queue_state_vw;

4. Now, reset the query group and run the long query again:
reset query_group;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid
<40000;

5. Run the queries against the views to see the results.

select * from wlm_queue_state_vw;
select * from wlm_query_state_vw;

The following are example results.

The result should be that the query is now running in queue 3 again.

Step 3: Create a Database User and Group

In To Create a Parameter Group (p. 87), you configured one of your query queues with a user group
named admin. Before you can run any queries in this queue, you need to create the user group in the
database and add a user to the group. Then youll log into psql using the new users credentials and run
queries. You need to run queries as a superuser, such as the masteruser, to create database users.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 4: Run a Query Using the User Group Queue

To Create a New Database User and User Group

1. In the database, create a new database user named adminwlm by running the following command in
a psql window.
create user adminwlm createuser password '123Admin';

2. Then, run the following commands to create the new user group and add your new adminwlm user to
it.
create group admin;
alter group admin add user adminwlm;

Step 4: Run a Query Using the User Group Queue

Next youll run a query and route it to the user group queue. You do this when you want to route your
query to a queue that is configured to handle the type of query you want to run.

To Run a Query Using the User Group Queue

1. In psql window 2, run the following queries to switch to the adminwlm account and run a query as that
user.
set session authorization 'adminwlm';
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid
<40000;

2. In psql window 1, run the following query to see the query queue that the queries are routed to.
select * from wlm_query_state_vw;
select * from wlm_queue_state_vw;

The following are example results.

Note that the queue this query ran in is queue 2, the admin user queue. Any time you run queries
logged in as this user, they will run in queue 2 unless you specify a different query group to use.
3. Now run the following query from psql window 2.
set query_group to test;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid
<40000;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Section 4: Using wlm_query_slot_count to Temporarily
Override Concurrency Level in a Queue

4. In psql window 1, run the following query to see the query queue that the queries are routed to.
select * from wlm_queue_state_vw;
select * from wlm_query_state_vw;

The following are example results.

5. When youre done, reset the query group.

reset query_group;

Section 4: Using wlm_query_slot_count to

Temporarily Override Concurrency Level in a
Queue
Sometimes, users might temporarily need more resources for a particular query. If so, they can use the
wlm_query_slot_count configuration setting to temporarily override the way slots are allocated in a query
queue. Slots are units of memory and CPU that are used to process queries. You might override the slot
count when you have occasional queries that take a lot of resources in the cluster, such as when you
perform a VACUUM operation in the database.
If you find that users often need to set wlm_query_slot_count for certain types of queries, you should
consider adjusting the WLM configuration and giving users a queue that better suits the needs of their
queries. For more information about temporarily overriding the concurrency level by using slot count, see
wlm_query_slot_count (p. 745).

Step 1: Override the Concurrency Level Using

wlm_query_slot_count
For the purposes of this tutorial, well run the same long-running SELECT query. Well run it as the
adminwlm user using wlm_query_slot_count to increase the number of slots available for the query.

To Override the Concurrency Level Using

wlm_query_slot_count
1. Increase the limit on the query to make sure that you have enough time to query the
WLM_QUERY_STATE_VW view and see a result.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 2: Run Queries from Different Sessions

set wlm_query_slot_count to 3;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid
<40000;

2. Now, query WLM_QUERY_STATE_VW use the masteruser account to see how the query is running.
select * from wlm_query_state_vw;

The following is an example result.

Notice that the slot count for the query is 3. This count means that the query is using all three slots to
process the query, allocating all of the resources in the queue to that query.
3. Now, run the following query.
select * from WLM_QUEUE_STATE_VW;

The following is an example result.

The wlm_query_slot_count configuration setting is valid for the current session only. If that session
expires, or another user runs a query, the WLM configuration is used.
4. Reset the slot count and rerun the test.
reset wlm_query_slot_count;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid
<40000;

The following are example results.

Step 2: Run Queries from Different Sessions

Next, run queries from different sessions.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Section 5: Cleaning Up Your Resources

To Run Queries from Different Sessions

1. In psql window 1 and 2, run the following to use the test query group.
set query_group to test;

2. In psql window 1, run the following long-running query.

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid
<40000;

3. As the long-running query is still going in psql window 1, run the following to increase the slot count to
use all the slots for the queue and then start running the long-running query.
set wlm_query_slot_count to 2;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid
<40000;

4. Open a third psql window and query the views to see the results.
select * from wlm_queue_state_vw;
select * from wlm_query_state_vw;

The following are example results.

Notice that the first query is using one of the slots allocated to queue 1 to run the query, and that there
is one query that is waiting in the queue (where queued is 1 and state is QueuedWaiting). Once
the first query completes, the second one will begin executing. This execution happens because both
queries are routed to the test query group, and the second query must wait for enough slots to begin
processing.

Section 5: Cleaning Up Your Resources

Your cluster continues to accrue charges as long as it is running. When you have completed this tutorial,
you should return your environment to the previous state by following the steps in Step 6: Find Additional
Resources and Reset Your Environment in Amazon Redshift Getting Started.
For more information about WLM, see Implementing Workload Management (p. 218).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Amazon Redshift Security Overview

Managing Database Security

Topics
Amazon Redshift Security Overview (p. 97)
Default Database User Privileges (p. 98)
Superusers (p. 98)
Users (p. 99)
Groups (p. 100)
Schemas (p. 100)
Example for Controlling User and Group Access (p. 102)
You manage database security by controlling which users have access to which database objects.
Access to database objects depends on the privileges that you grant to user accounts or groups. The
following guidelines summarize how database security works:
By default, privileges are granted only to the object owner.
Amazon Redshift database users are named user accounts that can connect to a database. A user
account is granted privileges explicitly, by having those privileges assigned directly to the account, or
implicitly, by being a member of a group that is granted privileges.
Groups are collections of users that can be collectively assigned privileges for easier security
maintenance.
Schemas are collections of database tables and other database objects. Schemas are similar to
operating system directories, except that schemas cannot be nested. Users can be granted access to
a single schema or to multiple schemas.
For examples of security implementation, see Example for Controlling User and Group Access (p. 102).

Amazon Redshift Security Overview

Amazon Redshift database security is distinct from other types of Amazon Redshift security. In addition
to database security, which is described in this section, Amazon Redshift provides these features to
manage security:

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Default Database User Privileges

Sign-in credentials Access to your Amazon Redshift Management Console is controlled by your
AWS account privileges. For more information, see Sign-In Credentials.
Access management To control access to specific Amazon Redshift resources, you define AWS
Identity and Access Management (IAM) accounts. For more information, see Controlling Access to
Amazon Redshift Resources.
Cluster security groups To grant other users inbound access to an Amazon Redshift cluster, you
define a cluster security group and associate it with a cluster. For more information, see Amazon
Redshift Cluster Security Groups.
VPC To protect access to your cluster by using a virtual networking environment, you can launch
your cluster in a Virtual Private Cloud (VPC). For more information, see Managing Clusters in Virtual
Private Cloud (VPC).
Cluster encryption To encrypt the data in all your user-created tables, you can enable cluster
encryption when you launch the cluster. For more information, see Amazon Redshift Clusters.
SSL connections To encrypt the connection between your SQL client and your cluster, you can
use secure sockets layer (SSL) encryption. For more information, see Connect to Your Cluster Using
SSL.
Load data encryption To encrypt your table load data files when you upload them to Amazon S3,
you can use either server-side encryption or client-side encryption. When you load from server-side
encrypted data, Amazon S3 handles decryption transparently. When you load from client-side encrypted
data, the Amazon Redshift COPY command decrypts the data as it loads the table. For more information,
see Uploading Encrypted Data to Amazon S3 (p. 136).
Data in transit To protect your data in transit within the AWS cloud, Amazon Redshift uses hardware
accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD, backup,
and restore operations.

Default Database User Privileges

When you create a database object, you are its owner. By default, only a superuser or the owner of an
object can query, modify, or grant privileges on the object. For users to use an object, you must grant the
necessary privileges to the user or the group that contains the user. Database superusers have the same
privileges as database owners.
Amazon Redshift supports the following privileges: SELECT, INSERT, UPDATE, DELETE, REFERENCES,
CREATE, TEMPORARY, and USAGE. Different privileges are associated with different object types. For
information on database object privileges supported by Amazon Redshift, see the GRANT (p. 375)
command.
The right to modify or destroy an object is always the privilege of the owner only.
To revoke a privilege that was previously granted, use the REVOKE (p. 386) command. The privileges of
the object owner, such as DROP, GRANT, and REVOKE privileges, are implicit and cannot be granted
or revoked. Object owners can revoke their own ordinary privileges, for example, to make a table read-only
for themselves as well as others. Superusers retain all privileges regardless of GRANT and REVOKE
commands.

Superusers
Database superusers have the same privileges as database owners for all databases.
The masteruser, which is the user you created when you launched the cluster, is a superuser.
You must be a superuser to create a superuser.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Users

Amazon Redshift system tables and system views are designated either "superuser visible" or "user
visible." Only superusers can query system tables and system views that are designated "superuser
visible." For information, see System Tables and Views (p. 626).
Superusers can query all PostgreSQL catalog tables. For information, see System Catalog Tables (p. 734).
A database superuser bypasses all permission checks. Be very careful when using a superuser role. We
recommend that you do most of your work as a role that is not a superuser. Superusers retain all privileges
regardless of GRANT and REVOKE commands.
To create a new database superuser, log on to the database as a superuser and issue a CREATE USER
command or an ALTER USER command with the CREATEUSER privilege.
create user adminuser createuser password '1234Admin';
alter user adminuser createuser;

Users
Amazon Redshift user accounts can only be created and dropped by a database superuser. Users are
authenticated when they login to Amazon Redshift. They can own databases and database objects (for
example, tables) and can grant privileges on those objects to users, groups, and schemas to control who
has access to which object. Users with CREATE DATABASE rights can create databases and grant
privileges to those databases. Superusers have database ownership privileges for all databases.

Creating, Altering, and Deleting Users

Database users accounts are global across a data warehouse cluster (and not per individual database).

To create a user use the CREATE USER (p. 354) command.

To create a superuser use the CREATE USER (p. 354) command with the CREATEUSER option.
To remove an existing user, use the DROP USER (p. 366) command.
To make changes to a user account, such as changing a password, use the ALTER USER (p. 292)
command.
To view a list of users, query the PG_USER catalog table:
select * from pg_user;
usename
| usesysid | usecreatedb | usesuper | usecatupd | passwd |
valuntil | useconfig
------------+----------+-------------+----------+-----------+----------+---------+----------rdsdb
|
1 | t
| t
| t
| ******** |
|
masteruser |
100 | t
| t
| f
| ******** |
|
dwuser
|
101 | f
| f
| f
| ******** |
|
simpleuser |
102 | f
| f
| f
| ******** |
|
poweruser |
103 | f
| t
| f
| ******** |
|
dbuser
|
104 | t
| f
| f
| ******** |

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Groups

|
(6 rows)

Groups
Groups are collections of users who are all granted whatever privileges are associated with the group.
You can use groups to assign privileges by role. For example, you can create different groups for sales,
administration, and support and give the users in each group the appropriate access to the data they
require for their work. You can grant or revoke privileges at the group level, and those changes will apply
to all members of the group, except for superusers.
To view all user groups, query the PG_GROUP system catalog table:
select * from pg_group;

Creating, Altering, and Deleting Groups

Only a superuser can create, alter, or drop groups.
You can perform the following actions:
To create a group, use the CREATE GROUP (p. 337) command.
To add users to or remove users from an existing group, use the ALTER GROUP (p. 286) command.
To delete a group, use the DROP GROUP (p. 362) command. This command only drops the group, not
its member users.

Schemas
A database contains one or more named schemas. Each schema in a database contains tables and other
kinds of named objects. By default, a database has a single schema, which is named PUBLIC. You can
use schemas to group database objects under a common name. Schemas are similar to operating system
directories, except that schemas cannot be nested.
Identical database object names can be used in different schemas in the same database without conflict.
For example, both MY_SCHEMA and YOUR_SCHEMA can contain a table named MYTABLE. Users
with the necessary privileges can access objects across multiple schemas in a database.
By default, an object is created within the first schema in the search path of the database. For information,
see Search Path (p. 101) later in this section.
Schemas can help with organization and concurrency issues in a multi-user environment in the following
ways:
To allow many developers to work in the same database without interfering with each other.
To organize database objects into logical groups to make them more manageable.
To give applications the ability to put their objects into separate schemas so that their names will not
collide with the names of objects used by other applications.

API Version 2012-12-01

100

Amazon Redshift Database Developer Guide

Creating, Altering, and Deleting Schemas

Any user can create groups and alter or drop groups they own.
You can perform the following actions:
To create a schema, use the CREATE SCHEMA (p. 338) command.
To change the owner of a schema, use the ALTER SCHEMA (p. 286) command.
To delete a schema and its objects, use the DROP SCHEMA (p. 362) command.
To create a table within a schema, create the table with the format schema_name.table_name.
To view a list of all schemas, query the PG_NAMESPACE system catalog table:
select * from pg_namespace;

To view a list of tables that belong to a schema, query the PG_TABLE_DEF system catalog table. For
example, the following query returns a list of tables in the PG_CATALOG schema.
select distinct(tablename) from pg_table_def
where schemaname = 'pg_catalog';

Search Path
The search path is defined in the search_path parameter with a comma-separated list of schema names.
The search path specifies the order in which schemas are searched when an object, such as a table or
function, is referenced by a simple name that does not include a schema qualifier.
If an object is created without specifying a target schema, the object is added to the first schema that is
listed in search path. When objects with identical names exist in different schemas, an object name that
does not specify a schema will refer to the first schema in the search path that contains an object with
that name.
To change the default schema for the current session, use the SET (p. 420) command.
For more information, see the search_path (p. 744) description in the Configuration Reference.

Schema-Based Privileges
Schema-based privileges are determined by the owner of the schema:
By default, all users have CREATE and USAGE privileges on the PUBLIC schema of a database. To
disallow users from creating objects in the PUBLIC schema of a database, use the REVOKE (p. 386)
command to remove that privilege.
Unless they are granted the USAGE privilege by the object owner, users cannot access any objects
in schemas they do not own.
If users have been granted the CREATE privilege to a schema that was created by another user, those
users can create objects in that schema.

API Version 2012-12-01

101

Amazon Redshift Database Developer Guide

Example for Controlling User and Group Access

This example creates user groups and user accounts and then grants them various privileges for an
Amazon Redshift database that connects to a web application client. This example assumes three groups
of users: regular users of a web application, power users of a web application, and web developers.
1. Create the groups where the user accounts will be assigned. The following set of commands creates
three different user groups:
create group webappusers;
create group webpowerusers;
create group webdevusers;

2. Create several database user accounts with different privileges and add them to the groups.
a. Create two users and add them to the WEBAPPUSERS group:
create user webappuser1 password 'webAppuser1pass'
in group webappusers;
create user webappuser2 password 'webAppuser2pass'
in group webappusers;

b. Create an account for a web developer and adds it to the WEBDEVUSERS group:
create user webdevuser1 password 'webDevuser2pass'
in group webdevusers;

c. Create a superuser account. This user will have administrative rights to create other users:
create user webappadmin
createuser;

password 'webAppadminpass1'

3. Create a schema to be associated with the database tables used by the web application, and grant
the various user groups access to this schema:
a. Create the WEBAPP schema:
create schema webapp;

b. Grant USAGE privileges to the WEBAPPUSERS group:

grant usage on schema webapp to group webappusers;

c. Grant USAGE privileges to the WEBPOWERUSERS group:

grant usage on schema webapp to group webpowerusers;

d. Grant ALL privileges to the WEBDEVUSERS group:

API Version 2012-12-01

102

Amazon Redshift Database Developer Guide

Example for Controlling User and Group Access

grant all on schema webapp to group webdevusers;

The basic users and groups are now set up.You can now make changes to alter the users and groups.
4. For example, the following command alters the search_path parameter for the WEBAPPUSER1.
alter user webappuser1 set search_path to webapp, public;

The SEARCH_PATH specifies the schema search order for database objects, such as tables and
functions, when the object is referenced by a simple name with no schema specified.
5. You can also add users to a group after creating the group, such as adding WEBAPPUSER2 to the
WEBPOWERUSERS group:
alter group webpowerusers add user webappuser2;

API Version 2012-12-01

103

Amazon Redshift Database Developer Guide

Choosing a Column Compression Type

Designing Tables
Topics
Choosing a Column Compression Type (p. 104)
Choosing a Data Distribution Style (p. 115)
Choosing Sort Keys (p. 127)
Defining Constraints (p. 127)
Analyzing Table Design (p. 127)
A data warehouse system has very different design goals as compared to a typical transaction-oriented
relational database system. An online transaction processing (OLTP) application is focused primarily on
single row transactions, inserts, and updates. Amazon Redshift is optimized for very fast execution of
complex analytic queries against very large data sets. Because of the massive amount of data involved
in data warehousing, you must specifically design your database to take full advantage of every available
performance optimization.
This section explains how to choose and implement compression encodings, data distribution keys, sort
keys, and table constraints, and it presents best practices for making these design decisions.

Choosing a Column Compression Type

Topics
Compression Encodings (p. 105)
Testing Compression Encodings (p. 111)
Example: Choosing Compression Encodings for the CUSTOMER Table (p. 113)
Compression is a column-level operation that reduces the size of data when it is stored. Compression
conserves storage space and reduces the size of data that is read from storage, which reduces the amount
of disk I/O and therefore improves query performance.
By default, Amazon Redshift stores data in its raw, uncompressed format. You can apply a compression
type, or encoding, to the columns in a table manually when you create the table, or you can use the COPY
command to analyze and apply compression automatically. For details about applying automatic
compression, see Loading Tables with Automatic Compression (p. 158).

API Version 2012-12-01

104

Amazon Redshift Database Developer Guide

Compression Encodings

Note
We strongly recommend using the COPY command to apply automatic compression.
You might choose to apply compression encodings manually if the new table shares the same data
characteristics as another table, or if in testing you discover that the compression encodings that are
applied during automatic compression are not the best fit for your data. If you choose to apply compression
encodings manually, you can run the ANALYZE COMPRESSION (p. 295) command against an already
populated table and use the results to choose compression encodings.
To apply compression manually, you specify compression encodings for individual columns as part of
the CREATE TABLE statement. The syntax is as follows:
CREATE TABLE table_name (column_name
data_type ENCODE encoding-type)[, ...]

Where encoding-type is taken from the keyword table in the following section.
For example, the following statement creates a two-column table, PRODUCT. When data is loaded into
the table, the PRODUCT_ID column is not compressed, but the PRODUCT_NAME column is compressed,
using the byte dictionary encoding (BYTEDICT).
create table product(
product_id int,
product_name char(20) encode bytedict);

You cannot change the compression encoding for a column after the table is created. You can specify
the encoding for a column when it is added to a table using the ALTER TABLE command.
ALTER TABLE table-name ADD [ COLUMN ] column_name column_type

Compression Encodings
Topics
Raw Encoding (p. 106)
Byte-Dictionary Encoding (p. 106)
Delta Encoding (p. 107)
LZO Encoding (p. 108)
Mostly Encoding (p. 108)
Runlength Encoding (p. 110)
Text255 and Text32k Encodings (p. 111)
A compression encoding specifies the type of compression that is applied to a column of data values as
rows are added to a table.
The following table identifies the supported compression encodings and the data types that support the
encoding.
Encoding type

Keyword in CREATE TABLE

and ALTER TABLE

Data types

Raw (no compression)

RAW

All

API Version 2012-12-01

105

Amazon Redshift Database Developer Guide

Compression Encodings

Encoding type

Keyword in CREATE TABLE

and ALTER TABLE

Data types

Byte dictionary

BYTEDICT

All except BOOLEAN

Delta

DELTA

SMALLINT, INT, BIGINT, DATE,

TIMESTAMP, DECIMAL

DELTA32K

INT, BIGINT, DATE, TIMESTAMP,

DECIMAL

LZO

All except BOOLEAN, REAL, and

DOUBLE PRECISION

Mostlyn

MOSTLY8

SMALLINT, INT, BIGINT, DECIMAL

MOSTLY16

INT, BIGINT, DECIMAL

MOSTLY32

BIGINT, DECIMAL

Run-length

RUNLENGTH

All

Text

TEXT255

VARCHAR only

TEXT32K

VARCHAR only

Raw Encoding
Raw encoding is the default storage method. With raw encoding, data is stored in raw, uncompressed
form.

Byte-Dictionary Encoding
In byte dictionary encoding, a separate dictionary of unique values is created for each block of column
values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256
one-byte values that are stored as indexes to the original data values. If more than 256 values are stored
in a single block, the extra values are written into the block in raw, uncompressed form. The process
repeats for each disk block.
This encoding is very effective when a column contains a limited number of unique values. This encoding
is optimal when the data domain of a column is fewer than 256 unique values. Byte-dictionary encoding
is especially space-efficient if a CHAR column holds long character strings.

Note
Byte-dictionary encoding is not always effective when used with VARCHAR columns. Using
BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly
recommend using a different encoding, such as LZO, for VARCHAR columns.
Suppose a table has a COUNTRY column with a CHAR(30) data type. As data is loaded, Amazon Redshift
creates the dictionary and populates the COUNTRY column with the index value. The dictionary contains
the indexed unique values, and the table itself contains only the one-byte subscripts of the corresponding
values.

Note
Trailing blanks are stored for fixed-length character columns. Therefore, in a CHAR(30) column,
every compressed value saves 29 bytes of storage when you use the byte-dictionary encoding.
The following table represents the dictionary for the COUNTRY column:

API Version 2012-12-01

106

Amazon Redshift Database Developer Guide

Compression Encodings

Unique data value

Dictionary index

Size (fixed length, 30 bytes per

value)

England

United States of America

Venezuela

Sri Lanka

Argentina

Japan

Total

180

The following table represents the values in the COUNTRY column:

Original data value

Original size (fixed

length, 30 bytes per
value)

Compressed value (in- New size (bytes)

dex)

England

United States of America

Venezuela

Sri Lanka

Argentina

Japan

Sri Lanka

Argentina

Totals

300

The total compressed size in this example is calculated as follows: 6 different entries are stored in the
dictionary (6 * 30 = 180), and the table contains 10 1-byte compressed values, for a total of 190 bytes.

Delta Encoding
Delta encodings are very useful for datetime columns.
Delta encoding compresses data by recording the difference between values that follow each other in
the column. This difference is recorded in a separate dictionary for each block of column values on disk.
(An Amazon Redshift disk block occupies 1 MB.) For example, if the column contains 10 integers in
sequence from 1 to 10, the first will be stored as a 4-byte integer (plus a 1-byte flag), and the next 9 will
each be stored as a byte with the value 1, indicating that it is one greater than the previous value.

API Version 2012-12-01

107

Amazon Redshift Database Developer Guide

Compression Encodings

Delta encoding comes in two variations:

DELTA records the differences as 1-byte values (8-bit integers)
DELTA32K records differences as 2-byte values (16-bit integers)
If most of the values in the column could be compressed by using a single byte, the 1-byte variation is
very effective; however, if the deltas are larger, this encoding, in the worst case, is somewhat less effective
than storing the uncompressed data. Similar logic applies to the 16-bit version.
If the difference between two values exceeds the 1-byte range (DELTA) or 2-byte range (DELTA32K),
the full original value is stored, with a leading 1-byte flag. The 1-byte range is from -127 to 127, and the
2-byte range is from -32K to 32K.
The following table shows how a delta encoding works for a numeric column:
Original data
value

Original size
(bytes)

Difference (delta) Compressed

value

Compressed size
(bytes)
1

1+4 (flag + actual

value)

200

150

1+4 (flag + actual

value)

185

-15

220

221

Totals

LZO Encoding
LZO encoding provides a very high compression ratio with good performance. LZO encoding works
especially well for CHAR and VARCHAR columns that store very long character strings, especially free
form text, such as product descriptions, user comments, or JSON strings.

Note
COPY with automatic compression currently supports LZO encoding only for CHAR and
VARCHAR; however, ANALYZE COMPRESSION (p. 295) fully supports LZO encoding. For more
information, see Loading Tables with Automatic Compression (p. 158).

Mostly Encoding
Mostly encodings are useful when the data type for a column is larger than most of the stored values
require. By specifying a mostly encoding for this type of column, you can compress the majority of the
values in the column to a smaller standard storage size. The remaining values that cannot be compressed
are stored in their raw form. For example, you can compress a 16-bit column, such as an INT2 column,
to 8-bit storage.
In general, the mostly encodings work with the following data types:
SMALLINT/INT2 (16-bit)
API Version 2012-12-01
108

Amazon Redshift Database Developer Guide

Compression Encodings

INTEGER/INT (32-bit)
BIGINT/INT8 (64-bit)
DECIMAL/NUMERIC (64-bit)
Choose the appropriate variation of the mostly encoding to suit the size of the data type for the column.
For example, apply MOSTLY8 to a column that is defined as a 16-bit integer column. Applying MOSTLY16
to a column with a 16-bit data type or MOSTLY32 to a column with a 32-bit data type is disallowed.
Mostly encodings might be less effective than no compression when a relatively high number of the values
in the column cannot be compressed. Before applying one of these encodings to a column, check that
most of the values that you are going to load now (and are likely to load in the future) fit into the ranges
shown in the following table.
Encoding

Compressed Storage Size Range of values that can be compressed

(values outside the range are stored raw)

MOSTLY8

1 byte (8 bits)

-128 to 127

MOSTLY16

2 bytes (16 bits)

-32768 to 32767

MOSTLY32

4 bytes (32 bits)

-2147483648 to +2147483647

Note
For decimal values, ignore the decimal point to determine whether the value fits into the range.
For example, 1,234.56 is treated as 123,456 and can be compressed in a MOSTLY32 column.
For example, the VENUEID column in the VENUE table is defined as a raw integer column, which means
that its values consume 4 bytes of storage. However, the current range of values in the column is 0 to
309. Therefore, re-creating and reloading this table with MOSTLY16 encoding for VENUEID would reduce
the storage of every value in that column to 2 bytes.
If the VENUEID values referenced in another table were mostly in the range of 0 to 127, it might make
sense to encode that foreign-key column as MOSTLY8. Before making the choice, you would have to
run some queries against the referencing table data to find out whether the values mostly fall into the
8-bit, 16-bit, or 32-bit range.
The following table shows compressed sizes for specific numeric values when the MOSTLY8, MOSTLY16,
and MOSTLY32 encodings are used:
Original value

Original INT or
BIGINT size
(bytes)

MOSTLY8
compressed
size (bytes)

MOSTLY16 compressed size (bytes)

MOSTLY32 compressed size (bytes)

100

API Version 2012-12-01

109

Amazon Redshift Database Developer Guide

Compression Encodings

Original value

Original INT or
BIGINT size
(bytes)

MOSTLY8
compressed
size (bytes)

MOSTLY16 compressed size (bytes)

MOSTLY32 compressed size (bytes)

1000

10000

Same as raw
data size

20000

40000

Same as raw data size 4

100000

2000000000

Runlength Encoding
Runlength encoding replaces a value that is repeated consecutively with a token that consists of the value
and a count of the number of consecutive occurrences (the length of the run). A separate dictionary of
unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies
1 MB.) This encoding is best suited to a table in which data values are often repeated consecutively, for
example, when the table is sorted by those values.
For example, if a column in a large dimension table has a predictably small domain, such as a COLOR
column with fewer than 10 possible values, these values are likely to fall in long sequences throughout
the table, even if the data is not sorted.
We do not recommend applying runlength encoding on any column that is designated as a sort key.
Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns
are compressed much more highly than other columns in the same query, range-restricted scans might
perform poorly.
The following table uses the COLOR column example to show how the runlength encoding works:
Original data value

Original size (bytes)

Compressed value
(token)

Compressed size
(bytes)

Blue

{2,Blue}

Blue

Green

Blue

{1,Blue}

Yellow

{4,Yellow}

Yellow

Totals

0
{3,Green}

API Version 2012-12-01

110

Amazon Redshift Database Developer Guide

Testing Compression Encodings

Text255 and Text32k Encodings

Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words
recur often. A separate dictionary of unique words is created for each block of column values on disk.
(An Amazon Redshift disk block occupies 1 MB.) The dictionary contains the first 245 unique words in
the column. Those words are replaced on disk by a one-byte index value representing one of the 245
values, and any words that are not represented in the dictionary are stored uncompressed. The process
repeats for each 1 MB disk block. If the indexed words occur frequently in the column, the column will
yield a high compression ratio.
For the text32k encoding, the principle is the same, but the dictionary for each block does not capture a
specific number of words. Instead, the dictionary indexes each unique word it finds until the combined
entries reach a length of 32K, minus some overhead. The index values are stored in two bytes.
For example, consider the VENUENAME column in the VENUE table. Words such as Arena, Center,
and Theatre recur in this column and are likely to be among the first 245 words encountered in each
block if text255 compression is applied. If so, this column will benefit from compression because every
time those words appear, they will occupy only 1 byte of storage (instead of 5, 6, or 7 bytes, respectively).

Testing Compression Encodings

If you decide to manually specify column encodings, you might want to test different encodings with your
data.

Note
We recommend that you use the COPY command to load data whenever possible, and allow
the COPY command to choose the optimal encodings based on your data. Alternatively, you
can use the ANALYZE COMPRESSION (p. 295) command to view the suggested encodings for
existing data. For details about applying automatic compression, see Loading Tables with
Automatic Compression (p. 158).
To perform a meaningful test of data compression, you need a large number of rows. For this example,
we will create a table by using a CREATE TABLE AS statement that selects from two tables; VENUE
and LISTING. We will leave out the WHERE clause that would normally join the two tables; the result is
that each row in the VENUE table is joined to all of the rows in the LISTING table, for a total of over 32
million rows. This is known as a Cartesian join and normally is not recommended, but for this purpose,
it is a convenient method of creating a lot of rows. If you have an existing table with data that you want
to test, you can skip this step.
After we have a table with sample data, we create a table with six columns, each with a different
compression encoding: raw, bytedict, lzo, runlength, text255, and text32k. We populate each column with
exactly the same data by executing an INSERT command that selects the data from the first table.
To test compression encodings:
1. (Optional) First, we'll create a table with a large number of rows. Skip this step if you want to test an
existing table.
create table reallybigvenue as
select venueid, venuename, venuecity, venuestate, venueseats
from venue, listing;

2. Next, create a table with the encodings that you want to compare.
create table encodingvenue (
venueraw varchar(100) encode raw,

API Version 2012-12-01

111

Amazon Redshift Database Developer Guide

Testing Compression Encodings

venuebytedict varchar(100) encode bytedict,

venuelzo varchar(100) encode lzo,
venuerunlength varchar(100) encode runlength,
venuetext255 varchar(100) encode text255,
venuetext32k varchar(100) encode text32k);

3. Insert the same data into all of the columns using an INSERT statement with a SELECT clause.
insert into encodingvenue
select venuename as venueraw, venuename as venuebytedict, venuename as venuer
unlength, venuename as venuetext32k, venuename as venuetext255
from reallybigvenue;

4. Verify the number of rows in the new table.

select count(*) from encodingvenue
count
---------38884394
(1 row)

5. Query the STV_BLOCKLIST (p. 693) system table to compare the number of 1 MB disk blocks used
by each column.
The MAX aggregate function returns the highest block number for each column.The STV_BLOCKLIST
table includes details for three system-generated columns. This example uses col < 6 in the WHERE
clause to exclude the system-generated columns.
select col, max(blocknum)
from stv_blocklist b, stv_tbl_perm p
where (b.tbl=p.id) and name ='encodingvenue'
and col < 6
group by name, col
order by col;

The query returns the following results. The columns are numbered beginning with zero. Depending
on how your cluster is configured, your result might have different numbers, but the relative sizes
should be similar. You can see that BYTEDICT encoding on the second column produced the best
results for this data set, with a compression ratio of better than 20:1. LZO encoding also produced
excellent results. Different data sets will produce different results, of course. When a column contains
longer text strings, LZO often produces the best compression results.
col | max
-----+----0 | 203
1 | 10
2 | 22
3 | 204
4 | 56
5 | 72
(6 rows)

API Version 2012-12-01

112

Amazon Redshift Database Developer Guide

Example: Choosing Compression Encodings for the
CUSTOMER Table

If you have data in an existing table, you can use the ANALYZE COMPRESSION (p. 295) command to
view the suggested encodings for the table. For example, the following example shows the recommended
encoding for a copy of the VENUE table, REALLYBIGVENUE, that contains 38 million rows. Notice that
ANALYZE COMPRESSION recommends BYTEDICT encoding for the VENUENAME column, which
agrees with the results of our previous test.
analyze compression reallybigvenue;
Column
| Encoding
---------------+----------venueid
| lzo
venuename
| bytedict
venuecity
| lzo
venuestate
| lzo
venueseats
| lzo
(5 rows)

Example: Choosing Compression Encodings for

the CUSTOMER Table
The following statement creates a CUSTOMER table that has columns with various data types. This
CREATE TABLE statement shows one of many possible combinations of compression encodings for
these columns.
create table customer(
custkey int encode delta,
custname varchar(30) encode raw,
gender varchar(7) encode text255,
address varchar(200) encode text255,
city varchar(30) encode text255,
state char(2) encode raw,
zipcode char(5) encode bytedict,
start_date date encode delta32k);

The following table shows the column encodings that were chosen for the CUSTOMER table and gives
an explanation for the choices:
Column

Data Type

Encoding

Explanation

CUSTKEY

int

delta

CUSTKEY consists of
unique, consecutive integer values. Since the
differences will be one
byte, DELTA is a good
choice.

CUSTNAME

varchar(30)

raw

CUSTNAME has a large

domain with few repeated values. Any
compression encoding
would probably be ineffective.

API Version 2012-12-01

113

Amazon Redshift Database Developer Guide

Example: Choosing Compression Encodings for the
CUSTOMER Table

Column

Data Type

Encoding

Explanation

GENDER

varchar(7)

text255

GENDER is very small

domain with many repeated values. Text255
works well with
VARCHAR columns in
which the same words
recur.

ADDRESS

varchar(200)

text255

ADDRESS is a large
domain, but contains
many repeated words,
such as Street Avenue,
North, South, and so on.
Text 255 and text 32k
are useful for compressing VARCHAR columns
in which the same words
recur.The column length
is short, so text255 is a
good choice.

CITY

varchar(30)

text255

CITY is a large domain,

with some repeated values. Certain city names
are used much more
commonly than others.
Text255 is a good
choice for the same
reasons as ADDRESS.

STATE

char(2)

raw

In the United States,

STATE is a precise domain of 50 two-character
values. Bytedict encoding would yield some
compression, but because the column size is
only two characters,
compression might not
be worth the overhead
of uncompressing the
data.

ZIPCODE

char(5)

bytedict

ZIPCODE is a known
domain of fewer than
50,000 unique values.
Certain zip codes occur
much more commonly
than others. Bytedict encoding is very effective
when a column contains
a limited number of
unique values.

API Version 2012-12-01

114

Amazon Redshift Database Developer Guide

Choosing a Data Distribution Style

Column

Data Type

Encoding

Explanation

START_DATE

date

delta32k

Delta encodings are very

useful for datetime
columns, especially if
the rows are loaded in
date order.

Choosing a Data Distribution Style

Topics
Data Distribution Concepts (p. 115)

Distribution Styles (p. 116)

Viewing Distribution Styles (p. 116)
Evaluating Query Patterns (p. 117)
Designating Distribution Styles (p. 117)
Evaluating the Query Plan (p. 118)
Query Plan Example (p. 120)
Distribution Examples (p. 125)

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the compute
nodes according to the table's distribution style. When you execute a query, the query optimizer
redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal
in selecting a table distribution style is to minimize the impact of the redistribution step by locating the
data where it needs to be before the query is executed.
This section will introduce you to the principles of data distribution in an Amazon Redshift database and
give you a methodology to choose the best distribution style for each of your tables.

Data Distribution Concepts

Nodes and slices
An Amazon Redshift cluster is a set of nodes. Each node in the cluster has its own operating system,
dedicated memory, and dedicated disk storage. One node is the leader node, which manages the
distribution of data and query processing tasks to the compute nodes.
The disk storage for a compute node is divided into a number of slices, equal to the number of processor
cores on the node. For example, each DW1.XL compute node has two slices, and each DW2.8XL compute
node has 32 slices. The nodes all participate in parallel query execution, working on data that is distributed
as evenly as possible across the slices.
Data redistribution
When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node
slices according to the table's distribution style. As part of a query plan, the optimizer determines where
blocks of data need to be located to best execute the query. The data is then physically moved, or
redistributed, during execution. Redistribution might involve either sending specific rows to nodes for
joining or broadcasting an entire table to all of the nodes.
Data redistribution can account for a substantial portion of the cost of a query plan, and the network traffic
it generates can affect other database operations and slow overall system performance. To the extent
that you anticipate where best to locate data initially, you can minimize the impact of data redistribution.
API Version 2012-12-01
115

Amazon Redshift Database Developer Guide

Distribution Styles

Data distribution goals

When you load data into a table, Amazon Redshift distributes the table's rows to the compute nodes and
slices according to the distribution style that you chose when you created the table. Data distribution has
two primary goals:
To distribute the workload uniformly among the nodes in the cluster. Uneven distribution, or data
distribution skew, forces some nodes to do more work than others, which impairs query performance.
To minimize data movement during query execution. If the rows that participate in joins or aggregates
are already collocated on the nodes with their joining rows in other tables, the optimizer does not need
to redistribute as much data during query execution.
The distribution strategy that you choose for your database has important consequences for query
performance, storage requirements, data loading, and maintenance. By choosing the best distribution
style for each table, you can balance your data distribution and significantly improve overall system
performance.

Distribution Styles
When you create a table, you designate one of three distribution styles; EVEN, KEY, or ALL.
Even distribution
The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values
in any particular column. EVEN distribution is appropriate when a table does not participate in joins or
when there is not a clear choice between KEY distribution and ALL distribution. EVEN distribution is the
default distribution style.
Key distribution
The rows are distributed according to the values in one column. The leader node will attempt to place
matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader
node collocates the rows on the slices according to the values in the joining columns so that matching
values from the common columns are physically stored together.
ALL distribution
A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place
only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for
every join that the table participates in.
ALL distribution multiplies the storage required by the number of nodes in the cluster, and so it takes
much longer to load, update, or insert data into multiple tables. ALL distribution is appropriate only for
relatively slow moving tables; that is, tables that are not updated frequently or extensively. Small dimension
tables do not benefit significantly from ALL distribution, because the cost of redistribution is low.

Note
After you have specified a distribution style for a column, Amazon Redshift handles data
distribution at the cluster level. Amazon Redshift does not require or support the concept of
partitioning data within database objects. You do not need to create table spaces or define
partitioning schemes for tables.

Viewing Distribution Styles

To view the distribution style of a table, query the PG_CLASS system catalog table. The RELDISTSYLE
column indicates the distribution style for the table. The following table gives the distribution style for each
value in PG_CLASS:

API Version 2012-12-01

116

Amazon Redshift Database Developer Guide

Evaluating Query Patterns

RELDISTSYLE

Distribution style

EVEN

KEY

ALL

The following example creates three tables using the three distribution styles and then queries PG_CLASS
to view the distribution styles.
create table alldiststyle (col1 int)
diststyle all;
create table evendiststyle (col1 int)
diststyle even;
create table keydiststyle (col1 int)
diststyle key distkey (col1);
select relname, reldiststyle from pg_class
where relname like '%diststyle';
relname
| reldiststyle
--------------+------------evendiststyle | 0
keydiststyle | 1
alldiststyle | 8

Evaluating Query Patterns

Choosing distribution styles is just one aspect of database design.You should consider distribution styles
only within the context of the entire system, balancing distribution with other important factors such as
cluster size, compression encoding methods, sort keys, and table constraints.
Test your system with data that is as close to real data as possible.
In order to make good choices for distribution styles, you need to understand the query patterns for your
Amazon Redshift application. Identify the most costly queries in your system and base your initial database
design on the demands of those queries. Factors that determine the total cost of a query are how long
the query takes to execute, how much computing resources it consumes, how often it is executed, and
how disruptive it is to other queries and database operations.
Identify the tables that are used by the most costly queries, and evaluate their role in query execution.
Consider how the tables are joined and aggregated.
Use the guidelines in this section to choose a distribution style for each table. When you have done so,
create the tables, load them with data that is a close as possible to real data, and then test the tables for
the types of queries that you expect to use. You can evaluate the query explain plans to identify tuning
opportunities. Compare load times, storage space, and query execution times in order to balance your
system's overall requirements.

Designating Distribution Styles

The considerations and recommendations for designating distribution styles in this section use a star
schema as an example. Your database design might be based on a star schema, some variant of a star
API Version 2012-12-01
117

Amazon Redshift Database Developer Guide

Evaluating the Query Plan

schema, or an entirely different schema. Amazon Redshift is designed to work effectively with whatever
schema design you choose. The principles in this section can be applied to any design schema.
1. Specify the primary key and foreign keys for all your tables.
Amazon Redshift does not enforce primary key and foreign key constraints, but the query optimizer
uses them when it generates query plans. If you set primary keys and foreign keys, your application
must maintain the validity of the keys.
2. Distribute the fact table and its largest dimension table on their common columns.
Choose the largest dimension based on the size of data set that participates in the most common join,
not just the size of the table. If a table is commonly filtered, using a WHERE clause, only a portion of
its rows participate in the join. Such a table has less impact on redistribution than a smaller table that
contributes more data. Designate both the dimension table's primary key and the fact table's
corresponding foreign key as DISTKEY. If multiple tables use the same distribution key, they will also
be collocated with the fact table. Your fact table can have only one distribution key. Any tables that
join on another key will not be collocated with the fact table.
3. Designate distribution keys for the other dimension tables.
Distribute the tables on their primary keys or their foreign keys, depending on how they most commonly
join with other tables.
4. Evaluate whether to change some of the dimension tables to use ALL distribution.
If a dimension table cannot be collocated with the fact table or other important joining tables, you can
improve query performance significantly by distributing the entire table to all of the nodes. Using ALL
distribution multiplies storage space requirements and increases load times and maintenance operations,
so you should weigh all factors before choosing ALL distribution. The following section explains how
to identify candidates for ALL distribution by evaluating the EXPLAIN plan.
5. Use EVEN distribution for the remaining tables.
If a table is largely denormalized and does not participate in joins, or if you don't have a clear choice
for another distribution style, use EVEN distribution (the default).
You cannot change the distribution style of a table after it is created. To use a different distribution style,
you can recreate the table and populate the new table with a deep copy. For more information, see
Performing a Deep Copy (p. 171).

Evaluating the Query Plan

You can use query plans to identify candidates for optimizing the distribution style.
After making your initial design decisions, create your tables, load them with data, and test them. Use a
test data set that is as close as possible to the real data. Measure load times to use as a baseline for
comparisons.
Evaluate queries that are representative of the most costly queries you expect to execute; specifically,
queries that use joins and aggregations. Compare execution times for various design options. When you
compare execution times, do not count the first time the query is executed, because the first run time
includes the compilation time.
DS_DIST_NONE
No redistribution is required, because corresponding slices are collocated on the compute nodes.
You will typically have only one DS_DIST_NONE step, the join between the fact table and one
dimension table.
DS_DIST_ALL_NONE
No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is
located on every node.
API Version 2012-12-01
118

Amazon Redshift Database Developer Guide

Evaluating the Query Plan

DS_DIST_INNER
The inner table is redistributed.
DS_BCAST_INNER
A copy of the entire inner table is broadcast to all the compute nodes.
DS_DIST_ALL_INNER
The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.
DS_DIST_BOTH
Both tables are redistributed.
DS_DIST_NONE and DS_DIST_ALL_NONE are good. They indicate that no distribution was required
for that step because all of the joins are collocated.
DS_DIST_INNER means that the step will probably have a relatively high cost because the inner table
is being redistributed to the nodes. DS_DIST_INNER indicates that the outer table is already properly
distributed on the join key. Set the inner table's distribution key to the join key to convert this to
DS_DIST_NONE. If distributing the inner table on the join key is not possible because the outer table is
not distributed on the join key, evaluate whether to use ALL distribution for the inner table. If the table is
relatively slow moving, that is, it is not updated frequently or extensively, and it is large enough to carry
a high redistribution cost, change the distribution style to ALL and test again. ALL distribution causes
increased load times, so when you retest, include the load time in your evaluation factors.
DS_DIST_ALL_INNER is not good. It means the entire inner table is redistributed to a single slice because
the outer table uses DISTSTYLE ALL, so that a copy of the entire outer table is located on each node.
This results in inefficient serial execution of the join on a single node instead taking advantage of parallel
execution using all of the nodes. DISTSTYLE ALL is meant to be used only for the inner join table. Instead,
specify a distribution key or use even distribution for the outer table.
DS_BCAST_INNER and DS_DIST_BOTH are not good. Usually these redistributions occur because the
tables are not joined on their distribution keys. If the fact table does not already have a distribution key,
specify the joining column as the distribution key for both tables. If the fact table already has a distribution
key on another column, you should evaluate whether changing the distribution key to collocate this join
will improve overall performance. If changing the distribution key of the outer table is not an optimal choice,
you can achieve collocation by specifying DISTSTYLE ALL for the inner table.
The following example shows a portion of a query plan with DS_DIST_NONE and DS_DIST_INNER
labels.
-> XN Hash Join DS_BCAST_INNER (cost=112.50..3272334142.59 rows=170771
width=84)
Hash Cond: ("outer".venueid = "inner".venueid)
-> XN Hash Join DS_BCAST_INNER (cost=109.98..3167290276.71 rows=172456
width=47)
Hash Cond: ("outer".eventid = "inner".eventid)
-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456
width=30)
Merge Cond: ("outer".listid = "inner".listid)
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497
width=14)
-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456
width=24)

After changing the dimension tables to use DISTSTYLE ALL, the query plan for the same query shows
DS_DIST_ALL_NONE in place of DS_BCAST_INNER. Also, there is a dramatic change in the relative
cost for the join steps.

API Version 2012-12-01

119

Amazon Redshift Database Developer Guide

Query Plan Example

XN Hash Join DS_DIST_ALL_NONE

(cost=112.50..14142.59 rows=170771 width=84)

Hash Cond: ("outer".venueid = "inner".venueid)

-> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456
width=47)
Hash Cond: ("outer".eventid = "inner".eventid)
-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456
width=30)
Merge Cond: ("outer".listid = "inner".listid)
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497
width=14)
-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456
width=24)

Query Plan Example

This example shows how to evaluate a query plan to find opportunities to optimize the distribution.
Run the following query with an EXPLAIN command to produce a query plan.
explain
select lastname, catname, venuename, venuecity, venuestate, eventname,
month, sum(pricepaid) as buyercost, max(totalprice) as maxtotalprice
from category join event on category.catid = event.catid
join venue on venue.venueid = event.venueid
join sales on sales.eventid = event.eventid
join listing on sales.listid = listing.listid
join date on sales.dateid = date.dateid
join users on users.userid = sales.buyerid
group by lastname, catname, venuename, venuecity, venuestate, eventname, month
having sum(pricepaid)>9999
order by catname, buyercost desc;

In the TICKIT database, SALES is a fact table and LISTING is its largest dimension. In order to collocate
the tables, SALES is distributed on the LISTID, which is the foreign key for LISTING, and LISTING is
distributed on its primary key, LISTID. The following example shows the CREATE TABLE commands for
SALES and LISTID.
create table sales(
salesid integer not null,
listid integer not null distkey,
sellerid integer not null,
buyerid integer not null,
eventid integer not null encode mostly16,
dateid smallint not null,
qtysold smallint not null encode mostly8,
pricepaid decimal(8,2) encode delta32k,
commission decimal(8,2) encode delta32k,
saletime timestamp,
primary key(salesid),
foreign key(listid) references listing(listid),
foreign key(sellerid) references users(userid),
foreign key(buyerid) references users(userid),
foreign key(dateid) references date(dateid))
sortkey(listid,sellerid);

API Version 2012-12-01

120

Amazon Redshift Database Developer Guide

Query Plan Example

create table listing(

listid integer not null distkey sortkey,
sellerid integer not null,
eventid integer not null encode mostly16,
dateid smallint not null,
numtickets smallint not null encode mostly8,
priceperticket decimal(8,2) encode bytedict,
totalprice decimal(8,2) encode mostly32,
listtime timestamp,
primary key(listid),
foreign key(sellerid) references users(userid),
foreign key(eventid) references event(eventid),
foreign key(dateid) references date(dateid));

In the following query plan, the Merge Join step for the join on SALES and LISTING shows
DS_DIST_NONE, which indicates that no redistribution is required for the step. However, moving up the
query plan, the other inner joins show DS_BCAST_INNER, which indicates that the inner table is broadcast
as part of the query execution. Because only one pair of tables can be collocated using key distribution,
five tables need to be rebroadcast.
QUERY PLAN
XN Merge (cost=1015345167117.54..1015345167544.46 rows=1000 width=103)
Merge Key: category.catname, sum(sales.pricepaid)
-> XN Network (cost=1015345167117.54..1015345167544.46 rows=170771 width=103)
Send to leader
-> XN Sort (cost=1015345167117.54..1015345167544.46 rows=170771
width=103)
Sort Key: category.catname, sum(sales.pricepaid)
-> XN HashAggregate (cost=15345150568.37..15345152276.08
rows=170771 width=103)
Filter: (sum(pricepaid) > 9999.00)
-> XN Hash Join DS_BCAST_INNER
(cost=742.08..15345146299.10 rows=170771 width=103)
Hash Cond: ("outer".catid = "inner".catid)
-> XN Hash Join DS_BCAST_INNER
(cost=741.94..15342942456.61 rows=170771 width=97)
Hash Cond: ("outer".dateid = "inner".dateid)
-> XN Hash Join DS_BCAST_INNER
(cost=737.38..15269938609.81 rows=170766 width=90)
Hash Cond: ("outer".buyerid = "in
ner".userid)
-> XN Hash Join DS_BCAST_INNER
(cost=112.50..3272334142.59 rows=170771 width=84)
Hash Cond: ("outer".venueid =
"inner".venueid)
-> XN Hash Join DS_BCAST_INNER
(cost=109.98..3167290276.71 rows=172456 width=47)
Hash Cond: ("outer".eventid
= "inner".eventid)
-> XN Merge Join DS_DIST_NONE
(cost=0.00..6286.47 rows=172456 width=30)
Merge Cond: ("out
er".listid = "inner".listid)
-> XN Seq Scan on
listing (cost=0.00..1924.97 rows=192497 width=14)

API Version 2012-12-01

121

Amazon Redshift Database Developer Guide

Query Plan Example

->
sales

XN Seq Scan on

(cost=0.00..1724.56 rows=172456 width=24)

XN Hash

(cost=87.98..87.98 rows=8798 width=25)

->
event

XN Seq Scan on

(cost=0.00..87.98 rows=8798 width=25)

XN Hash

(cost=2.02..2.02

rows=202 width=41)
->

XN Seq Scan on venue

(cost=0.00..2.02 rows=202 width=41)

XN Hash

(cost=499.90..499.90

rows=49990 width=14)
->
(cost=0.00..499.90 rows=49990 width=14)
-> XN Hash
->

XN Seq Scan on users

(cost=3.65..3.65 rows=365 width=11)

XN Seq Scan on date

(cost=0.00..3.65

rows=365 width=11)
->

XN Hash (cost=0.11..0.11 rows=11 width=10)

-> XN Seq Scan on category (cost=0.00..0.11

rows=11 width=10)

One solution is to recreate the tables with DISTSTYLE ALL. You cannot change a table's distribution
style after it is created. To recreate tables with a different distribution style, use a deep copy.
First, rename the tables.
alter
alter
alter
alter
alter

table
table
table
table
table

users rename to userscopy;

venue rename to venuecopy;
category rename to categorycopy;
date rename to datecopy;
event rename to eventcopy;

Run the following script to recreate USERS, VENUE, CATEGORY, DATE, EVENT. Don't make any
changes to SALES and LISTING.
create table users(
userid integer not null sortkey,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean,
primary key(userid)) diststyle all;

API Version 2012-12-01

122

Amazon Redshift Database Developer Guide

Query Plan Example

create table venue(

venueid smallint not null sortkey,
venuename varchar(100),
venuecity varchar(30),
venuestate char(2),
venueseats integer,
primary key(venueid)) diststyle all;
create table category(
catid smallint not null,
catgroup varchar(10),
catname varchar(10),
catdesc varchar(50),
primary key(catid)) diststyle all;
create table date(
dateid smallint not null sortkey,
caldate date not null,
day character(3) not null,
week smallint not null,
month character(5) not null,
qtr character(5) not null,
year smallint not null,
holiday boolean default('N'),
primary key (dateid)) diststyle all;
create table event(
eventid integer not null sortkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null,
eventname varchar(200),
starttime timestamp,
primary key(eventid),
foreign key(venueid) references venue(venueid),
foreign key(catid) references category(catid),
foreign key(dateid) references date(dateid)) diststyle all;

Insert the data back into the tables and run an ANALYZE command to update the statistics.
insert
insert
insert
insert
insert

into
into
into
into
into

users select * from userscopy;

venue select * from venuecopy;
category select * from categorycopy;
date select * from datecopy;
event select * from eventcopy;

analyze;

Finally, drop the copies.

drop
drop
drop
drop
drop

table
table
table
table
table

userscopy;
venuecopy;
categorycopy;
datecopy;
eventcopy;

API Version 2012-12-01

123

Amazon Redshift Database Developer Guide

Query Plan Example

Run the same query with EXPLAIN again, and examine the new query plan. The joins now show
DS_DIST_ALL_NONE, indicating that no redistribution is required because the data was distributed to
every node using DISTSTYLE ALL.
QUERY PLAN
XN Merge (cost=1000000047117.54..1000000047544.46 rows=1000 width=103)
Merge Key: category.catname, sum(sales.pricepaid)
-> XN Network (cost=1000000047117.54..1000000047544.46 rows=170771 width=103)
Send to leader
-> XN Sort (cost=1000000047117.54..1000000047544.46 rows=170771
width=103)
Sort Key: category.catname, sum(sales.pricepaid)
-> XN HashAggregate (cost=30568.37..32276.08 rows=170771
width=103)
Filter: (sum(pricepaid) > 9999.00)
-> XN Hash Join DS_DIST_ALL_NONE (cost=742.08..26299.10
rows=170771 width=103)
Hash Cond: ("outer".buyerid = "inner".userid)
-> XN Hash Join DS_DIST_ALL_NONE
(cost=117.20..21831.99 rows=170766 width=97)
Hash Cond: ("outer".dateid = "inner".dateid)
-> XN Hash Join DS_DIST_ALL_NONE
(cost=112.64..17985.08 rows=170771 width=90)
Hash Cond: ("outer".catid = "inner".catid)
-> XN Hash Join DS_DIST_ALL_NONE
(cost=112.50..14142.59 rows=170771 width=84)
Hash Cond: ("outer".venueid = "in
ner".venueid)
-> XN Hash Join DS_DIST_ALL_NONE
(cost=109.98..10276.71 rows=172456 width=47)
Hash Cond: ("outer".eventid
= "inner".eventid)
-> XN Merge Join DS_DIST_NONE
(cost=0.00..6286.47 rows=172456 width=30)
Merge Cond: ("outer".list
id = "inner".listid)
-> XN Seq Scan on
listing (cost=0.00..1924.97 rows=192497 width=14)
-> XN Seq Scan on sales
(cost=0.00..1724.56 rows=172456 width=24)
-> XN Hash (cost=87.98..87.98
rows=8798 width=25)
-> XN Seq Scan on event
(cost=0.00..87.98 rows=8798 width=25)
-> XN Hash (cost=2.02..2.02
rows=202 width=41)
-> XN Seq Scan on venue
(cost=0.00..2.02 rows=202 width=41)
-> XN Hash (cost=0.11..0.11 rows=11
width=10)
-> XN Seq Scan on category
(cost=0.00..0.11 rows=11 width=10)
-> XN Hash (cost=3.65..3.65 rows=365 width=11)
->

XN Seq Scan on date

rows=365 width=11)

API Version 2012-12-01

124

(cost=0.00..3.65

Amazon Redshift Database Developer Guide

Distribution Examples

XN Hash
->

(cost=499.90..499.90 rows=49990 width=14)

XN Seq Scan on users

(cost=0.00..499.90

rows=49990 width=14)

Distribution Examples
The following examples show how data is distributed according to the options that you define in the
CREATE TABLE statement.

USERID is a good choice for the distribution column on this table. If you query the SVV_DISKUSAGE
system view, you can see that the table is very evenly distributed:
select slice, col, num_values, minvalue, maxvalue
from svv_diskusage
where name='users' and col =0
order by slice, col;
slice| col | num_values | minvalue | maxvalue
-----+-----+------------+----------+---------0
| 0
| 12496
| 4
| 49987
1
| 0
| 12498
| 1
| 49988
2
| 0
| 12497
| 2
| 49989
3
| 0
| 12499
| 3
| 49990
(4 rows)

The table contains 49,990 rows. The num_values columns shows the number of rows on each of the four
slices. Each slice contains almost the same number of rows.
This example demonstrates distribution on a small test system. The total number of slices is typically
much higher.
If you create a new table with the same data as the USERS table, but you set the DISTKEY to the STATE
column, the distribution will not be as even. Slice 0 (13,587 rows) holds approximately 30% more rows
than slice 3 (10,150 rows). In a much larger table, this amount of distribution skew could have an adverse
impact on query processing.

API Version 2012-12-01

125

Amazon Redshift Database Developer Guide

Distribution Examples

create table userskey distkey(state) as select * from users;

select slice, col, num_values, minvalue, maxvalue from svv_diskusage
where name = 'newusers' and col=0 order by slice, col;
slice | col | num_values | minvalue | maxvalue
------+-----+------------+----------+---------0 |
0 |
13587 |
5 |
49989
1 |
0 |
11245 |
2 |
49990
2 |
0 |
15008 |
1 |
49976
3 |
0 |
10150 |
4 |
49986
(4 rows)

DISTSTYLE EVEN Example

If you create a new table with the same data as the USERS table but set the DISTSTYLE to EVEN, rows
are always evenly distributed across slices.
create table userseven diststyle even as
select * from users;
select slice, col, num_values, minvalue, maxvalue from svv_diskusage
where name = 'userseven' and col=0 order by slice, col;
slice | col | num_values | minvalue | maxvalue
------+-----+------------+----------+---------0 |
0 |
12497 |
4 |
49990
1 |
0 |
12498 |
8 |
49984
2 |
0 |
12498 |
2 |
49988
3 |
0 |
12497 |
1 |
49989
(4 rows)

However, because distribution is not based on a specific column, query processing can be degraded,
especially if the table is joined to other tables. The lack of distribution on a joining column often influences
the type of join operation that can be performed efficiently. Joins, aggregations, and grouping operations
are optimized when both tables are distributed and sorted on their respective joining columns.

DISTSTYLE ALL Example

If you create a new table with the same data as the USERS table but set the DISTSTYLE to ALL, all the
rows are distributed to the first slice of each node.
select slice, col, num_values, minvalue, maxvalue from svv_diskusage
where name = 'usersall' and col=0 order by slice, col;
slice | col | num_values | minvalue | maxvalue
------+-----+------------+----------+---------0 |
0 |
49990 |
4 |
49990
2 |
0 |
49990 |
2 |
49990
(4 rows)

API Version 2012-12-01

126

Amazon Redshift Database Developer Guide

Choosing Sort Keys

When you create a table, you can define one or more of its columns as sort keys. When data is initially
loaded into the empty table, the values in the sort key columns are stored on disk in sorted order.
Information about sort key columns is passed to the query planner, and the planner uses this information
to construct plans that exploit the way that the data is sorted. For example, a merge join, which is often
faster than a hash join, is feasible when the data is distributed and presorted on the joining columns. After
adding a significant amount of new data and vacuuming to resort the data, run an ANALYZE command
to update the statistical metadata for the query planner. For more information, see Analyzing Tables (p. 173).
Another optimization that depends on sorted data is the efficient handling of range-restricted predicates.
Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are
stored as part of the metadata. If a range-restricted column is a sort key, the query processor is able to
use the minimum and maximum values to rapidly skip over large numbers of blocks during table scans.
For example, if a table stores five years of data sorted by date and a query specifies a date range of one
month, up to 98 percent of the disk blocks can be eliminated from the scan. If the data is not sorted, more
of the disk blocks (possibly all of them) have to be scanned. For more information about these optimizations,
see Distribution Styles (p. 116).
Sorted column data is also valuable for general query processing (GROUP BY and ORDER BY operations),
window functions (PARTITION BY and ORDER BY operations), and as a means of optimizing compression.
To understand the impact of the chosen sort key on query performance, use the EXPLAIN (p. 370)
command. For more information, see Query Planning And Execution Workflow (p. 195).
As new rows are incrementally loaded into