Greenplum SQL
&
Performance Tuning
Course Introduction
Greenplum Confidential
Objectives
Review Greenplum Shared Nothing Architecture & Concepts
By implementing a Greenplum Data Mart students will
Create a Greenplum Database and Schemas
Evaluate Logical and Physical Data Models
Determine appropriate Distribution Keys and create tables
Determine and implement partitioning and indexing strategies
Determine and implement optimal load methodologies
By
fulfilling reporting requirements students will
Be able to Explain the Explain Plan
Create and use statistics
Create detail and summary reports using OLAP functions
Create and use temporary tables
Write basic PostGREs functions and create data types
Identify and optimize problem queries
Greenplum Confidential
Prerequisites
Greenplum Fundamentals Training Course
Experience with Unix or Linux
Experience with SQL (PostGRESQL is nice!)
Experience with VI or EMACS
Experience with Shell Scripting
Greenplum Confidential
Module 1
Greenplum Architecture Review
Greenplum Confidential
Greenplum Architecture
MPP = Massively Parallel Processing
Multiple units of parallelism working on the same task
Parallel Database Operations
Parallel CPU Processing
Greenplum Units of Parallelism are Segments
Shared Nothing Architecture
Segments only operate on their portion of the data
With Greenplum, each segment is a separate PostGres Database
Segments are self-sufficient
Dedicated CPU Processes
Dedicated storage that is only accessible by the segment
Greenplum Confidential
Greenplum Architecture
Greenplum Confidential
Master Host, Segment Hosts and
Segment Instances
Greenplum Confidential
Interconnect
Greenplum Confidential
How does Architecture Relate to
Queries?
Data
is spread across the entire system so
Good queries reduce the amount of data motion
between segments
Good physical implementation allows queries to easily
eliminate rows
Good data distribution reduces skewing
The Greenplum database has many high availability
features
High availability may incur a performance cost
Queries and Data Manipulation (DML) operations must
be optimized with the architecture in mind.
Greenplum Confidential
Module 1
LAB
Greenplum Architecture Review
Greenplum Confidential
Module 2
Data Modeling,
Databases & Schemas
Greenplum Confidential
Data Models
Logical Data Model
Graphical representation of the business
requirements
Contains the things of importance in an organization
and how they relate to one another
Contains business textual definitions and examples
Enhanced Logical Data Model
Gathers additional information regarding objects in
LDM
Physical Logical Data Model
How the Logical Model is implemented
Greenplum Confidential
Data Modeling Terms
Entity
Can exist on its own
Can be uniquely identified
Attribute
Defines a property of the entity
Relationship
How entities relate to each other
Primary Key
A single or combination of attributes that uniquely
identifies the entity.
Greenplum Confidential
Types of Logical Data Models
Dimensional Model (Star Schema)
- Easiest for users to understand
- Single layer of dimensions
- May have data redundancy
Greenplum Confidential
Dimensional Model Star Schema
Dimension
Known entry point to a
Fact
A defined attribute of a
Fact
Simple Primary Key
Fact
Measurable
Relates to a specific
instance
May have compound
Primary Key
Greenplum Confidential
Types of Logical Data Models
Snowflake Schema
- Greater reporting ability
- Can break up large dimensions into manageable chunks
- May have redundant data
Greenplum Confidential
Snowflake Schema
Greenplum Confidential
Types of Logical Data Models
Third Normal Form
- Most flexible option
- No redundant data
- Most difficult for users to use
Greenplum Confidential
Third Normal Form (3NF) Model
Greenplum Confidential
Data Warehousing & Models
Star and Snowflake Schemas are most common
in Data Warehouses
Fast retrieval of large numbers of rows
Easier to aggregate
Intuitive to end users
Greenplum Confidential
Enhanced Logical Data Model
Provides more detail than the Logical Model
Defines the cardinality of the data attributes
Defines the anticipated initial data volume (in rows)
Defines the anticipated data growth over time
Greenplum Confidential
Enhanced Logical Data Model
Entity: Store
Source: SOLS (Stores On-Line System)
stores table
Row Count at Source: 35
Anticipated Annual Growth Rate: 100
Attribute
Source
Uniqueness Data Type
Data Example
Store ID
Stores
Unique
Small
Integer
1,2 N
Store Name
Stores
Non
Character
Toms Groceries
Stores
Non
Character
T,F
Has
Pharmacy
Greenplum Confidential
Physical Data Model
Table
The physical manifestation of an entity
Columns
How the attributes of the entity are physically
represented
Constraints
Foreign Key (disabled in Greenplum 3.1)
Uniqueness
Primary Key
Null / Not Null
Check Values
Greenplum Confidential
Logical to Physical Logical Data Model
Dimensions
Store
Country
Customer
Fact
Transaction
Greenplum Confidential
Logical to Physical Physical Data Model
Notes:
Entities become tables
Attributes become
columns
Relationships may become
foreign key constraints
Greenplum Confidential
Data Segregation
Database
May have multiple in a Greenplum system
Greenplum databases do not share data
Schema
A physical grouping of database objects
Views
All or some table columns may be seen
Allows for user defined columns
- Instantiated at run time
Greenplum Confidential
About Greenplum Databases
A Greenplum Database system may have
multiple databases
Clients connect to one database at a time
Default database template: template1
Other database templates
- template0
- postgres
NEVER DROP OR CHANGE TEMPLATE
DATABASES!
Greenplum Confidential
Database SQL Commands
To Create: CREATE DATABASE or createdb
To Drop:
DROP DATABASE or dropdb
To Modify: ALTER DATABASE
Change database name
Assign to a new owner
Set configuration parameters
Greenplum Confidential
PSQL Database Tips
PSQL prompt shows what database you are in.
EXAMPLE: template1=# (super user)
template1=> (non-super user)
To show a list of all databases:
\l (a lower case L)
To connect to a different database:
\c db_name
BEST PRACTICE: Use the PGDATABASE environment variable in the .bashrc
shell script to set the default database!
Greenplum Confidential
About Schemas
Schemas are a way to logically organize objects within a
database (tables, views, functions, etc)
Use fully qualified names to access objects in a schema
EXAMPLE: [Link]
Every database has a public schema
Other system level schemas include: pg_catalog,
Information_schema, pg_toast, pg_bitmapindex
Greenplum Confidential
Schema SQL Commands
To Create: CREATE SCHEMA
To Drop:
DROP SCHEMA
To Modify: ALTER SCHEMA
Change schema name
Assign to a new owner
Greenplum Confidential
PSQL Schema Tips
To see the current schema
select current_schema();
To see a list of all schemas in the database
\dn
To see the schema search path
show search_path;
Greenplum Confidential
Default Schemas
public schema
No objects at database creation
Generally used to store public objects
pg_catalog
This is the data dictionary
information_schema
This schema contains views and tables to make the
data dictionary easier to understand with less joins.
pg_bitmapindex
Used to store bitmap index objects
Greenplum Confidential
TOAST Schema
This schema is used specifically to store large
attribute objects (> 1 page of data)
The
Oversized
Attribute
Storage
Technique
Greenplum Confidential
Module 2
LAB
Data Modeling,
Databases & Schemas
Greenplum Confidential
Module 3
Physical Design Decisions
Greenplum Confidential
Key Design Considerations
Data Distribution
Selecting the best distribution key
Checking for Data Skew
Partition or not to Partition
Data Types
Select the smallest type that will store the data
Constraints
Table
Column
Greenplum Confidential
Primary Key VS Distribution Key
A primary key is a logical model concept which allows each
row to be uniquely identified
A distributed by key is a physical Greenplum database
concept which determines how a row is physically stored
A well designed Greenplum schema could have a large
percentage of tables where the physical distributed by key is
different from the logical primary key (tables with common
distributed by keys provide faster join operations by
reducing number of motion files)
Greenplum Confidential
Key Comparison
Primary Key
Distribution By Key
Logical concept of data modeling
Physical mechanism for storage
Optional for Greenplum (becomes the
distribution index if specified)
Requirement for every Greenplum table (first
column if not specified in distributed by)
No limit to number of columns
Limits:
Documented in data model
Defined in Create Table command
Must be unique
Can be non-unique
Uniquely identifies each row
Can uniquely or non-uniquely identify each
row
Value can not be changed
Value can not be changed
Can not be NULL
Can be NULL
Does not define an access path (DBMS
independent)
Defines a storage path
Chosen for logical correctness
Chosen for distribution and performance
Greenplum Confidential
Data Distribution
CREATE TABLE tablename (
column_name1 data_type NOT NULL,
column_name2 data_type NOT NULL,
column_name3 data_type NOT NULL DEFAULT
default_value,
...)
[DISTRIBUTED
hash
OR BY (column_name)]
algorithm
[DISTRIBUTED RANDOMLY]
round-robin
algorithm Every table in the Greenplum
database has a distribution method
Greenplum Confidential
Distribution Key Considerations
A distribution key can not be changed or altered
after a table is created
If a table has a unique constraint it must be
declared as the distribution key
User defined data types and geometric data
types are not eligible as distribution keys
Greenplum Confidential
Default Distributions Not Recommended
If a table has a primary key and a distribution
key is NOT specified, by default the primary key
will be used as the distribution key
If the table does not have a primary key and a
distribution key is NOT specified, then the first
eligible column of the table will be used as the
distribution key
User defined data types and geometric data types are
not eligible as distribution keys
If a table does not have an eligible column then a
random distribution is used
Greenplum Confidential
Distributions and Performance
Segment
Instance
Network
Disk I/O
CPU
Response time
is the
completion
time for ALL
segment
instances in a
Greenplum
database
0
1
2
3
4
5
To optimize performance use a hash distribution
method (DISTRIBUTED BY) that distributes data
evenly across all segment instances
Greenplum Confidential
Hash Distributions and Data Skew
Segment
Instance
Network
Disk I/O
CPU
0
1
2
3
Gender = M or F
4
5
Select a distribution key with unique
values and high cardinality
Greenplum Confidential
Hash Distributions and Processing Skew
Segment
Instance
JAN
FEB
MAR
APR
MAY
JUN
Network
Disk I/O
CPU
Using a DATE column as the
distribution key may distribute
rows evenly across segment
instances but may result in
processing skew
Select a distribution key that will not
result in processing skew
Greenplum Confidential
Use the Same Distribution Key for
Commonly Joined Tables
Segment
Instance 0
Segment
Host
customer
(c_customer_id)
freg_shopper
(f_customer_id)
Segment
Instance 1
Optimize for Local Joins!
Distribute on the same key
used in the join
(WHERE clause)
customer
Segment
Host
(c_customer_id)
freq_shopper
(f_customer_id)
=
Greenplum Confidential
Avoid Redistribute Motion for Large
Tables
Segment Host
Segment
Instance 0
Segment Host
Segment Host
Segment
Instance 1
Segment
Instance 1
customer
customer
customer
(c_customer_id)
customer_id =102
(c_customer_id)
customer_id=745
(c_customer_id)
freg_shopper
freq_shopper
freq_shopper
(f_trans_number)
(f_trans_number)
customer_id=102
(f_trans_number)
customer_id=745
WHERE customer.c_customer_id = freg_shopper.f_customer_id
freq_shopper table is dynamically
redistributed on f_customer_id
Greenplum Confidential
Avoid Broadcast Motion for Large Tables
Segment Host
Segment
Instance 0
Segment Host
Segment Host
Segment
Instance 1
Segment
Instance 1
customer
customer
customer
(c_customer_id)
(c_customer_id)
(c_customer_id)
state
state
state
(s_statekey)
AK, AL, AZ, CA
(s_statekey)
AK, AL, AZ, CA
(s_statekey)
AK, AL, AZ, CA
WHERE customer.c_statekey = state.s_statekey
The state table is dynamically broadcasted
to all segment instances
Greenplum Confidential
Commonly Joined Tables Use the Same
Data Type for Distribution Keys
customer (c_customer_id)
745::int
freq_shopper (f_customer_id)
745::varchar(10)
Values might appear the same but they are
stored differently at the disk level
Values might appear the same but they HASH
to different values
Resulting in like rows being stored on different
segment instances
Requiring a redistribution before the tables can be
joined
Greenplum Confidential
DISTRIBUTED RANDOMLY
Uses a round robin algorithm
Any query that joins to a table that is
distributed randomly will require a
redistribute motion or a broadcast motion
Acceptable for small tables
For example dimension tables
Not acceptable for large tables
For example fact tables
Greenplum Confidential
Distribution Strategies to Avoid
Do not generate or fabricate columns in order
to create a unique distribution key simply to avoid
skew
DISTRIBUTED BY (col1, col2, col3)
if col1, col2, col3 is never used in joins
Homegrown serial key algorithm
Do not distribute on boolean data types
Do not distribute on floating point data types
Do not DISTRIBUTE RANDOMLY because it is
the easy choice!
Greenplum Confidential
Check for Data Skew
Use gpskew to check for data skew
gpskew t [Link] a database_name
For a partitioned table use the gpskew h option
gpskew t [Link] a database_name
-h
Greenplum Confidential
gpskew Example
gpadmin@mdw > gpskew -t [Link]
...
. . . Number of Distribution Column(s)
= {1}
. . . Distribution Column Name(s)
= cust_id
2008[Link]gpskew:mdw:gpadmin-[INFO]
. . . Total records (inc child tables)
= 5277811399
...
2008[Link]gpskew:mdw:gpadmin-[INFO]:-Skew result
. . . Maximum Record Count
= 66004567
. . . Segment instance hostname
= gpdbhost
. . . Segment instance name
= dw100
. . . Minimum Record Count
= 65942375
. . . Segment instance hostname
= gpdbhost
. . . Segment instance name
= dw100
. . . Record count variance
= 62192
Greenplum Confidential
Redistribute Using CREATE TABLE LIKE
To change the distribution of a table use the
CREATE TABLE LIKE statement
CREATE TABLE customer_tmp (LIKE customer)
DISTRIBUTED BY (customer_id);
Specify the New Distribution Key
INSERT INTO customer_tmp SELECT * FROM
customer;
DROP customer;
ALTER TABLE customer_tmp RENAME TO
Greenplum Confidential
Why Partition a Table?
Provide more efficiency in querying against a subset of large volumes of
transactional detail data as well as to manage this data more effectively
Businesses have recognized the analytic value of detailed transactions and are
storing larger and larger volumes of this data
Increase query efficiency by avoiding full table scans
- without the overhead and maintenance costs of an index for date
As the retention volume of detailed transactions increases, the percent of
transactions that the average query requires for execution decreases
Allow instantaneous dropping of old data and simple addition of
new data
Support a rolling n periods methodology for transactional data
Greenplum Confidential
Partitions in Greenplum
A mechanism in Greenplum for use in physical database design
Increases the available options to improve the performance of a
certain class of queries
Only the rows of the qualified partitions (child table) in a query need to be
accessed
Two types of partitioning:
Range partitioning (division of data based on a numerical range, such as
date or price)
List partitioning (division of data based on a list of values, such as state or
region)
Partitioning in Greenplum works using table inheritance and CHECK
table constraints
Partitioned tables are also distributed
Greenplum Confidential
Candidates for Partitioning
Any table with large data volumes where queries
commonly select using a date range or has rolloff
requirements
Examples
-
Fact tables with dates
Transaction Tables with dates
Activity Tables with dates
Statement Tables with numeric statement periods (YYYYMM)
Financial Summary Tables with end of month dates
Greenplum Confidential
Partitioning Column Candidates
Recommended and simple
Dates using date expressions
-
Trade date
Order date
Billing date
End of Month date
Numeric period that are integers
- Year and Month
- Cycle run date
- Julian date
Greenplum Confidential
Greenplum Table Partitioning
Greenplum partitions tables in the CREATE
TABLE DDL statement.
Partitions may be by range (date, id, etc.)
Partitions may be by value list
(Male,Female, etc.)
Partition Operations include:
CREATE TABLE PARTITION BY
ALTER TABLE
ADD PARTITION
DROP PARTITION
RENAME PARTITION
EXCHANGE PARTITION
Greenplum Confidential
Partitioned Table Example
Greenplum Confidential
DDL to Create Table (prior slide)
CREATE TABLE [Link] ( id INT,rank INT, year INT, gender CHAR(1), count
INT )
DISTRIBUTED BY ( id, gender, year )
PARTITION BY LIST ( gender )
SUBPARTITION BY RANGE ( year )
SUBPARTITION TEMPLATE
( SUBPARTITION "2001" START ( 2001 ) INCLUSIVE,
SUBPARTITION "2002" START ( 2002 ) INCLUSIVE,
SUBPARTITION "2003" START ( 2003 ) INCLUSIVE,
SUBPARTITION "2004" START ( 2004 ) INCLUSIVE,
SUBPARTITION "2005" START ( 2005 ) INCLUSIVE END ( 2006 ) EXCLUSIVE )
( PARTITION females VALUES ( 'F' ),
PARTITION males VALUES ( 'M' ) );
Greenplum Confidential
Table Partitioning Best Practices
Use table partitioning on large distributed tables
to improve query performance
Table partitioning is not a substitute for table
distribution
If the table can be divided into rather equal parts
based on a defining criteria
For example, range partitioning on date
And the defining partitioning criteria is the same
access pattern used in query predicates
WHERE date = 1/30/2008
No overlapping ranges or duplicate list values
Greenplum Confidential
Table Partitioning Example
Segment Host
Segment
Instance 0
customer
Segment Host
Segment Host
Segment
Instance 1
Segment
Instance 51
customer
(c_customer_id)
(week_number)
(c_customer_id)
(week_number)
week_number = 200701
week_number = 200702
week_number = 200703
week_number = 200701
week_number = 200702
week_number = 200703
week_number = 200752
week_number = 200752
customer
(c_customer_id)
(week_number)
week_number = 200701
week_number = 200702
week_number = 200703
...
week_number = 200752
Distributed by c_customer_id
Partitioned by week_number
SELECT . . . FROM customer WHERE week_number = 200702
Greenplum Confidential
Partition Elimination
SELECT . . . FROM customer
WHERE week_number = 200702
Segment
Instance 0
customer
(c_customer_id)
(week_number)
week_number = 200701
week_number = 200702
week_number = 200703
week_number = 200704
week_number = 200705
week_number = 200706
week_number = 200707
week_number = 200708
week_number = 200709
...
week_number = 200750
week_number = 200751
week_number = 200752
The primary goal of table
partitioning is to achieve
partition elimination
Only the 200702 week partition is
scanned.
The other 51 partitions on the segment
instance are eliminated from the query
Greenplum Confidential
CREATING PARTITIONED TABLES
CREATE TABLE customer (
c_customer_id
INT,
week_number
INT,
...
) DISTRIBUTED BY (c_customer_id)
PARTITION BY RANGE ( week_number )
(START (200701) END (200752) INCLUSIVE)
;
Greenplum Confidential
Maintaining Rolling Windows
Segment Host
Segment
Instance 0
customer
(c_customer_id)
(week_number)
week_number = 200702
week_number = 200703
week_number = 200752
week_number = 200801
Segment Host
Segment Host
Segment
Instance 1
Segment
Instance 51
customer
customer
(c_customer_id)
(week_number)
(c_customer_id)
(week_number)
week_number = 200702
week_number = 200703
week_number = 200702
week_number = 200703
week_number = 200752
week_number = 200801
...
week_number = 200752
week_number = 200801
Use ALTER TABLE to add a new child table (week_number=200801)
Use ALTER TABLE to delete an old child table
(week_number=200701)
Greenplum Confidential
Loading Partitioned Tables
Data may be loaded into the parent
Segment
Instance 0
customer parent
table
Parent table contains no data
By default data is not automatically
inserted into the correct child tables
- Use rewrite rules
week_number=200701 child
week_number = 200702 child
week_number = 200703 child
week_number = 200704 child
...
week_number = 200752 child
Data may be loaded into the child
tables
NFS
For the best load performance load directly into
child tables
OR
Load into clone of child table and EXCHANGE
partitions
Greenplum Confidential
Loading Child Tables
Data must be partitioned to ensure it is
Segment
Instance 0
loaded into the correct child table
Manually partition the data files
customer parent
week_number = 200701 child
week_number = 200702 child
week_number = 200703 child
week_number = 200704 child
OR
Use external tables to filter the data
INSERT INTO customer_200701
SELECT * FROM ext_customer
WHERE week_number = 200701;
...
week_number = 200752 child
Greenplum Confidential
Partitioned Table with No Sub-Partitions
Heterogeneous Example
CREATE TABLE orders
( order_id
NUMBER(12),
order_date
TIMESTAMP WITH LOCAL TIME ZONE,
order_mode
VARCHAR2(8),
customer_id
NUMBER(6),
order_status
NUMBER(2),
order_total
NUMBER(8,2),
sales_rep_id
NUMBER(6),
promotion_id
NUMBER(6)
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (order_date)
PARTITIONS 100
(
starting '2005-12-01'
ending '2007-12-01' every 3 months,
ending '2008-12-01' every 4 weeks,
ending '2008-12-03' every 6 hours,
ending '2008-12-04' every 2 secs
);
Greenplum Confidential
Greenplum Release
Multi-level Partition example
CREATE TABLE orders
( order_id
NUMBER(12),
order_date
TIMESTAMP WITH LOCAL TIME ZONE,
order_mode
VARCHAR2(8),
customer_id
NUMBER(6),
order_status
NUMBER(2),
order_total
NUMBER(8,2),
sales_rep_id
NUMBER(6),
promotion_id
NUMBER(6)
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (order_date)
SUBPARTITION BY HASH (customer_id)
SUBPARTITIONS 8
(
partition minny,
starting '2004-12-01' ending '2006-12-01',
partition maxy
);
Greenplum Confidential
What is the difference between
heterogeneous and multi-level partitions?
Heterogeneous partitions - when all
partitions use the same column value, for
example date but the partition are
determined by the expression (daily,
weekly, monthly).
Multi-level partitions when the
partitions are different levels based on
different columns
Greenplum Confidential
Data Type Best Practices
Use data types to constrain your column data
Use character types to store strings
Use date or timestamp types to store dates
Use numeric types to store numbers
Choose the type that uses the least space
Dont use a BIGINT when an INTEGER will work
Use identical data types for columns used in
join operations
Converting data types prior to joining is resource
intensive
Greenplum Confidential
Greenplum Data Types
Greenplum Confidential
Greenplum Data Types (continued)
Greenplum Confidential
More Greenplum Data Types
Greenplum Confidential
Table Constraints
Primary Key Constraint *
EXAMPLE: CONSTRAINT mytable_pk PRIMARY KEY (mycolumn)
With OIDs
EXAMPLE: WITH (OIDS=TRUE)
This is used when updating or deleting rows via an ODBC
connection from another database. (Oracle, SQL Server, etc.)
* If a table has a primary key, this column (or group of columns) is chosen
as the distribution key for the table. If a table has a primary key, then it
can not have other columns with a unique constraint.
Greenplum Confidential
Column Constraints
Check Constraints
EXAMPLE:
price_amt DECIMAL(10,2) CHECK (price_amt > 0.00)
Not Null Constraint
EXAMPLE:
customer_id INTEGER NOT NULL
Uniqueness Constraint *
EXAMPLE:
customer_id INTEGER UNIQUE
* Greenplum Database has some special conditions for unique constraints. A
table can only have one unique constraint and the unique column (or set of
columns) must also be declared as the distribution key for the table.
Greenplum Confidential
Module 3
LAB
Physical Design Decisions
Greenplum Confidential
Module 4
Data Loading
ETL versus ELT
Greenplum Confidential
ETL Extract, Transform & Load
Extract data from one or more source systems
Transform data using source system, ETL server or ETL Tool
(i.e. Informatica or DataStage) required by business rules
Load data into the data warehouse.
Greenplum Confidential
The Greenplum Way - ELT
Extract data from one or more source systems
Load data in a parallel manner via Greenplum External Table
into stage or temporary table in the MPP data warehouse
Transform data warehouse in a parallel manner and load into
target tables.
Greenplum Confidential
Loading Data Greenplum Methods
External Tables
File Based
Web Based
gpfdist Massively Parallel Load Tool
gpload
SQL Copy Command
ETL Tool
How to do ELT with a tool
Greenplum Confidential
Loading with External Tables
What are external tables?
read-only
data resides outside database
SELECT, JOIN, SORT, etc. just like a regular table
Advantages:
Parallelism
ELT flexibility
Error handling of badly formatted rows
Multiple data sources
Two Types:
External Tables: File-based
Web Tables: URL or Command-based
Greenplum Confidential
File-Based External Tables
Two Access Protocols
file
gpfdist
File Formats
text
csv
Example External Table Definition
CREATE EXTERNAL TABLE ext_expenses (name text, date date, amount float4, category text,
description text)
LOCATION (
'[Link]
'[Link]
'[Link]
'[Link]
'[Link]
'[Link]
)
FORMAT 'CSV' ( HEADER );
Greenplum Confidential
Parallel File Distribution Program
(gpfdist)
Ensures full parallelism for the best performance
Can be run on a server outside the Greenplum array
C program - uses HTTP protocol
200 MB/s data distribution rate per gpfdist
Configuration parameter: gp_external_max_segs
Example Start Commands:
gpfdist -d /var/load_files/expenses -p 8081 >> [Link] 2>&1 &
gpfdist -d /var/load_files/expenses -p 8082 >> [Link] 2>&1 &
Example External Table Definition:
CREATE EXTERNAL TABLE ext_expenses
( name text, date date, amount float4, description text )
LOCATION ('gpfdist//etlhost:8081/*','gpfdist//etlhost:8082/*')
FORMAT 'TEXT' (DELIMITER '|')
ENCODING UTF-8
LOG ERRORS INTO ext_expenses_loaderrors
SEGMENT REJECT LIMIT 10000 ROWS ;
Greenplum Confidential
Parallel File Distribution Program
(gpfdist)
Greenplum Confidential
About External and Web Tables
Two Types
External Tables
- Greenplum file server (gpfdist)
- File-based ([Link]
Web Tables
- URL-based ([Link]
- Output-based (OS commands, scripts)
Used for ETL and Data Loading
Data accessed at runtime by segments (in
parallel)
Read-only
Greenplum Confidential
External Table SQL Syntax
CREATE EXTERNAL TABLE table_name (column_name data_type [, ...])
LOCATION ('[Link] [, ...])
| ('gpfdist://filehost[:port]/file_pattern' [, ...])
FORMAT 'TEXT | CSV'
[( [DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF']
[HEADER]
[QUOTE [AS] 'quote']
[FORCE NOT NULL column [, ...] )]
[ ENCODING 'encoding' ]
[ [LOG ERRORS INTO error_table] SEGMENT REJECT LIMIT count [ROWS | PERCENT] ]
DROP EXTERNAL TABLE table_name;
Greenplum Confidential
External Web Tables
Dynamic data not rescannable
Two Forms:
URLs
Commands or scripts
- executed on specified number of segment instances
- can disable by setting: gp_external_enable_exec=off
Output Data Formats
text
csv
Example Web Table Definitions
CREATE EXTERNAL WEB TABLE ext_expenses (name text, date date, amount float4, description text)
LOCATION ( '[Link]
'[Link]
'[Link] )
FORMAT 'CSV' ( HEADER );
CREATE EXTERNAL WEB TABLE log_output (linenum int, message text)
EXECUTE '/var/load_scripts/get_log_data.sh'
ON HOST
FORMAT 'TEXT' (DELIMITER '|');
CREATE EXTERNAL WEB TABLE du_space (storage text)
EXECUTE 'du -sh'
ON ALL
FORMAT 'TEXT';
Greenplum Confidential
External Table Error Handling
Load good rows and catch poorly formatted rows, such as:
rows with missing or extra attributes
rows with attributes of the wrong data type
rows with invalid client encoding sequences
Does not apply to constraint errors:
PRIMARY KEY, NOT NULL, CHECK or UNIQUE constraints
Optional error handling clause for external tables:
[LOG ERRORS INTO error_table] SEGMENT REJECT LIMIT count [ROWS | PERCENT]
( PERCENT based on gp_reject_percent_threshold parameter )
Example:
CREATE EXTERNAL TABLE ext_customer
(id int, name text, sponsor text)
LOCATION ( 'gpfdist://filehost:8081/*.txt' )
FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5 ROWS;
Greenplum Confidential
External Tables and Planner Statistics
Data resides outside the database
No database statistics for external table data
Not meant for frequent or ad-hoc access
Can manually set rough statistics in pg_class:
UPDATE pg_class
SET reltuples=400000, relpages=400
WHERE relname='myexttable';
Greenplum Confidential
gpload
Runs a load job as defined in a YAML formatted control file
gpload is a data loading utility that acts as an interface external
table parallel loading feature. Using a load specification defined in a
YAML formatted control file, gpload executes a load by invoking the
Greenplum parallel file server (gpfdist), creating an external table
definition based on the source data defined, and executing an
INSERT, UPDATE or MERGE operation to load the source data into
the target table in the database.
Usage:
gpload -f control_file [-l log_file] [-h hostname] [-p port] [-U username]
[-d database] [-W] [-v | -V] [-q] [-D] gpload -? | --version
Example:
gpload -f my_load.yml
Greenplum Confidential
gpload - Control File Format
The gpload control file uses the YAML 1.1 document format and
then implements its own schema for defining the various steps of
a Greenplum Database load operation. The control file must be a
valid YAML document.
Example load control file:
VERSION: [Link]
DATABASE: ops
USER: gpadmin
HOST: mdw-1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- etl1-1
- etl1-2
- etl1-3
Greenplum Confidential
gpload - Control File Format (contd.)
- etl1-4
PORT: 8081
FILE: /var/load/data/*
- COLUMNS:
- name: text
- amount: float4
- category: text
- desc: text
- date: date
- FORMAT: text
- DELIMITER: |
- ERROR_LIMIT: 25
- ERROR_TABLE: payables.err_expenses
OUTPUT:
- TABLE: [Link]
- MODE: INSERT
Greenplum Confidential
COPY SQL Command
PostgreSQL command
Optimized for loading a large number of rows
Loads all rows in one command (not parallel)
Loads data from a file or from standard input
Supports error handling as does external tables
EXAMPLE:
COPY mytable FROM '/data/[Link]' WITH CSV HEADER;
Greenplum Confidential
When to use SQL Copy
When scripting a process that is loading small amounts of data (<
10K rows).
One-time data loads.
Use COPY to load multiple rows in one command, instead of using
a series of INSERT commands.
The COPY command is optimized for loading large numbers of
rows; it is less flexible than INSERT, but incurs significantly less
overhead for larger data loads.
Since COPY is a single command, there is no need to disable
autocommit if you use this method to populate a table.
Greenplum Confidential
Why use an ETL tool?
ETL tool may be corporate standard
ETL tool may be used to do standard data
quality processes
ETL tool may be used for standard code
standardization (multiple source systems with
different values for same code i.e. status code)
ETL tool may be used for house holding or new
identification keys (unique customer ids,
product ids)
Use ETL tool to provide connectivity options, for
example MVS connectivity.
Greenplum Confidential
Architecting ELT Solutions with an
ETL tool
Use ETL tool to create file for external table
Use the ETL tool to execute SQL commands
Use ETL tool for job scheduling of external
table loads
Greenplum Confidential
System Generated Keys in Greenplum
Primary Key contains many columns?
Distribution Key rarely used in queries or joins?
Solution: A System Generated Key
Use a Sequence
Use SQL
Greenplum Confidential
About Sequences
Greenplum Master sequence generator process (seqserver)
Sequence SQL Commands:
CREATE SEQUENCE
ALTER SEQUENCE
DROP SEQUENCE
Sequence Function Limitations:
lastval and currval not supported
setval cannot be used in queries that update data
nextval not allowed in UPDATEs or DELETEs if mirrors are enabled
nextval may grab a block of values for some queries
PSQL Tips:
To list all sequences while in psql: \ds
To see an sequence definition: \d+ sequence_name
Greenplum Confidential
Sequence Example
Create a sequence named myseq
CREATE SEQUENCE myseq START 101;
Insert a row into a table that gets the next value:
INSERT INTO distributors
VALUES (nextval('myseq'), 'acme');
Reset the sequence counter value on the master:
SELECT setval('myseq', 201);
Not allowed in Greenplum DB (set sequence value on segments):
INSERT INTO product
VALUES (setval('myseq', 201), 'gizmo');
NOTE: THERE COULD BE GAPS IN THE SEQUENCE NUMBERS!
Greenplum Confidential
Using SQL
CREATE TEMPORARY TABLE MAXI AS (
SELECT COALESCE(MAX(CustomerID),0) AS MaxID
FROM [Link]
)
DISTRIBUTED BY (MaxID)
Create a volatile table with the max current
;
SELECT (RANK() OVER (ORDER BY phone)) + [Link],
custName,
address,
city,
state,
zipcode,
zipPlusFour,
countrycd,
phone
FROM public.customer_external
CROSS JOIN MAXI
;
Join the single row table with the max value to every row in the
source table, add the rank and voila, no gaps!
Greenplum Confidential
Module 4
LAB
Data Loading
ETL versus ELT
Greenplum Confidential
Module 5
Explain the Explain Plan
Analyzing Queries
Greenplum Confidential
The Greenplum Explain Utility
Allows the user to view how the optimizer will
handle the query.
Shows the cost (in page views) and an
estimated run time.
EXPLAIN ANALYZE will show the difference
between the estimates and the actual run costs.
Greenplum Confidential
EXPLAIN or EXPLAIN ANALYZE
Use EXPLAIN ANALYZE to run the query and
generate query plans and planner estimates
EXPLAIN ANALYZE <query>;
Run EXPLAIN ANALYZE on queries to identify
opportunities to improve query performance
EXPLAIN ANALYZE is also a engineering tool
used by Greenplum developers
Focus on the query plan elements that are relevant
from an end user perspective to improve query
performance
Greenplum Confidential
EXPLAIN Output
Query plans are a right to left plan tree of nodes
read from the bottom up
Each node feeds its results to the node directly
above
There is one line for each node in the plan tree
Each node represents a single operation
Sequential Scan, Hash Join, Hash Aggregation, etc
The plan will also include motion nodes
responsible for moving rows between segment
instances
Redistribute Motion, Broadcast Motion, Gather
Greenplum Confidential
EXPLAIN Example
Gather Motion 48:1 (slice1) (cost=14879102.69..14970283.82 rows=607875 width=548)
Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Unique (cost=14879102.69..14970283.82 rows=607875 width=548)
Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Sort (cost=14879102.69..14894299.54 rows=6078742 width=548)
Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Result (cost=0.00..3188311.04 rows=6078742 width=548)
-> Append (cost=0.00..3188311.04 rows=6078742 width=548)
-> Seq Scan on display_run b (cost=0.00..1.02 rows=1 width=548)
Filter: local_time >= '2007-03-01 [Link]'::timestamp without time zone
AND local_time < '2007-04-01 [Link]'::timestamp without time zone AND (url::text
'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text)
-> Seq Scan on display_run_child_2007_03_month b
(cost=0.00..3188310.02 rows=6078741 width=50)
Filter: local_time >= '2007-03-01 [Link]'::timestamp without time zone
AND local_time < '2007-04-01 [Link]'::timestamp without time zone AND (url::text
'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text)
Greenplum Confidential
Begin by Examining Rows and Cost
Aggregate (cost=16141.02..16141.03 rows=1 width=0)
-> Gather Motion 2:1 (slice2) (cost=16140.97..16141.00 rows=1 width=0)
-> Aggregate (cost=16140.97..16140.98 rows=1 width=0)
-> Hash Join (cost=2999.46..15647.43 rows=197414 width=0)
Hash Cond: ps.ps_partkey = part.p_partkey
-> Hash Join (cost=240.46..9920.71 rows=200020 width=4)
Hash Cond: ps.ps_suppkey = supplier.s_suppkey
-> Seq Scan on partsupp ps (cost=0.00..6180.00 rows=400000
width=8)
-> Hash (cost=177.97..177.97 rows=4999 width=4)
-> Broadcast Motion 2:2 (slice1) (cost=0.00..177.97 rows=4999
width=4)
-> Seq Scan on supplier (cost=0.00..77.99 rows=4999
width=4)
-> Hash (cost=1509.00..1509.00 rows=100000 width=4)
-> Seq Scan
on part the
(cost=0.00..1509.00
width=4)
Identify plan nodes
where
estimatedrows=100000
cost is very
high
and the number of rows are very large
(where the majority of time is spent)
Greenplum Confidential
Validate Partition Elimination
Gather Motion 48:1 (slice1) (cost=174933650.92..176041040.58 rows=7382598 width=548)
Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Unique (cost=174933650.92..176041040.58 rows=7382598 width=548)
Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Sort (cost=174933650.92..175118215.86 rows=73825977 width=548)
Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Result (cost=0.00..31620003.26 rows=73825977 width=548)
-> Append (cost=0.00..31620003.26 rows=73825977 width=548)
-> Seq Scan on display_run b (cost=0.00..1.02 rows=1 width=548)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_03_month b (cost=0.00..2635000.02 rows=6079950 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_04_month b (cost=0.00..2635000.02 rows=6182099 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_05_month b (cost=0.00..2635000.02 rows=6186057 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_06_month b (cost=0.00..2635000.02 rows=6153744 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_07_month b (cost=0.00..2635000.02 rows=6124871 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_08_month b (cost=0.00..2635000.02 rows=6147722 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_09_month b (cost=0.00..2635000.02 rows=6244890 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_10_month b (cost=0.00..2635000.02 rows=6083077 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_11_month b (cost=0.00..2635000.02 rows=6161248 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
-> Seq Scan on display_run_child_2007_12_month b (cost=0.00..2635000.02 rows=6156930 width=50)
Filter: url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text
All child partitions
are scanned
Greenplum Confidential
Partition Elimination
Gather Motion 48:1 (slice1) (cost=14879102.69..14970283.82 rows=607875 width=548)
Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Unique (cost=14879102.69..14970283.82 rows=607875 width=548)
Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Sort (cost=14879102.69..14894299.54 rows=6078742 width=548)
Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Result (cost=0.00..3188311.04 rows=6078742 width=548)
-> Append (cost=0.00..3188311.04 rows=6078742 width=548)
-> Seq Scan on display_run b (cost=0.00..1.02 rows=1 width=548)
Filter: local_time >= '2007-03-01 [Link]'::timestamp without time zone AND local_time < '2007-04-01
[Link]'::timestamp without time zone AND (url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user
time::text)
-> Seq Scan on display_run_child_2007_03_month b
(cost=0.00..3188310.02 rows=6078741 width=50)
Filter: local_time >= '2007-03-01 [Link]'::timestamp without time zone AND local_time < '2007-04-01
[Link]'::timestamp without time zone AND (url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user
time::text)
To eliminate partitions the WHERE clause must
be the same as the partition criteria AND must
contain an explicit value (no subquery)
Greenplum Confidential
Optimal Plan Heuristics
Fast Operators
Sequential Scan
Hash Join
Hash Agg
Redistribute Motion
Slow Operators
Nested Loop Join
Merge Join
Sort
Broadcast Motion
Greenplum Confidential
Eliminate Nested Loop Joins
Gather Motion 24:1 (slice4) (cost=4517.05..214553.33 rows=416845 width=371)
-> Nested Loop (cost=4517.05..214553.33 rows=416845 width=371)
InitPlan (slice6)
-> Gather Motion 24:1 (slice3) (cost=1.09..10.23 rows=1 width=4)
-> Seq Scan on calendar_dim (cost=1.09..10.23 rows=1 width=4)
Filter: calendar_date = $0
InitPlan (slice5)
-> Gather Motion 24:1 (slice2) (cost=0.00..1.09 rows=1 width=8)
-> Seq Scan on event_status (cost=0.00..1.09 rows=1 width=8)
Filter: event_name::text = 'bonusbuy'::text
-> Seq Scan on primary_contact_events (cost=0.00..7866.45 rows=416845 width=109)
-> Materialize (cost=4506.82..4507.06 rows=24 width=262)
-> Broadcast Motion 24:24 (slice1) (cost=0.00..4506.80 rows=24 width=262)
-> Seq Scan on contact_type (cost=0.00..4506.55 rows=1 width=262)
Filter: call_type::text = 'Regional Group'::text AND created_dt::date = $1
Eliminate nested loop joins for hash joins
Greenplum Confidential
Replace Large Sorts
Redistribute Motion 48:48 (slice2) (cost=8568413851.73..8683017216.78 rows=83317
width=154)
-> GroupAggregate (cost=8568413851.73..8683017216.78 rows=83317 width=154)
Group By: b."host", "customer_id"
-> Sort (cost=8568413851.73..8591334233.13 rows=9168152560 width=154)
Sort Key: b."host", "customer_id"
-> Redistribute Motion 48:48 (slice1) (cost=2754445.61..1748783807.47
rows=9168152560 width=154)
Hash Key: b."host", "customer_id"
-> Hash Join (cost=2754445.61..1565420756.27 rows=9168152560
width=154)
Hash Cond: a.pack_id = b.pack_id
Join Filter: (a.local_time - b.local_time) >= '-[Link]'::
Replace large sorts with HashAgg
Greenplum Confidential
Re-write Sort Example
The optimizer dynamically re-writes a single count
distinct to replace a sort for a HashAgg by pushing the
distinct into a subquery
For example
SELECT count( distinct l_quantity ) FROM lineitem;
SELECT count(*) FROM (SELECT l_quantity FROM lineitem
GROUP BY l_quantity) as foo;
The optimizer can not rewrite multiple count distincts
Re-write the query to push the distinct into a
subquery that inserts the rows into a temp table
Then query the temp table that contains the distinct
values to obtain the count
Greenplum Confidential
Eliminate Large Table Broadcast Motion
-> Hash (cost=18039.92..18039.92 rows=20 width=66)
-> Redistribute Motion 24:24 (slice3) (cost=0.55..18039.92 rows=20 width=66)
Hash Key: cust_contact_activity.src_system_id::text
-> Hash Join (cost=0.55..18039.52 rows=20 width=66)
Hash Cond: cust_contact_activity.contact_id::text =
cust_contact.contact_id::text
-> Seq Scan on cust_contact_activity (cost=0.00..15953.63 rows=833663
width=39)
-> Hash (cost=0.25..0.25 rows=24 width=84)
-> Broadcast Motion 24:24 (slice2) (cost=0.00..0.25 rows=24 width=84)
-> Seq Scan on cust_contact (cost=0.00..0.00 rows=1 width=84)
Small table
broadcast
is
acceptable
Use gp_segments_for_planner to increase
the cost of the motion to favor
a redistribute motion
Greenplum Confidential
Increase work_mem
-> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)
Group By: l_orderkey, l_partkey, l_comment
Rows out: 2999671 rows (seg1) with 13345 ms to first row, 71558 ms to end, start offset by 3.533
ms.
Executor memory: 2645K bytes avg, 5019K bytes max (seg1).
Work_mem used: 2321K bytes avg, 4062K bytes max (seg1).
Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O
affecting 1 workers.
...
-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)
Rows out: 2999671 rows (seg1) with 0.571 ms to first row, 4167 ms to end, start offset by 4.105
Slice statistics:
(slice0) Executor memory: 211K bytes.
(slice1) * Executor memory: 2840K bytes avg x 2 workers, 5209K bytes max (seg1).
Work_mem: 4062K bytes max, 237859K bytes wanted.
Settings: work_mem=4MB
Total runtime: 73326.082 ms
(24 rows)
Greenplum Confidential
Spill Files
Operations performed in memory are optimal
Hash Join, Hash Agg, Sort, etc.
If there is not sufficient memory rows will be written out
to disk as spill files
Certain amount of overhead with any disk I/O operation
Spill files are located within the pgsql_temp directory for
the database
[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp] ls -ltrh
total 334464
-rw------- 1 jeshle jeshle 163M Jan 9 23:41 pgsql_tmp_SortTape_Slice1_14022.205
[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp] ls -ltrh
total 341632
-rw------- 1 jeshle jeshle 167M Jan 9 23:41 pgsql_tmp_SortTape_Slice1_14022.205
[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp] ls -ltrh
total 341632
-rw------- 1 jeshle jeshle 167M Jan 9 23:41 pgsql_tmp_SortTape_Slice1_14022.205
[jesh:~/gpdb-data/seg1/base/16571/pgsql_tmp]
Greenplum Confidential
Identify Respill in Hash Agg Operations
...
-> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)
Group By: l_orderkey, l_partkey, l_comment
Rows out: 2999671 rows (seg1) with 13345 ms to first row, 71558 ms to end . . .
Executor memory: 2645K bytes avg, 5019K bytes max (seg1).
Work_mem used: 2321K bytes avg, 4062K bytes max (seg1).
Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O
affecting 1 workers.
(seg1) 2999671 groups total in 5 batches; 64 respill passes; 23343536 respill rows.
(seg1) Initial pass: 44020 groups made from 44020 rows; 2955651 rows spilled to workfile.
(seg1) Hash chain length 5.0 avg, 18 max, using 602986 of 607476 buckets.
-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)
Rows out: 2999671 rows (seg1) with 0.571 ms to first row, 4167 ms to end, start offset by 4.105
Slice statistics:
(slice0) Executor memory: 211K bytes.
(slice1) * Executor memory: 2840K bytes avg x 2 workers, 5209K bytes max (seg1). Work_mem: 4062K
bytes max, 237859K bytes wanted.
Settings: work_mem=4MB
Greenplum Confidential
Eliminate Respill in Hash Agg Operations
-> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)
Group By: l_orderkey, l_partkey, l_comment
Rows out: 2999671 rows (seg1) with 9374 ms to first row, 19169 ms to end, start offset by ms.
Executor memory: 5690K bytes avg, 8063K bytes max (seg1).
Work_mem used: 2321K bytes avg, 4062K bytes max (seg1) .
Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O
affecting 1 workers.
(seg1) 2999671 groups total in 100 batches; 0 respill passes; 0 respill rows.
(seg1) Initial pass: 44020 groups made from 44020 rows; 2955651 rows spilled to workfile.
(seg1) Hash chain length 3.5 avg, 16 max, using 850893 of 880400 buckets.
-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)
Rows out: 2999671 rows (seg1) with 0.480 ms to first row, 4199 ms to end, start offset by 18 ms.
Slice statistics:
(slice0) Executor memory: 211K bytes.
(slice1) * Executor memory: 5884K bytes avg x 2 workers, 8254K bytes max (seg1). Work_mem: 4062K
bytes max, 237859K bytes wanted.
Settings: work_mem=4MB
Total runtime: 20987.615 ms
Use gp_hashagg_spillbatch_min and
gp_hashagg_spillbatch_max
Greenplum Confidential
Review Join Order
Aggregate (cost=16141.02..16141.03 rows=1 width=0)
-> Gather Motion 2:1 (slice2) (cost=16140.97..16141.00 rows=1 width=0)
-> Aggregate (cost=16140.97..16140.98 rows=1 width=0)
-> Hash Join (cost=2999.46..15647.43 rows=197414 width=0)
Hash Cond: ps.ps_partkey = part.p_partkey
-> Hash Join (cost=240.46..9920.71 rows=200020 width=4)
Cond: ps.ps_suppkey = supplier.s_suppkey
Build and Hash
-> Seq Scan on partsupp ps (cost=0.00..6180.00 rows=400000
probe
on the
width=8)
largest
-> Hash (cost=177.97..177.97 rows=4999 width=4)
-> Broadcast Motion 2:2 (slice1) (cost=0.00..177.97 rows=4999
table(s) last
width=4)
-> Seq Scan on supplier (cost=0.00..77.99 rows=4999
width=4)
-> Hash (cost=1509.00..1509.00 rows=100000 width=4)
-> Seq Scan on part (cost=0.00..1509.00 rows=100000 width=4)
Greenplum Confidential
Use join_collapse_limit to Specify Join
Order
Use join_collapse_limit to specify join order
Must be ANSI style join
For example
SELECT COUNT(*) FROM partsupp ps
JOIN supplier ON ps_suppkey = s_suppkey
JOIN part ON p_partkey = ps_partkey;
Non-ANSI style join
SELECT COUNT(*) FROM partsupp, supplier, part
WHERE ps_suppkey = s_suppkey
AND p_partkey = ps_partkey;
Set
join_collapse_limit = 1
Greenplum Confidential
Analyzing Query Plans Summary
Identify plan nodes with a large number of rows and
high cost
Validate partitions are being eliminated
Eliminate large table broadcast motion
Replace slow operators to favor fast operators
Nested Loop Join, Sort Merge Join, Sort, Agg
Identify spill files and increase memory
work_mem used versus work_mem wanted
Eliminate respill in HashAggregate
Review join order
Greenplum Confidential
Module 5
LAB
Explain the Explain Plan
Analyzing Queries
Greenplum Confidential
Module 6
Improve Performance
With Statistics
Greenplum Confidential
EXPLAIN ANALYZE Estimated Costs
EXPLAIN ANALYZE provides cost estimates for
the execution of the plan node as follows
Cost
Measured in units of disk page fetches
Rows
The number of rows output by the plan node
Width
Total bytes of all the rows output by the plan node
Greenplum Confidential
ANALYZE and Database Statistics
Database statistics used by the optimizer and
query planner
Run ANALYZE after loading data
Run ANALYZE after large INSERT, UPDATE and
DELETE operations
RUN ANALYZE after CREATE INDEX operations
RUN ANALYZE after database restores from
backups
ANALYZE is the most important ongoing
administrative task for optimal query performance
Greenplum Confidential
ANALYZE and VACUUM ANALYZE
Use the ANALYZE command to generate
database statistics
May specify table and column names
ANALYZE [table [ (column [, ...] ) ]]
Use VACUUM ANALYZE to vacuum the database
and generate database statistics
May specify table and column names
VACUUM ANALYZE [table [(column [, ...] )]]
Greenplum Confidential
Use SET STATISTICS to Increase Sampling
Use ALTER TABLE SET STATISTICS to increase
sampling for statistics collected for a given column
Default statistics value is 25
Increasing the statistics value may improve query
planner estimates
For columns used in query predicates and JOINS
(WHERE clause)
Example
ALTER TABLE customer ALTER customer_id
SET STATISTICS 35;
Larger values increases the time it takes to
ANALYZE
Greenplum Confidential
default_statistics_target
Use default_statistics_target server configuration
parameter to increase sampling for statistics
collected for ALL columns
Increasing the target value may improve query
planner estimates
Default is 25
Larger values increases the time it takes to
ANALYZE
Greenplum Confidential
gp_analyze_relative_error
The gp_analyze_relative_error server
configuration parameter sets the estimated
acceptable error in the cardinality of the table
A value of 0.5 is equivalent to an acceptable
error of 50%
The default value is 0.5
Decreasing the relative error fraction (accepting
less error) will increase the number of rows
sampled
gp_analyze_relative_error = .25
Greenplum Confidential
ANALYZE Best Practices
For large data warehouse environments it may
not be feasible to run ANALYZE on the entire
database or on all columns in a table
Run ANALYZE for
Columns used in a JOIN condition
Columns used in a WHERE clause
Columns used in a SORT clause
Columns used in a GROUP BY or HAVING
Clause
Greenplum Confidential
Module 6
LAB
Improve Performance
With Statistics
Greenplum Confidential
Module 7
Indexing Strategies
Greenplum Confidential
Indexes
Most data warehouse environment operate on
large volumes of data
Low selectivity
Sequential scan is the preferred method to read the data in a
Greenplum MPP environment
For queries with high selectivity indexes may
improve performance
Avoid indexes on frequently updated columns
Avoid overlapping indexes
Use bitmap indexes for columns with very low cardinality
Drop indexes before loading data and recreate indexes after the
load
Analyze after re-creating indexes
Greenplum Confidential
Greenplum Index Types
B-Tree
Use for fairly unique columns
Use for those columns that are single row queries
Bitmap
Use for low cardinality columns
Use when column is often included in predicate
Hash
Available, but not recommended
Greenplum Confidential
Create Index Syntax
CREATE [UNIQUE] INDEX [CONCURRENTLY]
name ON table
[USING method]
( {column | (expression)} [opclass]
[, ...] )
[ WITH ( FILLFACTOR = value )
[WHERE predicate]
Greenplum Confidential
B-Tree Index
Supports single value row lookups
Can be Unique or Non-Unique
Can be Single or Multi-Column
If multi-column, all columns in the index must be
included in the predicate for the index to be used.
EXAMPLE:
CREATE INDEX transid_btridx
ON [Link]
USING BTREE (transactionid)
;
Greenplum Confidential
Bitmap Index
Single column index is recommended
Provides very fast retrieval
Low cardinality columns
Gender
State / Province Code
EXAMPLE:
CREATE INDEX store_pharm_bmidx ON [Link]
USING BITMAP (pharmacy);
CREATE INDEX store_grocery_bmidx ON [Link]
USING BITMAP (grocery);
CREATE INDEX store_deli_bmidx ON [Link]
USING BITMAP (deli);
Greenplum Confidential
Index on Expressions
Only use when the expression appears often in
query predicates.
Very high overhead maintaining the index
during insert and update operations.
EXAMPLE:
CREATE INDEX lcase_storename_idx
ON store (LOWER(storename));
SUPPORTS:
SELECT * FROM store WHERE LOWER(storename) = top foods;
Greenplum Confidential
Index with Predicate (Partial Index)
Pre-selects rows based on predicate
Use to select small numbers of rows from large
tables
EXAMPLE:
CREATE INDEX canada_stores_idx
ON [Link]
WHERE storeid IN(8,32);
Greenplum Confidential
To Index or Not to Index
Will the optimizer use the index?
Test it with production volumes
No hints in Greenplum!
Is the column(s) used in query predicates?
Does the frequency of use justify the overhead?
Is the space available?
Should a unique B-Tree Index be the primary
key?
Consider a Primary Key Constraint instead
Greenplum Confidential
Index Costs
Indexes are not free!
Indexes take space
Indexes incur overhead during inserts and
updates
Indexes incur processing overhead during
creation
Greenplum Confidential
Index Best Practices
Only create indexes when full table scans do
not perform.
Name the index.
Use B-Tree indexes for single column lookups
Use Bitmap Indexes for low cardinality columns
Use Partial Indexes for queries that frequently
access small subsets of much larger data sets.
Greenplum Confidential
Module 7
LAB
Indexing Strategies
Greenplum Confidential
Module 8
Advanced Reporting
Using
OLAP
Greenplum Confidential
What is OLAP?
On-Line Analytic Processing
Quickly provide answers to multi-dimensional
queries
Window Functions
Allows access to multiple rows in a single pass
OLAP Grouping Sets
More flexible than standard GROUP BY
Greenplum Confidential
SELECT - OLAP Grouping Extensions
Standard GROUP BY Example
ROLLUP
GROUPING SETS
CUBE
grouping(column [, ...]) function
group_id() function
Greenplum Confidential
Standard GROUP BY Example
summarize product sales by vendor
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY pn, vn
ORDER BY 1,2,3;
pn | vn | sum
-----+----+--------100 | 20 | 0
100 | 40 | 2640000
200 | 10 | 0
200 | 40 | 0
300 | 30 | 0
400 | 50 | 0
500 | 30 | 120
600 | 30 | 60
700 | 40 | 1
800 | 40 | 1
Greenplum Confidential
(10 rows)
Standard GROUP BY Example Continued
now include subtotals and grand total
pn | vn | sum
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY pn, vn
UNION ALL
SELECT pn, null, sum(prc*qty)
FROM sale
GROUP BY pn
UNION ALL
SELECT null, null, sum(prc*qty)
FROM SALE
ORDER BY 1,2,3;
-----+----+--------100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
600 |
Greenplum Confidential
| 60
ROLLUP Example
the same query using ROLLUP
pn | vn | sum
-----+----+---------
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY ROLLUP(pn, vn)
ORDER BY 1,2,3;
100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
600 |
Greenplum Confidential
| 60
GROUPING SETS Example
the same query using GROUPING SETS
pn | vn | sum
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY GROUPING SETS
( (pn, vn), (pn), () )
ORDER BY 1,2,3;
-----+----+--------100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
Greenplum Confidential
600 |
| 60
CUBE Example
pn | vn | sum
-----+----+---------
CUBE creates subtotals for all possible
combinations of grouping columns,
so
100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY CUBE(pn, vn)
ORDER BY 1,2,3;
200 | 40 | 0
is the same as
400 | 50 | 0
SELECT pn, vn, sum(prc*qty)
FROM sale
GROUP BY GROUPING SETS
( (pn, vn), (pn), (vn), () )
ORDER BY 1,2,3;
500 | 30 | 120
200 |
| 0
300 | 30 | 0
300 |
400 |
500 |
| 0
| 0
| 120
600 | 30 | 60
600 |
| 60
700 | 40 | 1
700 |
| 1
800 | 40 | 1
Greenplum Confidential
800 |
| 1
GROUPING Function Example
Distinguishes NULL values from summary
markers
SELECT store,customer,product, sum(price),
SELECT * FROM dsales_null;
grouping(customer)
FROM dsales_null
store
| customer
| product
| price
GROUP BY ROLLUP(store,customer,product);
-------+----------+---------+------s2
| c1
| p1
| 90
s2
| c1
| p2
| 50
s2
| p1
| 44
s1
| c2
| p2
| 70
s1
| c3
| p1
| 40
(5 rows)
store
| customer | product | sum | grouping
-------+----------+---------+-----+---------s1
| c2
| p2
| 70
| 0
s1
| c2
| 70
| 0
s1
| c3
| p1
| 40
| 0
s1
| c3
| 40
| 0
s1
| 110 | 1
s2
| c1
| p1
| 90
| 0
s2
| c1
| p2
| 50
| 0
s2
| c1
| 140 | 0
| p1
| 44
Greenplum Confidential
s2
| 0
GROUP_ID Function
Returns 0 for each output row in a unique grouping set
Assigns a serial number >0 to each duplicate grouping set
found
Useful when combining grouping extension clauses
Can be used to filter output rows of duplicate grouping sets:
SELECT a, b, c, sum(p*q), group_id()
FROM sales
GROUP BY ROLLUP(a,b), CUBE(b,c)
HAVING group_id()<1
ORDER BY a,b,c;
Greenplum Confidential
SELECT OLAP Windowing Extensions
About Window Functions
Constructing a Window Specification
OVER clause
WINDOW clause
Built-in Window Functions
Greenplum Confidential
About Window Functions
New class of function allowed only in the SELECT list
Returns a value per row (unlike aggregate functions)
Results interpreted in terms of the current row and its corresponding window
partition or frame
Characterized by the use of the OVER clause
Defines the window partitions (groups of rows) to apply the function
Defines ordering of data within a window
Defines the positional or logical framing of a row in respect to its window
Greenplum Confidential
Defining Window Specifications (OVER
Clause)
All window functions have an OVER() clause
Specifies the window of data to which the function applies
Defines:
Window partitions (PARTITION BY clause)
Ordering within a window partition (ORDER BY clause)
Framing within a window partition (ROWS/RANGE clauses)
Greenplum Confidential
OVER (PARTITION BY)
Example
SELECT * ,
row_number()
OVER()
FROM sale
ORDER BY cn;
SELECT * ,
row_number()
OVER(PARTITION BY cn)
FROM sale
ORDER BY cn;
row_number
| cn | vn | pn
| dt
| qty
| prc
------------+----+----+-----+------------+------+-----1
| 1
| 10 | 200 | 1401-03-01 | 1
| 0
| 1
| 30 | 300 | 1401-05-02 | 1
| 0
| 1
| 50 | 400 | 1401-06-01 | 1
| 0
| 1
| 30 | 500 | 1401-06-01 | 12
| 5
| 1
| 20 | 100 | 1401-05-01 | 1
| 0
| 2
| 50 | 400 | 1401-06-01 | 1
| 0
7
| 2 | 40 | 100 | 1401-01-01 | 1100 | 2400
row_number | cn | vn | pn | dt
| qty | prc
8
| 3 | 40 | 200 | 1401-04-01 | 1
| 0
------------+----+----+-----+------------+------+-----(8 rows)
1
| 1 | 10 | 200 | 1401-03-01 | 1
| 0
2
| 1
| 30 | 300 | 1401-05-02 | 1
| 0
| 1
| 50 | 400 | 1401-06-01 | 1
| 0
| 1
| 30 | 500 | 1401-06-01 | 12
| 5
| 1
| 20 | 100 | 1401-05-01 | 1
| 0
| 2
| 50 | 400 | 1401-06-01 | 1
| 0
| 2
| 40 | 100 | 1401-01-01 | 1100 | 2400
Greenplum Confidential
OVER (ORDER BY) Example
vn
SELECT vn, sum(prc*qty)
FROM sale
GROUP BY vn
ORDER BY 2 DESC;
| sum
----+---------
SELECT vn, sum(prc*qty), rank()
OVER (ORDER BY sum(prc*qty) DESC)
FROM sale
GROUP BY vn
ORDER BY 2 DESC;
Greenplum Confidential
40
| 2640002
30
| 180
50
| 0
20
| 0
vn | sum
10 | 0
| rank
----+---------+-----(5 rows)
40
| 2640002 | 1
30
| 180
| 2
50
| 0
| 3
20
| 0
| 3
10
| 0
| 3
OVER (ROWS) Example
Window Framing: Box car Average
vn
SELECT
| dt
| avg
----+------------+---------
vn, dt,
10 | 03012008
| 30
30 | 05022008
| 0
AVG(prc*qty) OVER (
20 | 05012008
| 20
30 | 06012008
| 60
PARTITION BY vn
30 | 05022008
| 0
30 | 06012008
| 60
ORDER BY dt
30 | 06012008
| 60
30 | 06012008
| 60
ROWS BETWEEN
2 PRECEDING AND
2 FOLLOWING)
FROM sale;
30
| 06012008
| 60
30
| 06012008
| 60
40
| 06012008
| 140
40
| 06042008
| 90
40
| 06052008
| 120
Greenplum Confidential
Built-In Window Functions
cume_dist()
dense_rank()
first_value(expr)
lag(expr [,offset] [,default])
last_value(expr)
lead(expr [,offset] [,default])
ntile(expr)
percent_rank()
rank()
row_number()
* Any aggregate function (used with the OVER clause) can
also be used as a window function
Greenplum Confidential
Designating the Moving Window
<function> OVER (column1..n This is like a group by for the
function
ROWS BETWEEN see below
ORDER BY Required by some functions)
ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW
N PRECEDING AND CURRENT ROW
CURRENT ROW AND N FOLLOWING
N PRECEDING AND N FOLLOWING
Greenplum Confidential
Moving Window Example
REQUIREMENT: By store, rank all customers for the month of June.
EXAMPLE:
SELECT [Link]
,[Link]
,[Link]
,RANK() OVER (PARTITION BY [Link]
ORDER BY [Link] DESC) AS ranking
FROM (SELECT storeid
,customerid
,SUM(salesamt) AS totalsalesamt
FROM transaction
WHERE transdate BETWEEN '2008-06-01' AND '2008-06-30'
GROUP BY 1,2 ) t
INNER JOIN store s ON [Link] = [Link]
INNER JOIN customer c ON [Link] = [Link]
;
Greenplum Confidential
Global Window Specifications (WINDOW
clause)
Useful for multiple window function queries
Define and name a window specification
Reuse window specification throughout the query
EXAMPLE:
SELECT
RANK() OVER (ORDER BY pn),
SUM(prc*qty) OVER (ORDER BY pn),
AVG(prc*qty) OVER (ORDER BY pn)
FROM sale;
SELECT
RANK() OVER (w1),
SUM(prc*qty) OVER (w1),
AVG(prc*qty) OVER (w1)
FROM sale
WINDOW w1 AS (ORDER BY pn);
Greenplum Confidential
Module 8
LAB
Advanced Reporting
Using
OLAP
Greenplum Confidential
Module 9
Using Temporary Tables
Greenplum Confidential
Temporary Tables
Local to the session
Data may not be shared between sessions
Temporary table DDL executed at run time
Dropped when session ends
Greenplum Confidential
Why Use Temporary Tables
When performing complicated transformations
When multiple results sets need to be
combined
To facilitate subsequent join operations
Move data to get same segment joins
Greenplum Confidential
Features
Distributed the same as any Greenplum table
May be indexed
May be Analyzed
Greenplum Confidential
Guidelines
Always designate DISTRIBUTED BY column(s)
Will default to the first valid column
Once data is inserted ANALYZE the table
Not available in other RDBMS temporary tables!
Same CREATE DDL as a permanent table
May not create temporary external table
Greenplum Confidential
CREATE Temporary Table Example
CREATE TEMPORARY TABLE monthlytranssummary (
storeid
INTEGER,
customerid INTEGER,
transmonth SMALLINT,
salesamttot DECIMAL(10,2)
)
ON COMMIT PRESERVE ROWS
DISTRIBUTED BY (storeid,customerid)
;
ON COMMIT PRESERVE ROWS is the default action when creating
temporary tables.
Greenplum Confidential
Module 9
LAB
Using Temporary Tables
Greenplum Confidential
Module 10
PostGREs Functions
Greenplum Confidential
Types of Functions
Greenplum supports several function types.
query language functions (functions written in
SQL)
procedural language functions (functions
written in for example, PL/pgSQL or PL/Tcl)
internal functions
C-language functions
This class will only present the first two types of functions:
Query Language and Procedural Language
Greenplum Confidential
Query Language Function - Rules
Executes an arbitrary set of SQL commands
Returns the results of the last query in the list
Returns the first row of the last querys result set
Can return a set of a previously defined data type
The body of the function must be a list of SQL
statements followed by a semi-colon
Greenplum Confidential
Query Language Function More Rules
May contain SELECT statements
May contain DML statements
Insert, Update or Delete statements
May not contain:
Rollback, Savepoint, Begin or Commit commands
Last statement must be SELECT unless the
return type is void.
Greenplum Confidential
Query Language Function Example
This function has no parameters and no return
set:
CREATE FUNCTION public.clean_customer()
RETURNS void AS DELETE FROM [Link]
WHERE state = NA;
LANGUAGE SQL;
It is executed as a SELECT statement:
SELECT public.clean_customer();
clean_customer
----------(1 row)
Greenplum Confidential
Query Language Function - Notes
Quoted values in the query must use the
doubled single quote syntax.
One or more parameters may be passed to a
function.
The first parameter is referenced as $1
The second parameter is referenced as $2, and so
forth
Greenplum Confidential
Query Language Function With
Parameter
CREATE FUNCTION public.clean_specific_customer(which
char(2))
RETURNS void AS DELETE FROM [Link]
WHERE state = $1;
LANGUAGE SQL;
It is executed as a SELECT statement:
SELECT public.clean_specific_customer(NA);
clean_specific_customer
----------------------(1 row)
Greenplum Confidential
SQL Functions on Base Types
Used to return a single value
CREATE FUNCTION public.customer_cnt(char)
RETURNS bigint AS SELECT COUNT(*)
FROM [Link]
WHERE state = $1;
LANGUAGE SQL;
It is executed as a SELECT statement:
SELECT public.customer_cnt(WA);
customer_cnt
----------176
Greenplum Confidential
SQL Functions on Composite Types
Allows you to pass in a composite parameter
(such as a table name)
CREATE FUNCTION [Link](transaction)
RETURNS numeric AS $$
SELECT $[Link] * .90;
$$ LANGUAGE SQL;
Note the syntactical difference using the double $ signs instead of
single quotes in the DDL and how the function is called.
SELECT transid, transdate, taxamt,
[Link](transaction)
FROM transaction;
Greenplum Confidential
SQL Functions with Output Parameters
Very similar to previous example, no RETURNS clause.
CREATE FUNCTION public.cust_cnt(IN whichstate char ,
OUT MyCount bigint)
AS SELECT COUNT(*)
FROM [Link]
WHERE state = $1;
LANGUAGE SQL;
It is executed as a SELECT statement with only the input parameters
specified:
SELECT public.cust_cnt(WA);
customer_cnt
----------176
Greenplum Confidential
SQL Functions as Table Sources
All SQL functions can be used in the FROM clause of a query, but it is
particularly useful for functions returning composite types.
CREATE FUNCTION [Link](int)
RETURNS customer AS $$
SELECT *
FROM [Link] WHERE customerid = $1;
$$ LANGUAGE SQL;
Now we can access the columns in the function as if it were a table.
SELECT *,UPPER(custname)
FROM [Link](100);
Greenplum Confidential
SQL Functions Some Notes
Note that functions operating on tables must be
created in the same schema as the table.
To DROP the function, the DROP statement
must include the parameter types.
EXAMPLE:
DROP FUNCTION [Link](int);
Greenplum Confidential
SQL Functions Returning Sets
When an SQL function is declared as returning SETOF sometype, the
functions nal SELECT query is executed to completion, and each
row it outputs is returned as an element of the result set.
CREATE FUNCTION [Link](char)
RETURNS SETOF customer AS $$
SELECT *
FROM [Link] WHERE state = $1;
$$ LANGUAGE SQL;
SELECT *,UPPER(city)
FROM [Link](WA);
This function returns all qualifying rows.
Greenplum Confidential
Function Overloading
More than one function can be created with the
same name.
Must have different input parameter types or number
The optimizer selects which version to call based on
the input data type and number of arguments.
Greenplum Confidential
Function Overload - Example
CREATE FUNCTION [Link](int)
RETURNS customer AS $$
SELECT *
FROM [Link] WHERE customerid = $1;
$$ LANGUAGE SQL;
CREATE FUNCTION [Link](char)
RETURNS customer AS $$
SELECT * FROM [Link] WHERE state = $1;
$$ LANGUAGE SQL;
select *,upper(custname)from [Link]('WA') ;
select *,upper(custname) from [Link](1);
Greenplum Confidential
Function Volatility Categories
Volatility is a promise to the optimizer about
how the function will behave.
Every function falls into one of these categories
VOLATILE
IMMUTABLE
STABLE
Greenplum Confidential
Function Volatility - VOLATILE
A VOLATILE function can do anything,
including modifying the database.
It can return different results on successive
calls with the same arguments.
The optimizer makes no assumptions about the
behavior of such functions.
A query using a volatile function will reevaluate the function at every row where its
value is needed.
Greenplum Confidential
Function Volatility - IMMUTABLE
An IMMUTABLE function cannot modify the database.
It is guaranteed to return the same results given the
same arguments forever.
This category allows the optimizer to pre-evaluate the
function when a query calls it with constant arguments.
For example, a query like SELECT ... WHERE x = 2 + 2; can be
simplied on sight to SELECT ... WHERE x = 4; because the
function underlying the integer addition operator is
marked IMMUTABLE.
Greenplum Confidential
Function Volatility - STABLE
A STABLE function cannot modify the
database.
It is guaranteed to return the same results
given the same arguments for all rows within a
single statement.
This category allows the optimizer to optimize
multiple calls of the function to a single call.
In particular, it is safe to use an expression
containing such a function in an index scan
condition.
Greenplum Confidential
Procedural Language Functions
PL/pgSQL Procedural Language for Greenplum
Can be used to create functions and procedures
Adds control structures to the SQL language
Can perform complex computations
Inherits all user-dened types, functions, and
operators
Can be dened to be trusted by the server
Is (relatively) easy to use.
Greenplum Confidential
Advantages of PLpgSQL Functions
There can be considerable savings of
client/server communication overhead.
Extra round trips between client and server are
eliminated
Intermediate results that the client does not need do
not have to be marshaled or transferred between
server and client
Multiple rounds of query parsing can be avoided
This can result in a considerable performance
increase as compared to an application that does not
use stored functions.
Greenplum Confidential
Structure of PL/pgSQL
PL/pgSQL is a block-structured language.
CREATE FUNCTION somefunc() RETURNS integer AS $$
DECLARE quantity integer := 30;
BEGIN
RAISE NOTICE Quantity here is %, quantity;
-- Prints 30 quantity := 50;
-- Create a subblock
DECLARE quantity integer := 80;
BEGIN
RAISE NOTICE Quantity here is %, quantity;
-- Prints 80
RAISE NOTICE Outer quantity here is %, [Link];
-- Prints 50
END;
RAISE NOTICE Quantity here is %, quantity;
-- Prints 50
RETURN quantity;
END;
$$ LANGUAGE plpgsql;
Greenplum Confidential
PL/pgSQL Declarations
All variables used in a block must be declared in the declarations
section of the block.
SYNTAX:
name [ CONSTANT ] type [ NOT NULL ] [ { DEFAULT | := } expression ];
EXAMPLES:
user_id integer;
quantity numeric(5);
url varchar;
myrow tablename%ROWTYPE;
myfield [Link]%TYPE;
somerow RECORD;
Greenplum Confidential
PL/pgSQL %Types
%TYPE provides the data type of a variable or
table column.
You can use this to declare variables that will
hold database values.
SYNTAX: variable%TYPE
EXAMPLES:
custid [Link]%TYPE
tid
[Link]%TYPE
Greenplum Confidential
PL/pgSQL Row Types
A row variable can be declared to have the
same type as the rows of an existing table or
view.
Only the user-dened columns of a table row
are accessible in a row-type variable, not the
OID or other system columns.
SYNTAX: table_name%ROWTYPE
EXAMPLES:
cust customer%ROWTYPE
trans transaction%ROWTYPE
Greenplum Confidential
PL/pgSQL Record Types
Record variables are similar to row-type
variables, but they have no pre-dened
structure.
They take on the actual row structure of the row
they are assigned during a SELECT or FOR
command. (It is a place holder.)
The substructure of a record variable can
change each time it is assigned to.
SYNTAX: name RECORD;
Greenplum Confidential
PL/pgSQL Basic Statements
Assignment
variable := expression;
Executing DML (no RETURN)
PERFORM myfunction(myparm1,myparm2);
Single row results
Use SELECT INTO mytarget%TYPE;
Greenplum Confidential
PL/pgSQL Executing Dynamic SQL
Oftentimes you will want to generate dynamic
commands inside your PL/pgSQL functions
This is for commands that will involve different
tables or different data types each time they are
executed.
To handle this sort of problem, the EXECUTE
statement is provided:
EXECUTE command-string [ INTO [STRICT] target ];
Greenplum Confidential
PL/pgSQL Dynamic SQL Example
Dynamic values that are to be inserted into the constructed query
require careful handling since they might themselves contain quote
characters.
EXECUTE
||
||
||
||
||
UPDATE tbl SET
quote_ident(colname)
=
quote_literal(newvalue)
WHERE key =
quote_literal(keyvalue);
Greenplum Confidential
PL/pgSQL Getting the Results
There are several ways to determine the effect
of a command.
The rst method is to use the GET
DIAGNOSTICS command, which has the form:
GET DIAGNOSTICS variable = item [ , ... ];
This command allows retrieval of system status
indicators.
GET DIAGNOSTICS integer_var = ROW_COUNT;
Greenplum Confidential
PL/pgSQL Getting the Results
(continued)
The second method to determine the effects of
a command is to check the special variable
named FOUND, which is of type boolean.
FOUND starts out false within each PL/pgSQL
function call.
Greenplum Confidential
PL/pgSQL FOUND
FOUND is set by:
A SELECT INTO statement sets FOUND true if a row is assigned, false if
no row is returned.
A PERFORM statement sets FOUND true if it produces (and discards)
one or more rows, false if no row is produced.
UPDATE, INSERT, and DELETE statements set FOUND true if at least
one row is affected, false if no row is affected.
A FETCH statement sets FOUND true if it returns a row, false if no row
is returned.
A MOVE statement sets FOUND true if it successfully repositions the
cursor, false otherwise.
A FOR statement sets FOUND true if it iterates one or more times, else
false.
Greenplum Confidential
PL/pgSQL Control Structures
Control structures are probably the most useful
(and important) part of PL/pgSQL.
With PL/pgSQLs control structures, you can
manipulate PostgreSQL data in a very exible
and powerful way.
Greenplum Confidential
PL/pgSQL Returning from a Function
There are two commands available that allow you to return data
from a function: RETURN and RETURN NEXT.
RETURN with an expression terminates the function and returns the
value of expression to the caller.
If you declared the function with output parameters, write just
RETURN with no expression. The current values of the output
parameter variables will be returned.
If you declared the function to return void, a RETURN statement can
be used to exit the function early; but do not write an expression
following RETURN.
The return value of a function cannot be left undened.
Greenplum Confidential
PL/pgSQL Returning from a Function
RETURN NEXT and RETURN QUERY do not actually return from the
function they simply append zero or more rows to the functions result
set.
EXAMPLE:
CREATE OR REPLACE FUNCTION getAllStores()
RETURNS SETOF store AS
$BODY$
DECLARE r store%rowtype;
BEGIN
FOR r IN SELECT * FROM store WHERE storeid > 0 LOOP
-- can do some processing here
RETURN NEXT r;
-- return current row of SELECT
END LOOP;
RETURN;
END
$BODY$
LANGUAGE plpgsql;
Greenplum Confidential
PL/pgSQL Conditionals
IF THEN ELSE conditionals lets you execute commands based on
certain conditions.
IF boolean-expression THEN
statements
[ ELSIF boolean-expression THEN
statements
[ ELSIF boolean-expression THEN
statements
[ ELSE statements ]
END IF;
TIP: Put the most commonly occurring condition first!
Greenplum Confidential
PL/pgSQL Simple Loops
With the LOOP, EXIT, CONTINUE, WHILE, and FOR statements,
you can arrange for your PL/pgSQL function to repeat a series of
commands.
LOOP SYNTAX:
[ <<label>> ]
LOOP
statements
END LOOP [ label ];
Greenplum Confidential
PL/pgSQL Simple Loops
EXIT EXAMPLES:
LOOP
-- some computations
IF count > 0
THEN EXIT;
-- exit loop
END IF;
END LOOP;
LOOP
-- some computations
EXIT WHEN count > 0;
-- same result as previous example
END LOOP;
BEGIN
-- some computations
IF stocks > 100000
THEN EXIT;
-- causes exit from the BEGIN block
END IF;
END;
Greenplum Confidential
PL/pgSQL Simple Loops
CONTINUE can be used within all loops.
CONTINUE EXAMPLE:
LOOP
-- some computations
EXIT WHEN count > 100;
CONTINUE WHEN count < 50;
-- some computations for count IN [50 .. 100]
END LOOP;
Greenplum Confidential
PL/pgSQL Simple Loops
The WHILE statement repeats a sequence of statements so long as
the boolean-expression evaluates to true. The expression is checked
just before each entry to the loop body.
WHILE EXAMPLE:
WHILE customerid < 50 LOOP
-- some computations
END LOOP;
Greenplum Confidential
PL/pgSQL Simple For (integer) Loops
This form of FOR creates a loop that iterates over a range of integer
values.
EXAMPLE:
FOR i IN 1..3 LOOP
-- i will take on the values 1,2,3 within the loop
END LOOP;
FOR i IN REVERSE 3..1 LOOP
-- i will take on the values 3,2,1 within the loop
END LOOP;
FOR i IN REVERSE 8..1 BY 2 LOOP
-- i will take on the values 8,6,4,2 within the loop
END LOOP;
Greenplum Confidential
PL/pgSQL Simple For (Query) Loops
Using a different type of FOR loop, you can iterate through the results
of a query and manipulate that data accordingly.
EXAMPLE:
CREATE FUNCTION nukedimensions() RETURNS integer AS $$
DECLARE mdims RECORD;
BEGIN
FOR mdims IN SELECT * FROM pg_class WHERE schema=dimensions LOOP
-- Now mdims" has one record from pg_class
EXECUTE TRUNCATE TABLE || quote_ident([Link]);
END LOOP;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
Greenplum Confidential
PL/pgSQL Trapping Errors
By default, any error occurring in a PL/pgSQL function aborts
execution of the function, and indeed of the surrounding
transaction as well.
You can trap errors and recover from them by using a BEGIN block
with an EXCEPTION clause.
BEGIN
statements
EXCEPTION WHEN condition [OR condition ... ] THEN
handler_statements
[ WHEN condition [ OR condition ... ] THEN
handler_statements ... ]
END;
Greenplum Confidential
PL/pgSQL Common Exception
Conditions
NO_DATA
No data matches predicates
CONNECTION_FAILURE
SQL cannot connect
DIVISION_BY_ZERO
Invalid division operation 0 in divisor
INTERVAL_FIELD_OVERFLOW
Insufficient precision for timestamp
NULL_VALUE_VIOLATION
Tried to insert NULL into NOT NULL
column
RAISE_EXCEPTION
Error raising an exception
PLPGSQL_ERROR
PL/pgSQL error encountered at run
time
NO_DATA_FOUND
Query returned zero rows
TOO_MANY_ROWS
Anticipated single row return,
received more than 1 row.
Greenplum Confidential
Why Use Functions?
To perform conditional logic
To provide common SQL to the user community
Insures a single version of the metrics
Prevents poorly designed user queries
To execute complex report or transformation
logic
Note: We specifically did not cover cursors in this
course. Cursors can take a parallel process and make
it nicely serial!
Greenplum Confidential
User Defined Data Type
Often a function will need to return a data type
other than one based on a table.
CREATE TYPE queryreturn AS (
Customerid BIGINT,
CustomerName CHARACTER VARYING,
TotalSalesAmt DECIMAL(12,2)
);
Greenplum Confidential
Module 10
LAB
PostGREs Functions
Greenplum Confidential
Module 11
Controlling Access
Greenplum Confidential
System Level Security
Problems
Privileged users (gpadmin/root) have access to
everything (physical data, binaries, backups)
Sharing accounts makes auditing difficult
Users are tempted to run additional workloads on
master node, such as ETL
Once logged into the master, all other hosts are
available since ssh keys are exchanged
Network infrastructure isnt under our control
System administration isnt under our control
Greenplum Confidential
Database Security
Problems
gpadmin user has access to everything
Unless roles and privileges are configured, all users
have access to everything
Unless resource queues are configured, there are no
limits on what users can run
Default pg_hba.conf is loosely configured (see
Security in [Link])
Customers are tempted to share user accounts
Greenplum Confidential
Database Security
Roles
Each person should have their own role
Roles should be further divided into groups, which
are usually roles that cant login
Privileges should be granted at the group level
whenever possible
Privileges should be as restrictive as possible
Column level access can be accomplished with views
Roles are not related to OS users or groups
Greenplum Confidential
Creating Roles
The Greenplum Database manages database access
permissions using the concept of roles.
The concept of roles subsumes the concepts of users
and groups.
A role can be a database user, a group, or both. Roles
can own database objects (for example, tables) and can
assign privileges on those objects to other roles to
control access to the objects.
Roles can be members of other roles, thus a member
role can inherit the attributes and privileges of its
parent role.
Greenplum Confidential
Role based access
The Greenplum Database manages database
access permissions using the concept of roles.
The concept of roles subsumes the concepts of
users and groups.
A role can be a database user, a group, or both.
Roles can own database objects (for example,
tables) and can assign privileges on those objects
to other roles to control access to the objects.
Roles can be members of other roles, thus a
member role can inherit the attributes and
privileges of its parent role.
Greenplum Confidential
Database Architecture
A KEY COMPONENT OF ACCESS CONTROL
Isolate users from physical
layer through views
Insolate users from changes to
tables
Control access to columns and/or
rows
Enforce locking strategy
Separate functional areas into
separate schemas
Simplify security
Simplify backups/restores
Risk
Sales
Mktg
View
View
View
View
View
View
View
Risk
Sales
Table
Mktg
Table
Table
Table
Table
Table
Table
Table
Table
Greenplum Confidential
Table
ROLES AS USERS AND GROUPS
Every Greenplum Database system contains a
set of database roles (users and groups).
Those roles are separate from the users and groups
managed by the operating system on which the
server runs. However, for convenience you may want
to maintain a relationship between operating system
user names and Greenplum Database role names,
since many of the client applications use the current
operating system user name as the default.
A user is a role with a password!
Greenplum Confidential
Creating new roles (users)
A user-level role is considered to be a database
role that can log in to the database and initiate
a database session. Therefore, when you create
a new user-level role using the CREATE ROLE
command, you must specify the LOGIN
privilege.
Example: CREATE ROLE apatel WITH LOGIN;
Greenplum Confidential
Role attributes
A database role may have a number of
attributes that define what sort of tasks that
role can perform in the database.
These attributes can be set at role creation time or by
using the ALTER ROLE syntax.
Greenplum Confidential
Common user attributes
Greenplum Confidential
Role membership (groups)
It is frequently convenient to group users together to ease
management of object privileges: that way, privileges can be
granted to, or revoked from, a group as a whole.
In the Greenplum Database this is done by creating a role that
represents the group, and then granting membership in the group
role to individual user roles.
Use the CREATE ROLE SQL command to create a new group role.
Example: CREATE ROLE admin CREATEROLE CREATEDB;
Once the group role exists, you can add and remove members (user
roles) using the GRANT and REVOKE commands.
Example: GRANT admin TO john, sally;
REVOKE admin FROM bob;
Greenplum Confidential
Object level privileges
For managing object privileges, you would then
grant the appropriate permissions to the grouplevel role only.
The member user roles then inherit the object
privileges of the group role.
Examples:
GRANT ALL ON TABLE mytable TO admin;
GRANT ALL ON SCHEMA myschema TO admin;
GRANT ALL ON DATABASE mydb TO admin;
Greenplum Confidential
Managing object privileges
When an object (table, view, sequence, database,
function, language, schema, or tablespace) is
created, it is assigned an owner.
The owner is normally the role that executed the
creation statement.
For most kinds of objects, the initial state is that
only the owner (or a superuser) can do anything
with the object.
To allow other roles to use it, privileges must be
granted.
Greenplum Confidential
Object privileges
Greenplum Confidential
Security examples
CREATE ROLE admin CREATEROLE CREATEDB;
CREATE ROLE batch;
GRANT select, insert, update, delete
ON [Link] TO batch;
CREATE ROLE batchuser LOGIN;
GRANT batch TO batchuser;
Greenplum Confidential
Module 11
LAB
Controlling Access
Greenplum Confidential
Module 12
Advanced SQL Topics
&
Performance Tips
Greenplum Confidential
UPDATE and DELETE Performance
Use TRUNCATE TABLE to delete all rows from a
table
Do not DROP the table and recreate the table
To perform an UPDATE or DELETE on many
rows, insert the rows into a temp table and then
perform an UPDATE JOIN or a DELETE JOIN
The WHERE condition must specify equality between
the distribution columns of the target table and all
joined tables
Greenplum Confidential
Temporal and Boolean Data Types
To avoid processing skew do not use temporal
data types (date, time, timestamp/datetime,
interval) for distribution columns
To avoid data skew never use a boolean data
type (t/f) for distribution columns
Greenplum Confidential
Avoid Approximate Numeric Data Types
Avoid floating point data types when possible
Use the absolute minimum precision required
Do not use approximate numerics (double, float,
etc) for:
Distribution Columns
Join Columns
Columns that will be used for mathematical operations
(SUM, AVG)
Greenplum Confidential
Data Types and Performance
Use data types that will minimize disk storage
requirements, minimize scan time and maximize
performance
Choose the smallest integer type
For example, INTEGER not NUMERIC(11,0)
Choose integer types for distribution columns
Choose the same integer types for join columns
Greenplum Confidential
Use CASE to Avoid Multiple Passes
Use the CASE statement instead of DECODE.
Put the most commonly occurring condition first
Put the least commonly occurring condition last
Always put a default (ELSE) condition
EXAMPLE:
CASE
WHEN state = CA
THEN state_population * 1.2
WHEN state = WY
THEN state_population * 10.7
ELSE
0
END
Greenplum Confidential
SET Operations
SET operations work against queries, not
tables
SET operations can improve query performance
Greenplum set operations include
UNION
- Insures uniqueness in the results set
UNION ALL
- Allow for duplication, use with caution!
INTERSECT
EXCEPT (same as MINUS)
Greenplum Confidential
SET Operation Cautions
Do not UNION a large numbers of queries
against tables with millions (or billions) of
rows.
The optimizer will attempt to launch each query in
parallel, you could run out of memory and temp
space.
Use a temporary table to store the results and
populate it with separate queries.
Use a CASE statement, if possible, to retrieve the
data in one pass.
Greenplum Confidential
Avoid Multiple DISTINCT Operations
Original Query:
SELECT customerId,
COUNT(DISTINCT transId),
COUNT(DISTINCT storeId)
FROM transaction
WHERE transDate > '03/20/2008'
AND transDate < '03/25/2008'
GROUP BY 1
ORDER BY 1
LIMIT 100;
Greenplum Confidential
Avoid Multiple DISTINCT Operations
More Efficient Version
SELECT [Link],
sub2.cd_transId,
sub4.cd_storeId
FROM (SELECT customerId, COUNT(transId) as cd_transId
FROM ( SELECT customerId, transId
FROM transaction
WHERE transDate > '03/20/2008 AND transDate < '03/25/2008'
GROUP BY 1,2
) sub1
GROUP BY 1
) sub2,
(SELECT customerId, COUNT(storeId) as cd_storeId
FROM ( SELECT customerId, storeId
FROM transaction
WHERE transDate > '03/20/2008 AND transDate < '03/25/2008'
GROUP BY 1,2
) sub3
GROUP BY 1
) sub4
WHERE [Link] = [Link]
ORDER BY 1 LIMIT 100;
Greenplum Confidential
IsDate Function
Not part of Greenplum suite, this function works like the
Oracle IsDate function.
CREATE OR REPLACE FUNCTION [Link](text)
RETURNS boolean AS $BODY$
begin
perform $1::date;
return true;
exception when others then
return false;
end
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
Greenplum Confidential
Using Arrays
With Greenplum it is possible to have an ARRAY data type.
Arrays may not be the distribution key for a table
You may not create indexes on ARRAY data type columns
EXAMPLE:
CREATE TEMPORARY TABLE dimensionsOID AS(
SELECT ARRAY(SELECT [Link]
FROM pg_class c,
pg_namespace n
WHERE [Link] = [Link]
AND [Link] = 'dimensions)
AS oidArray)
DISTRIBUTED RANDOMLY;
Greenplum Confidential
How to Parse Strings Efficiently
Here are some functions that will help parse a delimited string:
split_part(string ,delimiter, occurrence);
EXAMPLE: SELECT split_part(name1=value1&name2=value2,&,2);
---------------name2=value2
string_to_array(string, delimiter or pattern matching expression)
EXAMPLE: SELECT string_to_array(a,b,c,d,e,f,g,,);
string_to_array
-----------------{a,b,c,d,e,f,g}
Greenplum Confidential
How to get Database Object Sizes
a.
b.
c.
d.
e.
select pg_size_pretty(pg_database_size(datamart'));
select pg_size_pretty(pg_relation_size([Link]'));
select pg_relation_size([Link]');
select pg_database_size(datamart');
select pg_relation_size(schemaname||'.'||tablename) from
pg_tables
f. gpsizecalc -x dbname -s g
g. gpsizecalc -a dbname -t [Link] -s g
Greenplum Confidential
Skew Analysis
Finding uneven data distribution & fixing it!
Command: gpskew
The gpskew script can be used to determine if table data is equally distributed across
all of the active segments. gpskew reports the following information:
The total number of records in the specified table. Note that for parent tables
with inherited child tables, you must use the -h option to show child table skew as well.
The number of records on each segment.
The variance of records between segments. It takes the segment which has the
maximum count, and the segment which has the minimum count, and reports the
difference between these two segments.
The segment response times (if -r is supplied).
The distribution key column names (if -c is supplied) .
Greenplum Confidential
Skew analysis
aparashar@mdw1 ~ $ > gpskew -t public.t1
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Spawning parallel processes
batch [1], please wait...
........
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Waiting for parallel processes batch [1], please wait...
2008[Link]gpskew:mdw1:aparashar-[INFO]:--------------------------------------2008[Link]gpskew:mdw1:aparashar-[INFO]:-Parallel process exit status
2008[Link]gpskew:mdw1:aparashar-[INFO]:--------------------------------------2008[Link]gpskew:mdw1:aparashar-[INFO]:--------------------------------------2008[Link]gpskew:mdw1:aparashar-[INFO]:-Number of Distribution Column(s)
= {1}
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Distribution Column Name(s)
= c1
2008[Link]gpskew:mdw1:aparashar-[INFO]:--------------------------------------2008[Link]gpskew:mdw1:aparashar-[INFO]:-Total records
= 2560000
2008[Link]gpskew:mdw1:aparashar-[INFO]:--------------------------------------2008[Link]gpskew:mdw1:aparashar-[INFO]:-Skew calculation target type
= Primary Segments
2008[Link]gpskew:mdw1:aparashar-[INFO]:--------------------------------------2008[Link]gpskew:mdw1:aparashar-[INFO]:-Skew result
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Maximum Record Count
= 320512
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Segment instance hostname
= 3114
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Segment instance name
= gp
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Minimum Record Count
= 319488
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Segment instance hostname
= 3114
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Segment instance name
= gp
2008[Link]gpskew:mdw1:aparashar-[INFO]:-Record count variance
= 1024
2008[Link]gpskew:mdw1:aparashar-[INFO]:--------------------------------------Greenplum Confidential
Skew analysis
Skew problem can be seen through queries
Use Explain analyze
ais_3114=# explain analyze SELECT count(*) from (SELECT count(*) from t2 group by c2) as a;
->
Subquery Scan a
(cost=75.33..87.84 rows=1 width=0)
Rows out:
Avg 1250.0 rows x 8 workers.
Max 1251 rows (seg2) with 6.305 ms to first row, 15 ms to
end, start offset by -17 ms.
->
HashAggregate
(cost=75.33..87.83 rows=1000 width=4)
Group By: t2.c2
Rows out:
Avg 1250.0 rows x 8 workers.
Max 1251 rows (seg2) with 6.300 ms to first row, 15
ms to end, start offset by -17 ms.
Executor memory:
Work_mem used:
(seg2)
->
1938K bytes avg, 1938K bytes max (seg0).
138K bytes avg, 138K bytes max (seg2).
Hash chain length 1.0 avg, 1 max, using 1251 of 110160 buckets.
Redistribute Motion 8:8
(slice1)
(cost=27.83..60.33 rows=1000 width=4)
Hash Key: t2.c2
Rows out:
Avg 1250.0 rows x 8 workers at destination.
Max 1251 rows (seg2) with 0.498
ms to first row, 2.729 ms to end, start offset by -16 ms.
->
HashAggregate
(cost=27.83..40.33 rows=1000 width=4)
Group By: t2.c2
Rows out:
rst row, 4.280 ms to end, start offset by -21 ms.
Avg 1250.0 rows x 8 workers.
Greenplum Confidential
Max 1251 rows (seg2) with 3.400 ms to fi
Skew Analysis
Example: (A Greenplum Customer)
> 75% of the tables - randomly distributed
Several very large tables had completely skewed distribution
Debug process:
Check the database skew
Start with big tables then work towards the smaller ones
Greenplum Confidential
Module 12
LAB
Advanced SQL Topics
Greenplum Confidential
Feedback and Questions
Please fill in the feedback form that you have
been given by your instructor. We value your
input in making this class as meaningful to our
students as possible!
THANK YOU!
Greenplum Confidential