A Data Warehousing Primer
Roland Bouman (Strukton Rail) http://rpbouman.blogspot.com/
Starring Sakila
Topics
Starring Sakila
Terminology
Business Intelligence Data Warehouse Dimensional Model Star Schema OLAP Cube
Data Warehousing Terminology
Business Intelligence (BI) Skills, technologies, applications and practices to acquire a better understanding of the commercial context of your business. Data Warehouse
Dimensional Model Star Schema OLAP Cube
What is Business Intelligence?
Business Intelligence Data Warehouse
A database designed to support Business Intelligence
Dimensional Model Star Schema OLAP Cube
What is a Data Warehouse?
Business Intelligence Data Warehouse Dimensional Model A logical data model that divides data in two kinds: Facts and Dimensions Star Schema
OLAP Cube
What is the Dimensional Model?
Business Intelligence Data Warehouse Dimensional Model Star Schema
Physical implementation of the Dimensional Model on a RDBMS which maps a dimension to a single table
OLAP Cube
What is a Star Schema?
Business Intelligence Data Warehouse Dimensional Model Star Schema OLAP On-Line Analytical Processing: querying muli-dimensional data, cornerstone of most BI applications Cube
What is OLAP?
Business Intelligence Data Warehouse Dimensional Model Star Schema OLAP Cube
Multi-dimensional data structure suitable for OLAP queries
What is a Cube
Business Intelligence
Understanding your Business
Front end Applications: Reports Charts and Graphs OLAP Pivot tables Data Mining Dashboards
Back end, Infrastructure ETL Extract Transformation Load Data Warehouse Data Mart Metadata ROLAP Cube
Business Intelligence
Back-end
Source Systems, External Data
Front-end
Business Intelligence Applications
Data Warehouse
Staging Area
Meta Data
Extract
Transform
Load
Present
High Level BI Architecture
Data Warehouse
Business Intelligence Database
Ultimately, it's just a Relational Database
Tables, Columns, Keys... Ease of use Performance Integration, Standardization, Data cleaning Add and maintain history
...But designed for BI applications
Data from various source systems
Data Warehouse
OLTP
OLAP
Operational 'Always' on All kinds of users Many users Directly supports business process Keep a Record of Current status
Tactical, Strategic Periodically Available Managers, Directors Few(er) users Decision support, long-term planning Maintain history
OLTP vs OLAP: Application Characterization
OLTP
OLAP
Subject Oriented Add, Modify, Remove single rows Human data entry Queries for small sets of rows with all their details Standard queries
Aspect Oriented Bulk load, rarely modify, never remove Automated ETL jobs Scan large sets to return aggregates over arbitrary groups Ad-hoc queries
OLTP vs OLAP: data processing
OLTP
OLAP
Entity-Relationship model Entities, Attributes, Relationships Foreign key constraints Indexes to increase performance Normalized to 3NF or BCNF
Dimensional model Facts, Dimensions, Hierarchies Ref. integrity ensured in loading process Scans on Fact table obliterates indexes Denormalized Dimensions (<= 1NF)
OLTP vs OLAP: database schema organization
Dimensional Model
Organizing data to suit Business Intelligence
Two kinds of data Facts Dimensions
The Dimensional Model
Facts Measures/Metrics of a Business Process Typical Metrics Cost, Units Sold, Profit
The Dimensional Model: Facts
Dimensions Describe aspects of Business Process Dimensions typically not inter-dependent Who? What? Where? When? Why? Typical Dimensions: Customer (who?), Product (what?), Date/Time (when?)
The Dimensional Model: Dimensions
Dimension Attributes organized in Hierarchies Date dimension examples: Year, Quarter, Month, Day Year, Week, Day Metrics typically numeric and additive
Navigate fact data Choose particular values for dimension Aggregate fact data at chosen level of hierarchy
The Dimensional Model: Navigating Facts with Dimensions
Date Dimension Location Dimension All locations America All America North South Europe All Europe East West
2008 Q4 All Months $ 3850 $ 2050 $ 1275 $ 775 $ 1800 $ 800 $ 1000 October $ 1000 $500 $ 300 $ 200 $ 500 $ 250 $ 250 November $ 1350 $ 750 $ 500 $ 250 $ 600 $ 250 $ 350 December $ 1500 $ 800 $ 475 $ 325 $ 700 $ 300 $ 400
Dimensional Example: Crosstab
Star Schema
Dimensional Model Implementation
Central Fact Table
Columns for storing Metrics 'Foreign Key' columns point to Dimension Typically normalized and not pre-aggregated Surrogate key Descriptive attributes organized in hierarchies No Foreign Keys to other tables Typically heavily denormalized
Dimension maps to a Dimension table
Stars Schema Characteristics
Store Date Film Time Rentals
Staff
Customer
Star Schema example: Sakila Rentals
Star schema is 'just' an implementation
Optimized for simplicity Optimized for performance (?) Heavily denormalized dimensions
Snowflake: Star Schema Alternative
Still a dimensional model Still a central fact table Normalized dimensions Easier maintenance of dimensions
Stars Schema Characteristics
Year Quarter Month Date Week Hour
Country City Store Film Rentals Minute Customer Country Staff City Rating Language
Snow Flake example: Sakila Rentals
Desinging Star Schemas
Starring Sakila
Select Business Process
Sales, Purchase, Storage, Transport, ... Facts: Key Event in Business Process Metrics (Fact Attributes): Count or Amount What? When? Where? Who? Why?
Define Facts and Key Metrics
Choose Dimensions and Hierarchies
Dimensional Design
MySQL Sample Database
http://dev.mysql.com/doc/sakila/en/sakila.html Overly simplified database schema
DVD rental business
Typical OLTP database
Dimensional Model example
Rental Business Process
Customer visits store, picks DVD DVD taken out of store inventory by staff member Customer returns home and enjoys DVD Customer returns to store with DVD DVD returned to staff member Staff member collects payment made by customer
Dimensional Model example
Category
Language
Actor
Country
Film
City
Store
Address
Inventory
Staff
Rental
Customer
3NF Source schema: Sakila Rentals
Select Business Process
Rentals Count (number of rentals) Rental Duration
Identify Facts
Choose Dimensions
What: Films When: Rental, Return
Who: Customer, Staff Where: Store
Example Business Process: Rentals
Who? Staff Customer When? What? Film Fact: Rentals Time Store Where? Date
Target Star Schema
Rental Star Schema
Inventory
Staff Rental
Customer
A star is born: Rentals 3NF
Category
Film Category Store
Film Inventory
Staff Rental
Customer
A star is born: Rentals 3NF
Category
Film Category Store
Film Inventory
Staff Rental
Customer
A star is born: Denormalize
Film Category Store
Store
Staff Rental
Customer
A star is born: Denormalize
Address
Film Category Store
Store
Staff Rental
Customer
A star is born
Address
Film Category Store
Store
Staff Rental
Customer
A star is born: Denormalize
Language Film Category Address Store Store Address
Staff Rental
Customer
A star is born: Denormalize
City Language Film Category Address Store Store
City
Address
Staff Rental
Customer
A star is born: Denormalize
Country
Country
City Language Film Category Address Store Store
City
Address
Staff Rental
Customer
A star is born: Rental Snowflake
What: Film Language Film Category
Where: Store Country City Address Store
Who: Staff Store Staff
Who: Customer Country City Address Customer
Rental
A star is born: Rental Star Schema
Something is missing....
Who ? (Customer, Staff) What ? (Film) Where ? (Store) .... ?
Dimensional Design
What: Film
Where: Store
Who: Staff
Who: Customer
Rental
When: Date
When: Time
A star is born: Rental Date and Time
What: Film
Where: Store
Who: Staff
Who: Customer
Rental
When: Rental Date
When: Rental Time
When: Return Date
When: Return Time
Role Playing: Date/Time for both Rentals and Returns
Denormalization through Joins
Denormalization through Flattening (Repeating Group)
Loading a Data Warehouse
ETL with Pentaho Data Integration
Pentaho Data Integration
sourceforge.net/projects/pentaho/
ETL and much more Transformations:
Extract, Load and Transform Organize multiple transformations to a complete ETL process
Jobs:
> 30 RDBMS-es, > 130 Transformation Steps
Dimensional Design
First load dimensions, finally load fact Mail notification in case of success / failure
Job: Rental ETL Process
First load dimensions, finally load fact Mail notification in case of success / failure
Job: Rental ETL Process
Get store, lookup address (subtransformation) and manager Load store dimension table
Job: Rental ETL Process
Get store, lookup address (subtransformation) and manager Load store dimension table
Job: Rental ETL Process
Get address, lookup city and country Concatenate address if necessary
Job: Rental ETL Process
This was just a simple example More complex example: importing XML
<?xml version="1.0" encoding="UTF-8"?> <result> <actors> <actor id="00000015">Anderson, Jeff</actor> <actor id="00000015">Anderson, Jeff</actor> .. </actors> <videos> <video> <title>The Fugitive</title> <genre>action</genre> .... </video> ... </videos> </result>
Job: Rental ETL Process
This was just a simple example More complex example: importing XML
Job: Rental ETL Process
OLAP
OLAP Pivot Table with Pentaho Analysis Services
Pentaho Analysis Services
Part of Pentaho BI Server
sourceforge.net/projects/pentaho/ Based on Mondrian ROLAP server sourceforge.net/projects/mondrian/
Dimensional Design
Pentaho Schema Workbench
Map data warehouse tables to a logical Cube
Dimensional Design
Pentaho Analysis View:
Dimensional Design
Pentaho Solutions Wiley ISBN 978-0-470-48432-6 September 2009 630+ page paperback Amazon pre-order $31.50 Regular: $50.00
Upcoming Book: Pentaho Solutions