Data Warehouse
Architecture
Objectives
Define Data Warehouse Architecture
Define Data Warehouse and Data Mart
Present a Data Warehouse Architectural
Framework
Demo – Data Enterprise Integration Server
Information Systems
Architecture
Information Systems Architecture is the
process of making the key choices that are
essential to the development of an
information system. Architecture includes:
◦ Guiding Principles:
◦ Approaches/philosophies
◦ “Logical” representations of a system
◦ Hardware/Operating System
◦ Computing model: client/server vs traditional vs
Web-based
◦ Tools and technologies
It is key, when making these choices that they
are:
◦ Requirements driven
◦ Take into consideration operational, technical and
financial feasibility
◦ Made within an architectural framework
◦
Architecture Drivers
There are a lot of Drivers of Architecture
Corporat
Corporat
ee
Politics
Politics
Business
Business System
System
Plan
Plan Qualities
Qualities
Architecture
Architecture
Emergin
Emergin
Current
Current gg
Systems
Systems Technolo
Technolo
gies
gies
End
EndUser
User
Require
Require
ments
ments
How is Architecture Different
from Design?
Its not – Architecture can be considered
‘high-level’ design
Architecture includes those aspects of the
design that are essential to the
information system
Architecture Example:
◦ Users must be able to self-serve (guiding
principle)
◦ “We will use a “hub and spoke” design where
data will be placed in a central data warehouse,
then be propagated to one or more data marts.
(approach)
◦ We will normalize data in the central warehouse
and use a dimensional design in the data marts
(approach)
◦ We will use Oracle 8i as our DBMS (technical
architecture)
Architecture vs Design
Not Architecture:
◦ The Order subject area will be composed of the
following tables: order_fact, customer_dim,
product_dim and time_dim
◦ The customer_dim table will have the following
attributes…….
The Value of
Communication:
Architecture
◦ To business sponsors, and business users
◦ Between members of the project team
Planning:
◦ Cross Check for Project Plan
◦ Ensure that all important components of the
data warehouse are accounted for
Flexibility and Growth
◦ Thinking about overall architecture will reduce
risk associated with the ‘success’ of the data
warehouse
Learning
Productivity and Reuse
What’s different about DW
Architecture?
Transaction processing systems – growth is
(relatively) predictable
Example:
◦ A company uses SAP for order processing
◦ They are opening a new retail store
◦ They predict (based on experience) 2000
transactions per week
◦ To process this volume, we need 3 workstations
to capture the transactions
◦ Peak time each day is 11-2 when 50% of
transactions occur
◦
What’s Different About Data
Warehouse Architecture?
Successdrives
explosive growth
Data ◦ More users
Warehouse ◦ More (complex)
queries
◦ More data
Performance is
unpredictable
Growth
SAP R/3
◦ Unpredictable
Siebel
queries
◦ Unpredictable use
patterns
Time
The Great Data Warehouse
Architecture Debate
Bill Inmon: “The
enterprise data If you build it,
They will come
warehouse”
Ralph Kimball: “data
marts”
The compromise:
“Hub and Spoke” or
“Federated” models
What is a Data Mart?
A data mart is a collection of subject areas
organized for decision support based on
the specific needs of a given user group.
Each mart may widely different from others
(as we will see)
Typically, data marts are built on the
dimensional data model:
◦ Facts – things that the organization wants to
measure: revenue, orders, shipments,
purchases, etc.
◦ Dimensions – the means by which the
organization wants to analyze the measures
(facts) – by customer, by time, by product –
BY ANY COMBINATION!!
What is a Data Mart?
There are two kinds of data marts--dependent
and independent.
A dependent data mart is one whose source is a
data warehouse.
An independent data mart is one whose source is
the legacy applications environment. All
dependent data marts are fed by the same
source--the data warehouse. Each independent
data mart is fed uniquely and separately by the
legacy applications environment.
Dependent data marts are architecturally and
structurally sound.
Independent data marts have a number of
significant issues
Data Warehouse vs.
Data Marts
What comes first
From the Data Warehouse to
Data Marts
Information
Individually Less
Structured
History
Departmentally Normalized
Structured Detailed
Organizationally More
Structured Data Warehouse
Data
Data Warehouse and Data
Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured
Organizationally structured
Atomic
Detailed Data Warehouse Data
Data Mart Centric
Data Sources
Data Marts
Data Warehouse
Problems with Data Mart
Centric Solution
If you end up creating multiple warehouses, integrating them
is a problem
True Warehouse
Data Sources
Data Warehouse
Data Marts
Data Warehouse
Architectures
Generic Two-Level Architecture
Independent Data Mart
Dependent Data Mart and Operational
Data Store
Logical Data Mart and Real-Time Data
Warehouse
Three-Layer architecture
All involve some form of extraction, transformation and loading (ETL)
ETL
19
Generic two-level data warehousing architecture
L
One,
company-
wide
T warehouse
Periodic extraction data is not completely current in warehouse
20
Independent data mart data Data marts:
warehousing architecture Mini-warehouses, limited in scope
T
E
Separate ETL for each Data access complexity
independent data mart due to multiple data marts
21
Dependent data mart with operational data ODS provides option for
store: a three-level architecture obtaining current data
T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
22
Logical data mart and real time ODS and data warehouse
are one and the same
warehouse architecture
T
E
Near real-time ETL for Data marts are NOT separate databases,
Data Warehouse but logical views of the data warehouse
Easier to create new data marts
23
Three-layer data architecture for a data warehouse
24
The Major Data Warehouse
Architectures
Independent data marts
Hub and spoke architecture
Data mart bus architecture
Federated data warehouse
Independent data mart
architecture
End user
access/
applications
Data staging
Independent data marts
Data sources
•Developed independently.
•No conformed dimensions (i.e., does not have the same categories and labels
for data elements in data marts which would allowdata across data marts to be
combined).
•Built to a business unit or functional area.
Federated architecture
Data mart
Data staging
End user
access/
applications
Federated data store
Data stores
Data warehouse
•Key spokesperson: Doug Hackney (2000, 2002).
•Combines data in an organization’s existing data warehousing environment.
•Characterized by combing key metrics and measures in existing data marts,
data warehouses and legacy systems.
Data Warehouse
Architecture Selection
Architecture selection
Architecture selection factors
Information interdependence
Upper management’s information needs
Urgency of need
View of the data warehouse Data warehouse architectures
Compatibility with existing systems
Independent data mart
Nature of end user tasks Data mart bus architecture
Resource constraints Hub and spoke architecture
Perceived ability of the IT staff Federated
Source of sponsorship
Expert influence
Best Practice #1
ØUse a data model that is optimized for
information retrieval
◦ dimensional model
◦ denormalized
◦ hybrid approach
DW Architecture Best Practices12/04/09 32
Data Acquisition Processes
Extract Transform Load (ETL)
◦ the process of unloading or copying data
from the source systems, transforming it
into the format and data model required in
the BI environment, and loading it to the
DW
◦ also, a software development tool for
building ETL processes (an ETL tool)
◦ many production DWs use COBOL or other
general-purpose programming languages
to implement ETL
DW Architecture Best Practices12/04/09 33
The ETL Process
Capture/Extract
Scrub or data cleansing
Transform
Load and Index
ETL = Extract, transform, and load
34
Capture/Extract…obtaining a snapshot of a chosen subset
of the source data for loading into the data warehouse
Steps in data
reconciliation
Static extract = Incremental extract =
capturing a snapshot of the capturing changes that
source data at a point in have occurred since the
time last static extract
35
Scrub/Cleanse…uses pattern recognition and AI
techniques to upgrade data quality
Steps in data
reconciliation
(cont.)
Fixing errors: misspellings, Also: decoding, reformatting,
erroneous dates, incorrect field time stamping, conversion, key
usage, mismatched addresses, generation, merging, error
missing data, duplicate data, detection/logging, locating
inconsistencies missing data
36
Transform = convert data from format of operational
system to format of data warehouse
Steps in data
reconciliation
(cont.)
Record-level: Field-level:
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or
Aggregation–data summarization one field to many
37
Load/Index= place transformed data
into the warehouse and create indexes
Steps in data
reconciliation
(cont.)
Refresh mode: bulk rewriting Update mode: only changes in
of target data at periodic intervals source data are written to data
warehouse
38
Data Quality Assurance
data cleansing
◦ the process of validating and enriching
the data as it is published to the DW
◦ also, a software development tool for
building data cleansing processes (a
data cleansing tool)
◦ many production DWs have only very
rudimentary data quality assurance
processes
DW Architecture Best Practices12/04/09 39
Data Acquisition &
Cleansing
getting data loaded efficiently and
correctly is critical to the success of your
DW
◦ implementation of data acquisition &
cleansing processes represents from 50
to 80% of effort on typical DW projects
◦ inaccurate data content can be ‘the kiss
of death’ for user acceptance
DW Architecture Best Practices12/04/09 40
Best Practice #2
ØCarefully design the data acquisition and
cleansing processes for your DW
◦ Ensure the data is processed efficiently and
accurately
◦ Consider acquiring ETL and Data Cleansing tools
◦ Use them well!
DW Architecture Best Practices12/04/09 41
Data Model
Already discussed the benefits of a
dimensional model
No matter whether dimensional modeling
or any other design approach is used,
the data model must be documented
DW Architecture Best Practices12/04/09 42
Documenting the Data
Model
The best practice is to use some kind of data
modeling tool
◦ CA ERwin
◦ Sybase PowerDesigner
◦ Oracle Designer
◦ IBM Rational Rose
◦ Etc.
Different tools support different modeling notations,
but they are more or less equivalent anyway
Most tools allow sharing of their metadata with an ETL
tool
DW Architecture Best Practices12/04/09 43
Data Model Standards
data model standards appropriate for the
environment and tools chosen in your data
warehouse should be adopted
considerations should be given to data access
tool(s) and integration with overall enterprise
standards
standards must be documented and enforced
within the DW team
◦ someone must ‘own’ the data model
to ensure a quality data model, all changes should
be reviewed thru some formal process
DW Architecture Best Practices12/04/09 44
Data Model Metadata
Business definitions should be recorded
for every field (unless they are technical
fields only)
Domain of data should be recorded
Sample values should be included
As more metadata is populated into the
modeling tool it becomes increasingly
important to be able to share this data
across ETL and Data Access tools
DW Architecture Best Practices12/04/09 45
Metadata Architecture
The strategy for sharing data model and
other metadata should be formalized and
documented
Metadata management tools should be
considered & the overall metadata
architecture should be carefully planned
DW Architecture Best Practices12/04/09 46
Best Practice #3
ØDesign a metadata architecture that allows
sharing of metadata between components
of your DW
DW Architecture Best Practices12/04/09 47
Alternative Architecture
Approaches
Bill Inmon: “Corporate Information Factory”
Hub and Spoke philosophy
“JBOC” – just a bunch of cubes
Let it evolve naturally
DW Architecture Best Practices12/04/09 48
What We Want
(Architectural Principal)
In most cases, business and IT agree that
the data warehouse should provide a
‘single version of the truth’
Any approach that can result in disparate
data marts or cubes is undesireable
This is known as data silos or…
DW Architecture Best Practices12/04/09 49
Enterprise DW
Architecture
how to design an enterprise data
warehouse and ensure a ‘single
version of the truth’?
according to Kimball:
◦ start with an overall data architecture
phase
◦ use “Data Warehouse Bus” design to
integrate multiple data marts
◦ use incremental approach by building
one data mart at a time
DW Architecture Best Practices12/04/09 50
Data Warehouse Bus
Architecture
named for the bus in a computer
◦ standard interface that allows you to plug
in cdrom, disk drive, etc.
◦ these peripherals work together smoothly
provides framework for data marts to
fit together
allows separate data marts to be
implemented by different groups,
even at different times
DW Architecture Best Practices12/04/09 51
Data Mart Definition
data mart is a complete subset of the
overall data warehouse
◦ a single business process OR
◦ a group of related business processes
think of a data mart as a collection of
related fact tables sharing
conformed dimensions, aka a ‘fact
constellation’
DW Architecture Best Practices12/04/09 52
Designing The DW Bus
determine which dimensions will be
shared across multiple data marts
conform the shared dimensions
produce a master suite of shared dimensions
determine which facts will be shared
across data marts
conform the facts
standardize the definitions of facts
DW Architecture Best Practices12/04/09 53
Dimension Granularity
conformed dimensions will usually be
granular
◦ makes it easy to integrate with various base
level fact tables
◦ easy to extend fact table by adding new facts
◦ no need to drop or reload fact tables, and no
keys have to be changed
DW Architecture Best Practices12/04/09 54
Conforming Dimensions
by adhering to standards, the separate
data marts can be plugged together
◦ e.g. customer, product, time
they can even share data usefully, for
example in a drill across report
ensures reports or queries from different
data marts share the same context
DW Architecture Best Practices12/04/09 55
Data Consolidation
a current trend in BI/DW is ‘data
consolidation’
from a software vendor perspective, it is
tempting to simplify this:
◦ ‘we can keep all the tables for all your disparate
applications in one physical database’
DW Architecture Best Practices12/04/09 56
Data Integration
To truly achieve ‘a single version of
the truth’, must do more than
simply consolidating application
databases
Must integrate data models and
establish common terms of
reference
DW Architecture Best Practices12/04/09 57
Best Practice #4
ØTake an approach that consolidates data into
‘a single version of the truth’
◦ Data Warehouse Bus
conformed dimensions & facts
◦ OR?
DW Architecture Best Practices12/04/09 58
Operational Data Store
(ODS)
a single point of integration for disparate
operational systems
contains integrated data at the most
detailed level (transactional)
may be loaded in ‘near real time’ or
periodically
can be used for centralized operational
reporting
DW Architecture Best Practices12/04/09 59
Best Practice #5
ØConsider implementing an ODS only when
information retrieval requirements are
near the bottom of the data abstraction
pyramid and/or when there are multiple
operational sources that need to be
accessed
◦ Must ensure that the data model is integrated,
not just consolidated
◦ May consider 3NF data model
◦ Avoid at all costs a ‘data dumping ground’
DW Architecture Best Practices12/04/09 60
Capacity Planning
DW workloads are typically very
demanding, especially for I/O capacity
Successful implementations tend to grow
very quickly, both in number of users
and data volume
Rules of thumb do exist for sizing the
hardware platform to provide adequate
initial performance
◦ typically based on estimated ‘raw’ data size
of proposed database e.g. 100-150 Gb per
modern CPU
DW Architecture Best Practices12/04/09 61
SMP Server Scale Up
Scaling performance within a single SMP
server is referred to as ‘scale up’
Database benchmarks suggest Windows
scalability is near that of Linux
IBM claims near-linear scalability for Linux
(on commodity hardware) up to about 4
processors
◦ Probably not cost effective to scale up Linux
much beyond 4 processors
IBM claims near-linear scalability for AIX
on POWER5 up to about 8 processors
DW Architecture Best Practices12/04/09 62
Scale Up vs. Scale Out
To obtain the total number of processors
required for the estimated DW
workload, must plan either to scale up
or scale out
Both options are viable but, all other
things being equal, scaling up is less
disruptive to end users and requires
less work to implement
◦ scaling up can offer lower hardware
investment, if practical
◦ however, network bandwidth or latency
issues can limit effectiveness of parallelism
DW Architecture Best Practices12/04/09 63
Best Practice #6
ØCreate a capacity plan for your BI
application & monitor it carefully
ØConsider future additional performance
demands
◦ Establish standard performance benchmark
queries and regularly run them
◦ Implement capacity monitoring tools
◦ Build scalability into your architecture
◦ May need to allow for scaling both up and
out!
DW Architecture Best Practices12/04/09 64
Open Source Affordability
Another emerging trend in IT generally is to
utilize Open Source software running on
commodity hardware
◦ this is expected to offer lower total cost of ownership
◦ certainly, GNU/Linux and other Open Source
initiatives do provide very good functionality and
quality for minimal cost
This trend also applies to BI & DW:
◦ most traditional rdbms’s are now supported on Linux
◦ however, open source rdbms’s lag behind on
providing good performance for DW queries
DW Architecture Best Practices12/04/09 65
DW Appliances
DW appliances, consisting of
packaged solutions providing all
required software and hardware, are
beginning to offer very promising
price/performance
production experience is limited so
far, so this is not yet a ‘best practice’
DW Architecture Best Practices12/04/09 66
Role of an ODS in DW
Architecture
In the case where an ODS is a
necessary component of the overall
DW, it should be carefully integrated
into the overall architecture
Can also be used for:
◦ Staging area
◦ Master/reference data management
◦ Etc…
DW Architecture Best Practices12/04/09 67