Optimize Your Business Data Warehouse
Optimize Your Business Data Warehouse
https://www.coursera.org/specializations/data-warehousing
Evaluate business needs; design a data warehouse; and integrate, and visualize
data using dashboards and visual analytics.
This Specialization covers data architecture skills that are increasingly critical across a broad
range of technology fields. You’ll learn the basics of structured data modeling, gain practical
SQL coding experience, and develop an in-depth understanding of data warehouse design and
data manipulation. You’ll have the opportunity to work with large data sets in a data
warehouse environment to create dashboards and Visual Analytics. You will use of
MicroStrategy, a leading BI tool, OLAP (online analytical processing) and Visual Insights
capabilities to create dashboards and Visual Analytics. In the final Capstone Project, you’ll
apply your skills to build a small, basic data warehouse, populate it with data, and create
dashboards and other visualizations to analyze and communicate the data to a broad audience.
University of Colorado
* The scope and integration level are important factors in determining an appropriate
architecture.
- Scope refers to the breath of an organization supported by a data warehouse. Scope can be
measured in various ways such as the number of data sources used the number of source systems
providing inputs, or number of organizational units providing inputs or using a data warehouse.
- Integration level refers to several data quality indicators across data sources. Integration level
involves completeness, consistency, conformity, and duplication across data sources not within
individual data sources.
* Two opposing approaches provide limits on architectures for data warehouses.
- The top down approach, known as the enterprise data warehouse, provides a single data
warehouse for an organization. Through logical centralization, the top down approach is a high level
integration in large scope.
--> top-down architecture with three tiers, staging area, data warehouse and data marts.
- The staging area provides temporary storage of transformed data before loading into
the data warehouse. The staging area is particularly useful for data warehouses with a
large number of operational databases and external data sources requiring complex
data transformations.
- enterprise data model known as an EDM. The EDM defines a structure of the data
warehouse in details required to access operational databases and external data
sources. Departmental users generally need access to small portions of the data
warehouse instead of the entire data warehouse.
- To provide them with faster access, while isolating them from data needed by other
user groups, smaller data warehouses called data marts are often used. Data marts act
as the interface between business analysts and the enterprise data warehouse, storing a
derived part of the data warehouse, and refreshing this derived part on a periodic basis.
- The bottom up approach provides independent data marts with lower integration levels. A data
mart is a data warehouse with a small scope justified by tangible benefits from critical business
reporting needs and lower development risks from smaller project sizes.
- In a bottom-up architecture, important business reporting requirements
spanning operational databases drive the development of independent data marts. Each
data mart is relatively small in scope supporting several organizational units and data
sources. The integration is limited because of the small scope of data marts. The set of
data marts may evolve into a large data warehouse if the organization can justify the
expense of building an enterprised data model.
One approach, know as a data mart bus approach, provides an additional layer or data
integration, to transform selected data from independent data marts. The level of data
integration in the data mart bus approach is much less than the level in an enterprise
data warehouse.
* The data warehouse maturity model provides a framework to view an organizations progress and
to guide investment decisions.
h) Employment opportunities
i) Course software requirements
For this course, you need to install a database server (Oracle or MySQL), Pivot4J plugin for the
Pentaho Business Analytics Platform, and Pentaho Data Integration. For both Pentaho software, you
should install the community editions. You will use Pivot4J in module 2 and Pentaho Data Integration
in module 5. The database server is used in module 5 as part of using Pentaho Data Integration. You
should also install a database client to create and populate the tables used in module 5.
WEEK 2
The slice summarize operator replaces one or more dimensions with summary calculations. The
summary calculation often indicates a total value across members or the central tendency of the
dimension, such as the average or median value.
The dice operator replaces a dimension with a subset of values of the dimension.
This example shows the result of the dice operation to display sales for the USA states of Utah,
Arizona, and Colorado for January 1, 2013.
A dice operation typically follows a slice operation and returns a subset of the cells displayed in the
preceding slice.
* The drill-down operator allows analysts to navigate from a broader level to a narrow or more
detailed level, such as navigating from product group to individual products.
Roll-up is the opposite of drill-down. Roll-up involves moving from a more detailed level to a broader
level of a hierarchical dimension.
Pivot tables have become the standard interface for data cubes.
Overview of WebPivotTable
A table design for data warehouse typically consists of dimension tables, connected to fact tables,
and one-to-many relationships. The attributes of a dimension map to columns in a dimension
table. Fact tables contain measures as well as foreign keys to dimension tables.
A common error early in the design process is to omit dimensions related to a fact table.
The size of a fact table can be estimated using the cardinality of each dimension discounted by the
rate of sparsity. Sparsity increases as the number of dimensions increases and the cardinality of a
dimension increases.
example :
The star schema can represent one data cube. This ERD consists of four dimension entity types, Item,
Customer, Store, and TimeDim, along with one transaction fact entity type called Sales.
When converted to a table design, the Sales table has foreign keys to each dimension table, Item,
Customer, Store, and TimeDim.
* For related business processes that share some of the dimension tables, a star schema can be
extended into a constellation schema with multiple fact entity types, as shown in this ERD.
- When converted to a table design, the inventory entity type becomes a fact table and one of many
relationships become foreign keys in the fact table.
- The inventory entity type contains a number of measures, including the quantity on hand of an
item, the cost of an item and a quantity returned.
- All dimension tables are shared among both fact tables, except for the supplier and customer
tables.
The constellation pattern is important because it shows shared dimensions to represent multiple
data cubes.
* A snowflake schema has multiple levels of dimension tables related to one or more fact tables.
In this CRD the store dimension has been split into two entity types. Store and division along with a
one to many relationship from division to store.
You should consider the snowflake schema instead of the star schema for small dimension tables
that are not fully normalized.
The Store table in the previously shown star schema is not fully normalized because division id
determines division name and division manager.
* Time representation is crucial for data warehouses, because most data warehouse queries use
time in conditions.
The principle usage of time is to record to occurrence of facts. The simplest representation is a time
stamp data type for a column and a fact table.
In place of a timestamp column, many data warehouses use a foreign key to a time dimension table,
as shown in previous examples in this lesson.
Summarizability patterns involve the relationship among dimension levels and the relationships from
dimension to fact tables.
* In this example, the parent college level drills down to the department child level with a smaller
total. The enrollment in the business college is omitted in the department level because the business
college does not have departments.
* Roll up incompleteness is the reverse of drill down incompleteness.
Rolling up from a child, that is a finer level, to a parent, that is a coarser level, shows a smaller total
indicating that measured values attributed to child members have not been allocated to parent
members.
In this example, the product child level rolls up to the category parent level with a smaller total.
The values of minimum and maximum cardinalities determine schema patterns for the three
dimension summarizability problems.
- The drilll-down incomplete problem involves a minimum cardinality of zero for the parent
entity type as shown in ERDA.
- The roll up incomplete problem involves a child's minimum cardinality of zero as shown in
ERDB.
Without an appropriate methodology, the best efforts will likely fail to produce a data warehouse
with high value for an organization.
Artifacts of a data warehouse design are dimensional models, a schema design using patterns
presented in other lessons, data integration procedures and data marks to support business analysis.
* The demand-driven data warehouse design methodology, also know as the requirements-driven
approach, first proposed by Kimball in 1988, is one of the earliest data warehouse design
methodologies.
The demand-driven methodology has three phases for identifying data marts and under the subsets
of user requirements, building a matrix-related data marks and dimensions, and designing fact
tables.
* The supply driven methodology emphasizes the analysis of existing data sources.
Entities in ERDs of existing data sources are analyzed to provide a starting point for the data
warehouse design.
The supply driven methodology has three phases, to Classify Entities, Refine Dimensions and to
Refine the Schema.
* The hybrid data warehouse design methodology proposed by in 2001 combines a demand and
supply methodologies.
The Hybrid Methodology involves a demand-driven stage, a supply-driven stage, and then a third
stage to integrate the demand in supply driven stages.
The demand and supply stages could be done independently as shown in this diagram.
The overall emphasis in the hybrid approach is to balance demand and supply aspects of data
warehouse design, possibly aided by automated tools. The demand driven stage collects
requirements using the goal, question, metrics, or GQM approach.
The GQM approach, provides some informal guidelines to define measures and dimensions from the
goals.
The second step of the hybrid methodology, involves analysis of existing ERDs. The methodology
provides guidelines to identify fact and dimension tables and existing ERDs. Potential fact tables are
identified based on a number of additive attributes. Dimension tables are involved in one of many
relationships with fact tables.
The third step of the hybrid methodology integrates the dimensional model in the demand stage and
the star schema in the supply stage.
WEEK 4
Data Integration Concepts, Processes, and Techniques
Concepts of data integration processes
Data integration adds value to disparate data sources that contribute to enterprise data
warehouses. The primary goal of data integration is to provide a single source of truth for decision-
making.
Refresh processing is about updating
- The Preparation Phase manipulates change data from individual source systems.
- Extraction retrieves data from individual data sources.
- Transportation moves extracted data to a staging area.
- Cleaning involves a variety of tasks to standardize and improve the quality of the extracted
data.
- Auditing records results of the cleaning process, performing completeness
and reasonableness checks and handling exceptions.
- The integration phase merges separate clean sources into one source.
- Merging can involve removal of inconsistencies among the source data.
- Auditing records results of the merging process, performing completeness
and reasonableness checks, and handling exceptions.
- The update phase involves propagating the integrated change data to various parts of the data
warehouse.
After propagation, notification can be sent to user groups and administrators. In addition to
periodic refreshment, data integration involves initial population of a data warehouse.
Change data concepts
emphasizes change data used in data integration processes to populate and refresh your data
warehouse.
- Source system requirements involve modifications to source systems to acquire change data.
Typical changes to source systems are new columns, such as timestamps required
for queryable change data, and trigger code required for cooperative change data.
- Processing level involves resource consumption and development required for data
integration procedures.
Logs and snapshot change data involve substantial processing.
1) Cooperative change data involve notification from a source system about changes. The notification
typically occurs at transaction time using a trigger.
A trigger is a rule executed by a DBMS when an event occurs, such as inserting a new row.
A trigger involves software development and execution as part of a source system. Cooperative
change data can be input immediately into a data warehouse. Or placed in a queue or staging area
for later processing, possibly with other changes.
2) Logged change data involve files that record changes or other user activity.
For example, a transaction log contains every change made by transaction, and a web log contains
page access histories called click streams by web site visitors.
3) queryable change data come directly from a data source via a query.
Queryable change data require time stamping in a data source. Since few data sources contain time
stamps for all data, queryable change data usually are augmented with other kinds of change data.
Queryable change data are most applicable for fact tables using columns such as order date,
shipment date and hire date, that are stored at operational data sources.
4) Snapshot change data involve periodic dumps of source data. To derive change data, a difference
operation uses the two most recent snapshots. The result of a difference operation is called a delta.
Generating a delta involves comparing source files to identify new rows, changed rows, and deleted
rows
Data cleaning tasks
Parsing decomposes complex objects, using text, into their constituent parts.
For data integration, parsing is important for decomposing multi-purpose text data into individual
fields.
Pattern matching with regular expressions
Regular expressions specify patterns for parsing text fields with multiple components, common in
data integration tasks.
Regular expression tools are widely supported in data integration tools, DBMSs, application
programming interfaces in testing web sites.
(..........
..................
..................)
Entity matching identifies duplicate records in two or more data sources when no common reliable
identifier exists.
The classic application involves identification of duplicate customers and data sources from different
firms.
Because a common identifier does not exist, duplicates must be identified from other common fields
such as names, address components, phone numbers and ages.
Because these common fields come from different data sources, inconsistency and non standard
representations may exist, complicating the matching process.
Matched entities can be merged or linked. If merging two entities, sometimes old data from one
source is discarded. In addition, new fields can be added to obtain data unique from each data
source. Linking maintains separate entities, but notes relationships.
Entity matching approaches use distance functions to determine if quasi identifiers in two entities
indicate the same entity.
In a geometric sense, distance is the amount of space between two points.
For entity matching a point is a combination of values, one from each quasi identifier. Nurmeri-quasi
identifiers are easy to compare, but text quasi identifiers can be difficult to compare.
WEEK 5
Architectures, Features, and Details of Data Integration Tools
Architectures and marketplace
To support the complexity of data integration processes to populate and refresh a data warehouse,
software products for data integration have been developed.
b) The extraction, loading, and transformation architecture, abbreviated at ELT, uses a relational
DBMS to perform transformations after extraction loading.
- The ELT Architecture is designed to utilize high performance features of the enterprise DBMSes.
- ETL architecture supporters emphasize DBMS independence of ETL engines, While ELT architecture
supporters emphasize superior optimization technology in relational DBMS engines.
- ETL architectures can usually support more complex operations in a single transformation than ELT
architectures, but ELT architecture may use less network bandwidth.
- High execution indicates large firm size and resources to support current and future products.
- High vision indicates a firm with broad integrated product offerings.
- Workflow and component specifications are the most salient parts of an IDE for data integration.
- A repository is a design database for all features of a data integration tool.
The repository stores design objects and relationships, maintains dependencies among design
objects, and provides documentation, with predefined reports and convenient formatting.
- Job management supports scheduling and monitoring of jobs for executing data integration
workflows.
Scheduling typically handles base schedules, repetition, conditional execution, and stepping through
a workflow to identify problems.
Monitoring provides logging of events, performance alerts, and reports.
- Data profiling helps a data warehouse administrator and data owners understand and approve
data quality and data sources. A data profiling tool can reduce unexpected delays and surprises in
populating and refreshing a data warehouse.
Course conclusion
COURS 3 : Relational Database Support for
Data Warehouses
https://www.coursera.org/learn/dwrelational
WEEK 1
* The failure of operational databases to support higher level decision making is a combination of
inadequacy of database technology and limitations in employment of databases.
- Performance limitation
Organizations and DBS vendors discovered a single database could not be configured to provide
adequate performance for both transaction processing and business intelligence processing.
- Lack of integration
Organizations discovered that the lack of integration among operational databases hindered higher
level decision making.
This lack of integration was thought a designed failure, so operational databases primarily support
transaction processing, not business intelligence processing.
- Missing data management feature
Organizations experience unexpected development difficulties and poor performance for business
intelligence applications.
These difficulties were unexpected. Because the promise of non procedural specifications, the SQL,
and optimizing compilers.
*
Relational database schema patterns
* The traditional schema pattern for data warehouse is known as a star schema,
--> consisting of one fact entity type per table, surrounded by the dimension entity types or tables in
one-to-many relationships. The star schema represents a single data cube.
This ERD consists of four dimension entity types. Item, Customer, Store, and TimeDim, along with
one transaction fact entity type called, Sales. When converted to a table design, the Sales table's
foreign keys to each dimension table, Item, Customer, Store, and TimeDim.
In some designs, the fact entity type depends on the related dimension entity type's first primary
key. Since fact tables can have many relationships, it is generally preferred to have an artificial
identifier rather than a large combined primary key.
In this example, SalesNo is an artificial identifier for the sales entity type.
* For related business processes that share some of the dimensions, a star schema can be extended
into a constellation schema with multiple fact entity types.
This constellation ERD diagram contains two fact entity types, Sales and Inventory, along with five
dimension and two types, shared among the fact entity types.
The constellation pattern is important, because it shows shared dimensions to represent multiple
data cubes. A data warehouse for medium size business may have many data cubes.
For a complex constellation diagram, a matrix provides a convenient mapping of dimensions and
facts. This matrix shows a mapping of the simplified schema diagram shown previously.
Most dimensions appear in two schemas. Shared dimensions are conformed or standardized across
schemas.
Supplier and Store are used in only one schema each.
* You should consider the snowflake schema instead of the star schema for small dimension tables
that are not fully normalized.
- The store table in the previously shown star schema is not fully normalized, because Division ID, the
term is Division Name and Division Manager. Since the store table is relatively small, query
performance will not suffer much with the need to join the division table in a snowflake design.
- For large dimension tables, such as customer, however, query performance may suffer with extra
join operations required in a snowflake design.
* The inventory data warehouse supports business intelligence about inventory cycles. Inventory
bought, sold, consumed, and produced, is the heart of any manufacturing or distribution company.
- Inventory transactions are frequent and common.
- The volume and significance of inventory transactions makes them important in a data warehouse
design.
- The work order, sales, and purchase life cycles, affect the perpetual inventory balance as shown in
this diagram.
Before starting on assignments in Module's 2, 3, and 4, you need to create and populate the
inventory data warehouse tables. The course website contains files with a create table and search
statements for each table
https://d3c33hcgiwev3.cloudfront.net/
_4b4ec83a9aa2fb6b0d7e85677d82dcac_SalesDWStatements.txt?
Expires=1464739200&Signature=g9QqP02gb-RDhz5AfRwhIki8RcEGxCJje-
4yR4nJebN34eN2Txnn87bzCYdkrrIFMLEPIzrznkoQNL-
RbWyEQ7vGMu75j5zj7F4dqRoUjpQYN9hTlxi7Btn29sj290brQPv9QrirBU0MPxUluqY9ypxxJjQLqAUJyO
5xp8a6-08_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A
InventoryDW.doc
The CREATE and INSERT statements for the Inventory data warehouse tables.
InventoryStarCREATEStatementOracle.txt
https://d3c33hcgiwev3.cloudfront.net/
_54e3cd972f071cc3211121f55ae1dfdf_InventoryStarCREATEStatementOracle.txt?
Expires=1464739200&Signature=UB4CCfmOi-tb~ujJNPeZSQqsplEUxJHvM-mY-TmisdMygg-
FUj1YKkE2c1ZJPv143kUrDl0TxZQR7EHlUfLrTccCBg1cyt47gB-
c4Xh~vuf5t6gRpKSCTthagxzMzm1Nn6e5i0F6oqfI~rJrw1yq2lrcVqzNedkVEaVy5V5rw-I_&Key-
Pair-Id=APKAJLTNE6QMUY6HBC5A
InventoryStarINSERTStatementOracle.txt
https://d3c33hcgiwev3.cloudfront.net/
_1e3a5ba95f42164c9e93d521fab08569_InventoryStarINSERTStatementOracle.txt?
Expires=1464739200&Signature=YxM8hwZBgE~4Kfy3llaGU81B9L-dF79-
6JTLhBlAobj8vvOSg3X2uUceihMrz1a6~KWfdknnOXS9VorP9GFaY6xt3xcnWKpj4JeVuFgS4E5XlO
rMa-LceSiCm0sOOUup4IFdz~E9zgSh3jm5aF6FLquPPNqJa1bszMN0Hm1W8Vk_&Key-Pair-
Id=APKAJLTNE6QMUY6HBC5A
WEEK 2
The CUBE operator generates a complete set of subtotals for collection of columns.
- Is only appropriate for independent columns, usually less than four columns.
- The column specification is order independent due to the nature of the operator generated all
possible subtotals. You can understand results of a cube operation by comparison to the normal
group by results.
You can understand the results of a ROLLUP operation, they comparison to the normal group by
results.
- They only syntax difference in the abbreviated statements, is the ROLLUP operator in a group
by clause. Complete select statements, appear later in the lesson.
- The ROLLUP result contains additional rows for sub totals.
- The first 6 rows are identical in both results. Dashes in the ROLLUP results, indicate sub total
rows.
- For example, the seventh row shows the sum of sales as 325 in 2012, across all months.
- The last row with two dashes, shows the grand total of 675.
- This ROLLUP result, grouping on 2 columns, year and month, contains 2 sets of subtotal rows,
subtotals for year and the grand total.
SQL GROUPING SETS operator
The grouping sets operator provides complete flexibility to generate any set of subtotals, even the
normal group by results are not generated by default.
- To achieve this flexibility you must provide explicit specification of subtotal groups, more
tedious than the specification for the cube and roll up operators.
- The explicit specification for the grouping sets operator has some similarity with writing a
union query to generate subtotals. For both the union query and grouping sets operator, you
must know the collections of subtotals to generate.
- The grouping sets operator explicitly specifies the complete set of sub totals.
- The new syntax in this statement is the grouping sets key words with a list of subtotal groups
after the group by clause. The grouping sets operation contains four subtotal groups. The
combination of store, zips, and time, month, store/zip by itself, time/month by itself and the
grand total specify it as empty parenthesis.
Window Comparisons I
Window Comparisons II
WEEK 4
WEEK 5
Data Governance
Closing
COURS 4 : Business Intelligence Concepts,
Tools, and Applications
https://www.coursera.org/learn/business-intelligence-tools
WEEK 1 :
WEEK 1
Course Overview
Lesson 1.1: Overview of Decision Making
WEEK 2
Graded Quiz #1: Please complete this GRADED quiz, which covers content from Modules 1
and 2
WEEK 3
WEEK 4
WEEK 5
Graded Quiz #2: Please complete this GRADED quiz, which covers content from Modules 3,
4 and 5.
Closing Video