CHAPTER 3
Data Warehousing
A data warehouse (DW) is an organized collection of integrated, subject-
oriented databases designed to support decision support functions. DW
is organized at the right level of granularity to provide clean enterprise-
wide data in a standardized format for reports, queries, and analysis. DW
is physically and functionally separate from an operational and transac-
tional database. Creating a DW for analysis and queries represents signifi-
cant investment in time and effort. It has to be constantly kept up-to-date
for it to be useful. DW offers many business and technical benefits.
DW supports business reporting and data mining activities. It can
facilitate distributed access to up-to-date business knowledge for depart-
ments and functions, thus improving business efficiency and customer
service. DW can present a competitive advantage by facilitating decision
making and helping reform business processes.
DW enables a consolidated view of corporate data, all cleaned and or-
ganized. Thus, the entire organization can see an integrated view of itself.
DW thus provides better and timely information. It simplifies data access
and allows end users to perform extensive analysis. It enhances overall IT
performance by not burdening the operational databases used by Enter-
prise Resource Planning (ERP) and other systems.
Caselet: University Health System—BI in
Health Care
Indiana University Health (IUH), a large academic health care system,
decided to build an enterprise data warehouse (EDW) to foster a genuinely
data-driven management culture. IUH hired a DW vendor to develop
EDW, which also integrates with their electronic health record (EHR)
38 BUSINESS INTELLIGENCE AND DATA MINING
system. They loaded 14 billion rows of data into EDW—fully 10 years
of clinical data from across IUH’s network. Clinical events, patient en-
counters, lab and radiology, and other patient data were included, as were
IUH’s performance management, revenue cycle, and patient satisfaction
data. They soon put in a new interactive dashboard using the EDW that
provided IUH’s leadership with the daily operational insights they need to
solve the quality/cost equation. It offers visibility into key operational met-
rics and trends to easily track the performance measures critical to control-
ling costs and maintaining quality. EDW can easily be used across IUH’s
departments to analyze, track, and measure clinical, financial, and patient
experience outcomes. (Source: healthcatalyst.com)
Q1. What are the benefits of a single large comprehensive EDW?
Q1. What kinds of data would be needed for EDW for an airline
company?
Design Considerations for DW
The objective of DW is to provide business knowledge to support deci-
sion making. For DW to serve its objective, it should be aligned around
those decisions. It should be comprehensive, easy to access, and up-to-
date. Here are some requirements for a good DW:
1. Subject-oriented: To be effective, DW should be designed around a
subject domain, that is, to help solve a certain category of problems.
2. Integrated: DW should include data from many functions that can
shed light on a particular subject area. Thus, the organization can
benefit from a comprehensive view of the subject area.
3. Time-variant (time series): The data in DW should grow at daily or
other chosen intervals. That allows latest comparisons over time.
4. Nonvolatile: DW should be persistent, that is, it should not be cre-
ated on the fly from the operations databases. Thus, DW is consis-
tently available for analysis, across the organization and over time.
5. Summarized: DW contains rolled-up data at the right level for queries
and analysis. The rolling up helps create consistent granularity for effec-
tive comparisons. It helps reduces the number of variables or dimensions
of the data to make them more meaningful for the decision makers.
DATA WAREHOUSING 39
6. Not normalized: DW often uses a star schema, which is a rectangular
central table, surrounded by some lookup tables. The single-table
view significantly enhances speed of queries.
7. Metadata: Many of the variables in the database are computed from
other variables in the operational database. For example, total daily
sales may be a computed field. The method of its calculation for each
variable should be effectively documented. Every element in DW
should be sufficiently well-defined.
8. Near real-time and/or right-time (active): DWs should be updated in
near real-time in many high-transaction volume industries, such as air-
lines. The cost of implementing and updating DW in real time could
discourage others. Another downside of real-time DW is the possibili-
ties of inconsistencies in reports drawn just a few minutes apart.
DW Development Approaches
There are two fundamentally different approaches to developing DW: top
down and bottom up. The top-down approach is to make a comprehensive
DW that covers all the reporting needs of the enterprise. The bottom-up
approach is to produce small data marts, for the reporting needs of different
departments or functions, as needed. The smaller data marts will eventually
align to deliver comprehensive EDW capabilities. The top-down approach
provides consistency but takes time and resources. The bottom-up approach
leads to healthy local ownership and maintainability of data (Table 3.1).
Table 3.1 Comparing data mart and data warehouse
Functional Data Mart Enterprise Data
Warehouse
Scope One subject or functional area Complete enterprise data needs
Value Functional area reporting and Deeper insights connecting
insights multiple functional areas
Target Decentralized management Centralized management
organization
Time Low to medium High
Cost Low High
Size Small to medium Medium to large
Approach Bottom up Top down
Complexity Low (fewer data transformations) High (data standardization)
Technology Smaller scale servers and databases Industrial strength
40 BUSINESS INTELLIGENCE AND DATA MINING
Figure 3.1 Data warehousing architecture
DW Architecture
DW has four key elements (Figure 3.1). The first element is the data
sources that provide the raw data. The second element is the process of
transforming that data to meet the decision needs. The third element is
the methods of regularly and accurately loading of that data into EDW or
data marts. The fourth element is the data access and analysis part, where
devices and applications use the data from DW to deliver insights and
other benefits to users.
Data Sources
DWs are created from structured data sources. Unstructured data, such as
text data, would need to be structured before inserted into DW.
1. Operations data include data from all business applications, includ-
ing from ERPs systems that form the backbone of an organization’s IT
systems. The data to be extracted will depend upon the subject matter
of DW. For example, for a sales/marketing DW, only the data about
customers, orders, customer service, and so on would be extracted.
2. Other applications, such as point-of-sale (POS) terminals and
e-commerce applications, provide customer-facing data. Supplier
data could come from supply chain management systems. Planning
and budget data should also be added as needed for making com-
parisons against targets.
3. External syndicated data, such as weather or economic activity data,
could also be added to DW, as needed, to provide good contextual
information to decision makers.
DATA WAREHOUSING 41
Data Transformation Processes
The heart of a useful DW is the processes to populate the DW with good
quality data. This is called the extract-transform-load (ETL) cycle.
1. Data should be extracted from many operational (transactional) da-
tabase sources on a regular basis.
2. Extracted data should be aligned together by key fields. It should be
cleansed of any irregularities or missing values. It should be rolled
up together to the same level of granularity. Desired fields, such as
daily sales totals, should be computed. The entire data should then
be brought to the same format as the central table of DW.
3. The transformed data should then be uploaded into DW.
This ETL process should be run at a regular frequency. Daily trans-
action data can be extracted from ERPs, transformed, and uploaded to
the database the same night. Thus, DW is up-to-date next morning. If
DW is needed for near-real-time information access, then the ETL pro-
cesses would need to be executed more frequently. ETL work is usually
automated using programing scripts that are written, tested, and then
deployed for periodic updating DW.
DW Design
Star schema is the preferred data architecture for most DWs. There is a
central fact table that provides most of the information of interest. There
are lookup tables that provide detailed values for codes used in the central
table. For example, the central table may use digits to represent a sales
person. The lookup table will help provide the name for that sales person
code. Here is an example of a star schema for a data mart for monitoring
sales performance (Figure 3.2).
Other schemas include the snowflake architecture. The difference be-
tween a star and snowflake is that in the latter, the lookup tables can have
their own further lookup tables.
There are many technology choices for developing DW. This includes
selecting the right database management system and the right set of data
management tools. There are a few big and reliable providers of DW sys-
tems. The provider of the operational DBMS may be chosen for DW also.
42 BUSINESS INTELLIGENCE AND DATA MINING
Figure 3.2 Star schema architecture
Alternatively, a best-of-breed DW vendor could be used. There are also a
variety of tools out there for data migration, data upload, data retrieval,
and data analysis.
DW Access
Data from DW could be accessed for many purposes, through many
devices.
1. A primary use of DW is to produce routine management and moni-
toring reports. For example, a sales performance report would show
sales by many dimensions, and compared with plan. A dashboard-
ing system will use data from the warehouse and present analysis
to users. The data from DW can be used to populate customized
performance dashboards for executives. The dashboard could in-
clude drill-down capabilities to analyze the performance data for
root cause analysis.
2. The data from the warehouse could be used for ad hoc queries and
any other applications that make use of the internal data.
DATA WAREHOUSING 43
3. Data from DW is used to provide data for mining purposes. Parts of
the data would be extracted, and then combined with other relevant
data, for data mining.
DW Best Practices
A DW project reflects a significant investment into IT. All of the best
practices in implementing any IT project should be followed.
1. The DW project should align with the corporate strategy. Top manage-
ment should be consulted for setting objectives. Financial viability Re-
turn on Investment (ROI) should be established. The project must be
managed by both IT and business professionals. The DW design should
be carefully tested before beginning development work. It is often much
more expensive to redesign after development work has begun.
2. It is important to manage user expectations. DW should be built in-
crementally. Users should be trained in using the system, and absorb
the many features of the system.
3. Quality and adaptability should be built in from the start. Only
cleansed and high-quality data should be loaded. The system should
be able to adapt to new access tools. As business needs change, new
data marts can be created for new needs.
Conclusion
DWs are special data management facilities intended for creating reports
and analysis to support managerial decision making. They are designed to
make reporting and querying simple and efficient. The sources of data are
operational systems and external data sources. DW needs to be updated
with new data regularly to keep it useful. Data from DW provides a useful
input for data mining activities.
Review Questions
1. What is the purpose of a data warehouse?
2. What are the key elements of a data warehouse? Describe each.
44 BUSINESS INTELLIGENCE AND DATA MINING
3. What are the sources and types of data for a data warehouse?
4. How will data warehousing evolve in the age of social media?
Liberty Stores Case Exercise: Step 2
The Liberty Stores company wants to be fully informed about its sales of
products and take advantage of growth opportunities as they arise. It wants
to analyze sales of all its products by all store locations. The newly hired
chief knowledge officer has decided to build a data warehouse.
1. Design a DW structure for the company to monitor its sales perfor-
mance. (Hint: Design the central table and lookup tables.)
2. Design another DW for the company’s sustainability and charitable
activities.