Data Warehousing (Chapter 2)
Data Warehousing (Chapter 2)
CHAPTER 2
Introduction:
Bill Inmon (1996, p. 33), considered to be the father of data warehousing as noted in the previous
chapter, provides the following definition: “A Data Warehouse is a subject oriented, integrated, non-
volatile, and time variant collection of data in support of management’s decisions.”
The data in the data warehouse is: Separate, Available, Integrated, Time stamped, Subject
oriented, Non-volatile, Accessible.
DEFINING FEATURES :-
Subject-Oriented Data
➢ In operational systems, we store data by individual applications. Each application has its own
specific data set for its operations.
➢ An order processing application stores data for entering orders, checking stock, verifying
credit, and assigning shipment. This data is specific to order processing functions.
➢ Data sets for consumer loans, checking accounts, and savings accounts are each specific to
those applications within a bank.
➢ Different applications, such as automobile insurance, life insurance, and workers'
compensation, each have their own data sets.
➢ In operational systems, data sets are designed to support specific application functions,
making them efficient for those operations.
➢ Unlike operational systems, data warehouses store data by real-world business subjects or
events rather than by applications.
➢ Data in a data warehouse is organized in such a way that all data sets related to the same
business subject or event are connected.
➢ Business subjects are critical areas of a business, and they differ from one enterprise to
another. Examples include sales, shipments, and inventory.
➢ For a manufacturing company, sales, shipments, and inventory are critical subjects. For a retail
store, sales at the checkout counter is a critical subject.
pg. 1
2 DW Module-01
➢ The main purpose of a data warehouse is to organize data by important business topics,
making it easier to analyze business events and trends across the enterprise.
Figure 2-1 distinguishes between how data is stored in operational systems and in the data warehouse.
Data Format Flat files (e.g., .txt, .csv) Relational tables with predefined schemas
File names or paths (no formal Primary and foreign keys to maintain
Key Usage
keys) integrity
Data Retrieval Simple file access Complex SQL queries for data retrieval and
Method (open/read/write operations) analysis
pg. 2
3 DW Module-01
Integrated Data:
Data integration is a critical step in building a data warehouse, ensuring data consistency and
making it useful for comprehensive analysis and decision-making.
- The data warehouse pulls together data from multiple operational systems.
- Data may also come from external sources like Metro Mail, A. C. Nielsen, and IRI.
- These sources can have different file layouts, operating systems, and naming conventions.
- Different applications may have varying data attributes (e.g., different lengths for account
numbers).
- Data could have different naming conventions, formats, and codes across systems.
- Before being stored in the data warehouse, data must be transformed, consolidated, and integrated.
Standardization Requirements:
pg. 3
4 DW Module-01
- Data Attributes: Standardize attributes such as data types and field lengths.
Time-Variant Data
- Examples: Current balance in accounts receivable, current order status, current loan amount.
→ Data Warehouse
→Allows users to understand customer buying patterns, track changes over time, and identify
reasons for business trends.
→ Data can be related to specific time periods (e.g., day, week, month, quarter).
- Analysis of the Past: Users can analyze historical data to understand trends.
- Future Forecasting: Enables predictive analysis and forecasting based on past data.
Non-volatile Data
The nonvolatile nature of data in a data warehouse helps maintain a stable environment for consistent
reporting, trend analysis, and informed decision-making without being affected by ongoing business
transactions.
- Once data is loaded, it remains stable for query and analysis purposes.
- Operational systems handle *real-time updates* with every transaction (e.g., placing an order).
- Data from operational systems is moved to the data warehouse at scheduled intervals (e.g., daily,
weekly).
- Different data sets may have different update frequencies depending on business needs.
- The data warehouse is not updated with each transaction; it is nonvolatile compared to the frequent
updates in operational databases.
- Data in the warehouse is meant for queries and analysis, not for real-time business operations.
pg. 5
6 DW Module-01
Data Granularity
➢ In operational systems, data is typically stored at the lowest level of detail, known as fine
granularity.
➢ For instance, in a grocery store's point-of-sale system, units sold are captured per transaction
at the checkout.
➢ When summary data is needed, individual transactions are aggregated. For example, to
determine monthly sales, all relevant transactions are summed up, rather than storing pre-
calculated summaries.
➢ In contrast, a data warehouse usually stores data at multiple levels of granularity to facilitate
analysis.
➢ Users often start with high-level summaries, such as total sales by region, and drill down to
more detailed levels, like sales by state or individual stores.
➢ This hierarchy of summary levels enables efficient querying. The decision on data granularity
in a data warehouse depends on balancing the level of detail, data volume, and system
performance.
pg. 6
7 DW Module-01
➢ In the early days of data warehousing, the terms "data warehouse" and "data mart" were often
confused. A data warehouse is an enterprise-wide repository, while a data mart is a smaller,
departmental version.
➢ Organizations need to decide between a top-down approach (building a central data
warehouse first) or a bottom-up approach (starting with data marts).
➢ Key questions include whether to focus on enterprise-wide or departmental needs, whether to
build a data warehouse or data marts first, and whether to create dependent or independent
data marts. These considerations are crucial for planning a successful data warehousing
strategy.
Top-Down Approach
→ A data warehouse is defined as a centralized repository for the entire enterprise, storing data
at the lowest level of granularity based on a normalized data model.
→ In Inmon's vision, the data warehouse is at the heart of the Corporate Information Factory
(CIF), providing a framework for delivering business intelligence. Data from business operations
drives the CIF, and the centralized data warehouse feeds dependent data marts, which are typically
designed using a dimensional data model.
pg. 7
8 DW Module-01
Advantages:
2. Ensures that the data warehouse is architected, rather than a collection of disparate data marts.
Disadvantages:
The top-down approach focuses on building a large, integrated data warehouse, avoiding
fragmented information. However, it may take longer to implement and has a high risk of failure
if not managed by experienced professionals. Gaining support from senior management can also
be challenging, as they may not see immediate results.
Bottom-Up Approach
→Ralph Kimball, a prominent author and expert in data warehousing, advocates for the bottom-up
approach.
→Kimball envisions the corporate data warehouse as a collection of conformed data marts. The
primary focus is on conforming the dimensions among these separate data marts. In this approach,
data marts are developed first to provide analytical and reporting capabilities for specific business
subjects, utilizing a dimensional data model.
Key Features:
- Data Granularity: Data marts contain information at the lowest level of granularity and also include
summaries based on analytical needs.
pg. 8
9 DW Module-01
- Union of Data Marts: These data marts are integrated by conforming their dimensions, allowing
for cohesive analysis.
Advantages:
Disadvantages:
In the bottom-up approach, departmental data marts are built individually based on priority.
While this approach allows for quicker implementation and less risk, it can lead to data
fragmentation. Each independent data mart may not align with the overall needs of the
organization, potentially causing issues with data consistency and integration.
A Practical Approach
→ To decide on the best approach for data warehousing in your organization, you must first
understand its needs:
- Are you aiming for long-term, integrated data solutions or quick results for specific data subjects?
→ A balanced approach combining both top-down and bottom-up strategies is often the most
practical. This hybrid approach ensures that we:
pg. 9
10 DW Module-01
1. Plan and Define Requirements at the corporate level, keeping the entire organization in mind.
4. Implement Data Warehouse as a series of small data marts, referred to as supermarts, one at a time.
→ Each data mart is a logical subset of the overall data warehouse, designed for specific business
groups.
→ Together, all data marts make up the entire *enterprise data warehouse*.
→ A data mart is part of a larger data warehouse, focusing on a specific business area.
→ A data warehouse is the integrated union of all the individual data marts.
→ Consistent data definitions are crucial across all supermarts to prevent inconsistent or disparate
data.
ARCHITECTURAL TYPES
Over time, various architectural arrangements of data warehouses and data marts have evolved,
tailored to the specific requirements of different organizations. Before diving into the components of
a data warehouse, let's review the leading architectural types and see how data marts fit within these
architectures.
Relationships: The relationship between the data warehouse and data marts.
pg. 10
11 DW Module-01
→ Data is stored in the third normal form at the atomic level of granularity, meaning it is at its
most detailed level.
→ Occasionally, summarized data is also included. Queries and applications directly access the
centralized data warehouse, and there are no separate data marts in this setup.
→ This architectural type emerges when organizational units develop their own data marts for
specific purposes.
→ These independent data marts do not provide a single version of the truth and often have
inconsistent data definitions and standards, making it difficult to perform cross-data mart analysis.
pg. 11
12 DW Module-01
For example, sales and shipments data may be stored in separate data marts, complicating efforts to
analyze them together despite their related nature.
Federated
→ In this architectural type, companies with existing decision-support systems (such as operational
systems, extracted datasets, or primitive data marts) use a federated approach.
→ Instead of discarding their investments and starting over, data is integrated either physically or
logically through shared key fields, global metadata, and distributed queries.
Hub-and-Spoke
→ The key difference is the presence of dependent data marts, which obtain data from the
centralized warehouse.
→ The data warehouse acts as a hub, feeding data to these dependent data marts, which serve
various purposes like departmental analysis, specialized queries, and data mining.
→ The dependent data marts can have normalized, denormalized, summarized, or dimensional
data structures. Most queries are directed to the data marts, though the centralized warehouse can
also be queried. This approach uses a top-down development method.
Data-Mart Bus
→ This is the Kimball conformed supermarts approach, which follows an enhanced bottom-up
development method. It begins by analyzing the requirements for a specific business subject (e.g.,
orders, shipments, billings).
→ The first data mart is built using business dimensions and metrics, which are then shared
across future data marts to ensure consistency.
pg. 12
13 DW Module-01
→ Each data mart contains atomic data organized in a dimensional data model.
You see the Source Data component shown on the left. The Data Staging component serves as the
next building block. In the middle, you see the Data Storage component that manages the data
warehouse data. This component not only stores and manages the data, it also keeps track of the data
by means of the metadata repository. The Information Delivery component shown on the right
consists of all the different ways of making the information from the data warehouse available to the
users.
1. Production Data:
o Data formats, hardware platforms, and database systems can vary across operational
systems.
o Queries in these systems are narrow and predictable (e.g., finding a specific customer’s
data or a single invoice).
2. Internal Data:
o It includes detailed customer profiles, which are valuable for personalized marketing
and customer relationship management.
o Not all internal data may be included at the start, and decisions about how much to
include must be made.
3. Archived Data:
o Historical data archived from operational systems (often moved to separate storage
after a certain period).
o Data can be stored in various stages (online databases, flat files, tapes).
o Essential for providing historical snapshots and analyzing long-term trends in the data
warehouse.
4. External Data:
o Data from outside sources (e.g., industry statistics, competitor data, financial
benchmarks).
o Provides insights that internal data alone cannot offer (e.g., spotting trends, comparing
with industry benchmarks).
pg. 14
15 DW Module-01
These categories provide diverse data that must be standardized and integrated to create a
comprehensive and valuable data warehouse.
• After extracting data from different operational systems and external sources, the data must be
prepared for storage.
• The data needs to be changed, converted, and formatted for easy querying and analysis.
• ETL Process: Three key functions: Extract, Transform, and Load (ETL).
These functions happen in a dedicated staging area before the data enters the data warehouse.
• This area is essential because the data comes from many different sources and must be
integrated.
• Functions include cleaning, deduplication, standardization, and preparing data for storage.
• Unlike operational systems, data in a warehouse is subject-oriented and comes from many
diverse sources.
• Directly moving data into the warehouse without proper staging would lead to data
integration issues.
• Hence, a separate staging area is needed to ensure the data is properly cleaned and
standardized before loading.
• This staging area stores literal copies of source system data in a more convenient format
(e.g., an ODBC-accessible database).
Discuss the three major functions that take place in the staging area.
a. Data Extraction
pg. 15
16 DW Module-01
• Data extraction involves collecting data from various sources in different formats.
• Sources can include relational databases, legacy systems, flat files, spreadsheets, and
departmental datasets.
Outside tools: Ready-made tools for specific data sources, which may have high initial costs.
In-house programs: Custom-built programs for extraction, which involve ongoing development
and maintenance costs.
• After extraction, data is often moved to a separate environment for easier handling.
• This environment can consist of flat files, staging databases, or a combination of both.
• Sometimes, extraction may occur on the legacy platform itself, depending on the situation.
b. Data Transformation:
• Data transformation is critical when moving data from old systems to new ones.
• It involves converting data from different formats and making it ready for the data warehouse.
• Data for the warehouse comes from many disparate sources, making transformation
complex.
• Unlike operational systems, data in a warehouse is not just a one-time load but requires
ongoing updates and transformations.
• Cleaning involves tasks like correcting errors (e.g., misspellings), resolving conflicts (like
mismatched state codes), and removing duplicates.
• Semantic standardization: Synonyms (terms meaning the same) and homonyms (terms with
multiple meanings) are resolved.
• It also involves removing unnecessary data and organizing it into new forms.
• Primary keys in operational systems often have built-in meanings (e.g., product codes).
• In the warehouse, surrogate keys are used, which are unique identifiers without built-in
meanings.
c. Data Loading
• After building the data warehouse, the first step is to load large volumes of data for the initial
setup.
• This process takes a lot of time because it involves moving huge amounts of data into the
warehouse.
• After the initial load, data updates need to be added continuously as the warehouse operates.
• This involves extracting, transforming, and loading data changes regularly to keep the
warehouse up to date.
• Data is moved from the staging area (where data is prepared) to the data warehouse storage.
• Movements can be of both large batches (initial load) and incremental updates (ongoing
changes).
Figure 2-8 illustrates the common types of data movements from the staging area to the data
warehouse storage.
pg. 17
18 DW Module-01
• Data warehouses have separate storage from operational systems because they store large
amounts of historical data for analysis.
• Operational systems are highly normalized for fast processing, while data warehouses use
structures optimized for analysis.
• Data warehouses are read-only so that analysts can work with stable, unchanging snapshots
of data at specified periods.
• Unlike operational systems, data in a data warehouse is not updated continuously but loaded
in scheduled refresh cycles (e.g., daily, monthly).
• Data warehouses are designed to work with multiple tools from different vendors, making
them flexible.
• Some data warehouses use Multidimensional Databases (MDDBs) for storing summary
data, offering aggregated views of data for analysis.
• Casual users require information occasionally and also need prepackaged data.
• Power users create their own queries, drill into data, and make custom reports.
• Complex queries, multidimensional analysis, and statistical tools are for business analysts and
power users.
• Senior executives use Executive Information Systems EIS for high-level, summarized
information.
• Some warehouses include data mining tools to discover trends and patterns in the data.
• Information is delivered through online queries, scheduled reports via email, or using the
intranet.
pg. 18
19 DW Module-01
Metadata Component
o A data dictionary holds information like logical structures, file details, and indexes.
o Metadata helps in managing and understanding the data stored in the data warehouse.
o This component sits on top of all other components in the data warehouse architecture.
o It moderates information delivery to users, ensuring they receive the right data.
pg. 19
20 DW Module-01
o Monitors the movement of data into the staging area and from there into the data
warehouse storage.
o Uses metadata as the source of information for effective management and control.
Types of Metadata
1. Operational metadata
3. End-user metadata
1. Operational Metadata
• Contains information about various operational systems providing data to the warehouse.
• Includes details about selected data elements, such as field lengths and data types.
• Helps tie delivered information back to the original source data sets.
• Covers splitting records, combining parts from different sources, and managing multiple
coding schemes.
3. End-User Metadata
• Helps users look for information in a way that aligns with their understanding of the business.
pg. 21