0% found this document useful (0 votes)
21 views21 pages

Data Warehousing (Chapter 2)

This document provides an in-depth overview of data warehousing, defining its key features such as subject-oriented, integrated, time-variant, and non-volatile data. It contrasts data storage in operational systems versus data warehouses, discusses architectural types like centralized and hub-and-spoke, and outlines approaches for building data warehouses, including top-down and bottom-up strategies. The importance of data integration, standardization, and granularity for effective analysis and decision-making is emphasized throughout.

Uploaded by

bhagyasreeganuga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views21 pages

Data Warehousing (Chapter 2)

This document provides an in-depth overview of data warehousing, defining its key features such as subject-oriented, integrated, time-variant, and non-volatile data. It contrasts data storage in operational systems versus data warehouses, discusses architectural types like centralized and hub-and-spoke, and outlines approaches for building data warehouses, including top-down and bottom-up strategies. The importance of data integration, standardization, and granularity for effective analysis and decision-making is emphasized throughout.

Uploaded by

bhagyasreeganuga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1 DW Module-01

CHAPTER 2

DATA WAREHOUSE: THE BUILDING BLOCKS

Introduction:

Bill Inmon (1996, p. 33), considered to be the father of data warehousing as noted in the previous
chapter, provides the following definition: “A Data Warehouse is a subject oriented, integrated, non-
volatile, and time variant collection of data in support of management’s decisions.”

The data in the data warehouse is: Separate, Available, Integrated, Time stamped, Subject
oriented, Non-volatile, Accessible.

DEFINING FEATURES :-

Subject-Oriented Data

➢ In operational systems, we store data by individual applications. Each application has its own
specific data set for its operations.
➢ An order processing application stores data for entering orders, checking stock, verifying
credit, and assigning shipment. This data is specific to order processing functions.
➢ Data sets for consumer loans, checking accounts, and savings accounts are each specific to
those applications within a bank.
➢ Different applications, such as automobile insurance, life insurance, and workers'
compensation, each have their own data sets.
➢ In operational systems, data sets are designed to support specific application functions,
making them efficient for those operations.
➢ Unlike operational systems, data warehouses store data by real-world business subjects or
events rather than by applications.
➢ Data in a data warehouse is organized in such a way that all data sets related to the same
business subject or event are connected.
➢ Business subjects are critical areas of a business, and they differ from one enterprise to
another. Examples include sales, shipments, and inventory.
➢ For a manufacturing company, sales, shipments, and inventory are critical subjects. For a retail
store, sales at the checkout counter is a critical subject.

pg. 1
2 DW Module-01

➢ The main purpose of a data warehouse is to organize data by important business topics,
making it easier to analyze business events and trends across the enterprise.

Figure 2-1 distinguishes between how data is stored in operational systems and in the data warehouse.

Aspect Operating System Storage Data Warehouse Storage

To manage files and system To consolidate and analyze large volumes of


Purpose
resources data

Hierarchical file system (files


Data Structure Structured tables in a relational database
and directories)

Data Format Flat files (e.g., .txt, .csv) Relational tables with predefined schemas

Independent files with minimal Interrelated tables using keys


Data Relationships
relationships (primary/foreign)

File names or paths (no formal Primary and foreign keys to maintain
Key Usage
keys) integrity

Extensive indexing to improve query


Indexing Limited indexing capabilities
performance

Data Retrieval Simple file access Complex SQL queries for data retrieval and
Method (open/read/write operations) analysis

pg. 2
3 DW Module-01

Aspect Operating System Storage Data Warehouse Storage

Everyday document storage


Use Case Business intelligence, reporting, and analytics
and management

Often normalized for data integrity; may be


Normalization Not typically applied
denormalized for performance

Performance Optimized for complex queries and large data


Optimized for quick file access
Optimization retrieval

Integrated Data:

Data integration is a critical step in building a data warehouse, ensuring data consistency and
making it useful for comprehensive analysis and decision-making.

- The data warehouse pulls together data from multiple operational systems.

- Data resides in different formats, databases, files, and platforms.

- Source data may come from a variety of internal operational applications.

- Data may also come from external sources like Metro Mail, A. C. Nielsen, and IRI.

- These sources can have different file layouts, operating systems, and naming conventions.

- Different applications may have varying data attributes (e.g., different lengths for account
numbers).

- Data could have different naming conventions, formats, and codes across systems.

- Before being stored in the data warehouse, data must be transformed, consolidated, and integrated.

- This includes removing inconsistencies and standardizing data elements.

- Integrated data allows for meaningful analysis and decision-making.

- It provides a unified view across different applications and data sources.

Standardization Requirements:
pg. 3
4 DW Module-01

- Naming Conventions: Standardize field names across sources.

- Codes: Ensure consistent code representations across data sets.

- Data Attributes: Standardize attributes such as data types and field lengths.

- Measurements: Ensure uniformity in measurement units.

Time-Variant Data

→ Operational Systems- Store current values that reflect day-to-day operations.

- Examples: Current balance in accounts receivable, current order status, current loan amount.

→ Data Warehouse

a. Stores historical data in addition to current data.

b. Provides snapshots of data at different points in time.

c. Records changes to data over time to allow for analysis.

→Supports analysis, decision-making, and trend identification.

→Allows users to understand customer buying patterns, track changes over time, and identify
reasons for business trends.

→ Every data structure in the data warehouse includes a *time element*.

→ Data can be related to specific time periods (e.g., day, week, month, quarter).

→Benefits of Time-Variant Data:


pg. 4
5 DW Module-01

- Analysis of the Past: Users can analyze historical data to understand trends.

- Present Insight: Helps relate historical information to the current situation.

- Future Forecasting: Enables predictive analysis and forecasting based on past data.

Non-volatile Data

The nonvolatile nature of data in a data warehouse helps maintain a stable environment for consistent
reporting, trend analysis, and informed decision-making without being affected by ongoing business
transactions.

- Data in the data warehouse is not frequently updated or changed.

- Once data is loaded, it remains stable for query and analysis purposes.

- Operational systems handle *real-time updates* with every transaction (e.g., placing an order).

- In a data warehouse, data is extracted from operational systems at specific intervals.

- Data from operational systems is moved to the data warehouse at scheduled intervals (e.g., daily,
weekly).

- Different data sets may have different update frequencies depending on business needs.

- Data warehouses do not support day-to-day business operations.

- They store historical snapshots of data for analysis and decision-making.

- Business transactions update operational systems in real time.

- The data warehouse is not updated with each transaction; it is nonvolatile compared to the frequent
updates in operational databases.

- Data in the warehouse is meant for queries and analysis, not for real-time business operations.

pg. 5
6 DW Module-01

Data Granularity

➢ In operational systems, data is typically stored at the lowest level of detail, known as fine
granularity.
➢ For instance, in a grocery store's point-of-sale system, units sold are captured per transaction
at the checkout.
➢ When summary data is needed, individual transactions are aggregated. For example, to
determine monthly sales, all relevant transactions are summed up, rather than storing pre-
calculated summaries.
➢ In contrast, a data warehouse usually stores data at multiple levels of granularity to facilitate
analysis.
➢ Users often start with high-level summaries, such as total sales by region, and drill down to
more detailed levels, like sales by state or individual stores.
➢ This hierarchy of summary levels enables efficient querying. The decision on data granularity
in a data warehouse depends on balancing the level of detail, data volume, and system
performance.

pg. 6
7 DW Module-01

DATA WAREHOUSES AND DATA MARTS:-

➢ In the early days of data warehousing, the terms "data warehouse" and "data mart" were often
confused. A data warehouse is an enterprise-wide repository, while a data mart is a smaller,
departmental version.
➢ Organizations need to decide between a top-down approach (building a central data
warehouse first) or a bottom-up approach (starting with data marts).
➢ Key questions include whether to focus on enterprise-wide or departmental needs, whether to
build a data warehouse or data marts first, and whether to create dependent or independent
data marts. These considerations are crucial for planning a successful data warehousing
strategy.

Top-Down Approach

→ Bill Inmon is a leading advocate of the top-down approach to data warehousing.

→ A data warehouse is defined as a centralized repository for the entire enterprise, storing data
at the lowest level of granularity based on a normalized data model.

→ In Inmon's vision, the data warehouse is at the heart of the Corporate Information Factory
(CIF), providing a framework for delivering business intelligence. Data from business operations
drives the CIF, and the centralized data warehouse feeds dependent data marts, which are typically
designed using a dimensional data model.
pg. 7
8 DW Module-01

Advantages:

1. Provides a holistic view of data across the organization.

2. Ensures that the data warehouse is architected, rather than a collection of disparate data marts.

3. Offers a single repository for all organizational data.

4. Maintains centralized rules and control over the data.

5. May yield quick results if implemented in iterations.

Disadvantages:

1. Takes longer to build, even with an iterative method.

2. High exposure to risk of failure.

3. Needs a high level of cross-functional skills within the team.

4. Requires a high outlay of resources without initial proof of concept.

The top-down approach focuses on building a large, integrated data warehouse, avoiding
fragmented information. However, it may take longer to implement and has a high risk of failure
if not managed by experienced professionals. Gaining support from senior management can also
be challenging, as they may not see immediate results.

Bottom-Up Approach

→Ralph Kimball, a prominent author and expert in data warehousing, advocates for the bottom-up
approach.

→Kimball envisions the corporate data warehouse as a collection of conformed data marts. The
primary focus is on conforming the dimensions among these separate data marts. In this approach,
data marts are developed first to provide analytical and reporting capabilities for specific business
subjects, utilizing a dimensional data model.

Key Features:

- Data Granularity: Data marts contain information at the lowest level of granularity and also include
summaries based on analytical needs.

pg. 8
9 DW Module-01

- Union of Data Marts: These data marts are integrated by conforming their dimensions, allowing
for cohesive analysis.

Advantages:

1. Enables quicker and easier implementation of smaller, manageable pieces.

2. Favorable return on investment and proof of concept.

3. Reduced risk of failure.

4. Allows for an incremental approach, prioritizing important data marts first.

5. Allows project team to learn and grow.

Disadvantages:

1. Each data mart has its own narrow view of data.

2. Permeates redundant data in every data mar

3. Perpetuates inconsistent and irreconcilable data

4. Proliferates unmanageable interfaces

In the bottom-up approach, departmental data marts are built individually based on priority.
While this approach allows for quicker implementation and less risk, it can lead to data
fragmentation. Each independent data mart may not align with the overall needs of the
organization, potentially causing issues with data consistency and integration.

A Practical Approach

→ To decide on the best approach for data warehousing in your organization, you must first
understand its needs:

- Are you aiming for long-term, integrated data solutions or quick results for specific data subjects?

- Do you need quick, throw-away implementations or something more comprehensive?

→ A balanced approach combining both top-down and bottom-up strategies is often the most
practical. This hybrid approach ensures that we:
pg. 9
10 DW Module-01

1. Maintain a big-picture view for the whole organization (top-down).

2. Implement solutions step by step based on priorities (bottom-up).

→ Steps of the Practical Approach:

1. Plan and Define Requirements at the corporate level, keeping the entire organization in mind.

2. Create a Surrounding Architecture for a complete data warehouse.

3. Conform and Standardize Data to ensure consistency across the system.

4. Implement Data Warehouse as a series of small data marts, referred to as supermarts, one at a time.

→ Supermarts are carefully structured data marts.

→ Each data mart is a logical subset of the overall data warehouse, designed for specific business
groups.

→ Together, all data marts make up the entire *enterprise data warehouse*.

→ A data mart is part of a larger data warehouse, focusing on a specific business area.

→ A data warehouse is the integrated union of all the individual data marts.

→ Consistent data definitions are crucial across all supermarts to prevent inconsistent or disparate
data.

ARCHITECTURAL TYPES

Over time, various architectural arrangements of data warehouses and data marts have evolved,
tailored to the specific requirements of different organizations. Before diving into the components of
a data warehouse, let's review the leading architectural types and see how data marts fit within these
architectures.

For each type of architecture, we will discuss:

Data Storage: How data is stored within the data warehouse.

Relationships: The relationship between the data warehouse and data marts.
pg. 10
11 DW Module-01

Figure 2-6 provides a visual representation of these different architectural types.

Centralized Data Warehouse:

→ This architectural type focuses on enterprise-level information requirements with an


established overall infrastructure.

→ Data is stored in the third normal form at the atomic level of granularity, meaning it is at its
most detailed level.

→ Occasionally, summarized data is also included. Queries and applications directly access the
centralized data warehouse, and there are no separate data marts in this setup.

Independent Data Marts

→ This architectural type emerges when organizational units develop their own data marts for
specific purposes.

→ These independent data marts do not provide a single version of the truth and often have
inconsistent data definitions and standards, making it difficult to perform cross-data mart analysis.
pg. 11
12 DW Module-01

For example, sales and shipments data may be stored in separate data marts, complicating efforts to
analyze them together despite their related nature.

Federated

→ In this architectural type, companies with existing decision-support systems (such as operational
systems, extracted datasets, or primitive data marts) use a federated approach.

→ Instead of discarding their investments and starting over, data is integrated either physically or
logically through shared key fields, global metadata, and distributed queries.

→ There is no single overall data warehouse in this setup.

Hub-and-Spoke

→ This is the Inmon Corporate Information Factory approach. It features an enterprise-wide


centralized data warehouse that stores atomic data in third normal form.

→ The key difference is the presence of dependent data marts, which obtain data from the
centralized warehouse.

→ The data warehouse acts as a hub, feeding data to these dependent data marts, which serve
various purposes like departmental analysis, specialized queries, and data mining.

→ The dependent data marts can have normalized, denormalized, summarized, or dimensional
data structures. Most queries are directed to the data marts, though the centralized warehouse can
also be queried. This approach uses a top-down development method.

Data-Mart Bus

→ This is the Kimball conformed supermarts approach, which follows an enhanced bottom-up
development method. It begins by analyzing the requirements for a specific business subject (e.g.,
orders, shipments, billings).

→ The first data mart is built using business dimensions and metrics, which are then shared
across future data marts to ensure consistency.

pg. 12
13 DW Module-01

→ By conforming dimensions across data marts, a logically integrated set of supermarts is


created, providing an enterprise view of the data.

→ Each data mart contains atomic data organized in a dimensional data model.

OVERVIEW OF THE COMPONENTS

Figure 2-7 shows the basic components of a typical warehouse.

You see the Source Data component shown on the left. The Data Staging component serves as the
next building block. In the middle, you see the Data Storage component that manages the data
warehouse data. This component not only stores and manages the data, it also keeps track of the data
by means of the metadata repository. The Information Delivery component shown on the right
consists of all the different ways of making the information from the data warehouse available to the
users.

1. Source Data Component

Different categories of source data for a data warehouse:


pg. 13
14 DW Module-01

1. Production Data:

o Comes from the organization’s operational systems (e.g., financial, manufacturing,


supply chain, CRM systems).

o Data formats, hardware platforms, and database systems can vary across operational
systems.

o Queries in these systems are narrow and predictable (e.g., finding a specific customer’s
data or a single invoice).

o Main challenge: Disparity in data—data must be standardized, transformed, and


integrated for the data warehouse.

2. Internal Data:

o Data held by individuals or departments in spreadsheets, documents, or local databases.

o It includes detailed customer profiles, which are valuable for personalized marketing
and customer relationship management.

o Adds complexity to data transformation and integration.

o Not all internal data may be included at the start, and decisions about how much to
include must be made.

3. Archived Data:

o Historical data archived from operational systems (often moved to separate storage
after a certain period).

o Data can be stored in various stages (online databases, flat files, tapes).

o Essential for providing historical snapshots and analyzing long-term trends in the data
warehouse.

4. External Data:

o Data from outside sources (e.g., industry statistics, competitor data, financial
benchmarks).

o Provides insights that internal data alone cannot offer (e.g., spotting trends, comparing
with industry benchmarks).
pg. 14
15 DW Module-01

o Needs to be converted to internal formats and organized for integration.

These categories provide diverse data that must be standardized and integrated to create a
comprehensive and valuable data warehouse.

2. Data Staging Component

• After extracting data from different operational systems and external sources, the data must be
prepared for storage.

• The data needs to be changed, converted, and formatted for easy querying and analysis.

• ETL Process: Three key functions: Extract, Transform, and Load (ETL).

These functions happen in a dedicated staging area before the data enters the data warehouse.

• A workbench where data is cleaned, transformed, and combined.

• This area is essential because the data comes from many different sources and must be
integrated.

• Functions include cleaning, deduplication, standardization, and preparing data for storage.

• Unlike operational systems, data in a warehouse is subject-oriented and comes from many
diverse sources.

• Directly moving data into the warehouse without proper staging would lead to data
integration issues.

• Hence, a separate staging area is needed to ensure the data is properly cleaned and
standardized before loading.

• For legacy systems, a literal staging area may be needed.

• This staging area stores literal copies of source system data in a more convenient format
(e.g., an ODBC-accessible database).

• Acts as a surrogate for the original source systems.

Discuss the three major functions that take place in the staging area.

a. Data Extraction

pg. 15
16 DW Module-01

• Data extraction involves collecting data from various sources in different formats.

• Sources can include relational databases, legacy systems, flat files, spreadsheets, and
departmental datasets.

• Extraction techniques vary based on the source and data format.

• There are two options for extracting data:

Outside tools: Ready-made tools for specific data sources, which may have high initial costs.

In-house programs: Custom-built programs for extraction, which involve ongoing development
and maintenance costs.

• After extraction, data is often moved to a separate environment for easier handling.

• This environment can consist of flat files, staging databases, or a combination of both.

• Sometimes, extraction may occur on the legacy platform itself, depending on the situation.

b. Data Transformation:

• Data transformation is critical when moving data from old systems to new ones.

• It involves converting data from different formats and making it ready for the data warehouse.

• Data for the warehouse comes from many disparate sources, making transformation
complex.

• Unlike operational systems, data in a warehouse is not just a one-time load but requires
ongoing updates and transformations.

• Cleaning involves tasks like correcting errors (e.g., misspellings), resolving conflicts (like
mismatched state codes), and removing duplicates.

• Missing data is filled in using default values when necessary.

• Data types and field lengths are standardized across sources.

• Semantic standardization: Synonyms (terms meaning the same) and homonyms (terms with
multiple meanings) are resolved.

• Transformation includes combining data from different sources and records.


pg. 16
17 DW Module-01

• It also involves removing unnecessary data and organizing it into new forms.

• Primary keys in operational systems often have built-in meanings (e.g., product codes).

• In the warehouse, surrogate keys are used, which are unique identifiers without built-in
meanings.

c. Data Loading

• After building the data warehouse, the first step is to load large volumes of data for the initial
setup.

• This process takes a lot of time because it involves moving huge amounts of data into the
warehouse.

• After the initial load, data updates need to be added continuously as the warehouse operates.

• This involves extracting, transforming, and loading data changes regularly to keep the
warehouse up to date.

• Data is moved from the staging area (where data is prepared) to the data warehouse storage.

• Movements can be of both large batches (initial load) and incremental updates (ongoing
changes).

Figure 2-8 illustrates the common types of data movements from the staging area to the data
warehouse storage.

pg. 17
18 DW Module-01

3. Data Storage Component

• Data warehouses have separate storage from operational systems because they store large
amounts of historical data for analysis.

• Operational systems are highly normalized for fast processing, while data warehouses use
structures optimized for analysis.

• Data warehouses are read-only so that analysts can work with stable, unchanging snapshots
of data at specified periods.

• Unlike operational systems, data in a data warehouse is not updated continuously but loaded
in scheduled refresh cycles (e.g., daily, monthly).

• Data warehouses are designed to work with multiple tools from different vendors, making
them flexible.

• Some data warehouses use Multidimensional Databases (MDDBs) for storing summary
data, offering aggregated views of data for analysis.

4. Information Delivery Component

• Novice users need pre-built reports and simple queries.

• Casual users require information occasionally and also need prepackaged data.

• Business analysts perform complex analysis using the data warehouse.

• Power users create their own queries, drill into data, and make custom reports.

• Predefined reports are for novice and casual users.

• Complex queries, multidimensional analysis, and statistical tools are for business analysts and
power users.

• Senior executives use Executive Information Systems EIS for high-level, summarized
information.

• Some warehouses include data mining tools to discover trends and patterns in the data.

• Information is delivered through online queries, scheduled reports via email, or using the
intranet.
pg. 18
19 DW Module-01

• Internet-based information delivery is becoming more common.

Figure 2-9 shows the different information delivery methods.

Metadata Component

o Metadata is data about the data in a data warehouse.

o It is similar to a data dictionary or data catalog in a database management system.

o A data dictionary holds information like logical structures, file details, and indexes.

o Metadata in a data warehouse includes this and more.

o Metadata helps in managing and understanding the data stored in the data warehouse.

o It provides crucial details for organizing and using data effectively.

o Metadata is listed as one of the important components in data warehouse architecture.

Management and Control Component

o This component sits on top of all other components in the data warehouse architecture.

o It coordinates services and activities within the data warehouse.

o Manages data transformation and data transfer into storage.

o It moderates information delivery to users, ensuring they receive the right data.

o Works with database management systems to store data properly in repositories.

pg. 19
20 DW Module-01

o Monitors the movement of data into the staging area and from there into the data
warehouse storage.

o Interacts with the metadata component to perform management functions.

o Uses metadata as the source of information for effective management and control.

METADATA IN THE DATA WAREHOUSE

Types of Metadata

Metadata in a data warehouse fall into three major categories:

1. Operational metadata

2. Extraction and transformation metadata

3. End-user metadata

1. Operational Metadata

• Contains information about various operational systems providing data to the warehouse.

• Includes details about selected data elements, such as field lengths and data types.

• Helps tie delivered information back to the original source data sets.

• Covers splitting records, combining parts from different sources, and managing multiple
coding schemes.

2. Extraction and Transformation Metadata

• Contains details on how data is extracted from source systems, including:

o Extraction Frequencies: How often data is extracted.

o Extraction Methods: Techniques used for extraction.

o Business Rules: Guidelines for data extraction processes.

• Documents all data transformations that occur in the staging area.

3. End-User Metadata

• Serves as a guide for end-users to navigate the data warehouse.


pg. 20
21 DW Module-01

• Allows users to find information using their own business language.

• Helps users look for information in a way that aligns with their understanding of the business.

pg. 21

You might also like