Unit 1
Unit 1
The idea of data warehousing came to the late 1980's when IBM researchers Barry
Devlin and Paul Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model
for the flow of information from the operational system to decisional support
environments. The concept attempt to address the various problems associated with
the flow, mainly the high costs associated with it.
In the absence of data warehousing architecture, a vast amount of space was required
to support multiple decision support environments. In large corporations, it was
ordinary for various decision support environments to operate independently.
Source data coming into the data warehouses may be grouped into four broad
categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose
segments of the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.
After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse. The extracted
data coming from several different sources need to be changed, converted, and made
ready in a format that is relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual tasks
as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging of
data take place on a large scale in the data staging area. When the data transformation
function ends, we have a collection of integrated data that is cleaned, standardized, and
summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first
time, we do the initial loading of the information into the data warehouse storage. The
initial load moves high volumes of data using up a substantial amount of time.
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data
repositories include the data structured in highly normalized for fast and efficient
processing.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information about
the indexes, and so on.
Data Marts
The management and control elements coordinate the services and functions within the
data warehouse. These components control the data transformation and the data
transfer into the data warehouse storage. On the other hand, it moderates the data
delivery to the clients. Its work with the database management systems and authorizes
data to be correctly saved in the repositories. It monitors the movement of information
into the staging method and from there into the data warehouses storage itself.
The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
Parameter Database Data Warehouse
Purpose Is designed to record Is designed to analyze
Processing The database uses the Online Data warehouse uses Online Analytical
Method Transactional Processing (OLTP) Processing (OLAP).
The database helps to perform
Data warehouse allows you to analyze
Usage fundamental operations for your
your business.
business
Table and joins are simple in a data
Tables and Tables and joins of a database are
warehouse because they are
Joins complex as they are normalized.
denormalized.
Is an application-oriented collection of
Orientation It is a subject-oriented collection of data
data
Stores data from any number of
Storage limit Generally limited to a single application
applications
Data is refreshed from source systems as
Availability Data is available real-time
and when needed
ER modeling techniques are used for Data modeling techniques are used for
Usage
designing. designing.
Technique Capture data Analyze data
Data Type Data stored in the Database is up to Current and Historical Data is stored in
date. Data Warehouse. May not be up to date.
Data Ware House uses dimensional and
Storage of Flat Relational Approach method is used normalized approach for the data
data for data storage. structure. Example: Star and snowflake
schema.
Complex queries are used for analysis
Query Type Simple transaction queries are used.
purpose.
Data Summary Detailed Data is stored in a database. It stores highly summarized data.
The Three-Tier Data Warehouse Architecture is the commonly used Data Warehouse
design in order to build a Data Warehouse by including the required Data Warehouse
Schema Model, the required OLAP server type, and the required front-end tools for
Reporting or Analysis purposes, which as the name suggests contains three tiers such as
Top tier, Bottom Tier and the Middle Tier that are procedurally linked with one another
from Bottom tier(data sources) through Middle tier(OLAP servers) to the Top tier(Front-
end tools).
Data Warehouse Architecture is the design based on which a Data Warehouse is built, to
accommodate the desired type of Data Warehouse Schema, user interface application
and database management system, for data organization and repository structure. The
type of Architecture is chosen based on the requirement provided by the project team.
Three-tier Data Warehouse Architecture is the commonly used choice, due to its
detailing in the structure. The three different tiers here are termed as:
Top-Tier
Middle-Tier
Bottom-Tier
Each Tier can have different components based on the prerequisites presented by the
decision-makers of the project but are subject to the novelty of their respective tier.
Three-Tier Data Warehouse Architecture
Here is a pictorial representation for the Three-Tier Data Warehouse Architecture
1. Bottom Tier
The Bottom Tier in the three-tier architecture of a data warehouse consists of the Data
Repository. Data Repository is the storage space for the data extracted from various
data sources, which undergoes a series of activities as a part of the ETL process. ETL
stands for Extract, Transform and Load. As a preliminary process, before the data is
loaded into the repository, all the data relevant and required are identified from several
sources of the system. These data are then cleaned up, to avoid repeating or junk data
from its current storage units. The next step is to transform all these data into a single
format of storage. The final step of ETL is to Load the data on the repository. Few
commonly used ETL tools are:
Informatica
Microsoft SSIS
Snaplogic
Confluent
Apache Kafka
Alooma
Ab Initio
IBM Infosphere
The storage type of the repository can be a relational database management system or
a multidimensional database management system. A relational database system can
hold simple relational data, whereas a multidimensional database system can hold data
that more than one dimension. Whenever the Repository includes both relational and
multidimensional database management systems, there exists a metadata unit. As the
name suggests, the metadata unit consists of all the metadata fetched from both the
relational database and multidimensional database systems. This Metadata unit
provides incoming data to the next tier, that is, the middle tier. From the user’s
standpoint, the data from the bottom tier can be accessed only with the use of SQL
queries. The complexity of the queries depends on the type of database. Data from the
relational database system can be retrieved using simple queries, whereas the
multidimensional database system demands complex queries with multiple joins and
conditional statements.
2. Middle Tier
The Middle tier here is the tier with the OLAP servers. The Data Warehouse can have
more than one OLAP server, and it can have more than one type of OLAP server model
as well, which depends on the volume of the data to be processed and the type of data
held in the bottom tier. There are three types of OLAP server models, such as:
ROLAP
MOLAP
The Middle Tier acts as an intermediary component between the top tier and the data
repository, that is, the top tier and the bottom tier respectively. From the user’s
standpoint, the middle tier gives an idea about the conceptual outlook of the database.
3. Top Tier
The Top Tier is a front-end layer, that is, the user interface that allows the user to
connect with the database systems. This user interface is usually a tool or an API call,
which is used to fetch the required data for Reporting, Analysis, and Data Mining
purposes. The type of tool depends purely on the form of outcome expected. It could be
a Reporting tool, an Analysis tool, a Query tool or a Data mining tool.
It is essential that the Top Tier should be uncomplicated in terms of usability. Only user-
friendly tools can give effective outcomes.
IBM Cognos
Microsoft BI Platform
SAP Business Objects Web
Pentaho
Crystal Reports
SAP BW
SAS Business Intelligence
This type of schema is used for sophisticated requirements and for aggregated fact
tables that are more complex to be supported by the Star schema (or) SnowFlake
schema. This schema is difficult to maintain due to its complexity.
Data
Redundancy High Low Low
Decreased performance
Decreased performance compared to Star and
Fewer foreign keys compared to Star Snowflake. Used for
resulting in increased Schema from higher complex data
Performance performance number of foreign keys aggregation.
More complicated
compared to Star Most complicated to
Schema – can be more understand. Reserved
Simple, designed to be challenging to for highly complex data
Complexity easy to understand understand structures
Extraction is the operation of extracting data from a source system for further use in a
data warehouse environment. This is the first step of the ETL process. After the
extraction, this data can be transformed and loaded into the data warehouse.
The source systems for a data warehouse are typically transaction processing
applications. For example, one of the source systems for a sales analysis data warehouse
might be an order entry system that records all of the current order activities.
Designing and creating the extraction process is often one of the most time-consuming
tasks in the ETL process and, indeed, in the entire data warehousing process. The source
systems might be very complex and poorly documented, and thus determining which
data needs to be extracted can be difficult. The data has to be extracted normally not
only once, but several times in a periodic manner to supply all changed data to the
warehouse and keep it up-to-date. Moreover, the source system typically cannot be
modified, nor can its performance or availability be adjusted, to accommodate the
needs of the data warehouse.
The extraction method you should choose is highly dependent on the source system and
also from the business needs in the target data warehouse environment. Very often,
there's no possibility to add additional logic to the source systems to enhance an
incremental extraction of data due to the performance or the increased workload of
these systems. Sometimes even the customer is not allowed to add anything to an out-
of-the-box application system.
The estimated amount of the data to be extracted and the stage in the ETL process
(initial load or maintenance of data) may also impact the decision of how to extract,
from a logical and a physical perspective. Basically, you have to decide how to extract
data logically and physically.
Locating and gathering data from diverse sources like databases, cloud storage, APIs,
and flat files.
Tools & Techniques:
Code Generators: Generate code to automate the extraction process from specific
data sources.
Database Replication Tools: Replicate and transform data between different database
systems.
Web Scraping: Extract data from websites, though it requires careful handling of
dynamic content.
API Connectors: Tools that connect to and extract data from various APIs.
Data Cleanup
Improves data quality by addressing inconsistencies, errors, and redundancies.
Tools & Techniques:
Data Profiling: Analyzes data to identify patterns, anomalies, and potential issues.
Data Cleansing: Removes or corrects incorrect, incomplete, or irrelevant data.
Data Validation: Ensures data meets predefined rules and formats.
Data Aggregation: Combines data from multiple sources or records into a single,
summarized record.
Data Normalization: Organizes data to reduce redundancy and improve data
integrity.
Data Transformation: Converts data into a standardized format, ensuring
compatibility with the target data warehouse.
Rule-Driven Engines: Apply predefined rules to cleanse and transform data based on
specific business requirements.
A multidimensional model views data in the form of a data-cube. A data cube enables
data to be modeled and viewed in multiple dimensions. It is defined by dimensions and
facts.
The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the
store's sales for the dimension time, item, and location. These dimensions allow the
save to keep track of things, for example, monthly sales of items and the locations at
which the items were sold. Each dimension has a table related to it, called a dimensional
table, which describes the dimension further. For example, a dimensional table for an
item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures. The fact table
contains the names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is
shown in the table. In this 2D representation, the sales for Delhi are shown for the time
dimension (organized in quarters) and the item dimension (classified according to the
types of an item sold). The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose
the data according to time and item, as well as the location is considered for the cities
Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D
data of the table are represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data
cube, as shown in fig:
Data Cube
The data cube is used to represent data along some dimension of interest. For ex:
in OLAP such dimension could be the subsidiaries a company has the products the
company offers and time in this setup a fact would be a sales event where a particular
product has been sold in a particular subsidiary at a particular time. In satellite image
time series dimensions would be Latitude and Longitude coordinates and time a fact
would be a pixel at a given space and time as taken by the satellite. Even though it is
called a cube generally is a multidimensional concept which can be 1-dimensional, 2-
dimensional, 3-dimensional or higher. In any case every dimension divides data into
groups of cells whereas each cell in the cube represents a single measures of interest.
Sometimes cubes hold only few values with the rest being empty I,e undefined,
sometimes most or all cubes hold a cell value. In the first case such data are called
sparse, in the second case they called dense, although there is no hard delineation
between both.
Applications-
Star Schema
Star schema is the fundamental schema among the data mart schema and it is
simplest.This schema is widely used to develop or build a data warehouse and
dimensional data marts. It includes one or more fact tables indexing any number of
dimensional tables. The star schema is a necessary cause of the snowflake schema. It is
said to be star as its physical model resembles to the star shape having a fact table at its
center and the dimension tables at its peripheral representing the star’s points.
Simpler Queries
Join logic of star schema is quite cinch in comparison to other join logic which are
needed to fetch data from a transactional schema that is highly normalized.
Feeding Cubes
Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact,
major OLAP systems deliver a ROLAP mode of operation which can a star schema as a
source without designing a cube structure.
-Data integrity is not enforced well since in a highly de-normalized schema state.
Characteristics of Snowflake:
-The main benefit of the snowflake schema it uses smaller disk space.
-The primary challenge that you will face while using the snowflake Schema is that you
need to perform more maintenance efforts because of the more lookup tables.
Fact Constellations
Advantage:
Disadvantage
Concept hierarchy
It defines a sequence of mappings from a set of low-level concepts to higher-level, more
general concepts. Consider a concept hierarchy for the dimension location. City values
for location include Vancouver, Toronto, New York, and Chicago. Each city, however, can
be mapped to the province or state to which it belongs. For example, Vancouver can be
mapped to British Columbia, and Chicago to Illinois. The provinces and states can in turn
be mapped to the country (e.g., Canada or the United States) to which they belong.
These mappings form a concept hierarchy for the dimension location, mapping a set of
low-level concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
This concept hierarchy is illustrated in Figure 4.9.
Many concept hierarchies are implicit within the database schema. For example,
suppose that the dimension location is described by the attributes number, street, city,
province_or_state, zip_code, and country. These attributes are related by a total order,
forming a concept hierarchy such as “street < city <province_or_state< country.” This
hierarchy is shown in Figure 4.10(a). Alternatively, the attributes of a dimension may be
organized in a partial order, forming a lattice. An example of a partial order for
the time dimension based on the attributes day, week, month, quarter, and year is “day
<{month < quarter; week} < year.”1 This lattice structure is shown in Figure 4.10(b). A
concept hierarchy that is a total or partial order among attributes in a database schema
is called a schema hierarchy. Concept hierarchies that are common to many
applications (e.g., for time) may be predefined in the data mining system.
Figure 4.10. Hierarc
Concept hierarchies may also be defined by grouping values for a given dimension or
attribute, resulting in a set-grouping hierarchy. A total or partial order can be defined
among groups of values. An example of a set-grouping hierarchy is shown in Figure for
the dimension price, where an interval ($X…$Y] denotes the range from $X (exclusive) to
$Y (inclusive).
There may be more than one concept hierarchy for a given attribute or dimension,
based on different user viewpoints. For instance, a user may prefer to organize price by
defining ranges for inexpensive, moderately_priced, and expensive.
Concept hierarchies may be provided manually by system users, domain experts, or
knowledge engineers, or may be automatically generated based on statistical analysis of
the data distribution.
We can define OLAP in data warehouse as a computing technology that allows query
data and analyze it from different perspectives. The technology is a great solution for
business analysts who need to pre-aggregate and pre-calculate data for fast analysis.
Talking about OLAP architecture in data warehouse, it is based on a multidimensional
data structure. The data has a form of OLAP cubes, which have a star or snowflake-
shape schema. In the middle of a star (or snowflake) is a table that includes data
aggregations and reconciles various dimensions.
To query the data required, OLAP usually uses MDX language. MDX queries work with
OLAP members, dimensions and hierarchies.
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter to the
level of month.
When drill-down is performed, one or more dimensions from the data cube are
added.
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows
the pivot operation.
OLAP vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)