0% found this document useful (0 votes)
29 views34 pages

Unit 1

Uploaded by

paritriddhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views34 pages

Unit 1

Uploaded by

paritriddhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT-1

History of Data Warehouse

The idea of data warehousing came to the late 1980's when IBM researchers Barry
Devlin and Paul Murphy established the "Business Data Warehouse."

In essence, the data warehousing idea was planned to support an architectural model
for the flow of information from the operational system to decisional support
environments. The concept attempt to address the various problems associated with
the flow, mainly the high costs associated with it.

In the absence of data warehousing architecture, a vast amount of space was required
to support multiple decision support environments. In large corporations, it was
ordinary for various decision support environments to operate independently.

Goals of Data Warehousing

o To help reporting as well as analysis


o Maintain the organization's historical information
o Be the foundation for decision making.

Need for Data Warehouse

Data Warehouse is needed for the following reasons:

1. 1) Business User: Business users require a data warehouse to view summarized


data from the past. Since these people are non-technical, the data may be
presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable
data from the past. This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in
the data warehouse. So, data warehouse contributes to making strategic
decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree of
flexibility and quick response time.

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.


2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier to
build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical
data.

Components or Building Blocks of Data Warehouse

Architecture is the proper arrangement of the elements. We build a data warehouse


with software and hardware components. To suit the requirements of our organizations,
we arrange these building we may want to boost up another part with extra tools and
services. All of these depends on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building
block. In the middle, we see the Data Storage component that handles the data
warehouses data. This element not only stores and manages the data; it also keeps track
of data using the metadata repository. The Information Delivery component shows on
the right consists of all the different ways of making the information from the data
warehouses available to the users.

Source Data Component

Source data coming into the data warehouses may be grouped into four broad
categories:

Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose
segments of the data from the various operational modes.

Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.

External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.

Data Staging Component

After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse. The extracted
data coming from several different sources need to be changed, converted, and made
ready in a format that is relevant to be saved for querying and analysis.

We will now discuss the three primary functions that take place in the staging area.

1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.

2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual tasks
as part of data transformation.

First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.

Standardization of data components forms a large part of data transformation. Data


transformation contains many forms of combining pieces of data from different sources.
We combine data from single source record or related data parts from many source
records.

On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging of
data take place on a large scale in the data staging area. When the data transformation
function ends, we have a collection of integrated data that is cleaned, standardized, and
summarized.

3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first
time, we do the initial loading of the information into the data warehouse storage. The
initial load moves high volumes of data using up a substantial amount of time.

Data Storage Components

Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data
repositories include the data structured in highly normalized for fast and efficient
processing.

Metadata Component

Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information about
the indexes, and so on.
Data Marts

It includes a subset of corporate-wide data that is of value to a specific group of users.


The scope is confined to particular selected subjects. Data in a data warehouse should
be a fairly current, but not mainly up to the minute, although development in the data
warehouse industry has made standard and incremental data dumps more achievable.
Data marts are lower than data warehouses and usually contain organization. The
current trends in data warehousing are to developed a data warehouse with several
smaller related data marts for particular kinds of queries and reports.

Management and Control Component

The management and control elements coordinate the services and functions within the
data warehouse. These components control the data transformation and the data
transfer into the data warehouse storage. On the other hand, it moderates the data
delivery to the clients. Its work with the database management systems and authorizes
data to be correctly saved in the repositories. It monitors the movement of information
into the staging method and from there into the data warehouses storage itself.

Information Delivery Component

The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
Parameter Database Data Warehouse
Purpose Is designed to record Is designed to analyze
Processing The database uses the Online Data warehouse uses Online Analytical
Method Transactional Processing (OLTP) Processing (OLAP).
The database helps to perform
Data warehouse allows you to analyze
Usage fundamental operations for your
your business.
business
Table and joins are simple in a data
Tables and Tables and joins of a database are
warehouse because they are
Joins complex as they are normalized.
denormalized.
Is an application-oriented collection of
Orientation It is a subject-oriented collection of data
data
Stores data from any number of
Storage limit Generally limited to a single application
applications
Data is refreshed from source systems as
Availability Data is available real-time
and when needed
ER modeling techniques are used for Data modeling techniques are used for
Usage
designing. designing.
Technique Capture data Analyze data
Data Type Data stored in the Database is up to Current and Historical Data is stored in
date. Data Warehouse. May not be up to date.
Data Ware House uses dimensional and
Storage of Flat Relational Approach method is used normalized approach for the data
data for data storage. structure. Example: Star and snowflake
schema.
Complex queries are used for analysis
Query Type Simple transaction queries are used.
purpose.
Data Summary Detailed Data is stored in a database. It stores highly summarized data.

Data Warehouse Architecture

The Three-Tier Data Warehouse Architecture is the commonly used Data Warehouse
design in order to build a Data Warehouse by including the required Data Warehouse
Schema Model, the required OLAP server type, and the required front-end tools for
Reporting or Analysis purposes, which as the name suggests contains three tiers such as
Top tier, Bottom Tier and the Middle Tier that are procedurally linked with one another
from Bottom tier(data sources) through Middle tier(OLAP servers) to the Top tier(Front-
end tools).

Data Warehouse Architecture is the design based on which a Data Warehouse is built, to
accommodate the desired type of Data Warehouse Schema, user interface application
and database management system, for data organization and repository structure. The
type of Architecture is chosen based on the requirement provided by the project team.
Three-tier Data Warehouse Architecture is the commonly used choice, due to its
detailing in the structure. The three different tiers here are termed as:

 Top-Tier
 Middle-Tier
 Bottom-Tier

Each Tier can have different components based on the prerequisites presented by the
decision-makers of the project but are subject to the novelty of their respective tier.
Three-Tier Data Warehouse Architecture
Here is a pictorial representation for the Three-Tier Data Warehouse Architecture

1. Bottom Tier
The Bottom Tier in the three-tier architecture of a data warehouse consists of the Data
Repository. Data Repository is the storage space for the data extracted from various
data sources, which undergoes a series of activities as a part of the ETL process. ETL
stands for Extract, Transform and Load. As a preliminary process, before the data is
loaded into the repository, all the data relevant and required are identified from several
sources of the system. These data are then cleaned up, to avoid repeating or junk data
from its current storage units. The next step is to transform all these data into a single
format of storage. The final step of ETL is to Load the data on the repository. Few
commonly used ETL tools are:

 Informatica
 Microsoft SSIS
 Snaplogic
 Confluent
 Apache Kafka
 Alooma
 Ab Initio
 IBM Infosphere
The storage type of the repository can be a relational database management system or
a multidimensional database management system. A relational database system can
hold simple relational data, whereas a multidimensional database system can hold data
that more than one dimension. Whenever the Repository includes both relational and
multidimensional database management systems, there exists a metadata unit. As the
name suggests, the metadata unit consists of all the metadata fetched from both the
relational database and multidimensional database systems. This Metadata unit
provides incoming data to the next tier, that is, the middle tier. From the user’s
standpoint, the data from the bottom tier can be accessed only with the use of SQL
queries. The complexity of the queries depends on the type of database. Data from the
relational database system can be retrieved using simple queries, whereas the
multidimensional database system demands complex queries with multiple joins and
conditional statements.

2. Middle Tier
The Middle tier here is the tier with the OLAP servers. The Data Warehouse can have
more than one OLAP server, and it can have more than one type of OLAP server model
as well, which depends on the volume of the data to be processed and the type of data
held in the bottom tier. There are three types of OLAP server models, such as:

ROLAP

 Relational online analytical processing is a model of online analytical processing


which carries out an active multidimensional breakdown of data stored in a
relational database, instead of redesigning a relational database into a
multidimensional database.
 This is applied when the repository consists of only the relational database
system in it.

MOLAP

 Multidimensional online analytical processing is another model of online


analytical processing that catalogs and comprises of directories directly on its
multidimensional database system.
 This is applied when the repository consists of only the multidimensional
database system in it.
HOLAP

 Hybrid online analytical processing is a hybrid of both relational and


multidimensional online analytical processing models.
 When the repository contains both the relational database management system
and the multidimensional database management system, HOLAP is the best
solution for a smooth functional flow between the database systems. HOLAP
allows storing data in both the relational and the multidimensional formats.

The Middle Tier acts as an intermediary component between the top tier and the data
repository, that is, the top tier and the bottom tier respectively. From the user’s
standpoint, the middle tier gives an idea about the conceptual outlook of the database.

3. Top Tier
The Top Tier is a front-end layer, that is, the user interface that allows the user to
connect with the database systems. This user interface is usually a tool or an API call,
which is used to fetch the required data for Reporting, Analysis, and Data Mining
purposes. The type of tool depends purely on the form of outcome expected. It could be
a Reporting tool, an Analysis tool, a Query tool or a Data mining tool.

It is essential that the Top Tier should be uncomplicated in terms of usability. Only user-
friendly tools can give effective outcomes.

Below are the few commonly used Top Tier tools.

 IBM Cognos
 Microsoft BI Platform
 SAP Business Objects Web
 Pentaho
 Crystal Reports
 SAP BW
 SAS Business Intelligence

Data Warehouse Schema


The Data Warehouse Schema is a structure that rationally defines the contents of the
Data Warehouse, by facilitating the operations performed on the Data Warehouse and
the maintenance activities of the Data Warehouse system, which usually includes the
detailed description of the databases, tables, views, indexes, and the Data, that are
regularly structured using predefined design types such as Star Schema, Snowflake
Schema, Galaxy Schema (also known as Fact Constellation Schema).

What is a Star Schema in a Data Warehouse?


The star schema in a data warehouse is historically one of the most straightforward
designs. This schema follows some distinct design parameters, such as only permitting
one central table and a handful of single-dimension tables joined to the table. In
following these design constraints, star schema can resemble a star with one central
table, and five dimension tables joined (thus where the star schema got its name).
Star Schema is known to create denormalized dimension tables – a database structuring
strategy that organizes tables to introduce redundancy for improved performance.
Denormalization intends to introduce redundancy in additional dimensions so long as it
improves query performance.

Characteristics of the Star Schema:


 Star data warehouse schemas create a denormalized database that enables quick
querying responses
 The primary key in the dimension table is joined to the fact table by the foreign key
 Each dimension in the star schema maps to one dimension table
 Dimension tables within a star scheme are not to be connected directly
 Star schema creates denormalized dimension tables

Benefits of Star Schema


 Queries use very simple joins while retrieving the data and thereby query
performance is increased.
 It is simple to retrieve data for reporting, at any point of time for any period.

Disadvantages of Star Schema


 If there are many changes in the requirements, the existing star schema is not
recommended to modify and reuse in the long run.
 Data redundancy is more as tables are not hierarchically divided.

An example of a Star Schema is given below.

What is a Snowflake Schema?

The Snowflake Schema is a data warehouse schema that encompasses a logical


arrangement of dimension tables. This data warehouse schema builds on the star
schema by adding additional sub-dimension tables that relate to first-order dimension
tables joined to the fact table.
Just like the relationship between the foreign key in the fact table and the primary key
in the dimension table, with the snowflake schema approach, a primary key in a sub-
dimension table will relate to a foreign key within the higher order dimension table.
Snowflake schema creates normalized dimension tables – a database structuring
strategy that organizes tables to reduce redundancy. The purpose of normalization is to
eliminate any redundant data to reduce overhead.
Benefits of SnowFlake Schema:
 Data redundancy is completely removed by creating new dimension tables.
 When compared with star schema, less storage space is used by the Snow Flaking
dimension tables.
 It is easy to update (or) maintain the Snow Flaking tables.
Disadvantages of SnowFlake Schema:
 Due to normalized dimension tables, the ETL system has to load the number of
tables.
 You may need complex joins to perform a query due to the number of tables
added. Hence query performance will be degraded.

An example of a SnowFlake Schema is given below.


The Dimension Tables in the above SnowFlake Diagram are normalized as explained
below:
 Date dimension is normalized into Quarterly, Monthly and Weekly tables by
leaving foreign key ids in the Date table.
 The store dimension is normalized to comprise the table for State.
 The product dimension is normalized into Brand.
 In the Customer dimension, the attributes connected to the city are moved into
the new City table by leaving a foreign key id in the Customer table.
In the same way, a single dimension can maintain multiple levels of hierarchy.

Characteristics of the Snowflake Schema:


 Snowflake Schema are permitted to have dimension tables joined to other
dimension tables
 Snowflake Schema are to have one fact table only
 Snowflake Schema create normalized dimension tables
 The normalized schema reduces required disk space for running and
managing this data warehouse
 Snowflake Scheme offer an easier way to implement a dimension

What is a Galaxy Schema?


A galaxy schema is also known as Fact Constellation Schema. In this schema, multiple
fact tables share the same dimension tables. The arrangement of fact tables and
dimension tables looks like a collection of stars in the Galaxy schema model.

The shared dimensions in this model are known as Conformed dimensions.

This type of schema is used for sophisticated requirements and for aggregated fact
tables that are more complex to be supported by the Star schema (or) SnowFlake
schema. This schema is difficult to maintain due to its complexity.

An example of Galaxy Schema is given below.

Characteristics of the Galaxy Schema:


 Galaxy Schema is multidimensional acting as a strong design consideration
for complex database systems
 Galaxy Schema reduces redundancy to near zero redundancy as a result of
normalization
 Galaxy Schema is known for high data quality and accuracy and lends to
effective reporting and analytics
Key Differences Between Star, Snowflake, and Galaxy schema

Star Schema Snowflake Schema Galaxy Schema

Single Fact Table Single Fact Table Multiple Fact Tables


connected to multiple connects to multiple connects to multiple
dimension tables with dimension tables that dimension tables that
no sub-dimension connects to multiple connects to multiple
Elements tables sub-dimension tables sub-dimension tables

Normalization Denormalized Normalized Normalized

Multiple dimension Multiple dimension Multiple dimension


Number of tables map to a single tables map to multiple tables map to multiple
Dimensions Fact Table dimension tables Fact Tables

Data
Redundancy High Low Low

Decreased performance
Decreased performance compared to Star and
Fewer foreign keys compared to Star Snowflake. Used for
resulting in increased Schema from higher complex data
Performance performance number of foreign keys aggregation.
More complicated
compared to Star Most complicated to
Schema – can be more understand. Reserved
Simple, designed to be challenging to for highly complex data
Complexity easy to understand understand structures

Low disk space usage


compared to the level
of sophistication due to
Higher disk space due Lower disk space due to the limited data
Storage Usage to data redundancy limited data redundancy redundancy

Multiple Fact Tables


One Fact Table only, permitted, only first
Design One Fact Table only, no multiple sub-dimensions level dimensions are
Limitations sub-dimensions are permitted permitted

Extraction in Data Warehouses

Extraction is the operation of extracting data from a source system for further use in a
data warehouse environment. This is the first step of the ETL process. After the
extraction, this data can be transformed and loaded into the data warehouse.

The source systems for a data warehouse are typically transaction processing
applications. For example, one of the source systems for a sales analysis data warehouse
might be an order entry system that records all of the current order activities.

Designing and creating the extraction process is often one of the most time-consuming
tasks in the ETL process and, indeed, in the entire data warehousing process. The source
systems might be very complex and poorly documented, and thus determining which
data needs to be extracted can be difficult. The data has to be extracted normally not
only once, but several times in a periodic manner to supply all changed data to the
warehouse and keep it up-to-date. Moreover, the source system typically cannot be
modified, nor can its performance or availability be adjusted, to accommodate the
needs of the data warehouse.

The extraction method you should choose is highly dependent on the source system and
also from the business needs in the target data warehouse environment. Very often,
there's no possibility to add additional logic to the source systems to enhance an
incremental extraction of data due to the performance or the increased workload of
these systems. Sometimes even the customer is not allowed to add anything to an out-
of-the-box application system.

The estimated amount of the data to be extracted and the stage in the ETL process
(initial load or maintenance of data) may also impact the decision of how to extract,
from a logical and a physical perspective. Basically, you have to decide how to extract
data logically and physically.

Locating and gathering data from diverse sources like databases, cloud storage, APIs,
and flat files.
Tools & Techniques:
Code Generators: Generate code to automate the extraction process from specific
data sources.
Database Replication Tools: Replicate and transform data between different database
systems.
Web Scraping: Extract data from websites, though it requires careful handling of
dynamic content.
API Connectors: Tools that connect to and extract data from various APIs.

Data Cleanup
Improves data quality by addressing inconsistencies, errors, and redundancies.
Tools & Techniques:
Data Profiling: Analyzes data to identify patterns, anomalies, and potential issues.
Data Cleansing: Removes or corrects incorrect, incomplete, or irrelevant data.
Data Validation: Ensures data meets predefined rules and formats.
Data Aggregation: Combines data from multiple sources or records into a single,
summarized record.
Data Normalization: Organizes data to reduce redundancy and improve data
integrity.
Data Transformation: Converts data into a standardized format, ensuring
compatibility with the target data warehouse.
Rule-Driven Engines: Apply predefined rules to cleanse and transform data based on
specific business requirements.

Key Considerations for Choosing Tools:


 Data Volume and Velocity: Consider the volume and speed at which data needs
to be processed.
 Data Source Variety: Ensure the tools can handle the different data formats and
structures.
 Scalability and Performance: Choose tools that can handle increasing data
volumes and processing demands.
 Integration Capabilities: Ensure seamless integration with the target data
warehouse and other systems.
 Cost and Maintenance: Evaluate the licensing costs and ongoing maintenance
requirements.

Multi-Dimensional Data Model

A multidimensional model views data in the form of a data-cube. A data cube enables
data to be modeled and viewed in multiple dimensions. It is defined by dimensions and
facts.

The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the
store's sales for the dimension time, item, and location. These dimensions allow the
save to keep track of things, for example, monthly sales of items and the locations at
which the items were sold. Each dimension has a table related to it, called a dimensional
table, which describes the dimension further. For example, a dimensional table for an
item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures. The fact table
contains the names of the facts or measures of the related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is
shown in the table. In this 2D representation, the sales for Delhi are shown for the time
dimension (organized in quarters) and the item dimension (classified according to the
types of an item sold). The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose
the data according to time and item, as well as the location is considered for the cities
Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D
data of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data
cube, as shown in fig:
Data Cube

In computer programming context, a data cube is a multidimensional array of


values.Typically the term data cube is applied in context where these arrays are
massively larger than the hosting computers main memory. Ex. Include multi-terabyte
data warehouse and time series of image data.

The data cube is used to represent data along some dimension of interest. For ex:
in OLAP such dimension could be the subsidiaries a company has the products the
company offers and time in this setup a fact would be a sales event where a particular
product has been sold in a particular subsidiary at a particular time. In satellite image
time series dimensions would be Latitude and Longitude coordinates and time a fact
would be a pixel at a given space and time as taken by the satellite. Even though it is
called a cube generally is a multidimensional concept which can be 1-dimensional, 2-
dimensional, 3-dimensional or higher. In any case every dimension divides data into
groups of cells whereas each cell in the cube represents a single measures of interest.
Sometimes cubes hold only few values with the rest being empty I,e undefined,
sometimes most or all cubes hold a cell value. In the first case such data are called
sparse, in the second case they called dense, although there is no hard delineation
between both.

Applications-

Multi-dimensional arrays can meaningfully represent spatio-temporal (belonging to both


space and time or to space ) sensor, main, image and simulation data where the semantics
of dimensions is not necessarily of spatial or temporal nature. Generally, any kind of axis
can be combined with any other into a data cube.

Star Schema

Star schema is the fundamental schema among the data mart schema and it is
simplest.This schema is widely used to develop or build a data warehouse and
dimensional data marts. It includes one or more fact tables indexing any number of
dimensional tables. The star schema is a necessary cause of the snowflake schema. It is
said to be star as its physical model resembles to the star shape having a fact table at its
center and the dimension tables at its peripheral representing the star’s points.

Advantages of Star Schema:

Simpler Queries

Join logic of star schema is quite cinch in comparison to other join logic which are
needed to fetch data from a transactional schema that is highly normalized.

Simplified business reporting logic

In comparison to a transactional schema that is highly normalized, the star schema


makes simpler common business reporting logic, such as as-of reporting and period-
over-period.

Feeding Cubes

Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact,
major OLAP systems deliver a ROLAP mode of operation which can a star schema as a
source without designing a cube structure.

Disadvantages of Star Schema:

-Data integrity is not enforced well since in a highly de-normalized schema state.

-Not flexible in terms if analytical needs as a normalized data model.

-Star schemas don’t reinforce many-to-many relationships within business entities at


least not frequently.
Snow Flakes

Snowflake Schema in data warehouse is a logical arrangement of tables in a


multidimensional database such that the ER diagram resembles a snowflake shapes. A
Snowflake Schema is an extension of a Star Schema, and it adds addition dimensions.
The dimension tables are normalizes which splits data into additional tables.

Characteristics of Snowflake:

-The main benefit of the snowflake schema it uses smaller disk space.

-Easier to implement a dimension is added to the Schema

-Due to multiple tables query performance is reduced

-The primary challenge that you will face while using the snowflake Schema is that you
need to perform more maintenance efforts because of the more lookup tables.

Fact Constellations

Fact constellation is a measure of online analytical processing, which is a collection of


multiple fact tables sharing dimension tables, viewed as a collection of stars. It is
possible to create fact constellation schema by splitting original star schema into more
star schema. It has many fact tables and some common dimension table.

Advantage:

Provides flexible schema.

Disadvantage

It is much more complex and hence, hard to implement and maintain.

Concept hierarchy
It defines a sequence of mappings from a set of low-level concepts to higher-level, more
general concepts. Consider a concept hierarchy for the dimension location. City values
for location include Vancouver, Toronto, New York, and Chicago. Each city, however, can
be mapped to the province or state to which it belongs. For example, Vancouver can be
mapped to British Columbia, and Chicago to Illinois. The provinces and states can in turn
be mapped to the country (e.g., Canada or the United States) to which they belong.
These mappings form a concept hierarchy for the dimension location, mapping a set of
low-level concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
This concept hierarchy is illustrated in Figure 4.9.

Many concept hierarchies are implicit within the database schema. For example,
suppose that the dimension location is described by the attributes number, street, city,
province_or_state, zip_code, and country. These attributes are related by a total order,
forming a concept hierarchy such as “street < city <province_or_state< country.” This
hierarchy is shown in Figure 4.10(a). Alternatively, the attributes of a dimension may be
organized in a partial order, forming a lattice. An example of a partial order for
the time dimension based on the attributes day, week, month, quarter, and year is “day
<{month < quarter; week} < year.”1 This lattice structure is shown in Figure 4.10(b). A
concept hierarchy that is a total or partial order among attributes in a database schema
is called a schema hierarchy. Concept hierarchies that are common to many
applications (e.g., for time) may be predefined in the data mining system.
Figure 4.10. Hierarc

hical and lattice structures of attributes in warehouse dimensions: (a) a hierarchy

for location and (b) a lattice for time.

Concept hierarchies may also be defined by grouping values for a given dimension or
attribute, resulting in a set-grouping hierarchy. A total or partial order can be defined
among groups of values. An example of a set-grouping hierarchy is shown in Figure for
the dimension price, where an interval ($X…$Y] denotes the range from $X (exclusive) to
$Y (inclusive).

Figure. A concept hierarchy for price.

There may be more than one concept hierarchy for a given attribute or dimension,
based on different user viewpoints. For instance, a user may prefer to organize price by
defining ranges for inexpensive, moderately_priced, and expensive.
Concept hierarchies may be provided manually by system users, domain experts, or
knowledge engineers, or may be automatically generated based on statistical analysis of
the data distribution.

OLAP (Online Analytical Processing)

We can define OLAP in data warehouse as a computing technology that allows query
data and analyze it from different perspectives. The technology is a great solution for
business analysts who need to pre-aggregate and pre-calculate data for fast analysis.
Talking about OLAP architecture in data warehouse, it is based on a multidimensional
data structure. The data has a form of OLAP cubes, which have a star or snowflake-
shape schema. In the middle of a star (or snowflake) is a table that includes data
aggregations and reconciles various dimensions.
To query the data required, OLAP usually uses MDX language. MDX queries work with
OLAP members, dimensions and hierarchies.

Features of OLAP in data warehouse are the following:


 Has intuitive easy-to-use interface;
 OLAP supports complex calculations;
 Provides data view in multidimensional manner;
 Time intelligence.

Different types of OLAP in data warehouse


Among OLAP variants, we can distinguish three main types and four additional.
There are following three major OLAP models in data warehouse:
1. ROLAP or Relational OLAP: the kind of system where users query data from a relational
database or from their own local tables. Thus, the number of potential questions is not
limited.
2. MOLAP or Multidimensional OLAP: this system stores the data in multidimensional
database. Provides high speed of calculations.
3. HOLAP or Hybrid OLAP: a mix of two above mentioned systems. Pre-computed
aggregates and cube structure stored in multidimensional database.
Here are four additional types of OLAP server in data warehouse:
1. WOLAP – Web OLAP
2. SOLAP – Spatial OLAP
3. Mobile OLAP
4. DOLAP – Desktop OLAP
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −
 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
 By climbing up a concept hierarchy for a dimension
 By dimension reduction
The following diagram illustrates how roll-up works.

 Roll-up is performed by climbing up a concept hierarchy for the dimension location.


 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the
level of city to the level of country.
 When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways
 By stepping down a concept hierarchy for a dimension
 By introducing a new dimension.
The following diagram illustrates how drill-down works −

 Drill-down is performed by stepping down a concept hierarchy for the dimension time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the
level of month.
 When drill-down is performed, one or more dimensions from the data cube are
added.
 It navigates the data from less detailed data to highly detailed data.

Slice
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.
 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.

Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
 (location = "Toronto" or "Vancouver")
 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")

Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows
the pivot operation.
OLAP vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)

1 Involves historical processing of Involves day-to-day processing.


information.

2 OLAP systems are used by OLTP systems are used by clerks,


knowledge workers such as DBAs, or database professionals.
executives, managers and
analysts.

3 Useful in analyzing the business. Useful in running the business.

4 It focuses on Information out. It focuses on Data in.

5 Based on Star Schema, Based on Entity Relationship Model.


Snowflake, Schema and Fact
Constellation Schema.
6 Contains historical data. Contains current data.

7 Provides summarized and Provides primitive and highly


consolidated data. detailed data.

8 Provides summarized and Provides detailed and flat relational


multidimensional view of data. view of data.

9 Number of users is in hundreds. Number of users is in thousands.

10 Number of records accessed is in Number of records accessed is in


millions. tens.

11 Database size is from 100 GB to 1 Database size is from 100 MB to 1


TB GB.

12 Highly flexible. Provides high performance.

You might also like