0% found this document useful (0 votes)
536 views58 pages

Data Warehouse-Ccs341 Material

The document provides an overview of data warehouses including: 1. It defines a data warehouse as a relational database designed for query and analysis rather than transactions, containing historical data from multiple sources to support decision making. 2. The key components of a data warehouse are the source data, data staging area, data storage, and information delivery. The staging area prepares extracted data for storage through extraction, transformation, and loading processes. 3. Benefits of data warehouses include enabling better business decisions and forecasting through centralized access to large amounts of historical data from various sources.

Uploaded by

ragavaharish463
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
536 views58 pages

Data Warehouse-Ccs341 Material

The document provides an overview of data warehouses including: 1. It defines a data warehouse as a relational database designed for query and analysis rather than transactions, containing historical data from multiple sources to support decision making. 2. The key components of a data warehouse are the source data, data staging area, data storage, and information delivery. The staging area prepares extracted data for storage through extraction, transformation, and loading processes. 3. Benefits of data warehouses include enabling better business decisions and forecasting through centralized access to large amounts of historical data from various sources.

Uploaded by

ragavaharish463
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

CCS341- DATA WAREHOUSE

Unit-1
Data warehouse Introduction – Data warehouse components- operational database Vs data
warehouse – Data warehouse Architecture – Three-tier Data Warehouse Architecture –
Autonomous Data Warehouse- Autonomous Data Warehouse Vs Snowflake – Modern Data
Warehouse
Introduction

Data Warehouse is a relational database management system (RDBMS) construct to meet the
requirement of transaction processing systems. It can be loosely described as any centralized
data repository which can be queried for business benefits. It is a database that stores
information oriented to satisfy decision-making requests. It is a group of decision support
technologies, targets to enabling the knowledge worker (executive, manager, and analyst) to
make superior and higher decisions. So, Data Warehousing support architectures and tool for
business executives to systematically organize, understand and use their information to make
strategic decisions.

Data Warehouse environment contains an extraction, transportation, and loading (ETL)


solution, an online analytical processing (OLAP) engine, customer analysis tools, and other
applications that handle the process of gathering information and delivering it to business
users.

What is a Data Warehouse?


 A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction
data from single and multiple sources.
 A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on
providing support for decision-makers for data modeling and analysis.
 A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.
 It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:

 It is a database designed for investigative tasks, using data from various applications.
 It supports a relatively small number of clients with relatively long interactions.
 It includes current and historical data to provide a historical perspective of information.
 Its usage is read-intensive.
 It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information


in support of management's decisions."

Features or Characteristics of Data Warehouse

Subject-Oriented
 A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around
a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations.
 This is done by excluding data that are not useful concerning the subject and including all
data needed by the users to understand the subject.
Integrated
 A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records.
 It requires performing data cleaning and integration during data warehousing to ensure
consistency in naming conventions, attributes types, etc., among different data sources.

Time-Variant
 Historical information is kept in a data warehouse.
 For example, one can retrieve files from 3 months, 6 months, 12 months, or even
previous data from a data warehouse.
 These variations with a transactions system, where often only the most current file is kept.

Non-Volatile
 The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS.
 The operational updates of data do not occur in the data warehouse, i.e., update, insert,
and delete operations are not performed.
 It usually requires only two procedures in data accessing:
 Initial loading of data and access to data.
Therefore, the DW does not require transaction processing, recovery,and concurrency
capabilities, which allows for substantial speedup of data retrieval.
Non-Volatile defines that once entered into the warehouse, and data should not change.
History of Data Warehouse
 The idea of data warehousing came to the late 1980's when IBM researchers Barry
Devlin and Paul Murphy established the "Business Data Warehouse."
 In essence, the data warehousing idea was planned to support an architectural model for
the flow of information from the operational system to decisional support environments.
 The concept attempt to address the various problems associated with the flow, mainly the
high costs associated with it.
 In the absence of data warehousing architecture, a vast amount of space was required to
support multiple decision support environments.
 In large corporations, it was ordinary for various decision support environments to
operate independently.

Goals of Data Warehousing

 To help reporting as well as analysis


 Maintain the organization's historical information
 Be the foundation for decision making.

Need for Data Warehouse


Data Warehouse is needed for the following reasons:
Business User:
Business users require a data warehouse to view summarized data from the past. Since
these people are non-technical, the data may be presented to them in an elementary form.
Store historical data:
Data Warehouse is required to store the time variable data from the past. This input is
made to be used for various purposes.
Make strategic decisions:
Some strategies may be depending upon the data in the data warehouse. So, data
warehouse contributes to making strategic decisions.
For data consistency and quality:
Bringing the data from different sources at a commonplace, the user can effectively
undertake to bring the uniformity and consistency in data.
High response time:
Data warehouse has to be ready for somewhat unexpected loads and types of queries, which demands a
significant degree of flexibility and quick response time.

Benefits of Data Warehouse

 Understand business trends and make better forecasting decisions.


 Data Warehouses are designed to perform well enormous amounts of data.
 The structure of data warehouses is more accessible for end-users to navigate, understand,
and query.
 Queries that would be complex in many normalized databases could be easier to build and
maintain in data warehouses.
 Data warehousing is an efficient method to manage demand for lots of information from
lots of users.
 Data warehousing provide the capabilities to analyze a large amount of historical data

Components or Building Blocks of Data Warehouse


 Architecture is the proper arrangement of the elements.
 We build a data warehouse with software and hardware components.
 To suit the requirements of our organizations, we arrange these building we may want to
boost up another part with extra tools and services.
 All of these depends on our circumstances.
 The figure shows the essential elements of a typical warehouse.
 We see the Source Data component shows on the left.
 The Data staging element serves as the next building block.
 In the middle, we see the Data Storage component that handles the data warehouses
data.
 This element not only stores and manages the data;it also keeps track of data using the
metadata repository.
 The Information Delivery component shows on the right consists of all the different
ways of making the information from the data warehouses available to the users.

Source Data Component

Source data coming into the data warehouses may be grouped into four broad categories:

Production Data:
This type of data comes from the different operating systems of the enterprise.
Based on the data requirements in the data warehouse, we choose segments of the data from the
various operational modes.
Internal Data:
In each organization, the client keeps their "private" spreadsheets, reports, customer profiles,
and sometimes even department databases. This is the internal data, part of which could be
useful in a data warehouse.

Archived Data:
Operational systems are mainly intended to run the current business. In every operational
system, we periodically take the old data and store it in achieved files.

External Data:
Most executives depend on information from external sources for a large percentage of the
information they use. They use statistics associating to their industry produced by the external
department.

Data Staging Component

 After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse.
 The extracted data coming from several different sources need to be changed, converted,
and made ready in a format that is relevant to be saved for querying and analysis.
 We will now discuss the three primary functions that take place in the staging area

Data Extraction:
 This method has to deal with numerous data sources.
 We have to employ the appropriate techniques for each data source.

Data Transformation:
 As we know, data for a data warehouse comes from many different sources.
 If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges.
 We perform several individual tasks as part of data transformation.
 First, we clean the data extracted from each source.
 Cleaning may be the correction of misspellings or may deal with providing default
values for missing data elements, or elimination of duplicates when we bring in the same
data from various source systems.
 Standardization of data components forms a large part of data transformation.
 Data transformation contains many forms of combining pieces of data from different
sources. We combine data from single source record or related data parts from many
source records.
Data Loading:
 Two distinct categories of tasks form data loading functions.
 When we complete the structure and construction of the data warehouse and go live for
the first time, we do the initial loading of the information into the data warehouse
storage.
 The initial load moves high volumes of data using up a substantial amount of time.

Data Storage Components

 Data storage for the data warehousing is a split repository.


 The data repositories for the operational systems generally include only the current data.
 Also, these data repositories include the data structured in highly normalized for fast and
efficient processing.

Information Delivery Component

The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.

Metadata Component

 Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system.
 In the data dictionary, we keep the data about the logical data structures, the data about
the records and addresses, the information about the indexes, and so on.

Data Marts

 It includes a subset of corporate-wide data that is of value to a specific group of users.


 The scope is confined to particular selected subjects.
 Data in a data warehouse should be a fairly current, but not mainly up to the minute,
although development in the data warehouse industry has made standard and
incremental data dumps more achievable.
 Data marts are lower than data warehouses and usually contain organization.
 The current trends in data warehousing are to develop a data warehouse with
several smaller related data marts for particular kinds of queries and reports.

Management and Control Component

 The management and control elements coordinate the services and functions within the
data warehouse.
 These components control the data transformation and the data transfer into the data
warehouse storage.
 On the other hand, it moderates the data delivery to the clients.
 Its work with the database management systems and authorizes data to be correctly
saved in the repositories.
 It monitors the movement of information into the staging method and from there into the
data warehouses storage itself.
Difference between Database and Data Warehouse

Database Data Warehouse

1. It is used for Online Transactional 1. It is used for


Online
Processing (OLTP) but can be used Analytical Processing
for other objectives such as Data (OLAP). This reads the
Warehousing. This records the data historical information for the
from the clients for history. customers for business
decisions.

2. The tables and joins are 2. The tables and joins are

complicated since they are accessible since they are


normalized for RDBMS. This is done de-normalized. This is done

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures are 4. Data: Modeling approach are used for
used for RDBMS database design. the Data Warehouse design.

5. Optimized for write operations. 5.Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for


analytical queries.
7. The database is the place where the data is 7. Data Warehouse is the place where the
taken as a base and managed to get available fast application data is handled for analysis
and efficient access. and reporting objectives.

Difference between Operational Database and Data Warehouse

Operational Database Data Warehouse


Operational systems are designed to Data warehousing systems are

support high-volume transaction typically designed to support


processing. high-volume analytical processing (i.e.,

OLAP).
Operational systems are usually Data warehousing systems are usually
concerned with current data. concerned with historical data.
Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.
It is designed for real-time business dealing It is designed for analysis of business
and processes. measures by subject area, categories, and
attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.
It is optimized for validation of incoming Loaded with consistent, valid information,
information during transactions, uses validation requires no real-time validation.
data tables.
It supports thousands of concurrent It supports a few concurrent clients relative
clients. to OLTP.
Operational systems are widely Data warehousing systems are widely
process-oriented. subject-oriented
Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of optimized to perform fast retrievals of
associatively small volumes of data. relatively high volumes of data.
Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)
Difference between OLTP and OLAP

OLTP OLAP
It is a system which is used to manage It is a system which is used to manage
operational Data. informational Data.
Clerks, clients, and information Knowledge workers, including managers,
technology professionals. executives, and analysts.
OLTP system is a customer- OLAP system is market-oriented, knowledge workers
oriented,transaction, and query including managers, do data analysts executive and
processing are done by clerks, clients, analysts.
and information technology
professionals.
OLTP system manages current data that OLAP system manages a large amount of historical
too detailed and are used for decision data, provides facilitates for summarization and
making. aggregation, and stores and manages data at different
levels of granularity. This information makes the data
more comfortable to use in informed decision making.
100 MB-GB 100 GB-TB
OLTP system usually uses an entity- OLAP system typically uses either a star or snowflake
relationship (ER) data model and model and subject-oriented database design.
application-oriented
database design.
OLTP system focuses primarily on the OLAP system often spans multiple versions of a
current data within an enterprise or database schema, due to the evolutionary process of an
department, without referring to organization. OLAP systems also deal with data that
historical information or data in different originates from various organizations, integrating
organizations. information from many data stores.
Not very large Because of their large volume, OLAP data are stored on
multiple storage media.
The access patterns of an OLTP system Accesses to OLAP systems are mostly read-only
subsist mainly of short, atomic methods because of these data warehouses stores
transactions. Such a system requires historical data.
concurrency control and recovery
techniques.
Very Fast It depends on the amount of files contained, batch data
refresh, and complex query may take many hours, and
query speed can be upgraded by creating indexes.
Read/write Mostly write
Data Warehouse Architecture

Operational System

Operational system
is a method used in data warehousing to refer to a system that is used to process the day-to-day
transactions of an organization.

Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.

Meta Data
 A set of data that defines and gives information about other data. Meta Data used in Data
Warehouse for a variety of purpose, including:Meta Data summarizes necessary
information about data, which can make finding and work with particular instances of
data more accessible. For example, author, data build, and data changed, and file size are
examples of very basic document metadata.
 Metadata is used to direct a query to the most appropriate data source.

Three-Tier Data Warehouse Architecture


Data Warehouses usually have a three-level (tier) architecture that includes:
 Bottom Tier (Data Warehouse Server)
 Middle Tier (OLAP Server)
 Top Tier (Front end Tools).\

A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It
may include several specialized data marts and a metadata repository.
 Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway.
 A gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking
and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
functions on multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.

A top-tier that contains front-end tools for displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-generated data.

The overall Data Warehouse Architecture is shown in fig


The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension,


hierarchies, data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e.,
active, archived or purged, and warehouse monitoring information, i.e., usage statistics,
error reports, audit, etc.
3. System performance data, which includes indices, used to improve data access and
retrieval performance.
4. Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include
business terms and definitions, ownership information, etc.

Principles of Data Warehousing

Load Performance

Data warehouses require increase loading of new data periodically basis within narrow time windows;
performance on the load process should be measured in hundreds of millions of rows and gigabytes per
hour and must not artificially constrain the volume of data business.

Load Processing

Many phases must be taken to load new or update data into the data warehouse, including data
conversion, filtering, reformatting, indexing, and metadata update.

Data Quality Management

Fact-based management demands the highest data quality. The warehouse ensures local consistency,
global consistency, and referential integrity despite "dirty" sources and massive database size.
Query Performance

Fact-based management must not be slowed by the performance of the data warehouse RDBMS;
large, complex queries must be complete in seconds, not days.
Terabyte Scalability

Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds of
gigabytes and terabyte-sized data warehouses.

Autonomous Data Warehouse

An autonomous data warehouse (ADW) is a modern solution that leverages artificial


intelligence (AI) and machine learning (ML) technologies to automate various aspects of data
management and analytics processes.

It represents a significant advancement in the field of data warehousing, offering numerous


benefits such as improved efficiency, reduced operational costs, and enhanced performance.
Here's a breakdown of key points to consider:

Definition:
An autonomous data warehouse is a cloud-based data warehouse platform that utilizes AI and
ML algorithms to automate tasks such as data loading, schema design, performance tuning,
security management, and optimization of data storage.

Automation:
 One of the defining features of an autonomous data warehouse is its ability to automate
routine administrative tasks traditionally performed by database administrators (DBAs)
and data engineers.
 This automation frees up human resources to focus on more strategic initiatives rather
than mundane maintenance tasks.

Self-Driving:
 The term "autonomous" refers to the self-driving nature of the data warehouse
platform. It continuously learns from user interactions, workload patterns, and
system performance metrics to make intelligent decisions autonomously.
 For example, it can automatically allocate resources, adjust storage
configurations, and optimize query execution plans to deliver optimal
performance.

Scalability:
 ADW platforms typically offer elastic scalability, allowing organizations to
dynamically adjust compute and storage resources based on changing workload
demands.
 This ensures that the data warehouse can efficiently handle varying workloads
without manual intervention.
Security:
 Autonomous data warehouses incorporate advanced security features such as data
encryption, access controls, and threat detection mechanisms to
 protect sensitive data assets.
 AI-driven security analytics can proactively identify and mitigate security threats,
helping organizations adhere to compliance regulations and safeguard their data.

Performance:
 By leveraging AI and ML algorithms for performance tuning and optimization,
autonomous data warehouses can deliver exceptional query
 performance and throughput.
 These platforms can analyze query patterns, data distribution, and system
statistics to dynamically optimize query execution and resource utilization.

Cost Eficiency:
 The automation capabilities of autonomous data warehouses lead to improved cost
efficiency by reducing the need for manual intervention, minimizing downtime, and
optimizing resource utilization.
 Organizations can benefit from apay-as-you-go pricing model, where they only pay for
the resources consumed, resulting in lower total cost of ownership (TCO).

Integration:
 Autonomous data warehouses are designed to seamlessly integrate with a wide range
of data sources, applications, and analytics tools.
 This enables organizations to consolidate disparate data sources into a centralized
repository for unified analytics and reporting.

Autonomous Data Warehouse Vs Snowflake


 Autonomous Data Warehouse (ADW) and Snowflake are both cloud-based data
 warehousing solutions that offer advanced capabilities for storing, managing, and
analyzing large volumes of data.
 However, they differ in several aspects, including architecture, pricing model,
features, and target audience.
 Here's a comprehensive comparison between the two:

Architecture:

ADW:
 Oracle Autonomous Data Warehouse is built on Oracle's Exadata infrastructure, which is a
highly optimized hardware platform designed for performance and scalability
 It leverages Oracle Database technology and is fully managed by Oracle, offering
automated provisioning, patching, backups, and optimization.
Snowflake:
 Snowflake's architecture is based on a multi-cluster, shared data architecture
that separates compute resources from storage.
 It utilizes virtual warehouses for processing queries, enabling independent
scaling of compute and storage resources.
 Snowflake is available on multiple cloud platforms, including AWS, Azure,
and Google Cloud.

Pricing Model:

ADW:
 Oracle ADW typically follows a subscription-based pricing model, where customers pay
for a predefined level of compute and storage resources on a monthly or annual basis.
 Pricing may include additional charges for data transfer, backup storage, and premium
features.
Snowflake:
 Snowflake offers a consumption-based pricing model, where customers pay for the
resources used on a per-second basis.
 This model provides flexibility and cost transparency, allowing organizations to scale
resources up or down based on demand without long-term commitments.

Features:

ADW:
 Oracle ADW offers a comprehensive set of features for data warehousing, including
automated data loading, schema management, performance tuning,security, and
analytics.
 It provides integration with Oracle Analytics Cloud and other Oracle Cloud services for
end-to-end analytics solutions.
Snowflake:
 Snowflake provides a wide range of features for data warehousing and analytics,
including automatic scaling, data sharing, data lake integration, multi-cloud support, and
native support for semi-structured data formats like JSON and Avro.
 It also offers built-in support for popular BI and analytics tools.

Ease of Use:

ADW:
 Oracle ADW is known for its simplicity and ease of use, offering a user-friendly
interface and automated management features that reduce the need for manual
intervention.
 It provides self-service capabilities for data loading, querying, and analytics, making it
accessible to users with varying levels of technical expertise.
Snowflake:
 Snowflake is praised for its ease of use and simplicity, with a SQL-based interface and
intuitive web-based console for managing data and resources.
 It offers seamless integration with popular data integration tools, making it easy to
ingest data from various sources and perform complex analytics tasks.

Target Audience:

ADW:
 Oracle ADW is primarily targeted at enterprises and large organizations that require a
robust, scalable, and fully managed data warehousing solution.
 It is particularly well-suited for Oracle Database customers looking to migrate to the
cloud or consolidate their data infrastructure.
Snowflake:
 Snowflake caters to a wide range of customers, including small startups, mid-sized
businesses, and large enterprises across various industries.
Its flexible pricing model and pay-as-you-go approach make it appealing to organizations of all
sizes, especially those with fluctuating workloads or budget constraints.

Modern Data Warehouse

 A modern data warehouse (MDW) represents a transformative approach to traditional


data warehousing, leveraging advanced technologies and methodologies to address the
evolving needs of data-driven organizations.
 It encompasses a combination of cloud-native architectures, scalable storage solutions,
real-time data processing capabilities, and advanced analytics tools to enable agile,
flexible, and cost-effective data management and analytics. Here's an in-depth exploration
of the concept:

Cloud-Native Architecture:
 A modern data warehouse is typically built on a cloud-native architecture, leveraging the
scalability, elasticity, and flexibility of cloud computing platforms such as AWS, Azure,
or Google Cloud.
 This allows organizations to scale resources up or down based on demand, reduce upfront
infrastructure costs, and benefit from pay-as-you-go pricing models.
Unified Data Storage:
 Unlike traditional data warehouses that rely on rigid, structured data models, modern
data warehouses embrace a unified approach to data storage, accommodating a variety of
data types, formats, and sources.
 They integrate structured, semi-structured, and unstructured data from internal and
external sources, including relational databases, NoSQL databases, data lakes, IoT
devices, and streaming data sources.
Real-Time Data Processing:
 Modern data warehouses prioritize real-time data processing and analytics capabilities to
enable timely insights and decision-making.
 They support streaming data ingestion, event-driven architectures, and near real-time
analytics to capture and analyze data as it is generated, ensuring that organizations can
respond rapidly to changing business conditions and emerging trends.
Advanced Analytics and AI:
 A key characteristic of modern data warehouses is their ability to leverage advanced
analytics techniques and artificial intelligence (AI) algorithms to extract actionable
insights from large volumes of data.
 This includes predictive analytics, machine learning, natural language processing, and
data visualization tools that empower users to uncover patterns, trends, and correlations
within their data.
Self-Service Data Exploration:
 Modern data warehouses emphasize self-service data exploration and analytics
capabilities, allowing business users, data analysts, and data scientists to access, query,
and visualize data without extensive technical expertise or reliance on IT departments
Scalability and Performance:
 Scalability and performance are critical considerations in modern data warehouse
architectures.
 They are designed to handle large volumes of data and concurrent user queries
efficiently, leveraging distributed computing and parallel processing techniques to
deliver high throughput and low latency query performance even under heavy workloads.
Security and Governance:
 Modern data warehouses prioritize security and governance to protect sensitive data
assets and ensure regulatory compliance.
 They incorporate robust authentication, authorization, encryption,and auditing
mechanisms to safeguard data privacy and integrity.
lOMoARcPSD|364 398 48

UNIT II ETL AND OLAP TECHNOLOGY


What is ETL – ETL Vs ELT – Types of Data warehouses - Data warehouse Design
and Modeling - Delivery Process - Online Analytical Processing (OLAP) -
Characteristics of OLAP - Online Transaction Processing (OLTP) Vs OLAP - OLAP
operations- Types of OLAP- ROLAP Vs MOLAP Vs HOLAP.

INTRODUCTION:
ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the warehouse.
The process of ETL can be broken down into the following three stages:

1. Extract:
The first stage in the ETL process is to extract data from various
sources such as transactional systems, spreadsheets, and flat files. This
step involves reading data from the source systems and storing it in a
staging area.
2. Transform:
In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. This may involve cleaning
and validating the data, converting data types, combining data from
multiple sources, and creating new data fields.
3. Load:
After the data is transformed, it is loaded into the data warehouse.
This step involves creating the physical data structures and loading the
data into the warehouse.

ETL is a process in Data Warehousing and it stands for Extract, Transform


and Load. It is a process in which an ETL tool extracts the data from various data
source systems, transforms it in the staging area, and then finally, loads it into the
Data Warehouse system.
lOMoARcPSD|364 398 48

Types of Data Warehouse


There are two main types ofdatawarehouse.
Enterprise Data Warehouse (EDW)
Operational Data Store (ODS)

Enterprise Data Warehouse (EDW)


This type of warehouse serves as a key or central database that facilitates
decision-support services throughout the enterprise. The advantage to this type of
warehouse is that it provides access to cross-organizational information, offers a
unified approach to data representation, and allows running complex queries.

Operational Data Store (ODS)


This type of data warehouse refreshes in real-time. It is often preferred for
routine activities like storing employee records. It is required when data warehouse
systems do not support reporting needs of the business.

Data Warehouse Modelling


Data warehouse modelling is the process of designing the schemas of the
detailed and summarized information of the data warehouse. The goal of data
warehouse modeling is to develop a schema describing the reality, or at least a
part of the fact, which the data warehouse is needed to support.Data warehouse
modelling is an essential stage of building a data warehouse for two main
reasons. Firstly, through the schema, data warehouse clients can visualize the
relationships among the warehouse data, to use them with greater ease.
Secondly, a well-designed schema allows an effective data warehouse structure
to emerge, to help decrease the cost of implementing the warehouse and
improve the efficiency of using it.

Conceptual Data Model


A conceptual data model recognizes the highest-level relationships
between the different entities.
lOMoARcPSD|364 398 48

Characteristics of the conceptual data model


It contains the essential entities and the relationships among them.
No attribute is specified.
No primary key is specified.

We can see that the only data shown via the conceptual data model is the
entities that define the data and the relationships between those entities. No other
data, as shown through the conceptual data model.

Logical Data Model


A logical data model defines the information in as much structure as
possible, without observing how they will be physically achieved in the database.
The primary objective of logical data modeling is to document the business data
structures, processes, rules, and relationships by a single view - the logical data
model.

Features of a logical data model


It involves all entities and relationships among them.
All attributes for each entity are specified.
The primary key for each entity is stated.
Referential Integrity is specified (FK Relation).

The phase for designing the logical data model which are as follows:
specify primary keys for all entities.
lOMoARcPSD|364 398 48

List the relationships between different entities.


List all attributes for each entity.
Normalization.
No data types are listed

Physical Data Model


Physical data model describes how the model will be presented in the
database. A physical database model demonstrates all table structures, column
names, data types, constraints, primary key, foreign key, and relationships
between tables. The purpose of physical data modeling is the mapping of the
logical data model to the physical structures of the RDBMS system hosting the data
warehouse. This contains defining physical RDBMS structures, such as tables and
data types to use when storing the information. It may also include the definition
of new data structures for enhancing query performance.

Characteristics of a physical data model


Specification all tables and columns.
Foreign keys are used to recognize relationships between tables.

The steps for physical data model design which are as follows:
lOMoARcPSD|364 398 48

Convert entities to tables.


Convert relationships to foreign keys.
Convert attributes to columns.

Types of Data Warehouse Models

Data Warehouse Design


A data warehouse is a single data repository where a record from multiple
data sources is integrated for online business analytical processing (OLAP). This
implies a data warehouse needs to meet the requirements from all the business
stages within the entire organization. Thus, data warehouse design is a hugely
complex, lengthy, and hence error-prone process. Furthermore, business analytical
functions change over time, which results in changes in the
lOMoARcPSD|364 398 48

requirements for the systems. Therefore, data warehouse and OLAP systems are
dynamic, and the design process is continuous. Data warehouse design takes a
method different from view materialization in the industries. It sees data
warehouses as database systems with particular needs such as answering
management related queries. The target of the design becomes how the record
from multiple data sources should be extracted, transformed, and loaded (ETL)
to be organized in a database as the data warehouse.

There are two approaches


"top-down" approach
"bottom-up" approach

Top-down Design Approach


In the "Top-Down" design approach, a data warehouse is described as a
subject-oriented, time-variant, non-volatile and integrated data repository for the
entire enterprise data from different sources are validated, reformatted and saved
in a normalized (up to 3NF) database as the data warehouse. The data warehouse
stores "atomic" information, the data at the lowest level of granularity, from where
dimensional data marts can be built by selecting the data required for specific
business subjects or particular departments. An approach is a data-driven approach
as the information is gathered and integrated first and then business requirements
by subjects for building data marts are formulated. The advantage of this method is
which it supports a single integrated data source. Thus data marts built from it will
have consistency when they overlap.
lOMoARcPSD|364 398 48

Bottom-Up Design Approach


In the "Bottom-Up" approach, a data warehouse is described as "a copy of
transaction data specifical architecture for query and analysis," term the star
schema. In this approach, a data mart is created first to necessary reporting and
analytical capabilities for particular business processes (or subjects). Thus it is
needed to be a business-driven approach in contrast to Inmon's data-driven
approach. Data marts include the lowest grain data and, if needed, aggregated
data too. Instead of a normalized database for the data warehouse, a denormalized
dimensional database is adapted to meet the data delivery requirements of data
warehouses. Using this method, to use the set of data marts as the enterprise data
warehouse, data marts should be built with conformed dimensions in mind,
defining that ordinary objects are represented the same in different data marts.
The conformed dimensions connected the data marts to form a data warehouse,
which is generally called a virtual data warehouse.The advantage of the "bottom-
up" design approach is that it has quick ROI, as developing a data mart, a data
warehouse for a single subject, takes far less time and effort than developing an
enterprise-wide data warehouse. Also, the risk of failure is even less. This method
is inherently incremental. This method allows the project team to learn and grow.
lOMoARcPSD|364 398 48

Advantages of bottom-up design


Documents can be generated quickly.
The data warehouse can be extended to accommodate new business units.
It is just developing new data marts and then integrating with other data marts.

Differentiate between Top-Down Design Approach and Bottom-Up Design


Approach

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into Solves the essential low- proble and
smaller subproblems. levelintegrates them into a m
higher one.
Inherently architected- not a Inherently incremental; can schedule
union ofseveral data marts. essential data marts first.
Single, central storage of Departmental information stored.
informationabout the content.
lOMoARcPSD|364 398 48

Centralized rules and control. Departmental rules and control.


It includes redundant Redundancy can be removed.
information.
It may see quick results if Less risk of failure, favorable return on
implemented with repetitions. investment,and proof of techniques.

Difference between ELT and ETL


 Extraction, Load and Transform (ELT):
Extraction, Load and Transform (ELT) is the technique of extracting
raw data from the source and storing it in data warehouse of the target server and
preparing it for endstream users.

ELT comprises of 3 different operations performed on the data:


1. Extract:
Extracting data is the technique of identifying data from one or more
sources. The sources may be databases, files, ERP, CRM or any other useful
source of data.
2. Load:
Loading is the process of storing the extracted raw data in data warehouse or data
lakes.
3. Transform:
Data transformation is the process in which the raw data source
is transformed to the target format required for analysis.
lOMoARcPSD|364 398 48

 Extraction, Transform and Load (ETL):


ETL is the traditional technique of extracting raw data, transforming it for the users
as required and storing it in data warehouses. ELT was later developed, having ETL as its
base. The three operations happening in ETL and ELT are the same except that their order
of processing is slightly varied. This change in sequence was made to overcome some
drawbacks.
1. Extract:
It is the process of extracting raw data from all available data
sources such as databases,files, ERP, CRM or any other.
2. Transform:
The extracted data is immediately transformed as required by the user.
3. Load:
The transformed data is then loaded into the data warehouse from where
the users canaccess it.

OLAP
What is OLAP (Online Analytical Processing)?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of
software technology which authorizes analysts, managers, and executives to gain
insight into information through fast, consistent, interactive access in a wide
variety of possible views of data that has been transformed from raw information
to reflect the real dimensionality of the enterprise as understood by the clients.
OLAP implement the multidimensional analysis of business information and
support the capability for complex estimations, trend analysis, and sophisticated
data modeling. It is rapidly enhancing the essential foundation for Intelligent
Solutions containing Business Performance Management, Planning, Budgeting,
Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge
Discovery, and Data Warehouses Reporting
lOMoARcPSD|364 398 48

The main characteristics of OLAP are as follows:


1. Multidimensional conceptual view: OLAP systems let business users have a
dimensional and logical view of the data in the data warehouse. It helps in
carrying slice and dice operations.

2. Multi-User Support: Since the OLAP techniques are shared, the OLAP
operation should provide normal database operations, containing retrieval,
update, adequacy control, integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-
end. The OLAP operations should be sitting between data sources (e.g., data
warehouses) and an OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.

5. Uniform documenting performance: Increasing the number of dimensions


or database size should not significantly degrade the reporting performance
of the OLAPsystem.
6. OLAP provides for distinguishing between zero values and missing values so
that aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.

8. OLAP facilitate interactive query and complex analysis for the users.

Benefits of OLAP
1. OLAP helps managers in decision-making through the multidimensional

record viewsthat it is efficient in providing, thus increasing their


productivity.

2. OLAP functions are self-sufficient owing to the inherent flexibility


support to theorganized databases.
3. It facilitates simulation of business models and problems, through
extensive management of analysis-capabilities.
4. In conjunction with data warehouse, OLAP can be used to support a
reduction in theapplication backlog, faster data retrieval, and reduction in
query drag.
lOMoARcPSD|364 398 48

Difference between OLTP and


OLAP
OLTP (On-Line Transaction Processing) is featured by a large number of
short on-line transactions (INSERT, UPDATE, and DELETE). The primary significance
of OLTP operations is put on very rapid query processing, maintaining record
integrity in multi-access environments, and effectiveness consistent by the number
of transactions per second. In the OLTP database, there is an accurate and current
record, and schema used to save transactional database is the entity model
(usually 3NF).

OLAP (On-line Analytical Processing) is represented by a relatively low


volume of transactions. Queries are very difficult and involve aggregations. For
OLAP operations, response time is an effectiveness measure. OLAP applications
are generally used by Data Mining techniques. In OLAP database there is
aggregated, historical information, stored in multi-dimensional schemas
(generally star schema).

Following are the difference between OLAP and OLTP system.


1) Users: OLTP systems are designed for office worker while the OLAP systems
are designed for decision-makers. Therefore while an OLTP method may be
accessed by hundreds or even thousands of clients in a huge enterprise, an
OLAP system is suitable to be accessed only by a select class of manager and
may be used only by dozens of users.
lOMoARcPSD|364 398 48

2) Functions: OLTP systems are mission-critical. They provide day-to-day


operations of an enterprise and are largely performance and availability driven.
These operations carry out simple repetitive operations. OLAP systems are
management-critical to support the decision of enterprise support tasks using
detailed investigation.

3) Nature: Although SQL queries return a set of data, OLTP methods are designed
to step one record at the time, for example, a data related to the user who may
be on the phone or in the store. OLAP system is not designed to deal with
individual customer records. Instead, they include queries that deal with many
data at a time and provide summary or aggregate information to a manager.
OLAP applications include data stored in a data warehouses that have been
extracted from many tables and possibly from more than one enterprise
database.
4) Design: OLTP database operations are designed to be application-
oriented while OLAP operations are designed to be subject-oriented. OLTP
systems view the enterprise record as a collection of tables (possibly based
on an entity-relationship model). OLAP operations view enterprise information
as multidimensional).
5) Data: OLTP systems usually deal only with the current status of data. For
example, a record about an employee who left three years ago may not be
feasible on the Human Resources System. The old data may have been
achieved on some type of stable storage media and may not be accessible
online. On the other hand, OLAP systems needed historical data over several
years since trends are often essential in decision making.

6) Kind of use: OLTP methods are used for reading and writing operations while
OLAP methods usually do not update the data.

7) View: An OLTP system focuses primarily on the current data within an


enterprise or department, which does not refer to historical data or data in
various organizations. In contrast, an OLAP system spans multiple version of a
database schema, due to the evolutionary process of an organization. OLAP
system also deals with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume,
these are stored on multiple storage media.

8) Access Patterns: The access pattern of an OLTP system consist primarily of


short, atomic transactions. Such a system needed concurrency control and
recovery techniques. However, access to OLAP systems is mostly read-only
operations because these data warehouses store historical information.
lOMoARcPSD|364 398 48

OLAP Operations in the Multidimensional Data Model


In the multidimensional model, the records are organized into various
dimensions, and each dimension includes multiple levels of abstraction described
by concept hierarchies. This organization support users with the flexibility to view
data from various perspectives. A number of OLAP data cube operation exist to
demonstrate these different views, allowing interactive queries and search of the
record at hand. Hence, OLAP supports a user-friendly environment for interactive
data analysis.Consider the OLAP operations which are to be performed on
multidimensional data. The figure shows data cubes for sales of a shop. The cube
contains the dimensions, location, and time and item, where the location is
aggregated with regard to city values, time is aggregated with respect to quarters,
and an item is aggregated with respect to item types.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation)
performs aggregation on a data cube, by climbing down concept hierarchies, i.e.,
dimension reduction. Roll-up is like zooming-out on the data cubes. Figure shows
the result of roll-up operations performed on the dimension location. The
hierarchy for the location is defined as the Order Street, city, province, or state,
country. The roll-up operation aggregates the data by ascending the location
hierarchy from the level of the city to the level of the country. When a roll-up is
performed by dimensions reduction, one or more dimensions are removed from
the cube. For example, consider a sales data cube having two dimensions, location
and time. Roll-up may be performed by removing, the time dimensions, appearing
in an aggregation of the total sales by location, relatively than by location and by
time.
lOMoARcPSD|364 398 48

Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of
roll-up. Drill- down is like zooming-in on the data cube. It navigates from less
detailed record to more detailed data. Drill-down can be performed by either
stepping down a concept hierarchy for a dimension or adding additional
dimensions. Figure shows a drill-down operation performed on the dimension time
by stepping down a concept hierarchy which is defined as day, month, quarter, and
year. Drill-down appears by descending the time hierarchy from the level of the
quarter to a more detailed level of the month.Because a drill-down adds more
details to the given data, it can also be performed by adding a new dimension to a
cube. For example, a drill-down on the central cubes of the figure can occur by
introducing an additional dimension, such as a customer group.
lOMoARcPSD|364 398 48

Slice
A slice is a subset of the cubes corresponding to a single value for one or
more members of the dimension. For example, a slice operation is executed when
the customer wants a selection on one dimension of a three-dimensional cube
resulting in a two-dimensional site. So, the Slice operations perform a selection
on one dimension of the given cu be, thusresulting in a subcube.

Dice
The dice operation describes a subcube by operating a selection on two or more
dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND
(temperature = cool OR temperature = hot) to the original cubes we get the
following subcube (still two- dimensional)

Temperature cool hot

Day 3 0 1

Day 4 0 0
lOMoARcPSD|364 398 48

Pivot
The pivot operation is also called a rotation. Pivot is a visualization
operations which rotates the data axes in view to provide an alternative
presentation of the data. It may contain swapping the rows and columns or moving
one of the row-dimensions into the column dimensions.
lOMoARcPSD|364 398 48

Consider the following diagram, which shows the pivot operation.

Types of OLAP
There are three main types of OLAP servers are as following:
lOMoARcPSD|364 398 48

ROLAP stands for Relational OLAP, an application based on relational DBMSs.


MOLAP stands for Multidimensional OLAP, an application based on
multidimensionalDBMSs.
HOLAP stands for Hybrid OLAP, an application using both relational and
multidimensional techniques.

Relational OLAP (ROLAP) Server :


These are intermediate servers which stand in between a relational back-end
server and user frontend tools.They use a relational or extended-relational DBMS
to save and handle warehouse data, and OLAP middleware to provide missing
pieces. ROLAP servers contain optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools and services.
ROLAP technology tends to have higher scalability than MOLAP technology.
ROLAP systems work primarily from the data that resides in a relational database,
where the base data and dimension tables are stored as relational tables. This
model permits the multidimensional analysis of data.This technique relies on
manipulating the data stored in the relational database to give the presence of
traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Relational OLAP Architecture


ROLAP Architecture includes the following components
lOMoARcPSD|364 398 48

Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology


segment in the market. This method allows multiple multidimensional views of
two-dimensional relational tables to be created, avoiding structuring record
around the desired view. Some products in this segment have supported reliable
SQL engines to help the complexity of multidimensional analysis. This includes
creating multiple SQL statements to handle user requests, being 'RDBMS' aware
and also being capable of generating the SQL statements based on the optimizer of
the DBMS engine.

Advantages
Can handle large amounts of information: The data size limitation of ROLAP
technology is depends on the data size of the underlying RDBMS. So, ROLAP
itself does not restrict the data amount.
<="" strong="">RDBMS already comes with a lot of features. So ROLAP
technologies,(works on top of the RDBMS) can control these functionalities.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL
queries) in the relational database, the query time can be prolonged if the
underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon
developing SQL statements to query the relational database, and SQL
statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server :


A MOLAP system is based on a native logical model that directly supports
multidimensional data and operations. Data are stored physically into
multidimensional arrays, and positional techniques are used to access them. One
of the significant distinctions of MOLAP against a ROLAP is that data are
summarized and are stored in an optimized format in a multidimensional cube,
instead of in a relational database. In MOLAP model, data are structured into
proprietary formats by client's reporting requirements with the calculations pre-
generated on the cubes.
MOLAP Architecture
MOLAP Architecture includes the following components
Database server.
MOLAP server.
Front-end tool.
lOMoARcPSD|364 398 48

MOLAP structure primarily reads the precompiled data. MOLAP structure has
limited capabilities to dynamically create aggregations or to evaluate results which
have not been pre-calculated and stored. Applications requiring iterative and
comprehensive time-series analysis of trends are well suited for MOLAP technology
(e.g., financial analysis and budgeting).

Advantages
Excellent Performance: A MOLAP cube is built for fast information
retrieval, and isoptimal for slicing and dicing operations.
Can perform complex calculations: All evaluation have been pre-generated
when the cube is created. Hence, complex calculations are not only possible,
but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations
are performed when the cube is built, it is not possible to contain a large
amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary
and does not already exist in the organization. Therefore, to adopt MOLAP
technology, chances are other investments in human and capital resources
are needed.

Hybrid OLAP (HOLAP) Server :


HOLAP incorporates the best features of MOLAP and ROLAP into a
single architecture. HOLAP systems save more substantial quantities of detailed
data in the relational tables while the aggregations are stored in the pre-calculated
cubes. HOLAP also can drill through from the cube down to the relational tables for
delineated data. The Microsoft SQL Server 2000 provides a hybrid OLAP server.
lOMoARcPSD|364 398 48

Advantages of HOLAP
HOLAP provide benefits of both MOLAP and ROLAP.
It provides fast access at all levels of aggregation.
HOLAP balances the disk space requirement, as it only stores the aggregate
information on the OLAP server and the detail record remains in the relational
database. So no duplicate copy of the detail record is maintained.

Disadvantages of HOLAP
HOLAP architecture is very complicated because it supports both MOLAP
and ROLAP servers.

Other Types :
There are also less popular types of OLAP styles upon which one could
stumble upon every so often. We have listed some of the less popular brands
existing in the OLAP industry.

Web-Enabled OLAP (WOLAP) Server


WOLAP pertains to OLAP application which is accessible via the web
browser. Unlike traditional client/server OLAP applications, WOLAP is considered
to have a three-tiered architecture which consists of three components: a client, a
middleware, and a database server.
lOMoARcPSD|364 398 48

Desktop OLAP (DOLAP) Server


DOLAP permits a user to download a section of the data from the
database or source, andwork with that dataset locally, or on their desktop.
Mobile OLAP (MOLAP) Server
Mobile OLAP enables users to access and work on OLAP data and
applications remotely through the use of their mobile devices.

Difference between ROLAP, MOLAP, and HOLAP

ROLAP MOLAP HOLAP

ROLAP stands for Relational MOLAP stands for HOLAP stands for Hybrid
Online Analytical Processing. Multidimensional Online Online Analytical Processing.
Analytical Processing.

The ROLAP storage mode The MOLAP storage mode The HOLAP storage mode
causes the aggregation of the principle the aggregations of the connects attributes of both
division to be stored in division and a copy of its source MOLAP and ROLAP. Like
indexed views in the information to be saved in a MOLAP, HOLAP causes the
relational database that was multidimensional operation in aggregation of the division to
specified in the partition's analysis services when the be stored in a multidimensional
data source. separation is processed. operation in an SQL Server
analysis services instance.

ROLAP does not because a This MOLAP operation is highly HOLAP does not causes a copy
copy of the source optimize to maximize query of the source information to be
information to be stored in performance. The storage area stored. For queries that access
the Analysis services data can be on the computer where the only summary record in the
folders. Instead, when the the partition is described or on aggregations of a division,
outcome cannot be derived another computer running HOLAP is the equivalent of
from the query cache, the Analysis services. Because a MOLAP.
indexed views in the record copy of the source information
source are accessed to answer resides in the multidimensional
queries. operation, queries can be
resolved without accessing the
partition's source record.

Query response is frequently Query response times can be Queries that access source
slower with ROLAP storage reduced substantially by using record for example, if we want
than with the MOLAP or aggregations. The record in the to drill down to an atomic cube
lOMoARcPSD|364 398 48

HOLAP storage mode. partition's MOLAP operation is cell for which there is no
Processing time is also only as current as of the most aggregation information must
frequently slower with recent processing of the retrieve data from the
ROLAP. separation. relational database and will not
be as fast as they would be if
the source information were
stored in the MOLAP
architecture.

Difference between ROLAP and MOLAP

ROLAP MOLAP

ROLAP stands for Relational Online MOLAP stands for Multidimensional Online
Analytical Processing. Analytical Processing.

It usually used when data warehouse It used when data warehouse contains relational
contains relational data. aswell as non-relational data.

It contains Analytical server. It contains the MDDB server.

It creates a multidimensional view of It contains prefabricated data cubes.


datadynamically.

It is very easy to implement It is difficult to implement.

It has a high response time It has less response time due to prefabricated
cubes.

It requires less amount of memory. It requires a large amount of memory.


lOMoARcPSD|364 398 48

UNIT III META DATA, DATA MART AND PARTITION STRATEGY


Meta Data – Categories of Metadata – Role of Metadata – Metadata Repository –
Challenges for Meta Management - Data Mart – Need of Data Mart- Cost Effective
Data Mart-Designing Data Marts- Cost of Data Marts- Partitioning Strategy –
Vertical partition –Normalization – Row Splitting – Horizontal Partition

What is Meta Data?


Metadata is data about the data or documentation about the information
which is required by the users. In data warehousing, metadata is one of the
essential aspects.
Metadata includes the following:
The location and descriptions of warehouse systems and components.
Names, definitions, structures, and content of data-warehouse and end-users
views.
Identification of authoritative data sources.
Integration and transformation rules used to populate data.
Integration and transformation rules used to deliver information to end-
user analyticaltools.
Subscription information for information delivery to analysis subscribers.
Metrics used to analyze warehouses usage and performance.
Security authorizations, access control list, etc.

Metadata is used for building, maintaining, managing, and using the


data warehouses. Metadata allow users access to help understand the content
and find data.
Several examples of metadata are:
A library catalog may be considered metadata. The directory metadata
consists of several predefined components representing specific
attributes of a resource, and each item can have one or more values.
These components could be the name of the author, the name of the
document, the publisher's name, the publication date, and the methods to
which it belongs.
The table of content and the index in a book may be treated metadata book.
lOMoARcPSD|364 398 48

Suppose we say that a data item about a person is 80. This must be
defined by noting that it is the person's weight and the unit is kilograms.
Therefore, (weight, kilograms) is the metadata about the data is 80.
Another examples of metadata are data about the tables and figures in a
report like this book. A table (which is a record) has a name (e.g., table
titles), and there are column names of the tables that may be treated
metadata. The figures also have titles or names.

Types of Metadata
Metadata in a data warehouse fall into three major parts:
Operational Metadata
Extraction and Transformation Metadata
End-User Metadata

Operational Metadata
As we know, data for the data warehouse comes from various operational
systems of the enterprise. These source systems include different data structures.
The data elements selected for the data warehouse have various fields lengths and
data types. In selecting information from the source systems for the data
warehouses, we divide records, combine factor of documents from different
source files, and deal with multiple coding schemes and field lengths. When we
deliver information to the end-users, we must be able to tie that back to the source
data sets. Operational metadata contains all of this information about the
operational data sources.
Extraction and Transformation Metadata
Extraction and transformation metadata include data about the removal of
data from the source systems, namely, the extraction frequencies, extraction
methods, and business rules for the data extraction. Also, this category of
metadata contains information about all the data transformation that takes place in
the data staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It
enables the end- users to find data from the data warehouses. The end-user
metadata allows the end-users to use their business terminology and look for the
information in those ways in which they usually think of the business.
lOMoARcPSD|364 398 48

Metadata Repository
The metadata itself is housed in and controlled by the metadata repository.
The software of metadata repository management can be used to map the source
data to the target database, integrate and transform the data, generate code for
data transformation, and to move data to the warehouse.
Benefits of Metadata Repository
It provides a set of tools for enterprise-wide metadata management.
It eliminates and reduces inconsistency, redundancy, and underutilization.
It improves organization control, simplifies management, and
accounting ofinformation assets.
It increases coordination, understanding, identification, and utilization
of information assets.
It enforces CASE development standards with the ability to share and reuse
metadata.
It leverages investment in legacy systems and utilizes existing applications.
It provides a relational model for heterogeneous RDBMS to share information.
It gives useful data administration tool to manage corporate
information assets withthe data dictionary.
It increases reliability, control, and flexibility of the application development
process.

What is Data Mart?


A Data Mart is a subset of a directorial information store, generally oriented
to a specific purpose or primary data subject which may be distributed to provide
business needs. Data Marts are analytical record stores designed to focus on
particular business functions for a specific community within an organization. Data
marts are derived from subsets of data in a data warehouse, though in the bottom-
up data warehouse design methodology, the data warehouse is created from the
union of organizational data marts. The fundamental use of a data mart is Business
Intelligence (BI) applications. BI is used to gather, store, access, and analyze
record. It can be used by smaller businesses to utilize the data they have
accumulated since it is less expensive than implementing a data warehouse.
lOMoARcPSD|364 398 48

Why do we need Data Mart?

Data Mart helps to enhance user’s response time due to reduction in volume of
data
It provides easy access to frequently requested data.
Data mart are simpler to implement when compared to corporate
Datawarehouse. At the same time, the cost of implementing Data
Mart is certainly lower compared with implementing a full data
warehouse.
Compared to Data Warehouse, a datamart is agile. In case of
change in model, datamart can be built quicker due to a smaller
size.
A Datamart is defined by a single Subject Matter Expert. On the
contrary data warehouse is defined by interdisciplinary SME from
a variety of domains. Hence, Data mart is more open to change
compared to Datawarehouse.
Data is partitioned and allows very granular access control privileges.
Data can be segmented and stored on different hardware/software platforms.

Steps in Implementing a Datamart

Implementing a Data Mart is a rewarding but complex procedure. Here are


the detailed steps to implement a Data Mart:

Designing
Designing is the first phase of Data Mart implementation. It covers all the
tasks between initiating the request for a data mart to gathering information about
the requirements. Finally,we create the logical and physical Data Mart design.

The design step involves the following tasks:


Gathering the business & technical requirements and Identifying data sources.
Selecting the appropriate subset of data.
Designing the logical and physical structure of the data mart.
lOMoARcPSD|364 398 48

Data could be partitioned based on following criteria:


Date
Business or Functional Unit
Geography
Any combination of above

Data could be partitioned at the application or DBMS level. Though it is


recommended to partition at the Application level as it allows different data
models each year with the change in business environment.

Constructing
This is the second phase of implementation. It involves creating the physical database
and the logical structures.

This step involves the following tasks:


Implementing the physical database designed in the earlier phase. For
instance, database schema objects like table, indexes, views, etc. are
created.
Storage management: An RDBMS stores and manages the data to create,
add, and delete data.
Fast data access: With a SQL query you can easily access data based on
certainconditions/filters.
Data protection: The RDBMS system also offers a way to recover from
system failures such as power failures. It also allows restoring data from
these backups incaseof the disk fails.
Multiuser support: The data management system offers concurrent access,
the ability for multiple users to access and modify data without interfering
or overwritingchanges made by another user.
Security: The RDMS system also provides a way to regulate access by
users toobjects and certain types of operations.

Populating
In the third phase, data in populated in the datamart.
The populating step involves the following tasks:
Source data to target data Mapping
Extraction of source data
Cleaning and transformation operations on the data
Loading data into the data mart
Creating and storing metadata
lOMoARcPSD|364 398 48

What Products and Technologies Do You Need?


You accomplish these population tasks using an ETL (Extract
Transform Load) Tool. This tool allows you to look at the data sources, perform
source-to-target mapping, extract the data, transform, cleanse it, and load it
back into the data mart. In the process, the tool also creates some metadata
relating to things like where the data came from, how recent it is, what type of
changes were made to the data, and what level of summarization was done.
Accessing
Accessing is a fourth step which involves putting the data to use: querying the data,
creating reports, charts, and publishing them. End-user submit queries to the database and
display theresults of the queries
The accessing step needs to perform the following tasks:
Set up a meta layer that translates database structures and objects names
into business terms. This helps non-technical users to access the Data mart
easily.
Set up and maintain database structures.
Set up API and interfaces if required

What Products and Technologies Do You Need?


You can access the data mart using the command line or GUI. GUI is
preferred as it caneasily generate graphs and is user-friendly compared to the
command line.

Managing
This is the last step of Data Mart Implementation process. This step covers
management tasks such as

Ongoing user access management.


System optimizations and fine-tuning to achieve the enhanced performance.
Adding and managing fresh data into the data mart.
Planning recovery scenarios and ensure system availability in the
case when thesystem fails
lOMoARcPSD|364 398 48

Reasons for creating a data mart


Creates collective data by a group of users
Easy access to frequently needed data
Ease of creation
Improves end-user response time
Lower cost than implementing a complete data warehouses
Potential clients are more clearly defined than in a comprehensive data
warehouse
It contains only essential business data and is less cluttered.

Need for Data Mart


Data Mart helps to enhance user’s response time due to reduction in volume of
data
It provides easy access to frequently requested data.
Data mart are simpler to implement when compared to corporate
Datawarehouse. At the same time, the cost of implementing Data Mart
is certainly lower compared with implementing a full data warehouse.
Compared to Data Warehouse, a datamart is agile. In case of
change in model, datamart can be built quicker due to a smaller
size.
A Datamart is defined by a single Subject Matter Expert. On the
contrary data warehouse is defined by interdisciplinary SME from a
variety of domains. Hence,Data mart is more open to change
compared to Datawarehouse.
Data is partitioned and allows very granular access control privileges.
Data can be segmented and stored on different hardware/software platform.
lOMoARcPSD|364 398 48

Data Warehousing - Partitioning Strategy


Partitioning is done to enhance performance and facilitate easy management
of data. Partitioning also helps in balancing the various requirements of the system.
It optimizes the hardware performance and simplifies the management of data
warehouse by partitioning each fact table into multiple separate partitions. In this
chapter, we will discuss different partitioning strategies.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal
partitioning, we have to keep in mind the requirements for manageability of the
data warehouse.
Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time
period. Here each time period represents a significant retention period within the
business. For example, if the user queries for month to date data then it is
appropriate to partition the data into monthly segments. We can reuse the
partitioned tables by removing the data in them.
Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It
is implemented as a set of small partitions for relatively current data, larger
partition for inactive data.

This technique is not useful where the partitioning profile changes on a


regular basis, because repartitioning will increase the operation cost of
data warehouse.
The detailed information remains available online.
The number of physical tables is kept relatively small, which reduces
the operating cost.
This technique is suitable where a mix of data dipping recent history
and data mining through entire history is required.
lOMoARcPSD|364 398 48

Partition on a Different Dimension


The fact table can also be partitioned on the basis of dimensions other
than time such as product group, region, supplier, or any other dimension.
Let's have an example. Suppose a market function has been structured into
distinct regional departments like on a state by state basis. If each region
wants to query on information captured within its region, it would prove to be
more effective to partition the fact table into regional partitions. This will
cause the queries to speed up because it does not require to scan information
that is not relevant.
Partition by Size of Table
When there are no clear basis for partitioning the fact table on any
dimension, then we should partition the fact table on the basis of their size. We
can set the predetermined size as a critical point. When the table exceeds the
predetermined size, a new table partition is created.
Points to Note
This partitioning is complex to manage.
It requires metadata to identify what data is stored in each partition.

Partitioning Dimensions
If a dimension contains large number of entries, then it is required
to partition the dimensions. Here we have to check the size of a
dimension. Consider a large design that changes over time. If we need to
store all the variations in order to apply comparisons, that dimension
may be very large. This would definitely affect the response time.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one
is archived. It uses metadata to allow user access tool to refer to the correct table
partition.This technique makes it easy to automate table management facilities
within the data warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts
how vertical partitioning is done.
lOMoARcPSD|364 398 48

Vertical partitioning can be performed in the following two ways


Normalization
Row Splitting
Normalization
Normalization is the standard relational method of database organization.
In this method, the rows are collapsed into a single row, hence it reduce space.
Take a look at the following tables that show how normalization is performed.
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The
motive of row splitting is to speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to
perform amajor join operation between two partitions.

dentify Key to Partition


It is very crucial to choose the right partition key. Choosing a wrong
partition key will lead to reorganizing the fact table. Let's have an example.
Suppose we want to partition the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
lOMoARcPSD|364 398 48

transaction_ date
region
branch_name

We can choose to partition on any key. The two possible keys could be
region
transaction_date

Suppose the business is organized in 30 geographical regions and each


region has different number of branches. That will give us 30 partitions, which is
reasonable. This partitioning is good enough because our requirements capture
has shown that a vast majority of queries are restricted to the user's own business
region. If we partition by transaction_date instead of region, then the latest
transaction from every region will be in one partition. Now the user who wants to
look at data within his own region has to query across multiple partitions.
Hence it is worth determining the right partitioning key.
Types of Metadata:
There are many types of metadata that can be used to describe different
aspects of data, such as its content, format, structure, and provenance. Some
common types of metadata include:
Descriptive metadata: This type of metadata provides information about
the content, structure, and format of data, and may include elements such
as title, author, subject, and keywords. Descriptive metadata helps to
identify and describe the content of data and can be used to improve the
discoverability of data through search engines and other tools.
Administrative metadata: This type of metadata provides information
about the management and technical characteristics of data, and may include
elements such as file format, size, and creation date. Administrative
metadata helps to manage and maintain data over time and can be used to
support data governance and preservation.
Structural metadata: This type of metadata provides information about
the relationships and organization of data, and may include elements such as
links, tables of contents, and indices. Structural metadata helps to organize
and connect data and can be used to facilitate the navigation and discovery
of data.
Provenance metadata: This type of metadata provides information
about the history and origin of data, and may include elements such as the
lOMoARcPSD|364 398 48

creator, date of creation, and sources of data. Provenance metadata helps to


provide context and credibility to data and can be used to support data
governance and preservation.
Rights metadata: This type of metadata provides information about the
ownership, licensing, and access controls of data, and may include elements
such as copyright, permissions, and terms of use. Rights metadata helps to
manage and protect the intellectual property rights of data and can be used
to support data governance and compliance.
Educational metadata:
This type of metadata provides information about the educational
value and learning objectives of data, and may include elements such as learning
outcomes, educational levels, and competencies. Educational metadata can be
used to support the discovery and use of educational resources, and to support
the design and evaluation of learning environments. Metadata can be stored in
various forms, such as text, XML, or RDF, and can be organized using metadata
standards and schemas. There are many metadata standards that have been
developed to facilitate the creation and management of metadata, such as Dublin
Core, schema.org, and the Metadata Encoding and Transmission Standard (METS).
Metadata schemas define the structure and format.

Metadata Repository
A metadata repository is a database or other storage mechanism that is
used to store metadata about data. A metadata repository can be used to
manage, organize, and maintain metadata in a consistent and structured
manner, and can facilitate the discovery, access, and use of data.
A metadata repository may contain metadata about a variety of types of
data, such as documents, images, audio and video files, and other types of
digital content. The metadata in a metadata repository may include
information about the content, format, structure, and other characteristics
of data, and may be organized using metadata standards and schemas.
There are many types of metadata repositories, ranging from simple file
systems or spreadsheets to complex database systems. The choice of
metadata repository will depend on the needs and requirements of the
organization, as well as the size and complexity of the data that is being
managed.
Metadata repositories can be used in a variety of contexts, such as libraries,
museums, archives, and online platforms. They can be used to improve the
discoverability and ranking of content in search engines, and to provide
context and additional information about search results. Metadata
repositories can also support data governance by providing information
lOMoARcPSD|364 398 48

about the ownership, use, and access controls of data, and can facilitate
interoperability by providing information about the content, format, and
structure of data, and by enabling the exchange of data between different
systems and applications. Metadata repositories can also support data
preservation by providing information about the context, provenance, and
preservation needs of data, and can support data visualization by providing
information about the data’s structure and content, and by enabling the
creation of interactive andcustomizable visualizations.
Benefits of Metadata Repository
A metadata repository is a centralized database or system that is used to
store and manage metadata. Some of the benefits of using a metadata repository
include:
Improved data quality: A metadata repository can help ensure that
metadata is consistently structured and accurate, which can improve the
overall quality of the data.
Increased data accessibility: A metadata repository can make it easier
for users to access and understand the data, by providing context and
information about the data.
Enhanced data integration: A metadata repository can facilitate data
integration by providing a common place to store and manage metadata
from multiple sources.
Improved data governance: A metadata repository can help enforce
metadata standards and policies, making it easier to ensure that data is
being used and managed appropriately.
Enhanced data security: A metadata repository can help protect the
privacy and security of metadata, by providing controls to restrict access to
sensitive or confidential information.
Metadata repositories can provide many benefits in terms of improving the
quality, accessibility, and management of data.
Challenges for Metadata Management
There are several challenges that can arise when managing metadata:
Lack of standardization: Different organizations or systems may use
different standards or conventions for metadata, which can make it
difficult to effectively manage metadata across different sources.
Data quality: Poorly structured or incorrect metadata can lead to problems
with data quality, making it more difficult to use and understand the data.
Data integration: When integrating data from multiple sources, it can be
challenging to ensure that the metadata is consistent and aligned across
the different sources.
lOMoARcPSD|364 398 48

Data governance: Establishing and enforcing metadata standards and


policies can be difficult, especially in large organizations with multiple
stakeholders.
Data security: Ensuring the security and privacy of metadata can be a
challenge, especially when working with sensitive or confidential
information.
Modern Data Warehouse Pyramid
There are five different components of a Modern Data Warehouse.
Level 1: Data Acquisition
Data acquisition can come from a variety of sources such as:

IoT devices
Social media posts
YouTube videos
Website content
Customer data
Enterprise Resource Planning
Legacy data stores

Level 2: Data
Engineering
Once you acquired it, you need to upload it into the data warehouse. Data
engineering uses pipelines and ETL (extract, transform, load) tools. Using these
different tools, you can upload that information to a data warehouse similar to a
factory. Data engineering is similar to a truck bringing raw materials into a factory.

Level 3: Data Management Governance


Once the data comes into the factory, you need someone to evaluate the
quality of the data. You then need to steward that data because security and
privacy must be considered.Data governance helps ensure the quality of the info
by stewarding, prepping, and cleaning the data to ensure it is ready for analysis.
lOMoARcPSD|364 398 48

Level 4: Reporting and Business Intelligence


Once you prep and clean the data, you can start using factory analysis to
take that raw material(data) and turn it into a finished good (business
intelligence). For our purposes, we will use Microsoft Power BI to help you
visualize the information by using advanced analytics, KPIs, and workflow
automation. When you are finished, you can see exactly what’s going on with
your data.
Level 5: Data Science
Modern Data Warehouse is about more than seeing the information; it’s
about using the data to make smarter decisions. That’s one of the key concepts
you should walk away with here today. There are several different programs to
help you leverage the data to your benefit, including:

AI
Deep learning
Machine learning
Statistical modeling
Natural language processing (NLP)
Keep in mind that all the algorithms above need data to work successfully.
The more data you provide, the smarter your decisions, and the smarter your
results. It’s essential to see if you want to understand your reports that you
leverage AI to get better answers, leading us back to Modern Data Warehouse.
Again, it is more than gathering and storing data. It is about making smart decisions.

You might also like