Data Warehouse-Ccs341 Material
Data Warehouse-Ccs341 Material
Unit-1
Data warehouse Introduction – Data warehouse components- operational database Vs data
warehouse – Data warehouse Architecture – Three-tier Data Warehouse Architecture –
Autonomous Data Warehouse- Autonomous Data Warehouse Vs Snowflake – Modern Data
Warehouse
Introduction
Data Warehouse is a relational database management system (RDBMS) construct to meet the
requirement of transaction processing systems. It can be loosely described as any centralized
data repository which can be queried for business benefits. It is a database that stores
information oriented to satisfy decision-making requests. It is a group of decision support
technologies, targets to enabling the knowledge worker (executive, manager, and analyst) to
make superior and higher decisions. So, Data Warehousing support architectures and tool for
business executives to systematically organize, understand and use their information to make
strategic decisions.
It is a database designed for investigative tasks, using data from various applications.
It supports a relatively small number of clients with relatively long interactions.
It includes current and historical data to provide a historical perspective of information.
Its usage is read-intensive.
It contains a few large tables.
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around
a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations.
This is done by excluding data that are not useful concerning the subject and including all
data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records.
It requires performing data cleaning and integration during data warehousing to ensure
consistency in naming conventions, attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse.
For example, one can retrieve files from 3 months, 6 months, 12 months, or even
previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS.
The operational updates of data do not occur in the data warehouse, i.e., update, insert,
and delete operations are not performed.
It usually requires only two procedures in data accessing:
Initial loading of data and access to data.
Therefore, the DW does not require transaction processing, recovery,and concurrency
capabilities, which allows for substantial speedup of data retrieval.
Non-Volatile defines that once entered into the warehouse, and data should not change.
History of Data Warehouse
The idea of data warehousing came to the late 1980's when IBM researchers Barry
Devlin and Paul Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model for
the flow of information from the operational system to decisional support environments.
The concept attempt to address the various problems associated with the flow, mainly the
high costs associated with it.
In the absence of data warehousing architecture, a vast amount of space was required to
support multiple decision support environments.
In large corporations, it was ordinary for various decision support environments to
operate independently.
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data:
This type of data comes from the different operating systems of the enterprise.
Based on the data requirements in the data warehouse, we choose segments of the data from the
various operational modes.
Internal Data:
In each organization, the client keeps their "private" spreadsheets, reports, customer profiles,
and sometimes even department databases. This is the internal data, part of which could be
useful in a data warehouse.
Archived Data:
Operational systems are mainly intended to run the current business. In every operational
system, we periodically take the old data and store it in achieved files.
External Data:
Most executives depend on information from external sources for a large percentage of the
information they use. They use statistics associating to their industry produced by the external
department.
After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse.
The extracted data coming from several different sources need to be changed, converted,
and made ready in a format that is relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area
Data Extraction:
This method has to deal with numerous data sources.
We have to employ the appropriate techniques for each data source.
Data Transformation:
As we know, data for a data warehouse comes from many different sources.
If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges.
We perform several individual tasks as part of data transformation.
First, we clean the data extracted from each source.
Cleaning may be the correction of misspellings or may deal with providing default
values for missing data elements, or elimination of duplicates when we bring in the same
data from various source systems.
Standardization of data components forms a large part of data transformation.
Data transformation contains many forms of combining pieces of data from different
sources. We combine data from single source record or related data parts from many
source records.
Data Loading:
Two distinct categories of tasks form data loading functions.
When we complete the structure and construction of the data warehouse and go live for
the first time, we do the initial loading of the information into the data warehouse
storage.
The initial load moves high volumes of data using up a substantial amount of time.
The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system.
In the data dictionary, we keep the data about the logical data structures, the data about
the records and addresses, the information about the indexes, and so on.
Data Marts
The management and control elements coordinate the services and functions within the
data warehouse.
These components control the data transformation and the data transfer into the data
warehouse storage.
On the other hand, it moderates the data delivery to the clients.
Its work with the database management systems and authorizes data to be correctly
saved in the repositories.
It monitors the movement of information into the staging method and from there into the
data warehouses storage itself.
Difference between Database and Data Warehouse
2. The tables and joins are 2. The tables and joins are
4. Entity: Relational modeling procedures are 4. Data: Modeling approach are used for
used for RDBMS database design. the Data Warehouse design.
OLAP).
Operational systems are usually Data warehousing systems are usually
concerned with current data. concerned with historical data.
Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.
It is designed for real-time business dealing It is designed for analysis of business
and processes. measures by subject area, categories, and
attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.
It is optimized for validation of incoming Loaded with consistent, valid information,
information during transactions, uses validation requires no real-time validation.
data tables.
It supports thousands of concurrent It supports a few concurrent clients relative
clients. to OLTP.
Operational systems are widely Data warehousing systems are widely
process-oriented. subject-oriented
Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of optimized to perform fast retrievals of
associatively small volumes of data. relatively high volumes of data.
Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on-line Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)
Difference between OLTP and OLAP
OLTP OLAP
It is a system which is used to manage It is a system which is used to manage
operational Data. informational Data.
Clerks, clients, and information Knowledge workers, including managers,
technology professionals. executives, and analysts.
OLTP system is a customer- OLAP system is market-oriented, knowledge workers
oriented,transaction, and query including managers, do data analysts executive and
processing are done by clerks, clients, analysts.
and information technology
professionals.
OLTP system manages current data that OLAP system manages a large amount of historical
too detailed and are used for decision data, provides facilitates for summarization and
making. aggregation, and stores and manages data at different
levels of granularity. This information makes the data
more comfortable to use in informed decision making.
100 MB-GB 100 GB-TB
OLTP system usually uses an entity- OLAP system typically uses either a star or snowflake
relationship (ER) data model and model and subject-oriented database design.
application-oriented
database design.
OLTP system focuses primarily on the OLAP system often spans multiple versions of a
current data within an enterprise or database schema, due to the evolutionary process of an
department, without referring to organization. OLAP systems also deal with data that
historical information or data in different originates from various organizations, integrating
organizations. information from many data stores.
Not very large Because of their large volume, OLAP data are stored on
multiple storage media.
The access patterns of an OLTP system Accesses to OLAP systems are mostly read-only
subsist mainly of short, atomic methods because of these data warehouses stores
transactions. Such a system requires historical data.
concurrency control and recovery
techniques.
Very Fast It depends on the amount of files contained, batch data
refresh, and complex query may take many hours, and
query speed can be upgraded by creating indexes.
Read/write Mostly write
Data Warehouse Architecture
Operational System
Operational system
is a method used in data warehousing to refer to a system that is used to process the day-to-day
transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data
A set of data that defines and gives information about other data. Meta Data used in Data
Warehouse for a variety of purpose, including:Meta Data summarizes necessary
information about data, which can make finding and work with particular instances of
data more accessible. For example, author, data build, and data changed, and file size are
examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It
may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway.
A gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking
and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
functions on multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-generated data.
Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time windows;
performance on the load process should be measured in hundreds of millions of rows and gigabytes per
hour and must not artificially constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse, including data
conversion, filtering, reformatting, indexing, and metadata update.
Fact-based management demands the highest data quality. The warehouse ensures local consistency,
global consistency, and referential integrity despite "dirty" sources and massive database size.
Query Performance
Fact-based management must not be slowed by the performance of the data warehouse RDBMS;
large, complex queries must be complete in seconds, not days.
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds of
gigabytes and terabyte-sized data warehouses.
Definition:
An autonomous data warehouse is a cloud-based data warehouse platform that utilizes AI and
ML algorithms to automate tasks such as data loading, schema design, performance tuning,
security management, and optimization of data storage.
Automation:
One of the defining features of an autonomous data warehouse is its ability to automate
routine administrative tasks traditionally performed by database administrators (DBAs)
and data engineers.
This automation frees up human resources to focus on more strategic initiatives rather
than mundane maintenance tasks.
Self-Driving:
The term "autonomous" refers to the self-driving nature of the data warehouse
platform. It continuously learns from user interactions, workload patterns, and
system performance metrics to make intelligent decisions autonomously.
For example, it can automatically allocate resources, adjust storage
configurations, and optimize query execution plans to deliver optimal
performance.
Scalability:
ADW platforms typically offer elastic scalability, allowing organizations to
dynamically adjust compute and storage resources based on changing workload
demands.
This ensures that the data warehouse can efficiently handle varying workloads
without manual intervention.
Security:
Autonomous data warehouses incorporate advanced security features such as data
encryption, access controls, and threat detection mechanisms to
protect sensitive data assets.
AI-driven security analytics can proactively identify and mitigate security threats,
helping organizations adhere to compliance regulations and safeguard their data.
Performance:
By leveraging AI and ML algorithms for performance tuning and optimization,
autonomous data warehouses can deliver exceptional query
performance and throughput.
These platforms can analyze query patterns, data distribution, and system
statistics to dynamically optimize query execution and resource utilization.
Cost Eficiency:
The automation capabilities of autonomous data warehouses lead to improved cost
efficiency by reducing the need for manual intervention, minimizing downtime, and
optimizing resource utilization.
Organizations can benefit from apay-as-you-go pricing model, where they only pay for
the resources consumed, resulting in lower total cost of ownership (TCO).
Integration:
Autonomous data warehouses are designed to seamlessly integrate with a wide range
of data sources, applications, and analytics tools.
This enables organizations to consolidate disparate data sources into a centralized
repository for unified analytics and reporting.
Architecture:
ADW:
Oracle Autonomous Data Warehouse is built on Oracle's Exadata infrastructure, which is a
highly optimized hardware platform designed for performance and scalability
It leverages Oracle Database technology and is fully managed by Oracle, offering
automated provisioning, patching, backups, and optimization.
Snowflake:
Snowflake's architecture is based on a multi-cluster, shared data architecture
that separates compute resources from storage.
It utilizes virtual warehouses for processing queries, enabling independent
scaling of compute and storage resources.
Snowflake is available on multiple cloud platforms, including AWS, Azure,
and Google Cloud.
Pricing Model:
ADW:
Oracle ADW typically follows a subscription-based pricing model, where customers pay
for a predefined level of compute and storage resources on a monthly or annual basis.
Pricing may include additional charges for data transfer, backup storage, and premium
features.
Snowflake:
Snowflake offers a consumption-based pricing model, where customers pay for the
resources used on a per-second basis.
This model provides flexibility and cost transparency, allowing organizations to scale
resources up or down based on demand without long-term commitments.
Features:
ADW:
Oracle ADW offers a comprehensive set of features for data warehousing, including
automated data loading, schema management, performance tuning,security, and
analytics.
It provides integration with Oracle Analytics Cloud and other Oracle Cloud services for
end-to-end analytics solutions.
Snowflake:
Snowflake provides a wide range of features for data warehousing and analytics,
including automatic scaling, data sharing, data lake integration, multi-cloud support, and
native support for semi-structured data formats like JSON and Avro.
It also offers built-in support for popular BI and analytics tools.
Ease of Use:
ADW:
Oracle ADW is known for its simplicity and ease of use, offering a user-friendly
interface and automated management features that reduce the need for manual
intervention.
It provides self-service capabilities for data loading, querying, and analytics, making it
accessible to users with varying levels of technical expertise.
Snowflake:
Snowflake is praised for its ease of use and simplicity, with a SQL-based interface and
intuitive web-based console for managing data and resources.
It offers seamless integration with popular data integration tools, making it easy to
ingest data from various sources and perform complex analytics tasks.
Target Audience:
ADW:
Oracle ADW is primarily targeted at enterprises and large organizations that require a
robust, scalable, and fully managed data warehousing solution.
It is particularly well-suited for Oracle Database customers looking to migrate to the
cloud or consolidate their data infrastructure.
Snowflake:
Snowflake caters to a wide range of customers, including small startups, mid-sized
businesses, and large enterprises across various industries.
Its flexible pricing model and pay-as-you-go approach make it appealing to organizations of all
sizes, especially those with fluctuating workloads or budget constraints.
Cloud-Native Architecture:
A modern data warehouse is typically built on a cloud-native architecture, leveraging the
scalability, elasticity, and flexibility of cloud computing platforms such as AWS, Azure,
or Google Cloud.
This allows organizations to scale resources up or down based on demand, reduce upfront
infrastructure costs, and benefit from pay-as-you-go pricing models.
Unified Data Storage:
Unlike traditional data warehouses that rely on rigid, structured data models, modern
data warehouses embrace a unified approach to data storage, accommodating a variety of
data types, formats, and sources.
They integrate structured, semi-structured, and unstructured data from internal and
external sources, including relational databases, NoSQL databases, data lakes, IoT
devices, and streaming data sources.
Real-Time Data Processing:
Modern data warehouses prioritize real-time data processing and analytics capabilities to
enable timely insights and decision-making.
They support streaming data ingestion, event-driven architectures, and near real-time
analytics to capture and analyze data as it is generated, ensuring that organizations can
respond rapidly to changing business conditions and emerging trends.
Advanced Analytics and AI:
A key characteristic of modern data warehouses is their ability to leverage advanced
analytics techniques and artificial intelligence (AI) algorithms to extract actionable
insights from large volumes of data.
This includes predictive analytics, machine learning, natural language processing, and
data visualization tools that empower users to uncover patterns, trends, and correlations
within their data.
Self-Service Data Exploration:
Modern data warehouses emphasize self-service data exploration and analytics
capabilities, allowing business users, data analysts, and data scientists to access, query,
and visualize data without extensive technical expertise or reliance on IT departments
Scalability and Performance:
Scalability and performance are critical considerations in modern data warehouse
architectures.
They are designed to handle large volumes of data and concurrent user queries
efficiently, leveraging distributed computing and parallel processing techniques to
deliver high throughput and low latency query performance even under heavy workloads.
Security and Governance:
Modern data warehouses prioritize security and governance to protect sensitive data
assets and ensure regulatory compliance.
They incorporate robust authentication, authorization, encryption,and auditing
mechanisms to safeguard data privacy and integrity.
lOMoARcPSD|364 398 48
INTRODUCTION:
ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the warehouse.
The process of ETL can be broken down into the following three stages:
1. Extract:
The first stage in the ETL process is to extract data from various
sources such as transactional systems, spreadsheets, and flat files. This
step involves reading data from the source systems and storing it in a
staging area.
2. Transform:
In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. This may involve cleaning
and validating the data, converting data types, combining data from
multiple sources, and creating new data fields.
3. Load:
After the data is transformed, it is loaded into the data warehouse.
This step involves creating the physical data structures and loading the
data into the warehouse.
We can see that the only data shown via the conceptual data model is the
entities that define the data and the relationships between those entities. No other
data, as shown through the conceptual data model.
The phase for designing the logical data model which are as follows:
specify primary keys for all entities.
lOMoARcPSD|364 398 48
The steps for physical data model design which are as follows:
lOMoARcPSD|364 398 48
requirements for the systems. Therefore, data warehouse and OLAP systems are
dynamic, and the design process is continuous. Data warehouse design takes a
method different from view materialization in the industries. It sees data
warehouses as database systems with particular needs such as answering
management related queries. The target of the design becomes how the record
from multiple data sources should be extracted, transformed, and loaded (ETL)
to be organized in a database as the data warehouse.
Breaks the vast problem into Solves the essential low- proble and
smaller subproblems. levelintegrates them into a m
higher one.
Inherently architected- not a Inherently incremental; can schedule
union ofseveral data marts. essential data marts first.
Single, central storage of Departmental information stored.
informationabout the content.
lOMoARcPSD|364 398 48
OLAP
What is OLAP (Online Analytical Processing)?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of
software technology which authorizes analysts, managers, and executives to gain
insight into information through fast, consistent, interactive access in a wide
variety of possible views of data that has been transformed from raw information
to reflect the real dimensionality of the enterprise as understood by the clients.
OLAP implement the multidimensional analysis of business information and
support the capability for complex estimations, trend analysis, and sophisticated
data modeling. It is rapidly enhancing the essential foundation for Intelligent
Solutions containing Business Performance Management, Planning, Budgeting,
Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge
Discovery, and Data Warehouses Reporting
lOMoARcPSD|364 398 48
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP
operation should provide normal database operations, containing retrieval,
update, adequacy control, integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-
end. The OLAP operations should be sitting between data sources (e.g., data
warehouses) and an OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
8. OLAP facilitate interactive query and complex analysis for the users.
Benefits of OLAP
1. OLAP helps managers in decision-making through the multidimensional
3) Nature: Although SQL queries return a set of data, OLTP methods are designed
to step one record at the time, for example, a data related to the user who may
be on the phone or in the store. OLAP system is not designed to deal with
individual customer records. Instead, they include queries that deal with many
data at a time and provide summary or aggregate information to a manager.
OLAP applications include data stored in a data warehouses that have been
extracted from many tables and possibly from more than one enterprise
database.
4) Design: OLTP database operations are designed to be application-
oriented while OLAP operations are designed to be subject-oriented. OLTP
systems view the enterprise record as a collection of tables (possibly based
on an entity-relationship model). OLAP operations view enterprise information
as multidimensional).
5) Data: OLTP systems usually deal only with the current status of data. For
example, a record about an employee who left three years ago may not be
feasible on the Human Resources System. The old data may have been
achieved on some type of stable storage media and may not be accessible
online. On the other hand, OLAP systems needed historical data over several
years since trends are often essential in decision making.
6) Kind of use: OLTP methods are used for reading and writing operations while
OLAP methods usually do not update the data.
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of
roll-up. Drill- down is like zooming-in on the data cube. It navigates from less
detailed record to more detailed data. Drill-down can be performed by either
stepping down a concept hierarchy for a dimension or adding additional
dimensions. Figure shows a drill-down operation performed on the dimension time
by stepping down a concept hierarchy which is defined as day, month, quarter, and
year. Drill-down appears by descending the time hierarchy from the level of the
quarter to a more detailed level of the month.Because a drill-down adds more
details to the given data, it can also be performed by adding a new dimension to a
cube. For example, a drill-down on the central cubes of the figure can occur by
introducing an additional dimension, such as a customer group.
lOMoARcPSD|364 398 48
Slice
A slice is a subset of the cubes corresponding to a single value for one or
more members of the dimension. For example, a slice operation is executed when
the customer wants a selection on one dimension of a three-dimensional cube
resulting in a two-dimensional site. So, the Slice operations perform a selection
on one dimension of the given cu be, thusresulting in a subcube.
Dice
The dice operation describes a subcube by operating a selection on two or more
dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND
(temperature = cool OR temperature = hot) to the original cubes we get the
following subcube (still two- dimensional)
Day 3 0 1
Day 4 0 0
lOMoARcPSD|364 398 48
Pivot
The pivot operation is also called a rotation. Pivot is a visualization
operations which rotates the data axes in view to provide an alternative
presentation of the data. It may contain swapping the rows and columns or moving
one of the row-dimensions into the column dimensions.
lOMoARcPSD|364 398 48
Types of OLAP
There are three main types of OLAP servers are as following:
lOMoARcPSD|364 398 48
Advantages
Can handle large amounts of information: The data size limitation of ROLAP
technology is depends on the data size of the underlying RDBMS. So, ROLAP
itself does not restrict the data amount.
<="" strong="">RDBMS already comes with a lot of features. So ROLAP
technologies,(works on top of the RDBMS) can control these functionalities.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL
queries) in the relational database, the query time can be prolonged if the
underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon
developing SQL statements to query the relational database, and SQL
statements do not suit all needs.
MOLAP structure primarily reads the precompiled data. MOLAP structure has
limited capabilities to dynamically create aggregations or to evaluate results which
have not been pre-calculated and stored. Applications requiring iterative and
comprehensive time-series analysis of trends are well suited for MOLAP technology
(e.g., financial analysis and budgeting).
Advantages
Excellent Performance: A MOLAP cube is built for fast information
retrieval, and isoptimal for slicing and dicing operations.
Can perform complex calculations: All evaluation have been pre-generated
when the cube is created. Hence, complex calculations are not only possible,
but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations
are performed when the cube is built, it is not possible to contain a large
amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary
and does not already exist in the organization. Therefore, to adopt MOLAP
technology, chances are other investments in human and capital resources
are needed.
Advantages of HOLAP
HOLAP provide benefits of both MOLAP and ROLAP.
It provides fast access at all levels of aggregation.
HOLAP balances the disk space requirement, as it only stores the aggregate
information on the OLAP server and the detail record remains in the relational
database. So no duplicate copy of the detail record is maintained.
Disadvantages of HOLAP
HOLAP architecture is very complicated because it supports both MOLAP
and ROLAP servers.
Other Types :
There are also less popular types of OLAP styles upon which one could
stumble upon every so often. We have listed some of the less popular brands
existing in the OLAP industry.
ROLAP stands for Relational MOLAP stands for HOLAP stands for Hybrid
Online Analytical Processing. Multidimensional Online Online Analytical Processing.
Analytical Processing.
The ROLAP storage mode The MOLAP storage mode The HOLAP storage mode
causes the aggregation of the principle the aggregations of the connects attributes of both
division to be stored in division and a copy of its source MOLAP and ROLAP. Like
indexed views in the information to be saved in a MOLAP, HOLAP causes the
relational database that was multidimensional operation in aggregation of the division to
specified in the partition's analysis services when the be stored in a multidimensional
data source. separation is processed. operation in an SQL Server
analysis services instance.
ROLAP does not because a This MOLAP operation is highly HOLAP does not causes a copy
copy of the source optimize to maximize query of the source information to be
information to be stored in performance. The storage area stored. For queries that access
the Analysis services data can be on the computer where the only summary record in the
folders. Instead, when the the partition is described or on aggregations of a division,
outcome cannot be derived another computer running HOLAP is the equivalent of
from the query cache, the Analysis services. Because a MOLAP.
indexed views in the record copy of the source information
source are accessed to answer resides in the multidimensional
queries. operation, queries can be
resolved without accessing the
partition's source record.
Query response is frequently Query response times can be Queries that access source
slower with ROLAP storage reduced substantially by using record for example, if we want
than with the MOLAP or aggregations. The record in the to drill down to an atomic cube
lOMoARcPSD|364 398 48
HOLAP storage mode. partition's MOLAP operation is cell for which there is no
Processing time is also only as current as of the most aggregation information must
frequently slower with recent processing of the retrieve data from the
ROLAP. separation. relational database and will not
be as fast as they would be if
the source information were
stored in the MOLAP
architecture.
ROLAP MOLAP
ROLAP stands for Relational Online MOLAP stands for Multidimensional Online
Analytical Processing. Analytical Processing.
It usually used when data warehouse It used when data warehouse contains relational
contains relational data. aswell as non-relational data.
It has a high response time It has less response time due to prefabricated
cubes.
Suppose we say that a data item about a person is 80. This must be
defined by noting that it is the person's weight and the unit is kilograms.
Therefore, (weight, kilograms) is the metadata about the data is 80.
Another examples of metadata are data about the tables and figures in a
report like this book. A table (which is a record) has a name (e.g., table
titles), and there are column names of the tables that may be treated
metadata. The figures also have titles or names.
Types of Metadata
Metadata in a data warehouse fall into three major parts:
Operational Metadata
Extraction and Transformation Metadata
End-User Metadata
Operational Metadata
As we know, data for the data warehouse comes from various operational
systems of the enterprise. These source systems include different data structures.
The data elements selected for the data warehouse have various fields lengths and
data types. In selecting information from the source systems for the data
warehouses, we divide records, combine factor of documents from different
source files, and deal with multiple coding schemes and field lengths. When we
deliver information to the end-users, we must be able to tie that back to the source
data sets. Operational metadata contains all of this information about the
operational data sources.
Extraction and Transformation Metadata
Extraction and transformation metadata include data about the removal of
data from the source systems, namely, the extraction frequencies, extraction
methods, and business rules for the data extraction. Also, this category of
metadata contains information about all the data transformation that takes place in
the data staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It
enables the end- users to find data from the data warehouses. The end-user
metadata allows the end-users to use their business terminology and look for the
information in those ways in which they usually think of the business.
lOMoARcPSD|364 398 48
Metadata Repository
The metadata itself is housed in and controlled by the metadata repository.
The software of metadata repository management can be used to map the source
data to the target database, integrate and transform the data, generate code for
data transformation, and to move data to the warehouse.
Benefits of Metadata Repository
It provides a set of tools for enterprise-wide metadata management.
It eliminates and reduces inconsistency, redundancy, and underutilization.
It improves organization control, simplifies management, and
accounting ofinformation assets.
It increases coordination, understanding, identification, and utilization
of information assets.
It enforces CASE development standards with the ability to share and reuse
metadata.
It leverages investment in legacy systems and utilizes existing applications.
It provides a relational model for heterogeneous RDBMS to share information.
It gives useful data administration tool to manage corporate
information assets withthe data dictionary.
It increases reliability, control, and flexibility of the application development
process.
Data Mart helps to enhance user’s response time due to reduction in volume of
data
It provides easy access to frequently requested data.
Data mart are simpler to implement when compared to corporate
Datawarehouse. At the same time, the cost of implementing Data
Mart is certainly lower compared with implementing a full data
warehouse.
Compared to Data Warehouse, a datamart is agile. In case of
change in model, datamart can be built quicker due to a smaller
size.
A Datamart is defined by a single Subject Matter Expert. On the
contrary data warehouse is defined by interdisciplinary SME from
a variety of domains. Hence, Data mart is more open to change
compared to Datawarehouse.
Data is partitioned and allows very granular access control privileges.
Data can be segmented and stored on different hardware/software platforms.
Designing
Designing is the first phase of Data Mart implementation. It covers all the
tasks between initiating the request for a data mart to gathering information about
the requirements. Finally,we create the logical and physical Data Mart design.
Constructing
This is the second phase of implementation. It involves creating the physical database
and the logical structures.
Populating
In the third phase, data in populated in the datamart.
The populating step involves the following tasks:
Source data to target data Mapping
Extraction of source data
Cleaning and transformation operations on the data
Loading data into the data mart
Creating and storing metadata
lOMoARcPSD|364 398 48
Managing
This is the last step of Data Mart Implementation process. This step covers
management tasks such as
Partitioning Dimensions
If a dimension contains large number of entries, then it is required
to partition the dimensions. Here we have to check the size of a
dimension. Consider a large design that changes over time. If we need to
store all the variations in order to apply comparisons, that dimension
may be very large. This would definitely affect the response time.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one
is archived. It uses metadata to allow user access tool to refer to the correct table
partition.This technique makes it easy to automate table management facilities
within the data warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts
how vertical partitioning is done.
lOMoARcPSD|364 398 48
transaction_ date
region
branch_name
We can choose to partition on any key. The two possible keys could be
region
transaction_date
Metadata Repository
A metadata repository is a database or other storage mechanism that is
used to store metadata about data. A metadata repository can be used to
manage, organize, and maintain metadata in a consistent and structured
manner, and can facilitate the discovery, access, and use of data.
A metadata repository may contain metadata about a variety of types of
data, such as documents, images, audio and video files, and other types of
digital content. The metadata in a metadata repository may include
information about the content, format, structure, and other characteristics
of data, and may be organized using metadata standards and schemas.
There are many types of metadata repositories, ranging from simple file
systems or spreadsheets to complex database systems. The choice of
metadata repository will depend on the needs and requirements of the
organization, as well as the size and complexity of the data that is being
managed.
Metadata repositories can be used in a variety of contexts, such as libraries,
museums, archives, and online platforms. They can be used to improve the
discoverability and ranking of content in search engines, and to provide
context and additional information about search results. Metadata
repositories can also support data governance by providing information
lOMoARcPSD|364 398 48
about the ownership, use, and access controls of data, and can facilitate
interoperability by providing information about the content, format, and
structure of data, and by enabling the exchange of data between different
systems and applications. Metadata repositories can also support data
preservation by providing information about the context, provenance, and
preservation needs of data, and can support data visualization by providing
information about the data’s structure and content, and by enabling the
creation of interactive andcustomizable visualizations.
Benefits of Metadata Repository
A metadata repository is a centralized database or system that is used to
store and manage metadata. Some of the benefits of using a metadata repository
include:
Improved data quality: A metadata repository can help ensure that
metadata is consistently structured and accurate, which can improve the
overall quality of the data.
Increased data accessibility: A metadata repository can make it easier
for users to access and understand the data, by providing context and
information about the data.
Enhanced data integration: A metadata repository can facilitate data
integration by providing a common place to store and manage metadata
from multiple sources.
Improved data governance: A metadata repository can help enforce
metadata standards and policies, making it easier to ensure that data is
being used and managed appropriately.
Enhanced data security: A metadata repository can help protect the
privacy and security of metadata, by providing controls to restrict access to
sensitive or confidential information.
Metadata repositories can provide many benefits in terms of improving the
quality, accessibility, and management of data.
Challenges for Metadata Management
There are several challenges that can arise when managing metadata:
Lack of standardization: Different organizations or systems may use
different standards or conventions for metadata, which can make it
difficult to effectively manage metadata across different sources.
Data quality: Poorly structured or incorrect metadata can lead to problems
with data quality, making it more difficult to use and understand the data.
Data integration: When integrating data from multiple sources, it can be
challenging to ensure that the metadata is consistent and aligned across
the different sources.
lOMoARcPSD|364 398 48
IoT devices
Social media posts
YouTube videos
Website content
Customer data
Enterprise Resource Planning
Legacy data stores
Level 2: Data
Engineering
Once you acquired it, you need to upload it into the data warehouse. Data
engineering uses pipelines and ETL (extract, transform, load) tools. Using these
different tools, you can upload that information to a data warehouse similar to a
factory. Data engineering is similar to a truck bringing raw materials into a factory.
AI
Deep learning
Machine learning
Statistical modeling
Natural language processing (NLP)
Keep in mind that all the algorithms above need data to work successfully.
The more data you provide, the smarter your decisions, and the smarter your
results. It’s essential to see if you want to understand your reports that you
leverage AI to get better answers, leading us back to Modern Data Warehouse.
Again, it is more than gathering and storing data. It is about making smart decisions.