Ccs341 DW Qa (Final)
Ccs341 DW Qa (Final)
SUJATHA, HoD/AI&DS
Unit-1
PART-A
ii) Data Warehouses are designed to perform well enormous amounts of data.
1
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Data Selection
Data Cleaning and Preprocessing
Data Transformation and Reduction
Data Mining
Evaluation and Interpretation of Results
6. How is data warehouse different from database? Identify the similarity. (AU-
N/D2024)
Similarity
Data Storage: Both store structured data in tables using rows and columns.
SQL Use: Both use SQL (Structured Query Language) for data operations.
Relational Model: Both can be based on a relational model, allowing relationships
between different data tables.
Data Integrity: Both emphasize accuracy and consistency of stored data.
10. List out the nine-step method followed in the design of a data warehouse.
3
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
PART-B
1. Explain three tier data warehouse architecture and its significance. (AU-
A/M2024)
Data warehousing is essential for businesses looking to make informed decisions based on
large amounts of information. The architecture of a data warehouse is key to its effectiveness,
influencing how easily data can be accessed and used. The Three/Multi-Tier Data Warehouse
Architecture is widely adopted due to its clear and organized framework. This architecture
divides data handling into three main layers:
Bottom Tier (Data Sources and Data Storage)
Middle Tier (OLAP Engine)
Top Tier (Front-End Tools)
Bottom Tier
The Bottom Tier of the Three-Tier Data Warehouse Architecture is fundamental, as
it serves as the foundation where data is initially collected and stored. This tier is typically
structured around a warehouse database server, commonly an RDBMS (Relational Database
Management System), which houses the data extracted from various operational and external
sources. The core activities in this tier involve the extraction, cleaning, transformation, and
loading of data, collectively known as the ETL process.
ETL, standing for Extract, Transform, and Load, is vital in the data warehousing process. In
the Bottom Tier, data undergoes transformation to align with business logic and analysis
needs before being loaded into a more query-friendly structure. This process ensures that data
is clean, consistent, and optimized for quick retrieval, which is essential for effective data
4
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
analysis. Several tools are used to perform ETL process such as, IBM Infosphere,
Informatica, Confluent, Microsoft SSIS, Snaplogic, Alooma etc.
Middle Tier:
The Middle Tier in the three-tier architecture of a data warehouse is where
the OLAP (Online Analytical Processing) server resides. This tier acts as the critical
processing layer that manages and enables complex analytical queries on data stored in the
bottom tier. It functions as a mediator between the data repository (bottom tier) and the end-
user interface (top tier), ensuring that data is efficiently processed and made ready for
analysis and reporting.
OLAP is a powerful technology for complex calculations, trend analysis, and data
modeling. It is designed for high-speed analytical processing. OLAP server models come in
three different categories, including:
ROLAP (Relational OLAP): This model uses a relational database to store and
manage warehouse data. It is ideal for handling large data volumes as it operates
directly on relational databases.
MOLAP (Multidimensional OLAP): This model stores data in a
multidimensional cube. The storage and retrieval processes are highly efficient,
making MOLAP suitable for complex analytical queries that require aggregation.
HOLAP (Hybrid OLAP): It is combination of relational and multidimensional
online analytical processing paradigms. HOLAP is the ideal option for a seamless
functional flow across the database systems when the repository houses both the
relational database management system and the multidimensional database
management system.
Top Tier
The Top Tier in the Three-Tier Data Warehouse Architecture comprises the front-end
client layer, which is essential for interacting with the data stored and processed in the lower
tiers. This layer includes a variety of business intelligence (BI) tools and techniques designed
to facilitate easy access and manipulation of data for reporting, analysis, and decision-
making.
BI tools are critical components of the Top Tier, providing robust platforms through which
users can query, report, and analyze data. Popular BI tools include:
IBM Cognos: Offers comprehensive reporting capabilities.
5
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
These technologies are often built into or layered on top of modern data
warehouses to improve speed, scalability, and reliability.
6
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
2. Columnar Storage
3. In-Memory Computing
4. Data Partitioning
6. Data Caching
7
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Tools like Apache NiFi, Talend, and Informatica streamline data transformation.
Reduce the time needed to load and prep data for analysis.Alternate Technologies
• Data Lakes
4. With a neat sketch explain the steps for design and construction of data
warehouses and explain the three tier architecture. (refer question 1)
5. Describe the three layers of snowflake architecture.
Snowflake's Architecture
Snowflake’s architecture mainly consists of three layers.
8
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Storage Layer
The Storage layer in snowflake architecture is responsible for managing and storing data in
an effective manner. The functionalities that were supported by the storage layer are:
Elasticity: Snowflake's storage layer is elastic, allowing organizations to scale
their storage needs independent of compute resources. It ensures to handle various
data volumes without affecting performance.
Cloud Based Object Storage: Snowflake uses cloud based object storage to store
data. This separation of storage and compute allows for cost-effective and scalable
data storage.
Data Clustering: Snowflake organizes data into micro partitions within the
object storage, and these micro partitions are clustered based on metadata. This
clustering enhances query performance by minimizing the amount of data that
needs to be scanned.
Zero Copy Cloning: Snowflake enables efficient data cloning through zero-copy
cloning technology. This feature allows users to create a copy of a dataset instantly
without duplicating the actual data, saving both time and storage costs.
Query Processing Layer
The SQL query execution is handled by Snowflake's Query Processing Layer,
which dynamically optimizes and parallelizes queries over several compute clusters. It
ensures great performance and scalability by decoupling computation and storage, allowing
for on-demand resource allocation based on query complexity and workload. Functionalities
of Query Processing Layer are:
Automatic Query Processing: Snowflake's Query Processing Layer
optimizes SQL queries automatically, modifying execution plans based on
underlying data distribution and query complexity to ensure efficient processing.
9
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
10
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Architecture Overview
11
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Use Cases
Benefits
12
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Unit-2
PART-A
1. What does ETL stands for? And how does it differ from ELT? (AU-A/M2024)
Extract: Pulling data from various sources (e.g., databases, APIs, files).
Transform: Cleaning, modifying, and enriching the data to make it usable.
Load: Moving the transformed data into a target system like a data warehouse.
🔎 Data Complex queries with aggregation Simple, fast queries for inserts,
Operation and joins updates, deletes
⏱️
Fast (optimized for quick
Response Slower (due to complex queries)
operations)
Time
🏗️
Denormalized schema
Database Highly normalized schema
(star/snowflake schemas)
Design
13
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
📈
Sales trend analysis, forecasting, Bank transactions, online ticket
Example
market research booking
Use
🗃️ Data
Not always critical (read-intensive) Very critical (write-intensive)
Integrity
14
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated
locally within a particular department or geographic area.
Dependent data marts are sourced directly from enterprise data
warehouses.
PART-B
In the FASMI characteristics of OLAP methods, the term derived from the
first letters of the characteristics are:
Fast
15
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
It defines which the system targeted to deliver the most feedback to the client within
about five seconds, with the elementary analysis taking no more than one second and
very few taking more than 20 seconds.
Analysis
It defines which the method can cope with any business logic and statistical analysis
that is relevant for the function and the user, keep it easy enough for the target client.
Although some preprogramming may be needed we do not think it acceptable if all
application definitions have to be allow the user to define new Adhoc calculations as
part of the analysis and to document on the data in any desired method, without having
to program so we excludes products (like Oracle Discoverer) that do not allow the user
to define new Adhoc calculation as part of the analysis and to document on the data in
any desired product that do not allow adequate end user-oriented calculation flexibility.
Share
It defines which the system tools all the security requirements for understanding and,
if multiple write connection is needed, concurrent update location at an appropriated
level, not all functions need customer to write data back, but for the increasing number
which does, the system should be able to manage multiple updates in a timely, secure
manner.
Multidimensional
Information
The system should be able to hold all the data needed by the applications. Data sparsity
should be handled in an efficient manner.
16
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation
should provide normal database operations, containing retrieval, update, adequacy
control, integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and
an OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
6. OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of
metrics along a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.
The ETL process, which stands for Extract, Transform, and Load, is a critical
methodology used to prepare data for storage, analysis, and reporting in a data
warehouse. It involves three distinct stages that help to streamline raw data from
multiple sources into a clean, structured, and usable form. Here’s a detailed
breakdown of each phase:
17
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
ETL
1. Extraction
The Extract phase is the first step in the ETL process, where raw data is collected from
various data sources. These sources can be diverse, ranging from structured sources
like databases (SQL, NoSQL), to semi-structured data like JSON, XML, or unstructured
data such as emails or flat files. The main goal of extraction is to gather data without altering
its format, enabling it to be further processed in the next stage.
Types of data sources can include:
Structured: SQL databases, ERPs, CRMs
Semi-structured: JSON, XML
Unstructured: Emails, web pages, flat files
2. Transformation
The Transform phase is where the magic happens. Data extracted in the previous phase is
often raw and inconsistent. During transformation, the data is cleaned, aggregated, and
formatted according to business rules. This is a crucial step because it ensures that the data
meets the quality standards required for accurate analysis.
Common transformations include:
Data Filtering: Removing irrelevant or incorrect data.
Data Sorting: Organizing data into a required order for easier analysis.
Data Aggregating: Summarizing data to provide meaningful insights (e.g.,
averaging sales data).
18
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
The transformation stage can also involve more complex operations such as currency
conversions, text normalization, or applying domain-specific rules to ensure the data aligns
with organizational needs.
3. Loading
Once data has been cleaned and transformed, it is ready for the final step: Loading. This
phase involves transferring the transformed data into a data warehouse, data lake, or another
target system for storage. Depending on the use case, there are two types of loading methods:
Full Load: All data is loaded into the target system, often used during the initial
population of the warehouse.
Incremental Load: Only new or updated data is loaded, making this method
more efficient for ongoing data updates.
Pipelining in ETL Process
Pipelining in the ETL process involves processing data in overlapping stages to enhance
efficiency. Instead of completing each step sequentially, data is extracted, transformed, and
loaded concurrently. As soon as data is extracted, it is transformed, and while transformed
data is being loaded into the warehouse, new data can continue being extracted and processed.
This parallel execution reduces downtime, speeds up the overall process, and improves
system resource utilization, making the ETL pipeline faster and more scalable.
ETL Pipelining
In short, the ETL process involves extracting raw data from various sources, transforming it
into a clean format, and loading it into a target system for analysis. This is crucial for
organizations to consolidate data, improve quality, and enable actionable insights for
decision-making, reporting, and machine learning. ETL forms the foundation of effective
data management and advanced analytics.
19
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Importance of ETL
Data Integration: ETL combines data from various sources,
including structured and unstructured formats, ensuring seamless integration for
a unified view.
Data Quality: By transforming raw data, ETL cleanses and standardizes it,
improving data accuracy and consistency for more reliable insights.
Essential for Data Warehousing: ETL prepares data for storage in data
warehouses, making it accessible for analysis and reporting by aligning it with the
target system's requirements.
Enhanced Decision-Making: ETL helps businesses derive actionable insights,
enabling better forecasting, resource allocation, and strategic planning.
Operational Efficiency: Automating the data pipeline through ETL speeds up
data processing, allowing organizations to make real-time decisions based on the
most current data.
Challenges in ETL Process
The ETL process, while essential for data integration, comes with its own set of challenges
that can hinder efficiency and accuracy. These challenges, if not addressed properly, can
impact the overall performance and reliability of data systems.
Data Quality Issues: Inconsistent, incomplete, or duplicate data from multiple
sources can impact transformation and loading, leading to inaccurate insights.
Performance Bottlenecks: Large datasets can slow down or cause ETL processes
to fail, particularly during complex transformations like cleansing and
aggregation.
Scalability Issues: Legacy ETL systems may struggle to scale with growing data
volumes, diverse sources, and more complex transformations.
3.Explain the different types of OLAP Tools. What type of OLAP tools is best suited for
data set that requires both detailed and summarized analysis? .(AU-N/D2023)
OLAP tools are used to analyze multidimensional data efficiently. Here are the major types:
20
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
⚠️ Limitations: Not ideal for massive detailed datasets; cube size can become a
bottleneck.
💽 Stores summarized data in cubes (like MOLAP) and detailed data in relational
databases (like ROLAP).
✅ Strengths: Balances performance and scalability.
The HOLAP model is generally the best fit when you need:
It gives you the flexibility of accessing detailed data like ROLAP, with the speed of
summarized cube querying like MOLAP.
MOLAP HOLAP
ROLAP
ROLAP stands for MOLAP stands f HOLAP stands for Hybrid
Relational Online or Online Analytical Processing.
Analytical Processing. Multidimensional Online
Analytical Processing.
The ROLAP storage The MOLAP storage The HOLAP storage mode
mode causes the mode principle the connects attributes of both
aggregation of the aggregations of the MOLAP and ROLAP. Like
division to be stored in division and a copy of MOLAP, HOLAP causes the
indexed views in the its source information aggregation of the division to
relational database that to be saved in a be stored in a multidimensional
was specified in the multidimensional operation in an SQL Server
partition's data source. operation in analysis analysis services instance.
services when the
separation is
processed.
22
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
ROLAP does not This MOLAP HOLAP does not causes a copy
because a copy of the operation is highly of the source information to be
source information to be optimize to maximize stored. For queries that access
stored in the Analysis query performance. the only summary record in the
The storage area can be
on the computer
services data folders. where the partition is aggregations of a division,
23
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
24
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
Climbing up in the concept hierarchy
Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City -> Country).
25
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
In the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
Location = "Delhi" or "Kolkata"
Time = "Q1" or "Q2"
Item = "Car" or "Bus"
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-
cube creation. In the cube given in the overview section, Slice is performed on the
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation,
performing pivot operation gives a new view of it
26
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
These are intermediate servers which stand in between a relational back-end server and
user frontend tools.
They use a relational or extended-relational DBMS to save and handle warehouse data,
and OLAP middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.
27
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
ROLAP systems work primarily from the data that resides in a relational database,
where the base data and dimension tables are stored as relational tables. This model permits
the multidimensional analysis of data.
This technique relies on manipulating the data stored in the relational database to give
the presence of traditional OLAP's slicing and dicing functionality. In essence, each method
of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Database server.
ROLAP server.
Front-end tool.
Some products in this segment have supported reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user
requests, being 'RDBMS' aware and also being capable of generating the SQL statements based
on the optimizer of the DBMS engine.
Advantages
28
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Can handle large amounts of information: The data size limitation of ROLAP technology
is depends on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the
data amount.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in
the relational database, the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL
statements to query the relational database, and SQL statements do not suit all needs.
A MOLAP system is based on a native logical model that directly supports multidimensional
data and operations. Data are stored physically into multidimensional arrays, and positional
techniques are used to access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are summarized
and are stored in an optimized format in a multidimensional cube, instead of in a relational
database. In MOLAP model, data are structured into proprietary formats by client's reporting
requirements with the calculations pre-generated on the cubes.
MOLAP Architecture
29
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been pre-
calculated and stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well
suited for MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's
Lightship Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's
Multiway.
Some of the problems faced by clients are related to maintaining support to multiple subject
areas in an RDBMS. Some vendors can solve these problems by continuing access from
MOLAP tools to detailed data in and RDBMS.
An example would be the creation of sales data measured by several dimensions (e.g., product
and sales region) to be stored and maintained in a persistent structure. This structure would be
provided to reduce the application overhead of performing calculations and building
aggregation during initialization. These structures can be automatically refreshed at
predetermined intervals established by an administrator.
Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal
for slicing and dicing operations.
30
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Can perform complex calculations: All evaluation have been pre-generated when the cube
is created. Hence, complex calculations are not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed
when the cube is built, it is not possible to contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary and does not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are other
investments in human and capital resources are needed.
HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture.
HOLAP systems save more substantial quantities of detailed data in the relational tables while
the aggregations are stored in the pre-calculated cubes. HOLAP also can drill through from
the cube down to the relational tables for delineated data. The Microsoft SQL Server 2000
provides a hybrid OLAP server.
Advantages of HOLAP
3. HOLAP balances the disk space requirement, as it only stores the aggregate information
on the OLAP server and the detail record remains in the relational database. So no duplicate
copy of the detail record is maintained. Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP
servers.
31
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Other Types
There are also less popular types of OLAP styles upon which one could stumble upon every
so often. We have listed some of the less popular brands existing in the OLAP industry.
WOLAP pertains to OLAP application which is accessible via the web browser. Unlike
traditional client/server OLAP applications, WOLAP is considered to have a three-tiered
architecture which consists of three components: a client, a middleware, and a database server.
Mobile OLAP enables users to access and work on OLAP data and applications remotely
through the use of their mobile devices.
SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP
into a single user interface. It facilitates the management of both spatial and non-spatial data.
32
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Unit-3
1. How to choose the data partitioning strategy for data warehouse? (AU-A/M2024)
Query Patterns Are users often querying by time, location, or category? Match
partitions to common filters.
Data Volume and Growth If the dataset grows rapidly (e.g., logs, transactions), use
time-based range partitioning.
Load & Maintenance Efficiency Choose partitions that simplify loading new data and
removing old data.
Hardware and Infrastructure In distributed systems, hash partitioning balances load
better across nodes.
Schema Evolution Vertical partitioning can isolate frequently changing columns for
easier updates.
33
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Schema Definitions
6. Why is data mart considered cost effective compared to a data warehouse? (AU-
N/D2023)
Data marts are generally more cost-effective, requiring less infrastructure and lower
maintenance costs since they pull from warehouses.
7. What is a Metadata Repository?
A metadata repository is a software tool that stores descriptive information about the
data model used to store and share metadata. Metadata repositories combine diagrams
and text, enabling metadata integration and change.
34
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Vertical Partitioning
→ All tables contain the same number of rows, but fewer columns
Row splitting tends to leave a one-to-one map between partitions. The motive of row
splitting is to speed up the access to large table by reducing its size.
PART-B
To design a cost-effective data mart, you need to consider various factors ranging from
Clearly define the objectives of your data mart. Understand what data you need to store,
analyze, and report on. This will help in determining the most cost-effective design.
35
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
3. Partitioning Strategy: Utilize partitioning to manage and access data efficiently. Consider
both vertical and horizontal partitioning:
Vertical Partitioning: Divide tables into smaller vertical segments based on the
frequency of access. Frequently accessed columns can be stored in one segment while
less accessed ones in another. This reduces I/O overhead and storage costs.
Horizontal Partitioning: Split tables into smaller chunks horizontally based on ranges
of values (e.g., date ranges) or other criteria. This helps in managing large datasets
efficiently and can optimize query performance.
7. Data Governance and Quality: Implement robust data governance practices to ensure data
quality an integrity. Poor data quality can lead to increased storage costs due tounnecessary
duplicates or inaccuracies.
8. Performance Monitoring and Tuning: Continuously monitor and tune the performance
of your data mart. Identify and eliminate performance bottlenecks through query optimization,
index tuning, and hardware upgrades if necessary.
36
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
10.Scalability: Design the data mart to be scalable to accommodate future growth. Ensure that
the chosen architecture can easily scale both vertically (e.g., adding more resources to existing
servers) and horizontally (e.g., adding more nodes to a distributed database system).
There are several challenges that can arise when managing metadata:
1. Lack of standardization: Different organizations or systems may use different standards
or conventions for metadata, which can make it difficult to effectively manage metadata
across different sources.
2. Data quality: Poorly structured or incorrect metadata can lead to problems with data
quality, making it more difficult to use and understand the data.
3. Data integration: When integrating data from multiple sources, it can be challenging to
ensure that the metadata is consistent and aligned across the different sources.
4. Data governance: Establishing and enforcing metadata standards and policies can be
difficult, especially in large organizations with multiple stakeholders.
5. Data security: Ensuring the security and privacy of metadata can be a challenge,
especially when working with sensitive or confidential information.
3.Illustrate the various classification of metadata with suitable example and explain the
same. .(AU-N/D2023)
Types of Metadata
There are mainly five types of metadata as shown below.
Types of Metadata
1. Preservation Metadata
Preservation metadata is nothing but it is the type of the metadata which is designed to be the
long term accessibility of the digital assets. It is served as a comprehensive record for the
37
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
necessary information for the preservation and management of digital collections, most of
the context of digital libraries, museums and other cultural institutions.
Preservation metadata is used to stores the information in long time with digital assets. It
includes the information about the authentications and preserve actions taken on the content.
This metadata preserved the necessary information for the long-term preservation and
manages the digital assets, ensuring their integrity and usability over time.
Example: This metadata includes the information about the formats of the file, migration
strategies, schemas of the metadata and preservation actions for maintaining the data
accessibility. It supports the activity related to the digital preservation like migration of the
data and risk management.
2. Descriptive Metadata
Descriptive metadata described about the detailed information of the content and the
characteristics of the particular piece of the data, enables the users to the understand the
meaning of the content and context of the content. This metadata plays a main role for
enabling users to the discover and understand the digital resources effectively. The
descriptive metadata describe about the title, author/creator, subject/keywords,
abstract/summary, date, format, languages, identifier, spatial coverage, temporal coverage,
rights information.
Descriptive metadata is used to describes the content and providing the information such as
author of the content, title of the content, summary of the content, theme of the content, dates,
key points of the content. It is helped to the users for better understanding the content.
Example: This metadata type consists of attributes like title, author, subjects, keywords,
abstracts and other description parts of the document what the data describe about. In the
digital context like documents, descriptive metadata also includes the details of the
documentation such as location, creation date, format and size of the file.
3. Technical Metadata
It provide the detailed information about technical characteristics and properties of the digital
asset. It served a main role of the facilitate the management, processing and interoperability
of the digital resources across the different platforms and the environments.
Technical metadata is used to describe about the technical words of the content i.e.
specifications of the hardware, encoding, formats of the file, resolution of the file, software
used in the content etc.,
Example: This metadata tells about the format of the file, size of the file, color space, method
of compression, software is used to create and manipulates the data. Technical metadata can
38
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
be includes the Exchangeable Image File Format (EXIF) data contains the settings of the
camera and other technical information.
4. Structural Metadata
In structural metadata, it provides the information about organization, about arrangements
and relationships with the digital asset. It helps to the users to navigate and understand the
internal structure and the hierarchical components. It is worked for interact with the content
and accessing the content, it enables the users to navigate with the difficult information
architectures and receive the data efficiently.
The purpose of structural metadata is to provide details about the organization and
relationships within the content or dataset. Let us consider an example, In a book, structural
metadata can be include the title of the chapter, headings of the chapter, contents, indexes,
etc.,
Example: In structural metadata, it can be include about the chapters, about the sections,
about the headings, about paragraphs and the page sequences. This multimedia includes the
multimedia files such as videos and records, it includes timestamp, markers and pointers.
5. Administrative Metadata
Administrative metadata is nothing but it gives the information about the management, about
the administration and governance of the digital assets with its lifecycle. It served as a
foundation for the resource management, traceability and compliance with the organizational
policies and its standards. Metadata can enabled the efficient
Administrative metadata is used to data provides the administration of the content and
management. This metadata might be include the ownership of the data, right to the access,
update/create the date and formats of the files.
Example: This metadata can include the details like author, date of the creation, access
permissions, history and restrictions of the Digital Rights Management (DRM). It also
describe about the storage of the data, strategies for the preservation.
4.Elaborate in detail about the various issues to be considered when designing and
implementing data warehouse environment. .(AU-N/D2023)
39
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse
is different from the warehouse data, yet it plays an important role. The various roles of
metadata are explained below.
• Metadata acts as a directory.
• This directory helps the decision support system to locate the contents of the data
warehouse.
• Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
• Metadata helps in summarization between current detailed data and highly summarized
data.
• Metadata also helps in summarization between lightly detailed data and highly
summarized data.
40
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
6.Compare and contrast the advantages & disadvantages of vertical and horizontal
partitioning in a data warehousing context .
41
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Commonly used for tables with a Commonly used for tables with a
wide range of columns, where large number of rows, where data
not all columns are frequently can be grouped based on some
Use cases accessed together. criteria (e.g., date ranges).
42
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Unit-4
PART-A
Storage
Lower (redundancy) Higher (normalized tables)
Efficiency
Fact Table: Central table containing quantitative data (like sales or revenue).
43
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Dimension Tables: Related tables providing descriptive context (like time, product,
location).
Sub-Dimension Tables: Further breakdown of dimensions into smaller related tables.
45
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
PART-B
1. Detail the process of dimensional modelling and its importance in data
warehousing. (AU-A/M2024)
Dimensional Data Modeling is one of the data modeling techniques used in data warehouse
design. The concept of Dimensional Modeling was developed by Ralph Kimball which is
comprised of facts and dimension tables. Since the main goal of this modeling is to improve
the data retrieval so it is optimized for SELECT OPERATION. The advantage of using this
model is that we can store data in such a way that it is easier to store and retrieve the data
once stored in a data warehouse. The dimensional model is the data model used by many
OLAP systems.
Elements of Dimensional Data Model
Facts
Facts are the measurable data elements that represent the business metrics of interest. For
example, in a sales data warehouse, the facts might include sales revenue, units sold, and
profit margins. Each fact is associated with one or more dimensions, creating a relationship
between the fact and the descriptive data.
Dimension
Dimensions are the descriptive data elements that are used to categorize or classify the data.
For example, in a sales data warehouse, the dimensions might include product, customer,
time, and location. Each dimension is made up of a set of attributes that describe the
dimension. For example, the product dimension might include attributes such as product
name, product category, and product price.
Attributes
Characteristics of dimension in data modeling are known as characteristics. These are used
to filter, search facts, etc. For a dimension of location, attributes can be State, Country,
Zipcode, etc.
Fact Table
46
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
In a dimensional data model, the fact table is the central table that contains the measures or
metrics of interest, surrounded by the dimension tables that describe the attributes of the
measures. The dimension tables are related to the fact table through foreign key relationships
Dimension Table
Dimensions of a fact are mentioned by the dimension table and they are basically joined by
a foreign key. Dimension tables are simply de-normalized tables. The dimensions can be
having one or more relationships.
Types of Dimensions in Data Warehouse Model
Conformed Dimension
Outrigger Dimension
Shrunken Dimension
Role-Playing Dimension
Dimension to Dimension Table
Junk Dimension
Degenerate Dimension
Swappable Dimension
Step Dimension
Steps to Create Dimensional Data Modeling
Step-1: Identifying the business objective: The first step is to identify the business
objective. Sales, HR, Marketing, etc. are some examples of the need of the organization.
Since it is the most important step of Data Modelling the selection of business objectives also
depends on the quality of data available for that process.
Step-2: Identifying Granularity: Granularity is the lowest level of information stored in the
table. The level of detail for business problems and its solution is described by Grain.
Step-3: Identifying Dimensions and their Attributes: Dimensions are objects or things.
Dimensions categorize and describe data warehouse facts and measures in a way that supports
meaningful answers to business questions. A data warehouse organizes descriptive attributes
as columns in dimension tables.
For Example, the data dimension may contain data like a year, month, and weekday.
Step-4:
47
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Step-5: Building of Schema: We implement the Dimension Model in this step. A schema is
a database structure. There are two popular schemes: Star Schema and Snowflake Schema.
i)Database parallelism
Parallelism is used to support speedup, where queries are executed faster because more
resources, such as processors and disks, are provided. Parallelism is also used to provide
scale-up, where increasing workloads are managed without increase response-time, via
an increase in the degree of parallelism.
Different architectures for parallel database systems are shared-memory, shared-disk,
shared-nothing, and hierarchical structures.
(a)Horizontal Parallelism: It means that the database is partitioned across multiple
disks, and parallel processing occurs within a specific task (i.e., table scan) that is
performed concurrently on different processors against different sets of data.
(b)Vertical Parallelism: It occurs among various tasks. All component query
operations (i.e., scan, join, and sort) are executed in parallel in a pipelined fashion. In
48
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
other words, an output from one function (e.g., join) as soon as records become
available.
A data warehouse keeps evolving and it is unpredictable what query the user is going to post
in the future. Therefore it becomes more difficult to tune a data warehouse system. In this
chapter, we will discuss how to tune the different aspects of a data warehouse such as
performance, data load, queries, etc.
It is very difficult to predict what query the user is going to post in the future.
Performance Assessment
49
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
• Scan rates
• It is of no use trying to tune response time, if they are already better than those
required.
• To hide the complexity of the system from the user, aggregations and views should
be used.
• It is also possible that the user can write a query you had not tuned for.
Data load is a critical part of overnight processing. Nothing else can run until data load is
complete. This is the entry point into the system.
There are various approaches of tuning data load that are discussed below −
• The very common approach is to insert data using the SQL Layer. In this approach,
normal checks and constraints need to be performed. When the data is inserted into
the table, the code will run to check for enough space to insert the data. If sufficient
space is not available, then more space may have to be allocated to these tables.
These checks take time to perform and are costly to CPU.
• The second approach is to bypass all these checks and constraints and place the data
directly into the preformatted blocks. These blocks are later written to the database. It
is faster than the first approach, but it can work only with whole blocks of data. This
can lead to some space wastage.
• The third approach is that while loading the data into the table that already contains
the table, we can maintain indexes.
50
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
• The fourth approach says that to load the data in tables that already contain data,
drop the indexes & recreate them when the data load is complete. The choice
between the third and the fourth approach depends on how much data is already
loaded and how many indexes need to be rebuilt.
Integrity Checks
Integrity checking highly affects the performance of the load. Following are the points to
remember −
• Integrity checks need to be limited because they require heavy processing power.
Tuning Queries
• Fixed queries
• Ad hoc queries
Fixed Queries
Fixed queries are well defined. Following are the examples of fixed queries −
• regular reports
• Canned queries
• Common aggregations
Tuning the fixed queries in a data warehouse is same as in a relational database system. The
only difference is that the amount of data to be queried may be different. It is good to store
the most successful execution plan while testing fixed queries. Storing these executing plan
will allow us to spot changing data size and data skew, as it will cause the execution plan to
change.
Ad hoc Queries
51
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
To understand ad hoc queries, it is important to know the ad hoc users of the data warehouse.
For each user or group of users, you need to know the following −
• It is important to track the user's profiles and identify the queries that are run on a
regular basis.
• It is also important that the tuning performed does not affect the performance.
• If these queries are identified, then the database will change and new indexes can be
added for those queries.
• If these queries are identified, then new aggregations can be created specifically for
those queries that would result in their efficient execution.
Testing
Testing is very important for data warehouse systems to make them work correctly and
efficiently. There are three basic levels of testing performed on a data warehouse −
• Unit testing
• Integration testing
• System testing
52
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Unit Testing
• Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
Integration Testing
In integration testing, the various modules of the application are brought together and
then tested against the number of inputs.
System Testing
• The purpose of system testing is to check whether the entire system works correctly
together or not.
• Since the size of the whole data warehouse is very large, it is usually possible to
perform minimal system testing before the test plan can be enacted.
Test Schedule
First of all, the test schedule is created in the process of developing the test plan. In
this schedule, we predict the estimated time required for the testing of the entire data
warehouse system.
There are different methodologies available to create a test schedule, but none of them are
perfect because the data warehouse is very complex and large. Also the data warehouse
system is evolving in nature. One may face the following issues while creating a test
schedule −
• A simple problem may have a large size of query that can take a day or more to
complete,
53
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
• There may be hardware failures such as losing a disk or human errors such as
accidentally deleting a table or overwriting a large table.
Testing the backup recovery strategy is extremely important. Here is the list of scenarios for
which this testing is needed −
• Media failure
• Instance failure
There are a number of aspects that need to be tested. These aspects are listed below.
• Event manager
• System manager
• Database manager
• Configuration manager
5.Describe and brief about various schemas in multidimensional data model. (AU-
N/D2023)
• Star Schema
• Snowflakes Schema
1. Start Schema
It is the data warehouse schema that contains two types of tables: Fact Table and
Dimension Tables. Fact Table lies at the center point and dimension tables are
connected with fact table such that star share is formed.
→ Fact Tables: A fact table typically has two types of columns: foreign keys to dimension
tables and measures that contain numeric facts. A fact table can contain fact data on detail or
aggregated level.
→ Dimension Tables: Dimension tables usually have a relatively small number of records
compared to fact tables, but each record may have a very large number of attributes to describe
the fact data. Each dimension in the star schema has only one dimension table and each table
holds a set of attributes. This constraint may cause data redundancy. The following diagram
shows the sales data of a company with respect to the four dimensions, namely time, item,
branch, and location.
55
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
There is a fact table at the center. It contains the keys to each of the four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Since star schema contains de-normalized dimension tables, it leads to simpler queries
due to a lesser number of join operations and it also leads to better system performance.
On the other hand, it is difficult to maintain the integrity of data in star schema due to
de-normalized tables. It is the widely used data warehouse schema and is also
recommended by oracle
2. Snowflakes Schema.
The snowflake schema is a variant of the star schema model, where some dimension
table is normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.
56
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
For example, the item dimension table in a star schema is normalized and split into two
dimension tables, namely item and supplier table.
Due to normalization table is easy to maintain and saves storage space. However, this
saving of space is negligible in comparison to the typical magnitude of the fact table.
Furthermore, the snowflake structure can reduce the effectiveness of browsing, since
more joins will be needed to execute a query. Consequently, the system performance
may be adversely impacted.
Hence, although the snowflake schema reduces redundancy, it is not as popular as the
star schema in data warehouse design.
57
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
A fact constellation schema allows dimension tables to be shared between fact tables.
6.Discuss the various types of database parallelism with suitable example. (AU-N/D2023)
Parallelism is used to support speedup, where queries are executed faster because more
resources, such as processors and disks, are provided. Parallelism is also used to provide scale-
up, where increasing workloads are managed without increase response-time, via an increase
in the degree of parallelism.
(a)Horizontal Parallelism: It means that the database is partitioned across multiple disks, and
parallel processing occurs within a specific task (i.e., table scan) that is performed concurrently
on different processors against different sets of data.
58
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
(b)Vertical Parallelism: It occurs among various tasks. All component query operations (i.e.,
scan, join, and sort) are executed in parallel in a pipelined fashion. In other words, an output
from one function (e.g., join) as soon as records become available.
Intraquery Parallelism
Intraquery parallelism defines the execution of a single query in parallel on multiple processors
and disks. Using intraquery parallelism is essential for speeding up long-running queries.
Interquery parallelism does not help in this function since each query is run sequentially. To
improve the situation, many DBMS vendors developed versions of their products that utilized
intraquery parallelism.
This application of parallelism decomposes the serial SQL, query into lower-level operations
such as scan, join, sort, and aggregation.
Interquery Parallelism
In interquery parallelism, different queries or transaction execute in parallel with one another.
This form of parallelism can increase transactions throughput. The response times of
individual transactions are not faster than they would be if the transactions were run in
isolation.
Thus, the primary use of interquery parallelism is to scale up a transaction processing system
to support a more significant number of transactions per second.
59
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
This approach naturally resulted in interquery parallelism, in which different server threads (or
processes) handle multiple requests at the same time.
Interquery parallelism has been successfully implemented on SMP systems, where it increased
the throughput and allowed the support of more concurrent users.
DLM components can be found in hardware, the operating system, and separate software layer,
all depending on the system vendor.
On the positive side, shared-disk architectures can reduce performance bottlenecks resulting
from data skew (uneven distribution of data), and can significantly increase system
availability.
The shared-disk distributed memory design eliminates the memory access bottleneck typically
of large SMP systems and helps reduce DBMS dependency on data partitioning.
60
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Shared-Memory Architecture
Shared-memory or shared-everything style is the traditional approach of implementing an
RDBMS on SMP hardware.
It is relatively simple to implement and has been very successful up to the point where it runs
into the scalability limitations of the shared-everything architecture.
The key point of this technique is that a single RDBMS server can probably apply all
processors, access all memory, and access the entire database, thus providing the client with a
consistent single system image.
In shared-memory SMP systems, the DBMS considers that the multiple database components
executing SQL statements communicate with each other by exchanging messages and
information via the shared memory.
All processors have access to all data, which is partitioned across local disks.
Shared-Nothing Architecture
In a shared-nothing distributed memory environment, the data is partitioned across all disks,
and the DBMS is "partitioned" across multiple co-servers, each of which resides on individual
nodes of the parallel system and has an ownership of its disk and thus its database partition.
Each processor has its memory and disk and communicates with other processors by
exchanging messages and data over the interconnection network. This architecture is
optimized specifically for the MPP and cluster systems.
61
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
The shared-nothing architectures offer near-linear scalability. The number of processor nodes
is limited only by the hardware platform limitations (and budgetary constraints), and each node
itself can be a powerful SMP system.
7.Discuss the various access types to data stored in data warehouse and consider the data
types of online shopping .(AU-N/D2023)
Data warehouses are designed to support decision-making processes by enabling fast and
efficient access to structured data. The main access types include:
1. Batch Access
3. Real-Time Access
Data can be accessed via APIs for integration with apps or external tools.
Supports automated workflows and data-driven apps.
Online shopping platforms generate and utilize a variety of data types to offer personalized
and efficient experiences:
Transaction
Orders, payment methods, invoices Purchase history, financial reports
Data
Social Data Shares, likes, referral codes Social proof, targeted promotions
8.Explain how to use a familiar approach help reduce training costs associated with
building a query and reporting environment with suitable example. .(AU-N/D2023)
Instead of introducing a complex BI platform from scratch, use tools that most
employees already understand:
63
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Example: A retail company needs sales reporting across regions. Instead of training
everyone on Power BI or Tableau, they create Excel templates with built-in queries to
fetch sales data. Regional managers simply refresh the data and use pre-formatted pivot
tables. This cuts training time dramatically, as they already understand Excel.
🔹 SQL-Based Tools
If your staff knows SQL, opt for platforms that support SQL queries directly.
Tools like SQL Server Reporting Services (SSRS) or Looker allow custom
SQL queries with minimal additional training.
8.Propose a fact constellation schema for a health care data warehouse to support
complex analytical queries.
10.Explain how does snowflake schema differ from star schema in terms of
normalization.
64
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Unit-5
PART-A
1. Name two types of data warehousing system managers and their roles. (AU-
A/M2024)
Data warehouse Administrator
ETL Manager
2. What is the function of load manager in data warehousing? (AU-A/M2024)
The load manager—sometimes called the data load subsystem—is a crucial component
of data warehousing. It serves as the gatekeeper for all data entering the warehouse.
65
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
PART-B
1.Discuss the role of system and process manager in a data warehousing environment.
(AU-A/M2024)
System Managers
66
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
• The system configuration manager is responsible for the management of the setup and
configuration of data warehouse.
• The structure of configuration manager varies from one operating system to another.
• In Unix structure of configuration, the manager varies from vendor to vendor.
• Configuration managers have single user interface.
• The interface of configuration manager allows us to control all aspects of the system.
System Scheduling Manager is responsible for the successful implementation of the data
warehouse. Its purpose is to schedule ad hoc queries. Every operating system has its own
scheduler with some form of batch control mechanism. The list of features a system
scheduling manager must have is as follows −
• Work across cluster or MPP boundaries
• Deal with international time differences
• Handle job failure
• Handle multiple queries
• Support job priorities
• Restart or re-queue the failed jobs
• Notify the user or a process when job is completed
• Maintain the job schedules across system outages
• Re-queue jobs to other queues
• Support the stopping and starting of queues
• Log Queued jobs
• Deal with inter-queue processing
Note − The above list can be used as evaluation parameters for the evaluation of a good
scheduler.
Some important jobs that a scheduler must be able to handle are as follows −
• Backup
• Aggregation creation
• Data transformation
Note − If the data warehouse is running on a cluster or MPP architecture, then the system
scheduling manager must be capable of running across the architecture.
The event manager is a kind of a software. The event manager manages the events that are
defined on the data warehouse system. We cannot manage the data warehouse manually
because the structure of data warehouse is very complex. Therefore we need a tool that
automatically handles all the events without any intervention of the user.
Note − The Event manager monitors the events occurrences and deals with them. The event
manager also tracks the myriad of things that can go wrong on this complex data warehouse
system.
Events
Events are the actions that are generated by the user or the system itself. It may be noted that
the event is a measurable, observable, occurrence of a defined action.
Given below is a list of common events that are required to be tracked.
• Hardware failure
• Running out of space on certain key disks
• A process dying
• A process returning an error
• CPU usage exceeding an 805 threshold
• Internal contention on database serialization points
• Buffer cache hit ratios exceeding or failure below threshold
• A table reaching to maximum of its size
• Excessive memory swapping
• A table failing to extend due to lack of space
• Disk exhibiting I/O bottlenecks
• Usage of temporary or sort area reaching a certain thresholds
• Any other database shared memory usage
68
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
The most important thing about events is that they should be capable of executing on their
own. Event packages define the procedures for the predefined events. The code associated
with each event is known as event handler. This code is executed whenever an event occurs.
System and database manager may be two separate pieces of software, but they do the same
job.
The objective of these tools is to automate certain processes and to simplify the execution of
others. The criteria for choosing a system and the database manager are as follows −
• increase user's quota.
• assign and de-assign roles to the users
• assign and de-assign the profiles to the users
• perform database space management
The backup and recovery tool makes it easy for operations and management staff to back-
up the data. Note that the system backup manager must be integrated with the schedule
manager software being used. The important features that are required for the management
of backups are as follows −
• Scheduling
• Backup data tracking
• Database awareness
Backups are taken only to protect against data loss. Following are the important points to
remember −
• The backup software will keep some form of database of where and when the piece of
data was backed up.
• The backup recovery manager must have a good front-end to that database.
• The backup recovery software should be database aware.
• Being aware of the database, the software then can be addressed in database terms, and
will not perform backups that would not be viable. Data Warehousing - Process
Managers
Process managers are responsible for maintaining the flow of data both into and out of the data
warehouse. There are three different types of process managers −
• Load manager
• Warehouse manager
• Query manager
69
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Load manager performs the operations required to extract and load the data into the
database. The size and complexity of a load manager varies between specific solutions from
one data warehouse to another.
Fast Load
• In order to minimize the total load window, the data needs to be loaded into the
warehouse in the fastest possible time.
• Transformations affect the speed of data processing.
• It is more effective to load the data into a relational database prior to applying
transformations and checks.
• Gateway technology is not suitable, since they are inefficient when large data volumes
are involved.
Simple Transformations
70
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
simple transformations, we can do complex checks. Suppose we are loading the EPOS sales
• Strip out all the columns that are not required within the warehouse.
• Convert all the values to required data types.
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists
of a third-party system software, C programs, and shell scripts. The size and complexity of
a warehouse manager varies between specific solutions.
71
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Query Manager
The query manager is responsible for directing the queries to suitable tables. By directing
the queries to appropriate tables, it speeds up the query request and response process. In
addition, the query manager is responsible for scheduling the execution of the queries
posted by the user.
2. Outline a key strategy for ensuring a data warehousing solution remain scalable as a
tech start up grows. High light one specific technology or approach that could be utilized.
(AU-A/M2024)
To ensure your data warehousing solution scales seamlessly with startup growth, design for
elasticity from day one. That means building a modular, cloud-native architecture that can
expand (or contract) as data volumes, users, and analytical demands evolve. Focus on
decoupling storage and compute so you can scale each independently—this ensures you're
not locked into a rigid infrastructure that becomes a bottleneck as you grow.
A standout in this space is Snowflake, a cloud-native data platform built precisely for
scalability and flexibility:
Decoupled Storage & Compute: Snowflake allows you to scale compute resources up
or down without impacting your storage, or vice versa.
Multi-Cluster Warehouses: As workloads increase, Snowflake can automatically spin
up additional compute clusters to handle the demand without user intervention.
Zero Management: No infrastructure to manage—updates, scaling, and availability
are handled automatically.
Data Sharing & Marketplace: Easily share data across teams or even externally with
partners, reducing data duplication and silos.
Bonus: Snowflake integrates well with major cloud providers (AWS, Azure, GCP), giving you
flexibility and avoiding vendor lock-in.
3.Describe how modernizing the data warehouse architecture could improve the data
analytics and decision making for the global retain company. Focus on one major
limitations of old system and how a modern solution addresses it. (AU-A/M2024)
73
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
4. Discuss in detail about the data warehouse data storage and accessibility for the
government of tamilnadu for any two schemes in practice. (AU-N/D2023)
Tamil Nadu has adopted a centralized and federated data storage model under its Tamil
Nadu Data Policy 2022, which aims to:
The State Data Warehouse integrates datasets from various departments, allowing real-time
access, analytics, and monitoring of scheme performance.
Beneficiary Data: Collected from school and college databases, Aadhaar-linked for
verification.
Integration: Linked with the Education Department’s MIS and Treasury systems
for fund disbursement.
Accessibility:
o Officials access dashboards for real-time monitoring of enrollment and
payments.
o Students can track their application status via public portals.
Security: Data is encrypted and stored in state-run data centers under the Tamil
Nadu Data Centre Policy 2021.
Impact
74
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Health Records: Digitized and stored in the Health Department’s data warehouse.
Hospital Integration: Linked with hospital management systems for claim
processing.
Accessibility:
o Government officials use analytics dashboards to monitor hospital
performance and claim volumes.
o Beneficiaries can access scheme details via mobile apps and e-Sevai centers.
Privacy: Complies with data classification and privacy norms under TNDP.
Impact
Component Description
Data Warehouse Central repository for structured data from multiple departments
Analytics Engines Tools for trend analysis, forecasting, and performance evaluation
5.Explain the relationship between load manager and warehouse manager in data
warehousing process.(Refer Question No:1)
75
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
6.Explain about various data warehousing process manager its collaboration and
interactions to ensure operations and performance of the entire data warehousing
system.(Refer Q.No:1)
7.Justify the statement “ testing process contribute to overall quality a data warehousing
system” and describe the types of tests conducted on the system.
Testing isn’t just a checkbox in the development lifecycle of a data warehouse—it’s a core
pillar that ensures:
Data Accuracy: Detects errors in data extraction, transformation, and loading (ETL)
processes.
System Reliability: Verifies that the warehouse performs consistently under varying
loads and queries.
Performance Optimization: Identifies bottlenecks in query performance or loading
operations.
Security Assurance: Ensures sensitive data is protected and access is appropriately
controlled.
User Satisfaction: Validates that the reports, dashboards, and analytics behave as
expected.
In short, without proper testing, a data warehouse might look fine from the outside—but could
collapse under real-world use. Quality testing helps guarantee that the system delivers
trustworthy data for critical decision-making.
Here are the key testing types used across the lifecycle of a data warehouse:
Ensures that various components (e.g., data sources, ETL tools, the
Integration Testing
warehouse) interact properly.
76
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS
Data Validation Confirms data quality, integrity, and accuracy post-load. Checks for
Testing duplicates, missing fields, or mismatches.
Security Testing Tests data encryption, access controls, and auditing features.
77