0% found this document useful (0 votes)
58 views77 pages

Ccs341 DW Qa (Final)

The document outlines key concepts and components of data warehousing, including definitions, comparisons with operational databases, benefits, design views, and approaches to building data warehouses. It details the three-tier architecture of data warehouses, which consists of the bottom tier (data storage), middle tier (OLAP engine), and top tier (front-end tools), as well as technologies to enhance performance. Additionally, it discusses the Snowflake architecture, emphasizing its storage, query processing, and cloud services layers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views77 pages

Ccs341 DW Qa (Final)

The document outlines key concepts and components of data warehousing, including definitions, comparisons with operational databases, benefits, design views, and approaches to building data warehouses. It details the three-tier architecture of data warehouses, which consists of the bottom tier (data storage), middle tier (OLAP engine), and top tier (front-end tools), as well as technologies to enhance performance. Additionally, it discusses the Snowflake architecture, emphasizing its storage, query processing, and cloud services layers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.

SUJATHA, HoD/AI&DS

Unit-1

PART-A

1. Define a Data warehouse. What are its key components? (AU-A/M2024)


"Data Warehouse is a subject-oriented, integrated, and time-variant store of information
in support of management's decisions."
Key components: Data source, data staging, ELT/ETL Process, data warehouse
storage, meta data , OLAP ENGINE.
2. Compare and contrast an operational database with data warehouse. (AU-
A/M2024)

Feature Database Data Warehouse

Designed for transactional


Purpose Designed for analytical processing
processing

Data Type Current, real-time data Historical, aggregated data

Operations Insert, update, delete frequently Query-heavy, read-only access mostly

Optimized for complex queries and


Design Optimized for fast insert/update
reports

Operational staff (e.g. customer


Users Business analysts, decision-makers
service)

3. List out the benefits of Data Warehouse.(AU-N/D2023)

i)Understand business trends and make better forecasting decisions.

ii) Data Warehouses are designed to perform well enormous amounts of data.

iii)The structure of data warehouses is more accessible for end-users to navigate,


understand, and query.

iv)Queries that would be complex in many normalized databases could be easier to


build and maintain in data warehouses.

4. List out the views in the design of the data warehouse.(AU-N/D2023)

 Bottom Tier (Data Warehouse Server)

 Middle Tier (OLAP Server)

1
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 Top Tier (Front end Tools).

5. What are the different types of knowledge discovery in database? .(AU-N/D2024)

 Data Selection
 Data Cleaning and Preprocessing
 Data Transformation and Reduction
 Data Mining
 Evaluation and Interpretation of Results

6. How is data warehouse different from database? Identify the similarity. (AU-
N/D2024)

Feature Database Data Warehouse

Designed for transactional


Purpose Designed for analytical processing
processing

Data Type Current, real-time data Historical, aggregated data

Query-heavy, read-only access


Operations Insert, update, delete frequently
mostly

Optimized for complex queries and


Design Optimized for fast insert/update
reports

Operational staff (e.g. customer


Users Business analysts, decision-makers
service)

Similarity

 Data Storage: Both store structured data in tables using rows and columns.
 SQL Use: Both use SQL (Structured Query Language) for data operations.
 Relational Model: Both can be based on a relational model, allowing relationships
between different data tables.
 Data Integrity: Both emphasize accuracy and consistency of stored data.

7. Define top-down approach of data warehouse.


In the top down approach, we build a centralized repository to house corporate
wide business data. This repository is called Enterprise Data Warehouse
2
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

(EDW).The data in the EDW is stored in a normalized form in order to avoid


redundancy.

8. Define bottom-up approach of data warehouse.


The bottom up approach is an incremental approach to build a data warehouse.
We build the data marts separately at different points of time as and when the
specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse.

9. Define conformed dimension.


A Conformed dimension has consistent dimension keys, consistent attribute
names and consistent values across separate data marts. The conformed
dimension means exact same thing with every fact table it is joined.

10. List out the nine-step method followed in the design of a data warehouse.

1. Choosing the subject matter

2. Deciding what a fact table represents

3. Identifying and conforming the dimensions

4. Choosing the facts

5. Storing pre calculations in the fact table

6. Rounding out the dimension table

7. Choosing the duration of the db

8. The need to track slowly changing dimensions

9. Deciding the query priorities and query models

3
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

PART-B
1. Explain three tier data warehouse architecture and its significance. (AU-
A/M2024)

Data warehousing is essential for businesses looking to make informed decisions based on
large amounts of information. The architecture of a data warehouse is key to its effectiveness,
influencing how easily data can be accessed and used. The Three/Multi-Tier Data Warehouse
Architecture is widely adopted due to its clear and organized framework. This architecture
divides data handling into three main layers:
 Bottom Tier (Data Sources and Data Storage)
 Middle Tier (OLAP Engine)
 Top Tier (Front-End Tools)

Bottom Tier
The Bottom Tier of the Three-Tier Data Warehouse Architecture is fundamental, as
it serves as the foundation where data is initially collected and stored. This tier is typically
structured around a warehouse database server, commonly an RDBMS (Relational Database
Management System), which houses the data extracted from various operational and external
sources. The core activities in this tier involve the extraction, cleaning, transformation, and
loading of data, collectively known as the ETL process.
ETL, standing for Extract, Transform, and Load, is vital in the data warehousing process. In
the Bottom Tier, data undergoes transformation to align with business logic and analysis
needs before being loaded into a more query-friendly structure. This process ensures that data
is clean, consistent, and optimized for quick retrieval, which is essential for effective data

4
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

analysis. Several tools are used to perform ETL process such as, IBM Infosphere,
Informatica, Confluent, Microsoft SSIS, Snaplogic, Alooma etc.

Middle Tier:
The Middle Tier in the three-tier architecture of a data warehouse is where
the OLAP (Online Analytical Processing) server resides. This tier acts as the critical
processing layer that manages and enables complex analytical queries on data stored in the
bottom tier. It functions as a mediator between the data repository (bottom tier) and the end-
user interface (top tier), ensuring that data is efficiently processed and made ready for
analysis and reporting.
OLAP is a powerful technology for complex calculations, trend analysis, and data
modeling. It is designed for high-speed analytical processing. OLAP server models come in
three different categories, including:
 ROLAP (Relational OLAP): This model uses a relational database to store and
manage warehouse data. It is ideal for handling large data volumes as it operates
directly on relational databases.
 MOLAP (Multidimensional OLAP): This model stores data in a
multidimensional cube. The storage and retrieval processes are highly efficient,
making MOLAP suitable for complex analytical queries that require aggregation.
 HOLAP (Hybrid OLAP): It is combination of relational and multidimensional
online analytical processing paradigms. HOLAP is the ideal option for a seamless
functional flow across the database systems when the repository houses both the
relational database management system and the multidimensional database
management system.

Top Tier
The Top Tier in the Three-Tier Data Warehouse Architecture comprises the front-end
client layer, which is essential for interacting with the data stored and processed in the lower
tiers. This layer includes a variety of business intelligence (BI) tools and techniques designed
to facilitate easy access and manipulation of data for reporting, analysis, and decision-
making.
BI tools are critical components of the Top Tier, providing robust platforms through which
users can query, report, and analyze data. Popular BI tools include:
 IBM Cognos: Offers comprehensive reporting capabilities.

5
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 Microsoft BI Platform: Integrates well with existing Microsoft products,


providing a familiar interface for users.
 SAP BW: Specializes in managing large datasets and integrating with other SAP
products.
 Crystal Reports: Known for its powerful reporting features.
 SAS Business Intelligence: Provides advanced analytics.
 Pentaho: A versatile tool for data integration and visualization.
The Top Tier is crucial for decision-making as it provides the interface through which
insights are accessed and explored. By presenting data in visual formats such as graphs,
charts, and dashboards, these tools allow decision-makers to quickly grasp complex patterns,
trends, and anomalies, leading to faster and more effective decision-making.

2. Compare and contrast autonomous data warehouses with snowflake data


warehouse solutions. (AU-A/M2024)

3. Describe the technologies used to improve the performance in data warehouse


environment. Mention the few alternate technologies also.(AU-N/D2023).
Core Technologies for Data Warehouse Performance

These technologies are often built into or layered on top of modern data
warehouses to improve speed, scalability, and reliability.

6
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

1. Massively Parallel Processing (MPP)

 Distributes data and workload across multiple processors/nodes.


 Enables simultaneous query execution for faster performance.
 Common in platforms like Amazon Redshift, Snowflake, and Azure Synapse.

2. Columnar Storage

 Stores data column-by-column instead of row-by-row.


 Improves I/O efficiency and compression for analytical workloads.
 Widely used in tools like Google BigQuery and Vertica.

3. In-Memory Computing

 Keeps frequently accessed data in RAM for ultra-fast query response.


 Reduces reliance on disk I/O.
 SAP HANA and Apache Ignite are good examples.

4. Data Partitioning

 Divides large tables into smaller pieces (e.g., by date or region).


 Reduces the amount of data scanned during queries.
 Improves indexing and manageability.

5. Indexing and Materialized Views

 Indexes speed up data retrieval by reducing full-table scans.


 Materialized views store precomputed query results for reuse.
 Often found in traditional relational warehouses like Oracle or SQL Server.

6. Data Caching

 Stores query results or commonly used datasets temporarily.


 Prevents repeated computations.
 Improves latency in dashboards and repeated analytics.

7
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

7. ETL Optimization Tools

 Tools like Apache NiFi, Talend, and Informatica streamline data transformation.
 Reduce the time needed to load and prep data for analysis.Alternate Technologies

These may serve as alternatives or complements to conventional data warehousing:

• Data Lakes

 Store raw, unstructured, and structured data.


 Typically built on cloud platforms like AWS S3 or Azure Data Lake.
 More flexible but require processing engines for querying (e.g., Presto, Hive).

• Lakehouses (e.g., Databricks Delta Lake)

 Combine features of data warehouses and lakes.


 Support ACID transactions and schema enforcement.
 Ideal for real-time analytics and machine learning.

• Cloud-Native Analytics Platforms

 Serverless architecture (e.g., Google BigQuery).


 Automatically scale resources based on demand.
 Ideal for dynamic workloads.

• Streaming Data Platforms

 Tools like Apache Kafka or Amazon Kinesis.


 Process real-time events and ingest continuous data.
 Complement traditional batch-based data warehouses.

4. With a neat sketch explain the steps for design and construction of data
warehouses and explain the three tier architecture. (refer question 1)
5. Describe the three layers of snowflake architecture.

Snowflake's Architecture
Snowflake’s architecture mainly consists of three layers.

8
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Storage Layer
The Storage layer in snowflake architecture is responsible for managing and storing data in
an effective manner. The functionalities that were supported by the storage layer are:
 Elasticity: Snowflake's storage layer is elastic, allowing organizations to scale
their storage needs independent of compute resources. It ensures to handle various
data volumes without affecting performance.
 Cloud Based Object Storage: Snowflake uses cloud based object storage to store
data. This separation of storage and compute allows for cost-effective and scalable
data storage.
 Data Clustering: Snowflake organizes data into micro partitions within the
object storage, and these micro partitions are clustered based on metadata. This
clustering enhances query performance by minimizing the amount of data that
needs to be scanned.
 Zero Copy Cloning: Snowflake enables efficient data cloning through zero-copy
cloning technology. This feature allows users to create a copy of a dataset instantly
without duplicating the actual data, saving both time and storage costs.
Query Processing Layer
The SQL query execution is handled by Snowflake's Query Processing Layer,
which dynamically optimizes and parallelizes queries over several compute clusters. It
ensures great performance and scalability by decoupling computation and storage, allowing
for on-demand resource allocation based on query complexity and workload. Functionalities
of Query Processing Layer are:
 Automatic Query Processing: Snowflake's Query Processing Layer
optimizes SQL queries automatically, modifying execution plans based on
underlying data distribution and query complexity to ensure efficient processing.

9
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 Parallel Execution across Clusters: Query execution is performed in parallel


across many compute clusters, leveraging Snowflake's multi-cluster
architecture to achieve high concurrency and faster results for complex
analytical workloads.
 On Demand Resource Allocation: Depending on the complexity and number of
queries, the Query Processing Layer dynamically distributes computational
resources as needed. This on-demand resource distribution provides peak
performance and cost efficiency.
 Compute and Storage Separation: Snowflake's architecture separates
computing and storage, allowing the Query Processing Layer to expand compute
resources independently. This separation improves flexibility by allowing
enterprises to change computer power without affecting stored data, so optimizing
both performance and prices.
Cloud Services Layer
In Snowflake's architecture, the Cloud Services Layer serves as the control plane,
managing information, security, and user access. It serves as a centralized platform for
administration, authentication, and activity coordination across the data warehouse. This
layer ensures that users and the underlying computation and storage resources in a cloud
environment interact seamlessly. The functionalities of Cloud Services Layer are:
 Metadata Management: Snowflake's metadata management involves storing
comprehensive information about data objects, structures, and statistics,
facilitating efficient query optimization. This metadata layer is crucial for
dynamically organizing and processing data within the cloud-based data
warehousing platform.
 Authentication and Access Control: Snowflake employs robust authentication
methods, including multi-factor authentication, to secure user access. Access
control is granular, with role-based permissions and policies ensuring fine-tuned
control over data and system resources.
 Query Optimization: Snowflake's query optimization dynamically adjusts
execution plans based on data distribution and complexity, ensuring efficient
processing of SQL queries. It leverages a multi-cluster, parallel processing
architecture for faster and scalable query performance.
 Infrastructure Management: Snowflake automates infrastructure management
by dynamically allocating and deallocating computing resources based on

10
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

workload demands, ensuring optimal performance and cost efficiency. This


approach simplifies operations for users by abstracting the complexities of
underlying cloud infrastructure.
 Security: Snowflake prioritizes security with end-to-end encryption, role-based
access controls, and features like data masking, ensuring comprehensive
protection of sensitive data within the cloud-based data warehousing platform.
Security measures are integrated at every level, safeguarding against unauthorized
access and data breaches.

6. Suppose that a data warehouse consists of four dimension customer, product,


sales person and sales time and the three measure sales time and the three
measure sales amount(in rupees), VAT(in rupees) and payment (in rupees).
Draw the different classes of schemas that are popularly used for modelling
data warehouses and explain it.
7. Discuss the autonomous data warehouse.

An Autonomous Data Warehouse is a cloud-based data management system


that uses machine learning and AI to automate key tasks like setup, tuning, security,
backups, and updates. Oracle pioneered this concept with its Oracle Autonomous
Data Warehouse, designed to simplify analytics and data warehousing.
Key Features

 Self-Driving: Automatically configures, optimizes, and scales based on


workload.
 Self-Securing: Detects threats and applies patches without downtime.
 Self-Repairing: Identifies and fixes issues to ensure high availability.
 Scalable: Adjusts compute and storage independently to meet demand.
 Integrated AI/ML: Built-in tools for advanced analytics and model building.

Architecture Overview

 Storage Layer: Uses high-performance infrastructure like Oracle Exadata.


 Compute Layer: Massively parallel processing for fast query execution.
 Autonomous Engine: AI-driven automation for lifecycle management.

11
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 Data Studio: A user-friendly interface for loading, transforming, and analyzing


data.

Use Cases

 Business Intelligence dashboards


 Real-time analytics and reporting
 Predictive modeling and forecasting
 IoT data analysis
 Financial and customer behavior analytics

Benefits

 Reduced Operational Costs: Less need for manual database administration.


 Faster Time to Insights: Quick setup and high-speed querying.
 Improved Security: Automated threat detection and data masking.
 High Availability: Uptime of over 99.995% with disaster recovery.

12
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Unit-2
PART-A

1. What does ETL stands for? And how does it differ from ELT? (AU-A/M2024)

ETL stands for:

 Extract: Pulling data from various sources (e.g., databases, APIs, files).
 Transform: Cleaning, modifying, and enriching the data to make it usable.
 Load: Moving the transformed data into a target system like a data warehouse.

ELT flips the order:

 Extract: Same as ETL—pull data from sources.


 Load: The raw data is loaded directly into the destination (usually a powerful data
warehouse).
 Transform: The transformation happens inside the destination system after loading.

2. List the difference between OLAP & OLTP. (AU-A/M2024)

OLAP (Online Analytical OLTP (Online Transaction


Feature
Processing) Processing)

💼 Supports decision-making, data Manages real-time transactional


Purpose analysis & reporting data

🔎 Data Complex queries with aggregation Simple, fast queries for inserts,
Operation and joins updates, deletes

📊 Data Handles large volumes of historical Works with current, detailed


Volume data transactional data

⏱️
Fast (optimized for quick
Response Slower (due to complex queries)
operations)
Time

🏗️
Denormalized schema
Database Highly normalized schema
(star/snowflake schemas)
Design

13
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

OLAP (Online Analytical OLTP (Online Transaction


Feature
Processing) Processing)

🧠 Users Data analysts, business managers Clerks, front-end users

📈
Sales trend analysis, forecasting, Bank transactions, online ticket
Example
market research booking
Use

🗃️ Data
Not always critical (read-intensive) Very critical (write-intensive)
Integrity

3. Outline the characteristics of OLAP. (AU-N/D2023)


 Multidimensional analysis
 Complex query capabilities
 Aggregated data
 Fast response time
 Data consistencey
4. List out the three data warehouse models.
 Enterprise warehouse
 Data mart
 Virtual warehouse
5. Define Enterprise warehouse.
An enterprise warehouse collects all of the information about subjects
spanning the entire organization. It provides corporate-wide data integration.
An enterprise data warehouse may be implemented on traditional mainframes,
computer super servers, or parallel architecture platforms.

6. Define Data mart.


A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects.

7. List out the types of data marts and explain.

14
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated
locally within a particular department or geographic area.
 Dependent data marts are sourced directly from enterprise data
warehouses.

8. Define Virtual warehouse.


A virtual warehouse is a set of views over operational databases. A virtual
warehouse is easy to build but requires excess capacity on operational
database servers.

9. What is drill-down operation?


Drill-down is the reverse of roll-up operation. It navigates from less
detailed data to more detailed data. Drill-down operation can be taken place by
stepping down a concept hierarchy for a dimension.

10. What is slice operation?


The slice operation performs a selection on one dimension of the cube resulting
in a sub cube.

PART-B

1. Explain the characteristics of OLAP and its operations. (AU-A/M2024)

In the FASMI characteristics of OLAP methods, the term derived from the
first letters of the characteristics are:

Fast
15
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

It defines which the system targeted to deliver the most feedback to the client within
about five seconds, with the elementary analysis taking no more than one second and
very few taking more than 20 seconds.

Analysis

It defines which the method can cope with any business logic and statistical analysis
that is relevant for the function and the user, keep it easy enough for the target client.
Although some preprogramming may be needed we do not think it acceptable if all
application definitions have to be allow the user to define new Adhoc calculations as
part of the analysis and to document on the data in any desired method, without having
to program so we excludes products (like Oracle Discoverer) that do not allow the user
to define new Adhoc calculation as part of the analysis and to document on the data in
any desired product that do not allow adequate end user-oriented calculation flexibility.

Share

It defines which the system tools all the security requirements for understanding and,
if multiple write connection is needed, concurrent update location at an appropriated
level, not all functions need customer to write data back, but for the increasing number
which does, the system should be able to manage multiple updates in a timely, secure
manner.

Multidimensional

This is the basic requirement. OLAP system must provide a multidimensional


conceptual view of the data, including full support for hierarchies, as this is certainly
the most logical method to analyze business and organizations.

Information

The system should be able to hold all the data needed by the applications. Data sparsity
should be handled in an efficient manner.

The main characteristics of OLAP are as follows:

1. Multidimensional conceptual view: OLAP systems let business users have a


dimensional and logical view of the data in the data warehouse. It helps in carrying
slice and dice operations.

16
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation
should provide normal database operations, containing retrieval, update, adequacy
control, integrity, and security.

3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and
an OLAP front-end.

4. Storing OLAP results: OLAP results are kept separate from data sources.

5. Uniform documenting performance: Increasing the number of dimensions or


database size should not significantly degrade the reporting performance of the OLAP
system.

6. OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.

7. OLAP system should ignore all missing values and compute correct aggregate values.

8. OLAP facilitate interactive query and complex analysis for the users.

9. OLAP allows users to drill down for greater details or roll up for aggregations of
metrics along a single business dimension or across multiple dimension.

10. OLAP provides the ability to perform intricate calculations and comparisons.

11. OLAP presents results in a number of meaningful ways, including charts and graphs.

2. Discuss the ETL process in Data warehousing. (AU-A/M2024)


ETL Process

The ETL process, which stands for Extract, Transform, and Load, is a critical
methodology used to prepare data for storage, analysis, and reporting in a data
warehouse. It involves three distinct stages that help to streamline raw data from
multiple sources into a clean, structured, and usable form. Here’s a detailed
breakdown of each phase:

17
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

ETL

1. Extraction
The Extract phase is the first step in the ETL process, where raw data is collected from
various data sources. These sources can be diverse, ranging from structured sources
like databases (SQL, NoSQL), to semi-structured data like JSON, XML, or unstructured
data such as emails or flat files. The main goal of extraction is to gather data without altering
its format, enabling it to be further processed in the next stage.
Types of data sources can include:
 Structured: SQL databases, ERPs, CRMs
 Semi-structured: JSON, XML
 Unstructured: Emails, web pages, flat files
2. Transformation
The Transform phase is where the magic happens. Data extracted in the previous phase is
often raw and inconsistent. During transformation, the data is cleaned, aggregated, and
formatted according to business rules. This is a crucial step because it ensures that the data
meets the quality standards required for accurate analysis.
Common transformations include:
 Data Filtering: Removing irrelevant or incorrect data.
 Data Sorting: Organizing data into a required order for easier analysis.
 Data Aggregating: Summarizing data to provide meaningful insights (e.g.,
averaging sales data).

18
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

The transformation stage can also involve more complex operations such as currency
conversions, text normalization, or applying domain-specific rules to ensure the data aligns
with organizational needs.
3. Loading
Once data has been cleaned and transformed, it is ready for the final step: Loading. This
phase involves transferring the transformed data into a data warehouse, data lake, or another
target system for storage. Depending on the use case, there are two types of loading methods:
 Full Load: All data is loaded into the target system, often used during the initial
population of the warehouse.
 Incremental Load: Only new or updated data is loaded, making this method
more efficient for ongoing data updates.
Pipelining in ETL Process
Pipelining in the ETL process involves processing data in overlapping stages to enhance
efficiency. Instead of completing each step sequentially, data is extracted, transformed, and
loaded concurrently. As soon as data is extracted, it is transformed, and while transformed
data is being loaded into the warehouse, new data can continue being extracted and processed.
This parallel execution reduces downtime, speeds up the overall process, and improves
system resource utilization, making the ETL pipeline faster and more scalable.

ETL Pipelining

In short, the ETL process involves extracting raw data from various sources, transforming it
into a clean format, and loading it into a target system for analysis. This is crucial for
organizations to consolidate data, improve quality, and enable actionable insights for
decision-making, reporting, and machine learning. ETL forms the foundation of effective
data management and advanced analytics.

19
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Importance of ETL
 Data Integration: ETL combines data from various sources,
including structured and unstructured formats, ensuring seamless integration for
a unified view.
 Data Quality: By transforming raw data, ETL cleanses and standardizes it,
improving data accuracy and consistency for more reliable insights.
 Essential for Data Warehousing: ETL prepares data for storage in data
warehouses, making it accessible for analysis and reporting by aligning it with the
target system's requirements.
 Enhanced Decision-Making: ETL helps businesses derive actionable insights,
enabling better forecasting, resource allocation, and strategic planning.
 Operational Efficiency: Automating the data pipeline through ETL speeds up
data processing, allowing organizations to make real-time decisions based on the
most current data.
Challenges in ETL Process
The ETL process, while essential for data integration, comes with its own set of challenges
that can hinder efficiency and accuracy. These challenges, if not addressed properly, can
impact the overall performance and reliability of data systems.
 Data Quality Issues: Inconsistent, incomplete, or duplicate data from multiple
sources can impact transformation and loading, leading to inaccurate insights.
 Performance Bottlenecks: Large datasets can slow down or cause ETL processes
to fail, particularly during complex transformations like cleansing and
aggregation.
 Scalability Issues: Legacy ETL systems may struggle to scale with growing data
volumes, diverse sources, and more complex transformations.

3.Explain the different types of OLAP Tools. What type of OLAP tools is best suited for
data set that requires both detailed and summarized analysis? .(AU-N/D2023)

OLAP tools are used to analyze multidimensional data efficiently. Here are the major types:

1. ROLAP (Relational OLAP)

 📚 Works on relational databases (e.g., SQL-based systems).

20
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 📊 Suitable for large volumes of data.

 🧠 Stores data in tables rather than cubes.

 ✅ Strengths: Handles detailed data well and scales easily.


 ⚠️ Limitations: Slower query performance on summarized views compared to
MOLAP.

2. MOLAP (Multidimensional OLAP)

 🧠 Uses multidimensional data cubes.

 ⚡ Very fast querying, especially on summarized data.

 📦 Pre-aggregates data for rapid access.


 ✅ Strengths: Great for summarized analytics and visualizations.

 ⚠️ Limitations: Not ideal for massive detailed datasets; cube size can become a
bottleneck.

3. HOLAP (Hybrid OLAP)

 🔄 Combines features of ROLAP and MOLAP.

 💽 Stores summarized data in cubes (like MOLAP) and detailed data in relational
databases (like ROLAP).
 ✅ Strengths: Balances performance and scalability.

 ⚠️ Limitations: Can be more complex to set up and manage.

4. DOLAP (Desktop OLAP)

 🖥️ Installed on individual machines.

 📈 Designed for local analysis with limited data volumes.

 ✅ Strengths: Offline access and easy to use.

 ⚠️ Limitations: Not suited for enterprise-level analytics or large datasets.

5. WOLAP (Web OLAP)

 🌐 Browser-based OLAP tool.

 🕸️ Accessible online with minimal software installation.


21
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 ✅ Strengths: Convenient and scalable for remote users.

 ⚠️ Limitations: Dependent on network and server performance.

🔍 Best Choice for Both Detailed and Summarized Analysis

The HOLAP model is generally the best fit when you need:

 Granular insights from detailed transactional data, and


 Aggregated summaries for trend analysis or dashboards.

It gives you the flexibility of accessing detailed data like ROLAP, with the speed of
summarized cube querying like MOLAP.

4.Diagrammatically illustrate and describe the architecture of MOLAP and ROLAP.


Find the major difference between MOLAP and ROLAP. (AU-N/D2023).

MOLAP HOLAP
ROLAP
ROLAP stands for MOLAP stands f HOLAP stands for Hybrid
Relational Online or Online Analytical Processing.
Analytical Processing. Multidimensional Online

Analytical Processing.
The ROLAP storage The MOLAP storage The HOLAP storage mode
mode causes the mode principle the connects attributes of both
aggregation of the aggregations of the MOLAP and ROLAP. Like
division to be stored in division and a copy of MOLAP, HOLAP causes the
indexed views in the its source information aggregation of the division to
relational database that to be saved in a be stored in a multidimensional
was specified in the multidimensional operation in an SQL Server
partition's data source. operation in analysis analysis services instance.
services when the
separation is
processed.

22
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

ROLAP does not This MOLAP HOLAP does not causes a copy
because a copy of the operation is highly of the source information to be
source information to be optimize to maximize stored. For queries that access
stored in the Analysis query performance. the only summary record in the
The storage area can be
on the computer
services data folders. where the partition is aggregations of a division,

Instead, when the outcome described or on another HOLAP is the equivalent of


cannot be derived from the computer running MOLAP.

query cache, the indexed Analysis services.

views in the record source Because a copy of the


are accessed to answer source information
resides in the
queries.
multidimensional
operation, queries can be
resolved without
accessing the partition's
source record.
Query response is Query response times can Queries that access source record
frequently slower with be reduced substantially for example, if we want to drill
ROLAP storage than with by using aggregations. down to an atomic cube cell for
the MOLAP or HOLAP The record in the which there is no aggregation
storage mode. Processing partition's MOLAP information must retrieve data
time is also frequently operation is only as from the relational database and
slower with ROLAP. current as of the most will not be as fast as they would be
recent processing of the if the source information were
separation. stored in the MOLAP
architecture.

23
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

5.Compare OLAP with OLTP system.(Refer two marks q.no:2)

6.Discuss the OLAP Operations with example(KDD).

There are five basic analytical operations that can be performed on an OLAP cube:

1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension

24
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City -> Country).

25
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
In the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
 Location = "Delhi" or "Kolkata"
 Time = "Q1" or "Q2"
 Item = "Car" or "Bus"

4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-
cube creation. In the cube given in the overview section, Slice is performed on the

dimension Time = "Q1".

5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation,
performing pivot operation gives a new view of it

26
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

7.Diagramatically illustrate and describe the architecture of MOLAP, ROLAP, HOLAP.

There are three main types of OLAP servers are as following:

 ROLAP stands for Relational OLAP, an application based on relational DBMSs.


 MOLAP stands for Multidimensional OLAP, an application based on
multidimensional DBMSs.
 HOLAP stands for Hybrid OLAP, an application using both relational and
multidimensional techniques.

Relational OLAP (ROLAP) Server

These are intermediate servers which stand in between a relational back-end server and
user frontend tools.

They use a relational or extended-relational DBMS to save and handle warehouse data,
and OLAP middleware to provide missing pieces.

ROLAP servers contain optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.

ROLAP technology tends to have higher scalability than MOLAP technology.

27
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

ROLAP systems work primarily from the data that resides in a relational database,
where the base data and dimension tables are stored as relational tables. This model permits
the multidimensional analysis of data.

This technique relies on manipulating the data stored in the relational database to give
the presence of traditional OLAP's slicing and dicing functionality. In essence, each method
of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Relational OLAP Architecture

ROLAP Architecture includes the following


components

 Database server.
 ROLAP server.
 Front-end tool.

Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology


segment in the market. This method allows multiple multidimensional views of two-
dimensional relational tables to be created, avoiding structuring record around the desired
view.

Some products in this segment have supported reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user
requests, being 'RDBMS' aware and also being capable of generating the SQL statements based
on the optimizer of the DBMS engine.

Advantages
28
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Can handle large amounts of information: The data size limitation of ROLAP technology
is depends on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the
data amount.

<="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies,


(works on top of the RDBMS) can control these functionalities.

Disadvantages

Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in
the relational database, the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL
statements to query the relational database, and SQL statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server

A MOLAP system is based on a native logical model that directly supports multidimensional
data and operations. Data are stored physically into multidimensional arrays, and positional
techniques are used to access them.

One of the significant distinctions of MOLAP against a ROLAP is that data are summarized
and are stored in an optimized format in a multidimensional cube, instead of in a relational
database. In MOLAP model, data are structured into proprietary formats by client's reporting
requirements with the calculations pre-generated on the cubes.

MOLAP Architecture

MOLAP Architecture includes the following


components
 Database server.
 MOLAP server.
 Front-end tool.

29
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been pre-
calculated and stored.

Applications requiring iterative and comprehensive time-series analysis of trends are well
suited for MOLAP technology (e.g., financial analysis and budgeting).

Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's
Lightship Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's
Multiway.

Some of the problems faced by clients are related to maintaining support to multiple subject
areas in an RDBMS. Some vendors can solve these problems by continuing access from
MOLAP tools to detailed data in and RDBMS.

This can be very useful for organizations with performance-sensitive multidimensional


analysis requirements and that have built or are in the process of building a data warehouse
architecture that contains multiple subject areas.

An example would be the creation of sales data measured by several dimensions (e.g., product
and sales region) to be stored and maintained in a persistent structure. This structure would be
provided to reduce the application overhead of performing calculations and building
aggregation during initialization. These structures can be automatically refreshed at
predetermined intervals established by an administrator.

Advantages

Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal
for slicing and dicing operations.

30
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Can perform complex calculations: All evaluation have been pre-generated when the cube
is created. Hence, complex calculations are not only possible, but they return quickly.

Disadvantages

Limited in the amount of information it can handle: Because all calculations are performed
when the cube is built, it is not possible to contain a large amount of data in the cube itself.

Requires additional investment: Cube technology is generally proprietary and does not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are other
investments in human and capital resources are needed.

Hybrid OLAP (HOLAP) Server

HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture.
HOLAP systems save more substantial quantities of detailed data in the relational tables while
the aggregations are stored in the pre-calculated cubes. HOLAP also can drill through from
the cube down to the relational tables for delineated data. The Microsoft SQL Server 2000
provides a hybrid OLAP server.

Advantages of HOLAP

1. HOLAP provide benefits of both MOLAP and ROLAP.

2. It provides fast access at all levels of aggregation.

3. HOLAP balances the disk space requirement, as it only stores the aggregate information
on the OLAP server and the detail record remains in the relational database. So no duplicate
copy of the detail record is maintained. Disadvantages of HOLAP

1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP
servers.

31
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Other Types

There are also less popular types of OLAP styles upon which one could stumble upon every
so often. We have listed some of the less popular brands existing in the OLAP industry.

Web-Enabled OLAP (WOLAP) Server

WOLAP pertains to OLAP application which is accessible via the web browser. Unlike
traditional client/server OLAP applications, WOLAP is considered to have a three-tiered
architecture which consists of three components: a client, a middleware, and a database server.

Desktop OLAP (DOLAP) Server


DOLAP permits a user to download a section of the data from the database or source, and work
with that dataset locally, or on their desktop.

Mobile OLAP (MOLAP) Server

Mobile OLAP enables users to access and work on OLAP data and applications remotely
through the use of their mobile devices.

Spatial OLAP (SOLAP) Server

SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP
into a single user interface. It facilitates the management of both spatial and non-spatial data.

32
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Unit-3

1. How to choose the data partitioning strategy for data warehouse? (AU-A/M2024)

 Query Patterns Are users often querying by time, location, or category? Match
partitions to common filters.
 Data Volume and Growth If the dataset grows rapidly (e.g., logs, transactions), use
time-based range partitioning.
 Load & Maintenance Efficiency Choose partitions that simplify loading new data and
removing old data.
 Hardware and Infrastructure In distributed systems, hash partitioning balances load
better across nodes.
 Schema Evolution Vertical partitioning can isolate frequently changing columns for
easier updates.

2. What is Data mart? (AU-A/M2024).


Data mart is such a storage component which is concerned on a specific department of
an organization. It is a subset of the data stored in the data warehouse. Data mart is
focused only on particular function of an organization and it is maintained by single
authority only, e.g. finance, Marketing. Data Marts are small in size and are flexible.

3. Differentiate meta data and data mart. (AU-N/D2023)

Feature Metadata Data Mart

Describes the characteristics of A specialized database for a business


Definition
data unit

Helps organize, locate, and Provides targeted data for decision-


Purpose
manage data making

Covers attributes of individual Focused on departmental or


Scope
data sets functional data

Users Data analysts, IT staff Business users, analysts, managers

4. Propose the features of meta data repository in data warehousing. .(AU-N/D2023)

33
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Data Source Tracking

 Identifies origins of data (e.g., OLTP systems, external feeds).


 Helps in understanding data lineage and reliability.

ETL Process Documentation

 Captures details of extraction, transformation, and loading (ETL) steps.


 Useful for audits and troubleshooting issues in data pipelines.

Schema Definitions

 Stores detailed schema info: tables, columns, data types, constraints.


 Ensures consistent data structure across the warehouse.

5. Compare Data mart with Data warehouse. (AU-N/D2023)

Data Warehouse Data Mart


A Data Warehouse is a vast A data mart is an only subtype of a Data
repository of information collected Warehouses. It is architecture to meet the
from various organizations or requirement of a specific user group.
departments within a corporation.
It may hold multiple subject areas. It holds only one subject area. For example,
Finance or Sales.

It holds very detailed information. It may hold more summarized data.


Works to integrate all data sources It concentrates on integrating data from a given
subject area or set of source systems.

6. Why is data mart considered cost effective compared to a data warehouse? (AU-
N/D2023)
Data marts are generally more cost-effective, requiring less infrastructure and lower
maintenance costs since they pull from warehouses.
7. What is a Metadata Repository?
A metadata repository is a software tool that stores descriptive information about the
data model used to store and share metadata. Metadata repositories combine diagrams
and text, enabling metadata integration and change.
34
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

8. What is Metadata used for?


Simply defined, metadata is the summary and the description about your data that is
used to classify, organize, label, and understand data, making sorting and searching for
data much easier. Without it, companies can't manage the huge amounts of data created
and collected across an enterprise .

9. Differentiate Horizontal and Vertical partioning

Vertical Partitioning

→ Moves specific columns into separate tables

→ All tables contain the same number of rows, but fewer columns

→ Ideal when different parts of an app only access certain attributes

Horizontal Partitioning (Sharding)

→ Splits tables into smaller sets of rows across multiple databases

→ All shards have the same columns, but fewer rows

→ Common in large-scale systems like social networks, ecommerce platforms, etc.

10. Define Normalization & Row splitting

Normalization is the standard relational method of database organization. In this


method, the rows are collapsed into a single row, hence it reduce space.

Row splitting tends to leave a one-to-one map between partitions. The motive of row
splitting is to speed up the access to large table by reducing its size.

PART-B

1.Describe the process of designing a cost effective data mart. (AU-A/M2024)

To design a cost-effective data mart, you need to consider various factors ranging from

storage optimization to performance efficiency. Here's a comprehensive approach:

1. Define Clear Objectives:

Clearly define the objectives of your data mart. Understand what data you need to store,
analyze, and report on. This will help in determining the most cost-effective design.

35
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

2. Data Modeling: Employ dimensional modeling techniques such as star schema or


snowflake schema. These models are optimized for query performance and are typically more
cost-effective compared to normalized models for analytical workloads.

3. Partitioning Strategy: Utilize partitioning to manage and access data efficiently. Consider
both vertical and horizontal partitioning:

 Vertical Partitioning: Divide tables into smaller vertical segments based on the
frequency of access. Frequently accessed columns can be stored in one segment while
less accessed ones in another. This reduces I/O overhead and storage costs.
 Horizontal Partitioning: Split tables into smaller chunks horizontally based on ranges
of values (e.g., date ranges) or other criteria. This helps in managing large datasets
efficiently and can optimize query performance.

4. Data Compression: Implement data compression techniques to reduce storage costs


without sacrificing performance. Modern database systems offer various compression options
suitable for different types of data.

5. Storage Optimization: Consider cost-effective storage solutions such as tiered storage or


cloud-based storage options. Utilize the appropriate storage tier based on the access patterns of
your data. Evaluate data lifecycle management policies to archive or delete data that is no
longer needed for analysis, thereby reducing storage costs.

6. Indexing Strategy: Optimize indexing based on query patterns. Avoid over-indexing as it


can increase storage costs and maintenance overhead. Focus on creating indexes that
significantly improve query performance.

7. Data Governance and Quality: Implement robust data governance practices to ensure data
quality an integrity. Poor data quality can lead to increased storage costs due tounnecessary
duplicates or inaccuracies.

8. Performance Monitoring and Tuning: Continuously monitor and tune the performance
of your data mart. Identify and eliminate performance bottlenecks through query optimization,
index tuning, and hardware upgrades if necessary.

9.Cloud Considerations:If leveraging cloud services, consider cost optimization features


provided by cloud providers. Utilize reserved instances, spot instances, or auto-scaling
capabilities to optimize costs based on workload demands.

36
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

10.Scalability: Design the data mart to be scalable to accommodate future growth. Ensure that
the chosen architecture can easily scale both vertically (e.g., adding more resources to existing
servers) and horizontally (e.g., adding more nodes to a distributed database system).

2.Discuss the challenges associated with meta data management. (AU-A/M2024)

There are several challenges that can arise when managing metadata:
1. Lack of standardization: Different organizations or systems may use different standards
or conventions for metadata, which can make it difficult to effectively manage metadata
across different sources.
2. Data quality: Poorly structured or incorrect metadata can lead to problems with data
quality, making it more difficult to use and understand the data.
3. Data integration: When integrating data from multiple sources, it can be challenging to
ensure that the metadata is consistent and aligned across the different sources.
4. Data governance: Establishing and enforcing metadata standards and policies can be
difficult, especially in large organizations with multiple stakeholders.
5. Data security: Ensuring the security and privacy of metadata can be a challenge,
especially when working with sensitive or confidential information.

3.Illustrate the various classification of metadata with suitable example and explain the
same. .(AU-N/D2023)

Types of Metadata
There are mainly five types of metadata as shown below.

Types of Metadata

1. Preservation Metadata
Preservation metadata is nothing but it is the type of the metadata which is designed to be the
long term accessibility of the digital assets. It is served as a comprehensive record for the

37
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

necessary information for the preservation and management of digital collections, most of
the context of digital libraries, museums and other cultural institutions.
Preservation metadata is used to stores the information in long time with digital assets. It
includes the information about the authentications and preserve actions taken on the content.
This metadata preserved the necessary information for the long-term preservation and
manages the digital assets, ensuring their integrity and usability over time.
Example: This metadata includes the information about the formats of the file, migration
strategies, schemas of the metadata and preservation actions for maintaining the data
accessibility. It supports the activity related to the digital preservation like migration of the
data and risk management.
2. Descriptive Metadata
Descriptive metadata described about the detailed information of the content and the
characteristics of the particular piece of the data, enables the users to the understand the
meaning of the content and context of the content. This metadata plays a main role for
enabling users to the discover and understand the digital resources effectively. The
descriptive metadata describe about the title, author/creator, subject/keywords,
abstract/summary, date, format, languages, identifier, spatial coverage, temporal coverage,
rights information.
Descriptive metadata is used to describes the content and providing the information such as
author of the content, title of the content, summary of the content, theme of the content, dates,
key points of the content. It is helped to the users for better understanding the content.
Example: This metadata type consists of attributes like title, author, subjects, keywords,
abstracts and other description parts of the document what the data describe about. In the
digital context like documents, descriptive metadata also includes the details of the
documentation such as location, creation date, format and size of the file.
3. Technical Metadata
It provide the detailed information about technical characteristics and properties of the digital
asset. It served a main role of the facilitate the management, processing and interoperability
of the digital resources across the different platforms and the environments.
Technical metadata is used to describe about the technical words of the content i.e.
specifications of the hardware, encoding, formats of the file, resolution of the file, software
used in the content etc.,
Example: This metadata tells about the format of the file, size of the file, color space, method
of compression, software is used to create and manipulates the data. Technical metadata can

38
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

be includes the Exchangeable Image File Format (EXIF) data contains the settings of the
camera and other technical information.
4. Structural Metadata
In structural metadata, it provides the information about organization, about arrangements
and relationships with the digital asset. It helps to the users to navigate and understand the
internal structure and the hierarchical components. It is worked for interact with the content
and accessing the content, it enables the users to navigate with the difficult information
architectures and receive the data efficiently.
The purpose of structural metadata is to provide details about the organization and
relationships within the content or dataset. Let us consider an example, In a book, structural
metadata can be include the title of the chapter, headings of the chapter, contents, indexes,
etc.,
Example: In structural metadata, it can be include about the chapters, about the sections,
about the headings, about paragraphs and the page sequences. This multimedia includes the
multimedia files such as videos and records, it includes timestamp, markers and pointers.
5. Administrative Metadata
Administrative metadata is nothing but it gives the information about the management, about
the administration and governance of the digital assets with its lifecycle. It served as a
foundation for the resource management, traceability and compliance with the organizational
policies and its standards. Metadata can enabled the efficient
Administrative metadata is used to data provides the administration of the content and
management. This metadata might be include the ownership of the data, right to the access,
update/create the date and formats of the files.
Example: This metadata can include the details like author, date of the creation, access
permissions, history and restrictions of the Digital Rights Management (DRM). It also
describe about the storage of the data, strategies for the preservation.

4.Elaborate in detail about the various issues to be considered when designing and
implementing data warehouse environment. .(AU-N/D2023)

5.Explain the following i) Metadata repository ii)Role of metadata.

i)Meta Data Repository

39
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 As organizations work with increasing amounts of data, it is essential


to manage and organize that data efficiently. This is where metadata
repositories come in. These specialized databases store details about datasets-what
they are, where they come from, and how they have changed over time.
 Metadata Repositories: Essential for Efficient Data Storage and Access
 Metadata repositories act like a central hub, making it easier to locate, track, and
manage data within an organization. As we use more and more data it's crucial to
manage it efficiently.
 Metadata repositories are super helpful and they store information about your data in
one place, making it easy to access and work with. Let's talk about why
these repositories are so important.
Key Components of a Metadata Repository
1. Storing Metadata:
These repositories hold all the info about our datasets. This includes where the data comes
from, how it's set up, and any changes it's been through.
2. Guiding Users:
When someone needs specific data, the repository guides them to the right files. Users can
see which files are there, when they were added, and who put them there.
3. Tracking Data Changes:
These repositories also keep track of any changes to the data. So if a dataset gets updated or
changed in any way, the repository shows this, making sure everyone always has the
most accurate data.
ii)Role of metadata.

Metadata has a very important role in a data warehouse. The role of metadata in a warehouse
is different from the warehouse data, yet it plays an important role. The various roles of
metadata are explained below.
• Metadata acts as a directory.
• This directory helps the decision support system to locate the contents of the data
warehouse.
• Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
• Metadata helps in summarization between current detailed data and highly summarized
data.
• Metadata also helps in summarization between lightly detailed data and highly
summarized data.
40
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

• Metadata is used for query tools.


• Metadata is used in extraction and cleansing tools.
• Metadata is used in reporting tools.
• Metadata is used in transformation tools.
• Metadata plays an important role in loading functions
.

6.Compare and contrast the advantages & disadvantages of vertical and horizontal
partitioning in a data warehousing context .

Feature Vertical Partitioning Horizontal Partitioning

Dividing a table into smaller tables


Dividing a table into smaller
based on rows (usually ranges of
tables based on columns.
Definition rows).

Reduce the number of columns Divide a table into smaller tables to


in a table to improve query manage large volumes of data
Purpose performance and reduce I/O. efficiently.

Rows with related data (typically


Columns with related data are
Data based on a range or a condition) are
placed together in the same table.
distribution placed together in the same table.

Improves query performance when


Improves query performance
Query queries primarily access a subset of
when queries only involve
performance rows in a large table.

41
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Feature Vertical Partitioning Horizontal Partitioning

specific columns that are part of


a partition.

Easier to manage and index


Each partition can be indexed
specific columns based on their
independently, making indexing
Maintenance characteristics and access
more efficient.
and indexing patterns.

May require joins to combine Joins between partitions are


data from multiple partitions typically not needed, as they
Joins when querying. contain disjoint sets of data.

Ensuring data consistency across Easier to maintain data integrity, as


partitions can be more each partition contains a self-
Data integrity challenging. contained subset of data.

Commonly used for tables with a Commonly used for tables with a
wide range of columns, where large number of rows, where data
not all columns are frequently can be grouped based on some
Use cases accessed together. criteria (e.g., date ranges).

Partitioning a large sales order table


Splitting a customer table into
by date, with each partition
one table for personal details and
containing orders from a specific
another for transaction history.
Examples month or year.

42
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Unit-4

PART-A

1. Compare and contrast snowflake schema and star schema. (AU-A/M2024)

Feature Star Schema Snowflake Schema

Table Fact + denormalized


Fact + normalized dimensions
Design dimensions

Query Speed Faster (fewer joins) Slower (more joins)

Storage
Lower (redundancy) Higher (normalized tables)
Efficiency

Moderate to Low (complex


Ease of Use High (simple structure)
structure)

Maintenance Lower flexibility Higher flexibility

2. What is Data Cube? (AU-A/M2024)


A data cube is a multi-dimensional array of values, typically used in data
warehousing and OLAP (Online Analytical Processing)

3. Define Star Schema.(AU-N/D2023)

Star Schema is a type of multidimensional model used for data warehouses. In a


star schema, the fact tables and dimension tables are included. This schema uses
fewer foreign-key joins. It forms a star structure with a central fact table connected to
the surrounding dimension tables.

4. What is Snowflake Schema? (AU-N/D2023)

A snowflake schema is a way of organizing data in a database—especially in a data


warehouse—that aims to optimize storage and efficiency through normalization.

Structure of Snowflake Schema

 Fact Table: Central table containing quantitative data (like sales or revenue).

43
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 Dimension Tables: Related tables providing descriptive context (like time, product,
location).
 Sub-Dimension Tables: Further breakdown of dimensions into smaller related tables.

5. Name the types of data warehouse schema. .(AU-N/D2023)


 Star schema
 Snowflake schema
 Galaxy schema
 Normalized schema
6. What is the significance of fact constellation schema in dimensional modelling?
(AU-N/D2023)
The fact constellation schema—sometimes called a galaxy schema—is a
powerful design pattern in data warehousing, especially when dealing with complex
business models.
7. What are the types of database parallelism?
 Horizontal parallelism
 Vertical parallelism
8. Define Multidimension Data Model.

A Multidimensional Data Model is defined as a model that allows data to be organized


and viewed in multiple dimensions, such as time, item, branch, and location, enabling
organizations to analyze relationships between different perspectives and entities
efficiently.

9. Evaluate the advantages and disadvantages of different schema types.


1. Star Schema
A central fact table connected to multiple dimension tables.
Advantages
 Simplified queries: Easier to understand and write
 High performance for read-heavy operations
 Ideal for OLAP (Online Analytical Processing)
Disadvantages
 Data redundancy in dimension tables
 Not ideal for complex many-to-many relationships
2. Snowflake Schema
44
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

An extension of the star schema where dimension tables are normalized.


Advantages
 Reduces data redundancy through normalization
 Efficient storage usage
Disadvantages
 Complex queries due to multiple joins
 Slower performance compared to star schema
3. Galaxy Schema (or Fact Constellation)
 Multiple fact tables sharing dimension tables.
Advantages
 Supports multiple business processes
 Flexible schema design
Disadvantages
 Increased complexity
 Harder to maintain and navigate
4. Normalized Schema (3NF or higher)
 Data is divided into logical tables to reduce redundancy.
Advantages
 Ensures data integrity and consistency
 Good for OLTP (Online Transaction Processing)
Disadvantages
 Requires more joins, which can slow down queries
 Not ideal for analytical queries
5. Denormalized Schema
 Tables are flattened for performance, with intentional redundancy.
Advantages
 Faster reads due to fewer joins
 Simpler queries for reporting
Disadvantages
 Data inconsistency risks
 Inefficient for frequent updates
10. Give some data warehouse tools
 Amazon redshift
 Google big query

45
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 Microsoft Azure Synapse Analytics

PART-B
1. Detail the process of dimensional modelling and its importance in data
warehousing. (AU-A/M2024)

Dimensional Data Modeling is one of the data modeling techniques used in data warehouse
design. The concept of Dimensional Modeling was developed by Ralph Kimball which is
comprised of facts and dimension tables. Since the main goal of this modeling is to improve
the data retrieval so it is optimized for SELECT OPERATION. The advantage of using this
model is that we can store data in such a way that it is easier to store and retrieve the data
once stored in a data warehouse. The dimensional model is the data model used by many
OLAP systems.
Elements of Dimensional Data Model
Facts
Facts are the measurable data elements that represent the business metrics of interest. For
example, in a sales data warehouse, the facts might include sales revenue, units sold, and
profit margins. Each fact is associated with one or more dimensions, creating a relationship
between the fact and the descriptive data.

Dimension

Dimensions are the descriptive data elements that are used to categorize or classify the data.
For example, in a sales data warehouse, the dimensions might include product, customer,
time, and location. Each dimension is made up of a set of attributes that describe the
dimension. For example, the product dimension might include attributes such as product
name, product category, and product price.

Attributes

Characteristics of dimension in data modeling are known as characteristics. These are used
to filter, search facts, etc. For a dimension of location, attributes can be State, Country,
Zipcode, etc.

Fact Table

46
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

In a dimensional data model, the fact table is the central table that contains the measures or
metrics of interest, surrounded by the dimension tables that describe the attributes of the
measures. The dimension tables are related to the fact table through foreign key relationships

Dimension Table

Dimensions of a fact are mentioned by the dimension table and they are basically joined by
a foreign key. Dimension tables are simply de-normalized tables. The dimensions can be
having one or more relationships.
Types of Dimensions in Data Warehouse Model
 Conformed Dimension
 Outrigger Dimension
 Shrunken Dimension
 Role-Playing Dimension
 Dimension to Dimension Table
 Junk Dimension
 Degenerate Dimension
 Swappable Dimension
 Step Dimension
Steps to Create Dimensional Data Modeling
Step-1: Identifying the business objective: The first step is to identify the business
objective. Sales, HR, Marketing, etc. are some examples of the need of the organization.
Since it is the most important step of Data Modelling the selection of business objectives also
depends on the quality of data available for that process.

Step-2: Identifying Granularity: Granularity is the lowest level of information stored in the
table. The level of detail for business problems and its solution is described by Grain.

Step-3: Identifying Dimensions and their Attributes: Dimensions are objects or things.
Dimensions categorize and describe data warehouse facts and measures in a way that supports
meaningful answers to business questions. A data warehouse organizes descriptive attributes
as columns in dimension tables.
For Example, the data dimension may contain data like a year, month, and weekday.

Step-4:

47
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Identifying the Fact:


The measurable data is held by the fact table. Most of the fact table rows are numerical values
Like price or cost per unit etc.

Step-5: Building of Schema: We implement the Dimension Model in this step. A schema is
a database structure. There are two popular schemes: Star Schema and Snowflake Schema.

2.Discuss the advantages of using a multidimensional data model. (AU-A/M2024)

The following are the advantages of a multi-dimensional data model :


 A multi-dimensional data model is easy to handle.
 It is easy to maintain.
 Its performance is better than that of normal databases (e.g. relational databases).
 The representation of data is better than traditional databases. That is because the
multi-dimensional databases are multi-viewed and carry different types of factors.
 It is workable on complex systems and applications, contrary to the simple one-
dimensional database systems.
 The compatibility in this type of database is an upliftment for projects having
lower bandwidth for maintenance staff.

3.Writ notes on i) Database parallelism ii) Data warehouse tools. (AU-A/M2024)

i)Database parallelism

 Parallelism is used to support speedup, where queries are executed faster because more
resources, such as processors and disks, are provided. Parallelism is also used to provide
scale-up, where increasing workloads are managed without increase response-time, via
an increase in the degree of parallelism.
 Different architectures for parallel database systems are shared-memory, shared-disk,
shared-nothing, and hierarchical structures.
 (a)Horizontal Parallelism: It means that the database is partitioned across multiple
disks, and parallel processing occurs within a specific task (i.e., table scan) that is
performed concurrently on different processors against different sets of data.
 (b)Vertical Parallelism: It occurs among various tasks. All component query
operations (i.e., scan, join, and sort) are executed in parallel in a pipelined fashion. In

48
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

other words, an output from one function (e.g., join) as soon as records become
available.

ii)Data warehouse tools

4.Explain the concept of data warehouse tuning and testing. (AU-A/M2024)

A data warehouse keeps evolving and it is unpredictable what query the user is going to post
in the future. Therefore it becomes more difficult to tune a data warehouse system. In this
chapter, we will discuss how to tune the different aspects of a data warehouse such as
performance, data load, queries, etc.

Difficulties in Data Warehouse Tuning

Tuning a data warehouse is a difficult procedure due to following reasons −


Data warehouse is dynamic; it never remains constant.

It is very difficult to predict what query the user is going to post in the future.

• Business requirements change with time.

• Users and their profiles keep changing.

• The user can switch from one group to another.

• The data load on the warehouse also changes with time.

Performance Assessment

Here is a list of objective measures of performance −

• Average query response time

49
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

• Scan rates

• Time used per day query

• Memory usage per process

• I/O throughput rates

• It is necessary to specify the measures in service level agreement (SLA).

• It is of no use trying to tune response time, if they are already better than those
required.

• It is essential to have realistic expectations while making performance assessment.

• It is also essential that the users have feasible expectations.

• To hide the complexity of the system from the user, aggregations and views should
be used.

• It is also possible that the user can write a query you had not tuned for.

Data Load Tuning

Data load is a critical part of overnight processing. Nothing else can run until data load is
complete. This is the entry point into the system.

There are various approaches of tuning data load that are discussed below −

• The very common approach is to insert data using the SQL Layer. In this approach,
normal checks and constraints need to be performed. When the data is inserted into
the table, the code will run to check for enough space to insert the data. If sufficient
space is not available, then more space may have to be allocated to these tables.
These checks take time to perform and are costly to CPU.

• The second approach is to bypass all these checks and constraints and place the data
directly into the preformatted blocks. These blocks are later written to the database. It
is faster than the first approach, but it can work only with whole blocks of data. This
can lead to some space wastage.

• The third approach is that while loading the data into the table that already contains
the table, we can maintain indexes.

50
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

• The fourth approach says that to load the data in tables that already contain data,
drop the indexes & recreate them when the data load is complete. The choice
between the third and the fourth approach depends on how much data is already
loaded and how many indexes need to be rebuilt.

Integrity Checks

Integrity checking highly affects the performance of the load. Following are the points to
remember −

• Integrity checks need to be limited because they require heavy processing power.

• Integrity checks should be applied on the source system to avoid performance


degrade of data load.

Tuning Queries

We have two kinds of queries in data warehouse −

• Fixed queries

• Ad hoc queries

Fixed Queries

Fixed queries are well defined. Following are the examples of fixed queries −

• regular reports

• Canned queries

• Common aggregations

Tuning the fixed queries in a data warehouse is same as in a relational database system. The
only difference is that the amount of data to be queried may be different. It is good to store
the most successful execution plan while testing fixed queries. Storing these executing plan
will allow us to spot changing data size and data skew, as it will cause the execution plan to
change.

Ad hoc Queries

51
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

To understand ad hoc queries, it is important to know the ad hoc users of the data warehouse.
For each user or group of users, you need to know the following −

The number of users in the group

Whether they use ad hoc queries at regular intervals of time

• Whether they use ad hoc queries frequently

• Whether they use ad hoc queries occasionally at unknown intervals.

• The maximum size of query they tend to run

• The average size of query they tend to run

• Whether they require drill-down access to the base data

• The elapsed login time per day

• The peak time of daily usage

• The number of queries they run per peak hour Tuning

• It is important to track the user's profiles and identify the queries that are run on a
regular basis.

• It is also important that the tuning performed does not affect the performance.

• Identify similar and ad hoc queries that are frequently run.

• If these queries are identified, then the database will change and new indexes can be
added for those queries.

• If these queries are identified, then new aggregations can be created specifically for
those queries that would result in their efficient execution.

Testing

Testing is very important for data warehouse systems to make them work correctly and
efficiently. There are three basic levels of testing performed on a data warehouse −

• Unit testing

• Integration testing

• System testing

52
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Unit Testing

• In unit testing, each component is separately tested.

• Each module, i.e., procedure, program, SQL Script, Unix shell is tested.

• This test is performed by the developer.

Integration Testing

In integration testing, the various modules of the application are brought together and
then tested against the number of inputs.

It is performed to test whether the various components do well after integration.

System Testing

• In system testing, the whole data warehouse application is tested together.

• The purpose of system testing is to check whether the entire system works correctly
together or not.

• System testing is performed by the testing team.

• Since the size of the whole data warehouse is very large, it is usually possible to
perform minimal system testing before the test plan can be enacted.

Test Schedule

First of all, the test schedule is created in the process of developing the test plan. In
this schedule, we predict the estimated time required for the testing of the entire data
warehouse system.

There are different methodologies available to create a test schedule, but none of them are
perfect because the data warehouse is very complex and large. Also the data warehouse
system is evolving in nature. One may face the following issues while creating a test
schedule −

• A simple problem may have a large size of query that can take a day or more to
complete,

i.e., the query does not complete in a desired time scale.

53
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

• There may be hardware failures such as losing a disk or human errors such as
accidentally deleting a table or overwriting a large table.

Testing Backup Recovery

Testing the backup recovery strategy is extremely important. Here is the list of scenarios for
which this testing is needed −

• Media failure

• Loss or damage of table space or data file

• Loss or damage of redo log file

• Loss or damage of control file

• Instance failure

• Loss or damage of archive file

Loss or damage of table

Failure during data failure

Testing Operational Environment

There are a number of aspects that need to be tested. These aspects are listed below.

• Security − A separate security document is required for security testing.


This document contains a list of disallowed operations and devising tests
for each.

• Scheduler − Scheduling software is required to control the daily


operations of a data warehouse. It needs to be tested during system testing.
The scheduling software requires an interface with the data warehouse,
which will need the scheduler to control overnight processing and the
management of aggregations.

• Disk Configuration. − Disk configuration also needs to be tested to


identify I/O bottlenecks. The test should be performed with multiple times
with different settings.

• Management Tools. − It is required to test all the management tools


during system testing.
54
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Here is the list of tools that need to be tested.

• Event manager

• System manager

• Database manager

• Configuration manager

• Backup recovery manager

5.Describe and brief about various schemas in multidimensional data model. (AU-
N/D2023)

Schemas for Multidimensional Data Model are:-

• Star Schema

• Snowflakes Schema

• Fact Constellations Schema

1. Start Schema

 It is the data warehouse schema that contains two types of tables: Fact Table and
Dimension Tables. Fact Table lies at the center point and dimension tables are
connected with fact table such that star share is formed.

→ Fact Tables: A fact table typically has two types of columns: foreign keys to dimension
tables and measures that contain numeric facts. A fact table can contain fact data on detail or
aggregated level.

→ Dimension Tables: Dimension tables usually have a relatively small number of records
compared to fact tables, but each record may have a very large number of attributes to describe
the fact data. Each dimension in the star schema has only one dimension table and each table
holds a set of attributes. This constraint may cause data redundancy. The following diagram
shows the sales data of a company with respect to the four dimensions, namely time, item,
branch, and location.

55
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 There is a fact table at the center. It contains the keys to each of the four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.

Advantages and Disadvantages of Star Schema

 Since star schema contains de-normalized dimension tables, it leads to simpler queries
due to a lesser number of join operations and it also leads to better system performance.

 On the other hand, it is difficult to maintain the integrity of data in star schema due to
de-normalized tables. It is the widely used data warehouse schema and is also
recommended by oracle

2. Snowflakes Schema.

 The snowflake schema is a variant of the star schema model, where some dimension
table is normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.

56
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 For example, the item dimension table in a star schema is normalized and split into two
dimension tables, namely item and supplier table.

Advantages and Disadvantages of Snowflakes Schema

 Due to normalization table is easy to maintain and saves storage space. However, this
saving of space is negligible in comparison to the typical magnitude of the fact table.

 Furthermore, the snowflake structure can reduce the effectiveness of browsing, since
more joins will be needed to execute a query. Consequently, the system performance
may be adversely impacted.

 Hence, although the snowflake schema reduces redundancy, it is not as popular as the
star schema in data warehouse design.

3. Fact Constellations Schema

57
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.

 A fact constellation schema allows dimension tables to be shared between fact tables.

For example, the following schema specifies two facma.


- The shipping table has five dimensions or keys: item key, time key, shipper key, from
location, and to location, and two measures: dollars cost and units shipped.

6.Discuss the various types of database parallelism with suitable example. (AU-N/D2023)

Types of Database Parallelism

Parallelism is used to support speedup, where queries are executed faster because more
resources, such as processors and disks, are provided. Parallelism is also used to provide scale-
up, where increasing workloads are managed without increase response-time, via an increase
in the degree of parallelism.

Different architectures for parallel database systems are shared-memory, shared-disk,


sharednothing, and hierarchical structures.

(a)Horizontal Parallelism: It means that the database is partitioned across multiple disks, and
parallel processing occurs within a specific task (i.e., table scan) that is performed concurrently
on different processors against different sets of data.

58
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

(b)Vertical Parallelism: It occurs among various tasks. All component query operations (i.e.,
scan, join, and sort) are executed in parallel in a pipelined fashion. In other words, an output
from one function (e.g., join) as soon as records become available.

Intraquery Parallelism

Intraquery parallelism defines the execution of a single query in parallel on multiple processors
and disks. Using intraquery parallelism is essential for speeding up long-running queries.

Interquery parallelism does not help in this function since each query is run sequentially. To
improve the situation, many DBMS vendors developed versions of their products that utilized
intraquery parallelism.

This application of parallelism decomposes the serial SQL, query into lower-level operations
such as scan, join, sort, and aggregation.

These lower-level operations are executed concurrently, in parallel.

Interquery Parallelism

In interquery parallelism, different queries or transaction execute in parallel with one another.
This form of parallelism can increase transactions throughput. The response times of
individual transactions are not faster than they would be if the transactions were run in
isolation.

Thus, the primary use of interquery parallelism is to scale up a transaction processing system
to support a more significant number of transactions per second.

59
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Database vendors started to take advantage of parallel hardware architectures by implementing


multiserver and multithreaded systems designed to handle a large number of client requests
efficiently.

This approach naturally resulted in interquery parallelism, in which different server threads (or
processes) handle multiple requests at the same time.

Interquery parallelism has been successfully implemented on SMP systems, where it increased
the throughput and allowed the support of more concurrent users.

Shared Disk Architecture

Shared-disk architecture implements a concept of shared ownership of the entire database


between RDBMS servers, each of which is running on a node of a distributed memory system.
Each RDBMS server can read, write, update, and delete information from the same shared
database, which would need the system to implement a form of a distributed lock manager
(DLM).

DLM components can be found in hardware, the operating system, and separate software layer,
all depending on the system vendor.

On the positive side, shared-disk architectures can reduce performance bottlenecks resulting
from data skew (uneven distribution of data), and can significantly increase system
availability.

The shared-disk distributed memory design eliminates the memory access bottleneck typically
of large SMP systems and helps reduce DBMS dependency on data partitioning.

60
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Shared-Memory Architecture
Shared-memory or shared-everything style is the traditional approach of implementing an
RDBMS on SMP hardware.

It is relatively simple to implement and has been very successful up to the point where it runs
into the scalability limitations of the shared-everything architecture.

The key point of this technique is that a single RDBMS server can probably apply all
processors, access all memory, and access the entire database, thus providing the client with a
consistent single system image.

In shared-memory SMP systems, the DBMS considers that the multiple database components
executing SQL statements communicate with each other by exchanging messages and
information via the shared memory.

All processors have access to all data, which is partitioned across local disks.

Shared-Nothing Architecture
In a shared-nothing distributed memory environment, the data is partitioned across all disks,
and the DBMS is "partitioned" across multiple co-servers, each of which resides on individual
nodes of the parallel system and has an ownership of its disk and thus its database partition.

A shared-nothing RDBMS parallelizes the execution of a SQL query across multiple


processing nodes.

Each processor has its memory and disk and communicates with other processors by
exchanging messages and data over the interconnection network. This architecture is
optimized specifically for the MPP and cluster systems.

61
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

The shared-nothing architectures offer near-linear scalability. The number of processor nodes
is limited only by the hardware platform limitations (and budgetary constraints), and each node
itself can be a powerful SMP system.

7.Discuss the various access types to data stored in data warehouse and consider the data
types of online shopping .(AU-N/D2023)

Access Types in Data Warehouses

Data warehouses are designed to support decision-making processes by enabling fast and
efficient access to structured data. The main access types include:

1. Batch Access

 Used for large-scale processing jobs.


 Often scheduled during off-peak hours.
 Ideal for generating reports or populating dashboards.

2. Online Analytical Processing (OLAP)

 Enables multidimensional analysis of data (e.g., slicing, dicing, pivoting).


 Supports complex queries and trend analysis.
 Common in business intelligence tools.

3. Real-Time Access

 Allows for immediate querying of data as it arrives.


 Useful for time-sensitive operations, like fraud detection.
 Requires integration with streaming or event-based systems.

4. Ad-Hoc Query Access

 Lets users run spontaneous, customized queries.


 Helpful for exploring data without predefined reports.
 Often used by analysts for deeper insights.

5. API or Programmatic Access


62
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

 Data can be accessed via APIs for integration with apps or external tools.
 Supports automated workflows and data-driven apps.

Data Types in Online Shopping

Online shopping platforms generate and utilize a variety of data types to offer personalized
and efficient experiences:

Data Type Example Purpose

Account creation, shipping,


User Data Name, address, contact info
customer support

Transaction
Orders, payment methods, invoices Purchase history, financial reports
Data

SKU, descriptions, prices, stock Catalog management, dynamic


Product Data
levels pricing

Clickstream UX optimization, marketing


Page visits, time spent, actions taken
Data strategy

Review and Reputation management, product


Product feedback, star ratings
Rating improvements

Social Data Shares, likes, referral codes Social proof, targeted promotions

Delivery status via GPS, smart


Sensor/IoT Data Logistics tracking, smart reordering
device interactions

8.Explain how to use a familiar approach help reduce training costs associated with
building a query and reporting environment with suitable example. .(AU-N/D2023)

Familiar Approach: Use of Microsoft Excel or SQL-Based Tools

Instead of introducing a complex BI platform from scratch, use tools that most
employees already understand:

🔹 Excel as a Reporting Tool

 Most business users are proficient in Excel.


 Excel can connect to databases using Power Query or ODBC connections.
 Pivot tables and charts allow for dynamic reporting without coding.

63
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Example: A retail company needs sales reporting across regions. Instead of training
everyone on Power BI or Tableau, they create Excel templates with built-in queries to
fetch sales data. Regional managers simply refresh the data and use pre-formatted pivot
tables. This cuts training time dramatically, as they already understand Excel.

🔹 SQL-Based Tools

 If your staff knows SQL, opt for platforms that support SQL queries directly.
 Tools like SQL Server Reporting Services (SSRS) or Looker allow custom
SQL queries with minimal additional training.

Example: An insurance company wants agents to generate custom reports. Instead of


introducing a non-SQL based tool, they use SSRS with predefined SQL templates.
Agents just modify conditions (e.g., date or region), reducing the need for new training
modules.

8.Propose a fact constellation schema for a health care data warehouse to support
complex analytical queries.

9.Describe the process architecture involved in designing and implementing a star


schema.

10.Explain how does snowflake schema differ from star schema in terms of
normalization.

64
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Unit-5

PART-A

1. Name two types of data warehousing system managers and their roles. (AU-
A/M2024)
 Data warehouse Administrator
 ETL Manager
2. What is the function of load manager in data warehousing? (AU-A/M2024)
The load manager—sometimes called the data load subsystem—is a crucial component
of data warehousing. It serves as the gatekeeper for all data entering the warehouse.

3. Define query manager. (AU-N/D2023)


The query manager is responsible for directing the queries to suitable tables. By
directing the queries to appropriate tables, it speeds up the query request and response
process. In addition, the query manager is responsible for scheduling the execution of
the queries posted by the user.

4. What are the various sources of data warehousing?(AU-N/D2023)


 Operational database
 Flat files
 Third party data providers
5. Recall the responsibilities of a data warehousing system configuration
manager.(AU-N/D2023)
 System setup and maintenance
 Version and patch management
 Performance monitoring
 Security management.
6. Outline the primary task of data warehousing system back up recovery manager.
(AU-N/D2023)
 Data back up operations
 Recovery Planning & Execution
 Version and configuration tracking
 Scheduling and automation
7. What is Database Tuning?

65
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Database tuning is the process of optimizing the performance of a database system to


ensure it runs as efficiently and quickly as possible. It’s a bit like fine-tuning an
engine—making small adjustments that can lead to big gains in speed and
responsiveness.
8. What is Testing?
Testing is the process of evaluating a system or its component (s) with the intent to find
whether it satisfies the specified requirements or not.
9. What are the functions of Load Manager?
The load manager does performs the following functions −
 Extract data from the source system.
 Fast load the extracted data into temporary data store.
 Perform simple transformations into structure similar to the one in the data
warehouse.
10. What is the role of query manager?
A query manager is a user interface for data professionals such as database
administrators, data engineers, developers, and others to write, store and run queries.
Without query managers, database access would be limited to application programs.

PART-B

1.Discuss the role of system and process manager in a data warehousing environment.
(AU-A/M2024)

System Managers

System management is mandatory for the successful implementation of a data warehouse.


The most important system managers are −

• System configuration manager


• System scheduling manager
• System event manager
• System database manager
• System backup recovery manager

System Configuration Manager

66
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

• The system configuration manager is responsible for the management of the setup and
configuration of data warehouse.
• The structure of configuration manager varies from one operating system to another.
• In Unix structure of configuration, the manager varies from vendor to vendor.
• Configuration managers have single user interface.
• The interface of configuration manager allows us to control all aspects of the system.

Note − The most important configuration tool is the I/O manager.

System Scheduling Manager

System Scheduling Manager is responsible for the successful implementation of the data
warehouse. Its purpose is to schedule ad hoc queries. Every operating system has its own
scheduler with some form of batch control mechanism. The list of features a system
scheduling manager must have is as follows −
• Work across cluster or MPP boundaries
• Deal with international time differences
• Handle job failure
• Handle multiple queries
• Support job priorities
• Restart or re-queue the failed jobs
• Notify the user or a process when job is completed
• Maintain the job schedules across system outages
• Re-queue jobs to other queues
• Support the stopping and starting of queues
• Log Queued jobs
• Deal with inter-queue processing

Note − The above list can be used as evaluation parameters for the evaluation of a good
scheduler.
Some important jobs that a scheduler must be able to handle are as follows −

• Daily and ad hoc query scheduling


• Execution of regular report requirements
• Data load
• Data processing
• Index creation
67
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

• Backup
• Aggregation creation
• Data transformation

Note − If the data warehouse is running on a cluster or MPP architecture, then the system
scheduling manager must be capable of running across the architecture.

System Event Manager

The event manager is a kind of a software. The event manager manages the events that are
defined on the data warehouse system. We cannot manage the data warehouse manually
because the structure of data warehouse is very complex. Therefore we need a tool that
automatically handles all the events without any intervention of the user.

Note − The Event manager monitors the events occurrences and deals with them. The event
manager also tracks the myriad of things that can go wrong on this complex data warehouse
system.

Events

Events are the actions that are generated by the user or the system itself. It may be noted that
the event is a measurable, observable, occurrence of a defined action.
Given below is a list of common events that are required to be tracked.

• Hardware failure
• Running out of space on certain key disks
• A process dying
• A process returning an error
• CPU usage exceeding an 805 threshold
• Internal contention on database serialization points
• Buffer cache hit ratios exceeding or failure below threshold
• A table reaching to maximum of its size
• Excessive memory swapping
• A table failing to extend due to lack of space
• Disk exhibiting I/O bottlenecks
• Usage of temporary or sort area reaching a certain thresholds
• Any other database shared memory usage

68
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

The most important thing about events is that they should be capable of executing on their
own. Event packages define the procedures for the predefined events. The code associated
with each event is known as event handler. This code is executed whenever an event occurs.

System and Database Manager

System and database manager may be two separate pieces of software, but they do the same
job.
The objective of these tools is to automate certain processes and to simplify the execution of
others. The criteria for choosing a system and the database manager are as follows −
• increase user's quota.
• assign and de-assign roles to the users
• assign and de-assign the profiles to the users
• perform database space management

System Backup Recovery Manager

The backup and recovery tool makes it easy for operations and management staff to back-
up the data. Note that the system backup manager must be integrated with the schedule
manager software being used. The important features that are required for the management
of backups are as follows −
• Scheduling
• Backup data tracking
• Database awareness

Backups are taken only to protect against data loss. Following are the important points to
remember −
• The backup software will keep some form of database of where and when the piece of
data was backed up.
• The backup recovery manager must have a good front-end to that database.
• The backup recovery software should be database aware.
• Being aware of the database, the software then can be addressed in database terms, and
will not perform backups that would not be viable. Data Warehousing - Process
Managers
Process managers are responsible for maintaining the flow of data both into and out of the data
warehouse. There are three different types of process managers −
• Load manager
• Warehouse manager
• Query manager

69
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Data Warehouse Load Manager

Load manager performs the operations required to extract and load the data into the
database. The size and complexity of a load manager varies between specific solutions from
one data warehouse to another.

Load Manager Architecture

The load manager does performs the following functions −

• Extract data from the source system.


• Fast load the extracted data into temporary data store.
• Perform simple transformations into structure similar to the one in the data warehouse.

Extract Data from Source


The data is extracted from the operational databases or the external information providers.
Gateways are the application programs that are used to extract data. It is supported by
underlying DBMS and allows the client program to generate SQL to be executed at a server.
Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples
of gateway.

Fast Load

• In order to minimize the total load window, the data needs to be loaded into the
warehouse in the fastest possible time.
• Transformations affect the speed of data processing.
• It is more effective to load the data into a relational database prior to applying
transformations and checks.
• Gateway technology is not suitable, since they are inefficient when large data volumes
are involved.

Simple Transformations

70
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

While loading, it may be required to perform simple transformations. After completing

simple transformations, we can do complex checks. Suppose we are loading the EPOS sales

transaction, we need to perform the following checks −

• Strip out all the columns that are not required within the warehouse.
• Convert all the values to required data types.

Warehouse Manager

The warehouse manager is responsible for the warehouse management process. It consists
of a third-party system software, C programs, and shell scripts. The size and complexity of
a warehouse manager varies between specific solutions.

Warehouse Manager Architecture

A warehouse manager includes the following −

• The controlling process


• Stored procedures or C with SQL
• Backup/Recovery tool
• SQL scripts

Functions of Warehouse Manager

A warehouse manager performs the following functions −

• Analyzes the data to perform consistency and referential integrity checks.


• Creates indexes, business views, partition views against the base data.

71
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

• Generates new aggregations and updates the existing aggregations.


• Generates normalizations.

Query Manager

The query manager is responsible for directing the queries to suitable tables. By directing
the queries to appropriate tables, it speeds up the query request and response process. In
addition, the query manager is responsible for scheduling the execution of the queries
posted by the user.

Query Manager Architecture

A query manager includes the following components −

• Query redirection via C tool or RDBMS


• Stored procedures
• Query management tool
• Query scheduling via C tool or RDBMS
• Query scheduling via third-party software

Functions of Query Manager

• It presents the data to the user in a form they understand.


72
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

• It schedules the execution of the queries posted by the end-user.


• It stores query profiles to allow the warehouse manager to determine which indexes
and aggregations are appropriate.

2. Outline a key strategy for ensuring a data warehousing solution remain scalable as a
tech start up grows. High light one specific technology or approach that could be utilized.
(AU-A/M2024)

To ensure your data warehousing solution scales seamlessly with startup growth, design for
elasticity from day one. That means building a modular, cloud-native architecture that can
expand (or contract) as data volumes, users, and analytical demands evolve. Focus on
decoupling storage and compute so you can scale each independently—this ensures you're
not locked into a rigid infrastructure that becomes a bottleneck as you grow.

Specific Technology: Snowflake

A standout in this space is Snowflake, a cloud-native data platform built precisely for
scalability and flexibility:

 Decoupled Storage & Compute: Snowflake allows you to scale compute resources up
or down without impacting your storage, or vice versa.
 Multi-Cluster Warehouses: As workloads increase, Snowflake can automatically spin
up additional compute clusters to handle the demand without user intervention.
 Zero Management: No infrastructure to manage—updates, scaling, and availability
are handled automatically.
 Data Sharing & Marketplace: Easily share data across teams or even externally with
partners, reducing data duplication and silos.

Bonus: Snowflake integrates well with major cloud providers (AWS, Azure, GCP), giving you
flexibility and avoiding vendor lock-in.

3.Describe how modernizing the data warehouse architecture could improve the data
analytics and decision making for the global retain company. Focus on one major
limitations of old system and how a modern solution addresses it. (AU-A/M2024)

73
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

4. Discuss in detail about the data warehouse data storage and accessibility for the
government of tamilnadu for any two schemes in practice. (AU-N/D2023)

Tamil Nadu has adopted a centralized and federated data storage model under its Tamil
Nadu Data Policy 2022, which aims to:

 Enable data-driven decision-making


 Promote inter-departmental coordination
 Ensure privacy and security
 Facilitate open data sharing for public good2

The State Data Warehouse integrates datasets from various departments, allowing real-time
access, analytics, and monitoring of scheme performance.

1. Pudhumai Penn Scheme

Overview: This scheme provides monthly financial assistance of ₹1,000 to girl


students pursuing higher education after completing school in government institutions.

Data Storage & Accessibility

 Beneficiary Data: Collected from school and college databases, Aadhaar-linked for
verification.
 Integration: Linked with the Education Department’s MIS and Treasury systems
for fund disbursement.
 Accessibility:
o Officials access dashboards for real-time monitoring of enrollment and
payments.
o Students can track their application status via public portals.
 Security: Data is encrypted and stored in state-run data centers under the Tamil
Nadu Data Centre Policy 2021.

Impact

 Helps track dropout rates, gender parity, and educational outcomes.


 Enables predictive analytics for future policy planning.

74
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

2. Chief Minister’s Comprehensive Health Insurance Scheme (CMCHIS)

Overview: Provides free medical treatment to economically weaker sections in


empaneled hospitals across Tamil Nadu.

Data Storage & Accessibility

 Health Records: Digitized and stored in the Health Department’s data warehouse.
 Hospital Integration: Linked with hospital management systems for claim
processing.
 Accessibility:
o Government officials use analytics dashboards to monitor hospital
performance and claim volumes.
o Beneficiaries can access scheme details via mobile apps and e-Sevai centers.
 Privacy: Complies with data classification and privacy norms under TNDP.

Impact

 Enables fraud detection, cost analysis, and service quality tracking.


 Supports policy refinement based on treatment patterns and demographic data.

Key Technologies Used

Component Description

Data Warehouse Central repository for structured data from multiple departments

Dashboards & Interfaces for officials and citizens to access scheme-related


Portals information

Analytics Engines Tools for trend analysis, forecasting, and performance evaluation

Security Framework Encryption, access control, and privacy compliance

5.Explain the relationship between load manager and warehouse manager in data
warehousing process.(Refer Question No:1)

75
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

6.Explain about various data warehousing process manager its collaboration and
interactions to ensure operations and performance of the entire data warehousing
system.(Refer Q.No:1)

7.Justify the statement “ testing process contribute to overall quality a data warehousing
system” and describe the types of tests conducted on the system.

Testing isn’t just a checkbox in the development lifecycle of a data warehouse—it’s a core
pillar that ensures:

 Data Accuracy: Detects errors in data extraction, transformation, and loading (ETL)
processes.
 System Reliability: Verifies that the warehouse performs consistently under varying
loads and queries.
 Performance Optimization: Identifies bottlenecks in query performance or loading
operations.
 Security Assurance: Ensures sensitive data is protected and access is appropriately
controlled.
 User Satisfaction: Validates that the reports, dashboards, and analytics behave as
expected.

In short, without proper testing, a data warehouse might look fine from the outside—but could
collapse under real-world use. Quality testing helps guarantee that the system delivers
trustworthy data for critical decision-making.

Types of Tests in a Data Warehousing System

Here are the key testing types used across the lifecycle of a data warehouse:

Test Type Purpose

Verifies individual ETL components, scripts, or functions work


Unit Testing
correctly in isolation.

Ensures that various components (e.g., data sources, ETL tools, the
Integration Testing
warehouse) interact properly.

76
SRRCET/AI&DS/III-YEAR/V-SEM/CCS341/DATA WAREHOUSE -Mrs.D.SUJATHA, HoD/AI&DS

Test Type Purpose

Evaluates the overall system for functionality, performance, and


System Testing
compliance.

Data Validation Confirms data quality, integrity, and accuracy post-load. Checks for
Testing duplicates, missing fields, or mismatches.

Makes sure that changes or updates haven’t broken existing features


Regression Testing
or data flows.

Gauges system speed, throughput, and scalability under real-world


Performance Testing
conditions.

Security Testing Tests data encryption, access controls, and auditing features.

User Acceptance Involves end-users testing the system to validate real-world


Testing (UAT) functionality and usability.

77

You might also like