0% found this document useful (0 votes)

31 views9 pages

Unit2 Data Science

A data warehouse is a centralized system for storing and analyzing large volumes of data from multiple sources, designed for business intelligence activities. It utilizes various schemas like star, snowflake, and fact constellation to organize data, and employs OLAP operations for multidimensional analysis. Additionally, it incorporates back-end tools for data extraction, cleaning, transformation, and loading, while addressing issues like missing values and noise through various techniques.

Uploaded by

shettydhiraj41

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views9 pages

Unit2 Data Science

Uploaded by

shettydhiraj41

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

1. Define and explain data warehouse.

(4/6 marks) (previous qp)

A data warehouse is a centralized system used to store, manage, and
analyze large volumes of data collected from multiple sources. It is designed
specifically for query and analysis rather than transaction processing.
Key Points: - Data warehouses store historical and current data to support
business intelligence (BI) activities. - Data is extracted from different
operational systems, cleaned, transformed, and loaded into the warehouse
(ETL process). - It supports decision-making processes by allowing easy
access to organized and integrated data.
Example: A supermarket chain collects data from all branches, integrates it
in a central warehouse, and analyzes sales trends, customer preferences,
etc.

3. Explain the following Multidimensional database schema with

example. (3 marks each)
a) Star Schema: This schema has a central fact table surrounded by
dimension tables. It is simple and used for fast queries. Example: A Sales fact
table with columns like sale_id, date_id, product_id, amount, connected to
dimension tables such as Time, Product, and Store.

b) Snowflake Schema: This is a more complex version of star schema

where dimension tables are normalized. It saves space but makes queries
slower. Example: Product dimension may be split into Product and Category
tables. So, the Sales fact table connects to Product, which connects to
Category.

c) Fact Constellation Schema: Also known as galaxy schema, it contains

multiple fact tables sharing common dimension tables. Example: Sales and
Shipping are two fact tables connected to shared dimensions like Time and
Product.
4. Define Measure. Explain different categories of measures. (4
marks) (previous qp)
A measure is a numeric value that represents data to be analyzed in a data
warehouse. Measures are stored in fact tables and are used in calculations
and aggregations.
Types of Measures:
1. Distributive: Can be computed by dividing the dataset, aggregating
separately, and combining (e.g., count, sum, max).
2. Algebraic: Derived from distributive measures using basic operations
(e.g., average = sum/count).
3. Holistic: Cannot be computed without the whole dataset (e.g., median,
mode, rank).

5. Explain any four/six OLAP operations. (4/6 marks) (previous qp)

OLAP (Online Analytical Processing) operations are used to analyze
multidimensional data from different perspectives.
OLAP Operations in Multidimensional Data Analysis

i. Slice

 Definition: Slicing refers to selecting a single value or member from

one dimension of a multidimensional dataset, creating a subset of the
data for focused analysis.

 Purpose: Allows users to examine data for a specific criterion within

one dimension while keeping all other dimensions fixed.

 Example: Analyzing sales data for a specific month (e.g., March) to

evaluate sales performance during that time.

ii. Dice

 Definition: Dicing involves selecting a subset of data by specifying

values or ranges for multiple dimensions simultaneously.

 Purpose: Enables multidimensional analysis by narrowing down data

using conditions across different dimensions.
 Example: Analyzing sales data for Q1 and for electronics products to
evaluate performance of that category within the selected time frame.

iii. Drill-Up (Roll-Up)

 Definition: Roll-up is the process of aggregating data to a higher level

of abstraction in one or more dimensions.

 Purpose: Helps users view summarized or higher-level data to identify

overall trends.

 Example: Aggregating daily sales data into monthly, quarterly, or

yearly sales figures for a broader view of performance.

iv. Drill-Down

 Definition: Drill-down is the process of breaking down summarized

data into more detailed sub-levels.

 Purpose: Allows users to explore underlying details behind aggregate

values to identify specific causes or trends.

 Example: Drilling down from yearly sales totals to monthly, weekly, or

daily sales data to detect seasonal patterns or anomalies.

v. Drill-Within

 Definition: Drill-within refers to switching between different

classifications or categories within the same dimension.

 Purpose: Provides a different perspective of analysis within a single

dimension.

 Example: In the "Product" dimension, switching from "Product

Category" to "Product Brand" to analyze sales by brand instead of
category.

vi. Drill-Across

 Definition: Drill-across involves moving analysis from one dimension

to another to examine related data from a different perspective.
 Purpose: Facilitates cross-dimensional analysis to gain broader
insights.

 Example: Switching from analyzing sales by region (geography

dimension) to analyzing by salesperson (staff dimension) for
performance comparisons.

8. Explain the three-tier architecture of Data warehouse with neat

diagram. (6 marks)
The three-tier architecture organizes the data warehouse system into three
levels:
1. Bottom Tier: Contains the database servers where data is extracted
from different sources, cleaned, transformed, and loaded (ETL process).
2. Middle Tier: An OLAP (Online Analytical Processing) server that helps in
fast querying and processing of data. It uses either Relational OLAP (ROLAP)
or Multidimensional OLAP (MOLAP).
3. Top Tier: This is the front-end layer which provides tools for data
analysis, query generation, and reporting (e.g., dashboards, charts).
(Draw a diagram with three layers: Bottom – ETL tools and Data sources,
Middle – OLAP server, Top – Reporting/Analysis tools)
┌───────────────────────────────┐
│ Top Tier (Front-End) │
│ - Query Tools (BI Tools) │
│ - Reporting & Visualization │
│ - Dashboards │
└───────────────────────────────┘
▲
│
▼
┌───────────────────────────────┐
│ Middle Tier (OLAP/ETL) │
│ - OLAP Servers (ROLAP/MOLAP) │
│ - ETL Tools (Extract, Load) │
│ - Metadata Management │
└───────────────────────────────┘
▲
│
▼
┌───────────────────────────────┐
│ Bottom Tier (Data Sources)│
│ - Operational Databases │
│ - External Data Sources │
│ - Flat Files / Logs │
└───────────────────────────────┘

10. Explain the different back-end tools and utilities included in data
warehouse. (4 marks)
Data warehouses use several back-end tools to prepare data:
1. Data Extraction Tools: Extract data from multiple heterogeneous
sources.
2. Data Cleaning Tools: Detect and correct errors or inconsistencies in
data.
3. Data Transformation Tools: Convert data into a suitable format or
structure.
4. Data Loading Tools: Load the transformed data into the warehouse.
5. Metadata Repository: Stores information about the structure,
operations, and contents of the data.
These tools ensure that only clean, consistent, and meaningful data is stored
in the warehouse.

14. Explain (any four) the different ways of handling missing values.
(4/6 marks) (previous qp)
Handling missing data is important to improve data quality. Here are
common methods:
1. Ignore the tuple: Remove the record if missing values are few and
data is not important.
2. Manual entry: Experts fill missing values using domain knowledge.
3. Global constant: Replace missing value with a constant like
“Unknown”.
4. Mean/Median/Mode: Fill with the column’s average (mean), middle
(median), or most frequent (mode) value.
5. Prediction using models: Use regression or classification models to
guess the value.
6. Using previous/next values: For time series data, use nearby data
values.

15. Define noise and explain different data smoothing techniques. (6

marks)
Noise refers to random errors or irrelevant data that distort the original
dataset. It needs to be removed or reduced to improve accuracy.
Data Smoothing Techniques:
1. Binning: Sort data and split into bins; smooth using mean, median, or
boundaries of bins.
2. Clustering: Group similar data points. Values in a cluster are replaced
with the cluster average.
3. Regression: Fit a regression line (linear or nonlinear), and replace data
points with values predicted by the line.
4. Moving average: Replace a value with the average of its neighboring
values in a sequence.
These techniques help in making the data more useful for analysis.

16. Explain the different steps involved in data transformation. (4

marks) (previous qp)
Data transformation involves converting data into a suitable format for
analysis. Steps include:
1. Smoothing: Removing noise from data using smoothing techniques.
2. Aggregation: Summarizing data (e.g., total sales per month).
3. Generalization: Replacing detailed data with higher-level summary
(e.g., replacing cities with states).
4. Normalization: Scaling data values to a small range like 0 to 1.
5. Attribute construction: Creating new features using existing
attributes (e.g., age group from age).

19. Explain any two numerosity reduction techniques. (4 marks)

(previous qp)
Numerosity reduction is the technique of reducing the volume of data
without losing its integrity.
1. Histogram: Data is grouped into ranges called bins. Each bin is
represented by a single value, reducing total data points.

2. Clustering: Similar records are grouped into clusters. Instead of

storing individual records, we store data about each cluster (like the
centroid).

The quality of a cluster can be defined by its diameter, the maximum

distance between any two objects in the cluster. Centroid distance is an
alternative measure of cluster quality. It is represented as the average
distance of each cluster object from the cluster centroid denoting the
"average object," or average point in the area for the cluster.

24. What is sampling? Explain the ways used to sample for data
reduction? (4/6 marks)
Sampling is selecting a small subset of data from a large dataset. It is used
in data reduction to make analysis faster and more efficient.
Sampling Techniques:
1. Simple Random Sampling: Each data point has an equal chance of
being selected.

2. Stratified Sampling: The dataset is divided into groups (strata), and

samples are taken from each group proportionally.

3. Systematic Sampling: Select every k-th record from a list after a

random starting point.

Sampling allows us to work on a smaller dataset that still represents the

original data accurately.

Data Mining - 1.
No ratings yet
Data Mining - 1.
34 pages
Cat Data Mining
No ratings yet
Cat Data Mining
4 pages
Solutions For Data Warehousing 7
No ratings yet
Solutions For Data Warehousing 7
18 pages
3-Tier Data Warehouse Architecture
No ratings yet
3-Tier Data Warehouse Architecture
22 pages
Data Warehousing Answer Key
No ratings yet
Data Warehousing Answer Key
4 pages
CS3352 Foundations of Data Science APRIL MAY 2023
No ratings yet
CS3352 Foundations of Data Science APRIL MAY 2023
16 pages
DWDM
No ratings yet
DWDM
19 pages
Data Warehousing Mock Paper
No ratings yet
Data Warehousing Mock Paper
6 pages
Document 2
No ratings yet
Document 2
10 pages
Data Warehousing & Data Mining PUT Solution
No ratings yet
Data Warehousing & Data Mining PUT Solution
38 pages
Document 3
No ratings yet
Document 3
9 pages
Datadwm 1
No ratings yet
Datadwm 1
8 pages
Unit IV Data Mining
No ratings yet
Unit IV Data Mining
65 pages
Data Warehousing Concepts Transparencies: © Pearson Education Limited 1995, 2005
No ratings yet
Data Warehousing Concepts Transparencies: © Pearson Education Limited 1995, 2005
58 pages
Data Warehouse
No ratings yet
Data Warehouse
19 pages
Fds 2
No ratings yet
Fds 2
17 pages
List Data Warehouse Models With Example
No ratings yet
List Data Warehouse Models With Example
19 pages
Data Warehousing and Data Mining: UNIT-1
No ratings yet
Data Warehousing and Data Mining: UNIT-1
118 pages
Warehouse Assignment
No ratings yet
Warehouse Assignment
9 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
48 pages
Data Warehouse Essentials Guide
No ratings yet
Data Warehouse Essentials Guide
3 pages
Data Warehousing Exam Guide
No ratings yet
Data Warehousing Exam Guide
10 pages
DW Question Paper 3
No ratings yet
DW Question Paper 3
4 pages
Data Notes
No ratings yet
Data Notes
37 pages
04OLAP
No ratings yet
04OLAP
58 pages
DWDM
No ratings yet
DWDM
14 pages
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
28 pages
DW OpenBook Assessment Exam QP APRIL 2022 New
No ratings yet
DW OpenBook Assessment Exam QP APRIL 2022 New
6 pages
UEU Sistem Pendukung Keputusan Pertemuan 5
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 5
46 pages
DW Olap1
No ratings yet
DW Olap1
88 pages
04OLAP
No ratings yet
04OLAP
50 pages
DWDM QB
No ratings yet
DWDM QB
29 pages
Chap3 PIEAS DCIS BSCIS DM 23 Topic 03 DWH OLAP
No ratings yet
Chap3 PIEAS DCIS BSCIS DM 23 Topic 03 DWH OLAP
46 pages
Data Mining and Warehouse Techniques
No ratings yet
Data Mining and Warehouse Techniques
70 pages
Data Mining & BI Exam Guide 2023
No ratings yet
Data Mining & BI Exam Guide 2023
45 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
51 pages
Chapter 13 - Data Warehousing
No ratings yet
Chapter 13 - Data Warehousing
31 pages
Ccs341 DW Qa (Final)
No ratings yet
Ccs341 DW Qa (Final)
77 pages
Data Warehouse Fundamentals Explained
No ratings yet
Data Warehouse Fundamentals Explained
31 pages
Unit 1ppt
No ratings yet
Unit 1ppt
39 pages
DMT Unit-1
No ratings yet
DMT Unit-1
59 pages
Report On Principles of Fragmentation in Computer Science
No ratings yet
Report On Principles of Fragmentation in Computer Science
26 pages
Warehouse
No ratings yet
Warehouse
58 pages
Answer Key Model Data Warehousing
No ratings yet
Answer Key Model Data Warehousing
48 pages
Warehousing & Data Mining Assignment
No ratings yet
Warehousing & Data Mining Assignment
13 pages
Data Warehouse
No ratings yet
Data Warehouse
10 pages
Data Warehouse Unit2
No ratings yet
Data Warehouse Unit2
7 pages
6th - SEM Data Science Notes
No ratings yet
6th - SEM Data Science Notes
46 pages
23AD1901-DWDM QuestionBank Student
No ratings yet
23AD1901-DWDM QuestionBank Student
25 pages
Data Warehousing & Data Mining Sessional-1 Solution
No ratings yet
Data Warehousing & Data Mining Sessional-1 Solution
15 pages
(2025!04!03) - Data Warehouse - Lecture 3
No ratings yet
(2025!04!03) - Data Warehouse - Lecture 3
41 pages
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
No ratings yet
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
119 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
63 pages
Lecture 4 (Dataware Housing)
No ratings yet
Lecture 4 (Dataware Housing)
50 pages
Data Warehouse & Mining Course Overview
No ratings yet
Data Warehouse & Mining Course Overview
5 pages
Overview of Data Warehousing and OLAP: Slide 29-2
No ratings yet
Overview of Data Warehousing and OLAP: Slide 29-2
36 pages
CH 4 DW
No ratings yet
CH 4 DW
36 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
2 pages
Year Round Fire Safety Awerness Program
100% (4)
Year Round Fire Safety Awerness Program
2 pages
Anatomy Physiology of The Eye - Lecture - Slides
No ratings yet
Anatomy Physiology of The Eye - Lecture - Slides
36 pages
Preparation Methods for Salts
No ratings yet
Preparation Methods for Salts
26 pages
Gas BQ
No ratings yet
Gas BQ
6 pages
Test Generator 8
No ratings yet
Test Generator 8
7 pages
Ter-Handbook Online
No ratings yet
Ter-Handbook Online
192 pages
Unit 1 Quantum Mechanics PDF
No ratings yet
Unit 1 Quantum Mechanics PDF
32 pages
P.4 English Term Two Notes 2025
No ratings yet
P.4 English Term Two Notes 2025
51 pages
What Is The Lecture Mainly About
No ratings yet
What Is The Lecture Mainly About
4 pages
SR Physics IPE Imp Questions
No ratings yet
SR Physics IPE Imp Questions
6 pages
Index Feed
No ratings yet
Index Feed
1 page
C1 Exam Module 2
No ratings yet
C1 Exam Module 2
11 pages
Sleeper 1
100% (1)
Sleeper 1
58 pages
Esilalei Eia Report - Draft
No ratings yet
Esilalei Eia Report - Draft
29 pages
Quickstart Enhhhhh
No ratings yet
Quickstart Enhhhhh
2 pages
Advantages of Old Model and New Model Television
No ratings yet
Advantages of Old Model and New Model Television
35 pages
Beginning of The End
100% (1)
Beginning of The End
265 pages
Zest Lunch Dinner
No ratings yet
Zest Lunch Dinner
9 pages
Athena Users1 PDF
No ratings yet
Athena Users1 PDF
436 pages
Graceful Label
No ratings yet
Graceful Label
31 pages
Grade 9 Agriculture and Nutrition1
No ratings yet
Grade 9 Agriculture and Nutrition1
11 pages
2025 MCC Winner List - 0219MM-1
No ratings yet
2025 MCC Winner List - 0219MM-1
7 pages
STEPFATHER - Off Limits - Stepfather Romance
33% (3)
STEPFATHER - Off Limits - Stepfather Romance
76 pages
Cotton: Chapter - 17
No ratings yet
Cotton: Chapter - 17
16 pages
BTS Fanfic: J-Hope & Suga Heat Story
No ratings yet
BTS Fanfic: J-Hope & Suga Heat Story
6 pages
TP Rate Fy 2022-23-1
No ratings yet
TP Rate Fy 2022-23-1
1 page
Wood Characteristics & Uses
No ratings yet
Wood Characteristics & Uses
4 pages
2021 Second Examination (SS1)
No ratings yet
2021 Second Examination (SS1)
4 pages
A319/A320/A321 CFM56 Ignition System Schematics
100% (1)
A319/A320/A321 CFM56 Ignition System Schematics
7 pages
Icd9 Icd10 Reference Sheet
No ratings yet
Icd9 Icd10 Reference Sheet
1 page

Unit2 Data Science

Uploaded by

Unit2 Data Science

Uploaded by

1. Define and explain data warehouse.

(4/6 marks) (previous qp)

3. Explain the following Multidimensional database schema with

b) Snowflake Schema: This is a more complex version of star schema

c) Fact Constellation Schema: Also known as galaxy schema, it contains

5. Explain any four/six OLAP operations. (4/6 marks) (previous qp)

 Definition: Slicing refers to selecting a single value or member from

 Purpose: Allows users to examine data for a specific criterion within

 Example: Analyzing sales data for a specific month (e.g., March) to

 Definition: Dicing involves selecting a subset of data by specifying

 Purpose: Enables multidimensional analysis by narrowing down data

iii. Drill-Up (Roll-Up)

 Definition: Roll-up is the process of aggregating data to a higher level

 Purpose: Helps users view summarized or higher-level data to identify

 Example: Aggregating daily sales data into monthly, quarterly, or

 Definition: Drill-down is the process of breaking down summarized

 Purpose: Allows users to explore underlying details behind aggregate

 Example: Drilling down from yearly sales totals to monthly, weekly, or

 Definition: Drill-within refers to switching between different

 Purpose: Provides a different perspective of analysis within a single

 Example: In the "Product" dimension, switching from "Product

 Definition: Drill-across involves moving analysis from one dimension

 Example: Switching from analyzing sales by region (geography

8. Explain the three-tier architecture of Data warehouse with neat

15. Define noise and explain different data smoothing techniques. (6

16. Explain the different steps involved in data transformation. (4

19. Explain any two numerosity reduction techniques. (4 marks)

2. Clustering: Similar records are grouped into clusters. Instead of

The quality of a cluster can be defined by its diameter, the maximum

2. Stratified Sampling: The dataset is divided into groups (strata), and

3. Systematic Sampling: Select every k-th record from a list after a

Sampling allows us to work on a smaller dataset that still represents the

You might also like