1. Define and explain data warehouse.
(4/6 marks) (previous qp)
A data warehouse is a centralized system used to store, manage, and
analyze large volumes of data collected from multiple sources. It is designed
specifically for query and analysis rather than transaction processing.
Key Points: - Data warehouses store historical and current data to support
business intelligence (BI) activities. - Data is extracted from different
operational systems, cleaned, transformed, and loaded into the warehouse
(ETL process). - It supports decision-making processes by allowing easy
access to organized and integrated data.
Example: A supermarket chain collects data from all branches, integrates it
in a central warehouse, and analyzes sales trends, customer preferences,
etc.
3. Explain the following Multidimensional database schema with
example. (3 marks each)
a) Star Schema: This schema has a central fact table surrounded by
dimension tables. It is simple and used for fast queries. Example: A Sales fact
table with columns like sale_id, date_id, product_id, amount, connected to
dimension tables such as Time, Product, and Store.
b) Snowflake Schema: This is a more complex version of star schema
where dimension tables are normalized. It saves space but makes queries
slower. Example: Product dimension may be split into Product and Category
tables. So, the Sales fact table connects to Product, which connects to
Category.
c) Fact Constellation Schema: Also known as galaxy schema, it contains
multiple fact tables sharing common dimension tables. Example: Sales and
Shipping are two fact tables connected to shared dimensions like Time and
Product.
4. Define Measure. Explain different categories of measures. (4
marks) (previous qp)
A measure is a numeric value that represents data to be analyzed in a data
warehouse. Measures are stored in fact tables and are used in calculations
and aggregations.
Types of Measures:
1. Distributive: Can be computed by dividing the dataset, aggregating
separately, and combining (e.g., count, sum, max).
2. Algebraic: Derived from distributive measures using basic operations
(e.g., average = sum/count).
3. Holistic: Cannot be computed without the whole dataset (e.g., median,
mode, rank).
5. Explain any four/six OLAP operations. (4/6 marks) (previous qp)
OLAP (Online Analytical Processing) operations are used to analyze
multidimensional data from different perspectives.
OLAP Operations in Multidimensional Data Analysis
i. Slice
Definition: Slicing refers to selecting a single value or member from
one dimension of a multidimensional dataset, creating a subset of the
data for focused analysis.
Purpose: Allows users to examine data for a specific criterion within
one dimension while keeping all other dimensions fixed.
Example: Analyzing sales data for a specific month (e.g., March) to
evaluate sales performance during that time.
ii. Dice
Definition: Dicing involves selecting a subset of data by specifying
values or ranges for multiple dimensions simultaneously.
Purpose: Enables multidimensional analysis by narrowing down data
using conditions across different dimensions.
Example: Analyzing sales data for Q1 and for electronics products to
evaluate performance of that category within the selected time frame.
iii. Drill-Up (Roll-Up)
Definition: Roll-up is the process of aggregating data to a higher level
of abstraction in one or more dimensions.
Purpose: Helps users view summarized or higher-level data to identify
overall trends.
Example: Aggregating daily sales data into monthly, quarterly, or
yearly sales figures for a broader view of performance.
iv. Drill-Down
Definition: Drill-down is the process of breaking down summarized
data into more detailed sub-levels.
Purpose: Allows users to explore underlying details behind aggregate
values to identify specific causes or trends.
Example: Drilling down from yearly sales totals to monthly, weekly, or
daily sales data to detect seasonal patterns or anomalies.
v. Drill-Within
Definition: Drill-within refers to switching between different
classifications or categories within the same dimension.
Purpose: Provides a different perspective of analysis within a single
dimension.
Example: In the "Product" dimension, switching from "Product
Category" to "Product Brand" to analyze sales by brand instead of
category.
vi. Drill-Across
Definition: Drill-across involves moving analysis from one dimension
to another to examine related data from a different perspective.
Purpose: Facilitates cross-dimensional analysis to gain broader
insights.
Example: Switching from analyzing sales by region (geography
dimension) to analyzing by salesperson (staff dimension) for
performance comparisons.
8. Explain the three-tier architecture of Data warehouse with neat
diagram. (6 marks)
The three-tier architecture organizes the data warehouse system into three
levels:
1. Bottom Tier: Contains the database servers where data is extracted
from different sources, cleaned, transformed, and loaded (ETL process).
2. Middle Tier: An OLAP (Online Analytical Processing) server that helps in
fast querying and processing of data. It uses either Relational OLAP (ROLAP)
or Multidimensional OLAP (MOLAP).
3. Top Tier: This is the front-end layer which provides tools for data
analysis, query generation, and reporting (e.g., dashboards, charts).
(Draw a diagram with three layers: Bottom – ETL tools and Data sources,
Middle – OLAP server, Top – Reporting/Analysis tools)
┌───────────────────────────────┐
│ Top Tier (Front-End) │
│ - Query Tools (BI Tools) │
│ - Reporting & Visualization │
│ - Dashboards │
└───────────────────────────────┘
▲
│
▼
┌───────────────────────────────┐
│ Middle Tier (OLAP/ETL) │
│ - OLAP Servers (ROLAP/MOLAP) │
│ - ETL Tools (Extract, Load) │
│ - Metadata Management │
└───────────────────────────────┘
▲
│
▼
┌───────────────────────────────┐
│ Bottom Tier (Data Sources)│
│ - Operational Databases │
│ - External Data Sources │
│ - Flat Files / Logs │
└───────────────────────────────┘
10. Explain the different back-end tools and utilities included in data
warehouse. (4 marks)
Data warehouses use several back-end tools to prepare data:
1. Data Extraction Tools: Extract data from multiple heterogeneous
sources.
2. Data Cleaning Tools: Detect and correct errors or inconsistencies in
data.
3. Data Transformation Tools: Convert data into a suitable format or
structure.
4. Data Loading Tools: Load the transformed data into the warehouse.
5. Metadata Repository: Stores information about the structure,
operations, and contents of the data.
These tools ensure that only clean, consistent, and meaningful data is stored
in the warehouse.
14. Explain (any four) the different ways of handling missing values.
(4/6 marks) (previous qp)
Handling missing data is important to improve data quality. Here are
common methods:
1. Ignore the tuple: Remove the record if missing values are few and
data is not important.
2. Manual entry: Experts fill missing values using domain knowledge.
3. Global constant: Replace missing value with a constant like
“Unknown”.
4. Mean/Median/Mode: Fill with the column’s average (mean), middle
(median), or most frequent (mode) value.
5. Prediction using models: Use regression or classification models to
guess the value.
6. Using previous/next values: For time series data, use nearby data
values.
15. Define noise and explain different data smoothing techniques. (6
marks)
Noise refers to random errors or irrelevant data that distort the original
dataset. It needs to be removed or reduced to improve accuracy.
Data Smoothing Techniques:
1. Binning: Sort data and split into bins; smooth using mean, median, or
boundaries of bins.
2. Clustering: Group similar data points. Values in a cluster are replaced
with the cluster average.
3. Regression: Fit a regression line (linear or nonlinear), and replace data
points with values predicted by the line.
4. Moving average: Replace a value with the average of its neighboring
values in a sequence.
These techniques help in making the data more useful for analysis.
16. Explain the different steps involved in data transformation. (4
marks) (previous qp)
Data transformation involves converting data into a suitable format for
analysis. Steps include:
1. Smoothing: Removing noise from data using smoothing techniques.
2. Aggregation: Summarizing data (e.g., total sales per month).
3. Generalization: Replacing detailed data with higher-level summary
(e.g., replacing cities with states).
4. Normalization: Scaling data values to a small range like 0 to 1.
5. Attribute construction: Creating new features using existing
attributes (e.g., age group from age).
19. Explain any two numerosity reduction techniques. (4 marks)
(previous qp)
Numerosity reduction is the technique of reducing the volume of data
without losing its integrity.
1. Histogram: Data is grouped into ranges called bins. Each bin is
represented by a single value, reducing total data points.
2. Clustering: Similar records are grouped into clusters. Instead of
storing individual records, we store data about each cluster (like the
centroid).
The quality of a cluster can be defined by its diameter, the maximum
distance between any two objects in the cluster. Centroid distance is an
alternative measure of cluster quality. It is represented as the average
distance of each cluster object from the cluster centroid denoting the
"average object," or average point in the area for the cluster.
24. What is sampling? Explain the ways used to sample for data
reduction? (4/6 marks)
Sampling is selecting a small subset of data from a large dataset. It is used
in data reduction to make analysis faster and more efficient.
Sampling Techniques:
1. Simple Random Sampling: Each data point has an equal chance of
being selected.
2. Stratified Sampling: The dataset is divided into groups (strata), and
samples are taken from each group proportionally.
3. Systematic Sampling: Select every k-th record from a list after a
random starting point.
Sampling allows us to work on a smaller dataset that still represents the
original data accurately.