Summary for Exam: Introduction to Data Warehouse & Data Mining
1. Need for Data Warehouses
• Business Questions: Complex decision-support queries (e.g., profitability analysis, trend
identification) require data from multiple sources.
• Shortcomings of SQL: Complex queries are time-consuming and involve diverse data
(internal/external, relational/non-relational).
• Solution: Data warehouses complement relational databases by enabling integrated,
historical, and summarized data analysis.
2. Transaction Processing (OLTP) vs. Decision Support (DSS)
Characteristic Operational Database (OLTP) Data Warehouse (DSS)
Purpose Daily transactions (e.g., order tracking) Historical analysis & decision-making
Data Current, detailed, process-oriented Historical, summarized, subject-oriented
Updates Volatile (frequent changes) Non-volatile (appended, not replaced)
Normalization Highly normalized Denormalized (e.g., star schemas)
Query Scope Few records Millions of records
3. Characteristics of Data Warehouses
• Subject-Oriented: Organized by business subjects (e.g., customers, products).
• Integrated: Consolidated data with consistent formats, naming, and units.
• Time-Variant: Historical snapshots for trend analysis.
• Non-Volatile: Data is appended, not updated/deleted.
4. Architectures
• Two-Tier:
o Operational data → Transformation → Data Warehouse.
o Uses an Enterprise Data Model (EDM) for metadata and integration rules.
• Three-Tier:
o Adds scalability for large datasets (common in big organizations).
• Data Marts:
o Subsets of data warehouses for departmental use (e.g., faster access, isolation).
5. Data Models
• Relational: Star schema (fact tables + dimension tables).
• Multidimensional: Data cubes for end-user analysis.
6. Data Warehouse Maturity Model
Stages: Prenatal → Infant → Child → Teenager → Adult → Sage.
• Progresses from operational reports to enterprise-wide and inter-enterprise analytics.
• Higher stages offer greater business value but face ROI challenges.
7. Data Mining
• Purpose: Discover hidden patterns (e.g., fraud detection, customer retention).
• Tools:
o Data visualization, clustering, predictive models (neural networks, decision trees).
o Parallel processing for scalability.
Key Exam Questions
1. Why are operational databases unsuitable for decision support?
o Inconsistent formats, process orientation, lack of historical/summarized data.
2. Data Warehouse vs. Data Mart?
o Data mart: Subset for departments; Data warehouse: Enterprise-wide.
3. Three-tier vs. Two-tier?
Three-tier for scalability in large organizations.
o
4. Components of EDM?
o Metadata, integration rules, data structure definitions.
Good luck! 📚✨
2 Introduction to Multidimentional Data Model.pdf
PDF 1.37MB
Summary for Exam: Multidimensional Data Model
1. OLTP vs. OLAP
OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)
Focuses on daily transactions (e.g., order
Focuses on data analysis for decision-making.
entry, banking).
Handles real-time, short transactions with
Analyzes historical data from multiple sources.
high concurrency.
Ensures data integrity and speed in Uses multidimensional models (cubes) for insights
operational systems. (e.g., sales trends).
2. Multidimensional Data Model
• Purpose: Supports decision-making by organizing data into cubes with dimensions (e.g.,
Product, Location, Time) and measures (e.g., Sales).
• Advantages over Relational Models:
Flexibility: Adding dimensions (e.g., new products) requires minimal changes.
o
Efficiency: Faster query performance and easier aggregation (e.g., total sales by
o
region).
o Usability: Intuitive for business users to slice, dice, and drill down.
• Disadvantage: Potential storage inefficiency due to empty cells.
3. Key Terminologies
• Data Cube: A structure with dimensions (e.g., Product, Time) and measures (numeric values
like Sales).
• Dimensions: Categories for analysis (e.g., Location).
Hierarchies: Levels within a dimension (e.g., Country → State → City).
o
Members: Values within a dimension (e.g., California, Utah).
o
• Measures: Numeric data in cube cells (e.g., Sales, Profit).
4. Cube Operations
Operation Description Example
Focus on a subset by fixing one dimension (e.g.,
Slice View sales for 2023.
Time = 2023).
View sales in Utah and
Dice Select a subset of members in multiple dimensions.
California for Q1.
Drill- Expand Utah sales to Salt Lake
Navigate to finer granularity (e.g., State → City).
Down City.
Summarize to higher granularity (e.g., Daily → Aggregate daily sales into
Roll-Up
Quarterly). quarterly totals.
Rotate dimensions for better visualization (e.g., Switch Product and Location
Pivot
swap rows/columns). axes.
5. Hierarchies & Time-Series
• Hierarchies: Enable efficient aggregation (e.g., drill-down from Year to Month).
• Multiple Hierarchies: Flexibility in analysis (e.g., Location by state or postal code).
• Time as a Dimension: Allows trend analysis (e.g., sales growth over years).
Key Exam Questions
1. Advantages of Multidimensional vs. Relational?
o Faster queries, intuitive analysis, easier aggregation, flexible schema.
2. Why multiple hierarchies?
oTo support diverse analysis paths (e.g., Location by state or postal code).
3. Slice vs. Dice?
Slice fixes a dimension value; Dice selects subsets across dimensions.
o
4. Drill-Down vs. Roll-Up?
o Drill-down: Detailed view; Roll-Up: Summarized view.
5. Pivot Usefulness?
o Rearranges dimensions for clearer data visualization.
Good luck! 🚀📊
explain the Operations with diagram and example with ans
Multidimensional Data Cube Operations Explained
Below are explanations of key operations with text-based diagrams, examples, and answers based
on the sales data cube from the PDF:
1. Slice
Purpose: Focus on a subset of the cube by fixing one dimension.
Example: View sales for Time = Q1 2006.
Original Cube (3D: Product × Location × Time)
Location Mono Laser Ink Jet Photo Portable
California 80 110 60 25
Utah 40 90 50 30
Arizona 70 55 60 35
Washington 75 85 45 45
Colorado 65 45 85 60
Slice Operation: Fix Time = Q1 2006 → Result is a 2D table.
Answer:
The slice shows sales for Q1 2006. For example, California sold 80 Mono Laser printers.
2. Dice
Purpose: Select a subset of members across multiple dimensions.
Example: View sales for Location = California, Utah and Time = Q1 2006.
Original Cube
Location Mono Laser Ink Jet Photo Portable
California 80 110 60 25
Utah 40 90 50 30
Dice Operation: Filter rows for California and Utah.
Answer:
Total sales for California and Utah in Q1: 80 + 110 + 60 + 25 + 40 + 90 + 50 + 30 = 485.
3. Drill-Down
Purpose: Navigate from a higher hierarchy level (e.g., State) to a lower level (e.g., City).
Example: Expand Utah into cities (Salt Lake, Park City, Ogden).
Before Drill-Down (State Level):
Location Mono Laser Ink Jet Photo Portable
Utah 40 90 50 30
After Drill-Down (City Level):
Location Mono Laser Ink Jet Photo Portable
Salt Lake 20 20 10 15
Park City 5 30 10 5
Ogden 15 40 30 10
Answer:
Drilling down reveals Ogden sold the most Ink Jet printers (40 units).
4. Roll-Up
Purpose: Summarize data from a detailed level (e.g., City) to a higher level (e.g., State).
Example: Aggregate daily sales into Quarterly Sales.
Daily Sales (City Level)
Location Daily Sales
Salt Lake 20
Park City 5
Ogden 15
Roll-Up Operation: Sum daily sales into Utah’s Quarterly Sales = 20 + 5 + 15 = 40.
Answer:
Utah’s total quarterly sales for Mono Laser printers: 40.
5. Pivot
Purpose: Rearrange dimensions for better visualization.
Example: Swap Product (rows) and Location (columns).
Original Table (Product × Location):
Location Mono Laser Ink Jet Photo Portable
California 80 110 60 25
Pivoted Table (Location × Product):
Product California Utah Arizona
Mono Laser 80 40 70
Ink Jet 110 90 55
Answer:
Pivoting helps compare Mono Laser sales across states (e.g., California leads with 80 units).
Summary of Operations
Operation Input Output Example Use Case
Analyze sales for a specific
Slice Fix one dimension 2D table
quarter.
Filter multiple Compare sales in selected
Dice Subset of cube
dimensions states.
Drill-
Expand hierarchy level Detailed view (e.g., cities) Identify top-performing cities.
Down
Aggregate hierarchy Summarized view (e.g.,
Roll-Up Calculate quarterly totals.
level states)
Pivot Rearrange dimensions Reorganized axes Compare products vs. locations.
Relational Model for Data Warehouse Summary
1. Star Schema:
• A data modeling technique used in data warehouses to represent multidimensional data.
• Central fact table linked to multiple dimension tables (looks like a star).
• Fact Table: Stores numeric data (e.g., sales figures).
• Dimension Tables: Store descriptive data (e.g., product, customer, time, location).
• Each dimension table has a 1-M relationship with the fact table.
2. Example:
• Dimensions: Item, Customer, Store, Time.
• Fact: Sales (contains foreign keys to all dimension tables).
3. Time Representation & Historical Integrity:
• Problem: Updates to dimension tables can make related fact data historically inaccurate.
• Solutions (Kimball’s types):
o Type I: Overwrite old data (no historical tracking).
o Type II: Use version numbers to track changes by adding new rows.
o Type III: Add extra columns to track limited history (e.g., current and previous
values).
4. Constellation Schema:
• Extends star schema to model multiple related business processes.
• Multiple fact tables sharing common dimension tables.
• Example: Sales and Inventory fact tables sharing dimensions like Store, Item, Time.
5. Normalization:
• Fact tables are usually normalized.
• Dimension tables are often denormalized (not fully in 3NF) to enhance retrieval speed.
• Snowflake schema applies normalization to dimension tables, creating multiple layers
around the fact table.
6. Snowflake Schema:
• Dimension tables are normalized (opposite of denormalized star schema).
• Useful when dimension tables are large (e.g., Customer or Item tables).
7. Storage Engines:
• MOLAP (Multidimensional OLAP):
o Directly manipulates pre-computed data cubes.
o Optimized for multidimensional queries, better performance but struggles with
large cube sizes.
• ROLAP (Relational OLAP):
o Uses relational databases with star schema structures.
o Scales better for large datasets but can be slower than MOLAP for queries.
• HOLAP (Hybrid OLAP):
o Combines MOLAP (for summary cubes) and ROLAP (for detailed data).
o More flexible but adds system complexity.
8. MOLAP vs. ROLAP vs. HOLAP:
• MOLAP: Faster queries but limited scalability.
• ROLAP: Handles large data volumes, but queries might be slower.
• HOLAP: Combines strengths of both but may not always justify added complexity.