Data Mining, Data Warehousing, and
Distributed DBMS - Summary
1. Data Mining: Concepts and Techniques
What is Data Mining?
Data mining is the process of extracting useful, non-trivial, and previously unknown
patterns from large datasets.
Example: Predicting which customers are likely to buy a new product based on past
purchases.
Knowledge Discovery in Databases (KDD)
KDD includes steps like data cleaning, integration, selection, mining, pattern evaluation, and
presentation.
Example: Medical data preprocessing and mining patterns in patient symptoms.
Evolution of Database Technologies
1960s–1980s: Basic DBMS like IMS, RDBMS
1990s: Data mining, multimedia DBs
2000s+: Stream data, Web mining, Big Data
Data Mining Functions
1. Generalization: Summarize data. E.g., average sales by region.
2. Association & Correlation: Discover co-occurrence. E.g., bread → butter.
3. Classification: Predict labels. E.g., spam detection.
4. Clustering: Group similar items. E.g., customer segmentation.
5. Outlier Detection: Identify anomalies. E.g., fraud detection.
6. Trend Analysis: Sequential pattern recognition. E.g., TV → speakers.
Applications of Data Mining
Web analysis, Market basket analysis, Medical diagnosis, Fraud detection.
Issues in Data Mining
Methodology, scalability, data diversity, privacy, performance.
2. Data Warehouse
What is a Data Warehouse?
A centralized system storing integrated data from multiple sources for analysis and BI.
Types of Data Warehouses
1. EDW: Org-wide repository.
2. ODS: Real-time data for ops.
3. Data Mart: Subset for departments.
4. Virtual DW: No physical storage.
5. Cloud DW: Hosted online.
6. Federated DW: Data remains at source.
Case Study: XYZ Retail
XYZ implemented ETL, centralized DW, BI tools, and governance for better reporting and
decision-making.
3. Distributed DBMS Architectures
What is Distributed DBMS?
It manages data distributed across locations as one database.
Architectures
1. Client-Server
2. Peer-to-Peer
3. Three-Tier
Types of DDBMS
1. Homogeneous (same DBMS)
2. Heterogeneous (different DBMS)
Data Fragmentation
1. Horizontal: Row-wise
2. Vertical: Column-wise
3. Mixed
Replication
1. Full: All data copied
2. Partial: Some data replicated
Query Processing
Optimize queries to reduce communication across nodes.
Concurrency and Recovery
Use locking, timestamps, and protocols like 2PC to ensure consistency.
Data Warehousing Summary
1. What is a Data Warehouse?
A data warehouse is a centralized repository separate from operational databases. It
integrates data from various sources and supports analytical processing of historical data.
2. Major Features of Data Warehouse
🔹 Subject-oriented: Organized by key subjects like customer, product, and sales.
🔹 Integrated: Combines data from different sources using cleaning and integration.
🔹 Time-variant: Stores data for historical analysis (e.g., 5–10 years).
🔹 Nonvolatile: Data is stable and separate from transactional systems; mainly supports
reading and loading.
3. Uses of Data Warehouses in Organizations
• Analyze customer buying behavior
• Evaluate and reposition product strategies
• Optimize operations and identify profit sources
• Manage customer relationships and corporate assets
4. Traditional vs. Data Warehousing Approaches
🔹 Traditional: Query-driven; uses wrappers/mediators to fetch data from sources in real-
time.
🔹 Data Warehouse: Update-driven; integrates data beforehand for direct querying.
5. OLTP vs. OLAP
🔹 OLTP: Handles day-to-day operations (e.g., payroll, inventory); current, detailed data.
🔹 OLAP: Supports analysis and decision-making; large historical datasets,
summary/aggregated views.
Key Differences:
• OLTP is customer-oriented; OLAP is market-oriented.
• OLTP uses ER model; OLAP uses star/snowflake schema.
• OLTP handles short, atomic transactions; OLAP supports complex queries (read-heavy).
6. Data Warehousing: Multitiered Architecture
🔹 Top-down: Centralized warehouse first; systematic but costly and inflexible.
🔹 Bottom-up: Data marts first; cheaper and flexible but harder to integrate later.
Data, Information, and Knowledge - Summary
1. Definitions
🔹 Data: Raw, unprocessed facts (e.g., 3, 6, 9; cat, dog; 161.2, 175.3).
🔹 Information: Processed data with meaning (e.g., those numbers are student heights).
🔹 Knowledge: Application of information to solve problems or gain insights (e.g., tallest
student is 175.3cm).
Equations: Data + Meaning = Information | Information + Use = Knowledge
2. Types of Knowledge
🔹 Explicit Knowledge: Easy to share (e.g., facts in books).
🔹 Tacit Knowledge: Gained through experience, harder to communicate (e.g., baking,
programming).
3. Data Categorization
🔹 Structured vs. Unstructured: Tables vs. text/images
🔹 Quantitative vs. Qualitative: Numbers vs. opinions
🔹 Primary vs. Secondary: Collected first-hand vs. from existing sources
🔹 Internal vs. External: Organization’s data vs. third-party sources
🔹 Time-series vs. Cross-sectional: Over time vs. at a single point
4. Where Data Resides
🔹 Databases: Structured storage
🔹 Data Warehouses: Analytical storage
🔹 Data Lakes: Raw data storage
🔹 Cloud Storage: Scalable online storage (e.g., AWS S3)
🔹 File Systems & Devices: HDDs, SSDs, USBs
5. Do We Need a Server?
Not always, but beneficial for:
• Scalability
• Centralized data management
• Access control & security
• Redundancy & backup
6. Homogeneous vs. Heterogeneous Data
🔹 Homogeneous: Same type/format (e.g., customer info table)
🔹 Heterogeneous: Mixed types/formats (e.g., images + text + audio)
7. Real-World Use Case: Healthcare
Hospitals collect data from EHRs, devices, wearables. They use:
• ML for risk prediction
• NLP for patient feedback analysis
• Computer vision for X-ray/MRI analysis