Data Mining - Module I (Expanded Notes Based on Reference Textbooks)
1. **Data Mining:**
- Data mining is the process of discovering interesting, non-trivial, implicit, previously unknown,
and potentially useful patterns or knowledge from large amounts of data.
- Involves multiple disciplines including database systems, statistics, machine learning, and
artificial intelligence.
- Also referred to as Knowledge Discovery in Databases (KDD).
2. **KDD Process:**
- **Selection:** Choosing the relevant data from various sources.
- **Preprocessing:** Removing noise, handling missing values, and resolving inconsistencies.
- **Transformation:** Converting data into appropriate formats for mining.
- **Data Mining:** Applying algorithms to extract patterns.
- **Evaluation:** Interpreting and validating the mined knowledge.
3. **Data Mining Tasks:**
- **Classification:** Assign data to predefined classes using algorithms like decision trees, k-NN,
SVM.
- **Prediction:** Estimate future values based on current data using regression techniques.
- **Clustering:** Group data into clusters with similar characteristics (e.g., k-means).
- **Association Rule Mining:** Discover relationships between items (e.g., Market Basket Analysis
using Apriori).
- **Outlier Detection:** Identify anomalies or rare items that differ from the norm.
4. **Types of Data in Data Mining:**
- **Structured Data:** Relational databases, data warehouses.
- **Semi-structured Data:** XML, JSON.
- **Unstructured Data:** Text, images, videos.
- **Data Streams:** Real-time continuous data.
5. **Data Objects and Attribute Types:**
- **Nominal:** Categorical values with no order (e.g., colors).
- **Binary:** Two values like 0/1, true/false.
- **Ordinal:** Ordered values (e.g., satisfaction level).
- **Numeric:**
- **Interval:** Values with meaningful differences but no true zero (e.g., temperature).
- **Ratio:** Values with a true zero (e.g., height, weight).
6. **Data Preprocessing:**
- Essential for improving the quality of input data.
- Steps: Data cleaning, integration, transformation, reduction, and discretization.
7. **Measures of Central Tendency:**
- **Mean:** Average value of a dataset.
- **Median:** Middle value separating the higher half from the lower half.
- **Mode:** Most frequently occurring value.
8. **Classification of Data Mining Systems:**
- **Based on Data Type:** Relational, spatial, multimedia, text, time-series.
- **Based on Knowledge Type:** Characterization, discrimination, association, classification,
clustering.
- **Based on Technique:** Statistical, machine learning, neural networks, visualization-based.
9. **Major Issues in Data Mining:**
- **Data Quality:** Incomplete, noisy, or inconsistent data.
- **Scalability:** Efficient algorithms for large datasets.
- **Privacy & Security:** Sensitive data protection.
- **Interpretability:** Understandable models for decision-makers.
10. **Data Warehousing:**
- A subject-oriented, integrated, time-variant, non-volatile collection of data.
- Supports decision-making by providing a unified view of enterprise data.
11. **Multidimensional Data Model:**
- Allows data to be modeled and analyzed from multiple dimensions (e.g., time, geography,
product).
- Fundamental concept: **Data Cube** - a multi-dimensional array of values.
12. **Schemas for Multidimensional Data:**
- **Star Schema:** A central fact table connected to dimension tables.
- **Snowflake Schema:** Normalized dimension tables.
- **Fact Constellation:** Multiple fact tables sharing dimension tables (also known as galaxy
schema).
These notes provide a comprehensive overview of all key concepts covered in Module I of Data
Mining.