Data Mining - Module I Notes (Based on Reference Textbooks)
**1. Introduction to Data Mining** (Han & Kamber, Tan et al.):
- **What is Data Mining?**
- A process of extracting hidden patterns from large data sets.
- Integrates techniques from statistics, machine learning, database systems.
**2. Data Mining Tasks** (Han & Kamber):
- **Classification:** Assigning items to predefined categories.
- **Prediction:** Forecasting future data trends.
- **Cluster Analysis:** Grouping a set of objects in such a way that objects in the same group are
more similar to each other.
- **Association Rule Mining:** Discovering interesting relations between variables.
- **Outlier Detection**
**3. KDD Process** (Tan et al.):
- **Steps:**
1. Selection
2. Preprocessing
3. Transformation
4. Data Mining
5. Interpretation/Evaluation
**4. Data Mining Functionalities** (Han & Kamber):
- Finding patterns, associations, correlations, trends.
- Classifying and predicting outcomes.
**5. Classification of Data Mining Systems**:
- Based on the type of data (relational, transactional, spatial, multimedia, time-series).
- Based on the kind of knowledge mined (characterization, discrimination, association).
- Based on the techniques used (machine learning, statistics, visualization, database-oriented).
**6. Issues in Data Mining**:
- Handling noisy and incomplete data
- Performance and scalability
- Integration of data mining with databases
- Data ownership and privacy issues
**7. Data Objects and Attribute Types**:
- **Nominal:** Categories with no order (e.g., hair color)
- **Binary:** Two categories (e.g., true/false)
- **Ordinal:** Categories with a meaningful order (e.g., small, medium, large)
- **Numeric:**
- **Interval:** No true zero (e.g., temperature)
- **Ratio:** True zero exists (e.g., age)
**8. Central Tendency Measures** (Han & Kamber):
- **Mean:** Average value
- **Median:** Middle value
- **Mode:** Most frequent value
**9. Data Warehousing Concepts** (Paulraj Ponnaiah, Sam Anahory):
- **Definition:** A subject-oriented, integrated, time-variant, and non-volatile collection of data.
- **Multidimensional Data Models:**
- **Data Cubes:** Allow data to be modeled and viewed in multiple dimensions.
- **Schemas:**
- **Star Schema**
- **Snowflake Schema**
- **Fact Constellation**
10. Reference Texts Used:
- Jiawei Han & Micheline Kamber, *Data Mining: Concepts and Techniques*
- Pang-Ning Tan et al., *Introduction to Data Mining*
- Arun K. Pujari, *Data Mining Techniques*
- Sam Anahory & Dennis Murray, *Data Warehousing in the Real World*
- Paulraj Ponnaiah, *Data Warehousing Fundamentals*