Data Mining Detailed Notes
UNIT I – Data Mining Fundamentals (8 hrs)
**Overview & Motivation:** Data mining is the process of discovering patterns, trends, and
useful information from large datasets. Applications include marketing, fraud detection,
medicine, etc.
**Definition & Functionalities:** Data mining involves classification, clustering, association
analysis, prediction, outlier detection, and evolution analysis.
**Data Processing:** Preprocessing steps include cleaning, integration, transformation, and
reduction.
**Data Cleaning:** Techniques to handle missing values (mean/mode substitution), noisy
data (binning, regression), and inconsistent data.
**Data Integration & Transformation:** Combining multiple data sources and transforming
data (normalization, aggregation).
**Data Reduction:** Summarization using Data Cube Aggregation, Dimensionality
Reduction (PCA), and Data Compression techniques.
UNIT II – Classification, Clustering, and Association Rules (8 hrs)
**Classification:** Assigning items to categories using decision trees, Naïve Bayes, k-NN, etc.
**Attribute Relevance & Class Comparisons:** Identifying significant attributes and
comparing classes statistically.
**Clustering:** Grouping data based on similarity. Hierarchical (CURE, Chameleon) and
Partitional (k-means) methods.
**Association Rules:** Discovering item correlations using Apriori, FP-Growth, and neural
networks.
UNIT III – Data Mining Process using CRISP-DM (8 hrs)
**CRISP-DM Methodology:** Business understanding, data understanding, preparation,
modeling, evaluation, and deployment.
**Data Import in R:** Using read.csv(), read.table(), tidyverse for importing structured data.
**Data Preprocessing in R:** Cleaning, transforming, and reducing data using packages like
dplyr and caret.
**Modeling in R:** EDA, association rules (arules), clustering (kmeans, hclust), anomaly
detection.
UNIT IV – Predictive Analytics (8 hrs)
**Evaluation Metrics:** Accuracy, Precision, Recall, F1-score, ROC-AUC.
**Tree-Based Models and SVM:** Decision Trees, Random Forests, and Support Vector
Machines for classification tasks.
**Artificial Neural Networks:** Including deep learning with CNNs, RNNs.
**Model Ensembles:** Bagging, Boosting (XGBoost), and Stacking.
**Evaluation Techniques:** Holdout, Cross-validation, Bootstrapping, and Deployment
practices.
UNIT V – Market Basket and Sequence Analysis (8 hrs)
**Transactional Dataset & Apriori:** Frequent itemset mining using Apriori.
**Rule Generation:** Filtering rules by support, confidence, lift.
**Plotting & Visualization:** Using arulesViz in R.
**Sequential Dataset:** Analyzing time-ordered transactions using SPADE, GSP.
**Business Applications:** Retail bundling, fraud detection, and recommendation systems.