Detailed Notes on Classification in Data Mining
1. What is Classification?
Classification is a data mining technique used to predict the categorical class labels of new
instances based on past observations. It is a type of supervised learning, where the target variable
is categorical (discrete values). 2. Classification Process:
- Training Phase: The algorithm learns from a labeled dataset.
- Testing Phase: The trained model is tested with unseen data to predict labels.
- Evaluation: Accuracy, precision, recall, and F1-score are used to evaluate model performance.
3. Common Classification Algorithms:
a. Decision Trees (ID3, C4.5, CART):
- Uses a tree-like structure where internal nodes represent tests on attributes.
- Leaves represent class labels.
- Easy to interpret and visualize.
b. Naïve Bayes:
- Based on Bayes’ Theorem and assumes feature independence.
- Fast and effective for large datasets and text classification.
c. k-Nearest Neighbors (k-NN):
- Instance-based learning technique.
- Classifies based on the majority class of k-nearest neighbors.
d. Support Vector Machines (SVM):
- Finds the optimal hyperplane that separates data into different classes.
- Effective in high-dimensional spaces.
e. Neural Networks:
- Consists of input, hidden, and output layers.
- Learns complex patterns and is the foundation of deep learning.
4. Applications of Classification:
- Email spam detection.
- Medical diagnosis (e.g., cancer detection).
- Credit scoring.
- Image and speech recognition.
- Customer segmentation.
5. Advantages of Classification:
- Handles both binary and multi-class problems.
- Wide variety of algorithms available.
- High accuracy with proper tuning.
6. Challenges in Classification:
- Imbalanced datasets.
- Noisy or missing data.
- Overfitting (model too complex).
- Underfitting (model too simple).
7. Model Evaluation Metrics:
- Accuracy: Correct predictions / total predictions.
- Precision: True Positives / (True Positives + False Positives).
- Recall (Sensitivity): True Positives / (True Positives + False Negatives).
- F1-Score: Harmonic mean of precision and recall.
- Confusion Matrix: Tabular summary of prediction results.