1.
Compare Linear & Non-Linear SVMs with Suitable Example
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. It works by finding the best hyperplane that separates
data points of different classes.
Linear SVM:
- Used when data is linearly separable.
- Separates data using a straight line (2D), plane (3D), or hyperplane (nD).
- Hyperplane Equation: w·x + b = 0
- Example: Classifying emails as spam or not spam using keyword and link count.
Non-Linear SVM:
- Used when data is not linearly separable.
- Uses kernel trick to transform data into higher dimensions.
- Common kernels: Polynomial, RBF, Sigmoid.
- Example: Tumor classification using size and shape, forming circular clusters.
Comparison Table:
| Feature | Linear SVM | Non-Linear SVM |
|---------------|-------------------|----------------------|
| Data Type | Linearly separable| Non-linearly separable|
| Kernel Used | Not required | Required |
| Complexity | Low | High |
| Speed | Fast | Slower |
| Example | Spam detection | Image classification |
2. Short Note on LDA
Latent Dirichlet Allocation (LDA) is a generative probabilistic model used to discover topics
in a collection of documents. Each document is considered a mixture of topics, and each
topic is a distribution over words.
Key Concepts:
- Document: A collection of words.
- Topic: A group of related words.
- Dirichlet Distribution: Used to model topic and word distributions.
Working of LDA:
1. Choose number of topics.
2. Randomly assign a topic to each word.
3. Iterate to improve assignment using statistical inference.
4. Output topic distribution for each document and word distribution for each topic.
Example:
Documents:
- Doc1: "apple banana mango"
- Doc2: "football cricket hockey"
- Doc3: "apple mango football"
LDA Topics:
- Topic 1: apple, banana, mango (fruits)
- Topic 2: football, cricket, hockey (sports)
Applications:
- Topic modeling, recommender systems, search engines, summarization.
3. Demonstrate K-Nearest Neighbour Algorithm with Use Case
K-Nearest Neighbors (KNN) is a supervised learning algorithm used for classification and
regression. It classifies a data point based on the majority label of its nearest neighbors.
How it Works:
1. Choose K (number of neighbors).
2. Calculate distance from the new point to all training points.
3. Select K nearest neighbors.
4. Assign class label by majority voting.
Use Case: Customer Classification
Classify a new customer as High or Low spender based on Age and Income.
Example:
| Customer | Age | Income | Class |
|----------|-----|--------|-------------|
| C1 | 25 | 25K | Low spender |
| C2 | 45 | 70K | High spender|
| C3 | 30 | 30K | Low spender |
New customer: Age 28, Income 28K → KNN predicts: Low spender
Advantages:
- Simple, no training time, good with small data.
Disadvantages:
- Slow with large data, sensitive to irrelevant features, needs scaling.
4. Random Forests Algorithm
Random Forest is an ensemble learning algorithm used for classification and regression. It
builds multiple decision trees and combines their outputs.
How it Works:
1. Create random samples from the dataset.
2. Build decision trees on each sample.
3. Randomly select features at each split.
4. Aggregate results (majority vote or average).
Key Concepts:
- Bagging: Combines results of multiple models trained on random subsets.
- Feature randomness reduces correlation.
Use Case: Disease Prediction
Predict disease based on age, blood pressure, sugar levels, etc.
Advantages:
- High accuracy, handles missing data, less overfitting, works for classification and
regression.
Disadvantages:
- Slower, less interpretable, high memory usage.