0% found this document useful (0 votes)
30 views21 pages

Machine Learning

Machine Learning (ML) is a subset of Artificial Intelligence that enables systems to learn from data and improve performance. It is classified into three types: supervised learning, unsupervised learning, and reinforcement learning, each with distinct algorithms and applications. The document also discusses decision trees, confusion matrices, support vector machines, and convolutional neural networks, providing insights into their algorithms, advantages, and use cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views21 pages

Machine Learning

Machine Learning (ML) is a subset of Artificial Intelligence that enables systems to learn from data and improve performance. It is classified into three types: supervised learning, unsupervised learning, and reinforcement learning, each with distinct algorithms and applications. The document also discusses decision trees, confusion matrices, support vector machines, and convolutional neural networks, providing insights into their algorithms, advantages, and use cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning is a subset of Artificial Intelligence (AI) that enables systems to automatically learn from

data and improve performance without being explicitly programmed. ML focuses on designing algorithms that can
identify patterns, make predictions, and take decisions based on input data.

Types of Machine Learning : Machine Learning is broadly classified into three types
1. Supervised Learning: In supervised learning, the model is trained using labeled data (i.e., input-output pairs).
The goal is to learn a function that maps inputs to correct outputs.
Examples: Email spam detection, house price prediction, sentiment analysis.
Algorithms: Linear Regression, Decision Tree, Support Vector Machine.
2. Unsupervised Learning: In this type, data is not labeled. The model tries to identify hidden patterns or structures
from the input data without predefined output.
Examples: Customer segmentation, market basket analysis, anomaly detection.
Algorithms: K-Means Clustering, Hierarchical Clustering, PCA.
3. Reinforcement Learning : An agent interacts with an environment and learns to make decisions by receiving
rewards or penalties. It aims to maximize cumulative reward.
Examples: Game playing (chess, Go), robotics, self-driving cars.
Algorithms: Q-Learning, Deep Q Network, SARSA.

Applications of Machine Learning:


1. Email Spam Filtering: ML algorithms like Naive Bayes detect and block spam emails based on keyword
patterns, sender information, and metadata.
2.Speech Recognition: Converts spoken language into text using models such as HMMs and deep neural
networks, widely used in virtual assistants.
3. Self-Driving Cars: ML enables cars to detect lanes, traffic signs, pedestrians, and make navigation decisions
using reinforcement learning and computer vision.
4. Recommendation Systems: E-commerce and OTT platforms use ML to suggest products or content using
collaborative and content-based filtering techniques.
5. Fraud Detection: Financial institutions apply anomaly detection and classification to identify suspicious activities
and transactions.
6. Customer Segmentation: Clustering algorithms divide customers into groups based on behavior, preferences,
and purchase history for targeted marketing.
7. Natural Language Processing (NLP): ML is used for language translation, chatbots, sentiment analysis, and
voice assistants like Siri or Alexa.
8. Face Recognition: Applied in surveillance, user authentication, and social media tagging using deep learning
and pattern recognition.
Q1. Write algorithm for Top Down Induction of Decision Tree (ID3).
ID3 (Iterative Dichotomiser 3) is a top-down, recursive, greedy approach used to build decision trees based on
Information Gain.

Steps of ID3 Algorithm:

1.​ Input: Dataset D, attributes A, target attribute T.


2.​ If all samples in D belong to the same class, return a leaf with that class.
3.​ If A is empty, return a leaf node with the majority class label in D.
4.​ Compute Information Gain for each attribute in A.
5.​ Select the attribute A_best with the highest gain.
6.​ Create a decision node for A_best.
7.​ For each value v of A_best:​
a. Create a branch where A_best = v.​
b. Form subset Dv of D where A_best = v.​
c. If Dv is empty, attach a leaf with majority class of D.​
d. Else, recursively apply ID3 on Dv with remaining attributes.

Output: A decision tree where each path forms a classification rule.

Q2. Explain Decision Tree Algorithm with Example.


Ans: A Decision Tree is a supervised learning method used for classification and regression. It splits data into
subsets based on the value of input features.

Working:

1.​ Start with the full dataset.


2.​ Select the best attribute using Information Gain or Gini Index.
3.​ Split the dataset into subsets where attribute has specific values.
4.​ Repeat the process recursively on each subset.
5.​ Stop when:​

○​ All instances have same class label


○​ No more attributes left
○​ Minimum sample condition reached

Example: (Play Tennis)​


Attributes: Outlook, Temperature, Humidity, Wind​
Target: Play (Yes/No)

If Outlook=Sunny, then check Humidity;​


If Outlook=Overcast, then Play=Yes;​
If Outlook=Rain, then check Wind, and so on.

Advantages:

●​ Easy to understand and visualize


●​ Handles both categorical and numerical data
●​ Generates understandable rules
Q3. Explain Overfitting and Pruning with respect to Decision Tree.

Ans: Overfitting occurs when a decision tree learns noise and details from training data, resulting in poor
performance on unseen data. It often happens when the tree becomes too deep or complex.
Symptoms of Overfitting:

●​ High training accuracy


●​ Low testing accuracy
●​ Large, complicated tree

Pruning: A technique to reduce the size of a tree and avoid overfitting by removing nodes that add little predictive
power.
Types:

1.​ Pre-Pruning (Early Stopping):​

○​ Stop growing tree if node has too few instances


○​ Set max depth or minimum gain threshold​

2.​ Post-Pruning:​

○​ Grow the complete tree


○​ Then remove branches that don’t improve accuracy on validation set

Advantages of Pruning:

●​ Simplifies model
●​ Increases generalization
●​ Prevents overfitting

Example: A leaf node classifies only one sample incorrectly; pruning it might improve overall accuracy.

Q4. How do you choose best split in the Decision Tree algorithm?

The best split at each node of a decision tree is chosen by measuring how well an attribute separates data into
distinct classes. This is done using impurity measures:
1. Entropy: Measures disorder or impurity in a dataset.

2. Information Gain (used in ID3): Gain = Entropy(Parent) − Weighted Entropy(Children)​


Higher the gain, better the attribute.
3. Gini Index (used in CART):
Lower Gini means higher purity.

4. Gain Ratio (used in C4.5):


Normalizes Information Gain to avoid bias toward
attributes with many values.
Process:

●​ Compute the chosen metric for each attribute


●​ Select attribute with highest Information Gain or lowest Gini
●​ Use that to split data at that node

Example: If Outlook has the highest Information Gain, it becomes the root node and data is split based on its
values.
Q. Explain Confusion Matrix for classification problem Or Explain
Confusion Matrix with example and state the need of confusion matrix.
Ans: A Confusion Matrix is a performance evaluation table used for classification problems, where the output can
be divided into multiple classes. It compares the actual target values with those predicted by the model. In binary
classification, it is a 2×2 matrix; for multiclass, it extends to n×n format. It provides a complete picture of prediction
outcomes, separating correct and incorrect predictions into four categories.

Structure (Binary Classification): .

Predictive Positive Predictive Negative

Actual Positive True Poitive (TP) False Negative (FN)

Actual Negative False Negative (FP) True Negative (TN)

●​ True Positive (TP): Actual = Positive, Predicted = Positive


●​ True Negative (TN): Actual = Negative, Predicted = Negative
●​ False Positive (FP): Actual = Negative, Predicted = Positive (Type I Error)
●​ False Negative (FN): Actual = Positive, Predicted = Negative (Type II Error)​

Example: In a disease test dataset of 100 people: 50 are positive, 50 negative.​


Model prediction: TP=45, TN=47, FP=3, FN=5


Matrix:

Predictive Positive Predictive Negative

Actual Positive 45 (TP) 5 (FN)

Actual Negative 3 (FP) 47 (TN)

Derived Metrics:

●​ Accuracy: (TP + TN) / Total = (45 + 47)/100 = 92%


●​ Precision: TP / (TP + FP) = 45 / (45 + 3) ≈ 93.75%
●​ Recall (Sensitivity): TP / (TP + FN) = 45 / (45 + 5) = 90%
●​ F1-Score: 2 × (Precision × Recall) / (Precision + Recall) = 91.8% approx
●​ Specificity: TN / (TN + FP) = 47 / (47 + 3) = 94%​

Need of Confusion Matrix: It gives deeper insight than accuracy alone, especially in imbalanced datasets where
one class dominates. For example, in fraud detection or medical diagnosis, a model with 95% accuracy might still be
missing all actual fraud cases. Confusion matrix shows the type of misclassifications—whether the model is giving
false alarms or missing real positives. It helps evaluate model robustness, guides model tuning, helps in threshold
selection, and ensures performance is acceptable for sensitive applications. It is also the base for generating metrics
like ROC curve, AUC, precision-recall curve, and more.
Q. How to find right Hyperplane in Support Vector Machine? Explain with
suitable example Or
How Linear SVM works? Explain with example.
Ans: Support Vector Machine (SVM) is a supervised learning algorithm used for binary and multiclass classification.
A hyperplane is a decision boundary that separates the classes. In linear SVM, the goal is to find the optimal
hyperplane that maximizes the margin between data points of different classes. Margin is the distance between the
hyperplane and the nearest data points from each class, called support vectors. The larger the margin, the better
the generalization of the model.

Equation of Hyperplane: For 2D data:

Where w is the weight vector, x is the input feature vector, and b is the bias.​
SVM tries to choose w and b such that the margin between classes is maximized. The constraints are:

Example: Consider 2D data points:​


Class A (Positive): (2,3), (3,4), (4,5)​
Class B (Negative): (6,5), (7,7), (8,6)

SVM will find a straight line (hyperplane) that separates these two sets and lies at maximum distance from closest
points of both classes. The support vectors may be (4,5) and (6,5). The optimal hyperplane lies exactly midway and
is orthogonal to the line connecting the support vectors.

Linear SVM Working:

1.​ It checks whether the data is linearly separable.


2.​ It finds all possible separating hyperplanes.
3.​ It selects the one with the maximum margin using quadratic optimization.
4.​ Predictions are made based on which side of the hyperplane a point lies.​
For linearly separable data, it guarantees a unique, optimal separating boundary.

Advantages:

●​ Effective in high-dimensional spaces


●​ Works well even with a small dataset
●​ Guarantees global minimum due to convex objective function

Conclusion:​
SVM selects the hyperplane that not only separates classes but does so with maximum margin for better accuracy
and robustness. For non-linear data, kernel tricks like RBF or polynomial can be used to map data to higher
dimensions for linear separation.
Q1. Explain Convolutional Neural Network (CNN) for image classification.

Ans: CNN is a deep learning model used primarily for image-related tasks such as classification, detection, and
segmentation. It mimics the human visual cortex by processing image data in layers. CNN automatically extracts
local and hierarchical features from images using convolution and pooling operations, removing the need for manual
feature engineering. It is especially effective for spatial data and captures relationships like edges, corners, and
textures.

Architecture Components:

1.​ Input Layer: Accepts image as a 2D or 3D matrix of pixel values (e.g., 28×28×1 for grayscale).
2.​ Convolution Layer: Applies filters/kernels (e.g., 3×3) to scan the image and detect patterns.
3.​ ReLU Activation: Applies non-linearity by replacing negative values with zero.
4.​ Pooling Layer: Downsamples the feature map using Max or Average pooling to reduce size and
computation.
5.​ Fully Connected Layer (FC): Flattens feature maps and connects to output neurons.
6.​ Softmax Output: Provides class probabilities.

Example: In handwritten digit recognition (MNIST dataset), CNN identifies stroke patterns through convolutional
layers and classifies the digit using FC + softmax. CNNs generalize well on image datasets and are widely used in
real-time applications like face recognition, object detection, and medical imaging.

Q2. Draw Multilayer Feedforward Backpropagation Neural Network and explain.

Ans: A Multilayer Feedforward Backpropagation Neural Network (MLP with backprop) is a neural network where
data flows from input to output in one direction (feedforward), and error signals are propagated backward during
learning (backpropagation). It can approximate any non-linear function and is widely used in classification,
regression, and time-series prediction.

Architecture:

●​ Input Layer: Takes n-dimensional feature vector (x₁, x₂, ..., x ).


●​ Hidden Layers: Perform weighted sum + activation (sigmoid, tanh, ReLU). Each neuron passes its output
to all neurons in the next layer.
●​ Output Layer: Computes final prediction using activation like softmax or linear function.
●​ Backpropagation: Computes error at output and propagates it backward to adjust weights using gradient
descent.

Working: 1. In forward pass, data moves layer-by-layer to compute output. .


. 2. .Error (loss) is calculated using a loss function.
. 3. In backward pass, each layer's weights are updated to minimize error.
. This process repeats for many iterations (epochs) over the training dataset.

Example: To implement XOR logic, we use 2 input neurons, 2 hidden neurons, 1 output neuron. Initially, random
weights give wrong output, but after several epochs, backprop adjusts them to correctly compute XOR logic.

Diagram: [Input Layer] → [Hidden Layer(s)] → [Output Layer]​


(Arrows for forward flow, dashed arrows for error backprop)

MLPs are universal function approximators and are used in digit recognition, stock prediction, and natural language
processing.
Q3. What is Backpropagation? Explain in brief with neat diagram.

Ans: Backpropagation is the core algorithm for training neural networks. It minimizes the prediction error by
adjusting network weights in the reverse direction of output. It uses the chain rule of calculus to compute how
much each weight contributes to the total error, then updates the weights to reduce that error iteratively.

Forward Pass: Input is fed layer-by-layer to compute predicted output.


Loss Calculation: Compute error using loss function (e.g., MSE, cross-entropy).
Backward Pass: Calculate gradients ∂E/∂w for each weight using the chain rule.
Weight Update:

where η = learning rate, ∂E/∂w = gradient.

Backpropagation is used in all deep networks


including CNNs and RNNs. It ensures faster
convergence and allows deep architectures
to be trained effectively.

Use Case: In image classification or speech


recognition, backpropagation ensures correct
features are learned during training, reducing classification errors over epochs Diagram

Q4. Explain the Architecture of Convolutional Neural Network.


Ans: The architecture of a CNN is a layered structure designed to automatically extract hierarchical features from
input images. Each layer transforms the input volume to a more abstract representation, making it easier for the
network to detect high-level patterns. CNNs exploit spatial and local correlations in data by using kernels and
pooling mechanisms.

Typical CNN Architecture Includes:

1.​ Input Layer: Receives raw pixel data (e.g., 224×224×3 for RGB image).
2.​ Convolution Layer(s): Applies filters to detect local features such as edges and textures. Output is a
feature map.
3.​ Activation Layer (ReLU): Applies non-linearity to maintain complex mapping.
4.​ Pooling Layer (Max or Avg): Reduces the feature map size and increases computation efficiency.
5.​ Dropout Layer (optional): Randomly drops some neurons to prevent overfitting.
6.​ Flatten Layer: Converts 2D feature maps to 1D vector.
7.​ Fully Connected Layer (Dense): Combines all features to make final classification.
8.​ Softmax Output: Converts scores into probabilities for multi-class output.

Example: In a dog-vs-cat classifier, lower layers detect edges and corners, middle layers detect ears/tails, and final
layers classify full images. CNNs are efficient, require fewer parameters than fully connected networks, and achieve
high accuracy in image tasks.
Q. Explain Perceptron. How it works? Explain with diagram.
Ans: A Perceptron is the simplest type of artificial neural network and the foundation of deep learning. It is a
binary linear classifier that maps input features to output classes using a linear decision boundary. It was
introduced by Frank Rosenblatt in 1958 and works well for linearly separable problems.

Architecture:

●​ Inputs (x₁,
x₂, ..., x ): Feature values
●​ Weights (w₁, w₂, ..., w ): Associated with each input
●​ Bias (b): Adjusts the output independently of input
●​ Summation Unit: Computes weighted sum:

●​ Activation Function (usually Step Function):

Working:

1.​ Inputs are multiplied by corresponding weights and summed.


2.​ The sum is passed through an activation function.
3.​ The result is binary output (0 or 1).
4.​ During training, weights are updated using:
5.​

Example: For OR gate:


Inputs (0,1), (1,0), (1,1) → Output = 1
Input (0,0) → Output = 0
Perceptron can learn OR/AND logic but not XOR (non-linear problem).

Limitation: Perceptron fails on non-linearly separable problems like XOR. For that, multi-layer perceptrons (MLPs)
with hidden layers are used.

Conclusion: Perceptron is fast, simple, and effective for linear classification tasks. It laid the foundation for deeper
neural networks.

Diagram :
Q1. Write a note on PAC (Probably Approximately Correct) Learning Model.

Ans: PAC Learning is a theoretical model in machine learning that studies whether a concept can be learned
efficiently from data. Introduced by Leslie Valiant in 1984, PAC learning provides a mathematical framework to
define the feasibility and limits of learning algorithms under uncertainty. The term “Probably Approximately
Correct” refers to the idea that a learner can find a hypothesis that is approximately correct (with low error) and
probably correct (with high confidence).
Let:

●​ ε (epsilon) = maximum allowed error (approximation)


●​ δ (delta) = probability of failure (risk)​
A hypothesis h is PAC-learned if, with probability at least (1 − δ), the error of h is less than ε.

Key Terms:

●​ Concept c: The actual function or pattern to be learned.


●​ Hypothesis h: The learner’s guess of the concept.
●​ Instance space X: All possible inputs.
●​ Hypothesis space H: Set of all hypotheses that the learner can consider.

Mathematical Bound for PAC Learning:​


To ensure PAC learnability, number of training examples needed is:

This means, as ε decreases (higher accuracy) or δ decreases (higher confidence), more samples are needed.

Example: If you’re learning to identify spam emails using features like sender, subject, etc., PAC learning can tell
you how many emails (examples) you need to see before your spam filter performs well with high confidence.

Importance: PAC model ensures theoretical learning guarantees. It helps determine whether an algorithm can be
trained efficiently on a given dataset and helps prove the learnability of classes like decision trees, neural networks,
and SVMs. It is a foundation for computational learning theory.

Q3. Explain Ensemble Learning in detail. ( Bagging & Boosting:)

Ans: Ensemble Learning is a powerful machine learning technique where multiple weak or base models are
trained and combined to produce better overall performance than any single model. It is based on the idea that a
group of diverse models, when combined properly, can outperform even strong individual learners.

Why Ensemble Learning?

●​ Reduces overfitting (variance)


●​ Decreases bias
●​ Improves accuracy and stability
●​ More robust to noise or outliers​
Ensemble methods are especially useful when a single model struggles with high variance or high bias.

Types of Ensemble Learning:

1) Bagging (Bootstrap Aggregating):

Bagging builds several independent models in parallel using different subsets of the training data (sampled with
replacement). Each model is trained separately, and final output is obtained by majority voting (for classification) or
averaging (for regression).
Steps:

●​ Create multiple random datasets by bootstrapping.


●​ Train a separate model on each dataset.
●​ Aggregate their predictions.

Advantages:

●​ Reduces overfitting and variance.


●​ Improves stability for unstable models like decision trees.

Example: Random Forest is an ensemble of decision trees where each tree is trained on a random bootstrapped
sample and random subset of features.

2) Boosting:

Boosting trains models sequentially, where each new model focuses more on the errors made by the previous
model. Weights are increased for misclassified samples so that the next model pays more attention to them. Final
prediction is made using weighted majority voting.

Steps:

●​ Initialize equal weights for all training examples.


●​ Train first model and compute errors.
●​ Increase weight of misclassified instances.
●​ Train next model on updated weights.
●​ Repeat and combine predictions.

Advantages:

●​ Reduces both bias and variance.


●​ Boosting often achieves very high accuracy.

Popular Algorithms:

●​ AdaBoost: Uses decision stumps (1-level trees), adjusts weights after each round.
●​ Gradient Boosting: Minimizes loss function via gradient descent at each stage.
●​ XGBoost: Optimized gradient boosting, faster and regularized.

Comparison Table: Aspects Begging Boosting


. Model Training Parallel Sequential
. Goal Reduce variance Reduce bias .
. Sample Selection Bootstrap Selection Reweighted samples
. Output Sample voying/averaging Weighted voting
. Example Random Forest AdaBoost, XGBoost

Real-life Example: In email spam detection:

●​ Bagging may combine many decision trees to vote if an email is spam.


●​ Boosting will refine the classifier over iterations, focusing more on misclassified spam emails.
Q5(a). Write a note on Adaptive Hierarchical Clustering.

Ans: Adaptive Hierarchical Clustering is a flexible version of traditional hierarchical clustering where the
linkage method or distance metric can adapt based on data distribution. It combines the strengths of agglomerative
and divisive clustering with dynamic thresholding and stopping criteria.

Key Features:

●​ Dynamically adjusts merging or splitting criteria.


●​ Can handle non-uniform cluster sizes or shapes.
●​ Incorporates statistical tests or machine learning techniques to decide hierarchy changes.

Applications:

●​ Biological taxonomy
●​ Image segmentation
●​ Multiscale pattern analysis

Advantage: More accurate cluster detection in real-world, non-linear data.​


Limitation: More computationally complex than basic hierarchical clustering.

Q5(b). Write a note on Gaussian Mixture Model (GMM).

Ans: GMM is a probabilistic clustering algorithm that models data as a mixture of multiple Gaussian
distributions. Unlike k-means, which assigns points to only one cluster, GMM assigns probability of belonging to
each cluster. It is based on Expectation-Maximization (EM) algorithm.

Model Structure: Each cluster is represented by a Gaussian (Normal) distribution defined by:

●​ Mean (μ): Center of the cluster


●​ Covariance (Σ): Shape of the cluster
●​ Weight (π): Proportion of data points in that cluster

Algorithm Steps (EM):

1.​ Initialize parameters (means, covariances, weights).


2.​ E-step: Estimate probability that each point belongs to each cluster.
3.​ M-step: Update parameters to maximize likelihood.
4.​ Repeat E–M steps until convergence.

Advantages:

●​ Handles elliptical clusters better than k-means.


●​ Provides soft clustering (probability-based).
●​ Suitable for overlapping clusters.

Applications:

●​ Image segmentation
●​ Speaker identification
●​ Market segmentation
Q1. Explain K-Means Clustering Algorithm with example and working.

Ans: K-Means is an unsupervised machine learning algorithm used to partition a dataset into K distinct,
non-overlapping clusters. It works by grouping similar data points together and assigning them to the cluster
whose centroid is nearest. The algorithm minimizes the Within-Cluster Sum of Squares (WCSS), also called
inertia. It assumes that the number of clusters (K) is known in advance and all clusters are
spherical in shape and similar in size.
Working of K-Means Algorithm: Step 1: Choose the number of clusters, K. Step 2: Randomly initialize K
cluster centroids. Step 3: Assign each data point to the nearest centroid using Euclidean distance:

Step 4: Compute new centroids by taking the mean of all data points in each cluster.​
Step 5: Repeat Steps 3–4 until cluster assignments do not change or a maximum number of iterations is reached.

Example: Dataset: (2,3), (3,3), (10,15), (11,14), (8,10)​


Let K = 2
1. Randomly select centroids, say (2,3) and (10,15) 2. Assign each point to the nearest centroid:​
→ (2,3), (3,3) to cluster 1; (10,15), (11,14), (8,10) to cluster 2
3. Recalculate centroids: → Cluster 1 mean = (2.5, 3), Cluster 2 mean = (9.67, 13)
4. Reassign points and update centroids again 5. Continue until centroids stabilize

Final result: two distinct clusters with clearly separated data points.
Advantages 1. Easy to implement and understand 2. Efficient for large datasets
. 3. Works well when clusters are well-separated and spherical
Limitations: 1. Requires K in advance 2. Sensitive to initial centroid placement
. 3. Fails on non-spherical or overlapping clusters 4. Affected by outliers

Q2. How to choose optimal value of K in K-Means Clustering : Choosing the


correct value of K is crucial for effective clustering. An incorrect K may lead to underfitting (too few clusters) or
overfitting (too many clusters).
Common Techniques to Choose Optimal K:
Elbow Method:

1. Plot the WCSS (Within-Cluster Sum of Squares) vs. number of clusters K 2. WCSS decreases as K increases
3. Find the “elbow point” where the rate of decrease sharply reduces 4. That K is optimal
5. Graph shows diminishing returns in reducing error beyond the elbow


Silhouette Score:​
1. Measures how well-separated and cohesive the clusters are 2. Score ranges from −1 to 1
3. Higher score → better cluster structure 4. Average silhouette score is computed for different values of K​

Gap Statistic:​
1. Compares the total within intra-cluster variation for different values of K with their expected values under a
reference null distribution 2. Higher gap value indicates better clustering


Domain Knowledge: 1. In some cases, expert understanding of the dataset is used to determine a logical number
of clusters
Q1. Write Nearest Neighbor Clustering Algorithm. : Nearest Neighbor Clustering is a bottom-up
agglomerative hierarchical clustering method where clusters are formed by repeatedly merging the closest
pair of data points or clusters based on distance. It uses single linkage, where the distance between two clusters is
the minimum distance between any two points in the two clusters. Algorithm
Steps: 1. Start with each data point as a separate cluster. .
2. Compute pairwise distances between all clusters. .
3. Merge the two clusters that are closest (minimum distance). .
4. Recalculate distances between new cluster and remaining clusters. .
5. Repeat steps 3–4 until all points are in one cluster or desired number of clusters is formed.​
Distance Metric: Usually Euclidean or Manhattan distance.
Applications: 1. Bioinformatics (gene similarity) 2. Document clustering 3. Anomaly detection​
Limitation: Sensitive to noise and outliers, chaining effect due to single linkage.

Q. Explain Agglomerative Algorithm in Hierarchical Clustering in detail.


Agglomerative Hierarchical Clustering is a bottom-up clustering method where each data point is initially treated
as an individual cluster. The algorithm repeatedly merges the closest pair of clusters until all points are grouped
into a single cluster or a predefined number of clusters is achieved. The result is typically represented using a
dendrogram (tree-like diagram).

Working : 1. Start: Treat each of the n data points as a separate cluster → n clusters.
2. Compute Distance Matrix: Find the distance between all pairs of clusters using a chosen metric (e.g., Euclidean
distance). 3. Merge Closest Clusters: Identify and merge the pair of clusters that are closest to each other
based on a linkage criterion. 4. Update Distance Matrix: After merging, recompute distances between the new
cluster and all other existing clusters. 5. Repeat: Steps 3–4 until only one cluster remains or the desired number of
clusters (k) is reached.

Linkage Criteria: Single Linkage: Minimum distance between points of two clusters
Complete Linkage: Maximum distance between points of two clusters Average Linkage: Average distance
between all pairs of points in two clusters Centroid Linkage: Distance between centroids of two clusters

Dendrogram: 1. A tree diagram showing how clusters are merged step-by-step. 2. Vertical axis represents
the distance at which clusters are joined. 3. Cutting the dendrogram at a chosen level gives desired number of
clusters.

Example: Given data points: A, B, C, D Step 1: Start with 4 clusters: {A}, {B}, {C}, {D}
Step 2: Merge the two closest points (say, A & B → {AB}) Step 3: Merge next closest pair (say, C & D → {CD})
Step 4: Merge {AB} and {CD} → Final cluster {ABCD} The merging process can be visualized using a dendrogram.


Advantages: 1. No need to specify number of clusters (K) in advance
. 2. Produces hierarchical relationships 3. Works well with small datasets and irregular cluster shapes

Disadvantages: 1. Computationally expensive for large datasets (O(n² log n))


. 2. Not suitable for very large datasets 3. Sensitive to noise and outliers
. 4. Once merged, clusters cannot be undone

Applications: 1. Taxonomy and phylogenetic trees (biology) 2. Document and text clustering
. 3. Image segmentation 4. Customer segmentation
Q. What is Curse of Dimensionality? What are measures to resolve it?
Ans: The Curse of Dimensionality refers to the set of problems and challenges that arise when working with data
in high-dimensional spaces. As the number of features (dimensions) increases, the data becomes sparse,
distance metrics lose meaning, and model performance often degrades. This curse affects clustering, classification,
regression, and nearest neighbor algorithms.

Key Problems in High Dimensions:


1.​ Data Sparsity: In high dimensions, data points become far apart, making it difficult to detect patterns.
2.​ Distance Metrics Fail: Euclidean or cosine distances become less discriminative, reducing the
effectiveness of clustering or k-NN.
3.​ Overfitting: More dimensions → more complexity → model learns noise instead of useful patterns.
4.​ Increased Computation Cost: Time and space complexity increases exponentially with dimensionality.
5.​ Visualization Difficulties: Hard to visualize beyond 3D, making interpretation difficult.

Measures to Resolve Curse of Dimensionality:

1. Dimensionality Reduction Techniques:

●​ Principal Component Analysis (PCA): Projects data into fewer dimensions while preserving
variance.
●​ Linear Discriminant Analysis (LDA): Reduces dimensions while preserving class separability.
●​ t-SNE / UMAP: Non-linear techniques for visualization in 2D/3D.​
These help remove redundant and less-informative features.

2. Feature Selection:

●​ Select only the most relevant features using techniques like:​

○​ Filter methods (e.g., correlation, variance)


○​ Wrapper methods (e.g., recursive feature elimination)
○​ Embedded methods (e.g., Lasso regularization)​
This helps reduce noise and improve accuracy.

3. Regularization Techniques:

●​ Apply L1 (Lasso) or L2 (Ridge) regularization to penalize large coefficients and prevent overfitting in high
dimensions.

4. Data Sampling / Clustering First:

●​ Cluster high-dimensional data to reduce variance, then model within clusters.

5. Domain Knowledge:

●​ Use expert insight to select only meaningful features before model training.
Logistic Regression is a supervised classification algorithm used to predict the probability of a categorical
dependent variable, especially for binary outcomes (0 or 1). Unlike linear regression which outputs continuous
values, logistic regression uses the sigmoid function to convert the linear combination of inputs into a probability
score between 0 and 1.

The hypothesis function:

If h(x) > 0.5 predict class 1; else predict class 0. It estimates the probability that the given input belongs to the
positive class.

Example: Suppose we want to predict whether a student passes an exam based on hours studied. Logistic
regression takes the input features (like study time) and outputs the probability of passing. If the probability > 0.5,
model predicts "Pass" (1), else "Fail" (0).

(a) Features of Logistic Regression:

●​ Used for classification, not regression problems.


●​ Outputs probabilities, not direct class labels.
●​ Based on sigmoid activation function which maps values to [0, 1].
●​ Uses log-odds (logit) to model the relationship between input features and target.
●​ Fast, interpretable, and easy to implement.
●​ Can be regularized using L1 (Lasso) or L2 (Ridge) to avoid overfitting.
●​ Performs best when classes are linearly separable.​

(b) Types of Logistic Regression:

1.​ Binary Logistic Regression: Predicts two outcomes (e.g., Pass/Fail, Yes/No, Spam/Not Spam).​

2.​ Multinomial Logistic Regression: Used when the dependent variable has more than two unordered
categories (e.g., predicting fruit type: Apple, Banana, Orange).​

3.​ Ordinal Logistic Regression: Used when the dependent variable has ordered categories (e.g., Poor,
Average, Good, Excellent).​
(a) Cross Validation:
Cross Validation is a model evaluation technique used to assess how well a machine learning model generalizes
to unseen data. It helps avoid overfitting and ensures that the model performs well across different subsets of the
dataset. The most common type is k-fold cross-validation, where:

●​ The dataset is divided into k equal parts (folds).


●​ The model is trained on k−1 folds and tested on the remaining 1 fold.
●​ This process is repeated k times, each time with a different fold used as the test set.
●​ The final performance is the average of all k results.

Other variants include Leave-One-Out CV and Stratified K-Fold.​


Benefits: Provides a better estimate of model accuracy and reduces variance in evaluation.

(b) Inductive Bias:


Inductive Bias refers to the set of assumptions a learning algorithm makes to generalize from limited training data
to unseen data. Since data alone is insufficient for unique generalization, a bias is necessary to prefer one
hypothesis over another.
Example: 1. A decision tree algorithm assumes the shortest tree that fits data is better (Occam’s Razor).
. 2. A linear classifier assumes that data is linearly separable.

Types: 1. Language Bias: Restrictions on hypothesis representation (e.g., only linear models).
. 2. Preference Bias: Preferences over hypotheses (e.g., simpler is better).

Importance: Inductive bias enables learning to happen, but the wrong bias may lead to poor generalization. It
directly affects algorithm performance and accuracy.

1. Binomial Logistic Regression:


Binomial (or binary) logistic regression is used when the dependent variable has exactly two possible outcomes
(e.g., 0 or 1, Yes or No, Pass or Fail).​
It models the log odds of the probability of the default class using the sigmoid function.​
The goal is to separate the two classes using a single decision boundary.​
Training involves minimizing the binary cross-entropy loss.​
It is widely used in applications like spam detection, loan approval, and medical diagnosis (disease: yes/no).​
Only one sigmoid function is required to compute the probability of class 1 (and 1 − p for class 0).​
Output is converted to a label using a threshold (commonly 0.5).

2. Multinomial Logistic Regression:


Multinomial logistic regression is used when the target variable has more than two unordered categories (e.g.,
classifying animals as cat, dog, or horse).​
Instead of a single sigmoid, it uses the softmax function to compute probabilities for all classes.​
Each class gets its own set of coefficients, and the model learns multiple linear functions.​
The output probabilities of all classes sum up to 1.​
It minimizes the categorical cross-entropy loss during training.​
Used in applications like text classification, sentiment analysis, and multi-class image recognition.​
It generalizes binary logistic regression to handle multiple classes simultaneously.
Write a note on Bayes Theorem and illustrate Maximum Likelihood Method for predicting
probabilities in Bayesian Learning with an example.

Ans: Bayes Theorem is a fundamental concept in probability theory used in Bayesian learning to update the
probability of a hypothesis based on observed evidence. It provides a way to combine prior knowledge with new
data.
Bayes Theorem Formula:
Where:

●​ P(H∣D): Posterior probability (probability of hypothesis H given data D)


●​ P(D∣H): Likelihood (probability of data D given H)
●​ P(H): Prior probability of hypothesis H
●​ P(D): Marginal probability of data D

It helps in computing the updated probability of a hypothesis after seeing the data.

Maximum Likelihood Estimation (MLE): In Bayesian learning, MLE is used to estimate parameters that
maximize the likelihood P(D∣H)P(D|H)P(D∣H), i.e., the probability of observing data D under hypothesis H.​
It ignores the prior and focuses on finding parameters that best explain the observed data.

Example – Email Spam Classification: Let’s say we want to classify an email as Spam (S) or Not Spam (¬S)
based on the presence of the word “Free”.
We calculate: 1. P(S) : Prior probability of spam = 0.4
. 2. P(¬S): Prior probability of not spam = 0.6 .
3. P(Free∣S) = 0.8, P(Free∣¬S) = 0.1

What is Unsupervised Machine Learning? Explain with suitable


example.
Unsupervised Machine Learning is a type of machine learning where the model is trained on unlabeled data,
meaning no predefined class or output is given. The objective is to let the algorithm discover hidden patterns or
structures in the data without supervision. It is mainly used for clustering, association rule mining, and
dimensionality reduction.

Unlike supervised learning, there is no feedback signal to guide the learning process. The system organizes data
based on similarity, density, or statistical features.

Example:​
In customer segmentation, K-Means Clustering groups customers based on purchasing patterns without prior
labels. These insights help businesses design personalized marketing strategies and product recommendations.
Q. What is Instance-Based Learning? Explain the importance of feature
reduction while solving problems using machine learning techniques.
Ans:​
Instance-Based Learning is a type of lazy learning algorithm where the model stores training instances and
delays generalization until a query is made. Instead of learning a function from the training data, it compares new
inputs directly to stored examples using a similarity (distance) metric, such as Euclidean or Manhattan
distance.

The most common instance-based algorithms are K-Nearest Neighbors (KNN) and Case-Based Reasoning
(CBR). These models predict output by retrieving the most similar instances from memory and combining their
outputs (e.g., by voting or averaging).

Characteristics:

●​ No explicit training phase​

●​ Easy to implement​

●​ Memory-intensive and sensitive to irrelevant features​

●​ Works well with well-labeled and noise-free data​

Importance of Feature Reduction in ML Problems:


Feature Reduction (or dimensionality reduction) involves removing irrelevant, redundant, or noisy features from
the dataset before applying a machine learning model.

Why it is important:

1.​ Improves Model Accuracy: Irrelevant features can confuse the model and degrade performance.​

2.​ Reduces Overfitting: Fewer features reduce model complexity, lowering the chance of fitting noise.​

3.​ Speeds Up Computation: Smaller input size means faster training and prediction.​

4.​ Enhances Interpretability: Easier to understand and visualize models with fewer features.​

5.​ Improves Distance Calculations: In instance-based learning, fewer and relevant features make distance
metrics more meaningful, improving prediction quality.​

Techniques used:

●​ Feature selection: correlation, mutual information, chi-square​

●​ Dimensionality reduction: PCA, LDA, t-SNE​


Q. Explain the Random Forest Algorithm with examples.
Ans: Random Forest is a popular ensemble learning algorithm used for classification and regression tasks. It
works by creating a collection (forest) of decision trees during training and outputs the majority vote
(classification) or average (regression) of the individual trees. It improves accuracy and reduces overfitting
compared to a single decision tree.

Working of Random Forest Algorithm:


1.​ Bootstrap Sampling:​

○​ From the original dataset, multiple random samples are drawn with replacement (called
bootstrapped datasets).
○​ Each sample is used to train a different decision tree.​

2.​ Random Feature Selection:​

○​ At each split in the decision tree, only a random subset of features is considered, not all
features.
○​ This adds randomness and reduces correlation between trees.​

3.​ Build Multiple Trees:​

○​ Many trees are grown independently using different bootstrapped samples and random features.​

4.​ Prediction:​

○​ For classification: the final prediction is made using majority voting from all trees.
○​ For regression: the final output is the average of all tree predictions.

Example (Classification):
Suppose we have a dataset of students with features like age, study time, and attendance. The target is whether the
student will pass or fail.

●​ Random Forest creates multiple decision trees, each trained on different student samples and features.
●​ Each tree gives its own prediction (pass/fail), and the majority output is chosen as the final prediction.

Advantages:
●​ High accuracy and robustness to overfitting
●​ Works well with large datasets and high dimensional data
●​ Handles missing values and non-linear data effectively
●​ Reduces variance by averaging multiple models

Disadvantages:
●​ Slower than single decision tree due to multiple trees
●​ Less interpretable than a single decision tree
●​ Requires more memory and computation

You might also like