Machine Learning
Machine Learning
data and improve performance without being explicitly programmed. ML focuses on designing algorithms that can
identify patterns, make predictions, and take decisions based on input data.
Types of Machine Learning : Machine Learning is broadly classified into three types
1. Supervised Learning: In supervised learning, the model is trained using labeled data (i.e., input-output pairs).
The goal is to learn a function that maps inputs to correct outputs.
Examples: Email spam detection, house price prediction, sentiment analysis.
Algorithms: Linear Regression, Decision Tree, Support Vector Machine.
2. Unsupervised Learning: In this type, data is not labeled. The model tries to identify hidden patterns or structures
from the input data without predefined output.
Examples: Customer segmentation, market basket analysis, anomaly detection.
Algorithms: K-Means Clustering, Hierarchical Clustering, PCA.
3. Reinforcement Learning : An agent interacts with an environment and learns to make decisions by receiving
rewards or penalties. It aims to maximize cumulative reward.
Examples: Game playing (chess, Go), robotics, self-driving cars.
Algorithms: Q-Learning, Deep Q Network, SARSA.
Working:
Advantages:
Ans: Overfitting occurs when a decision tree learns noise and details from training data, resulting in poor
performance on unseen data. It often happens when the tree becomes too deep or complex.
Symptoms of Overfitting:
Pruning: A technique to reduce the size of a tree and avoid overfitting by removing nodes that add little predictive
power.
Types:
2. Post-Pruning:
Advantages of Pruning:
● Simplifies model
● Increases generalization
● Prevents overfitting
Example: A leaf node classifies only one sample incorrectly; pruning it might improve overall accuracy.
Q4. How do you choose best split in the Decision Tree algorithm?
The best split at each node of a decision tree is chosen by measuring how well an attribute separates data into
distinct classes. This is done using impurity measures:
1. Entropy: Measures disorder or impurity in a dataset.
Example: If Outlook has the highest Information Gain, it becomes the root node and data is split based on its
values.
Q. Explain Confusion Matrix for classification problem Or Explain
Confusion Matrix with example and state the need of confusion matrix.
Ans: A Confusion Matrix is a performance evaluation table used for classification problems, where the output can
be divided into multiple classes. It compares the actual target values with those predicted by the model. In binary
classification, it is a 2×2 matrix; for multiclass, it extends to n×n format. It provides a complete picture of prediction
outcomes, separating correct and incorrect predictions into four categories.
Matrix:
Derived Metrics:
Need of Confusion Matrix: It gives deeper insight than accuracy alone, especially in imbalanced datasets where
one class dominates. For example, in fraud detection or medical diagnosis, a model with 95% accuracy might still be
missing all actual fraud cases. Confusion matrix shows the type of misclassifications—whether the model is giving
false alarms or missing real positives. It helps evaluate model robustness, guides model tuning, helps in threshold
selection, and ensures performance is acceptable for sensitive applications. It is also the base for generating metrics
like ROC curve, AUC, precision-recall curve, and more.
Q. How to find right Hyperplane in Support Vector Machine? Explain with
suitable example Or
How Linear SVM works? Explain with example.
Ans: Support Vector Machine (SVM) is a supervised learning algorithm used for binary and multiclass classification.
A hyperplane is a decision boundary that separates the classes. In linear SVM, the goal is to find the optimal
hyperplane that maximizes the margin between data points of different classes. Margin is the distance between the
hyperplane and the nearest data points from each class, called support vectors. The larger the margin, the better
the generalization of the model.
Where w is the weight vector, x is the input feature vector, and b is the bias.
SVM tries to choose w and b such that the margin between classes is maximized. The constraints are:
SVM will find a straight line (hyperplane) that separates these two sets and lies at maximum distance from closest
points of both classes. The support vectors may be (4,5) and (6,5). The optimal hyperplane lies exactly midway and
is orthogonal to the line connecting the support vectors.
Advantages:
Conclusion:
SVM selects the hyperplane that not only separates classes but does so with maximum margin for better accuracy
and robustness. For non-linear data, kernel tricks like RBF or polynomial can be used to map data to higher
dimensions for linear separation.
Q1. Explain Convolutional Neural Network (CNN) for image classification.
Ans: CNN is a deep learning model used primarily for image-related tasks such as classification, detection, and
segmentation. It mimics the human visual cortex by processing image data in layers. CNN automatically extracts
local and hierarchical features from images using convolution and pooling operations, removing the need for manual
feature engineering. It is especially effective for spatial data and captures relationships like edges, corners, and
textures.
Architecture Components:
1. Input Layer: Accepts image as a 2D or 3D matrix of pixel values (e.g., 28×28×1 for grayscale).
2. Convolution Layer: Applies filters/kernels (e.g., 3×3) to scan the image and detect patterns.
3. ReLU Activation: Applies non-linearity by replacing negative values with zero.
4. Pooling Layer: Downsamples the feature map using Max or Average pooling to reduce size and
computation.
5. Fully Connected Layer (FC): Flattens feature maps and connects to output neurons.
6. Softmax Output: Provides class probabilities.
Example: In handwritten digit recognition (MNIST dataset), CNN identifies stroke patterns through convolutional
layers and classifies the digit using FC + softmax. CNNs generalize well on image datasets and are widely used in
real-time applications like face recognition, object detection, and medical imaging.
Ans: A Multilayer Feedforward Backpropagation Neural Network (MLP with backprop) is a neural network where
data flows from input to output in one direction (feedforward), and error signals are propagated backward during
learning (backpropagation). It can approximate any non-linear function and is widely used in classification,
regression, and time-series prediction.
Architecture:
Example: To implement XOR logic, we use 2 input neurons, 2 hidden neurons, 1 output neuron. Initially, random
weights give wrong output, but after several epochs, backprop adjusts them to correctly compute XOR logic.
MLPs are universal function approximators and are used in digit recognition, stock prediction, and natural language
processing.
Q3. What is Backpropagation? Explain in brief with neat diagram.
Ans: Backpropagation is the core algorithm for training neural networks. It minimizes the prediction error by
adjusting network weights in the reverse direction of output. It uses the chain rule of calculus to compute how
much each weight contributes to the total error, then updates the weights to reduce that error iteratively.
1. Input Layer: Receives raw pixel data (e.g., 224×224×3 for RGB image).
2. Convolution Layer(s): Applies filters to detect local features such as edges and textures. Output is a
feature map.
3. Activation Layer (ReLU): Applies non-linearity to maintain complex mapping.
4. Pooling Layer (Max or Avg): Reduces the feature map size and increases computation efficiency.
5. Dropout Layer (optional): Randomly drops some neurons to prevent overfitting.
6. Flatten Layer: Converts 2D feature maps to 1D vector.
7. Fully Connected Layer (Dense): Combines all features to make final classification.
8. Softmax Output: Converts scores into probabilities for multi-class output.
Example: In a dog-vs-cat classifier, lower layers detect edges and corners, middle layers detect ears/tails, and final
layers classify full images. CNNs are efficient, require fewer parameters than fully connected networks, and achieve
high accuracy in image tasks.
Q. Explain Perceptron. How it works? Explain with diagram.
Ans: A Perceptron is the simplest type of artificial neural network and the foundation of deep learning. It is a
binary linear classifier that maps input features to output classes using a linear decision boundary. It was
introduced by Frank Rosenblatt in 1958 and works well for linearly separable problems.
Architecture:
● Inputs (x₁,
x₂, ..., x ): Feature values
● Weights (w₁, w₂, ..., w ): Associated with each input
● Bias (b): Adjusts the output independently of input
● Summation Unit: Computes weighted sum:
Working:
Limitation: Perceptron fails on non-linearly separable problems like XOR. For that, multi-layer perceptrons (MLPs)
with hidden layers are used.
Conclusion: Perceptron is fast, simple, and effective for linear classification tasks. It laid the foundation for deeper
neural networks.
Diagram :
Q1. Write a note on PAC (Probably Approximately Correct) Learning Model.
Ans: PAC Learning is a theoretical model in machine learning that studies whether a concept can be learned
efficiently from data. Introduced by Leslie Valiant in 1984, PAC learning provides a mathematical framework to
define the feasibility and limits of learning algorithms under uncertainty. The term “Probably Approximately
Correct” refers to the idea that a learner can find a hypothesis that is approximately correct (with low error) and
probably correct (with high confidence).
Let:
Key Terms:
This means, as ε decreases (higher accuracy) or δ decreases (higher confidence), more samples are needed.
Example: If you’re learning to identify spam emails using features like sender, subject, etc., PAC learning can tell
you how many emails (examples) you need to see before your spam filter performs well with high confidence.
Importance: PAC model ensures theoretical learning guarantees. It helps determine whether an algorithm can be
trained efficiently on a given dataset and helps prove the learnability of classes like decision trees, neural networks,
and SVMs. It is a foundation for computational learning theory.
Ans: Ensemble Learning is a powerful machine learning technique where multiple weak or base models are
trained and combined to produce better overall performance than any single model. It is based on the idea that a
group of diverse models, when combined properly, can outperform even strong individual learners.
Bagging builds several independent models in parallel using different subsets of the training data (sampled with
replacement). Each model is trained separately, and final output is obtained by majority voting (for classification) or
averaging (for regression).
Steps:
Advantages:
Example: Random Forest is an ensemble of decision trees where each tree is trained on a random bootstrapped
sample and random subset of features.
2) Boosting:
Boosting trains models sequentially, where each new model focuses more on the errors made by the previous
model. Weights are increased for misclassified samples so that the next model pays more attention to them. Final
prediction is made using weighted majority voting.
Steps:
Advantages:
Popular Algorithms:
● AdaBoost: Uses decision stumps (1-level trees), adjusts weights after each round.
● Gradient Boosting: Minimizes loss function via gradient descent at each stage.
● XGBoost: Optimized gradient boosting, faster and regularized.
Ans: Adaptive Hierarchical Clustering is a flexible version of traditional hierarchical clustering where the
linkage method or distance metric can adapt based on data distribution. It combines the strengths of agglomerative
and divisive clustering with dynamic thresholding and stopping criteria.
Key Features:
Applications:
● Biological taxonomy
● Image segmentation
● Multiscale pattern analysis
Ans: GMM is a probabilistic clustering algorithm that models data as a mixture of multiple Gaussian
distributions. Unlike k-means, which assigns points to only one cluster, GMM assigns probability of belonging to
each cluster. It is based on Expectation-Maximization (EM) algorithm.
Model Structure: Each cluster is represented by a Gaussian (Normal) distribution defined by:
Advantages:
Applications:
● Image segmentation
● Speaker identification
● Market segmentation
Q1. Explain K-Means Clustering Algorithm with example and working.
Ans: K-Means is an unsupervised machine learning algorithm used to partition a dataset into K distinct,
non-overlapping clusters. It works by grouping similar data points together and assigning them to the cluster
whose centroid is nearest. The algorithm minimizes the Within-Cluster Sum of Squares (WCSS), also called
inertia. It assumes that the number of clusters (K) is known in advance and all clusters are
spherical in shape and similar in size.
Working of K-Means Algorithm: Step 1: Choose the number of clusters, K. Step 2: Randomly initialize K
cluster centroids. Step 3: Assign each data point to the nearest centroid using Euclidean distance:
Step 4: Compute new centroids by taking the mean of all data points in each cluster.
Step 5: Repeat Steps 3–4 until cluster assignments do not change or a maximum number of iterations is reached.
Final result: two distinct clusters with clearly separated data points.
Advantages 1. Easy to implement and understand 2. Efficient for large datasets
. 3. Works well when clusters are well-separated and spherical
Limitations: 1. Requires K in advance 2. Sensitive to initial centroid placement
. 3. Fails on non-spherical or overlapping clusters 4. Affected by outliers
1. Plot the WCSS (Within-Cluster Sum of Squares) vs. number of clusters K 2. WCSS decreases as K increases
3. Find the “elbow point” where the rate of decrease sharply reduces 4. That K is optimal
5. Graph shows diminishing returns in reducing error beyond the elbow
Silhouette Score:
1. Measures how well-separated and cohesive the clusters are 2. Score ranges from −1 to 1
3. Higher score → better cluster structure 4. Average silhouette score is computed for different values of K
Gap Statistic:
1. Compares the total within intra-cluster variation for different values of K with their expected values under a
reference null distribution 2. Higher gap value indicates better clustering
Domain Knowledge: 1. In some cases, expert understanding of the dataset is used to determine a logical number
of clusters
Q1. Write Nearest Neighbor Clustering Algorithm. : Nearest Neighbor Clustering is a bottom-up
agglomerative hierarchical clustering method where clusters are formed by repeatedly merging the closest
pair of data points or clusters based on distance. It uses single linkage, where the distance between two clusters is
the minimum distance between any two points in the two clusters. Algorithm
Steps: 1. Start with each data point as a separate cluster. .
2. Compute pairwise distances between all clusters. .
3. Merge the two clusters that are closest (minimum distance). .
4. Recalculate distances between new cluster and remaining clusters. .
5. Repeat steps 3–4 until all points are in one cluster or desired number of clusters is formed.
Distance Metric: Usually Euclidean or Manhattan distance.
Applications: 1. Bioinformatics (gene similarity) 2. Document clustering 3. Anomaly detection
Limitation: Sensitive to noise and outliers, chaining effect due to single linkage.
Working : 1. Start: Treat each of the n data points as a separate cluster → n clusters.
2. Compute Distance Matrix: Find the distance between all pairs of clusters using a chosen metric (e.g., Euclidean
distance). 3. Merge Closest Clusters: Identify and merge the pair of clusters that are closest to each other
based on a linkage criterion. 4. Update Distance Matrix: After merging, recompute distances between the new
cluster and all other existing clusters. 5. Repeat: Steps 3–4 until only one cluster remains or the desired number of
clusters (k) is reached.
Linkage Criteria: Single Linkage: Minimum distance between points of two clusters
Complete Linkage: Maximum distance between points of two clusters Average Linkage: Average distance
between all pairs of points in two clusters Centroid Linkage: Distance between centroids of two clusters
Dendrogram: 1. A tree diagram showing how clusters are merged step-by-step. 2. Vertical axis represents
the distance at which clusters are joined. 3. Cutting the dendrogram at a chosen level gives desired number of
clusters.
Example: Given data points: A, B, C, D Step 1: Start with 4 clusters: {A}, {B}, {C}, {D}
Step 2: Merge the two closest points (say, A & B → {AB}) Step 3: Merge next closest pair (say, C & D → {CD})
Step 4: Merge {AB} and {CD} → Final cluster {ABCD} The merging process can be visualized using a dendrogram.
Advantages: 1. No need to specify number of clusters (K) in advance
. 2. Produces hierarchical relationships 3. Works well with small datasets and irregular cluster shapes
Applications: 1. Taxonomy and phylogenetic trees (biology) 2. Document and text clustering
. 3. Image segmentation 4. Customer segmentation
Q. What is Curse of Dimensionality? What are measures to resolve it?
Ans: The Curse of Dimensionality refers to the set of problems and challenges that arise when working with data
in high-dimensional spaces. As the number of features (dimensions) increases, the data becomes sparse,
distance metrics lose meaning, and model performance often degrades. This curse affects clustering, classification,
regression, and nearest neighbor algorithms.
● Principal Component Analysis (PCA): Projects data into fewer dimensions while preserving
variance.
● Linear Discriminant Analysis (LDA): Reduces dimensions while preserving class separability.
● t-SNE / UMAP: Non-linear techniques for visualization in 2D/3D.
These help remove redundant and less-informative features.
2. Feature Selection:
3. Regularization Techniques:
● Apply L1 (Lasso) or L2 (Ridge) regularization to penalize large coefficients and prevent overfitting in high
dimensions.
5. Domain Knowledge:
● Use expert insight to select only meaningful features before model training.
Logistic Regression is a supervised classification algorithm used to predict the probability of a categorical
dependent variable, especially for binary outcomes (0 or 1). Unlike linear regression which outputs continuous
values, logistic regression uses the sigmoid function to convert the linear combination of inputs into a probability
score between 0 and 1.
If h(x) > 0.5 predict class 1; else predict class 0. It estimates the probability that the given input belongs to the
positive class.
Example: Suppose we want to predict whether a student passes an exam based on hours studied. Logistic
regression takes the input features (like study time) and outputs the probability of passing. If the probability > 0.5,
model predicts "Pass" (1), else "Fail" (0).
1. Binary Logistic Regression: Predicts two outcomes (e.g., Pass/Fail, Yes/No, Spam/Not Spam).
2. Multinomial Logistic Regression: Used when the dependent variable has more than two unordered
categories (e.g., predicting fruit type: Apple, Banana, Orange).
3. Ordinal Logistic Regression: Used when the dependent variable has ordered categories (e.g., Poor,
Average, Good, Excellent).
(a) Cross Validation:
Cross Validation is a model evaluation technique used to assess how well a machine learning model generalizes
to unseen data. It helps avoid overfitting and ensures that the model performs well across different subsets of the
dataset. The most common type is k-fold cross-validation, where:
Types: 1. Language Bias: Restrictions on hypothesis representation (e.g., only linear models).
. 2. Preference Bias: Preferences over hypotheses (e.g., simpler is better).
Importance: Inductive bias enables learning to happen, but the wrong bias may lead to poor generalization. It
directly affects algorithm performance and accuracy.
Ans: Bayes Theorem is a fundamental concept in probability theory used in Bayesian learning to update the
probability of a hypothesis based on observed evidence. It provides a way to combine prior knowledge with new
data.
Bayes Theorem Formula:
Where:
It helps in computing the updated probability of a hypothesis after seeing the data.
Maximum Likelihood Estimation (MLE): In Bayesian learning, MLE is used to estimate parameters that
maximize the likelihood P(D∣H)P(D|H)P(D∣H), i.e., the probability of observing data D under hypothesis H.
It ignores the prior and focuses on finding parameters that best explain the observed data.
Example – Email Spam Classification: Let’s say we want to classify an email as Spam (S) or Not Spam (¬S)
based on the presence of the word “Free”.
We calculate: 1. P(S) : Prior probability of spam = 0.4
. 2. P(¬S): Prior probability of not spam = 0.6 .
3. P(Free∣S) = 0.8, P(Free∣¬S) = 0.1
Unlike supervised learning, there is no feedback signal to guide the learning process. The system organizes data
based on similarity, density, or statistical features.
Example:
In customer segmentation, K-Means Clustering groups customers based on purchasing patterns without prior
labels. These insights help businesses design personalized marketing strategies and product recommendations.
Q. What is Instance-Based Learning? Explain the importance of feature
reduction while solving problems using machine learning techniques.
Ans:
Instance-Based Learning is a type of lazy learning algorithm where the model stores training instances and
delays generalization until a query is made. Instead of learning a function from the training data, it compares new
inputs directly to stored examples using a similarity (distance) metric, such as Euclidean or Manhattan
distance.
The most common instance-based algorithms are K-Nearest Neighbors (KNN) and Case-Based Reasoning
(CBR). These models predict output by retrieving the most similar instances from memory and combining their
outputs (e.g., by voting or averaging).
Characteristics:
● Easy to implement
Why it is important:
1. Improves Model Accuracy: Irrelevant features can confuse the model and degrade performance.
2. Reduces Overfitting: Fewer features reduce model complexity, lowering the chance of fitting noise.
3. Speeds Up Computation: Smaller input size means faster training and prediction.
4. Enhances Interpretability: Easier to understand and visualize models with fewer features.
5. Improves Distance Calculations: In instance-based learning, fewer and relevant features make distance
metrics more meaningful, improving prediction quality.
Techniques used:
○ From the original dataset, multiple random samples are drawn with replacement (called
bootstrapped datasets).
○ Each sample is used to train a different decision tree.
○ At each split in the decision tree, only a random subset of features is considered, not all
features.
○ This adds randomness and reduces correlation between trees.
○ Many trees are grown independently using different bootstrapped samples and random features.
4. Prediction:
○ For classification: the final prediction is made using majority voting from all trees.
○ For regression: the final output is the average of all tree predictions.
Example (Classification):
Suppose we have a dataset of students with features like age, study time, and attendance. The target is whether the
student will pass or fail.
● Random Forest creates multiple decision trees, each trained on different student samples and features.
● Each tree gives its own prediction (pass/fail), and the majority output is chosen as the final prediction.
Advantages:
● High accuracy and robustness to overfitting
● Works well with large datasets and high dimensional data
● Handles missing values and non-linear data effectively
● Reduces variance by averaging multiple models
Disadvantages:
● Slower than single decision tree due to multiple trees
● Less interpretable than a single decision tree
● Requires more memory and computation