0% found this document useful (0 votes)
19 views12 pages

Class Test 2 Solution ML

Uploaded by

trupti.sonkusare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Class Test 2 Solution ML

Uploaded by

trupti.sonkusare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Hope Foundation's

International Institute of Information Technology (I2IT), Pune


DEPARTMENT OF COMPUTER ENGINEERING
Academic Year: 2025-26, Semester - I
Class Test-2 Solution
Class: BE CE (A&B)
Subject: Machine Learning Date:

Q.1 A What are different techniques to reduce under fitting and overfitting?Explain each in
detail.
Ans: Underfitting occurs when the model is too simple and cannot capture patterns in data.

Techniques to reduce Underfitting:

1. Increase Model Complexity


Use more complex models – e.g., deeper neural networks, more decision tree depth.

2. Increase Training Time / Iterations


Train longer so the model learns patterns properly.

3. Feature Engineering
Add new relevant features or transform features (polynomial features).

4. Reduce Regularization
Lower L1/L2 regularization strength so the model does not become too simple.

Overfitting occurs when the model learns noise/memory of training data and performs poorly on test
data.

Techniques to reduce Overfitting:

1. Regularization (L1/L2)
Penalizes large weights and simplifies model.

2. Dropout (in Neural Networks)


Randomly removes neurons during training to prevent reliance on specific paths.

3. Early Stopping
Stop training when validation error starts increasing.

4. Cross-validation
Ensures model generalization.

5. Data Augmentation
Expand dataset artificially (rotation, scaling, etc. for images).

6. Pruning (Decision Trees)


Reduce complexity by cutting unnecessary branches.

Q.1 B What is Linear Regression? Difference between Lasso & Ridge Regression.
Ans: Linear Regression:

A supervised learning algorithm that models relationship between input variables (X) and output
variable (Y) by fitting a linear equation:

Y=b0+b1X1++bnXnY = b_0 + b_1X_1 + \cdots + b_nX_nY=b0+b1X1++bnXn

Used for predicting continuous values.

Difference Between Lasso and Ridge:

Feature Lasso Regression (L1) Ridge Regression (L2)

Penalty Sum of absolute weights Sum of squared weights

Feature Selection Yes, sets some weights to zero No feature elimination

Best Use When many features are irrelevant When all features contribute

Effect on Coefficients Produces sparse model Shrinks coefficients smoothly

Q.2 A Explain the following Evaluation Metrics with examples :


i) MAE
ii) RMSE
iii) R2
Ans: MAE represents the average magnitude of errors in a set of predictions, giving equal weight to
all errors. It is the most intuitive metric as it is in the same unit as the target variable.

Root Mean Squared Error (RMSE)

RMSE is the square root of the average squared differences between predicted and actual values,
penalizing larger errors more heavily than MAE.

R-squared (R2)
R2, or the coefficient of determination, measures the proportion of the variance in the target variable
that is explained by the model. Unlike MAE and RMSE, it is not an error metric but a measure of
explanatory power or "goodness of fit".

Q. 2 B What is Bias and variance trade off for machine learning model?
Ans: The bias-variance tradeoff is the relationship between a model's complexity and its error,
where bias is the error from wrong assumptions in the learning algorithm, and variance is the error
from sensitivity to small fluctuations in the training set. A simple model has high bias and low
variance, while a complex model has high variance and low bias; the tradeoff is about finding the
right balance to minimize the total error, which is key to good generalization.
The Tradeoff
 The balance: The tradeoff is the need to balance high bias and low variance with low bias and high
variance to achieve the best possible performance on unseen data.

 Minimizing error: The goal is to find a model complexity that minimizes the total error, which is a
combination of bias, variance, and an irreducible error.

 Finding the sweet spot: A model with the right level of complexity will be able to capture the true
patterns in the data (low bias) without being overly sensitive to the specifics of the training set (low
variance).

 Regularization: Techniques like regularization are used to manage this tradeoff by controlling model
complexity.
Q. 3 A Define Ensemble Learning and Differentiate between bagging and boosting.
Ans: Ensemble learning is a machine learning paradigm that combines the predictions of multiple
individual models, often referred to as "weak learners," to achieve a more accurate and robust final
prediction than any single model could achieve alone. The core idea is that by aggregating the
insights of diverse models, the ensemble can reduce errors due to bias, variance, or noise in the data.
Differentiation between Bagging and Boosting:
Bagging (Bootstrap Aggregating) and Boosting are two prominent ensemble learning techniques that
differ in their approach to combining models:
Bagging:
 Parallel Training: Bagging trains multiple base models independently and in parallel.

 Data Sampling: It uses bootstrap sampling, which involves creating multiple subsets of the original
training data by sampling with replacement. Each base model is trained on a different bootstrap
sample.

 Goal: Primarily aims to reduce variance by averaging or voting the predictions of diverse,
independently trained models. This makes bagging robust to overfitting.

 Weighting: Each base model in a bagging ensemble typically has an equal weight in the final
prediction.

 Examples: Random Forest is a well-known example of a bagging algorithm.


Boosting:
 Sequential Training:
Boosting trains base models sequentially, where each subsequent model attempts to correct the errors
made by its predecessors.

 Data Weighting:

It assigns higher weights to data points that were misclassified or had higher errors by previous
models. This forces subsequent models to focus on these difficult examples.

 Goal:

Primarily aims to reduce bias and convert weak learners into a strong learner by iteratively improving
performance on misclassified data.

 Weighting:

Models in a boosting ensemble are typically weighted based on their performance, with better-
performing models having a greater influence on the final prediction.

 Examples:
AdaBoost, Gradient Boosting, and XGBoost are common examples of boosting algorithms.

Q. 3 B. What is K-fold cross-validation? In K-fold cross-validation, comment on the following


situations i) When the value of K is too large ii) When the value of K is too small How do you
decide the value of k in k-fold cross-validation?
Ans: K-fold cross-validation is a robust technique for evaluating the performance of a machine
learning model and ensuring its generalizability to unseen data. It involves partitioning the original
dataset into a number of "folds" (subsets) to systematically train and test the model multiple times.
The process involves the following steps:

 Data Partitioning: The entire dataset is randomly divided into 𝐾 non-overlapping subsets of
approximately equal size.
 Iterative Training & Testing: The procedure is repeated K times (or "folds"). In each iteration:
o One fold is reserved as the validation set (or test set).
o The remaining folds are combined to form the training set.
 Performance Averaging: After 𝐾 iterations, each data point has been used in a validation set
exactly once. The performance metric (e.g., accuracy, error rate) is calculated for each
iteration, and the final reported performance of the model is the average of these 𝐾 scores.

Impact of the Value of K


The choice of K significantly impacts the trade-offs between computational cost, variance, and bias.
i) When the value of K is too large

 Pros: The size of the training sets is large (close to the original dataset size), which leads to a
lower bias in the performance estimate, as the model is trained on a substantial amount of data
in each fold. The results are also more robust and have a lower variance.
 Cons: The computational cost increases significantly because the model must be trained K
times. A common extreme is "Leave-One-Out Cross-Validation" (LOOCV), where 𝐾 equals
the number of data points.

ii) When the value of K is too small

 Pros: The computational cost is much lower because fewer iterations are required to train the
model.
 Cons: The size of the training sets is small, leading to a higher bias in the performance
estimate, as the model does not get enough data to learn effectively in each fold. This results
in a performance estimate that can be highly variable and less reliable.

Q. 4 A What is Confusion Matrix? Why it is important?

Ans: A confusion matrix is a table used to evaluate the performance of a classification model,
showing how many predictions were correct and incorrect. It breaks down predictions into four
categories: true positives, true negatives, false positives, and false negatives, providing a more
detailed picture than accuracy alone. It is important because it allows for a deeper understanding of a
model's specific errors and the calculation of other critical metrics like precision and recall, which is
vital for making informed decisions in applications such as medical diagnosis or fraud detection. A
confusion matrix is a performance measurement tool for a classification model, which is a model that
sorts data into categories. It's a table that compares the actual values of a dataset against the model's
predicted values.
 True Positive (TP): The model correctly predicted a positive case.

 True Negative (TN): The model correctly predicted a negative case.

 False Positive (FP): The model incorrectly predicted a positive case (a Type I error). For example,
marking a legitimate email as spam.

 False Negative (FN): The model incorrectly predicted a negative case (a Type II error). For example,
failing to identify a spam email and letting it into the inbox.

Why is it important?
 Provides more than just accuracy: Accuracy can be misleading, especially with imbalanced datasets.
A confusion matrix breaks down errors, giving a clearer picture of how the model performs on
different classes.

 Identifies specific errors: It helps to understand where the model is making mistakes by showing how
many of the positive predictions were incorrect (false positives) and how many of the negative
predictions were incorrect (false negatives).

 Enables calculation of key metrics: It is the foundation for calculating other important metrics, such
as:

o Precision: Of all the instances predicted as positive, how many were actually positive?

o Recall (Sensitivity): Of all the actual positive instances, how many did the model correctly identify?

o F1-Score: A single metric that balances precision and recall.

 Crucial for real-world applications: In critical areas like medical diagnosis, a false negative (failing to
detect a disease) can have severe consequences, while a false positive (incorrectly identifying a
disease) can lead to unnecessary stress and further testing. The matrix helps tune the model to
minimize the most impactful type of error for the specific application.

Q. 4 B. Define following terms with reference to SVM.


i) Separating hyperplane

ii) Margin

Ans: i) Separating Hyperplane


In SVMs, a separating hyperplane is a decision boundary that separates data points belonging to
different classes in a feature space. For linearly separable data, this hyperplane is a straight line in a
2D space, a flat plane in a 3D space, and a hyperplane in higher-dimensional spaces. The goal of an
SVM is to find the optimal separating hyperplane that best distinguishes between the classes.

ii) Margin
The margin in an SVM refers to the maximal width of the slab parallel to the separating hyperplane
that contains no interior data points. It is the distance between the separating hyperplane and the
closest data points of each class, known as support vectors. SVMs aim to maximize this margin
because a larger margin generally indicates better generalization performance and a more robust
classifier. The wider the margin, the greater the confidence in the classification of new, unseen data
points.

Q.5(A) Write short note on elbow method. Explain K means clustering with essential steps used
in it.
Ans:Elbow Method
The Elbow Method helps to find the optimal number of clusters (k) in K-Means clustering.
It involves plotting the Within-Cluster Sum of Squares (WCSS) against different values of k.
As k increases, WCSS decreases, but after a certain point, the rate of decrease slows down.
The point where the curve bends like an “elbow” is considered the best value of k.
K-Means Clustering
K-Means is an unsupervised learning algorithm that divides data into k clusters based on similarity.
Essential Steps:
Choose the number of clusters (k).
Initialize k centroids randomly.
Assign each data point to the nearest centroid.
Recalculate centroids as the mean of all points in each cluster.
Repeat the assignment and update steps until centroids no longer change.

Q.5(B) Cluster the following eight points (with (x,y) representing locations) into three clusters:
P1(1,3), P2(2,2), P3(5,8), P4(8, 5), P5(3, 9), P6(10, 7), P7(3, 3),P8(9, 4), P9(3, 7). Use K-Means
Algorithm to find the three clusters.
Ans:
Step 1: Choose initial centroids
We need 3 centroids (can be randomly selected). Let's pick:
C1 = P1(1,3)
C2 = P3(5,8)
C3 = P4(8,5)
Step 2: Assign points to the nearest centroid
We calculate Euclidean distance and assign each point to the closest centroid.
Distances (rounded for simplicity):
Point Distance to C1 Distance to C2 Distance to C3 Assigned Cluster

P1(1,3) 0 7.21 7.28 C1

P2(2,2) 1.41 7.07 7.21 C1


P3(5,8) 7.21 0 3.61 C2

P4(8,5) 7.28 3.61 0 C3

P5(3,9) 6.32 2.24 5.00 C2

P6(10,7) 10.30 5.39 2.24 C3

P7(3,3) 2.0 6.71 5.0 C1

P8(9,4) 8.06 5.0 1.41 C3

P9(3,7) 4.47 2.0 5.0 C2


Cluster Assignment after Step 1:
Cluster 1 (C1): P1, P2, P7
Cluster 2 (C2): P3, P5, P9
Cluster 3 (C3): P4, P6, P8
Step 3: Update centroids
New centroid = mean of points in each cluster
C1_new: P1(1,3), P2(2,2), P7(3,3)
x = (1+2+3)/3 = 2
y = (3+2+3)/3 = 2.67 C1_new(2, 2.67)
C2_new: P3(5,8), P5(3,9), P9(3,7)
x = (5+3+3)/3 = 3.67
y = (8+9+7)/3 = 8 C2_new(3.67, 8)
C3_new: P4(8,5), P6(10,7), P8(9,4)
x = (8+10+9)/3 = 9
y = (5+7+4)/3 = 5.33 C3_new(9, 5.33)
Step 4: Reassign points to nearest centroid
Calculate distances to new centroids.
After recalculation, the cluster assignments remain the same, so the algorithm converges.
Final Clusters
Cluster 1 (C1): P1(1,3), P2(2,2), P7(3,3)
Cluster 2 (C2): P3(5,8), P5(3,9), P9(3,7)
Cluster 3 (C3): P4(8,5), P6(10,7), P8(9,4)

Q.6(A). Why are K-medoids used? Explain Hierarchical and Density based Clustering with
examples.
Ans: K-Medoids is a clustering algorithm similar to K-Means but uses actual data points as cluster
centers (medoids) instead of the mean of points.
Advantages over K-Means:
Robust to outliers and noise – extreme values do not distort medoids.
Works with any distance metric – not limited to Euclidean distance.
Example: In customer segmentation, if a few customers have extreme purchase values, K-Medoids
avoids skewing cluster centers.
Hierarchical Clustering
Builds a tree-like structure (dendrogram) representing nested clusters.
Two types:
Agglomerative (Bottom-Up): Start with each point as a cluster and merge closest clusters step by
step.
Divisive (Top-Down): Start with all points in one cluster and recursively split them.
Distance measures: Euclidean, Manhattan, Cosine, etc.
Example: Organizing species into a taxonomy tree based on genetic similarity.
Key Steps (Agglomerative):
Compute distance between all points.
Merge the two closest points/clusters.

Update distance matrix.

Repeat until all points form a single cluster.

Density-Based Clustering (DBSCAN)


Forms clusters based on density of data points rather than distance to a centroid.
Key Concepts:
Eps: Radius to consider neighbors.
MinPts: Minimum points to form a dense region.
Core point:
Has MinPts neighbors within Eps.
Border point: Within Eps of a core point but has < MinPts neighbors.
Noise point: Does not belong to any cluster.
Advantages: Can find arbitrarily shaped clusters and handle outliers.
Example: Detecting groups of stars in astronomical data where clusters are irregularly shaped.

Q.6(B).Write short note on: (i) LOF (ii)Isolation Factor (ii) Extrinsic methods (iii) Intrinsic
method
Ans.
(i) LOF (Local Outlier Factor)
A density-based method to detect outliers.Compares the local density of a point with its neighbors.
Points with much lower density than neighbors are flagged as outliers.
Example: Detecting fraudulent transactions in banking.

(ii) Isolation Forest


An ensemble-based method for anomaly detection.

Isolates anomalies by randomly partitioning data; outliers require fewer splits.

Example: Detecting network intrusions or rare defects in manufacturing.

(iii) Extrinsic Methods


Evaluate clustering or model quality using external knowledge or labels.

Example: Using ground truth labels to measure clustering accuracy.

Metrics: Purity, F-measure, Adjusted Rand Index.

(iv) Intrinsic Methods


Evaluate clustering or model quality without external labels.
Based on internal measures like compactness and separation.
Example: Silhouette Score, Davies-Bouldin Index.

Q.7(A) Describe multi-layer neural network. Explain artificial neural network based on
perception concept with diagram.
Ans.Multi-Layer Neural Network
A fully connected Multi-Layered Neural Network is known as Multi-Layer Perceptron. A Multi-
Layered Neural Network consists of multiple layers of artificial neurons or nodes. Unlike Single-
Layer Neural networks, in recent times most networks have Multi-Layered Neural Network. The
following diagram is a visualization of a multi-layer neural network:
Artificial Neural Network (ANN) – Perceptron Concept
Perceptron: The basic building block of an ANN, introduced by Frank Rosenblatt.

Purpose: Used for binary classification, deciding if an input belongs to a class or not.

Components of a Perceptron
Inputs (x1,x2,…,xnx_1, x_2, …, x_nx1,x2,…,xn) – Features of the data.

Weights (w1,w2,…,wnw_1, w_2, …, w_nw1,w2,…,wn) – Each input has an associated weight


representing its importance.

Bias (b) – Helps shift the activation function to better fit the data.

Summation Function:
Computes the weighted sum of inputs plus bias:
Activation Function (f): Determines the output of the perceptron. Common examples: Step function,
Sigmoid, ReLU.
y=f(z)y = f(z)y=f(z)
Output (y): Produces the final decision (0 or 1 in basic perceptron).

Diagram of a Perceptron
x1 ---- w1 \
\
x2 ---- w2 ----> Σ --> f(z) --> Output (y)
/
x3 ---- w3 /
Bias (b)

Working
Multiply each input by its corresponding weight.
Sum all weighted inputs and add the bias.
Apply the activation function to produce the output.
Update weights during training using a learning rule (e.g., perceptron learning rule) until the network
correctly classifies the data.

Q.7(B) What are different types of padding used in CNN? Compare back propagation network
with feed forward network.
Ans:
1. Types of Padding in CNN
Padding is used to control the spatial size of the output feature maps in Convolutional Neural
Networks (CNNs). The main types are:
Valid Padding (No Padding)
No extra pixels are added.
Output size shrinks after convolution.

Formula: Output Size=(NF)S+1\text{Output Size} = \frac{(N - F)}{S} + 1Output Size=S(NF)+1

Use: When reduction in size is acceptable.


Same Padding (Zero Padding)
Add zeros around the input so that output size is same as input.
Formula: Output Size=NS\text{Output Size} = \frac{N}{S}Output Size=SN

Use: Preserve spatial dimensions, useful in deep networks.


Full Padding
○ Add enough zeros so that the filter passes all positions including edges.

○ Output size is larger than input.

○ Rarely used, mainly in specific applications.

2. Comparison: Backpropagation Network vs Feedforward Network

Feature Feedforward Network (FFN) Backpropagation Network (BPN)

Structure Same structure as FFN but trained


Neurons arranged in layers (input hidden output) with
forward using backpropagation algorithm
connections only

Information Forward only Forward pass + backward pass


Flow (error propagation)

Learning Can use random weights; no Weights are adjusted using error
automatic weight adjustment gradient to minimize loss

Purpose Computes outputs for given inputs Learns from data to reduce error
and improve predictions

Training May require other methods Supervised learning using gradient


descent

Complexity Simple More computationally intensive


due to backward pass

Q.8(A)What is Functional Link Artificial Neural Network (FLANN)? Explain its merits over
other ANNs.
Ans:
FLANN is a type of single-layer neural network that expands the input space using functional
transformations (like polynomial, trigonometric, or Fourier functions) before feeding it to the
network.
Unlike traditional multi-layer ANNs, FLANN does not use hidden layers, but can still model non-
linear relationships by increasing the dimensionality of the input.

Merits of FLANN over Other ANNs

1. Simplicity:

○ Single-layer network, easier to implement and train than multi-layer ANNs.

2. Faster Training:

○ Requires less computational time because there are no hidden layers.


3. Non-linear Mapping:

○ Functional expansion allows the network to capture non-linear relationships


effectively.

4. Reduced Risk of Local Minima:

○ Fewer parameters and simpler structure reduce the likelihood of getting stuck in local
minima.

5. Good for Real-Time Applications:

○ Faster computation makes it suitable for real-time prediction tasks.

Q.8(B)What are the different activation functions used in NN? Compare Recurrent Neural
Networks with Convolution Neural Network (CNN).

Ans: 1. Activation Functions in Neural Networks

Activation functions introduce non-linearity into the network, enabling it to learn complex patterns.
Common types include:

Activation Formula Range Usage


Function

Sigmoid (0,1) Binary classification, output


f(x)=11+exf(x) = \frac{1}{1 + e^{-x}}f(x)=1+ex1
layer

Tanh (-1,1) Hidden layers, zero-centered


outputs
f(x)=exexex+exf(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}f(x)=ex+exexex

ReLU (Rectified f(x)=max⁡(0,x)f(x) = \ [0, ) Most hidden layers, reduces


Linear Unit) max(0,x)f(x)=max(0,x) vanishing gradient problem

Leaky ReLU f(x)=max⁡(0.01x,x)f(x) = \ (-, ) Avoids dying ReLU


max(0.01x, problem
x)f(x)=max(0.01x,x)

Softmax (0,1) Multi-class classification,


(sum = output
f(xi)=exijexjf(x_i) = \frac{e^{x_i}}{\sum_j layer
e^{x_j}}f(xi)=jexjexi
1)

2. Comparison: RNN vs CNN

Feature Recurrent Neural Network (RNN) Convolutional Neural Network


(CNN)
Data Type Sequential data (time series, text, Grid-like data (images, videos)
speech)

Architecture Loops in network to remember Layers of convolutions and


previous states pooling, no loops

Memory Maintains internal state (short-term No internal memory, processes


memory) local features

Purpose Captures temporal dependencies Captures spatial features (edges,


textures)

Training Harder, suffers from Easier with standard


Complexity vanishing/exploding gradients backpropagation

Example Language modeling, speech Image classification, object


Applications recognition, stock prediction detection, video analysis

You might also like