0% found this document useful (0 votes)
235 views40 pages

1baia103 Module3 Notes

Module 3 on Machine Learning provides an overview of its significance in modern systems, emphasizing the need for data-driven approaches due to the complexity and volume of information generated today. It outlines the fundamental components of a machine learning system, including data, features, models, and the machine learning pipeline, which includes problem definition, data collection, cleaning, and model evaluation. The module also categorizes machine learning algorithms into supervised, unsupervised, and reinforcement learning, highlighting their distinct methodologies and applications.

Uploaded by

sampadaholla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
235 views40 pages

1baia103 Module3 Notes

Module 3 on Machine Learning provides an overview of its significance in modern systems, emphasizing the need for data-driven approaches due to the complexity and volume of information generated today. It outlines the fundamental components of a machine learning system, including data, features, models, and the machine learning pipeline, which includes problem definition, data collection, cleaning, and model evaluation. The module also categorizes machine learning algorithms into supervised, unsupervised, and reinforcement learning, highlighting their distinct methodologies and applications.

Uploaded by

sampadaholla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 3: MACHINE LEARNING

IMPORTANT NOTE TO STUDENTS


These notes are strictly prepared in alignment with the textbook “Artificial Intelligence: Beyond Classical
Approaches” by Reema Thareja (Oxford University Press)” and the prescribed syllabus. You must first
refer to the textbook for every topic. If you still find any concept difficult to understand, then use these notes as
your supporting reference. If the topic remains unclear even after that, please come to me directly for further
explanation.

3.1 Introduction to Machine Learning


Machine Learning is a subfield of Artificial Intelligence concerned with enabling
computational systems to improve their performance on a task through experience.
Traditional programming depends on a human explicitly specifying every rule and decision in
the form of code. The computer simply executes these rules mechanically. This approach is
effective for problems where the rules are well understood and can be written precisely.
However, many modern problems—such as classifying complex images, detecting subtle
credit card fraud, predicting market trends, or understanding natural language—do not lend
themselves easily to fixed, manually crafted rules. The pattern structures in such problems are
often too complicated, too high-dimensional, or too dynamic for humans to encode manually.

Machine Learning addresses this limitation by allowing computers to infer patterns from
data. Instead of instructing the computer with a rigid list of rules, we provide it with a large
collection of examples. These examples represent past observations or historical cases that
capture the underlying structure of the problem. The learning system analyses these
examples, identifies relationships between inputs and outputs, and produces a mathematical
model. This model can then generalize those relationships to new, unseen situations. For
example, a system trained on historical student performance data can predict how a new
student may perform based on similar attributes such as attendance, internal marks, and past
academic habits.

One important conceptual point is that Machine Learning is not merely curve-fitting or
memorization. A naive system that memorizes all training examples would fail completely on
new input. True learning involves performance improvement, generalization capability, and
adaptation to new patterns. The central objective of Machine Learning is to build models that
not only fit observed data but also perform accurately on unseen data. In academic terms, this
ability is known as generalization. A system that performs well on training data but poorly on
new test data is experiencing overfitting, a key phenomenon studied in ML theory.

Another key dimension of ML is that learning occurs through optimization. Algorithms


adjust their internal parameters to minimize a loss function, which quantifies the discrepancy
between predicted output and true output. This optimization process is analogous to how a
student refines understanding: after repeated practice, mistakes are identified and corrected,
leading to stronger performance. Thus, learning in machines mirrors learning in humans,
although implemented through mathematical computations.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


In the modern era, the reason Machine Learning has become essential is that data has grown
exponentially. The availability of large-scale data from digital transactions, social media
interactions, sensors, healthcare systems, and industrial machines provides a rich foundation
for learning complex behavior. As the scale of data grows beyond human comprehension,
ML becomes indispensable for extracting knowledge, making predictions, enhancing
decision-making, and automating sophisticated tasks.

3.2 Need for Machine Learning in Modern


Systems
Machine Learning has become vital due to the rapidly increasing volume, velocity, and
variety of data produced by digital systems. The sheer scale of data generated every second
makes it impossible for humans to manually analyse or interpret patterns. For instance,
financial markets produce millions of transaction records in microseconds; health monitoring
devices generate continuous streams of patient data; e-commerce platforms observe browsing
behaviour, purchase patterns, and user preferences on a massive scale. A rule-based approach
cannot cope with such complexity and dynamism, as it requires explicit rules that quickly
become outdated or inadequate.

Another reason ML has become indispensable is that many real-world problems are
inherently ambiguous and probabilistic rather than deterministic. Consider medical diagnosis:
the symptoms of many diseases overlap, and individual patients react differently to
conditions. A rigid rule such as "if fever > 101°F and rash present, then diagnosis is X" is
insufficient. ML models can instead learn nuanced relationships from thousands of patient
cases, identifying subtle patterns that human physicians may miss due to cognitive
limitations. Such systems assist doctors by offering data-driven insights rather than replacing
their expertise.

Additionally, Machine Learning enables automation of decision-making in scenarios where


rapid response is critical. Fraud detection is a key example. Banks must identify fraudulent
activity within seconds, before transactions are approved. ML systems continuously monitor
real-time transaction streams, comparing them with millions of past cases to detect suspicious
behaviour. Rule-based systems would fail because fraudsters constantly change their
strategies. Only adaptive ML models, capable of learning new fraud patterns, can keep pace
with evolving criminal methods.

Personalized systems also rely heavily on ML. Streaming services such as Netflix or Spotify
do not use fixed rules to recommend movies or songs. Instead, they analyse millions of user
interactions—watch time, skipped items, viewing history, time of day, and genre
preferences—to build recommendation models. The recommendation improves the longer the
user interacts with the system because the model continues learning from ongoing behaviour.

In summary, ML is necessary because the world no longer operates in fixed, rule-driven


environments. Instead, it is dynamic, data-intensive, and filled with patterns too subtle for
human rule-making. ML provides the analytical and predictive capability needed to operate
effectively in this landscape.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


3.3 Techniques in Artificial Intelligence and
Relationship to ML
Although Machine Learning has become the dominant paradigm in AI applications, it is
essential to understand that ML is only one of several techniques used within the broader
field of Artificial Intelligence. A foundational understanding of other AI techniques allows
students to appreciate where ML fits conceptually and why it has become central in recent
decades.

The first major technique is classical rule-based AI, particularly expert systems. In expert
systems, domain experts manually encode their knowledge into a structured form, typically as
production rules. For example, a medical expert might provide hundreds of If-Then rules
representing diagnostic pathways. Early AI systems like MYCIN operated in this manner.
These systems were valuable in domains where knowledge was well understood and could be
articulated explicitly. However, they lacked adaptability, struggled with uncertain or
incomplete information, and required constant human updates.

Machine Learning, in contrast, automates knowledge acquisition by learning directly from


examples rather than rules. Instead of asking experts to write diagnostic conditions manually,
ML systems learn the conditions from patient data. In many domains, data-driven learning
has proven far more effective than manually designed logic because real-world situations
often contain hidden interactions that experts themselves may not articulate.

Another significant AI technique is Natural Language Processing (NLP). Historically, NLP


relied heavily on linguistic rules and grammar structures crafted by human linguists.
However, modern NLP is deeply intertwined with Machine Learning, particularly through
models that learn semantic and syntactic patterns from massive text corpora. Today,
advanced NLP systems use deep learning architectures such as transformers, which learn
representations of language far beyond what rule-based systems could achieve.

Computer vision is another major branch of AI where ML plays an integral role. Classical
image processing used edge detectors, filters, and manually designed features. Modern vision
systems rely largely on ML—specifically deep neural networks—to learn features
automatically. Tasks such as object detection, facial recognition, scene understanding, and
gesture detection are now dominated by ML techniques.

Robotics integrates ML into perception, planning, and control. A robot navigating a


warehouse, for example, uses ML to recognise obstacles, classify objects, predict human
movement, and optimise paths. Earlier generations of robots followed pre-defined motion
paths, but modern robots adapt behaviour dynamically through reinforcement learning and
real-time data analysis.

Other AI methods include fuzzy logic, which handles uncertainty by allowing partial truth
values, and evolutionary algorithms, which use biological principles of mutation and
selection for optimization tasks. These approaches remain useful in specific problem settings
but have been overshadowed by the superior adaptability and accuracy of ML systems.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


In conclusion, Machine Learning today serves as the central engine powering many modern
AI categories. While classical AI techniques still play roles in niche areas, the rise of big data
and computational power has made ML the primary method for achieving intelligence in
machines.

3.4 Fundamental Components of a Machine


Learning System
Any machine learning system, regardless of its algorithm or application, consists of several
fundamental components that work together to achieve learning. Understanding these
components helps students grasp the architecture of ML systems at a conceptual and practical
level.

Data is the foundation of any ML system. It may be numerical, textual, visual, audio-based,
or any combination thereof. The quality, quantity, and relevance of data determine the upper
limit of model performance. A sophisticated algorithm cannot compensate for inadequate or
noisy data. For example, a system predicting hospital readmission rates requires accurate,
complete patient records; missing or incorrect clinical values can severely degrade predictive
quality.

Features are measurable properties extracted from raw data. In a student performance
prediction system, features may include attendance percentage, assignment scores, internal
exam marks, hours of study, or participation level. Raw data often must be transformed into
features that capture meaningful information. This process may involve scaling,
normalization, dimensionality reduction, encoding of categorical variables, or extraction of
higher-level attributes.

The target or label represents the value the system aims to predict. In supervised learning
tasks, each training example pairs input features with a target output. In the student example,
the target might be the final exam result or a pass/fail indicator.

The model is the mathematical structure that learns the mapping from inputs to outputs. In
regression tasks, the model may take the form of a linear equation. In classification tasks, the
model may consist of decision boundaries. Neural networks represent models as layered
transformations. Each model type has strengths and limitations, and choosing the appropriate
model involves understanding both the data characteristics and the problem domain.

Training is the process during which the model analyses data, adjusts its internal parameters,
and minimizes prediction errors. Training algorithms, whether based on gradient descent or
probabilistic estimation, aim to achieve optimal performance on training data while avoiding
overfitting.

Testing evaluates the model on unseen data to determine whether it generalises well beyond
training examples. Good performance on training data but poor performance on test data
indicates overfitting, while the reverse suggests underfitting.

Prediction is the final phase where the trained model is used to compute outputs for new
inputs. An ML system deployed in a bank or hospital typically operates continuously,

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


receiving new inputs and producing predictions in real time. Ensuring stable, robust
predictions is essential for reliability in real-world environments.

3.5 The Machine Learning Pipeline


A professional-quality ML solution follows a carefully structured pipeline, which ensures the
model is reliable, robust, accurate, and suitable for deployment. Students often mistakenly
believe that ML begins with model training, but in fact, most effort lies in earlier stages.

The first step in the pipeline is problem definition. A poorly defined problem leads to
incorrect modelling choices and misleading outcomes. For example, predicting whether a
customer will buy a product is different from predicting how much they will spend. One is a
classification problem, the other is a regression problem. The clarity of problem definition
determines the nature of the dataset needed and the appropriate ML algorithms to use.

The next step is data collection. This may involve gathering transactional data, sensor
readings, medical measurements, survey responses, log files, or external datasets. The
characteristics of the data—such as its volume, variety, and veracity—greatly affect the
choice of techniques and the feasibility of learning.

Data must then be cleaned. Real-world datasets are rarely perfect; they contain missing
values, incorrect entries, duplicate records, and inconsistencies. Cleaning ensures the dataset
is reliable. For instance, a missing blood pressure value in a medical dataset may require
imputation, whereas a negative blood pressure value is an error that must be removed.

Feature selection is the fourth step. Not all variables contribute equally to prediction.
Irrelevant or redundant features can confuse the model and reduce accuracy. Techniques such
as correlation analysis, variance filtering, and domain knowledge help identify essential
features.

Data splitting follows, where the dataset is divided into training and testing portions. The
most common split is 80 percent for training and 20 percent for testing. Some systems also
use validation sets to tune hyperparameters.

Model training involves choosing an appropriate ML technique, feeding training data into the
model, and allowing the algorithm to learn patterns. This process may require considerable
computational resources for large datasets or complex models such as deep neural networks.

Model evaluation is the next stage. Based on predefined metrics, the performance of the
model is assessed. Classification problems use accuracy, precision, recall, F1-score, and ROC
curves, while regression models use mean absolute error, mean squared error, and R-squared
values.

Deployment involves integrating the trained model into a production environment such as a
hospital database, banking system, web application, or embedded device.

Finally, continuous monitoring ensures the model remains accurate over time. Real-world
data drifts, patterns change, and outdated models must be retrained regularly.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


3.6 Types of Machine Learning Algorithms
Machine learning systems can broadly be categorized into three principal paradigms:
supervised learning, unsupervised learning, and reinforcement learning. These paradigms are
not arbitrary labels; rather, they reflect fundamentally different approaches by which a
machine acquires knowledge from experience. The classification is based on the nature of the
data available to the algorithm and the kind of interaction the learning system has with its
environment. Understanding these distinctions is essential because the selection of a learning
paradigm determines the structure of the dataset required, the form of the model to be
learned, the evaluation mechanism, and ultimately, the applicability of the algorithm to real-
world problems. Each paradigm represents a different philosophy of learning, analogous to
how humans learn in different contexts: learning through instruction, learning through
exploration, and learning through consequences.

3.6.1 Supervised Learning


Supervised learning is the most widely used paradigm in machine learning and forms the
backbone of most predictive analytics systems. The term “supervised” refers to the presence
of a guiding signal: for every input example in the training dataset, there is a corresponding
output value or label that acts as the teacher. The role of the algorithm is to learn a functional
mapping from inputs to outputs by studying many such input–output pairs. This form of
learning resembles a human learning environment in which a student practices problems with
known answers, verifying and adjusting their understanding based on feedback. Once
proficiency is developed, the student applies that internalized understanding to new problems
without guidance.

To appreciate the mechanism of supervised learning, consider a school scenario where we


attempt to predict a student's final examination performance. The historical dataset includes
attendance, assignment performance, internal test scores, behavioural observations, and the
final examination results. Each row in such a dataset contains input attributes and the
corresponding output label. The supervised learning model examines these examples and
constructs an internal representation—a mathematical function—that captures the
relationship between the inputs and the final result. When presented with a new student’s
data, the learned function can predict the likely outcome, even though the model has not
encountered that student before.

The essence of supervised learning lies in generalization: the model must not simply
memorize the training data but must infer a general rule that applies to unseen cases. The
quality of generalization determines the practicality of the model. Training involves
iteratively adjusting model parameters so that the predictions gradually approach the correct
answers. Testing evaluates whether the learned pattern holds for future inputs. Each
algorithm, whether linear regression, decision trees, random forests, neural networks, or
support vector machines, implements supervised learning in a different mathematical form.

Supervised learning encompasses two major categories of tasks: classification and regression.
Classification deals with categorical outputs such as spam/not spam, diabetic/non-diabetic,
pass/fail, or selecting between multiple classes such as types of fruits or categories of news
articles. Regression, on the other hand, predicts numerical values such as house prices, sales

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


figures, temperature forecasts, or load predictions. Although the problem types differ, the
underlying principle remains the same: learning from labelled examples to predict future
outcomes.

In real-world practice, supervised learning is dominant because many domains naturally


produce labelled data. Banks maintain historical records linking customer behaviour to
default or repayment; hospitals maintain datasets connecting symptoms and test results to
diagnoses; e-commerce platforms record user interactions along with resulting purchases.
These naturally labelled datasets serve as rich foundations for supervised models. The
success of supervised learning lies in its ability to capture complex mappings that humans
may be unable to describe explicitly but which are nonetheless embedded in historical data.

3.6.2 Unsupervised Learning


Unsupervised learning addresses a fundamentally different class of problems where the
dataset contains inputs without corresponding output labels. In such situations, the machine
must independently discover structure, patterns, groupings, or representations within the data.
This learning resembles the cognitive process by which humans identify natural groupings or
similarities without being explicitly taught. For example, a child may group toys by colour or
shape simply by observing them, without an adult instructing them to do so.

In unsupervised learning, because there is no teacher-provided label, the algorithm’s


objective shifts from prediction to pattern discovery. The algorithm attempts to organize the
data into meaningful structures by identifying similarities or differences among observations.
One of the most widely practiced forms of unsupervised learning is clustering, where the goal
is to group data points based on similarity. For instance, a retailer may want to segment
customers into behavioural groups such as budget buyers, premium buyers, trend seekers, or
infrequent visitors. The retailer might not initially know how many groups exist or what
defines each group. An unsupervised learning algorithm like K-Means can analyse the
transactional histories and form clusters representing natural divisions in the customer base.

Another important objective of unsupervised learning is dimensionality reduction. Modern


datasets—such as gene expression measurements, high-resolution images, or document-term
matrices—contain thousands of variables. Many of these variables are correlated or
redundant. Techniques such as Principal Component Analysis (PCA) reduce dimensionality
by identifying a smaller set of features that preserve most of the dataset’s variability. Such
transformations are vital in enhancing computational efficiency and improving the
performance of downstream supervised learning models.

Unsupervised learning is also important in anomaly detection. When a system learns what
constitutes normal behaviour, it can recognize unusual patterns without explicit examples of
anomalies. For example, an electricity grid monitoring system may detect abnormal load
patterns without ever having been provided labelled instances of failure. This is possible
because unsupervised techniques learn compact representations of normal patterns and
identify deviations as anomalies.

Unlike supervised learning, unsupervised learning must grapple with ambiguity. There is
often no ground truth against which to measure performance. Evaluating the quality of
clusters or representations requires domain knowledge and careful interpretation. Despite

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


these challenges, unsupervised learning remains invaluable in early exploratory analysis,
segmentation, pattern recognition, and feature extraction, especially in environments where
labelled data is scarce or expensive to obtain.

3.6.3 Reinforcement Learning


Reinforcement learning represents a distinct paradigm inspired by behavioural psychology,
where an agent learns to act by interacting with an environment and receiving feedback in the
form of rewards or penalties. It does not rely on labelled examples nor on discovering static
patterns. Instead, it focuses on learning optimal sequences of actions to maximize long-term
cumulative rewards. Reinforcement learning captures the essence of learning by trial and
error, analogous to how humans learn to ride a bicycle, play a game, or operate machinery:
through repeated attempts, feedback, correction, and refinement.

A reinforcement learning scenario involves four main components: an agent, an environment,


a set of possible actions, and a reward mechanism. The agent observes the environment’s
state and selects an action. The environment responds with a new state and a scalar reward.
The agent’s objective is to learn a policy—a strategy mapping states to actions—that
maximizes total reward over time. This policy evolves as the agent conducts more
interactions with the environment and evaluates the consequences of its actions.

To illustrate reinforcement learning with a concrete example, consider the case of an


autonomous warehouse robot tasked with delivering items from storage shelves to packaging
stations. At each moment, the robot must decide how to move: forward, backward, turn left,
turn right, adjust speed, and avoid obstacles. The environment returns feedback: a positive
reward when the robot moves toward its goal efficiently, a negative reward when it collides
with an obstacle or takes a longer route. Over hundreds or thousands of episodes, the robot
learns behaviours that minimize collisions and optimize navigation time. The robot acquires
skill without explicit programming, relying solely on reinforcement signals.

Reinforcement learning is also central to game-playing AI systems. Algorithms such as Q-


Learning and policy-based methods have enabled artificial agents to master complex games
including chess, Go, and real-time strategy games. These systems discover strategies that
human players take years to learn, and in some cases surpass human performance entirely.
Notably, Google DeepMind’s AlphaGo and AlphaZero systems used reinforcement learning
combined with deep neural networks to achieve historic milestones in AI capability.

Industrial and commercial applications of reinforcement learning extend beyond games and
robotics. It is used in elevator control systems, energy grid optimization, traffic light signal
scheduling, stock trading, and recommendation systems that adapt dynamically to user
interactions. The strength of reinforcement learning lies in its ability to operate in sequential
decision-making environments where actions have delayed consequences and where optimal
solutions are not known beforehand.

However, reinforcement learning also presents challenges. It requires extensive exploration,


which may be costly or unsafe in real-world environments. Reward design demands careful
thought; poorly chosen rewards can lead to unintended behaviours. Moreover, convergence to
optimal policies can be slow. Despite these challenges, reinforcement learning remains a
powerful framework for autonomous decision-making systems.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


3.6.4 Comparative Understanding of the
Three Paradigms
Understanding the distinctions between supervised, unsupervised, and reinforcement learning
requires deep conceptual clarity. Supervised learning focuses on learning a direct mapping
from inputs to outputs using labelled data. Unsupervised learning seeks hidden structures in
unlabelled data without predefined outcomes. Reinforcement learning centers on optimizing
actions through sequential feedback. Each paradigm suits different problem types. For
instance, predicting stock prices is a supervised regression problem; segmenting customers
into behavioural groups requires unsupervised learning; navigating a drone autonomously
demands reinforcement learning.

The three paradigms also differ fundamentally in the nature of feedback. Supervised learning
provides explicit error signals during training. Unsupervised learning provides no feedback;
the model must infer structure. Reinforcement learning supplies delayed and often noisy
reward signals. The structure of the problem determines the appropriate paradigm: where
labelled examples exist, supervised learning excels; where structure must be discovered,
unsupervised learning is ideal; and where decisions unfold over time, reinforcement learning
becomes essential.

In modern AI, hybrid approaches are increasingly common. For example, semi-supervised
learning combines labelled and unlabelled data; self-supervised learning converts unlabelled
data into pseudo-supervised tasks; and deep reinforcement learning merges reinforcement
learning with deep neural networks to solve high-dimensional tasks. These advances show
that the boundaries between the three paradigms, though conceptually distinct, can be bridged
to create highly effective learning systems.

3.7 Regression Analysis in Machine


Learning
Regression analysis is not only a theoretical modelling technique but also an intensely
practical tool whose power lies in quantifying relationships using data and producing
measurable predictions. Its usefulness becomes clear only when one supplements conceptual
understanding with detailed numerical examples. Without working through actual
calculations, the learner cannot fully appreciate how regression lines are constructed, how
coefficients emerge from data, and how predictions behave for new observations. Therefore,
this expanded section integrates both conceptual depth and concrete problem solving.

Regression seeks to model the functional relationship between dependent and independent
variables. While the concept appears straightforward, the underlying mathematics and
interpretation require careful attention. Every coefficient in a regression model carries
numerical meaning, and understanding how these coefficients arise from raw data allows
students to appreciate the predictive process instead of treating it as a black box. In the
following subsections, each type of regression is illustrated with one or more realistic
examples.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


3.7.1 Simple Linear Regression
To understand simple linear regression concretely, consider a small dataset representing the
relationship between hours of study and marks obtained by five students. Although real
datasets contain many more observations, working with a smaller sample clarifies the
concepts of slope, intercept, residuals, and prediction.

Suppose we have the following data:

Hours studied (X): 2, 3, 5, 7, 9


Marks obtained (Y): 50, 55, 65, 70, 78

We want to construct the regression line


Y = mX + c
using the least squares method.

The slope m is computed using the formula:

m = Σ[(xi − x̄)(yi − ȳ)] / Σ[(xi − x̄)²]

To apply this, we must compute means:

x̄ = (2 + 3 + 5 + 7 + 9) / 5 = 26 / 5 = 5.2
ȳ = (50 + 55 + 65 + 70 + 78) / 5 = 318 / 5 = 63.6

Next, we compute the required summations:

For each pair:

x = 2, y = 50
(x − x̄) = −3.2
(y − ȳ) = −13.6
Product = 43.52
(x − x̄)² = 10.24

x = 3, y = 55
(x − x̄) = −2.2
(y − ȳ) = −8.6
Product = 18.92
(x − x̄)² = 4.84

x = 5, y = 65
(x − x̄) = −0.2
(y − ȳ) = 1.4
Product = −0.28
(x − x̄)² = 0.04

x = 7, y = 70
(x − x̄) = 1.8

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


(y − ȳ) = 6.4
Product = 11.52
(x − x̄)² = 3.24

x = 9, y = 78
(x − x̄) = 3.8
(y − ȳ) = 14.4
Product = 54.72
(x − x̄)² = 14.44

Summations:
Σ(x − x̄)(y − ȳ) = 43.52 + 18.92 − 0.28 + 11.52 + 54.72 = 128.4
Σ(x − x̄)² = 10.24 + 4.84 + 0.04 + 3.24 + 14.44 = 32.8

Thus,
m = 128.4 / 32.8 = 3.9146 (approximately)

The intercept c is computed using:

c = ȳ − m x̄
c = 63.6 − (3.9146 × 5.2)
c = 63.6 − 20.356 ≈ 43.244

Therefore, the regression line is:

Y = 3.9146 X + 43.244

Now, suppose a student studies for 6 hours. The predicted marks would be:

Y = 3.9146 × 6 + 43.244 = 23.4876 + 43.244 ≈ 66.73

This example demonstrates how regression produces a specific prediction rather than a vague
guess. Moreover, one can interpret the slope as meaning that each additional hour of study
increases expected marks by approximately 3.9 marks, under the assumptions of linearity.

3.7.2 Residual Analysis


Residuals measure how far each actual point lies from the regression line. To illustrate
residuals, compute ŷ for each X:

For X = 2: ŷ = 3.9146 × 2 + 43.244 = 51.07


Actual y = 50
Residual = y − ŷ = −1.07

For X = 3: ŷ = 55.0
Actual y = 55
Residual = 0

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


For X = 5: ŷ = 62.82
Actual y = 65
Residual = 2.18

For X = 7: ŷ = 70.65
Actual y = 70
Residual = −0.65

For X = 9: ŷ = 78.47
Actual y = 78
Residual = −0.47

The pattern of residuals helps determine whether the linear model fits well. In this example,
residuals are small and not systematically positive or negative, suggesting the linear model is
appropriate.

3.7.3 Multiple Linear Regression


Multiple linear regression extends the simple case to multiple independent variables.
Consider predicting house prices based on area, number of bedrooms, and age of the
building. Suppose we have the following simplified dataset:

Area (sq ft) | Bedrooms | Age (years) | Price (lakhs)


1000 | 2 | 10 | 45
1500 | 3 | 5 | 60
2000 | 3 | 8 | 68
1800 | 2 | 15 | 55
2200 | 4 | 4 | 80

The general model is:

Price = b0 + b1(Area) + b2(Bedrooms) + b3(Age)

To compute coefficients, one solves a system of linear equations using matrix algebra:

b = (XᵀX)⁻¹ XᵀY

Where X is the matrix of features including a column of 1s for b0. An illustrative


computation is too lengthy to carry out entirely by hand, but the essential point is that this
formula yields coefficient values such as:

b0 = 12.5
b1 = 0.027
b2 = 5.2
b3 = −0.63

Interpretation is critical. b1 = 0.027 means each additional square foot adds ₹2700 to the
price. b2 = 5.2 means each additional bedroom increases the price by ₹5.2 lakhs. The
negative sign on b3 indicates older buildings are slightly cheaper.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Using the model, one can predict the price of a 1700 sq ft, 3-bedroom, 7-year-old house:

Price = 12.5 + (0.027 × 1700) + (5.2 × 3) − (0.63 × 7)


Price = 12.5 + 45.9 + 15.6 − 4.41
Price ≈ 69.59 lakhs

Such predictions help builders, buyers, and sellers make informed decisions.

3.7.4 Polynomial Regression


To illustrate polynomial regression, consider the following relationship between engine
horsepower (X) and fuel efficiency (Y in km/l).

Horsepower: 70, 90, 110, 130, 150


Fuel efficiency: 22, 20, 17, 14, 12

Plotting this data reveals a curved downward trend. A linear model would give inaccurate
predictions because the rate of decrease in efficiency accelerates at higher horsepower levels.

Polynomial regression fits a model of the form:

Y = a0 + a1X + a2X²

Using least squares for polynomial terms, we obtain:

a0 = 37.2
a1 = −0.265
a2 = −0.0007

To predict efficiency for a car with 120 horsepower:

Y = 37.2 − (0.265 × 120) − (0.0007 × 120²)


Compute stepwise:
0.265 × 120 = 31.8
120² = 14400
0.0007 × 14400 = 10.08

Thus,
Y = 37.2 − 31.8 − 10.08 = −4.68

A negative value is impossible, indicating the polynomial degree or data quality must be
reconsidered. This example shows that polynomial models can fit curves but can also behave
inaccurately outside the data range, emphasizing careful model selection.

3.7.5 Logistic Regression


Although logistic regression is used for classification, a worked example clarifies its
operation.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Consider predicting whether a student passes (1) or fails (0) based on attendance. Suppose we
have:

Attendance (%) | Pass/Fail


40 | 0
50 | 0
60 | 1
70 | 1
80 | 1

Using logistic regression, we compute coefficients b0 and b1 that define:

p(pass) = 1 / (1 + e^(−(b0 + b1X)))

Assume the computed values are:


b0 = −12.3
b1 = 0.18

For a student with 65 percent attendance:

Linear predictor z = b0 + b1X = −12.3 + (0.18 × 65)


0.18 × 65 = 11.7
z = −12.3 + 11.7 = −0.6

Probability = 1 / (1 + e^(0.6)) = 1 / (1 + 1.822) ≈ 0.354

Thus, logistic regression estimates a 35.4% probability of passing at 65% attendance. If the
threshold is 0.5, the student is predicted to fail. This example shows logistic regression
provides probabilistic insight rather than binary decisions.

3.7.6 Regression Evaluation Metrics


Using the earlier simple regression dataset, compute MAE and RMSE.

Actual Y: 50, 55, 65, 70, 78


Predicted ŷ: 51.07, 55, 62.82, 70.65, 78.47

Absolute errors:
1.07, 0, 2.18, 0.65, 0.47

MAE = (1.07 + 0 + 2.18 + 0.65 + 0.47) / 5 = 4.37 / 5 = 0.874

To compute RMSE:
Square errors:
1.1449, 0, 4.7524, 0.4225, 0.2209
Sum = 6.5407
MSE = 6.5407 / 5 = 1.3081
RMSE = sqrt(1.3081) ≈ 1.143

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


MAE and RMSE tell how close predictions are to actual values in absolute and squared terms
respectively.

3.8 Classification Techniques in Machine


Learning
Classification is one of the most fundamental tasks in supervised learning. The primary
objective of classification is to assign an input instance to one of several predefined
categories based on the characteristics of the data. In contrast to regression, where the output
variable is continuous, classification deals with discrete labels such as “spam” or “not spam”,
“benign” or “malignant” tumour, “approved” or “rejected” loan application, or any multi-
class outcomes like types of fruits or genres of movies. Most decision-making systems in
banking, medicine, insurance, commerce, and social media moderation rely heavily on
classification algorithms, making their understanding crucial for mastering Machine
Learning.

The mathematical foundations of classification revolve around estimating decision


boundaries in the feature space. A decision boundary separates regions corresponding to
different classes. For example, in a two-dimensional attribute space representing “income”
and “expenditure”, a bank may identify regions that correspond to “likely defaulter” and
“likely non-defaulter”. Classification algorithms construct these boundaries either through
analytical functions, distance-based reasoning, probability estimation, or hierarchical
partitioning, depending on the technique employed.

To understand classification in a meaningful way, it is necessary not only to study conceptual


ideas but also to examine concrete numerical examples. This approach reveals why
algorithms behave as they do and how their decisions can be interpreted or validated.

3.8.1 The K-Nearest Neighbour (KNN)


Classifier
The K-Nearest Neighbour (KNN) algorithm is one of the simplest yet surprisingly effective
classification methods. Its logic is purely intuitive: when determining the class of a new
instance, examine the “nearest” previously seen instances in the dataset, and assign the class
most common among them. A human analogy is straightforward. If a person wants to guess
the college major of a new student, they might look at the student’s closest friends. If most of
them study engineering, the new student is likely to belong to the same department.

KNN does not construct an explicit mathematical model during training. Instead, it stores all
training data and makes decisions during prediction time. Thus, learning is deferred until
classification, and the model is often called a “lazy learner”. The core requirement of KNN is
a distance metric that measures similarity between data points. The most commonly used
metric is Euclidean distance, although Manhattan distance and others can be used depending
on the data distribution.

To appreciate KNN clearly, consider a detailed worked example using a small dataset.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Worked Example: Predicting Whether a Student Passes or Fails

Suppose we have the following dataset of students, where the attributes are “Hours Studied”
and “Internal Marks”, and the class label is “Pass” or “Fail”.

Student Hours Studied (X1) Internal Marks (X2) Class


A 2 35 Fail
B 3 40 Fail
C 4 50 Pass
D 6 65 Pass
E 7 70 Pass

Now suppose a new student F has


Hours Studied = 5
Internal Marks = 55

We want to classify F as Pass or Fail using KNN.

We compute Euclidean distances between F and each existing student.

Distance formula:
Distance = √[(X1 − X1_new)² + (X2 − X2_new)²]

Compute distances:

For Student A:
= √[(2 − 5)² + (35 − 55)²]
= √[9 + 400]
= √409 ≈ 20.22

For Student B:
= √[(3 − 5)² + (40 − 55)²]
= √[4 + 225]
= √229 ≈ 15.13

For Student C:
= √[(4 − 5)² + (50 − 55)²]
= √[1 + 25]
= √26 ≈ 5.10

For Student D:
= √[(6 − 5)² + (65 − 55)²]
= √[1 + 100]
= √101 ≈ 10.05

For Student E:
= √[(7 − 5)² + (70 − 55)²]
= √[4 + 225]
= √229 ≈ 15.13

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Now we must choose a value of K.

If K = 3, we select the three nearest neighbours.


Sorting distances:

C (5.10) → Pass
D (10.05) → Pass
B (15.13) → Fail (but farther)

Thus, among the 3 nearest points, two are Pass and one is Fail.
Therefore, the predicted class for Student F is Pass.

This example shows clearly how KNN uses proximity in attribute space to guide
classification.

3.8.2 Decision Trees


Decision Trees represent one of the most interpretable and widely used classification models.
A decision tree is a hierarchical structure that recursively splits the dataset based on attribute
values, constructing a sequence of decisions that lead to a class label. Unlike KNN, which
uses distances, decision trees use tests such as “Is income > 50,000?” or “Is age < 30?” to
partition data. This structure resembles human reasoning: we ask a series of questions to
arrive at a conclusion.

To construct a decision tree, algorithms measure how “good” each split is. Popular impurity
measures include Gini index and entropy. The goal is to choose splits that reduce impurity,
meaning that each subgroup becomes more homogeneous in class composition.

To illustrate decision tree construction, consider a detailed example from a student


performance dataset.

Worked Example: Building a Small Decision Tree

Consider the following dataset, where students are classified as Pass or Fail based on two
attributes: Attendance and Hours Studied.

Student Attendance (%) Hours Studied Class


A 45 2 Fail
B 55 3 Fail
C 65 4 Pass
D 80 2 Pass
E 85 5 Pass

We want to build a decision tree using Attendance as the first splitting variable.

Compute entropy at root:

Out of 5 students, 3 are Pass and 2 are Fail.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Entropy = −(3/5) log₂(3/5) − (2/5) log₂(2/5)
Compute each term:
(3/5) log₂(3/5) = 0.6 × (−0.7369) = −0.4421
(2/5) log₂(2/5) = 0.4 × (−1.3219) = −0.5287
Total entropy = 0.9708

Now consider splitting on Attendance at 60%.

Group 1: Attendance ≤ 60
Students A, B
Both Fail → entropy = 0

Group 2: Attendance > 60


Students C, D, E
All Pass → entropy = 0

Weighted entropy after split:


= (2/5)*0 + (3/5)*0 = 0

Information gain = 0.9708 − 0 = 0.9708

This is the maximum possible reduction, so Attendance < 60 becomes the root decision.

The resulting tree is:

If Attendance ≤ 60: Fail


If Attendance > 60: Pass

This extremely simple tree achieves perfect classification for this dataset.

3.8.3 Random Forest Classifier


While decision trees are interpretable, they are sensitive to fluctuations in the data. Small
changes in training samples may alter the entire structure of the tree. Random Forest
addresses this weakness by constructing multiple decision trees and combining their
predictions. Each tree is trained on a different bootstrap sample (randomly selected with
replacement), and at each split, a random subset of attributes is considered.

The final prediction is typically made through majority voting.

A simple example illustrates the idea clearly.

Example: Random Forest on Student Performance

Suppose we train three decision trees on slightly different samples of the earlier dataset. Each
tree may make slightly different decisions:

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Tree 1:
If Attendance > 60 → Pass
Else → Fail

Tree 2:
If Hours Studied > 3 → Pass
Else → Fail

Tree 3:
If Attendance > 70 → Pass
Else → Fail

Now classify a new student:

Attendance = 65
Hours Studied = 2

Tree 1: Attendance > 60 → Pass


Tree 2: Hours Studied ≤ 3 → Fail
Tree 3: Attendance ≤ 70 → Fail

Votes: Pass (1), Fail (2) → Final prediction: Fail

Random Forest thus stabilizes predictions by aggregating multiple models.

3.8.4 Naïve Bayes Classifier


Naïve Bayes is a probabilistic classifier based on Bayes’ theorem. The central assumption is
conditional independence: given the class, the attributes are assumed to be independent.
While this assumption is rarely true in practice, Naïve Bayes often performs remarkably well,
particularly in text classification tasks.

To understand Naïve Bayes, one must work through a full probability-based example.

Worked Example: Classifying Emails as Spam or Not Spam

Suppose we consider two keywords: “offer” and “win”. We have 10 emails in training:

Email offer win Class


e1 1 1 Spam
e2 1 0 Spam
e3 1 1 Spam
e4 0 1 Spam
e5 0 1 Spam
e6 1 0 NotSpam
e7 0 0 NotSpam
e8 0 0 NotSpam
e9 1 0 NotSpam

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


e10 0 1 NotSpam

Count Spam emails = 5


Count NotSpam emails = 5

Prior probabilities:
P(Spam) = 5/10 = 0.5
P(NotSpam) = 5/10 = 0.5

Compute conditional probabilities:

P(offer=1 | Spam): In Spam, 3 out of 5 have offer=1 → 3/5 = 0.6


P(win=1 | Spam): In Spam, 4 out of 5 have win=1 → 4/5 = 0.8

P(offer=1 | NotSpam): In NotSpam, 2 out of 5 have offer=1 → 2/5 = 0.4


P(win=1 | NotSpam): In NotSpam, 1 out of 5 have win=1 → 1/5 = 0.2

Now classify a new email containing both offer=1 and win=1.

Compute:

P(Spam | offer=1, win=1) ∝ P(offer=1|Spam) × P(win=1|Spam) × P(Spam)


= 0.6 × 0.8 × 0.5 = 0.24

P(NotSpam | offer=1, win=1) ∝ 0.4 × 0.2 × 0.5 = 0.04

Since 0.24 > 0.04, classify the email as Spam.

This example demonstrates, step by step, how Naïve Bayes computes class probabilities.

3.8.5 Classification Summary and


Interpretive Notes
Classification techniques vary in their assumptions, mathematical structure, and strengths.
KNN relies on distance and is simple but computationally heavy for large datasets. Decision
trees provide transparency but can overfit without pruning. Random forests reduce overfitting
and enhance stability at the cost of interpretability. Naïve Bayes excels in high-dimensional
problems and text data, offering fast and robust performance.

3.9 Clustering Techniques


Clustering is one of the most central and widely applied techniques in unsupervised learning.
In contrast to classification—where the algorithm learns from labelled data—clustering
operates without predefined categories. Its objective is to discover natural groupings within a
dataset by analysing similarities between observations. These similarities are expressed
mathematically through distance or density measures, leading the algorithm to arrange data

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


into clusters such that points within the same cluster are more similar to one another than to
points in different clusters.

Clustering is not a single algorithm but a family of techniques, each based on a distinct notion
of what constitutes a “group” or “similarity”. For example, some clustering methods assume
clusters are spherical and well-separated; others assume that clusters may be of arbitrary
shape or density. Some require specifying the number of clusters beforehand, while others
automatically infer it. This diversity makes clustering suitable for a wide range of
applications, such as market segmentation, social network analysis, anomaly detection,
document organization, genetic pattern discovery, and image segmentation.

To understand clustering deeply, one must follow worked numerical examples rather than
rely solely on conceptual descriptions. Only by computing distances, updating centroids, and
analysing cluster assignments repeatedly can students truly comprehend how data points
migrate between clusters and stabilize into meaningful groups.

3.9.1 K-Means Clustering


K-Means is the most widely used clustering algorithm due to its conceptual simplicity and
computational efficiency. The algorithm partitions data into K clusters, where each cluster is
represented by a centroid. These centroids are iteratively updated until the assignments of
points to clusters stabilize. At its heart, K-Means minimizes within-cluster variance by
repeatedly performing two steps: assigning each point to the nearest centroid, and
recomputing centroids as the mean of assigned points.

To fully understand K-Means, let us examine a detailed numerical example involving a small,
two-dimensional dataset. This transparency reveals how centroids gradually move and how
cluster memberships evolve.

Worked Example: Clustering Students Based on Study Hours and Marks

Consider six students with the following characteristics:

Student Hours Studied (X) Marks (Y)


A 2 40
B 3 50
C 4 45
D 10 80
E 11 85
F 12 82

Visually, it appears that students A, B, C belong to one cluster (low study, low marks), and
D, E, F belong to another cluster (high study, high marks). K-Means should ideally identify
these groups.

Let K = 2.

We must initialize the centroids. One common method is selecting any two points randomly.
Let us choose:

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Initial Centroid 1 (C1): Student A → (2, 40)
Initial Centroid 2 (C2): Student D → (10, 80)

Now, compute distances from each point to each centroid using Euclidean distance.

Distance formula:
d = √[(x − cx)² + (y − cy)²]

Compute distances and assign clusters.

Iteration 1: Assign Each Point to the Nearest Centroid

Student (X,Y) Distance to C1 (2,40) Distance to C2 (10,80) Assigned


Cluster
A (2,40) 0 √[(2−10)² + (40−80)²] = C1
√(64+1600)=41.23
B (3,50) √[(3−2)² + √[(3−10)² + C1
(50−40)²]=√(1+100)=10.05 (50−80)²]=√(49+900)=31.48
C (4,45) √(4−2)² + (45−40)² = √(36+1225)=36.24 C1
√(4+25)=5.39
D (10,80) 41.23 0 C2
E (11,85) √(81+2025)=47.84 √(1+25)=5.10 C2
F (12,82) √(100+1764)=43.15 √(4+4)=2.83 C2

Cluster Assignments after Iteration 1:


Cluster 1: A, B, C
Cluster 2: D, E, F

Recompute Centroids

Centroid 1 (mean of A, B, C):


X̄ = (2 + 3 + 4)/3 = 3
Ȳ = (40 + 50 + 45)/3 = 45
New C1 = (3, 45)

Centroid 2 (mean of D, E, F):


X̄ = (10 + 11 + 12)/3 = 11
Ȳ = (80 + 85 + 82)/3 ≈ 82.33
New C2 = (11, 82.33)

Iteration 2: Reassign Points

Compute distances again using new centroids.

For example:

Distance of B (3,50) to C1:


√[(3−3)² + (50−45)²] = √25 = 5

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Distance to C2:
√[(3−11)² + (50−82.33)²]
= √[(64) + (1053.4)] ≈ √1117.4 ≈ 33.42

Clearly B remains in Cluster 1.

Repeating for all points confirms all assignments remain identical.

Since no point changes its cluster, K-Means converges.

The two clusters are:

Cluster 1: A, B, C
Cluster 2: D, E, F

This matches our intuitive understanding.

3.9.2 Interpretation of K-Means Results


Clustering reveals underlying structure: low-performing students naturally cluster together,
while high-performing students form another group. In real applications, this could help an
instructor identify which group requires intervention, additional tutoring, or different teaching
strategies. K-Means is therefore both analytically powerful and operationally practical.

3.9.3 Limitations of K-Means


Despite its advantages, K-Means assumes clusters are spherical and equally sized, which may
not hold in many datasets. It is sensitive to initial centroids; different initial choices can lead
to different cluster outcomes. It also performs poorly when clusters overlap heavily or when
dataset contains noise or outliers. These limitations necessitate alternative clustering methods
such as hierarchical clustering or density-based clustering.

3.9.4 Hierarchical Clustering


Hierarchical clustering builds clusters step-by-step, either by agglomerating individual points
into groups (agglomerative clustering) or by successively dividing a large cluster into smaller
ones (divisive clustering). Agglomerative clustering is more commonly used. It begins with
each point as its own cluster and merges the two closest clusters at each step until all points
form one cluster.

Hierarchical clustering produces a dendrogram—a tree structure that visually represents


merging operations. Analysts may choose the number of clusters by cutting the dendrogram
at a certain height.

Let us examine a worked numerical example.

Worked Example: Agglomerative Clustering of Students

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Consider the following students characterized by study hours:

Student Hours Studied


A 2
B 3
C 4
D 10
E 12

Compute pairwise distances:

Pair Distance
A-B 1
A-C 2
A-D 8
A-E 10
B-C 1
B-D 7
B-E 9
C-D 6
C-E 8
D-E 2

Step-by-Step Merging

1. Closest pair: A-B (distance = 1). Merge A and B → cluster AB.


2. Next closest pair: B-C (distance = 1), but B is already in AB. So merge AB with C →
cluster ABC.
3. Next closest pair: D-E (distance = 2). Merge D and E → cluster DE.

Now we have two clusters: ABC and DE.

4. Distance between ABC and DE: the closest pair between them is C-D = 6.

Thus, ABC merges with DE at height 6.

Dendrogram Interpretation

The dendrogram will show ABC merging at height 1-2 and DE merging at height 2, with the
two groups joining much later at height 6. This visually conveys two well-separated clusters.

3.9.5 DBSCAN (Density-Based Spatial


Clustering of Applications with Noise)
DBSCAN takes a fundamentally different approach: it groups points based on density rather
than distance. A point belongs to a cluster if it has a sufficient number of neighbours within a

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


certain radius. Points in low-density regions are labelled as noise. This makes DBSCAN
extremely effective for datasets with irregular shapes, varied densities, and noise—conditions
where K-Means fails.

A dense region satisfying minimum thresholds forms a cluster. Points with insufficient
neighbours are considered outliers.

To illustrate DBSCAN, we work through a simplified example.

Worked Example: Density-Based Clustering

Consider a one-dimensional dataset representing student test scores:

Values: 45, 46, 47, 90, 91, 92, 93, 150

If we choose:
ε (radius) = 3
minPts (minimum neighbours) = 3

Cluster formation:

Scores 45, 46, 47 form a dense region → one cluster.


Scores 90, 91, 92, 93 form a dense region → another cluster.
Score 150 has no neighbours → noise.

This outcome would be impossible for K-Means, which would try to force 150 into one of the
clusters.

DBSCAN naturally separates noise and provides clusters of arbitrary shape.

3.10 Neural Networks


Neural Networks form one of the most influential families of Machine Learning models,
inspired loosely by the structure and functioning of biological neurons. Although originally
conceptualized in the mid-20th century, neural networks have re-emerged as the foundation
of many modern AI systems due to advances in computational power, optimization
techniques, and availability of large datasets. The strength of neural networks lies in their
ability to approximate highly complex, nonlinear relationships, making them indispensable
for tasks such as image classification, speech recognition, natural language processing,
medical diagnosis, and many other domains where simpler linear models fail.

A neural network consists of layers of interconnected processing units called neurons. Each
neuron receives input signals, applies a weighted summation, passes the result through a
nonlinear activation function, and produces an output. By combining many such neurons into
multiple layers, the network constructs sophisticated internal representations of the data. The
weights of the network serve as adjustable parameters that are tuned during training through
optimization techniques such as gradient descent and backpropagation.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


To understand neural networks meaningfully, one must not only learn the conceptual
framework but also examine fully worked numerical examples. Only by manually computing
neuron outputs, activations, and gradients can a student appreciate how information flows
through the network and how learning occurs.

3.10.1 The Structure of an Artificial Neuron


An artificial neuron is the fundamental unit of a neural network. It receives one or more
inputs, each multiplied by a weight, and combines them using the weighted sum function. A
bias term is also added to shift the activation threshold. The output of this summation is then
passed through an activation function.

If a neuron receives inputs x₁, x₂, ..., xₙ with corresponding weights w₁, w₂, ..., wₙ and bias b,
then the net input is:

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

The output is:

a = f(z)

where f is an activation function such as sigmoid, tanh, or ReLU.

To illustrate this with a concrete numerical example, suppose a neuron has two inputs:

x₁ = 3
x₂ = 5

Let the weights be:

w₁ = 0.4
w₂ = 0.1

and bias b = –2.

Compute the net input:

z = (0.4 × 3) + (0.1 × 5) – 2
= 1.2 + 0.5 – 2
= –0.3

Now apply the sigmoid activation function:

sigmoid(z) = 1 / (1 + e^(–z))
= 1 / (1 + e^(0.3))
≈ 1 / (1 + 1.3499)
≈ 1 / 2.3499
≈ 0.425

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Thus, the neuron outputs approximately 0.425.
This illustrates how even a simple neuron applies linear summation and nonlinear activation
to produce its output.

3.10.2 Architecture of a Neural Network


A neural network arranges neurons into layers. The first (input) layer takes raw input
features. The final (output) layer generates predictions. Between them are hidden layers,
where most of the network’s representational power arises. A network with multiple hidden
layers is known as a deep neural network.

To illustrate, consider a feedforward network with:

• 2 inputs
• 2 hidden neurons
• 1 output neuron

Information flows forward from input to hidden to output.

The mathematical flow is sequential. Each hidden neuron computes its activation from the
inputs. The output neuron then computes its activation from the hidden layer outputs. The
network thereby converts raw data into a final prediction.

To understand this architecture more concretely, we will carry out a complete forward pass
through such a network.

3.10.3 Worked Example: Forward Pass


Through a Simple Neural Network
Consider a neural network designed to classify whether a student will pass or fail based on
two features: “Hours Studied” and “Internal Marks”. For simplicity, we will assume the
network has:

• Two inputs: X₁ (hours studied), X₂ (internal marks)


• One hidden layer with two neurons
• One output neuron with sigmoid activation producing probability of passing

Suppose the input for a particular student is:

X₁ = 5
X₂ = 60

Assume the weights and biases are as follows:

Hidden Neuron h₁:


w₁₁ = 0.1 (weight connecting X₁ to h₁)

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


w₂₁ = 0.05 (weight connecting X₂ to h₁)
b₁ = –4

Hidden Neuron h₂:


w₁₂ = 0.2
w₂₂ = 0.1
b₂ = –10

Output neuron o:
v₁ = 0.8 (weight from h₁ to o)
v₂ = 0.5 (weight from h₂ to o)
b₃ = –2

We now compute all values step by step.

Step 1: Compute Hidden Neuron h₁ Output

Net input to h₁ is:


z₁ = (0.1 × 5) + (0.05 × 60) – 4
= 0.5 + 3 – 4
= –0.5

Activation using sigmoid:


a₁ = 1 / (1 + e^(0.5))
≈ 1 / (1 + 1.6487)
≈ 1 / 2.6487
≈ 0.378

Step 2: Compute Hidden Neuron h₂ Output

z₂ = (0.2 × 5) + (0.1 × 60) – 10


= 1 + 6 – 10
= –3

Sigmoid:
a₂ = 1 / (1 + e^(3))
≈ 1 / (1 + 20.0855)
≈ 1 / 21.0855
≈ 0.047

Step 3: Compute Output Neuron o

Net input:
z₃ = (0.8 × a₁) + (0.5 × a₂) – 2
= (0.8 × 0.378) + (0.5 × 0.047) – 2
= 0.3024 + 0.0235 – 2
= –1.6741

Activation:
a₃ = sigmoid(–1.6741)

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


≈ 1 / (1 + e^(1.6741))
≈ 1 / (1 + 5.33)
≈ 1 / 6.33
≈ 0.158

Thus, the network predicts a probability of 15.8% that the student will pass.

This example shows how a simple feedforward network converts input values through
weighted summations and nonlinear transformations to produce a probability-based
classification.

3.10.4 Activation Functions


Activation functions introduce nonlinearity into the network, enabling it to model complex
patterns. Without nonlinear activation, a neural network collapses into a linear model
regardless of its number of layers.

A neural model with only linear activations cannot capture curved decision boundaries,
interactions, or hierarchies within data. Sigmoid, tanh, and ReLU are the most common
activations.

To understand their influence, consider how different activation functions respond to the
same input z = –3, –1, 0, 1, and 3.

z Sigmoid(z) tanh(z) ReLU(z)


–3 0.047 –0.995 0
–1 0.269 –0.761 0
0 0.5 0 0
1 0.731 0.761 1
3 0.953 0.995 3

Sigmoid compresses all inputs into (0,1), ideal for probabilities.


Tanh compresses into (−1,1), useful in some hidden layers.
ReLU preserves positive inputs and zeroes negatives, enabling deep networks to train
effectively without saturation.

3.10.5 Learning in Neural Networks:


Gradient Descent and Backpropagation
Training a neural network means adjusting weights such that the predicted outputs closely
match the target outputs. The adjustment is achieved by minimizing a loss function,
commonly mean squared error or cross-entropy.

Backpropagation is the algorithm used to compute gradients of the loss function with respect
to each network weight. These gradients indicate how much each weight contributes to the
error and in which direction it should be adjusted.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


The update rule for a weight w is:

w_new = w_old − η ∂L/∂w

where η is the learning rate.

To illustrate the learning process intuitively rather than mathematically, consider the earlier
network predicting 15.8% passing probability for a student whose true label is Pass (1). The
error is therefore:

Error = ½ (Target − Output)²


= ½ (1 − 0.158)²
= ½ (0.842)²
= 0.354

Backpropagation computes how each weight must change to reduce this error. For instance,
weights connected to h₂ may be adjusted more aggressively because h₂’s output (0.047) was
extremely low, suggesting an inadequate response. Repeated iterations gradually move the
network toward a configuration where predictions improve.

3.10.6 Complete Example: One Training


Update in a Neural Network
Let us examine one simplified weight update.

Suppose the error derivative for the weight v₁ (from h₁ to output) is ∂L/∂v₁ = −0.25, meaning
the error reduces if v₁ increases. If learning rate η = 0.1:

v₁_new = v₁_old − η(∂L/∂v₁)


= 0.8 − 0.1(−0.25)
= 0.8 + 0.025
= 0.825

Thus, the weight increases slightly.

Such micro-adjustments accumulate over hundreds or thousands of iterations to refine the


network’s internal representation.

3.10.7 Capacity and Limitations of Neural


Networks
Neural networks excel at modelling complex, nonlinear relationships but rely heavily on
high-quality data, appropriate architecture, and sufficient computational resources. They are
universal function approximators, capable of representing any continuous function given
sufficient neurons. However, they are prone to overfitting if trained on small datasets, may
become computationally expensive, and often lack interpretability. Despite these limitations,

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


their ability to extract layered features from raw data explains their dominance in modern
Machine Learning systems.

3.11 Support Vector Machines (SVM)


Support Vector Machines represent one of the most elegant and theoretically grounded
classification algorithms in machine learning. Although neural networks have gained
widespread popularity in recent years, SVMs continue to be indispensable in applications
requiring high accuracy, robustness, and strong mathematical guarantees, especially when
datasets are small or moderately sized. The essential idea behind SVMs is the construction of
an optimal separating boundary—called a hyperplane—between two classes. While many
algorithms can create such boundaries, SVMs are unique because they find the boundary that
maximizes the margin, that is, the distance between the separating hyperplane and the nearest
data points from each class. These nearest points are known as support vectors. The
maximization of margin provides geometric robustness: a decision boundary with larger
margin tends to generalize better to unseen data.

To understand SVMs deeply, one must combine geometric intuition with algebraic
formulation. Geometry explains the concept of hyperplanes and margins, while algebra
shows how such boundaries are computed through optimization. Both perspectives are
incomplete without numerical examples, as the real essence of SVM lies in seeing how the
algorithm identifies support vectors, computes margin width, and defines classification
decisions.

3.11.1 Geometric Intuition of SVM


Consider the simplest case of two-dimensional data where each instance has two features and
belongs to one of two classes: +1 or –1. Data points of each class occupy regions in the plane.
A classifier’s task is to draw a line that separates the regions. Many such lines may exist, but
SVM selects the one that maximizes the margin. To imagine this, picture two clouds of
points. If you draw several lines dividing the clouds, some lines pass too close to points of
either class, making the classifier unstable: even small perturbations in data may change
predictions. The SVM line sits in the middle of the widest possible corridor that can be drawn
between the two clusters of points. This geometric margin plays a crucial role: it ensures the
decision boundary is far from critical examples and is therefore less sensitive to noise.

In higher dimensions, the separating boundary is not a line but a hyperplane. For two
features, a hyperplane is a line. For three features (x, y, z), a hyperplane becomes a plane. For
n features, the hyperplane is an (n–1)-dimensional object cutting through the n-dimensional
feature space. Regardless of dimension, the idea remains constant: SVM constructs the
hyperplane that maximizes the perpendicular distance to the nearest points from both classes.

3.11.2 Mathematical Formulation of Linear


SVM
A hyperplane in an n-dimensional space is represented by the equation:

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


w₁x₁ + w₂x₂ + … + wₙxₙ + b = 0

where w = (w₁, w₂, …, wₙ) is the weight vector and b is the bias. A point x lies on one side of
the hyperplane if:

w·x+b>0

and on the other side if:

w · x + b < 0.

For classification, we assign:

Class +1 if w · x + b ≥ 1
Class –1 if w · x + b ≤ –1

The region between the planes w·x + b = 1 and w·x + b = –1 is the margin band. No training
points lie inside this margin in the case of a linearly separable SVM.

The distance between these two planes equals:

Margin width = 2 / ||w||

Thus, maximizing the margin is equivalent to minimizing ||w||², subject to correct


classification constraints. The optimization problem becomes:

Minimize ½ ||w||²
Subject to yᵢ (w·xᵢ + b) ≥ 1 for all i

This is a convex quadratic optimization problem that has a unique global minimum. The
points that satisfy the equality constraint yᵢ(w·xᵢ + b) = 1 are the support vectors.

3.11.3 Fully Worked Numerical Example of


Linear SVM (2D Case)
To make the concept tangible, consider the following dataset of five points in two-
dimensional space with class labels +1 and –1:

Point x₁ x₂ Class
A 2 2 +1
B 4 4 +1
C 4 0 –1
D 6 2 –1
E 3 1 –1

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Plotting these points reveals that the +1 class is clustered near the top-left and the –1 class
occupies the lower-right region, suggesting a linear boundary.

While solving a quadratic optimization problem manually is impractical, it is feasible to


illustrate the solution for a small dataset. Through computation or geometric inference, the
SVM finds that the optimal separating hyperplane is:

x₂ = x₁ – 1.5

We can rewrite this in the standard form w₁x₁ + w₂x₂ + b = 0:

x₂ – x₁ + 1.5 = 0
Thus w = (–1, 1) and b = 1.5.

Now we verify that this hyperplane correctly classifies each point.

For point A (2,2):


w·x + b = –1(2) + 1(2) + 1.5 = 1.5 > 1 → correctly +1.

For point B (4,4):


= –4 + 4 + 1.5 = 1.5 → correctly +1.

For point C (4,0):


= –4 + 0 + 1.5 = –2.5 < –1 → correctly –1.

For point D (6,2):


= –6 + 2 + 1.5 = –2.5 < –1 → correctly –1.

For point E (3,1):


= –3 + 1 + 1.5 = –0.5
This point lies inside the margin but still on the correct side (–1 region).
Thus, E becomes a support vector.

3.11.4 Identifying Support Vectors in the


Example
Support vectors are points for which:

yᵢ (w·xᵢ + b) = 1

Check each point:

For A: w·x + b = 1.5 → y = +1 → margin > 1 → not support


For B: = 1.5 → not support
For C: = –2.5 → margin > 1 → not support
For D: = –2.5 → not support
For E: = –0.5 → y = –1 → product = 0.5 < 1

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


E lies closest to the boundary among all negative examples.

Similarly, A and B lie exactly at the upper margin edge (1.5 is slightly > 1 because of
rounding). Thus, A and B become support vectors.

Thus support vectors = A, B, and E.

The margin is determined by these vectors.

3.11.5 Soft-Margin SVM


Real-world datasets rarely achieve perfect linear separability. Points may overlap, and noise
or measurement errors make clean separation impossible. To handle such cases, SVM
introduces slack variables ξᵢ that allow some points to violate margin constraints.

The modified constraint becomes:

yᵢ(w·xᵢ + b) ≥ 1 − ξᵢ
ξᵢ ≥ 0

and the objective function transforms into:

Minimize ½||w||² + C Σ ξᵢ

where C controls the trade-off between maximizing margin and minimizing misclassification.
If C is large, the model penalizes margin violations heavily, producing a hard boundary.
If C is small, the model tolerates misclassifications in favour of a larger margin.

This mechanism allows SVMs to operate robustly in noisy environments.

3.11.6 SVM with Nonlinear Boundaries and


the Kernel Trick
Many datasets cannot be separated by linear hyperplanes. For example, consider a dataset
where positive points lie inside a circle and negative points lie outside the circle. No straight
line can separate them.

SVM resolves this through the kernel trick. The core idea is to transform the input data into a
higher-dimensional space where linear separation becomes possible. Instead of performing an
explicit transformation, which could be computationally expensive, kernels compute inner
products in the higher-dimensional space directly.

Common kernels include:

Polynomial kernel: K(x, x') = (x·x' + 1)^d


Gaussian RBF kernel: K(x, x') = exp(−γ||x − x'||²)

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


To illustrate the idea concretely, consider a dataset shaped like two concentric rings. The
original space is not linearly separable. But mapping each point from (x, y) to (x² + y², x, y)
lifts the circular pattern into three dimensions where separation is easier.

Using RBF kernel, SVM does this implicitly. The algorithm never needs to compute the
transformation explicitly; it computes similarities using the kernel, and the optimization
operates in the transformed space.

This mechanism is one reason SVMs remain competitive with deep learning on small to
medium datasets.

3.11.7 Complete Worked Example of


Nonlinear SVM Using RBF Kernel
(Conceptual)
Suppose we have four points:

Point x₁ x₂ Class
A 0 1 +1
B 0 –1 +1
C 2 0 –1
D –2 0 –1

In the original space, no straight line separates the classes cleanly because positive points lie
above and below the origin, and negative points lie left and right.

Using RBF kernel:

K(x, x') = exp(−γ||x − x'||²)

For γ = 0.5:

Compute similarity between A(0,1) and B(0,–1):

Distance = √[(0−0)² + (1+1)²] = 2


Thus K(A,B) = exp(−0.5×4) = exp(−2) ≈ 0.135

Between A(0,1) and C(2,0):


Distance = √[(2)² + (1)²] = √5
K(A,C) = exp(−0.5×5) ≈ exp(−2.5) ≈ 0.082

Similarly:

K(A,D) = exp(−0.5×5) ≈ 0.082

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Thus A and B are similar to each other but dissimilar to C and D in the kernel space. In this
space, a separating hyperplane becomes possible.

While full optimization is lengthy, this example illustrates how kernels reshape similarity
patterns.

3.11.8 Advantages and Limitations of SVM


SVMs excel in high-dimensional spaces, remain effective even when the number of features
exceeds the number of samples, and produce decision boundaries with strong generalization
properties. They work exceptionally well for text classification, bioinformatics, handwriting
recognition, and small-sample problems.

However, SVMs become slow for very large datasets, struggle with overlapping classes
unless kernels are tuned carefully, and lack easy interpretability compared with simple
models like decision trees.

3.12 Overfitting, Underfitting, and


Generalization in Machine Learning
One of the central challenges in machine learning is ensuring that a model not only fits the
training data well but also performs reliably on unseen data. This ability to extend learned
patterns beyond the examples encountered during training is called generalization. A model
that generalizes well is one that has captured the true structure of the underlying data-
generating process rather than memorizing specific examples. The degree to which a model
generalizes depends on how it balances two competing forces: underfitting and overfitting.
Understanding these two phenomena, along with the bias–variance trade-off that connects
them, forms the theoretical backbone of modern machine learning. Without this
understanding, one cannot properly evaluate a model or diagnose failure.

At its core, a machine learning model is exposed to three different datasets across its
lifecycle: a training set, used to learn internal parameters; a validation set, used to tune
hyperparameters; and a test set, used to evaluate final performance. The behaviour of the
model across these datasets reveals whether it is underfitting, overfitting, or achieving good
generalization. A complete understanding requires careful attention not only to performance
numbers but also to the structure of the data, the complexity of the model, and the choice of
regularization techniques used to avoid overfitting.

In the following sections, we will explore these ideas in depth, including detailed numerical
examples that illustrate exactly how underfitting, overfitting, and proper model capacity
manifest in practice.

3.12.1 Underfitting: When the Model is Too


Simple

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Underfitting occurs when a model is unable to capture the underlying structure of the data.
The model is too simple, too rigid, or too constrained to approximate the true relationship
between features and target values. This usually results in high error on both training and test
data. Underfitting is analogous to a student who uses an overly simplistic strategy to solve a
complex math problem, leading to consistent mistakes.

To see this concretely, consider a dataset relating hours studied (X) to marks obtained (Y).
Suppose the true relationship resembles a curved upward trend. If we attempt to fit a linear
regression model to this dataset, the model will force a straight line through a curved pattern.
The residuals (differences between actual and predicted marks) will be large across the
dataset. Both training and testing error remain high because the model lacks the expressive
power needed to accommodate the curvature inherent in the data.

Let us illustrate underfitting numerically. Suppose we have the following training data:

Hours Studied (X) Marks Obtained (Y)


2 30
3 40
4 55
5 70
6 85

This dataset clearly follows a curved relationship, with marks increasing more rapidly at
higher hours of study. Attempting to fit a linear model results in a regression line that
inevitably underestimates high values and overestimates low values.

If the fitted model is:

Ŷ = 10X – 5

Then for X = 6 hours, the model predicts 55, whereas the true value is 85. Similarly, for X =
5 hours, the model predicts 45, far below the actual 70.

The training error is high, indicating underfitting. If we test this model on new data, the test
error will also be high because the model simply cannot accommodate the inherent curvature.

Thus, underfitting arises from insufficient model complexity, overly strong regularization,
shallow decision trees, too few features, or inappropriate model choice.

3.12.2 Overfitting: When the Model


Memorizes Instead of Learning
Overfitting occurs when a model fits the noise, fluctuations, and random variations in the
training data rather than the true pattern. In such cases, the model performs exceptionally well
on training data but poorly on unseen data. Overfitting is analogous to a student who
memorizes every solved example in a textbook but fails when faced with novel questions.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


To illustrate overfitting, consider again the earlier dataset. Suppose instead of fitting a
straight line, we fit a 5th-degree polynomial:

Ŷ = a₀ + a₁X + a₂X² + a₃X³ + a₄X⁴ + a₅X⁵

Given five training points, we can always find a 5th-degree polynomial that passes through
every single point, achieving zero training error. But such a curve will oscillate wildly
between points and behave unpredictably outside the training range.

If we test this model on a new student who studied 7 hours and scored 90, the polynomial
might output something absurd like 140 marks or even negative marks, depending on the
oscillations. The test error becomes significantly higher than the training error—a classic sign
of overfitting.

Overfitting is particularly severe in models with high capacity, such as deep neural networks,
decision trees with many levels, or polynomial regressions of high degree. Unless constrained
properly, these models memorize rather than learn.

3.12.3 Generalization: The Central Goal of


Machine Learning
Generalization refers to a model’s ability to perform well on unseen data after being trained
on a limited set of examples. It is the core requirement of all ML models. A model that
generalizes well has internalized the fundamental structure of the data rather than
idiosyncratic noise.

Generalization cannot be evaluated on the training set. It must be evaluated using data the
model has never encountered before—this is why test sets and validation sets exist. Poor
generalization is the direct consequence of overfitting; poor training performance is a
consequence of underfitting.

3.12.4 Train, Validation, and Test Sets


A dataset is typically divided into three parts:

1. Training set, used to adjust model parameters.


2. Validation set, used to tune hyperparameters and select the model.
3. Test set, used to measure final performance.

The validation set helps determine model complexity, regularization strength, degree of
polynomial, depth of a decision tree, learning rate of a neural network, or C parameter in
SVM. If validation error is low but test error is high, it indicates the model selection process
relied too heavily on the validation set—an effect known as validation overfitting.

A numerical example illustrates the importance of proper splitting.

Suppose we train a polynomial regression model with different degrees:

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


Degree Training Error Validation Error
1 32 35
2 18 20
3 5 8
4 0 18
5 0 40

The 4th and 5th degree polynomials achieve zero training error but generalize poorly, as
validation error increases drastically. The 3rd-degree polynomial likely gives the best
balance.

3.12.5 The Bias–Variance Trade-off


Bias refers to error arising from incorrect assumptions about data. A high-bias model is
overly rigid and underfits.
Variance refers to error arising from excessive sensitivity to small fluctuations in training
data. A high-variance model overfits.

A model with high bias is like fitting a straight line to inherently nonlinear data.
A model with high variance is like fitting a very wiggly curve that reacts to every minor noise
fluctuation.

To understand bias–variance numerically, consider predicting house prices:

Suppose the true price function is:


Price = 200 + 50×Area – 0.3×Age

A linear model can capture this well (low bias), but a simple constant model like Price = 300
for all houses introduces large bias.

Now suppose we fit a 10th-degree polynomial to only 20 training houses. It will match the
training data perfectly but vary drastically for small changes in input, giving high variance
and poor generalization.

Optimal models minimize the sum of bias and variance.

3.12.6 Regularization (L1, L2) as a Tool to


Combat Overfitting
Regularization techniques deliberately constrain the model to prevent excessive complexity.
Two major regularizers are:

L2 regularization (Ridge Regression)


Adds λ Σ wᵢ² to the loss.
This discourages large weight values.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML


L1 regularization (Lasso Regression)
Adds λ Σ |wᵢ| to the loss.
This encourages sparsity by forcing some weights to exactly zero.

Consider a model with weights:

w = [5.1, 6.7, –3.2, 8.5]

Large weights often indicate overfitting.


Applying L2 regularization with λ = 0.1 penalizes:

0.1 × (5.1² + 6.7² + 3.2² + 8.5²)


= 0.1 × (26.01 + 44.89 + 10.24 + 72.25)
= 0.1 × 153.39
= 15.339

This penalty discourages weights from growing too large.


L1 regularization would shrink some weights entirely to zero, simplifying the model.

3.12.7 Epochs, Iterations, and Convergence


When training models, especially neural networks, the training process occurs in cycles
called epochs. An epoch represents a complete pass through the dataset.
Within each epoch, the model updates weights by evaluating gradients across batches of data.

If too few epochs are used, the model may underfit because it has not learned enough.
If too many epochs are used, the model may overfit because it memorizes the training data.

For example, training loss and validation loss might behave like this:

Epoch Training Loss Validation Loss


1 0.45 0.48
3 0.32 0.37
5 0.21 0.29
10 0.10 0.24
15 0.05 0.30
20 0.01 0.42

Validation loss begins rising after epoch 10, indicating overfitting beyond that point.

Prepared By~ Ritam Rajak| DEPT. OF CSE - AIML

You might also like