Machine Learning
UNIT-1
1. What is learning and hence defile Machine Learning Algorithm?
Learning is the process of acquiring new understanding, knowledge, behaviours, skills, values,
attitudes, and preferences. It can be seen as a relatively permanent change in behaviour or knowledge
that results from experience, practice, or study over time
Definition of Machine Learning Algorithm
A Machine Learning Algorithm is a set of mathematical instructions or processes that enables
computers and artificial intelligence (AI) systems to learn from data, recognize patterns, and make
decisions or predictions based on input data—without being explicitly programmed for every
possible scenario.
Key Characteristics
Uses data and experience: These algorithms analyze and learn from existing data to improve
their performance over time.
Predictive capability: They are used to predict outputs (such as classifications or numerical
values) based on input data.
Adaptability: As more data is provided, machine learning algorithms can improve their accuracy
and decision-making ability.
Types of Machine Learning Algorithms
2. Type Description
Define well
Supervised Learns from labeled data (input-output pairs) to make predictions.
Learning
Unsupervised Identifies patterns or groupings in unlabeled data.
Learning
Reinforcement Learns by interacting with an environment, using rewards to guide
Learning actions
posed problem and hence model a real time task of your choice as a well posed problem.
Well-Posed Problem:
A well-posed problem, as defined by mathematician Jacques Hadamard, is a problem that
satisfies the following three conditions:
1. Existence: There must be a solution to the problem.
2. Uniqueness: The solution must be unique.
3. Stability: The solution’s behavior changes continuously with initial conditions (small changes in
input do not cause large changes in the solution).
If a problem does not meet any of these three criteria, it is considered ill-posed.
Example: House Price Prediction
Let’s model the task of predicting the price of a house given its features as a well-posed problem.
1. Problem Statement
Given property attributes (such as size, location, number of bedrooms, age, etc.), predict the
market price of the house.
2. Formulating as a Well-Posed Problem
Criterion Application to Task
Existence For any valid set of input features, a price can be predicted
using a model (e.g., linear regression).
Uniquenes The model provides exactly one predicted price for a
s specific set of inputs.
Stability Small changes in input features (e.g., size or number of
bedrooms) result in small, predictable changes in the price
prediction.
Mathematical Formulation
Let, x = vector of input features (size, age, etc.)
f(x) = function/model predicting house price
Goal: Find f(x) such that for every valid x,
A price y=f(x) exists (Existence)
There’s only one y for a given xx (Uniqueness)
If xx changes slightly, y changes smoothly (Stability)
Attribute Example Value
Size (sq ft) 2,000
Location (Zone) Urban
Number of Bedrooms 3
Age (years) 10
Predicted Price ($) 300,000
3. Applications of Machine Learning in Diverse Fields
Machine learning (ML) has revolutionized a wide variety of industries by enabling systems to learn
from data, identify patterns, and make intelligent decisions. Here is a list and explanation of major
applications of machine learning across different fields:
Applications of Machine Learning in Diverse Fields
1. Healthcare
Disease Diagnosis: ML models analyse medical images (such as X-rays, MRIs) to detect
diseases like cancer or pneumonia.
Predictive Analytics: Forecasts patient outcomes, readmission risks, and spread of diseases.
Drug Discovery: Accelerates identification of potential drug candidates and simulates
chemical interactions efficiently.
2. Finance
Fraud Detection: Identifies unusual patterns in financial transactions to catch fraudulent
activities.
Credit Scoring: Assesses creditworthiness based on multiple data sources for effective loan
approval.
Algorithmic Trading: Executes trades in stock markets at optimal times by analyzing
historical data and trends.
3. Retail & E-Commerce
Recommendation Systems: Suggest products or content tailored for individual users (e.g.,
Amazon recommendations).
Demand Forecasting: Predicts future sales, optimizing inventory and supply chains.
Dynamic Pricing: Adjusts prices in real time based on demand, competitor pricing, and
inventory.
4. Transportation
Autonomous Vehicles: Uses ML for navigation, obstacle detection, and traffic management in
self-driving cars.
Route Optimization: Reduces delivery times and increases efficiency for ridesharing and
logistics companies.
Traffic Prediction: Analyzes real-time data to manage congestion and plan optimal routes.
5. Social Media
Content Personalization: Platforms like Facebook and Instagram tailor feeds and ads based on
user preferences.
Image and Facial Recognition: Automates tagging and photo categorization.
Sentiment Analysis: Understands user opinions and tracks trends from posts and comments.
6. Natural Language Processing (NLP)
Chatbots & Virtual Assistants: Powers intelligent assistants like Siri, Alexa, and customer
support chatbots.
Language Translation: Automatically translates text and speech between languages with high
accuracy.
Spam Filtering: Detects and filters out unwanted emails and harmful content.
7. Manufacturing & Industry
Predictive Maintenance: Forecasts equipment failures, enabling scheduled repairs and
reducing downtime.
Quality Control: Inspects products for defects using sensor data and image analysis.
8. Agriculture
Crop Yield Prediction: Estimates harvest quantity to optimize planting and resource
allocation.
Disease Management: Identifies crop diseases and pest infestations from images and sensor
data.
9. Education
Personalized Learning: Adapts educational content to match student abilities and learning
pace.
Automated Grading: Uses ML to assess and grade essays or assignments efficiently.
10. Security & Surveillance
Anomaly Detection: Monitors for unusual behaviour in surveillance systems, preventing
security breaches.
Facial Recognition: Controls access to secure areas and enhances identity management.
4. Briefly explain various learning techniques in Machine techniques.
Machine learning employs several primary learning techniques to enable systems to learn from data
and make predictions or decisions. Below are the main types with brief explanations:
1. Supervised Learning
Description: The model is trained on labelled data, where each input has a corresponding
known output.
Goal: Learn the mapping between input and output to predict labels for new, unseen data.
Common Applications: Classification (e.g., spam detection, image recognition) and
regression (e.g., predicting house prices).
Examples of Algorithms: Linear regression, support vector machines, decision trees, random
forests.
2. Unsupervised Learning
Description: The model analyses data without labelled outputs and seeks to find patterns or
groupings.
Goal: Discover underlying structure in data.
Common Applications: Clustering (e.g., customer segmentation), dimensionality reduction
(e.g., data visualization), anomaly detection.
Examples of Algorithms: K-means clustering, hierarchical clustering, principal component
analysis (PCA), autoencoders.
3. Semi-Supervised Learning
Description: A combination of a small amount of labelled data and a large amount of
unlabelled data.
Goal: Improve learning accuracy when labelling data is expensive or time-consuming.
Common Applications: Image classification with partially labelled datasets, web content
categorization.
Benefit: Leverages the abundance of unlabelled data while using some labelled data for
guidance.
4. Reinforcement Learning
Description: Models, called agents, learn by interacting with an environment and receiving
feedback in the form of rewards or penalties.
Goal: Learn optimal actions or strategies to maximize cumulative rewards.
Common Applications: Robotics, game playing (e.g., AlphaGo), self-driving cars.
Examples of Algorithms: Q-learning, SARSA, deep reinforcement learning.
5. Additional and Specialized Techniques
Feature Learning: Algorithms discover useful representations of the input data automatically,
either supervised (e.g., neural networks) or unsupervised (e.g., autoencoders).
Transfer Learning & Fine-Tuning: Adaptation of a pre-trained model to a new but related task
for improved efficiency, especially with limited labelled data.
Multitask Learning: Training a model to perform multiple tasks simultaneously to boost
generalization and robust performance
UNIT-2
1. Explain Occam's Razor Principle with simple example.
Occam’s razor is one of the simplest examples of inductive bias. It involves a preference for a simpler
hypothesis that best fits the data. Though the razor can be used to eliminate other hypotheses, relevant
justification may be needed to do so.
Simple Example
Imagine you wake up and see that the grass in your yard is wet.
Possible Explanations:
A: It rained last night.
B: A neighbor came over at midnight and watered your lawn while you were sleeping.
Both explanations account for the wet grass. However, explanation A (rain) is simpler and relies on
fewer assumptions than explanation B (neighbor at night). According to Occam’s Razor, you should
prefer the simpler answer unless you have evidence to suggest otherwise.
Application in Machine Learning
Suppose you want to predict whether an email is spam. You have two models:
Model 1: Uses 3 straightforward rules based on common spam indicators.
Model 2: Uses 20 complex rules, including subtle patterns and word combinations.
If both models work equally well on actual email data, Occam’s Razor suggests choosing Model 1,
because it's simpler and less likely to "overfit" or latch onto random noise in the data.
Occam’s Razor helps us choose solutions that are simple, efficient, and more likely to generalize well
to new situations, unless further evidence requires a more complex explanation.
2. What is bias? State significance of bias and variance with respect to ML problems.
Bias in machine learning refers to the error introduced by approximating a real-world problem (which
may be complex) by a much simpler model. In essence, bias measures how closely a model’s predictions
match actual outcomes when simplifying assumptions are made.
High bias means the model makes strong assumptions about the data, often leading it to miss
important trends (underfitting).
Low bias means the model more closely follows the data, capturing its underlying relationships.
Significance of Bias and Variance in ML Problems
Bias and variance are two critical sources of error that influence a machine learning model's
performance:
Bias
Represents errors from erroneous or overly simplistic assumptions in the learning algorithm.
High bias results in models that underfit the data, missing relevant patterns.
Common in simple models (e.g., linear regression on nonlinear data).
Variance
Measures how much model predictions fluctuate for different training datasets.
High variance indicates the model is too sensitive to small fluctuations in the training data, often
capturing noise as if it were signal (overfitting).
Common in overly complex models (e.g., deep neural networks with insufficient data).
Importance in Machine Learning
Balance is Crucial: The goal is to achieve the right trade-off between bias and variance.
Too much bias (high bias/low variance): Model is too simple, underfits, and performs
poorly on both training and test data.
Too much variance (low bias/high variance): Model overfits the training data, capturing
noise, and performs poorly on new data.
Generalization: Good machine learning models need low overall error, achieved by balancing bias and
variance to generalize well to unseen data.
3. Define classification and regression analysis and hence list metrics for classification and regression.
Classification: Classification is a supervised machine learning task where the goal is to assign input data
to one of several predefined categories or classes.
Example: Email spam detection classifies emails as "spam" or "not spam."
Output: Discrete class labels (e.g., cat, dog, or bird).
Common Metrics for Classification
Accuracy: Proportion of correct predictions among all predictions.
Precision: Ratio of true positives to all predicted positives — measures correctness among
positive predictions.
Recall (Sensitivity): Ratio of true positives to all actual positives — measures how many actual
positives were captured.
F1-Score: Harmonic mean of precision and recall; balances both concerns.
ROC-AUC: Area Under the Receiver Operating Characteristic curve — measures the ability of
the model to distinguish between classes.
Regression: Regression is a supervised machine learning technique used to predict a continuous
numerical value based on input data.
Example: Predicting the price of a house given attributes like size, location, and age.
Output: Continuous numerical value (e.g., $307,000.50).
Common Metrics for Regression
Mean Squared Error (MSE): Average squared difference between predicted and actual values.
Root Mean Squared Error (RMSE): Square root of MSE, providing error in original units.
Mean Absolute Error (MAE): Average absolute difference between predicted and true values.
R² Score (Coefficient of Determination): Proportion of variance in the dependent variable
explained by the model; values closer to 1 indicate better fit.
4. Define over fitting and hence briefly explain how regularization influences over fitting?
Overfitting: Definition and the Role of Regularization
What Is Overfitting?
Overfitting occurs in machine learning when a model learns not only the underlying patterns in the
training data but also the noise and random fluctuations.
This means:
The model performs extremely well on the training data but poorly on new, unseen data.
It captures irrelevant details, mistaking them for true signal, resulting in poor generalization.
Key characteristics of overfitting:
High accuracy on training data, low accuracy on test data.
The model is too complex for data or noise level present.
How Regularization Influences Overfitting:
Regularization is a set of techniques designed to prevent overfitting by discouraging overly complex
models. It works by adding a penalty to the model's loss function, typically targeting large weights or
complex model structures.
Main Types of Regularization
L1 Regularization (Lasso): Adds a penalty equal to the absolute value of the magnitude of
coefficients. Can result in some weights being zeroed out, effectively performing feature
selection.
L2 Regularization (Ridge): Adds a penalty equal to the square of the magnitude of coefficients.
Tends to shrink coefficients evenly but rarely makes them exactly zero.
5. Discuss various errors in Machine Learning?
Various types of errors occur in machine learning that impact model performance and generalization.
These errors broadly fall into categories related to model assumptions, data quality, and the inherent
limits of learning algorithms. Here's a detailed discussion of the key types of errors in machine learning:
1. Bias Error (Underfitting)
Definition: Bias refers to errors due to overly simplistic assumptions made by the model. It
measures how far off the model’s predictions are from the true values on average.
Cause: Using models that are too simple to capture the underlying data patterns, e.g., linear
models for nonlinear data.
Effect: The model misses important trends, resulting in poor accuracy on both training and
new data.
Example: Predicting a complex medical diagnosis with a simple linear regression causing
consistent errors.
Significance: High bias leads to underfitting; the model fails to generalize well.
2. Variance Error (Overfitting)
Definition: Variance reflects the model's sensitivity to fluctuations in the training data.
Cause: Models that are too complex may fit noise and minor variations in the training set
rather than the true data distribution.
Effect: Poor generalization with low training error but high error on unseen data.
Example: A very deep decision tree memorizing training examples but failing on test data.
Significance: High variance causes overfitting, limiting real-world predictive effectiveness.
3. Irreducible Error
Definition: Error that cannot be reduced by any model because of inherent variability or noise
in the data.
Cause: Factors like measurement noise, unknown variables affecting the target, or
randomness.
Effect: Sets a baseline for minimum achievable error, despite perfect modeling.
Example: Biological variability in patient responses to drugs.
Significance: Understanding irreducible error sets realistic expectations for model accuracy.
4. Classification Errors: False Positives and False Negatives
False Positive (Type I Error): Model predicts a positive class when the true class is negative.
False Negative (Type II Error): Model predicts a negative class when the true class is
positive.
Impact: False positives may cause unnecessary actions; false negatives can cause missed
opportunities or risks.
Example: Spam email classifier marking valid email as spam (false positive) or missing spam
mail (false negative).
5. Data-Related Errors
Data Imbalance: When dataset classes are unevenly represented, causing bias toward majority
class predictions.
Data Leakage: Information from outside the training set improperly influences the model,
producing overly optimistic performance.
Outliers: Extreme values that can skew model training or lead to instability.
Data Collection/Storage Errors: Incorrect, missing, or corrupted data impacts model quality.
Significance: These errors introduce noise and bias in training, often requiring careful data
preprocessing and validation.
6. Model-Specific Errors (Representation and Learner Errors)
Representation Error: The hypothesis space (model class) does not contain a function that can
represent the true relationship between inputs and outputs.
Learner Error: The algorithm fails to find the best model within the chosen hypothesis space
due to optimization issues or limitations.
Boundary Errors: Incorrect decisions near classification boundaries due to model limitations
or insufficient training data.
7. General Concept: Prediction Errors
Prediction errors arise due to all the above sources and manifest as deviations in both training
and validation performance.
UNIT-3
1. Define conditional probability and hence state and prove Baye’s theorem.
2. List various descriptive statistical measures in learning.
Descriptive statistics are essential in learning and data analysis, providing tools and
summaries to help understand and interpret data before modeling or deeper statistical inference.
Categories of Descriptive Statistical Measures
1. Measures of Central Tendency: Describe where the centre of a data set lies.
Mean: The average of all observations.
Median: The middle value when data is sorted.
Mode: The most frequently occurring value in the dataset.
2. Measures of Variability (Dispersion): Indicate how much the data spreads out from the canter.
Range: Difference between the maximum and minimum values.
Variance: The average squared deviation from the mean.
Standard Deviation: The square root of the variance, indicating spread in the units of the data.
Interquartile Range (IQR): The range within which the central 50% of data lie (between the
25th and 75th percentiles).
3. Measures of Distribution and Shape: Summarize the overall pattern or structure of data.
Frequency Distribution: Counts of how often values or ranges of values occur.
Skewness: Indicates asymmetry in the data distribution (positive, negative, or zero for
symmetrical data).
Kurtosis: Measures "tailedness", or how concentrated data are at the extremes versus the
mean.
4. Measures of Position: Specify the relative standing of data points.
Percentiles: The value below which a given percentage of data falls (e.g., the 90th percentile).
Quartiles: Divide the data into four equal parts.
Minimum and Maximum: The smallest and largest values in the dataset.
3. Write K-Nearest Neighbor (KNN) Classifier algorithm and hence explain which statistical
measure or central tendency used in KNN algorithm.
The K-Nearest Neighbor (KNN) algorithm is a simple, intuitive, and non-parametric
supervised machine learning method used for classification (and regression). It classifies data points
based on the classes of their nearest neighbors in the feature space.
KNN Classifier Algorithm Steps:
Choose a value for K: Select the number of nearest neighbors (K) to consider for making a
classification decision.
Calculate Distances: For the new (unlabeled) instance, compute its distance (commonly
Euclidean distance) to all data points in the training set.
Find the K Nearest Neighbours: Identify the K data points in the training set that are closest
to the new instance.
Majority Vote: The new instance is assigned to the class that is most frequent (the mode)
among the K nearest neighbours.
Output Prediction: Return the predicted class for the new data point.
Statistical Measure of Central Tendency Used in KNN
Mode is used in Classification:
In KNN classification, the mode (the most frequently occurring class label among the K
nearest neighbors) is used as the statistical measure of central tendency. The new data point is
assigned the class that appears most often among its nearest neighbors.
Why Not Mean or Median?
For classification tasks, the output is a category (not a numeric value), so mean and median
are not meaningful. The mode allows the algorithm to make a categorical decision based on
neighbor majority.
4. List various types of regression and hence explain influence of Least square Error criterion on
Linear Regression.
Types of Regression in Machine Learning:
Regression techniques are fundamental in machine learning for modelling and predicting continuous
outcomes. Here are the main types:
1. Linear Regression: Models the linear relationship between a dependent variable and one or more
independent variables. Widely used for its simplicity and interpretability.
2. Multiple Linear Regression: An extension of linear regression involving multiple independent
variables predicting a single dependent variable.
3. Polynomial Regression: Used when the relationship between variables is nonlinear but can be
modelled as an nth-degree polynomial.
4. Ridge Regression: Adds a penalty term to the loss function to prevent overfitting when
multicollinearity exists among predictors.
5. Lasso Regression: Like ridge regression but encourages sparsity by shrinking some coefficients
entirely to zero, useful for feature selection.
6. Elastic Net Regression: A hybrid of ridge and lasso regression, combining their penalties for
balanced regularization.
7. Logistic Regression: Used for classification tasks but technically a regression method due to its
mathematical formulation.
8. Support Vector Regression (SVR): Adaptation of Support Vector Machines for regression tasks,
effective for capturing complex relationships.
9. Decision Tree Regression: Utilizes tree-like models for predicting continuous values, handling
nonlinearities and interactions well.
10. Random Forest Regression: An ensemble of decision trees aggregated for more stable and
accurate predictions.
5. Briefly explain various Discriminant and Regression functions.
1. Discriminant and Regression Functions:
Discriminant functions are used to assign new observations to predefined classes, primarily in
classification problems. They evaluate which class a sample most likely belongs to, usually by
maximizing the probability or minimizing classification risk.
Common Discriminant Functions
Bayes (Optimal) Discriminant Function
Assigns a sample to the class with the highest posterior probability P(ωi∣x)P(ωi∣x).
Minimizes expected classification error if true data distributions are known.
Linear Discriminant Function (LDA/Fisher)
Assumes classes share a common covariance matrix and are normally distributed.
Decision boundary is a straight line (or hyperplane in higher dimensions).
Function form: δi(x)=wi⊤x+w0iδi(x)=wi⊤x+w0i.
Quadratic Discriminant Function (QDA)
Allows each class to have its own covariance matrix.
Decision boundary is quadratic (curve, ellipse, parabola).
More flexible than LDA, requires more data to estimate parameters.
Regularized/Flexible Discriminant Functions (RDA, Kernel Methods)
Blend properties of LDA and QDA by regularizing covariance estimates or
mapping features to higher-dimensional spaces.
2. Regression Functions (Predicting Continuous Outcomes)
Regression functions model the relationship between input variables and a continuous target variable.
They are foundational for predicting numeric values.
Types of Regression Functions
Linear Regression
Models a linear relationship: y=β0+β1x1+⋯+βpxpy=β0+β1x1+⋯+βpxp.
Parameters are estimated by minimizing the sum of squared errors (least squares).
Polynomial Regression
Extends linear regression by incorporating polynomial terms, capturing
curves: y=β0+β1x+β2x2+⋯y=β0+β1x+β2x2+⋯.
Ridge and Lasso Regression
Add regularization to linear regression to prevent overfitting and manage predictor
importance.
Ridge uses L2L2 penalty (squared coefficients); keeps all predictors but shrinks
them.
Lasso uses L1L1 penalty (sum of absolute coefficients); can shrink some
coefficients to zero for feature selection.
ElasticNet Regression
Combines both L1L1 (Lasso) and L2L2 (Ridge) penalties, balancing sparsity and
shrinkage.
Logistic Regression
Technically a classification model, it predicts the probability of binary outcomes
using the logistic function but is commonly grouped with regression functions due
to its mathematical form.
6. What is Fisher's Linear Discriminant Analysis (LDA) and explain how is it used for
classification?
Fisher's Linear Discriminant Analysis (LDA) is a statistical and machine learning technique
designed for classification and dimensionality reduction. LDA seeks a projection—a linear
combination of input features—that maximizes the separation between two or more classes while
minimizing the spread (variance) within each class.
Key objectives of Fisher's LDA:
Maximize the distance between the class means (between-class variance)
Minimize the variance within each class (within-class variance)
This approach yields an axis or a set of axes onto which data points are projected for optimal class
separation.
How LDA Works for Classification
Step-by-Step Process
1. Compute Class Statistics:
Calculate the mean vector for each class.
Compute the within-class scatter matrix (measures variance within each group).
Compute the between-class scatter matrix (measures variance between different
groups).
2. Determine the Optimal Projection:
Find the direction (vector w) that maximizes the ratio:
For two classes, w is proportional to the inverse of the within-class scatter matrix
multiplied by the difference between class means:
3. Project the Data:
Each data sample x is projected onto the direction w:
y=w^TX
The projected values give a new representation, typically in a lower-dimensional space.
4. Classification:
Set a threshold (or, for multiple classes, use several projection axes) for assigning class
labels.
New samples are projected and classified based on where they fall relative to the
threshold(s).
Why Is LDA Effective for Classification?
Reduces Dimensionality: Simplifies data, making further classification steps faster and less
prone to overfitting.
Maximizes Separability: Ensures that classes are as far apart as possible in the projected
space while keeping the data within each class tightly grouped.
Robust and Interpretable: Particularly useful when class distributions are approximately
Gaussian and share similar covariances.
Example: Binary Classification
Suppose you have two classes (e.g., approved and rejected loan applications) described by
features such as income and credit score:
LDA computes a line in the feature space onto which all data are projected.
A cut-off point is set along this line. Applications projected on one side are labelled
"approved," and those on the other side as "rejected".
UNIT-4
1. What is significance of SVM algorithm. Explain
1. Effective in High-Dimensional Spaces
SVM is highly effective when the number of features is large (e.g., in text classification, gene
expression data).
It can handle thousands of features and still produce reliable results.
2. Robust to Overfitting (especially with proper regularization)
SVM focuses on finding the maximum margin hyperplane, which reduces the chance of
overfitting compared to algorithms that try to minimize error on training data directly.
3. Works Well with Clear Margin of Separation
SVM is particularly powerful when there is a clear margin of separation between classes.
It finds the optimal separating hyperplane that maximizes this margin.
4. Uses Kernel Trick for Non-Linearly Separable Data
SVM can efficiently perform non-linear classification using the kernel trick, which transforms
the input space into a higher-dimensional feature space.
Common kernels: Linear, Polynomial, RBF (Gaussian), and Sigmoid.
5. Versatility
SVM can be adapted to both binary and multi-class classification problems.
It also supports regression tasks through Support Vector Regression (SVR).
6. Good Generalization Ability
Because it focuses on the points that are most difficult to classify (support vectors), it tends to
generalize well to unseen data.
Explanation of How SVM Works
1. Separation: SVM tries to find the best possible dividing line (hyperplane) that separates classes in
your dataset.
2. Optimization: It selects the boundary that maximizes the margin—the gap between classes—which
helps with generalization.
3. Kernel Application: If the classes are not linearly separable, SVM transforms the data into a higher-
dimensional space using kernels to find a suitable hyperplane.
4. Decision Making: Only the closest data points to the margin (support vectors) influence the final
boundary, improving computational efficiency and robustness.
Applications of SVM
Text classification (spam detection, sentiment analysis)
Image recognition and object detection
Bioinformatics (e.g., cancer classification using gene expression)
Face detection
Handwriting recognition
2. What is Perceptron? Explain with neat sketch
Perceptron: It is a type of neural network that performs binary classification
that maps input features to an output decision, usually classifying data into one
of two categories, such as 0 or 1. Perceptron consists of a single layer of input nodes that are
fully connected to a layer of output nodes. It is particularly good at learning linearly separable
patterns. It utilizes a variation of artificial neurons called Threshold Logic Units (TLU), which
were first introduced by McCulloch and Walter Pitts in the 1940s. This foundational model has
played a crucial role in the development of more advanced neural networks and machine learning
algorithms.
Types of Perceptron
1. Single-Layer Perceptron is a type of perceptron is limited to learning linearly separable patterns. It
is effective for tasks where the data can be divided into distinct categories through a straight line.
While powerful in its simplicity, it struggles with more complex problems where the relationship
between inputs and outputs is non-linear.
2. Multi-Layer Perceptron possess enhanced processing capabilities as they consist of two or more
layers, adept at handling more complex patterns and relationships within the data.
Perceptron work:
A weight is assigned to each input node of a perceptron, indicating the importance of that input in
determining the output. The Perceptron’s output is calculated as a weighted sum of the inputs, which is
then passed through an activation function to decide whether the Perceptron will fire.
The weighted sum is computed as:
The step function compares this weighted sum to a threshold. If the input is larger than the threshold
value, the output is 1; otherwise, it's 0. This is the most common activation function used in
Perceptron’s are represented by the Heaviside step function:
A perceptron consists of a single layer of Threshold Logic Units (TLU), with each TLU fully
connected to all input nodes.
The output of the fully connected layer is computed as:
fW,b(X)=h(XW+b)
where XX is the input WW is the weight for each inputs neurons and bb is the bias and hh is the step
function.
3. State and explain ‘widrow hoff learning rule’
The Widrow-Hoff learning rule, also known as the Delta rule or Least Mean Squares (LMS)
algorithm, is a supervised learning algorithm used to update the weights in artificial neural networks
—especially for linear units such as the ADALINE (Adaptive Linear Neuron). Its primary aim is to
minimize the mean squared error between the predicted output and the desired output for a given
training example.
w(t+1)=w(t)+η⋅[d(t)−y(t)]⋅x(t)
w(t): Weight vector at iteration t
η: Learning rate (a small positive constant)
d(t): Desired (target) output for the current input
y(t): Actual (predicted) output for the current input
x(t): Input vector at iteration t
Key Concepts and Working:
The Widrow-Hoff rule seeks to adjust the weights so that the neural network's output is as close as
possible to the target value for each input, effectively minimizing the overall error or loss.
Error Calculation:
The rule computes the difference between the desired and the actual output:
e(t)=d(t)−y(t)
This error constitutes the feedback used to tune the weights.
Weight Update Mechanism:
The weights are updated at each step by moving them in the direction that reduces the error,
proportional to the size of the error and the value of the input:
If the error is large, the correction is larger.
The direction of update depends on the sign of the error (positive: increase weights;
negative: decrease weights).
Learning Rate (η):
Controls the size of adjustment. Too large a value may lead to overshooting, too small a value
results in slow learning.
Steps in the Algorithm
1. Initialize the weights (often randomly).
2. For each training example:
Compute the predicted output (y(t)y(t)).
Calculate the error (d(t)−y(t)d(t)−y(t)).
Update the weights using the Widrow-Hoff rule.
3. Repeat until the error converges or reaches a satisfactory minimum.
Illustration: Single Neuron Update
Given input vector xx, weight ww, target output dd, and learning rate ηη:
Prediction: y=w⋅x
Error: e=d−y
Weight update: wnew=wold+η⋅e
4. Explain How does a SVM use linear discriminant functions for binary classification.
Support Vector Machine (SVM) is a robust supervised learning algorithm primarily used for binary
classification. Central to its approach is the use of a linear discriminant function, which mathematically
defines the optimal boundary separating the two classes in the data.
Linear Discriminant Function in SVM
Mathematical Formulation
For binary classification, the SVM discriminant function is expressed as:
f(x)=w⊤x+b
ww: Weight vector (normal to the hyperplane)
xx: Input feature vector
bb: Bias (offset) term
The classification decision rule is:
This means SVM assigns a data point to one of two classes according to which side of the hyperplane
(w⊤x+b=0) it falls on.
SVM's Strategy: Maximum Margin Classification
Finding the Optimal Hyperplane: SVM searches for the hyperplane that maximizes the margin—the
distance between the hyperplane and the nearest data points from each class (support vectors).
Why Maximizing Margin?
Larger margin reduces generalization error.
Ensures that new, unseen data points have a higher chance of being classified correctly.
Working Steps
1. Construct Linear Discriminant: Define the function f(x)=w⊤x+b
2. Label Assignment: Label data points as +1 or -1 based on their position relative to the hyperplane.
3. Margin Maximization: Optimize ww and bb so the margin, subject to correct classification, is as
large as possible.
The critical points closest to the boundary are called support vectors—they determine the
position of the hyperplane.
5. What are kernel functions, and explain how do they enable SVMs to handle nonlinear
classification?
Kernel functions are mathematical tools that measure the similarity between pairs of data points. In the
context of Support Vector Machines (SVM), a kernel function computes the inner product of two data points
in a transformed, high-dimensional feature space—without explicitly performing that transformation. This
technique is known as the kernel trick.
Common kernel functions include:
Linear Kernel: K(x,y)=x⋅y
Polynomial Kernel: K(x,y)=(x⋅y+c)d
Radial Basis Function (RBF)/Gaussian Kernel: K(x,y)=exp(−γ∥x−y∥2)
Sigmoid Kernel: K(x,y)=tanh(α(x⋅y)+c)
Each kernel maps data into a higher-dimensional space in a unique way, suitable for different data structures
and problem domains.
How Kernel Functions Enable Nonlinear SVM Classification
The Challenge: Nonlinear Data
Standard SVMs construct a linear boundary (hyperplane) to separate classes—this works well only when
data is linearly separable. Many real-world datasets, however, are nonlinear: they can’t be divided
correctly by a straight line (in two dimensions) or a flat hyperplane (in higher dimensions). Examples
include data shaped like concentric circles or complex spirals.
The Solution: The Kernel Trick
Kernel functions solve this problem by:
1. Implicitly Mapping Data: Rather than explicitly projecting data points into a higher-dimensional
space—a process that would be computationally expensive—kernels compute the inner product of
the transformed features directly in the original space.
2. Linear Separation in New Space: In this higher-dimensional space, classes that are nonlinearly
mixed in the original space often become linearly separable. The SVM then finds the optimal linear
boundary in this new feature space.
6. Explain about neural networks, and how do they relate to cognitive machines.
Neural networks are computational models inspired by the structure and functioning of the human
brain. Sometimes called artificial neural networks (ANNs), these systems are central to machine learning
and artificial intelligence. They consist of many interconnected nodes—artificial neurons—grouped into
layers: an input layer, one or more hidden layers, and an output layer. Each connection has a weight, which
determines how strongly signals between neurons influence each other.
The basic operation of a neural network involves:
Input: Data enters the input layer.
Propagation: Inputs are combined and passed through activation functions in each neuron.
Output: The processed information leads to a prediction, classification, or control action at the
output layer.
Learning: The network adjusts its weights and biases using a learning algorithm (such as
backpropagation), improving performance based on experience
Cognitive Machines:
Cognitive machines refer to automated systems designed to simulate human thought processes, enabling
them to understand, reason, learn, and interact in ways that mimic human cognition. This field is also known
as cognitive computing and is a subset of artificial intelligence.
Components of cognitive machines include:
Machine Learning and Neural Networks: The primary technologies that enable machines to learn
from data.
Natural Language Processing (NLP): Allows understanding and generation of human language.
Speech and Vision: Facilitates speech recognition and interpretation of images.
Reasoning and Problem-Solving: Enables decision-making with incomplete or ambiguous
information.
Cognitive machines analyse large amounts of unstructured data, recognize patterns, and make informed
decisions much like humans do but with greater speed and consistency.
How Neural Networks Relate to Cognitive Machines:
Neural networks are a foundational element of cognitive machines:
Mimic Human Brain Function: Both aim to simulate the way the human brain processes
information—neural networks via a mathematical abstraction, and cognitive machines by integrating
such models with other AI technologies.
Enable Adaptive, Intelligent Behavior: Neural networks give cognitive machines the ability to
learn from examples, adapt to new data, and improve performance without explicit rule
programming.
Drive Core Cognitive Functions: Tasks such as speech recognition, image understanding, and
decision-making in cognitive machines rely on neural networks for their pattern recognition,
learning, and predictive capabilities.
Integration with Other Techniques: While neural networks handle perception and pattern learning,
cognitive machines often combine them with symbolic reasoning and probabilistic algorithms to
achieve broader intelligence
7. List applications of SVM
Support Vector Machines (SVMs) have a wide range of applications across various domains due to their
effectiveness in classification and regression tasks. Some key real-world applications of SVM include:
Face Detection: SVM classifies parts of images as face or non-face, creating boundaries to detect
faces accurately.
Text and Hypertext Categorization: Used to classify documents such as emails, news articles, and
web pages into categories, useful in spam detection, sentiment analysis, and information retrieval.
Image Classification: Improves search accuracy and classification results in image datasets by
handling high-dimensional data well.
Bioinformatics: Applied in protein classification, cancer detection, gene expression analysis, and
protein remote homology detection.
Handwriting Recognition: Recognizes handwritten characters for use in postal services, document
digitization, and pattern recognition.
Generalized Predictive Control (GPC): Used in controlling chaotic dynamic systems by tuning
parameters efficiently.
Financial Fraud Detection: Detects fraudulent transactions with high accuracy by analyzing user
behavior and transaction patterns.
Customer Segmentation and Marketing: Models customer behavior to segment markets and
minimize churn.
Security and Encryption Analysis: Helps in analyzing and detecting encryption patterns within
image data or secure transmissions.
Facial Expression Classification: Classifies facial emotions useful in life-care systems and
personalized filters.
UNIT-5
1. Backpropagation Algorithm
Backpropagation is one of the most important algorithms in training artificial neural networks. It provides a
systematic way to update network weights and biases to minimize the difference between the predicted
output and the actual target values.
Step-by-Step Backpropagation Algorithm
1. Initialization
Assign small random values to all weights and biases in the neural network.
Set a learning rate (η).
2. Forward Pass
Feed the input vector through the network layer by layer.
For each neuron, compute its input as:
zj=∑iwjixi+bj
Apply the activation function f(zj)to get the output of the neuron.
3. Compute Error at Output
For each output neuron, calculate the error (usually using Mean Squared Error or cross-entropy loss):
E=12∑(ytarget−ypredicted)2
4. Backward Pass ("Backpropagation")
a) Calculate Output Layer Gradients
For each output neuron:
δoutput=(ypredicted−ytarget)⋅f′(z)δoutput=(ypredicted−ytarget)⋅f′(z)
where f′(z)f′(z) is the derivative of the activation function.
b) Calculate Hidden Layer Gradients
For each hidden neuron:
δhidden=(∑kwkh⋅δk)⋅f′(zh)
where the sum is over all neurons 'k' in the next layer.
5. Update Weights and Biases
For each weight:
wij=wij−η⋅δj⋅xi
For each bias: bj=bj−η⋅δj
2. Give defferent types of neural network architectures and hence explain Multilayer perceptron
Networks
Types of Neural Network Architectures
Neural networks come in various architectures, each designed for specific types of data and tasks. Below are
some of the most used neural network architectures:
1. Feedforward Neural Networks (FNN)
The simplest type of artificial neural network.
Connections between nodes do not form cycles.
Data moves in only one direction: from input to output through hidden layers.
Used for basic pattern recognition and classification tasks.
2. Convolutional Neural Networks (CNN)
Specialize in processing grid-like data such as images.
Use convolutional layers to extract spatial features and pooling layers to reduce dimensionality.
Widely used in computer vision, image and video analysis, and object detection.
3. Recurrent Neural Networks (RNN)
Designed for sequential or time-series data.
Have feedback loops allowing information to persist from previous inputs.
Applied in language modeling, translation, and stock market prediction.
4. Long Short-Term Memory Networks (LSTM)
A type of RNN that overcomes the vanishing gradient problem.
Equipped with memory cells and gates for better handling of long-term dependencies in sequences.
5. Generative Adversarial Networks (GAN)
Consist of two neural networks (generator and discriminator) trained in opposition.
Used for data generation, such as image synthesis.
6. Autoencoders
Composed of encoder and decoder networks.
Used for unsupervised learning, dimensionality reduction, feature extraction, and denoising.
7. Deep Belief Networks (DBN) and Others
DBNs are layered networks of restricted Boltzmann machines.
Additional architectures include Radial Basis Function Networks (RBFN), Self-Organizing Maps
(SOM), Hopfield Networks, Transformer networks, etc..
Multilayer Perceptron (MLP) Networks
What Is an MLP?
A Multilayer Perceptron is one of the most fundamental and widely used feedforward neural network
architectures. It consists of three or more layers of nodes:
Input Layer: Receives the initial data.
Hidden Layer(s): One or more layers where computation happens; each neuron applies an
activation function (commonly ReLU, sigmoid, or tanh) to the weighted sum of its inputs.
Output Layer: Produces the final prediction, such as class labels or regression values.
Structure and Operation
Fully Connected: Each node in one layer connects to every node in the next layer.
Nonlinear Activations: Hidden layers use nonlinear activation functions, which allow the MLP to
model complex, non-linear relationships and interactions among features.
Forward Propagation: Data travels from the input layer through hidden layers to the output.
Backpropagation: During training, the network uses backpropagation to update weights and biases,
minimizing the loss function (e.g., mean squared error for regression, cross-entropy for
classification).
Supervised Learning: MLPs are typically used for tasks where output labels are available for
training.
Advantages and Applications
Versatile: Can handle classification, regression, and function approximation problems.
Nonlinearity: Models highly non-linear relationships due to activation functions in hidden layers.
Foundation: Serves as the backbone for more complex networks like deep neural networks and
forms the basis for many other architectures.
Typical Applications:
Image and speech recognition
Medical diagnosis
Prediction problems (e.g., stock price, weather)
Handwriting and character recognition
Multilayer Perceptron’s (MLPs) are a core type of feedforward neural network, utilizing multiple layers
and nonlinear activations to solve both simple and complex supervised learning tasks. Their structure
and learning capability have made them foundational in the development of modern neural network
architectures.
3. Briefly explain Radial Bias Functions network:
Radial Basis Function (RBF) Network: Brief Explanation
A Radial Basis Function (RBF) Network is a type of artificial neural network designed for tasks such as
classification, regression, and function approximation. It is particularly adept at modelling complex,
nonlinear relationships and can learn rapidly compared to many other neural network architectures.
Key Structure
RBF networks have a three-layer architecture:
Input Layer: Passes input data to the network without computation.
Hidden Layer: Consists of neurons using a radial basis function (most commonly a Gaussian
function) as the activation function. Each hidden neuron has a “centre” and computes its response
based on the distance between its centre and the input.
Output Layer: Produces the network’s result as a weighted sum of hidden layer outputs. This output
can be used for classification or regression tasks.
How It Works
1. Input: The network receives an input vector.
2. Hidden Layer Response: Each neuron calculates the distance between the input and its centre. The
neuron's output is highest when the input is near its centre and decreases as the distance grows.
3. Activation Function: Typically, a Gaussian function:
ϕ(∥x−c∥)=e−∥x−c∥22σ2
where cc is the centre and σ controls the spread.
4. Output Layer: Computes a weighted sum of all hidden neurons’ outputs to produce the final
prediction.
3. List various impurity measures in attribute splitting with mathematical formulas in decision tree
classification and hence list impurity measures used in ID3, C4.5 and CART for attribute splitting.
Impurity Measures in Decision Tree Classification:
In decision tree classification, impurity measures are mathematical criteria used to evaluate how well an
attribute splits the dataset into homogeneous (pure) subsets. Lower impurity indicates better separation of
classes. The most used impurity measures are:
1. Entropy
Formula:
Entropy(S)=−∑i=1npilog2(pi)
S: dataset or subset.
pi: proportion of class i in S.
Purpose: Measures the disorder or randomness. Entropy is highest with evenly mixed classes and
zero when all elements are of the same class.
2. Gini Impurity (Gini Index)
Formula:
Gini(S)=1−∑i=1npi2
pi: proportion of class ii in the subset.
Purpose: Measures the probability of incorrectly classifying a randomly chosen element if it was
labelled according to the class distribution in the subset.
3. Classification Error
Formula:
Classification Error(S)=1−maxi(pi)
pi: proportion of class i in the subset.
Purpose: Simply represents the proportion of misclassified samples if the majority class is predicted.
4. Information Gain & Gain Ratio:
4. Discuss strength and weakness of decision tree classification
Strengths of Decision Tree Classification
Easy to Understand and Interpret
Decision trees mimic human decision-making; their flowchart-like structure makes them
highly interpretable even to non-experts.
The rules generated are clear, and the reasoning for each decision is transparent.
Handles Both Numerical and Categorical Data
Can process a mixture of variable types without the need for data normalization or scaling.
Little Data Preparation Required
No need for feature scaling or cantering. Missing values can be handled internally without
preprocessing.
Works Well for Non-Linear Relationships
Can capture complex patterns and interactions between features without explicit
mathematical formulations.
Performs Automatic Feature Selection
Attributes that best split the data are chosen during tree construction, highlighting the most
informative features.
Robust to Outliers
Tree splits only depend on the majority of data in each node, making them less sensitive to
extreme values.
Weaknesses of Decision Tree Classification
Prone to Overfitting
If trees are allowed to grow deep without restrictions (pruning), they can fit noise in the
training data, resulting in poor generalization on unseen data.
Unstable with Respect to Small Data Changes
Minor fluctuations in data can result in completely different splits, leading to drastically
different tree structures.
Biased with Imbalanced Data
Trees may favor classes that are more frequent in the dataset unless special techniques are
used to handle class imbalance.
Poor at Approximating Certain Functions
Struggle with relationships that require smooth transitions or combinations of many variables
—piecewise constant prediction can be limiting.
Can Create Biased Trees with Many Levels or Categories
Attributes with many distinct values may dominate splits unless careful impurity measures
(like gain ratio) are used.
Not Ideal for Extrapolation
Predictive regions outside the training data are handled poorly—trees can't predict values
beyond the observed range.