Machine Learning
Machine Learning
1. What is Machine Learning? Explain its need and relevance in today’s world.
Machine Learning (ML) is a branch of artificial intelligence (AI) that enables computers to
learn from data and improve their performance on tasks without being explicitly programmed.
Instead of following fixed instructions, ML algorithms identify patterns in data and make
decisions or predictions based on that data.
• The volume of data generated today is enormous and too complex for traditional
programming methods to handle effectively.
• ML automates decision-making by learning from data, reducing the need for manual
intervention.
• It enables systems to adapt and improve over time, making them more efficient and
accurate
• Data Quality and Quantity: ML models require large amounts of high-quality, relevant
data. Poor, noisy, or insufficient data can lead to inaccurate models.
• Overfitting and Underfitting: Overfitting occurs when a model learns noise in the
training data, reducing accuracy on new data. Underfitting happens when the model is too
simple to capture patterns.
• Feature Selection and Engineering: Choosing the right input features is crucial but
often difficult and time-consuming.
• Computational Complexity: Training complex models demands significant processing
power and time.
• Interpretability: Some models act as “black boxes,” making it hard to understand their
decision process.
• Generalization: Ensuring the model works well on unseen data remains a key challenge.
• Bias and Fairness: Models may reflect biases present in data, leading to unfair results.
• Data Privacy and Security: Handling sensitive data requires strict privacy and security
measures.
2. Describe the different types of Machine Learning with examples: Supervised,
Unsupervised, and Reinforcement Learning.
• Supervised Learning is a type of machine learning where the model is trained using labeled
data (input with the correct output).
• The machine learns from these examples and uses the patterns to predict outcomes for new,
unseen data.
• It’s called “supervised” because the learning is guided — like a teacher helping a student by
giving the right answers during practice.
Imagine a basket full of different fruits, and we want the machine to identify them.
During training:
• If the fruit is round, red, and has a small dip at the top, it is labeled as an Apple.
• If the fruit is long, curved, and yellow, it is labeled as a Banana.
Now, when we give the machine a new fruit that is yellow and curved, it compares it to what it
has learned, and predicts it’s a Banana.
1. Classification:
o The goal is to predict a discrete class label.
o Used when the output variable is categorical (e.g., yes/no, spam/not spam,
apple/banana).
o Example: Email spam detection, disease diagnosis (positive/negative), image
classification.
2. Regression:
o The goal is to predict a continuous numerical value.
o Used when the output variable is quantitative.
o Example: Predicting housing prices, stock market trends, temperature
forecasting.
1. Clustering:
o Groups similar data points into clusters based on feature similarity.
o Example: Customer segmentation in marketing, grouping users with similar
behavior.
2. Dimensionality Reduction:
o Reduces the number of features in a dataset while preserving important
information.
o Example: PCA (Principal Component Analysis) used in image compression,
noise reduction.
3. Association Rule Learning:
o Discovers interesting relationships or associations among variables in large
datasets.
o Example: Market Basket Analysis – people who buy bread often buy butter.
Imagine a basket filled with fruits, but no labels (no mention of which is apple, banana, or
orange). The machine has to analyze the features of each fruit (like color, shape, size) and
group similar fruits together on its own.
🔹 Example:
Here, the machine doesn't know the names of the fruits — it only groups them based on
similarity. This is how clustering works — useful when you don’t know the categories in
advance.
It is based on the concept of trial and error. Unlike supervised learning, there are no labeled
input/output pairs. The agent learns from the consequences of its actions rather than being
explicitly taught.
Core Components:
Example: Consider a robot learning to walk in a simulated environment. It begins without any
prior knowledge. When it attempts to walk, it receives feedback:
Applications:
Testing and validation are fundamental phases in the machine learning workflow. They are used
to evaluate the performance and generalizability of a trained model on data it has never
encountered before. These steps ensure that the model not only fits the training data but also
performs robustly on new, real-world data.
Purpose:
Process:
• Split the dataset into training and validation subsets (commonly 70-80% training, 20-30%
validation).
• Train the model on the training set.
• After each training iteration or epoch, evaluate the model on the validation set using
metrics like accuracy, loss, precision, recall, etc.
• Use validation results to optimize hyperparameters and select the best performing model
configuration.
Testing in Machine Learning:- Testing is the final evaluation step where the trained
and validated model is tested on a test set, which remains completely unseen during
both training and validation.
Purpose:
Process: After training and validation, apply the final model on the test data. Measure
performance using appropriate metrics (e.g., accuracy, F1-score, ROC-AUC for classification;
mean squared error for regression). Use the results to make conclusive judgments about the
model's effectiveness.
5. What is classification in Machine Learning? Explain the MNIST dataset.
Classification is a supervised learning task where the goal is to assign input data points to one or
more predefined categories or classes based on their features. The type of classification problem
depends on the number and nature of the classes involved. There are three main types of
classification:
• How it works:
The model learns from labeled examples of the two classes and predicts whether new
input belongs to Class A or Class B.
• Example:
Email spam detection is a classic example. The system analyzes emails and classifies
them as either “spam” or “not spam” based on features like certain keywords, sender
address, or message structure.
• Use Cases:
o Fraud detection (fraud / no fraud)
o Disease diagnosis (disease present / not present)
o Credit approval (approve / reject)
• How it works:
The model learns decision boundaries between multiple classes and assigns the input to
the single class that best fits its features.
• Example:
An image recognition system classifies pictures of animals into categories like “cat,”
“dog,” “bird,” and so on. The model evaluates the features of the image such as shape,
texture, and color to predict the correct label.
• Use Cases:
o Handwritten digit recognition (digits 0-9)
o Document classification (news, sports, entertainment)
o Species classification in biology
3. Multi-Label Classification :- In multi-label classification, each input data point can be
assigned multiple labels simultaneously.
• How it works:
Unlike multiclass classification, where the categories are mutually exclusive, multi-label
allows overlapping classes. The model predicts all relevant classes for each input.
• Example:
A movie recommendation system that tags a movie as both “action” and “comedy.”
Based on features like plot, actors, and genre, the model assigns multiple labels to the
same movie.
• Use Cases:
o Text categorization where documents may belong to multiple topics
o Music genre classification where a song can be tagged with multiple genres
o Medical diagnosis where a patient may have multiple concurrent conditions
• Email Spam Filtering: Classifies emails as spam or not spam by analyzing keywords
and sender information.
• Credit Risk Assessment: Predicts loan default risk using credit score, income, and loan
history to help banks decide on approvals.
• Medical Diagnosis: Identifies diseases (e.g., cancer, diabetes) from test results and
patient data to assist doctors in diagnosis.
• Image Classification: Used in facial recognition, autonomous driving, and medical
imaging to identify objects or conditions.
• Sentiment Analysis: Determines if text sentiment is positive, negative, or neutral,
helping businesses understand customer feedback.
• Fraud Detection: Detects fraudulent transactions by analyzing patterns in financial data
to prevent credit card and insurance fraud.
• Recommendation Systems: Suggests movies, products, or content based on user
preferences to improve personalization and sales.
6. Discuss performance evaluation metrics in Machine Learning: Confusion Matrix, Precision,
Recall, and ROC Curve.
1. Confusion Matrix
The confusion matrix is a fundamental tool for visualizing the performance of a classification
algorithm. It displays the counts of correct and incorrect predictions made by the model
compared to the actual outcomes.
This matrix is the basis for calculating many other performance metrics.
2. Precision
Formula:
Precision = TP / (TP + FP)
Interpretation: It indicates how many of the predicted positive cases were actually positive.
Example:
Precision = 30 / (30 + 5) = 0.857
Use Case: Precision is critical when the cost of false positives is high, such as in spam
detection, fraud alerts, or automated medical diagnoses.
3. Recall (Sensitivity or True Positive Rate)
Recall measures the model’s ability to identify all relevant positive cases.
Formula:
Recall = TP / (TP + FN)
Interpretation: It tells us how many of the actual positive cases were captured by the model.
Example:
Recall = 30 / (30 + 10) = 0.75
Use Case: Recall is essential when false negatives are more dangerous, such as in disease
screening or security breach detection.
4. F1 Score
The F1 Score is the harmonic mean of precision and recall. It provides a single score that
balances both concerns.
Formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Use Case: F1 is ideal when both false positives and false negatives carry significant costs (e.g.,
loan approvals or cancer detection).
The Receiver Operating Characteristic (ROC) Curve is a plot of the True Positive Rate
(Recall) against the False Positive Rate (FPR = FP / (FP + TN)) across different classification
thresholds.
The Area Under the ROC Curve (AUC) quantifies the overall ability of the model to
discriminate between positive and negative classes.
Use Case: ROC and AUC are particularly useful for comparing multiple classification models
or evaluating model performance in imbalanced datasets.
7. What is the Precision/Recall trade-off? How is it handled in practice?
In machine learning classification tasks, the precision/recall trade-off refers to the inverse
relationship between two key evaluation metrics:
• Precision = TP / (TP + FP): Measures how many predicted positive instances are actually
correct.
• Recall = TP / (TP + FN): Measures how many actual positive instances were correctly
identified.
Improving one of these metrics usually reduces the other. You cannot maximize both
simultaneously. This is called the precision/recall trade-off.
• YouTube Restricted Mode: To protect children, the model must only allow truly safe
videos. It should avoid classifying harmful content as safe. Hence, the focus is on high
precision.
• Shoplifting Detection in Malls: If a system wrongly identifies innocent customers as
shoplifters, it can lead to major issues. Hence, the model must have high precision,
reducing false positives.
• Disease Detection Model: In medical diagnosis, missing a disease case (false negative)
is dangerous. The goal is to detect as many true cases as possible, i.e., high recall.
• Loan Default Prediction: A model predicting whether a loan applicant is a defaulter
should aim to catch all true defaulters. False negatives (failing to flag actual defaulters)
can lead to large financial losses. Hence, high recall is desired.
How It Is Handled in Practice
Multiclass classification means teaching a computer to put things into more than two groups or
categories. Instead of just “yes” or “no” (which is binary classification), it decides between many
classes.
Example: If you have to recognize handwritten numbers from 0 to 9, the model has to choose
between 10 different classes. This is multiclass classification.
• One-vs-Rest (OvR): We make one model for each class. Each model learns to separate
that class from all others. Then, we pick the class with the strongest prediction.
• One-vs-One (OvO): We make a model for every pair of classes. For example, if there
are 3 classes (A, B, C), we make models for A vs B, A vs C, and B vs C. The class that
wins the most matches is chosen.
• Special Algorithms: Some models like Decision Trees or Neural Networks can directly
handle many classes at once.
1. Unequal Data (Class Imbalance): Some classes might have many examples, others very
few. The model might get better at recognizing common classes and ignore rare ones.
2. More Classes, More Confusion: When there are many classes, it’s harder for the model
to tell them apart.
3. More Work and Time: With many classes, the computer needs to do more calculations
and use more memory.
4. Harder to Measure Success:
Accuracy alone doesn’t tell the full story. We need other measures like precision and
recall for each class.
9. What is error analysis in Machine Learning? How can it improve model performance?
Error analysis is the systematic process of examining the mistakes or errors made by a machine
learning model after training. Instead of just looking at overall accuracy or loss, error analysis
digs deeper into what types of errors occur, why they happen, and under what conditions.
This detailed understanding is crucial for making targeted improvements to the model.
When you first train a machine learning model, it rarely performs perfectly. Error analysis helps
answer important questions:
By answering these, error analysis enables more efficient and effective improvements, rather
than random guessing.
Consider a speech recognition system that converts spoken words into text. The model might
work well in quiet environments but struggle with noisy backgrounds, accents, or different
microphones. Instead of blindly trying new models or features, error analysis allows you to:
• Tag errors based on environment types (quiet office, noisy car, street, etc.)
• Quantify which environment causes the most mistakes
• Focus your improvement efforts (like adding noise-robust features) where they matter
most
Linear regression is a type of supervised machine-learning algorithm that learns from the
labelled datasets and maps the data points with most optimized linear functions which can be
used for prediction on new datasets. It assumes that there is a linear relationship between the
input and output, meaning the output changes at a constant rate as the input changes. This
relationship is represented by a straight line.
For example we want to predict a student's exam score based on how many hours they studied.
We observe that as students study more hours, their scores go up. In the example of predicting
exam scores based on hours studied. Here
• Independent variable (input): Hours studied because it's the factor we control or
observe.
• Dependent variable (output): Exam score because it depends on how many hours were
studied.
Simple Linear Regression :- Simple Linear Regression is a technique used to predict the value
of one dependent variable using only one independent variable. It assumes a linear (straight-line)
relationship between the two variables.
Multiple Linear Regression :- Multiple Linear Regression is a technique used to predict the
value of one dependent variable using two or more independent variables. It assumes a linear
relationship between the dependent variable and the combination of all independent variables.
Example: Predicting a person’s salary based on their years of experience, education level, and
location.
1. Real Estate Pricing :- MLR is used to predict property prices based on factors such as
location, size, number of bedrooms, and property type.
2. Financial Forecasting :- Financial analysts apply MLR to forecast stock prices or economic
indicators using variables like interest rates, inflation rates, and market trends.
It is widely used in machine learning and deep learning for training models by adjusting weights
to reduce prediction error.
Imagine you are at the top of a mountain and want to reach the lowest point (valley). You can’t
see the entire path, but you can feel the slope under your feet. Each step you take in the direction
of the steepest descent gets you closer to the bottom.
This is how Gradient Descent works: it updates model parameters step-by-step to minimize the
error.
The difference between the three types lies in how much data they use to calculate the gradient
for each step.
Definition:
Batch Gradient Descent calculates the gradient using the entire training dataset before updating
the model parameters. This means the model makes one update per epoch, after seeing all the
training examples.
Advantages:
• The updates are stable and smooth because they are based on the entire dataset.
• It converges steadily toward the minimum when the loss function is smooth.
Disadvantages:
• It is slow and requires high memory when the dataset is very large because it processes
all data for each update.
Example:
If you have 1,000 house price records, batch gradient descent calculates the error for all 1,000
houses first, averages them, and then updates the model once.
2. Stochastic Gradient Descent (SGD)
Definition:
Stochastic Gradient Descent updates the model parameters after processing each individual
training example. Instead of waiting for all data, it adjusts weights step-by-step using one sample
at a time.
Explanation:
Here, the model looks at one house price record, calculates the error, and immediately updates
the parameters. Then it moves to the next record and repeats the process.
Advantages:
Disadvantages:
• Updates are noisy and less stable, so the loss may fluctuate rather than smoothly
decrease.
• May take longer to fully converge due to the fluctuations.
Definition:
Mini-Batch Gradient Descent uses a small fixed number of examples (called mini-batches) to
compute the gradient and update the model. This method balances between Batch and Stochastic
Gradient Descent.
Advantages:
Disadvantages:
Example:
The model updates itself after seeing batches of 32 house price records instead of all 1,000 or
just one.
3. What is Polynomial Regression?
Polynomial regression is a type of regression where the relationship between the independent
variable and the dependent variable is modeled as a curve rather than a straight line. This allows
the model to capture more complex patterns in the data.
Unlike simple linear regression, which assumes a straight-line relationship between the input and
output, polynomial regression can fit data where the effect of the input variable on the output
changes direction or speed. It does this by including powers of the input variable, which creates a
curved line that better fits the data points.
Example:
Imagine you want to predict an employee’s salary based on their years of experience. At the start, salary
increases slowly as the employee gains some experience. Then, during the middle years, salary grows
faster as they gain valuable skills and responsibilities. After many years, the salary growth slows down or
levels off. This curved pattern can’t be captured well by a straight line, but polynomial regression fits this
curve and models the salary growth more accurately.
Use Cases:
Learning curves are graphs that show how well a machine learning model performs as it learns
from more data. They plot:
• Training Error: How well the model fits the data it learned from (training data).
• Test Error: How well the model predicts new, unseen data (test data).
• Underfitting (high bias): Underfitting happens when your model is too simple to
capture variations and patterns in your data. The machine doesn’t learn the right
characteristics and relationships from the training data, and thus performs poorly with
subsequent data sets.
• Overfitting (high variance): Training error is very low, but test error is high. The model
memorizes training data but fails on new data.
• Adding more training data usually reduces overfitting (variance) but does not help much
with underfitting (bias).
Bias is the error that happens when a model makes wrong assumptions about the data. It causes
the model’s predictions to be systematically different from the true values. Simply put, bias
means the model is too simple to learn the true pattern in the data.
• High bias :- A model with a higher bias would not match the data set closely. means the
model is very simple and does not fit the training data well. It ignores important patterns
and results in underfitting.
• Low bias A low bias model will closely match the training data set. means the model is
flexible and can fit the training data well.
Example:
Imagine trying to predict salary based only on years of experience, assuming a simple straight
line relationship. But if the real relationship is more complex (like salary grows faster after some
years), a simple straight line model will have high bias and will give wrong predictions.
Variance measures how much the model’s predictions change when trained on different subsets
of data. It shows how sensitive the model is to small changes in the training data.
• High variance means the model fits the training data too closely, including noise or
random details. It performs very well on training data but badly on new, unseen data —
this is called overfitting.
• Low variance means the model is stable and produces similar predictions even if the
training data changes.
Example:
If a model memorizes the salaries of individual employees perfectly but cannot predict well for
new employees, it has high variance.
Bias-Variance Tradeoff
• Ideally, you want a model with low bias and low variance, meaning it fits the data well
and generalizes to new data.
• However, decreasing bias by making the model more complex usually increases variance.
• Decreasing variance by simplifying the model usually increases bias.
• This balancing act is the bias-variance tradeoff — finding the right model complexity to
minimize total error.
6. Compare Ridge Regression and Lasso Regression. When should each be used?
In Machine Learning, sometimes linear regression models can become too complex and overfit
the data — meaning they work well on training data but perform poorly on new data. To handle
this, we use regularization techniques, and the two most common ones are Ridge Regression
and Lasso Regression.
These are advanced versions of linear regression that add a penalty to the model to reduce
overfitting and improve generalization.
Ridge Regression (L2 Regularization) :- Ridge Regression is a type of linear regression that adds
a penalty to the squared values of the coefficients (weights). This penalty term helps shrink
the size of the coefficients but does not make them zero. It is useful when we have many
features and we want to keep all of them but avoid overfitting.
➤ Example (Salary prediction): Suppose you are predicting someone's salary using 10 features
like age, experience, education level, number of languages known, etc. Ridge regression will use
all the features, but it will shrink the influence of the less important ones.
Lasso Regression (L1 Regularization) :- Lasso Regression is another form of linear regression
that adds a penalty to the absolute values of the coefficients. Unlike Ridge, Lasso can reduce
some coefficients to zero, effectively removing those features from the model. So, it performs
both regularization and feature selection.
Lasso regression helps remove unnecessary features automatically. This leads to simpler models
that are easier to interpret and work well on new data.
Using the same example, if “number of siblings” or “distance from home” doesn’t really affect
salary much, Lasso will remove those features by making their coefficient zero.
7. What is Early Stopping in model training?
While training a model, the performance on the training dataset improves as the model sees more
data. Initially, both training and validation errors decrease. But after a certain point, the model
starts to overfit — learning patterns specific to the training set instead of general patterns. This is
visible when:
Early stopping captures this point, and returns the model parameters from that iteration,
not the final one. Hence, it helps maintain good generalization and low variance.
If we continue training after overfitting starts, the model becomes too tailored to the training
data and performs poorly on new, unseen data. Early stopping ensures we pause at the best
moment, balancing training accuracy and generalization.
This technique is considered an implicit form of regularization — it does not add a penalty
term like L1 or L2 (as in Lasso or Ridge), but it still helps control complexity and overfitting.
• Reduces overfitting
• Improves generalization
• Simple and easy to implement
• Requires less training data compared to other regularization methods
• Saves training time
Logistic Regression is a supervised machine learning algorithm used for classification tasks.
Despite its name, it is actually a classification algorithm, not a regression one. It is used when the
target variable is categorical, such as binary classification (e.g., spam or not spam, pass or fail, 0
or 1).
Binomial Logistic Regression: This type is used when the dependent variable has only two
possible outcomes. Examples include predicting whether a student passes or fails, or whether a
customer will buy a product or not. It is the most common type and is used for binary
classification problems.
Multinomial Logistic Regression: This is used when the dependent variable has three or more
categories that do not follow any specific order. For example, classifying types of transport like
bus, car, or train. These categories are distinct and unordered.
Ordinal Logistic Regression: This type is used when the dependent variable has three or more
categories with a natural order or ranking. Examples include rating a service as poor, average, or
excellent. The order of the categories is considered while modeling.
Step 1: Take Input Data The model takes input values (called features), like marks, age,
income, hours studied, etc.
Step 2: Calculate a Score It combines all the input values using a simple formula to calculate a
single number (score). This score can be any number – positive or negative.
Step 3: Apply the Sigmoid Function The score is passed through a sigmoid function, which
converts it into a probability between 0 and 1.
Step 4: Make a Decision Now, the model checks the probability and decides:
p>=0.5,class=A
p<=0.5,class=B
If our threshold was .5 and our prediction function returned .7, we would classify this
observation belongs to class A. If our prediction was .2 we would classify the observation
belongs to class B.
A linear decision boundary is a straight line (or plane in higher dimensions) that separates the
data into two classes. It is the simplest form of boundary and is used when the data is linearly
separable.
It can be expressed using a linear equation such as:
y = mx + b,
where m is the slope and b is the intercept.
A non-linear decision boundary is a curved or flexible line that separates the classes. It is used
when the data cannot be separated by a straight line. Models like SVM with RBF kernel or
Neural Networks can learn such boundaries. These boundaries adapt to the complex shape of
the data distribution.
This type of decision boundary is made of horizontal and vertical lines that split the space like
blocks or steps. It is used in models like Decision Trees or Random Forests, which split data
using feature thresholds (e.g., age > 30). These boundaries are not smooth but look like stairs or
rectangles.
Purpose:
Visualization: You can often visualize decision boundaries by plotting the data points and the
boundary line or surface. Learning:During training, machine learning algorithms learn the
optimal decision boundary that best separates the classes based.
10. What is Softmax Regression?
Softmax Regression is a method used in machine learning when you want to classify things into
more than two groups or categories. For example:
Why do we need this “softmax” step? If you just had raw scores, they might not be easy to
compare or interpret. For example, a score of 5 for “dog” and 3 for “cat” doesn’t directly tell you
how sure the computer is. By converting to probabilities, you get a clearer picture of how
confident the model is about each option.
Before Softmax Regression can classify things well, it needs to learn from examples:
• It looks at many labeled examples (like pictures that are already tagged as “dog” or
“cat”).
• It adjusts the way it calculates scores to make its predictions better and better.
• It tries to give higher probabilities to the correct categories on the training examples. This
learning process is usually done by trying to minimize mistakes, so the model improves
over time.
11. Explain Cross Entropy and its role in classification.
Cross Entropy is a way to measure how close or far your model’s predictions are from the actual
answers in classification problems.
Imagine you have a test, and your model gives probabilities for each possible answer. Cross
Entropy tells you how bad or good your guesses are compared to the correct answers.
When training a model, you want it to get better at making predictions. But to improve, the
model needs to know how wrong it is on each guess. Cross Entropy acts like a score or penalty
for wrong guesses — the bigger the penalty, the worse the prediction.
• It penalizes confident wrong guesses heavily — if the model is very sure about a wrong
answer, the penalty is large. This encourages the model to be cautious when it is unsure.
• It rewards confident correct guesses — if the model is sure about the correct class, it
gets a low penalty.
• It works well with probabilities, which are the model’s natural output for classification
tasks.
Unit 3
1. Explain SVM Classification and its types in detailed .
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. It tries to find the best boundary known as hyperplane that
separates different classes in the data. It is useful when you want to do binary classification like
spam vs. not spam or cat vs. dog.
The main goal of SVM is to maximize the margin between the two classes. The larger the margin
the better the model performs on new and unseen data.
1. SVM looks at the data and tries to find a straight line (or plane) that divides the two
groups.
2. It doesn’t just find any line — it finds the one that keeps the two groups as far apart as
possible.
3. The points that are closest to this line are called Support Vectors — these are the most
important points because they help SVM decide where the line should go.
4. The gap between these closest points and the line is called the Margin — the bigger the
gap, the better the SVM thinks the model will work on new data.
Explanation: In Linear SVM, the algorithm finds the best straight boundary (hyperplane) that
divides the two classes.
• The main goal is to choose the hyperplane in such a way that the distance (called
margin) between the closest points of the two classes is maximized.
• The points that are closest to the hyperplane are called support vectors, and they play an
important role in defining the position of the hyperplane.
2. Soft Margin SVM Classification: Soft Margin SVM is used when the data is not perfectly
separable — i.e., when some points from different classes may overlap or be misclassified.
Explanation: Real-world data is rarely perfect — some points might lie on the wrong side of the
separating boundary.
When to use:
• When data is noisy, has outliers, or cannot be separated perfectly by a straight line.
3. Nonlinear SVM Classification: Nonlinear SVM is used when the data cannot be separated
by a straight line at all, no matter how hard you try.
Explanation: In such cases, SVM uses a special technique called the "Kernel Trick".
• A kernel function transforms the data into a higher-dimensional space where a straight
line (or hyperplane) can separate the data properly.
• This way, even if the original data is tangled or circular, SVM can find a good separation
in the new space.
• Common kernel functions:
o Polynomial Kernel , Radial Basis Function (RBF) or Gaussian Kernel ,
Sigmoid Kernel
When to use:
• When data is not linearly separable — like spirals, circles, or other complex shapes.
2. Explain the Polynomial Kernel and Gaussian RBF Kernel used in SVM.
The Polynomial Kernel is a kernel function that allows the SVM to create curved decision
boundaries instead of straight lines.
The formula of Polynomial kernel is: K(x,y)=(x.y+c)d where is a constant and d is the
polynomial degree.
It is used in Complex problems like image recognition where relationships between features can
be non-linear.
Explanation:
• Sometimes, data is not separable by a straight line but can be separated by a curve.
• The Polynomial Kernel transforms the input data into a higher-dimensional space
where the SVM can draw a boundary shaped like a curve or even a complex shape.
• The degree of the polynomial (like square, cube, etc.) decides how complex the curve
will be.
o For example:
▪ Degree 2 (Quadratic) — makes a parabolic curve.
▪ Degree 3 (Cubic) — makes more flexible curves.
When to use:
2. Gaussian RBF Kernel (Radial Basis Function Kernel): The Gaussian RBF Kernel is the
most popular kernel that allows the SVM to make round or radial decision boundaries.
We use RBF kernel When the decision boundary is highly non-linear and we have no prior
knowledge about the data’s structure is available.
Explanation:
• This kernel transforms the data in such a way that each data point influences the space
around it like a small hill.
• The RBF Kernel can create very flexible boundaries that bend, curve, and wrap
around data clusters, no matter how complex they are.
• A parameter called gamma (γ) controls the influence:
o High gamma: Points have more local influence (sharp peaks).
o Low gamma: Points influence a wider area (gentler hills).
When to use:
• When data is highly non-linear and complex — like circular or spiral patterns.
• It is a good default choice when the data’s pattern is unknown because it can adapt to
various shapes.
Support Vector Regression (SVR) is a version of Support Vector Machine (SVM) that is used for
predicting continuous values instead of classifying data into categories.
SVR tries to find a function (like a line or curve) that fits the data points, but with some
flexibility. It draws a margin of tolerance (called epsilon, ε) around the function where small
errors are ignored. If a predicted value falls inside this margin, it is accepted without penalty.
If some points fall outside this margin, SVR adds a penalty based on how far they are. The goal
is to keep the function as simple as possible (to avoid overfitting), while fitting the data well
within this margin.
Kernels in SVR:
Choosing the right kernel depends on the shape and complexity of your data.
Important Parameters:
• C (Regularization): Controls how much error SVR is willing to tolerate outside the
margin.
o A large C means less tolerance to errors (fits training data closely).
o A small C allows more errors, helping the model generalize better.
• Epsilon (ε): Defines how wide the margin of tolerance is.
o A larger epsilon means a wider margin and fewer penalties.
o A smaller epsilon means the model tries to fit the data more closely.
Evaluating SVR:
After training SVR, you check how well it predicts new data using metrics like Mean Squared
Error (MSE) or Mean Absolute Error (MAE). Lower values of these metrics mean better
predictions.
4. Describe Decision Trees and their role in Machine Learning.
A decision tree is a supervised learning algorithm used for both classification and regression
tasks. It has a hierarchical tree structure which consists of a root node, branches, internal nodes
and leaf nodes. It It works like a flowchart help to make decisions step by step where:
• Root Node:
This is the very first point of the tree. It has all the data and is where the first question or
split happens.
• Leaf Node:
These are the end points of the tree. They don’t split anymore and tell you the final
answer or prediction.
• Splitting:
This means dividing the data into smaller parts based on some rule or question about a
feature.
• Branch:
A branch is like a path that connects one question (node) to the next question or final
answer.
• Decision Node:
A place in the tree where a question is asked to decide how to split the data further.
• Pruning:
Cutting off extra parts of the tree that don’t help improve predictions. This stops the tree
from being too complicated and makes it better at working with new data.
• No Need for Feature Scaling: They don’t require you to normalize or scale your data.
• Overfitting: Overfitting occurs when a decision tree captures noise and details in the
training data and it perform poorly on new data.
• Instability: instability means that the model can be unreliable slight variations in input
can lead to significant differences in predictions.
• Bias towards Features with More Levels: Decision trees can become biased towards
features with many categories focusing too much on them during decision-making. This
can cause the model to miss out other important features led to less accurate predictions .
o Input features include income, credit score, employment status, and loan history.
o The decision tree predicts loan approval or rejection, helping the bank make quick
and reliable decisions.
• Medical Diagnosis: A healthcare provider wants to predict whether a patient has diabetes
based on clinical test results.
o Features like glucose levels, BMI, and blood pressure are used to make a decision
tree.
• Predicting Exam Results in Education : School wants to predict whether a student will
pass or fail based on study habits.
• Once the tree is trained, you can visualize it as a flowchart or tree diagram.
• Each node shows the feature used for splitting and the condition (e.g., Age ≤ 30).
• Branches represent outcomes of these conditions.
• Leaf nodes show the predicted class or value.
• Visualization helps you understand how the model makes decisions step by step.
Why visualize?
CART stands for Classification and Regression Trees, which is a popular algorithm used to
build decision trees for both classification (predicting categories) and regression (predicting
numbers).
CART builds a decision tree by repeatedly splitting the dataset into two parts, aiming to group
similar data points together. It uses binary splits, meaning each decision splits the data into
exactly two groups.
• CART’s binary splitting makes the tree easy to understand and interpret.
• It works for both classification and regression problems.
• It’s the foundation for many advanced models like Random Forests and Gradient Boosted
Trees.
7. Explain Gini Impurity and Entropy as criteria in Decision Trees.
When a decision tree tries to split data into groups, it wants to make those groups as pure as
possible — meaning, groups where most items belong to the same category.
To decide which split is best, the tree uses measures to check how mixed or pure each group is
after splitting. The two common measures are:
1. Gini Impurity :- The Gini index is a metric for the classification tasks in CART.
• Imagine you have a bag of colored balls (say red and blue).
• Gini Impurity tells us how mixed the colors are in that bag.
• If the bag has only red balls, then impurity is 0 — it’s pure.
• If the bag has half red and half blue balls, the impurity is higher because it’s mixed.
• So, the lower the Gini Impurity, the better — because the group is more pure.
Think of it like this: If you randomly pick a ball from the bag, Gini Impurity tells you the
chance that the ball is not from the dominant color.
2. Entropy
• Entropy is another way to measure how mixed a group is, but it comes from information
theory.
• It measures how uncertain or confusing the group is.
• If all balls are red, entropy is 0 — no confusion at all.
• If the balls are evenly split red and blue, entropy is at its highest — the group is very
confusing or uncertain.
• The goal is to split the data so the entropy decreases — meaning groups become less
mixed and easier to predict.
Think of it like this: Entropy measures how surprised you would be if you tried to guess the
color of a randomly picked ball from the bag.
The decision tree tries different ways to split the data and picks the split that makes the groups
as pure as possible — so it can make accurate predictions.
• If the groups after splitting have low Gini Impurity or low Entropy, that’s a good split.
• If not, the split is bad because the groups are still mixed and confusing.
8. What are Regularization Hyperparameters in SVM and Decision Trees?
Unit 4
1. What is Deep Learning? Explain its need.
Deep Learning is a part of Artificial Intelligence (AI) and Machine Learning (ML) that
focuses on teaching computers to learn and make decisions by themselves, especially when
dealing with complex data.
It uses special models called Artificial Neural Networks inspired by how the human brain
works. These networks have many layers, so they are called deep networks.
Artificial Neural Networks contain artificial neurons, which are called units. These units are
arranged in a series of layers that together constitute the whole Artificial Neural Network in a
system.
A layer can have only a dozen units or millions of units, as this depends on how the complex
neural networks will be required to learn the hidden patterns in the dataset.
Commonly, an Artificial Neural Network has an input layer, an output layer, as well as hidden
layers. The input layer receives data from the outside world, which the neural network needs to
analyze or learn about. Then, this data passes through one or multiple hidden layers that
transform the input into data that is valuable for the output layer. Finally, the output layer
provides an output in the form of a response of the Artificial Neural Networks to the input data
provided.
1. Neurons
• Input layer: Takes in the raw data (like images, text, or numbers).
• Hidden layers: Intermediate layers where the network processes and learns features.
• Output layer: Produces the final answer (like class labels or predictions).
• Weights: Numbers that control how much influence one neuron’s output has on the next
neuron.
• Biases: Extra numbers added to the neuron's input to give the network flexibility.
4. Forward Propagation
• The process where data passes through the network from input to output layer.
• Each neuron calculates a weighted sum of inputs, adds bias, applies an activation
function, and passes the result forward.
5. Activation Functions
6. Loss Functions
• A way to measure how wrong the network's predictions are compared to the actual
answers.
• The goal during training is to minimize this loss.
7. Backpropagation
8. Learning Rate
• A small number that controls how much the weights and biases are changed during each
update.
• Too high = may overshoot best solution; too low = slow learning.
• Works by checking how close the input is to certain points (like distance from a center).
• Good for predicting trends in data.
1. Input Layer
2. One or more Hidden Layers
3. Output Layer
In MLP, each neuron in one layer is fully connected to every neuron in the next layer, hence
it is called a fully connected network.
• These are middle layers between the input and output layers.
• They process the data and find patterns or relationships in the data.
• Each neuron here gets input from all neurons of the previous layer and gives output to
the next layer.
• You can have 1 or more hidden layers — this is what makes the network “deep” or
“shallow.”
• This is the last layer that gives the final output or prediction.
• Example:
o In binary classification, 1 or 2 neurons (for class 0 or 1).
o In multi-class classification, as many neurons as there are classes.
o In regression, usually 1 neuron (for the predicted value).
4. Weights (Strength of Connection)
• Each layer (except the input layer) has an extra bias neuron.
• Bias helps the network adjust the output — like an offset in a line equation (y = mx + c,
where 'c' is the bias).
• This makes the network more flexible to learn different patterns.
• After calculating the weighted sum, each neuron passes the result through an activation
function.
• This adds non-linearity so the network can solve complex problems.
• Common activation functions:
o Sigmoid: Output between 0 and 1.
o ReLU: Outputs zero or positive number.
o Tanh: Output between -1 and 1.
o Softmax: Used in classification to give probabilities.
• An activation function decides whether a neuron should "fire" or not by changing the
output of the neuron.
• Without activation functions, a neural network would behave like simple linear regression
— unable to learn complex patterns.
• Activation functions add non-linearity so the network can handle more complex data like
images, sounds, or texts.
5. What are Tensors? Describe basic tensor operations.
Tensors are mathematical objects that describe linear relationships between sets
of multidimensional data. They are a generalization of scalars, vectors, and matrices, which are
all types of tensors.
6. Give a brief introduction to the TensorFlow framework.
TensorFlow is an open-source framework for machine learning (ML) and artificial intelligence (AI) that
was developed by Google Brain. It was designed to facilitate the development of machine learning
models, particularly deep learning models by providing tools to easily build, train and deploy them
across different platforms.
TensorFlow supports a wide range of applications from natural language processing (NLP) and computer
vision (CV) to time series forecasting and reinforcement learning.
1. Scalability
TensorFlow is designed to scale across a variety of platforms from desktops and servers to mobile
devices and embedded systems. It supports distributed computing allowing models to be trained on
large datasets efficiently.
2. Comprehensive Ecosystem
• TensorFlow Core: The base API for TensorFlow that allows users to define models, build
computations and execute them.
• Keras: A high-level API for building neural networks that runs on top of TensorFlow, simplifying
model development.
• TensorFlow Lite: A lightweight solution for deploying models on mobile and embedded devices.
• TensorFlow.js: A library for running machine learning models directly in the browser using
JavaScript.
• TensorFlow Hub: A repository of pre-trained models that can be easily integrated into
applications.
TensorFlow automatically calculates gradients for all trainable variables in the model which simplifies
the backpropagation process during training. This is a core feature that enables efficient model
optimization using techniques like gradient descent.
4. Multi-language Support
TensorFlow is primarily designed for Python but it also provides APIs for other languages like C++, Java
and JavaScript making it accessible to developers with different programming backgrounds.