0% found this document useful (0 votes)
12 views44 pages

Machine Learning

Machine Learning (ML) is a branch of AI that allows computers to learn from data and improve performance without explicit programming, addressing the need for automation and efficiency in handling large datasets. It encompasses various types, including supervised, unsupervised, and reinforcement learning, each with distinct applications and challenges. Testing and validation are crucial for evaluating model performance, ensuring generalization, and preventing overfitting, with metrics like precision, recall, and ROC curve used for performance assessment.

Uploaded by

shivam singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views44 pages

Machine Learning

Machine Learning (ML) is a branch of AI that allows computers to learn from data and improve performance without explicit programming, addressing the need for automation and efficiency in handling large datasets. It encompasses various types, including supervised, unsupervised, and reinforcement learning, each with distinct applications and challenges. Testing and validation are crucial for evaluating model performance, ensuring generalization, and preventing overfitting, with metrics like precision, recall, and ROC curve used for performance assessment.

Uploaded by

shivam singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Unit 1

1. What is Machine Learning? Explain its need and relevance in today’s world.

Machine Learning (ML) is a branch of artificial intelligence (AI) that enables computers to
learn from data and improve their performance on tasks without being explicitly programmed.
Instead of following fixed instructions, ML algorithms identify patterns in data and make
decisions or predictions based on that data.

Need of Machine Learning:

• The volume of data generated today is enormous and too complex for traditional
programming methods to handle effectively.
• ML automates decision-making by learning from data, reducing the need for manual
intervention.
• It enables systems to adapt and improve over time, making them more efficient and
accurate

Relevance in Today’s World:

• ML powers many real-world applications like speech recognition, image classification,


recommendation systems, fraud detection, and autonomous vehicles.
• It helps businesses gain insights from large datasets, driving better strategies and
personalized services.
• With the growth of big data and AI, ML is essential for innovation across industries such
as healthcare, finance, retail, and transportation.
• It supports automation and smart technologies, making daily life easier and enhancing
productivity.
Key Challenges in Machine Learning:

• Data Quality and Quantity: ML models require large amounts of high-quality, relevant
data. Poor, noisy, or insufficient data can lead to inaccurate models.
• Overfitting and Underfitting: Overfitting occurs when a model learns noise in the
training data, reducing accuracy on new data. Underfitting happens when the model is too
simple to capture patterns.
• Feature Selection and Engineering: Choosing the right input features is crucial but
often difficult and time-consuming.
• Computational Complexity: Training complex models demands significant processing
power and time.
• Interpretability: Some models act as “black boxes,” making it hard to understand their
decision process.
• Generalization: Ensuring the model works well on unseen data remains a key challenge.
• Bias and Fairness: Models may reflect biases present in data, leading to unfair results.
• Data Privacy and Security: Handling sensitive data requires strict privacy and security
measures.
2. Describe the different types of Machine Learning with examples: Supervised,
Unsupervised, and Reinforcement Learning.

• Supervised Learning is a type of machine learning where the model is trained using labeled
data (input with the correct output).

• The machine learns from these examples and uses the patterns to predict outcomes for new,
unseen data.

• It’s called “supervised” because the learning is guided — like a teacher helping a student by
giving the right answers during practice.

Example (Fruit Basket Analogy):

Imagine a basket full of different fruits, and we want the machine to identify them.

During training:

• If the fruit is round, red, and has a small dip at the top, it is labeled as an Apple.
• If the fruit is long, curved, and yellow, it is labeled as a Banana.

Now, when we give the machine a new fruit that is yellow and curved, it compares it to what it
has learned, and predicts it’s a Banana.

Types of Supervised Learning:

1. Classification:
o The goal is to predict a discrete class label.
o Used when the output variable is categorical (e.g., yes/no, spam/not spam,
apple/banana).
o Example: Email spam detection, disease diagnosis (positive/negative), image
classification.
2. Regression:
o The goal is to predict a continuous numerical value.
o Used when the output variable is quantitative.
o Example: Predicting housing prices, stock market trends, temperature
forecasting.

Applications of Supervised Learning:

• Image Recognition: Recognizing faces, objects, handwriting.


• Natural Language Processing (NLP): Sentiment analysis, spam filtering.
• Healthcare: Disease classification, medical image analysis.
• Finance: Credit scoring, fraud detection.
• Marketing: Customer segmentation and behavior prediction.
Unsupervised Learning is a type of machine learning where the algorithm is given input data
without any corresponding output labels. The main goal is for the algorithm to find hidden
patterns, structures, or relationships in the data without prior training signals. For example,
unsupervised learning can analyze animal data and group the animals by their traits and
behavior.

Types of Unsupervised Learning:

1. Clustering:
o Groups similar data points into clusters based on feature similarity.
o Example: Customer segmentation in marketing, grouping users with similar
behavior.
2. Dimensionality Reduction:
o Reduces the number of features in a dataset while preserving important
information.
o Example: PCA (Principal Component Analysis) used in image compression,
noise reduction.
3. Association Rule Learning:
o Discovers interesting relationships or associations among variables in large
datasets.
o Example: Market Basket Analysis – people who buy bread often buy butter.

Example – Fruit Grouping Without Labels:

Imagine a basket filled with fruits, but no labels (no mention of which is apple, banana, or
orange). The machine has to analyze the features of each fruit (like color, shape, size) and
group similar fruits together on its own.

🔹 Example:

• The algorithm might create:


o Cluster 1: Round, red fruits → likely apples
o Cluster 2: Long, yellow fruits → likely bananas
o Cluster 3: Round, orange fruits → likely oranges

Here, the machine doesn't know the names of the fruits — it only groups them based on
similarity. This is how clustering works — useful when you don’t know the categories in
advance.

💡 Applications of Unsupervised Learning:

• Customer Segmentation: Grouping customers based on purchasing behavior.


• Anomaly Detection: Identifying unusual data points (e.g., fraud detection).
• Document Clustering: Grouping news articles or research papers by topic.
• Recommender Systems: Suggesting products based on similar user preferences.
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make
decisions by interacting with an environment. The agent takes actions, observes the outcomes,
and receives feedback in the form of rewards or penalties. Over time, it learns an optimal policy
to maximize cumulative reward.

It is based on the concept of trial and error. Unlike supervised learning, there are no labeled
input/output pairs. The agent learns from the consequences of its actions rather than being
explicitly taught.

Core Components:

1. Agent: The learner or decision-maker that interacts with the environment.


2. Environment: The external system with which the agent interacts.
3. State (S): The current situation or configuration of the environment.
4. Action (A): All possible moves the agent can take.
5. Reward (R): Immediate feedback received after taking an action. Positive for good
actions, negative for bad ones.
6. Policy (π): The strategy used by the agent to decide which action to take in each state.
7. Value Function: A measure of the expected long-term return with respect to a state or
state-action pair.

Example: Consider a robot learning to walk in a simulated environment. It begins without any
prior knowledge. When it attempts to walk, it receives feedback:

• A reward of +1 for successfully taking a step without falling.


• A penalty of -10 for falling.

Types of Reinforcement Learning:

1. Positive Reinforcement: Strengthens behavior by providing a positive outcome after the


desired action. It improves the performance of the agent over time.
2. Negative Reinforcement: Strengthens behavior by removing a negative outcome when
the desired action is taken. It helps the agent avoid undesired states.

Applications:

• Game playing (e.g., AlphaGo, reinforcement-based chess engines)


• Robotics (e.g., motion control, path planning)
• Autonomous vehicles (e.g., self-driving cars learning to navigate)
• Industrial automation and manufacturing systems
• Resource allocation and scheduling problems
• Healthcare (e.g., treatment policy optimization)
3. Explain the importance of testing and validation in Machine Learning. How are they
performed?

Testing and validation are fundamental phases in the machine learning workflow. They are used
to evaluate the performance and generalizability of a trained model on data it has never
encountered before. These steps ensure that the model not only fits the training data but also
performs robustly on new, real-world data.

Validation in Machine Learning :- Validation refers to the evaluation of the machine


learning model on a validation set, which is distinct from the training data but still
labeled. It is primarily used during the training phase to tune model parameters and
prevent overfitting.

Purpose:

• To monitor model performance on unseen data during training.


• To fine-tune hyperparameters such as learning rate, regularization coefficients, and
model architecture.
• To serve as an early indicator of overfitting or underfitting.
• To decide when to stop training (e.g., via early stopping criteria).

Process:

• Split the dataset into training and validation subsets (commonly 70-80% training, 20-30%
validation).
• Train the model on the training set.
• After each training iteration or epoch, evaluate the model on the validation set using
metrics like accuracy, loss, precision, recall, etc.
• Use validation results to optimize hyperparameters and select the best performing model
configuration.

Testing in Machine Learning:- Testing is the final evaluation step where the trained
and validated model is tested on a test set, which remains completely unseen during
both training and validation.

Purpose:

• To provide an unbiased assessment of the model's performance.


• To evaluate the model’s generalization capability on truly new data.
• To confirm that the model is ready for deployment or practical use.

Process: After training and validation, apply the final model on the test data. Measure
performance using appropriate metrics (e.g., accuracy, F1-score, ROC-AUC for classification;
mean squared error for regression). Use the results to make conclusive judgments about the
model's effectiveness.
5. What is classification in Machine Learning? Explain the MNIST dataset.

Classification is a supervised learning task where the goal is to assign input data points to one or
more predefined categories or classes based on their features. The type of classification problem
depends on the number and nature of the classes involved. There are three main types of
classification:

1. Binary Classification :- Binary classification is the simplest form of classification where


the data is categorized into exactly two classes or categories.

• How it works:
The model learns from labeled examples of the two classes and predicts whether new
input belongs to Class A or Class B.
• Example:
Email spam detection is a classic example. The system analyzes emails and classifies
them as either “spam” or “not spam” based on features like certain keywords, sender
address, or message structure.
• Use Cases:
o Fraud detection (fraud / no fraud)
o Disease diagnosis (disease present / not present)
o Credit approval (approve / reject)

2. Multiclass Classification :- Multiclass classification involves categorizing data


into more than two classes, but each data point can belong to only one class.

• How it works:
The model learns decision boundaries between multiple classes and assigns the input to
the single class that best fits its features.
• Example:
An image recognition system classifies pictures of animals into categories like “cat,”
“dog,” “bird,” and so on. The model evaluates the features of the image such as shape,
texture, and color to predict the correct label.
• Use Cases:
o Handwritten digit recognition (digits 0-9)
o Document classification (news, sports, entertainment)
o Species classification in biology
3. Multi-Label Classification :- In multi-label classification, each input data point can be
assigned multiple labels simultaneously.

• How it works:
Unlike multiclass classification, where the categories are mutually exclusive, multi-label
allows overlapping classes. The model predicts all relevant classes for each input.
• Example:
A movie recommendation system that tags a movie as both “action” and “comedy.”
Based on features like plot, actors, and genre, the model assigns multiple labels to the
same movie.
• Use Cases:
o Text categorization where documents may belong to multiple topics
o Music genre classification where a song can be tagged with multiple genres
o Medical diagnosis where a patient may have multiple concurrent conditions

Machine Learning classification is widely used in many practical applications, including:

• Email Spam Filtering: Classifies emails as spam or not spam by analyzing keywords
and sender information.
• Credit Risk Assessment: Predicts loan default risk using credit score, income, and loan
history to help banks decide on approvals.
• Medical Diagnosis: Identifies diseases (e.g., cancer, diabetes) from test results and
patient data to assist doctors in diagnosis.
• Image Classification: Used in facial recognition, autonomous driving, and medical
imaging to identify objects or conditions.
• Sentiment Analysis: Determines if text sentiment is positive, negative, or neutral,
helping businesses understand customer feedback.
• Fraud Detection: Detects fraudulent transactions by analyzing patterns in financial data
to prevent credit card and insurance fraud.
• Recommendation Systems: Suggests movies, products, or content based on user
preferences to improve personalization and sales.
6. Discuss performance evaluation metrics in Machine Learning: Confusion Matrix, Precision,
Recall, and ROC Curve.

1. Confusion Matrix

The confusion matrix is a fundamental tool for visualizing the performance of a classification
algorithm. It displays the counts of correct and incorrect predictions made by the model
compared to the actual outcomes.

It consists of four components:

• True Positive (TP): Instances correctly predicted as positive


• False Positive (FP): Instances incorrectly predicted as positive
• True Negative (TN): Instances correctly predicted as negative
• False Negative (FN): Instances incorrectly predicted as negative

Example: In a spam email classification problem:

• Total emails: 100


• Actual spam: 40
• Model predicts:
o TP = 30 (correctly identified spam)
o FP = 5 (non-spam marked as spam)
o FN = 10 (missed spam emails)
o TN = 55 (correctly identified non-spam)

This matrix is the basis for calculating many other performance metrics.

2. Precision

Precision evaluates the accuracy of positive predictions made by the model.

Formula:
Precision = TP / (TP + FP)

Interpretation: It indicates how many of the predicted positive cases were actually positive.

Example:
Precision = 30 / (30 + 5) = 0.857

Use Case: Precision is critical when the cost of false positives is high, such as in spam
detection, fraud alerts, or automated medical diagnoses.
3. Recall (Sensitivity or True Positive Rate)

Recall measures the model’s ability to identify all relevant positive cases.

Formula:
Recall = TP / (TP + FN)

Interpretation: It tells us how many of the actual positive cases were captured by the model.

Example:
Recall = 30 / (30 + 10) = 0.75

Use Case: Recall is essential when false negatives are more dangerous, such as in disease
screening or security breach detection.

4. F1 Score

The F1 Score is the harmonic mean of precision and recall. It provides a single score that
balances both concerns.

Formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)

Using our previous values:


F1 = 2 × (0.857 × 0.75) / (0.857 + 0.75) ≈ 0.799

Use Case: F1 is ideal when both false positives and false negatives carry significant costs (e.g.,
loan approvals or cancer detection).

5. ROC Curve and AUC (Area Under the Curve)

The Receiver Operating Characteristic (ROC) Curve is a plot of the True Positive Rate
(Recall) against the False Positive Rate (FPR = FP / (FP + TN)) across different classification
thresholds.

The Area Under the ROC Curve (AUC) quantifies the overall ability of the model to
discriminate between positive and negative classes.

• AUC = 1 indicates perfect classification


• AUC = 0.5 implies no better than random guessing
• AUC > 0.8 is generally considered good

Use Case: ROC and AUC are particularly useful for comparing multiple classification models
or evaluating model performance in imbalanced datasets.
7. What is the Precision/Recall trade-off? How is it handled in practice?

In machine learning classification tasks, the precision/recall trade-off refers to the inverse
relationship between two key evaluation metrics:

• Precision = TP / (TP + FP): Measures how many predicted positive instances are actually
correct.
• Recall = TP / (TP + FN): Measures how many actual positive instances were correctly
identified.

Improving one of these metrics usually reduces the other. You cannot maximize both
simultaneously. This is called the precision/recall trade-off.

Why the Trade-off Happens


Models output probabilities. A threshold (e.g., 0.5) is used to classify results.

• Lower threshold → more positives → higher recall, lower precision.


• Higher threshold → fewer positives → higher precision, lower recall.

Examples of the Trade-off

1. High Precision Scenarios


These focus on reducing false positives.

• YouTube Restricted Mode: To protect children, the model must only allow truly safe
videos. It should avoid classifying harmful content as safe. Hence, the focus is on high
precision.
• Shoplifting Detection in Malls: If a system wrongly identifies innocent customers as
shoplifters, it can lead to major issues. Hence, the model must have high precision,
reducing false positives.

2. High Recall Scenarios


These focus on reducing false negatives.

• Disease Detection Model: In medical diagnosis, missing a disease case (false negative)
is dangerous. The goal is to detect as many true cases as possible, i.e., high recall.
• Loan Default Prediction: A model predicting whether a loan applicant is a defaulter
should aim to catch all true defaulters. False negatives (failing to flag actual defaulters)
can lead to large financial losses. Hence, high recall is desired.
How It Is Handled in Practice

1. Adjusting the Threshold:


o Lower threshold → higher recall, lower precision.
o Higher threshold → higher precision, lower recall.
2. F1 Score:
The harmonic mean of precision and recall. Useful when both are equally important.
3. Precision-Recall Curve:
A visual tool to help choose a suitable threshold and balance.
4. Application-Driven Focus:
o If false positives are costly → focus on precision.
o If false negatives are risky → focus on recall.
5. Cost-sensitive Learning:
Models can be modified to penalize FP or FN differently, depending on the business
context.

8. Explain Multiclass Classification and its challenges.

Multiclass classification means teaching a computer to put things into more than two groups or
categories. Instead of just “yes” or “no” (which is binary classification), it decides between many
classes.

Example: If you have to recognize handwritten numbers from 0 to 9, the model has to choose
between 10 different classes. This is multiclass classification.

How Do We Do Multiclass Classification?

• One-vs-Rest (OvR): We make one model for each class. Each model learns to separate
that class from all others. Then, we pick the class with the strongest prediction.
• One-vs-One (OvO): We make a model for every pair of classes. For example, if there
are 3 classes (A, B, C), we make models for A vs B, A vs C, and B vs C. The class that
wins the most matches is chosen.
• Special Algorithms: Some models like Decision Trees or Neural Networks can directly
handle many classes at once.

Challenges in Multiclass Classification

1. Unequal Data (Class Imbalance): Some classes might have many examples, others very
few. The model might get better at recognizing common classes and ignore rare ones.
2. More Classes, More Confusion: When there are many classes, it’s harder for the model
to tell them apart.
3. More Work and Time: With many classes, the computer needs to do more calculations
and use more memory.
4. Harder to Measure Success:
Accuracy alone doesn’t tell the full story. We need other measures like precision and
recall for each class.
9. What is error analysis in Machine Learning? How can it improve model performance?

Error analysis is the systematic process of examining the mistakes or errors made by a machine
learning model after training. Instead of just looking at overall accuracy or loss, error analysis
digs deeper into what types of errors occur, why they happen, and under what conditions.
This detailed understanding is crucial for making targeted improvements to the model.

Why is Error Analysis Important?

When you first train a machine learning model, it rarely performs perfectly. Error analysis helps
answer important questions:

• Where exactly does the model fail?


• Are certain types of inputs causing more errors?
• Is the problem due to the data, the model, or both?

By answering these, error analysis enables more efficient and effective improvements, rather
than random guessing.

Example to Illustrate Error Analysis

Consider a speech recognition system that converts spoken words into text. The model might
work well in quiet environments but struggle with noisy backgrounds, accents, or different
microphones. Instead of blindly trying new models or features, error analysis allows you to:

• Tag errors based on environment types (quiet office, noisy car, street, etc.)
• Quantify which environment causes the most mistakes
• Focus your improvement efforts (like adding noise-robust features) where they matter
most

How Error Analysis Helps Improve Model Performance

• Data Quality Improvements:


Identifies mislabeled or low-quality data points. Cleaning or augmenting these data can
boost performance.
• Feature Engineering:
Reveals missing or weak features responsible for errors, guiding better feature design.
• Model Selection and Tuning:
Helps decide whether a different algorithm or parameter tuning is needed for difficult
cases.
• Focused Data Collection:
Suggests collecting more data from error-prone categories or environments to balance the
training set.
• Refining Preprocessing:
Shows if preprocessing steps like normalization or noise reduction need adjustment.
Unit 2
1. What is Linear Regression?

Linear regression is a type of supervised machine-learning algorithm that learns from the
labelled datasets and maps the data points with most optimized linear functions which can be
used for prediction on new datasets. It assumes that there is a linear relationship between the
input and output, meaning the output changes at a constant rate as the input changes. This
relationship is represented by a straight line.

For example we want to predict a student's exam score based on how many hours they studied.
We observe that as students study more hours, their scores go up. In the example of predicting
exam scores based on hours studied. Here

• Independent variable (input): Hours studied because it's the factor we control or
observe.

• Dependent variable (output): Exam score because it depends on how many hours were
studied.

Types Of Linear Regression

Simple Linear Regression :- Simple Linear Regression is a technique used to predict the value
of one dependent variable using only one independent variable. It assumes a linear (straight-line)
relationship between the two variables.

Example: Predicting a person’s salary based on their years of experience.

Multiple Linear Regression :- Multiple Linear Regression is a technique used to predict the
value of one dependent variable using two or more independent variables. It assumes a linear
relationship between the dependent variable and the combination of all independent variables.
Example: Predicting a person’s salary based on their years of experience, education level, and
location.

Use Cases of Multiple Linear Regression (MLR)

1. Real Estate Pricing :- MLR is used to predict property prices based on factors such as
location, size, number of bedrooms, and property type.

2. Financial Forecasting :- Financial analysts apply MLR to forecast stock prices or economic
indicators using variables like interest rates, inflation rates, and market trends.

3. Agricultural Yield Prediction


MLR helps estimate crop yields by analyzing factors such as rainfall, temperature, soil quality,
and fertilizer usage. This enables farmers to plan their agricultural practices more effectively.
2. Explain Gradient Descent and its types: Batch, Stochastic, and Mini-batch Gradient Descent.

Gradient Descent is an iterative optimization algorithm used to minimize a cost (loss)


function by updating the model parameters in the direction of the negative gradient of the
function.

It is widely used in machine learning and deep learning for training models by adjusting weights
to reduce prediction error.

How it Works (Simple Explanation):

Imagine you are at the top of a mountain and want to reach the lowest point (valley). You can’t
see the entire path, but you can feel the slope under your feet. Each step you take in the direction
of the steepest descent gets you closer to the bottom.
This is how Gradient Descent works: it updates model parameters step-by-step to minimize the
error.

Types of Gradient Descent

The difference between the three types lies in how much data they use to calculate the gradient
for each step.

1. Batch Gradient Descent

Definition:
Batch Gradient Descent calculates the gradient using the entire training dataset before updating
the model parameters. This means the model makes one update per epoch, after seeing all the
training examples.

Advantages:

• The updates are stable and smooth because they are based on the entire dataset.
• It converges steadily toward the minimum when the loss function is smooth.

Disadvantages:

• It is slow and requires high memory when the dataset is very large because it processes
all data for each update.

Example:
If you have 1,000 house price records, batch gradient descent calculates the error for all 1,000
houses first, averages them, and then updates the model once.
2. Stochastic Gradient Descent (SGD)

Definition:
Stochastic Gradient Descent updates the model parameters after processing each individual
training example. Instead of waiting for all data, it adjusts weights step-by-step using one sample
at a time.

Explanation:
Here, the model looks at one house price record, calculates the error, and immediately updates
the parameters. Then it moves to the next record and repeats the process.

Advantages:

• Faster updates, especially for very large datasets.


• It can escape local minimum points because of the randomness in updates.

Disadvantages:

• Updates are noisy and less stable, so the loss may fluctuate rather than smoothly
decrease.
• May take longer to fully converge due to the fluctuations.

3. Mini-Batch Gradient Descent

Definition:
Mini-Batch Gradient Descent uses a small fixed number of examples (called mini-batches) to
compute the gradient and update the model. This method balances between Batch and Stochastic
Gradient Descent.

Advantages:

• Combines the speed of SGD and stability of batch gradient descent.


• Allows faster computation using vectorized operations, especially on GPUs.

Disadvantages:

• Requires choosing a good mini-batch size.


• Some noise remains, but less than SGD.

Example:
The model updates itself after seeing batches of 32 house price records instead of all 1,000 or
just one.
3. What is Polynomial Regression?

Polynomial regression is a type of regression where the relationship between the independent
variable and the dependent variable is modeled as a curve rather than a straight line. This allows
the model to capture more complex patterns in the data.

The general form of a polynomial regression equation of degree n is:-


y=β0+β1x+β2x2+…+βnxn+ϵ
where,

• y is the dependent variable.

• x is the independent variable.

• β0,β1,…,βnβ0,β1,…,βn are the coefficients of the polynomial terms.

• n is the degree of the polynomial.

• ϵϵ represents the error term.

Unlike simple linear regression, which assumes a straight-line relationship between the input and
output, polynomial regression can fit data where the effect of the input variable on the output
changes direction or speed. It does this by including powers of the input variable, which creates a
curved line that better fits the data points.

Example:
Imagine you want to predict an employee’s salary based on their years of experience. At the start, salary
increases slowly as the employee gains some experience. Then, during the middle years, salary grows
faster as they gain valuable skills and responsibilities. After many years, the salary growth slows down or
levels off. This curved pattern can’t be captured well by a straight line, but polynomial regression fits this
curve and models the salary growth more accurately.

Use Cases:

• Predicting sales that grow rapidly then stabilize.


• Modeling population growth that speeds up then slows down. Any situation where the
relationship between variables is not a straight line but a smooth curve.
4. What are Learning Curves in Machine Learning? Explain the Bias/Variance Tradeoff.

Learning curves are graphs that show how well a machine learning model performs as it learns
from more data. They plot:

• Training Error: How well the model fits the data it learned from (training data).
• Test Error: How well the model predicts new, unseen data (test data).

They help identify if the model is underfitting or overfitting.

• Underfitting (high bias): Underfitting happens when your model is too simple to
capture variations and patterns in your data. The machine doesn’t learn the right
characteristics and relationships from the training data, and thus performs poorly with
subsequent data sets.
• Overfitting (high variance): Training error is very low, but test error is high. The model
memorizes training data but fails on new data.
• Adding more training data usually reduces overfitting (variance) but does not help much
with underfitting (bias).

Bias in Machine Learning

Bias is the error that happens when a model makes wrong assumptions about the data. It causes
the model’s predictions to be systematically different from the true values. Simply put, bias
means the model is too simple to learn the true pattern in the data.

• High bias :- A model with a higher bias would not match the data set closely. means the
model is very simple and does not fit the training data well. It ignores important patterns
and results in underfitting.
• Low bias A low bias model will closely match the training data set. means the model is
flexible and can fit the training data well.

Example:
Imagine trying to predict salary based only on years of experience, assuming a simple straight
line relationship. But if the real relationship is more complex (like salary grows faster after some
years), a simple straight line model will have high bias and will give wrong predictions.

Variance in Machine Learning?

Variance measures how much the model’s predictions change when trained on different subsets
of data. It shows how sensitive the model is to small changes in the training data.

• High variance means the model fits the training data too closely, including noise or
random details. It performs very well on training data but badly on new, unseen data —
this is called overfitting.
• Low variance means the model is stable and produces similar predictions even if the
training data changes.
Example:
If a model memorizes the salaries of individual employees perfectly but cannot predict well for
new employees, it has high variance.

Bias-Variance Tradeoff

• Ideally, you want a model with low bias and low variance, meaning it fits the data well
and generalizes to new data.
• However, decreasing bias by making the model more complex usually increases variance.
• Decreasing variance by simplifying the model usually increases bias.
• This balancing act is the bias-variance tradeoff — finding the right model complexity to
minimize total error.

6. Compare Ridge Regression and Lasso Regression. When should each be used?

In Machine Learning, sometimes linear regression models can become too complex and overfit
the data — meaning they work well on training data but perform poorly on new data. To handle
this, we use regularization techniques, and the two most common ones are Ridge Regression
and Lasso Regression.

These are advanced versions of linear regression that add a penalty to the model to reduce
overfitting and improve generalization.

Ridge Regression (L2 Regularization) :- Ridge Regression is a type of linear regression that adds
a penalty to the squared values of the coefficients (weights). This penalty term helps shrink
the size of the coefficients but does not make them zero. It is useful when we have many
features and we want to keep all of them but avoid overfitting.

➤ Example (Salary prediction): Suppose you are predicting someone's salary using 10 features
like age, experience, education level, number of languages known, etc. Ridge regression will use
all the features, but it will shrink the influence of the less important ones.

Lasso Regression (L1 Regularization) :- Lasso Regression is another form of linear regression
that adds a penalty to the absolute values of the coefficients. Unlike Ridge, Lasso can reduce
some coefficients to zero, effectively removing those features from the model. So, it performs
both regularization and feature selection.

Lasso regression helps remove unnecessary features automatically. This leads to simpler models
that are easier to interpret and work well on new data.

➤ Example (Salary prediction):

Using the same example, if “number of siblings” or “distance from home” doesn’t really affect
salary much, Lasso will remove those features by making their coefficient zero.
7. What is Early Stopping in model training?

Early stopping is a regularization technique used in training machine learning models to


prevent overfitting. It works by monitoring the model’s performance on a validation set, and
stopping training when the validation performance starts to degrade, even though training
accuracy may still improve.

While training a model, the performance on the training dataset improves as the model sees more
data. Initially, both training and validation errors decrease. But after a certain point, the model
starts to overfit — learning patterns specific to the training set instead of general patterns. This is
visible when:

• Training loss keeps decreasing


• Validation loss starts increasing
• Validation accuracy drops

This point is called the optimal stopping point.

Early stopping captures this point, and returns the model parameters from that iteration,
not the final one. Hence, it helps maintain good generalization and low variance.

Why is Early Stopping Important?

If we continue training after overfitting starts, the model becomes too tailored to the training
data and performs poorly on new, unseen data. Early stopping ensures we pause at the best
moment, balancing training accuracy and generalization.

This technique is considered an implicit form of regularization — it does not add a penalty
term like L1 or L2 (as in Lasso or Ridge), but it still helps control complexity and overfitting.

Advantages of Early Stopping:

• Reduces overfitting
• Improves generalization
• Simple and easy to implement
• Requires less training data compared to other regularization methods
• Saves training time

Limitations of Early Stopping:

• If stopped too early, the model may underfit


• Not ideal for all models or datasets
• If validation data is not well chosen, it may give wrong stopping signals
• Running it many times may cause overfitting to the validation set
8. What is Logistic Regression? Explain its working.

Logistic Regression is a supervised machine learning algorithm used for classification tasks.
Despite its name, it is actually a classification algorithm, not a regression one. It is used when the
target variable is categorical, such as binary classification (e.g., spam or not spam, pass or fail, 0
or 1).

Types Of Logistic Regression :-

Binomial Logistic Regression: This type is used when the dependent variable has only two
possible outcomes. Examples include predicting whether a student passes or fails, or whether a
customer will buy a product or not. It is the most common type and is used for binary
classification problems.

Multinomial Logistic Regression: This is used when the dependent variable has three or more
categories that do not follow any specific order. For example, classifying types of transport like
bus, car, or train. These categories are distinct and unordered.

Ordinal Logistic Regression: This type is used when the dependent variable has three or more
categories with a natural order or ranking. Examples include rating a service as poor, average, or
excellent. The order of the categories is considered while modeling.

Logistic Regression Works

Step 1: Take Input Data The model takes input values (called features), like marks, age,
income, hours studied, etc.

Step 2: Calculate a Score It combines all the input values using a simple formula to calculate a
single number (score). This score can be any number – positive or negative.

Step 3: Apply the Sigmoid Function The score is passed through a sigmoid function, which
converts it into a probability between 0 and 1.

• High score → Probability near 1


• Low score → Probability near 0
• Middle score → Probability around 0.5

Step 4: Make a Decision Now, the model checks the probability and decides:

• If the probability ≥ 0.5 → Predicts Class 1 (like Yes/Pass/Spam)


• If the probability < 0.5 → Predicts Class 0 (like No/Fail/Not Spam)

Step 5: Repeat & Improve (During Training)


While training, the model keeps adjusting itself to improve the accuracy using a method called
gradient descent.
9. What are Decision Boundaries in classification?

In classification problems, a decision boundary is a hypersurface that separates the feature


space into regions, each associated with a different class label. It is the boundary that a
classification algorithm creates to distinguish between different categories or classes based on
the input features.

p>=0.5,class=A

p<=0.5,class=B

If our threshold was .5 and our prediction function returned .7, we would classify this
observation belongs to class A. If our prediction was .2 we would classify the observation
belongs to class B.

So, line with 0.5 is called the decision boundary.

1. Linear Decision Boundary

A linear decision boundary is a straight line (or plane in higher dimensions) that separates the
data into two classes. It is the simplest form of boundary and is used when the data is linearly
separable.
It can be expressed using a linear equation such as:
y = mx + b,
where m is the slope and b is the intercept.

2. Non-Linear Decision Boundary

A non-linear decision boundary is a curved or flexible line that separates the classes. It is used
when the data cannot be separated by a straight line. Models like SVM with RBF kernel or
Neural Networks can learn such boundaries. These boundaries adapt to the complex shape of
the data distribution.

3. Step-like (Axis-aligned) Decision Boundary

This type of decision boundary is made of horizontal and vertical lines that split the space like
blocks or steps. It is used in models like Decision Trees or Random Forests, which split data
using feature thresholds (e.g., age > 30). These boundaries are not smooth but look like stairs or
rectangles.

Purpose:

Visualization: You can often visualize decision boundaries by plotting the data points and the
boundary line or surface. Learning:During training, machine learning algorithms learn the
optimal decision boundary that best separates the classes based.
10. What is Softmax Regression?

Softmax Regression is a method used in machine learning when you want to classify things into
more than two groups or categories. For example:

• Deciding if a photo shows a cat, a dog, or a bird (three categories).


• Classifying emails as work, personal, or spam.
• Recognizing handwritten numbers from 0 to 9 (ten categories).

How does Softmax Regression work? Step by step:

1. Look at the data features:


The computer starts with information about what it needs to classify. For example, in a
photo, it might look at colors, edges, shapes, or other details that describe the picture.
2. Calculate scores for each category:
Using what it has learned, the computer gives a “score” to each possible category for the
given data. Think of it like how confident the computer is about each category — but
these scores can be any number, positive or negative.
3. Convert scores into probabilities:
Because the scores themselves don’t add up to something meaningful, the computer
changes these scores into probabilities. This means it transforms the scores so that:
o All probabilities are between 0% and 100%.
o The probabilities for all categories add up exactly to 100%. This way, the
computer says something like:
o 60% chance it’s a dog
o 30% chance it’s a cat
o 10% chance it’s a bird
4. Pick the category with the highest probability:
The category with the largest probability is the one the computer chooses as the final
answer.

Why do we need this “softmax” step? If you just had raw scores, they might not be easy to
compare or interpret. For example, a score of 5 for “dog” and 3 for “cat” doesn’t directly tell you
how sure the computer is. By converting to probabilities, you get a clearer picture of how
confident the model is about each option.

How does the computer learn these scores?

Before Softmax Regression can classify things well, it needs to learn from examples:

• It looks at many labeled examples (like pictures that are already tagged as “dog” or
“cat”).
• It adjusts the way it calculates scores to make its predictions better and better.
• It tries to give higher probabilities to the correct categories on the training examples. This
learning process is usually done by trying to minimize mistakes, so the model improves
over time.
11. Explain Cross Entropy and its role in classification.

Cross Entropy is a way to measure how close or far your model’s predictions are from the actual
answers in classification problems.

Imagine you have a test, and your model gives probabilities for each possible answer. Cross
Entropy tells you how bad or good your guesses are compared to the correct answers.

Why do we need Cross Entropy?

When training a model, you want it to get better at making predictions. But to improve, the
model needs to know how wrong it is on each guess. Cross Entropy acts like a score or penalty
for wrong guesses — the bigger the penalty, the worse the prediction.

What does Cross Entropy do during training?

1. Model makes a guess:


The model predicts probabilities for each class. For example, for an image of a cat, the
model might say:
o Cat: 70%
o Dog: 20%
o Bird: 10%
2. Check the real answer:
The real answer is “Cat.” So, the model should ideally give a probability close to 100%
for “Cat” and near 0% for others.
3. Calculate the Cross Entropy loss:
o If the model’s guess for “Cat” is close to 1 (100%), the Cross Entropy is low —
meaning the guess was good.
o If the model’s guess for “Cat” is low (like 20%), the Cross Entropy is high —
meaning the guess was bad.
4. Use Cross Entropy to improve:
The model looks at this penalty and adjusts its internal parameters to reduce this penalty
in the future — so it guesses better next time.

Why is Cross Entropy a good choice for classification?

• It penalizes confident wrong guesses heavily — if the model is very sure about a wrong
answer, the penalty is large. This encourages the model to be cautious when it is unsure.
• It rewards confident correct guesses — if the model is sure about the correct class, it
gets a low penalty.
• It works well with probabilities, which are the model’s natural output for classification
tasks.
Unit 3
1. Explain SVM Classification and its types in detailed .

Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. It tries to find the best boundary known as hyperplane that
separates different classes in the data. It is useful when you want to do binary classification like
spam vs. not spam or cat vs. dog.

The main goal of SVM is to maximize the margin between the two classes. The larger the margin
the better the model performs on new and unseen data.

How does SVM work?

1. SVM looks at the data and tries to find a straight line (or plane) that divides the two
groups.
2. It doesn’t just find any line — it finds the one that keeps the two groups as far apart as
possible.
3. The points that are closest to this line are called Support Vectors — these are the most
important points because they help SVM decide where the line should go.
4. The gap between these closest points and the line is called the Margin — the bigger the
gap, the better the SVM thinks the model will work on new data.

Other Important Terms (in simple words):

• Hyperplane: The line or boundary SVM draws to separate the groups.


• Support Vectors: Special points that are closest to the boundary — they help SVM
decide the best line.
• Margin: The space between the two groups and the boundary — wider margin means
better and safer separation.
• Kernel: A magic tool that helps SVM to handle complex, non-straight-line problems.
• Hard Margin: No mistakes allowed — used when data is perfectly separable.
• Soft Margin: Some mistakes allowed — used when data is messy or noisy.
• C Parameter: Controls how much mistakes SVM can allow — a strict teacher or a
lenient one!
• Dual Problem: A method SVM uses in the background to make calculations easier and
faster.
1. Linear SVM Classification: Linear SVM is used when the given data can be separated
perfectly by a straight line (in 2D), a flat plane (in 3D), or a hyperplane (in higher dimensions).

Explanation: In Linear SVM, the algorithm finds the best straight boundary (hyperplane) that
divides the two classes.

• The main goal is to choose the hyperplane in such a way that the distance (called
margin) between the closest points of the two classes is maximized.
• The points that are closest to the hyperplane are called support vectors, and they play an
important role in defining the position of the hyperplane.

2. Soft Margin SVM Classification: Soft Margin SVM is used when the data is not perfectly
separable — i.e., when some points from different classes may overlap or be misclassified.

Explanation: Real-world data is rarely perfect — some points might lie on the wrong side of the
separating boundary.

• Soft Margin SVM allows some mistakes (misclassifications) by introducing a


flexibility term.
• This flexibility ensures that SVM doesn’t try to perfectly separate every point (which
could cause overfitting) but instead finds a balance between accuracy and
generalization.
• A parameter called C (regularization parameter) controls how much error is allowed:
o A high C tries to reduce errors strictly (less flexible).
o A low C allows more errors but may generalize better.

When to use:

• When data is noisy, has outliers, or cannot be separated perfectly by a straight line.

3. Nonlinear SVM Classification: Nonlinear SVM is used when the data cannot be separated
by a straight line at all, no matter how hard you try.

Explanation: In such cases, SVM uses a special technique called the "Kernel Trick".

• A kernel function transforms the data into a higher-dimensional space where a straight
line (or hyperplane) can separate the data properly.
• This way, even if the original data is tangled or circular, SVM can find a good separation
in the new space.
• Common kernel functions:
o Polynomial Kernel , Radial Basis Function (RBF) or Gaussian Kernel ,
Sigmoid Kernel

When to use:

• When data is not linearly separable — like spirals, circles, or other complex shapes.
2. Explain the Polynomial Kernel and Gaussian RBF Kernel used in SVM.

The Polynomial Kernel is a kernel function that allows the SVM to create curved decision
boundaries instead of straight lines.

The formula of Polynomial kernel is: K(x,y)=(x.y+c)d where is a constant and d is the
polynomial degree.

It is used in Complex problems like image recognition where relationships between features can
be non-linear.

Explanation:

• Sometimes, data is not separable by a straight line but can be separated by a curve.
• The Polynomial Kernel transforms the input data into a higher-dimensional space
where the SVM can draw a boundary shaped like a curve or even a complex shape.
• The degree of the polynomial (like square, cube, etc.) decides how complex the curve
will be.
o For example:
▪ Degree 2 (Quadratic) — makes a parabolic curve.
▪ Degree 3 (Cubic) — makes more flexible curves.

When to use:

• When the relationship between features is polynomial in nature.


• Suitable for problems where the data shows a curvy or wavy separation pattern.

2. Gaussian RBF Kernel (Radial Basis Function Kernel): The Gaussian RBF Kernel is the
most popular kernel that allows the SVM to make round or radial decision boundaries.

We use RBF kernel When the decision boundary is highly non-linear and we have no prior
knowledge about the data’s structure is available.
Explanation:

• This kernel transforms the data in such a way that each data point influences the space
around it like a small hill.
• The RBF Kernel can create very flexible boundaries that bend, curve, and wrap
around data clusters, no matter how complex they are.
• A parameter called gamma (γ) controls the influence:
o High gamma: Points have more local influence (sharp peaks).
o Low gamma: Points influence a wider area (gentler hills).

When to use:

• When data is highly non-linear and complex — like circular or spiral patterns.
• It is a good default choice when the data’s pattern is unknown because it can adapt to
various shapes.

Real World Applications of SVM Kernels

• Polynomial kernels are frequently applied in image classification tasks to identify


objects or patterns in images. They help capture the complex relationships between pixel
features, making them suitable for tasks like facial recognition or object detection.

• In text analysis such as sentiment analysis (classifying text as positive, negative, or


neutral) SVMs with various kernels can handle different types of text data. Non-linear
kernels especially RBF
3. What is SVM Regression?

Support Vector Regression (SVR) is a version of Support Vector Machine (SVM) that is used for
predicting continuous values instead of classifying data into categories.

How SVR Works:

SVR tries to find a function (like a line or curve) that fits the data points, but with some
flexibility. It draws a margin of tolerance (called epsilon, ε) around the function where small
errors are ignored. If a predicted value falls inside this margin, it is accepted without penalty.

If some points fall outside this margin, SVR adds a penalty based on how far they are. The goal
is to keep the function as simple as possible (to avoid overfitting), while fitting the data well
within this margin.

Kernels in SVR:

SVR can work in both linear and non-linear ways:

• Linear Kernel: Assumes a straight-line relationship between input and output.


• Non-linear Kernels (like RBF): Transform the input into a higher-dimensional space so
SVR can fit more complex, curved patterns in the data.

Choosing the right kernel depends on the shape and complexity of your data.

Important Parameters:

• C (Regularization): Controls how much error SVR is willing to tolerate outside the
margin.
o A large C means less tolerance to errors (fits training data closely).
o A small C allows more errors, helping the model generalize better.
• Epsilon (ε): Defines how wide the margin of tolerance is.
o A larger epsilon means a wider margin and fewer penalties.
o A smaller epsilon means the model tries to fit the data more closely.

Evaluating SVR:

After training SVR, you check how well it predicts new data using metrics like Mean Squared
Error (MSE) or Mean Absolute Error (MAE). Lower values of these metrics mean better
predictions.
4. Describe Decision Trees and their role in Machine Learning.

A decision tree is a supervised learning algorithm used for both classification and regression
tasks. It has a hierarchical tree structure which consists of a root node, branches, internal nodes
and leaf nodes. It It works like a flowchart help to make decisions step by step where:

• Internal nodes represent attribute tests

• Branches represent attribute values

• Leaf nodes represent final decisions or predictions.

• Root Node:
This is the very first point of the tree. It has all the data and is where the first question or
split happens.
• Leaf Node:
These are the end points of the tree. They don’t split anymore and tell you the final
answer or prediction.
• Splitting:
This means dividing the data into smaller parts based on some rule or question about a
feature.
• Branch:
A branch is like a path that connects one question (node) to the next question or final
answer.
• Decision Node:
A place in the tree where a question is asked to decide how to split the data further.
• Pruning:
Cutting off extra parts of the tree that don’t help improve predictions. This stops the tree
from being too complicated and makes it better at working with new data.

Advantages of Decision Trees

• Simplicity and Interpretability: Decision trees are straightforward and easy to


understand. You can visualize them like a flowchart which makes it simple to see how
decisions are made.
• Versatility: It means they can be used for different types of tasks can work well for
both classification and regression

• No Need for Feature Scaling: They don’t require you to normalize or scale your data.

• Handles Non-linear Relationships: It is capable of capturing non-linear relationships


between features and target variables.

Disadvantages of Decision Trees

• Overfitting: Overfitting occurs when a decision tree captures noise and details in the
training data and it perform poorly on new data.

• Instability: instability means that the model can be unreliable slight variations in input
can lead to significant differences in predictions.

• Bias towards Features with More Levels: Decision trees can become biased towards
features with many categories focusing too much on them during decision-making. This
can cause the model to miss out other important features led to less accurate predictions .

Applications of Decision Trees

• Loan Approval in Banking: A bank needs to decide whether to approve a loan


application based on customer profiles.

o Input features include income, credit score, employment status, and loan history.

o The decision tree predicts loan approval or rejection, helping the bank make quick
and reliable decisions.

• Medical Diagnosis: A healthcare provider wants to predict whether a patient has diabetes
based on clinical test results.

o Features like glucose levels, BMI, and blood pressure are used to make a decision
tree.

o Tree classifies patients into diabetic or non-diabetic, assisting doctors in


diagnosis.

• Predicting Exam Results in Education : School wants to predict whether a student will
pass or fail based on study habits.

o Data includes attendance, time spent studying, and previous grades.

o The decision tree identifies at-risk students, allowing teachers to provide


additional support.
5. Explain the process of Training and Visualizing a Decision Tree.

Training a Decision Tree

1. Prepare the Data:


First, you gather your dataset with input features (like age, income, etc.) and the target
variable (what you want to predict).
2. Choose the Feature to Split:
The algorithm looks at all features and decides which one best separates the data into
groups based on how pure or mixed the groups are. This is done using measures like
information gain or Gini impurity.
3. Split the Data:
Based on the best feature, the data is split into smaller subsets. This creates branches in
the tree.
4. Repeat the Process:
For each subset, the algorithm again finds the best feature to split and repeats the
splitting. This continues until stopping rules are met, like reaching a maximum tree depth
or having very small subsets.
5. Create Leaf Nodes:
When no more splitting is possible or needed, the subsets become leaf nodes that give the
final prediction (a class label or value).

Visualizing a Decision Tree

• Once the tree is trained, you can visualize it as a flowchart or tree diagram.
• Each node shows the feature used for splitting and the condition (e.g., Age ≤ 30).
• Branches represent outcomes of these conditions.
• Leaf nodes show the predicted class or value.
• Visualization helps you understand how the model makes decisions step by step.

Why visualize?

• It makes the model easy to explain to others.


• Helps to find if the tree is too complex or simple.
• You can spot mistakes or overfitting by looking at the tree structure.
6. What is the CART Training Algorithm?

CART stands for Classification and Regression Trees, which is a popular algorithm used to
build decision trees for both classification (predicting categories) and regression (predicting
numbers).

What CART Does:

CART builds a decision tree by repeatedly splitting the dataset into two parts, aiming to group
similar data points together. It uses binary splits, meaning each decision splits the data into
exactly two groups.

How CART Training Works:

1. Start with the whole dataset as the root of the tree.


2. Find the best split:
o The algorithm examines every feature (like age, salary, or any attribute) and tries
different split points.
o For each possible split, it measures how well the split separates the data into pure
groups.
o For classification problems, CART uses a measure called Gini impurity to find
the best split, which helps to minimize mixing of different classes.
o For regression problems, it uses measures like mean squared error (MSE) to
find splits that reduce prediction errors.
3. Split the data into two groups based on the best split found.
4. Repeat the splitting process on each group separately, applying the same method to find
the best splits for the smaller groups.
5. Stop splitting when:
o The groups are pure enough (mostly same class or very close values),
o Or a pre-set stopping rule is reached (like max depth of the tree, or minimum
number of samples in a node).
6. Assign predictions to the leaf nodes:
o For classification, the leaf node predicts the class that is most common in that
group.
o For regression, the leaf node predicts the average value of the data points in that
group.

Why is CART important?

• CART’s binary splitting makes the tree easy to understand and interpret.
• It works for both classification and regression problems.
• It’s the foundation for many advanced models like Random Forests and Gradient Boosted
Trees.
7. Explain Gini Impurity and Entropy as criteria in Decision Trees.

When a decision tree tries to split data into groups, it wants to make those groups as pure as
possible — meaning, groups where most items belong to the same category.

To decide which split is best, the tree uses measures to check how mixed or pure each group is
after splitting. The two common measures are:

1. Gini Impurity :- The Gini index is a metric for the classification tasks in CART.

• Imagine you have a bag of colored balls (say red and blue).
• Gini Impurity tells us how mixed the colors are in that bag.
• If the bag has only red balls, then impurity is 0 — it’s pure.
• If the bag has half red and half blue balls, the impurity is higher because it’s mixed.
• So, the lower the Gini Impurity, the better — because the group is more pure.

Think of it like this: If you randomly pick a ball from the bag, Gini Impurity tells you the
chance that the ball is not from the dominant color.

2. Entropy

• Entropy is another way to measure how mixed a group is, but it comes from information
theory.
• It measures how uncertain or confusing the group is.
• If all balls are red, entropy is 0 — no confusion at all.
• If the balls are evenly split red and blue, entropy is at its highest — the group is very
confusing or uncertain.
• The goal is to split the data so the entropy decreases — meaning groups become less
mixed and easier to predict.

Think of it like this: Entropy measures how surprised you would be if you tried to guess the
color of a randomly picked ball from the bag.

Why use these measures?

The decision tree tries different ways to split the data and picks the split that makes the groups
as pure as possible — so it can make accurate predictions.

• If the groups after splitting have low Gini Impurity or low Entropy, that’s a good split.
• If not, the split is bad because the groups are still mixed and confusing.
8. What are Regularization Hyperparameters in SVM and Decision Trees?

Unit 4
1. What is Deep Learning? Explain its need.

Deep Learning is a part of Artificial Intelligence (AI) and Machine Learning (ML) that
focuses on teaching computers to learn and make decisions by themselves, especially when
dealing with complex data.

It uses special models called Artificial Neural Networks inspired by how the human brain
works. These networks have many layers, so they are called deep networks.

How Does Deep Learning Work?

1. Neural Networks Basics


o Think of a neural network as a series of connected layers of nodes (called
neurons).
o Each neuron takes input, processes it, and passes the result to the next layer.
o The first layer takes raw data (like images, text, or sounds).
o The last layer gives the output (like classifying an image as “cat” or “dog”).
2. Layers
o Input layer: Receives raw data.
o Hidden layers: These layers do most of the work by extracting features and
learning patterns. More layers mean a “deeper” network.
o Output layer: Produces the final result, like a prediction or classification.
3. Learning Process
o The network learns by adjusting the connections (weights) between neurons.
o It tries to reduce the difference between its prediction and the actual answer using
something called a loss function.
o This adjustment is done repeatedly through a process called backpropagation
and optimization (e.g., gradient descent).

Where Is Deep Learning Used?

• Image Recognition: Identifying faces, objects, or handwriting.


• Speech Recognition: Understanding spoken language (e.g., Siri, Alexa).
• Natural Language Processing: Translating languages, answering questions, chatbots.
• Recommendation Systems: Suggesting movies, products, or music.
• Healthcare: Detecting diseases from medical images.
• Autonomous Vehicles: Self-driving cars use deep learning to recognize surroundings.
2. Introduce Artificial Neural Networks (ANN) and their core components.

Artificial Neural Networks contain artificial neurons, which are called units. These units are
arranged in a series of layers that together constitute the whole Artificial Neural Network in a
system.

A layer can have only a dozen units or millions of units, as this depends on how the complex
neural networks will be required to learn the hidden patterns in the dataset.

Commonly, an Artificial Neural Network has an input layer, an output layer, as well as hidden
layers. The input layer receives data from the outside world, which the neural network needs to
analyze or learn about. Then, this data passes through one or multiple hidden layers that
transform the input into data that is valuable for the output layer. Finally, the output layer
provides an output in the form of a response of the Artificial Neural Networks to the input data
provided.

Basic Components of Neural Networks

1. Neurons

• These are the tiny units or “nodes” in the network.


• Each neuron takes input, does some calculation, and sends output to the next layer.

2. Layers in Neural Networks

• Input layer: Takes in the raw data (like images, text, or numbers).
• Hidden layers: Intermediate layers where the network processes and learns features.
• Output layer: Produces the final answer (like class labels or predictions).

3. Weights and Biases

• Weights: Numbers that control how much influence one neuron’s output has on the next
neuron.
• Biases: Extra numbers added to the neuron's input to give the network flexibility.
4. Forward Propagation

• The process where data passes through the network from input to output layer.
• Each neuron calculates a weighted sum of inputs, adds bias, applies an activation
function, and passes the result forward.

5. Activation Functions

• Functions applied to each neuron's output to introduce non-linearity, enabling the


network to learn complex patterns.
• Common ones: ReLU, Sigmoid, Tanh.

6. Loss Functions

• A way to measure how wrong the network's predictions are compared to the actual
answers.
• The goal during training is to minimize this loss.

7. Backpropagation

• The process of adjusting weights and biases to reduce the loss.


• Errors are sent backward through the network to update the parameters using gradients.

8. Learning Rate

• A small number that controls how much the weights and biases are changed during each
update.
• Too high = may overshoot best solution; too low = slow learning.

Types of Artificial Neural Networks (ANNs) — Easy Version

1. Feedforward Neural Network

• Data goes straight from input to output, no going back.


• Like a one-way street.
• Used for simple tasks like recognizing patterns.

2. Convolutional Neural Network (CNN)

• Special for images and speech.


• Finds small patterns (like edges in pictures).
• Used in things like face recognition or voice assistants.
3. Modular Neural Network

• Made of smaller networks working separately on parts of a problem.


• Each one does its own job, then their results combine.
• Helps with big, complicated tasks.

4. Radial Basis Function Network

• Works by checking how close the input is to certain points (like distance from a center).
• Good for predicting trends in data.

5. Recurrent Neural Network (RNN)

• Remembers past information to help make better decisions.


• Great for things like speech, text, or time series (like stock prices).

Where Neural Networks Are Used — Simple Examples

• Social Media: Suggesting friends you might know on Facebook.


• Shopping Websites: Recommending products based on what you looked at before.
• Healthcare: Helping doctors find diseases early from pictures or scans.
• Personal Assistants: Alexa or Siri understanding what you say and responding.
3. What is a Multi-Layer Perceptron (MLP)?

A Multi-Layer Perceptron (MLP) is a type of feedforward artificial neural network that


consists of multiple layers of nodes (also called neurons). It is called "multi-layer" because it
contains at least three layers:

1. Input Layer
2. One or more Hidden Layers
3. Output Layer

In MLP, each neuron in one layer is fully connected to every neuron in the next layer, hence
it is called a fully connected network.

Component Of Multi-Layer Perceptron

1. Input Layer (Starting Point)

• This is the first layer where the data goes in.


• Each input feature (like height, weight, age) gets its own neuron.
• Example: If you have 3 features, the input layer will have 3 neurons.
• This layer just passes the data to the next layer — it does not do any processing.

2. Hidden Layers (The Brain of MLP)

• These are middle layers between the input and output layers.
• They process the data and find patterns or relationships in the data.
• Each neuron here gets input from all neurons of the previous layer and gives output to
the next layer.
• You can have 1 or more hidden layers — this is what makes the network “deep” or
“shallow.”

3. Output Layer (Final Result)

• This is the last layer that gives the final output or prediction.
• Example:
o In binary classification, 1 or 2 neurons (for class 0 or 1).
o In multi-class classification, as many neurons as there are classes.
o In regression, usually 1 neuron (for the predicted value).
4. Weights (Strength of Connection)

• Every connection between two neurons has a weight.


• It shows how much importance the input has.
• Example: If "age" is more important than "height," its weight will be higher.
• The network learns and changes these weights during training to make better
predictions.

5. Bias Neuron (Special Helper)

• Each layer (except the input layer) has an extra bias neuron.
• Bias helps the network adjust the output — like an offset in a line equation (y = mx + c,
where 'c' is the bias).
• This makes the network more flexible to learn different patterns.

6. Activation Function (Making Data Smarter)

• After calculating the weighted sum, each neuron passes the result through an activation
function.
• This adds non-linearity so the network can solve complex problems.
• Common activation functions:
o Sigmoid: Output between 0 and 1.
o ReLU: Outputs zero or positive number.
o Tanh: Output between -1 and 1.
o Softmax: Used in classification to give probabilities.

7. Feedforward & Backpropagation (Learning Process)

• Feedforward: Data flows from input → hidden layer → output.


• Loss is calculated (how wrong the prediction is).
• Backpropagation: Error flows backward to adjust weights & biases to reduce the error.
• This happens again and again until the network gives good predictions.
4. Explain activation functions with examples: Sigmoid and ReLU.

• An activation function decides whether a neuron should "fire" or not by changing the
output of the neuron.
• Without activation functions, a neural network would behave like simple linear regression
— unable to learn complex patterns.
• Activation functions add non-linearity so the network can handle more complex data like
images, sounds, or texts.
5. What are Tensors? Describe basic tensor operations.

Tensors are mathematical objects that describe linear relationships between sets
of multidimensional data. They are a generalization of scalars, vectors, and matrices, which are
all types of tensors.
6. Give a brief introduction to the TensorFlow framework.

TensorFlow is an open-source framework for machine learning (ML) and artificial intelligence (AI) that
was developed by Google Brain. It was designed to facilitate the development of machine learning
models, particularly deep learning models by providing tools to easily build, train and deploy them
across different platforms.

TensorFlow supports a wide range of applications from natural language processing (NLP) and computer
vision (CV) to time series forecasting and reinforcement learning.

Key Features of TensorFlow

1. Scalability

TensorFlow is designed to scale across a variety of platforms from desktops and servers to mobile
devices and embedded systems. It supports distributed computing allowing models to be trained on
large datasets efficiently.

2. Comprehensive Ecosystem

TensorFlow offers a broad set of tools and libraries including:

• TensorFlow Core: The base API for TensorFlow that allows users to define models, build
computations and execute them.

• Keras: A high-level API for building neural networks that runs on top of TensorFlow, simplifying
model development.

• TensorFlow Lite: A lightweight solution for deploying models on mobile and embedded devices.

• TensorFlow.js: A library for running machine learning models directly in the browser using
JavaScript.

• TensorFlow Extended (TFX): A production-ready solution for deploying machine learning


models in production environments.

• TensorFlow Hub: A repository of pre-trained models that can be easily integrated into
applications.

3. Automatic Differentiation (Autograd)

TensorFlow automatically calculates gradients for all trainable variables in the model which simplifies
the backpropagation process during training. This is a core feature that enables efficient model
optimization using techniques like gradient descent.

4. Multi-language Support

TensorFlow is primarily designed for Python but it also provides APIs for other languages like C++, Java
and JavaScript making it accessible to developers with different programming backgrounds.

You might also like