Solution
Solution
Answer:
Supervised and unsupervised learning are two main ways machines learn from data. They are different in how they
use data.
• Supervised Learning:
o How it learns: The computer learns from "labeled" data. This means that for every piece of
information it gets, the correct answer or "label" is already given. Think of it like a student with a
teacher.
o Goal: To learn a rule to map inputs to known outputs. Then, it predicts outputs for new, unseen data.
o Data: Requires input data that has correct answers already attached.
o Example: Training a program to tell if an email is spam. You give it many emails, each already marked
as "spam" or "not spam." The program learns from these examples.
• Unsupervised Learning:
o How it learns: This type of learning uses "unlabeled" data. The computer gets data without any
correct answers. It has to find patterns or structures in the data on its own. It's like a student
exploring a new topic without a teacher.
o Goal: To find hidden patterns or groups within the data. It tries to organize the data into meaningful
sets.
▪ Association: Finding rules about how things go together (like what products are often bought
together).
o Example: Grouping customers for a store. You give the program data about their purchases, but you
don't tell it what groups to make. It finds similar customer groups on its own.
In Summary: Supervised learning uses data with answers to predict new answers. Unsupervised learning explores
data without answers to find hidden patterns or groups.
1. b) Briefly explain the need of Inductive Bias in decision Tree Learning.
Answer:
Inductive bias is a set of assumptions that a learning program uses to guess outputs for information it hasn't seen
before. It's very important for decision trees to learn well.
• Limited Training Data: Real-world data is often incomplete. Without inductive bias, a decision tree might just
memorize the training data perfectly. This is called "overfitting." When overfitting happens, the tree works
badly on new data. Inductive bias helps the tree make smart guesses for new situations.
• Too Many Possible Trees: There are often many, many different decision trees that could perfectly fit the
same training data. Inductive bias helps the program pick the best and often simplest tree from all these
choices. Without it, the program wouldn't know which tree is better.
• Guiding How to Split Data: When building a decision tree, the program needs to decide the best way to split
the data at each step. This choice is guided by a form of inductive bias. It helps the tree focus on features
that make the best splits.
• Preventing Overfitting: Inductive biases, like "pruning" (cutting off branches from the tree), help prevent the
tree from becoming too complicated and too focused on the training data. This makes the tree perform
better on data it hasn't seen.
In simple terms: Inductive bias gives the learning program a "hint" or a "preference" on how to learn and generalize
from limited data, so it doesn't just memorize and can make good predictions for new situations.
2. a) What is Principal Component Analysis (PCA) in machine learning? How does PCA help in reducing the
dimensionality of data?
Answer:
Principal Component Analysis (PCA) is a common method in machine learning that helps simplify large datasets. It's
used when you have too many features (or columns) in your data. PCA works by turning many original features into a
smaller number of new, simpler features called Principal Components (PCs).
1. Finds Key Directions: Datasets often have features that are related to each other (like house size and number
of rooms). PCA finds new, independent "directions" in the data where the information is most spread out.
o The first Principal Component (PC1) captures the most information from the original data.
o The second Principal Component (PC2) captures the next most important information, in a direction
that is totally separate from the first PC.
2. Ranks by Importance: PCA ranks these new PCs by how much information they hold. The first few PCs
usually contain most of the important stuff.
3. Reduces Number of Features: You then choose to keep only the top PCs that hold most of the important
information. For example, if you had 10 original features, you might reduce them to just 2 or 3 PCs.
o This way, you simplify your data, making it much easier to work with, visualize, and faster for other
machine learning programs to process. You keep most of the important information but with fewer
features.
In simple terms: PCA finds the main patterns in your data and combines related features into fewer, stronger
features. This shrinks your data without losing too much important detail.
2. b) Evaluate the effectiveness of biologically inspired neural network architectures in solving complex machine
learning tasks.
Answer:
Neural networks, which are inspired by the human brain, are very effective at solving tough machine learning
problems. Their design is based on how our brains work.
1. Doing Many Things at Once (Parallel Processing): Our brains can handle many tasks at the same time. Neural
networks do this too. They have many small parts (like brain neurons) that work together. This makes them
very fast for big tasks like recognizing images or understanding speech in real-time.
2. Learning Complex Patterns: Our brain processes information in complex, non-straightforward ways. Neural
networks use special "activation functions" that let them learn these tricky, non-straightforward patterns in
data. This helps them solve hard problems like recognizing faces or understanding complex language.
3. Storing Information Widely: In the brain, information isn't stored in just one spot; it's spread out. Neural
networks also store their learning (weights) across the whole network. This means if a small part of the
network is damaged, it can often still work, making it tough and reliable.
4. Adapting to Incomplete Information: Neural networks can still make good guesses even if some of the input
data is missing or a bit messy. They learn general patterns, so they can adapt and fill in gaps.
5. Handling Different Data Types: They can work with many kinds of data, from numbers to pictures and
sounds. This makes them useful for a wide range of tasks like spam filters or identifying objects in photos.
Even though they can be hard to understand and need powerful computers, their ability to learn like a brain makes
them very good at many complex machine learning tasks.
Answer:
• Hypothesis Space:
o When we choose a machine learning program (like a neural network or a decision tree), we are also
choosing all the possible "rules" or "answers" that this program could learn.
o This entire collection of all possible answers is called the hypothesis space.
o The learning program then searches within this space to find the best rule that fits the training data
well and can make good predictions for new data.
• Inductive Bias:
o The "hypothesis space" can be huge, sometimes endless. So, a learning program needs some built-in
assumptions or preferences to help it pick one rule over others. This is especially true when it needs
to guess for data it has never seen.
o It helps the program generalize from the data it has learned from to make predictions for new
situations. Without these assumptions, the program would just memorize the training data
(overfitting) and wouldn't be able to guess for new things.
In short: Hypothesis space is all the possible solutions a program could find. Inductive bias is the program's built-in
"preference" or "guesswork" that helps it choose the best solution from that space, especially for new data.
Answer:
Machine Learning, while very powerful, faces several important problems or challenges that can affect how well its
models work.
o Issue: ML models learn from data, so they need a lot of it. A big problem is not having enough high-
quality and relevant data to train a model properly. If data is scarce, the model can't learn strong
patterns.
o Issue: Even if there's a lot of data, if it's messy, incomplete (missing info), inaccurate (wrong info), or
contains noise (unwanted details), the model's accuracy will suffer badly. This is often summed up as
"garbage in, garbage out."
o Issue: For a model to work well on new, real-world data, the data it trained on must be a good
sample of all the situations it will face. If the training data doesn't properly represent new situations,
the model won't make good predictions for them.
4. Overfitting:
o Issue: This happens when a model learns the training data too perfectly, including all its little errors
and random quirks. It becomes overly specific and essentially just memorizes the training examples.
Because of this, it works great on the training data but performs very poorly on new, unseen data.
5. Underfitting:
o Issue: This is the opposite of overfitting. It happens when a model is too simple or hasn't been
trained enough. It fails to learn the important patterns in the training data. This leads to poor
performance on both the training data and new data.
o Issue: For complex ML models, it can be very difficult to understand why the model made a certain
prediction. This "black box" nature can make people less trusting of the model, especially in
important areas like healthcare.
o Issue: Training very advanced and large ML models requires a lot of computing power, including
special processors and memory. This can be expensive and require significant resources.
Answer:
Principal Component Analysis (PCA) is a popular method in machine learning that helps simplify large datasets. It's
used when you have too many features (or columns) in your data. PCA works by turning many original features into a
smaller number of new, simpler features called Principal Components (PCs).
Imagine you have a dataset about different cars. Each car has many features like:
• Engine size
• Horsepower
• Weight
• Fuel efficiency
• Top speed
• Number of seats
Some of these features might be closely related. For example, "Engine size," "Horsepower," and "Top speed" might all
be high for a sports car and low for an economy car. "Weight" might also be related to "Engine size."
1. Finds New Directions (Principal Components): PCA looks for the main "directions" in your data where the
information is most spread out.
o The First Principal Component (PC1) captures the biggest chunk of information from the original
data. For our car example, PC1 might represent "overall performance" (a mix of engine size,
horsepower, and top speed).
o The Second Principal Component (PC2) captures the next biggest chunk of information, but it's
completely separate from the first PC. PC2 might represent "practicality" (a mix of fuel efficiency and
number of seats).
2. Ranks by Importance: PCA then ranks these new PCs by how much information they explain. The first few
PCs usually hold most of the important details.
3. Reduces Number of Features: You can then choose to keep only the top-ranked PCs (e.g., PC1 and PC2). By
doing this, you've reduced your data's features from many original ones (like 6 in our example) down to just a
few (like 2 PCs).
o Even though you have fewer features, you've still kept most of the important information about the
cars, but in a simpler form. This makes your data easier to understand, visualize, and faster for other
ML programs to use.
Key Points:
• It changes related features into new, unrelated features called Principal Components.
• It helps simplify data by keeping most of the important information with fewer variables.
Answer:
Machine Learning (ML) is a field of Artificial Intelligence (AI) that allows computers to learn from data without
needing to be told exactly what to do. It helps computers find patterns, make decisions, or predict things based on
past information.
Machine Learning models, despite their power, often face several common challenges:
o Problem: ML models need to learn from data. A big issue is not having enough data that is relevant
and of high quality. If there isn't enough good data, the model can't learn well and will give poor
results.
o Problem: Even if there's a lot of data, its quality matters a lot. Data that is messy, incomplete (missing
information), inaccurate (wrong details), or has noise (unwanted extra details) will make the model's
predictions bad.
o Problem: For a model to work well in the real world, the data it learned from must truly represent
what it will see later. If the training data is not like the new data, the model won't make good
predictions.
4. Overfitting:
o Problem: This happens when a model learns the training data too perfectly, including all its small
errors and random quirks. It becomes overly specific and basically just memorizes the training
examples. Because of this, it works great on the training data but very badly on new, unseen data.
5. Underfitting:
o Problem: This is the opposite of overfitting. It happens when a model is too simple or hasn't been
trained enough. It fails to learn the important patterns in the training data. This leads to poor
performance on both the training data and new data.
o Problem: For complex ML models, it can be very difficult to know why the model made a certain
prediction. This "black box" problem can make people trust the model less, especially in critical areas
like healthcare.
o Problem: Training very advanced and large ML models needs a lot of computer power, including
special processors and memory. This can be costly and requires significant resources.
Unit II: Neural Networks
Artificial Neural Networks (ANNs) are like computer models that are inspired by the human brain. Our brain has
billions of tiny parts called neurons that are all connected. These connections help us learn, recognize things, and
make decisions very quickly.
ANNs try to copy this brain structure. They have "nodes" (like brain neurons) that are connected to each other in
layers. This design helps ANNs learn from data and adjust their connections to do tasks. It shows how computers can
process many things at once, just like our brains.
Key Ideas:
• The human brain is good at doing many things at once (parallel processing).
• Information in the brain is stored in many places, which helps us use our memory.
A Neural Network (NN) is a collection of connected "nodes" (like tiny processors) arranged in layers.
The Layers:
• Input Layer: This is where the data first enters the network.
• Hidden Layer(s): These layers are between the input and output. They do most of the complex calculations
to find patterns in the data.
• Output Layer: This layer gives the final result or prediction from the network.
Every connection between nodes has a "weight." This weight shows how strong or important that connection is.
When the network learns, it changes these weights. Each node takes inputs from the previous layer, adds them up
(with their weights), includes a "bias" (an extra value), and then puts it through an "activation function" to decide
what to pass on.
Key Points:
Parallel processing means doing many things at the same time. Neural Networks are built to do this naturally.
Each node in a Neural Network can work independently, but also together with other nodes. When you give data to
the network, information travels through all the connections and nodes at the same time. This makes NNs very fast
and efficient, especially for big and complex tasks. This is like how our brain's billions of neurons work together
simultaneously.
Key Points:
• This involves distributing the network itself (model parallelism) or the training data (data parallelism) across
different processors.
What is a Perceptron?
The Perceptron is the simplest kind of Artificial Neural Network. It was one of the very first models and is used for
basic "yes" or "no" (binary) decisions.
How it Works:
A perceptron takes several inputs. Each input is multiplied by a "weight" (its importance). All these weighted inputs
are added together. Then, it uses a simple rule (an "activation function") to decide if the final output should be 0 or 1.
How it Learns:
If the perceptron makes a wrong guess, its weights are adjusted slightly. This helps it make better guesses next time.
This process repeats until it learns to correctly classify the data. However, a single perceptron can only solve "linearly
separable" problems, meaning data that can be perfectly split by a straight line.
Key Points:
• Can only solve problems that are easy to separate with a line.
Training a perceptron means repeatedly adjusting its weights and bias (an extra value) so it can correctly guess for
the training data.
The Steps:
1. Start with Guesses: Give the weights and bias small random starting numbers.
2. Make a Prediction: For each training example, the perceptron makes a guess (0 or 1) based on its current
weights.
3. Check the Guess: Compare the perceptron's guess to the actual correct answer.
4. Adjust if Wrong: If the guess is wrong, the weights and bias are changed using a special "learning rule." This
rule helps reduce the error for the next guess.
5. Repeat: Do this for all training examples many times until the perceptron guesses correctly for most or all of
them.
Key Points:
• The perceptron makes a prediction, checks it, and then updates its weights if wrong.
• The learning rate controls how big the weight adjustments are.
What is an MLP?
A Multilayer Perceptron (MLP) is a more advanced type of Neural Network. It's called "multilayer" because it has one
or more "hidden layers" between the input and output layers.
Unlike a simple perceptron, MLPs can learn much more complex patterns in data, even patterns that can't be
separated by a straight line. This is because the hidden layers, combined with special "activation functions," allow the
network to find very complicated relationships. Information flows forward through these layers, building up
understanding step by step.
Key Points:
What is Backpropagation?
Backpropagation is a key algorithm used to train neural networks like MLPs. It's how the network learns from its
mistakes.
2. Backward Pass:
o First, we figure out how wrong the prediction was (the "error").
o Then, this error is sent backward through the network, from the output layer all the way to the input
layer.
o As the error moves backward, the algorithm calculates how much each connection's "weight"
contributed to the mistake.
o Finally, all the weights are adjusted a little bit to reduce the error for next time.
This process repeats many times until the network becomes good at making correct predictions. It uses calculus to
figure out the right adjustments.
Key Points:
• Involves a forward pass (making a prediction) and a backward pass (calculating and distributing error).
• Needs special "activation functions" that can be used for these calculations.
Training:
This is the process where the Neural Network learns from the labeled data. The goal is for the network to adjust its
internal weights and biases so it can predict outputs very well. It keeps changing its settings to make the errors as
small as possible.
Validation:
Validation is a very important step during or just after training. We use a separate part of the data, called the
"validation set," that the network has never seen before.
• Checking Performance: We use this set to see how well the model is really learning and if it's getting too
specific to the training data (overfitting).
• Adjusting Settings: It helps us choose the best settings for the network (like how fast it learns).
• Stopping Early: If the model starts doing worse on the validation data, we can stop training early. This
prevents overfitting.
Key Points:
• Validation: Checks how well the network performs on new, unseen data.
Activation functions are special mathematical rules used inside each node (neuron) of a Neural Network. After a
node adds up all its weighted inputs, the activation function decides if that node should "turn on" or "fire."
• Add Non-Linearity: Without them, even a many-layered network would just act like a simple straight line.
Activation functions allow the network to learn complex, curvy, non-linear patterns.
• Decide "On" or "Off": They determine if a node's signal is strong enough to be passed to the next layer.
• Scale Output: They can also make sure the output of a node stays within a certain range (like between 0 and
1).
Common Types:
• ReLU (Rectified Linear Unit): Simple rule: If input is positive, output is the input; otherwise, output is 0. Very
popular because it's fast and helps solve some learning problems.
• Softmax: Used in the final output layer for problems with multiple categories (e.g., predicting if an image is a
cat, dog, or bird, giving probabilities for each).
These are issues that can happen when training deep Neural Networks, making them learn very slowly or become
unstable. They happen because of how errors are sent backward through many layers (backpropagation).
• Vanishing Gradients:
o What: The error signals become extremely tiny as they go backward through many layers.
o Why: This often happens when multiplying many small numbers together (from certain activation
functions, especially if their outputs are very small or very large).
o Effect: The layers closer to the start of the network barely learn at all, stopping the overall learning
process.
• Exploding Gradients:
o What: The error signals become extremely large, blowing up during backpropagation.
o Why: This happens if the network's weights are too large, causing the error to grow out of control
with each step backward.
o Effect: The network becomes very unstable, weights change drastically, and learning breaks down.
• For Vanishing: Use different activation functions (like ReLU), or special network designs (like LSTMs, which
handle long-term memory).
• For Exploding: Use "gradient clipping" (limiting how large the error signals can get), or careful ways to set
initial weights.
Unit III: Supervised Learning Techniques
A Decision Tree is a supervised learning model that looks like a tree. It helps make decisions or predictions by
breaking down data into smaller and smaller parts. It can be used for both "yes/no" type answers (classification) and
guessing numbers (regression)1.
o Internal Node (Decision Node): This is where a question is asked about a feature (like "Is age > 30?").
It's labeled with an input feature2.
o Branch (Link): The lines coming out of a node are "branches," representing the answers to the
question or a decision rule3.
o Leaf Node (Terminal Node): These are the very end points of the tree. They give the final answer or
prediction, which is a class label or a continuous value4. A leaf node has a class label determined by
the majority vote of training examples5.
o Root Node: This is the starting node at the very top of the tree6.
• The tree keeps asking questions and splitting the data until it reaches a final answer. The goal is to make the
purest groups possible at the end.
• Classification: A classification tree determines a set of logical "if-then" conditions to categorize problems8.
For example, discriminating between flower types based on features9.
• Regression: Used when the target variable is a number or continuous. Each split is made based on
minimizing the sum of squared errors101010.
Key Ideas:
• Used for
• Made of
• Works by
Naive Bayes is a classification algorithm that works based on probability, using a rule called Bayes' Theorem15. It's
"naive" because it assumes that all features (like a person's age, income, and job) are independent of each other
when predicting a class (like whether they will buy a product)16. This means it thinks one feature doesn't affect
another.
How it Works:
Even with this simple assumption, Naive Bayes often performs surprisingly well, especially with a lot of data. It's very
fast and efficient. It calculates the chance that a certain input belongs to each possible category, and then picks the
category with the highest chance17.
Key Ideas:
• It's a
• Based on
• Good for tasks like spam detection or figuring out text feelings.
What is Classification?
Classification is a type of supervised learning where the computer learns to put things into predefined groups or
categories. The answer it gives is always one of these specific labels. It is one of the main tasks of supervised
learning2222.
How it Differs:
• Output: The main thing about classification is that its output is always a category (like "yes" or "no", "cat" or
"dog")2323.
• Learning: The model learns from data that already has these categories marked (labeled data)2424.
Examples:
• Is this email
Key Points:
• Predicts
An SVM is a powerful supervised learning algorithm used for both classification and regression problems. Its main
goal is to find the best way to separate different groups of data by drawing a clear boundary2828.
Imagine you have data points scattered on a graph, and you want to draw a line to separate two different types of
points (like circles and squares).
• Hyperplane: The SVM tries to find the "best" line (or plane, if you have more features) that separates these
groups. This line is called a hyperplane29292929. The dimensions of the hyperplane depend on the features in
the dataset3030.
• Maximum Margin: The "best" line is the one that has the largest possible gap (or "margin") between it and
the closest data points from each group31313131313131. A large margin is considered a good margin32.
• Support Vectors: The data points that are closest to this separating line are called "support vectors." These
are the critical points that "support" or define the position of the hyperplane333333.
Key Points:
• Goal: To
A Random Forest is a very popular and powerful ML algorithm for both classification and regression problems38. It's
based on an idea called "Ensemble Learning," which means combining many simpler models (specifically, decision
trees) to get a better, more robust result39. As its name suggests, it builds a "forest" of decision trees40.
How it Works:
many decision trees using different random subsets of the given data41.
• For
classification problems, the Random Forest then takes a "vote" from all the trees and chooses the prediction that the
majority of trees agreed on42.
• Having many trees helps improve overall accuracy and prevents overfitting (where a single tree might be too
specific to the training data)43.
Key Ideas:
• Combines
• Uses
• Takes
• Helps achieve
regression problems484848. Its goal is to find the best straight line that describes the relationship between an input
feature (or features) and a continuous numerical output4949.
Imagine you have data points on a graph (like house size vs. house price). Linear regression tries to draw a straight
line that comes closest to all these points5050.
• Minimizing Errors: It does this by minimizing the "residuals" or "errors," which are the distances between
each actual data point and the line5151. Specifically, it minimizes the
sum of the squared differences between the observed values and the predicted values from the model52525252. This is
called the "Ordinary Least Squares (OLS)" method.
• Equation of the Line: The final line has an equation. For one input, it's like
Y=β0+β1X1+β2X2+...+βnXn+e54. Here, Y is the output (dependent variable), X values are inputs (independent
variables),
β values are the coefficients (showing how much each input affects the output), and 'e' is the error term55.
Before using Linear Regression, ideally, certain things should be true about your data for the model to be
reliable56565656:
• Linearity: There should be a straight-line relationship between the input features and the output.
• Independence of Errors: The errors (differences between predicted and actual values) should not be related
to each other.
• Homoscedasticity: The spread of errors should be roughly constant across all levels of the input variables
(constant variance of errors).
• Normality of Errors: The errors should be normally distributed (follow a bell-shaped curve).
• No Multicollinearity: The independent input features should not be too highly correlated with each other.
Key Points:
• Used for
• Finds the
• Minimizes the
• Has
What is OLS?
Ordinary Least Squares (OLS) is the most common technique used for
regression analysis61. It's specifically how the "best-fitting line" in Linear Regression is found6262.
How it Works:
The main idea of OLS is to make the differences between the actual data points and the line as small as
possible63636363.
• Errors/Residuals: These are the differences between the actual value and the value predicted by the
model64646464.
• Sum of Squared Residuals (RSS): OLS squares each of these errors and then adds them up65656565. Squaring
the errors means larger errors are penalized more, and positive/negative errors don't cancel each other out.
• Minimizing This Sum: The OLS method then finds the line (by figuring out the best slope and intercept) that
makes this "sum of squared residuals" as small as possible66. This line is called the "Regression Line" and
represents the best fit for the data67.
Key Points:
• A
• Its core is
minimizing the sum of squared residuals (RSS) between actual and predicted values69.
• Results in the
How it Works:
• Instead of fitting a straight line to the data (like linear regression), Logistic Regression uses a special S-shaped
curve called the "sigmoid function."
• This curve squashes any input value into a probability between 0 and 1.
• If the calculated probability is above a certain cutoff (e.g., 0.5), the model assigns it to one class; otherwise, it
assigns it to the other.
Key Points:
• A
4.1 Clustering
What is Clustering?
Clustering is a task in unsupervised learning. It's about taking a group of unlabeled data points and dividing them into
different "clusters" or groups111. The goal is to put similar data points into the same cluster, and points that are
different into different clusters2.
Imagine you run a store and want to understand your customers. You can't look at every single customer's details.
Instead, clustering can group your customers into, say, 10 groups based on their buying habits3. Then, you can create
different marketing plans for each group4. This helps make sense of lots of data without knowing the answers
beforehand5.
Key Points:
• It's an
• Goal: Put
• Helps to
• Example:
• Stock price prediction (though often done with supervised, clustering can be used for market
segmentation)14.
• Discover Hidden Patterns: Can find groups in data without prior knowledge15.
• Anomaly Detection: Outliers might form their own small clusters or be far from existing clusters.
• No Labeled Data Needed: Works with unlabeled data, which is often abundant.
• Requires Choosing K (for some methods): For algorithms like K-Means, you need to decide the number of
clusters (K) beforehand, which can be difficult171717.
• Sensitivity to Initial Conditions: Some algorithms (like K-Means) can be sensitive to where the cluster centers
start.
• Difficulty with Irregular Shapes: Many algorithms struggle with clusters that are not round or clear18.
• Interpretation Can Be Hard: Understanding what each cluster truly means can sometimes be challenging.
K-Means Clustering is a very popular unsupervised learning algorithm. It groups unlabeled data into a specific
number of clusters, which we call "K"19. For example, if K=2, it will create two clusters; if K=3, it will create three
clusters2020. It is an iterative algorithm that divides unlabeled data into K different clusters21.
1. Choose K: You first decide how many clusters (K) you want2222.
2. Pick Centers: The algorithm randomly picks K starting points called "centroids" (center points for each
cluster)23.
4. Update Centers: Once all points are assigned, the centroid of each cluster is recalculated to be the actual
center of all points in that cluster.
5. Repeat: Steps 3 and 4 are repeated. Points might move to different clusters, and centroids keep shifting until
they stop moving much, meaning the clusters are stable.
The main aim is to minimize the sum of distances between the data points and their corresponding cluster centers25.
Each data point will belong to only one group26.
Key Points:
• An
K predefined clusters28.
• It's an
iterative algorithm29.
centroid30.
• Goal:
Minimize the sum of distances between data points and their cluster centers31.
Example of K-Means Clustering (from previous papers - showing process):
Let's cluster the data {2, 4, 10, 12, 3, 20, 30, 11, 25} into two groups (K=2)32323232.
• Iteration 1:
▪ Points closest to m2=4: {10, 12, 20, 30, 11, 25} (distances: |10-4|=6, |12-4|=8, etc.)
▪ Initial clusters: Cluster 1 = {2, 3, 4}, Cluster 2 = {10, 12, 11, 18, 19, 23, 25, 27, 29, 30, 31}
(Using data from other K-Means question also)
o Update centroids:
▪ New m1=(2+3+4)/3=9/3=3
• Iteration 2:
▪ Points closest to m2=18: {10, 11, 12, 20, 25, 30} (distances: |10-18|=8, |11-18|=7, |12-
18|=6, |20-18|=2, |25-18|=7, |30-18|=12)
▪ Clusters: Cluster 1 = {2, 3, 4}, Cluster 2 = {10, 11, 12, 20, 25, 30} (Notice: No points moved,
and centroids won't change)
• Final Clusters: Cluster 1 = {2, 3, 4}, Cluster 2 = {10, 11, 12, 20, 25, 30}.
(Note: The exact step-by-step numerical solution might vary slightly based on the full dataset provided in the
question, but the process remains the same).
Hierarchical Clustering (also known as HCA or hierarchical cluster analysis) is another unsupervised learning
algorithm. It's used to group unlabeled data into clusters by building a hierarchy (like a tree) of clusters34. This tree-
like structure is called a "Dendrogram"35.
• In K-Means, you need to tell it how many clusters (K) you want upfront36.
don't need to determine the number of clusters in advance37. You can decide the number of clusters later by
"cutting" the dendrogram at different levels.
1. Agglomerative (Bottom-Up):
o It starts by treating
o Then, it repeatedly
all data points are merged into one big cluster40404040. This is a popular example of HCA41.
2. Divisive (Top-Down):
1. Start: Treat each data point as a single cluster. If there are 'n' data points, there are 'n' clusters43.
2. Merge Closest: Take the two closest data points or clusters and merge them to form one new cluster44.
3. Repeat Merging: Continue taking the two closest clusters and merging them together. The number of
clusters will decrease by one each time45.
4. Final Cluster: Repeat step 3 until only one single cluster is left (containing all data points)46.
5. Build Dendrogram: Once all clusters are combined, a Dendrogram (tree structure) is developed. You can then
use this dendrogram to decide where to divide the clusters based on your problem47.
Key Points:
• An
unsupervised ML algorithm48.
• Builds a
Gaussian Mixture Models (GMMs) are a type of ML algorithm used for clustering and are based on probability5252.
They assume that your data points come from a mix of different "Gaussian distributions" (which are like bell
curves)535353. Each bell curve represents a different cluster.
Instead of just finding a center for each cluster (like K-Means), GMMs try to figure out the shape (spread and
direction) of each cluster, assuming points in a cluster follow a bell curve54.
• Probabilistic: GMMs are "probabilistic" models. They estimate the probability that each data point belongs
to each cluster55. This means a point can belong to a cluster with a certain probability, not just 100%.
• Unknown Parameters: They assume data points are generated from Gaussian distributions with unknown
parameters565656. The goal is to estimate these parameters and the proportion of data points from each
distribution57.
• Robust to Outliers: GMMs are generally good at handling unusual data points (outliers) because they can
assign a low probability of belonging to any cluster, yielding accurate results even with outliers58.
• They can find clusters that are not perfectly round or equally sized, unlike K-Means59.
• They give you the probability of a data point belonging to a cluster, which can be more informative.
Key Points:
clustering60606060.
• It's a
probabilistic model62.
• Can
• Relatively
robust to outliers65.
• GMMS: Assumes data points are generated from a mixture of Gaussian distributions with unknown
parameters. Its goal is to estimate these parameters and the proportion of data points from each
distribution66.
• K-Means: Does not make any assumptions about the underlying distribution of data points. It simply divides
the data into K clusters, where each cluster is defined by its centroid67.
These are methods in machine learning inspired by how nature evolves, using ideas like "survival of the fittest"68.
They are used to find the best solutions for difficult problems, especially optimization tasks (finding the best settings
or values for something). They try many solutions, combine them, and keep the best ones, allowing them to "evolve"
over time towards an optimal answer.
4. Reproduction/Mutation: Create new solutions by combining (crossing over) parts of the best ones and
adding small random changes (mutations).
5. Repeat: Go back to step 2 and keep evolving the solutions over generations until a good answer is found.
Key Points:
• Inspired by
• Used for
In clustering, especially K-Means, you often need to decide how many clusters (K) to create717171. This is an
important choice because it affects how the data is grouped.
• Trial and Error: Sometimes, you try different values of K and see which one makes the most sense for your
data or problem72.
• Application Defined: For some problems, the number of clusters is already known or makes practical sense
(e.g., if you want to group customers into 3 specific loyalty tiers)73.
• Evaluation Metrics: There are methods that help determine a good K by looking at how "tight" the clusters
are or how well separated they are.
o Elbow Method: You plot a graph showing how much error decreases as you add more clusters.
Often, the graph forms an "elbow" shape, and the "elbow point" suggests a good K.
o Silhouette Score: This measures how similar an object is to its own cluster compared to other
clusters. A higher score is better.
Key Points:
• Deciding the
• Can be chosen by
These methods define how the distance between two clusters is calculated:
• Single Linkage: The distance between two clusters is the shortest distance between any two points from the
different clusters78.
• Complete Linkage: The distance between two clusters is the longest distance between any two points from
the different clusters79. This method tends to form tighter clusters than single linkage80.
• Average Linkage: The distance between two clusters is the average distance between all possible pairs of
data sets, where one point is from each cluster81.
• Centroid Linkage: The distance between two clusters is the distance between their centroids (their center
points)82.
Key Points:
The Expectation-Maximization (EM) algorithm is a powerful method used for finding hidden (or "latent") variables in
data. It's often used to train models like Gaussian Mixture Models (GMMs) when some data is missing or when we
don't know which cluster each data point belongs to86.
Imagine you have a bag of coins, but you don't know if they're fair or biased. If you knew which coin was which, it
would be easy to flip them and count heads/tails. But you don't know. EM helps in these situations where there's
"missing information" or "hidden variables" that make direct calculation hard87. It helps estimate parameters for
models like GMMs, especially when clusters are not clearly defined88.
1. Expectation (E-step): In this step, the algorithm guesses the "missing information" or the values of the
hidden variables89. For example, in GMMs, it guesses the probability that each data point belongs to each
cluster, based on the current (guessed) cluster properties.
2. Maximization (M-step): In this step, the algorithm uses the guessed information from the E-step to update
the model and maximize the likelihood of the observed data90. For example, in GMMs, this step recalculates
the best cluster properties (like their centers and shapes) based on the probabilities assigned in the E-step.
These two steps are repeated over and over. With each repetition, the algorithm's guesses get better and better,
leading to a good final model.
Key Points:
• Often used to
• It's an
• Needed when
In Machine Learning, an "experiment" is a planned series of actions where we try different settings to see how they
affect a model's performance. It's like a scientific test1. We want to understand what makes our ML model work best.
Why Do We Do Experiments?
• To learn about the model: We feed data to a learner (our ML model) and see what output it gives2.
• To identify important factors: We play with different factors (like the algorithm used, the training set, or
input features) to see how they change the outcome3.
• To find the best settings: The goal might be to find the setup that makes the model perform its best4.
• To get reliable results: We want to be sure our findings are statistically meaningful and not just due to
chance5.
• Hypothesis Formulation: Clearly stating what we expect to happen or what we want to test6.
• Variable Manipulation: Changing certain factors (inputs) to see their effect on the output7.
• Control: Keeping other factors constant to make sure our changes are truly causing the observed effects8.
Key Points:
high accuracy, minimal complexity, and one that is not easily affected by outside changes11.
• Factors (Inputs): These are the things we can change or that vary in an ML experiment13. They can be:
o The
algorithm we choose14.
o The
training set used15.
o The
• Response (Output): This is what we measure to see the result of our changes18. It's usually the model's
performance (e.g., accuracy, error rate).
Strategy of Experimentation:
• Planning: It's important to plan experiments carefully to identify the most important factors19.
• Observation: We observe how the model's performance (response) changes when we adjust the factors20.
• Information Extraction: The goal is to extract information and identify the most important factors21.
• Eliminating Chance: We design experiments to make sure our conclusions are statistically sound and not just
due to random luck22.
Key Points:
• We change
• A trained learner can be shown as having controllable and uncontrollable factors as input, producing an
output26.
General Guidelines:
• Clear Goals: Define what you want to achieve with the experiment (e.g., higher accuracy, faster training).
• Reproducibility: Make sure your experiment can be repeated by others (or yourself later) to get the same
results. This means keeping good records of your data, code, and settings.
• Baselines: Always compare your new model or approach against a simple, existing model (a "baseline") to
see if it's actually better.
• Proper Evaluation: Use appropriate evaluation metrics for your problem (e.g., accuracy for classification,
MSE for regression) and always test on unseen data.
Key Points:
Resampling methods are techniques that involve repeatedly drawing samples from a training dataset2727. They are
used to train and evaluate ML models, giving a more robust estimate of performance than a single train-test
split2828. Cross-validation is a key resampling method29292929.
Cross-Validation:
(We covered this in detail in Unit 1.8, but here's a recap in this context.)
• Purpose: It's a technique for checking how well a model will work on new, unseen data30. It helps test the
model's stability31.
• How it works (e.g., k-Fold): The data is split into 'k' parts. The model is trained 'k' times. Each time, a
different part is used for testing, and the rest for training. The final score is the average of all tests32323232.
• Benefit: Gives a more reliable performance estimate and helps prevent overfitting.
• Better Performance Estimate: It helps get a more accurate idea of how the model will perform in the real
world33.
• Hyperparameter Tuning: It's used to find the best settings (hyperparameters) for the model.
• Model Selection: When comparing different models, resampling helps reliably identify which one is best.
Key Points:
• It helps
Once a classification model is trained, we need to know how well it predicts categories38. This is crucial for trusting
the model and deciding if it's good enough for its purpose.
• Confusion Matrix: A table that shows all types of correct and incorrect predictions:
o True Positives (TP): Correctly predicted positive.
• Precision: Out of all the times the model predicted "positive," how many were actually correct.
• Recall: Out of all the actual "positives" in the data, how many did the model correctly identify.
Key Points:
• Essential to understand
Hypothesis testing is a statistical method used to make decisions about a whole group (called a "population") based
on data from a smaller part of that group (called a "sample")40. We start with a belief or a proposed explanation (a
"hypothesis") and then use data to see if there's enough evidence to support or reject it41414141.
• Confirm Observations: Data science projects often start by exploring data. Hypothesis testing helps confirm if
what we observe in our sample data is true for the larger population42.
• Statistical Significance: It tells us if differences in model performance (e.g., between two algorithms) are
truly meaningful or just due to random chance. If a difference is "statistically significant," it means it's
probably not just a coincidence434343.
• Algorithm Selection: It's used to determine if differences in performance between two data samples or
metric performances are statistically significant or just noise444444.
Key Points:
• A statistical method to
statistically significant474747.
Comparing different ML algorithms and models is vital. It helps us choose the best one for a specific task and
dataset4949. There are also non-obvious benefits to comparing effectively50.
Goals of Comparison:
• Better Performance: The primary objective is to find the algorithm that gives the best results for our problem
(e.g., highest accuracy, lowest error)515151.
• Longer Lifetime: We want a model that understands the underlying patterns in data, so its predictions
remain good even with new, unseen data, reducing the need for constant retraining52.
• Easier Retraining: By comparing thoroughly, we record details that help us understand why a model was
chosen or why it failed, making retraining quicker if needed535353.
• Speedy Production: Comparison helps identify models that are fast and use computer resources optimally,
important for real-world use54.
• Statistical Tests: Since ML models are based on statistics, we use statistical tests to compare them55.
o Null Hypothesis Testing: Tests if the differences in performance between two models are statistically
significant or just random noise56.
o ANOVA (Analysis of Variance): Checks if the means (averages) of different groups are similar or not,
often using one or more categorical features and one continuous target57.
o Chi-square: Used for categorical features to evaluate the likelihood of association or correlation
based on frequency distributions58.
o Student's t-test: Compares the averages of different samples to determine if differences are
statistically significant59.
• Learning Curves: Plotting model performance over time (training vs. validation) helps see if the model is
learning well or overfitting616161616161.
o Training Learning Curve: Plots evaluation metric score over time during training, tracking progress62.
o Validation Learning Curve: Plots evaluation score against time on the validation set, showing how
well the model generalizes and helping identify overfitting63.
• Bias-Variance Tradeoff: Comparing models involves understanding their bias (assumptions used to simplify
learning) and variance (how much predictions change with changes in training data)64. The ultimate goal is to
reduce both bias and variance to a minimum6565.
Key Points:
• Crucial for
• Aims for
• Uses
statistical tests (ANOVA, t-test, Chi-square, Null Hypothesis testing, ten-fold cross-validation) for objective
comparison68.
• Uses
• Involves balancing