0% found this document useful (0 votes)
9 views93 pages

Unit IV Classification DataScience

The document provides an overview of classification in data science, explaining its fundamental concepts, types, and common algorithms such as K-Nearest Neighbors, Logistic Regression, and Decision Trees. It details the classification process, including model construction, training and testing phases, and performance evaluation methods. Additionally, it discusses the importance of decision boundaries and the effects of parameters like 'k' in KNN on classification accuracy.

Uploaded by

syed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views93 pages

Unit IV Classification DataScience

The document provides an overview of classification in data science, explaining its fundamental concepts, types, and common algorithms such as K-Nearest Neighbors, Logistic Regression, and Decision Trees. It details the classification process, including model construction, training and testing phases, and performance evaluation methods. Additionally, it discusses the importance of decision boundaries and the effects of parameters like 'k' in KNN on classification accuracy.

Uploaded by

syed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

CLASSIFICATION

1
CLASSIFICATION
•Classification

• Nearest Neighbours

•Training and Testing

•Rows of Tables

•Implementing the Classifier

•Performance Measures

•Updating Predictions

•Binary Classifier

• Making Decisions.

2
CLASSIFICATION
•Classification
Classification

• Nearest Neighbours

•Training and Testing

•Rows of Tables

•Implementing the Classifier

•Performance Measures

•Updating Predictions

•Binary Classifier

• Making Decisions.

3
What is Classification in Data Science?
Classification:
•Understanding the data behavior and identifying the resultant groups

4
CLASSIFICATION
• Classification is a fundamental concept in data science and machine learning.
In simple terms, classification involves assigning input data to one of several
predefined categories or classes.

• In supervised learning, a classification model learns from labeled examples


and then predicts the class of new, unseen data. For example, a model might
learn to label emails as “spam” or “not spam” based on past data.

• In other words, a classification model sorts data points into predefined groups
called classes. Think of it like a mail sorter that learns to put each piece of mail
into the right mailbox (spam or inbox, say) based on its features.

• The most common types of classification algorithms are k-nearest


neighbours, decision trees, logistic regression, naive Bayes, and support
vector machines.

5
CLASSIFICATION- A TWO STEP PROCESS

6
Process (1)-Model Construction
(Training Phase

7
Process (2)- Using the model in
prediction (Testing Phase)

8
Types of Classification Problems

9
Types of Classification Problems
Binary Classification: This is the simplest case, where each input is assigned to one of two
classes. For example, predicting whether an email is spam or not spam, or whether a patient has
a disease (yes/no). In binary classification, the data is labeled in a binary way (e.g., 0/1,
true/false, positive/negative).

Multi-Class Classification: Here, there are more than two possible classes, but still exactly one
label per example. For example, an image classifier might label photos as cat, dog, or rabbit. The
model must pick one class out of many.

Multi-Label Classification: In some tasks, each instance can belong to multiple classes
simultaneously. For example, a photo might contain both a “bicycle” and an “apple,” so it has
two labels. In multi-label classification, a model predicts a set of classes for each example. This is
different from multi-class, since examples are not exclusive to one class.

Imbalanced Classification: Many real-world datasets are imbalanced, meaning some classes
have many more examples than others. Examples include fraud detection or rare disease
diagnosis.

10
Common Classification
Algorithms in Data Science

11
Algorithms in Classification in Data Science
Logistic Regression:
Logistic Regression is a classification algorithm used to predict a binary outcome (e.g. yes/no, 0/1,
true/false) based on independent variables. It uses an equation to determine the probability of an
event occurring, and then uses a threshold value to determine the outcome.

K-Nearest Neighbors (KNN):


K-Nearest Neighbors (KNN) is a non-parametric, supervised machine learning algorithm used for
classification. It works by finding the K (usually 3-5) nearest points in the dataset, and then
assigning a class label based on the majority class among them.

Support Vector Machines (SVM):


Support Vector Machines (SVM) is a supervised machine learning algorithm used for classification
and regression. It works by finding a hyperplane that separates the data points into their respective
classes.

12
Algorithms in Classification in Data Science

Decision Tree:
Decision Tree is a supervised machine learning algorithm used for both classification and
regression. It works by constructing a decision tree from the training data, which is then used to
make predictions on unseen data points.

Random Forest:
Random Forest is an ensemble machine-learning algorithm used for both classification and
regression. It works by randomly selecting a subset of features, and then building multiple
decision trees from the dataset.

13
CLASSIFICATION
•Classification

•1Nearest Neighbors

•Training and Testing

•Rows of Tables

•Implementing the Classifier

•Performance Measures

•Updating Predictions

•Binary Classifier

• Making Decisions.

14
Nearest Neighbors
•"Nearest neighbors" refers to the concept of finding the closest data points to a given point,
often used in machine learning for classification (K-Nearest Neighbors or KNN) and in optimization
problems.

•In KNN, an unknown data point is assigned to a class based on a majority vote of its k nearest
neighbors in a training set, using distance metrics like Euclidean distance to determine closeness.

15
Nearest Neighbors
Example 1

A simple nearest neighbor's example involves classifying a new data point by finding its k closest
neighbors in a dataset and assigning it the majority class of those neighbors. For instance, if you are
classifying a new fruit based on its shape and size, and you choose k=3, the algorithm looks at the three
closest-known fruits. If two of those three neighbors are apples and one is a banana, the new fruit is
classified as an apple.

Example 2

16
Nearest Neighbors

17
Understanding Decision Boundaries in K-
Nearest Neighbours

18
Understanding Decision Boundaries in K-
Nearest Neighbours
• A decision boundary is a line or surface that divides different groups in a
classification task.
• It shows which areas belong to which class based on what the model decides. K-
Nearest Neighbors (KNN) algorithm operates on the principle that similar data
points exist in close proximity within a feature space.
• The shape of this boundary depends on:
•The value of K (how many neighbors are considered).

•How the data points are spread out in space.

• For example, given a dataset with two classes the decision boundary can be
visualized as the line or curve dividing the two regions where each class is
predicted.

19
How KNN creates decision boundaries
In KNN, decision boundaries are influenced by the choice of k and the distance metric
used:
1. Impact of 'K' on Decision Boundaries: The number of neighbors (k) affects the
shape and smoothness of the decision boundary.
•Small k: When k is small the decision boundary can become very complex, closely
following the training data. This can lead to overfitting.

•Large k: When k is large the decision boundary smooths out and becomes less sensitive to
individual data points, potentially leading to underfitting.
2. Distance Metric: The decision boundary is also affected by the distance metric used
like Euclidean, Manhattan. Different metrics can lead to different boundary shapes.
•Euclidean Distance: Commonly used leading to circular or elliptical decision
boundaries in two-dimensional space.

•Manhattan Distance: Results in axis-aligned decision boundaries.

20
Decision Boundaries in K-Nearest Neighbours For different ‘k’

21
Factors That Affect KNN Decision
Boundaries
Feature Scaling: KNN is sensitive to the scale of data. Features with
larger ranges can dominate distance calculations, affecting the boundary
shape.

Noise in Data: Outliers and noisy data points can shift or distort decision
boundaries, leading to incorrect classifications.

Data Distribution: How data points are spread across the feature space
influences how KNN separates classes.

Boundary Shape: A clear and accurate boundary improves classification


accuracy, while a messy or unclear boundary can lead to errors.

22
How Does the K-Nearest
Neighbors Algorithm Works?
• The K-NN algorithm compares a new data entry to the values in a given data set (with
different classes or categories).

• Based on its closeness or similarities in a given range (K) of neighbors, the algorithm assigns
the new data to a class or category in the data set (training data).

Step #1 - Assign a value to K.

Step #2 - Calculate the distance between the new data entry and all other existing data entries
(you'll learn how to do this shortly). Arrange them in ascending order.

Step #3 - Find the K nearest neighbors to the new entry based on the calculated distances.

Step #4 - Assign the new data entry to the majority class in the nearest neighbors.

23
CLASSIFICATION
•Classification

•Nearest Neighbors

•Training
Training and Testing
and Testing

•Rows of Tables

•Implementing the Classifier

•Performance Measures

•Updating Predictions

•Binary Classifier

• Making Decisions.

24
Training and testing
How good is our nearest neighbor classifier?
• To answer this, we’ll need to find out how frequently our
classifications are correct. If a patient has chronic kidney disease, how
likely is our classifier to pick that up?

❖If the patient is in our training set, we can find out


immediately. We already know what class the patient is in. So,
we can just compare our prediction and the patient’s true
class.

❖But the point of the classifier is to make predictions


for new patients not in our training set. We don’t know what
class these patients are in but we can make a prediction based
on our classifier.

25
Training and testing
How to find out whether the prediction is correct?

❖One way is to wait for further medical tests on the patient and then
check whether or not our prediction agrees with the test results.
With that approach, by the time we can say how likely our prediction
is to be accurate, it is no longer useful for helping the patient.

❖Instead, we will try our classifier on some patients whose true classes
are known. Then, we will compute the proportion of the time our
classifier was correct. This proportion will serve as an estimate of the
proportion of all new patients whose class our classifier will
accurately predict. This is called testing.

26
Training and testing
Overly Optimistic “Testing”:
• Overly optimistic testing in nearest neighbor (k-NN) classification
occurs when the same data used to train the model is also used to test
its performance.

• Because the k-NN model simply memorizes the training data, this
practice will produce a misleadingly high accuracy score, often 100%,
that does not reflect how the model will perform on new, unseen
data.
The core reason for this bias
• The k-NN algorithm is an instance-based or "lazy" learner. Instead of
building a generalized model during a training phase, it stores the
entire training dataset.

• During the classification phase for a new data point, it finds


the k nearest data points from the stored training set.
27
Training and testing
The core reason for this bias (Contd..)
• When you "test" a k-NN model on the same data used to train it (the
training set), a query point is compared against every point in the
training set, including itself.

• For any given point in the training set, its nearest neighbor will always
be itself, at a distance of zero. Therefore, a 1-nearest neighbor
classifier will always correctly classify every point in the training set.

• The result is a test accuracy score of 100%, which is overly optimistic


and not representative of the model's true predictive power on new
data

28
Training and testing
How to avoid overly optimistic testing:
To get a more realistic and unbiased evaluation of a k-NN model, you must
measure its performance on a separate, unseen dataset. The standard
practice is to split the original dataset into two or three parts:
•Training set: A portion of the data used to "train" the model (i.e.,
to have it memorize the data points).
•Testing set: A separate portion of the data used to evaluate the
model's performance. The model has never seen this data
before.
•Validation set (optional): A third set used for fine-tuning
hyperparameters, like the value of k.

This process is known as a train-test split and can be extended with methods
like cross-validation, which repeatedly splits the data and averages the
performance scores to produce an even more robust evaluation.
29
CLASSIFICATION
•Classification

•Nearest Neighbors

•Training and Testing

Rows of
•Rows of Tables
Tables

•Implementing the Classifier

•Performance Measures

•Updating Predictions

•Binary Classifier

• Making Decisions.

30
rows of tables
• In nearest neighbor classification, rows of a table represent the
individual data points or observations in the training dataset.

• Each row contains all the information about the feature values
and the known class label for a single instance.
How rows function in nearest neighbor classification
1.A row is a single data point: Each row is a complete set of attributes for one
observation. For example, in a medical diagnosis problem, a row might represent one
patient and include their age, blood pressure, and other lab results.
2.Rows are compared: When you want to classify a new, unlabeled data point, the algorithm
compares the features of this new point to the feature values in every row of your training
table. The comparison is done using a distance metric, like Euclidean distance, to determine
how "close" each training row is to the new point.
3.Rows identify neighbors: The distances are then sorted, and the top k rows (the "neighbors") with
the smallest distances are selected.
4.Rows determine the class: The class labels of the k nearest-neighbor rows are examined to make a prediction
for the new data point. For a classification problem, the new point is assigned the class that is most common
among its nearest neighbors (a process known as "majority voting").
31
rows of tables
Example using a "chronic kidney disease" table
Imagine a training table for predicting chronic kidney disease (CKD), with each
row representing a different patient.
Patient IDHemoglobin Glucose White Blood Cell Count
Class
P101 11.2 117 6700 CKD
P102 9.5 70 12100 CKD
P103 12.5 264 9600 Not CKD
P104 10.0 70 18900 CKD
... ... ... ... ...

•Training data: The entire table, including all the rows, is the training data for
the classifier.
•A new patient: A new patient, Alice, comes in with a Hemoglobin level of
10.5, a Glucose level of 120, and a White Blood Cell Count of 8000.

32
rows of tables
The process:
• The algorithm calculates the distance between Alice's data and the
data in each row of the table.

• If k=3, the algorithm finds the three rows in the table that are
"closest" to Alice based on her Hemoglobin, Glucose, and White Blood
Cell Count values.

• It then looks at the "Class" column for those three nearest rows to see
if the majority of them are "CKD" or "Not CKD".

• The majority vote becomes the predicted class for Alice.

33
CLASSIFICATION
•Classification

•Nearest Neighbors

•Training and Testing

•Rows of Tables

Implementing the
•Implementing theClassifier
Classifier

•Performance Measures

•Updating Predictions

•Binary Classifier

• Making Decisions.

34
Implementing the classifier
• We are now ready to implement a nearest neighbor classifier based on multiple attributes.

• We have used only two attributes so far, for ease of visualization. But usually, predictions will
be based on many attributes.

• Here is an example that shows how multiple attributes can be better than pairs.

This time we’ll look at predicting whether a banknote (e.g., a $20 bill) is counterfeit or legitimate.
Researchers have put together a data set for us, based on photographs of many individual banknotes:
some counterfeit, some legitimate. They computed a few numbers from each image, using techniques
that we won’t worry about for this course. So, for each banknote, we know a few numbers that were
computed from a photograph of it as well as its class (whether it is counterfeit or not). Let’s load it
into a table and take a look.

35
Implementing the classifier
This time we’ll look at predicting whether a banknote (e.g., a $20 bill) is counterfeit or legitimate. Researchers
have put together a data set for us, based on photographs of many individual banknotes: some counterfeit,
some legitimate. They computed a few numbers from each image, using techniques that we won’t worry about
for this course. So, for each banknote, we know a few numbers that were computed from a photograph of it as
well as its class (whether it is counterfeit or not). Let’s load it into a table and take a look.

36
Implementing the classifier
Let’s look at whether the first two numbers tell us anything about whether the banknote is
counterfeit or not from the below scatterplot that considered only the two attributes WaveletCurt
and WaveletVar.

37
Implementing the classifier

Observation: There is some overlap between the blue cluster and the gold cluster.

Inference: This indicates that there will be some images where it’s hard to tell whether the banknote is
legitimate based on just these two numbers. Still, the legitimacy of a banknote could be predicted using a
nearest neighbor classifier.

38
Implementing the classifier
The Scatterplot obtained for different two attributes chosen(Entropy & WaveletSkew)

Observation: here again overlap between blue and gold clusters results in a complex structure by
considering other two attributes .
Inference: Difficult to differentiate between counterfeit and legitimate currency notes.

39
Implementing the classifier
Multiple attributes
• So far , exactly 2 attributes that were used to make our prediction.

What if we have more than 2?


For instance, what if we have 3 attributes?

• The same ideas can be used for this case, too.


❖ make use of a 3-dimensional scatterplot, instead of a 2-dimensional plot.
❖ still the nearest neighbors classifier can be used, but now computing distances in 3
dimensions instead of just 2.
❖ This all works for arbitrarily many attributes; just work in a very high dimensional
space. It gets impossible to visualize.

40
Implementing the classifier
Try to predict whether a banknote is counterfeit or not using 3 of the measurements, instead of just
2. The Scatterplot is as follows:
• Observation: There is no overlap between the classes
counterfeit and legitimate.

• Inference: a classifier that uses these 3 attributes will


be more accurate than one that only uses the 2
attributes.

• This is a general phenomenon in classification. Each


attribute can potentially give you new information,
so more attributes sometimes helps you build a
better classifier.

• Of course, the cost is that now we have to gather


more information to measure the value of each
attribute, but this cost may be well worth it if it
significantly improves the accuracy of our classifier.

41
Implementing the classifier
How to use -nearest neighbor classification to predict the answer to a yes/no
question, based on the values of some attributes, assuming you have a training set
with examples where the correct prediction is known?
1.identify some attributes that you think might help you predict the
answer to the question.

2.Gather a training set of examples where you know the values of the
attributes as well as the correct prediction.

3.To make predictions in the future, measure the value of the attributes
and then use -nearest neighbor classification to predict the answer to
the question.

42
Implementing the classifier
Distance in Multiple Dimensions
Euclidean Distance for two dimensions:

Euclidean Distance for three dimensions:

Euclidean Distance for n dimensions:

+……+(N0-N1)2

43
K-Nearest Neighbors Algorithm -
Example
Brightness Saturation Class
40 20 Red
50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue

The table above represents our data set. We have two columns Brightness and Saturation.
Each row in the table has a class of either Red or Blue.

44
K-Nearest Neighbors Algorithm -
Example
Before we introduce a new data entry, let's assume the value of K is 5.

How to Calculate Euclidean Distance in the K-Nearest Neighbors Algorithm


Here's the new data entry:
Brightness Saturation Class
20 35 ?

We have a new entry but it doesn't have a class yet. To know its class, we have to
calculate the distance from the new entry to other entries in the data set using
the Euclidean distance formula.

45
K-Nearest Neighbors Algorithm -
Example
Where:
•X₂ = New entry's brightness (20).
•X₁= Existing entry's brightness.
•Y₂ = New entry's saturation (35).
•Y₁ = Existing entry's saturation.
Let's do the calculation together. I'll calculate the first three.
Distance #1
For the first row, d1:

Brightness Saturation Class


40 20 Red

d1 = √(20 - 40)² + (35 - 20)²


= √400 + 225
= √625
= 25

46
K-Nearest Neighbors Algorithm -
Example
We now know the distance from the new data entry to the first entry in the table. Let's
update the table.

Brightness Saturation Class Distance


40 20 Red 25
50 50 Blue ?
60 90 Blue ?
10 25 Red ?
70 70 Blue ?
60 10 Red ?
25 80 Blue ?

47
K-Nearest Neighbors Algorithm -Example
Distance #2
For the second row, d2:
Brightness Saturation Class Distance
50 50 Blue ?
d2 = √(20 - 50)² + (35 - 50)²
= √900 + 225
= √1125
= 33.54
Here's the table with the updated distance:
Brightness Saturation Class Distance
40 20 Red 25
50 50 Blue 33.54
60 90 Blue ?
10 25 Red ?
70 70 Blue ?
60 10 Red ?
25 80 Blue ?
48
K-Nearest Neighbors Algorithm -Example
Distance #3
For the third row, d3:
Brightness Saturation Class Distance
60 90 Blue ?
d2 = √(20 - 60)² + (35 - 90)²
= √1600 + 3025
= √4625
= 68.01 Updated table:

Brightness Saturation Class Distance


40 20 Red 25
50 50 Blue 33.54
60 90 Blue 68.01
10 25 Red ?
70 70 Blue ?
60 10 Red ?
25 80 Blue ?

49
K-Nearest Neighbors Algorithm -Example
Here's what the table will look like after all the distances have been calculated:
Updated table:
Brightness Saturation Class Distance
40 20 Red 25
50 50 Blue 33.54
60 90 Blue 68.01
10 25 Red 10
70 70 Blue 61.03
60 10 Red 47.17
25 80 Blue 45

50
K-Nearest Neighbors Algorithm -Example
Let's rearrange the distances in ascending order:

Brightness Saturation Class Distance


10 25 Red 10
40 20 Red 25
50 50 Blue 33.54
25 80 Blue 45
60 10 Red 47.17
70 70 Blue 61.03
60 90 Blue 68.01

51
K-Nearest Neighbors Algorithm -Example
Since we chose 5 as the value of K, we'll only consider the first five rows. That is:

Brightness Saturation Class Distance


10 25 Red 10
40 20 Red 25
50 50 Blue 33.54
25 80 Blue 45
60 10 Red 47.17

52
K-Nearest Neighbors Algorithm -Example
As you can see above, the majority class within the 5 nearest neighbors to the new entry
is Red. Therefore, we'll classify the new entry as Red.
Here's the updated table:

Brightness Saturation Class


40 20 Red
50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue
20 35 Red

53
How to Choose the Value of K in the K-NN
Algorithm
There is no particular way of choosing the value K, but here are some
common conventions to keep in mind:

•Choosing a very low value will most likely lead to inaccurate predictions.

•The commonly used value of K is 5.

•Always use an odd number as the value of K.

54
K-NN Algorithm

Advantages of K-NN Algorithm

•It is simple to implement.

•No training is required before classification.

Disadvantages of K-NN Algorithm

•Can be cost-intensive when working with a large data set.

•A lot of memory is required for processing large data sets.

•Choosing the right value of K can be tricky.

55
CLASSIFICATION
•Classification

•Nearest Neighbors

•Training and Testing

•Rows of Tables

•Implementing the Classifier

Performance Measures
•Performance Measures

•Updating Predictions

•Binary Classifier

• Making Decisions.

56
accuracy of the classifier

• To see how well our classifier does, we might put 50% of the data
into the training set and the other 50% into the test set.

• Basically, we are setting aside some data for later use, so we can use
it to measure the accuracy of our classifier.

• We’ve been calling that the test set. Sometimes people will call the
data that you set aside for testing a hold-out set, and they’ll call this
strategy for estimating accuracy the hold-out method.

57
accuracy of the classifier
Cancer Diagnosis

• If a patient has a lump in a region, the doctors may want to take a biopsy to see if it is cancerous.
• The doctor gets a sample of the mass, puts it under a microscope, takes a picture, and a trained lab
tech analyzes the picture to determine whether it is cancer or not. We get a picture like one of the
following:

58
accuracy of the classifier
• Unfortunately, distinguishing between benign vs malignant can be tricky. So, researchers have
studied the use of machine learning to help with this task.

• The idea is that we’ll ask the lab tech to analyze the image and compute various attributes:
things like the typical size of a cell, how much variation there is among the cell sizes, and so on.

• Then, we’ll try to use this information to predict (classify) whether the sample is malignant or
not. We have a training set of past samples from patients where the correct diagnosis is known,
and we’ll hope that our machine learning algorithm can use those to learn how to predict the
diagnosis for future samples.

59
Accuracy of the classifier
For improved visibility only two parameters were considered for the scatterplot

Observation:That plot is utterly misleading, because there are a bunch of points that have identical values
for both the x- and y-coordinates.
Action: One key innovation is that incorporation of confidence score into the results aided the algorithm to
produce 99% accurate result

60
CLASSIFICATION
•Classification

•Nearest Neighbors

•Training and Testing

•Rows of Tables

•Implementing the Classifier

•Performance Measures

•Updating Predictions
Updating Predictions

•Binary Classifier

• Making Decisions.

61
UPDATING PREDICTIONS

• Classification is just a prediction of the class, based on the most common class among
the training points that are nearest our new point.

• Suppose that we eventually find out the true class of our new point. Then we will
know whether we got the classification right. Also, we will have a new point that we
can add to our training set, because we know its class. This updates our training set.
So, naturally, we will want to update our classifier based on the new training set.

• Let us look at some simple scenarios where new data leads us to update our
predictions. While the examples in the chapter are simple in terms of calculation, the
method of updating can be generalized to work in complex settings and is one of the
most powerful tools used for machine learning.

62
CLASSIFICATION
•Classification

•Nearest Neighbors

•Training and Testing

•Rows of Tables

•Implementing the Classifier

•Performance Measures

•Updating Predictions

•Binary Classifier
Binary Classifier

• Making Decisions.

63
A “More Likely Than Not” Binary Classifier
Let’s try to use data to classify a point into one of two categories, choosing the category that we think is
more likely than not. To do this, we not only need the data but also a clear description of how chances are
involved.
Suppose there is a university class with the following composition:
•60% - Second Years
•40% - Third Years
•50% Second Years- have declared their major
•80% Third Years- have declared their major
pick a student at random from the class.

Can he/she be classified as Second Year or Third Year using the “more likely than not” criterion?

• 60% chance that picked student is second year

• 40% chance that picked student is Third Year

Irrespective of the majors, it is easy to predict the year of the student based on the given proportions of
Second and Third Years in the class.

64
A “More Likely Than Not” Binary Classifier
Year Declared Undeclared
Second 30 30
Third 32 8

• The total count is 100 students, of whom 60 are Second Years and 40 are Third Years.

• Among the Second Years, 50% are in each of the Major categories.

• Among the 40 Third Years, 20% are Undeclared and 80% Declared.

• So, this population of 100 students has the same proportions as the class in our problem, and we can
assume that our student has been picked at random from among all 100 students.

65
A “More Likely Than Not” Binary Classifier
Updating the Prediction Based on New Information

Now in addition to the above scenario, the student has declared a major.

• It becomes important to look at the relation between year and major declaration.

• More students are Second Years than Third Years. But it’s also true that among the Third
Years, a much higher percent have declared their major than among the Second Years

• Previous case: (Before adding information that major is declared)


Prediction falls in any of the four categories: second year declared, second year
undeclared third year declared, third year undeclared and
Therefore were more likely to be in the top row (Second Year) because that contains
more students.

66
A “More Likely Than Not” Binary Classifier
Present case:(After adding information that major is declared)

Prediction falls into any of the two categories: second year/third year

There are 62 students in those cells, and 32 out of the 62 are Third Years. That’s
than half, even though not by much.

So, inclusion of new information about the student’s major results in updation of
our prediction and now we classify the student as a Third Year(since majority of students
who declared their major is in third year).

What is the chance that our classification is correct?


• It will be right for all the 32 Third Years who are Declared,
• and wrong for the 30 Second Years who are Declared.
• The chance that were correct is therefore about 0.516.(32/30+32)

In other words, the chance that we are correct is the proportion of Third Years among the
students who have Declared
.

67
A “More Likely Than Not” Binary Classifier
Tree Diagram
The previous calculation depends only on the proportions in the different categories, not
on the counts. The proportions can be visualized in a tree diagram, shown directly below
the pivot table for ease of comparison.
Tree Diagram
Pivot Table

Year Declared Undeclared


Second 30 30
Third 32 8

68
A “More Likely Than Not” Binary Classifier
Note:
• The “Third Year, Declared” branch contains the proportion 0.4 x 0.8 =0.32 of the
students, corresponding to the 32 students in the “Third Year, Declared” cell of the
pivot table.

• The “Second Year, Declared” branch contains 0.6 x 0.5 = 0.3 of the students,
corresponding to the 30 in the “Second Year, Declared” cell of the pivot table.

• We know that the student who was picked belongs to a “Declared” branch; that is, the
student is either in the top branch or the third from top. Those two branches now form
our reduced space of possibilities, and all chances have to be calculated relative to the
total chance of this reduced space.

• So, given that the student is Declared, the chance of them being a Third Year can be
calculated directly from the tree. The answer is the proportion in the “Third Year,
Declared” branch relative to the total proportion in the two “Declared” branches.

69
A “More Likely Than Not” Binary Classifier
Bayes’ Rule
solved what was called an “inverse probability” problem: given new data, how can
you update chances you had found earlier?
widely used now in machine learning.
Terminologies:
Prior probabilities. Before we knew the chosen student’s major declaration status, the
chance that the student was a Second Year was 60% and the chance that the student
was a Third Year was 40%. These are the prior probabilities of the two categories.

Likelihoods. These are the chances of the Major status, given the category of student;
thus they can be read off the tree diagram. For example, the likelihood of Declared
status given that the student is a Second Year is 0.5.

Posterior probabilities. These are the chances of the two Year categories, after we have
taken into account information about the Major declaration status. We computed one
of these:
The posterior probability that the student is a Third Year, given that the student has
Declared,is denoted and is calculated as follows.
70
A “More Likely Than Not” Binary Classifier
The posterior probability that the student is a Third Year, given that the student has
Declared, is denoted and is calculated as follows.

The other posterior probability is

(0.6 * 0.5)/(0.6 * 0.5 + 0.4 * 0.8)= 0.4838709677419354

71
A “More Likely Than Not” Binary Classifier
• That’s about 0.484, which is less than half, consistent with our classification of Third Year.

• Notice that both the posterior probabilities have the same denominator: the chance of the new
information, which is that the student has Declared.

• Because of this, Bayes’ method is sometimes summarized as a statement about proportionality:

72
A “More Likely Than Not” Binary Classifier

Example: Predicting rain based on cloudy weather.


•Prior: P(Rain) = 0.2
•Likelihood: P(Cloudy|Rain) = 0.8
•Compute P(Rain|Cloudy) using Bayes’ Rule →
about 0.36 (36%).

73
A “More Likely Than Not” Binary Classifier
How Bayes’ Rule and probability help us make rational decisions when information is
incomplete?

•Most real-world decisions are made without complete data.

•Bayesian inference allows us to update our beliefs when new evidence arrives.

•A decision depends on:


• The evidence (data we observe)
• The prior belief (what we assumed before)
• The cost or benefit of different outcomes

• Example: An online retailer updates its belief about product demand after new sales
data comes in.

74
CLASSIFICATION
•Classification

•Nearest Neighbors

•Training and Testing

•Rows of Tables

•Implementing the Classifier

•Performance Measures

•Updating Predictions

•Binary Classifier

• Making Decisions
Making Decisions

75
Making decisions
• A primary use of Bayes’ Rule is to make decisions based on incomplete information,
incorporating new information as it comes in. Many medical tests for diseases return
Positive or Negative results.

• A Positive result means that according to the test, the patient has the disease. A
Negative result means the test concludes that the patient doesn’t have the disease.
• Medical tests are carefully designed to be very accurate. But few tests are accurate
100% ofthe time. Almost all tests make errors of two kinds:

• A false positive is an error in which the test concludes Positive but the patient
doesn’thave the disease.

• A false negative is an error in which the test concludes Negative but the patient
doeshave the disease.
These errors can affect people’s decisions
• False positives can cause anxiety and unnecessary treatment (which in some cases is
expensive or dangerous).
• False negatives can have even more serious consequences if the patient doesn’t receive
treatment because of their Negative test result.

76
Making decisions
A Test for a Rare Disease

Suppose there is a large population and a disease that strikes a tiny proportion of the
population. The tree diagram below summarizes information about such a disease and about a
medical test for it.

Given:
4 in 1000 people have the disease (P(Disease) = 0.004)
Test accuracy:
P(Positive|Disease) = 0.99 (true positive rate)
P(Positive|No Disease) = 0.05 (false positive rate)
Question: If a person tests positive, what is P(Disease|Positive)?

77
Making decisions
Suppose a person is picked at random from the population and tested. If the test result is
Positive, how would you classify them: Disease, or No disease?

• We can answer this by applying Bayes’ Rule and using our “more likely than not” classifier.

• Given that the person has tested Positive, the chance that he or she has the disease is the
proportion in the top branch, relative to the total proportion in the Test Positive branches
(0.004 * 0.99)/(0.004 * 0.99 + 0.996*0.005)=0.44295302013422816

• Interpretation: Even after a positive test, it’s still more likely (55.7%) that the person does not
have the disease.

78
Making decisions
Explaining the Counterintuitive Result

• The test is accurate, but the disease is rare.


• Most people who test positive don’t actually have the disease because false positives dominate.
• Key Insight: Base rates (prior probabilities) are crucial.
• Example: If 100,000 people are tested:
400 have the disease → 396 test positive.
99,600 are healthy → 498 test positive.

True Negative Positive


Condition
Disease 4 396
No Disease 99102 498

79
Making decisions
• The cells of the table have the right counts. For example, according to the description
of the population, 4 in 1000 people have the disease.

• There are 100,000 people in the table, so 400 should have the disease. That’s what the
table shows: 4 + 396 = 400. Of these 400, 99% get a Positive test result: 0.99 x 400 =
396.

• Among the Positives, the proportion that have the disease is:
396/(396 + 498)= 0.4429530201342282
• That’s the answer we got by using Bayes’ Rule. The counts in the Positives column show
why it is less than 1/2. Among the Positives, more people don’t have the disease than
do have the disease.

• The reason is that a huge fraction of the population doesn’t have the disease in the first
place. The tiny fraction of those that falsely test Positive are still greater in number than
the people who correctly test Positive.

80
Making decisions
This is easier to visualize in the tree diagram:

• The proportion of true Positives is a large fraction (0.99) of a tiny fraction (0.004) of
the population.

• The proportion of false Positives is a tiny fraction (0.005) of a large fraction (0.996) of
the population.

• These two proportions are comparable; the second is a little larger. So, given that
the randomly chosen person tested positive, we were right to classify them as more
likely than not to not have the disease.

81
Making decisions
This is easier to visualize in the tree diagram:

• The proportion of true Positives is a large fraction (0.99) of a tiny fraction (0.004) of
the population.

• The proportion of false Positives is a tiny fraction (0.005) of a large fraction (0.996) of
the population.

• These two proportions are comparable; the second is a little larger. So, given that
the randomly chosen person tested positive, we were right to classify them as more
likely than not to not have the disease.

82
Making Decisions
A Subjective Prior
Focus: Understanding how subjective beliefs
(priors) affect Bayesian decision-making
outcomes.

83
Making Decisions
A Subjective Prior
When Being Right Feels Wrong
• Our earlier decision classified a Positive patient as “No
Disease.”

Statistically correct, but intuitively unsatisfying. Why?


• Because the assumption of randomness doesn’t reflect
how people are tested in real life.

Key Idea:
People are not tested at random — they get tested because
they or their doctor suspect illness.

84
Making Decisions
A Subjective Prior
The Problem with Random Sampling Assumption
The previous calculation assumed a randomly chosen person
from the population.

Reality: Patients get tested because of symptoms or medical


suspicion. Therefore, the prior probability (P(Disease)) is
higher for those tested than in the general population.

Result: The earlier analysis underestimated the true


probability for tested patients.

85
Making Decisions
A Subjective Prior
Introducing the Subjective Prior
A subjective prior represents personal or expert opinion rather
than population frequency.
Here: “The doctor thinks there’s a 5% chance the patient has
the disease.”

86
Making Decisions
A Subjective Prior
Here: “Subjective prior = belief-based probability reflecting expert
judgment.

Examples of Subjective Probabilities:

“Chance a candidate wins the election.”


“Chance of an earthquake in the next decade.”
“Chance a team wins the World Cup.”
These are not measurable frequencies, but informed opinions.
The doctor thinks there’s a 5% chance the patient has the disease.”

87
Making Decisions
A Subjective Prior
Prior: P(Disease) = 0.05
P(Positive | Disease) = 0.99
P(Positive | No Disease) = 0.005
Apply Bayes’ Rule:
0.05 × 0.99
𝑃 𝐷𝑖𝑠𝑒𝑎𝑠𝑒 ∣ 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = = 0.912
0.05 × 0.99 + 0.95 × 0.005
Interpretation:
The posterior probability of disease is now 91.2%.
A positive test almost certainly means the patient has the disease.

88
Making Decisions
A Subjective Prior
Tree Structure:
Start with 100,000 patients.
5% (5000) believed to have disease.
Apply test accuracy and false positive rates.

Among Positive results:


4950/ 4950 + 475 = 0.912

89
Making Decisions
Confirming the Answer
Creating an artificial population:
• Though the doctor’s opinion is subjective, we can generate an
artificial population in which 5% of the people have the disease
and are tested using the same test.

• Then we can count people in different categories to see if the


counts are consistent with the answer, we got by using Bayes’
Rule.
True
Negative Positive
Condition
Disease 50 4950
No Disease 94525 475

90
Making Decisions
Confirming the Answer
• In this artificially created population of 100,000 people, 5000
people (5%) have the disease, and 99% of them test Positive,
leading to 4950 true Positives.

• Compare this with 475 false Positives: among the Positives,


the proportion that have the disease is the same as what we
got by Bayes’ Rule.
4950/(4950 + 475)=0.9124423963133641

91
MAKING DECISIONS

•Bayesian inference connects belief, evidence, and decision.

•Subjective priors make probabilistic reasoning human-


centered.

•The posterior combines data and expert intuition into


actionable insight.

Inference + Expertise = Realistic Decisions

92
93

You might also like