0% found this document useful (0 votes)
6 views58 pages

Chapter7 Unit V2024 Up

Uploaded by

vachan s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views58 pages

Chapter7 Unit V2024 Up

Uploaded by

vachan s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

CHAPTER7– UNIT 5

PROBABILISTIC LEARNING

BY

NAGARATNA B. CHITTARAGI

1
• One criticism that is often made of neural networks—especially the
MLP—is that it is not clear exactly what it is doing: while we can go and have
a look at the activations of the neurons and the weights, they don’t tell us
much.
• We’ve already seen some methods that don’t have this problem, principally
the decision tree in Chapter 12.
• In this chapter we are going to look at methods that are based on statistics,
and that are therefore more transparent, in that we can always extract and
look at the probabilities and see what they are, rather than having to worry
about weights that have no obvious meaning.
• We will look at how to perform classification by using the frequency with which
examples appear in the training data, and then we will see how we can deal
with our first example of unsupervised learning, when the labels are not
present for the training examples.
• If the data comes from known probability distributions, then we will see that it
is possible to solve this problem with a very neat algorithm, the
Expectation-Maximisation EM algorithm, which we will also see in other
guises in later chapters.
• Finally, we will have a look at a rather different way of using the dataset when
we look at nearest neighbour methods.

2
3
4
5
Gaussian Mixture Model (GMM)

The Gaussian Mixture Model or GMM is defined as a mixture model that


has a combination of the unspecified probability distribution function.

•Further, GMM also requires estimated statistics values such as mean and
standard deviation or parameters.
•It is used to estimate the parameters of the probability distributions to best
fit the density of a given training dataset.
•Although there are plenty of techniques available to estimate the
parameter of the Gaussian Mixture Model GMM, the Maximum
Likelihood Estimation is one of the most popular techniques among them.

6
Unsupervised Machine Learning
•There may be many cases in which we do not have labeled data and need to
find the hidden patterns from the given dataset.
•So, to solve such types of cases in machine learning, we need unsupervised
learning techniques.

What is Unsupervised Learning?

•As the name suggests, unsupervised learning is a machine learning technique


in which models are not supervised using training dataset.
•Instead, models itself find the hidden patterns and insights from the given
data.
•It can be compared to learning which takes place in the human brain while
learning new things. It can be defined as:

Unsupervised learning is a type of machine learning in which models


are trained using unlabeled dataset and are allowed to act on that
data without any supervision.
7
Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the input
data but no corresponding output data.

The goal of unsupervised learning is to

find the underlying structure of dataset, group that data according


to similarities, and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning


algorithm is given an input dataset containing
images of different types of cats and dogs. The
algorithm is never trained upon the given
dataset, which means it does not have any idea
about the features of the dataset. The task of
the unsupervised learning algorithm is to
identify the image features on their own.
Unsupervised learning algorithm will perform
this task by clustering the image dataset into
the groups according to similarities between
images.
8
Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised
Learning:
▪Unsupervised learning is helpful for finding useful insights from the data.
▪Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
▪Unsupervised learning works on unlabeled and uncategorized data which
make unsupervised learning more important.
▪In real-world, we do not always have input data with the corresponding output
so to solve such cases, we need unsupervised learning.

9
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below
diagram:

10
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into
two
types of problems: ▪ Clustering: Clustering is a method of
grouping the objects into clusters such that
objects with most similarities remains into a
group and has less or no similarities with the
objects of another group.
▪ Cluster analysis finds the commonalities
between the data objects and categorizes
them as per the presence and absence of
those commonalities.
▪ Association: An association rule is an
unsupervised learning method which is used
for finding the relationships between
variables in the large database.
▪ It determines the set of items that occurs
together in the dataset.
▪ Association rule makes marketing strategy
more effective.
▪ Such as people who buy X item (suppose a
bread) are also tend to purchase Y
Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.

11
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning
algorithms:

▪K-means clustering
▪KNN (k-nearest neighbors)
▪Hierarchal clustering
▪Anomaly detection
▪Neural Networks
▪Principle Component Analysis
▪Independent Component Analysis
▪Apriori algorithm
▪Singular value decomposition

12
Real world applications of unsupervised algorithms:

◦ Bank/Internet Security: fraud/spam pattern discovery


◦ Biology: taxonomy of living things such as kingdom, phylum, class, order,
family, genus and species
◦ City-planning: Identifying groups of houses according to their house type,
value, and geographical location
◦ Climate change: understanding earth climate, find patterns of atmospheric
and ocean
◦ Finance: stock clustering analysis to uncover correlation underlying shares
◦ Image Compression/segmentation: coherent pixels grouped
◦ Information retrieval/organization: Google search, topic-based news
◦ Land use: Identification of areas of similar land use in an earth observation
database
◦ Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
◦ Social network mining: special interest
13
group automatic discovery

13
Here is another example.

•Let’s say you have a YouTube channel. You may have a lot of data about the
subscribers of your channel. If you want to detect groups of similar subscribers,
then you may need to run a clustering algorithm.

•You don’t need to tell the algorithm which group a subscriber belongs to. The
algorithm can find those connections without your help.

•For example, it may tell you that 35% of your subscribers are from Canada,
while 20% of them are from the United States.

•Similarly, it can give a lot of information, and this will help you to target your
videos for each group. You can use a hierarchical clustering algorithm to
subdivide each group into smaller groups.

•That is how clustering works with unsupervised machine learning. A lot of


advanced things can be achieved using this strategy.

14
Quiz

15
NEAREST NEIGHBOUR METHODS
• Groups in a classroom.
• Making groups in function hall

• This is pretty much exactly the idea behind nearest neighbour methods: if
we don’t have a model that describes the data, then the best thing to do is to look
at similar data and choose to be in the same class as them.

• We have the datapoints positioned within input space, so we just need to


work out which of the training data are close to it.

• This requires computing the distance to each datapoint in the training set, which is
relatively expensive:
• if we are in normal Euclidean space, then we have to compute d subtractions and
d squarings (we can ignore the square root since we only want to know which
points are the closest, not the actual distance) and this has to be done O(N ) 2

times.

16
• We can then identify the k nearest neighbours to the test point, and then set
the class of the test point to be the most common one out of those for the
nearest neighbours.
• The choice of k is not trivial. Make it too small and nearest neighbour methods
are sensitive to noise, too large and the accuracy reduces as points that are
too far away are considered.
• Some possible effects of changing the size of k on the decision boundary are
shown in Figure 7.3.

17
• This method suffers from the curse of dimensionality (Section 2.1.2). First, as
shown above, the computational costs get higher as the number of
dimensions grows.
• This is not as bad as it might appear at first: there are sets of methods such
as KD-Trees (see Section 7.2.2 for more details) that compute this in O(N
logN) time.

• However, more importantly, as the number of dimensions increases, so


the distance to other datapoints tends to increase.

• In addition, they can be far away in a variety of different directions—there


might be points that are relatively close in some dimensions, but a long way in
others.

• There are methods for dealing with these problems, known as adaptive
nearest neighbour methods, and there is a reference to them in the Further
Reading section at the end of the chapter.

• The only part of this that requires any care during the implementation is what
to do when there is more than one class found in the closest points, but even
with that the implementation is nice and simple:

18
19
K-Nearest Neighbor(KNN) Algorithm for Machine Learning

▪ KNearest Neighbour is one of the simplest Machine Learning algorithms


based on Supervised Learning technique.
▪ KNN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
▪ KNN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suited category by using K NN algorithm.
▪ KNN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.

▪ KNN is a non-parametric algorithm, which means it does not make any


assumption on underlying data.
▪ It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
▪ KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to
the new data.

20
K-nearest neighbours (KNN) algorithm uses ‘feature similarity’ to predict the
values of new data points which further means that the new data point will be
assigned a value based on how closely it matches the points in the training set.

We can understand its working with the help of following steps −

Step 1 −We need dataset. So during the first step of KNN,


we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the
nearest data points. K can be any integer.
Step 3 − For each point in the test data do the following

21
◦ 3.1 − Calculate the distance between test data and each row of training
data with the help of any of the method namely:
◦ Euclidean, Manhattan or Hamming distance.
◦ The most commonly used method to calculate distance is Euclidean.
◦ 3.2 − Now, based on the distance value, sort them in ascending order.
◦ 3.3 − Next, it will choose the top K rows from the sorted array.
◦ 3.4 − Now, it will assign a class to the test point based on most
frequent class of these rows.

Step 4 − End

22
Let us start with a simple example. Consider the following table – it consists of the
height, age, and weight (target) values for 10 people. As you can see, the weight value
of ID11 is missing.
We need to predict the weight of this person based on their height and age.

Note: The data in this table does not represent actual values. It is merely used as an
example to explain this concept.

23
How Does the KNN Algorithm Work?

•As we saw above, the KNN algorithm can be used for both classification and
regression problems.
•The KNN algorithm uses ‘feature similarity’ to predict the values of any new
data points.
•This means that the new point is assigned a value based on how closely it
resembles the points in the training set.
• From our example, we know that ID11 has height and age similar to ID1 and
ID5, so the weight would also approximately be the same.

24
Below is a stepwise explanation of the
algorithm:
1. First, the distance between the new point
and each training point is calculated.

25
2. The closest k data points are selected (based on the distance). In this example,
points 1, 5, and 6 will be selected if the value of k is 3.
We will further explore the method to select the right value of k later in this article.

26
3. The average of these data points is the final prediction for the new point. Here,
we have the weight of ID11 = (77+72+60)/3 = 69.66 kg.

How to Calculate the Distance Between Points?

The first step is to calculate the distance between the new point and each
training point. There are various methods for calculating this distance, of which
the most commonly known methods are – Euclidian, Manhattan (for continuous),
and Hamming distance (for categorical).

1.Euclidean Distance: Euclidean distance is calculated as the square root of the


sum of the squared differences between a new point (x) and an existing point (y).

2.Manhattan Distance: This is the distance between real vectors using the sum of
their absolute difference.

27
1.Hamming Distance: It is used for categorical variables. If the value (x) and the value (y) are
the same, the distance D will be equal to 0. Otherwise D=1.

28
There is also Minkowski distance which is a generalized form of Euclidean and
Manhattan distances.

Once the distance of a new observation from the points in our training set has
been measured, the next step is to pick the closest points. The number of points
to be considered is defined by the value of k.

The second step is to select the k value. This determines the number of neighbors we
look at when we assign a value to any new observation.
In our example, for a value k = 3, the closest points are ID1, ID5, and ID6.

ID11 = (77+72+60)/3
ID11 = 69.66 kg

29
30
Problem 1: Apply K-Nearest Neighbour algorithm
for classification for the following datasets. Find the
class to which the data X={Maths=6, physics=8} and
K=3.

Apply each step and find the class to which


X={Maths=6, Physics=8}.
31
32
• From these we have to consider three nearest neighbours (K=3).

• In this, 2nd, 3rd and 5th data elements are nearer to new test data.

• Same time these three belongs to class Pass from the given dataset.

• Hence we should fit or label the class for new data as Pass.

• In case different class labels are found, i.e. two pass and one fail,(Two fail
and one pass) in that time, higher probability of belongingness to class labels
must be considered.

• This indicates that, new test data is identified with the class label by
considering three neighbours k=3.

• Similarly, you can find the class for the given data. X={Maths=5,
Physics=7}

33
34
• We are going to look next at how we can use these methods for regression,
before we turn to the question of how to perform the distance calculations as
efficiently as possible, something that is done simply but inefficiently in the code
above.

• We will then consider briefly whether or not the Euclidean distance is always the
most useful way to calculate distances, and what alternatives there are.
• For the k-nearest neighbours algorithm the bias-variance decomposition can be
computed as:

• The way to interpret this is that when k is small, so that there are few neighbours
considered, the model has flexibility and can represent the underlying model well,
but that it makes mistakes (has high variance) because there is relatively little
data.

• As k increases, the variance decreases, but at the cost of less flexibility and so
more bias.
35
Efficient Distance Computations: the KD-Tree-- K-Dimensional Tree

• As was mentioned above, computing the distances between all pairs of points
is very computationally expensive.
• Fortunately, as with many problems in computer science, designing an efficient
data structure can reduce the computational overhead a lot.
• For the problem of finding nearest neighbours the data structure of choice is the
KD-Tree.
• It has been around since the late 1970s, when it was devised by Friedman and
Bentley, and it reduces the cost of finding a nearest neighbour to O(logN) for
O(N) storage.

• The construction of the tree is O(N log2 N), with much of the computational cost
being in the computation of the median, which with a naïve algorithm requires a
sort and is therefore O(N logN), or can be computed with a randomised
algorithm in O(N) time.

• The idea behind the KD-tree is very simple. You create a binary tree by
choosing one dimension at a time to split into two, and placing the line through
the median of the point coordinates of that dimension.

• Not that different to a decision tree (Chapter 12), really. The points themselves
end up as leaves of the tree. 36
• Making the tree follows pretty much the same steps as usual for constructing
a binary tree: we identify a place to split into two choices, left and right, and
then carry on down the tree.
• This makes it natural to write the algorithm recursively. The choice of what to
split and where is what makes the KD-tree special.
• Just one dimension is split in each step, and the position of the split is found
by computing the median of the points that are to be split in that one
dimension, and putting the line there.
• In general, the choice of which dimension to split alternates through the
different choices, or it can be made randomly.
• The algorithm below cycles through the possible dimensions based on the
depth of the tree so far, so that in two dimensions it alternates horizontal and
vertical splits.

37
38
FIGURE 7.6 The splits and leaf points found by the KD-tree.

39
40
• Suppose that we had seven two-dimensional points to make a tree from:

(5, 4), (1, 6), (6, 1), (7, 5), (2, 7), (2, 2), (5, 8) (as plotted in Figure 7.5).

• The algorithm will pick the first coordinate to split on initially, and the median
point here is 5, so the split is through x = 5. Of those on the left of the line, the
median y coordinate is 6, and for those on the right it is 5.
• At this point we have separated all the points, and so the algorithm terminates
with the split shown in Figure 7.6 and the tree shown in Figure 7.7.

• Searching the tree is the same as any other binary tree; we are more
interested in finding the nearest neighbours of a test point.
• This is fairly easy: starting at the root of the tree you recurse down through the
tree comparing just one dimension at a time until you find a leaf node that is in
the region containing the test point.
• Using the tree shown in Figure 7.7 we introduce the test point (3, 5), which
finds (2, 2) as the leaf for the box that (3, 5) is in.
• However, looking at Figure 7.8 we see that this is not the closest point at all, so
we need to do some more work.

41
• The first thing we do is label the leaf we have found as a potential nearest
neighbour, and compute the distance between the test point and this point,
since any other point has to be closer.
• Now we need to check any other boxes that could contain something closer.
• Looking at Figure 7.8 you can see that point (3, 7) is closer, and that is the
label of the leaf for the sibling box to the one that was returned, so the
algorithm also needs to check the sibling box.
• However, suppose that we used (4.5, 2) as the test point. In that case the
sibling is too far away, but another point (6, 1) is closer.
• So just checking the sibling is not enough — we also need to check the
siblings of the parent node, together with its descendants (the cousins of the
first point).
• A look at the figure again should convince you that the algorithm can then
terminate in most cases; very occasionally it can be necessary to go even
further afield, but it is easy to see which branches to prune.
• This leads to the following Python program: 42
43
FIGURE 7.8 Two
test points for
the example KD-tree.

44
45
• For the sake of simplicity, let us understand a 2-D Tree with an example.

• The root would have an x-aligned plane, the root’s children would both have
y-aligned planes, the root’s grandchildren would all have x-aligned planes, and the
root’s great-grandchildren would all have y-aligned planes and so on.

• Generalization:

Let us number the planes as 0, 1, 2, …(K – 1). From the above example, it is quite
clear that a point (node) at depth D will have A aligned plane where A is calculated
as:
A = D mod K

• How to determine if a point will lie in the left subtree or in right subtree?

• If the root node is aligned in planeA, then the left subtree will contain all points
whose coordinates in that plane are smaller than that of root node.
• Similarly, the right subtree will contain all points whose coordinates in that plane are
greater-equal to that of root node.

46
Creation of a 2-D Tree:
Consider following points in a 2-D plane:
(3, 6), (17, 15), (13, 15), (6, 12), (9, 1), (2, 7), (10, 19)

47
How is space partitioned?
All 7 points will be plotted in the X-Y
plane as follows:

1.Point (3, 6) will divide the space into two


parts: Draw line X = 3.

48
2. Point (2, 7) will divide the space to the left of
line X = 3 into two parts horizontally.
Draw line Y = 7 to the left of line X = 3.

49
3. Point (17, 15) will divide the space to
the right of line X = 3 into two parts
horizontally.
Draw line Y = 15 to the right of line X = 3.

50
4. Point (6, 12) will divide the space below line Y = 15 and to the
right of line X = 3 into two parts.
Draw line X = 6 to the right of line X = 3 and below line Y = 15.

51
5. Point (13, 15) will divide the space below line Y = 15 and to
the right of line X = 6 into two parts.
Draw line X = 13 to the right of line X = 6 and below line Y = 15.

52
6. Point (9, 1) will divide the space
between lines X = 3, X = 6 and Y = 15 into
two parts.
Draw line Y = 1 between lines X = 3 and
X = 6.

53
7. Point (10, 19) will divide the space to the
right of line X = 3 and above line Y = 15 into
two parts.
Draw line Y = 19 to the right of line X = 3
and above line Y = 15.

54
Distance Measures
• We have computed the distance between points as the Euclidean distance,
which is something that you learnt about in high school.
• However, it is not the only option, nor is it necessarily the most useful.
• In this section we will look at the underlying idea behind distance calculations
and possible alternatives.
• If I were to ask you to find the distance between my house and the nearest
shop, then your first guess might involve taking a map of my town, locating my
house and the shop, and using a ruler to measure the distance between them.
• By careful application of the map scale you can now tell me how far it is.
• However, when I set out to buy some milk I’m liable to find that I have to walk
rather further than you’ve told me, since the direct line that you measured would
involve walking through (or over) several houses, and some serious
fence-scaling.
• Your ‘as the crow flies’ distance is the shortest possible path, and it is the
straight-line, or Euclidean, distance.
• You can measure it on the map by just using a ruler, but it essentially consists
of measuring the distance in one direction (we’ll call it north-south) and then the
distance in another direction that is perpendicular to the first (let’s call it
east-west) and then squaring them, adding them together, and then taking the
square root of that.

55
• Writing that out, the Euclidean distance that we are all used to is:

• where (x1, y1) is the location of my house in some coordinate system (say by
using a GPS tracker) and (x2, y2) is the location of the shop.
• If I told you that my town was laid out on a grid block system, as is common in
towns that were built in the interval between the invention of the motor car and
the invention of innovative town planners, then you would probably use a
different measure.
• You would measure the distance between my house and the shop in the
‘north-south’ direction and the distance in the ‘east-west’ direction, and then add
the two distances together.
• This would correspond to the distance I actually had to walk. It is often known as
the city-block or Manhattan distance and looks like:

The point of this discussion is to show that there is more than one way to measure a
distance, and that they can provide radically different answers.
These two different distances can be seen in Figure 7.9.

56
• Mathematically, these distance measures are known as metrics.
• A metric function or norm takes two inputs and gives a scalar (the distance) back,
which is positive, and 0 if and only if the two points are the same, symmetric (so
that the distance to the shop is the same as the distance back), and obeys the
triangle inequality, which says that the distance from a to b plus the distance from
b to c should not be less than the direct distance from a to c.
• Most of the data that we are going to have to analyse lives in rather more than
two dimensions.
• Fortunately, the Euclidean distance that we know about generalises very well to
higher dimensions (and so does the city-block metric).
• In fact, these two measures are both instances of a class of metrics that work in
any number of dimensions.

57
The general measure is the Minkowski metric and it is written as:

• If we put k = 1 then we get the city-block distance (Equation (7.12)), and k = 2


gives the Euclidean distance (Equation (7.11)).
• Thus, you might possibly see the Euclidean metric written as the L2 norm and the
city-block distance as the L1 norm.
• These norms have another interesting feature.
• Remember that we can define different averages of a set of numbers. If we
define the average as the point that minimises the sum of the distance to every
datapoint, then it turns out that the mean minimises the Euclidean distance (the
sum-of-squares distance), and the median minimises the L1 metric.
• We met another distance measure earlier: the Mahalanobis distance

58

You might also like