NEURAL NETWORKS STUDY NOTES.
Chapter 1
This Chapter covers the principle concepts of pattern recognition. It deals with:
Polynomial curve fitting
Parameter optimization
Generalization
Model complexity
Pattern recognition ->Probability, Decision criteria, Bayes theorem
1.1 Character recognition
Pattern recognition is synonymous with information processing problems.
An approach will be developed based on sound theoretical concepts.
Information is normally probabilistic in nature and a statistical framework
provides means of both operation on the data and representation of
results.
Consider image data represented by a vector x=(x1,….xd)T where d is the
total number of variables. This image has to be classified as either class 1
(C1) or class 2 (C2) up to k, where k is the number of classes. The aim is to
develop an algorithm or system to classify these sets of data. To do this, a
collection of known data is collected for each class (data set or sample)
This is then used in the development of a classification algorithm. A
collection of available data is known as a training set. This normally may
be less than the total number of possible variations in a particular class.
The algorithm, should be able to correctly classify data which was not
used in the training set- generalization .
e.g. for d-dimensional space,
d=20, a four bit representation would mean 220 x 4 elements in a sample.
Too big a number.
To reduce this, data can be combined into features. This reduces the ‘data
set’ considerably.
These could consist of ratios, averages etc depending on the problem. This
can then be used to determine a decision boundary or threshold, provided
there is a marked distinction between the classes for this particular
feature.
e.g. elements of class C2 have bigger values of feature x than elements of
class C1. Overlaps may be there. See (fig 1.2. )
Fig 1.1 fig 1.2
Misclassifications are inevitable. However, these can be reduced by
increasing the number of features which too, should not exceed a certain
limit. Better decision boundaries can be obtained with more features.
1.2 Classification and regression
The overall system maps a set of input variables (from which features are
extracted) to an output variable y. For several outputs yk where k=1..c. c=
number of outputs. This cannot be done with a set of training data. This
data is fed into some mapping function with adjustable weights. yk
= yk(x:w), where w is a vector/matrix of parameters or weights.
A neural network is a particular choice of a set of functions(or fuctional
forms) yk(x:w),,
together with procedures to optimize the mapping.
In classification problems we seek to approximate probabilities of
membership of different classes expressed as functions of input variables.
Apart from classification, another pattern recognition task is regression. In
regression, future and unknown values are predicted given past values.
The output represents the value of a continuous variable. More like, given
a data set, the relationship between the data is determined i.e. the
function connecting the data. This approximated function is then used to
determine other values not present in a data set.
In regression, it is the regression function which is approximated (see
example)
A classical definition of regression
A term coined by Galton for the tendency of the quantitative traits of
offspring to be closer to the population mean than are their parents' traits.
It arises from a combination of factors - dominance, gene interactions, and
environmental influences on traits.
1.3 Pre-processing and feature extraction
The sequence of operations on data is:
feature extraction.A fixed transformation of
variables. contains adaptive parameters
converting to required form
Some definitions
Translational invariance- a classification system whose decisions are
insensitive to location of
object (data) in an image(space).
Scale invariance - a feature does not depend on size of the object
Prior knowledge - information known about the desired form of
the solution
1.4 The curse of dimensionality
Increasing features can improve the performance of the classification
system however, beyond a critical point, more features result in the
reduction of performance of the classification system.
Imagine two input variables x1 and x2(d=2) if these are divided into M
divisions, the number of cells increases by Md
X21= 2
d
M=2, M=3 Fig 1.3
Md=22=4,
Md=32=9 The number of cells grows exponentially with the
dimensionality of the input space. This means the training
data must increase by at least the same amount. That’s the
curse.
-Since data is somewhat correlated it tends only to fill a subspace of lower
dimensionality –intrinsic dimensionality.
-outputs do not arbitrarily change from one region of input space to
another.
Therefore, output values at intermediate points can be inferred like in
interpolation. These properties enable reduction of inputs possible with the
aim of reducing dimensionality.
Polynomial curve fitting
We are trying to fit a polynomial of order M to N data points
yx:w=w0x0+w1x1…+wMxM=j=0Mwjxj
Parameters wj are determined by minimizing the error function
E=12n=1Nyxn;w-tn2 ,
where tn is the desired output (it can be xn or some other ‘prior’ value) and
xn is the nth data point.
Supervised learning – learning which involves some known target
value
Unsupervised learning – no target. The goal is not an input output
mapping but to
model the probability distribution of the data
or some other
inherent structures.
Reinforcement learning –information is supplied to determine the quality
of the output
But no actual values are given.
A system must be able to generalize well to cater for noise. Test datais used
to determine a
system’s generalization capabilities. Test data is just trained data (similarly
generated) with
noise.
The order of the polynomial is called degree of freedom. Curve fitting is
improved with increase
inthe degrees of freedom. Up to a certain point. The curve should not fit the data
exactly. That would be over-fitting which would give a poor representation of the
original underlying function.
1.5 Generalization
ERMS=1Nn=1Nyxn;w*-tn2
This is known as the root mean square function. Where w* denotes the best
set of
weights.(minimum error).
Example.
For a given function
hx=0∙5+0∙4sin2πx
Fig 1.6 M=1 bad fit Fig 1.7 M=3, good fit.
Fig 1.8. M=10, over-fitting Fig 1.9 Test set
can be used to determine best order
M=3 is the
minimum
For 2D data
Fig 1.11 good boundary Fig 1.12 over-fitted
boundary
A model with little flexibility has a high bias.
A model with too much flexibility has a high variance.
1.6 Model complexity
As can be seen above (and from Occan’s razor) simple models are better
than complex ones. But they should not be too simple, M=1.
Another control its effective complexity approach apart from the one
mentioned above is, introducing a penalty Ω(regularization term) to the
error function.
Ê=E +vΩ , v determines how much Ω influences the solution.
Ω increases with complexity (curvature).
Ω=12d2ydx22dx, therefore after substitution,
Ê=12n=1Nyxn;w-tn2 +v12d2ydx22dx
This helps reduce both bias and variance. The limit is the amount of noise.
1.7 Multivariate non- linear functions
Mapping polynomials can be extended to higher dimensions. E.g. for d
input variables, and 1 output variable, it could be chosen to represent the
mapping as
y=w0+i1=1dwi1xi1+i1=1di2=1dwi1i2xi1xi2+i1=1di2=1di3=1dwi1i2i3xi1
xi2xi3
However the number of independent parameters would grow as dM.This
would need lots of training data. The importance of neural networks is how
they deal with the problem of scaling and dimensionality. These models
represent non-linear functions of many variables in terms of
superpositions of non-linear functions of a single variable. These are
known as hidden functions and are adapted to the data. We shall consider
the multi-layer perceptron and radial basis function network. The error
falls as O(1/M) where Mist the number of hidden networks. For
polynomials decrease as O(1/M2/d).However, this is computationally
intensive and has the problem of multiple minima in the error function.
1.8 Bayes’ Theorem
Without knowing about the features or characteristics of the data, prior
probabilities can be determined from a set of data. This is in cases of
classification of data. The fraction of a class in the total number of data
samples is considered to be the prior probability.
e.g. A sample of letters contains 3 ’ A’s , 4 ‘B’s, we can say, the prior
probabilities of A and B are P(A),3/7 and P(B),4/7 respectively. This means
, every new character would be assigned as ‘B’ since it has a higher prior
probability. If more 2 characters were sampled, and their prior probabilities
were as follows A-3/11, B-4/11, C-4/11 then it means each new character
would be between C and B. This means that prior probabilities are
insufficient means of classification. However, they have to be taken into
consideration. Introducing other things to criteria for classification is a
necessity. Here is where features of the data come in.
Some definitions
Joint probability P(Ck,Xl) - probability that the object has feature value X
l
and belongs
to class C k. , where k is 1,2…n. n is the total number
of classes. l is the
total number of features.
Conditional probability P(Xl|Ck) - The probability that the observation is
feature X l seeing
that it belongs to class C k. It is a measure of the
predominance of a
feature by a particular class. How much this
feature is synonymous
with a certain class.
Example.
l
Consider Two classes C1 and C2 and feature X
X1 X2 Total
C1 19 41 60
C2 12 28 40
Total 31 69 100
Feature X1 X2
Row C1= 60, row C2 = 40. C1 = 60. Total Samples =40 + 60.
P(C1) = 60Total Samples= 60100 {Probability of a randomly selected variable
belonging to C1}
P(X 2) = 69100 {probability that of finding a randomly selected feature(X 2) is
feature in the sample}
P(C1|X 2) = 4169 from the definition. It can also be said that P(X 2|C1) =
4160
P(X 2 |C1) is the probability that a given class is C1 seeing that it falls in
feature X 2. It is also
2
known as the class conditional probability of X for class C1.
P(C1 ,X 2) = P(X 2|C1) P(C1)
= 4160×60100 = 0.41 OR
P(C1 ,X 2) = P(C1|X 2) P(X 2)
= 4169×69100 =0.41
It can also be clearly seen from the table that the (joint) probability of
a sample having
feature X 2 and belonging to class C1 is 41 out of 100 samples = 0.41
Since P(X 2|C1) P(C1) = P(C1|X 2) P(X 2), then
PC1|X2=PX2|C1PC1PX2 the “conditional probability” on the left is known
as the posterior
probability . The above is known as Baye’s theorem
An Excerpt from the web
Let's use the same example, but shorten each event to its one letter initial, ie: A, B,
C, and D instead of Aberations, Brochmailians, Chompieliens, and Defective.
P(D|B) is not a Bayes problem. This is given in the problem. Bayes' formula finds
the reverse conditional probability P(B|D).
It is based that the Given (D) is made of three parts, the part of D in A, the part of
D in B, and the part of D in C.
P(B and D)
P(B|D) = -----------------------------------------
P(A and D) + P(B and D) + P(C and D)
Inserting the multiplication rule for each of these joint probabilities gives
P(D|B)*P(B)
P(B|D) = -----------------------------------------
P(D|A)*P(A) + P(D|B)*P(B) + P(D|C)*P(C)
However, and I hope you agree, it is much easier to take the joint probability
divided by the marginal probability. The table does the adding for you and
makes the problems doable without having to memorize the formulas.
Company Good Defective Total
(A) Aberations 0.50-0.025 = 0.475 0.05(0.50) = 0.025 0.50
(B) Brochmailians 0.30-0.021 = 0.279 0.07(0.30) = 0.021 0.30
(C ) Chompieliens 0.20-0.020 = 0.180 0.10(0.20) = 0.020 0.20
Total 0.934 0.066 1.00
N.B the marginal probability is the P(D) (= 0.66 )in this case. P(C1) and
P(C2) are marginal probabilities. Note that the total of the marginal
probabilities adds to 1.00.
Whereas,
P(A and D) + P(B and D) + P(C and D)= 0.025 + 0.021 + 0.020= 0.66 = P(D)
Thus, it can be said that the denominator acts as a normalizing factor. i.e.
the conditional
probabilities all add up to unity. Therefore ,
PC1|X2+PC2|X2=1
From the excerpt, PBD+PCD+PAD= P(B,D)P(D)+P(C,D)P(D)+P(A,D)P(D) =
0.0210.66+0.0200.66+0.0250.66=1
N.B. P(D) ensures that the sum of conditional probabilities tends to unity.
Without it, the sum would be 0.66
It is a NORMALIZING FACTOR
From the Example, it is clear that, P( C1 , X2) + P( C2 , X2) = P(X2)
{C1 and X2} {C2 and X2}
These are, 41% + 28 % = 69 %
The relation above, P( C1 , X2) + P( C2 , X2) = P(X2), can also be written as
P(X2)= P(X2|C1)P(C1) + P(X2|C2)P(C2)
P(X2) has been expressed in terms of the prior probability and class
conditional probability.
1.8.1 Inference and decision
The importance of Baye’s theorem is that posterior probabilities can be
calculated and
expressed using easily obtainable quantities. In the examples, these are
the values in the
tables which are directly obtained from the data.
l
Classification of an object with a particular feature or feature value, X is
done by assigning
the object to class Ck for which the posterior probability P(Ck |X l ) is
greatest. Baye’s rule.
In practice, care must be taken in data collection to avoid large disparities
between our
expectations (calculated posterior probabilities) and reality. Prior
probabilities must be
correctly accounted for.
Implementation of Baye’s theorem would meanevaluating the class
conditional and prior
probabilities separately, then using Baye’s theorem to evaluate posterior
probabilities.
Outputs of neural networks can be interpreted as posterior probabilities
provided that the
error function is chosen appropriately.
1.8.2 Bayesian versus frequentist statistics
We assumed that the samples or observations tend to infinity. This is a
frequentist view
of probabilities.
Usually, probabilities express our degree of belief that a particular event
will occur. Baye’s
theorem, is a precise quantitative way to update these values when new
data is presented.
1.8.3. Probability densities
Feature variables can be regarded as continuous. The notation is as
follows. Upper case
denotes probability while lower case denotes probability densities.
Px∈[a,b]=abpxdx 1.14
px is normalized so that Px∈a.b=1
A more general form for a region r in space is:
Px∈R=Rpxdx 1.15
Average or expectation of a particular function Q(x) is:
εQ=Qxp(x)dx
1.16
For a finite set of data points,
εQ=Qxp(x)dx≅ 1Nn=1NQ(xn) 1.17
1.8.4 Baye’s theory in general
This is to incorporate the probability densities.
PCkx=pxCkP(Ck)p(x)
1.21
px=k=1cpxCkPk 1.22
k=1cPxCk=1
1.23
When class-conditional OR posterior probabilities are viewed as
parameterized
functions, they are referred to as likelihood functions.
Baye’s theorem can be minimized to:
posterior=likelihood X priornormalization factor
1.24
1.9 Decision Boundaries
1.10 Minimizing risk