100% found this document useful (2 votes)

251 views11 pages

Neural Networks Study Notes

This chapter covers key concepts in pattern recognition including polynomial curve fitting, parameter optimization, generalization, model complexity, and Bayes' theorem. It discusses character recognition problems and developing classification algorithms using training data to determine decision boundaries between classes. The chapter also addresses overfitting, preprocessing data, reducing dimensionality through feature extraction, and evaluating a model's generalization capabilities on test data. Multivariate nonlinear functions are discussed along with reducing complexity through regularization.

Uploaded by

pekalu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

251 views11 pages

Neural Networks Study Notes

Uploaded by

pekalu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

NEURAL NETWORKS STUDY NOTES.

Chapter 1

This Chapter covers the principle concepts of pattern recognition. It deals with:
Polynomial curve fitting
Parameter optimization
Generalization
Model complexity
Pattern recognition ->Probability, Decision criteria, Bayes theorem
1.1 Character recognition
Pattern recognition is synonymous with information processing problems.
An approach will be developed based on sound theoretical concepts.
Information is normally probabilistic in nature and a statistical framework
provides means of both operation on the data and representation of
results.
Consider image data represented by a vector x=(x1,….xd)T where d is the
total number of variables. This image has to be classified as either class 1
(C1) or class 2 (C2) up to k, where k is the number of classes. The aim is to
develop an algorithm or system to classify these sets of data. To do this, a
collection of known data is collected for each class (data set or sample)
This is then used in the development of a classification algorithm. A
collection of available data is known as a training set. This normally may
be less than the total number of possible variations in a particular class.
The algorithm, should be able to correctly classify data which was not
used in the training set- generalization .
e.g. for d-dimensional space,
d=20, a four bit representation would mean 220 x 4 elements in a sample.
Too big a number.

To reduce this, data can be combined into features. This reduces the ‘data
set’ considerably.
These could consist of ratios, averages etc depending on the problem. This
can then be used to determine a decision boundary or threshold, provided
there is a marked distinction between the classes for this particular
feature.
e.g. elements of class C2 have bigger values of feature x than elements of
class C1. Overlaps may be there. See (fig 1.2. )
Fig 1.1 fig 1.2
Misclassifications are inevitable. However, these can be reduced by
increasing the number of features which too, should not exceed a certain
limit. Better decision boundaries can be obtained with more features.

1.2 Classification and regression

The overall system maps a set of input variables (from which features are
extracted) to an output variable y. For several outputs yk where k=1..c. c=
number of outputs. This cannot be done with a set of training data. This
data is fed into some mapping function with adjustable weights. yk
= yk(x:w), where w is a vector/matrix of parameters or weights.
A neural network is a particular choice of a set of functions(or fuctional
forms) yk(x:w),,
together with procedures to optimize the mapping.
In classification problems we seek to approximate probabilities of
membership of different classes expressed as functions of input variables.
Apart from classification, another pattern recognition task is regression. In
regression, future and unknown values are predicted given past values.
The output represents the value of a continuous variable. More like, given
a data set, the relationship between the data is determined i.e. the
function connecting the data. This approximated function is then used to
determine other values not present in a data set.
In regression, it is the regression function which is approximated (see
example)

A classical definition of regression

A term coined by Galton for the tendency of the quantitative traits of
offspring to be closer to the population mean than are their parents' traits.
It arises from a combination of factors - dominance, gene interactions, and
environmental influences on traits.

1.3 Pre-processing and feature extraction

The sequence of operations on data is:

feature extraction.A fixed transformation of

variables. contains adaptive parameters
converting to required form

Some definitions
Translational invariance- a classification system whose decisions are
insensitive to location of
object (data) in an image(space).
Scale invariance - a feature does not depend on size of the object
Prior knowledge - information known about the desired form of
the solution

1.4 The curse of dimensionality

Increasing features can improve the performance of the classification
system however, beyond a critical point, more features result in the
reduction of performance of the classification system.
Imagine two input variables x1 and x2(d=2) if these are divided into M
divisions, the number of cells increases by Md
X21= 2
d
M=2, M=3 Fig 1.3
Md=22=4,
Md=32=9 The number of cells grows exponentially with the
dimensionality of the input space. This means the training
data must increase by at least the same amount. That’s the
curse.
-Since data is somewhat correlated it tends only to fill a subspace of lower
dimensionality –intrinsic dimensionality.
-outputs do not arbitrarily change from one region of input space to
another.
Therefore, output values at intermediate points can be inferred like in
interpolation. These properties enable reduction of inputs possible with the
aim of reducing dimensionality.

Polynomial curve fitting

We are trying to fit a polynomial of order M to N data points

yx:w=w0x0+w1x1…+wMxM=j=0Mwjxj

Parameters wj are determined by minimizing the error function

E=12n=1Nyxn;w-tn2 ,

where tn is the desired output (it can be xn or some other ‘prior’ value) and
xn is the nth data point.
Supervised learning – learning which involves some known target
value
Unsupervised learning – no target. The goal is not an input output
mapping but to
model the probability distribution of the data
or some other
inherent structures.
Reinforcement learning –information is supplied to determine the quality
of the output
But no actual values are given.
A system must be able to generalize well to cater for noise. Test datais used
to determine a

system’s generalization capabilities. Test data is just trained data (similarly

generated) with
noise.
The order of the polynomial is called degree of freedom. Curve fitting is
improved with increase
inthe degrees of freedom. Up to a certain point. The curve should not fit the data
exactly. That would be over-fitting which would give a poor representation of the
original underlying function.

1.5 Generalization
ERMS=1Nn=1Nyxn;w*-tn2

This is known as the root mean square function. Where w* denotes the best
set of
weights.(minimum error).
Example.

For a given function

hx=0∙5+0∙4sin2πx

Fig 1.6 M=1 bad fit Fig 1.7 M=3, good fit.
Fig 1.8. M=10, over-fitting Fig 1.9 Test set
can be used to determine best order

M=3 is the
minimum

For 2D data

Fig 1.11 good boundary Fig 1.12 over-fitted

boundary

A model with little flexibility has a high bias.

A model with too much flexibility has a high variance.

1.6 Model complexity

As can be seen above (and from Occan’s razor) simple models are better
than complex ones. But they should not be too simple, M=1.
Another control its effective complexity approach apart from the one
mentioned above is, introducing a penalty Ω(regularization term) to the
error function.

Ê=E +vΩ , v determines how much Ω influences the solution.

Ω increases with complexity (curvature).

Ω=12d2ydx22dx, therefore after substitution,

Ê=12n=1Nyxn;w-tn2 +v12d2ydx22dx

This helps reduce both bias and variance. The limit is the amount of noise.
1.7 Multivariate non- linear functions
Mapping polynomials can be extended to higher dimensions. E.g. for d
input variables, and 1 output variable, it could be chosen to represent the
mapping as
y=w0+i1=1dwi1xi1+i1=1di2=1dwi1i2xi1xi2+i1=1di2=1di3=1dwi1i2i3xi1
xi2xi3
However the number of independent parameters would grow as dM.This
would need lots of training data. The importance of neural networks is how
they deal with the problem of scaling and dimensionality. These models
represent non-linear functions of many variables in terms of
superpositions of non-linear functions of a single variable. These are
known as hidden functions and are adapted to the data. We shall consider
the multi-layer perceptron and radial basis function network. The error
falls as O(1/M) where Mist the number of hidden networks. For
polynomials decrease as O(1/M2/d).However, this is computationally
intensive and has the problem of multiple minima in the error function.

1.8 Bayes’ Theorem

Without knowing about the features or characteristics of the data, prior
probabilities can be determined from a set of data. This is in cases of
classification of data. The fraction of a class in the total number of data
samples is considered to be the prior probability.

e.g. A sample of letters contains 3 ’ A’s , 4 ‘B’s, we can say, the prior
probabilities of A and B are P(A),3/7 and P(B),4/7 respectively. This means
, every new character would be assigned as ‘B’ since it has a higher prior
probability. If more 2 characters were sampled, and their prior probabilities
were as follows A-3/11, B-4/11, C-4/11 then it means each new character
would be between C and B. This means that prior probabilities are
insufficient means of classification. However, they have to be taken into
consideration. Introducing other things to criteria for classification is a
necessity. Here is where features of the data come in.

Some definitions

Joint probability P(Ck,Xl) - probability that the object has feature value X
l
and belongs

to class C k. , where k is 1,2…n. n is the total number

of classes. l is the
total number of features.

Conditional probability P(Xl|Ck) - The probability that the observation is

feature X l seeing

that it belongs to class C k. It is a measure of the

predominance of a

feature by a particular class. How much this

feature is synonymous

with a certain class.

Example.
l
Consider Two classes C1 and C2 and feature X

X1 X2 Total
C1 19 41 60
C2 12 28 40
Total 31 69 100

Feature X1 X2

Row C1= 60, row C2 = 40. C1 = 60. Total Samples =40 + 60.

P(C1) = 60Total Samples= 60100 {Probability of a randomly selected variable

belonging to C1}

P(X 2) = 69100 {probability that of finding a randomly selected feature(X 2) is

feature in the sample}

P(C1|X 2) = 4169 from the definition. It can also be said that P(X 2|C1) =
4160

P(X 2 |C1) is the probability that a given class is C1 seeing that it falls in
feature X 2. It is also
2
known as the class conditional probability of X for class C1.

P(C1 ,X 2) = P(X 2|C1) P(C1)

= 4160×60100 = 0.41 OR

P(C1 ,X 2) = P(C1|X 2) P(X 2)

= 4169×69100 =0.41

It can also be clearly seen from the table that the (joint) probability of
a sample having
feature X 2 and belonging to class C1 is 41 out of 100 samples = 0.41

Since P(X 2|C1) P(C1) = P(C1|X 2) P(X 2), then

PC1|X2=PX2|C1PC1PX2 the “conditional probability” on the left is known

as the posterior

probability . The above is known as Baye’s theorem

An Excerpt from the web

Let's use the same example, but shorten each event to its one letter initial, ie: A, B,
C, and D instead of Aberations, Brochmailians, Chompieliens, and Defective.
P(D|B) is not a Bayes problem. This is given in the problem. Bayes' formula finds
the reverse conditional probability P(B|D).
It is based that the Given (D) is made of three parts, the part of D in A, the part of
D in B, and the part of D in C.
P(B and D)
P(B|D) = -----------------------------------------
P(A and D) + P(B and D) + P(C and D)
Inserting the multiplication rule for each of these joint probabilities gives
P(D|B)*P(B)
P(B|D) = -----------------------------------------
P(D|A)*P(A) + P(D|B)*P(B) + P(D|C)*P(C)
However, and I hope you agree, it is much easier to take the joint probability
divided by the marginal probability. The table does the adding for you and
makes the problems doable without having to memorize the formulas.
Company Good Defective Total
(A) Aberations 0.50-0.025 = 0.475 0.05(0.50) = 0.025 0.50
(B) Brochmailians 0.30-0.021 = 0.279 0.07(0.30) = 0.021 0.30
(C ) Chompieliens 0.20-0.020 = 0.180 0.10(0.20) = 0.020 0.20
Total 0.934 0.066 1.00
N.B the marginal probability is the P(D) (= 0.66 )in this case. P(C1) and
P(C2) are marginal probabilities. Note that the total of the marginal
probabilities adds to 1.00.

Whereas,
P(A and D) + P(B and D) + P(C and D)= 0.025 + 0.021 + 0.020= 0.66 = P(D)
Thus, it can be said that the denominator acts as a normalizing factor. i.e.
the conditional

probabilities all add up to unity. Therefore ,

PC1|X2+PC2|X2=1

From the excerpt, PBD+PCD+PAD= P(B,D)P(D)+P(C,D)P(D)+P(A,D)P(D) =

0.0210.66+0.0200.66+0.0250.66=1
N.B. P(D) ensures that the sum of conditional probabilities tends to unity.
Without it, the sum would be 0.66

It is a NORMALIZING FACTOR

From the Example, it is clear that, P( C1 , X2) + P( C2 , X2) = P(X2)

{C1 and X2} {C2 and X2}

These are, 41% + 28 % = 69 %

The relation above, P( C1 , X2) + P( C2 , X2) = P(X2), can also be written as

P(X2)= P(X2|C1)P(C1) + P(X2|C2)P(C2)

P(X2) has been expressed in terms of the prior probability and class
conditional probability.

1.8.1 Inference and decision

The importance of Baye’s theorem is that posterior probabilities can be

calculated and

expressed using easily obtainable quantities. In the examples, these are

the values in the

tables which are directly obtained from the data.

l
Classification of an object with a particular feature or feature value, X is
done by assigning

the object to class Ck for which the posterior probability P(Ck |X l ) is

greatest. Baye’s rule.

In practice, care must be taken in data collection to avoid large disparities

between our

expectations (calculated posterior probabilities) and reality. Prior

probabilities must be

correctly accounted for.

Implementation of Baye’s theorem would meanevaluating the class

conditional and prior

probabilities separately, then using Baye’s theorem to evaluate posterior

probabilities.

Outputs of neural networks can be interpreted as posterior probabilities

provided that the

error function is chosen appropriately.

1.8.2 Bayesian versus frequentist statistics

We assumed that the samples or observations tend to infinity. This is a
frequentist view

of probabilities.

Usually, probabilities express our degree of belief that a particular event

will occur. Baye’s

theorem, is a precise quantitative way to update these values when new

data is presented.

1.8.3. Probability densities

Feature variables can be regarded as continuous. The notation is as

follows. Upper case

denotes probability while lower case denotes probability densities.

Px∈[a,b]=abpxdx 1.14

px is normalized so that Px∈a.b=1

A more general form for a region r in space is:

Px∈R=Rpxdx 1.15

Average or expectation of a particular function Q(x) is:

εQ=Qxp(x)dx
1.16

For a finite set of data points,

εQ=Qxp(x)dx≅ 1Nn=1NQ(xn) 1.17

1.8.4 Baye’s theory in general

This is to incorporate the probability densities.

PCkx=pxCkP(Ck)p(x)
1.21

px=k=1cpxCkPk 1.22

k=1cPxCk=1
1.23

When class-conditional OR posterior probabilities are viewed as

parameterized

functions, they are referred to as likelihood functions.

Baye’s theorem can be minimized to:

posterior=likelihood X priornormalization factor
1.24

1.9 Decision Boundaries

1.10 Minimizing risk

Artificial Neural Networks
No ratings yet
Artificial Neural Networks
66 pages
CNN Guide for Machine Learning Students
No ratings yet
CNN Guide for Machine Learning Students
37 pages
Overview of Artificial Neurons
100% (1)
Overview of Artificial Neurons
16 pages
ML-5TH Unit
No ratings yet
ML-5TH Unit
28 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Statistical Learning Theory Notes
No ratings yet
Statistical Learning Theory Notes
119 pages
Back Propagation Technique
No ratings yet
Back Propagation Technique
24 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Machine Learning Class Notes PDF
No ratings yet
Machine Learning Class Notes PDF
1 page
Vision AI: Understanding Images
No ratings yet
Vision AI: Understanding Images
10 pages
Machine Learning in Mechanical Engineering
No ratings yet
Machine Learning in Mechanical Engineering
20 pages
Neural Networks in Pattern Classification
No ratings yet
Neural Networks in Pattern Classification
35 pages
Memory Based Reasoning - BIA
100% (1)
Memory Based Reasoning - BIA
19 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
17 pages
Intro To Scikit Learning
No ratings yet
Intro To Scikit Learning
18 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
0% (1)
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
16 pages
Autoencoders and Generative Models Overview
No ratings yet
Autoencoders and Generative Models Overview
25 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
18 pages
Python Iterators
No ratings yet
Python Iterators
9 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Feature Scaling Techniques: Machine Learning
No ratings yet
Feature Scaling Techniques: Machine Learning
27 pages
EE2211 Introduction To Machine Learning: Semester 1 2020/2021
No ratings yet
EE2211 Introduction To Machine Learning: Semester 1 2020/2021
34 pages
Yeungnam University School of Mechanical Engineering Syllabus For 0993 Tribology
No ratings yet
Yeungnam University School of Mechanical Engineering Syllabus For 0993 Tribology
42 pages
Neural Networks for Tech Enthusiasts
No ratings yet
Neural Networks for Tech Enthusiasts
23 pages
Logistic Regression Quiz
100% (1)
Logistic Regression Quiz
5 pages
Federated Learning Presentation
No ratings yet
Federated Learning Presentation
11 pages
MLOPs Original
No ratings yet
MLOPs Original
27 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
5 pages
Fundamental - Deep Learning
No ratings yet
Fundamental - Deep Learning
69 pages
Neural Networks and Fuzzy Logic
0% (1)
Neural Networks and Fuzzy Logic
2 pages
Stochastic Autoencoders in Deep Learning
No ratings yet
Stochastic Autoencoders in Deep Learning
42 pages
Programming For Data Science - Assignment 1
No ratings yet
Programming For Data Science - Assignment 1
2 pages
Customizing Seaborn Figure Styles
No ratings yet
Customizing Seaborn Figure Styles
15 pages
Machine Learning Most Important Question For Mid Term Ipu University
No ratings yet
Machine Learning Most Important Question For Mid Term Ipu University
36 pages
Regression Techniques in Python Guide
No ratings yet
Regression Techniques in Python Guide
34 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
19 pages
Text
No ratings yet
Text
131 pages
Applications of Evolutionary Computation, Part II
No ratings yet
Applications of Evolutionary Computation, Part II
547 pages
Introduction To Programming For Engineers Using Python
No ratings yet
Introduction To Programming For Engineers Using Python
358 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
12 pages
Pandas
No ratings yet
Pandas
1,385 pages
Data Science Interview Stats Q&A
No ratings yet
Data Science Interview Stats Q&A
5 pages
Simple - Linear - Regression - Ipynb - Colaboratory
No ratings yet
Simple - Linear - Regression - Ipynb - Colaboratory
2 pages
Understanding MWOC in Fiber Optics
No ratings yet
Understanding MWOC in Fiber Optics
82 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
11 pages
Machine Learning Lab: Regression Analysis
No ratings yet
Machine Learning Lab: Regression Analysis
15 pages
Neural Networks in Scientific Computing
No ratings yet
Neural Networks in Scientific Computing
37 pages
Iot Systems - Logical Design Using Python: Bahga & Madisetti, © 2015
No ratings yet
Iot Systems - Logical Design Using Python: Bahga & Madisetti, © 2015
31 pages
CNN Concepts for Computer Science Students
No ratings yet
CNN Concepts for Computer Science Students
15 pages
RBF Neural Network
No ratings yet
RBF Neural Network
34 pages
Optimal Learning Rates in Deep Learning
No ratings yet
Optimal Learning Rates in Deep Learning
9 pages
Image Captioning with TensorFlow Guide
0% (1)
Image Captioning with TensorFlow Guide
2 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Image Captioning Using CNN & LSTM: Digital Signal Processing Laboratory (EEE - 316)
No ratings yet
Image Captioning Using CNN & LSTM: Digital Signal Processing Laboratory (EEE - 316)
24 pages
Neural Networks: Basics and Learning Methods
No ratings yet
Neural Networks: Basics and Learning Methods
28 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
Unit-1 PCF
No ratings yet
Unit-1 PCF
18 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
DL Unit1
100% (1)
DL Unit1
61 pages
Chapter 08
100% (2)
Chapter 08
202 pages
Fault Codes Citroen ABS Bosch Workshop Km38400
No ratings yet
Fault Codes Citroen ABS Bosch Workshop Km38400
3 pages
10069-9-V01-Ef00-00039 Fire Alarm System
100% (1)
10069-9-V01-Ef00-00039 Fire Alarm System
49 pages
Experiment 2
No ratings yet
Experiment 2
9 pages
Semiconductor Diode Modeling: Jon Kirwan Electronics Hobbyist
No ratings yet
Semiconductor Diode Modeling: Jon Kirwan Electronics Hobbyist
11 pages
CAN BUS Steering Gear Control
No ratings yet
CAN BUS Steering Gear Control
7 pages
Applied Physics-Unit 1 Wave Optics
33% (3)
Applied Physics-Unit 1 Wave Optics
17 pages
Sub Station Equipment
No ratings yet
Sub Station Equipment
150 pages
JH17ACPL09
No ratings yet
JH17ACPL09
23 pages
Aco TSP
No ratings yet
Aco TSP
7 pages
Factors Affecting The Business Performance Tourism Industry 1
No ratings yet
Factors Affecting The Business Performance Tourism Industry 1
10 pages
Types of Variables in Research
No ratings yet
Types of Variables in Research
4 pages
JEE Chemistry: Gas Laws Explained
No ratings yet
JEE Chemistry: Gas Laws Explained
36 pages
Simple and Compound Interest MCQs
100% (4)
Simple and Compound Interest MCQs
12 pages
Windows 10 System Info and Drivers
No ratings yet
Windows 10 System Info and Drivers
31 pages
Western Blotting: Techniques & Applications
No ratings yet
Western Blotting: Techniques & Applications
33 pages
Exp-03 (Design A 4 Bit Multiplier)
No ratings yet
Exp-03 (Design A 4 Bit Multiplier)
6 pages
Dynamic Programming & Algorithm Analysis
No ratings yet
Dynamic Programming & Algorithm Analysis
60 pages
MasteringArchiMateEdition3 20171022 Screensyntax Optimized
100% (1)
MasteringArchiMateEdition3 20171022 Screensyntax Optimized
56 pages
ELITEC Technical Info for 1510 Models
No ratings yet
ELITEC Technical Info for 1510 Models
6 pages
Interview Questions
No ratings yet
Interview Questions
4 pages
The Simplex Method and Sensitivity Analysis
No ratings yet
The Simplex Method and Sensitivity Analysis
55 pages
4.1 Electrolysis Multiple Choice Question Paper (Medium)
No ratings yet
4.1 Electrolysis Multiple Choice Question Paper (Medium)
5 pages
Architecture Explained
No ratings yet
Architecture Explained
31 pages
Chapte
No ratings yet
Chapte
8 pages
Crane Lifting Capacities and Ratings Guide
No ratings yet
Crane Lifting Capacities and Ratings Guide
2 pages
QS Notes Topic 3
No ratings yet
QS Notes Topic 3
3 pages
Assignment # 5
No ratings yet
Assignment # 5
2 pages
Understanding RPA and UiPath Components
No ratings yet
Understanding RPA and UiPath Components
34 pages
Unit 3 - Dbms - Cs Worksheet 2
No ratings yet
Unit 3 - Dbms - Cs Worksheet 2
2 pages
Water Quality Specifications for Industries
No ratings yet
Water Quality Specifications for Industries
20 pages

Neural Networks Study Notes

Uploaded by

Neural Networks Study Notes

Uploaded by

NEURAL NETWORKS STUDY NOTES.

1.2 Classification and regression

A classical definition of regression

1.3 Pre-processing and feature extraction

feature extraction.A fixed transformation of

1.4 The curse of dimensionality

Polynomial curve fitting

Parameters wj are determined by minimizing the error function

system’s generalization capabilities. Test data is just trained data (similarly

For a given function

Fig 1.11 good boundary Fig 1.12 over-fitted

A model with little flexibility has a high bias.

A model with too much flexibility has a high variance.

1.6 Model complexity

Ê=E +vΩ , v determines how much Ω influences the solution.

Ω=12d2ydx22dx, therefore after substitution,

1.8 Bayes’ Theorem

to class C k. , where k is 1,2…n. n is the total number

Conditional probability P(Xl|Ck) - The probability that the observation is

that it belongs to class C k. It is a measure of the

feature by a particular class. How much this

with a certain class.

P(C1) = 60Total Samples= 60100 {Probability of a randomly selected variable

P(X 2) = 69100 {probability that of finding a randomly selected feature(X 2) is

P(C1 ,X 2) = P(X 2|C1) P(C1)

P(C1 ,X 2) = P(C1|X 2) P(X 2)

Since P(X 2|C1) P(C1) = P(C1|X 2) P(X 2), then

PC1|X2=PX2|C1PC1PX2 the “conditional probability” on the left is known

probability . The above is known as Baye’s theorem

An Excerpt from the web

probabilities all add up to unity. Therefore ,

From the excerpt, PBD+PCD+PAD= P(B,D)P(D)+P(C,D)P(D)+P(A,D)P(D) =

From the Example, it is clear that, P( C1 , X2) + P( C2 , X2) = P(X2)

{C1 and X2} {C2 and X2}

These are, 41% + 28 % = 69 %

The relation above, P( C1 , X2) + P( C2 , X2) = P(X2), can also be written as

P(X2)= P(X2|C1)P(C1) + P(X2|C2)P(C2)

1.8.1 Inference and decision

The importance of Baye’s theorem is that posterior probabilities can be

expressed using easily obtainable quantities. In the examples, these are

tables which are directly obtained from the data.

the object to class Ck for which the posterior probability P(Ck |X l ) is

In practice, care must be taken in data collection to avoid large disparities

expectations (calculated posterior probabilities) and reality. Prior

correctly accounted for.

Implementation of Baye’s theorem would meanevaluating the class

probabilities separately, then using Baye’s theorem to evaluate posterior

Outputs of neural networks can be interpreted as posterior probabilities

error function is chosen appropriately.

1.8.2 Bayesian versus frequentist statistics

Usually, probabilities express our degree of belief that a particular event

theorem, is a precise quantitative way to update these values when new

1.8.3. Probability densities

Feature variables can be regarded as continuous. The notation is as

denotes probability while lower case denotes probability densities.

px is normalized so that Px∈a.b=1

A more general form for a region r in space is:

Average or expectation of a particular function Q(x) is:

For a finite set of data points,

εQ=Qxp(x)dx≅ 1Nn=1NQ(xn) 1.17

1.8.4 Baye’s theory in general

This is to incorporate the probability densities.

When class-conditional OR posterior probabilities are viewed as

functions, they are referred to as likelihood functions.

Baye’s theorem can be minimized to:

1.9 Decision Boundaries

1.10 Minimizing risk

You might also like