0 ratings0% found this document useful (0 votes) 32 views10 pagesML Algorithms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
92. MACHINE LEARNING
rf jatil from Equation (4.2), as
; differentiating E
be obtained by
gradient can
is, oeek bin yi aaray
uy, in De
1 Leu —o4)?
W
deD so
DE (een an)
= 42-0052
ia
a a
= u- 04) 55, ta — wx)
to
OE
= Sta — 04)(—xia) (46)
ui De
where xq denotes the single input component x; for training example d. We now
have an equation that gives 2 in terms of the linear unit inputs id, Outputs f4, and
target values tz associated with the training examples. Substituting Equation (4.6)
into Equation (4.5) yields the weight update rule for gradient descent
Aw: =n a = 04) xia 2)
Fors
jova: Summarize, the gradient descent algorithm for training linear units is a
follows: Pick an initial random weight vector, Apply the linear unit to all training
cxamples, then compute Aw, for each weight according to Equation (4.7). Update
tach weight w, by adding Aw,, then repeat this process, ‘This algorithm is given
in Table 4.1. Because the error surface contains only 2 single global minimum,
this reason, one common modification to the algorithm is to Sradually reduce the
value of n as the number of gradient descent steps grows.
4433 STOCHASTIC APPROXIMATION TO GRADIENT DESCENT
Gradient descent is an important general paradigm for learning. It is a Strategy for
searching through a large or infinite hypothesis space that can be applied whenever
(1) the hypothesis space Contains continuously parameterized hypotheses (e.g., the
Weights in a linear unit), and (2) the enor can be differentiated with respect to
these hypothesis Parameters. The key practical difficulties in applying gradient
descent are (1) Converging to a local minimum can sometimes be quite slow (ice.
it can require many thousands of gradient descent steps), and (2) if there are
multiple local minima in the error Sumtacom beaitiiexe is no) guarantee! thar th
Procedure will find the global minimum, e98 MACHINE LBARNING
BAcKPROPAGATION( raining examples, Ny Min Mout, Mhidden)
Each training example is a pair of the form (i,7 ), where x is the vector of network input
values, and f is the vector of target network output values.
nis the learning rate (¢,8.,.05). nin is the number of network inputs, nnidden the number of
units in the hidden layer, and nous the number of output units.
The input from unit into unit j is denoted xji, and the weight from unit ito unit j is denoted
wy
«Create a feed-forward network with nin inputs, nyidden hidden units, and nou, Output units.
+ Initialize all network weights to small random numbers (e.g., between —.05 and .05).
‘* Until the termination condition is met, Do
© For each (z,7) in training examples, Do
Propagate the input forward through the network:
1, Input the instance % to the network and compute the output 0, of every unit u in
the network.
Propagate the errors backward through the network:
2. For each network output unit k, calculate its error term 5,
4 © o4(1 = o4)(t4 ~ 0%)
3. For each hidden unit A, calculate its error term bn
81 on 04) D> wand
keouiputs
4, Update each network weight wji
wy) — wy + Aw,
where
uy = 15)
TABLE 4,2
The stochastic gradient descent version of the B,
|ACKPROPAGATION al
‘containing two layers of sigmoid units, Boni for feedforward networkssaree i instance x be descri
ig precisely: yet an arbitrary 1! nbd yg
sidean distance. MOTE
feature Vector oan)
tribute of instance x. Then the
ke cr
tes he vals the Biel to be d(i3)), where “Ny
ces x and x)
(ay (x). 020 *
vere a, (x) del
Baten tro instan'
Yee =a, (x))?
d(x. %)) =
i be either discret
. jag the target function may va
In nearest eer Jeaning diserete-valued tare functions of
cor real-valued. Lets frst finite set (bis «-¥e}- TE KN OREST Net
form f :#" > V, whe! iscrete-valued target function is given in Table
algorithm for approx! this algorithm as its estimate of f,, ;
As shown there, ye ais value of f among the k training, examples neares
eos ia 1, then the 1-NEAREST NeIGHBOR algorithm assigns to fu)
the ae See ‘ys is the training instance nearest to xq. For larger ya}, iw
of fits, algorithm assigns the most common value among the k nearest Uaiing
~ewrears!
alue f (xq) returned by
peration of the k-NEAREST NEIGHBOR algorithm fo,
the case where the instances are points in a two~ mensional space and where the
target function is boolean valued. The positive and negative training examples are
Shown by “+ and “—" respectively. A query point xq is shown as well. Note te
L-Nearest NEIGHBOR algorithm classifies xg as a positive example in this figure,
whereas the 5-NearEst NEIGHBOR algorithm classifies it as a negative example.
What is the nature of the hypothesis space H implicitly considered by the
k-Nearest NEIGHBOR algorithm? Note the k-NEAREST NEIGHBOR algorithm never
forms an explicit general hypothesis f regarding the target function f. It simply
computes the classification of each new query instance as needed. Nevertheless,
les.
Figure 8.1 illustrates the o
Training algorithm:
+ For each training example (x, f(x)), add the example to the list training examples
‘Classification algorithm:
* Given a query instance xq to be classified,
+ Let x... :
4b enay 77 Genet the instances from training.examples that are nears 1%
: k
xq) < argmax 5
gm Ds , £))
where (a, 6) =1 if a=
and where 3(a, b) = 0 otherwise.
TABLE 8.1
The k-NeaKest Nex
CARDS alpen :
Bonithm for approximating a discrete-valued function f 2X" 7"‘ »
94 MACHINE LEARNING stic gradient descent can be mag,
], stochastic g)
diffe
ize) sufficiently small, stoct Cerne Vey tite
ec A ee teal descent arbitrarily au Ki gba ty
ie griient descent and stochastic gradie!
standard gra s
i r all examples
gradient descent, the error 18 summed oye big ES eg
: iene Teh whereas in stochastic gradient descei 2 ee
updating weights,
Merc ne : ;
pon examining each training exampre.
Summing over multiple examples in standard gra ient descent requires m,
mil dient de
y it uses thi
i ate step. On the other hand, because i a
tion per weight update step. On ,
rie ration tidied gradient descent is often used with a larger step size
i it.
i hastic gradient descent
er weight update than stoc! ent desce .
. ii cases where there are multiple local aa ae ae
ic. gradi avoid falling int
tic gradient descent can sometimes avoi t _loce
Seats it uses the various VE,(i) rather than VE(w) to guide its search,
Both stochastic and standard gradient descent methods are commonly used in
ractice, F
‘The training rule in Equation (4.10) is known as the delta rule, or sometimes
the LMS (least-mean-square) rule, Adaline rule, or Widrow-Hoff rule (after its
inventors). In Chapter 1 we referred to it as the LMS weight-update rule when
describing its use for learning an evaluation function for game playing. Notice
the delta rule in Equation (4.10) is similar to the perceptron training rule in
Equation (4.4.2). In fact, the two expressions appear to be identical. However,
the rules are different because in the delta rule o refers to the linear unit output
(4) = i. %, whereas for the Perceptron rule refers to the thresholded output
o(8) = sgn(ib -z),
Although we have presented the delta rule as a method for learning weights
for unthresholded linear units, it can easily be used 0 train thresholded perceptron
units, as well. Suppose that 0 = ij. z is the unthresholded linear unit output as
above, and o/ = sgn(ib-z) is the result of thresholding o as in the perce; coReNGE
if we wish to train a perceptron to fit training examples with target valuec fl fe
o’, we can use these same target values and examples to train 9 instead, i if
delta ule, Clearly i the unthresholded output oes eves eee
Perfectly, then the threshold output o/ will fit them as
and sgn(—1) = —1). Even when the target
thresholded 0’ value will correctly fit 8S eee eee eseteetly i
unit output o has ¢! fe Ver the linear
pt las the correct sign, Notice, however, that while this Procedure wil]
pind P a i 3
ights, Th ffe x Ng Perceptr
We Hs cons; gered two. ‘ar algorithms for iterative; learning p, eptron
etween these algorithms is that the perceptron trai
in-Bae aren 7”
ining example is @ pair ofthe form
ens vl en ee
+ fch 10 S01 seal Aor vale
i nto condition mt, Do
kaze each AW, (0 7er0,
$ preach i!) traningexamples, Do
input the instance # to the unit and c
{Foc each linear unit weight w,, Do = MH ouput 0
va
AU) © Bw + n(t — os;
{reach linear unit weight w,, Do
Hw + Aw,
eee
eral '
DesceNT algorithm for traning a near unit. To
pa 4 To implement the stochastic approximation
went descent, Equation (4.2) is deleted, and Equation (T4.1) replaced by wy oe soothe
One common variation on gradient descent intended to alleviate these diffi-
cae is called incremental gradient descent, or alternatively stochastic gradient
ocent, Whereas the gradient descent training rule presented in Equation (4.7)
computes weight updates after summing over all the training examples in D, the
ida behind stochastic gradient descent is to approximate this gradient descent
seach by updating weights incrementally, following the calculation of the error
‘nreach individual example. The modified training rule is like the training rule
sien by Equation (4.7) except that as we iterate through each training example
veupdate the weight according to
Aw; =n(¢-0) x; (4.10)
vieref,0, and x; are the target value, unit output, and ith input for the training
‘tanple in question, To modify the gradient descent algorithm of Table 4.1 to
‘aplement this stochastic approximation, Equation (T4.2) is simply deleted and
Equation (4.1) replaced by w; — w;+n(t—o) x;. One way to view this stochastic
‘aicnt descent is to consider a distinct error function E4(w) defined for each
“Ghidual training example d as follows
Ey (i) = du = 04)? 4.1)
Nee and og are the target value and the unit output value for training ex-
214: Stochastic gradient descent iterates over the training examples d in D,
fc ulertion altering the weights according to the gradient with respect to
stags THE Sequence of these weight updates, when iterated over all training
Pi Provides a reasonable approximation to descending the gradient with
"© our original error function (i). By making the value of » (the gradientUnsupervised Learning @ 283
sapoint and all of the cluster centres,
gat nat we can reduce the computati
hm that was described in Section
algo we then compute the mean of
fie iterate te algorithm until the clus
gescription:
Fhe Means Algorithm
, Initialisation
— choose a value for k
— choose k random positions in the input space
— assign the cluster centres 41; to those positions
+ Learning
= repeat
* for each datapoint x;:
__—_*_- compute the distance to each cluster centre
~ assign the datapoint to the nearest cluster centre with distance
i 4; = min dle, Hy) (14.1)
* for each cluster centre:
+ move the position of the centre to the mean of the points in that cluster
(N; is the number of points in cluster j):
Ns
Hy = DL (14.2)
‘test point:
the distance to each cluster centre
“datapoint to the nearest cluster centre with distance
Bis
iepoh iv 2:
d, = min d(ei, 45): (14.3)
ntation follows these steps almost exactly, and we can take advan-
) function, which returns the index of the minimum value, to find
ister, The code that computes the distances, finds the nearest cluster centre,
them eae as:
“closest clrerspective
78 m Machine Learning: An Algorithmic Pt
axle ———
ron Algorithm
+ Initialisation
siti dom values
= initialise all weights to small (positive ran
and negative)
+ Training
= repeat:
+ for each input vector:
Forwards phase:
~ compute the activation of each neuron j in the hidden layer(s) using
L
he = Doane (44)
120
ag=g(hc) = — (45)
T+ exp(—Bhe)
+ work throngh the network until you get to the output layer neurons,
which have activations (although see also Section 4.2.3):
Pe 9) G7 tyre (4.6)
7
1
Yn =9G(he) = Trench) (4.7)
Backwards phase:
* compute the error at the output using:
Gol) = (Ye — te) Ye(1 — Ye)
~ compute the error in the hidden layer(s) using: ee
i
9n(6) = ae(1 —a¢) > wed (ie)
& (4.9)
+ update the output layer weights using:
Won & Wen ~ n8o(n)abidden
* update the hidden layer weights using: (4.10)
* (if using sequential upd ~ arenes (4)
iential updating) randomise the order on
i aeetin f the i
that you don't trainin exactly the same order each terre PMt Vectors 30
mn.
~ until learning stops (see Section 4.3.3)
+ Recall
7 use the Forwards phase in the training section above