Backpropagation Error
Backpropagation Error
delineating the absolute indigeneity of amino acids in fossils. Arco, Exxon, Phillips Petroleum, Texaco Inc., The Upjohn Co.
As AMS iechniques are refined to handle smaller samples, it We also acknowledge the donors of the Petroleum Research
may also become possible to date individual amino acid enan- Fund, administered by the American Chemical Society (grant
tiomers by the °C method. If one enantiomer is entirely derived 16144.A02 to M.H.E., grant 14805-AC2 to S.A.M.} for support.
from the other by racemization during diagenesis, the individual S.A.M. acknowledges NSERC (grant 42644)for partial support.
Dp. and L-enantiomers for a given amino acid should have
identical “C ages. Received 19 May: accepted 15 July 1988.
Older, more poorly preserved fossils may not always prove Bada, J. L. & Protsch, R. Prow natn. Acad. Sei U.S.A. 78, 1331-1334 (4973).
awe
amenable to the determination of amino acid indigeneity by the Bada, J. L., Schroeder, R. A. & Carter, G. F. Science 184, 791-793 (1974).
Boulton, G. S. et al Nature 288, 437-441 (1982).
stable isotope method, as the prospects for complete replace- Websniller, J. F. in Quaternary Dating Methods (ed. Mahaney, W. C.) 171-193 (Elsevier,
ment of indigenous amino acids with non-indigenous amino Amsterdam, 1984),
acids increases with time. As non-indigenous amino acids Engel, M. H., Zumberge, J. E. & Nagy, B. Analy: Biochem $2, 415-822 (1977).
SOMA
Bada, 3.1L. A Rev. Earth planet. Sci. 43, 241-368 (1985).
undergo racemization, the enantiomers may have identical Chisholm, B. S., Nelson, D. E. & Schwarez, H. PL Science 236, 1131-1132 (1982).
isotopic compositions and still not be related to the original . Ambrose, S. HE. & DeNiro, M. 3. Nature 319, 321-324 (1986).
. Macke, 5. 4., Estep, M. L. F., Hare, P. E. & Hoering, T. C. Yb. Carnegie Insts Wash. 82,
organisms. Such a circumstance may, however, become easier 404-410 (1983).
to recognize as more information becomes available concerning 10. Hare, P. E. & Estep, M. L. F. Yb. Camegie insin Wash. 82, 430-414 (1983).
il. Engel, M. H. & Hare, P. Ein Chemistry and Biochemistry of the Amino Acids (ed. Barrett,
the distribution and stable isotopic composition of the amino G. ©.) 462-479 (Chapman and Hall, London, 1935).
acid constituents of modern representatives of fossil organisms. 12. Johnstone, R.A. W. & Rose, M. E. in Chemistry and Biochemistry of the Amino Acids (ed.
Also, AMS dates on individual amino acid enantiomers may, Barrett, G. C.) 480-524 (Chapman and Ball, Londen, 1985).
13. Weinstein, 5., Gagel, MH. & Hare, B. Ein Practical Protein Chemisiry—A Handbook (ed.
in some cases, help to clarify indigeneity problems, in particular Darbre, A.) 337-344 (Wiley, New York, 1986).
when stratigraphic controls can be used to estimate a general 14. Bada, J. L., Gillespie, R., Gowlett, J. A. J. & Hedges, RL E. M. Nature 342, 442-444 (3984),
iS. Mitterer, RM. & Kriausakul, N. Org. Geochem, 7, 91-98 (1984).
age range for the fossil in question. 16. Wiliams, K. M. & Smith, G. G. Origins Life 8, ¥1~144 (1977).
Finally, the development of techniques for determining the 17. Engel, M. H. & Hare, P. E. Yb. Carnegie Instn Wash. 61, 425-430 (1942).
18 Hare, PE. Y0. Camegie Instn Wash. 13, 576-581 (1974),
stable isotopic compasition of amino acid enantiomers may 19. Eillinger, ©. T. Nature 296, 862 (1982)
enable us to establish whether non-racemic amino acids in some 20. Nenberger, A. Adv. Protein Chem. 4, 298-383 (1948).
carbonaceous meteorites” are indigenous, or result in part from 21. Engel, M. H. & Macko, 8. A. Anatyt. Chem. 38, 2598-2600 (1984).
22. Dungworth, G. Chem. Geol. 37, 135-453 (1976).
terrestrial contamination. 23. Weinstein, 5., Engel, MoH. & Hare, BB. Analyt. Biochem, 324, 370-377 (1982).
M.H.E. thanks the NSF, Division of Earth Sciences (grant | 24, Macke, S. A, Lee, W. ¥. & Parker, PL. J. exp. mar, Biol, Ecol 63, 145-149 (1982}.
25. Macko, S. A., Estep, M. LF. & Hoering, TC. Yb. Carnegie Instn Wash, 81, 613-417 (1982).
EAR-8352085) and the folowing contributors to his Presidential 26. Vablentyne, J. R. Geochim, cosmochim. Acta 1B, 157-188 (1964).
Young Investigator Award for partial support of this research: 27, Engel, M. H. & Nagy, B. Nature 296, 837-846 (1982).
Learning representations more difficult when we introduce hidden units whose actual or
by back-propagating errors desired states are not specified by the task. (in perceptrons,
there are ‘feature analysers’ between the input and output that
are not true hidden units because their input connections are
David E. Rumelhari*, Geoffrey E. Hintont fixed by hand, so their siates are completely determined by the
& Ronald J. Williams* input vector: they do not learn representations.) The learning
procedure must decide under what circumstances the hidden
* Institute for Cognitive Science, C-015, University of California, units should be active in order to help achieve the desired
San Diego, La Jolla, California 92093, USA input-output behaviour. This amounts to deciding what these
+ Department of Computer Science, Carnegie-Mellon University, unis should represent. We demonstrate that a general purpose
Pittsburgh, Philadelphia 15213, USA and relatively simple procedure is powerful enough to construct
appropriate internal representations.
The simplest form of the learning procedure is for layered
We describe a mew learning procedure, back-prepagation, for
networks which have a layer of input units at the bottom; any
networks of seurone-like units. The procedure repeatedly adjusts number of intermediate layers; and a layer of Gutput units at
the top. Connections within a layer or from higher to lower
the weights of the conmections in the metwork se as to minimize a
measure of the difference hetween the actual output vector of the layers are forbidden, but connections can skip intermediate
met and the desired output vector. As a result of the weight layers. An input vector is presented to the network by setting
adjustments, internal ‘hidden’ units which are not part of the input the states of the input units. Then the states of the units in each
or quiput come to represent important features of the task domain, layer are determined by applying equations (1) and (2) to the
and the regularities im the task are captured by the interactions connections coming from lower layers. All units within a layer
of these units. The ability te create useful new features distin- have their states set in parallel, but different layers have their
guishes backh-prepagation from earlier, simpler methods such as states set sequentially, starting at the bottom and working
the perceptron-convergence procedure’. upwards until the states of the output units are determined.
There have been many attempts to design self-organizing The total input, x;, to unit jis a linear function of the outputs,
neural networks. The aim is te find a powerful synaptic y,, of the units that are connected to j and of the weights, w,,
modification rule that will allow an arbitrarily connected neural on these connections :
network to develop an internal structure that is appropriate for x= Zo Vij: (}
a particular task domain. The task is specified by giving the
desired state vectar of the output units for each state vector of Units can be given biases by introducing an extra input to cach
the input units. If the input units are directly connected to the unit which always has a value of 1. The weight on this extra
ovtpat units it is relatively easy to find learning rules that input is called the bias and is equivalent to a threshold of thé
iteratively adjust the relative strengths of the connections so as opposite sign. It can be treated just like the other weights.
to progressively reduce the difference between the actual and Acunit has a real-valued output, y,, which is a non-linear
desired output vectors’, Learning becomes more interesting but function of its total input
atome“|
Colin Chariotie
Ye
bidder. ‘) coe . hidden Fig. 2 Two isomorphic family trees. The information can be
AS
i
BR eh 72 7A 4 unit expressed as 4 set of triples of the form (person 1)relationship)
(person 2), where the possible relationships are {father, mother,
) CNG
husband, wife, son, daughter, uncle, aunt, brother, sister, nephew,
-3.6 niece}. A layered net can be said to ‘know’ these triples if it can
produce the third term of each triple when given the first two. The
first two terms are encoded by activating two of the input units,
~14.2 14.2 and the network must then complete the proposition by activating
NS the output unit that represents the third term.
input units
Fig, A network that has learned to detect mirror syrametry in
the input vector, The numbers on the arcs are weights and the
mumbers inside the nodes are biases. The learning required 1,425
sweeps through the set of 64 possible input vectors, with the weights
being adjusted on the basis of the accumulated gradient after each
sweep. The values of the parameters in equation (9) were e = 0.1
and a «6.9. The initial weights were random and were uniformly
distributed between —0.3 and 0.3. The key property of this solution
is that for a given hidden unit, weights that are symmetric about
the middle of the input vector are equal in magnitude and oppasite Fig. 3 Activity levels in a five-laver network after it has learned.
in sign, So if a symmetrical pattern is presented, both hidden units The bottom layer has 24 input units on the left for representing
will receive a net input of 0 from the input units, and, because the {person 1) and 12 input units on the right for representing the
hidden units have a negative bias, both will be off. In this case the relationship. The white squares inside these two groups show the
output unit, having a positive bias, will be on. Note that the weights activity levels of the units. There is one active unit in the first group
on each side of the midpoint are in the ratio 1:2:4. This ensures representing Colin and one in the secand group representing the
that each ofthe eight patterns that can occur above the midpoint relationship ‘has-aunt’. Each of the two input groups is totally
sends a unique activation sum to each hidden unit, so the only connected to its own group of 6 units in the second layer. These
pattern below the midpoint that can exactly balance this sum is groups learn to encode peaple and relationships as distributed
the symmetrical one. For all non-symmetrical patterns, both hidden patterns of activity. The second Jayer is totally connected to the
units will receive non-zero activations from the input units. The central layer of 12 units, and these are connected to the penultimate
two hidden untis have identical patterns of weights but with layer of 6 units. The activity in the penultimate layer must activate
opposite signs, so for every non-symmetric pattern one hidden unit the correct output units, each of which stands for a particular
will come on and suppress the outpul unit. (person 2). In this case, there are two correct answers (marked by
black dots) because Colin has two aunts. Both the input units and
It is not necessary to use exactly the functions given in equations
the output units are laid out spatially with the English people in
(1) and (2). Any input-output function which has a bounded one row and the isomorphic Italians immediately below.
derivative will do. However, the use of a linear function for
combining the inputs to 2 unit before applying the nonlinearity
greatly simplifies the learning procedure. The backward pass starts by computing @&/ay for each of
The aim is to find a set of weights that ensure that for each the output units. Differentiating equation (3) for a particular
input vector the output vector produced by the network is the case, c, and suppressing the index ¢ gives
same as (or sufficiently close to) the desired output vector. If
there is a fixed, finite set of input-output cases, the total error
BE/dy; = yj4; (4)
inthe performance ofthe network with a particular set of weights We can then apply the chain rule to compute dE /3x,
can be computed by comparing the actual and desired output BE fax, = dh fay, dy,/dx;
vectors for every case. The total error, B, is defined as
Differentiating equation (2) to get the value of dy,/dx, and
B=3EE Oe 7 hel
cj
(3) substituting gives
where ¢ is an index over cases (input-output pairs), j is an dE/dx, = a8 fay y- ¥) {3}
index over output units, y is the actual state of an output unit
This means that we know how a change in the total input x to
and d is its desired state. To minimize & by gradient descent
an output unit will affect the error. But this total input is just a
it is necessary to compute the partial derivative of E with respect
linear function of the states of the lower level units and it is
to each weight in the network. This is simply the sum of the
also a linear function of the weights on the connections, 50 it
partial derivatives for each of the input-output cases. For a
is easy to compute how the error will be affected by changing
given case, the partial derivatives of the error with respect to
these states and weights. For a weight w,, from i ia j the
each weight are computed in two passes. We have already
derivative is
described the forward pass in which the units in each layer have
their states determined by the input they receive from units in BE /Bwy OB fx; x,/ aw
lower layers using equations (1) and (2). The backward pass
which propagates derivatives from the top layer back to the = BE Lax; ¥; {6}
bottom one is more complicated. and for the output of the i ynit the contribution to 2E/ay;
Christine
JE/0x;-Ax;/dy; = OE/ Ox; Wj
= so taking into accountall the connections emanating from unit
i we have
dE /ay, = OE /x;" wi (7)
Wehave now seen how to compute dF /déy for any unit in the
penultimate layer when given 0E/ay for all units in the last
layer. We can therefore repeat this procedure to compute this
Fig. 4 The weights from the 24 input units that represent people term for successively earlier layers, computing dE/dw for the
to the 6 units in the second layer that learn distributed representa- weights as we go.
tions of people. White rectangles, excitatory weights; black rec- One wayof using dE/dw is to change the weights after every
tangles, inhibitory weights; area of the rectangle encodes the mag-
input-output case. This has the advantage that no separate
nitude of the weight. The weights from the 12 English people are
memory is required for the derivatives. An alternative scheme,
in the top row of each unit. Unit 1 is primarily concerned with the
distinction between English and Italian and mostof the otherunits which we used in the research reported here, is to accumulate
ignore this distinction. This means that the representation of an dE/dw over all the input-output cases before changing the
English person is very similar to the representation of their Italian weights. The simplest version of gradient descent is to change
equivalent. The network is making use of the isomorphism between each weight by an amount proportional to the accumulated
the two family treesto allow it to share structure andit will therefore dE/aw
tend to generalize sensibly from one tree to the other. Unit 2
encodes which generation a person belongsto, and unit 6 encodes Aw=—edE/aw (8)
which branch of the family they come from. The features captured
This method does not converge as rapidly as methods which
by the hidden units are notat all explicit in the input and output
make use of the second derivatives, but it is much simpler and
encodings, since these use a separate unit for each person. Because
the hidden features capture the underlying structure of the task can easily be implemented by local computations in parallel
domain, the network generalizes correctly to the four triples on hardware. It can be significantly improved, without sacrificing
whichit was not trained. Wetrained the network for 1500 sweeps, the simplicity and locality, by using an acceleration method in
using « = 0.005 and a =0.5 for thefirst 20 sweeps and « = 0.01 and which the current gradient is used to modify the velocity of the
a =0.9 for the remaining sweeps. To makeit easier to interpret point in weight space instead ofits position
the weights we introduced ‘weight-decay’ by decrementing every
weight by 0.2% after each weight change. After prolonged iearning, Aw(t)=—eadE/aw(t)+ aAw(t—1) (9)
the decay was balanced by 9E/aw,so the final magnitude of each where ¢ is incremented by 1 for each sweep through the whole
weight indicates its usefulness in reducing the error. To prevent
set of input-output cases, and @ is an exponential decay factor
the network needing large weights to drive the outputs to 1 or 0,
the error was considered to be zero if output units that should be
between 0 and 1 that determinestherelative contribution of the
on had activities above 0.8 and output units that should be off had current gradient and earlier gradients to the weight change.
activities below 0.2. To break symmetry we start with small random weights.
Variants on the learning procedure have been discovered
independently by David Parker (personal communication) and
by Yann Le Cun’.
One simple task that cannot be done by just connecting the
input units to the output units is the detection of symmetry. To
detect whether the binary activity levels of a one-dimensional
array of input units are symmetrical about the centre point, it
is essential to use an intermediate layer because the activity in
A set of
corresponding
an individual input unit, considered alone, provides no evidence
weights about the symmetry or non-symmetry of the whole input vector,
so simply adding up the evidence from the individual input
units is insufficient. (A more formal proof that intermediate
Wo Ww units are required is given in ref.2.) The learning procedure
discovered an elegant solution using just two intermediate units,
as shownin Fig.1.
wy, Ws Anotherinteresting task is to store the information in the two
family trees (Fig. 2). Figure 3 shows the network we used, and
Fig. 4 showsthe ‘receptive fields’ of some of the hidden units
Fig. 5 A synchronousiterative net that is run for three iterations
after the network wastrained on 100 of the 104 possible triples.
and the equivalent layered net. Each time-step in the recurrent net
correspondsto a layer in the layered net. The learning procedure
So far, we have only dealt with layered, feed-forward
for layered nets can be mapped into a learning procedure for networks. The equivalence between layered networksand recur-
iterative nets. Two complicationsarise in performing this mapping: rent networks that are run iteratively is shown in Fig.5.
first, in a layered net the outputlevels of the units in the intermedi- The most obvious drawbackof the learning procedureis that
ate layers during the forward pass are required for performing the the error-surface may contain local minima so that gradient
backward pass (see equations (5) and (6)). So in an iterative net descent is not guaranteed to find a global minimum. However,
it is necessary to store the history of output states of each unit. experience with many tasks showsthat the network very rarely
Second, for a layered net to be equivalent to an iterative net, gets stuck in poor local minimathatare significantly worse than
corresponding weights between different layers must have the same
the global minimum. We have only encountered this undesirable
value. To preserve this property, we average 9E/dw forall the
weights in each set of corresponding weights and then change each behaviour in networks that have just enough connections to
weightin the set by an amountproportionalto this average gradient. perform the task. Adding a few more connectionscreates extra
With these two provisos, the learning procedure can be applied dimensions in weight-space and these dimensions provide paths
directly to iterative nets. These nets can theneither learn to perform around the barriers that create poor local minima in the lower
iterative searches or learn sequential structures’. dimensional subspaces.
°
f
i. Roseabtatt, F. Principles af Neurodynamics (Spartan, Washington, DC, i961}.
4. Minsky, M. L. & Papert, §. Perceptrons (MIT, Cambridge, 1969).
2
|
3. Le Cun, ¥. Proc. Cognitivg 8%, 599-804 (1985}.
4, Rumelhart, D. &., Hinton, G. FE. & Williams, R. 5. in Parallel Distributed Processing:
E: ct
Explorations in the Mi ture ofCogni Vol. i: Foundations (eds R fart, DLE.
& McClelland, J, .) 318-382 (MIT, Cambridge, 1986).
me
y
a
Bilateral amblyopia after a short
qT
period of reverse occlusion in kittens
a
I
Kathryn M. Murphy* & Donald E. Mitchell
Department of Psychology, Dalhousie University,
Halifax Nova Scotia, Canada B3H 431
10 26 2 40 s& 8 70
The majority of meurenes in the visual cortex of both adult cats Days since termination
and kittens can be excited by visual stimulation ef either eye. of reverse occlusion