The distributed cohort model
As we indicated earlier, not all researchers believe that localist models
like the TRACE model are a truthful simulation of how the brain
group of neurons specifically devoted to the phoneme “s” or to the
spoken word “sun.” Rather the meanings of stimuli are represented by
activity patterns of a complete layer of nodes.
Figure 8.7 shows such a model for speech perception, called the
distributed cohort model (Gaskell & Marslen-Wilson, 1997; see also
Magnuson et al., 2020).
The input layer consisted of 11 nodes representing the speech
signal at a particular moment in time (i.e., the activity pattern when a
particular sound of the word was pronounced). All nodes of the input
layer were connected to a hidden layer of 200 units. At each processing
cycle, the hidden layer was copied to a context layer, which fed back to
the hidden layer. In this way, the hidden layer not only received input
from the current processing cycle but also from the previous processing
cycle. This top-down part of the model allowed it to capture systematic
transitions in time. The hidden layer fed activation forward to a layer
of 50 semantic nodes and three layers of output phonology representing
the sounds (phonemes) of the word. The three layers of output
phonology represented the three parts of monosyllabic words: the
consonants before the vowel (called the onset of the word; can be
empty), the vowel (called the nucleus of the word), and the consonants
after the vowel (the coda; can also be empty). By using this coding
scheme all monosyllabic words could be represented.
The input to the model consisted of monosyllabic English words.
Importantly, the words were attached to each other, with no pauses inbetween,
to simulate the fact that words in a stream of spoken languageworks. According to them, there is little
evidence that there would be a are not separated. The model was trained by presenting the continuous
sequence of phonetic feature bundles representing incoming speech to
the input layer. The output of the network was compared to a second sequence of the same length, which
represented the semantics and
phonology of the words. The second sequence was used for comparison
with the network output, allowing connection weights to be updated, so
that the output of the network gradually approached that of the comparison
sequence. During training, the words were presented a few
hundred times in various orders.
After training, the model performed very similarly to human
observers. In particular, the model showed activation of cohorts (both
at the phonological and semantic level) when the onset and the nucleus
of the word were presented, and ended up with a unique representation
after the coda was presented, even though there were no gaps in
the input. The model was also able to decide whether a real word had
been presented (“sun”) or a non-existing monosyllabic non-word
(“suph”). Notice that the model was capable of doing so without
explicit top-down effects from the semantic level. The only top-down
information came from the context layer that kept a record of the
preceding processing step.
10.8 LANGUAGE PRODUCTION
Speech production – deciding what we want to say, and articulating
this accurately and fluently – is a behaviour which we take very much
for granted, and which we typically do extremely well – it has been
estimated that any one talker uses a production vocabulary of around
20,000 words, but that we make mistakes of word selection in only
around every one in a million words produced.
The processes involved in turning thoughts into spoken words are
called lexicalisation, and two main stages have been hypothesised
(Levelt, 1989, 1992). The first stage comprises a link between conceptual
thoughts and word forms which include semantic and syntactic
information, but not phonological detail. This is called the ‘lemma’
and the processes of identifying and choosing the correct word is called
lemma selection. In the second stage, the lemma makes contact with
the phonetic representation of the word, called the lexeme, and the
specifying of this form is called lexeme selection. Much of the evidence
for this two-stage form of word selection in speech production comes
from a frustrating state that many people have experienced, called tipof-
the-tongue state. When in this condition, people have an absolute
certainty that they know a word that they want to say, combined with
a lack of sensation of how they should say it. In this state, people can
often access a lot of information about a word, such as what it means
and aspects of its syntax, and this has been ascribed to being able to
access lemma information, without being able to make contact with
the lexeme detail (Harley, 2001).
There is also experimental evidence for these stages from studies
of priming, for example, participants name pictures more quickly if
they had previously named or defined the word, but not if they had
produced a homophone which sounds the same but has a different
meaning. This suggests that priming at the lemma level (semantic and
syntax) can operate separately from lexeme (phonological) priming.
Historically, another influence on our understanding of speech production
processes comes from studies of speech errors, or ‘slips of the
tongue’. Fromkin (1973) said these errors ‘provide a window into linguistic
processes’ (pp. 43–44), although it has also been pointed out
that these errors rely on accurate acoustic and phonetic decoding by
the listener, which comprise a complex set of psychological processes
(Boucher, 1994).
There are consistencies in the kinds of errors speakers make, where
the errors occur at the level of phonemes, morphemes or words, rather than random noisy patterns of
errors. This gives weight to the suggestion
that they result from specific kinds of errors in the speech production
system (Fromkin, 1971, 1973; Garrett, 1975; Dell, 1986; Harley,
2001).
Garrett developed a model of speech production based on a set of
speech errors which he considered to be particularly informative:
Word substitutions – these affect content words, not (typically)
form words, such as ‘man’ for ‘woman’ or ‘day’ for ‘night’.
Word exchanges in which words from the same category swap positions
with each other, such that nouns swap with nouns, verbs with
verbs, etc.
Sound exchange errors such as classic spoonerisms such as ‘wastey
term’ for ‘tasty worm’, where the onsets of words swap positions
with each other, commonly over words which are next to each other.
Morpheme exchange – this is where word endings (morphological
inflections) move to other points in the sentence, such as ‘Have you
seen Hector Easter’s egg?’ for ‘Have you seen Hector’s Easter egg?’.
Morpheme exchange errors can also include ‘stranding’ errors such
as “Have you seen Easter’s Hector egg?”.
In Garrett’s model of speech production there are several, independent
levels involved in speech production:
1. the message level, which represents the concepts and thoughts that
the speaker wants to express;
2. the functional level, at which these concepts are expressed as semantic
lexical representations, and thematic aspects of the sentence (the
subject and object, for example) are also represented – i.e. the roles
that these semantic items will take in the sentence;
3. the positional level, at which the semantic-lexical representations
are implemented as phonological items, with a syntactic structure;
4. the phonetic level, at which the phonological and syntactic representations
are realised as detailed phonetic sequences, precisely
articulating the word forms and inflections specific by the positional
level;
5. the articulation level, which form control of the vocal apparatus to
express the
In Garrett’s model, the semantic information about content words
is specified at the functional level, while function words and bound
morphemes (such as ‘-ing’ endings) are added to the sentence structure
at the positional level, where they are associated with their
phonetic forms: in contrast, the phonological forms of the content
words needs to be generated within the sentence at the positional
level. This kind of constraint in the model allows for word substitution
errors, which are generated at the functional level (or the lexical
level in Levelt’s model), and which infrequently affect form words.
Likewise, sound exchange errors arise when content words are being phonologically constructed at the
positional level, and again, affect
form words much less (as they are phonologically prespecified at
this level). Stranding errors occur when the content words are being
positioned in the sentence, which occurs before syntactic structure
and inflections are added to the sentence.
This is a serial model of speech production: speech production
is a result of a series of independent output stages in which there
are distinct computational processes specified in a serial, noninteracting
fashion; this is also true of the Levelt model. There are
other approaches to modelling speech production which proceed
along more parallel lines, and which are typically modelled within
the connectionist, interactive framework – for example, the speech
production model of Gary Dell (Dell, 1986; Dell and O’Seaghdha,
1991; Dell et al., 1997). In this model, a spoken sentence is represented
as a sentence frame, and is planned simultaneously across
semantic, syntactic, morphological and phonological levels, with
spreading activation permitting different levels to affect each other.
This allows speech errors to be ‘mixed’: as Dell has pointed out,
many speech errors (such as ‘The wind is strowing strongly’) represent
several different kinds of errors.
Functionally the Dell model works via different points in the sentence
frame activating items in a lexicon – for example when a verb
is specified, there will be activation across interconnected nodes for
concepts, words, morphemes and phonemes. When a node is activated,
there is a spread of activation across all the nodes connected to it.
Thus if the node for the verb ‘run’ is activated, there will also be activation
for the verb ‘walk’. Selection is based on the node with the
highest activation, and after a node has been selected its activation
is reset to zero (to prevent the same word from be continuously produced).
In this way, word substitution errors occur when the wrong
word becomes more highly activated than the correct target word. The
model contains categorical rules which act as constraints on the types
of items which are activated at each level in the model, and these rules
place limits on the kinds of errors that can be made – nouns swapping
places with nouns, for example. In contrast, exchange errors occur
as a result of the increases in activation, which means that a lexical
element (a phoneme, or a word) can appear earlier in a sentence than
was intended, if its activation unexpectedly increased: as the activation
is immediately set to zero once an item has been selected, another
highly activated item is likely to take its place in the intended part of
the sentence frame.
As in other areas of cognitive psychology, there has been a
lively debate about the extent to which speech production is well
modelled by interactive connectionist models, or by more rulebased,
serial, symbolic models. The two-stage model of Levelt was
developed into a more complex six-stage model of spoken word
production (Levelt, 1989; Bock and Levelt, 1994; Levelt et al.,
1999 called WEAVER++ (Word-form Encoding by Activation and
VERification). The stages are: 1. conceptual preparation;
2. lexical selection (the stage at which the abstract lemma is selected);
3. morphological encoding;
4. phonological encoding;
5. phonetic encoding;
6. articulation.
Like Dell’s model, WEAVER++ is a spreading activation model, but
unlike Dell’s model, activation is fed forward in one direction only,
from concepts to articulation: furthermore, the WEAVER++ model
is truly serial, as each stage is completed before the next stage is
started.
In an experimental attempt to generate speech errors, Levelt
et al. (1991) required participants to name pictures while also listening
to words, and pressing a button when they recognised a
word. The relationships between the seen objects and heard words
varied – there were semantic relationships, phonological relationships
and unrelated pairs, and some had a ‘mediated’ relationship
to the picture, that is, linked through a semantic and phonological
connection. If the picture was a dog, a mediated relationship word
could be ‘cot’, which is phonologically similar to ‘cat’, which in turn
has a semantic relationship with ‘dog’. The study was specifically
designed to test the hypothesis, inferred from Dell’s model, that a
model of speech production in which different levels can interact
would predict a facilitation of naming ‘dog’ when ‘cot’ is heard
(Levelt et al., 1991). Experimentally, this predicted facilitation was
not found: there was no phonological activation of semantically
related items. In contrast, the results supported a sequential model.
Specifically, there was priming of lexical decisions to the heard
word from semantically related words only at very short intervals
(around 70 ms), while priming of lexical decisions from phonologically
related words was only significant at longer intervals (around
600 ms). These results were taken to support a sequential, stagebased
implementation of word naming.
There is evidence in favour of the Dell model, however: Morsella
and Miozzo (2002) asked participants to name pictures in the presence
of other (distractor) pictures: there was facilitation of picture
naming when there was a phonological relationship between the target
and distractor pictures. This was taken to show a beneficial effect
of phonological information at an earlier stage in word selection and
production than would be predicted by a feed-forward, sequential
model like Levelt’s.
Speech production has been somewhat less closely studied than
other aspects of language in cognitive psychology (especially when
compared with the detailed investigations of speech production
seen in the aphasia literature as will be seen in Chapter 11); however,
that profile is changing rapidly as a range of experimental
techniques are becoming available to researchers (Griffin and Crew,
2012).