0% found this document useful (0 votes)
23 views10 pages

Sequence To Sequence Model

Lecture 11 focuses on the application of recurrent neural networks (RNNs) for sequence-to-sequence models, which can convert one sequence into another, such as translating text from one language to another. The lecture discusses the construction of language models, including the use of start and end tokens to manage sentence generation, and the integration of convolutional networks for encoding input data. Additionally, it covers how to condition RNNs on various inputs, such as images or other sequences, to generate appropriate text outputs.

Uploaded by

sps.cseheritage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views10 pages

Sequence To Sequence Model

Lecture 11 focuses on the application of recurrent neural networks (RNNs) for sequence-to-sequence models, which can convert one sequence into another, such as translating text from one language to another. The lecture discusses the construction of language models, including the use of start and end tokens to manage sentence generation, and the integration of convolutional networks for encoding input data. Additionally, it covers how to condition RNNs on various inputs, such as images or other sequences, to generate appropriate text outputs.

Uploaded by

sps.cseheritage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

welcome to lecture 11.

today we're going to talk about how we can use recurrent neural networks
which we learned about last time to actually solve some interesting problems and uh the kind of
combination of today's lecture will be how we can train and utilize sequence to sequence models which
are a very powerful class of our current neural network models that can convert one sequence to
another for example text in one language detects another language text representing a question in the
test text representing it the answer and so forth so last time when we talked about rnns towards the
end we saw how recurrent neural networks are very flexible that can be used to solve a wide range of
different types of sequence processing problems such as taking some single input and turning it into a
sequence for example for image captioning where the input is an image and the output is a sequence
representing a textual description they can turn a sequence into a single output for example activity
recognition where you can take a sequence of video frames and produce a label representing the
activity that is portrayed in the video they can turn sequences into other sequences for example the
machine translation and uh they can turn sequences into other sequences one step at a time for
example for frame level video annotation and uh of course the applications of rnns that output
sequences they typically also input full sequences to deal with the correlations between sequential
tokens as we saw at the end of the previous lecture so in today's lecture we're mainly going to focus on
these many to many uh transduction problems and specifically sequencer sequence models but before
we do that uh let's first discuss how we can build a basic neural language model okay so what is a
language model a language model is a model that assigns probabilities to sequences representing text
and language models are a very important concept for a lot of what we're going to discuss because they
can not only assign probabilities but they can also oftentimes generate text so here is the example from
uh from last time of a neural network model that generates uh phrases and the training data for this
would be a large collection of natural language sentences so here i'm showing just three of course in
reality real language models might be trained on several million of these and how are they represented
well there are a few choices here and we'll go into a lot more depth in the in the next in the lecture two
lectures from now but a very simple choice uh is to tokenize each of the sentences which means that
every word becomes a separate time step and then encode that word in some way a very simple way to
encode the word is what's called a one hot vector so a one hot vector is just a vector of length equal to
the number of possible words number of words in the dictionary and every element in that vector is
zero except for the element corresponding to the index of the word which is set to one so that means
that in the sentence i think therefore i am there are five words therefore there will be five time steps in
each of those time steps will be represented by a very large vector potentially thousands in length uh
that has zeros everywhere except for the single position representing the word uh that is there there are
more complex ways to represent words uh one more complex approach is to use what's called a word
embedding which is a continuous vector-valued representation of words that is meant to reflect their
semantic similarity so that words that mean similar things are located closer together in terms of the
euclidean distance and there are ways to actually construct these word embeddings which we will
discuss most likely next week okay but for now let's just say that words are one hot vectors which means
that our natural language uh training data just consists of uh sequences of these one hot vectors now
this is a very simplistic view of language models we'll talk about real language models much more later
but for now i just want to introduce kind of the basic idea of an rnn language model because we're going
to build on this to develop sequence to sequence models all right a few details that we have to get right
so so far we saw actually kind of a cartoony uh view of rnn's train on sentences to make this actually
work we need a few little details one detail we need is we need to understand when the model is done
outputting a sentence so for example how can we use such a model to output a complete sentence uh
for example if you give it i and you want to complete it with i think therefore i am um how do you know
that you should stop at m like maybe am is not the last word maybe it's going to say i think therefore i
am a hippopotamus that's also a valid sentence [Music] so what we're going to do is we're going to add
a special token and we're going to include this token at the training in the training date at the end of all
of our sentences which is the end of sequence token also sometimes referred as end of sentence token
so this is a special word it's going to come at the at the end of the dictionary and it doesn't represent
any actual word it just represents the fact that the model is done and every sequence in the training
data will end with this token therefore the model will learn that when it has completed a sentence it
should at the very last time step output this end of sentence or end of sequence token and it will do this
on the last step okay so that will help us understand when the model is done with a sentence now
another problem we're faced with is if you want such a model to generate a sentence you need to
somehow kick it off right so for the second time step it will use the output the first time step as input
that's what we learned at the end of the previous lecture but what do we use at the very first time step
you know we could just give it a totally random word and that's reasonable but not all words are equally
likely to start a sentence so one thing we could do which would actually work is we could simply
compute the frequency with which each sentence starts a word sample from that randomly and then
feed that as the first time step into the model and that would actually be a reasonable solution a slightly
more elegant solution which doesn't require us to have a special component is to in the same way that
we introduced an end of sentence token we can introduce a start of sentence token so we could add an
additional time step prior to the first one where the input is the special start token and then the model
will be trained where in the training data every sentence starts with a start token and the model will
then learn that it should output a random word with probability proportional to the probability of
actually starting a sentence so if we want to come up with an entirely new sequence start with a special
start token and let the model generate this um a little pop quiz here just for uh for all of you to kind of
test your understanding of what's going on now that we've learned about these start tokens and of
sentence tokens we know how we can get rnns to generate completely random sentences what if we
want an rn to complete a sentence what if you want to ask the rnn here's a sentence starting with i think
for and i want you to finish it and i want you to finish it with a a random but reasonable sentence like for
example i could say i think therefore i'm really smart or could say i think therefore i am or could uh say i
think therefore i am is a famous quote by a well-known philosopher right these are all valid english
language sentences they're probably pretty representative of what's in the training data so can we get
the network to complete these uh sentences in a variety of ways always starting with start i think how
would we do this as a hint doing this does not require changing the training procedure at all the model is
trained exactly the same way that we discussed in the previous lecture okay so it's actually fairly simple
to get a language model to complete a sentence you still feed in the tokens one of the so you feed and
start you let it make a prediction but then you don't sample from that prediction instead you just
directly feed in the next word that you're conditioning which is i and then you let it make some
prediction but you ignore it and then you for times for the next time step if you didn't think and then
we'll make some prediction of the third step and then you actually sample from that and feed that into
the input to the fourth step and so on and so on so all you have to do in order to condition uh your
generation on a particular starting snippet is to just force those first few inputs to agree with that
snippet regardless of what the network was outputting and that's it and that's a very simple way to get
an rnn to complete a sentence so if you start with a start token and end with an eos token you can get it
to generate entirely random sentences without conditioning anything except for start or you can get it
to complete very long sentences just by putting a few words at the end as a little uh kind of thought
exercise to do at home you might even consider how you can get a neural network language model to
fill in missing words in a sentence so maybe what i want is i want to is i have start i think blank i am and i
wanted to fill in that missing word could i get a language model to do that and if so how the answer to
this is actually a little bit subtle [Music] and actually by the end of today's lecture you should have
enough information to guess the answer and we can talk about it in class okay but back to the main
topic so so far we discussed unconditional language models now i'm going to talk about how we can
build conditional language models and here for these conditional language models the text is the output
of the model and that's why i switched using y to denote it and you're going to condition this model on
some input that's going to tell it what text you want to generate for example you could imagine a
conditional language model for image captioning where you condition the model on let's say a picture
right so you showed a picture and the model's job is to generate text describing what's in that picture so
the way that we can do this is we can have some kind of encoder model read in the conditioning
information for example read in the picture of the puppy and it's a picture so we would uh use for
example a convolutional network but this convolutional network wouldn't produce a label it would
produce actually the vector representing the starting state of the rnn so before we would have started
the rnn uh with a big vector of zeros right so this was kind of the rnn that we learned about last time
now we're going to actually set the initial state of the rnn to be something that is outputted by our
convolutional network and this whole thing will be trained end-to-end so the convolutional network and
the rnn together would be trained uh to produce the correct uh text right so the training procedure for
this would be uh very similar to the rnns that we saw uh in lecture 10. the main difference now is that
it's essentially a convenient uh with an rnn staple to the end of it and the job of the continent is to
output the initial state of the rnn so in a sense this initial state of the rnn is a kind of representation of
what is going on in the picture so it needs to contain all the information that the rnn needs to generate
the correct sentence the rn internally understands what is a valid sentence versus an invalid sentence
but this vector needs to contain all the information needs to figure out which valid sentence is
appropriate here so intuitively you can think of this a0 as a some kind of you know vector that
represents the thought that there is a cute puppy and then the rn's job is to turn this thought into valid
english text okay so just again to summarize because it's very important to understand how this works
this is one big neural network that consists of a few convolutional layers then maybe some fully
connected layers and the and that last fully connected layer goes into an rnn and you can think of the
convolutional network and subsequent fully connected layers producing a0 the initial hidden state of the
rnn so if the rnn was just a vanilla rnn then this would just be a vector if it was an lstm then this would
be essentially the initial cell state and the initial h so we refer to the rnn part of this as an rnn decoder
because its job is to decode the thought contained in e0 into english text and we call the convolutional
part a cnn or in general a neural network encoder because its job is to take the input in this case this
photograph and encode it into a0 and a0 is this vector the initial rnn activations that represent what is
going on in the picture that basically represent everything the rna needs to know about the input in
order to produce the right output okay uh now just to check your understanding of what's going on here
a quick pop quiz what do we expect the training data for this to look like alright so if we want to train
this thing end to end uh to generate text for pictures what do we need our training data to be right like
our training data is clearly going to contain pictures and it's going to have labels and those labels are no
longer you know categories they're actually text so it's just tuples of picture and english language sounds
and how do we tell the iron and what to generate well that needs to uh that's determined by what goes
into that vector a0 so essentially the rnn internally knows how english language works but it needs the
right information to be contained in a0 in order to know which english language to actually produce
okay now we don't have to condition on photographs we could actually condition our language model
on anything so we could even condition it on another sequence so instead of having a photograph and a
convenient you could have an encoder that is another rnn so let's say that you want to translate french
into english well you could have an rnn that reads in french text and produces the initial hidden state
activations for another rnn that produces english text so the first rnn reads in french produces a zero
and the second rnn takes a zero and produces english text and you could use this for example to train a
model that translates french into english so now instead of a cnn encoder you have an rnn encoder and
this uh this a0 thing um it's kind of it's kind of virtual so in a sense you could think of this as one big rnn
right whatever happens at the last step of the french sentence just goes directly to the first step of the
english sentence so in fact you could even train it this way you could just take one giant rnn whose job is
to read in a french sentence and then an english sentence and at every step of english sentences should
produce the next word of that english sentence and that would be a valid model for translating french
into english in practice it's much more common to have two separate rnns meaning that the weights
between the french part and the english part are actually different but the weights between time steps
and the french part are all the same and the weights between different time steps in english are all the
same but between the two languages they're different but you don't have to do it this way you could
have just one giant iron and that would be technically correct although it'll probably be a little harder to
try if you think of this as one giant rnn then the start token for the english sentence is actually also the
end token of the french sentence so essentially seeing that token it actually doesn't matter with the
start and end it'll work the same way starting that token seeing that token will tell them the network
that now it's time to start to stop reading in french and start generating english okay now there are a
few details to get this to actually work uh one little detail that people sometimes do although not always
is to actually read in the input sequence in reverse right instead of starting at the start of the sentence
and then going to the beginning you actually start at the end of the sentence and then go backwards
you don't have to do it this way but this is a one common choice as a little thought exercise and this will
be important for later why do you think that is why might it be better to read in the french sentence
backwards and then produce the english sentence forwards well the reason this might be better is that
the beginning of the english sentence probably has more to do with the beginning of the french
sentence than with the end of the french sentence so by reversing the input sentence the first part
comes last which puts it closest to the first part of the output sentence so essentially the dependencies
are a little shorter now you could argue that maybe well maybe that's bad because the the length for
the end of the sun is even longer but in practice this sometimes works a little bit better although i'm not
not always and it's not the only way to do it this notion of long-term dependencies though will come up
again towards the end of the lecture when we talk about attention okay so as i mentioned we tend to
use different rnns for the input and the output so the encoder and the decoder actually have different
weights and the encoder basically produces the initial hidden state activation for the decoder by the
way when i say hidden state that's just another word for activation so typically they're two separate
irons with different weights uh but they are still trained end to end so your training data now will consist
of paired sentences basically tuples where there is a french sentence and its corresponding english
translation and you would train it end to end the same exact way as you would train all the rnns from
the previous lecture okay here's a more realistic example of a sequence to sequence model so here i've
made a few modifications i read in the input sentence backwards and i've also added multiple layers so
it's very common for these kinds of designs to use lstm cells and to stack multiple stm cells on top of
each other this is sometimes referred to as a stacked rna or stacked osteo the number of layers that you
would use for this the number that you would stack is typically a bit lower than what you would use for
a large confidence so confidence might have tens or even hundreds of layers for lstms typically it'll be in
the area of like two to four maybe five um later on we'll talk about another type of model called a
transformer which tends to be quite a lot deeper but lstms tend to be you know two to four layers deep
and that's partly because they're using um 10h non-linearities which don't understand quite as well and
partly because also the sequential nature of the lcm adds a lot to its representational power so it has
depth both both you know vertically and horizontally here so it doesn't need to have to stack too many
layers two to four tends to be pretty good right so we have multiple rnn layers each rnn layer would
typically use an lstm cell or some variant of an lstm like a grdu cell it's trained end-to-end on pairs of
sequences and the sequences can be different length the input and the output can be different lengths
and also across tuples of course they can be different lengths and these kinds of sequence to sequence
models are very flexible they can be used to do a wide variety of different things not just labeling
pictures of cute puppies or translating sentences about puppies they can be used of course to translate
one language into another and they can also be used to summarize long sentences into short sentences
so for instance the input can be a long sentence and the apple could be a short sentence they could be
used to respond to a question with an answer so in that case the input would be the question and the
output will be the text of the answer they can even be used for some exotic applications for instance the
input can be a textual description of a piece of code and the output can be python code that implements
that description so in general anything that you can formulate as pairs of sequences that are let's say
less than 100 tokens in length you could probably set up a seek to seek model a sequence of sequence
model to try to do this if you want to read more about sequence to sequence models uh i would
recommend this paper by elia suzuki oriol vinyls and quaplay called sequence to sequence learning with
neural networks although there are many follow-ups that extend this basic recipe in a number of
different ways okay so that's kind of the basic uh seek to seek model and in the rest of the lecture we'll
develop this idea further and discuss a few details needed to use this effectively welcome to lecture 11.
today we're going to talk about how we can use recurrent neural networks which we learned about last
time to actually solve some interesting problems and uh the kind of combination of today's lecture will
be how we can train and utilize sequence to sequence models which are a very powerful class of our
current neural network models that can convert one sequence to another for example text in one
language detects another language text representing a question in the test text representing it the
answer and so forth so last time when we talked about rnns towards the end we saw how recurrent
neural networks are very flexible that can be used to solve a wide range of different types of sequence
processing problems such as taking some single input and turning it into a sequence for example for
image captioning where the input is an image and the output is a sequence representing a textual
description they can turn a sequence into a single output for example activity recognition where you can
take a sequence of video frames and produce a label representing the activity that is portrayed in the
video they can turn sequences into other sequences for example the machine translation and uh they
can turn sequences into other sequences one step at a time for example for frame level video
annotation and uh of course the applications of rnns that output sequences they typically also input full
sequences to deal with the correlations between sequential tokens as we saw at the end of the previous
lecture so in today's lecture we're mainly going to focus on these many to many uh transduction
problems and specifically sequencer sequence models but before we do that uh let's first discuss how
we can build a basic neural language model okay so what is a language model a language model is a
model that assigns probabilities to sequences representing text and language models are a very
important concept for a lot of what we're going to discuss because they can not only assign probabilities
but they can also oftentimes generate text so here is the example from uh from last time of a neural
network model that generates uh phrases and the training data for this would be a large collection of
natural language sentences so here i'm showing just three of course in reality real language models
might be trained on several million of these and how are they represented well there are a few choices
here and we'll go into a lot more depth in the in the next in the lecture two lectures from now but a very
simple choice uh is to tokenize each of the sentences which means that every word becomes a separate
time step and then encode that word in some way a very simple way to encode the word is what's called
a one hot vector so a one hot vector is just a vector of length equal to the number of possible words
number of words in the dictionary and every element in that vector is zero except for the element
corresponding to the index of the word which is set to one so that means that in the sentence i think
therefore i am there are five words therefore there will be five time steps in each of those time steps
will be represented by a very large vector potentially thousands in length uh that has zeros everywhere
except for the single position representing the word uh that is there there are more complex ways to
represent words uh one more complex approach is to use what's called a word embedding which is a
continuous vector-valued representation of words that is meant to reflect their semantic similarity so
that words that mean similar things are located closer together in terms of the euclidean distance and
there are ways to actually construct these word embeddings which we will discuss most likely next week
okay but for now let's just say that words are one hot vectors which means that our natural language uh
training data just consists of uh sequences of these one hot vectors now this is a very simplistic view of
language models we'll talk about real language models much more later but for now i just want to
introduce kind of the basic idea of an rnn language model because we're going to build on this to
develop sequence to sequence models all right a few details that we have to get right so so far we saw
actually kind of a cartoony uh view of rnn's train on sentences to make this actually work we need a few
little details one detail we need is we need to understand when the model is done outputting a sentence
so for example how can we use such a model to output a complete sentence uh for example if you give
it i and you want to complete it with i think therefore i am um how do you know that you should stop at
m like maybe am is not the last word maybe it's going to say i think therefore i am a hippopotamus
that's also a valid sentence [Music] so what we're going to do is we're going to add a special token and
we're going to include this token at the training in the training date at the end of all of our sentences
which is the end of sequence token also sometimes referred as end of sentence token so this is a special
word it's going to come at the at the end of the dictionary and it doesn't represent any actual word it
just represents the fact that the model is done and every sequence in the training data will end with this
token therefore the model will learn that when it has completed a sentence it should at the very last
time step output this end of sentence or end of sequence token and it will do this on the last step okay
so that will help us understand when the model is done with a sentence now another problem we're
faced with is if you want such a model to generate a sentence you need to somehow kick it off right so
for the second time step it will use the output the first time step as input that's what we learned at the
end of the previous lecture but what do we use at the very first time step you know we could just give it
a totally random word and that's reasonable but not all words are equally likely to start a sentence so
one thing we could do which would actually work is we could simply compute the frequency with which
each sentence starts a word sample from that randomly and then feed that as the first time step into
the model and that would actually be a reasonable solution a slightly more elegant solution which
doesn't require us to have a special component is to in the same way that we introduced an end of
sentence token we can introduce a start of sentence token so we could add an additional time step prior
to the first one where the input is the special start token and then the model will be trained where in
the training data every sentence starts with a start token and the model will then learn that it should
output a random word with probability proportional to the probability of actually starting a sentence so
if we want to come up with an entirely new sequence start with a special start token and let the model
generate this um a little pop quiz here just for uh for all of you to kind of test your understanding of
what's going on now that we've learned about these start tokens and of sentence tokens we know how
we can get rnns to generate completely random sentences what if we want an rn to complete a
sentence what if you want to ask the rnn here's a sentence starting with i think for and i want you to
finish it and i want you to finish it with a a random but reasonable sentence like for example i could say i
think therefore i'm really smart or could say i think therefore i am or could uh say i think therefore i am
is a famous quote by a well-known philosopher right these are all valid english language sentences
they're probably pretty representative of what's in the training data so can we get the network to
complete these uh sentences in a variety of ways always starting with start i think how would we do this
as a hint doing this does not require changing the training procedure at all the model is trained exactly
the same way that we discussed in the previous lecture okay so it's actually fairly simple to get a
language model to complete a sentence you still feed in the tokens one of the so you feed and start you
let it make a prediction but then you don't sample from that prediction instead you just directly feed in
the next word that you're conditioning which is i and then you let it make some prediction but you
ignore it and then you for times for the next time step if you didn't think and then we'll make some
prediction of the third step and then you actually sample from that and feed that into the input to the
fourth step and so on and so on so all you have to do in order to condition uh your generation on a
particular starting snippet is to just force those first few inputs to agree with that snippet regardless of
what the network was outputting and that's it and that's a very simple way to get an rnn to complete a
sentence so if you start with a start token and end with an eos token you can get it to generate entirely
random sentences without conditioning anything except for start or you can get it to complete very long
sentences just by putting a few words at the end as a little uh kind of thought exercise to do at home
you might even consider how you can get a neural network language model to fill in missing words in a
sentence so maybe what i want is i want to is i have start i think blank i am and i wanted to fill in that
missing word could i get a language model to do that and if so how the answer to this is actually a little
bit subtle [Music] and actually by the end of today's lecture you should have enough information to
guess the answer and we can talk about it in class okay but back to the main topic so so far we discussed
unconditional language models now i'm going to talk about how we can build conditional language
models and here for these conditional language models the text is the output of the model and that's
why i switched using y to denote it and you're going to condition this model on some input that's going
to tell it what text you want to generate for example you could imagine a conditional language model
for image captioning where you condition the model on let's say a picture right so you showed a picture
and the model's job is to generate text describing what's in that picture so the way that we can do this is
we can have some kind of encoder model read in the conditioning information for example read in the
picture of the puppy and it's a picture so we would uh use for example a convolutional network but this
convolutional network wouldn't produce a label it would produce actually the vector representing the
starting state of the rnn so before we would have started the rnn uh with a big vector of zeros right so
this was kind of the rnn that we learned about last time now we're going to actually set the initial state
of the rnn to be something that is outputted by our convolutional network and this whole thing will be
trained end-to-end so the convolutional network and the rnn together would be trained uh to produce
the correct uh text right so the training procedure for this would be uh very similar to the rnns that we
saw uh in lecture 10. the main difference now is that it's essentially a convenient uh with an rnn staple
to the end of it and the job of the continent is to output the initial state of the rnn so in a sense this
initial state of the rnn is a kind of representation of what is going on in the picture so it needs to contain
all the information that the rnn needs to generate the correct sentence the rn internally understands
what is a valid sentence versus an invalid sentence but this vector needs to contain all the information
needs to figure out which valid sentence is appropriate here so intuitively you can think of this a0 as a
some kind of you know vector that represents the thought that there is a cute puppy and then the rn's
job is to turn this thought into valid english text okay so just again to summarize because it's very
important to understand how this works this is one big neural network that consists of a few
convolutional layers then maybe some fully connected layers and the and that last fully connected layer
goes into an rnn and you can think of the convolutional network and subsequent fully connected layers
producing a0 the initial hidden state of the rnn so if the rnn was just a vanilla rnn then this would just be
a vector if it was an lstm then this would be essentially the initial cell state and the initial h so we refer to
the rnn part of this as an rnn decoder because its job is to decode the thought contained in e0 into
english text and we call the convolutional part a cnn or in general a neural network encoder because its
job is to take the input in this case this photograph and encode it into a0 and a0 is this vector the initial
rnn activations that represent what is going on in the picture that basically represent everything the rna
needs to know about the input in order to produce the right output okay uh now just to check your
understanding of what's going on here a quick pop quiz what do we expect the training data for this to
look like alright so if we want to train this thing end to end uh to generate text for pictures what do we
need our training data to be right like our training data is clearly going to contain pictures and it's going
to have labels and those labels are no longer you know categories they're actually text so it's just tuples
of picture and english language sounds and how do we tell the iron and what to generate well that
needs to uh that's determined by what goes into that vector a0 so essentially the rnn internally knows
how english language works but it needs the right information to be contained in a0 in order to know
which english language to actually produce okay now we don't have to condition on photographs we
could actually condition our language model on anything so we could even condition it on another
sequence so instead of having a photograph and a convenient you could have an encoder that is another
rnn so let's say that you want to translate french into english well you could have an rnn that reads in
french text and produces the initial hidden state activations for another rnn that produces english text
so the first rnn reads in french produces a zero and the second rnn takes a zero and produces english
text and you could use this for example to train a model that translates french into english so now
instead of a cnn encoder you have an rnn encoder and this uh this a0 thing um it's kind of it's kind of
virtual so in a sense you could think of this as one big rnn right whatever happens at the last step of the
french sentence just goes directly to the first step of the english sentence so in fact you could even train
it this way you could just take one giant rnn whose job is to read in a french sentence and then an
english sentence and at every step of english sentences should produce the next word of that english
sentence and that would be a valid model for translating french into english in practice it's much more
common to have two separate rnns meaning that the weights between the french part and the english
part are actually different but the weights between time steps and the french part are all the same and
the weights between different time steps in english are all the same but between the two languages
they're different but you don't have to do it this way you could have just one giant iron and that would
be technically correct although it'll probably be a little harder to try if you think of this as one giant rnn
then the start token for the english sentence is actually also the end token of the french sentence so
essentially seeing that token it actually doesn't matter with the start and end it'll work the same way
starting that token seeing that token will tell them the network that now it's time to start to stop
reading in french and start generating english okay now there are a few details to get this to actually
work uh one little detail that people sometimes do although not always is to actually read in the input
sequence in reverse right instead of starting at the start of the sentence and then going to the beginning
you actually start at the end of the sentence and then go backwards you don't have to do it this way but
this is a one common choice as a little thought exercise and this will be important for later why do you
think that is why might it be better to read in the french sentence backwards and then produce the
english sentence forwards well the reason this might be better is that the beginning of the english
sentence probably has more to do with the beginning of the french sentence than with the end of the
french sentence so by reversing the input sentence the first part comes last which puts it closest to the
first part of the output sentence so essentially the dependencies are a little shorter now you could argue
that maybe well maybe that's bad because the the length for the end of the sun is even longer but in
practice this sometimes works a little bit better although i'm not not always and it's not the only way to
do it this notion of long-term dependencies though will come up again towards the end of the lecture
when we talk about attention okay so as i mentioned we tend to use different rnns for the input and the
output so the encoder and the decoder actually have different weights and the encoder basically
produces the initial hidden state activation for the decoder by the way when i say hidden state that's
just another word for activation so typically they're two separate irons with different weights uh but
they are still trained end to end so your training data now will consist of paired sentences basically
tuples where there is a french sentence and its corresponding english translation and you would train it
end to end the same exact way as you would train all the rnns from the previous lecture okay here's a
more realistic example of a sequence to sequence model so here i've made a few modifications i read in
the input sentence backwards and i've also added multiple layers so it's very common for these kinds of
designs to use lstm cells and to stack multiple stm cells on top of each other this is sometimes referred
to as a stacked rna or stacked osteo the number of layers that you would use for this the number that
you would stack is typically a bit lower than what you would use for a large confidence so confidence
might have tens or even hundreds of layers for lstms typically it'll be in the area of like two to four
maybe five um later on we'll talk about another type of model called a transformer which tends to be
quite a lot deeper but lstms tend to be you know two to four layers deep and that's partly because
they're using um 10h non-linearities which don't understand quite as well and partly because also the
sequential nature of the lcm adds a lot to its representational power so it has depth both both you know
vertically and horizontally here so it doesn't need to have to stack too many layers two to four tends to
be pretty good right so we have multiple rnn layers each rnn layer would typically use an lstm cell or
some variant of an lstm like a grdu cell it's trained end-to-end on pairs of sequences and the sequences
can be different length the input and the output can be different lengths and also across tuples of course
they can be different lengths and these kinds of sequence to sequence models are very flexible they can
be used to do a wide variety of different things not just labeling pictures of cute puppies or translating
sentences about puppies they can be used of course to translate one language into another and they
can also be used to summarize long sentences into short sentences so for instance the input can be a
long sentence and the apple could be a short sentence they could be used to respond to a question with
an answer so in that case the input would be the question and the output will be the text of the answer
they can even be used for some exotic applications for instance the input can be a textual description of
a piece of code and the output can be python code that implements that description so in general
anything that you can formulate as pairs of sequences that are let's say less than 100 tokens in length
you could probably set up a seek to seek model a sequence of sequence model to try to do this if you
want to read more about sequence to sequence models uh i would recommend this paper by elia suzuki
oriol vinyls and quaplay called sequence to sequence learning with neural networks although there are
many follow-ups that extend this basic recipe in a number of different ways okay so that's kind of the
basic uh seek to seek model and in the rest of the lecture we'll develop this idea further and discuss a
few details needed to use this effectively

You might also like