"The cat sat on the mat.
"
Without Masking (Incorrect):
If the model has access to the entire sequence without any masking, it can "peek" at future words
while making predictions. This means it might see the word "mat" while predicting "sat," which
would be cheating and not representative of real-world usage.
1. Masked Multi-Head Attention:
o Self-Attention: The decoder first performs self-attention on its own output. This
allows the decoder to focus on different parts of the output sequence it has
generated so far.
o Masking: To prevent the decoder from "peeking" at future tokens in the output
sequence, a mask is applied. This ensures that the decoder only attends to previous
tokens.
o Multi-Head Attention: Multiple attention heads are used to capture different aspects
of the output sequence.
2. Encoder-Decoder Attention:
o Cross-Attention: The decoder then performs attention over the encoder's output.
This allows the decoder to align its output with the relevant parts of the input
sequence.
o Multi-Head Attention: Multiple attention heads are used to capture different
relationships between the input and output sequences.
3. Feed-Forward Network (FFN):
o Position-wise Feed-Forward Networks: Each position in the output sequence is fed
through a fully connected feed-forward network. This introduces non-linearity and
allows the model to learn complex relationships between the input and output.
4. Linear Layer:
o Projection: A linear layer is applied to project the output of the FFN into a vector
space that matches the size of the vocabulary.
5. Softmax:
o Probability Distribution: Softmax is applied to the output of the linear layer to
obtain a probability distribution over the vocabulary.
o Next Token Prediction: The token with the highest probability is selected as the next
token in the output sequence.
Each layer processing the output of the previous layer.
Iteration 1:
1. Self-Attention (Current Sequence):
o The decoder uses self-attention to focus on different parts of the generated
sequence
o (initially just <s>).
2. Encoder-Decoder Attention:
o The decoder uses cross-attention to attend to the encoder's output, which is the
contextualized matrix of "Hi, how are you?"
o It gathers relevant contextual information from the encoder's each layer processing
the output of the previous layer.
3. Feed-Forward Network:
o The combined information from the attention mechanisms is processed through a
feed-forward neural network.
4. Softmax Layer:
o The output is passed through a softmax layer to generate a probability distribution
over the vocabulary for the next token.
5. Token Selection:
o The token with the highest probability (e.g., "I'm") is selected as the next token.
Feed-Forward Network:
Role: After the self-attention and cross-attention mechanisms, the feed-forward network
(FFN) processes the outputs to transform the encoded information into the required
format.
Function: It consists of two linear layers with a ReLU activation in between. This helps in
capturing complex patterns and relationships in the data.
2. Linear Layer:
Role: The linear (or dense) layer acts as a transformation step. It maps the output of the
feed-forward network to the vocabulary size.
Function: This layer projects the high-dimensional output of the FFN to the dimension of
the vocabulary, creating a vector where each position corresponds to a token in the
vocabulary.
3. Softmax Layer:
Role: The softmax layer converts the output from the linear layer into a probability
distribution over the vocabulary.
Iteration 2:
9. Next Input Sequence:
o The input sequence now includes the previously generated token: <s> I'm
10. Self-Attention (Current Sequence):
o The decoder focuses on the current sequence <s> I'm.
11. Encoder-Decoder Attention:
o It attends to the encoder's contextualized matrix of "Hi, how are you?" again to
gather more relevant information.
12. Feed-Forward Network:
o The output is processed through the feed-forward network.
13. Softmax Layer:
o A probability distribution is generated for the next token.
14. Token Selection:
o The token with the highest probability (e.g., "good") is selected.
Iteration 3:
15. Next Input Sequence:
o The input sequence now is: <s> I'm good.
16. Self-Attention (Current Sequence):
o The decoder focuses on the sequence <s> I'm good.
17. Encoder-Decoder Attention:
o It attends to the encoder's output again.
18. Feed-Forward Network:
o The output is processed.
19. Softmax Layer:
o A probability distribution is generated.
20. Token Selection:
o The token with the highest probability (e.g., "how") is selected.
Final Iteration:
21. Repeat Steps 9-20:
o The process repeats, generating tokens like "are", "you?" until a stopping criterion is
met (e.g., end token <\s>).
Final Output:
The final output sequence might be: "I'm good, how are you?"