part 4
## Abstract
- **Sequence Transduction Models:** These are computer programs designed to convert one sequence of data into
another. In the context mentioned, these models are often used for tasks like translating a sequence of words from one
language to another.
- **Include an Encoder and a Decoder:** In these models, there are two main parts. The encoder takes the input sequence
(e.g., a sentence in one language), processes it, and converts it into a different representation. The decoder then takes this
representation and generates the output sequence (e.g., a translated sentence in another language).
- This is a crucial detail. The attention mechanism allows the model to focus on specific parts of the input sequence when
generating each part of the output sequence. It's like telling the model, "Pay more attention to these words when
translating this part.”
## Introduction
- **Sequential Nature of Recurrent Models:** These models process sequences step by step, generating a hidden state for
each position in the sequence based on the previous hidden state and the input for that position. However, this sequential
processing makes it challenging to parallelize computations, which is important for efficiency, especially with long
sequences.
- **Attention Mechani[Link] These are tools that have become crucial in sequence modeling. They allow the model to
focus on different parts of the input or output sequence, regardless of their distance from each other. Traditionally, attention
mechanisms are combined with recurrent networks.
## Background
- **Previous Models Using Convolutional Neural Networks:** The Extended Neural GPU, ByteNet, and ConvS2S are
introduced as models that use convolutional neural networks (CNNs) as the fundamental building block. These models
perform computations in parallel for all input and output positions.
- **Challenges in Previous Models:** The number of operations required to relate signals from different positions in the
input or output sequences increases with the distance between these positions in ConvS2S and ByteNet. This makes it
harder for the models to learn dependencies between distant positions.
- **Introduction of Self-Attention:** Self-attention, also called intra-attention, is introduced as an attention mechanism that
relates different positions within a single sequence to compute a representation of that sequence. Self-attention has been
successfully used in various tasks, such as reading comprehension, summarization, and learning task-independent
sentence representations.