Attention
mechanisms and
positional encoding
INTRODUCTION TO LLMS IN PYTHON
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
Why attention mechanisms?
INTRODUCTION TO LLMS IN PYTHON
Positional encoding
INTRODUCTION TO LLMS IN PYTHON
Positional encoding
Create positional encodings vector PE for token embedding E
INTRODUCTION TO LLMS IN PYTHON
Positional encoding
Create positional encodings vector PE for token embedding E
Based on sine and cosine functions
Add PE to E
INTRODUCTION TO LLMS IN PYTHON
Positional encoding
Create positional encodings vector PE for token embedding E
Based on sine and cosine functions
Add PE to E
INTRODUCTION TO LLMS IN PYTHON
Positional encoder class
class PositionalEncoder(nn.Module): Set max_seq_length and embedding size:
def __init__(self, d_model, max_seq_length=512): d_model
super(PositionalEncoder, self).__init__()
self.d_model = d_model Create positional encoding matrix pe , for
self.max_seq_length = max_seq_length
sequences up to max_seq_length
pe = torch.zeros(max_seq_length, d_model)
position = torch.arange(0, max_seq_length, position : position indices in the sequence
dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2 div_term : a term to scale position indices
dtype=torch.float) * -(math.log(10000.0) / d
pe[:, 0::2] = torch.sin(position * div_term)
Alternately apply sine and cosine to pe
pe[:, 1::2] = torch.cos(position * div_term) register_buffer() : set pe as non-
pe = pe.unsqueeze(0) trainable
self.register_buffer('pe', pe)
def forward(self, x): Add positional encodings pe to input
x = x + self.pe[:, :x.size(1)]
tensor of sequence embeddings, x
return x
INTRODUCTION TO LLMS IN PYTHON
Let's practice!
INTRODUCTION TO LLMS IN PYTHON
Multi-headed self
attention
INTRODUCTION TO LLMS IN PYTHON
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
Self-attention mechanism anatomy
INTRODUCTION TO LLMS IN PYTHON
Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding
INTRODUCTION TO LLMS IN PYTHON
Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding
Similarity between Q and K pairs
Attention scores matrix
INTRODUCTION TO LLMS IN PYTHON
Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding
Similarity between Q and K pairs
Attention scores matrix
Softmax scaling
Attention weights matrix
Orange is my favorite fruit
Query: Orange
Attention .21
.03 .05 .31 .40
weights:
INTRODUCTION TO LLMS IN PYTHON
Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding
Similarity between Q and K pairs
Attention scores matrix
Softmax scaling
Attention weights matrix
Orange is my favorite fruit
Query: Orange
Attention .21
.03 .05 .31 .40
weights:
INTRODUCTION TO LLMS IN PYTHON
Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding
Similarity between Q and K pairs
Attention scores matrix
Softmax scaling
Attention weights matrix
Orange is my favorite fruit
Query: Orange
Attention .21
.03 .05 .31 .40
weights:
INTRODUCTION TO LLMS IN PYTHON
Multi-headed self-attention
INTRODUCTION TO LLMS IN PYTHON
Multi-headed self-attention
INTRODUCTION TO LLMS IN PYTHON
Multi-headed self-attention
INTRODUCTION TO LLMS IN PYTHON
Multi-headed attention class
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
self.head_dim = d_model // num_heads
self.query_linear = nn.Linear(d_model, d_model)
self.key_linear = nn.Linear(d_model, d_model)
self.value_linear = nn.Linear(d_model, d_model)
self.output_linear = nn.Linear(d_model, d_model)
num_heads : number of attention heads, each handling embeddings of size head_dim
Setting up linear transformations ( nn.Linear() ) for attention inputs and output
INTRODUCTION TO LLMS IN PYTHON
Multi-headed attention class
def split_heads(self, x, batch_size):
x = x.view(batch_size, -1, self.num_heads, self.head_dim)
return x.permute(0, 2, 1, 3).contiguous().
view(batch_size * self.num_heads, -1, self.head_dim)
def compute_attention(self, query, key, mask=None):
scores = torch.matmul(query, key.permute(1, 2, 0))
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-1e9"))
attention_weights = F.softmax(scores, dim=-1)
return attention_weights
split_heads() : splits the input across attention heads
compute_attention() : compute attention weights using F.softmax()
INTRODUCTION TO LLMS IN PYTHON
Multi-headed attention class
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
query = self.split_heads(self.query_linear(query), batch_size)
key = self.split_heads(self.key_linear(key), batch_size)
value = self.split_heads(self.value_linear(value), batch_size)
attention_weights = self.compute_attention(query, key, mask)
output = torch.matmul(attention_weights, value)
output = output.view(batch_size, self.num_heads, -1, self.head_dim).
permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
return self.output_linear(output)
forward() : orchestrate multi-headed attention mechanism workflow.
Concatenate and project head outputs with self.output_linear()
INTRODUCTION TO LLMS IN PYTHON
Let's practice!
INTRODUCTION TO LLMS IN PYTHON
Building an encoder
transformer
INTRODUCTION TO LLMS IN PYTHON
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
From original to encoder-only transformer
INTRODUCTION TO LLMS IN PYTHON
From original to encoder-only transformer
Transformer body: encoder stack with N
encoder layers
Encoder layer
Multi-headed self-attention
Feed-forward (sub)layers
Layer normalizations, skip connections,
dropouts
INTRODUCTION TO LLMS IN PYTHON
From original to encoder-only transformer
Transformer body: encoder stack with N
encoder layers
Encoder layer
Multi-headed self-attention
Feed-forward (sub)layers
Layer normalizations, skip connections,
dropouts
Transformer head: process encoded inputs
(hidden states) to produce output prediction
Supervised task: classification, regression
INTRODUCTION TO LLMS IN PYTHON
Feed-forward sublayer in encoder layer
2 x fully connected + ReLU activation
d_ff : dimension between linear layers
forward() : processes attention outputs to
class FeedForwardSubLayer(nn.Module):
capture complex, non-linear patterns
def __init__(self, d_model, d_ff):
super(FeedForwardSubLayer, self).__init__()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
self.relu = nn.ReLU()
def forward(self, x):
return self.fc2(self.relu(self.fc1(x)))
INTRODUCTION TO LLMS IN PYTHON
Encoder layer
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask):
Multi-headed self-attention
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output)) Feed-forward sublayer
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output)) Layer normalizations and dropouts
return x
forward() : forward-pass through encoder
layer
mask prevents processing padding tokens
INTRODUCTION TO LLMS IN PYTHON
Masking the attention process
INTRODUCTION TO LLMS IN PYTHON
Transformer body: encoder
class TransformerEncoder(nn.Module): Encoder structure:
def __init__(self, vocab_size, d_model, num_layers,
num_heads, d_ff, dropout, max_sequence_length):
super(TransformerEncoder, self).__init__()
Input embeddings based on vocab_size
self.embedding = nn.Embedding(vocab_size, d_model)
Positional encoding
self.positional_encoding =
PositionalEncoding(d_model, max_sequence_length)
Stack of num_layers encoder layers, using
self.layers = nn.ModuleList( nn.ModuleList()
[EncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)]
)
forward() : forward-pass x through
def forward(self, x, mask):
x = self.embedding(x) transformer body
x = self.positional_encoding(x)
for layer in self.layers:
x = layer(x, mask)
return x
INTRODUCTION TO LLMS IN PYTHON
Transformer head
class ClassifierHead(nn.Module): Classification head
def __init__(self, d_model, num_classes):
super(ClassifierHead, self).__init__() Tasks: text classification, sentiment
self.fc = nn.Linear(d_model, num_classes)
analysis, NER, extractive QA, etc.
def forward(self, x): fc : fully connected linear layer
logits = self.fc(x)
Transforms encoder hidden states into
return F.log_softmax(logits, dim=-1)
num_classes class probabilities
class RegressionHead(nn.Module): Regression head
def __init__(self, d_model, output_dim):
super(RegressionHead, self).__init__()
Tasks: estimate text readability, language
self.fc = nn.Linear(d_model, output_dim)
complexity, etc.
def forward(self, x): output_dim is 1 when predicting a single
return self.fc(x)
numerical value
INTRODUCTION TO LLMS IN PYTHON
Let's practice!
INTRODUCTION TO LLMS IN PYTHON
Building a decoder
transformer
INTRODUCTION TO LLMS IN PYTHON
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
From original to decoder-only transformer
INTRODUCTION TO LLMS IN PYTHON
From original to decoder-only transformer
Autoregressive sequence generation: text
generation and completion
INTRODUCTION TO LLMS IN PYTHON
From original to decoder-only transformer
Autoregressive sequence generation: text
generation and completion
Masked multi-head self-attention
Upper triangular mask
Hide future positions in sequence
Only attend tokens before current one
INTRODUCTION TO LLMS IN PYTHON
From original to decoder-only transformer
Autoregressive sequence generation: text
generation and completion
Masked multi-head self-attention
Upper triangular mask
Hide future positions in sequence
Only attend tokens before current one
Decoder-only transformed head
Linear + Softmax over vocabulary
Predict most likely next token
INTRODUCTION TO LLMS IN PYTHON
Masked self-attention
Key to autoregressive or causal behavior
Triangular (causal) attention mask
INTRODUCTION TO LLMS IN PYTHON
Masked self-attention
Key to autoregressive or causal behavior
Triangular (causal) attention mask
A token only pays attention to "past" (left-
hand side) information in the sequence
INTRODUCTION TO LLMS IN PYTHON
Masked self-attention
Key to autoregressive or causal behavior
Triangular (causal) attention mask
A token only pays attention to "past" (left-
hand side) information in the sequence
"favorite": "orange", "is", "my", "favorite"
Enforced causal attention: predict next
word to generate, e.g. "fruit"
self_attention_mask = (1 - torch.triu(
Mask built and passed to the model
torch.ones(1, sequence_length, sequence_length), self_attention_mask
diagonal=1)).bool()
(...) Same multi-headed attention
output = decoder(input_sequence, self_attention_mask)
mechanism: only the mask is different
INTRODUCTION TO LLMS IN PYTHON
Transformer body (decoder) and head
class DecoderOnlyTransformer(nn.Module):
def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
super(TransformerDecoder, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, max_sequence_length)
self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
self.fc = nn.Linear(d_model, vocab_size)
def forward(self, x, self_mask):
x = self.embedding(x)
x = self.positional_encoding(x)
for layer in self.layers:
x = layer(x, self_mask)
x = self.fc(x)
return F.log_softmax(x, dim=-1)
self.fc : output linear layer with vocab_size neurons
Add self.fc and softmax activation in forward pass
INTRODUCTION TO LLMS IN PYTHON
Let's practice!
INTRODUCTION TO LLMS IN PYTHON
Building an encoder-
decoder transformer
INTRODUCTION TO LLMS IN PYTHON
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
Transformer architecture: encoder recap
INTRODUCTION TO LLMS IN PYTHON
Cross-attention mechanism
Cross-attention: double inputs
1. Information processed throughout decoder
2. Final hidden states from encoder block
Look back at processed input sequence
Find out next output token to generate
INTRODUCTION TO LLMS IN PYTHON
Cross-attention mechanism
Cross-attention: double inputs. class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super(DecoderLayer, self).__init__()
1. Information processed throughout decoder self.self_attn = MultiHeadAttention(
d_model, num_heads)
2. Final hidden states from encoder block self.cross_attn = MultiHeadAttention(
d_model, num_heads)
Look back at processed input sequence. ...
Find out next output token to generate. def forward(self, x, y, causal_mask, cross_mask):
self_attn_output = self.self_attn(x, x, x,
causal_mask)
x = self.norm1(x + self.dropout(self_attn_output))
x : decoder information flow, becomes cross_attn_output = self.cross_attn(x, y, y,
cross_mask)
cross-attention query x = self.norm2(x + self.dropout(cross_attn_output))
...
y : encoder output, becomes cross-
attention key and values
INTRODUCTION TO LLMS IN PYTHON
Encoder meets decoder
INTRODUCTION TO LLMS IN PYTHON
Encoder meets decoder
INTRODUCTION TO LLMS IN PYTHON
Transformer head
jugar (to play): 0.03
viajar (to travel): 0.96
dormir (to sleep): 0.01
INTRODUCTION TO LLMS IN PYTHON
Everything brought together!
INTRODUCTION TO LLMS IN PYTHON
Everything brought together!
class PositionalEncoding(nn.Module): class Transformer(nn.Module):
... def __init__(self, vocab_size, d_model, num_heads,
class MultiHeadAttention(nn.Module): num_layers, d_ff, max_seq_len, dropout):
... super(Transformer, self).__init__()
class FeedForwardSubLayer(nn.Module):
... self.encoder = TransformerEncoder(vocab_size,
class EncoderLayer(nn.Module): d_model, num_heads, num_layers, num_heads,
... d_ff, max_seq_len, dropout)
class DecoderLayer(nn.Module): self.decoder = TransformerDecoder(vocab_size,
... d_model, num_heads, num_layers, num_heads,
d_ff, max_seq_len, dropout)
class TransformerEncoder(nn.Module):
def forward(self, src, src_mask, causal_mask):
...
encoder_output = self.encoder(src, src_mask)
class TransformerDecoder(nn.Module):
decoder_output = self.decoder(src, encoder_output,
...
causal_mask, mask)
class ClassificationHead(nn.Module):
return decoder_output
...
INTRODUCTION TO LLMS IN PYTHON
Let's practice!
INTRODUCTION TO LLMS IN PYTHON