0% found this document useful (0 votes)
25 views52 pages

Chapter 2

Uploaded by

linkon dhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views52 pages

Chapter 2

Uploaded by

linkon dhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Attention

mechanisms and
positional encoding
INTRODUCTION TO LLMS IN PYTHON

Iván Palomares Carrascosa, PhD


Senior Data Science & AI Manager
Why attention mechanisms?

INTRODUCTION TO LLMS IN PYTHON


Positional encoding

INTRODUCTION TO LLMS IN PYTHON


Positional encoding

Create positional encodings vector PE for token embedding E

INTRODUCTION TO LLMS IN PYTHON


Positional encoding

Create positional encodings vector PE for token embedding E


Based on sine and cosine functions

Add PE to E

INTRODUCTION TO LLMS IN PYTHON


Positional encoding

Create positional encodings vector PE for token embedding E


Based on sine and cosine functions

Add PE to E

INTRODUCTION TO LLMS IN PYTHON


Positional encoder class
class PositionalEncoder(nn.Module): Set max_seq_length and embedding size:
def __init__(self, d_model, max_seq_length=512): d_model
super(PositionalEncoder, self).__init__()
self.d_model = d_model Create positional encoding matrix pe , for
self.max_seq_length = max_seq_length
sequences up to max_seq_length
pe = torch.zeros(max_seq_length, d_model)
position = torch.arange(0, max_seq_length, position : position indices in the sequence
dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2 div_term : a term to scale position indices
dtype=torch.float) * -(math.log(10000.0) / d
pe[:, 0::2] = torch.sin(position * div_term)
Alternately apply sine and cosine to pe
pe[:, 1::2] = torch.cos(position * div_term) register_buffer() : set pe as non-
pe = pe.unsqueeze(0) trainable
self.register_buffer('pe', pe)
def forward(self, x): Add positional encodings pe to input
x = x + self.pe[:, :x.size(1)]
tensor of sequence embeddings, x
return x

INTRODUCTION TO LLMS IN PYTHON


Let's practice!
INTRODUCTION TO LLMS IN PYTHON
Multi-headed self
attention
INTRODUCTION TO LLMS IN PYTHON

Iván Palomares Carrascosa, PhD


Senior Data Science & AI Manager
Self-attention mechanism anatomy

INTRODUCTION TO LLMS IN PYTHON


Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding

INTRODUCTION TO LLMS IN PYTHON


Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding

Similarity between Q and K pairs


Attention scores matrix

INTRODUCTION TO LLMS IN PYTHON


Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding

Similarity between Q and K pairs


Attention scores matrix

Softmax scaling
Attention weights matrix

Orange is my favorite fruit


Query: Orange
Attention .21
.03 .05 .31 .40
weights:

INTRODUCTION TO LLMS IN PYTHON


Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding

Similarity between Q and K pairs


Attention scores matrix

Softmax scaling
Attention weights matrix

Orange is my favorite fruit


Query: Orange
Attention .21
.03 .05 .31 .40
weights:

INTRODUCTION TO LLMS IN PYTHON


Self-attention mechanism anatomy
Q, K, V: three linear projections of each
token embedding

Similarity between Q and K pairs


Attention scores matrix

Softmax scaling
Attention weights matrix

Orange is my favorite fruit


Query: Orange
Attention .21
.03 .05 .31 .40
weights:

INTRODUCTION TO LLMS IN PYTHON


Multi-headed self-attention

INTRODUCTION TO LLMS IN PYTHON


Multi-headed self-attention

INTRODUCTION TO LLMS IN PYTHON


Multi-headed self-attention

INTRODUCTION TO LLMS IN PYTHON


Multi-headed attention class
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
self.head_dim = d_model // num_heads

self.query_linear = nn.Linear(d_model, d_model)


self.key_linear = nn.Linear(d_model, d_model)
self.value_linear = nn.Linear(d_model, d_model)
self.output_linear = nn.Linear(d_model, d_model)

num_heads : number of attention heads, each handling embeddings of size head_dim

Setting up linear transformations ( nn.Linear() ) for attention inputs and output

INTRODUCTION TO LLMS IN PYTHON


Multi-headed attention class
def split_heads(self, x, batch_size):
x = x.view(batch_size, -1, self.num_heads, self.head_dim)
return x.permute(0, 2, 1, 3).contiguous().
view(batch_size * self.num_heads, -1, self.head_dim)

def compute_attention(self, query, key, mask=None):


scores = torch.matmul(query, key.permute(1, 2, 0))
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-1e9"))
attention_weights = F.softmax(scores, dim=-1)
return attention_weights

split_heads() : splits the input across attention heads

compute_attention() : compute attention weights using F.softmax()

INTRODUCTION TO LLMS IN PYTHON


Multi-headed attention class
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)

query = self.split_heads(self.query_linear(query), batch_size)


key = self.split_heads(self.key_linear(key), batch_size)
value = self.split_heads(self.value_linear(value), batch_size)

attention_weights = self.compute_attention(query, key, mask)

output = torch.matmul(attention_weights, value)


output = output.view(batch_size, self.num_heads, -1, self.head_dim).
permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
return self.output_linear(output)

forward() : orchestrate multi-headed attention mechanism workflow.


Concatenate and project head outputs with self.output_linear()

INTRODUCTION TO LLMS IN PYTHON


Let's practice!
INTRODUCTION TO LLMS IN PYTHON
Building an encoder
transformer
INTRODUCTION TO LLMS IN PYTHON

Iván Palomares Carrascosa, PhD


Senior Data Science & AI Manager
From original to encoder-only transformer

INTRODUCTION TO LLMS IN PYTHON


From original to encoder-only transformer
Transformer body: encoder stack with N
encoder layers

Encoder layer

Multi-headed self-attention

Feed-forward (sub)layers

Layer normalizations, skip connections,


dropouts

INTRODUCTION TO LLMS IN PYTHON


From original to encoder-only transformer
Transformer body: encoder stack with N
encoder layers

Encoder layer

Multi-headed self-attention

Feed-forward (sub)layers

Layer normalizations, skip connections,


dropouts

Transformer head: process encoded inputs


(hidden states) to produce output prediction

Supervised task: classification, regression

INTRODUCTION TO LLMS IN PYTHON


Feed-forward sublayer in encoder layer
2 x fully connected + ReLU activation

d_ff : dimension between linear layers

forward() : processes attention outputs to

class FeedForwardSubLayer(nn.Module):
capture complex, non-linear patterns
def __init__(self, d_model, d_ff):
super(FeedForwardSubLayer, self).__init__()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
self.relu = nn.ReLU()

def forward(self, x):


return self.fc2(self.relu(self.fc1(x)))

INTRODUCTION TO LLMS IN PYTHON


Encoder layer
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)

def forward(self, x, mask):


Multi-headed self-attention
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output)) Feed-forward sublayer
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output)) Layer normalizations and dropouts
return x
forward() : forward-pass through encoder
layer

mask prevents processing padding tokens

INTRODUCTION TO LLMS IN PYTHON


Masking the attention process

INTRODUCTION TO LLMS IN PYTHON


Transformer body: encoder
class TransformerEncoder(nn.Module): Encoder structure:
def __init__(self, vocab_size, d_model, num_layers,
num_heads, d_ff, dropout, max_sequence_length):
super(TransformerEncoder, self).__init__()
Input embeddings based on vocab_size
self.embedding = nn.Embedding(vocab_size, d_model)
Positional encoding
self.positional_encoding =
PositionalEncoding(d_model, max_sequence_length)
Stack of num_layers encoder layers, using
self.layers = nn.ModuleList( nn.ModuleList()
[EncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)]
)

forward() : forward-pass x through


def forward(self, x, mask):
x = self.embedding(x) transformer body
x = self.positional_encoding(x)
for layer in self.layers:
x = layer(x, mask)
return x

INTRODUCTION TO LLMS IN PYTHON


Transformer head
class ClassifierHead(nn.Module): Classification head
def __init__(self, d_model, num_classes):
super(ClassifierHead, self).__init__() Tasks: text classification, sentiment
self.fc = nn.Linear(d_model, num_classes)
analysis, NER, extractive QA, etc.
def forward(self, x): fc : fully connected linear layer
logits = self.fc(x)
Transforms encoder hidden states into
return F.log_softmax(logits, dim=-1)
num_classes class probabilities
class RegressionHead(nn.Module): Regression head
def __init__(self, d_model, output_dim):
super(RegressionHead, self).__init__()
Tasks: estimate text readability, language
self.fc = nn.Linear(d_model, output_dim)
complexity, etc.
def forward(self, x): output_dim is 1 when predicting a single
return self.fc(x)
numerical value

INTRODUCTION TO LLMS IN PYTHON


Let's practice!
INTRODUCTION TO LLMS IN PYTHON
Building a decoder
transformer
INTRODUCTION TO LLMS IN PYTHON

Iván Palomares Carrascosa, PhD


Senior Data Science & AI Manager
From original to decoder-only transformer

INTRODUCTION TO LLMS IN PYTHON


From original to decoder-only transformer
Autoregressive sequence generation: text
generation and completion

INTRODUCTION TO LLMS IN PYTHON


From original to decoder-only transformer
Autoregressive sequence generation: text
generation and completion

Masked multi-head self-attention

Upper triangular mask


Hide future positions in sequence

Only attend tokens before current one

INTRODUCTION TO LLMS IN PYTHON


From original to decoder-only transformer
Autoregressive sequence generation: text
generation and completion

Masked multi-head self-attention

Upper triangular mask


Hide future positions in sequence

Only attend tokens before current one

Decoder-only transformed head

Linear + Softmax over vocabulary

Predict most likely next token

INTRODUCTION TO LLMS IN PYTHON


Masked self-attention
Key to autoregressive or causal behavior
Triangular (causal) attention mask

INTRODUCTION TO LLMS IN PYTHON


Masked self-attention
Key to autoregressive or causal behavior
Triangular (causal) attention mask

A token only pays attention to "past" (left-


hand side) information in the sequence

INTRODUCTION TO LLMS IN PYTHON


Masked self-attention
Key to autoregressive or causal behavior
Triangular (causal) attention mask

A token only pays attention to "past" (left-


hand side) information in the sequence
"favorite": "orange", "is", "my", "favorite"
Enforced causal attention: predict next
word to generate, e.g. "fruit"

self_attention_mask = (1 - torch.triu(
Mask built and passed to the model
torch.ones(1, sequence_length, sequence_length), self_attention_mask
diagonal=1)).bool()
(...) Same multi-headed attention
output = decoder(input_sequence, self_attention_mask)
mechanism: only the mask is different

INTRODUCTION TO LLMS IN PYTHON


Transformer body (decoder) and head
class DecoderOnlyTransformer(nn.Module):
def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
super(TransformerDecoder, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, max_sequence_length)
self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
self.fc = nn.Linear(d_model, vocab_size)

def forward(self, x, self_mask):


x = self.embedding(x)
x = self.positional_encoding(x)
for layer in self.layers:
x = layer(x, self_mask)
x = self.fc(x)
return F.log_softmax(x, dim=-1)

self.fc : output linear layer with vocab_size neurons

Add self.fc and softmax activation in forward pass

INTRODUCTION TO LLMS IN PYTHON


Let's practice!
INTRODUCTION TO LLMS IN PYTHON
Building an encoder-
decoder transformer
INTRODUCTION TO LLMS IN PYTHON

Iván Palomares Carrascosa, PhD


Senior Data Science & AI Manager
Transformer architecture: encoder recap

INTRODUCTION TO LLMS IN PYTHON


Cross-attention mechanism
Cross-attention: double inputs

1. Information processed throughout decoder

2. Final hidden states from encoder block

Look back at processed input sequence


Find out next output token to generate

INTRODUCTION TO LLMS IN PYTHON


Cross-attention mechanism
Cross-attention: double inputs. class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super(DecoderLayer, self).__init__()
1. Information processed throughout decoder self.self_attn = MultiHeadAttention(
d_model, num_heads)
2. Final hidden states from encoder block self.cross_attn = MultiHeadAttention(
d_model, num_heads)

Look back at processed input sequence. ...

Find out next output token to generate. def forward(self, x, y, causal_mask, cross_mask):
self_attn_output = self.self_attn(x, x, x,
causal_mask)
x = self.norm1(x + self.dropout(self_attn_output))

x : decoder information flow, becomes cross_attn_output = self.cross_attn(x, y, y,


cross_mask)
cross-attention query x = self.norm2(x + self.dropout(cross_attn_output))
...
y : encoder output, becomes cross-
attention key and values

INTRODUCTION TO LLMS IN PYTHON


Encoder meets decoder

INTRODUCTION TO LLMS IN PYTHON


Encoder meets decoder

INTRODUCTION TO LLMS IN PYTHON


Transformer head

jugar (to play): 0.03

viajar (to travel): 0.96


dormir (to sleep): 0.01

INTRODUCTION TO LLMS IN PYTHON


Everything brought together!

INTRODUCTION TO LLMS IN PYTHON


Everything brought together!
class PositionalEncoding(nn.Module): class Transformer(nn.Module):
... def __init__(self, vocab_size, d_model, num_heads,
class MultiHeadAttention(nn.Module): num_layers, d_ff, max_seq_len, dropout):
... super(Transformer, self).__init__()
class FeedForwardSubLayer(nn.Module):
... self.encoder = TransformerEncoder(vocab_size,
class EncoderLayer(nn.Module): d_model, num_heads, num_layers, num_heads,
... d_ff, max_seq_len, dropout)
class DecoderLayer(nn.Module): self.decoder = TransformerDecoder(vocab_size,
... d_model, num_heads, num_layers, num_heads,
d_ff, max_seq_len, dropout)

class TransformerEncoder(nn.Module):
def forward(self, src, src_mask, causal_mask):
...
encoder_output = self.encoder(src, src_mask)
class TransformerDecoder(nn.Module):
decoder_output = self.decoder(src, encoder_output,
...
causal_mask, mask)
class ClassificationHead(nn.Module):
return decoder_output
...

INTRODUCTION TO LLMS IN PYTHON


Let's practice!
INTRODUCTION TO LLMS IN PYTHON

You might also like