Lecture 8. Transformers¶

Maybe attention is all you need

Joaquin Vanschoren

Overview¶

Basics: word embeddings
- Word2Vec, FastText, GloVe
Sequence-to-sequence and autoregressive models
Self-attention and transformer models
Vision Transformers

Bag of word representation¶

First, build a vocabulary of all occuring words. Maps every word to an index.
Represent each document as an $N$ dimensional vector (top-$N$ most frequent words)
- One-hot (sparse) encoding: 1 if the word occurs in the document
Destroys the order of the words in the text (hence, a 'bag' of words)

Text preprocessing pipelines¶

Tokenization: how to you split text into words / tokens?
Stemming: naive reduction to word stems. E.g. 'the meeting' to 'the meet'
Lemmatization: NLP-based reduction, e.g. distinguishes between nouns and verbs
Discard stop words ('the', 'an',...)
Only use $N$ (e.g. 10000) most frequent words, or a hash function
n-grams: Use combinations of $n$ adjacent words next to individual words
- e.g. 2-grams: "awesome movie", "movie with", "with creative", ...
Character n-grams: combinations of $n$ adjacent letters: 'awe', 'wes', 'eso',...
Subword tokenizers: graceful splits "unbelievability" -> un, believ, abil, ity
Useful libraries: nltk, spaCy, gensim, HuggingFace tokenizers,...

Neural networks on bag of words¶

We can build neural networks on bag-of-word vectors
- Do a one-hot-encoding with 10000 most frequent words
- Simple model with 2 dense layers, ReLU activation, dropout

self.model = nn.Sequential(
    nn.Linear(10000, 16),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(16, 16),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(16, 1)
)

Evaluation¶

IMDB dataset of movie reviews (label is 'positive' or 'negative')
Take a validation set of 10,000 samples from the training set
Works prety well (88% Acc), but overfits easily

No description has been provided for this image

`Trainer.fit` stopped: `max_epochs=15` reached.

Predictions¶

Let's look at a few predictions. Why is the last one so negative?

Review 0:
 [START] please give this one a miss br br [UNK] [UNK] and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite [UNK] so all you madison fans give this a miss
Predicted positiveness: 0.15110373

Review 16:
 [START] from 1996 first i watched this movie i feel never reach the end of my satisfaction i feel that i want to watch more and more until now my god i don't believe it was ten years ago and i can believe that i almost remember every word of the dialogues i love this movie and i love this novel absolutely perfection i love willem [UNK] he has a strange voice to spell the words black night and i always say it for many times never being bored i love the music of it's so much made me come into another world deep in my heart anyone can feel what i feel and anyone could make the movie like this i don't believe so thanks thanks
Predicted positiveness: 0.99687344

Review X:
 [START] the restaurant is not too terrible
Predicted positiveness: 0.8728

Word Embeddings¶

A word embedding is a numeric vector representation of a word
- Can be manual or learned from an existing representation (e.g. one-hot)

Learning embeddings from scratch¶

Input layer uses fixed length documents (with 0-padding).
Add an embedding layer to learn the embedding
- Create $n$-dimensional one-hot encoding.
- To learn an $m$-dimensional embedding, use $m$ hidden nodes. Weight matrix $W^{n x m}$
- Linear activation function: $\mathbf{X}_{embed} = W \mathbf{X}_{orig}$.
Combine all word embeddings into a document embedding (e.g. global pooling).
Add layers to map word embeddings to the output. Learn embedding weights from data.

Let's try this:

max_length = 100 # pad documents to a maximum number of words
vocab_size = 10000 # vocabulary size
embedding_length = 20 # embedding length (more would be better)

self.model = nn.Sequential(
    nn.Embedding(vocab_size, embedding_length),
    nn.AdaptiveAvgPool1d(1),  # global average pooling over sequence
    nn.Linear(embedding_length, 1),
)

Training on the IMDB dataset: slightly worse than using bag-of-words?
- Embedding of dim 20 is very small, should be closer to 100 (or 300)
- We don't have enough data to learn a really good embedding from scratch