Lecture 8. Transformers¶

Maybe attention is all you need

Joaquin Vanschoren

Overview¶

  • Basics: word embeddings
    • Word2Vec, FastText, GloVe
  • Sequence-to-sequence and autoregressive models
  • Self-attention and transformer models
  • Vision Transformers

Bag of word representation¶

  • First, build a vocabulary of all occuring words. Maps every word to an index.
  • Represent each document as an $N$ dimensional vector (top-$N$ most frequent words)
    • One-hot (sparse) encoding: 1 if the word occurs in the document
  • Destroys the order of the words in the text (hence, a 'bag' of words)

ml

Text preprocessing pipelines¶

  • Tokenization: how to you split text into words / tokens?
  • Stemming: naive reduction to word stems. E.g. 'the meeting' to 'the meet'
  • Lemmatization: NLP-based reduction, e.g. distinguishes between nouns and verbs
  • Discard stop words ('the', 'an',...)
  • Only use $N$ (e.g. 10000) most frequent words, or a hash function
  • n-grams: Use combinations of $n$ adjacent words next to individual words
    • e.g. 2-grams: "awesome movie", "movie with", "with creative", ...
  • Character n-grams: combinations of $n$ adjacent letters: 'awe', 'wes', 'eso',...
  • Subword tokenizers: graceful splits "unbelievability" -> un, believ, abil, ity
  • Useful libraries: nltk, spaCy, gensim, HuggingFace tokenizers,...

Neural networks on bag of words¶

  • We can build neural networks on bag-of-word vectors
    • Do a one-hot-encoding with 10000 most frequent words
    • Simple model with 2 dense layers, ReLU activation, dropout
self.model = nn.Sequential(
    nn.Linear(10000, 16),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(16, 16),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(16, 1)
)

Evaluation¶

  • IMDB dataset of movie reviews (label is 'positive' or 'negative')
  • Take a validation set of 10,000 samples from the training set
  • Works prety well (88% Acc), but overfits easily
No description has been provided for this image
`Trainer.fit` stopped: `max_epochs=15` reached.

Predictions¶

Let's look at a few predictions. Why is the last one so negative?

Review 0:
 [START] please give this one a miss br br [UNK] [UNK] and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite [UNK] so all you madison fans give this a miss
Predicted positiveness: 0.15110373

Review 16:
 [START] from 1996 first i watched this movie i feel never reach the end of my satisfaction i feel that i want to watch more and more until now my god i don't believe it was ten years ago and i can believe that i almost remember every word of the dialogues i love this movie and i love this novel absolutely perfection i love willem [UNK] he has a strange voice to spell the words black night and i always say it for many times never being bored i love the music of it's so much made me come into another world deep in my heart anyone can feel what i feel and anyone could make the movie like this i don't believe so thanks thanks
Predicted positiveness: 0.99687344

Review X:
 [START] the restaurant is not too terrible
Predicted positiveness: 0.8728

Word Embeddings¶

  • A word embedding is a numeric vector representation of a word
    • Can be manual or learned from an existing representation (e.g. one-hot)
ml

Learning embeddings from scratch¶

  • Input layer uses fixed length documents (with 0-padding).
  • Add an embedding layer to learn the embedding
    • Create $n$-dimensional one-hot encoding.
    • To learn an $m$-dimensional embedding, use $m$ hidden nodes. Weight matrix $W^{n x m}$
    • Linear activation function: $\mathbf{X}_{embed} = W \mathbf{X}_{orig}$.
  • Combine all word embeddings into a document embedding (e.g. global pooling).
  • Add layers to map word embeddings to the output. Learn embedding weights from data.
ml

Let's try this:

max_length = 100 # pad documents to a maximum number of words
vocab_size = 10000 # vocabulary size
embedding_length = 20 # embedding length (more would be better)

self.model = nn.Sequential(
    nn.Embedding(vocab_size, embedding_length),
    nn.AdaptiveAvgPool1d(1),  # global average pooling over sequence
    nn.Linear(embedding_length, 1),
)
  • Training on the IMDB dataset: slightly worse than using bag-of-words?
    • Embedding of dim 20 is very small, should be closer to 100 (or 300)
    • We don't have enough data to learn a really good embedding from scratch