Overview¶
- Basics: word embeddings
- Word2Vec, FastText, GloVe
- Sequence-to-sequence and autoregressive models
- Self-attention and transformer models
- Vision Transformers
Bag of word representation¶
- First, build a vocabulary of all occuring words. Maps every word to an index.
- Represent each document as an $N$ dimensional vector (top-$N$ most frequent words)
- One-hot (sparse) encoding: 1 if the word occurs in the document
- Destroys the order of the words in the text (hence, a 'bag' of words)

Text preprocessing pipelines¶
- Tokenization: how to you split text into words / tokens?
- Stemming: naive reduction to word stems. E.g. 'the meeting' to 'the meet'
- Lemmatization: NLP-based reduction, e.g. distinguishes between nouns and verbs
- Discard stop words ('the', 'an',...)
- Only use $N$ (e.g. 10000) most frequent words, or a hash function
- n-grams: Use combinations of $n$ adjacent words next to individual words
- e.g. 2-grams: "awesome movie", "movie with", "with creative", ...
- Character n-grams: combinations of $n$ adjacent letters: 'awe', 'wes', 'eso',...
- Subword tokenizers: graceful splits "unbelievability" -> un, believ, abil, ity
- Useful libraries: nltk, spaCy, gensim, HuggingFace tokenizers,...
Neural networks on bag of words¶
- We can build neural networks on bag-of-word vectors
- Do a one-hot-encoding with 10000 most frequent words
- Simple model with 2 dense layers, ReLU activation, dropout
self.model = nn.Sequential(
nn.Linear(10000, 16),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(16, 16),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(16, 1)
)
Evaluation¶
- IMDB dataset of movie reviews (label is 'positive' or 'negative')
- Take a validation set of 10,000 samples from the training set
- Works prety well (88% Acc), but overfits easily
`Trainer.fit` stopped: `max_epochs=15` reached.
Predictions¶
Let's look at a few predictions. Why is the last one so negative?
Review 0: [START] please give this one a miss br br [UNK] [UNK] and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite [UNK] so all you madison fans give this a miss Predicted positiveness: 0.15110373 Review 16: [START] from 1996 first i watched this movie i feel never reach the end of my satisfaction i feel that i want to watch more and more until now my god i don't believe it was ten years ago and i can believe that i almost remember every word of the dialogues i love this movie and i love this novel absolutely perfection i love willem [UNK] he has a strange voice to spell the words black night and i always say it for many times never being bored i love the music of it's so much made me come into another world deep in my heart anyone can feel what i feel and anyone could make the movie like this i don't believe so thanks thanks Predicted positiveness: 0.99687344 Review X: [START] the restaurant is not too terrible Predicted positiveness: 0.8728
Word Embeddings¶
- A word embedding is a numeric vector representation of a word
- Can be manual or learned from an existing representation (e.g. one-hot)
Learning embeddings from scratch¶
- Input layer uses fixed length documents (with 0-padding).
- Add an embedding layer to learn the embedding
- Create $n$-dimensional one-hot encoding.
- To learn an $m$-dimensional embedding, use $m$ hidden nodes. Weight matrix $W^{n x m}$
- Linear activation function: $\mathbf{X}_{embed} = W \mathbf{X}_{orig}$.
- Combine all word embeddings into a document embedding (e.g. global pooling).
- Add layers to map word embeddings to the output. Learn embedding weights from data.
Let's try this:
max_length = 100 # pad documents to a maximum number of words
vocab_size = 10000 # vocabulary size
embedding_length = 20 # embedding length (more would be better)
self.model = nn.Sequential(
nn.Embedding(vocab_size, embedding_length),
nn.AdaptiveAvgPool1d(1), # global average pooling over sequence
nn.Linear(embedding_length, 1),
)
- Training on the IMDB dataset: slightly worse than using bag-of-words?
- Embedding of dim 20 is very small, should be closer to 100 (or 300)
- We don't have enough data to learn a really good embedding from scratch