1
K. HarshaVardhan
22 WU0104101
AIML-B
NLP ASSIGNMENT
1 . Computational Cost of Training the Skip-gram Model
and Negative Sampling
Core Function:
The Skip-gram model, a key part of the word2vec framework, generates word
embeddings.
It predicts surrounding words based on a given target word.
Developed by Tomas Mikolov et al., it aims to capture semantic relationships
between words.
Computational Challenges:
Training is extremely resource-intensive due to large vocabulary sizes.
Probability calculations, especially with the softmax function, add to the complexity.
Large NLP datasets exacerbate these computational demands.
Operational Mechanism:
The model inputs a target word and outputs predicted context words.
It calculates the probability of each context word appearing nearby.
The softmax function, which considers every vocabulary word, is used for probability
calculation.
Illustrative Example:
In a vocabulary of 1 million words, each training step requires summing probabilities
across all 1 million words.
This extensive calculation significantly slows down the training process and demands
considerable computational power.
2
Factors Contributing to High Computational Cost:
Large Vocabulary Size:
o NLP datasets commonly include millions of unique words.
o Calculating the softmax function across such extensive vocabularies
demands substantial memory and processing capabilities.
Frequent Probability Calculations:
o The Skip-gram model optimizes the probability of correct word-context
pairings.
o This optimization requires iterating through all possible words, resulting in
significant computational overhead.
Gradient Computation Complexity:
o Word embeddings are updated via backpropagation, necessitating
adjustments to each word's vector.
o With a large vocabulary, updating all embeddings in every step slows down
the training process.
Storage and Memory Issues:
o Storing and updating vectors for millions of words requires considerable
memory resources.
o The large size of the embedding matrix itself creates training difficulties on
hardware with limited resources.
Addressing Computational Burden:
Negative Sampling, introduced by Mikolov et al., aims to lessen the computational
load of Skip-gram training.
It reduces the number of words processed during each training step.
How Negative Sampling Functions:
Instead of calculating probabilities for the entire vocabulary, it uses a small set of
sampled words for model updates.
Positive Sample Selection:
o A valid word-context pair from the actual text is designated as a positive
example.
Negative Sample Selection:
o Rather than updating embeddings for all words, the model chooses a few
random words as negative examples (words unlikely to be context words).
3
o Typically, 5-20 negative samples are selected for each positive word-context
pair.
Simplified Probability Calculation:
o The model updates embeddings only for the target word, the correct context
word, and the selected negative words.
o This significantly decreases the required number of calculations.
Mathematical Formulation:
o Negative Sampling employs a binary classification objective, replacing the
softmax function.
Mitigation of Computational Load:
Negative Sampling, introduced by Mikolov et al., is a method designed to lessen the
computational demands of Skip-gram training.
It achieves this by decreasing the quantity of words requiring processing during each
training iteration.
Operational Mechanism of Negative Sampling:
Rather than computing probabilities for every word within the vocabulary, Negative
Sampling updates the model using only a select number of sampled words.
Positive Sample Selection:
o A word-context combination derived from actual text data is designated as a
positive example.
Negative Sample Selection:
o Instead of updating embeddings for all vocabulary words, the model selects a
small set of random words as negative examples (words unlikely to appear as
context words).
o Generally, 5-20 negative samples are chosen for each positive word-context
pair.
Simplified Probability Calculation:
o The model updates embeddings solely for the target word, the correct context
word, and the chosen negative words.
o This significantly reduces the number of calculations required.
Mathematical Formulation:
o In place of the softmax function, Negative Sampling employs a binary
classification objective.
Advantages of Negative Sampling:
Faster Training Time:
o Instead of processing millions of words, it updates only a small sample.
o This dramatically reduces the number of calculations per step.
Lower Memory Consumption:
o The embedding matrix requires fewer updates, lessening memory demands.
Scalability:
o Negative Sampling enables efficient handling of large datasets.
o It makes training practical on standard hardware.
Improved Performance on Rare Words:
4
o Rare words receive updates through relevant negative examples, leading to
better representations.
Overall Impact:
Negative Sampling offers an efficient alternative by focusing on sampled words,
significantly reducing training time and resource needs.
This optimization allows word2vec to learn high-quality word representations
efficiently, contributing to its widespread use in NLP.
It does have drawbacks when compared to smarter sampling methods that can
increase accuracy.
In general, negative sampling greatly increases the speed of the sampling process.
5
2 . Comparison of Skip-gram and FastText Word
Embeddings
Introduction to Word Embeddings:
Word embeddings in NLP convert words into vectors, capturing semantic
relationships.
Key methods include the Skip-gram model (word2vec) and the FastText model.
Both generate meaningful representations but differ in approach, efficiency, and
linguistic structure handling.
Overview of Skip-gram Model:
Skip-gram encodes each word as a unique vector.
It predicts context words from a target word by scanning a large text corpus.
A neural network learns word relationships based on co-occurrence within a defined
window.
Strengths:
o Generates good-quality embeddings with small training sets.
o Captures semantic and syntactic relationships.
Limitations:
o Treats words independently, ignoring internal structure.
o Challenges with rare or out-of-vocabulary words.
o Limited to the information it was trained on.
Overview of FastText Model:
6
FastText, developed by Facebook AI, extends Skip-gram with subword embeddings.
It splits words into character n-grams (subword sequences).
Example: "delaying" becomes "delaying", "de", "lay", "ing", etc.
Benefits:
o Encodes morphological variations.
o Enhances word representations, especially for languages with complex word
forms (prefixes, suffixes, inflections).
o Useful for languages that combine words, like German.
Key Differences Between Skip-gram and FastText:
Word Representation:
o Skip-gram: Assigns a single vector to each word, treating "run" and "running"
as distinct.
o FastText: Decomposes words into subwords, recognizing similarities between
"run," "running," and "runner."
o This subword approach is beneficial for morphologically rich languages like
German and Arabic.
o Example: German "Untergrundbahnhöfen" breaks down to "UNDER-
GROUND-TRACK-YARD".
Handling Rare/Unseen Words:
o Skip-gram: Struggles with infrequent words; cannot generate embeddings for
out-of-vocabulary words.
o FastText: Infers meaning from subword components, approximating unseen
word meanings.
7
o FastText is more effective with rare and unique words.
Computational Efficiency:
o Skip-gram: More computationally efficient due to simpler vector processing.
o Fasttext: Slower processing speeds due to subword processing.
Performance and Use Cases:
Skip-gram:
o Performs well with large, frequent-word datasets.
o Suitable for tasks prioritizing speed and treating words as discrete units.
o Used in search engines, machine translation, and document classification.
FastText:
o Useful with noisy or incomplete datasets (user-generated content, tweets,
reviews).
o Preferred for multilingual NLP and languages with complex word formations.
o Good for colloquial and social media language.
o If a project uses formal well structured text, skip-gram is sufficient, if the project
handles misspellings, or linguistic variation, FastText is preferred.
Conclusion:
Both models have pros and cons, depending on the NLP task.
Skip-gram:
o More computationally efficient, suitable for large-scale operations and speed.
8
o Limited by its inability to handle unseen or rare words.
FastText:
o Slightly slower and more memory-intensive.
o Better generalization through subword information.
o Adept at handling morphologically rich languages and unseen words.
o Preferred when linguistic variance is a key consideration.
3 . Visualizing Word Embeddings with t-SNE – Skip-gram
vs. FastText
Okay, here's the paraphrased passage with bullet points:
• Word Embeddings and Visualization:
o Word embeddings represent word relationships as dense numerical vectors.
o These vectors exist in high-dimensional spaces (often hundreds of
dimensions), making direct interpretation challenging. o t-SNE (t-
distributed Stochastic Neighbor Embedding) is a common visualization
technique.
It reduces high-dimensional data to two or three dimensions.
▪ It preserves local structures, allowing for visual analysis of word
relationships.
Now, let's make a representation of this with some training:
9
Step 1: Scraping Data
10
11
12
Output: