Module 2 Cont... Text Classification
Module 2 Cont... Text Classification
In this chapter, the focus is on the critical process of feature extraction in machine learning,
emphasizing its importance in determining the quality of results—summarized by the phrase
"garbage in, garbage out."
The chapter explores how to effectively perform feature engineering for text data, a process
known as text representation in NLP.
It explains the conversion of raw text into numerical form, which is essential for feeding data
into NLP and ML algorithms.
The chapter also reviews various methods for representing text as numeric vectors, providing
insights into transforming text for effective machine learning applications.
With respect to the larger picture for any NLP problem, the scope of this chapter is depicted
by the dotted box in Figure 3-1.
Feature representation is a common step in any ML project, whether the data is text, images,
videos, or speech. However, feature representation for text is often much more involved as
compared to other formats of data.
For example, images are stored as matrices of pixels, where each cell in the matrix
corresponds to a pixel's intensity, providing an accurate numerical representation of the
image.
Videos, being sequences of images, are represented as a collection of such matrices, with
each frame corresponding to a matrix.
The text notes that while these formats have straightforward numerical representations, text
data requires more involved methods for effective feature representation.
The real value stored at cell[i,j] represents the intensity of the corresponding pixel in the
image, as shown in Figure 3-2. This matrix representation accurately represents the complete
image. Video is similar: a video is just a collection of frames where each frame is an image.
Hence, any video can be represented as a sequential collection of matrices, one per frame, in
the same order.
Now consider speech—it’s transmitted as a wave. To represent it mathematically, we sample the
wave and record its amplitude (height), as shown in Figure 3-3.0
This gives us a numerical array representing the amplitude of a sound wave at fixed time intervals, as
shown in Figure 3-4.
This section highlights the challenge of mathematically representing text compared to other
data formats like images, video, and speech, which are more straightforward.
The complexity of text representation has led to extensive research over the past decades.
The chapter focuses on exploring various methods, starting from simple approaches to
advanced, state-of-the-art techniques, for effectively representing text in a numerical form
suitable for machine learning.
These approaches are classified into four categories:
This section introduces the process of exploring different categories of text representation
algorithms.
Before delving into specific methods, it presents a scenario involving sentiment analysis,
where the goal is to build a model that accurately predicts the sentiment of a sentence.
To achieve this, the model must understand the sentence's meaning, which relies on
identifying the most crucial data points within the text.
In order to correctly extract the meaning of the sentence, the most crucial data points are:
1. Break the sentence into lexical units such as lexemes, words, and phrases
2. Derive the meaning for each of the lexical units
3. Understand the syntactic (grammatical) structure of the sentence
4. Understand the context in which the sentence appears
This section emphasizes that the meaning (semantics) of a sentence comes from the
combination of key data points within the text.
A good text representation scheme must effectively extract these data points to accurately
reflect the linguistic properties of the text.
Without this capability, a text representation scheme would not be very useful for tasks like
sentiment analysis.
where Ai and Bi are the ith components of vectors A and B, respectively. Sometimes, people also use
This section describes a simple text representation approach where, after lowercasing the text
and ignoring punctuation, the vocabulary of the corpus consists of six words: [dog, bites, man,
eats, meat, food].
The vocabulary can be organized in any order, and each document in the corpus is
represented as a vector of size six, corresponding to the words in the vocabulary.
One-Hot Encoding
One-Hot Encoding is a method of representing characters or words by a vector where only one
element is set to one and all others are zero, based on their position in the vocabulary.
In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID wid that is
between 1 and |V|, where V is the set of the corpus vocabulary.
Each word is then represented by a V-dimensional binary vector of 0s and 1s.
This is done via a |V| dimension vector filled with all 0s barring the index, where index = wid. At this
index, we simply put a 1.
The representation for individual words is then combined to form a sentence representation.
Let’s understand this via our toy corpus. We first map each of the six words to unique IDs: dog = 1,
bites = 2, man = 3, meat = 4 , food = 5, eats = [Link] Let’s consider the document D1: ―dog bites man‖.
As per the scheme, each word is a six-dimensional vector. Dog is represented as [1 0 0 0 0 0], as the
word ―dog‖ is mapped to ID 1. Bites is represented as [0 1 0 0 0 0], and so on and so forth. Thus, D1
is represented as [ [1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]]. D4 is represented as [ [ 0 0 1 0 0] [0 0 0 0
1 0] [0 0 0 0 0 1]]. Other documents in the corpus can be represented similarly.
Advantages
On the positive side, one-hot encoding is intuitive to understand and straightforward to implement.
However, it suffers from a few shortcomings:
Disadvantages
The size of a one-hot vector is directly proportional to size of the vocabulary, and most real-
world corpora have large vocabularies. This results in a sparse representation where most of the
entries in the vectors are zeroes, making it computationally inefficient to store, compute with,
and learn from (sparsity leads to overfitting).
This representation does not give a fixed-length representation for text, i.e., if a text has 10
words, you get a longer representation for it as compared to a text with 5 words. For most
learning algorithms, we need the feature vectors to be of the same length.
It treats words as atomic units and has no notion of (dis)similarity between words. For example,
consider three words: run, ran, and apple. Run and ran have similar meanings as opposed to run
and apple. But if we take their respective vectors and compute Euclidean distance between
them, they’re all equally apart root2. Thus, semantically, they’re very poor at capturing the
meaning of the word in relation to other words.
Say we train a model using our toy corpus. At runtime, we get a sentence: ―man eats fruits.‖
The training data didn’t include ―fruit‖ and there’s no way to represent it in our model. This is
known as the out of vocabulary (OOV) problem. A one-hot encoding scheme cannot handle
this. The only way is to retrain the model: start by expanding the vocabulary, give an ID to the
new word, etc.
Bag of Words
Bag of Words (BoW) is a classic technique used in NLP for text classification.
It represents text as a collection of words without considering their order or context.
The main idea is that a specific set of words characterizes each class in a dataset.
If two texts share similar words, they likely belong to the same class.
By analyzing the words in a text, BoW helps identify its corresponding class.
Similar to one-hot encoding, Bag of Words (BoW) assigns unique integer IDs to words in a
vocabulary.
Each document is then represented as a vector, where each component corresponds to the count
of a specific word's occurrences in the document.
The vector's dimensionality equals the size of the vocabulary, with each word's frequency
determining its score in the vector.
In the example corpus, words are assigned unique IDs (e.g., "dog" = 1, "bites" = 2). Document
D1 is represented as [1 1 1 0 0 0], indicating that the first three words appear once, while the
others do not. Similarly, D4 is represented as [0 0 1 0 1 1].
The size of the vector increases with the size of the vocabulary. Thus, sparsity continues to be
a problem. One way to control it is by limiting the vocabulary to n number of the most
frequent words.
It does not capture the similarity between different words that mean the same thing. Say we
have three documents: ―I run‖, ―I ran‖, and ―I ate‖. BoW vectors of all three documents will
be equally apart.
This representation does not have any way to handle out of vocabulary words (i.e., new words
that were not seen in the corpus that was used to build the vectorizer).
As the name indicates, it is a ―bag‖ of words—word order information is lost in this
representation. Both D1 and D2 will have the same representation in this scheme.
However, despite these shortcomings, due to its simplicity and ease of implementation, BoW is a
commonly used text representation scheme, especially for text classification among other
NLP problems.
Bag of N-Grams
Previous text representation schemes treat words as independent units, ignoring phrases or
word order.
The bag-of-n-grams (BoN) approach addresses this by breaking text into chunks of n
contiguous words, called n-grams.
This method captures some context that earlier approaches missed.
The vocabulary consists of all unique n-grams in the corpus, and each document is
represented by a vector showing the frequency of n-grams within the document, with zeroes
for those not present.
In a 2-gram (bigram) model, text is represented by pairs of consecutive words.
For the example corpus, the set of all bigrams includes pairs like "dog bites" and "man bites."
Each document is represented by an eight-dimensional vector, indicating the frequency of
these bigrams.
For instance, D1 is represented as [1,1,0,0,0,0,0,0], and D2 as [0,0,1,1,0,0,0,0]. The Bag of
Words (BoW) model is a special case of this approach with n=1. As n increases, the context
captured grows, but so does the sparsity of the representation. This technique is also known as
"n-gram feature selection" in NLP.
TF-IDF
In the 3 approaches discussed so far, all words in a text are treated as equally important. TF-
IDF (term frequency–inverse document frequency) addresses this by quantifying the
importance of a word relative to others in both the document and the entire corpus.
This technique is commonly used in information retrieval systems to identify relevant
documents for a given query.
The intuition behind TF-IDF is that a word's importance in a document increases with its
frequency in that document but decreases if it frequently appears in other documents within
the corpus.
This importance is quantified using two measures: Term Frequency (TF) and Inverse
Document Frequency (IDF). These are combined to calculate the TF-IDF score, which
reflects the significance of a word in a specific document relative to the entire corpus.
TF (term frequency) measures how often a term or word occurs in a given document.
Since different documents in the corpus may be of different lengths, a term may occur more
often in a longer document as compared to a shorter document.
To normalize these counts, we divide the number of occurrences by the length of the
document. TF of a term t in a document d is defined as:
IDF (inverse document frequency) measures the importance of the term across a corpus.
In computing TF, all terms are given equal importance (weightage).
However, it’s a well-known fact that stop words like is, are, am, etc., are not important, even
though they occur frequently.
To account for such cases, IDF weighs down the terms that are very common across a corpus
and weighs up the rare terms. IDF of a term t is calculated as follows:
The TF-IDF score is a product of these two terms. Thus, TF-IDF score = TF * IDF.
Let’s compute TF-IDF scores for our toy corpus. Some terms appear in only one document,
some appear in two, while others appear in three documents. The size of our corpus is N=4.
Hence, corresponding TF-IDF values for each term are shown in Table 3-2.
The TF-IDF vector representation for a document is then simply the TF-IDF score for each
term in that document. So, for D1 we get
Distributed Representations
Before diving into the methods that use neural network architectures to create dense, low-
dimensional representations, it's essential to understand some key terms:
[Link] similarity
The meaning of a word can be understood based on the context in which it appears, a concept
known as connotation—where meaning is defined by context. This contrasts with denotation,
which refers to the literal meaning of a word. For example, in the phrase "NLP rocks," the
word "rocks" literally means "stones," but contextually, it signifies something positive or
fashionable.
[Link] hypothesis
The distributional hypothesis in linguistics suggests that words appearing in similar contexts
have similar meanings. For example, "dog" and "cat" often occur in similar contexts,
indicating a strong similarity in their meanings. In vector space models (VSM), this idea
extends to word representations: if two words frequently appear in similar contexts, their
corresponding vectors in the model should be close to each other.
[Link] representation
Distributional representation schemes are based on the distributional hypothesis, which posits
that words appearing in similar contexts have related meanings. These schemes use high-
dimensional vectors derived from a co-occurrence matrix, capturing the relationship between
words and their contexts. The matrix's dimensions correspond to the vocabulary size of the
corpus. Examples of such schemes include one-hot encoding, bag of words, bag of n-grams,
and TF-IDF, all of which represent words based on their contextual distribution.
[Link] representation
Distributed representation is also based on the distributional hypothesis but addresses the
limitations of high-dimensional, sparse vectors in traditional distributional representations. It
significantly reduces dimensionality, resulting in compact, dense vectors with few or no
zeros. This makes them computationally efficient and more effective for learning. The vector
space created through this process is known as distributed representation, and the upcoming
methods discussed in the chapter are examples of this approach.
[Link]
For the set of words in a corpus, embedding is a mapping between vector space coming from
distributional representation to vector space coming from distributed representation.
[Link] semantics
This refers to the set of NLP methods that aim to learn the word representations based on
distributional properties of words in a large corpus.
Word Embeddings
When we say that a text representation should capture "distributional similarities between
words," it means that words appearing in similar contexts should have similar representations.
For example, words like "USA" might be distributionally similar to other countries or cities within the
USA, while "beautiful" might be similar to synonyms or antonyms. In 2013, Mikolov et al.
demonstrated this concept with their Word2Vec model, which uses distributional similarity to capture
word analogy relationships, such as "King – Man + Woman ≈ Queen."
Their model was able to correctly answer many more analogies like this. Figure 3-5 shows a
snapshot of a system based on Word2vec answering analogies. The Word2vec model is in
many ways the dawn of modern-day NLP.
Word2Vec learns semantically rich word representations that are low-dimensional (50–500
dimensions) and dense, with most values being non-zero. These compact and efficient
representations make machine learning tasks more tractable. Word2Vec's success spurred
extensive research on using neural networks to learn text representations, known as
"embeddings." Next, we’ll explore how these embeddings work and how they can be used to
represent text.
The goal of Word2Vec is to learn embeddings for each word in a text corpus that
accurately capture the word's meaning.
It does this by using distributional similarity and the distributional hypothesis, which
infer a word's meaning from its context—words that appear nearby in the text.
If two words frequently occur in similar contexts, their meanings are likely similar.
Word2Vec projects these meanings into a vector space where words with similar
meanings cluster together, while words with different meanings are positioned far
apart.
Word2Vec takes a large text corpus and learns to represent words in a common vector
space based on their contexts.
For each word w in the corpus, it starts with a randomly initialized vector vw .
The model refines vw by predicting it based on the vectors of the words in its context
C, using a two-layer neural network.
This process adjusts the vector to best capture the word's meaning. Before exploring
how to train our own embeddings, we'll first discuss pre-trained embeddings.
The original Word2Vec approach includes two architectural variants for training word
embeddings:Two varients are:-
1. Continuous Bag of Words (CBOW): This model predicts a target word based on its
surrounding context words. It uses the context to estimate the likelihood of a word
occurring in that context.
2. Skip-Gram: This model, in contrast, predicts the surrounding context words given a
target word. It uses the target word to estimate the likelihood of various context words
appearing around it.
Both architectures are designed to learn word embeddings by leveraging the distributional
hypothesis, but they focus on different aspects of the word-context relationship.
[Link]
In CBOW, the primary task is to build a language model that correctly predicts
thecenter word given the context words in which the center word appears.
What is a language model? It is a (statistical) model that tries to give a probability
distribution over sequences of words.
Given a sentence of, say, m words, it assigns a probability Pr(w1, w2, ….., wn) to the
whole sentence.
The objective of a language model is to assign probabilities in such a way that it
gives high probability to―good‖ sentences and low probabilities to ―bad‖ sentences.
By good, we mean sentences that are semantically and syntactically correct. By bad,
we mean sentences that are incorrect—semantically or syntactically or both. So, for a
sentence like ―The cat jumped over the dog,‖ it will try to assign a probability close to
1.0, whereas for a sentence like ―jumped over the the cat dog,‖ it tries to assign a
probability close to 0.0.
CBOW tries to learn a language model that tries to predict the ―center‖ word from the
words in its context.
Let’s understand this using our toy corpus. If we take the word ―jumps‖ as the center
word, then its context is formed by words in its vicinity. If we take the context size of
2, then for our example, the context is given by brown, fox, over, the. CBOW uses the
context words to predict the target word—jumps—as shown in Figure 3-7.
CBOW tries to do this for every word in the corpus; i.e., it takes every word in the
corpus as the target word and tries to predict the target word from its corresponding
context words.
The idea discussed in the previous paragraph is then extended to the entire corpus to
build the training set.
Details are as follows: we run a sliding window of size 2k+1 over the text corpus. For
our example, we took k as 2.
Each position of the window marks the set of 2k+1 words that are under
consideration.
The center word in the window is the target, and k words on either side of the center
word form the context. This gives us one data point. If the point is represented
as (X,Y), then the context is the X and the target word is the Y. A single data point
consists of a pair of numbers: (2k indices of words in context, index of word in
target).
To get the next data point, we simply shift the window to the right on the corpus by
one word and repeat the process. This way, we slide the window across the entire
corpus to create the training set. This is shown in Figure 3-8. Here, the target word is
shown in blue and k=2.
Now that we have the training data ready, let’s focus on the model. For this,
we construct a shallow net (it’s shallow since it has a single hidden layer), as
shown in Figure 3-9. We assume we want to learn D-dim word embeddings.
Further, let V be the vocabulary of the text corpus.
The process involves learning an embedding matrix E of size ∣V∣×d where ∣V∣ is the
vocabulary size and d is the embedding dimension.
The matrix is initialized randomly. In the input layer, word indices are used to fetch the
corresponding rows from the matrix E
. These vectors are summed to produce a single ddd-dimensional vector, which is then
passed to the next layer.
This vector is multiplied by another matrix E′ of size d×∣V∣d \times |V|d×∣V∣, resulting
in a vector of size 1×∣V∣1 \times |V|1×∣V∣. A softmax function is applied to this vector
to produce a probability distribution over the vocabulary.
The distribution is compared with the label, and backpropagation is used to update both
E and E′. After training, the matrix E represents the learned word embeddings.
SkipGram
SkipGram is a model similar to CBOW but with a key difference: it predicts context
words from a given center word.
For example, with a context size of 2, the model takes the center word "jumps" and
predicts each of its surrounding words—―brown,‖ ―fox,‖ ―over,‖ ―the.‖ The process
is repeated for every word in the [Link] 3.10
To train SkipGram, a sliding window of size 2k+1 is used across the text corpus.
The center word in the window is X, and the k words on either side are Y. This
approach generates 2k data points, with each consisting of a pair: the index of the
center word and the index of a target word.
The window is then shifted one word to the right, and the process repeats across the
entire corpus to create the training [Link] 3.11
The SkipGram model's shallow network is similar to the CBOW network with slight
modifications.
In the input layer, the index of the target word is used to fetch the corresponding row
from the embedding matrix E of size ∣V∣×d|.
This d-dimensional vector is then passed to the next layer, where it is multiplied by
another matrix E′ of size d×∣V ∣, resulting in a 1×∣V ∣ vector.
A softmax function is applied to this vector to produce a probability distribution over
the vocabulary.
This distribution is compared to the label, and backpropagation is used to update both
E and E′. After training, the matrix EEE contains the learned word embeddings.
Figure 3.12 Both CBOW and SkipGram models involve intricate details that
interested readers can explore through additional resources like Sebastian Ruder's
blog and Rong's (2016) paper on Word2vec parameter learning.
A crucial aspect to consider in these models is the selection of hyperparameters, such
as window size, vector dimensionality, learning rate, and the number of epochs, as
they significantly impact the model's quality.
Though there are practical implementations like gensim that abstract the mathematical
complexities, users still need to carefully decide on the hyperparameters before
starting the training process to ensure optimal results.
Once the Doc2vec model is trained, it can infer paragraph vectors for new texts using
the word vectors learned during training.
Doc2vec was one of the first accessible methods for generating embedding
representations for entire texts, rather than just aggregating individual word vectors.
By modeling context and encoding texts of any length into fixed, low-dimensional
vectors, Doc2vec has been widely used in NLP applications such as text
classification, document tagging, text recommendation systems, and simple chatbots..