0% found this document useful (0 votes)
15 views14 pages

Module 2 Cont... Text Classification

This chapter focuses on feature extraction in machine learning, particularly for text data, emphasizing the importance of effective text representation for NLP applications. It discusses various methods for converting raw text into numerical forms, including basic vectorization approaches, Bag of Words, and TF-IDF, while highlighting the challenges of representing text compared to other data formats. The chapter also introduces the Vector Space Model and its applications in measuring text similarity.

Uploaded by

ansenvinoj8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

Module 2 Cont... Text Classification

This chapter focuses on feature extraction in machine learning, particularly for text data, emphasizing the importance of effective text representation for NLP applications. It discusses various methods for converting raw text into numerical forms, including basic vectorization approaches, Bag of Words, and TF-IDF, while highlighting the challenges of representing text compared to other data formats. The chapter also introduces the Vector Space Model and its applications in measuring text similarity.

Uploaded by

ansenvinoj8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

 Text Representation

 In this chapter, the focus is on the critical process of feature extraction in machine learning,
emphasizing its importance in determining the quality of results—summarized by the phrase
"garbage in, garbage out."
 The chapter explores how to effectively perform feature engineering for text data, a process
known as text representation in NLP.
 It explains the conversion of raw text into numerical form, which is essential for feeding data
into NLP and ML algorithms.
 The chapter also reviews various methods for representing text as numeric vectors, providing
insights into transforming text for effective machine learning applications.
 With respect to the larger picture for any NLP problem, the scope of this chapter is depicted
by the dotted box in Figure 3-1.

Figure 3-1. Scope of this chapter within the NLP pipeline

 Feature representation is a common step in any ML project, whether the data is text, images,
videos, or speech. However, feature representation for text is often much more involved as
compared to other formats of data.
 For example, images are stored as matrices of pixels, where each cell in the matrix
corresponds to a pixel's intensity, providing an accurate numerical representation of the
image.
 Videos, being sequences of images, are represented as a collection of such matrices, with
each frame corresponding to a matrix.
 The text notes that while these formats have straightforward numerical representations, text
data requires more involved methods for effective feature representation.
 The real value stored at cell[i,j] represents the intensity of the corresponding pixel in the
image, as shown in Figure 3-2. This matrix representation accurately represents the complete
image. Video is similar: a video is just a collection of frames where each frame is an image.
Hence, any video can be represented as a sequential collection of matrices, one per frame, in
the same order.
Now consider speech—it’s transmitted as a wave. To represent it mathematically, we sample the
wave and record its amplitude (height), as shown in Figure 3-3.0

Figure 3-3. Sampling a speech wave

This gives us a numerical array representing the amplitude of a sound wave at fixed time intervals, as
shown in Figure 3-4.

Figure 3-4. Speech signal represented by a numerical vector

 This section highlights the challenge of mathematically representing text compared to other
data formats like images, video, and speech, which are more straightforward.
 The complexity of text representation has led to extensive research over the past decades.
 The chapter focuses on exploring various methods, starting from simple approaches to
advanced, state-of-the-art techniques, for effectively representing text in a numerical form
suitable for machine learning.
 These approaches are classified into four categories:

 Basic vectorization approaches


 Distributed representations
 Universal language representation
 Handcrafted features

 This section introduces the process of exploring different categories of text representation
algorithms.
 Before delving into specific methods, it presents a scenario involving sentiment analysis,
where the goal is to build a model that accurately predicts the sentiment of a sentence.
 To achieve this, the model must understand the sentence's meaning, which relies on
identifying the most crucial data points within the text.
 In order to correctly extract the meaning of the sentence, the most crucial data points are:

1. Break the sentence into lexical units such as lexemes, words, and phrases
2. Derive the meaning for each of the lexical units
3. Understand the syntactic (grammatical) structure of the sentence
4. Understand the context in which the sentence appears

 This section emphasizes that the meaning (semantics) of a sentence comes from the
combination of key data points within the text.
 A good text representation scheme must effectively extract these data points to accurately
reflect the linguistic properties of the text.
 Without this capability, a text representation scheme would not be very useful for tasks like
sentiment analysis.

 Vector Space Models


 This section explains that for machine learning algorithms to process text data, it must first
be converted into a mathematical form, typically represented as vectors in the Vector
Space Model (VSM).
 VSM is a fundamental algebraic model used to represent various text units, such as words
or documents, as vectors of numbers.
 It is widely used in information retrieval tasks like document scoring, classification, and
clustering.
 The most common method for measuring similarity between two text vectors in VSM is
cosine similarity, which calculates the cosine of the angle between the vectors, with values
ranging from 1 (identical) to -1 (completely opposite).

where Ai and Bi are the ith components of vectors A and B, respectively. Sometimes, people also use

Euclidean distance between vectors to capture similarity.

 Basic Vectorization Approaches


 This section introduces a basic approach to text representation by mapping each word in the
vocabulary (V) of a text corpus to a unique ID (integer value).
 Each sentence or document in the corpus is then represented as a V-dimensional vector, where
the vector's dimensions correspond to the words in the vocabulary.
 This method provides a straightforward way to convert text into a numerical format that can
be used for further processing in machine learning models.
 To understand this better, let’s take a toy corpus (shown in Table 3-1) with only four
documents—D1, D2, D3, D4 —as an example.

D1 Dog bites man.


D2 Man bites dog.
D3 Dog eats meat.
D4 Man eats food.
Table 3-1. Our toy corpus

 This section describes a simple text representation approach where, after lowercasing the text
and ignoring punctuation, the vocabulary of the corpus consists of six words: [dog, bites, man,
eats, meat, food].
 The vocabulary can be organized in any order, and each document in the corpus is
represented as a vector of size six, corresponding to the words in the vocabulary.
 One-Hot Encoding
 One-Hot Encoding is a method of representing characters or words by a vector where only one
 element is set to one and all others are zero, based on their position in the vocabulary.

 In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID wid that is
between 1 and |V|, where V is the set of the corpus vocabulary.
 Each word is then represented by a V-dimensional binary vector of 0s and 1s.
 This is done via a |V| dimension vector filled with all 0s barring the index, where index = wid. At this
index, we simply put a 1.
 The representation for individual words is then combined to form a sentence representation.
 Let’s understand this via our toy corpus. We first map each of the six words to unique IDs: dog = 1,
bites = 2, man = 3, meat = 4 , food = 5, eats = [Link] Let’s consider the document D1: ―dog bites man‖.
As per the scheme, each word is a six-dimensional vector. Dog is represented as [1 0 0 0 0 0], as the
word ―dog‖ is mapped to ID 1. Bites is represented as [0 1 0 0 0 0], and so on and so forth. Thus, D1
is represented as [ [1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]]. D4 is represented as [ [ 0 0 1 0 0] [0 0 0 0
1 0] [0 0 0 0 0 1]]. Other documents in the corpus can be represented similarly.

Advantages

 On the positive side, one-hot encoding is intuitive to understand and straightforward to implement.
However, it suffers from a few shortcomings:

Disadvantages

 The size of a one-hot vector is directly proportional to size of the vocabulary, and most real-
world corpora have large vocabularies. This results in a sparse representation where most of the
entries in the vectors are zeroes, making it computationally inefficient to store, compute with,
and learn from (sparsity leads to overfitting).

 This representation does not give a fixed-length representation for text, i.e., if a text has 10
words, you get a longer representation for it as compared to a text with 5 words. For most
learning algorithms, we need the feature vectors to be of the same length.

 It treats words as atomic units and has no notion of (dis)similarity between words. For example,
consider three words: run, ran, and apple. Run and ran have similar meanings as opposed to run
and apple. But if we take their respective vectors and compute Euclidean distance between
them, they’re all equally apart root2. Thus, semantically, they’re very poor at capturing the
meaning of the word in relation to other words.
 Say we train a model using our toy corpus. At runtime, we get a sentence: ―man eats fruits.‖
The training data didn’t include ―fruit‖ and there’s no way to represent it in our model. This is
known as the out of vocabulary (OOV) problem. A one-hot encoding scheme cannot handle
this. The only way is to retrain the model: start by expanding the vocabulary, give an ID to the
new word, etc.
 Bag of Words
 Bag of Words (BoW) is a classic technique used in NLP for text classification.
 It represents text as a collection of words without considering their order or context.
 The main idea is that a specific set of words characterizes each class in a dataset.
 If two texts share similar words, they likely belong to the same class.
 By analyzing the words in a text, BoW helps identify its corresponding class.
 Similar to one-hot encoding, Bag of Words (BoW) assigns unique integer IDs to words in a
vocabulary.
 Each document is then represented as a vector, where each component corresponds to the count
of a specific word's occurrences in the document.
 The vector's dimensionality equals the size of the vocabulary, with each word's frequency
determining its score in the vector.
 In the example corpus, words are assigned unique IDs (e.g., "dog" = 1, "bites" = 2). Document
D1 is represented as [1 1 1 0 0 0], indicating that the first three words appear once, while the
others do not. Similarly, D4 is represented as [0 0 1 0 1 1].

Advantages of this encoding:

 Like one-hot encoding, BoW is fairly simple to understand and implement.


 With this representation, documents having the same words will have their vector
representations closer to each other in Euclidean space as compared to documents with
completely different words. The distance between D1 and D2 is 0 as compared to the distance
between D1 and D4, which is 2. Thus, the vector space resulting from the BoW scheme
captures the semantic similarity of documents. So if two documents have similar vocabulary,
they’ll be closer to each other in the vector space and vice versa.
 We have a fixed-length encoding for any sentence of arbitrary length.

Disadvantages of this encoding

 The size of the vector increases with the size of the vocabulary. Thus, sparsity continues to be
a problem. One way to control it is by limiting the vocabulary to n number of the most
frequent words.
 It does not capture the similarity between different words that mean the same thing. Say we
have three documents: ―I run‖, ―I ran‖, and ―I ate‖. BoW vectors of all three documents will
be equally apart.
 This representation does not have any way to handle out of vocabulary words (i.e., new words
that were not seen in the corpus that was used to build the vectorizer).
 As the name indicates, it is a ―bag‖ of words—word order information is lost in this
representation. Both D1 and D2 will have the same representation in this scheme.

However, despite these shortcomings, due to its simplicity and ease of implementation, BoW is a
commonly used text representation scheme, especially for text classification among other
NLP problems.

 Bag of N-Grams
 Previous text representation schemes treat words as independent units, ignoring phrases or
word order.
 The bag-of-n-grams (BoN) approach addresses this by breaking text into chunks of n
contiguous words, called n-grams.
 This method captures some context that earlier approaches missed.
 The vocabulary consists of all unique n-grams in the corpus, and each document is
represented by a vector showing the frequency of n-grams within the document, with zeroes
for those not present.
 In a 2-gram (bigram) model, text is represented by pairs of consecutive words.
 For the example corpus, the set of all bigrams includes pairs like "dog bites" and "man bites."
Each document is represented by an eight-dimensional vector, indicating the frequency of
these bigrams.
 For instance, D1 is represented as [1,1,0,0,0,0,0,0], and D2 as [0,0,1,1,0,0,0,0]. The Bag of
Words (BoW) model is a special case of this approach with n=1. As n increases, the context
captured grows, but so does the sparsity of the representation. This technique is also known as
"n-gram feature selection" in NLP.

Here are the main pros and cons of BoN:

 It captures some context and word-order information in the form of n-grams.


 Thus, resulting vector space is able to capture some semantic similarity. Documents having
the same n-grams will have their vectors closer to each other in Euclidean space as compared
to documents with completely different n-grams.
 As n increases, dimensionality (and therefore sparsity) only increases rapidly.
 It still provides no way to address the OOV problem.

 TF-IDF
 In the 3 approaches discussed so far, all words in a text are treated as equally important. TF-
IDF (term frequency–inverse document frequency) addresses this by quantifying the
importance of a word relative to others in both the document and the entire corpus.
 This technique is commonly used in information retrieval systems to identify relevant
documents for a given query.
 The intuition behind TF-IDF is that a word's importance in a document increases with its
frequency in that document but decreases if it frequently appears in other documents within
the corpus.
 This importance is quantified using two measures: Term Frequency (TF) and Inverse
Document Frequency (IDF). These are combined to calculate the TF-IDF score, which
reflects the significance of a word in a specific document relative to the entire corpus.
 TF (term frequency) measures how often a term or word occurs in a given document.
 Since different documents in the corpus may be of different lengths, a term may occur more
often in a longer document as compared to a shorter document.
 To normalize these counts, we divide the number of occurrences by the length of the
document. TF of a term t in a document d is defined as:

 IDF (inverse document frequency) measures the importance of the term across a corpus.
 In computing TF, all terms are given equal importance (weightage).
 However, it’s a well-known fact that stop words like is, are, am, etc., are not important, even
though they occur frequently.
 To account for such cases, IDF weighs down the terms that are very common across a corpus
and weighs up the rare terms. IDF of a term t is calculated as follows:

 The TF-IDF score is a product of these two terms. Thus, TF-IDF score = TF * IDF.
 Let’s compute TF-IDF scores for our toy corpus. Some terms appear in only one document,
some appear in two, while others appear in three documents. The size of our corpus is N=4.
Hence, corresponding TF-IDF values for each term are shown in Table 3-2.
 The TF-IDF vector representation for a document is then simply the TF-IDF score for each
term in that document. So, for D1 we get

Dog bites Man eats meat food

0.136 0.17 0.136 0 0 0

 To summarize the fundamental drawbacks of representation schemes:


 They’re discrete representations—i.e., they treat language units (words, n-grams, etc.) as
atomic units. This discreteness hampers their ability to capture relationships between words.
 The feature vectors are sparse and high-dimensional representations. The dimensionality
increases with the size of the vocabulary, with most values being zero for any vector. This
hampers learning capability. Further, high-dimensionality representation makes them
computationally inefficient.
 They cannot handle OOV words.

 Distributed Representations
 Before diving into the methods that use neural network architectures to create dense, low-
dimensional representations, it's essential to understand some key terms:

[Link] similarity

 The meaning of a word can be understood based on the context in which it appears, a concept
known as connotation—where meaning is defined by context. This contrasts with denotation,
which refers to the literal meaning of a word. For example, in the phrase "NLP rocks," the
word "rocks" literally means "stones," but contextually, it signifies something positive or
fashionable.
[Link] hypothesis

 The distributional hypothesis in linguistics suggests that words appearing in similar contexts
have similar meanings. For example, "dog" and "cat" often occur in similar contexts,
indicating a strong similarity in their meanings. In vector space models (VSM), this idea
extends to word representations: if two words frequently appear in similar contexts, their
corresponding vectors in the model should be close to each other.

[Link] representation

 Distributional representation schemes are based on the distributional hypothesis, which posits
that words appearing in similar contexts have related meanings. These schemes use high-
dimensional vectors derived from a co-occurrence matrix, capturing the relationship between
words and their contexts. The matrix's dimensions correspond to the vocabulary size of the
corpus. Examples of such schemes include one-hot encoding, bag of words, bag of n-grams,
and TF-IDF, all of which represent words based on their contextual distribution.

[Link] representation

 Distributed representation is also based on the distributional hypothesis but addresses the
limitations of high-dimensional, sparse vectors in traditional distributional representations. It
significantly reduces dimensionality, resulting in compact, dense vectors with few or no
zeros. This makes them computationally efficient and more effective for learning. The vector
space created through this process is known as distributed representation, and the upcoming
methods discussed in the chapter are examples of this approach.

[Link]

 For the set of words in a corpus, embedding is a mapping between vector space coming from
distributional representation to vector space coming from distributed representation.

[Link] semantics

 This refers to the set of NLP methods that aim to learn the word representations based on
distributional properties of words in a large corpus.

 Word Embeddings
 When we say that a text representation should capture "distributional similarities between
words," it means that words appearing in similar contexts should have similar representations.

For example, words like "USA" might be distributionally similar to other countries or cities within the
USA, while "beautiful" might be similar to synonyms or antonyms. In 2013, Mikolov et al.
demonstrated this concept with their Word2Vec model, which uses distributional similarity to capture
word analogy relationships, such as "King – Man + Woman ≈ Queen."

 Their model was able to correctly answer many more analogies like this. Figure 3-5 shows a
snapshot of a system based on Word2vec answering analogies. The Word2vec model is in
many ways the dawn of modern-day NLP.
 Word2Vec learns semantically rich word representations that are low-dimensional (50–500
dimensions) and dense, with most values being non-zero. These compact and efficient
representations make machine learning tasks more tractable. Word2Vec's success spurred
extensive research on using neural networks to learn text representations, known as
"embeddings." Next, we’ll explore how these embeddings work and how they can be used to
represent text.
 The goal of Word2Vec is to learn embeddings for each word in a text corpus that
accurately capture the word's meaning.
 It does this by using distributional similarity and the distributional hypothesis, which
infer a word's meaning from its context—words that appear nearby in the text.
 If two words frequently occur in similar contexts, their meanings are likely similar.
Word2Vec projects these meanings into a vector space where words with similar
meanings cluster together, while words with different meanings are positioned far
apart.
 Word2Vec takes a large text corpus and learns to represent words in a common vector
space based on their contexts.
 For each word w in the corpus, it starts with a randomly initialized vector vw .
 The model refines vw by predicting it based on the vectors of the words in its context
C, using a two-layer neural network.
 This process adjusts the vector to best capture the word's meaning. Before exploring
how to train our own embeddings, we'll first discuss pre-trained embeddings.

Pre-trained word embeddings


 Pre-trained word embeddings are vectors for words that have already been trained on large
corpora, such as Wikipedia or news articles, by others.
 These embeddings are available for download and provide pre-computed word vectors,
which can be used directly without the need for training from scratch.
 They are essentially key-value pairs, with words as keys and their corresponding vectors as
values. Popular pre-trained embeddings include Word2Vec (Google), GloVe (Stanford), and
fastText (Facebook), and they come in various dimensions like 25, 50, 100, 200, 300, and
600.

Training our own embeddings

 The original Word2Vec approach includes two architectural variants for training word
embeddings:Two varients are:-

1. Continuous Bag of Words (CBOW): This model predicts a target word based on its
surrounding context words. It uses the context to estimate the likelihood of a word
occurring in that context.
2. Skip-Gram: This model, in contrast, predicts the surrounding context words given a
target word. It uses the target word to estimate the likelihood of various context words
appearing around it.

Both architectures are designed to learn word embeddings by leveraging the distributional
hypothesis, but they focus on different aspects of the word-context relationship.
[Link]

 In CBOW, the primary task is to build a language model that correctly predicts
 thecenter word given the context words in which the center word appears.

 What is a language model? It is a (statistical) model that tries to give a probability
distribution over sequences of words.
 Given a sentence of, say, m words, it assigns a probability Pr(w1, w2, ….., wn) to the
whole sentence.
 The objective of a language model is to assign probabilities in such a way that it
gives high probability to―good‖ sentences and low probabilities to ―bad‖ sentences.
 By good, we mean sentences that are semantically and syntactically correct. By bad,
we mean sentences that are incorrect—semantically or syntactically or both. So, for a
sentence like ―The cat jumped over the dog,‖ it will try to assign a probability close to
1.0, whereas for a sentence like ―jumped over the the cat dog,‖ it tries to assign a
probability close to 0.0.
 CBOW tries to learn a language model that tries to predict the ―center‖ word from the
words in its context.
 Let’s understand this using our toy corpus. If we take the word ―jumps‖ as the center
word, then its context is formed by words in its vicinity. If we take the context size of
2, then for our example, the context is given by brown, fox, over, the. CBOW uses the
context words to predict the target word—jumps—as shown in Figure 3-7.
 CBOW tries to do this for every word in the corpus; i.e., it takes every word in the
corpus as the target word and tries to predict the target word from its corresponding
context words.

 The idea discussed in the previous paragraph is then extended to the entire corpus to
build the training set.
 Details are as follows: we run a sliding window of size 2k+1 over the text corpus. For
our example, we took k as 2.
 Each position of the window marks the set of 2k+1 words that are under
consideration.
 The center word in the window is the target, and k words on either side of the center
word form the context. This gives us one data point. If the point is represented
as (X,Y), then the context is the X and the target word is the Y. A single data point
consists of a pair of numbers: (2k indices of words in context, index of word in
target).
 To get the next data point, we simply shift the window to the right on the corpus by
one word and repeat the process. This way, we slide the window across the entire
corpus to create the training set. This is shown in Figure 3-8. Here, the target word is
shown in blue and k=2.
 Now that we have the training data ready, let’s focus on the model. For this,
we construct a shallow net (it’s shallow since it has a single hidden layer), as
shown in Figure 3-9. We assume we want to learn D-dim word embeddings.
Further, let V be the vocabulary of the text corpus.
 The process involves learning an embedding matrix E of size ∣V∣×d where ∣V∣ is the
vocabulary size and d is the embedding dimension.

 The matrix is initialized randomly. In the input layer, word indices are used to fetch the
corresponding rows from the matrix E

 . These vectors are summed to produce a single ddd-dimensional vector, which is then
passed to the next layer.

 This vector is multiplied by another matrix E′ of size d×∣V∣d \times |V|d×∣V∣, resulting
in a vector of size 1×∣V∣1 \times |V|1×∣V∣. A softmax function is applied to this vector
to produce a probability distribution over the vocabulary.

 The distribution is compared with the label, and backpropagation is used to update both
E and E′. After training, the matrix E represents the learned word embeddings.

SkipGram

 SkipGram is a model similar to CBOW but with a key difference: it predicts context
words from a given center word.
 For example, with a context size of 2, the model takes the center word "jumps" and
predicts each of its surrounding words—―brown,‖ ―fox,‖ ―over,‖ ―the.‖ The process
is repeated for every word in the [Link] 3.10

 To train SkipGram, a sliding window of size 2k+1 is used across the text corpus.
 The center word in the window is X, and the k words on either side are Y. This
approach generates 2k data points, with each consisting of a pair: the index of the
center word and the index of a target word.
 The window is then shifted one word to the right, and the process repeats across the
entire corpus to create the training [Link] 3.11

 The SkipGram model's shallow network is similar to the CBOW network with slight
modifications.
 In the input layer, the index of the target word is used to fetch the corresponding row
from the embedding matrix E of size ∣V∣×d|.
 This d-dimensional vector is then passed to the next layer, where it is multiplied by
another matrix E′ of size d×∣V ∣, resulting in a 1×∣V ∣ vector.
 A softmax function is applied to this vector to produce a probability distribution over
the vocabulary.
 This distribution is compared to the label, and backpropagation is used to update both
E and E′. After training, the matrix EEE contains the learned word embeddings.
 Figure 3.12 Both CBOW and SkipGram models involve intricate details that
interested readers can explore through additional resources like Sebastian Ruder's
blog and Rong's (2016) paper on Word2vec parameter learning.
 A crucial aspect to consider in these models is the selection of hyperparameters, such
as window size, vector dimensionality, learning rate, and the number of epochs, as
they significantly impact the model's quality.
 Though there are practical implementations like gensim that abstract the mathematical
complexities, users still need to carefully decide on the hyperparameters before
starting the training process to ensure optimal results.

Let’s look at two examples.

 Dimensionality of the word vectors


As the name indicates, this decides the space of the learned embeddings. While there
is no ideal number, it’s common to construct word vectors with dimensions in the
range of 50–500 and evaluate them on the task we’re using them for to choose the
best option.
 Context window
How long or short the context we look for to learn the vector representation is.
 Distributed Representations Beyond Words and
Characters(Doc2Vec)

 So far, we've explored two methods for creating text representations:


 Word2vec, which generates word embeddings that can be aggregated to form text
representations, and
 fastText, which creates word representations from character n-grams that are then
aggregated into text representations.
 However, both approaches fail to consider word context, leading to identical
representations for phrases with different meanings, like "dog bites man" and "man
bites dog."
 Doc2vec addresses this issue by directly learning representations for texts of arbitrary
lengths (phrases, sentences, paragraphs, documents) while accounting for word
context.
 It builds on the paragraph vectors framework and shares a similar architecture with
Word2vec, but it adds a "paragraph vector" that represents the full text in context.
 When trained on a large corpus, Doc2vec learns unique paragraph vectors for each
text, while word vectors are shared across all texts.
 The shallow neural networks used in Doc2vec are similar to the CBOW and
SkipGram models in Word2vec.
 The two architectures are called distributedmemory (DM) and distributed bag of
words (DBOW).
 T hey are shown in Figure 3-13.

 Once the Doc2vec model is trained, it can infer paragraph vectors for new texts using
the word vectors learned during training.
 Doc2vec was one of the first accessible methods for generating embedding
representations for entire texts, rather than just aggregating individual word vectors.
 By modeling context and encoding texts of any length into fixed, low-dimensional
vectors, Doc2vec has been widely used in NLP applications such as text
classification, document tagging, text recommendation systems, and simple chatbots..

You might also like