Text Vectorization
Hina Arora
TextVectorization.ipynb
• Text Vectorization is the process of converting text into numerical
representation
• Some popular methods to accomplish text vectorization:
o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF
o Word2Vec
o etc
Binary Term Frequency
• Captures presence (1) or absence (0) of term in document
• Token_pattern = ‘(?u)\\b\\w\\w+\\b’
The default regexp select tokens of 2 or more alphanumeric characters (punctuation is
completely ignored and always treated as a token separator).
• lowercase = True
• stop_words = ‘english’
• max_df (default 1.0):
When building the vocabulary ignore terms that have a document frequency strictly higher
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.
• min_df (default 1):
When building the vocabulary ignore terms that have a document frequency strictly lower
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.
• max_features (default None) :
If not None, build a vocabulary that only consider the top max_features ordered by term
frequency across the corpus.
• ngram_range (default (1,1)):
The lower and upper boundary of the range of n-values for different n-grams to be
extracted. All values of n such that min_n <= n <= max_n will be used.
Bag of Words (BoW) Term Frequency
• Captures frequency of term in document
(L1) Normalized Term Frequency
• Captures normalized BoW term frequency in document
• TF typically L1-normalized
(L2) Normalized TFIDF
• Captures normalized TFIDF of term in document
• TFIDF typically L2-normalized
• Number of documents in corpus: N
• Number of documents in corpus with term t: Nt
• Term Frequency of term t in document d: TF(t, d)
o Bag of Words (BoW) Term Frequency
o The more frequent a term is, the higher the TF
o With sublinear TF: log(TF) + 1
• Inverse Document Frequency of term t in corpus: IDF(t) = log[N/Nt] + 1
o Measures how common a term is among all documents.
o The more common a term is, the lower its IDF.
o With smoothing: IDF(t) = log[(1+N)/(1+ Nt)] + 1
• TFIDF = Term Frequency * Inverse Document Frequency = TF*IDF (t, d)
o If a term appears frequently in a document, it's important - give the term a high score.
o If a term appears in many documents, it's not a unique identifier - give the term a low score.
• TFIDF score is then often l2-normalized (could also consider l1-normalized)
Word2Vec
• Captures embedded representation of terms
References:
Distributed Representations of Words and Phrases and their Compositionality
Efficient Estimation of Word Representations in Vector Space
Typical text representations provide localized representations of the word:
o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF
ngrams try to capture some level of contextual information, but don’t
really do a great job.
• Word2Vec Provides distributed or embedded representation of words
• Start with OHE representation of all words in the corpus
• Train a NN (with 1 hidden layer) on a very large corpus of data. The rows of the
resulting hidden-layer weight-matrix are then used as the word vectors.
• One of two methods is typically used for training the NN:
o Continuous Bag of Words (CBOW): Predict vector representation of center/target word -
based on window of context words.
o Skip-Gram (SG): Predict vector representation of window of context words - based on
center/target word.
context words
𝑤𝒕−𝟐 𝑤𝒕−1 𝑤𝒕+1 𝑤𝒕+2
You shall know a word by the company it keeps
𝑤𝒕
center/target word
*Quote by J. R. Firth
Several factors influence the quality of the word vectors including:
• Amount and quality of the training data.
If you don’t have enough data, you may be able to use pre-trained vectors created by others (for
instance Google has shared a model trained on ~ 100 billion words from their News data. The
model contains 300-dimensional vectors for 3 million words and phrases). If you do end up using
pre-trained vectors, make sure the training data domain is similar to the data you’re working with.
• Size of the embedded vectors
In general, quality increases with higher dimensionality, but marginal gains typically diminish after
a threshold. Typically, the dimensionality of the vectors is set to be between 100 and 1000.
• Training algorithm
Typically, CBOW trains faster and has slightly better accuracy for the frequent words. SG works
well with small amounts of the training data, and does a good job representing rare words or
phrases.
Once we have the embedded vectors for each word, we can use them for NLP, for
instance:
• Compute similarity using cosine similarity between word vectors
• Create higher order representations (sentence/document) using weighted
average of the word vectors and feed to the classification task