0% found this document useful (0 votes)

29 views24 pages

Text Mining - Vectorization

Text vectorization is the process of converting text into numerical representations. Popular methods include binary term frequency, bag-of-words term frequency, normalized term frequencies, TF-IDF, and Word2Vec. Word2Vec provides distributed representations of words by training a neural network to learn embedded word vectors from a large corpus of text. These vectors capture semantic and syntactic relationships between words based on their distributional properties.

Uploaded by

Zorka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views24 pages

Text Mining - Vectorization

Uploaded by

Zorka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Text Vectorization

Hina Arora
TextVectorization.ipynb
• Text Vectorization is the process of converting text into numerical
representation

• Some popular methods to accomplish text vectorization:

o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF
o Word2Vec
o etc
Binary Term Frequency
• Captures presence (1) or absence (0) of term in document
• Token_pattern = ‘(?u)\\b\\w\\w+\\b’
The default regexp select tokens of 2 or more alphanumeric characters (punctuation is
completely ignored and always treated as a token separator).

• lowercase = True

• stop_words = ‘english’

• max_df (default 1.0):

When building the vocabulary ignore terms that have a document frequency strictly higher
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• min_df (default 1):

When building the vocabulary ignore terms that have a document frequency strictly lower
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• max_features (default None) :

If not None, build a vocabulary that only consider the top max_features ordered by term
frequency across the corpus.

• ngram_range (default (1,1)):

The lower and upper boundary of the range of n-values for different n-grams to be
extracted. All values of n such that min_n <= n <= max_n will be used.
Bag of Words (BoW) Term Frequency
• Captures frequency of term in document
(L1) Normalized Term Frequency
• Captures normalized BoW term frequency in document
• TF typically L1-normalized
(L2) Normalized TFIDF
• Captures normalized TFIDF of term in document
• TFIDF typically L2-normalized
• Number of documents in corpus: N

• Number of documents in corpus with term t: Nt

• Term Frequency of term t in document d: TF(t, d)

o Bag of Words (BoW) Term Frequency
o The more frequent a term is, the higher the TF
o With sublinear TF: log(TF) + 1

• Inverse Document Frequency of term t in corpus: IDF(t) = log[N/Nt] + 1

o Measures how common a term is among all documents.
o The more common a term is, the lower its IDF.
o With smoothing: IDF(t) = log[(1+N)/(1+ Nt)] + 1

• TFIDF = Term Frequency * Inverse Document Frequency = TF*IDF (t, d)

o If a term appears frequently in a document, it's important - give the term a high score.
o If a term appears in many documents, it's not a unique identifier - give the term a low score.

• TFIDF score is then often l2-normalized (could also consider l1-normalized)

Word2Vec
• Captures embedded representation of terms

References:
Distributed Representations of Words and Phrases and their Compositionality
Efficient Estimation of Word Representations in Vector Space
Typical text representations provide localized representations of the word:
o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF

ngrams try to capture some level of contextual information, but don’t

really do a great job.
• Word2Vec Provides distributed or embedded representation of words

• Start with OHE representation of all words in the corpus

• Train a NN (with 1 hidden layer) on a very large corpus of data. The rows of the
resulting hidden-layer weight-matrix are then used as the word vectors.

• One of two methods is typically used for training the NN:

o Continuous Bag of Words (CBOW): Predict vector representation of center/target word -
based on window of context words.
o Skip-Gram (SG): Predict vector representation of window of context words - based on
center/target word.
context words

𝑤𝒕−𝟐 𝑤𝒕−1 𝑤𝒕+1 𝑤𝒕+2

You shall know a word by the company it keeps

𝑤𝒕

center/target word

*Quote by J. R. Firth
Several factors influence the quality of the word vectors including:

• Amount and quality of the training data.

If you don’t have enough data, you may be able to use pre-trained vectors created by others (for
instance Google has shared a model trained on ~ 100 billion words from their News data. The
model contains 300-dimensional vectors for 3 million words and phrases). If you do end up using
pre-trained vectors, make sure the training data domain is similar to the data you’re working with.

• Size of the embedded vectors

In general, quality increases with higher dimensionality, but marginal gains typically diminish after
a threshold. Typically, the dimensionality of the vectors is set to be between 100 and 1000.

• Training algorithm
Typically, CBOW trains faster and has slightly better accuracy for the frequent words. SG works
well with small amounts of the training data, and does a good job representing rare words or
phrases.
Once we have the embedded vectors for each word, we can use them for NLP, for
instance:

• Compute similarity using cosine similarity between word vectors

• Create higher order representations (sentence/document) using weighted

average of the word vectors and feed to the classification task

Text Vectorization
No ratings yet
Text Vectorization
18 pages
Ch4 Word Embeddings
No ratings yet
Ch4 Word Embeddings
21 pages
Lab 5
No ratings yet
Lab 5
27 pages
Unit IV
No ratings yet
Unit IV
58 pages
BBC Sports Text Preprocessing Guide
No ratings yet
BBC Sports Text Preprocessing Guide
6 pages
Text Representation in NLP Techniques
No ratings yet
Text Representation in NLP Techniques
57 pages
Machine Learning for NLP: Tokenization & Features
No ratings yet
Machine Learning for NLP: Tokenization & Features
37 pages
Module III
No ratings yet
Module III
42 pages
Lect 04
No ratings yet
Lect 04
44 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
89 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
NLP Text Representation Guide
No ratings yet
NLP Text Representation Guide
131 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
Text Vectorization
No ratings yet
Text Vectorization
10 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Unit 2
No ratings yet
Unit 2
48 pages
Word2Vec: Vector Representations Explained
No ratings yet
Word2Vec: Vector Representations Explained
31 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Lecture 5 - Language Representation Tf-Idf
No ratings yet
Lecture 5 - Language Representation Tf-Idf
51 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Data Science Interview Preparation Questions (#Day06)
No ratings yet
Data Science Interview Preparation Questions (#Day06)
10 pages
Word2Vec for NLP Enthusiasts
100% (1)
Word2Vec for NLP Enthusiasts
12 pages
Word Vectors: Word2Vec and GloVe Explained
No ratings yet
Word Vectors: Word2Vec and GloVe Explained
39 pages
Ch6 - Text Vectorization - 1
No ratings yet
Ch6 - Text Vectorization - 1
63 pages
Sentiment Analysis Based On Vector Embeding
No ratings yet
Sentiment Analysis Based On Vector Embeding
5 pages
NLP Text Classification Techniques
No ratings yet
NLP Text Classification Techniques
3 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
Aiml 1st Insem Vi Sem
No ratings yet
Aiml 1st Insem Vi Sem
11 pages
Word Embedding & Language Modelling
No ratings yet
Word Embedding & Language Modelling
111 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
Sheet 3
No ratings yet
Sheet 3
5 pages
Bag of Words and TF-IDF
No ratings yet
Bag of Words and TF-IDF
17 pages
Word Embeddings and Their Applications
No ratings yet
Word Embeddings and Their Applications
46 pages
At First On 2013 I Do Some Research Work With My Departmental HOD Sir and With Prof Partha Basu Chakraborty
No ratings yet
At First On 2013 I Do Some Research Work With My Departmental HOD Sir and With Prof Partha Basu Chakraborty
4 pages
Unit 2 Newml
No ratings yet
Unit 2 Newml
25 pages
Pipeline
No ratings yet
Pipeline
9 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
NLP Word Vectors for Students
No ratings yet
NLP Word Vectors for Students
33 pages
M6L3 Lyst8212
No ratings yet
M6L3 Lyst8212
17 pages
NLP Word Representation Techniques
No ratings yet
NLP Word Representation Techniques
14 pages
Understanding Word2Vec and Dense Vectors
No ratings yet
Understanding Word2Vec and Dense Vectors
60 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Allnlp
No ratings yet
Allnlp
15 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
NLP Deep Learning for Students
No ratings yet
NLP Deep Learning for Students
57 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
8 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Small Groups Key Readings, 1st Edition PDF
100% (9)
Small Groups Key Readings, 1st Edition PDF
16 pages
Derivatives
No ratings yet
Derivatives
2 pages
Lab'S Report: Signature
No ratings yet
Lab'S Report: Signature
21 pages
Bilge System
No ratings yet
Bilge System
11 pages
01192015114905IMYB - 2013 - Vol III - Kaolin - Ballclay - Other Clays and Shale 2013 PDF
No ratings yet
01192015114905IMYB - 2013 - Vol III - Kaolin - Ballclay - Other Clays and Shale 2013 PDF
24 pages
Syllabus Combined Ad No 10
No ratings yet
Syllabus Combined Ad No 10
15 pages
Making Successful Transitions: UMW Speaking Center Presents
No ratings yet
Making Successful Transitions: UMW Speaking Center Presents
2 pages
Research - EnergyPlus Models For Advanced Gas Heating Systems
No ratings yet
Research - EnergyPlus Models For Advanced Gas Heating Systems
30 pages
Epp 6 Week 5
No ratings yet
Epp 6 Week 5
10 pages
School Form 2 (SF2) Daily Attendance Report of Learners: 102868 2020 - 2021 Bauan ES Grade 6 Sapphire
No ratings yet
School Form 2 (SF2) Daily Attendance Report of Learners: 102868 2020 - 2021 Bauan ES Grade 6 Sapphire
3 pages
The Making of A Hotwife: The Next Morning
No ratings yet
The Making of A Hotwife: The Next Morning
36 pages
Love's Philosophy - Percy Shelley
No ratings yet
Love's Philosophy - Percy Shelley
6 pages
ESA Cordex400W System
No ratings yet
ESA Cordex400W System
2 pages
XE Engine (Component Manual)
86% (7)
XE Engine (Component Manual)
298 pages
History of Machine Translation-2006 PDF
No ratings yet
History of Machine Translation-2006 PDF
21 pages
01-APCRDA Internship Policy
No ratings yet
01-APCRDA Internship Policy
16 pages
AI Community - Welcome Instructions - Workzone
No ratings yet
AI Community - Welcome Instructions - Workzone
14 pages
Coa Methanol
No ratings yet
Coa Methanol
1 page
Environmental Reactor Design Guide
No ratings yet
Environmental Reactor Design Guide
34 pages
Air Insulated Ring Main Units
No ratings yet
Air Insulated Ring Main Units
4 pages
Compact Vacuum Circuit Breakers
100% (2)
Compact Vacuum Circuit Breakers
56 pages
Skill Builder 3
No ratings yet
Skill Builder 3
2 pages
Minimalism in Design: A Framework
No ratings yet
Minimalism in Design: A Framework
10 pages
Steamdeck 2d 20220202
No ratings yet
Steamdeck 2d 20220202
2 pages
DLA Testing: Relationship Between Power Factor and Dissipation Factor
No ratings yet
DLA Testing: Relationship Between Power Factor and Dissipation Factor
3 pages
OWNERS MANUAL Safety - Eng - 190604 2
No ratings yet
OWNERS MANUAL Safety - Eng - 190604 2
22 pages
MCADI Method for 3x3 Matrix Inversion
No ratings yet
MCADI Method for 3x3 Matrix Inversion
2 pages
Memory Tree Activity: #Rememberwhen and #Childrensgriefawareness
No ratings yet
Memory Tree Activity: #Rememberwhen and #Childrensgriefawareness
2 pages
Unit 45 Assignment .....
No ratings yet
Unit 45 Assignment .....
7 pages
The Memory of Tiresias - Intertext
50% (2)
The Memory of Tiresias - Intertext
386 pages

Text Mining - Vectorization

Uploaded by

Text Mining - Vectorization

Uploaded by

Text Vectorization

• Some popular methods to accomplish text vectorization:

• max_df (default 1.0):

• min_df (default 1):

• max_features (default None) :

• ngram_range (default (1,1)):

• Number of documents in corpus with term t: Nt

• Term Frequency of term t in document d: TF(t, d)

• Inverse Document Frequency of term t in corpus: IDF(t) = log[N/Nt] + 1

• TFIDF = Term Frequency * Inverse Document Frequency = TF*IDF (t, d)

• TFIDF score is then often l2-normalized (could also consider l1-normalized)

ngrams try to capture some level of contextual information, but don’t

• Start with OHE representation of all words in the corpus

• One of two methods is typically used for training the NN:

𝑤𝒕−𝟐 𝑤𝒕−1 𝑤𝒕+1 𝑤𝒕+2

You shall know a word by the company it keeps

• Amount and quality of the training data.

• Size of the embedded vectors

• Compute similarity using cosine similarity between word vectors

• Create higher order representations (sentence/document) using weighted

You might also like