0% found this document useful (0 votes)

241 views50 pages

Introduction To NLP

The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence that allows machines to understand human language. It discusses the advantages of NLP like its ability to analyze large amounts of unstructured text data. The document also outlines popular NLP techniques like tokenization, stemming, lemmatization and feature extraction. Finally, it discusses important NLP libraries and the steps involved in solving NLP problems.

Uploaded by

Amit Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

241 views50 pages

Introduction To NLP

Uploaded by

Amit Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

Introduction to NLP

What is NLP?

● Natural Language Processing(NLP) is a field of Artificial Intelligence that gives

machines the ability to read, understand and derive meaning from human language.
● With the help of NLP, we can communicate with computers using a natural language
such as English.
Advantages of NLP

● Today NLP is booming with the advancements in the access to data and increase in
the computational power.
● This is helping practitioners to achieve meaningful results in areas like
○ Healthcare
○ Media
○ Finance
Problems with Text Data

● Today, millions of data is generated through conversations, declarations or tweets and

these data are unstructured data.
● Unstructured data does not fit into row and column structure which makes it difficult
to analyze and manipulate.
Why should we learn NLP?

● With the help of NLP, it is possible for the machines to detect figure of speeches like
irony and even perform sentiment analysis.
● We can not have data in numbers all the time, so to deal with textual data we use NLP
which can easily take raw language as input and derive meaningful insights from it.
Applications of NLP

● NLP enables the recognition and prediction of diseases based on electronic health
records and a patient’s own speech.
● Organizations can determine what customers’ are saying about product by identifying
and extracting information.
○ This sentiment analysis can tell a lot about customer’s choices and their decision
drivers.
Applications of NLP

● Big companies like Google filter and classify emails with NLP by analyzing text in
emails and stopping spam emails before they enter your inbox.
● Amazon’s Alexa and Apple’s Siri are examples of intelligent voice-driven interfaces
that use NLP to respond to vocal prompts and do everything.
Applications of NLP

● NLP is used in both search and selection phase of

talent recruitment by identifying the skills of
potential hires.
● NLP is also used in search Autocorrect and
Autocomplete.
Steps to solve NLP problems

● Gather Data
○ Gather textual data from emails, posts or tweets.
● Clean Data
○ A clean dataset allows the model to learn meaningful features and not overfit on
irrelevant noise.
■ Remove all irrelevant characters.
■ Tokenize the word by separating it in different words.
Steps to solve NLP problems
● Clean Data
■ Convert all characters to lowercase. Combine misspelled or alternatively
spelled words to single representation.
■ Reduce words such as “am”, “are” and “is” to a common form.
● Finding good representation
○ Change textual data into numbers form which algorithms can understand and
derive insights from it.
Steps to solve NLP problems
● Classification
○ Split the data into training and testing data.
○ Classify the data using a model by fitting the training data into it and check how
well the model generalizes on unseen dataset using the testing dataset.
● Inspection
○ Understand the errors made by model using confusion matrix
What is Text Processing?
● Text Processing means analysis, manipulation and generation of text.
● It is the automated process which analyzes data for getting structured information.
● It includes extracting smaller bits of information from text data and assigning tags
depending on its context.
Techniques to analyze text data
● Statistical Methods
○ We use statistical methods such as frequency distribution and TF-IDF to process
and analyze text.
● Text Classification
○ Text Classification classifies text into predefined groups based on its content.
Popular models include:- Topic Analysis, Sentiment Analysis, intent detection
and language classification.
Techniques to analyze text data
● Text Extraction
○ Text extraction is a text processing technique that identifies and obtain valuable
pieces of data that are present within the text.
○ This method helps us to detect and extract the relevant words or expressions
from text.
Popular libraries used for NLP
● Spacy
○ Spacy is an open source library which excels at working with incredibly large
scale information extraction tasks.
○ Major takeaways are:-
■ Part-of-speech tagging, and Tokenization
■ Dependency parsing
■ Sentence Segmentation
■ Entity and sentence recognition
■ Methods for cleaning and normalizing texts.
Popular libraries used for NLP
● NLTK (Natural Language ToolKit)
○ Its goal is to make learning and working with computational linguistics easier by
offering features such as classification, stemming, tagging, parsing, semantic
reasoning and wrappers.
● Gensim
○ It is a library for Topic Modelling and similarity retrieval.
○ It excels at two things: Processing of language and Information Retrieval.
Popular libraries used for NLP
● TextBlob
○ Textblob is used for processing text based data and offers smooth integration
with other programming languages.
■ Part-of-speech tagging
■ Sentiment Analysis
■ Classification
■ Tokenization
■ N-grams
■ Parsing, and Spelling correction
What is Feature Engineering?
● Feature engineering is the process of creating new
features from the existing ones and removing the
unimportant features.
● Feature engineering is an art and a skill.
● It requires us to have creativity and domain
knowledge.
Why do we need Feature Engineering?

● Good Features present in the data

influence the results of the predictive
model in a very positive Way.
Why is it necessary to Clean Data?

● Data Cleaning is a very crucial step in NLP because without cleaning, the dataset is
just a cluster of words which computer doesn’t understand.
● The textual data is unstructured and noisy and can have:-
■ Typos, Bad grammar, Usage of slangs, URLs.
■ Stopwords, Expressions, Punctuations etc.
Steps to clean Textual Data

● Remove punctuations and numbers

● Perform tokenization
● Remove special and accented characters
● Remove Stopwords
● Perform Stemming and Lemmatization
What is Tokenization?

● Tokenization is simple as well as the building block of Natural Language.

● It is a way of separating a piece of text into smaller units called tokens.
● The tokens can be:-
■ Words, Characters, Sentences
StopWords

● Stopwords are the words which do not add much value to sentence.
● They are removed from the vocabulary to reduce noise as well as the dimension of the
feature set.
● The English stop words are:-
■ The, And, Myself, this, Into, Here etc
■ And So many more Meaningless words Like this.
Stemming

● Stemming is a process of removing a part of a word, or reducing a word to its stem or

root word.
● Example:-
■ We have three words: “Ask”, “asking” and “asked”
■ Stemming converts all the three words into the root word “ask”.
Lemmatization

● Lemmatization reduces the word to its dictionary form. The root word in
lemmatization is called “lemma”.
● We have two words: “good” and “better”.
○ Lemmatization reduce them to the same root word “good”.
Difference between Stemming & Lemmatization

● Algorithms used in stemming process don’t know the meaning behind the words.
● Algorithms used in lemmatization process refer to a dictionary to understand the
meaning of the word before chopping it off.
Let’s Take an Example

● We have three words:- “play”, “playing” and “player”.

○ According to Stemmer, all three words have same root word “play”.
○ According to lemmatizer, “play” and “playing” have same root and “player” is
a word with different meaning.
Feature Extraction for NLP
What is Feature Extraction?
● Feature extraction means to extract and produce feature representations that are
appropriate for the NLP task.
● Features that can be extracted from the text are:-
■ Number of words, characters, Stopwords etc.
■ Length of text.
■ Number of punctuation.
Feature Extraction Techniques
● Major feature extraction techniques for NLP are:-
■ Bag of words representation
■ TF-IDF
■ N-gram analysis
Bag of Words

● Bag of words is a way of extracting features from text for use in modelling.
● The bag of words approach is very simple and flexible and can be used in a number of
ways for extracting features from the documents.
Bag of words representation

● Bag of words work in the following way:-

○ It checks the frequency of distinct words occurring in the text.
○ All the distinct words make the columns of the matrix and the values are
represented as 0 or 1 based on the absence or presence of the word in the text.
Introduction to TF-IDF
● TF-IDF short for Term Frequency-Inverse Document Frequency.
● It is designed to reflect how important a word is to a document in a collection for
corpus.
● The TF-IDF value increases proportionally to the number of times word appears in a
document.
TF-IDF Score

● The value of TF-IDF is calculated by multiplying two metrics:-

■ How many times a word appears in a document.
■ Inverse document frequency of the word across a set of documents.
● Inverse Document frequency
○ How common a word is in entire document set
■ log(Total no. of documents/ no. of documents containing word)
Why TF-IDF?
● Information Retrieval
○ TF-IDF was invented for document search and is used to deliver results that are
most relevant to what we are searching for.
● Keyword extraction
○ TF-IDF is also useful for keyword extraction. The highest scoring words for a
document are the most relevant keywords.
N-grams
● N-grams is a sequence of n-words.
● For Example: I love reading books.
○ 1-gram or unigram will be:- “I”, “love”, “reading”, “books”.
○ 2-gram or bigram will be:- “I love”, “love reading”, “reading books”.
○ 3-gram or trigram will be:- “I love reading”, “love reading books”.
Why N-grams?
● N-grams of texts are extensively used in text mining and NLP tasks such as Auto
completion of sentences and Auto spell check.
● Example:-
○ Using a 3-gram analysis, a bot will understand the difference between “What’s
the temperature” and “Set the temperature” which is not possible using 1-gram
or 2-grams.
What is Text Classification?
● Text classification or Text categorization is the process of analyzing the natural
language text and then labelling the text with a predefined set of labels or tags.
● Text classifiers have proven to be great alternative to structure textual data in a fast,
cost-effective and scalable way.
● It allows to easily get insights from data and automate business processes.
Examples
● Classifying emails as spam or not spam.
● Sentiment analysis:- Understanding if the text has positive, negative or neutral
sentiment.
● Language detection:- Detecting the language of a given text.
● Classifying content into categories to easily search and navigate within a website or
application.
Applications of Text Classification
● Tagging content or products using categories as a way to improve browsing or to
identify related content on the website.
● As marketing is becoming more targeted everyday, automated classification of users
into cohorts can make marketer’s life simple.
● Text classification of content on website help Google crawl website easily which help
in SEO.
● Email providers use text classification to differentiate between legitimate and spam
mails.
Text Classification using ML
● Machine learning helps in text classification and classify based on past observations.
● By using pre-labelled examples as training data, ML algorithms learn different
associations between pieces of texts.
Models for Text Classification
● Naive Bayes Family of Algorithms
● Support Vector Machines (SVM)
● Deep learning
Conditional Probability
● Probability of an event occurring based on the occurrence of the previous event.
● We have a bag of 5 balls.

2/5 3/5
Example

2/4 2/4

● The probability of one event depending on the probability of the previous

event.
Bayes Theorem
● P(A/B) = P(A and B) / P(B) => P(A and B) = P(A/B) P(B)
● P(B/A) = P(B and A) / P(A) => P(B and A) = P(B/A) P(A)
● Equating both the terms, we get Bayes Theorem
○ P(A/B) = [P(B/A) P(A)] / P(B)
Naive Bayes Classifier
● Naive Bayes is a family of probabilistics algorithms which uses Bayes theorem to
predict the tag of the text.
● It is a probabilistic algorithm which means it calculates the probability of each tag for
a given text.
Example
● We have 4 major words in the categories Sport and not sport.
○ 4 major words are match, game, win and election.

Sport Not Sport

Match 6 6/15 Match 1 1/15

Game 5 5/15 Game 2 2/15
Win 3 3/15 Win 5 5/15
Election 1 1/15 Election 7 7/15
Example
● New message comes which has win election in it.
● The probability of a message in “sport” or “not sport” category is ½.
● P(sport/win election) = 0.006
● P(not sport/win election) = 0.07
● New message “win election” is in the “Not sport” category.
Support Vector Machines(SVM)
● SVM is a powerful text classification machine
learning algorithm.
● SVM divides or separates the two sides by a line.
● An optimal line is the one with the largest
distance between each label.
● It works for both linear and non-linear data.
More Things to Try!
● We can try some more classification models such as Logistic Regression and Deep
learning.
● Instead of using TF-IDF vectorizer, we can use Bag of words to change texts into
numbers and check how the model accuracy changes.
○ In case of Bag of words also we have two choices, either using binary bag of
words or frequency bag of words.

Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Python Text Processing and NLP Basics
No ratings yet
Python Text Processing and NLP Basics
32 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Python Data Cleaning with Pandas
No ratings yet
Python Data Cleaning with Pandas
11 pages
Decrease and Conquer, Insertion Sort
No ratings yet
Decrease and Conquer, Insertion Sort
17 pages
Python NLP Practical Exercises
No ratings yet
Python NLP Practical Exercises
14 pages
Operator
No ratings yet
Operator
29 pages
A719552767 - 20992 - 7 - 2019 - Lecture10 Python OOP
No ratings yet
A719552767 - 20992 - 7 - 2019 - Lecture10 Python OOP
15 pages
vb6 Activex DLL Tutorial
No ratings yet
vb6 Activex DLL Tutorial
3 pages
Python Programming Lab Manual 3rd Sem BCA
No ratings yet
Python Programming Lab Manual 3rd Sem BCA
22 pages
Hadoop Tutorial - YDN
No ratings yet
Hadoop Tutorial - YDN
14 pages
Evolution of Big Data
No ratings yet
Evolution of Big Data
21 pages
Data Structures Course Outline
No ratings yet
Data Structures Course Outline
34 pages
Regular Expressions in Automata Theory
No ratings yet
Regular Expressions in Automata Theory
28 pages
Data Structure Using C++ For II B.SC
No ratings yet
Data Structure Using C++ For II B.SC
3 pages
Token Separation & Parsing Guide
82% (11)
Token Separation & Parsing Guide
47 pages
Infix to Postfix Conversion Algorithm
No ratings yet
Infix to Postfix Conversion Algorithm
3 pages
EE 2204 - Data Structures and Algorithms: N Radhakrishnan Assistant Professor Anna University, Chennai
No ratings yet
EE 2204 - Data Structures and Algorithms: N Radhakrishnan Assistant Professor Anna University, Chennai
83 pages
Lecture # 1: Evolution of OOP
No ratings yet
Lecture # 1: Evolution of OOP
9 pages
Data Structures for Beginners
100% (1)
Data Structures for Beginners
31 pages
Data Structures Overview for II PUC
No ratings yet
Data Structures Overview for II PUC
26 pages
Digital Communication Lab Manual
No ratings yet
Digital Communication Lab Manual
36 pages
Programming Basics & Algorithms Guide
No ratings yet
Programming Basics & Algorithms Guide
32 pages
Adsa Lab Manual
No ratings yet
Adsa Lab Manual
52 pages
ML Notes (III BCA)
No ratings yet
ML Notes (III BCA)
64 pages
Web Technology-Lab-Manual III-II r22 Updated 24
No ratings yet
Web Technology-Lab-Manual III-II r22 Updated 24
114 pages
Unit 1 A Closer Look at Methods and Classes
100% (2)
Unit 1 A Closer Look at Methods and Classes
21 pages
Maps and Dictionary: Data Structures and Algorithms
No ratings yet
Maps and Dictionary: Data Structures and Algorithms
50 pages
Health Risks in Firework Industry
No ratings yet
Health Risks in Firework Industry
66 pages
Python Programming with Django Framework
No ratings yet
Python Programming with Django Framework
2 pages
Isom 3400 - Python For Business Analytics 1. Intro To Python
No ratings yet
Isom 3400 - Python For Business Analytics 1. Intro To Python
46 pages
BMC202 Object Oriented Programming Notes
No ratings yet
BMC202 Object Oriented Programming Notes
3 pages
R Programming Essentials
No ratings yet
R Programming Essentials
9 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
Two Basic Types of HTML Tags
No ratings yet
Two Basic Types of HTML Tags
6 pages
Converting Infix To Prefix Using Stack
100% (1)
Converting Infix To Prefix Using Stack
2 pages
Sources and Nature of Data
No ratings yet
Sources and Nature of Data
44 pages
HTML Lab Report Program 8 To 13
100% (1)
HTML Lab Report Program 8 To 13
16 pages
Module - 1: 0.1 The Python Imaging Library (PIL)
No ratings yet
Module - 1: 0.1 The Python Imaging Library (PIL)
7 pages
Understanding Stack Data Structure
No ratings yet
Understanding Stack Data Structure
40 pages
NLP Practical Manual
No ratings yet
NLP Practical Manual
48 pages
BSBI: Efficient Index Construction Techniques
No ratings yet
BSBI: Efficient Index Construction Techniques
264 pages
Lec 1 SREE
No ratings yet
Lec 1 SREE
55 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
Module 4
No ratings yet
Module 4
16 pages
Ids Unit-4
No ratings yet
Ids Unit-4
33 pages
2022 - DS Lab Manual
No ratings yet
2022 - DS Lab Manual
54 pages
Python IDLE Installation & Basic Programs Guide
No ratings yet
Python IDLE Installation & Basic Programs Guide
30 pages
Halstead's Operators and Operands Guide
100% (6)
Halstead's Operators and Operands Guide
5 pages
002chapter 2 - Lexical Analysis
No ratings yet
002chapter 2 - Lexical Analysis
114 pages
Lab-manual-Advanced Python Programming 4321602
No ratings yet
Lab-manual-Advanced Python Programming 4321602
24 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
27 pages
Data Structure in Python
No ratings yet
Data Structure in Python
36 pages
Python Problem Solving Techniques
No ratings yet
Python Problem Solving Techniques
37 pages
C Programming Syllabus
No ratings yet
C Programming Syllabus
2 pages
Digital Computer Fundamentals
No ratings yet
Digital Computer Fundamentals
37 pages
AI Basics for Tech Enthusiasts
No ratings yet
AI Basics for Tech Enthusiasts
125 pages
Office Automation - UNIT - 1
No ratings yet
Office Automation - UNIT - 1
39 pages
Introduction to NLP at Toyota
No ratings yet
Introduction to NLP at Toyota
44 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
CCNA 1st December
No ratings yet
CCNA 1st December
67 pages
AWS Solutions Architect Exam Guide
No ratings yet
AWS Solutions Architect Exam Guide
47 pages
AWS+SAA C03+Exam+Guide+Mapped+to+Videos
No ratings yet
AWS+SAA C03+Exam+Guide+Mapped+to+Videos
29 pages
AWS Training for IT Professionals
No ratings yet
AWS Training for IT Professionals
15 pages
Application Deployment Various Method in J-Boss
No ratings yet
Application Deployment Various Method in J-Boss
8 pages
3PAR Storage Multi-Path (Hyperactive) + RHEL7.2 + ORACLE 11G-RAC Environme NT Construction Notes
No ratings yet
3PAR Storage Multi-Path (Hyperactive) + RHEL7.2 + ORACLE 11G-RAC Environme NT Construction Notes
17 pages
Linux File System Repair Guide
No ratings yet
Linux File System Repair Guide
12 pages
Zoning Best Practices for SAN Switches
No ratings yet
Zoning Best Practices for SAN Switches
15 pages
AMBER18 & AMBERTOOLS18 Installation-Linux System
No ratings yet
AMBER18 & AMBERTOOLS18 Installation-Linux System
16 pages
What's New in EAP 7?: Undertow Web Server
No ratings yet
What's New in EAP 7?: Undertow Web Server
8 pages
Installing SSH on SCO OpenServer 5
No ratings yet
Installing SSH on SCO OpenServer 5
4 pages
JBoss EAP 6.4 Setup Guide
No ratings yet
JBoss EAP 6.4 Setup Guide
67 pages
Isilon - ESRS Cluster
No ratings yet
Isilon - ESRS Cluster
12 pages
Isilon - Configure 10 G
No ratings yet
Isilon - Configure 10 G
8 pages
Mount iSCSI LUNs on Linux Guide
No ratings yet
Mount iSCSI LUNs on Linux Guide
4 pages
Isilon Installtion Guide
No ratings yet
Isilon Installtion Guide
5 pages
Isilon Cluster Shutdown
100% (1)
Isilon Cluster Shutdown
12 pages
Red Hat Enterprise Linux 6 DM Multipath en US
No ratings yet
Red Hat Enterprise Linux 6 DM Multipath en US
46 pages
Article - JBoss EAP 7.2 Domain Mode Setup
No ratings yet
Article - JBoss EAP 7.2 Domain Mode Setup
7 pages
Cisco UCS to HP 3PAR Storage Setup Guide
No ratings yet
Cisco UCS to HP 3PAR Storage Setup Guide
7 pages
Coating Procedure - Sop 06 Painting Control
No ratings yet
Coating Procedure - Sop 06 Painting Control
1 page
Weather Station for Engineers
No ratings yet
Weather Station for Engineers
4 pages
Comparative Study: HDFC vs Axis Internet Banking
No ratings yet
Comparative Study: HDFC vs Axis Internet Banking
10 pages
Wiresize
No ratings yet
Wiresize
10 pages
SophosXDR 1
No ratings yet
SophosXDR 1
5 pages
Well Intervention Course Guide
No ratings yet
Well Intervention Course Guide
160 pages
61 MC Elvain Cave Durand Bingham Fluids HR Value
No ratings yet
61 MC Elvain Cave Durand Bingham Fluids HR Value
10 pages
Design of PDN
No ratings yet
Design of PDN
13 pages
2450 Central Avenue, Suite G Boulder, CO 80301 800-821-0426 303-443-1319 Fax: 303-440-8878
No ratings yet
2450 Central Avenue, Suite G Boulder, CO 80301 800-821-0426 303-443-1319 Fax: 303-440-8878
20 pages
Guide For Writing Requirements
100% (1)
Guide For Writing Requirements
110 pages
Class IX Statistics: Mean, Median, Mode
No ratings yet
Class IX Statistics: Mean, Median, Mode
6 pages
Dagm
No ratings yet
Dagm
26 pages
SCSI Primary Commands - 4 (SPC-4)
No ratings yet
SCSI Primary Commands - 4 (SPC-4)
1,009 pages
Age Minimization in Massive IoT Via UAV Swarm A Multi-Agent Reinforcement Learning Approach
No ratings yet
Age Minimization in Massive IoT Via UAV Swarm A Multi-Agent Reinforcement Learning Approach
6 pages
Sel 3350 Rtac
No ratings yet
Sel 3350 Rtac
16 pages
AKG Labelitaly - en - Akg7-Manual
No ratings yet
AKG Labelitaly - en - Akg7-Manual
3 pages
Isi Problems With Solutions
No ratings yet
Isi Problems With Solutions
116 pages
John Deere 6020 Serie SCV 300
No ratings yet
John Deere 6020 Serie SCV 300
6 pages
Chapter - 04 - Input - Output Devices
No ratings yet
Chapter - 04 - Input - Output Devices
12 pages
(TW-P6-F16) - Testing Comm. of Fire Alarm System
No ratings yet
(TW-P6-F16) - Testing Comm. of Fire Alarm System
5 pages
DSA Lab Rubrics
No ratings yet
DSA Lab Rubrics
2 pages
1 Juniper Networks Security Firewall Gateway Comparison Chart
No ratings yet
1 Juniper Networks Security Firewall Gateway Comparison Chart
4 pages
Assignment 08 - Method - 3
No ratings yet
Assignment 08 - Method - 3
12 pages
2024-2025 S2 SB Assignment
No ratings yet
2024-2025 S2 SB Assignment
3 pages
Anveshana Project Presentation Template
No ratings yet
Anveshana Project Presentation Template
9 pages
Mstar ISP Tool Upgrade Instructions
No ratings yet
Mstar ISP Tool Upgrade Instructions
13 pages
Marvelous College of Technology, Inc
No ratings yet
Marvelous College of Technology, Inc
25 pages
Flight Management System, CS, Class 12
No ratings yet
Flight Management System, CS, Class 12
25 pages
Infosphere Information Server Installation
No ratings yet
Infosphere Information Server Installation
7 pages
Arjes Broschuere Impaktor 250-En
No ratings yet
Arjes Broschuere Impaktor 250-En
12 pages

Introduction To NLP

Uploaded by

Introduction To NLP

Uploaded by

Introduction to NLP

● Natural Language Processing(NLP) is a field of Artificial Intelligence that gives

● Today, millions of data is generated through conversations, declarations or tweets and

● NLP is used in both search and selection phase of

● Good Features present in the data

● Remove punctuations and numbers

● Tokenization is simple as well as the building block of Natural Language.

● Stemming is a process of removing a part of a word, or reducing a word to its stem or

● We have three words:- “play”, “playing” and “player”.

● Bag of words work in the following way:-

● The value of TF-IDF is calculated by multiplying two metrics:-

● The probability of one event depending on the probability of the previous

Sport Not Sport

Match 6 6/15 Match 1 1/15

You might also like