0% found this document useful (0 votes)

23 views42 pages

NLP Core Using NLTK: Dr. Muhammad Nouman Durrani

The document provides an overview of Natural Language Processing (NLP) using the NLTK library in Python, detailing various text processing techniques such as tokenization, stemming, lemmatization, and part-of-speech tagging. It also discusses the importance of word embeddings and methods like one-hot encoding for representing text numerically for machine learning applications. Key functionalities of NLTK, including accessing corpora, filtering stopwords, and using WordNet for synonyms and antonyms, are highlighted throughout the document.

Uploaded by

طحہ طاھر

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views42 pages

NLP Core Using NLTK: Dr. Muhammad Nouman Durrani

Uploaded by

طحہ طاھر

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

NLP Core using NLTK

Dr. Muhammad Nouman Durrani

NLTK
• NLTK is a leading platform for building Python programs to work with
human language data.
• It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet, along with a suite of text processing
libraries for:
• classification, tokenization, stemming, tagging, parsing, and semantic
reasoning, wrappers for industrial-strength NLP libraries
Introduction to Text Processing:
Extracting, transforming and selecting features
Consider the following examples:
• Google search
• 2008 U.S Presidential Elections.
• google translate
Introduction to Text Processing:
Extracting, transforming and selecting features

• So what do the above examples have in common?

• TEXT processing. All the above three scenarios deal with massive
amount of text to perform tasks
• Humans deal with text format quite intuitively
Introduction to Text Processing
• A computer can match two strings and tell you whether they are
same or not.
• But how do we make computers tell you about football or Ronaldo
when you search for Messi?

• Word Embeddings
Tokenization
• The process of breaking down a text paragraph into smaller chunks such as words or
sentences is called Tokenization
• Token is a single entity that is building blocks for sentence or paragraph

Sentence Tokenization
• Sentence tokenizer breaks text paragraph into sentences
from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)
['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is
pinkish-blue.', "You shouldn't eat cardboard"]
Tokenization
Word Tokenization
• Word tokenizer breaks text paragraph into words

from nltk.tokenize import word_tokenize

tokenized_word=word_tokenize(text)
print(tokenized_word)

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',',
'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat',
'cardboard']
Tokenization
• Frequency Distribution

from nltk.probability import FreqDist

fdist = FreqDist(tokenized_word)
print(fdist)

<FreqDist with 25 samples and 30 outcomes>

fdist.most_common(2)

[('is', 3), (',', 2)]

Tokenization
• Frequency Distribution Plot

import matplotlib.pyplot as plt

fdist.plot(30,cumulative=False)
plt.show()
Tokenize Non-English Languages Text

• To tokenize other languages, you can specify the language like this:

from nltk.tokenize import sent_tokenize

mytext = "Bonjour M. Adam, comment allez-vous? J'espère que tout va bien. Aujourd'hui est un bon jour."
print(sent_tokenize(mytext, "french"))

The result will be like this:

['Bonjour M. Adam, comment allez-vous?', "J'espère que tout va bien.", "Aujourd'hui est un bon jour."]
Stopwords
• Stopwords considered as noise in the text. Text may contain stop words such as is, am, are,
this, a, an, the, etc.

• In NLTK for removing stopwords, you need to create a list of stopwords and filter out your list
of tokens from these words
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

{'their', 'then', 'not', 'ma', 'here', 'other', 'won', 'up', 'weren', 'being', 'we', 'those', 'an', 'them', 'which', 'him', 'so', 'yourselves', 'what', 'own',
'has', 'should', 'above', 'in', 'myself', 'against', 'that', 'before', 't', 'just', 'into', 'about', 'most', 'd', 'where', 'our', 'or', 'such', 'ours', 'of', 'doesn',
'further', 'needn', 'now', 'some', 'too', 'hasn', 'more', 'the', 'yours', 'her', 'below', 'same', 'how', 'very', 'is', 'did', 'you', 'his', 'when', 'few',
'does', 'down', 'yourself', 'i', 'do', 'both', 'shan', 'have', 'itself', 'shouldn', 'through', 'themselves', 'o', 'didn', 've', 'm', 'off', 'out', 'but', 'and',
'doing', 'any', 'nor', 'over', 'had', 'because', 'himself', 'theirs', 'me', 'by', 'she', 'whom', 'hers', 're', 'hadn', 'who', 'he', 'my', 'if', 'will', 'are',
'why', 'from', 'am', 'with', 'been', 'its', 'ourselves', 'ain', 'couldn', 'a', 'aren', 'under', 'll', 'on', 'y', 'can', 'they', 'than', 'after', 'wouldn', 'each',
'once', 'mightn', 'for', 'this', 'these', 's', 'only', 'haven', 'having', 'all', 'don', 'it', 'there', 'until', 'again', 'to', 'while', 'be', 'no', 'during', 'herself',
'as', 'mustn', 'between', 'was', 'at', 'your', 'were', 'isn', 'wasn'}
Stopwords
filtered_sent=[]
for w in tokenized_sent:
if w not in stop_words:
filtered_sent.append(w)
print("Tokenized Sentence:",tokenized_sent)
print("Filterd Sentence:",filtered_sent)

Tokenized Sentence: ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?’]
Filterd Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?']
Get Synonyms From WordNet
• WordNet is a database built for natural language processing
• It includes groups of synonyms and a brief definition
from nltk.corpus import wordnet
syn = wordnet.synsets("pain")
print(syn[0].definition())
print(syn[0].examples())

a symptom of some physical hurt or disorder

['the patient developed severe pain and distension']

Get Synonyms From WordNet
• You can use WordNet to get synonymous words like this:

from nltk.corpus import wordnet Synsets represent the set of different senses of a particular word.

synonyms = [] Whereas lemmas as the synonyms within each sense.

for syn in wordnet.synsets('Computer'): The words in a Synset are known as Lemmas.

for lemma in syn.lemmas():

synonyms.append(lemma.name())
print(synonyms)
The output is:

['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer',

'information_processing_system', 'calculator', 'reckoner', 'figurer', 'estimator', 'computer']
Get Antonyms From WordNet
• You can get the antonyms of words the same way
• Use the lemmas before adding them to the array
• it's an antonym or not
from nltk.corpus import wordnet
antonyms = []
for syn in wordnet.synsets("small"):
for l in syn.lemmas():
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(antonyms)

['large', 'big', 'big']

NLTK Word Stemming

• Word stemming means removing affixes from words and returning the root word. (The stem of
the word working is work.)
• Search engines use this technique when indexing pages, so many people write different
versions for the same word and all of them are stemmed to the root word
• NLTK has a class called PorterStemmer that uses this algorithm.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem('working’))

The result is: work.

Lemmatizing Words Using WordNet
• Word lemmatizing is similar to stemming, but the difference is the
result of lemmatizing is a real word
When we stem some words, it will result as follows:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('increases’))
The result is: increas.
When we lemmatize the same word using NLTK WordNet, the result is increase:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('increases'))
The result is increase.
Lemmatizing Words Using WordNet
• If we try to lemmatize a word like “playing”, it will end up with the same word
– This is because the default part of speech is nouns
– To get verbs, adjective, or adverb, we should specify it (See Example)
– Actually, this is a very good level of text compression.
– We end up with about 50% to 60% compression
from nltk.stem import WordNetLemmatizer The result is:
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v")) play
print(lemmatizer.lemmatize('playing', pos="n")) playing
print(lemmatizer.lemmatize('playing', pos="a")) playing
print(lemmatizer.lemmatize('playing', pos="r")) playing
Part of speech tagging (POS)

• Part-of-speech tagging is used to assign [('vote', 'NN')]

[('to', 'TO')]
parts of speech to each word of a given text [('choose', 'NN')]
[('a', 'DT')]
(such as nouns, verbs, pronouns, adverb, [('particular', 'JJ')]
conjunction, adjectives, interjection) based [('man', 'NN')]
[('or', 'CC')]
on its definition and its context. [('a', 'DT')]
[('group', 'NN')]
[('(', '(')]
[('party', 'NN')]
text = “vote to choose a particular man or a group (party) to represent [(')', ')')]
them in parliament” [('to', 'TO')]
tex = word_tokenize(text) #Tokenize the text [('represent', 'NN')]
[('them', 'PRP')]
for token in tex:
[('in', 'IN')]
print(nltk.pos_tag([token])) [('parliament', 'NN')]
Named entity recognition

• It is the process of detecting the named entities such as

the person name, the location name, the company name,
the quantities and the monetary value.

text = “Google’s CEO Sundar Pichai introduced the new Pixel at Minnesota Roi Centre
Event” #importing chunk library from nltk
from nltk import ne_chunk # tokenize and POS Tagging before doing chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = ne_chunk(tags)
chunk
POS Tagging Output:

Tree('S', [Tree('GPE', [('Google', 'NNP')]), ("'s", 'POS'), Tree('ORGANIZATION',

[('CEO', 'NNP'), ('Sundar', 'NNP'), ('Pichai', 'NNP')]), ('introduced', 'VBD'), ('the',
'DT'), ('new', 'JJ'), ('Pixel', 'NNP'), ('at', 'IN'), Tree('ORGANIZATION',
[('Minnesota', 'NNP'), ('Roi', 'NNP'), ('Centre', 'NNP')]), ('Event', 'NNP')])
Chunking

• Chunking means picking up individual pieces of information and

grouping them into bigger pieces.
• In the context of NLP and text mining, chunking means grouping of
words or tokens into chunks.
text = “We saw the yellow dog”
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = “NP: {<DT>?<JJ>*<NN>}”
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)
(S We/PRP saw/VBD (NP the/DT yellow/JJ dog/NN))
What are Word Embeddings?
• Word Embeddings are the texts converted into numbers
– There may be different numerical representations of the same
text.

Why do we need Word Embeddings?

• Many Machine Learning algorithms and almost all Deep
Learning Architectures are incapable of processing
strings or plain text in their raw form.

• A Word Embedding format generally tries to map a word

using a dictionary to a vector.
What are Word Embeddings?
• Word Embeddings are the texts converted into numbers
• There may be different numerical representations of the same text.

Why do we need Word Embeddings?

• Many Machine Learning algorithms and almost all Deep Learning Architectures are
incapable of processing strings or plain text in their raw form.
• With huge amount of data that is present in the text format, it is imperative to extract
knowledge out of it and build applications.
• They require numbers as inputs to perform any sort of job, be it classification, regression
etc. in broad terms.
• Some real world applications of text applications are – sentiment analysis of reviews by
Amazon etc., document or news classification or clustering by Google etc.
• A Word Embedding format generally tries to map a word using a dictionary to a vector.
What are Word Embeddings?
Take a look at this example – sentence=” Word Embeddings are Word converted
into numbers ”
• A word in this sentence may be “Embeddings” or “numbers ” etc.
• A dictionary may be the list of all unique words in the sentence.
• So, a dictionary may look like – [‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]
• A vector representation of a word may be a one-hot encoded vector where 1 stands
for the position where the word exists and 0 everywhere else.
• The vector representation of “numbers” in this format according to the above
dictionary is [0,0,0,0,0,1] and of converted is[0,0,0,1,0,0].
One-hot encoding (CountVectorizing)

• The most basic and naive method for transforming words into vectors is to count
occurrence of each word in each document. Such an approach is called
countvectorizing or one-hot encoding.
– The idea is to collect a set of documents (they can be words, sentences, paragraphs or
even articles) and count the occurrence of every word in them.
– The columns of the resulting matrix are words and the rows are documents.
Example I
Example I
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(X.toarray())
Example II
Example II
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
sample_text = ["One of the most basic ways we can numerically represent words "
"is through the one-hot encoding method (also sometimes called "
"count vectorizing)."]
# To actually create the vectorizer, we simply need to call fit on the text
# data that we wish to fix
vectorizer.fit(sample_text)
# Now, we can inspect how our vectorizer vectorized the text
# This will print out a list of words used, and their index in the vectors
print('Vocabulary: ') The converse mapping from feature name to column
print(vectorizer.vocabulary_) index is stored in the vocabulary_ attribute of the
vectorizer:
Example II
# If we would like to actually create a vector, we can do so by passing the
# text into the vectorizer to get back counts
vector = vectorizer.transform(sample_text)

# Our final vector:

print('Full vector: ')
print(vector.toarray())

# Or if we wanted to get the vector for one word:

print('Hot vector: ')
print(vectorizer.transform(['hot']).toarray())

# Or if we wanted to get multiple vectors at once to build matrices

print('Hot, one and Today: ')
print(vectorizer.transform(['hot', 'one', 'of']).toarray())
Example II

# We could also do the whole thing at once with the fit_transform method:
print('One swoop:')
new_text = ['Today is the day that I do the thing today, today']
new_vectorizer = CountVectorizer()
print(new_vectorizer.fit_transform(new_text).toarray())
Word Frequencies with TfidfVectorizer
• Word counts are a good starting point, but are very basic
– One issue with simple counts is that some words like “the” or “many other non-stop words” will appear many
times and their large counts will not be very meaningful in the encoded vectors
• TF-IDF feature numerical representations where words are represented by their
term frequency multiplied by their inverse document frequency
– Term Frequency: This summarizes how often a given word appears within a document
The number of times a word appears in a document divided by the total number of words in the document.
Every document has its own term frequency
Word Frequencies with TfidfVectorizer
– Inverse Document Frequency: The log of the number of documents divided by the number of documents that
contain the word w
Inverse document frequency determines the weight of rare words across all documents in the corpus
This downscales words that appear a lot across documents

• Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the
product of tf and idf:

• TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document
but not across documents
Word Frequencies with TfidfVectorizer
• Let’s take an example to get a clearer understanding.
Sentence 1 : The car is driven on the road.
Sentence 2: The truck is driven on the highway.
• In this example, each sentence is a separate document.
• Calculate the TF-IDF for the above two documents, which represent our corpus.
With Tfidftransformer you will systematically compute word counts
using CountVectorizer and then compute the Inverse Document
Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at

once. Under the hood, it (i) tokenize documents and computes the
word counts, (ii) learn the vocabulary and inverse document frequency
weightings IDF values, and (iii) Tf-idf scores all using the same dataset.
Word Frequencies with TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer # summarize

# list of text documents print(vectorizer.vocabulary_)
text = ["The quick brown fox jumped over the lazy dog.", print(vectorizer.idf_)
"The dog.", # encode document
"The fox"] vector = vectorizer.transform([text[0]])
# create the transform # summarize encoded vector
vectorizer = TfidfVectorizer() print(vector.shape)
# tokenize and build vocab print(vector.toarray())
vectorizer.fit(text)
How exactly does TF-IDF work?

• Sample tables give the count of terms(tokens/words) in two documents.

• TF = (Number of times term t appears in a document)/(Number of terms in the
document)
• TF(This,Document1) = 1/8
• TF(This, Document2)=1/5
• It denotes the contribution of the word to the document i.e words relevant to the
document should be frequent.
• For Example: A document about Messi should contain the word ‘Messi’ in large
number.
How exactly does TF-IDF work?
• IDF = log(N/n), where, N is the number of documents and n is the number of documents a term t
has appeared in.
IDF(This) = log(2/2) = 0.
• How do we explain the reasoning behind IDF? Ideally, if a word has appeared in all the
document, then probably that word is not relevant to a particular document.
• But if it has appeared in a subset of documents then probably the word is of some relevance to
the documents it is present in.
Let us compute IDF for the word ‘Messi’.
IDF(Messi) = log(2/1) = 0.301.
Now, let us compare the TF-IDF for a common word ‘This’ and a word ‘Messi’
TF-IDF(This,Document1) = (1/8) * (0) = 0 TF-IDF(This, Document2) = (1/5) * (0) = 0
TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15
• For Document1, TF-IDF method heavily penalizes the word ‘This’ but assigns greater weight to
‘Messi’.
• So, ‘Messi’ is an important word for Document1 from the context of the entire corpus.

NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
UBC Summer Linguistics Course Overview
No ratings yet
UBC Summer Linguistics Course Overview
33 pages
NLP Lab
No ratings yet
NLP Lab
63 pages
NLTK: Python Text Processing Guide
No ratings yet
NLTK: Python Text Processing Guide
4 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
NLP Applications and Text Preprocessing
No ratings yet
NLP Applications and Text Preprocessing
54 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
NLP with NLTK in Python Guide
No ratings yet
NLP with NLTK in Python Guide
5 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Main Topics: Start With A Checkmark Followed by The Topic Name
No ratings yet
Main Topics: Start With A Checkmark Followed by The Topic Name
48 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
7 Idf
No ratings yet
7 Idf
5 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
Lab 2
No ratings yet
Lab 2
49 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP Tutorial with Python NLTK
No ratings yet
NLP Tutorial with Python NLTK
19 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
Lecture 2 Tokenization
No ratings yet
Lecture 2 Tokenization
16 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
NLTK
No ratings yet
NLTK
3 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
NLTK Tutorial: Basics and Techniques
No ratings yet
NLTK Tutorial: Basics and Techniques
33 pages
NLP Text Preprocessing Techniques
No ratings yet
NLP Text Preprocessing Techniques
15 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLTK Cheatsheet
No ratings yet
NLTK Cheatsheet
27 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP
No ratings yet
NLP
12 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Exp1
No ratings yet
NLP Exp1
4 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP Techniques for Developers
No ratings yet
NLP Techniques for Developers
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
Cotb46 7
No ratings yet
Cotb46 7
3 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
Soal Dan Jawaban Sumatif Inggris Kelas X
No ratings yet
Soal Dan Jawaban Sumatif Inggris Kelas X
5 pages
A Detailed Lesson Plan in English (High School Level)
No ratings yet
A Detailed Lesson Plan in English (High School Level)
8 pages
English 1 FALL 19
No ratings yet
English 1 FALL 19
3 pages
UCLA General Education Master Course List: Foundations of Knowledge
No ratings yet
UCLA General Education Master Course List: Foundations of Knowledge
10 pages
Prases and Clauses
No ratings yet
Prases and Clauses
18 pages
Đề Thi Chọn Học Sinh Giỏi Cấp Tỉnh
50% (2)
Đề Thi Chọn Học Sinh Giỏi Cấp Tỉnh
11 pages
First Grade Reading Achievement Checklist
No ratings yet
First Grade Reading Achievement Checklist
3 pages
Dance Fitness and Social Benefits
No ratings yet
Dance Fitness and Social Benefits
1 page
SOPHIST NLP for Requirements Analysis
No ratings yet
SOPHIST NLP for Requirements Analysis
22 pages
English Past Tense Exercise
No ratings yet
English Past Tense Exercise
4 pages
Waytoeng2 Ep 2-2
No ratings yet
Waytoeng2 Ep 2-2
2 pages
The Expressive Macro Skills G1
No ratings yet
The Expressive Macro Skills G1
15 pages
Dell Hymes
No ratings yet
Dell Hymes
4 pages
CATTI
No ratings yet
CATTI
4 pages
Saraswatiriver Vol 03
100% (8)
Saraswatiriver Vol 03
249 pages
Busa - The Annals of Humanities Computing. The Index Thomisticus
No ratings yet
Busa - The Annals of Humanities Computing. The Index Thomisticus
8 pages
Easy To Learn Irregular Verbs
No ratings yet
Easy To Learn Irregular Verbs
1 page
G2 English DLL Q3 W2
No ratings yet
G2 English DLL Q3 W2
14 pages
Satire Word Wall Lesson
No ratings yet
Satire Word Wall Lesson
5 pages
Mae Jemison Kindergarten Lesson Plan
No ratings yet
Mae Jemison Kindergarten Lesson Plan
9 pages
SWENER-1800 A Corpus For Named Entity Recognition
No ratings yet
SWENER-1800 A Corpus For Named Entity Recognition
14 pages
Lehmann, S. - Strict Fregean Free Logic
No ratings yet
Lehmann, S. - Strict Fregean Free Logic
31 pages
How To Write An Article For An International Exam
100% (2)
How To Write An Article For An International Exam
2 pages
Cohesive Devices for Students
No ratings yet
Cohesive Devices for Students
2 pages
English Basics for Interior Design Students
No ratings yet
English Basics for Interior Design Students
4 pages
4º ESO Grammar Revision Guide
No ratings yet
4º ESO Grammar Revision Guide
5 pages
Comparatives & Superlatives Guide
No ratings yet
Comparatives & Superlatives Guide
1 page
TOEFL Reading Section Strategies
No ratings yet
TOEFL Reading Section Strategies
16 pages
The Origin of Language: Let's Talk About It!
No ratings yet
The Origin of Language: Let's Talk About It!
5 pages
Adult Coursebook Evaluations
No ratings yet
Adult Coursebook Evaluations
18 pages

NLP Core Using NLTK: Dr. Muhammad Nouman Durrani

Uploaded by

NLP Core Using NLTK: Dr. Muhammad Nouman Durrani

Uploaded by

NLP Core using NLTK

Dr. Muhammad Nouman Durrani

• So what do the above examples have in common?

from nltk.tokenize import word_tokenize

from nltk.probability import FreqDist

<FreqDist with 25 samples and 30 outcomes>

[('is', 3), (',', 2)]

import matplotlib.pyplot as plt

from nltk.tokenize import sent_tokenize

The result will be like this:

a symptom of some physical hurt or disorder

['the patient developed severe pain and distension']

synonyms = [] Whereas lemmas as the synonyms within each sense.

for syn in wordnet.synsets('Computer'): The words in a Synset are known as Lemmas.

for lemma in syn.lemmas():

['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer',

['large', 'big', 'big']

from nltk.stem import PorterStemmer

The result is: work.

• Part-of-speech tagging is used to assign [('vote', 'NN')]

• It is the process of detecting the named entities such as

Tree('S', [Tree('GPE', [('Google', 'NNP')]), ("'s", 'POS'), Tree('ORGANIZATION',

• Chunking means picking up individual pieces of information and

Why do we need Word Embeddings?

• A Word Embedding format generally tries to map a word

Why do we need Word Embeddings?

# Our final vector:

# Or if we wanted to get the vector for one word:

# Or if we wanted to get multiple vectors at once to build matrices

With Tfidfvectorizer on the contrary, you will do all three steps at

from sklearn.feature_extraction.text import TfidfVectorizer # summarize

• Sample tables give the count of terms(tokens/words) in two documents.

You might also like