0% found this document useful (0 votes)

173 views16 pages

Text Processing with NLTK in Python

The document discusses various text processing techniques in natural language toolkits (NLTK), including tokenization, stopwords removal, and word normalization. It loads text examples from built-in corpora and demonstrates functions for word, sentence and document tokenization. It also covers removing common stopwords and converting words to lowercase.

Uploaded by

Nipuni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

173 views16 pages

Text Processing with NLTK in Python

Uploaded by

Nipuni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

text-processing

March 24, 2024

[1]: import nltk

#tokenizing
from [Link] import WordPunctTokenizer
from [Link] import word_tokenize
from [Link] import sent_tokenize
from [Link] import RegexpTokenizer

#stopwords
from [Link] import stopwords

#regexp
import re

# pandas dataframe
import pandas as pd

#import count vectorizer

from sklearn.feature_extraction.text import CountVectorizer

[2]: [Link]()

showing info [Link]

[2]: True

[3]: #load the data used in the book examples into the Python environment:

from [Link] import *

* Introductory Examples for the NLTK Book *

Loading text1, …, text9 and sent1, …, sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus

1
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
This command loaded 9 of the text examples available from the corpora package (only
a small number of them!). It has used the variable names text1 through text9 for
theseexamples, and already assigned them values. If you type the variable name, you
get a description of the text
[4]: text1

[4]: <Text: Moby Dick by Herman Melville 1851>

Note that the first sentence of the book Moby Dick is “Call me Ishmael.” and that
this sentence has been already separated into tokens in the variable sent1
[5]: #The variables sent1 through sent9 have been set to be a list of tokens of the␣
↪first sentence of each text.

sent1

[5]: ['Call', 'me', 'Ishmael', '.']

[ ]:

0.1 Counting
[8]: #gives the total number of words in the text

len(text1)

[8]: 260819

[7]: #to find out how many unique words there are, not counting repetitions (gives␣
↪all tokens)

sorted(set(text1))

#Or we can just find the length of that list.

len(sorted(set(text3)))

[7]: 2789

[12]: #Or we can specify just to print the first 30 words in the list of sorted words:
sorted(set(text3))[:30]

[12]: ['!',
"'",

2
'(',
')',
',',
',)',
'.',
'.)',
':',
';',
';)',
'?',
'?)',
'A',
'Abel',
'Abelmizraim',
'Abidah',
'Abide',
'Abimael',
'Abimelech',
'Abr',
'Abrah',
'Abraham',
'Abram',
'Accad',
'Achbor',
'Adah',
'Adam',
'Adbeel',
'Admah']

[13]: #to count how many times the word 'Moby' has appeared in the text1
[Link]("Moby")

[13]: 84

[ ]:

0.2 Processing Text

lets use gutenberg corpus
NLTK includes a small selection of texts from the Project Gutenberg electronic text
archive, which contains some 25,000 free electronic books
[19]: # You can then view some books obtained from the Gutenberg on-line book project:
[Link]()

3
[19]: ['[Link]',
'[Link]',
'[Link]',
'[Link]',
'[Link]',
'[Link]',
'[Link]',
'[Link]',
'[Link]',
'[Link]',
'[Link]',
'[Link]',
'melville-moby_dick.txt',
'[Link]',
'[Link]',
'[Link]',
'[Link]',
'[Link]']

[22]: #view the first file

file1 = [Link]( ) [0]
file1

[22]: '[Link]'

[33]: #We can get the original text, using the raw function:

emmatext = [Link](file1)

emmatext[:120] #Since this is quite long, we can view part of it, e.g. the␣
↪first 120 characters

#len(emmatext) #count of total characters

[33]: '[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,

handsome, clever, and rich, with a comfortable home\nan'

0.3 1. Tokenization
NLTK has several tokenizers available to break the raw text into tokens; we will use one that
separates by white space and also by special characters (punctuation)

0.3.1 Word Tokenization

[32]: emmatokens = nltk.wordpunct_tokenize(emmatext)

len(emmatokens) #total token count

4
#view the tokenized text
emmatokens[:15]

[32]: ['[',
'Emma',
'by',
'Jane',
'Austen',
'1816',
']',
'VOLUME',
'I',
'CHAPTER',
'I',
'Emma',
'Woodhouse',
',',
'handsome']

[34]: #Example
sentence="I have no money at the moment."
nltk.wordpunct_tokenize(sentence)

[34]: ['I', 'have', 'no', 'money', 'at', 'the', 'moment', '.']

[36]: #using word_tokenize

text = "God is Great! I won a lottery."
print(word_tokenize(text))

['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

[39]: #usigng Regexp tokenizer

text="God is Great! I won a lottery."
tokenizer = RegexpTokenizer("[\w']+")

[Link](text)

[39]: ['God', 'is', 'Great', 'I', 'won', 'a', 'lottery']

0.3.2 Sentence Tokenization

[44]: #by using nltk library

text1 = "God is Great! I won a lottery."

print(sent_tokenize(text1))

5
['God is Great!', 'I won a lottery.']

[45]: text2="Let us understand the difference between sentence & word tokenizer. It␣
↪is going to be a simple example."

[Link](". ")

[45]: ['Let us understand the difference between sentence & word tokenizer',
'It is going to be a simple example.']

[ ]:

0.4 2. Stopwords
[19]: #lookat the stopwords listed
print([Link]('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he',
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's",
'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what',
'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is',
'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having',
'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about',
'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why',
'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some',
'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now',
'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn',
"couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
"hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't",
'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn',
"shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn',
"wouldn't"]

[49]: sent1="""He determined to drop his litigation with the monastry, and relinguish␣
↪his claims to the wood-cuting and

fishery rihgts at once. He was the more ready to do this becuase the rights had␣
↪become much less valuable, and he had

indeed the vaguest idea where the wood and river in question were."""

# set of stop words

stop_words = set([Link]('english'))

# tokens of words

6
word_tokens = word_tokenize(sent1)
word_tokens[:10]

[49]: ['He',
'determined',
'to',
'drop',
'his',
'litigation',
'with',
'the',
'monastry',
',']

[50]: #empty list to get the final stop word removed text
filtered_sentence = []

# filter out the stop words

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

print("\nOriginal Sentence \n")

print(" ".join(word_tokens))

print("\nFiltered Sentence \n")

print(" ".join(filtered_sentence))

Original Sentence

He determined to drop his litigation with the monastry , and relinguish his
claims to the wood-cuting and fishery rihgts at once . He was the more ready to
do this becuase the rights had become much less valuable , and he had indeed the
vaguest idea where the wood and river in question were .

Filtered Sentence

He determined drop litigation monastry , relinguish claims wood-cuting fishery

rihgts . He ready becuase rights become much less valuable , indeed vaguest idea
wood river question .

7
0.5 3. Normalizing word Formats
0.6 3.1 Lowercase
[51]: #Example
sentence="I have NO moNey at tHE moMent."

[Link]()

[51]: 'i have no money at the moment.'

[53]: #for already tokenized text

emmawords = [[Link]( ) for w in emmatokens]
emmawords[:15]

[53]: ['[',
'emma',
'by',
'jane',
'austen',
'1816',
']',
'volume',
'i',
'chapter',
'i',
'emma',
'woodhouse',
',',
'handsome']

[55]: # We can further view the words by getting the unique words and sorting them:
emmavocab = sorted(set(emmawords))
emmavocab[:10]

[55]: ['!', '!"', '!"--', "!'", "!'--", '!)--', '!--', '!--"', '!--(', '!--`']

[25]: #uppercased
[Link]()

#check Table 3.2 for more operations on strings (Chapter 3, Section 3.2 of NLTK␣
↪book)

[25]: 'I HAVE NO MONEY AT THE MOMENT.'

[26]: #select a set of words from the tokenized text

shortwords=emmawords[11:111]
shortwords[:10]

8
[26]: ['emma', 'woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',']

[27]: #get the frequency count for each word

shortdist = FreqDist(shortwords)
[Link]( )

for word in [Link]():

print (word, shortdist[word])

emma 1
woodhouse 1
, 8
handsome 1
clever 1
and 4
rich 1
with 2
a 3
comfortable 1
home 1
happy 1
disposition 1
seemed 1
to 3
unite 1
some 1
of 6
the 4
best 1
blessings 1
existence 1
; 2
had 3
lived 1
nearly 1
twenty 1
- 1
one 1
years 1
in 2
world 1
very 2
little 1
distress 1
or 1
vex 1
her 4

9
. 2
she 1
was 1
youngest 1
two 1
daughters 1
most 1
affectionate 1
indulgent 1
father 1
consequence 1
sister 1
' 1
s 1
marriage 1
been 1
mistress 1
his 1
house 1
from 1
early 1
period 1
mother 1
died 1
too 1
long 1
ago 1
for 1
have 1
more 1

0.7 3.2 Stemming

NLTK has two stemmers, Porter and Lancaster, described in section 3.6 of the NLTK
book. To use these stemmers, you first create them
[58]: porter = [Link]()
lancaster = [Link]()

[61]: #regular-cased text- porter stemmer

emmaregstem = [[Link](t) for t in emmatokens]
emmaregstem[1:10]

[61]: ['emma', 'by', 'jane', 'austen', '1816', ']', 'volum', 'i', 'chapter']

[30]: #lowercased text

emmalowerstem = [[Link](t) for t in emmawords]
emmalowerstem[1:10]

10
[30]: ['emma', 'by', 'jane', 'austen', '1816', ']', 'volum', 'i', 'chapter']

[31]: #regular-cased text - lancaster stemmer

emmaregstem1 = [[Link](t) for t in emmatokens]
emmaregstem1[1:10]

[31]: ['emm', 'by', 'jan', 'aust', '1816', ']', 'volum', 'i', 'chapt']

[70]: #building our own simple stemmer by making a list of suffixes to take off.

def stem(word):
for suffix in ['ing','ly','ed','ious','ies','ive','es','s']:
if [Link](suffix):
return word[:-len(suffix)]
return word

#try the above stemmer with 'friends'

stem('friends')

[70]: 'friend'

[71]: stem('relatives')

[71]: 'relativ'

0.8 3.3 Lemmatizing

NLTK has a lemmatizer that uses the WordNet on-line thesaurus as a dictionary to look up roots
and find the word.
[74]: wnl = [Link]()
emmalemma=[[Link](t) for t in emmawords]
emmalemma[1:10]

[74]: ['emma', 'by', 'jane', 'austen', '1816', ']', 'volume', 'i', 'chapter']

[82]: [Link]('friends')
[Link]('relatives')

[82]: 'relative'

0.9 4. Regex:Regular Expressions for Detecting Word Patterns

[83]: emmatext[:100]

[83]: '[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,

handsome, clever, and rich, with a'

11
[85]: #the function replace to replace all the new characters ‘\n’ with a space ‘ ‘.
newemmatext = [Link]('\n', ' ')
shorttext = newemmatext[:150]

#redefined the variable shorttext to be the first 150 characters

#without newlines
shorttext

[85]: '[Emma by Jane Austen 1816] VOLUME I CHAPTER I Emma Woodhouse, handsome,
clever, and rich, with a comfortable home and happy disposition, seemed to'

[38]: pword = [Link]('\w+')

#[Link] will find the substrings that matched anywhere in the string.

[Link](pword, shorttext)

[38]: ['Emma',
'by',
'Jane',
'Austen',
'1816',
'VOLUME',
'I',
'CHAPTER',
'I',
'Emma',
'Woodhouse',
'handsome',
'clever',
'and',
'rich',
'with',
'a',
'comfortable',
'home',
'and',
'happy',
'disposition',
'seemed',
'to']

[39]: #[Link] will find the substrings that matched anywhere in the specialtext.
specialtext = 'U.S.A. poster-print costs $12.40, with 10% off.'
[Link](pword, specialtext)

[39]: ['U', 'S', 'A', 'poster', 'print', 'costs', '12', '40', 'with', '10', 'off']

12
[40]: #to match tokens by matching words can have an internal hyphen.
ptoken = [Link]('(\w+(-\w+)*)')
[Link](ptoken, specialtext)

[40]: [('U', ''),

('S', ''),
('A', ''),
('poster-print', '-print'),
('costs', ''),
('12', ''),
('40', ''),
('with', ''),
('10', ''),
('off', '')]

[41]: #to match abbreviations that might have a “.” inside, like U.S.A.
#We only allow capitalized letters
pabbrev = [Link]('(([A-Z]\.)+)')
[Link](pabbrev, specialtext)

[41]: [('U.S.A.', 'A.')]

[42]: #combine it with the words pattern to match either words or abbreviations
ptoken = [Link]('(\w+(-\w+)*|([A-Z]\.)+)')
[Link](ptoken, specialtext)

[42]: [('U', '', ''),

('S', '', ''),
('A', '', ''),
('poster-print', '-print', ''),
('costs', '', ''),
('12', '', ''),
('40', '', ''),
('with', '', ''),
('10', '', ''),
('off', '', '')]

[43]: #order of the matching patterns really matters if

#an earlier pattern matches part of what you want to match.
ptoken = [Link]('(([A-Z]\.)+|\w+(-\w+)*)')
[Link](ptoken, specialtext)

[43]: [('U.S.A.', 'A.', ''),

('poster-print', '', '-print'),
('costs', '', ''),
('12', '', ''),
('40', '', ''),

13
('with', '', ''),
('10', '', ''),
('off', '', '')]

[44]: #add an expression to match the currency

ptoken = [Link](r'(([A-Z]\.)+|\w+(-\w+)*|\$?\d+(\.\d+)?)')
[Link](ptoken, specialtext)

[44]: [('U.S.A.', 'A.', '', ''),

('poster-print', '', '-print', ''),
('costs', '', '', ''),
('$12.40', '', '', '.40'),
('with', '', '', ''),
('10', '', '', ''),
('off', '', '', '')]

Regular Expression Tokenizer using NLTK Tokenizer

[45]: #We can make a prettier regular expression that is equivalent to this one by
#using Python’s triple quotes that allows a string to go across multiple
#lines without adding a newline character

# abbreviations, e.g. U.S.A.

# words with internal hyphens
# currency, like $12.40

ptoken = [Link](r'''([A-Z]\.)+
| \w+(-\w+)*
| \$?\d+(\.\d+)?
''', re.X)

[46]: # abbreviations, e.g. U.S.A.

# words with optional internal hyphens
# currency and percentages, e.g. $12.40, 82%
# ellipsis ex: hmm..., well...
# these are separate tokens; includes ], [

pattern = r''' (?x) [A-Z][a-z]+\.| (?:[A-Z]\.)+|

| \w+(?:-\w+)*
| \$?\d+(?:\.\d+)?%?
| \.\.\.
| [][.,;"'?():-_']'''

[47]: nltk.regexp_tokenize(shorttext[:30], pattern)

[47]: ['',
'[',

14
'',
'Emma',
'',
'',
'by',
'',
'',
'Jane',
'',
'',
'Austen',
'',
'',
'1816',
'',
']',
'',
'',
'',
'VO',
'']

[48]: nltk.regexp_tokenize(specialtext, pattern)

[48]: ['U.S.A.',
'',
'',
'poster-print',
'',
'',
'costs',
'',
'',
'$12.40',
'',
',',
'',
'',
'with',
'',
'',
'10',
'',
'',
'',
'off',
'',

15
'.',
'']

[Link]

0.10 Document Term Matrix- DTM

[87]: # Let's start with a 'toy' corpus
CORPUS = [
'the sky is blue',
'sky is blue and sky is beautiful',
'the beautiful sky is so blue',
'i love blue cheese'
]

[90]: #assign the count vectorizer to a variable

countvectorizer=CountVectorizer()

DTM=[Link](countvectorizer.fit_transform(CORPUS).toarray(),
columns=countvectorizer.get_feature_names_out(),index=None)

DTM

[90]: and beautiful blue cheese is love sky so the

0 0 0 1 0 1 0 1 0 1
1 1 1 1 0 2 0 2 0 0
2 0 1 1 0 1 0 1 1 1
3 0 0 1 1 0 1 0 0 0

[ ]:

NLP Programming
No ratings yet
NLP Programming
39 pages
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
No ratings yet
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
5 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
NLP with NLTK in Python Guide
No ratings yet
NLP with NLTK in Python Guide
5 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
15 pages
UBC Summer Linguistics Course Overview
No ratings yet
UBC Summer Linguistics Course Overview
33 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
NLP Applications and Text Preprocessing
No ratings yet
NLP Applications and Text Preprocessing
54 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
NLP Record
No ratings yet
NLP Record
6 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP Text Preprocessing Techniques
No ratings yet
NLP Text Preprocessing Techniques
15 pages
NLP Lab
No ratings yet
NLP Lab
63 pages
Lab 2
No ratings yet
Lab 2
49 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
Python Regex for Text Tokenization
No ratings yet
Python Regex for Text Tokenization
20 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP Day1
No ratings yet
NLP Day1
4 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
NLTK Tutorial: Basics and Techniques
No ratings yet
NLTK Tutorial: Basics and Techniques
33 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
SPR 05 NLTK
No ratings yet
SPR 05 NLTK
18 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLTK Text Analysis Cheatsheet
No ratings yet
NLTK Text Analysis Cheatsheet
3 pages
NLTK Cheatsheet for Text Analysis
No ratings yet
NLTK Cheatsheet for Text Analysis
3 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
NLTK Text Analysis Cheatsheet
No ratings yet
NLTK Text Analysis Cheatsheet
3 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
Ccs369-Lab Ex 3,4,5
No ratings yet
Ccs369-Lab Ex 3,4,5
8 pages
Batch 2
No ratings yet
Batch 2
13 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
All Practicals
No ratings yet
All Practicals
33 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
NLP
No ratings yet
NLP
12 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
7 Idf
No ratings yet
7 Idf
5 pages
Text Mining Basics
No ratings yet
Text Mining Basics
16 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Python Text Processing Techniques
No ratings yet
Python Text Processing Techniques
13 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
ෂඩ්වර්ගය
No ratings yet
ෂඩ්වර්ගය
18 pages
Soper and Mitra-2013 Amcis-An Inquiry Into Mental Models of Web Interface Design
No ratings yet
Soper and Mitra-2013 Amcis-An Inquiry Into Mental Models of Web Interface Design
7 pages
Data Cleaning and Pre Processing 1
No ratings yet
Data Cleaning and Pre Processing 1
12 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Apache Storm
No ratings yet
Apache Storm
29 pages
Mining Filipino-English Corpora From The Web: Joel P. Ilao and Rowena Cristina L. Guevara
No ratings yet
Mining Filipino-English Corpora From The Web: Joel P. Ilao and Rowena Cristina L. Guevara
5 pages
在线完成作业
100% (1)
在线完成作业
9 pages
ScratchJr Program Kids
100% (1)
ScratchJr Program Kids
29 pages
Mega Walk-In Interviews for IT Roles
No ratings yet
Mega Walk-In Interviews for IT Roles
4 pages
Linux Production Support Interview Questions
100% (2)
Linux Production Support Interview Questions
3 pages
In-Sight 7000 Installation Manual
No ratings yet
In-Sight 7000 Installation Manual
44 pages
Material Science Multiple Choice
100% (3)
Material Science Multiple Choice
946 pages
A Step by Step Guide For Invoicing Extraction (FI-... - SAP Community
No ratings yet
A Step by Step Guide For Invoicing Extraction (FI-... - SAP Community
22 pages
Webtech Notes
No ratings yet
Webtech Notes
112 pages
Free Bar Graph Maker 2
No ratings yet
Free Bar Graph Maker 2
1 page
GE Mac 400 - Quick Guide PDF
No ratings yet
GE Mac 400 - Quick Guide PDF
7 pages
dt8 Ultra
100% (1)
dt8 Ultra
12 pages
Data Transmission in GCSE/IGCSE CS
No ratings yet
Data Transmission in GCSE/IGCSE CS
7 pages
About The Company: Cognitive Clouds
No ratings yet
About The Company: Cognitive Clouds
3 pages
Ecs H81h3-Ad PDF
50% (4)
Ecs H81h3-Ad PDF
37 pages
Bootstrap 4 Cheat Sheet Guide
100% (5)
Bootstrap 4 Cheat Sheet Guide
12 pages
Ict Thesis Topics
100% (3)
Ict Thesis Topics
6 pages
TopSURV User Manual
No ratings yet
TopSURV User Manual
228 pages
Spectrum Series Power Supply Guide
No ratings yet
Spectrum Series Power Supply Guide
4 pages
Unity App Crash on Xiaomi Device
No ratings yet
Unity App Crash on Xiaomi Device
4 pages
Under Ground Cable Fault Detector Using Arduino
100% (1)
Under Ground Cable Fault Detector Using Arduino
24 pages
2023 Kolar Optimal Synergetic Operation and Experimental Evaluation of An Ultra-Compact GaN-Based Three-Phase 10 KW EV Charger Compressed
No ratings yet
2023 Kolar Optimal Synergetic Operation and Experimental Evaluation of An Ultra-Compact GaN-Based Three-Phase 10 KW EV Charger Compressed
19 pages
CRR 718 Ben
No ratings yet
CRR 718 Ben
66 pages
Criminal Psychology
No ratings yet
Criminal Psychology
740 pages
Dawei Medical Ultrasound Innovations
No ratings yet
Dawei Medical Ultrasound Innovations
26 pages
Vasanth Pasumpadiyar Resume
No ratings yet
Vasanth Pasumpadiyar Resume
1 page
Year 3 Maths Workbook
100% (4)
Year 3 Maths Workbook
26 pages
Lect 8 Dynamic Behaviour of Feedback Controller Process
No ratings yet
Lect 8 Dynamic Behaviour of Feedback Controller Process
12 pages
Media & Speaking for Students
No ratings yet
Media & Speaking for Students
9 pages
US20190305415A1
No ratings yet
US20190305415A1
14 pages

Text Processing with NLTK in Python

Uploaded by

Text Processing with NLTK in Python

Uploaded by

text-processing

March 24, 2024

[1]: import nltk

#import count vectorizer

showing info [Link]

from [Link] import *

*** Introductory Examples for the NLTK Book ***

[4]: <Text: Moby Dick by Herman Melville 1851>

[5]: ['Call', 'me', 'Ishmael', '.']

#Or we can just find the length of that list.

0.2 Processing Text

[22]: #view the first file

#len(emmatext) #count of total characters

[33]: '[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,

0.3.1 Word Tokenization

len(emmatokens) #total token count

[34]: ['I', 'have', 'no', 'money', 'at', 'the', 'moment', '.']

[36]: #using word_tokenize

['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

[39]: #usigng Regexp tokenizer

[39]: ['God', 'is', 'Great', 'I', 'won', 'a', 'lottery']

0.3.2 Sentence Tokenization

text1 = "God is Great! I won a lottery."

# set of stop words

# filter out the stop words

print("\nOriginal Sentence \n")

print("\nFiltered Sentence \n")

He determined drop litigation monastry , relinguish claims wood-cuting fishery

[51]: 'i have no money at the moment.'

[53]: #for already tokenized text

[25]: 'I HAVE NO MONEY AT THE MOMENT.'

[26]: #select a set of words from the tokenized text

[27]: #get the frequency count for each word

for word in [Link]():

0.7 3.2 Stemming

[61]: #regular-cased text- porter stemmer

[30]: #lowercased text

[31]: #regular-cased text - lancaster stemmer

#try the above stemmer with 'friends'

0.8 3.3 Lemmatizing

0.9 4. Regex:Regular Expressions for Detecting Word Patterns

[83]: '[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,

#redefined the variable shorttext to be the first 150 characters

[38]: pword = [Link]('\w+')

[40]: [('U', ''),

[41]: [('U.S.A.', 'A.')]

[42]: [('U', '', ''),

[43]: #order of the matching patterns really matters if

[43]: [('U.S.A.', 'A.', ''),

[44]: #add an expression to match the currency

[44]: [('U.S.A.', 'A.', '', ''),

Regular Expression Tokenizer using NLTK Tokenizer

# abbreviations, e.g. U.S.A.

[46]: # abbreviations, e.g. U.S.A.

pattern = r''' (?x) [A-Z][a-z]+\.| (?:[A-Z]\.)+|

[47]: nltk.regexp_tokenize(shorttext[:30], pattern)

[48]: nltk.regexp_tokenize(specialtext, pattern)

0.10 Document Term Matrix- DTM

[90]: #assign the count vectorizer to a variable

[90]: and beautiful blue cheese is love sky so the

You might also like

* Introductory Examples for the NLTK Book *