Natural Language Processing with Python & nltk Cheat Sheet
by RJ Murray (murenei) via cheatography.com/58736/cs/15485/
Handling Text Part of Speech (POS) Tagging
text='Some words' assign string nltk.help.upenn_tagset( Lookup definition for a POS
list(text) Split text into character tokens 'MD') tag
set(text) Unique tokens nltk.pos_tag(words) nltk in-built POS tagger
len(text) Number of characters <use an alternative tagger
to illustrate ambiguity>
Accessing corpora and lexical resources
Sentence Parsing
from nltk.corpus import brow import CorpusReader
object g=nltk.data.load('grammar.cfg') Load a
n
grammar from
brown.words(text_id) Returns pretokenised
a file
document as list of words
g=nltk.CFG.fromstring("""...""") Manually
brown.fileids() Lists docs in Brown
define
corpus
grammar
brown.categories() Lists categories in Brown
parser=nltk.ChartParser(g) Create a parser
corpus
out of the
grammar
Tokenization
trees=parser.parse_all(text)
text.split(" ") Split by space
for tree in trees: ... print tree
nltk.word_tokenizer( nltk in-built word tokenizer
from nltk.corpus import treebank
text)
treebank.parsed_sents('wsj_00 Treebank
nltk.sent_tokenize(d nltk in-built sentence tokenizer
01.mrg') parsed
oc)
sentences
Lemmatization & Stemming
Text Classification
input="List listed lists listing listing Different
from sklearn.feature_extraction.text import CountVe
s" suffixes
ectorizer
words=input.lower().split(' ') Normalize
vect=CountVectorizer().fit(X_train) Fit bag of word
(lower‐
vect.get_feature_names() Get features
case)
words vect.transform(X_train) Convert to doc
porter=nltk.PorterStemmer Initialise
Stemmer
[porter.stem(t) for t in words] Create list
of stems
WNL=nltk.WordNetLemmatizer() Initialise
WordNet
lemmatizer
[WNL.lemmatize(t) for t in words] Use the
lemmatizer
By RJ Murray (murenei) Published 28th May, 2018. Sponsored by Readable.com
cheatography.com/murenei/ Last updated 29th May, 2018. Measure your website readability!
tutify.com.au Page 1 of 2. https://readable.com
Natural Language Processing with Python & nltk Cheat Sheet
by RJ Murray (murenei) via cheatography.com/58736/cs/15485/
Entity Recognition (Chunking/Chinking)
g="NP: {<DT>?<JJ>*<NN>‐ Regex chunk grammar
}"
cp=nltk.RegexpParser(g Parse grammar
)
ch=cp.parse(pos_sent) Parse tagged sent. using
grammar
print(ch) Show chunks
ch.draw() Show chunks in IOB tree
cp.evaluate(test_sents Evaluate against test doc
)
sents=nltk.corpus.treebank.tagged_sents(
)
print(nltk.ne_chunk(s‐ Print chunk tree
ent))
RegEx with Pandas & Named Groups
df=pd.DataFrame(time_sents, columns=['text'])
df['text'].str.split().str.len()
df['text'].str.contains('word')
df['text'].str.count(r'\d')
df['text'].str.findall(r'\d')
df['text'].str.replace(r'\w+day\b', '???')
df['text'].str.replace(r'(\w)', lambda x: x.groups(‐
)[0][:3])
df['text'].str.extract(r'(\d?\d):(\d\d)')
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap
]m))')
df['text'].str.extractall(r'(?P<digits>\d)')
By RJ Murray (murenei) Published 28th May, 2018. Sponsored by Readable.com
cheatography.com/murenei/ Last updated 29th May, 2018. Measure your website readability!
tutify.com.au Page 2 of 2. https://readable.com