0% found this document useful (0 votes)

73 views4 pages

Tokenizer

The document discusses various techniques for tokenizing text for natural language processing, including: - Splitting text into individual words or tokens through methods like splitting on whitespace. - Performing preprocessing steps like lowercasing, stop word removal, and lemmatization. - Segmenting text into sentences using punctuation, regular expressions, or NLP libraries. - Comparing word and sentence tokenization techniques in NLTK, spaCy, Keras, and with regular expressions.

Uploaded by

Asmar Hajizada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views4 pages

Tokenizer

Uploaded by

Asmar Hajizada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

5/20/23, 2:20 PM tokenizer NLP

Tokenization: Splitting the text into individual words or tokens.

Lowercasing: Converting all text to lowercase to ensure consistent word representation.
Stop word removal: Removing common words that do not carry much meaning (e.g.,
"a," "the," "is") to reduce noise.
Punctuation removal: Removing punctuation marks to focus on the essential words.
Lemmatization or stemming: Reducing words to their base or root form to normalize
variations (e.g., "running" to "run").
Removing numbers: Eliminating numerical values that may not be relevant for the
analysis.
Removing special characters: Eliminating symbols or special characters that do not
contribute to the meaning.
Handling contractions: Expanding contractions (e.g., "can't" to "cannot") for consistent
word representation.
Removing HTML tags (if applicable): Removing HTML tags if dealing with web data.
Handling encoding issues: Addressing encoding problems to ensure proper text
handling.
Handling missing data: Dealing with missing values in the text, if any, through
imputation or removal.
Removing irrelevant information: Eliminating non-textual content, such as URLs or email
addresses.
Spell checking/correction: Correcting common spelling errors to improve the quality of
the text.
Removing excess white spaces: Eliminating extra spaces or tabs between words.
Normalizing whitespace: Ensuring consistent spacing between words.
Sentence segmentation: Splitting the text into individual sentences, if required.
Feature engineering: Extracting additional features from the text, such as n-grams or
part-of-speech tags, for more advanced analyses.

Tokenization

Word Tokenization
In [5]: text = """There are multiple ways we can perform tokenization on given text data.
We can choose any method based on langauge, library and purpose of mode
tokens = text.split()
print(tokens)

localhost:8888/nbconvert/html/anaconda3/tokenizer NLP .ipynb?download=false 1/4

5/20/23, 2:20 PM tokenizer NLP

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on',

'given', 'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on',
'langauge,', 'library', 'and', 'purpose', 'of', 'modeling.']

Sentence Tokenization
In [12]: text = """Characters like periods, exclamation point and newline char are used to s

line = text.split(". ")

line

['Characters like periods, exclamation point and newline char are used to separate
Out[12]:
the sentences',
'But one drawback with split() method, that we can only use one separator at a ti
me! So sentence tonenization wont be foolproof with split() method.']

Tokenization Using RegEx

In [14]: import re
text = """There are multiple ways we can perform tokenization on given text data.
We can choose any method based on langauge, library and purpose of modeling."""
tokens = re.findall("[\w]+", text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on',

'given', 'text', 'data', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'l
angauge', 'library', 'and', 'purpose', 'of', 'modeling']

Sentence Tokenization
In [17]: text = """Characters like periods, exclamation point and newline char are used to s
tokens_sent = re.compile('[.!?] ').split(text)
tokens_sent

['Characters like periods, exclamation point and newline char are used to separate
Out[17]:
the sentences.But one drawback with split() method, that we can only use one separ
ator at a time',
'So sentence tonenization wont be foolproof with split() method.']

Tokenization Using NLTK

word Tokenization

In [18]: from nltk.tokenize import word_tokenize

text = """There are multiple ways we can perform tokenization on given text data. W
tokens = word_tokenize(text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on',

'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'o
n', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']

sentence Tokenization

In [20]: from nltk.tokenize import sent_tokenize

text = """There are multiple ways we can perform tokenization on given text data. W
tokens = sent_tokenize(text)
print(tokens)

localhost:8888/nbconvert/html/anaconda3/tokenizer NLP .ipynb?download=false 2/4

5/20/23, 2:20 PM tokenizer NLP

['There are multiple ways we can perform tokenization on given text data.', 'We ca
n choose any method based on langauge, library and purpose of modeling.']

Tokenization Using spaCy

word Tokenization

In [23]: from spacy.lang.en import English

nlp = English()
text = """There are multiple ways we can perform tokenization on given text data. W
doc = nlp(text)
token = []
for tok in doc:
token.append(tok)
print(token)

[There, are, multiple, ways, we, can, perform, tokenization, on, given, text, dat
a, ., We, can, choose, any, method, based, on, langauge, ,, library, and, purpose,
of, modeling, .]

sentence Tokenization

In [32]: nlp = English()

nlp.add_pipe('sentencizer')
text = """Characters like periods, exclamation point and newline char are used to s
doc = nlp(text)
sentence_list =[]
for sentence in doc.sents:
sentence_list.append(sentence.text)
print(sentence_list)

['Characters like periods, exclamation point and newline char are used to separate
the sentences.', 'But one drawback with split() method, that we can only use one s
eparator at a time!', 'So sentence tonenization wont be foolproof with split() met
hod.']

Tokenization using Keras

word Tokenization

In [33]: from keras.preprocessing.text import text_to_word_sequence

text = """There are multiple ways we can perform tokenization on given text data. W

tokens = text_to_word_sequence(text)
print(tokens)

['there', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on',

'given', 'text', 'data', 'we', 'can', 'choose', 'any', 'method', 'based', 'on', 'l
angauge', 'library', 'and', 'purpose', 'of', 'modeling']

sentence Tokenization

In [34]: from keras.preprocessing.text import text_to_word_sequence

text = """Characters like periods, exclamation point and newline char are used to s

text_to_word_sequence(text, split= ".", filters="!.\n")

localhost:8888/nbconvert/html/anaconda3/tokenizer NLP .ipynb?download=false 3/4

5/20/23, 2:20 PM tokenizer NLP

['characters like periods, exclamation point and newline char are used to separate
Out[34]:
the sentences',
' but one drawback with split() method, that we can only use one separator at a t
ime',
' so sentence tonenization wont be foolproof with split() method']

localhost:8888/nbconvert/html/anaconda3/tokenizer NLP .ipynb?download=false 4/4

Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Lecture 2 Tokenization
No ratings yet
Lecture 2 Tokenization
16 pages
Week 1
No ratings yet
Week 1
14 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
NLP Exp1
No ratings yet
NLP Exp1
4 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
Python Sentence Tokenization Methods
No ratings yet
Python Sentence Tokenization Methods
3 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
NLP 02
No ratings yet
NLP 02
6 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
2.2text Preprocessing Tokanization
No ratings yet
2.2text Preprocessing Tokanization
3 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Tokenization Basics
No ratings yet
NLP Tokenization Basics
3 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLP Practicals All
No ratings yet
NLP Practicals All
57 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
AMLTA
No ratings yet
AMLTA
17 pages
Slide 2 Introduction To Text Tokeni
No ratings yet
Slide 2 Introduction To Text Tokeni
5 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Tokenizations
No ratings yet
Tokenizations
3 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP with NLTK in Python Guide
No ratings yet
NLP with NLTK in Python Guide
5 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
Exp1 Ananya 66 C NLP
No ratings yet
Exp1 Ananya 66 C NLP
12 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Lab Manual Final
No ratings yet
NLP Lab Manual Final
25 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Token Ization
No ratings yet
Token Ization
5 pages
NLP Applications and Text Preprocessing
No ratings yet
NLP Applications and Text Preprocessing
54 pages
Tokenization in Santali Language NLP
No ratings yet
Tokenization in Santali Language NLP
16 pages
NLP Techniques: Tokenization & Stemming
No ratings yet
NLP Techniques: Tokenization & Stemming
11 pages
NLP and Computational Linguistics Overview
No ratings yet
NLP and Computational Linguistics Overview
60 pages
Unit 1 - Tokenisation Text
No ratings yet
Unit 1 - Tokenisation Text
5 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
M6L2 Lyst1662
No ratings yet
M6L2 Lyst1662
24 pages
Essential NLP Pre-processing Steps
No ratings yet
Essential NLP Pre-processing Steps
20 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
NLP - Lab - 1.ipynb - Colab
No ratings yet
NLP - Lab - 1.ipynb - Colab
4 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
3 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages