0% found this document useful (0 votes)

5 views5 pages

Unit 1 - Tokenisation Text

The document discusses text tokenization in Natural Language Processing (NLP), explaining how to split text into sentences and words using various Python libraries such as NLTK. It provides code examples for sentence and word tokenization, including the use of different tokenizers like PunktSentenceTokenizer and TreebankWordTokenizer. Additionally, it highlights the ability to tokenize text in multiple languages and the use of regular expressions for tokenization.

Uploaded by

Stella Thanis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views5 pages

Unit 1 - Tokenisation Text

Uploaded by

Stella Thanis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Text Tokenisation

Natural Language Processing (NLP) is a subfield of computer

science, artificial intelligence, information engineering, and human-
computer interaction. This field focuses on how to program
computers to process and analyze large amounts of natural
language data. It is difficult to perform as the process of reading and
understanding languages is far more complex than it seems at first
glance. Tokenization is the process of tokenizing or splitting a
string, text into a list of tokens. One can think of token as parts like
a word is a token in a sentence, and a sentence is a token in a
paragraph. Key points of the article –
 Text into sentences tokenization
 Sentences into words tokenization
 Sentences using regular expressions tokenization

Code #1: Sentence Tokenization – Splitting sentences in the

paragraph
 Python3

from [Link] import sent_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP

article"

sent_tokenize(text)

Output :
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
How sent_tokenize works ? The sent_tokenize function uses an
instance of PunktSentenceTokenizer from the [Link]
module, which is already been trained and thus very well knows to
mark the end and beginning of sentence at what characters and
punctuation. Code #2: PunktSentenceTokenizer – When we
have huge chunks of data then it is efficient to use it.
 Python3

import [Link]

# Loading PunktSentenceTokenizer using English pickle file

tokenizer = [Link]('tokenizers/punkt/PY3/[Link]')

[Link](text)

Output :
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']

Code #3: Tokenize sentence of different language – One can

also tokenize sentence from different languages using different
pickle file other than English.
 Python3

import [Link]

spanish_tokenizer =
[Link]('tokenizers/punkt/PY3/[Link]')

text = 'Hola amigo. Estoy bien.'

spanish_tokenizer.tokenize(text)

Output :
['Hola amigo.',
'Estoy bien.']

Code #4: Word Tokenization – Splitting words in a sentence.

 Python3

from [Link] import word_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks."

word_tokenize(text)

Output :
['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks',
'.']

How word_tokenize works? word_tokenize() function is a wrapper

function that calls tokenize() on an instance of the
TreebankWordTokenizer class. Code #5: Using
TreebankWordTokenizer
 Python3

from [Link] import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

[Link](text)

Output :
['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']

These tokenizers work by separating the words using punctuation

and spaces. And as mentioned in the code outputs above, it doesn’t
discard the punctuation, allowing a user to decide what to do with
the punctuations at the time of pre-processing.

Code #6: PunktWordTokenizer – It doesn’t separates the

punctuation from the words.
 Python3

from [Link] import PunktWordTokenizer

tokenizer = PunktWordTokenizer()

[Link]("Let's see how it's working.")

Output :
['Let', "'s", 'see', 'how', 'it', "'s", 'working', '.']
Code #6: WordPunctTokenizer – It separates the punctuation
from the words.
 Python3

from [Link] import WordPunctTokenizer

tokenizer = WordPunctTokenizer()

[Link]("Let's see how it's working.")

Output :
['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working',
'.']
Code #7: Using Regular Expression
 Python3

from [Link] import RegexpTokenizer

tokenizer = RegexpTokenizer("[\w']+")

text = "Let's see how it's working."

[Link](text)

Output :
["Let's", 'see', 'how', "it's", 'working']
Code #7: Using Regular Expression
 Python3

from [Link] import regexp_tokenize

text = "Let's see how it's working."

regexp_tokenize(text, "[\w']+")

Output :
["Let's", 'see', 'how', "it's", 'working']

Lecture 2 Tokenization
No ratings yet
Lecture 2 Tokenization
16 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
Python Sentence Tokenization Methods
No ratings yet
Python Sentence Tokenization Methods
3 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
NLP Exp1
No ratings yet
NLP Exp1
4 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Week 1
No ratings yet
Week 1
14 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLP with NLTK in Python Guide
No ratings yet
NLP with NLTK in Python Guide
5 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLP Tokenization Basics
No ratings yet
NLP Tokenization Basics
3 pages
Main Topics: Start With A Checkmark Followed by The Topic Name
No ratings yet
Main Topics: Start With A Checkmark Followed by The Topic Name
48 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
Tokenizations
No ratings yet
Tokenizations
3 pages
2.2text Preprocessing Tokanization
No ratings yet
2.2text Preprocessing Tokanization
3 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
UBC Summer Linguistics Course Overview
No ratings yet
UBC Summer Linguistics Course Overview
33 pages
Slide 2 Introduction To Text Tokeni
No ratings yet
Slide 2 Introduction To Text Tokeni
5 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Applications and Text Preprocessing
No ratings yet
NLP Applications and Text Preprocessing
54 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
NLP and Computational Linguistics Overview
No ratings yet
NLP and Computational Linguistics Overview
60 pages
NLTK for NLP Education
No ratings yet
NLTK for NLP Education
4 pages
NLP Lab Manual Final
No ratings yet
NLP Lab Manual Final
25 pages
AMLTA
No ratings yet
AMLTA
17 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
NLP 02
No ratings yet
NLP 02
6 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Lab
No ratings yet
NLP Lab
63 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
NLP Techniques: Tokenization & Stemming
No ratings yet
NLP Techniques: Tokenization & Stemming
11 pages
NLP Record
No ratings yet
NLP Record
6 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
50 pages
NLP Practicals All
No ratings yet
NLP Practicals All
57 pages
NLTK: Python Text Processing Guide
No ratings yet
NLTK: Python Text Processing Guide
4 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
InfoSec Lab Manual for Students
No ratings yet
InfoSec Lab Manual for Students
25 pages
BLC 2 BLC 1nlp12erged
No ratings yet
BLC 2 BLC 1nlp12erged
11 pages
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
No ratings yet
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
18 pages
Python NLP Assignment
No ratings yet
Python NLP Assignment
9 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
NLP
No ratings yet
NLP
12 pages
Pertemuan 3 - Preprocessing
No ratings yet
Pertemuan 3 - Preprocessing
25 pages
NLTK Package Training
No ratings yet
NLTK Package Training
17 pages
CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring
No ratings yet
CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring
11 pages
Regular Expressions and Its Applications
No ratings yet
Regular Expressions and Its Applications
6 pages
The Go Programming Language Specification - The Go Programming Language
No ratings yet
The Go Programming Language Specification - The Go Programming Language
95 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
No ratings yet
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
29 pages
Unit-V Expert Systems-Notes
No ratings yet
Unit-V Expert Systems-Notes
23 pages
12 Mark Questions With Answer-1
No ratings yet
12 Mark Questions With Answer-1
21 pages
C# Source Generators Explained: Boosting Compile-Time Productivity: Write Smarter Code by Automating Repetition and Enhancing Your C# Projects with Compile-Time Code Generation by BOSCO-IT CONSULTING 2025
No ratings yet
C# Source Generators Explained: Boosting Compile-Time Productivity: Write Smarter Code by Automating Repetition and Enhancing Your C# Projects with Compile-Time Code Generation by BOSCO-IT CONSULTING 2025
461 pages
Lex and Yacc Examples Lab Task
No ratings yet
Lex and Yacc Examples Lab Task
6 pages
Text Preprocessing Techniques for NLP
No ratings yet
Text Preprocessing Techniques for NLP
3 pages
Python Complete Guide
No ratings yet
Python Complete Guide
5 pages
BI Unit 3
No ratings yet
BI Unit 3
132 pages
Question Bank-Compiler Full Notes From Ktu
No ratings yet
Question Bank-Compiler Full Notes From Ktu
5 pages
NLP Notes
No ratings yet
NLP Notes
71 pages
Multilingual Sentence Segmentation
No ratings yet
Multilingual Sentence Segmentation
19 pages
Compiler Concepts and Optimization Techniques
No ratings yet
Compiler Concepts and Optimization Techniques
4 pages
FP CaseStudies
No ratings yet
FP CaseStudies
70 pages
Towards Automatic Detection of Correct Domain Words in OCR Texts From Polish Digital Libraries
No ratings yet
Towards Automatic Detection of Correct Domain Words in OCR Texts From Polish Digital Libraries
5 pages
BTech IT
No ratings yet
BTech IT
81 pages
CD
No ratings yet
CD
21 pages
ITE 3106 - Lesson 03 - Application Architectures
100% (1)
ITE 3106 - Lesson 03 - Application Architectures
14 pages
Project File
No ratings yet
Project File
66 pages
Natural Language Processing With Java - Sample Chapter
100% (1)
Natural Language Processing With Java - Sample Chapter
33 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
13 pages
CS602PC - Compiler - Design - Lecture Notes - Unit - 3
No ratings yet
CS602PC - Compiler - Design - Lecture Notes - Unit - 3
27 pages
Bangla Braille Adaptation: January 2013
No ratings yet
Bangla Braille Adaptation: January 2013
22 pages
Demonstrate The Phases of A Compiler With Example
No ratings yet
Demonstrate The Phases of A Compiler With Example
16 pages
NLP - Unit 1 Notes
No ratings yet
NLP - Unit 1 Notes
30 pages
Compiler Construction Lexical Analysis
No ratings yet
Compiler Construction Lexical Analysis
63 pages
MikroC PRO200
No ratings yet
MikroC PRO200
427 pages

Unit 1 - Tokenisation Text

Uploaded by

Unit 1 - Tokenisation Text

Uploaded by

Text Tokenisation

Natural Language Processing (NLP) is a subfield of computer

Code #1: Sentence Tokenization – Splitting sentences in the

from [Link] import sent_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP

# Loading PunktSentenceTokenizer using English pickle file

Code #3: Tokenize sentence of different language – One can

text = 'Hola amigo. Estoy bien.'

Code #4: Word Tokenization – Splitting words in a sentence.

from [Link] import word_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks."

How word_tokenize works? word_tokenize() function is a wrapper

from [Link] import TreebankWordTokenizer

These tokenizers work by separating the words using punctuation

Code #6: PunktWordTokenizer – It doesn’t separates the

from [Link] import PunktWordTokenizer

[Link]("Let's see how it's working.")

from [Link] import WordPunctTokenizer

[Link]("Let's see how it's working.")

from [Link] import RegexpTokenizer

text = "Let's see how it's working."

from [Link] import regexp_tokenize

text = "Let's see how it's working."

You might also like