0% found this document useful (0 votes)
5 views5 pages

Unit 1 - Tokenisation Text

The document discusses text tokenization in Natural Language Processing (NLP), explaining how to split text into sentences and words using various Python libraries such as NLTK. It provides code examples for sentence and word tokenization, including the use of different tokenizers like PunktSentenceTokenizer and TreebankWordTokenizer. Additionally, it highlights the ability to tokenize text in multiple languages and the use of regular expressions for tokenization.

Uploaded by

Stella Thanis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

Unit 1 - Tokenisation Text

The document discusses text tokenization in Natural Language Processing (NLP), explaining how to split text into sentences and words using various Python libraries such as NLTK. It provides code examples for sentence and word tokenization, including the use of different tokenizers like PunktSentenceTokenizer and TreebankWordTokenizer. Additionally, it highlights the ability to tokenize text in multiple languages and the use of regular expressions for tokenization.

Uploaded by

Stella Thanis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Text Tokenisation

Natural Language Processing (NLP) is a subfield of computer


science, artificial intelligence, information engineering, and human-
computer interaction. This field focuses on how to program
computers to process and analyze large amounts of natural
language data. It is difficult to perform as the process of reading and
understanding languages is far more complex than it seems at first
glance. Tokenization is the process of tokenizing or splitting a
string, text into a list of tokens. One can think of token as parts like
a word is a token in a sentence, and a sentence is a token in a
paragraph. Key points of the article –
 Text into sentences tokenization
 Sentences into words tokenization
 Sentences using regular expressions tokenization

Code #1: Sentence Tokenization – Splitting sentences in the


paragraph
 Python3

from [Link] import sent_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP


article"

sent_tokenize(text)

Output :
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
How sent_tokenize works ? The sent_tokenize function uses an
instance of PunktSentenceTokenizer from the [Link]
module, which is already been trained and thus very well knows to
mark the end and beginning of sentence at what characters and
punctuation. Code #2: PunktSentenceTokenizer – When we
have huge chunks of data then it is efficient to use it.
 Python3

import [Link]

# Loading PunktSentenceTokenizer using English pickle file

tokenizer = [Link]('tokenizers/punkt/PY3/[Link]')

[Link](text)

Output :
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']

Code #3: Tokenize sentence of different language – One can


also tokenize sentence from different languages using different
pickle file other than English.
 Python3

import [Link]

spanish_tokenizer =
[Link]('tokenizers/punkt/PY3/[Link]')

text = 'Hola amigo. Estoy bien.'

spanish_tokenizer.tokenize(text)

Output :
['Hola amigo.',
'Estoy bien.']

Code #4: Word Tokenization – Splitting words in a sentence.


 Python3

from [Link] import word_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks."

word_tokenize(text)

Output :
['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks',
'.']

How word_tokenize works? word_tokenize() function is a wrapper


function that calls tokenize() on an instance of the
TreebankWordTokenizer class. Code #5: Using
TreebankWordTokenizer
 Python3

from [Link] import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

[Link](text)

Output :
['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']

These tokenizers work by separating the words using punctuation


and spaces. And as mentioned in the code outputs above, it doesn’t
discard the punctuation, allowing a user to decide what to do with
the punctuations at the time of pre-processing.

Code #6: PunktWordTokenizer – It doesn’t separates the


punctuation from the words.
 Python3

from [Link] import PunktWordTokenizer

tokenizer = PunktWordTokenizer()

[Link]("Let's see how it's working.")

Output :
['Let', "'s", 'see', 'how', 'it', "'s", 'working', '.']
Code #6: WordPunctTokenizer – It separates the punctuation
from the words.
 Python3

from [Link] import WordPunctTokenizer

tokenizer = WordPunctTokenizer()

[Link]("Let's see how it's working.")

Output :
['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working',
'.']
Code #7: Using Regular Expression
 Python3

from [Link] import RegexpTokenizer


tokenizer = RegexpTokenizer("[\w']+")

text = "Let's see how it's working."

[Link](text)

Output :
["Let's", 'see', 'how', "it's", 'working']
Code #7: Using Regular Expression
 Python3

from [Link] import regexp_tokenize

text = "Let's see how it's working."

regexp_tokenize(text, "[\w']+")

Output :
["Let's", 'see', 'how', "it's", 'working']

You might also like