MVJCOLLEGE OF ENGINEERING, BENGALURU-560067
(Autonomous Institution Affiliated to VTU, Belagavi)
DEPARTMENT OF DATA SCIENCE
CERTIFICATE
Certified that the minor project work titled ‘Practical Applications of NLP: Corpus
Exploration, POS Tagging, and Word Segmentation Using NLTK’ is carried out by AYAN
AMANULLAH KHAN(1MJ22CD010), JOY GRAS(1MJ22CD018), K VARUN
KUMAR(1MJ22CD019), LINGUTLA CHAITHANYA(1MJ22CD026), PARIKSHITH V
M(1MJ22CD038) who are confide students of MVJ College of Engineering, Bengaluru, in
partial fulfilment for the award of Degree of Bachelor of Engineering in Data Science of the
Visvesvaraya Technological University, Belagavi during the year 2022-2026. It is certified that
all corrections/suggestions indicated for the Internal Assessment have been incorporated in the
report deposited in the departmental library. The report has been approved as it satisfies the
academic requirements in respect of assignment prescribed by the institution for the said Degree.
Signature of Guide Signature of Head of the Department Signature of Principal
Dr.________________ Dr.__________________ Dr. Ajayan K R
External Viva
Name of Examiners Signature with Date
I
MVJ COLLEGE OF ENGINEERING, BENGALURU-560067
(Autonomous Institution Affiliated to VTU, Belagavi)
DEPARTMENT OF DATA SCIENCE
DECLARATION
We, AYAN AMANULLAH KHAN, JOY GRAS, K VARUN KUMAR, LINGUTLA
CHAITHANYA, PARIKSHITH V M students of sixth semester B.E., Department of Data
Science, MVJ College of Engineering, Bengaluru, hereby declare that the assignment titled
‘Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK’ has been carried out by us and submitted in partial fulfilment for
the award of Degree of Bachelor of Engineering in Data Science during the year 2024-25
Further we declare that the content of the dissertation has not been submitted previously by
anybody for the award of any Degree or Diploma to any other University.
We also declare that any Intellectual Property Rights generated out of this project carried out at
MVJCE will be the property of MVJ College of Engineering, Bengaluru and we will be one of
the authors of the same.
Place: Bengaluru
Date:
Name Signature
1. AYAN AMANULLAH KHAN
2. JOY GRAS
3. K VARUN KUMAR
4. LINGUTLA CHAITHANYA
5. PARIKSHITH V M
II
ACKNOWLEDGEMENT
We are indebted to our guide, Prof. Rekha P, Professor, Dept of Data Science, MVJ College of
Engineering for his wholehearted support, suggestions and invaluable advice throughout our
assignment and also for the help in the preparation of this report.
We also express our gratitude to our panel members Prof. Lubi E, Associate Professor, Prof
Victoria, Department of Data Science for their valuable comments and suggestions.
Our sincere thanks to Prof Rekha P, Associate Professor and Head, Department of Data
Science, MVJCE for her support and encouragement.
We express sincere gratitude to our beloved Principal, Dr. Ajayan for all his support towards
this assignment.
Lastly, we take this opportunity to thank our family members and friends who provided all the
backup support throughout the assignment.
III
ABSTRACT
This assingment explores the foundational techniques of Natural Language Processing (NLP)
through practical applications of information retrieval and part-of-speech (POS) tagging using
Python's Natural Language Toolkit (NLTK). The study begins with an in-depth examination of
various standard corpora such as Brown, Inaugural, Reuters, and Universal Declaration of
Human Rights (UDHR), highlighting methods like fields, raw, words, sends, and categories.
Additionally, it demonstrates the creation and utilization of custom corpora in plaintext and
categorized formats. The project further investigates Conditional Frequency Distributions to
analyse word distributions across categories. Emphasis is placed on understanding tagged
corpora using tagged, words and tagged, sent, methods, followed by identifying the most
frequent noun tags. The assignment also showcases mapping words to properties using Python
dictionaries and implements basic POS taggers, including rule-based and unigram taggers.
Finally, it addresses a word segmentation task, where words are extracted from a continuous
string of characters by referencing a predefined corpus and assigning scores based on word
likelihood. This comprehensive study bridges theory and practical implementation, reinforcing
key NLP concepts crucial for text analysis and language modelling.
This assignment presents a comprehensive exploration of Information Retrieval and Part-of-
Speech (POS) tagging within the domain of Natural Language Processing using Python and the
NLTK library. The study involves detailed analysis of standard corpora like Brown, Reuters,
Inaugural, and UDHR, leveraging NLTK methods to access and manipulate raw texts, word lists,
sentences, and categories. It includes building and accessing user-defined corpora, examining
tagged corpora, and generating conditional frequency distributions to understand linguistic
patterns. The project implements rule-based and unigram taggers, extracts the most frequent
noun tags, and uses Python dictionaries to map words to specific properties. Additionally, it
addresses the challenge of segmenting a continuous text string into valid words by matching
against a known corpus and scoring potential segmentations. This assignment serves as a
practical demonstration of key NLP concepts and their implementation using real-world data .
IV
ACRONYMS
Acronym Abbreviation
USN University Seat Number
SEM Semester
NLTK Natural Language Toolkit
NLP Natural Language Processing
V
TABLE OF CONTENTS
Page No
Certificate I
Declaration II
Acknowledgement III
Abstract IV
Acronyms V
Chapter 1
1. Introduction
1.1 Aim
1.2 Motivation
1.3 Problem Statement
1.4 Existing System
1.5 Proposed System
Chapter 2
2. Literature Survey
Chapter 3
3. System Requirement and Specification
3.1 Hardware Requirement
3.2 Software Requirement
3.3.1 Functional Requirement
3.3.2 Non-Functional Requirement
Chapter 4
4. System Design 42
4.1 System Architecture
4.2 Data Flow Diagram
4.3 Flowchart
VI
4.4 Modules
4.5 Activity Diagram
4.6 Sequence Diagram
5. Conclusion
6. References
VII
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
CHAPTER-1
INTRODUCTION
Department of Data Science 2024-25 Page No.1
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
CHAPTER 1
Introduction
In the digital age, vast amounts of textual data are generated every second, and
extracting meaningful information from this unstructured text is a central challenge
in computer science. Natural Language Processing (NLP) addresses this
challenge by combining computational linguistics with machine learning and
information retrieval techniques. One of the key components of NLP is
Information Retrieval (IR)—the process of searching and extracting relevant data
from large collections of text. This assignment provides a hands-on exploration of
IR concepts through the lens of NLP, using the powerful NLTK (Natural
Language Toolkit) in Python.
The assignment begins with the study of various standard corpora such as the
Brown Corpus, Inaugural Addresses, Reuters News Corpus, and the Universal
Declaration of Human Rights (UDHR). These corpora represent a wide range of
text genres, from news and literature to historical speeches and multilingual
declarations. By examining their structure using methods like words(), sents(),
raw(), and categories(), we gain a better understanding of how real-world text is
organized and processed.
Next, we delve into custom corpora creation, both in simple plaintext format and
with labelled categories. This mirrors how NLP practitioners often need to prepare
domain-specific data before applying machine learning models. The use of
Conditional Frequency Distributions (CFDs) allows us to analyse how
frequently words appear under different conditions, such as in different genres or
categories, providing insights into linguistic patterns and trends.
An essential part of NLP is understanding grammatical structure, which we
approach through Part-of-Speech (POS) tagging. By exploring tagged corpora and
Department of Data Science 2024-25 Page No.2
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
utilizing functions like tagged, words() and tagged, sents(), we can analyse how
words function in different contexts. A dedicated program is implemented to find
the most frequent noun tags, emphasizing their importance in extracting key
subjects from text.
Additionally, we explore how Python dictionaries can be used to map words to
their properties—laying the groundwork for more complex NLP tasks like semantic
analysis or feature engineering. The study of Rule-Based and Unigram Taggers
further reinforces the understanding of POS tagging, contrasting deterministic
methods with probabilistic models.
A particularly interesting problem is addressed in the final section: breaking a
continuous string of text (without spaces) into meaningful words using a reference
corpus and assigning scores to the possible word combinations. This simulates
word segmentation, a critical step in languages like Chinese and a common
problem in NLP preprocessing.
Overall, this assignment offers a comprehensive introduction to the practical
aspects of information retrieval in NLP, enabling learners to build foundational
skills in text processing, analysis, and interpretation using real-world datasets and
tools.
1.1 Aim
To explore and demonstrate the use of Information Retrieval techniques in Natural
Language Processing (NLP) using the NLTK toolkit by studying standard corpora
(Brown, Inaugural, Reuters, UDHR), creating custom corpora, analysing
conditional frequency distributions, examining tagged corpora, implementing
taggers (Rule-based and Unigram), mapping words to properties using dictionaries,
and segmenting text without spaces using a reference corpus for word scoring.
Department of Data Science 2024-25 Page No.3
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
1.2 Motivation
Natural Language Processing (NLP) plays a crucial role in enabling machines to
understand and process human language. One of the fundamental tasks in NLP is
information retrieval, which involves extracting meaningful patterns, structures,
and insights from large volumes of text data. This assignment is designed to
provide hands-on experience with core NLP techniques and tools using real-world
text corpora and Python-based processing methods.
Through this assignment, you'll:
Explore diverse corpora (Brown, Inaugural, Reuters, UDHR) to
understand the variety and complexity of language in different contexts —
from news articles to political speeches.
Create your own corpora, simulating the tasks of preparing domain-
specific datasets for NLP applications like sentiment analysis or document
classification.
Use tools like conditional frequency distributions to study how word
usage varies across genres or categories — a key skill in information
retrieval and text analytics.
Tag parts of speech (POS) in corpora to understand grammatical structure,
which is essential for downstream NLP tasks like named entity recognition
or question answering.
Identify the most frequent noun tags, providing insights into subject
matter focus.
Practice mapping words to properties, which is useful in sentiment
scoring, feature engineering, and semantic analysis.
Explore rule-based and statistical taggers like the Unigram Tagger to
learn how machines assign POS tags and structure to unstructured text.
Tackle the challenge of word segmentation in unspaced text — a critical
task in languages without word boundaries (e.g., Chinese, Thai) or in
Department of Data Science 2024-25 Page No.4
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
OCR/typo correction systems. This mirrors real-world scenarios in search
engines and autocorrect features.
By the end of this assignment, you will have a solid foundation in retrieving,
analyzing, and tagging text — equipping you with the skills to build intelligent
language-based applications like chatbots, search engines, or language translation
tools.
1.3 Problem Statement
The goal of this assignment is to understand and demonstrate the fundamentals of
Information Retrieval and Part-of-Speech (POS) Tagging in Natural Language
Processing (NLP) using the Natural Language Toolkit (NLTK) in Python.
Objectives:
1. Explore Predefined Corpora:
Study and explore various corpora such as Brown, Inaugural,
Reuters, and UDHR using methods like fields(), raw(), words(),
sents(), and categories().
2. Create and Use Custom Corpora:
Create and utilize your own corpora using plaintext and categorical
formats.
3. Analyze Conditional Frequency Distributions:
Generate and interpret Conditional Frequency Distributions
(CFD) based on text data to understand word usage across different
categories or contexts.
4. Work with Tagged Corpora:
Study tagged corpora using methods such as tagged_sents() and
tagged_words() to explore how words are annotated with POS tags.
5. POS Tagging Analysis:
Write a program to extract and analyze the most frequent noun
tags from a tagged corpus.
6. Dictionary Mapping:
Department of Data Science 2024-25 Page No.5
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
Implement Python dictionaries to map words to properties (e.g.,
word length, frequency, POS tags).
7. Implement Tagging Models:
Study and implement both Rule-based taggers and Unigram
taggers using NLTK.
8. Word Segmentation Task:
Write a program to segment a string of characters (without
spaces) into valid words using a given corpus, and compute the
score or probability of different segmentations.
1.4 Existing System
The existing system typically uses built-in corpus tools provided by libraries like
NLTK (Natural Language Toolkit). These tools allow exploration of well-known
corpora such as Brown, Reuters, and Inaugural addresses. These systems are often
used in NLP education and research for:
Simple corpus access
Tagging using pretrained taggers
Basic frequency distributions
Basic string comparison tasks
1.5 Proposed System
Your proposed system builds upon the existing tools but adds customization,
interactivity, and real-world utility. You also introduce user-defined corpora,
rule-based taggers, and a scoring-based word segmentation system.
1.6 Organization of the Report
Chapter 1: Introduction
Objective of the assignment
Overview of Information Retrieval in NLP
Tools used: Python, NLTK (Natural Language Toolkit)
Department of Data Science 2024-25 Page No.6
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
Chapter 2: Exploring Standard Corpora
Overview of Corpora in NLTK
o Brown Corpus
o Inaugural Corpus
o Reuters Corpus
o Universal Declaration of Human Rights (UDHR)
2Using Methods
o fields()
o raw()
o words()
o sents()
o categories()
Code Examples and Outputs for each method on different corpora
Chapter 3: Creating and Using Custom Corpora
Plaintext Corpus
o How to create and load using PlaintextCorpusReader
Categorical Corpus
o Organizing files by categories
o Loading using CategorizedPlaintextCorpusReader
Code Snippets and Sample Files
Chapter 4: Conditional Frequency Distributions
Introduction to ConditionalFreqDist
Example with Brown or Reuters Corpus
o Frequency of words by category
Plotting distributions
Department of Data Science 2024-25 Page No.7
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
Chapter 5: Working with Tagged Corpora
Tagged Sentences and Words
o tagged_sents()
o tagged_words()
Frequency of POS Tags
o Finding the most frequent noun tags (e.g., NN, NNP)
Code to extract and count tags
Chapter 6: Mapping Words to Properties
Using Python Dictionaries
o Mapping words to POS tags or categories
o Example: {'dog': 'noun', 'run': 'verb'}
Applications in NLP
Chapter 7: POS Tagging Techniques
Rule-Based Tagger
o Simple patterns using RegexpTagger
Unigram Tagger
o Training with tagged corpus
o Evaluation using accuracy()
Comparison and Results
Chapter 8: Word Segmentation from Plain Text
Problem Statement
o Plain text without spaces (e.g., itisanexample)
Using a Given Corpus of Words
o Comparing with a dictionary
Scoring Words
Department of Data Science 2024-25 Page No.8
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
o Use of frequency or probability
Implementation and Output
Department of Data Science 2024-25 Page No.9
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
CHAPTER-2
LITERATURE SURVEY
Department of Data Science 2024-25 Page No.10
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
CHAPTER 2
LTERATURE SURVEY
Topic Description Reference
Corpus Overview The Brown, Inaugural, Brown Corpus
Reuters, and UDHR
corpora are widely used
for training models and
conducting experiments.
They contain diverse
datasets in English and
other languages.
Methods: fields, raw, These methods help NLTK Documentation
words, sents, categories access different
components of a corpus:
fields provide the
document IDs, raw
retrieves all the text,
words gives a list of
words, sents returns a list
of sentences, and
categories provides
different genres.
Custom Corpora Creation Creating and using your NLTK Custom Corpus
own corpus allows
handling domain-specific
data or categorical
datasets. This is done by
defining a structure and
incorporating textual data
into it.
Conditional Frequency Conditional frequency NLTK CFD
Distributions (CFD) distributions calculate the
frequency of words
conditional on certain
conditions, providing
insights into contextual
word usage.
Tagged Corpora and Tagged corpora include Tagged Corpora
Methods annotations like part-of-
speech (POS) tags.
Department of Data Science 2024-25 Page No.11
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
Methods like
tagged_sents and
tagged_words allow
accessing sentences and
words with their
respective POS tags.
Most Frequent Noun Tags A program to find the NLTK POS Tagging
Program most frequent noun tags
from a tagged corpus.
This is useful for
linguistic analysis to
identify noun-heavy
sentences.
Mapping Words to Using dictionaries to map Python Dictionary in NLP
Properties using Python words to various
Dictionaries properties like frequency
or category. This is a
foundational concept in
NL
Rule-Based Tagger Rule-based taggers assign NLTK Rule-Based
POS tags based on Tagging
predefined rules. They are
useful for improving
accuracy on unknown
words or ambiguous
cases.
Unigram Tagger A unigram tagger assigns Unigram Tagger
POS tags based on the
previous word in a
sentence, providing a
baseline for more
complex taggers.
Evaluation Metrics for Evaluation metrics such NLP Evaluation Metrics
NLP as precision, recall, and
F1-score help assess the
performance of
information retrieval and
other NLP tasks.
Department of Data Science 2024-25 Page No.12
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
CHAPTER-3
SYSTEM REQUIREMENT AND SPECIFICATION
Department of Data Science 2024-25 Page No.13
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
CHAPTER 3
System Requirement And Specification
3.1 Hardware Requirement
Minimum System Requirements
Processor (CPU): Dual-core processor (e.g., Intel i3 or AMD Ryzen 3)
RAM: 4 GB
Storage: At least 10 GB free (to store Python environment, NLTK corpora,
and results)
Operating System: Windows 10, Ubuntu 20.04+, or macOS 10.15+
Python Version: Python 3.7 or above
Internet Connection: Required initially for downloading corpora via
NLTK
Recommended System Requirement
Processor (CPU): Quad-core processor (e.g., Intel i5/i7 or AMD Ryzen 5/7)
RAM: 8 GB or more (recommended especially for using multiple corpora
simultaneously)
Storage: SSD with at least 20 GB free for faster read/write speeds
Graphics (GPU): Not required (since this is not deep learning-based)
Operating System: Any modern OS (Linux preferred for better performance with
Python tools)
Department of Data Science 2024-25 Page No.14
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
3.2 Software Requirement
1. Python (version 3.8 or above)
o Primary programming language for implementing NLP techniques.
2. NLTK (Natural Language Toolkit) Library
o Required for accessing standard corpora like Brown, Inaugural,
Reuters, and UDHR.
o Used for working with tagged corpora, frequency distributions,
taggers, etc.
3. NLTK Corpora Data
o Downloadable via [Link]():
brown
inaugural
reuters
udhr
punkt (for tokenization)
averaged_perceptron_tagger (for POS tagging)
universal_tagset (optional, for simplified tags)
4. Jupyter Notebook / Google Colab / Any Python IDE
Department of Data Science 2024-25 Page No.15
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
o For writing, running, and documenting code interactively.
o IDEs like PyCharm, VSCode, or Anaconda are also suitable.
5. Operating System
o Windows, macOS, or Linux (any OS that supports Python and
NLTK).
6. Text Editor (Optional)
o For creating custom plain text or categorized corpora files (e.g.,
Notepad++, VSCode).
7. Memory Requirements
o At least 4 GB RAM (for working with medium-sized corpora
efficiently).
8. Internet Access
o Required to download NLTK datasets and resources if not already
available offline.
9. Matplotlib or Seaborn (Optional for Visualization)
o For visualizing frequency distributions or tag patterns (if needed).
10. Basic Knowledge of NLP Concepts
Understanding of tokenization, tagging, corpus usage, frequency
distributions, and tagging models.
Department of Data Science 2024-25 Page No.16
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
3.3.1 Functional Requirement
1. Corpus Study
The system shall allow the user to load and explore built-in corpora:
o Brown, Inaugural, Reuters, UDHR
The system shall provide methods to:
o View available fields (e.g., fileids, categories)
o Retrieve raw text from a corpus
o Tokenize text into words, sentences
o Filter and display categories (where applicable)
2. Custom Corpus Creation and Usage
The system shall support creation of:
o A plain text corpus from a user-defined directory
o A categorized corpus where files belong to labeled categories
The system shall tokenize and display words, sents, or fileids from the
custom corpus
3. Conditional Frequency Distribution
Department of Data Science 2024-25 Page No.17
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
The system shall compute and display the Conditional Frequency
Distribution:
o Frequency of words conditioned on categories or fields
o Frequency of word-tag combinations in tagged corpora
4. Tagged Corpora Exploration
The system shall load pre-tagged corpora (e.g., Brown tagged corpus)
The system shall allow access to:
o tagged_sents – list of tagged sentences
o tagged_words – list of (word, tag) pairs
5. Most Frequent Noun Tags
The system shall extract and count parts of speech tags
The system shall identify and display the most frequent noun tags (e.g., NN,
NNS)
6. Word to Property Mapping Using Python Dictionaries
The system shall create a dictionary mapping words to specific properties
such as:
Department of Data Science 2024-25 Page No.18
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
o Length
o Frequency
o POS tag
The user shall be able to query this dictionary for given words
7. POS Taggers
The system shall implement and demonstrate:
o A Rule-Based Tagger (using regular expressions or predefined rules)
o A Unigram Tagger (trained on tagged corpus data)
8. Word Segmentation (Word Break Problem)
The system shall:
o Accept a plain text string with no spaces
o Use a corpus-based dictionary to identify valid words in the string
o Segment the string into possible words
o Calculate and display scores (e.g., based on frequency or likelihood)
3.3.2 Non-Functional Requirement
Performance
Department of Data Science 2024-25 Page No.19
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
The system should process and analyze corpora data (e.g., Brown,
Reuters) within a reasonable time (under 5 seconds for standard corpus
operations).
Tagging and frequency distribution operations should be optimized to
handle medium-size datasets (up to 10,000 words) efficiently.
Scalability
The program should be able to scale to handle additional corpora or user-
defined datasets without significant changes to the code.
Modular design should allow easy integration of new taggers or corpora.
Usability
The interface (CLI or GUI) should be simple and intuitive for students or
researchers with basic NLP knowledge.
Clear documentation and help messages should be provided for each
function.
Maintainability
The code should follow standard coding conventions (e.g., PEP8 for
Python) and be well-commented.
Functions should be modular to facilitate future updates or modifications.
Portability
Department of Data Science 2024-25 Page No.20
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
The solution should work on major platforms like Windows, Linux, and
macOS with minimal configuration.
Dependencies should be managed using [Link] or pipenv.
Accuracy
The POS tagging and word segmentation should yield accurate results based
on standard NLP libraries like NLTK or spaCy.
Use trusted corpora like Brown and Reuters for evaluation to ensure
linguistic accuracy.
Reliability
The system should not crash or behave unexpectedly when working with
empty or malformed corpora.
Error handling should be in place for missing files or corrupted input.
Reusability
The corpora handling, tagging, and frequency distribution logic should be
implemented in reusable functions or classes.
Code modules should be general enough to be reused in other NLP tasks.
Security
If user-defined corpora are uploaded, ensure no arbitrary code execution
occurs (e.g., no eval() on user input).
Sanitize file inputs and restrict to plaintext only.
Department of Data Science 2024-25 Page No.21
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
Department of Data Science 2024-25 Page No.22
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
CHAPTER 4
SYSTEM DESIGN
Department of Data Science 2024-25 Page 23
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
CHAPTER 4
System Design
4.1 System Architecture
1. Corpus Exploration Module
Use NLTK corpora: brown, reuters, inaugural, udhr
Functions: .words(), .sents(), .categories(), .fileids()
2. Custom Corpus Module
Use PlaintextCorpusReader for unstructured files.
Use CategorizedPlaintextCorpusReader with categories.
3. Conditional Frequency Distribution
[Link](): useful to study word usage per category or tag.
4. Tagged Corpus Analysis
Use [Link].tagged_words() and tagged_sents()
Count most frequent noun tags using tag patterns (e.g., NN, NNS, NNP, etc.)
5. POS Taggers
Implement basic [Link], [Link], RegexpTagger.
6. Word Segmentation and Scoring
For text like "thereisacat":
Department of Data Science 2024-25 Page 24
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
Department of Data Science 2024-25 Page 25
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
o Use recursive algorithm to segment.
Department of Data Science 2024-25 Page 26
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
4.2 DataFlow Diagram
Department of Data Science 2024-25 Page 27
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
4.3 Flowchart
Department of Data Science 2024-25 Page 28
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
4.4 Modules
1. Study the Various Corpora: Brown, Inaugural, Reuters,
UDHR
Department of Data Science 2024-25 Page 29
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
2. Create and Use Your Own Corpora (Plaintext,
Categorical)
3. Conditional Frequency Distributions
Department of Data Science 2024-25 Page 30
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
4. Study Tagged Corpora
5. Most Frequent Noun Tags
6. Map Words to Properties Using Python Dictionaries
7. Study Rule-Based Tagger, Unigram Tagger
Department of Data Science 2024-25 Page 31
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
8. Word Segmentation Without Spaces and Scoring
Department of Data Science 2024-25 Page 32
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
4.5 Activity Diagram
Department of Data Science 2024-25 Page 33
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
4.6 Sequence Diagram
Department of Data Science 2024-25 Page 34
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
References
Books and Academic References
1. Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd
edition, draft).
o Chapter 2 (Regular Expressions, Text Normalization)
o Chapter 3 (Language Modeling, Smoothing)
o Chapter 4 (Naive Bayes, Text Classification, Sentiment Analysis)
o Chapter 5 (POS Tagging, Taggers, and HMMs)
o URL: [Link]
2. Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python – Analyzing Text with the Natural Language Toolkit. O’Reilly Media.
o Especially Chapters 2, 3, and 5:
Chapter 2: Accessing Text Corpora
Chapter 3: Processing Raw Text
Chapter 5: Categorizing and Tagging Words
o URL: [Link]
Toolkits and Libraries
3. NLTK (Natural Language Toolkit)
o Python library for NLP; includes all mentioned corpora and tagging
tools.
Department of Data Science 2024-25 Page 35
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
o Documentation: [Link]
o Corpora docs: [Link]
4. NLTK GitHub Repository
o For source code and examples:
o [Link]
Articles, Blogs, and Tutorials
5. NLTK Corpora Tutorial – by GeeksForGeeks
o [Link]
resources-in-nlp-with-python-nltk/
6. Understanding POS Tagging in NLTK – Towards Data Science
o [Link]
lemmatization-in-python-8c57a5dcb46c
7. Rule-based and Unigram Taggers in NLTK – Stack Overflow Discussions
and Examples
o [Link]
Optional for Advanced Word Segmentation
8. "Word Segmentation using Probability" – Peter Norvig's Blog
o Excellent explanation of finding word boundaries using bigrams and
scoring.
o [Link]
Department of Data Science 2024-25 Page 36
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK
THANK YOU
Department of Data Science 2024-25 Page 37