0% found this document useful (0 votes)
22 views44 pages

NLP Report

The document certifies the minor project titled 'Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word Segmentation Using NLTK' conducted by students of MVJ College of Engineering as part of their Bachelor of Engineering in Data Science. It outlines the project's exploration of Natural Language Processing techniques using Python's NLTK, including the analysis of standard and custom corpora, POS tagging, and word segmentation. The document includes acknowledgments, an abstract summarizing the project's objectives, and a detailed table of contents for the report.

Uploaded by

Varun Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views44 pages

NLP Report

The document certifies the minor project titled 'Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word Segmentation Using NLTK' conducted by students of MVJ College of Engineering as part of their Bachelor of Engineering in Data Science. It outlines the project's exploration of Natural Language Processing techniques using Python's NLTK, including the analysis of standard and custom corpora, POS tagging, and word segmentation. The document includes acknowledgments, an abstract summarizing the project's objectives, and a detailed table of contents for the report.

Uploaded by

Varun Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

MVJCOLLEGE OF ENGINEERING, BENGALURU-560067

(Autonomous Institution Affiliated to VTU, Belagavi)

DEPARTMENT OF DATA SCIENCE

CERTIFICATE

Certified that the minor project work titled ‘Practical Applications of NLP: Corpus
Exploration, POS Tagging, and Word Segmentation Using NLTK’ is carried out by AYAN
AMANULLAH KHAN(1MJ22CD010), JOY GRAS(1MJ22CD018), K VARUN
KUMAR(1MJ22CD019), LINGUTLA CHAITHANYA(1MJ22CD026), PARIKSHITH V
M(1MJ22CD038) who are confide students of MVJ College of Engineering, Bengaluru, in
partial fulfilment for the award of Degree of Bachelor of Engineering in Data Science of the
Visvesvaraya Technological University, Belagavi during the year 2022-2026. It is certified that
all corrections/suggestions indicated for the Internal Assessment have been incorporated in the
report deposited in the departmental library. The report has been approved as it satisfies the
academic requirements in respect of assignment prescribed by the institution for the said Degree.

Signature of Guide Signature of Head of the Department Signature of Principal


Dr.________________ Dr.__________________ Dr. Ajayan K R

External Viva

Name of Examiners Signature with Date

I
MVJ COLLEGE OF ENGINEERING, BENGALURU-560067
(Autonomous Institution Affiliated to VTU, Belagavi)

DEPARTMENT OF DATA SCIENCE

DECLARATION

We, AYAN AMANULLAH KHAN, JOY GRAS, K VARUN KUMAR, LINGUTLA


CHAITHANYA, PARIKSHITH V M students of sixth semester B.E., Department of Data
Science, MVJ College of Engineering, Bengaluru, hereby declare that the assignment titled
‘Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK’ has been carried out by us and submitted in partial fulfilment for
the award of Degree of Bachelor of Engineering in Data Science during the year 2024-25

Further we declare that the content of the dissertation has not been submitted previously by
anybody for the award of any Degree or Diploma to any other University.

We also declare that any Intellectual Property Rights generated out of this project carried out at
MVJCE will be the property of MVJ College of Engineering, Bengaluru and we will be one of
the authors of the same.

Place: Bengaluru
Date:

Name Signature
1. AYAN AMANULLAH KHAN
2. JOY GRAS
3. K VARUN KUMAR
4. LINGUTLA CHAITHANYA
5. PARIKSHITH V M

II
ACKNOWLEDGEMENT

We are indebted to our guide, Prof. Rekha P, Professor, Dept of Data Science, MVJ College of
Engineering for his wholehearted support, suggestions and invaluable advice throughout our
assignment and also for the help in the preparation of this report.

We also express our gratitude to our panel members Prof. Lubi E, Associate Professor, Prof
Victoria, Department of Data Science for their valuable comments and suggestions.

Our sincere thanks to Prof Rekha P, Associate Professor and Head, Department of Data
Science, MVJCE for her support and encouragement.

We express sincere gratitude to our beloved Principal, Dr. Ajayan for all his support towards
this assignment.

Lastly, we take this opportunity to thank our family members and friends who provided all the
backup support throughout the assignment.

III
ABSTRACT

This assingment explores the foundational techniques of Natural Language Processing (NLP)
through practical applications of information retrieval and part-of-speech (POS) tagging using
Python's Natural Language Toolkit (NLTK). The study begins with an in-depth examination of
various standard corpora such as Brown, Inaugural, Reuters, and Universal Declaration of
Human Rights (UDHR), highlighting methods like fields, raw, words, sends, and categories.
Additionally, it demonstrates the creation and utilization of custom corpora in plaintext and
categorized formats. The project further investigates Conditional Frequency Distributions to
analyse word distributions across categories. Emphasis is placed on understanding tagged
corpora using tagged, words and tagged, sent, methods, followed by identifying the most
frequent noun tags. The assignment also showcases mapping words to properties using Python
dictionaries and implements basic POS taggers, including rule-based and unigram taggers.
Finally, it addresses a word segmentation task, where words are extracted from a continuous
string of characters by referencing a predefined corpus and assigning scores based on word
likelihood. This comprehensive study bridges theory and practical implementation, reinforcing
key NLP concepts crucial for text analysis and language modelling.

This assignment presents a comprehensive exploration of Information Retrieval and Part-of-


Speech (POS) tagging within the domain of Natural Language Processing using Python and the
NLTK library. The study involves detailed analysis of standard corpora like Brown, Reuters,
Inaugural, and UDHR, leveraging NLTK methods to access and manipulate raw texts, word lists,
sentences, and categories. It includes building and accessing user-defined corpora, examining
tagged corpora, and generating conditional frequency distributions to understand linguistic
patterns. The project implements rule-based and unigram taggers, extracts the most frequent
noun tags, and uses Python dictionaries to map words to specific properties. Additionally, it
addresses the challenge of segmenting a continuous text string into valid words by matching
against a known corpus and scoring potential segmentations. This assignment serves as a
practical demonstration of key NLP concepts and their implementation using real-world data .

IV
ACRONYMS
Acronym Abbreviation
USN University Seat Number
SEM Semester
NLTK Natural Language Toolkit
NLP Natural Language Processing

V
TABLE OF CONTENTS
Page No

Certificate I
Declaration II
Acknowledgement III
Abstract IV
Acronyms V

Chapter 1
1. Introduction
1.1 Aim
1.2 Motivation
1.3 Problem Statement
1.4 Existing System
1.5 Proposed System

Chapter 2
2. Literature Survey

Chapter 3
3. System Requirement and Specification
3.1 Hardware Requirement
3.2 Software Requirement
3.3.1 Functional Requirement
3.3.2 Non-Functional Requirement

Chapter 4
4. System Design 42
4.1 System Architecture
4.2 Data Flow Diagram
4.3 Flowchart

VI
4.4 Modules
4.5 Activity Diagram
4.6 Sequence Diagram
5. Conclusion
6. References

VII
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER-1

INTRODUCTION

Department of Data Science 2024-25 Page No.1


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 1

Introduction
In the digital age, vast amounts of textual data are generated every second, and
extracting meaningful information from this unstructured text is a central challenge
in computer science. Natural Language Processing (NLP) addresses this
challenge by combining computational linguistics with machine learning and
information retrieval techniques. One of the key components of NLP is
Information Retrieval (IR)—the process of searching and extracting relevant data
from large collections of text. This assignment provides a hands-on exploration of
IR concepts through the lens of NLP, using the powerful NLTK (Natural
Language Toolkit) in Python.

The assignment begins with the study of various standard corpora such as the
Brown Corpus, Inaugural Addresses, Reuters News Corpus, and the Universal
Declaration of Human Rights (UDHR). These corpora represent a wide range of
text genres, from news and literature to historical speeches and multilingual
declarations. By examining their structure using methods like words(), sents(),
raw(), and categories(), we gain a better understanding of how real-world text is
organized and processed.

Next, we delve into custom corpora creation, both in simple plaintext format and
with labelled categories. This mirrors how NLP practitioners often need to prepare
domain-specific data before applying machine learning models. The use of
Conditional Frequency Distributions (CFDs) allows us to analyse how
frequently words appear under different conditions, such as in different genres or
categories, providing insights into linguistic patterns and trends.

An essential part of NLP is understanding grammatical structure, which we


approach through Part-of-Speech (POS) tagging. By exploring tagged corpora and

Department of Data Science 2024-25 Page No.2


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

utilizing functions like tagged, words() and tagged, sents(), we can analyse how
words function in different contexts. A dedicated program is implemented to find
the most frequent noun tags, emphasizing their importance in extracting key
subjects from text.

Additionally, we explore how Python dictionaries can be used to map words to


their properties—laying the groundwork for more complex NLP tasks like semantic
analysis or feature engineering. The study of Rule-Based and Unigram Taggers
further reinforces the understanding of POS tagging, contrasting deterministic
methods with probabilistic models.

A particularly interesting problem is addressed in the final section: breaking a


continuous string of text (without spaces) into meaningful words using a reference
corpus and assigning scores to the possible word combinations. This simulates
word segmentation, a critical step in languages like Chinese and a common
problem in NLP preprocessing.

Overall, this assignment offers a comprehensive introduction to the practical


aspects of information retrieval in NLP, enabling learners to build foundational
skills in text processing, analysis, and interpretation using real-world datasets and
tools.

1.1 Aim
To explore and demonstrate the use of Information Retrieval techniques in Natural
Language Processing (NLP) using the NLTK toolkit by studying standard corpora
(Brown, Inaugural, Reuters, UDHR), creating custom corpora, analysing
conditional frequency distributions, examining tagged corpora, implementing
taggers (Rule-based and Unigram), mapping words to properties using dictionaries,
and segmenting text without spaces using a reference corpus for word scoring.

Department of Data Science 2024-25 Page No.3


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

1.2 Motivation
Natural Language Processing (NLP) plays a crucial role in enabling machines to
understand and process human language. One of the fundamental tasks in NLP is
information retrieval, which involves extracting meaningful patterns, structures,
and insights from large volumes of text data. This assignment is designed to
provide hands-on experience with core NLP techniques and tools using real-world
text corpora and Python-based processing methods.
Through this assignment, you'll:
 Explore diverse corpora (Brown, Inaugural, Reuters, UDHR) to
understand the variety and complexity of language in different contexts —
from news articles to political speeches.
 Create your own corpora, simulating the tasks of preparing domain-
specific datasets for NLP applications like sentiment analysis or document
classification.
 Use tools like conditional frequency distributions to study how word
usage varies across genres or categories — a key skill in information
retrieval and text analytics.
 Tag parts of speech (POS) in corpora to understand grammatical structure,
which is essential for downstream NLP tasks like named entity recognition
or question answering.
 Identify the most frequent noun tags, providing insights into subject
matter focus.
 Practice mapping words to properties, which is useful in sentiment
scoring, feature engineering, and semantic analysis.
 Explore rule-based and statistical taggers like the Unigram Tagger to
learn how machines assign POS tags and structure to unstructured text.
 Tackle the challenge of word segmentation in unspaced text — a critical
task in languages without word boundaries (e.g., Chinese, Thai) or in

Department of Data Science 2024-25 Page No.4


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

OCR/typo correction systems. This mirrors real-world scenarios in search


engines and autocorrect features.
By the end of this assignment, you will have a solid foundation in retrieving,
analyzing, and tagging text — equipping you with the skills to build intelligent
language-based applications like chatbots, search engines, or language translation
tools.
1.3 Problem Statement
The goal of this assignment is to understand and demonstrate the fundamentals of
Information Retrieval and Part-of-Speech (POS) Tagging in Natural Language
Processing (NLP) using the Natural Language Toolkit (NLTK) in Python.
Objectives:
1. Explore Predefined Corpora:
 Study and explore various corpora such as Brown, Inaugural,
Reuters, and UDHR using methods like fields(), raw(), words(),
sents(), and categories().
2. Create and Use Custom Corpora:
 Create and utilize your own corpora using plaintext and categorical
formats.
3. Analyze Conditional Frequency Distributions:
 Generate and interpret Conditional Frequency Distributions
(CFD) based on text data to understand word usage across different
categories or contexts.
4. Work with Tagged Corpora:
 Study tagged corpora using methods such as tagged_sents() and
tagged_words() to explore how words are annotated with POS tags.
5. POS Tagging Analysis:
 Write a program to extract and analyze the most frequent noun
tags from a tagged corpus.
6. Dictionary Mapping:

Department of Data Science 2024-25 Page No.5


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

 Implement Python dictionaries to map words to properties (e.g.,


word length, frequency, POS tags).
7. Implement Tagging Models:
 Study and implement both Rule-based taggers and Unigram
taggers using NLTK.
8. Word Segmentation Task:
 Write a program to segment a string of characters (without
spaces) into valid words using a given corpus, and compute the
score or probability of different segmentations.
1.4 Existing System
The existing system typically uses built-in corpus tools provided by libraries like
NLTK (Natural Language Toolkit). These tools allow exploration of well-known
corpora such as Brown, Reuters, and Inaugural addresses. These systems are often
used in NLP education and research for:
 Simple corpus access
 Tagging using pretrained taggers
 Basic frequency distributions
 Basic string comparison tasks
1.5 Proposed System
Your proposed system builds upon the existing tools but adds customization,
interactivity, and real-world utility. You also introduce user-defined corpora,
rule-based taggers, and a scoring-based word segmentation system.

1.6 Organization of the Report


Chapter 1: Introduction
 Objective of the assignment
 Overview of Information Retrieval in NLP
 Tools used: Python, NLTK (Natural Language Toolkit)

Department of Data Science 2024-25 Page No.6


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Chapter 2: Exploring Standard Corpora


 Overview of Corpora in NLTK
o Brown Corpus
o Inaugural Corpus
o Reuters Corpus
o Universal Declaration of Human Rights (UDHR)
 2Using Methods
o fields()
o raw()
o words()
o sents()
o categories()
 Code Examples and Outputs for each method on different corpora

Chapter 3: Creating and Using Custom Corpora


 Plaintext Corpus
o How to create and load using PlaintextCorpusReader
 Categorical Corpus
o Organizing files by categories
o Loading using CategorizedPlaintextCorpusReader
 Code Snippets and Sample Files

Chapter 4: Conditional Frequency Distributions


 Introduction to ConditionalFreqDist
 Example with Brown or Reuters Corpus
o Frequency of words by category
 Plotting distributions

Department of Data Science 2024-25 Page No.7


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Chapter 5: Working with Tagged Corpora


 Tagged Sentences and Words
o tagged_sents()
o tagged_words()
 Frequency of POS Tags
o Finding the most frequent noun tags (e.g., NN, NNP)
 Code to extract and count tags

Chapter 6: Mapping Words to Properties


 Using Python Dictionaries
o Mapping words to POS tags or categories
o Example: {'dog': 'noun', 'run': 'verb'}
 Applications in NLP

Chapter 7: POS Tagging Techniques


 Rule-Based Tagger
o Simple patterns using RegexpTagger
 Unigram Tagger
o Training with tagged corpus
o Evaluation using accuracy()
 Comparison and Results

Chapter 8: Word Segmentation from Plain Text


 Problem Statement
o Plain text without spaces (e.g., itisanexample)
 Using a Given Corpus of Words
o Comparing with a dictionary
 Scoring Words

Department of Data Science 2024-25 Page No.8


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o Use of frequency or probability


 Implementation and Output

Department of Data Science 2024-25 Page No.9


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER-2

LITERATURE SURVEY

Department of Data Science 2024-25 Page No.10


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 2

LTERATURE SURVEY

Topic Description Reference


Corpus Overview The Brown, Inaugural, Brown Corpus
Reuters, and UDHR
corpora are widely used
for training models and
conducting experiments.
They contain diverse
datasets in English and
other languages.
Methods: fields, raw, These methods help NLTK Documentation
words, sents, categories access different
components of a corpus:
fields provide the
document IDs, raw
retrieves all the text,
words gives a list of
words, sents returns a list
of sentences, and
categories provides
different genres.
Custom Corpora Creation Creating and using your NLTK Custom Corpus
own corpus allows
handling domain-specific
data or categorical
datasets. This is done by
defining a structure and
incorporating textual data
into it.
Conditional Frequency Conditional frequency NLTK CFD
Distributions (CFD) distributions calculate the
frequency of words
conditional on certain
conditions, providing
insights into contextual
word usage.
Tagged Corpora and Tagged corpora include Tagged Corpora
Methods annotations like part-of-
speech (POS) tags.

Department of Data Science 2024-25 Page No.11


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Methods like
tagged_sents and
tagged_words allow
accessing sentences and
words with their
respective POS tags.
Most Frequent Noun Tags A program to find the NLTK POS Tagging
Program most frequent noun tags
from a tagged corpus.
This is useful for
linguistic analysis to
identify noun-heavy
sentences.
Mapping Words to Using dictionaries to map Python Dictionary in NLP
Properties using Python words to various
Dictionaries properties like frequency
or category. This is a
foundational concept in
NL
Rule-Based Tagger Rule-based taggers assign NLTK Rule-Based
POS tags based on Tagging
predefined rules. They are
useful for improving
accuracy on unknown
words or ambiguous
cases.
Unigram Tagger A unigram tagger assigns Unigram Tagger
POS tags based on the
previous word in a
sentence, providing a
baseline for more
complex taggers.
Evaluation Metrics for Evaluation metrics such NLP Evaluation Metrics
NLP as precision, recall, and
F1-score help assess the
performance of
information retrieval and
other NLP tasks.

Department of Data Science 2024-25 Page No.12


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER-3

SYSTEM REQUIREMENT AND SPECIFICATION

Department of Data Science 2024-25 Page No.13


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 3

System Requirement And Specification

3.1 Hardware Requirement

Minimum System Requirements


 Processor (CPU): Dual-core processor (e.g., Intel i3 or AMD Ryzen 3)
 RAM: 4 GB
 Storage: At least 10 GB free (to store Python environment, NLTK corpora,
and results)
 Operating System: Windows 10, Ubuntu 20.04+, or macOS 10.15+
 Python Version: Python 3.7 or above
 Internet Connection: Required initially for downloading corpora via
NLTK

Recommended System Requirement


 Processor (CPU): Quad-core processor (e.g., Intel i5/i7 or AMD Ryzen 5/7)
 RAM: 8 GB or more (recommended especially for using multiple corpora
simultaneously)
 Storage: SSD with at least 20 GB free for faster read/write speeds
 Graphics (GPU): Not required (since this is not deep learning-based)
 Operating System: Any modern OS (Linux preferred for better performance with
Python tools)

Department of Data Science 2024-25 Page No.14


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

3.2 Software Requirement

1. Python (version 3.8 or above)


o Primary programming language for implementing NLP techniques.

2. NLTK (Natural Language Toolkit) Library


o Required for accessing standard corpora like Brown, Inaugural,
Reuters, and UDHR.

o Used for working with tagged corpora, frequency distributions,


taggers, etc.

3. NLTK Corpora Data


o Downloadable via [Link]():

 brown

 inaugural

 reuters

 udhr

 punkt (for tokenization)

 averaged_perceptron_tagger (for POS tagging)

 universal_tagset (optional, for simplified tags)

4. Jupyter Notebook / Google Colab / Any Python IDE

Department of Data Science 2024-25 Page No.15


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o For writing, running, and documenting code interactively.

o IDEs like PyCharm, VSCode, or Anaconda are also suitable.

5. Operating System
o Windows, macOS, or Linux (any OS that supports Python and
NLTK).

6. Text Editor (Optional)


o For creating custom plain text or categorized corpora files (e.g.,
Notepad++, VSCode).

7. Memory Requirements
o At least 4 GB RAM (for working with medium-sized corpora
efficiently).

8. Internet Access
o Required to download NLTK datasets and resources if not already
available offline.

9. Matplotlib or Seaborn (Optional for Visualization)


o For visualizing frequency distributions or tag patterns (if needed).

10. Basic Knowledge of NLP Concepts

 Understanding of tokenization, tagging, corpus usage, frequency

distributions, and tagging models.

Department of Data Science 2024-25 Page No.16


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

3.3.1 Functional Requirement

1. Corpus Study
 The system shall allow the user to load and explore built-in corpora:

o Brown, Inaugural, Reuters, UDHR

 The system shall provide methods to:

o View available fields (e.g., fileids, categories)

o Retrieve raw text from a corpus

o Tokenize text into words, sentences

o Filter and display categories (where applicable)

2. Custom Corpus Creation and Usage


 The system shall support creation of:

o A plain text corpus from a user-defined directory

o A categorized corpus where files belong to labeled categories

 The system shall tokenize and display words, sents, or fileids from the
custom corpus

3. Conditional Frequency Distribution

Department of Data Science 2024-25 Page No.17


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

 The system shall compute and display the Conditional Frequency


Distribution:

o Frequency of words conditioned on categories or fields

o Frequency of word-tag combinations in tagged corpora

4. Tagged Corpora Exploration


 The system shall load pre-tagged corpora (e.g., Brown tagged corpus)

 The system shall allow access to:

o tagged_sents – list of tagged sentences

o tagged_words – list of (word, tag) pairs

5. Most Frequent Noun Tags


 The system shall extract and count parts of speech tags

 The system shall identify and display the most frequent noun tags (e.g., NN,
NNS)

6. Word to Property Mapping Using Python Dictionaries

 The system shall create a dictionary mapping words to specific properties

 such as:

Department of Data Science 2024-25 Page No.18


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o Length

o Frequency

o POS tag

 The user shall be able to query this dictionary for given words

7. POS Taggers
 The system shall implement and demonstrate:

o A Rule-Based Tagger (using regular expressions or predefined rules)

o A Unigram Tagger (trained on tagged corpus data)

8. Word Segmentation (Word Break Problem)


 The system shall:

o Accept a plain text string with no spaces

o Use a corpus-based dictionary to identify valid words in the string

o Segment the string into possible words

o Calculate and display scores (e.g., based on frequency or likelihood)

3.3.2 Non-Functional Requirement

Performance

Department of Data Science 2024-25 Page No.19


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

 The system should process and analyze corpora data (e.g., Brown,

Reuters) within a reasonable time (under 5 seconds for standard corpus


operations).

 Tagging and frequency distribution operations should be optimized to


handle medium-size datasets (up to 10,000 words) efficiently.

Scalability
 The program should be able to scale to handle additional corpora or user-
defined datasets without significant changes to the code.

 Modular design should allow easy integration of new taggers or corpora.

Usability
 The interface (CLI or GUI) should be simple and intuitive for students or
researchers with basic NLP knowledge.

 Clear documentation and help messages should be provided for each

function.

Maintainability
 The code should follow standard coding conventions (e.g., PEP8 for

Python) and be well-commented.

 Functions should be modular to facilitate future updates or modifications.

Portability
Department of Data Science 2024-25 Page No.20
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

 The solution should work on major platforms like Windows, Linux, and
macOS with minimal configuration.

 Dependencies should be managed using [Link] or pipenv.

Accuracy
 The POS tagging and word segmentation should yield accurate results based
on standard NLP libraries like NLTK or spaCy.

 Use trusted corpora like Brown and Reuters for evaluation to ensure
linguistic accuracy.

Reliability
 The system should not crash or behave unexpectedly when working with
empty or malformed corpora.

 Error handling should be in place for missing files or corrupted input.

Reusability
 The corpora handling, tagging, and frequency distribution logic should be
implemented in reusable functions or classes.

 Code modules should be general enough to be reused in other NLP tasks.

Security
 If user-defined corpora are uploaded, ensure no arbitrary code execution
occurs (e.g., no eval() on user input).

 Sanitize file inputs and restrict to plaintext only.

Department of Data Science 2024-25 Page No.21


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Department of Data Science 2024-25 Page No.22


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 4

SYSTEM DESIGN

Department of Data Science 2024-25 Page 23


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 4

System Design

4.1 System Architecture

1. Corpus Exploration Module


 Use NLTK corpora: brown, reuters, inaugural, udhr

 Functions: .words(), .sents(), .categories(), .fileids()

2. Custom Corpus Module


 Use PlaintextCorpusReader for unstructured files.

 Use CategorizedPlaintextCorpusReader with categories.

3. Conditional Frequency Distribution


 [Link](): useful to study word usage per category or tag.

4. Tagged Corpus Analysis


 Use [Link].tagged_words() and tagged_sents()

 Count most frequent noun tags using tag patterns (e.g., NN, NNS, NNP, etc.)

5. POS Taggers
 Implement basic [Link], [Link], RegexpTagger.

6. Word Segmentation and Scoring


 For text like "thereisacat":

Department of Data Science 2024-25 Page 24


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Department of Data Science 2024-25 Page 25


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o Use recursive algorithm to segment.

Department of Data Science 2024-25 Page 26


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.2 DataFlow Diagram

Department of Data Science 2024-25 Page 27


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.3 Flowchart

Department of Data Science 2024-25 Page 28


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.4 Modules

1. Study the Various Corpora: Brown, Inaugural, Reuters,


UDHR

Department of Data Science 2024-25 Page 29


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

2. Create and Use Your Own Corpora (Plaintext,


Categorical)

3. Conditional Frequency Distributions

Department of Data Science 2024-25 Page 30


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4. Study Tagged Corpora

5. Most Frequent Noun Tags

6. Map Words to Properties Using Python Dictionaries

7. Study Rule-Based Tagger, Unigram Tagger

Department of Data Science 2024-25 Page 31


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

8. Word Segmentation Without Spaces and Scoring

Department of Data Science 2024-25 Page 32


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.5 Activity Diagram

Department of Data Science 2024-25 Page 33


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.6 Sequence Diagram

Department of Data Science 2024-25 Page 34


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

References

Books and Academic References


1. Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd
edition, draft).

o Chapter 2 (Regular Expressions, Text Normalization)

o Chapter 3 (Language Modeling, Smoothing)

o Chapter 4 (Naive Bayes, Text Classification, Sentiment Analysis)

o Chapter 5 (POS Tagging, Taggers, and HMMs)

o URL: [Link]

2. Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python – Analyzing Text with the Natural Language Toolkit. O’Reilly Media.

o Especially Chapters 2, 3, and 5:

 Chapter 2: Accessing Text Corpora

 Chapter 3: Processing Raw Text

 Chapter 5: Categorizing and Tagging Words

o URL: [Link]

Toolkits and Libraries


3. NLTK (Natural Language Toolkit)

o Python library for NLP; includes all mentioned corpora and tagging
tools.

Department of Data Science 2024-25 Page 35


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o Documentation: [Link]

o Corpora docs: [Link]

4. NLTK GitHub Repository

o For source code and examples:

o [Link]

Articles, Blogs, and Tutorials


5. NLTK Corpora Tutorial – by GeeksForGeeks

o [Link]
resources-in-nlp-with-python-nltk/

6. Understanding POS Tagging in NLTK – Towards Data Science

o [Link]
lemmatization-in-python-8c57a5dcb46c

7. Rule-based and Unigram Taggers in NLTK – Stack Overflow Discussions


and Examples

o [Link]

Optional for Advanced Word Segmentation


8. "Word Segmentation using Probability" – Peter Norvig's Blog

o Excellent explanation of finding word boundaries using bigrams and


scoring.

o [Link]

Department of Data Science 2024-25 Page 36


Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

THANK YOU

Department of Data Science 2024-25 Page 37

You might also like