0% found this document useful (0 votes)

22 views44 pages

NLP Report

The document certifies the minor project titled 'Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word Segmentation Using NLTK' conducted by students of MVJ College of Engineering as part of their Bachelor of Engineering in Data Science. It outlines the project's exploration of Natural Language Processing techniques using Python's NLTK, including the analysis of standard and custom corpora, POS tagging, and word segmentation. The document includes acknowledgments, an abstract summarizing the project's objectives, and a detailed table of contents for the report.

Uploaded by

Varun Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views44 pages

NLP Report

Uploaded by

Varun Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

MVJCOLLEGE OF ENGINEERING, BENGALURU-560067

(Autonomous Institution Affiliated to VTU, Belagavi)

DEPARTMENT OF DATA SCIENCE

CERTIFICATE

Certified that the minor project work titled ‘Practical Applications of NLP: Corpus
Exploration, POS Tagging, and Word Segmentation Using NLTK’ is carried out by AYAN
AMANULLAH KHAN(1MJ22CD010), JOY GRAS(1MJ22CD018), K VARUN
KUMAR(1MJ22CD019), LINGUTLA CHAITHANYA(1MJ22CD026), PARIKSHITH V
M(1MJ22CD038) who are confide students of MVJ College of Engineering, Bengaluru, in
partial fulfilment for the award of Degree of Bachelor of Engineering in Data Science of the
Visvesvaraya Technological University, Belagavi during the year 2022-2026. It is certified that
all corrections/suggestions indicated for the Internal Assessment have been incorporated in the
report deposited in the departmental library. The report has been approved as it satisfies the
academic requirements in respect of assignment prescribed by the institution for the said Degree.

Signature of Guide Signature of Head of the Department Signature of Principal

Dr.________________ Dr.__________________ Dr. Ajayan K R

External Viva

Name of Examiners Signature with Date

I
MVJ COLLEGE OF ENGINEERING, BENGALURU-560067
(Autonomous Institution Affiliated to VTU, Belagavi)

DEPARTMENT OF DATA SCIENCE

DECLARATION

We, AYAN AMANULLAH KHAN, JOY GRAS, K VARUN KUMAR, LINGUTLA

CHAITHANYA, PARIKSHITH V M students of sixth semester B.E., Department of Data
Science, MVJ College of Engineering, Bengaluru, hereby declare that the assignment titled
‘Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK’ has been carried out by us and submitted in partial fulfilment for
the award of Degree of Bachelor of Engineering in Data Science during the year 2024-25

Further we declare that the content of the dissertation has not been submitted previously by
anybody for the award of any Degree or Diploma to any other University.

We also declare that any Intellectual Property Rights generated out of this project carried out at
MVJCE will be the property of MVJ College of Engineering, Bengaluru and we will be one of
the authors of the same.

Place: Bengaluru
Date:

Name Signature
1. AYAN AMANULLAH KHAN
2. JOY GRAS
3. K VARUN KUMAR
4. LINGUTLA CHAITHANYA
5. PARIKSHITH V M

II
ACKNOWLEDGEMENT

We are indebted to our guide, Prof. Rekha P, Professor, Dept of Data Science, MVJ College of
Engineering for his wholehearted support, suggestions and invaluable advice throughout our
assignment and also for the help in the preparation of this report.

We also express our gratitude to our panel members Prof. Lubi E, Associate Professor, Prof
Victoria, Department of Data Science for their valuable comments and suggestions.

Our sincere thanks to Prof Rekha P, Associate Professor and Head, Department of Data
Science, MVJCE for her support and encouragement.

We express sincere gratitude to our beloved Principal, Dr. Ajayan for all his support towards
this assignment.

Lastly, we take this opportunity to thank our family members and friends who provided all the
backup support throughout the assignment.

III
ABSTRACT

This assingment explores the foundational techniques of Natural Language Processing (NLP)
through practical applications of information retrieval and part-of-speech (POS) tagging using
Python's Natural Language Toolkit (NLTK). The study begins with an in-depth examination of
various standard corpora such as Brown, Inaugural, Reuters, and Universal Declaration of
Human Rights (UDHR), highlighting methods like fields, raw, words, sends, and categories.
Additionally, it demonstrates the creation and utilization of custom corpora in plaintext and
categorized formats. The project further investigates Conditional Frequency Distributions to
analyse word distributions across categories. Emphasis is placed on understanding tagged
corpora using tagged, words and tagged, sent, methods, followed by identifying the most
frequent noun tags. The assignment also showcases mapping words to properties using Python
dictionaries and implements basic POS taggers, including rule-based and unigram taggers.
Finally, it addresses a word segmentation task, where words are extracted from a continuous
string of characters by referencing a predefined corpus and assigning scores based on word
likelihood. This comprehensive study bridges theory and practical implementation, reinforcing
key NLP concepts crucial for text analysis and language modelling.

This assignment presents a comprehensive exploration of Information Retrieval and Part-of-

Speech (POS) tagging within the domain of Natural Language Processing using Python and the
NLTK library. The study involves detailed analysis of standard corpora like Brown, Reuters,
Inaugural, and UDHR, leveraging NLTK methods to access and manipulate raw texts, word lists,
sentences, and categories. It includes building and accessing user-defined corpora, examining
tagged corpora, and generating conditional frequency distributions to understand linguistic
patterns. The project implements rule-based and unigram taggers, extracts the most frequent
noun tags, and uses Python dictionaries to map words to specific properties. Additionally, it
addresses the challenge of segmenting a continuous text string into valid words by matching
against a known corpus and scoring potential segmentations. This assignment serves as a
practical demonstration of key NLP concepts and their implementation using real-world data .

IV
ACRONYMS
Acronym Abbreviation
USN University Seat Number
SEM Semester
NLTK Natural Language Toolkit
NLP Natural Language Processing

V
TABLE OF CONTENTS
Page No

Certificate I
Declaration II
Acknowledgement III
Abstract IV
Acronyms V

Chapter 1
1. Introduction
1.1 Aim
1.2 Motivation
1.3 Problem Statement
1.4 Existing System
1.5 Proposed System

Chapter 2
2. Literature Survey

Chapter 3
3. System Requirement and Specification
3.1 Hardware Requirement
3.2 Software Requirement
3.3.1 Functional Requirement
3.3.2 Non-Functional Requirement

Chapter 4
4. System Design 42
4.1 System Architecture
4.2 Data Flow Diagram
4.3 Flowchart

VI
4.4 Modules
4.5 Activity Diagram
4.6 Sequence Diagram
5. Conclusion
6. References

VII
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER-1

INTRODUCTION

Department of Data Science 2024-25 Page No.1

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 1

Introduction
In the digital age, vast amounts of textual data are generated every second, and
extracting meaningful information from this unstructured text is a central challenge
in computer science. Natural Language Processing (NLP) addresses this
challenge by combining computational linguistics with machine learning and
information retrieval techniques. One of the key components of NLP is
Information Retrieval (IR)—the process of searching and extracting relevant data
from large collections of text. This assignment provides a hands-on exploration of
IR concepts through the lens of NLP, using the powerful NLTK (Natural
Language Toolkit) in Python.

The assignment begins with the study of various standard corpora such as the
Brown Corpus, Inaugural Addresses, Reuters News Corpus, and the Universal
Declaration of Human Rights (UDHR). These corpora represent a wide range of
text genres, from news and literature to historical speeches and multilingual
declarations. By examining their structure using methods like words(), sents(),
raw(), and categories(), we gain a better understanding of how real-world text is
organized and processed.

Next, we delve into custom corpora creation, both in simple plaintext format and
with labelled categories. This mirrors how NLP practitioners often need to prepare
domain-specific data before applying machine learning models. The use of
Conditional Frequency Distributions (CFDs) allows us to analyse how
frequently words appear under different conditions, such as in different genres or
categories, providing insights into linguistic patterns and trends.

An essential part of NLP is understanding grammatical structure, which we

approach through Part-of-Speech (POS) tagging. By exploring tagged corpora and

Department of Data Science 2024-25 Page No.2

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

utilizing functions like tagged, words() and tagged, sents(), we can analyse how
words function in different contexts. A dedicated program is implemented to find
the most frequent noun tags, emphasizing their importance in extracting key
subjects from text.

Additionally, we explore how Python dictionaries can be used to map words to

their properties—laying the groundwork for more complex NLP tasks like semantic
analysis or feature engineering. The study of Rule-Based and Unigram Taggers
further reinforces the understanding of POS tagging, contrasting deterministic
methods with probabilistic models.

A particularly interesting problem is addressed in the final section: breaking a

continuous string of text (without spaces) into meaningful words using a reference
corpus and assigning scores to the possible word combinations. This simulates
word segmentation, a critical step in languages like Chinese and a common
problem in NLP preprocessing.

Overall, this assignment offers a comprehensive introduction to the practical

aspects of information retrieval in NLP, enabling learners to build foundational
skills in text processing, analysis, and interpretation using real-world datasets and
tools.

1.1 Aim
To explore and demonstrate the use of Information Retrieval techniques in Natural
Language Processing (NLP) using the NLTK toolkit by studying standard corpora
(Brown, Inaugural, Reuters, UDHR), creating custom corpora, analysing
conditional frequency distributions, examining tagged corpora, implementing
taggers (Rule-based and Unigram), mapping words to properties using dictionaries,
and segmenting text without spaces using a reference corpus for word scoring.

Department of Data Science 2024-25 Page No.3

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

1.2 Motivation
Natural Language Processing (NLP) plays a crucial role in enabling machines to
understand and process human language. One of the fundamental tasks in NLP is
information retrieval, which involves extracting meaningful patterns, structures,
and insights from large volumes of text data. This assignment is designed to
provide hands-on experience with core NLP techniques and tools using real-world
text corpora and Python-based processing methods.
Through this assignment, you'll:
 Explore diverse corpora (Brown, Inaugural, Reuters, UDHR) to
understand the variety and complexity of language in different contexts —
from news articles to political speeches.
 Create your own corpora, simulating the tasks of preparing domain-
specific datasets for NLP applications like sentiment analysis or document
classification.
 Use tools like conditional frequency distributions to study how word
usage varies across genres or categories — a key skill in information
retrieval and text analytics.
 Tag parts of speech (POS) in corpora to understand grammatical structure,
which is essential for downstream NLP tasks like named entity recognition
or question answering.
 Identify the most frequent noun tags, providing insights into subject
matter focus.
 Practice mapping words to properties, which is useful in sentiment
scoring, feature engineering, and semantic analysis.
 Explore rule-based and statistical taggers like the Unigram Tagger to
learn how machines assign POS tags and structure to unstructured text.
 Tackle the challenge of word segmentation in unspaced text — a critical
task in languages without word boundaries (e.g., Chinese, Thai) or in

Department of Data Science 2024-25 Page No.4

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

OCR/typo correction systems. This mirrors real-world scenarios in search

engines and autocorrect features.
By the end of this assignment, you will have a solid foundation in retrieving,
analyzing, and tagging text — equipping you with the skills to build intelligent
language-based applications like chatbots, search engines, or language translation
tools.
1.3 Problem Statement
The goal of this assignment is to understand and demonstrate the fundamentals of
Information Retrieval and Part-of-Speech (POS) Tagging in Natural Language
Processing (NLP) using the Natural Language Toolkit (NLTK) in Python.
Objectives:
1. Explore Predefined Corpora:
 Study and explore various corpora such as Brown, Inaugural,
Reuters, and UDHR using methods like fields(), raw(), words(),
sents(), and categories().
2. Create and Use Custom Corpora:
 Create and utilize your own corpora using plaintext and categorical
formats.
3. Analyze Conditional Frequency Distributions:
 Generate and interpret Conditional Frequency Distributions
(CFD) based on text data to understand word usage across different
categories or contexts.
4. Work with Tagged Corpora:
 Study tagged corpora using methods such as tagged_sents() and
tagged_words() to explore how words are annotated with POS tags.
5. POS Tagging Analysis:
 Write a program to extract and analyze the most frequent noun
tags from a tagged corpus.
6. Dictionary Mapping:

Department of Data Science 2024-25 Page No.5

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

 Implement Python dictionaries to map words to properties (e.g.,

word length, frequency, POS tags).
7. Implement Tagging Models:
 Study and implement both Rule-based taggers and Unigram
taggers using NLTK.
8. Word Segmentation Task:
 Write a program to segment a string of characters (without
spaces) into valid words using a given corpus, and compute the
score or probability of different segmentations.
1.4 Existing System
The existing system typically uses built-in corpus tools provided by libraries like
NLTK (Natural Language Toolkit). These tools allow exploration of well-known
corpora such as Brown, Reuters, and Inaugural addresses. These systems are often
used in NLP education and research for:
 Simple corpus access
 Tagging using pretrained taggers
 Basic frequency distributions
 Basic string comparison tasks
1.5 Proposed System
Your proposed system builds upon the existing tools but adds customization,
interactivity, and real-world utility. You also introduce user-defined corpora,
rule-based taggers, and a scoring-based word segmentation system.

1.6 Organization of the Report

Chapter 1: Introduction
 Objective of the assignment
 Overview of Information Retrieval in NLP
 Tools used: Python, NLTK (Natural Language Toolkit)

Department of Data Science 2024-25 Page No.6

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Chapter 2: Exploring Standard Corpora

 Overview of Corpora in NLTK
o Brown Corpus
o Inaugural Corpus
o Reuters Corpus
o Universal Declaration of Human Rights (UDHR)
 2Using Methods
o fields()
o raw()
o words()
o sents()
o categories()
 Code Examples and Outputs for each method on different corpora

Chapter 3: Creating and Using Custom Corpora

 Plaintext Corpus
o How to create and load using PlaintextCorpusReader
 Categorical Corpus
o Organizing files by categories
o Loading using CategorizedPlaintextCorpusReader
 Code Snippets and Sample Files

Chapter 4: Conditional Frequency Distributions

 Introduction to ConditionalFreqDist
 Example with Brown or Reuters Corpus
o Frequency of words by category
 Plotting distributions

Department of Data Science 2024-25 Page No.7

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Chapter 5: Working with Tagged Corpora

 Tagged Sentences and Words
o tagged_sents()
o tagged_words()
 Frequency of POS Tags
o Finding the most frequent noun tags (e.g., NN, NNP)
 Code to extract and count tags

Chapter 6: Mapping Words to Properties

 Using Python Dictionaries
o Mapping words to POS tags or categories
o Example: {'dog': 'noun', 'run': 'verb'}
 Applications in NLP

Chapter 7: POS Tagging Techniques

 Rule-Based Tagger
o Simple patterns using RegexpTagger
 Unigram Tagger
o Training with tagged corpus
o Evaluation using accuracy()
 Comparison and Results

Chapter 8: Word Segmentation from Plain Text

 Problem Statement
o Plain text without spaces (e.g., itisanexample)
 Using a Given Corpus of Words
o Comparing with a dictionary
 Scoring Words

Department of Data Science 2024-25 Page No.8

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o Use of frequency or probability

 Implementation and Output

Department of Data Science 2024-25 Page No.9

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER-2

LITERATURE SURVEY

Department of Data Science 2024-25 Page No.10

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 2

LTERATURE SURVEY

Topic Description Reference

Corpus Overview The Brown, Inaugural, Brown Corpus
Reuters, and UDHR
corpora are widely used
for training models and
conducting experiments.
They contain diverse
datasets in English and
other languages.
Methods: fields, raw, These methods help NLTK Documentation
words, sents, categories access different
components of a corpus:
fields provide the
document IDs, raw
retrieves all the text,
words gives a list of
words, sents returns a list
of sentences, and
categories provides
different genres.
Custom Corpora Creation Creating and using your NLTK Custom Corpus
own corpus allows
handling domain-specific
data or categorical
datasets. This is done by
defining a structure and
incorporating textual data
into it.
Conditional Frequency Conditional frequency NLTK CFD
Distributions (CFD) distributions calculate the
frequency of words
conditional on certain
conditions, providing
insights into contextual
word usage.
Tagged Corpora and Tagged corpora include Tagged Corpora
Methods annotations like part-of-
speech (POS) tags.

Department of Data Science 2024-25 Page No.11

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Methods like
tagged_sents and
tagged_words allow
accessing sentences and
words with their
respective POS tags.
Most Frequent Noun Tags A program to find the NLTK POS Tagging
Program most frequent noun tags
from a tagged corpus.
This is useful for
linguistic analysis to
identify noun-heavy
sentences.
Mapping Words to Using dictionaries to map Python Dictionary in NLP
Properties using Python words to various
Dictionaries properties like frequency
or category. This is a
foundational concept in
NL
Rule-Based Tagger Rule-based taggers assign NLTK Rule-Based
POS tags based on Tagging
predefined rules. They are
useful for improving
accuracy on unknown
words or ambiguous
cases.
Unigram Tagger A unigram tagger assigns Unigram Tagger
POS tags based on the
previous word in a
sentence, providing a
baseline for more
complex taggers.
Evaluation Metrics for Evaluation metrics such NLP Evaluation Metrics
NLP as precision, recall, and
F1-score help assess the
performance of
information retrieval and
other NLP tasks.

Department of Data Science 2024-25 Page No.12

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER-3

SYSTEM REQUIREMENT AND SPECIFICATION

Department of Data Science 2024-25 Page No.13

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 3

System Requirement And Specification

3.1 Hardware Requirement

Minimum System Requirements

 Processor (CPU): Dual-core processor (e.g., Intel i3 or AMD Ryzen 3)
 RAM: 4 GB
 Storage: At least 10 GB free (to store Python environment, NLTK corpora,
and results)
 Operating System: Windows 10, Ubuntu 20.04+, or macOS 10.15+
 Python Version: Python 3.7 or above
 Internet Connection: Required initially for downloading corpora via
NLTK

Recommended System Requirement

 Processor (CPU): Quad-core processor (e.g., Intel i5/i7 or AMD Ryzen 5/7)
 RAM: 8 GB or more (recommended especially for using multiple corpora
simultaneously)
 Storage: SSD with at least 20 GB free for faster read/write speeds
 Graphics (GPU): Not required (since this is not deep learning-based)
 Operating System: Any modern OS (Linux preferred for better performance with
Python tools)

Department of Data Science 2024-25 Page No.14

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

3.2 Software Requirement

1. Python (version 3.8 or above)

o Primary programming language for implementing NLP techniques.

2. NLTK (Natural Language Toolkit) Library

o Required for accessing standard corpora like Brown, Inaugural,
Reuters, and UDHR.

o Used for working with tagged corpora, frequency distributions,

taggers, etc.

3. NLTK Corpora Data

o Downloadable via [Link]():

 brown

 inaugural

 reuters

 udhr

 punkt (for tokenization)

 averaged_perceptron_tagger (for POS tagging)

 universal_tagset (optional, for simplified tags)

4. Jupyter Notebook / Google Colab / Any Python IDE

Department of Data Science 2024-25 Page No.15

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o For writing, running, and documenting code interactively.

o IDEs like PyCharm, VSCode, or Anaconda are also suitable.

5. Operating System
o Windows, macOS, or Linux (any OS that supports Python and
NLTK).

6. Text Editor (Optional)

o For creating custom plain text or categorized corpora files (e.g.,
Notepad++, VSCode).

7. Memory Requirements
o At least 4 GB RAM (for working with medium-sized corpora
efficiently).

8. Internet Access
o Required to download NLTK datasets and resources if not already
available offline.

9. Matplotlib or Seaborn (Optional for Visualization)

o For visualizing frequency distributions or tag patterns (if needed).

10. Basic Knowledge of NLP Concepts

 Understanding of tokenization, tagging, corpus usage, frequency

distributions, and tagging models.

Department of Data Science 2024-25 Page No.16

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

3.3.1 Functional Requirement

1. Corpus Study
 The system shall allow the user to load and explore built-in corpora:

o Brown, Inaugural, Reuters, UDHR

 The system shall provide methods to:

o View available fields (e.g., fileids, categories)

o Retrieve raw text from a corpus

o Tokenize text into words, sentences

o Filter and display categories (where applicable)

2. Custom Corpus Creation and Usage

 The system shall support creation of:

o A plain text corpus from a user-defined directory

o A categorized corpus where files belong to labeled categories

 The system shall tokenize and display words, sents, or fileids from the
custom corpus

3. Conditional Frequency Distribution

Department of Data Science 2024-25 Page No.17

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

 The system shall compute and display the Conditional Frequency

Distribution:

o Frequency of words conditioned on categories or fields

o Frequency of word-tag combinations in tagged corpora

4. Tagged Corpora Exploration

 The system shall load pre-tagged corpora (e.g., Brown tagged corpus)

 The system shall allow access to:

o tagged_sents – list of tagged sentences

o tagged_words – list of (word, tag) pairs

5. Most Frequent Noun Tags

 The system shall extract and count parts of speech tags

 The system shall identify and display the most frequent noun tags (e.g., NN,
NNS)

6. Word to Property Mapping Using Python Dictionaries

 The system shall create a dictionary mapping words to specific properties

 such as:

Department of Data Science 2024-25 Page No.18

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o Length

o Frequency

o POS tag

 The user shall be able to query this dictionary for given words

7. POS Taggers
 The system shall implement and demonstrate:

o A Rule-Based Tagger (using regular expressions or predefined rules)

o A Unigram Tagger (trained on tagged corpus data)

8. Word Segmentation (Word Break Problem)

 The system shall:

o Accept a plain text string with no spaces

o Use a corpus-based dictionary to identify valid words in the string

o Segment the string into possible words

o Calculate and display scores (e.g., based on frequency or likelihood)

3.3.2 Non-Functional Requirement

Performance

Department of Data Science 2024-25 Page No.19

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

 The system should process and analyze corpora data (e.g., Brown,

Reuters) within a reasonable time (under 5 seconds for standard corpus

operations).

 Tagging and frequency distribution operations should be optimized to

handle medium-size datasets (up to 10,000 words) efficiently.

Scalability
 The program should be able to scale to handle additional corpora or user-
defined datasets without significant changes to the code.

 Modular design should allow easy integration of new taggers or corpora.

Usability
 The interface (CLI or GUI) should be simple and intuitive for students or
researchers with basic NLP knowledge.

 Clear documentation and help messages should be provided for each

function.

Maintainability
 The code should follow standard coding conventions (e.g., PEP8 for

Python) and be well-commented.

 Functions should be modular to facilitate future updates or modifications.

Portability
Department of Data Science 2024-25 Page No.20
Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

 The solution should work on major platforms like Windows, Linux, and
macOS with minimal configuration.

 Dependencies should be managed using [Link] or pipenv.

Accuracy
 The POS tagging and word segmentation should yield accurate results based
on standard NLP libraries like NLTK or spaCy.

 Use trusted corpora like Brown and Reuters for evaluation to ensure
linguistic accuracy.

Reliability
 The system should not crash or behave unexpectedly when working with
empty or malformed corpora.

 Error handling should be in place for missing files or corrupted input.

Reusability
 The corpora handling, tagging, and frequency distribution logic should be
implemented in reusable functions or classes.

 Code modules should be general enough to be reused in other NLP tasks.

Security
 If user-defined corpora are uploaded, ensure no arbitrary code execution
occurs (e.g., no eval() on user input).

 Sanitize file inputs and restrict to plaintext only.

Department of Data Science 2024-25 Page No.21

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Department of Data Science 2024-25 Page No.22

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 4

SYSTEM DESIGN

Department of Data Science 2024-25 Page 23

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

CHAPTER 4

System Design

4.1 System Architecture

1. Corpus Exploration Module

 Use NLTK corpora: brown, reuters, inaugural, udhr

 Functions: .words(), .sents(), .categories(), .fileids()

2. Custom Corpus Module

 Use PlaintextCorpusReader for unstructured files.

 Use CategorizedPlaintextCorpusReader with categories.

3. Conditional Frequency Distribution

 [Link](): useful to study word usage per category or tag.

4. Tagged Corpus Analysis

 Use [Link].tagged_words() and tagged_sents()

 Count most frequent noun tags using tag patterns (e.g., NN, NNS, NNP, etc.)

5. POS Taggers
 Implement basic [Link], [Link], RegexpTagger.

6. Word Segmentation and Scoring

 For text like "thereisacat":

Department of Data Science 2024-25 Page 24

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

Department of Data Science 2024-25 Page 25

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o Use recursive algorithm to segment.

Department of Data Science 2024-25 Page 26

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.2 DataFlow Diagram

Department of Data Science 2024-25 Page 27

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.3 Flowchart

Department of Data Science 2024-25 Page 28

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.4 Modules

1. Study the Various Corpora: Brown, Inaugural, Reuters,

UDHR

Department of Data Science 2024-25 Page 29

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

2. Create and Use Your Own Corpora (Plaintext,

Categorical)

3. Conditional Frequency Distributions

Department of Data Science 2024-25 Page 30

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4. Study Tagged Corpora

5. Most Frequent Noun Tags

6. Map Words to Properties Using Python Dictionaries

7. Study Rule-Based Tagger, Unigram Tagger

Department of Data Science 2024-25 Page 31

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

8. Word Segmentation Without Spaces and Scoring

Department of Data Science 2024-25 Page 32

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.5 Activity Diagram

Department of Data Science 2024-25 Page 33

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

4.6 Sequence Diagram

Department of Data Science 2024-25 Page 34

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

References

Books and Academic References

1. Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd
edition, draft).

o Chapter 2 (Regular Expressions, Text Normalization)

o Chapter 3 (Language Modeling, Smoothing)

o Chapter 4 (Naive Bayes, Text Classification, Sentiment Analysis)

o Chapter 5 (POS Tagging, Taggers, and HMMs)

o URL: [Link]

2. Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python – Analyzing Text with the Natural Language Toolkit. O’Reilly Media.

o Especially Chapters 2, 3, and 5:

 Chapter 2: Accessing Text Corpora

 Chapter 3: Processing Raw Text

 Chapter 5: Categorizing and Tagging Words

o URL: [Link]

Toolkits and Libraries

3. NLTK (Natural Language Toolkit)

o Python library for NLP; includes all mentioned corpora and tagging
tools.

Department of Data Science 2024-25 Page 35

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

o Documentation: [Link]

o Corpora docs: [Link]

4. NLTK GitHub Repository

o For source code and examples:

o [Link]

Articles, Blogs, and Tutorials

5. NLTK Corpora Tutorial – by GeeksForGeeks

o [Link]
resources-in-nlp-with-python-nltk/

6. Understanding POS Tagging in NLTK – Towards Data Science

o [Link]
lemmatization-in-python-8c57a5dcb46c

7. Rule-based and Unigram Taggers in NLTK – Stack Overflow Discussions

and Examples

o [Link]

Optional for Advanced Word Segmentation

8. "Word Segmentation using Probability" – Peter Norvig's Blog

o Excellent explanation of finding word boundaries using bigrams and

scoring.

o [Link]

Department of Data Science 2024-25 Page 36

Practical Applications of NLP: Corpus Exploration, POS Tagging, and Word
Segmentation Using NLTK

THANK YOU

Department of Data Science 2024-25 Page 37

Assignment on Natural Language Processing
No ratings yet
Assignment on Natural Language Processing
16 pages
Project Report
No ratings yet
Project Report
12 pages
Hadiyyisa POS Tagger With Deep Learning
100% (2)
Hadiyyisa POS Tagger With Deep Learning
34 pages
Experiment 4
No ratings yet
Experiment 4
3 pages
Text Modication Methods For Natural Language Generation: Universitat Autònoma de Barcelona
No ratings yet
Text Modication Methods For Natural Language Generation: Universitat Autònoma de Barcelona
44 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Natural Language Processing - Theory and Application
No ratings yet
Natural Language Processing - Theory and Application
20 pages
GBHRFTHRDF
No ratings yet
GBHRFTHRDF
3 pages
Natural Language Processing (2) Finalll
No ratings yet
Natural Language Processing (2) Finalll
20 pages
NLP Essay
No ratings yet
NLP Essay
2 pages
POS Tagging and Syntactic Parsing in NLP
No ratings yet
POS Tagging and Syntactic Parsing in NLP
29 pages
143 1468042221 - 09-07-2016 PDF
No ratings yet
143 1468042221 - 09-07-2016 PDF
4 pages
MScIT Sem4
No ratings yet
MScIT Sem4
8 pages
Natural Language Processing Course Overview
No ratings yet
Natural Language Processing Course Overview
31 pages
NLP Techniques for AI Researchers
No ratings yet
NLP Techniques for AI Researchers
4 pages
BLC 2 BLC 1nlp12erged
No ratings yet
BLC 2 BLC 1nlp12erged
11 pages
NLP for Computer Science Students
No ratings yet
NLP for Computer Science Students
16 pages
Statistical NLP and Sequence Labeling: ML4291 Natural Language Processing LTPC 2 0 2 3 Course Objectives
No ratings yet
Statistical NLP and Sequence Labeling: ML4291 Natural Language Processing LTPC 2 0 2 3 Course Objectives
3 pages
Natural Language Processing Course Overview
No ratings yet
Natural Language Processing Course Overview
6 pages
Al3501 NLP
100% (1)
Al3501 NLP
2 pages
NLP-Based SQL Query Formation
No ratings yet
NLP-Based SQL Query Formation
5 pages
CSR 322 Syllabus
No ratings yet
CSR 322 Syllabus
2 pages
Natural Language Generation Overview
No ratings yet
Natural Language Generation Overview
146 pages
NLP Seminar Report GHZXCVBNJKL
No ratings yet
NLP Seminar Report GHZXCVBNJKL
19 pages
Text Classification and Processing Using NLP
No ratings yet
Text Classification and Processing Using NLP
21 pages
Session2 3
No ratings yet
Session2 3
18 pages
TSA Book
No ratings yet
TSA Book
154 pages
NLP Techniques and Applications Overview
No ratings yet
NLP Techniques and Applications Overview
25 pages
Chapter - 1
No ratings yet
Chapter - 1
25 pages
NLP Seminar Report: Applications & Future
No ratings yet
NLP Seminar Report: Applications & Future
19 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
AIPT LAB 24-25 MANUAL EXPE 4 To8
No ratings yet
AIPT LAB 24-25 MANUAL EXPE 4 To8
15 pages
An Online Punjabi Shahmukhi Lexical Resource
100% (1)
An Online Punjabi Shahmukhi Lexical Resource
7 pages
NLP Syllabus R21
100% (1)
NLP Syllabus R21
2 pages
NLP Subject Orientation SH23
No ratings yet
NLP Subject Orientation SH23
35 pages
CMU NLP Online Course Overview
No ratings yet
CMU NLP Online Course Overview
13 pages
Natural Language Processing Course Syllabus
No ratings yet
Natural Language Processing Course Syllabus
6 pages
Introduction To NLP - First - Week - Lecture - 1st
No ratings yet
Introduction To NLP - First - Week - Lecture - 1st
6 pages
ME02023011
No ratings yet
ME02023011
3 pages
Machine Learning Natural Language 2023
No ratings yet
Machine Learning Natural Language 2023
28 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
57 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
NLP A
No ratings yet
NLP A
6 pages
Intro IT356
No ratings yet
Intro IT356
45 pages
NLPAssignment Purna
No ratings yet
NLPAssignment Purna
12 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
88 pages
Deep Learning for Hadiyyisa POS Tagging
No ratings yet
Deep Learning for Hadiyyisa POS Tagging
16 pages
Word Segmentation in NLP Explained
No ratings yet
Word Segmentation in NLP Explained
27 pages
CSE 3652 Lab Record Format - PDF
No ratings yet
CSE 3652 Lab Record Format - PDF
13 pages
Data Science: Text & Speech Analysis Course
No ratings yet
Data Science: Text & Speech Analysis Course
2 pages
Text Analysis Based On Natural Language Processing NLP
No ratings yet
Text Analysis Based On Natural Language Processing NLP
5 pages
CH-2 Natural Language Processing Models and Algorithm
No ratings yet
CH-2 Natural Language Processing Models and Algorithm
119 pages
Arxiv: Natural Language Processing (Almost) From Scratch
No ratings yet
Arxiv: Natural Language Processing (Almost) From Scratch
47 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
CS702B
No ratings yet
CS702B
114 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
Natural Language Processing From Scratch
No ratings yet
Natural Language Processing From Scratch
45 pages
Fiverr WordPress MCQ's
100% (1)
Fiverr WordPress MCQ's
93 pages
MACD Handbook 2012-13
No ratings yet
MACD Handbook 2012-13
65 pages
Content Marketing Class Notes
No ratings yet
Content Marketing Class Notes
36 pages
Unit Three
No ratings yet
Unit Three
27 pages
ICT - Lab 9 20112024 120205pm 20112024 023814pm
No ratings yet
ICT - Lab 9 20112024 120205pm 20112024 023814pm
8 pages
Learning and Sharing Creative Skills With Short Videos: A Case Study of User Behavior in Tiktok and Bilibili
No ratings yet
Learning and Sharing Creative Skills With Short Videos: A Case Study of User Behavior in Tiktok and Bilibili
16 pages
SAP Community Network Wiki - CRM - Upload and Display Images in Web Ui Screen
No ratings yet
SAP Community Network Wiki - CRM - Upload and Display Images in Web Ui Screen
5 pages
Opc Quick Client
No ratings yet
Opc Quick Client
8 pages
Sleepy Owl - Social & SEO Audit
No ratings yet
Sleepy Owl - Social & SEO Audit
20 pages
Information and Software Technology: Gillian J. Greene, Marvin Esterhuizen, Bernd Fischer
No ratings yet
Information and Software Technology: Gillian J. Greene, Marvin Esterhuizen, Bernd Fischer
19 pages
SAP SuccessFactors Customer Community Enablement Guide
No ratings yet
SAP SuccessFactors Customer Community Enablement Guide
30 pages
CAT P2 JUNE 2024 QP ENG MEMO-Adjusted
No ratings yet
CAT P2 JUNE 2024 QP ENG MEMO-Adjusted
20 pages
Jason McDonald - SEO Workbook - Search Engine Optimization Success in Seven Steps (2022 Online Marketing) (2020, Independently Published) - Libgen - Li
100% (3)
Jason McDonald - SEO Workbook - Search Engine Optimization Success in Seven Steps (2022 Online Marketing) (2020, Independently Published) - Libgen - Li
355 pages
Algortithm
No ratings yet
Algortithm
38 pages
Basic HTML Codes For Web Pages
No ratings yet
Basic HTML Codes For Web Pages
24 pages
BCS552 Lab
No ratings yet
BCS552 Lab
41 pages
TradeSTP Module Testing Guide
No ratings yet
TradeSTP Module Testing Guide
74 pages
Iii. Web App Answers
No ratings yet
Iii. Web App Answers
55 pages
Telecom Service Management System Overview
No ratings yet
Telecom Service Management System Overview
26 pages
CL 6 LN 8 HTML
100% (1)
CL 6 LN 8 HTML
16 pages
The Internet and World Wide Web
No ratings yet
The Internet and World Wide Web
68 pages
Ec Iom CT001 en Rev 2.0
No ratings yet
Ec Iom CT001 en Rev 2.0
54 pages
Empowerment Tech for Students
100% (1)
Empowerment Tech for Students
8 pages
SEO Audit File Sample
No ratings yet
SEO Audit File Sample
33 pages
HTML Basics for Beginners
No ratings yet
HTML Basics for Beginners
2 pages
Teme Diplome Gerta Bleta
No ratings yet
Teme Diplome Gerta Bleta
14 pages
PM-ANALYZE Reporting - en
No ratings yet
PM-ANALYZE Reporting - en
31 pages
ODLND23COM0237
No ratings yet
ODLND23COM0237
19 pages
Tag Writer User Manual
No ratings yet
Tag Writer User Manual
41 pages
Azure ML Documentation
No ratings yet
Azure ML Documentation
731 pages