DATA STRUCTURES
ACSD08
A REPORT ON COMPLEX ENGINEERING
PROBLEM
SOLVING (AAT - 2)
Vemula Rohan
23951A04E3
ECE-C
DSCSP87
ComplexProblem SolvingSelf-AssessmentForm
1 Name of the Student Vemula Rohan
2 RollNumber 23951A04E3
3 BranchandSection ECE-C
4 Program B.Tech
5 CourseName DataStructures
6 CourseCode ACSD08
7 Pleasetick(✓)relevantEngineeringCompetency(ECs)Profiles
EC Profiles (✓)
EC1 Ensuresthatallaspectsofanengineeringactivityaresoundlybasedon ✓
fundamental principles - by diagnosing, and taking appropriate action
with data, calculations, results, proposals, processes, practices, and
documented information that may be ill-founded, illogical, erroneous,
unreliableorunrealisticrequirementsapplicabletotheengineering discipline
EC2 Havenoobvioussolution and requires abstract thinking and originality in ✓
analysis to formulate suitable models.
EC3 Support sustainable development solutions by ensuring functional ✓
requirements, minimize environmental impact and optimize resource
utilizationthroughoutthelifecycle,whilebalancingperformanceand cost
effectiveness.
EC4 Competentlyaddressescomplexengineeringproblemswhichinvolve ✓
uncertainty, ambiguity, imprecise information and wide-ranging or
conflictingtechnical,engineeringandotherissues.
EC5 Conceptualises alternative engineering approaches and evaluates ✓
potentialoutcomesagainstappropriatecriteriatojustifyanoptimal
solutionchoice.
EC6 Identifies, quantifies, mitigates and manages technical, health, ✓
environmental,safety,economicandothercontextualrisksassociatedto
seekachievablesustainableoutcomeswithengineeringapplicationinthe
designated engineering discipline.
1
EC7 Involvethecoordinationofdiverseresources(andforthis purpose, ✓
resourcesincludepeople,money,equipment,materials,informationand
technologies) in the timely delivery of outcomes
EC8 Designanddevelopsolutiontocomplexengineeringproblem ✓
consideringaveryperspectiveandtakingaccountofstakeholderviews with
widely varying needs.
EC9 Meetalllevel,legal,regulatory,relevantstandardsandcodesofpractice, ✓
protectpublichealthandsafetyinthecourseofallengineering activities.
EC 10 Highlevelproblemsincludingmanycomponentpartsorsub-problems, ✓
partitionsproblems,processesorsystemsintomanageableelementsfor the
purposes of analysis, modelling or design and then re-combines to form
awhole,withtheintegrityandperformanceoftheoverall system as the top
consideration.
EC Profiles (✓)
EC 11 UndertakeCPDactivitiestomaintainandextendcompetencesand ✓
enhancetheabilitytoadapttoemergingtechnologiesandtheever- changing
nature of work.
EC 12 Recognizecomplexityandassessalternativesinlightofcompeting ✓
requirementsandincompleteknowledge.Requirejudgementindecision
making in the course of all complex engineering activities.
8 Pleasetick(✓)relevantCourseOutcomes(COs)Covered
CO CourseOutcomes (✓)
CO1 Interpretthecomplexityofthealgorithmusingtheasymptoticnotations ✓
CO2 Selecttheappropriatesearchingandsortingtechniqueforagivenproblem ✓
CO3 Constructprogramsonperformingoperationsonlinearandnonlineardata ✓
structures for organization of a data
CO4 Makeuseoflineardatastructuresandnonlineardatastructuressolving ✓
realtime applications.
CO5 Describehashingtechniquesandcollisionresolutionmethodsforaccessing data ✓
with respect to performance
CO6 Comparevarioustypesofdatastructures;intermsofimplementation, ✓
operations and performance.
9 CourseELRVVideoLecturesViewed Numberof Viewingtime
Videos in Hours
68 35
Foundationsforanalyzingand
10 JustifyyourunderstandingofWK1
optimizing operations.
2
Coretoadvanced concepts,
11 JustifyyourunderstandingofWK2–WK9
tools,design,andethics.
HowmanyWKsfromWK2toWK9wereimplemented? All8WKsfrom WK2toWK9
12 areimplementedinthisdesign and
analysis.
Mentionthem WK2 toWK9
Date:22-12-2024
Signature of the Student
PROBLEM STATEMENT
Preprocessing Text Data for Natural Language Processing
(NLP) Models DS CSP87
I. Project Overview
Preprocessing text data is a fundamental step in Natural Language Processing (NLP).
Raw text data from various sources, such as social media posts, articles, or speech
transcriptions, is often noisy and unstructured. Effective preprocessing transforms
this raw data into a clean, structured format that machine learning models can
process. This ensures better performance, generalization, and interpretability of NLP
models.
II. Objectives
1. Text Normalization: Convert text into a consistent format, including cleaning,
tokenization, and case normalization.
2. Noise Removal: Eliminate unwanted elements such as punctuation, special
characters, and irrelevant data.
3. Stopword Removal: Remove common words (e.g., "the," "is") that do not
3
contribute significant meaning.
4. Stemming and Lemmatization: Reduce words to their root forms to minimize
redundant variations.
5. Feature Representation: Prepare text for numerical representation, such as
Bag-of-Words, TF-IDF, or embeddings.
III. Key Steps in Preprocessing
1. Cleaning the Text:
o Remove punctuation, numbers, and special characters.
o Convert text to lowercase for consistency.
2. Tokenization:
o Split text into individual words or tokens.
3. Stopword Removal:
o Eliminate common, less meaningful words (e.g., "and," "to," "at").
4. Stemming and Lemmatization:
o Stemming: Strip words down to their roots (e.g., "running" → "run").
o Lemmatization: Convert words to their base dictionary form (e.g., "better"
→ "good").
5. Handling Noise:
o Remove duplicate spaces, URLs, or irrelevant content such as HTML tags.
6. Final Output:
o Produce a clean, structured list of words or sentences ready for feature
extraction.
IV. Challenges
1. Language Dependency: Preprocessing rules vary for different languages (e.g.,
compound words in German).
2. Ambiguity: Words with multiple meanings (e.g., "bat") require context for
accurate preprocessing.
3. Data Size: Handling large text corpora efficiently.
4
4. Maintaining Semantic Meaning: Excessive cleaning might remove
meaningful words, affecting downstream tasks.
V. Technologies Used
1. Programming Language: Java
2. Libraries/Frameworks: None (built using native Java capabilities)
3. Concepts: Object-Oriented Programming, Regular Expressions, Stream API
VI. Java Implementation
Code: Text Preprocessing for NLP
java
Copy code
import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
public class TextPreprocessing {
// Stopwords list
private static final Set<String> STOPWORDS = new HashSet<>(Arrays.asList(
"the", "is", "in", "and", "to", "a", "of", "it", "on", "for", "this", "that", "with",
"as", "at", "by", "an"
));
// Main preprocessing pipeline
public static void main(String[] args) {
// Sample text data
List<String> textData = Arrays.asList(
"The quick brown fox jumps over the lazy dog!",
"NLP is fun, and it's exciting to preprocess text!",
"Cleaning & normalizing text improves model accuracy."
5
);
System.out.println("Original Text Data:");
textData.forEach(System.out::println);
// Step 1: Clean Text
List<String> cleanedTexts = textData.stream()
.map(TextPreprocessing::cleanText)
.collect(Collectors.toList());
// Step 2: Tokenize Text
List<List<String>> tokenizedTexts = cleanedTexts.stream()
.map(TextPreprocessing::tokenizeText)
.collect(Collectors.toList());
// Step 3: Remove Stopwords
List<List<String>> filteredTokens = tokenizedTexts.stream()
.map(TextPreprocessing::removeStopwords)
.collect(Collectors.toList());
// Step 4: Stemming (Simplistic Implementation)
List<List<String>> stemmedTokens = filteredTokens.stream()
.map(TextPreprocessing::stemTokens)
.collect(Collectors.toList());
// Step 5: Combine Tokens Back to Text
List<String> processedTexts = stemmedTokens.stream()
.map(tokens -> String.join(" ", tokens))
.collect(Collectors.toList());
// Display Processed Texts
System.out.println("\nProcessed Text Data:");
6
processedTexts.forEach(System.out::println);
}
// Step 1: Clean text (remove punctuation, convert to lowercase)
private static String cleanText(String text) {
return text.toLowerCase().replaceAll("[^a-zA-Z\\s]", "");
}
// Step 2: Tokenize text
private static List<String> tokenizeText(String text) {
return Arrays.asList(text.split("\\s+"));
}
// Step 3: Remove stopwords
private static List<String> removeStopwords(List<String> tokens) {
return tokens.stream()
.filter(token -> !STOPWORDS.contains(token))
.collect(Collectors.toList());
}
// Step 4: Stemming (simplistic implementation)
private static List<String> stemTokens(List<String> tokens) {
// Mimic stemming by truncating to the first 4 characters (for demonstration
purposes)
return tokens.stream()
.map(token -> token.length() > 4 ? token.substring(0, 4) : token)
.collect(Collectors.toList());
}
}
7
VII. Explanation of Code
1. Clean Text: Removes non-alphabetic characters and converts the text to
lowercase.
2. Tokenization: Splits sentences into individual words using whitespace as a
delimiter.
3. Stopword Removal: Filters out predefined stopwords stored in a HashSet.
4. Stemming: Simplistic implementation truncates words to their first 4 characters.
Replace this logic with a proper stemming library if needed.
5. Pipeline Design: Each preprocessing step is modular, enabling flexibility and
easier debugging.
VIII. Sample Execution
Input:
kotlin
Original Text Data:
The quick brown fox jumps over the lazy dog!
NLP is fun, and it's exciting to preprocess text!
Cleaning & normalizing text improves model accuracy.
Output:
kotlin
Processed Text Data:
quick brown fox jump lazy
nlp fun excit prep
clea norm text impr mode accurate
8
IX. Applications
1. Sentiment Analysis: Preprocessed text serves as input for models that predict
sentiment (positive, negative, or neutral).
2. Text Classification: Useful for categorizing news articles or emails (e.g., spam
9
detection).
3. Topic Modeling: Extract latent topics from large corpora using algorithms like
LDA.
4. Machine Translation: Cleaned text is essential for training translation models.
X. Conclusion
This Java implementation demonstrates a complete, modular pipeline for text
preprocessing in NLP. The approach can be expanded to include advanced
techniques like lemmatization, named entity recognition, or feature extraction. By
understanding and implementing preprocessing effectively, developers can ensure
their models work optimally with real-world data.
10