0% found this document useful (0 votes)

16 views11 pages

23951a04e3 Acsd08

Uploaded by

rohanvemula1156

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views11 pages

23951a04e3 Acsd08

Uploaded by

rohanvemula1156

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

DATA STRUCTURES

ACSD08

A REPORT ON COMPLEX ENGINEERING

PROBLEM
SOLVING (AAT - 2)

Vemula Rohan
23951A04E3
ECE-C
DSCSP87
ComplexProblem SolvingSelf-AssessmentForm

1 Name of the Student Vemula Rohan

2 RollNumber 23951A04E3

3 BranchandSection ECE-C

4 Program B.Tech

5 CourseName DataStructures

6 CourseCode ACSD08

7 Pleasetick(✓)relevantEngineeringCompetency(ECs)Profiles

EC Profiles (✓)
EC1 Ensuresthatallaspectsofanengineeringactivityaresoundlybasedon ✓
fundamental principles - by diagnosing, and taking appropriate action
with data, calculations, results, proposals, processes, practices, and
documented information that may be ill-founded, illogical, erroneous,
unreliableorunrealisticrequirementsapplicabletotheengineering discipline

EC2 Havenoobvioussolution and requires abstract thinking and originality in ✓

analysis to formulate suitable models.
EC3 Support sustainable development solutions by ensuring functional ✓
requirements, minimize environmental impact and optimize resource
utilizationthroughoutthelifecycle,whilebalancingperformanceand cost
effectiveness.
EC4 Competentlyaddressescomplexengineeringproblemswhichinvolve ✓
uncertainty, ambiguity, imprecise information and wide-ranging or
conflictingtechnical,engineeringandotherissues.
EC5 Conceptualises alternative engineering approaches and evaluates ✓
potentialoutcomesagainstappropriatecriteriatojustifyanoptimal
solutionchoice.
EC6 Identifies, quantifies, mitigates and manages technical, health, ✓
environmental,safety,economicandothercontextualrisksassociatedto
seekachievablesustainableoutcomeswithengineeringapplicationinthe
designated engineering discipline.
1
EC7 Involvethecoordinationofdiverseresources(andforthis purpose, ✓
resourcesincludepeople,money,equipment,materials,informationand
technologies) in the timely delivery of outcomes
EC8 Designanddevelopsolutiontocomplexengineeringproblem ✓
consideringaveryperspectiveandtakingaccountofstakeholderviews with
widely varying needs.
EC9 Meetalllevel,legal,regulatory,relevantstandardsandcodesofpractice, ✓
protectpublichealthandsafetyinthecourseofallengineering activities.

EC 10 Highlevelproblemsincludingmanycomponentpartsorsub-problems, ✓
partitionsproblems,processesorsystemsintomanageableelementsfor the
purposes of analysis, modelling or design and then re-combines to form
awhole,withtheintegrityandperformanceoftheoverall system as the top
consideration.

EC Profiles (✓)
EC 11 UndertakeCPDactivitiestomaintainandextendcompetencesand ✓
enhancetheabilitytoadapttoemergingtechnologiesandtheever- changing
nature of work.
EC 12 Recognizecomplexityandassessalternativesinlightofcompeting ✓
requirementsandincompleteknowledge.Requirejudgementindecision
making in the course of all complex engineering activities.
8 Pleasetick(✓)relevantCourseOutcomes(COs)Covered

CO CourseOutcomes (✓)
CO1 Interpretthecomplexityofthealgorithmusingtheasymptoticnotations ✓

CO2 Selecttheappropriatesearchingandsortingtechniqueforagivenproblem ✓

CO3 Constructprogramsonperformingoperationsonlinearandnonlineardata ✓
structures for organization of a data
CO4 Makeuseoflineardatastructuresandnonlineardatastructuressolving ✓
realtime applications.
CO5 Describehashingtechniquesandcollisionresolutionmethodsforaccessing data ✓
with respect to performance
CO6 Comparevarioustypesofdatastructures;intermsofimplementation, ✓
operations and performance.

9 CourseELRVVideoLecturesViewed Numberof Viewingtime

Videos in Hours
68 35

Foundationsforanalyzingand
10 JustifyyourunderstandingofWK1
optimizing operations.

2
Coretoadvanced concepts,
11 JustifyyourunderstandingofWK2–WK9
tools,design,andethics.
HowmanyWKsfromWK2toWK9wereimplemented? All8WKsfrom WK2toWK9
12 areimplementedinthisdesign and
analysis.

Mentionthem WK2 toWK9

Date:22-12-2024

Signature of the Student

PROBLEM STATEMENT
Preprocessing Text Data for Natural Language Processing
(NLP) Models DS CSP87

I. Project Overview
Preprocessing text data is a fundamental step in Natural Language Processing (NLP).
Raw text data from various sources, such as social media posts, articles, or speech
transcriptions, is often noisy and unstructured. Effective preprocessing transforms
this raw data into a clean, structured format that machine learning models can
process. This ensures better performance, generalization, and interpretability of NLP
models.

II. Objectives
1. Text Normalization: Convert text into a consistent format, including cleaning,
tokenization, and case normalization.
2. Noise Removal: Eliminate unwanted elements such as punctuation, special
characters, and irrelevant data.
3. Stopword Removal: Remove common words (e.g., "the," "is") that do not
3
contribute significant meaning.
4. Stemming and Lemmatization: Reduce words to their root forms to minimize
redundant variations.
5. Feature Representation: Prepare text for numerical representation, such as
Bag-of-Words, TF-IDF, or embeddings.

III. Key Steps in Preprocessing

1. Cleaning the Text:
o Remove punctuation, numbers, and special characters.

o Convert text to lowercase for consistency.

2. Tokenization:
o Split text into individual words or tokens.

3. Stopword Removal:
o Eliminate common, less meaningful words (e.g., "and," "to," "at").

4. Stemming and Lemmatization:

o Stemming: Strip words down to their roots (e.g., "running" → "run").

o Lemmatization: Convert words to their base dictionary form (e.g., "better"

→ "good").
5. Handling Noise:
o Remove duplicate spaces, URLs, or irrelevant content such as HTML tags.

6. Final Output:
o Produce a clean, structured list of words or sentences ready for feature

extraction.

IV. Challenges
1. Language Dependency: Preprocessing rules vary for different languages (e.g.,
compound words in German).
2. Ambiguity: Words with multiple meanings (e.g., "bat") require context for
accurate preprocessing.
3. Data Size: Handling large text corpora efficiently.
4
4. Maintaining Semantic Meaning: Excessive cleaning might remove
meaningful words, affecting downstream tasks.

V. Technologies Used
1. Programming Language: Java
2. Libraries/Frameworks: None (built using native Java capabilities)
3. Concepts: Object-Oriented Programming, Regular Expressions, Stream API

VI. Java Implementation

Code: Text Preprocessing for NLP
java
Copy code
import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class TextPreprocessing {

// Stopwords list
private static final Set<String> STOPWORDS = new HashSet<>(Arrays.asList(
"the", "is", "in", "and", "to", "a", "of", "it", "on", "for", "this", "that", "with",
"as", "at", "by", "an"
));

// Main preprocessing pipeline

public static void main(String[] args) {
// Sample text data
List<String> textData = Arrays.asList(
"The quick brown fox jumps over the lazy dog!",
"NLP is fun, and it's exciting to preprocess text!",
"Cleaning & normalizing text improves model accuracy."
5
);

System.out.println("Original Text Data:");

textData.forEach(System.out::println);

// Step 1: Clean Text

List<String> cleanedTexts = textData.stream()
.map(TextPreprocessing::cleanText)
.collect(Collectors.toList());

// Step 2: Tokenize Text

List<List<String>> tokenizedTexts = cleanedTexts.stream()
.map(TextPreprocessing::tokenizeText)
.collect(Collectors.toList());

// Step 3: Remove Stopwords

List<List<String>> filteredTokens = tokenizedTexts.stream()
.map(TextPreprocessing::removeStopwords)
.collect(Collectors.toList());

// Step 4: Stemming (Simplistic Implementation)

List<List<String>> stemmedTokens = filteredTokens.stream()
.map(TextPreprocessing::stemTokens)
.collect(Collectors.toList());

// Step 5: Combine Tokens Back to Text

List<String> processedTexts = stemmedTokens.stream()
.map(tokens -> String.join(" ", tokens))
.collect(Collectors.toList());

// Display Processed Texts

System.out.println("\nProcessed Text Data:");
6
processedTexts.forEach(System.out::println);
}

// Step 1: Clean text (remove punctuation, convert to lowercase)

private static String cleanText(String text) {
return text.toLowerCase().replaceAll("[^a-zA-Z\\s]", "");
}

// Step 2: Tokenize text

private static List<String> tokenizeText(String text) {
return Arrays.asList(text.split("\\s+"));
}

// Step 3: Remove stopwords

private static List<String> removeStopwords(List<String> tokens) {
return tokens.stream()
.filter(token -> !STOPWORDS.contains(token))
.collect(Collectors.toList());
}

// Step 4: Stemming (simplistic implementation)

private static List<String> stemTokens(List<String> tokens) {
// Mimic stemming by truncating to the first 4 characters (for demonstration
purposes)
return tokens.stream()
.map(token -> token.length() > 4 ? token.substring(0, 4) : token)
.collect(Collectors.toList());
}
}

7
VII. Explanation of Code
1. Clean Text: Removes non-alphabetic characters and converts the text to
lowercase.
2. Tokenization: Splits sentences into individual words using whitespace as a
delimiter.
3. Stopword Removal: Filters out predefined stopwords stored in a HashSet.
4. Stemming: Simplistic implementation truncates words to their first 4 characters.
Replace this logic with a proper stemming library if needed.
5. Pipeline Design: Each preprocessing step is modular, enabling flexibility and
easier debugging.

VIII. Sample Execution

Input:
kotlin
Original Text Data:
The quick brown fox jumps over the lazy dog!
NLP is fun, and it's exciting to preprocess text!
Cleaning & normalizing text improves model accuracy.
Output:
kotlin
Processed Text Data:
quick brown fox jump lazy
nlp fun excit prep
clea norm text impr mode accurate

8
IX. Applications
1. Sentiment Analysis: Preprocessed text serves as input for models that predict
sentiment (positive, negative, or neutral).
2. Text Classification: Useful for categorizing news articles or emails (e.g., spam
9
detection).
3. Topic Modeling: Extract latent topics from large corpora using algorithms like
LDA.
4. Machine Translation: Cleaned text is essential for training translation models.

X. Conclusion
This Java implementation demonstrates a complete, modular pipeline for text
preprocessing in NLP. The approach can be expanded to include advanced
techniques like lemmatization, named entity recognition, or feature extraction. By
understanding and implementing preprocessing effectively, developers can ensure
their models work optimally with real-world data.

NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Lab 2 NLP
No ratings yet
Lab 2 NLP
2 pages
Lab Syllabus NLP Lab
No ratings yet
Lab Syllabus NLP Lab
2 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
NLP Text Preprocessing Guide
No ratings yet
NLP Text Preprocessing Guide
19 pages
NLP Lect 2
No ratings yet
NLP Lect 2
5 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP - Shortnotes Unit 1 & 2
100% (1)
NLP - Shortnotes Unit 1 & 2
16 pages
Chapter 2 Solutions
No ratings yet
Chapter 2 Solutions
6 pages
NLP Problem-Solving Techniques
No ratings yet
NLP Problem-Solving Techniques
118 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
Text Mining Preprocessing Techniques
No ratings yet
Text Mining Preprocessing Techniques
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
5 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Module 5
No ratings yet
Module 5
69 pages
NLP Lab Manual - 1
No ratings yet
NLP Lab Manual - 1
40 pages
Module 2
No ratings yet
Module 2
19 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Unit 5
No ratings yet
Unit 5
8 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Essential NLP Pre-processing Steps
No ratings yet
Essential NLP Pre-processing Steps
20 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
NLP Text Processing & Parsing Techniques
No ratings yet
NLP Text Processing & Parsing Techniques
57 pages
BAI601 All Modules VTU 10 Mark Complete
No ratings yet
BAI601 All Modules VTU 10 Mark Complete
18 pages
Text Processing
No ratings yet
Text Processing
5 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Lect 02
No ratings yet
Lect 02
23 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
NLP Report File
No ratings yet
NLP Report File
30 pages
Genai Unit !
No ratings yet
Genai Unit !
71 pages
NLP Test Questions
No ratings yet
NLP Test Questions
1 page
NLP Syllabus R21
100% (1)
NLP Syllabus R21
2 pages
NLP Syllabus
No ratings yet
NLP Syllabus
2 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
IR Lec3
No ratings yet
IR Lec3
41 pages
Question Bank
No ratings yet
Question Bank
3 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Preprocessing in Ir: Rida Hafeez
No ratings yet
Preprocessing in Ir: Rida Hafeez
14 pages
Text Mining and Preprocessing Techniques
No ratings yet
Text Mining and Preprocessing Techniques
40 pages
InfoSec Lab Manual for Students
No ratings yet
InfoSec Lab Manual for Students
25 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
13 pages
NLP Tasks for MCA Students
No ratings yet
NLP Tasks for MCA Students
16 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
NLPEXP3
No ratings yet
NLPEXP3
3 pages
B.tech BT23 Project Work CIA Form Review 1-Section-D
No ratings yet
B.tech BT23 Project Work CIA Form Review 1-Section-D
1 page
Evolve 2025 Agenda
No ratings yet
Evolve 2025 Agenda
2 pages
DSD Module3part1
No ratings yet
DSD Module3part1
121 pages
Module 3 Part 2
No ratings yet
Module 3 Part 2
38 pages
DSDMODULE3part PDF
No ratings yet
DSDMODULE3part PDF
10 pages
Hackathon User Flow Guide
No ratings yet
Hackathon User Flow Guide
1 page
Companies&Expenditures 1
No ratings yet
Companies&Expenditures 1
4 pages
Evilia 2022
No ratings yet
Evilia 2022
10 pages
Past Perfect Continuous Guide
No ratings yet
Past Perfect Continuous Guide
12 pages
Quarter 2 - Module 4: Text Media and Information
No ratings yet
Quarter 2 - Module 4: Text Media and Information
18 pages
The Number System of Swahili
No ratings yet
The Number System of Swahili
4 pages
Understanding Inflectional Morphemes
No ratings yet
Understanding Inflectional Morphemes
2 pages
Last Lesson of French in War Time
No ratings yet
Last Lesson of French in War Time
19 pages
Understanding Culture, Society and Politics: Spects OF Ulture
No ratings yet
Understanding Culture, Society and Politics: Spects OF Ulture
8 pages
Grade 9 Term 1 Grammar Revision
No ratings yet
Grade 9 Term 1 Grammar Revision
24 pages
Phrasal Verbs: Definitions & Exercises
No ratings yet
Phrasal Verbs: Definitions & Exercises
1 page
ĐỀ CƯƠNG ÔN TẬP GIỮA HK1 TIẾNG ANH 6. in 25 bộ
No ratings yet
ĐỀ CƯƠNG ÔN TẬP GIỮA HK1 TIẾNG ANH 6. in 25 bộ
7 pages
The Power of Product Testing With Synthetic Data: Ipsos Views
No ratings yet
The Power of Product Testing With Synthetic Data: Ipsos Views
11 pages
Understanding Adverb Clauses Explained
No ratings yet
Understanding Adverb Clauses Explained
26 pages
Lessons Plans Gloria Classes
No ratings yet
Lessons Plans Gloria Classes
4 pages
The Sanskrit Alphabet in The Siddha Script
No ratings yet
The Sanskrit Alphabet in The Siddha Script
3 pages
BVSc & AH Admission Application 2020
No ratings yet
BVSc & AH Admission Application 2020
2 pages
An English Pocket Guide To Interlingua
100% (1)
An English Pocket Guide To Interlingua
8 pages
English
No ratings yet
English
12 pages
PGM Paradigm
No ratings yet
PGM Paradigm
26 pages
Continuous Non Continuous Verbs
No ratings yet
Continuous Non Continuous Verbs
3 pages
Paolo Pre A1 Starters
No ratings yet
Paolo Pre A1 Starters
1 page
Passive Voice Exercises for English Learners
No ratings yet
Passive Voice Exercises for English Learners
3 pages
Websites: Website Purpose
No ratings yet
Websites: Website Purpose
6 pages
RAPE (2) Sunday Indian Bengali Puja Special
No ratings yet
RAPE (2) Sunday Indian Bengali Puja Special
4 pages
Overview of the Brāhmī Alphabet
No ratings yet
Overview of the Brāhmī Alphabet
5 pages
A Checklist of Proto-Celtic Lexical Items
83% (6)
A Checklist of Proto-Celtic Lexical Items
277 pages
Spanish 2 Vocabulary and Tenses Schedule
No ratings yet
Spanish 2 Vocabulary and Tenses Schedule
1 page
Positive Affirmations for Kids
No ratings yet
Positive Affirmations for Kids
1 page
Adjectives Degrees of Comparison Grade 4
No ratings yet
Adjectives Degrees of Comparison Grade 4
3 pages
English TGAT Practice Guide
No ratings yet
English TGAT Practice Guide
14 pages
Essential Bible Companion To The Psalms by B. Webster and D. Beach, Excerpt
93% (14)
Essential Bible Companion To The Psalms by B. Webster and D. Beach, Excerpt
39 pages

23951a04e3 Acsd08

Uploaded by

23951a04e3 Acsd08

Uploaded by

DATA STRUCTURES

A REPORT ON COMPLEX ENGINEERING

1 Name of the Student Vemula Rohan

EC2 Havenoobvioussolution and requires abstract thinking and originality in ✓

9 CourseELRVVideoLecturesViewed Numberof Viewingtime

Mentionthem WK2 toWK9

Signature of the Student

III. Key Steps in Preprocessing

o Convert text to lowercase for consistency.

4. Stemming and Lemmatization:

o Lemmatization: Convert words to their base dictionary form (e.g., "better"

VI. Java Implementation

public class TextPreprocessing {

// Main preprocessing pipeline

System.out.println("Original Text Data:");

// Step 1: Clean Text

// Step 2: Tokenize Text

// Step 3: Remove Stopwords

// Step 4: Stemming (Simplistic Implementation)

// Step 5: Combine Tokens Back to Text

// Display Processed Texts

// Step 1: Clean text (remove punctuation, convert to lowercase)

// Step 2: Tokenize text

// Step 3: Remove stopwords

// Step 4: Stemming (simplistic implementation)

VIII. Sample Execution

You might also like