0% found this document useful (0 votes)

18 views8 pages

Unit 5

The document discusses various applications of clustering, including marketing, land use, insurance, urban planning, and seismology. It also covers text preprocessing techniques essential for data science, such as tokenization, stemming, stop words removal, and lemmatization, which help clean and organize text data for analysis. Additionally, it explains methods for model validation like random subsampling and K-fold cross-validation to ensure reliable performance estimates.

Uploaded by

thikoleashwini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views8 pages

Unit 5

Uploaded by

thikoleashwini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Unit 5

1) Examples of Clustering Application:

1. Marketing:
Companies can find different groups of customers based on their habits and likes.
This helps them create better marketing plans for each group.
2. Land use:
We can group areas of land that are used for the same purpose, like farming, forests,
or cities, by looking at satellite images.
3. Insurance:
Insurance companies can find groups of people who have similar types of insurance
and who make similar amounts of claims. This helps them manage risk better.
4. Urban planning:
City planners can group houses based on type (like apartments or bungalows), value,
and location. This helps in planning city services and development.
5. Seismology:
Earthquake experts can group earthquake starting points (epicenters) to find patterns
and see which fault lines are more active.

2) Text Preprocessing

Text Preprocessing in DSBDA (Data Science and Big Data Analytics)

Text preprocessing is an important step in data science, especially when working with text
data. In the DSBDA subject, it is considered part of the data preparation phase before
applying analytics, machine learning, or big data techniques.

📑 What is Text Preprocessing?

Text preprocessing is the process of cleaning, organizing, and transforming raw text data
into a structured format suitable for analysis and modeling.

✅ Common Steps in Text Preprocessing:

Step Description

Convert all text to lowercase to ensure uniformity (e.g.,

1. Lowercasing
'Apple' and 'apple' are treated the same).

2. Removing Remove commas, periods, special symbols, etc., which

Punctuation may not add meaning in analysis.

Break text into individual words or tokens (e.g., "I love

3. Tokenization
data" → ["I", "love", "data"]).

4. Removing Stop Remove commonly used words like 'the', 'is', 'in', which
Step Description

Words add little analytical value.

5. Stemming or Convert words to their base or root form (e.g., 'running'

Lemmatization → 'run').

6. Removing Numbers Remove numerical values if not needed for the analysis.

7. Removing Extra
Clean unwanted extra spaces, tabs, or line breaks.
Spaces

🎯 Why is Text Preprocessing Important?

 Reduces noise and redundancy.
 Makes the data uniform and analyzable.
 Improves the accuracy of machine learning models.
 Helps in extracting meaningful patterns from the text.

Example:
Input text:
"Data Science is exciting!! It allows data-driven decisions."
After preprocessing:
["data", "science", "excite", "allow", "data", "driven", "decision"]

3) Techniques to Handle Noise and Irrelevant Information in Text Data

When working with text data, it often contains noise, irrelevant information, and
inconsistencies, which can negatively affect the performance of data analysis, natural
language processing (NLP), or machine learning models.
To clean and prepare such data, several text preprocessing techniques are used.

1) Tokenization
 Definition:
Tokenization is the process of splitting a text into individual units, known as tokens.
These tokens can be words, phrases, sentences, or even characters.
 Purpose:
Tokenization helps in breaking down large unstructured text data into manageable
pieces for analysis. It is the first and fundamental step in text preprocessing.
 Types of Tokenization:
o Word Tokenization: Splitting text into words.
o Sentence Tokenization: Splitting text into sentences.
o Character Tokenization: Splitting text into characters.
 Example:
o Original Text: "Data Science is a vast field."
o Word Tokens: ["Data", "Science", "is", "a", "vast", "field", "."]

2) Stemming
 Definition:
Stemming is the process of reducing words to their root form by removing prefixes
or suffixes, without necessarily producing a valid dictionary word.
 Purpose:
It helps in reducing word variations to a common base form, which simplifies the text
and reduces the feature space in NLP tasks.
 Characteristics:
o May produce non-standard or invalid words.
o Fast and rule-based.
 Example:
"running", "runner", "runs" → "run"
 Common Algorithms:
o Porter Stemmer
o Snowball Stemmer

3) Stop Words Removal

 Definition:
Stop words are commonly used words in a language that are often filtered out as they
add little or no significant meaning to text analysis. Examples include "is", "the", "in",
"at", etc.
 Purpose:
Removing stop words helps in focusing only on important and meaningful words,
reducing noise in the dataset.
 Sources:
Libraries like NLTK, spaCy provide built-in lists of stop words.
 Example:
o Original Text: "The data is analyzed by the scientist."
o After Stop Word Removal: "data analyzed scientist"
4) Lemmatization
 Definition:
Lemmatization is the process of converting a word to its lemma, which is its
dictionary base or canonical form, considering the context and part of speech.
 Purpose:
Unlike stemming, lemmatization provides meaningful words by using linguistic
analysis and vocabulary. It ensures grammatical correctness and meaningfulness.
 Characteristics:
o Produces valid dictionary words.
o More accurate and slower than stemming.
 Example:
o "better" → "good"
o "running" → "run"
 Common Tools:
o WordNet Lemmatizer
o spaCy Lemmatizer

4) Bag of Words (BoW) – In Simple Words

The Bag of Words model is a way to turn text (like sentences or documents) into numbers so
that a computer can understand and work with it.
Why is this useful?
Computers can’t understand words like humans do. So, we convert words into numbers, and
then use those numbers to train models for tasks like spam detection, sentiment analysis, etc.
How does it work?
1. Make a list of all the words in your documents (called a vocabulary).
o Example: From the sentence “It is a puppy and it is extremely cute”, the
vocabulary might be: it, is, a, puppy, and, extremely, cute.
2. Count how many times each word appears in the document.
o For example:
 "it" appears 2 times
 "puppy" appears 1 time
 "extremely" appears 1 time
 "aardvark" and "cat" appear 0 times (they're in the vocabulary but not
in this sentence)
This count is put into a table or a vector (a list of numbers). This is your bag-of-words
vector.
What does it look like?
Imagine you have a table:
 Rows = different documents
 Columns = words from your vocabulary
 The cells show how many times each word appears in that document.
Important Points
 Words are treated as separate items, like objects in a "bag"—we don’t care about their
order.
 The result is a fixed-length list for each document, based on the total vocabulary.
 This is a basic but powerful technique for preparing text for machine learning models.

5) TF – IDF

When you use Term Frequency (TF) alone, it only tells you how often a word appears
in a single document, without considering how common or rare that word is in the
whole collection of documents (corpus).

For example:

 Words like "the", "is", or "and" may appear very frequently in almost every
document.
 So, their TF will always be high, but these words don't help to identify or
differentiate one document from another because they are common everywhere.
 TF alone can't tell if a word is important or unique for a specific document, since
it doesn't look at the bigger picture of the entire corpus.
That's why TF is often combined with IDF (Inverse Document Frequency) to
create TF-IDF, which adjusts the weight of terms by considering both their frequency
in a document and their rarity in the entire corpus.
6) Difference between Random Subsampling and Holdout Method.
Aspect Holdout Method Random Subsampling

Dataset is randomly split

Dataset is split once into
Definition multiple times into training
training and testing sets.
and testing sets.

Single split (e.g., 70% Repeated random splits

Splitting
train, 30% test). (e.g., 50 iterations).

More reliable estimate by

Quick estimate of model
Usage averaging across multiple
performance.
splits.

High variance (depends

Bias and Lower variance (averages
heavily on the chosen
Variance over multiple splits).
split).

Some data may never be

Coverage of Same issue, but minimized
used in training or
Data over multiple iterations.
testing.

Slightly more computational

Complexity Simple and fast.
effort due to repetitions.

Risk of biased estimate if Reduces risk of biased

Risk the split is not performance due to more
representative. splits.

7) Random Subsampling

1. ✅ Random Subsampling (Repeated Holdout)

 What is it?
A method where you randomly split the data into training and testing sets
multiple times and average the performance.
 Steps:
1. Randomly split the dataset (e.g., 70% train, 30% test).
2. Train the model on the train set.
3. Test it on the test set.
4. Repeat steps 1-3 multiple times (e.g., 10, 50 times).
5. Average the performance metrics.
 Key Point:
o Splits are random and repeated.
o Some data points may never appear in test/train sets.

2. ✅ Cross Validation (General Concept)

 What is it?
Cross-validation is a general term for techniques that split the dataset into
multiple parts to get a more reliable estimate of model performance.
 Purpose:
To reduce bias and variance by ensuring that every data point gets a chance
to be in both training and testing sets.
 Types of Cross Validation:
o K-Fold Cross Validation
o Leave-One-Out Cross Validation (LOOCV)
o Stratified K-Fold (for classification)

3. ✅ K-Fold Cross Validation (Most common form of Cross Validation)

 What is it?
A systematic way of cross-validation where the data is divided into K equal
parts (folds).
 Steps:
1. Split the data into K equal folds.
2. For each fold:
 Use that fold as the test set.
 Use the remaining K-1 folds as the training set.
3. Repeat this process K times, each time using a different fold as the test
set.
4. Average the performance metrics.
 Key Point:
o Every data point is used exactly once as test data.
o More systematic and fair compared to random subsampling.
o Common choices of K are 5 or 10.

✅ Summary Table:

Splitting
Method Description Bias/Variance
Type

Lower bias than

Random split Random
Random holdout, but some
repeated splits,
Subsampling data may never be
multiple times. repeated.
used.

General term
Depends on
Cross for model Reduces bias and
specific
Validation validation by variance.
method.
data splitting.
Splitting
Method Description Bias/Variance
Type

Data is divided
Systematic
K-Fold Cross into K parts; Balanced, reliable,
K equal
Validation each part used all data used.
splits.
as test once.

Retrieving Information in Text Mining
No ratings yet
Retrieving Information in Text Mining
4 pages
Text Analytics with TF-IDF in Python
No ratings yet
Text Analytics with TF-IDF in Python
14 pages
Module III
No ratings yet
Module III
42 pages
Data Analytics
No ratings yet
Data Analytics
24 pages
DS Finalexam (Thxtoshravani)
No ratings yet
DS Finalexam (Thxtoshravani)
31 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Data Mining Techniques Guide
No ratings yet
Data Mining Techniques Guide
61 pages
Pipeline
No ratings yet
Pipeline
9 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Lect 5
No ratings yet
Lect 5
40 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
Chap 2
No ratings yet
Chap 2
70 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Assigmnent I TEXT WEB Media (2024 Feb)
No ratings yet
Assigmnent I TEXT WEB Media (2024 Feb)
12 pages
Predictive Text Mining Techniques
No ratings yet
Predictive Text Mining Techniques
75 pages
NLP Scheme for Mobile Forensics Exam
No ratings yet
NLP Scheme for Mobile Forensics Exam
6 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Top 10 NLP Techniques for Data Science
No ratings yet
Top 10 NLP Techniques for Data Science
7 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
Text Processing
No ratings yet
Text Processing
5 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
13 pages
Week 7 - Show in Class - Text Processing
No ratings yet
Week 7 - Show in Class - Text Processing
4 pages
Week10 Social Network Analytics
No ratings yet
Week10 Social Network Analytics
19 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
20 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
1 - Overview of NLP
No ratings yet
1 - Overview of NLP
39 pages
Text Mining and Preprocessing Techniques
No ratings yet
Text Mining and Preprocessing Techniques
40 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
6 pages
DM05 Text Mining
No ratings yet
DM05 Text Mining
44 pages
NLP Lect 2
No ratings yet
NLP Lect 2
5 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Module 3 - DSV
No ratings yet
Module 3 - DSV
17 pages
Software Engineering Rev
No ratings yet
Software Engineering Rev
5 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Unit 2
No ratings yet
Unit 2
25 pages
Natural Language Processing
No ratings yet
Natural Language Processing
49 pages
CAT King Study Material 4
No ratings yet
CAT King Study Material 4
32 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Text Extraction
No ratings yet
Text Extraction
79 pages
Semantic Processing for Data Scientists
No ratings yet
Semantic Processing for Data Scientists
10 pages
Text Analysis with TF-IDF and NLTK
No ratings yet
Text Analysis with TF-IDF and NLTK
10 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
Understanding Each Pre-Processing Aspect
No ratings yet
Understanding Each Pre-Processing Aspect
5 pages
Data Analytics Imp
No ratings yet
Data Analytics Imp
20 pages
Text Extraction (Irs - Unit 2)
No ratings yet
Text Extraction (Irs - Unit 2)
103 pages
Genai Unit !
No ratings yet
Genai Unit !
71 pages
L2 Data Crawling Preprocessing
No ratings yet
L2 Data Crawling Preprocessing
30 pages
LP V Oral Questions and Answers
No ratings yet
LP V Oral Questions and Answers
4 pages
Text Analysis for Students
No ratings yet
Text Analysis for Students
11 pages
NLP (DP) Notes1
No ratings yet
NLP (DP) Notes1
61 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
57 pages
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
No ratings yet
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
30 pages
Binary Independence Model تقرير
No ratings yet
Binary Independence Model تقرير
7 pages
Module 4 Notes
No ratings yet
Module 4 Notes
34 pages
Systematic Literature Review On Recommender System Approach Problem Evaluation Techniques Datasets
No ratings yet
Systematic Literature Review On Recommender System Approach Problem Evaluation Techniques Datasets
21 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
Text-to-Image Synthesis with Self-Attention
No ratings yet
Text-to-Image Synthesis with Self-Attention
20 pages
NLP Module 1
No ratings yet
NLP Module 1
71 pages
NLP Practice Problems
No ratings yet
NLP Practice Problems
48 pages
NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity
No ratings yet
NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity
6 pages
AI Healthcare Chatbot System Design
No ratings yet
AI Healthcare Chatbot System Design
8 pages
Sentiment Analysis of Movie Reviews
No ratings yet
Sentiment Analysis of Movie Reviews
39 pages
Movie Success Prediction System
No ratings yet
Movie Success Prediction System
17 pages
Sheet 3
No ratings yet
Sheet 3
5 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
100% (1)
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Al Hawari2019
No ratings yet
Al Hawari2019
17 pages
Lab Manual On Recommender System
No ratings yet
Lab Manual On Recommender System
59 pages
Hashjacker-Detection and Analysis of Hashtag Hijacking On Twitter
No ratings yet
Hashjacker-Detection and Analysis of Hashtag Hijacking On Twitter
4 pages
Day 28 Feature Engineering Detailed Notes
No ratings yet
Day 28 Feature Engineering Detailed Notes
32 pages
Does Generative AI Erode Its Own Training Data? - Empirical Evidence of The Effects On Data Quantity and Characteristics From A Q&A Platform
No ratings yet
Does Generative AI Erode Its Own Training Data? - Empirical Evidence of The Effects On Data Quantity and Characteristics From A Q&A Platform
40 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
32 pages
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
No ratings yet
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
29 pages
The Synchronic and Diachronic Evolution of Key Themes Around SDG 4 Before and After 2015: From A Quantitative Analysis of Web Downloaded Texts
No ratings yet
The Synchronic and Diachronic Evolution of Key Themes Around SDG 4 Before and After 2015: From A Quantitative Analysis of Web Downloaded Texts
21 pages
Understanding Emojis For Sentiment Analysis: Byungkyu Yoo, Julia Rayz
No ratings yet
Understanding Emojis For Sentiment Analysis: Byungkyu Yoo, Julia Rayz
4 pages
A Project Report: in Partial Fulfillment For The Award of The Degree
No ratings yet
A Project Report: in Partial Fulfillment For The Award of The Degree
50 pages
NLP Question Bank: Chapter-Wise Practice Problems With Solutions
No ratings yet
NLP Question Bank: Chapter-Wise Practice Problems With Solutions
45 pages
Text Mining: Methods and Applications
No ratings yet
Text Mining: Methods and Applications
30 pages
Email Classification Via Intention Based Segmentation
No ratings yet
Email Classification Via Intention Based Segmentation
7 pages
1) Explain User Interaction With IR With The Help of A Diagram
No ratings yet
1) Explain User Interaction With IR With The Help of A Diagram
12 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages