Unit 5
1) Examples of Clustering Application:
1. Marketing:
Companies can find different groups of customers based on their habits and likes.
This helps them create better marketing plans for each group.
2. Land use:
We can group areas of land that are used for the same purpose, like farming, forests,
or cities, by looking at satellite images.
3. Insurance:
Insurance companies can find groups of people who have similar types of insurance
and who make similar amounts of claims. This helps them manage risk better.
4. Urban planning:
City planners can group houses based on type (like apartments or bungalows), value,
and location. This helps in planning city services and development.
5. Seismology:
Earthquake experts can group earthquake starting points (epicenters) to find patterns
and see which fault lines are more active.
2) Text Preprocessing
Text Preprocessing in DSBDA (Data Science and Big Data Analytics)
Text preprocessing is an important step in data science, especially when working with text
data. In the DSBDA subject, it is considered part of the data preparation phase before
applying analytics, machine learning, or big data techniques.
📑 What is Text Preprocessing?
Text preprocessing is the process of cleaning, organizing, and transforming raw text data
into a structured format suitable for analysis and modeling.
✅ Common Steps in Text Preprocessing:
Step Description
Convert all text to lowercase to ensure uniformity (e.g.,
1. Lowercasing
'Apple' and 'apple' are treated the same).
2. Removing Remove commas, periods, special symbols, etc., which
Punctuation may not add meaning in analysis.
Break text into individual words or tokens (e.g., "I love
3. Tokenization
data" → ["I", "love", "data"]).
4. Removing Stop Remove commonly used words like 'the', 'is', 'in', which
Step Description
Words add little analytical value.
5. Stemming or Convert words to their base or root form (e.g., 'running'
Lemmatization → 'run').
6. Removing Numbers Remove numerical values if not needed for the analysis.
7. Removing Extra
Clean unwanted extra spaces, tabs, or line breaks.
Spaces
🎯 Why is Text Preprocessing Important?
Reduces noise and redundancy.
Makes the data uniform and analyzable.
Improves the accuracy of machine learning models.
Helps in extracting meaningful patterns from the text.
Example:
Input text:
"Data Science is exciting!! It allows data-driven decisions."
After preprocessing:
["data", "science", "excite", "allow", "data", "driven", "decision"]
3) Techniques to Handle Noise and Irrelevant Information in Text Data
When working with text data, it often contains noise, irrelevant information, and
inconsistencies, which can negatively affect the performance of data analysis, natural
language processing (NLP), or machine learning models.
To clean and prepare such data, several text preprocessing techniques are used.
1) Tokenization
Definition:
Tokenization is the process of splitting a text into individual units, known as tokens.
These tokens can be words, phrases, sentences, or even characters.
Purpose:
Tokenization helps in breaking down large unstructured text data into manageable
pieces for analysis. It is the first and fundamental step in text preprocessing.
Types of Tokenization:
o Word Tokenization: Splitting text into words.
o Sentence Tokenization: Splitting text into sentences.
o Character Tokenization: Splitting text into characters.
Example:
o Original Text: "Data Science is a vast field."
o Word Tokens: ["Data", "Science", "is", "a", "vast", "field", "."]
2) Stemming
Definition:
Stemming is the process of reducing words to their root form by removing prefixes
or suffixes, without necessarily producing a valid dictionary word.
Purpose:
It helps in reducing word variations to a common base form, which simplifies the text
and reduces the feature space in NLP tasks.
Characteristics:
o May produce non-standard or invalid words.
o Fast and rule-based.
Example:
"running", "runner", "runs" → "run"
Common Algorithms:
o Porter Stemmer
o Snowball Stemmer
3) Stop Words Removal
Definition:
Stop words are commonly used words in a language that are often filtered out as they
add little or no significant meaning to text analysis. Examples include "is", "the", "in",
"at", etc.
Purpose:
Removing stop words helps in focusing only on important and meaningful words,
reducing noise in the dataset.
Sources:
Libraries like NLTK, spaCy provide built-in lists of stop words.
Example:
o Original Text: "The data is analyzed by the scientist."
o After Stop Word Removal: "data analyzed scientist"
4) Lemmatization
Definition:
Lemmatization is the process of converting a word to its lemma, which is its
dictionary base or canonical form, considering the context and part of speech.
Purpose:
Unlike stemming, lemmatization provides meaningful words by using linguistic
analysis and vocabulary. It ensures grammatical correctness and meaningfulness.
Characteristics:
o Produces valid dictionary words.
o More accurate and slower than stemming.
Example:
o "better" → "good"
o "running" → "run"
Common Tools:
o WordNet Lemmatizer
o spaCy Lemmatizer
4) Bag of Words (BoW) – In Simple Words
The Bag of Words model is a way to turn text (like sentences or documents) into numbers so
that a computer can understand and work with it.
Why is this useful?
Computers can’t understand words like humans do. So, we convert words into numbers, and
then use those numbers to train models for tasks like spam detection, sentiment analysis, etc.
How does it work?
1. Make a list of all the words in your documents (called a vocabulary).
o Example: From the sentence “It is a puppy and it is extremely cute”, the
vocabulary might be: it, is, a, puppy, and, extremely, cute.
2. Count how many times each word appears in the document.
o For example:
"it" appears 2 times
"puppy" appears 1 time
"extremely" appears 1 time
"aardvark" and "cat" appear 0 times (they're in the vocabulary but not
in this sentence)
This count is put into a table or a vector (a list of numbers). This is your bag-of-words
vector.
What does it look like?
Imagine you have a table:
Rows = different documents
Columns = words from your vocabulary
The cells show how many times each word appears in that document.
Important Points
Words are treated as separate items, like objects in a "bag"—we don’t care about their
order.
The result is a fixed-length list for each document, based on the total vocabulary.
This is a basic but powerful technique for preparing text for machine learning models.
5) TF – IDF
When you use Term Frequency (TF) alone, it only tells you how often a word appears
in a single document, without considering how common or rare that word is in the
whole collection of documents (corpus).
For example:
Words like "the", "is", or "and" may appear very frequently in almost every
document.
So, their TF will always be high, but these words don't help to identify or
differentiate one document from another because they are common everywhere.
TF alone can't tell if a word is important or unique for a specific document, since
it doesn't look at the bigger picture of the entire corpus.
That's why TF is often combined with IDF (Inverse Document Frequency) to
create TF-IDF, which adjusts the weight of terms by considering both their frequency
in a document and their rarity in the entire corpus.
6) Difference between Random Subsampling and Holdout Method.
Aspect Holdout Method Random Subsampling
Dataset is randomly split
Dataset is split once into
Definition multiple times into training
training and testing sets.
and testing sets.
Single split (e.g., 70% Repeated random splits
Splitting
train, 30% test). (e.g., 50 iterations).
More reliable estimate by
Quick estimate of model
Usage averaging across multiple
performance.
splits.
High variance (depends
Bias and Lower variance (averages
heavily on the chosen
Variance over multiple splits).
split).
Some data may never be
Coverage of Same issue, but minimized
used in training or
Data over multiple iterations.
testing.
Slightly more computational
Complexity Simple and fast.
effort due to repetitions.
Risk of biased estimate if Reduces risk of biased
Risk the split is not performance due to more
representative. splits.
7) Random Subsampling
1. ✅ Random Subsampling (Repeated Holdout)
What is it?
A method where you randomly split the data into training and testing sets
multiple times and average the performance.
Steps:
1. Randomly split the dataset (e.g., 70% train, 30% test).
2. Train the model on the train set.
3. Test it on the test set.
4. Repeat steps 1-3 multiple times (e.g., 10, 50 times).
5. Average the performance metrics.
Key Point:
o Splits are random and repeated.
o Some data points may never appear in test/train sets.
2. ✅ Cross Validation (General Concept)
What is it?
Cross-validation is a general term for techniques that split the dataset into
multiple parts to get a more reliable estimate of model performance.
Purpose:
To reduce bias and variance by ensuring that every data point gets a chance
to be in both training and testing sets.
Types of Cross Validation:
o K-Fold Cross Validation
o Leave-One-Out Cross Validation (LOOCV)
o Stratified K-Fold (for classification)
3. ✅ K-Fold Cross Validation (Most common form of Cross Validation)
What is it?
A systematic way of cross-validation where the data is divided into K equal
parts (folds).
Steps:
1. Split the data into K equal folds.
2. For each fold:
Use that fold as the test set.
Use the remaining K-1 folds as the training set.
3. Repeat this process K times, each time using a different fold as the test
set.
4. Average the performance metrics.
Key Point:
o Every data point is used exactly once as test data.
o More systematic and fair compared to random subsampling.
o Common choices of K are 5 or 10.
✅ Summary Table:
Splitting
Method Description Bias/Variance
Type
Lower bias than
Random split Random
Random holdout, but some
repeated splits,
Subsampling data may never be
multiple times. repeated.
used.
General term
Depends on
Cross for model Reduces bias and
specific
Validation validation by variance.
method.
data splitting.
Splitting
Method Description Bias/Variance
Type
Data is divided
Systematic
K-Fold Cross into K parts; Balanced, reliable,
K equal
Validation each part used all data used.
splits.
as test once.