AMLTA
C0-1
Topics:
1. Text Analysis
2. Applications and Challenges of Text Analysis
3. Introduction to Machine Learning Algorithms
4. Understanding ML Pipeline for Text
5. Introduction to preprocessing: Basics of Text Preprocessing
6. Tokenization: Tokenization Concepts and Tools
7. Word and sentence tokenization: Tokenization in NLTK and spaCy
8. Stemming - Introduction to Stemming & Lemmatization
9. Lemmatization Concepts and Comparison with Stemming
10. Cleaning and Normalizing & Text normalization:Lowercasing, removing accents
11. Bag-of-Words model: Introduction to BoW Model && python code
12. Feature extraction: Vectorization and Feature Representation
13. Text classification: Using BoW for Text Classification
14. Clustering with BoW: Text Clustering with BoW Features
15. TF-IDF weighting: Introduction to TF-IDF
1,2,3 – read from notes
4.
A machine learning (ML) pipeline for text processing is a structured sequence of steps that automate the
transformation of raw text data into a trained ML model ready for deployment. It typically includes stages such
as text preprocessing (tokenization, lowercasing, stop word removal, stemming/lemmatization), feature
extraction, model training, and evaluation. This step-by-step workflow ensures that text data is cleaned,
transformed into numerical features, and then used to train and validate a model, which can be deployed for
tasks like classification or sentiment analysis. The pipeline helps streamline, automate, and maintain
consistency throughout the ML lifecycle for text data.
Key steps in a typical text ML pipeline include:
Data Collection – Gather raw text (tweets, reviews, articles, etc.).
Text Preprocessing – Clean and normalize text:
Tokenization
1
Stop-word removal
Stemming/Lemmatization
Lowercasing, punctuation removal
Feature Extraction / Representation – Convert text to numerical form:
Bag of Words (BoW)
TF-IDF
Word Embeddings (Word2Vec, GloVe, BERT)
Model Training – Apply ML algorithms (Naive Bayes, SVM, Logistic Regression, Neural Networks).
Model Evaluation – Measure performance (Accuracy, Precision, Recall, F1-score).
Deployment – Integrate into real applications (chatbots, sentiment analysis systems).
Monitoring & Improvement – Update with new data for better accuracy.
Text preprocessing is a fundamental step in natural language processing (NLP) that involves cleaning and
preparing raw text data so that it can be effectively analyzed and processed by computers. It is about
transforming messy, unstructured text into a cleaner and more uniform format to improve the accuracy and
performance of NLP models.
Basic Steps of Text Preprocessing
1. Lowercasing – Converting all text to lowercase for consistency.
o Example: "Hello World" → "hello world"
2. Tokenization – Splitting text into words or tokens.
o Example: "I love NLP" → ["I", "love", "NLP"]
3. Stop-word Removal – Removing common words that don’t add meaning.
o Example: words like "the, is, in, of"
4. Punctuation Removal – Removing symbols like , . ? !
5. Stemming – Reducing words to their root form (may not always be a real word).
o Example: "playing, played" → "play"
6. Lemmatization – Converting words to their meaningful root form (dictionary word).
o Example: "better" → "good"
2
7. Text Normalization – Handling numbers, abbreviations, special characters, etc.
6,7
Tokenization is the process of breaking down text into smaller units called tokens, which are the smallest
meaningful elements for analysis in NLP. Tokens can be words, subwords, characters, or sentences.
Tokenization is a crucial first step in text preprocessing because it simplifies text analysis, standardizes input,
and facilitates feature extraction for various NLP tasks like classification and sentiment analysis.
Tools for Tokenization:
NLTK (Python) – simple and beginner-friendly.
spaCy – advanced, faster, with linguistic features.
Others – Hugging Face tokenizers, Scikit-learn.
Word and Sentence Tokenization in NLTK
NLTK provides different functions for word and sentence tokenization:
word_tokenize() splits text into individual words and punctuation marks.
sent_tokenize() breaks a piece of text into sentences.
Both functions handle punctuation and spaces effectively and are widely used for parsing large text
data for further analysis such as vectorization, stemming, and lemmatization. An example of sentence
tokenization in NLTK might be splitting "God is Great! I won a lottery." into the sentences ["God is
Great!", "I won a lottery."].
Word and Sentence Tokenization in spaCy
SpaCy offers an efficient, fast tokenizer that handles words, punctuation, and special characters. Its tokenizer
treats punctuation as separate tokens and adapts tokenization rules for different languages. It provides
tokenization as part of a pipeline that includes part-of-speech tagging and lemmatization automatically. For
sentences, spaCy can also be used for sentence segmentation to split texts into sentences based on language-
specific rules. Example tokenization of "I love natural language processing!" results in ["I", "love", "natural",
"language", "processing", "!"].
In sum, both NLTK and spaCy include powerful and widely used tools for word and sentence tokenization, with
spaCy offering more modern and faster processing plus integrated NLP pipeline components. NLTK is great for
traditional and flexible use cases, while spaCy is suited for efficient and large-scale processing
Word and Sentence Tokenization in NLTK and spaCy
A) Using NLTK
Word Tokenization:
3
from nltk.tokenize import word_tokenize
text = "I love NLP."
print(word_tokenize(text))
# Output: ['I', 'love', 'NLP', '.']
Sentence Tokenization:
from nltk.tokenize import sent_tokenize
text = "I love NLP. It is fun."
print(sent_tokenize(text))
# Output: ['I love NLP.', 'It is fun.']
B) Using spaCy
Word & Sentence Tokenization:
import spacy
nlp = spacy.load("en_core_web_sm")
text = nlp("I love NLP. It is fun.")
# Word tokens
print([token.text for token in text])
# ['I', 'love', 'NLP', '.', 'It', 'is', 'fun', '.']
# Sentence tokens
print([sent.text for sent in text.sents])
# ['I love NLP.', 'It is fun.']
8.
Stemming is a heuristic process in NLP that reduces words to their root or base form by chopping off prefixes
or suffixes, often using simple rules. It does not consider the actual linguistic correctness of the root, which can
result in stems that are not valid words. For example, "running," "runner," and "runs" might all be reduced to
"run," but stemming might also produce non-words like "creat" from "created." It is a faster and simpler
process used for reducing word forms.
4
Lemmatization reduces words to their dictionary or base form (called a lemma) by analyzing the morphological
structure of words and considering the context, part of speech, and meaning. Unlike stemming, it always
returns valid words by linking different inflected forms like "running," "ran," and "runs" to their base lemma
"run." It involves a more sophisticated process using vocabulary, morphological analysis, and sometimes
machine learning models.
How it works: Uses vocabulary + grammar rules. Needs Part-of-Speech (POS) tags for accuracy.
Examples:
studies → study
better → good
Tools:
NLTK – WordNetLemmatizer
spaCy – built-in lemmatizer
Comparison: Stemming vs Lemmatization
Stemmi Lemmatiza
Aspect
ng tion
Cuts Uses
suffixes dictionary +
Definition
to get grammar to
root get lemma
May
not be a
Always a
valid
valid word
Output word
(studies →
(studies
study)
→
studi)
Low High
Accuracy (rule- (linguistic
based) knowledge)
Speed Faster Slower
Use Cases Search Chatbots,
5
Stemmi Lemmatiza
Aspect
ng tion
ML/NLP
engines
tasks
, simple
needing
tasks
accuracy
Cleaning and Normalizing
Cleaning: Removing unwanted parts from text (noise).
o Examples: punctuation, numbers, HTML tags, extra spaces, special
characters.
Normalization: Converting text into a standard, consistent format so algorithms
can process it easily.
o Example: “HELLO!!! Wørld??” → “hello world”
Text Normalization Techniques
1. Lowercasing
o Converts all text to lowercase.
o Example: “Python IS Fun” → “python is fun”
2. Removing Accents / Diacritics
o Converts accented characters to plain form.
o Example: “résumé” → “resume”
3. Removing Punctuation & Special Characters
o Example: “hello!!!” → “hello”
4. Handling Numbers
o Remove or replace numbers depending on use case.
5. Expanding Contractions
o Example: “don’t” → “do not”
6. Whitespace Normalization
6
Stemmi Lemmatiza
Aspect
ng tion
o Remove extra spaces.
o Example: “I like NLP” → “I like NLP”
11
The Bag-of-Words (BoW) model is a simple and widely used technique in natural
language processing (NLP) for converting text data into numerical features. It treats a
text document as an unordered collection (a "bag") of words and uses the frequency of
each word in the document as features for machine learning algorithms. The BoW
model ignores grammar and word order, focusing solely on the count or presence of
words.
How the BoW Model Works
1.
Create a vocabulary (all unique words from text).
2. Represent each document as a vector that counts how often each word appears.
Key Points:
Ignores grammar & word order.
Only focuses on word frequency.
Used for text classification, sentiment analysis, spam detection.
Example:
Sentences:
1. "I love NLP"
2. "I love Machine Learning"
Vocabulary = {I, love, NLP, Machine, Learning}
Sentence 1 → [1, 1, 1, 0, 0]
Sentence 2 → [1, 1, 0, 1, 1]
from sklearn.feature_extraction.text import CountVectorizer
# Sample documents
docs = [
7
Stemmi Lemmatiza
Aspect
ng tion
"I love NLP",
"I love Machine Learning"
# Create BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(
# Vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())
# Document vectors
print("BoW Representation:\n", X.toarray())
12
Feature Extraction: Vectorization and Feature Representation in NLP
Feature extraction in natural language processing (NLP) is the process of transforming
raw text data into a numerical format understandable by machine learning models. This
step is crucial because ML algorithms require input data in fixed-length vectors, not raw
text. The extracted features represent essential characteristics or patterns from the text,
helping models learn and make predictions.
Common Vectorization Methods:
1. Bag-of-Words (BoW) – counts word frequencies.
2. TF-IDF (Term Frequency – Inverse Document Frequency) – considers word
importance (less weight for common words).
3. Word Embeddings (Word2Vec, GloVe, FastText) – represent words in dense
vectors capturing semantic meaning.
4. Sentence Embeddings (BERT, spaCy) – represent entire sentences/documents
with context.
8
Stemmi Lemmatiza
Aspect
ng tion
3. Feature Representation
After vectorization, each document is represented as a feature vector.
Example:
Sentences: "I love NLP", "I love ML"
Vocabulary = {I, love, NLP, ML}
Sentence 1 → [1, 1, 1, 0]
Sentence 2 → [1, 1, 0, 1]
Here, numbers represent features (word presence/absence or frequency).
13 & 14
Text Classification Using Bag-of-Words (BoW)
The Bag-of-Words model is commonly used for text classification tasks where the goal is
to assign categories or labels to documents based on their word content. BoW transforms
each document into a vector of word frequencies (or presence/absence), which serves as
features for machine learning algorithms such as Naive Bayes, Logistic Regression, or
Support Vector Machines.
How it works in text classification:
Build a vocabulary of unique words from the training corpus.
Represent each document as a vector of word counts (or binary indicators).
Feed these vectors into a classifier.
The model learns to associate specific word patterns with categories.
Effective for spam detection, sentiment analysis, and topic categorization.
Text Clustering with BoW Features
Clustering groups similar documents based on feature similarities without preassigned
labels. BoW vectors serve as input features for clustering algorithms like K-Means or
Hierarchical Clustering.
BoW-based clustering steps:
9
Stemmi Lemmatiza
Aspect
ng tion
Vectorize documents with BoW.
Use distance/similarity metrics (e.g., cosine similarity) on BoW vectors.
Cluster documents by grouping vectors close in feature space.
Useful to discover natural groupings or topics in unlabeled text data.
Summary
Task Using BoW
Text Classification Vectorize text into BoW features, train classifiers for labeled categories
Text Clustering Vectorize text, compute similarity, apply clustering without labels
BoW is simple and effective for both classification and clustering, though limited by
ignoring word order and context, which impacts deeper semantic understanding.
15
TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting technique used in
natural language processing and information retrieval to evaluate the importance of a
word in a document relative to a collection of documents or corpus.
Components of TF-IDF:
. Components
1. Term Frequency (TF):
o Measures how often a word appears in a document.
o Formula:
TF(t,d)=count of term t in document dtotal terms in document dTF(t,d) = \frac{\
text{count of term t in document d}}{\text{total terms in document
d}}TF(t,d)=total terms in document dcount of term t in document d
2. Inverse Document Frequency (IDF):
o Measures how rare a word is across all documents.
10
Stemmi Lemmatiza
Aspect
ng tion
o Formula:
IDF(t)=logN1+df(t)IDF(t) = \log \frac{N}{1 + df(t)}IDF(t)=log1+df(t)N
where N = total documents, df(t) = number of documents containing term t.
3. TF-IDF Score:
o TF − IDF(t,d)=TF(t,d)×IDF(t)TF\!-\!IDF(t,d) = TF(t,d) \times
IDF(t)TF−IDF(t,d)=TF(t,d)×IDF(t)
3. Example
Corpus: ["I love NLP", "I love Machine Learning"]
Word "love": appears in both docs → low IDF → lower weight.
Word "NLP": appears in only one doc → higher IDF → more important.
Co1
1.What are the different tasks involved in text analysis?
2. Explain the applications and challenges of text analysis.
3. What is machine learning? Give an overview of ML algorithms.
4. Differentiate between supervised and unsupervised learning with examples.
5. Explain the ML pipeline for text data processing.
6. What is text preprocessing? Why is it important?
7. Define tokenization. What are its types?
11
8. How does tokenization work in NLTK and spaCy?
9. What is stemming? Explain with an example.
10. Differentiate between stemming and lemmatization with suitable examples.
Co2
1. What are stop words? Why do we remove them in text preprocessing?
2. Explain the process of removing punctuation in text data.
3. How are special characters handled during text cleaning?
4. What is text normalization? Why is lowercasing important?
5. Explain the steps involved in building a preprocessing pipeline.
6. Design a mini project to preprocess a real-world text dataset.
7. What is the Bag-of-Words model? Explain with an example.
8. How is BoW implemented in scikit-learn?
9. What is vectorization in NLP? Explain its importance.
10. How is BoW used for text classification tasks?
C0-1
1. Different tasks in text analysis
Text classification (e.g., spam vs. ham mails)
Sentiment analysis (positive, negative, neutral)
Information retrieval (search engines)
Topic modeling (discovering hidden topics)
Named Entity Recognition (NER – extracting names, places, etc.)
Text summarization
Machine translation
12
Question answering & chatbot systems
2. Applications and challenges of text analysis
Applications:
Customer feedback analysis
Fake news detection
Healthcare reports mining
Legal document analysis
Recommendation systems
Social media monitoring
Challenges:
Ambiguity of words (e.g., bank = river bank or financial bank)
Sarcasm and irony detection
Multilingual text handling
Large data volume
Noise (spelling mistakes, emojis, slang)
3. What is machine learning? Overview of ML algorithms
Machine Learning (ML): A branch of AI where systems learn patterns from data without explicit
programming.
Types of algorithms:
4. Supervised learning – labeled data → predict output
o Examples: Linear Regression, Decision Trees, SVM, Naïve Bayes
5. Unsupervised learning – unlabeled data → find hidden patterns
o Examples: K-means, PCA, Hierarchical Clustering
6. Reinforcement learning – agent learns by trial and error
o Example: Q-Learning, Deep Q-Networks
13
4.
5.
ML Pipeline for Text Data Processing
Data collection (raw text)
Text preprocessing (tokenization, normalization, stop-word removal)
Feature extraction (e.g., BoW, TF-IDF)
Model selection and training
Model evaluation and validation
Prediction and deployment
6.
Text Preprocessing and Its Importance
Text preprocessing involves cleaning and formatting raw text (e.g., lowercasing, punctuation and stop word
removal, stemming/lemmatization). This step is crucial to reduce noise, enhance feature extraction, and
improve ML model accuracy.
7.Tokenization and Its Types
Tokenization is the process of splitting text into units (tokens).
Word tokenization (splitting into words)
14
Sentence tokenization (splitting into sentences)
Subword or character tokenization
8.Tokenization in NLTK and spaCy
NLTK: word_tokenize() and sent_tokenize() functions handle tokenization.
spaCy: Utilizes Doc objects for efficient tokenization using language-specific rules.
9.
What is Stemming? With Example
Stemming reduces words to their base or root form (e.g., "running" → "run"). For example, using Porter
Stemmer: "flies", "flying" → "fli".
10.
Stemming vs. Lemmatization (With Examples)
Stemming: Crude heuristic process. E.g., "better" → "bet"
Lemmatization: Considers vocabulary and context. E.g., "better" → "good", "running" → "run"
C0-2:
1.Stop Words and Their Removal
Stop words are common words (like 'is', 'and', 'the') that carry less meaning for analysis and are removed to
focus on informative words.
2.Removing Punctuation in Text Data
Removing punctuation involves stripping marks (.,!?—) using regex, string methods, or NLP libraries, helping
standardize the text for analysis.
3.Handling Special Characters
Special characters (e.g., @, #, $, %, emojis) are often removed or normalized unless relevant for tasks such as
sentiment analysis or social media mining.
15
4.Text Normalization and Lowercasing
Text normalization unifies variants in text (e.g., lowercasing, converting accented to unaccented, expanding
contractions) to reduce complexity and enhance matching.
5.Steps in Building a Preprocessing Pipeline
Normalize text (lowercasing)
Remove/preserve special characters as needed
Tokenize
Remove stop words
Stem or lemmatize
Clean extra spaces/punctuation
6.Mini-Project Idea: Real-World Text Preprocessing
Example: Download customer reviews (e.g., from Amazon or Yelp), preprocess them by cleaning,
normalizing, tokenizing, removing stopwords, and outputting the processed text for sentiment analysis.
7.Bag-of-Words (BoW) Model Explained
The BoW model converts text into numeric vectors by counting word occurrences, disregarding grammar and
order.
Example: For "the dog barked", "the cat meowed":
Word Doc1 Doc2
the 1 1
dog 1 0
barked 1 0
cat 0 1
meowed 0 1
8.BoW Implementation in scikit-learn
Using CountVectorizer to transform documents into BoW vectors, fit on training data, and convert new
16
documents for model input.
9.What is Vectorization in NLP?
Vectorization converts text into numerical representations (vectors), enabling ML algorithms to process text.
10.Importance of BoW for Text Classification
BoW’s numeric features allow ML models to classify text (e.g., spam detection, sentiment analysis) using
word occurrence patterns as input features.
17