Text Classification
Text Classification
2
A pipeline for building text classification system
● In several Natural Language Processing (NLP) applications like news categorization, sentiment
analysis, and subject labelling, text classification is a crucial and relevant task.
● The goal is to tag or label textual components like sentences, questions, paragraphs, and
documents.
3
TRADITIONAL MACHINE LEARNING
● Raw text data serves as the starting point for all approaches
Step 2: Preprocessing
Step 3: Representation
● Manual feature extraction using methods like Bag of Words (BoW), TF-IDF
● Creating numerical representations that traditional algorithms can
understand
Step 4: Classification
Step 5: Labels
https://www.datacamp.com/tutorial/text-classification-python
● Final classification output/predictions
4
DEEP LEARNING WITHOUT PRE-TRAINING DEEP LEARNING WITH PRE-TRAINING
● Same starting point with raw text data ● Raw text input
Step 2: Preprocessing
Step 2: Representation and Classification (Combined)
● Similar text cleaning and preparation steps
● Uses pre-trained models like BERT, GPT, or other
Step 3: Representation Transformer-based models
● These models have already learned rich representations
● Uses neural networks to automatically learn feature from massive datasets
representations ● Fine-tuning occurs on the specific task
● Word embeddings like Word2Vec, GloVe create dense vector
● Both representation learning and classification happen
representations
● No manual feature engineering required simultaneously in the pre-trained architecture
● Deep neural networks (CNNs, RNNs, LSTMs) perform ● Final predictions leveraging the power of pre-trained
classification knowledge
● End-to-end learning where representation and classification
are learned together
Step 5: Labels
Rule-based techniques use a set of manually constructed language rules to categorize text into categories or groups. These rules
tell the system to classify text into a particular category based on the content of a text by using semantically relevant textual
elements. An antecedent or pattern and a projected category make up each rule.
For example, imagine you have tons of new articles, and your goal is to assign them to relevant categories such as Sports,
Politics, Economy, etc.
With a rule-based classification system, you will do a human review of a couple of documents to come up with linguistic rules like
this one:
● If the document contains words such as money, dollar, GDP, or inflation, it belongs to the Politics group (class).
Machine learning-based text classification is a supervised machine learning problem. It learns the mapping of input data (raw text) with the
labels (also known as target variables). This is similar to non-text classification problems where we train a supervised classification algorithm
on a tabular dataset to predict a class, with the exception that in text classification, the input data is raw text instead of numeric features.
Like any other supervised machine learning, text classification machine learning has two phases; training and prediction.
Training phase: A supervised machine learning algorithm is trained on the input-labeled dataset during the training phase. At the end of this
process, we get a trained model that we can use to obtain predictions (labels) on new and unseen data.
Prediction phase: Once a machine learning model is trained, it can be used to predict labels on new and unseen data.
6
What is a Support Vector Machine(SVM)?
● A Support Vector Machine (SVM) is a machine learning algorithm used for classification and regression. This finds the best
line (or hyperplane) to separate data into groups, maximizing the distance between the closest points (support vectors) of
each group.
● It can handle complex data using kernels to transform it into higher dimensions. In short, SVM helps classify data
effectively. It is useful when you want to do binary classification like spam vs. not spam or cat vs. dog.
● The main goal of SVM is to maximize the margin between the two classes. The larger the margin the better the model performs on new
and unseen data.
7
Types of Support Vector Machine (SVM) Algorithms
● Linear SVM: When the data is perfectly linearly separable only then we can use Linear SVM. Perfectly linearly separable
means that the data points can be classified into 2 classes by using a single straight line(if 2D).
● Non-Linear SVM: When the data is not linearly separable, we can use Non-Linear SVM. This happens when the data
points cannot be separated into two classes using a straight line (if 2D). In such cases, we use advanced techniques like
kernel tricks to classify them. In most real-world applications we do not find linearly separable datapoints hence we use
kernel trick to solve them.
8
Selecting hyperplane for data with outlier
How does Support Vector Machine Algorithm Work?
The key idea behind the SVM algorithm is to find the hyperplane that best separates two classes by maximizing the margin
between them. This margin is the distance from the hyperplane to the nearest data points (support vectors) on each side.
The best hyperplane also known as the "hard margin" is the one that maximizes the
distance between the hyperplane and the nearest data points from both classes. This
ensures a clear separation between the classes. So from the above figure, we choose
L2 as hard margin. Let's consider a scenario like shown below:
Here, we have one blue ball in the boundary of the red ball.
Multiple hyperplanes separate the data from two classes 9
How does SVM classify the data?
The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the characteristics to ignore the
outlier and finds the best hyperplane that maximizes the margin. SVM is robust to outliers.
Objective Function=(margin/1)+λ∑penalty
The penalty used for violations is often hinge loss which has the
following behavior:
Hyperplane which is the most optimized one
● If a data point is correctly classified and within the margin
11
Algorithm details:
1. Collect and pre-classify sentences
2. Convert sentences into a numerical vector (embedding) using pre-trained NLP model (like
BERT)
3. calculate the average location (centroid) of the embeddings for each category
4. This average location represents the central point of each category based on the pre-classified
samples
5. classify new sentences
● sentences are also converted into embeddings
● measure its distance to the average locations (centroids) of the categories
● the new sentence is classified based on its proximity to these centroids
Architecture Example
For a sentiment analysis system:
Step 2: Generate Embeddings Using a pre-trained model like Word2Vec, GloVe, or BERT:
# Conceptual flow:
13
text → tokenize → Word2Vec embeddings → average/sum → classifier
2. BERT-based Classification
# Modern approach:
text → BERT tokenizer → BERT model → [CLS] token → classification head
4. LSTM/RNN Approach
Key Advantages
● Semantic Understanding: Captures meaning beyond exact word matches
● Generalization: Works well on unseen text with similar meaning
● Transfer Learning: Pre-trained embeddings can be fine-tuned for specific tasks
● Efficiency: Dense representations are computationally efficient
14
Deep learning for text classification
Deep Learning is the next evolution and subset of Machine Learning. It is a method of statistical learning that extracts features
or attributes from raw data. Deep Learning uses a network of algorithms called artificial neural networks which imitates the
function of the human neural networks present in the brain. Deep Learning takes the data into a network of layers(Input, Hidden
& Output) to extract the features and to learn from the data.
Applications of Deep Learning for Text Classification Sentiment Analysis, News categorization, Spam detection, Document
categorization, and Customer service ticket routing.
1. Recurrent Neural Networks (RNN): Excellent for tasks requiring a strong sense of sequence and
context, such as sentiment analysis and time-series forecasting. Processes text sequentially, maintaining memory of
previous words through hidden states.
● Input Text → Embedding → LSTM/GRU Layers → Dense Layer → Output
● Real Example - News Category Classification,
https://vasista.medium.com/deep-learning-vs-machine-learning-with-text-classification-162ea20a7924 15
https://www.mathworks.com/help/textanalytics/ug/classify-text-data-using-deep-learning.html
3. Transformer-based Models (e.g., BERT): State-of-the-art for their ability to capture global dependencies
and generate rich contextual word embeddings, leading to superior performance on many text classification tasks.Complex
understanding, long-range dependencies, state-of-the-art performance
Interpretation Techniques
1. Attention Mechanisms
How it works: Shows which words/tokens the model focuses on when making predictions.
How it works: Perturbs input text locally and observes prediction changes to identify important features.
What it is: A technique that tests what happens when you remove or change words to see how much each word matters.
Simple Analogy: Like playing "What if?" - "What if I remove this word? What if I change that word?"
How it works:
How it works: Uses game theory to compute the contribution of each feature to the prediction.
What it is: A fair way to assign credit to each word for the final prediction, like splitting a bill fairly among friends.
Simple Analogy: If your team scores 100 points in a game, SHAP tells you exactly how many points each player contributed.
How it works: Uses math to fairly calculate each word's contribution to the final score.
17
4. GRADIENT-BASED METHODS
What it is: Uses calculus to measure how sensitive the AI's decision is to tiny changes in each word.
Simple Analogy: Like checking how wobbly a table is - which leg, when pushed slightly, makes the table move the most?
Simple Analogy: After watching 1000 cooking shows, you notice that whenever chefs say "perfectly seasoned," the dish gets good reviews.
That's feature importance.
6. COUNTERFACTUAL EXPLANATIONS
What it is: Finding the smallest possible change to flip the AI's decision.
Simple Analogy: "What's the minimum I need to change to turn this bad review into a good review?"
18
Pizza Review Example
Review: "This pizza place has amazing food but terrible service" AI Decision: 60% POSITIVE (mixed sentiment)
19