0% found this document useful (0 votes)
13 views19 pages

Text Classification

The document outlines a comprehensive overview of text classification systems, detailing both traditional machine learning and deep learning approaches. It covers various techniques including Support Vector Machines, neural embeddings, and deep learning architectures like CNNs and RNNs, along with their applications in tasks such as sentiment analysis and spam detection. Additionally, it emphasizes the importance of interpreting text classification models to understand their decision-making processes and improve their performance.

Uploaded by

4janhavig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Text Classification

The document outlines a comprehensive overview of text classification systems, detailing both traditional machine learning and deep learning approaches. It covers various techniques including Support Vector Machines, neural embeddings, and deep learning architectures like CNNs and RNNs, along with their applications in tasks such as sentiment analysis and spam detection. Additionally, it emphasizes the importance of interpreting text classification models to understand their decision-making processes and improve their performance.

Uploaded by

4janhavig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT- IV: Text Classification

● A pipeline for building text classification system,


● one pipeline,
● many classifiers,
● support Vector machine,
● neural embeddings using text classification,
● Deep learning for text classification,
● interpreting text classification models.

2
A pipeline for building text classification system

● In several Natural Language Processing (NLP) applications like news categorization, sentiment
analysis, and subject labelling, text classification is a crucial and relevant task.
● The goal is to tag or label textual components like sentences, questions, paragraphs, and
documents.

3
TRADITIONAL MACHINE LEARNING

Step 1: Text Dataset

● Raw text data serves as the starting point for all approaches

Step 2: Preprocessing

● Text cleaning, tokenization, removing stop words, stemming/lemmatization


● Standardizing the text format for further processing

Step 3: Representation

● Manual feature extraction using methods like Bag of Words (BoW), TF-IDF
● Creating numerical representations that traditional algorithms can
understand

Step 4: Classification

● Using classical algorithms like Naive Bayes, SVM, Random Forest


● These algorithms work on the manually engineered features

Step 5: Labels
https://www.datacamp.com/tutorial/text-classification-python
● Final classification output/predictions

4
DEEP LEARNING WITHOUT PRE-TRAINING DEEP LEARNING WITH PRE-TRAINING

Step 1: Text Dataset Step 1: Text Dataset

● Same starting point with raw text data ● Raw text input
Step 2: Preprocessing
Step 2: Representation and Classification (Combined)
● Similar text cleaning and preparation steps
● Uses pre-trained models like BERT, GPT, or other
Step 3: Representation Transformer-based models
● These models have already learned rich representations
● Uses neural networks to automatically learn feature from massive datasets
representations ● Fine-tuning occurs on the specific task
● Word embeddings like Word2Vec, GloVe create dense vector
● Both representation learning and classification happen
representations
● No manual feature engineering required simultaneously in the pre-trained architecture

Step 4: Classification Step 3: Labels

● Deep neural networks (CNNs, RNNs, LSTMs) perform ● Final predictions leveraging the power of pre-trained
classification knowledge
● End-to-end learning where representation and classification
are learned together

Step 5: Labels

● Classification results from the neural network


5
Types of Text Classification Systems
Rule-based text classification

Rule-based techniques use a set of manually constructed language rules to categorize text into categories or groups. These rules
tell the system to classify text into a particular category based on the content of a text by using semantically relevant textual
elements. An antecedent or pattern and a projected category make up each rule.

For example, imagine you have tons of new articles, and your goal is to assign them to relevant categories such as Sports,
Politics, Economy, etc.

With a rule-based classification system, you will do a human review of a couple of documents to come up with linguistic rules like
this one:

● If the document contains words such as money, dollar, GDP, or inflation, it belongs to the Politics group (class).

Machine learning-based text classification

Machine learning-based text classification is a supervised machine learning problem. It learns the mapping of input data (raw text) with the
labels (also known as target variables). This is similar to non-text classification problems where we train a supervised classification algorithm
on a tabular dataset to predict a class, with the exception that in text classification, the input data is raw text instead of numeric features.

Like any other supervised machine learning, text classification machine learning has two phases; training and prediction.

Training phase: A supervised machine learning algorithm is trained on the input-labeled dataset during the training phase. At the end of this
process, we get a trained model that we can use to obtain predictions (labels) on new and unseen data.

Prediction phase: Once a machine learning model is trained, it can be used to predict labels on new and unseen data.
6
What is a Support Vector Machine(SVM)?

● A Support Vector Machine (SVM) is a machine learning algorithm used for classification and regression. This finds the best
line (or hyperplane) to separate data into groups, maximizing the distance between the closest points (support vectors) of
each group.
● It can handle complex data using kernels to transform it into higher dimensions. In short, SVM helps classify data
effectively. It is useful when you want to do binary classification like spam vs. not spam or cat vs. dog.
● The main goal of SVM is to maximize the margin between the two classes. The larger the margin the better the model performs on new
and unseen data.

7
Types of Support Vector Machine (SVM) Algorithms
● Linear SVM: When the data is perfectly linearly separable only then we can use Linear SVM. Perfectly linearly separable
means that the data points can be classified into 2 classes by using a single straight line(if 2D).
● Non-Linear SVM: When the data is not linearly separable, we can use Non-Linear SVM. This happens when the data
points cannot be separated into two classes using a straight line (if 2D). In such cases, we use advanced techniques like
kernel tricks to classify them. In most real-world applications we do not find linearly separable datapoints hence we use
kernel trick to solve them.

8
Selecting hyperplane for data with outlier
How does Support Vector Machine Algorithm Work?
The key idea behind the SVM algorithm is to find the hyperplane that best separates two classes by maximizing the margin
between them. This margin is the distance from the hyperplane to the nearest data points (support vectors) on each side.

The best hyperplane also known as the "hard margin" is the one that maximizes the
distance between the hyperplane and the nearest data points from both classes. This
ensures a clear separation between the classes. So from the above figure, we choose
L2 as hard margin. Let's consider a scenario like shown below:

Selecting hyperplane for data with outlier

Here, we have one blue ball in the boundary of the red ball.
Multiple hyperplanes separate the data from two classes 9
How does SVM classify the data?

The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the characteristics to ignore the
outlier and finds the best hyperplane that maximizes the margin. SVM is robust to outliers.

A soft margin allows for some misclassifications or violations of the


margin to improve generalization. The SVM optimizes the
following equation to balance margin maximization and penalty
minimization:

Objective Function=(margin/1)+λ∑penalty
The penalty used for violations is often hinge loss which has the
following behavior:
Hyperplane which is the most optimized one
● If a data point is correctly classified and within the margin

linearly separable data that seprates there is no penalty (loss = 0).


group of blue balls and red balls by a ● If a point is incorrectly classified or violates the margin
straight line/linear line.
the hinge loss increases proportionally to the distance of
https://www.analyticsvidhya.com/blog/2021/10/support-vector-machinessvm-a-complete-guide-for-beginners/ the violation.
10
Neural embeddings using text classification
● Neural embeddings for text classification convert words, sentences, or documents into numerical vectors that capture
their semantic meaning, which are then fed into a neural network to classify text.
● This process involves using embedding layers in networks like LSTMs or CNNs to learn these vector representations
from data, allowing words with similar meanings to have close vectors.
● Popular embedding models include Word2Vec, GloVe, and transformer-based approaches like BERT and Sentence
Transformers, which learn contextual embeddings for enhanced performance in text classification tasks like sentiment
analysis, topic categorization, and spam detection.

How Neural Embeddings Work


Neural embeddings map words, sentences, or documents to high-dimensional vectors (typically 50-768 dimensions) where
semantically similar texts are positioned closer together in the vector space. Unlike traditional bag-of-words approaches,
embeddings capture contextual relationships and semantic similarity.

11
Algorithm details:
1. Collect and pre-classify sentences
2. Convert sentences into a numerical vector (embedding) using pre-trained NLP model (like
BERT)
3. calculate the average location (centroid) of the embeddings for each category
4. This average location represents the central point of each category based on the pre-classified
samples
5. classify new sentences
● sentences are also converted into embeddings
● measure its distance to the average locations (centroids) of the categories
● the new sentence is classified based on its proximity to these centroids

Architecture Example
For a sentiment analysis system:

1. Input Layer: Raw text


2. Embedding Layer: Convert words to 300-dim vectors
3. Feature Extraction: CNN/LSTM/Transformer layers
4. Classification Layer: Dense layer with softmax
5. Output: Probability distribution over classes 12
Real-Time Example: Email Spam Classification
Let's walk through a practical example of classifying emails as "spam" or "legitimate":

Step 1: Text Preprocessing

Email 1: "Congratulations! You've won $1000000! Click here now!"

Email 2: "Meeting scheduled for tomorrow at 3 PM in conference room"

Step 2: Generate Embeddings Using a pre-trained model like Word2Vec, GloVe, or BERT:

● Email 1 → [0.2, -0.8, 0.5, ..., 0.1] (768-dimensional vector)


● Email 2 → [0.7, 0.3, -0.2, ..., 0.9] (768-dimensional vector)

Step 3: Classification A neural network takes these embeddings and predicts:

● Email 1: 95% spam, 5% legitimate


● Email 2: 2% spam, 98% legitimate

Common Algorithms and Approaches


1. Word2Vec + Neural Network

# Conceptual flow:
13
text → tokenize → Word2Vec embeddings → average/sum → classifier
2. BERT-based Classification
# Modern approach:
text → BERT tokenizer → BERT model → [CLS] token → classification head

3. CNN for Text Classification

● Embeddings → Convolutional layers → Max pooling → Dense layers → Output

4. LSTM/RNN Approach

● Sequential processing of word embeddings through recurrent layers

Key Advantages
● Semantic Understanding: Captures meaning beyond exact word matches
● Generalization: Works well on unseen text with similar meaning
● Transfer Learning: Pre-trained embeddings can be fine-tuned for specific tasks
● Efficiency: Dense representations are computationally efficient

14
Deep learning for text classification
Deep Learning is the next evolution and subset of Machine Learning. It is a method of statistical learning that extracts features
or attributes from raw data. Deep Learning uses a network of algorithms called artificial neural networks which imitates the
function of the human neural networks present in the brain. Deep Learning takes the data into a network of layers(Input, Hidden
& Output) to extract the features and to learn from the data.

Applications of Deep Learning for Text Classification Sentiment Analysis, News categorization, Spam detection, Document
categorization, and Customer service ticket routing.

The fundamental network architectures of neural networks are


1. Convolutional Neural Networks (CNN):Good at identifying key phrases and spatial patterns. Short
texts, document classification, when local patterns matter
● Input Text → Embedding Layer → Conv1D Layers → Max Pooling → Dense Layer → Output
● Real Example - Movie Review Sentiment:

1. Recurrent Neural Networks (RNN): Excellent for tasks requiring a strong sense of sequence and
context, such as sentiment analysis and time-series forecasting. Processes text sequentially, maintaining memory of
previous words through hidden states.
● Input Text → Embedding → LSTM/GRU Layers → Dense Layer → Output
● Real Example - News Category Classification,
https://vasista.medium.com/deep-learning-vs-machine-learning-with-text-classification-162ea20a7924 15
https://www.mathworks.com/help/textanalytics/ug/classify-text-data-using-deep-learning.html
3. Transformer-based Models (e.g., BERT): State-of-the-art for their ability to capture global dependencies
and generate rich contextual word embeddings, leading to superior performance on many text classification tasks.Complex
understanding, long-range dependencies, state-of-the-art performance

● Input Text → Tokenization → Transformer Encoder → Classification Head → Output


● Real Example - Question Classification,

Interpreting text classification models


● Interpreting text classification models involves understanding why a model makes a particular classification decision.
This is crucial for building trust in the model, identifying biases, and improving its performance.
● Interpreting text classification models is crucial for understanding how models make decisions, building trust, debugging
errors, and ensuring fairness.

Interpretation Techniques

1. Attention Mechanisms

How it works: Shows which words/tokens the model focuses on when making predictions.

Real Example - Sentiment Analysis:

● Input: "The movie was absolutely terrible and boring"


● High attention on: "terrible" (0.8), "boring" (0.7)
https://www.kaggle.com/code/eliotbarr/text-classification-using-neural-networks 16
● Low attention on: "The", "was", "and"
2. LIME (Local Interpretable Model-agnostic Explanations)

How it works: Perturbs input text locally and observes prediction changes to identify important features.

What it is: A technique that tests what happens when you remove or change words to see how much each word matters.

Simple Analogy: Like playing "What if?" - "What if I remove this word? What if I change that word?"

How it works:

1. Take original text.


2. Create many versions by removing/changing words randomly.
3. See how predictions change.
4. Find which words cause the biggest change.

3. SHAP (SHapley Additive exPlanations)

How it works: Uses game theory to compute the contribution of each feature to the prediction.

What it is: A fair way to assign credit to each word for the final prediction, like splitting a bill fairly among friends.

Simple Analogy: If your team scores 100 points in a game, SHAP tells you exactly how many points each player contributed.

How it works: Uses math to fairly calculate each word's contribution to the final score.

17
4. GRADIENT-BASED METHODS
What it is: Uses calculus to measure how sensitive the AI's decision is to tiny changes in each word.

Simple Analogy: Like checking how wobbly a table is - which leg, when pushed slightly, makes the table move the most?

5. FEATURE IMPORTANCE ANALYSIS


What it is: Finding which words or patterns are generally most important across many examples.

Simple Analogy: After watching 1000 cooking shows, you notice that whenever chefs say "perfectly seasoned," the dish gets good reviews.
That's feature importance.

6. COUNTERFACTUAL EXPLANATIONS
What it is: Finding the smallest possible change to flip the AI's decision.

Simple Analogy: "What's the minimum I need to change to turn this bad review into a good review?"

18
Pizza Review Example
Review: "This pizza place has amazing food but terrible service" AI Decision: 60% POSITIVE (mixed sentiment)

Different interpretation methods tell us:

● Attention: AI focused 70% on "amazing", 60% on "terrible"


● LIME: Removing "amazing" drops to 20% positive, removing "terrible" raises to 85% positive
● SHAP: "amazing" contributes +40%, "terrible" contributes -25%
● Counterfactual: Change "terrible" to "excellent" → 90% POSITIVE
● Gradient: AI is most sensitive to changes in "amazing" and "terrible"

19

You might also like