Identifying and Categorizing Offensive Language in
Social Media
IDIR Imane
Faculty of Sciences
Department of Computer Science
MASTER THESIS
Supervised by : Dr. BESSOU Sadik
October, 6 2020
Table of Content
1 Introduction
2 Offensive Language
3 Project development
Data collection
Data Preprocessing
Feature Extraction
Classification
4 Results
5 Conclusion and Future Works
Introduction
Introduction
The proliferation of online platforms for user generated content enables more
people to experience freedom of expression than ever before. Which increase the
chance of using Offensive language.
IDIR Imane Offensive Language Detection October, 6 2020 3 / 24
Introduction
Introduction
The proliferation of online platforms for user generated content enables more
people to experience freedom of expression than ever before. Which increase the
chance of using Offensive language.
So, What is offensive language?
IDIR Imane Offensive Language Detection October, 6 2020 3 / 24
Offensive Language
Offensive Language
Offensive language is the use of unacceptable and aggressive expression (oral or
text) against individual or group. whether explicit (direct attacks such threating,
cursing, using dirty / obscene words etc.), or implicit. It can occur even when
humor is used.
IDIR Imane Offensive Language Detection October, 6 2020 4 / 24
Offensive Language
Offensive language filtering
social media platforms are under legal action for not properly taking care of
offensive content online. That’s the reason different online platforms allow employ
moderates who identify offensive content.
IDIR Imane Offensive Language Detection October, 6 2020 5 / 24
Offensive Language
Offensive language filtering
social media platforms are under legal action for not properly taking care of
offensive content online. That’s the reason different online platforms allow employ
moderates who identify offensive content.
IDIR Imane Offensive Language Detection October, 6 2020 5 / 24
Offensive Language
Offensive language filtering
social media platforms are under legal action for not properly taking care of
offensive content online. That’s the reason different online platforms allow employ
moderates who identify offensive content.
IDIR Imane Offensive Language Detection October, 6 2020 5 / 24
Offensive Language
Arabic offensive language detection
The automatic detection of Arabic offensive language is a complex task, it has
multiple challenges including:
IDIR Imane Offensive Language Detection October, 6 2020 6 / 24
Offensive Language
Arabic offensive language detection
The automatic detection of Arabic offensive language is a complex task, it has
multiple challenges including:
1 The ambiguity and informality of the written format of the text.
IDIR Imane Offensive Language Detection October, 6 2020 6 / 24
Offensive Language
Arabic offensive language detection
The automatic detection of Arabic offensive language is a complex task, it has
multiple challenges including:
1 The ambiguity and informality of the written format of the text.
2 Arabic language has multiple dialects with diverse vocabularies and
structures, which increase the complexity of obtaining high classification.
IDIR Imane Offensive Language Detection October, 6 2020 6 / 24
Offensive Language
Arabic offensive language detection
The automatic detection of Arabic offensive language is a complex task, it has
multiple challenges including:
1 The ambiguity and informality of the written format of the text.
2 Arabic language has multiple dialects with diverse vocabularies and
structures, which increase the complexity of obtaining high classification.
So, How can we automate the process of detecting offensive content in
social media?
IDIR Imane Offensive Language Detection October, 6 2020 6 / 24
Project development
Machine learning approach
IDIR Imane Offensive Language Detection October, 6 2020 7 / 24
Project development Data collection
Data Annotation
1 For building and testing any classifier model, we need annotated data !
IDIR Imane Offensive Language Detection October, 6 2020 8 / 24
Project development Data collection
Data Annotation
1 For building and testing any classifier model, we need annotated data !
2 Annotating the collected data was a difcult task. The comments are mixed
and very ambiguous, it is hard to understand what the user wants to express,
and to distinguish between the types of oensive language.
IDIR Imane Offensive Language Detection October, 6 2020 8 / 24
Project development Data collection
Data Annotation
1 For building and testing any classifier model, we need annotated data !
2 Annotating the collected data was a difcult task. The comments are mixed
and very ambiguous, it is hard to understand what the user wants to express,
and to distinguish between the types of oensive language.
IDIR Imane Offensive Language Detection October, 6 2020 8 / 24
Project development Data collection
Data statistics
Category Number of comments Number of token
Offensive 4,638 55,879
Non-Offensive 3,933 44,726
Total 8,571 100,605
IDIR Imane Offensive Language Detection October, 6 2020 9 / 24
Project development Data collection
Data statistics
Category Number of comments Number of token
Offensive 4,638 55,879
Non-Offensive 3,933 44,726
Total 8,571 100,605
IDIR Imane Offensive Language Detection October, 6 2020 9 / 24
Project development Data Preprocessing
Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.
IDIR Imane Offensive Language Detection October, 6 2020 10 / 24
Project development Data Preprocessing
Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.
Tokenization.
IDIR Imane Offensive Language Detection October, 6 2020 10 / 24
Project development Data Preprocessing
Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.
Cleaning.
IDIR Imane Offensive Language Detection October, 6 2020 10 / 24
Project development Data Preprocessing
Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.
Remove stop words.
IDIR Imane Offensive Language Detection October, 6 2020 10 / 24
Project development Data Preprocessing
Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.
Normalization.
IDIR Imane Offensive Language Detection October, 6 2020 11 / 24
Project development Data Preprocessing
Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.
Stemming.
IDIR Imane Offensive Language Detection October, 6 2020 11 / 24
Project development Feature Extraction
Machines, cannot understand the raw text. Therefore, we need to convert our text
into numbers. Different approaches exist to convert text into the corresponding
numerical form.
IDIR Imane Offensive Language Detection October, 6 2020 12 / 24
Project development Feature Extraction
Machines, cannot understand the raw text. Therefore, we need to convert our text
into numbers. Different approaches exist to convert text into the corresponding
numerical form.
How can we transform the data into numbers ?
IDIR Imane Offensive Language Detection October, 6 2020 12 / 24
Project development Feature Extraction
Bag of words
Each word is used as a feature for training the classifier. It is known as a “bag” of
words, since the method doesn’t care about how many times a word occurs or the
order of the words, all what matters is the presence of the word in a list of words.
IDIR Imane Offensive Language Detection October, 6 2020 13 / 24
Project development Feature Extraction
TF-IDF
Statistical measure used to evaluate how important a word is to a document in a
collection or in a corpus.
IDIR Imane Offensive Language Detection October, 6 2020 14 / 24
Project development Feature Extraction
Features
The n-gram is a contiguous sequence of n items (words, letters or symbols) from a
given sample of text.
IDIR Imane Offensive Language Detection October, 6 2020 15 / 24
Project development Feature Extraction
Features
The n-gram is a contiguous sequence of n items (words, letters or symbols) from a
given sample of text.
How can we select best n-grams combination features??
IDIR Imane Offensive Language Detection October, 6 2020 15 / 24
Project development Feature Extraction
Features
The n-gram is a contiguous sequence of n items (words, letters or symbols) from a
given sample of text.
How can we select best n-grams combination features??
IDIR Imane Offensive Language Detection October, 6 2020 15 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine.
3 Multinomial Naive Bayes.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine.
3 Multinomial Naive Bayes.
4 Bernoulli Naive Bayes.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine.
3 Multinomial Naive Bayes.
4 Bernoulli Naive Bayes.
5 Random Forests.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine.
3 Multinomial Naive Bayes.
4 Bernoulli Naive Bayes.
5 Random Forests.
6 Stochastic Gradient
Descent.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
4 Bernoulli Naive Bayes. build the classier.
5 Random Forests.
6 Stochastic Gradient
Descent.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine. 1 Spliting dataset.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
4 Bernoulli Naive Bayes. build the classier.
5 Random Forests.
6 Stochastic Gradient
Descent.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine. 1 Spliting dataset.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
2 Train classifier.
4 Bernoulli Naive Bayes. build the classier.
5 Random Forests.
6 Stochastic Gradient
Descent.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine. 1 Spliting dataset.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
2 Train classifier.
4 Bernoulli Naive Bayes. build the classier. 3 Test classifier.
5 Random Forests.
6 Stochastic Gradient
Descent.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Project development Classification
Classification Algorithms
Machine learning algorithms:
1 Logistic Regression.
2 Support Vector Machine. 1 Spliting dataset.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
2 Train classifier.
4 Bernoulli Naive Bayes. build the classier. 3 Test classifier.
5 Random Forests. 4 Make predections.
6 Stochastic Gradient
Descent.
IDIR Imane Offensive Language Detection October, 6 2020 16 / 24
Results
Task Description
Our project is broken down into the following tow subtasks:
1 Sub-task A: Identifying and categorizing offensive language in one step (the
use of non-offensive category with offensive categories together).
IDIR Imane Offensive Language Detection October, 6 2020 17 / 24
Results
Task Description
Our project is broken down into the following tow subtasks:
1 Sub-task A: Identifying and categorizing offensive language in one step (the
use of non-offensive category with offensive categories together).
2 Sub-task B - Identifying and categorizing offensive language in tow steps, In
the first step we identify offensive content i.e. if a comment is offensive or
non-offensive. While, in the second step we categorize the type of offensive
content.
IDIR Imane Offensive Language Detection October, 6 2020 17 / 24
Results
Sub-Task A
Comparison of accuracy using different models with different features in subtask A.
IDIR Imane Offensive Language Detection October, 6 2020 18 / 24
Results
Sub-Task A
Comparison of accuracy using different models with different features in subtask A.
The final best result achieved by SGDClassifier (84.78%) with Tf-idf + (1,2).
IDIR Imane Offensive Language Detection October, 6 2020 18 / 24
Results
Sub-Task B
Comparison of accuracy using different models with different features in subtask B
(first step).
IDIR Imane Offensive Language Detection October, 6 2020 19 / 24
Results
Sub-Task B
Comparison of accuracy using different models with different features in subtask B
(first step).
The final best result achieved by LinearSVC (89.21%) with Tf-idf + (1,2).
IDIR Imane Offensive Language Detection October, 6 2020 19 / 24
Results
Sub-Task B
Comparison of accuracy using different models with different features in subtask B
(Second step).
IDIR Imane Offensive Language Detection October, 6 2020 20 / 24
Results
Sub-Task B
Comparison of accuracy using different models with different features in subtask B
(Second step).
The final best result achieved by LinearSVC (84,48%) with Tf-idf + (1,1).
IDIR Imane Offensive Language Detection October, 6 2020 20 / 24
Conclusion and Future Works
Conclusion
1 Support vector machines and stochastic gradient descent are the best
classifiers for our dataset.
IDIR Imane Offensive Language Detection October, 6 2020 21 / 24
Conclusion and Future Works
Conclusion
1 Support vector machines and stochastic gradient descent are the best
classifiers for our dataset.
2 The use of Grid search has significatif impact.
IDIR Imane Offensive Language Detection October, 6 2020 21 / 24
Conclusion and Future Works
Conclusion
1 Support vector machines and stochastic gradient descent are the best
classifiers for our dataset.
2 The use of Grid search has significatif impact.
3 Unigrams and the combination of unigrams and bigrams are the best features.
IDIR Imane Offensive Language Detection October, 6 2020 21 / 24
Conclusion and Future Works
Conclusion
Testing classifiers ....
IDIR Imane Offensive Language Detection October, 6 2020 22 / 24
Conclusion and Future Works
Future Works
The following ideas could be tested:
Increasing the size of our dataset from a different domain, and platforms to
classify more precisely.
IDIR Imane Offensive Language Detection October, 6 2020 23 / 24
Conclusion and Future Works
Future Works
The following ideas could be tested:
Increasing the size of our dataset from a different domain, and platforms to
classify more precisely.
Regarding the annotation procedure for our corpus, we would like to verify
our dataset by experienced annotators so as they can give us the agree of its
validation.
IDIR Imane Offensive Language Detection October, 6 2020 23 / 24
Conclusion and Future Works
Future Works
The following ideas could be tested:
Increasing the size of our dataset from a different domain, and platforms to
classify more precisely.
Regarding the annotation procedure for our corpus, we would like to verify
our dataset by experienced annotators so as they can give us the agree of its
validation.
Applying spelling corrector to exclude typos, because users comments contain
a lot of spelling mistakes.
IDIR Imane Offensive Language Detection October, 6 2020 23 / 24
Conclusion and Future Works
Future Works
The following ideas could be tested:
Increasing the size of our dataset from a different domain, and platforms to
classify more precisely.
Regarding the annotation procedure for our corpus, we would like to verify
our dataset by experienced annotators so as they can give us the agree of its
validation.
Applying spelling corrector to exclude typos, because users comments contain
a lot of spelling mistakes.
Applying deep learning algorithms.
IDIR Imane Offensive Language Detection October, 6 2020 23 / 24
Conclusion and Future Works
Thank You
IDIR Imane Offensive Language Detection October, 6 2020 24 / 24