0% found this document useful (0 votes)
113 views59 pages

Identifying and Categorizing Offensive Language in Social Media

This document is a master's thesis that examines identifying and categorizing offensive language in social media. It discusses how online platforms are working to filter offensive content and the challenges of detecting offensive Arabic language automatically. It then describes the author's machine learning approach, including collecting and annotating data, preprocessing the data through steps like tokenization and stemming, and extracting features to transform the text into numerical format that can be used to classify the data.

Uploaded by

zaroure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views59 pages

Identifying and Categorizing Offensive Language in Social Media

This document is a master's thesis that examines identifying and categorizing offensive language in social media. It discusses how online platforms are working to filter offensive content and the challenges of detecting offensive Arabic language automatically. It then describes the author's machine learning approach, including collecting and annotating data, preprocessing the data through steps like tokenization and stemming, and extracting features to transform the text into numerical format that can be used to classify the data.

Uploaded by

zaroure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Identifying and Categorizing Offensive Language in

Social Media

IDIR Imane

Faculty of Sciences
Department of Computer Science

MASTER THESIS
Supervised by : Dr. BESSOU Sadik

October, 6 2020
Table of Content

1 Introduction

2 Offensive Language

3 Project development
Data collection
Data Preprocessing
Feature Extraction
Classification

4 Results

5 Conclusion and Future Works


Introduction

Introduction
The proliferation of online platforms for user generated content enables more
people to experience freedom of expression than ever before. Which increase the
chance of using Offensive language.

IDIR Imane Offensive Language Detection October, 6 2020 3 / 24


Introduction

Introduction

The proliferation of online platforms for user generated content enables more
people to experience freedom of expression than ever before. Which increase the
chance of using Offensive language.

So, What is offensive language?

IDIR Imane Offensive Language Detection October, 6 2020 3 / 24


Offensive Language

Offensive Language
Offensive language is the use of unacceptable and aggressive expression (oral or
text) against individual or group. whether explicit (direct attacks such threating,
cursing, using dirty / obscene words etc.), or implicit. It can occur even when
humor is used.

IDIR Imane Offensive Language Detection October, 6 2020 4 / 24


Offensive Language

Offensive language filtering

social media platforms are under legal action for not properly taking care of
offensive content online. That’s the reason different online platforms allow employ
moderates who identify offensive content.

IDIR Imane Offensive Language Detection October, 6 2020 5 / 24


Offensive Language

Offensive language filtering


social media platforms are under legal action for not properly taking care of
offensive content online. That’s the reason different online platforms allow employ
moderates who identify offensive content.

IDIR Imane Offensive Language Detection October, 6 2020 5 / 24


Offensive Language

Offensive language filtering


social media platforms are under legal action for not properly taking care of
offensive content online. That’s the reason different online platforms allow employ
moderates who identify offensive content.

IDIR Imane Offensive Language Detection October, 6 2020 5 / 24


Offensive Language

Arabic offensive language detection

The automatic detection of Arabic offensive language is a complex task, it has


multiple challenges including:

IDIR Imane Offensive Language Detection October, 6 2020 6 / 24


Offensive Language

Arabic offensive language detection

The automatic detection of Arabic offensive language is a complex task, it has


multiple challenges including:

1 The ambiguity and informality of the written format of the text.

IDIR Imane Offensive Language Detection October, 6 2020 6 / 24


Offensive Language

Arabic offensive language detection

The automatic detection of Arabic offensive language is a complex task, it has


multiple challenges including:

1 The ambiguity and informality of the written format of the text.

2 Arabic language has multiple dialects with diverse vocabularies and


structures, which increase the complexity of obtaining high classification.

IDIR Imane Offensive Language Detection October, 6 2020 6 / 24


Offensive Language

Arabic offensive language detection

The automatic detection of Arabic offensive language is a complex task, it has


multiple challenges including:

1 The ambiguity and informality of the written format of the text.

2 Arabic language has multiple dialects with diverse vocabularies and


structures, which increase the complexity of obtaining high classification.

So, How can we automate the process of detecting offensive content in


social media?

IDIR Imane Offensive Language Detection October, 6 2020 6 / 24


Project development

Machine learning approach

IDIR Imane Offensive Language Detection October, 6 2020 7 / 24


Project development Data collection

Data Annotation

1 For building and testing any classifier model, we need annotated data !

IDIR Imane Offensive Language Detection October, 6 2020 8 / 24


Project development Data collection

Data Annotation

1 For building and testing any classifier model, we need annotated data !
2 Annotating the collected data was a difcult task. The comments are mixed
and very ambiguous, it is hard to understand what the user wants to express,
and to distinguish between the types of oensive language.

IDIR Imane Offensive Language Detection October, 6 2020 8 / 24


Project development Data collection

Data Annotation

1 For building and testing any classifier model, we need annotated data !
2 Annotating the collected data was a difcult task. The comments are mixed
and very ambiguous, it is hard to understand what the user wants to express,
and to distinguish between the types of oensive language.

IDIR Imane Offensive Language Detection October, 6 2020 8 / 24


Project development Data collection

Data statistics

Category Number of comments Number of token


Offensive 4,638 55,879
Non-Offensive 3,933 44,726
Total 8,571 100,605

IDIR Imane Offensive Language Detection October, 6 2020 9 / 24


Project development Data collection

Data statistics
Category Number of comments Number of token
Offensive 4,638 55,879
Non-Offensive 3,933 44,726
Total 8,571 100,605

IDIR Imane Offensive Language Detection October, 6 2020 9 / 24


Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

IDIR Imane Offensive Language Detection October, 6 2020 10 / 24


Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Tokenization.

IDIR Imane Offensive Language Detection October, 6 2020 10 / 24


Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Cleaning.

IDIR Imane Offensive Language Detection October, 6 2020 10 / 24


Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Remove stop words.

IDIR Imane Offensive Language Detection October, 6 2020 10 / 24


Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Normalization.

IDIR Imane Offensive Language Detection October, 6 2020 11 / 24


Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Stemming.

IDIR Imane Offensive Language Detection October, 6 2020 11 / 24


Project development Feature Extraction

Machines, cannot understand the raw text. Therefore, we need to convert our text
into numbers. Different approaches exist to convert text into the corresponding
numerical form.

IDIR Imane Offensive Language Detection October, 6 2020 12 / 24


Project development Feature Extraction

Machines, cannot understand the raw text. Therefore, we need to convert our text
into numbers. Different approaches exist to convert text into the corresponding
numerical form.

How can we transform the data into numbers ?

IDIR Imane Offensive Language Detection October, 6 2020 12 / 24


Project development Feature Extraction

Bag of words
Each word is used as a feature for training the classifier. It is known as a “bag” of
words, since the method doesn’t care about how many times a word occurs or the
order of the words, all what matters is the presence of the word in a list of words.

IDIR Imane Offensive Language Detection October, 6 2020 13 / 24


Project development Feature Extraction

TF-IDF

Statistical measure used to evaluate how important a word is to a document in a


collection or in a corpus.

IDIR Imane Offensive Language Detection October, 6 2020 14 / 24


Project development Feature Extraction

Features

The n-gram is a contiguous sequence of n items (words, letters or symbols) from a


given sample of text.

IDIR Imane Offensive Language Detection October, 6 2020 15 / 24


Project development Feature Extraction

Features

The n-gram is a contiguous sequence of n items (words, letters or symbols) from a


given sample of text.

How can we select best n-grams combination features??

IDIR Imane Offensive Language Detection October, 6 2020 15 / 24


Project development Feature Extraction

Features
The n-gram is a contiguous sequence of n items (words, letters or symbols) from a
given sample of text.

How can we select best n-grams combination features??

IDIR Imane Offensive Language Detection October, 6 2020 15 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine.
3 Multinomial Naive Bayes.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine.
3 Multinomial Naive Bayes.
4 Bernoulli Naive Bayes.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine.
3 Multinomial Naive Bayes.
4 Bernoulli Naive Bayes.
5 Random Forests.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine.
3 Multinomial Naive Bayes.
4 Bernoulli Naive Bayes.
5 Random Forests.
6 Stochastic Gradient
Descent.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
4 Bernoulli Naive Bayes. build the classier.
5 Random Forests.
6 Stochastic Gradient
Descent.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine. 1 Spliting dataset.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
4 Bernoulli Naive Bayes. build the classier.
5 Random Forests.
6 Stochastic Gradient
Descent.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine. 1 Spliting dataset.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
2 Train classifier.
4 Bernoulli Naive Bayes. build the classier.
5 Random Forests.
6 Stochastic Gradient
Descent.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine. 1 Spliting dataset.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
2 Train classifier.
4 Bernoulli Naive Bayes. build the classier. 3 Test classifier.
5 Random Forests.
6 Stochastic Gradient
Descent.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Project development Classification

Classification Algorithms

Machine learning algorithms:


1 Logistic Regression.
2 Support Vector Machine. 1 Spliting dataset.
Each algorithm goes
3 Multinomial Naive Bayes. through these steps to
2 Train classifier.
4 Bernoulli Naive Bayes. build the classier. 3 Test classifier.
5 Random Forests. 4 Make predections.
6 Stochastic Gradient
Descent.

IDIR Imane Offensive Language Detection October, 6 2020 16 / 24


Results

Task Description

Our project is broken down into the following tow subtasks:

1 Sub-task A: Identifying and categorizing offensive language in one step (the


use of non-offensive category with offensive categories together).

IDIR Imane Offensive Language Detection October, 6 2020 17 / 24


Results

Task Description

Our project is broken down into the following tow subtasks:

1 Sub-task A: Identifying and categorizing offensive language in one step (the


use of non-offensive category with offensive categories together).

2 Sub-task B - Identifying and categorizing offensive language in tow steps, In


the first step we identify offensive content i.e. if a comment is offensive or
non-offensive. While, in the second step we categorize the type of offensive
content.

IDIR Imane Offensive Language Detection October, 6 2020 17 / 24


Results

Sub-Task A
Comparison of accuracy using different models with different features in subtask A.

IDIR Imane Offensive Language Detection October, 6 2020 18 / 24


Results

Sub-Task A
Comparison of accuracy using different models with different features in subtask A.

The final best result achieved by SGDClassifier (84.78%) with Tf-idf + (1,2).

IDIR Imane Offensive Language Detection October, 6 2020 18 / 24


Results

Sub-Task B

Comparison of accuracy using different models with different features in subtask B


(first step).

IDIR Imane Offensive Language Detection October, 6 2020 19 / 24


Results

Sub-Task B

Comparison of accuracy using different models with different features in subtask B


(first step).

The final best result achieved by LinearSVC (89.21%) with Tf-idf + (1,2).

IDIR Imane Offensive Language Detection October, 6 2020 19 / 24


Results

Sub-Task B
Comparison of accuracy using different models with different features in subtask B
(Second step).

IDIR Imane Offensive Language Detection October, 6 2020 20 / 24


Results

Sub-Task B
Comparison of accuracy using different models with different features in subtask B
(Second step).

The final best result achieved by LinearSVC (84,48%) with Tf-idf + (1,1).
IDIR Imane Offensive Language Detection October, 6 2020 20 / 24
Conclusion and Future Works

Conclusion

1 Support vector machines and stochastic gradient descent are the best
classifiers for our dataset.

IDIR Imane Offensive Language Detection October, 6 2020 21 / 24


Conclusion and Future Works

Conclusion

1 Support vector machines and stochastic gradient descent are the best
classifiers for our dataset.

2 The use of Grid search has significatif impact.

IDIR Imane Offensive Language Detection October, 6 2020 21 / 24


Conclusion and Future Works

Conclusion

1 Support vector machines and stochastic gradient descent are the best
classifiers for our dataset.

2 The use of Grid search has significatif impact.

3 Unigrams and the combination of unigrams and bigrams are the best features.

IDIR Imane Offensive Language Detection October, 6 2020 21 / 24


Conclusion and Future Works

Conclusion
Testing classifiers ....

IDIR Imane Offensive Language Detection October, 6 2020 22 / 24


Conclusion and Future Works

Future Works

The following ideas could be tested:

Increasing the size of our dataset from a different domain, and platforms to
classify more precisely.

IDIR Imane Offensive Language Detection October, 6 2020 23 / 24


Conclusion and Future Works

Future Works

The following ideas could be tested:

Increasing the size of our dataset from a different domain, and platforms to
classify more precisely.

Regarding the annotation procedure for our corpus, we would like to verify
our dataset by experienced annotators so as they can give us the agree of its
validation.

IDIR Imane Offensive Language Detection October, 6 2020 23 / 24


Conclusion and Future Works

Future Works

The following ideas could be tested:

Increasing the size of our dataset from a different domain, and platforms to
classify more precisely.

Regarding the annotation procedure for our corpus, we would like to verify
our dataset by experienced annotators so as they can give us the agree of its
validation.

Applying spelling corrector to exclude typos, because users comments contain


a lot of spelling mistakes.

IDIR Imane Offensive Language Detection October, 6 2020 23 / 24


Conclusion and Future Works

Future Works

The following ideas could be tested:

Increasing the size of our dataset from a different domain, and platforms to
classify more precisely.

Regarding the annotation procedure for our corpus, we would like to verify
our dataset by experienced annotators so as they can give us the agree of its
validation.

Applying spelling corrector to exclude typos, because users comments contain


a lot of spelling mistakes.

Applying deep learning algorithms.

IDIR Imane Offensive Language Detection October, 6 2020 23 / 24


Conclusion and Future Works

Thank You

IDIR Imane Offensive Language Detection October, 6 2020 24 / 24

You might also like