0% found this document useful (0 votes)

113 views59 pages

Identifying and Categorizing Offensive Language in Social Media

This document is a master's thesis that examines identifying and categorizing offensive language in social media. It discusses how online platforms are working to filter offensive content and the challenges of detecting offensive Arabic language automatically. It then describes the author's machine learning approach, including collecting and annotating data, preprocessing the data through steps like tokenization and stemming, and extracting features to transform the text into numerical format that can be used to classify the data.

Uploaded by

zaroure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views59 pages

Identifying and Categorizing Offensive Language in Social Media

Uploaded by

zaroure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Identifying and Categorizing Offensive Language in

Social Media

IDIR Imane

Faculty of Sciences
Department of Computer Science

MASTER THESIS
Supervised by : Dr. BESSOU Sadik

October, 6 2020
Table of Content

1 Introduction

2 Offensive Language

3 Project development
Data collection
Data Preprocessing
Feature Extraction
Classification

4 Results

5 Conclusion and Future Works

Introduction

Introduction
The proliferation of online platforms for user generated content enables more
people to experience freedom of expression than ever before. Which increase the
chance of using Offensive language.

IDIR Imane Offensive Language Detection October, 6 2020 3 / 24

Introduction

The proliferation of online platforms for user generated content enables more
people to experience freedom of expression than ever before. Which increase the
chance of using Offensive language.

So, What is offensive language?

IDIR Imane Offensive Language Detection October, 6 2020 3 / 24

Offensive Language

Offensive Language
Offensive language is the use of unacceptable and aggressive expression (oral or
text) against individual or group. whether explicit (direct attacks such threating,
cursing, using dirty / obscene words etc.), or implicit. It can occur even when
humor is used.

IDIR Imane Offensive Language Detection October, 6 2020 4 / 24

Offensive Language

Offensive language filtering

social media platforms are under legal action for not properly taking care of
offensive content online. That’s the reason different online platforms allow employ
moderates who identify offensive content.

IDIR Imane Offensive Language Detection October, 6 2020 5 / 24

Offensive Language

Offensive language filtering

IDIR Imane Offensive Language Detection October, 6 2020 5 / 24

Offensive Language

Offensive language filtering

IDIR Imane Offensive Language Detection October, 6 2020 5 / 24

Offensive Language

Arabic offensive language detection

The automatic detection of Arabic offensive language is a complex task, it has

multiple challenges including:

IDIR Imane Offensive Language Detection October, 6 2020 6 / 24

Offensive Language

Arabic offensive language detection

The automatic detection of Arabic offensive language is a complex task, it has

multiple challenges including:

1 The ambiguity and informality of the written format of the text.

IDIR Imane Offensive Language Detection October, 6 2020 6 / 24

Offensive Language

Arabic offensive language detection

The automatic detection of Arabic offensive language is a complex task, it has

multiple challenges including:

1 The ambiguity and informality of the written format of the text.

2 Arabic language has multiple dialects with diverse vocabularies and

structures, which increase the complexity of obtaining high classification.

IDIR Imane Offensive Language Detection October, 6 2020 6 / 24

Offensive Language

Arabic offensive language detection

The automatic detection of Arabic offensive language is a complex task, it has

multiple challenges including:

1 The ambiguity and informality of the written format of the text.

2 Arabic language has multiple dialects with diverse vocabularies and

structures, which increase the complexity of obtaining high classification.

So, How can we automate the process of detecting offensive content in

social media?

IDIR Imane Offensive Language Detection October, 6 2020 6 / 24

Project development

Machine learning approach

IDIR Imane Offensive Language Detection October, 6 2020 7 / 24

Project development Data collection

Data Annotation

1 For building and testing any classifier model, we need annotated data !

IDIR Imane Offensive Language Detection October, 6 2020 8 / 24

Project development Data collection

Data Annotation

1 For building and testing any classifier model, we need annotated data !
2 Annotating the collected data was a difcult task. The comments are mixed
and very ambiguous, it is hard to understand what the user wants to express,
and to distinguish between the types of oensive language.

IDIR Imane Offensive Language Detection October, 6 2020 8 / 24

Project development Data collection

Data Annotation

IDIR Imane Offensive Language Detection October, 6 2020 8 / 24

Project development Data collection

Data statistics

Category Number of comments Number of token

Offensive 4,638 55,879
Non-Offensive 3,933 44,726
Total 8,571 100,605

IDIR Imane Offensive Language Detection October, 6 2020 9 / 24

Project development Data collection

Data statistics
Category Number of comments Number of token
Offensive 4,638 55,879
Non-Offensive 3,933 44,726
Total 8,571 100,605

IDIR Imane Offensive Language Detection October, 6 2020 9 / 24

Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

IDIR Imane Offensive Language Detection October, 6 2020 10 / 24

Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Tokenization.

IDIR Imane Offensive Language Detection October, 6 2020 10 / 24

Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Cleaning.

IDIR Imane Offensive Language Detection October, 6 2020 10 / 24

Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Remove stop words.

IDIR Imane Offensive Language Detection October, 6 2020 10 / 24

Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Normalization.

IDIR Imane Offensive Language Detection October, 6 2020 11 / 24

Project development Data Preprocessing

Data Preprocessing
The aim of pre-processing is to reduce dimensions and clean up the data from
noisy and non-meaningful words.

Stemming.

IDIR Imane Offensive Language Detection October, 6 2020 11 / 24

Project development Feature Extraction

Machines, cannot understand the raw text. Therefore, we need to convert our text
into numbers. Different approaches exist to convert text into the corresponding
numerical form.

IDIR Imane Offensive Language Detection October, 6 2020 12 / 24

Project development Feature Extraction

Machines, cannot understand the raw text. Therefore, we need to convert our text
into numbers. Different approaches exist to convert text into the corresponding
numerical form.

How can we transform the data into numbers ?

IDIR Imane Offensive Language Detection October, 6 2020 12 / 24

Project development Feature Extraction

Bag of words
Each word is used as a feature for training the classifier. It is known as a “bag” of
words, since the method doesn’t care about how many times a word occurs or the
order of the words, all what matters is the presence of the word in a list of words.

IDIR Imane Offensive Language Detection October, 6 2020 13 / 24

Project development Feature Extraction

TF-IDF

Statistical measure used to evaluate how important a word is to a document in a

collection or in a corpus.

IDIR Imane Offensive Language Detection October, 6 2020 14 / 24

Project development Feature Extraction

Features

The n-gram is a contiguous sequence of n items (words, letters or symbols) from a

given sample of text.

IDIR Imane Offensive Language Detection October, 6 2020 15 / 24

Project development Feature Extraction

Features

The n-gram is a contiguous sequence of n items (words, letters or symbols) from a

given sample of text.

How can we select best n-grams combination features??

IDIR Imane Offensive Language Detection October, 6 2020 15 / 24

Project development Feature Extraction

Features
The n-gram is a contiguous sequence of n items (words, letters or symbols) from a
given sample of text.