Chapter.
INTRODUCTION
In recent years ,we see exponential growth of social media such as
Face book, Twitter and You tube has revolutionized communication and content
publishing, but is also increasingly exploited for the propagation of Hate,
Offensive and Profane speeches. The anonymity and mobility afforded by such
media has made the breeding and spread of hate speeches eventually leading to
hate crime effortless in a virtual landscape beyond the realms of traditional law
enforcement.
The term ‘hate speeches’ was formally defined as ‘any
communication that disparages a person or a group on the basis of some
characteristics (to be referred to as types of hate or hate classes) such as race,
colour, ethnicity, gender, sexual orientation, nationality, religion, or other
characteristics’.
The term ‘Offensive language’ is a crime that is charged when
someone uses foul or offensive language. It is most commonly used either where a
person has verbally abused police, or along with other more serious charges. The
offence of Offensive Language is contained in section 4A of the Summary
Offenses Act 1988 which states: “A person must not use offensive language in or
near, or within hearing from, a public place or a school.”.
The term ‘Profanity’ is socially offensive language, which may also
be called ‘cursing’, or ‘swearing’ (British English), ‘cuss words’ (American
English vernacular), ‘swear words’, ‘bad words’, or ‘expletives’. Used in this
sense, profanity is language that is generally considered by certain parts of a
culture to be strongly impolite, rude, or offensive. It can show a debasement of
someone or something, or be considered as an expression of strong feeling towards
something.
1.1 MOTIVATION
Building effective counter measures for online Hate, Offensive and
Profane speeches requires as the first step, identifying and tracking Hate,
Offensive and Profane speeches online. For years, social media companies such as
Twitter, Facebook, and YouTube have been investing hundreds of millions of
Rupees every year on this task, but are still being criticised for not doing enough.
This is largely because such efforts are primarily based on manual moderation to
identify and delete offensive materials. The process is labour intensive, time
consuming, and not sustainable or scalable in reality.
A large number of research has been conducted in recent years to
develop automatic methods for Hate, Offensive and Profane speeches detection in
the social media domain. These typically employ semantic content analysis
techniques built on Natural Language Processing (NLP) and Machine Learning
(ML) methods, both of which are core pillars of the Semantic Web research. The
task typically involves classifying textual content into non-hate or hateful, in which
case it may also identify the types of the Hate, Offensive and Profane speeches.
Although current methods have reported promising results, we notice that their
evaluations are largely biased towards detecting content that is non-hate, as
opposed to detecting and classifying real hateful content. A limited number of
studies have shown that, for example, state of the art methods that detect sexism
messages can only obtain an F1 of between 15 and 60 percentage points lower than
detecting non-hate messages. These results suggest that it is much harder to detect
hateful content and their types than non-hate1. However, from a practical point of
view, we argue that the ability to correctly (Precision) and thoroughly (Recall)
detect and identify specific types of hate speeches is more desirable. For example,
social media companies need to flag up hateful content for moderation, while law
enforcement need to identify hateful messages and their nature as forensic
evidence.
We were concerned with the task of detecting; identifying and
analyzing the spread of Hate, Offensive and Profane speeches sentiments in the
social media.
Address concerns on children’s access to offensive content over
Internet, administrators of social media often manually review online contents to
detect and delete offensive materials. However, the manual review tasks of
identifying offensive contents are labor intensive, time consuming, and thus not
sustainable and scalable in reality. Some automatic content filtering software
packages, such as Appen and Internet Security Suite, have been developed to
detect and filter online offensive contents. Most of them simply blocked webpages
and paragraphs that contained dirty words. These word-based approaches not only
affect the readability and usability of web sites, but also fail to identify subtle
offensive messages. For example, under these conventional approaches, the
sentence “you are such a crying baby” will not be identified as offensive content,
because none of its words is included in general offensive lexicons. In addition, the
false positive rate of these word-based detection approaches is often high, due to
the word ambiguity problem, i.e., the same word can have very different meanings
in different contexts.
Pornographic language refers to the portrayal of explicit sexual
subject matter for the purposes of sexual arousal and erotic satisfaction. offensive
language includes any communication outside the law that disparages a person or a
group on the basis of some characteristics such as race, color, ethnicity, gender,
sexual orientation, nationality, and religion. All of these are generally immoral and
harmful for adolescents’ mental health.
1.2 STATEMENT OF THE PROBLEM
In HASOOC we break down given content into four
classes(HATE,OFFENSE,PROFANE and NONE) taking the type and target of
statements.
Offensive language identification: In this we are interested in the identification
of offensive posts and posts containing any form of (untargeted) profanity. In this
there are 4 categories in which the given statement could be classified.
Hate: if statement contain hate words which disparages a person or a group on the
basis of some characteristic such as race, colour, gender, nationality, religion, or
other characteristics.
Offensive: If statements contains offensive language or a targeted (veiled or direct)
offense. To sum up this category includes insults, threats.
Profane: if statement contains words which are strongly impolite, rude, offensive,
or be considered as an expression of strong feeling towards something.
1.3 FLOW OF WORK
Identification of statement into target classes is done by following
steps
1) Data Extraction: In this step we extracted data from datasets to data frames
2) Data Cleaning: In this step we take the required data in required format from
data frames by following processes
Removing special symbols
Removing stop words
Convert acquired data into tokens
3) Sentiment Analysis: In this step we classify the data weather it is an positive
context statement or an negative context statement
4) Label Encoding: In this step we label the statements in given data so that we
labels which are only used in machine learning
5) Machine Learning Classifier: In this step we use an classifier to classify the
data into required target classes
Splitting an data
Training an Model
Predicting target classes by model
6) Result Analysis: In this step we check the results acquired by predicting target
classes using machine learning classifier
Building confusion matrix
Precision and recall
1.4Organization of Report
This report structure as follows
Chapter2: Is about how we came to known about the flow of work by referring
different documents.
Chapter3: Is about how we are going to complete the problem.
Chapter4: Is about dataset given, language used and implementation of methods
to rectify the problem.
Chapter5:Is about Is Model can be accepted or rejected by considering different
parameters.
Chapter6: concludes this work and discusses future usages.