1.
Title:
Predic ng Cyberbullying on Social Media Using Machine Learning Techniques
2. Project Statement:
The project aims to address the escala ng concern of cyberbullying on social media pla orms (such
as Twi er (or X), Instagram, Facebook, etc.) by u lising machine learning and deep learning
algorithms, such as Support Vector Machine (SVM), Convolu onal Neural Networks (CNN), Random
Forest etc to predict and report them.
To improve the accuracy of the solu on, the project will leverage the Natural Language Toolkit for
data preprocessing and feature extrac on. Then, using the ML techniques, models will be built and
evaluated to e ec vely dis nguish cyberbullying on social media. This project will be helpful for
mely detec on of bullying episodes and providing assistance to vic ms.
3. Outcomes:
• Real Time Detec on of Cyberbullying Episodes: Crea on of a real- me system that
monitors social media data and alerts authori es or organisers of poten al harassment
or bullying on social media.
• Understanding of Types of Cyberbullying: Researchers can also gain insights on the kind
of cyberbullying such as harassment over Age, Religion, Ethnicity, Gender, etc. The
authori es can then take necessary steps based on the type of cyberbullying.
• Deployment and integra on: Researchers can focus on the deployment and integra on
of cyberbullying tweet predic on models into exis ng social media pla orms. This can
provide real- me feedback to users and contribute to a safer and more inclusive online
environment.
• Contribu on to Cyber Safety: The ul mate outcome of such a project would be to
contribute to cyber safety and security by providing a tool that can detect harassment on
social media. Cyberbullying is a grave issue with severe consequences and such ML models
can provide promising solu ons to combat it.
Modules to be Implemented:
1. Data Inges on
2. Exploratory Data Analysis (EDA)
3. Data Preprocessing using NLP techniques
4. Machine Learning Models (Random Forest, SVM, CNN, etc.)
5. Evalua on and Compara ve Analysis of Models
6. Project Presenta on & Documenta on
ti
tt
ti
ti
ti
ti
ti
ti
ti
ff
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
tf
ti
ti
tf
Week-wise Implementa on Plan of Modules:
Milestone 1: Week 1-2
Module 1: Data Collec on and Impor ng Relevant Libraries
• Understand the problem statement
• Gather Twitter (or X) data from relevant sources
• Import relevant libraries on Python
Module 2: Exploratory Data Analysis (EDA)
The goal is to perform EDA on the raw data and provide data visualisa ons in the form of
charts. Examples below:
• Plot the distribu on of tweets labelled on type of cyberbullying
• Plot distribution charts based on word lengths
• Plot word clouds for different label classifications
• Bar charts based on common words
Milestone 2: Week 3-4
Module 3: Data Preprocessing
The social media data (tweets in this case) consists of massive amounts of noise. Therefore, a
rigorous data preprocessing will be implemented to ensure the quality and reliability of the
dataset. This will involve:
• Cleaning and ltering the social media content to remove noise, irrelevant informa on, and
duplicate posts.
• NLP techniques will be used for text normalisa on, tokenisa on, stemming, and removal of
stop words to standardise the textual data.
fi
ti
ti
ti
ti
ti
ti
ti
ti
Milestone 3: Week 5-6
Module 4: Building Machine Learning Techniques
The goal is to build a suite of sophis cated ML models on the transformed textual data to
iden fy cyberbullying. These models are recursively evaluated and tuned to make them more
e ciently predic ve. Some of the proposed models are:
• Convolu onal Neural Networks: CNN models are designed to process data through
mul ple layers of arrays. Text-based CNNs work on word embeddings in the form of
matrices.
• Random Forest: RF combines several di erent classi ers to nd solu ons to complex tasks.
A random forest is essen ally an algorithm consis ng of mul ple decision trees, trained by
bagging or bootstrap aggrega ng. A random forest text classi ca on model predicts an
outcome by taking the decision trees' mean output.
• Naïve Bayes model: A probabilis c supervised learning approach that works with a
likelihood func on that illustrates the probability of witnessing a speci c value of a feature.
• Support Vector Machine: SVM is a supervised ML model that uses classi ca on techniques
to categorise new text a er being given labeled training data sets for each category.
Milestone 4: Week 7
Module 5: Evalua on and Compara ve Analysis of Models
The goal is to do a compara ve analysis of the results obtained from the implementa on of
various algorithms on selected datasets.
• U lise parameters such as accuracy, recall, precision and F1-score to carry out this analysis.
• To the best performing models, provide as series of texts to observe and record real- me
predic ons.
Milestone 5: Week 8
Module 6: Project Presenta on and Documenta on
• Prepare a presenta on and demo with following structure:
o Problem Statement and Objec ve
o Methodology (Brief overview of models used)
o Results & Insights (emphasise on key takeaways)
o Visualisa ons
o Q&A Session
• Clear visualisa ons and minimum overly technical text in presenta ons.
• Documenta on prepara on in below men oned format:
ti
ti
ffi
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ft
ti
ti
ti
ti
ti
ti
ti
ff
ti
ti
ti
ti
fi
fi
ti
fi
ti
ti
ti
fi
fi
ti
ti
ti
o Project Overview: Problem statement, goals, expected outcomes
o Data Sources: Details on where data was acquired
o Data Preprocessing and Cleaning: Steps taken, techniques used, jus ca on
o Exploratory Data Analysis: Summary of ndings, key visualisa ons
o Model Development: Explana on of model choices, ra onale for parameter selec on
o Model Evalua on: Performance metrics used, comparison of di erent models
o Predic ve Results: Examples of predictions of cyberbullying
o Appendix: Code snippets (well-commented), addi onal visualisa ons, etc.
Evalua on Criteria:
Milestone 1 Evalua on (Week 1-2):
• Successful loading of the dataset into a suitable format (e.g., Pandas Data Frame inPython).
• Iden ca on of missing and duplicate values and handling strategy
• Approval of Ini al summary sta s cs to understand the data distribu on.
• Approval of thorough examina on of data distribu ons (histograms, box plots, etc.).
Milestone 2 Evalua on (Week 3-4):
• Approval of steps for data preprocessing techniques and its implementa on.
• Approval of outcomes of the data preprocessing through visualisa ons of input vs output data
for each data preprocessing step.
Milestone 3 Evalua on (Week 5-6):
• Approval of the Machine Learning models and architectures to be used on the processed
dataset.
• Approval of the hyperparameter tuning process and the range of parameters explored.
• Comple on and approval of performance metrics for all built models.
Milestone 4 Evalua on (Week 7-8):
• Approval of the nal model based on evalua on criteria
• Approval of the presenta on and project documenta on.
• Final code submission on GitHub.
Trigger Warning: The cyberbullying datasets may contain strong language on sensi ve topics, such as
violence, abuse, discrimina on, and/or mental health issues
ti
fi
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
ff
ti
ti
ti
fi
ti
ti
ti
ti
ti