This repository hosts the code, data and results of our reproducibility experiment for the paper "Hate Speech Detection based on Sentiment Knowledge Sharing" by Zhou et al. (2021), as part of the ML Reproducibility Challenge 2021.
This project requires Python >= 3.8.10. After cloning the repository and creating a dedicated environment, all dependencies can be installed running pip install -r requirements.txt. Before proceeding to the training step, Glove Common Crawl Embeddings (840B Token) should be downloaded, unzipped and placed in the data directory.
-
We provide the trainig- and test-set for the SemEval2019 data-set as two separate csv files
df_train.csvanddf_test.csv. To accomodate the original implementation, the original fieldsid,textandHShave already been renamed astask_idx,tweetandlabel. -
We include both the original Davidson data-set
davidson_data_full.csvand our 5-fold cross-validation splits, where theclassfield has already been renamed aslabelto accomodate the original implementation. -
We provide the training data-set used for the sentiment analysis task
train_E6oV3lV.csv. The original training- and test-set are freely available on Kaggle. -
We rely on the same dictionary of derogatory words
word_all.txtcompiled by the original authors.
To reproduce our results for the three models SKS, -s and -sc on both the SemEval (SE) and the Davidson (DV) data-sets, the appropriate shell script should be executed from the DNN directory.
SE_run_SKS.sh: trains theSKSmodel on the SE data-set using both sentiment features and category embeddings;SE_run_s.sh: trains theSKSmodel on the SE data-set ablating sentiment features;SE_run_sc.sh: trains theSKSmodel on the SE data-set ablating sentiment features and category embeddings;DV_run_SKS.sh: trains theSKSmodel on the 5-folds DV data-set using both sentiment features and category embeddings;DV_run_s.sh: trains theSKSmodel on the 5-folds DV data-set ablating sentiment features;DV_run_sc.sh: trains theSKSmodel on the 5-folds DV data-set ablating sentiment features and category embeddings.
In the results folder we include the intermediate findings obtained on each model for both data-sets, the output of the grid search we ran to tune a subset of the hyperparameters (learning rate, batch size, dropout rate) and a summary of our results resutls.txt.
The code and the data in this repository are provided under an MIT license. For information about the license under which the original code-base distributed, please consult the original repository.