GitHub - HKUST-KnowComp/HS_Bias_Eval: Data and code of our EMNLP 2020 paper "Comparative Evaluation of Label-Agnostic Selection Bias in Multilingual Hate Speech Datasets".

This is the code we used for our EMNLP 2020 paper: "Comparative Evaluation of Label-Agnostic Selection Bias in Mutilingual Hate Speech Datasets".

Description

This code uses topic models and semantic similarity to compute selection bias in hate speech datasets.

You can read our paper here: https://www.aclweb.org/anthology/2020.emnlp-main.199/

Requirements

Python 3.6

Gensim

Babylon embeddings for the tested language

FastText multilingual

How to run the code

python run_bias_metrics.py --language language_name --dataset /path/to/dataset.csv --metric b1 --embeds_path path/to/embeddings --topic_number x --word_number y

The dataset needs to be a CSV file that has 'tweet' as a column. We will update the code soon and will upload a full example.

Experiments

To replicate our experiments, please refer to the following datasets:

Arabic Data

Nuha Albadi, Maram Kurdi, and Shivakant Mishra. 2018. Are they our brothers? analysis and detection of religious hate speech in the arabic twittersphere. In Proceedings of ASONAM, pages 69–76. IEEE Computer Society. [https://github.com/nuhaalbadi/Arabic_hatespeech]
Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. 2019. L-HSAB: A Levantine twitter dataset for hate speech and abusive language. In Proceedings of the Third Workshop on Abusive Language Online, pages 111–118. [https://github.com/Hala-Mulki/L-HSAB-First-Arabic-Levantine-HateSpeech-Dataset]
Nedjma Ousidhoum, Zizheng Lin, Hongming Zhang, Yangqiu Song, and Dit-Yan Yeung. 2019. Multilingual and multi-aspect hate speech analysis. In Proceedings of EMNLP-IJCNLP, pages 4675–4684 [https://github.com/HKUST-KnowComp/MLMA_hate_speech/]

English Data

Antigoni-Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of twitter abusive behavior. In Proceedings of ICWSM, pages 491–500. [https://github.com/ENCASEH2020/hatespeech-twitter]
Nedjma Ousidhoum, Zizheng Lin, Hongming Zhang, Yangqiu Song, and Dit-Yan Yeung. 2019. Multilingual and multi-aspect hate speech analysis. In Proceedings of EMNLP-IJCNLP, pages 4675–4684 [https://github.com/HKUST-KnowComp/MLMA_hate_speech/]
Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL Student Research Workshop, pages 88–93. [https://github.com/ZeerakW/hatespeech]

French Data

Nedjma Ousidhoum, Zizheng Lin, Hongming Zhang, Yangqiu Song, and Dit-Yan Yeung. 2019. Multilingual and multi-aspect hate speech analysis. In Proceedings of EMNLP-IJCNLP, pages 4675–4684 [https://github.com/HKUST-KnowComp/MLMA_hate_speech/]

German Data

Bjorn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. 2016. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis. In Proceedings of NLP4CMC III: 3rd Workshop on Natural Language Processing for Computer-Mediated Communication, pages 6–9. [https://github.com/UCSM-DUE/IWG_hatespeech_public]

Indonesian Data

Muhammad Okky Ibrohim and Indra Budi. 2019. Multi-label hate speech and abusive language detection in Indonesian twitter. In Proceedings of the Third Workshop on Abusive Language Online, pages 46–57. [https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection]

Italian Data

Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, Viviana Patti, and Marco Stranisci. 2018. An italian twitter corpus of hate speech against immigrants. In Proceedings of LREC. [https://github.com/msang/hate-speech-corpus]

Portuguese Data

Paula Fortuna, Joao Rocha da Silva, Juan Soler Company, Leo Wanner, and Sergio Nunes. 2019. A hierarchically-labeled portuguese hate speech dataset. In Proceedings of the 3rd Workshop on Abusive Language Online (ALW3). https://github.com/paulafortuna/Portuguese-Hate-Speech-Dataset]

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md
bias_metrics.py		bias_metrics.py
compute_similarity.py		compute_similarity.py
constants.py		constants.py
keywords.txt		keywords.txt
run_bias_metrics.py		run_bias_metrics.py
run_topic_models.py		run_topic_models.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Description

Requirements

How to run the code

Experiments

Arabic Data

English Data

French Data

German Data

Indonesian Data

Italian Data

Portuguese Data

About

Uh oh!

Releases

Packages

Languages

License

HKUST-KnowComp/HS_Bias_Eval

Folders and files

Latest commit

History

Repository files navigation

Description

Requirements

How to run the code

Experiments

Arabic Data

English Data

French Data

German Data

Indonesian Data

Italian Data

Portuguese Data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages