A curated list of awesome datasets with human label variation (un-aggregated labels) in Natural Language Processing and Computer Vision, including links to related initiatives and key references. The key focus of the table provided below is to collect datasets that contain multiple annotations per instance, to enable learning with human label variation/disagreement. The starting point of Table 1 was the table in the appendix of our paper.
If you know of resources or papers or links that are not yet listed, please help grow this resource. You can contribute by creating a pull request as outlined in contributing.md.
Please cite our paper Plank, 2022 EMNLP if you find this repository useful:
@inproceedings{plank-2022-emnlp,
title = "The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation",
author = "Plank, Barbara",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing ",
month = December,
year = "2022",
address = "Abu Dhabi",
publisher = "Association for Computational Linguistics",
}
Icons refer to the following:
- π π SemEval 2023 Shared Task 11 on Learning with Disagreement (Le-Wi-Di): 2nd Shared task on subjective NLP tasks π on-going!
- π€· SemEval 2021 Shared Task 11 on Learning with Disagreement: 1st Shared task, which included core NLP and computer vision tasks
- π₯§ Perspectivist Data Manifesto (PDAI): Website that contains key references and a first list of non-aggregated datasets
- π£οΈ NLPerspectives 2022, Workshop on Perspectivist Approaches to NLP held at LREC 2022; 2nd edition 2023 Workshop on Perspectivist Approaches to NLP co-located with ECAI 2023
- π Uma et al., 2021: Learning from Disagreement: A Survey. Broad overview across NLP and computer vision tasks.
- Plank et al., 2014. Learning part-of-speech taggers with inter-annotator agreement loss. Proposed to leverage small samples of un-aggregated data to improve performance on morphosyntactic NLP tasks. Inspired follow-up work such as Linguistically debatable or just plain wrong? ACL 2014. Analysis of systematicity of annotator agreement on objective linguistic annotation tasks (POS tagging).
- Aroyo & Welty, 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine. Proposes the crowd truth framework, which included a large body of work on medical relation extraction, frame disambiguation and other semantic processing tasks.
- Pavlick & Kwiatkowski, 2019. Inherent Disagreements in Human Textual Inferences. TACL. Seminal work that illustrates plausible disagreement in entailment datasets. Inspired follow-up work such as dataset re-annotation studies like ChaosNLI by Nie et al., 2020 and follow-up work such as embracing the collective human opinion for NLI.
- Alm, 2011. Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications. Early paper discussing annotator agreement on subjective linguistic annotation tasks.
- Basile et al., 2021. Toward a Perspectivist Turn in Ground Truthing for Predictive Computing. Conference of the Italian Chapter of the Association for Intelligent Systems (ItAIS 2021). Putting forward data perspectivism to embrace human perspectives. Inspired a lot of follow-up work on subjective tasks (see e.g. the Le-Wi-Di 2023 shared task)
- Gordon et al., 2021. The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality. Seminal paper in the Human-Computer-Interaction (CHI) conference.
- π Davani et al., 2022. Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations TACL. Examines, a.o., whether the uncertainty in predictions is correlated with whether the multi-task model was able to correctly predict the majority label.
- Jiang & de Marneffe, 2022. Investigating Reasons for Disagreement in Natural Language Inference. TACL. Provides a novel linguistic taxonomy to characterize disagreements in natural language inference datasets.
- πΈ Wan et al., 2023. Everyoneβs Voice Matters: Quantifying Annotation Disagreement Using Demographic Information. AAAI. Predict human label variation on five subjective tasks, examine demographic information.
This list above are selected key references. Please see our EMNLP 2022 theme paper (Plank, 2022) for further references related to annotator culture/backgrounds, different terms proposed in the literature and more. If you know of relevant related work (not datasets), please leave an Issue. For more datasets, please see contributing.md