We present a novel Bangla, Hindi, Magahi, Malayalam, Marathi, Odia, Punjabi, Telugu, and Urdu dataset that facilitates text sentiment transfer, a subtask of Text style transfer (TST), enabling the transformation of positive sentiment sentences to negative and vice versa. To establish a high-quality base for further research, we refined and corrected an existing English dataset of 1,000 sentences for sentiment transfer based on Yelp reviews, and we introduce a new human-translated Indian languages dataset that parallels its English counterpart. For further read, please refer Low-Resource Text Style Transfer for Bangla: Data & Models and Multilingual Text Style Transfer: Datasets & Models for Indian Languages papers.
📂 multilingual-tst-datasets/
📂 bengali/
├── bn_yelp_reference-0.csv
├── bn_yelp_reference-1.csv
└── ..
📂 hindi/
├── hi_yelp_reference-0.csv
├── hi_yelp_reference-1.csv
└── ..
📂 magahi/
├── mag_yelp_reference-0.csv
├── mag_yelp_reference-1.csv
└── ..
📂 malayalam/
├── ml_yelp_reference-0.csv
├── ml_yelp_reference-1.csv
└── ..
📂 marathi/
├── mr_yelp_reference-0.csv
├── mr_yelp_reference-1.csv
└── ..
📂 odia/
├── or_yelp_reference-0.csv
├── or_yelp_reference-1.csv
└── ..
📂 punjabi/
├── pa_yelp_reference-0.csv
├── pa_yelp_reference-1.csv
└── ..
📂 refined-english/
├── en_yelp_reference-0.csv
├── en_yelp_reference-1.csv
└── ..
📂 telugu/
├── te_yelp_reference-0.csv
├── te_yelp_reference-1.csv
└── ..
📂 urdu/
├── ur_yelp_reference-0.csv
├── ur_yelp_reference-1.csv
└── ..
├── LICENSE.md
├── README.md
Please read the License file.
This research was supported by the European Union (ERC, NG-NLG, 101039303) . We acknowledge of the use of resources provided by the LINDAT/CLARIAH-CZ Research Infrastructure (Czech Ministry of Education, Youth, and Sports project No. LM2018101). We would also like to acknowledge Panlingua Language Processing LLP for this collaborative research project and for providing the dataset.
Atul Kr. Ojha and John P. McCrae would like to acknowledge the support of the Science Foundation Ireland (SFI) as part of Grant Number SFI/12/RC/2289_P2 Insight_2, Insight SFI Research Centre for Data Analytics.
If you use this data, please cite:
@inproceedings{mukherjee-etal-2023-low,
title = "Low-Resource Text Style Transfer for {B}angla: Data {\&} Models",
author = "Mukherjee, Sourabrata and
Bansal, Akanksha and
Majumdar, Pritha and
Ojha, Atul Kr. and
Du{\v{s}}ek, Ond{\v{r}}ej",
editor = "Alam, Firoj and
Kar, Sudipta and
Chowdhury, Shammur Absar and
Sadeque, Farig and
Amin, Ruhul",
booktitle = "Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.banglalp-1.5",
doi = "10.18653/v1/2023.banglalp-1.5",
pages = "34--47",
}
@misc{mukherjee2024multilingual,
title={Multilingual Text Style Transfer: Datasets & Models for Indian Languages},
author={Sourabrata Mukherjee and Atul Kr. Ojha and Akanksha Bansal and Deepak Alok and John P. McCrae and Ondřej Dušek},
year={2024},
eprint={2405.20805},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
@misc{mukherjee2024large,
title={Are Large Language Models Actually Good at Text Style Transfer?},
author={Sourabrata Mukherjee and Atul Kr. Ojha and Ondřej Dušek},
year={2024},
eprint={2406.05885},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
=== Machine-readable metadata (DO NOT REMOVE!) ===================================================== Data available since: Multilingual Text Style Transfer (MTST) Datasets@2023 License: See the LICENSE.md ======= Includes text: Yes Contact:[email protected] or [email protected]/[email protected] Contributor/© holder: Panlingua Language Processing LLP, India; Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics Charles University, Czech Republic; and Insight Centre for Data Analytics, Data Science Institue, University of Galway, Ireland =======================================================================================================