Multilingual Text Style Transfer (MTST) Datasets

Introduction

We present a novel Bangla, Hindi, Magahi, Malayalam, Marathi, Odia, Punjabi, Telugu, and Urdu dataset that facilitates text sentiment transfer, a subtask of Text style transfer (TST), enabling the transformation of positive sentiment sentences to negative and vice versa. To establish a high-quality base for further research, we refined and corrected an existing English dataset of 1,000 sentences for sentiment transfer based on Yelp reviews, and we introduce a new human-translated Indian languages dataset that parallels its English counterpart. For further read, please refer Low-Resource Text Style Transfer for Bangla: Data & Models and Multilingual Text Style Transfer: Datasets & Models for Indian Languages papers.

Structure of the MTST datasets:

📂 multilingual-tst-datasets/
    📂  bengali/
        ├── bn_yelp_reference-0.csv
        ├── bn_yelp_reference-1.csv
        └── ..
    📂  hindi/
        ├── hi_yelp_reference-0.csv
        ├── hi_yelp_reference-1.csv
        └── ..
    📂  magahi/
        ├── mag_yelp_reference-0.csv
        ├── mag_yelp_reference-1.csv
        └── ..
    📂  malayalam/
        ├── ml_yelp_reference-0.csv
        ├── ml_yelp_reference-1.csv
        └── ..
    📂  marathi/
        ├── mr_yelp_reference-0.csv
        ├── mr_yelp_reference-1.csv
        └── ..
    📂  odia/
        ├── or_yelp_reference-0.csv
        ├── or_yelp_reference-1.csv
        └── ..
    📂  punjabi/
        ├── pa_yelp_reference-0.csv
        ├── pa_yelp_reference-1.csv
        └── ..
    📂  refined-english/
        ├── en_yelp_reference-0.csv
        ├── en_yelp_reference-1.csv
        └── ..
    📂  telugu/
        ├── te_yelp_reference-0.csv
        ├── te_yelp_reference-1.csv
        └── ..
    📂  urdu/
        ├── ur_yelp_reference-0.csv
        ├── ur_yelp_reference-1.csv
        └── ..
    ├── LICENSE.md
    ├── README.md

License

Please read the License file.

Acknowldegments

This research was supported by the European Union (ERC, NG-NLG, 101039303) . We acknowledge of the use of resources provided by the LINDAT/CLARIAH-CZ Research Infrastructure (Czech Ministry of Education, Youth, and Sports project No. LM2018101). We would also like to acknowledge Panlingua Language Processing LLP for this collaborative research project and for providing the dataset.

Atul Kr. Ojha and John P. McCrae would like to acknowledge the support of the Science Foundation Ireland (SFI) as part of Grant Number SFI/12/RC/2289_P2 Insight_2, Insight SFI Research Centre for Data Analytics.

References

If you use this data, please cite:

  @inproceedings{mukherjee-etal-2023-low,
    title = "Low-Resource Text Style Transfer for {B}angla: Data {\&} Models",
    author = "Mukherjee, Sourabrata  and
      Bansal, Akanksha  and
      Majumdar, Pritha  and
      Ojha, Atul Kr.  and
      Du{\v{s}}ek, Ond{\v{r}}ej",
    editor = "Alam, Firoj  and
      Kar, Sudipta  and
      Chowdhury, Shammur Absar  and
      Sadeque, Farig  and
      Amin, Ruhul",
    booktitle = "Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.banglalp-1.5",
    doi = "10.18653/v1/2023.banglalp-1.5",
    pages = "34--47",
}

@misc{mukherjee2024multilingual,
      title={Multilingual Text Style Transfer: Datasets & Models for Indian Languages}, 
      author={Sourabrata Mukherjee and Atul Kr. Ojha and Akanksha Bansal and Deepak Alok and John P. McCrae and Ondřej Dušek},
      year={2024},
      eprint={2405.20805},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}

@misc{mukherjee2024large,
      title={Are Large Language Models Actually Good at Text Style Transfer?}, 
      author={Sourabrata Mukherjee and Atul Kr. Ojha and Ondřej Dušek},
      year={2024},
      eprint={2406.05885},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}

=== Machine-readable metadata (DO NOT REMOVE!) =====================================================
Data available since: Multilingual Text Style Transfer (MTST) Datasets@2023
License: See the LICENSE.md
=======
Includes text: Yes
Contact:[email protected] or [email protected]/[email protected] 
Contributor/© holder: Panlingua Language Processing LLP, India; Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics Charles University, Czech Republic; and Insight Centre for Data Analytics, Data Science Institue, University of Galway, Ireland
=======================================================================================================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multilingual Text Style Transfer (MTST) Datasets

Introduction

Structure of the MTST datasets:

License

Acknowldegments

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
bengali		bengali
hindi		hindi
magahi		magahi
malayalam		malayalam
marathi		marathi
odia		odia
punjabi		punjabi
refined-english		refined-english
telugu		telugu
urdu		urdu
LICENSE		LICENSE
README.md		README.md

License

panlingua/multilingual-tst-datasets

Folders and files

Latest commit

History

Repository files navigation

Multilingual Text Style Transfer (MTST) Datasets

Introduction

Structure of the MTST datasets:

License

Acknowldegments

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages