Skip to content

panlingua/multilingual-tst-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual Text Style Transfer (MTST) Datasets

Introduction

We present a novel Bangla, Hindi, Magahi, Malayalam, Marathi, Odia, Punjabi, Telugu, and Urdu dataset that facilitates text sentiment transfer, a subtask of Text style transfer (TST), enabling the transformation of positive sentiment sentences to negative and vice versa. To establish a high-quality base for further research, we refined and corrected an existing English dataset of 1,000 sentences for sentiment transfer based on Yelp reviews, and we introduce a new human-translated Indian languages dataset that parallels its English counterpart. For further read, please refer Low-Resource Text Style Transfer for Bangla: Data & Models and Multilingual Text Style Transfer: Datasets & Models for Indian Languages papers.

Structure of the MTST datasets:

📂 multilingual-tst-datasets/
    📂  bengali/
        ├── bn_yelp_reference-0.csv
        ├── bn_yelp_reference-1.csv
        └── ..
    📂  hindi/
        ├── hi_yelp_reference-0.csv
        ├── hi_yelp_reference-1.csv
        └── ..
    📂  magahi/
        ├── mag_yelp_reference-0.csv
        ├── mag_yelp_reference-1.csv
        └── ..
    📂  malayalam/
        ├── ml_yelp_reference-0.csv
        ├── ml_yelp_reference-1.csv
        └── ..
    📂  marathi/
        ├── mr_yelp_reference-0.csv
        ├── mr_yelp_reference-1.csv
        └── ..
    📂  odia/
        ├── or_yelp_reference-0.csv
        ├── or_yelp_reference-1.csv
        └── ..
    📂  punjabi/
        ├── pa_yelp_reference-0.csv
        ├── pa_yelp_reference-1.csv
        └── ..
    📂  refined-english/
        ├── en_yelp_reference-0.csv
        ├── en_yelp_reference-1.csv
        └── ..
    📂  telugu/
        ├── te_yelp_reference-0.csv
        ├── te_yelp_reference-1.csv
        └── ..
    📂  urdu/
        ├── ur_yelp_reference-0.csv
        ├── ur_yelp_reference-1.csv
        └── ..
    ├── LICENSE.md
    ├── README.md

License

Please read the License file.

Acknowldegments

This research was supported by the European Union (ERC, NG-NLG, 101039303) . We acknowledge of the use of resources provided by the LINDAT/CLARIAH-CZ Research Infrastructure (Czech Ministry of Education, Youth, and Sports project No. LM2018101). We would also like to acknowledge Panlingua Language Processing LLP for this collaborative research project and for providing the dataset.

Atul Kr. Ojha and John P. McCrae would like to acknowledge the support of the Science Foundation Ireland (SFI) as part of Grant Number SFI/12/RC/2289_P2 Insight_2, Insight SFI Research Centre for Data Analytics.

References

If you use this data, please cite:

  @inproceedings{mukherjee-etal-2023-low,
    title = "Low-Resource Text Style Transfer for {B}angla: Data {\&} Models",
    author = "Mukherjee, Sourabrata  and
      Bansal, Akanksha  and
      Majumdar, Pritha  and
      Ojha, Atul Kr.  and
      Du{\v{s}}ek, Ond{\v{r}}ej",
    editor = "Alam, Firoj  and
      Kar, Sudipta  and
      Chowdhury, Shammur Absar  and
      Sadeque, Farig  and
      Amin, Ruhul",
    booktitle = "Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.banglalp-1.5",
    doi = "10.18653/v1/2023.banglalp-1.5",
    pages = "34--47",
}
@misc{mukherjee2024multilingual,
      title={Multilingual Text Style Transfer: Datasets & Models for Indian Languages}, 
      author={Sourabrata Mukherjee and Atul Kr. Ojha and Akanksha Bansal and Deepak Alok and John P. McCrae and Ondřej Dušek},
      year={2024},
      eprint={2405.20805},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
@misc{mukherjee2024large,
      title={Are Large Language Models Actually Good at Text Style Transfer?}, 
      author={Sourabrata Mukherjee and Atul Kr. Ojha and Ondřej Dušek},
      year={2024},
      eprint={2406.05885},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
=== Machine-readable metadata (DO NOT REMOVE!) =====================================================
Data available since: Multilingual Text Style Transfer (MTST) Datasets@2023
License: See the LICENSE.md
=======
Includes text: Yes
Contact:[email protected] or [email protected]/[email protected] 
Contributor/© holder: Panlingua Language Processing LLP, India; Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics Charles University, Czech Republic; and Insight Centre for Data Analytics, Data Science Institue, University of Galway, Ireland
=======================================================================================================

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •