Skip to content

kevserbusrayildirim/FSMTSAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

FSMTSAD(FSMTSA Dataset)

Overview

The FSMTSA (Fatih Sultan Mehmet Target Sentiment Analysis) dataset is a comprehensive resource designed for sentiment analysis studies in Turkish. It includes text samples from diverse sources such as hotel reviews, restaurant reviews, movie critiques, product evaluations from e-commerce platforms, and social media posts (tweets). Additionally, the dataset is enriched with text samples generated by Large Language Models (LLMs) to improve data diversity and coverage.

This dataset has been expanded through data augmentation to provide an extensive representation of positive, neutral, and negative sentiments, facilitating robust and unbiased sentiment analysis.


Data Summary

  • Total Samples: 15,853
  • Class Distribution:
    • Positive: 5,284 (33.3%)
    • Neutral: 5,206 (32.8%)
    • Negative: 5,363 (33.8%)

Data Sources

The data originates from a variety of real-world sources, including:

  • Hotel, restaurant, and movie reviews
  • E-commerce product reviews
  • Social media posts (tweets)
  • Texts generated by Large Language Models (LLMs)

For neutral samples, particular care was taken to select texts without explicit subjective judgments or with balanced opposing sentiments.


Data Augmentation Techniques

  1. Back-Translation:

    • Sentences were translated into English and then back into Turkish to introduce structural variations while preserving semantic meaning.
  2. Synonym Replacement:

    • Key words in the text were replaced with their synonyms using the WordNet lexical database to create contextually equivalent variations.

During the augmentation process, duplicate entries were systematically removed to ensure data quality.


Annotation Process

  • The dataset was annotated manually by three independent annotators.
  • In cases of disagreement, a majority vote was taken, with further validation by a supervisor.
  • Disputed annotations were compared against outputs from at least three different LLMs for consistency.
  • Sentiment classes are encoded as follows:
    • -1: Negative
    • 0: Neutral
    • 1: Positive

Example Data Samples

Text Source Polarity
Akşam 9'da kapanma olacak ya sanırım İstanbul'un trafik yoğunluğunun %50'si şu an Yeniköy'de bu ne hal? Tweet -1 (Negative)
Vatandaşlar, oy kullanma hakkına sahiptirler, ulaşılabilirlik konusuna dikkat edilmektedir. LLM-generated 0 (Neutral)
Kokusu güzel hafif, diğer yumuşatıcılar gibi ağır yoğun bir kokusu yok. Bahar gibi kokuyor, bahar aylarında tercih edilebilecek bir yumuşatıcı bence. Product Review 1 (Positive)

Usage Guidelines

  • This dataset is publicly available for academic and research purposes.
  • Users must properly cite the original authors and reference the dataset in their publications.
  • The dataset should not be used for commercial purposes without explicit permission.

Citation Guide

If you are using the FSMTSA dataset in your research, please cite as follows: Zümberoğlu, K. B., Dik, S. Z., Karadeniz, B. S., & Sahmoud, S. (2025). Towards Better Sentiment Analysis in the Turkish Language: Dataset Improvements and Model Innovations. Applied Sciences, 15(4), 2062. https://doi.org/10.3390/app15042062


Contact

For questions or feedback, please contact [email protected].


Acknowledgements

We acknowledge Dr. Shaaban Sahmoud for his invaluable guidance and extend our thanks to Sümeyye Zülal Dik and Büşra Sinem Karadeniz for their dedicated efforts in the creation, annotation, and validation of the FSMTSA dataset.

About

Fatih Sultan Mehmet Turkish Sentiment Analysis Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors