Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021
This paper presents the methodologies implemented while classifying Dravidian code-mixed comments according to their polarity. With datasets of code-mixed Tamil and Malayalam available, three methods are proposed - a sub-word level model, a word embedding based model and a machine learning based architecture. The sub-word and word embedding based models utilized Long Short Term Memory (LSTM) network along with language-specific preprocessing while the machine learning model used term frequency–inverse document frequency (TF-IDF) vectorization along with a Logistic Regression model. The sub-word level model was submitted to the the track ‘Sentiment Analysis for Dravidian Languages in Code-Mixed Text’ proposed by Forum of Information Retrieval Evaluation in 2020 (FIRE 2020). Although it received a rank of 5 and 12 for the Tamil and Malayalam tasks respectively in the FIRE 2020 track, this paper improves upon the results by a margin to attain final weighted F1-scores of 0.65 for the Ta...
2020
This paper presents the methodologies implemented while classifying Dravidian code-mixed comments according to their polarity in the evaluation of the track ‘Sentiment Analysis for Davidian Languages in Code-Mixed Text’ proposed by Forum of Information Retrieval Evaluation in 2020. The implemented method used a sub-word level representation to capture the sentiment of the text. Using a Long Short Term Memory (LSTM) network along with language-specific preprocessing, the model classified the text according to its polarity. With F1-scores of 0.61 and 0.60, the model achieved an overall rank of 5 and 12 in the Tamil and Malayalam tasks respectively.
2020
This paper describes the IRlab@IITBHU system for the Dravidian-CodeMix FIRE 2020: Sentiment Analysis for Dravidian Languages pairs Tamil-English (TA-EN) and Malayalam-English (ML-EN) in Code-Mixed text. We submitted three models for sentiment analysis of code-mixed TA-EN and MA-EN datasets. Run-1 was obtained from the BERT and Logistic regression classifier, Run-2 used the DistilBERT and Logistic regression classifier, and Run-3 used the fastText model for producing the results. Run-3 outperformed Run-1 and Run-2 for both the datasets. We obtained an F1-score of 0.58, rank 8/14 in TA-EN language pair and for ML-EN, an F1-score of 0.63 with rank 11/15.
2020
Sentiment analysis is a fast growing research positioned to uncover the underlying meaning of a text by categorizing it into different levels. This paper is an attempt to decode the deeply entangled code-mixed Malayalam and Tamil datasets and classify its interlined meaning at five various levels. Along with the corpus creation, [1] propose a five-level classification for Malayalam and Tamil code-mixed datasets. In this paper, we follow the five-level annotated datasets and aim to solve the classification problem by implementing unigram and bigram knowledge with a Multinomial Naive Bayes model. Our model scores an F1-score of 0.55 for Tamil and 0.48 for Malayalam.
Forum for Information Retrieval Evaluation
Sentiment analysis of Dravidian languages has received attention in recent years. However, most social media text is code-mixed, and there is no research available on the sentiment analysis of code-mixed Dravidian languages. The Dravidian-CodeMix-FIRE 2020 https://dravidian-codemix.github.io/2020/, a track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, focused on creating a platform for researchers to come together and investigate the problem. Two language tracks, Tamil and Malayalam, were created as a part of Dravidian-CodeMix-FIRE 2020. The goal of this shared task was to identify the sentiment of a given code-mixed comment (from YouTube) into five classespositive, negative, neutral, mixed-feeling and comment not in the intended language. The performance of the systems (developed by participants) has been evaluated in terms of weighted-F1 score.
2020
There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in...
Journal of Information Technology Management (JITM), 2023
Extracting sentiments from the English-Telugu code-mixed data can be challenging and is still a relatively new research area. Data obtained from the Twitter API has to be in English-Telugu code-mixed language. That data is free-form text, noisy, lexicon borrowings, codemixed, phonetic typing and misspelling data. The initial step is language identification and sentiment class labels assigned to each tweet in the dataset. The second step is the data normalization task, and the final step is classification, which can be achieved using three different methods: lexicon, machine learning, and deep learning. In the lexicon-based approach, tokenize each tweet with its language tag. If the language tag is in Telugu, transliterate the roman script into native Telugu words. Words are verified with TeluguSentiWordNet, and the Telugu sentiments are extracted, and English SentiWordNets are used to extract sentiments from the English tokens. In this paper, the aspect-based sentiment analysis approach is suggested and used with normalized data. In addition, deep learning and machine learning techniques are applied to extract sentiment ratings, and the results are compared to prior work.
2020
In the era of technology, each and every one of us is expressing their opinion on social media platforms very frequently. And these opinions are mostly expressed in regional languages, so the contents mostly generated are in regional languages in nature. Sentiment Analysis (SA) is a natural language processing task that is defined as finding opinion (In the sense of Positive, Negative, or Neutral) of the writer about specific entities. This includes analyzing a person’s emotions, feelings, and attitudes towards his contents. This paper gives a comparative analysis of sentiment analysis performed in various Indian languages, which includes classification techniques which are based on Lexicon, Dictionary, and Machine Learning. And it also gives a list of lexical resources available to perform Sentiment Analysis (SA) of Indian Languages and the challenges of developing lexical resources for low resourced Indian languages.
2021
This paper discusses our participation in the “Sentiment Analysis in Dravidian-CodeMix”, DravidianCodeMix and “Hate Speech and Offensive Content Identification in Indo-European Languages”FIRE 2020 tasks of identifying subjective opinions or reactions on a given topic. Several techniques are applied for sentiment analysis including the recent word embeddings-based methods. BERT, Word2Vec, and ELMo are currently among the most promising and ready-to-use word embedding methods that can convert words into meaningful vectors. We used the BERT_BASE model for sentiment classification of Dravidian-CodeMix data and for HASOC task, our team submitted systems for all the two sub-tasks in three languages Hindi, English, and German with BERT-based system. We report our approach and results which are promising.
2024 IEEE International Conference on Contemporary Computing and Communications (InC4), 2024
In the era of digital communication, understanding public sentiment is crucial. However, sentiment analysis tools for less common languages like Marathi are limited. This paper introduces a machine learning and deep learning approach to Marathi sentiment analysis using Senticnet. Through the utilization of Senticnet and various machine learning techniques, we collected and pre-processed data, adapt Senticnet for Marathi, and design language-specific sentiment analysis models. Leveraging techniques such as tokenization, text cleaning, and feature extraction, we effectively classified Marathi text into positive, negative, and neutral sentiments. Our library's performance is evaluated against existing tools, show casing its accuracy and sensitivity to Marathi sentiment nuances. This work not only enhances Marathi sentiment analysis but also offers insights into adapting resources for non-english languages. By sharing our methodology and library, we encourage further research in regional languages, promoting sentiment analysis in diverse linguistic landscapes. The significant this research focuses on the creation of robust Marathi sentiment analysis tool, facilitating deeper understanding and analysis of sentiments in underrepresented languages, and serving as a catalyst for future advancements in sentiment analysis across linguistic boundaries.
2024
Sentiment analysis from Hindi text is a growing area of research, aiming to understand and categorize the emotions expressed in written content in the Hindi language. Because there is a lot of information on the internet in Indian languages like Hindi, Malyalam, Punjabi, Gujrati, Bengali and others, it is very important to study and find useful and important information from this data. This survey paper offers a summary of the latest progressions and challenges in sentiment analysis specifically tailored for Hindi text. There are four main computational intelligence techniques for getting sentiment from hindi text namely Machine Learning, Deep Learning, Lexicon-based, and Hybrid techniques. In this survey paper we concentrate on Machine learning and Deep learning techniques. This paper discusses about sentiment analysis and their levels, different machine learning models with their features and also the whole process for getting sentiment using machine learning. Furthermore, the paper highlights the challenges associated with sentiment analysis in Hindi, such as the lack of standardized resources, code-mixing, and dialectical variations.
ArXiv, 2018
This paper reports about our work in the NLP Tool Contest @ICON-2017, shared task on Sentiment Analysis for Indian Languages (SAIL) (code mixed). To implement our system, we have used a machine learning algo-rithm called Multinomial Na\"ive Bayes trained using n-gram and SentiWordnet features. We have also used a small SentiWordnet for English and a small SentiWordnet for Bengali. But we have not used any SentiWordnet for Hindi language. We have tested our system on Hindi-English and Bengali-English code mixed social media data sets released for the contest. The performance of our system is very close to the best system participated in the contest. For both Bengali-English and Hindi-English runs, our system was ranked at the 3rd position out of all submitted runs and awarded the 3rd prize in the contest.
International Journal of Advanced Computer Science and Applications, 2022
A comprehensive review of sentiment analysis for code-mixed and switched text corpus of Indian social media using machine learning (ML) approaches, based on recent research studies has been presented in this paper. Code-mixing and switching are linguistic behavior shown by the bilingual/multilingual population, primarily in spoken but also in written communication, especially on social media. Code-mixing involves combining lower linguistic units like words and phrases of a language into the sentences of other language (the base language) and code-switching involves switching to another language, for the length of one sentence or more. In code-mixing and switching, a bilingual person takes one or more words or phrases from one language and introduces them into another language while communicating in that language in spoken or written mode. People nowadays express their views and opinions on several issues on social media. In multilingual countries, people express their views using English as well as their native languages. Several reasons can be attributed to code-mixing. Lack of knowledge in one language on a particular subject, being empathetic, interjection and clarification are some to name. Sentiment analysis of monolingual social media content has been carried out for the last two decades. However, during recent years, Natural Language Processing (NLP) research focus has also shifted towards the exploration of code-mixed data, thereby, making code mixed sentiment analysis an evolving field of research. Systems have been developed using ML techniques to predict the polarity of code-mixed text corpus and to fine tune the existing models to improve their performance.
2020
Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.
ArXiv, 2021
We present the results of the Dravidian-CodeMix shared task1 held at FIRE 2021, a track on sentiment analysis for Dravidian Languages in Code-Mixed Text. We describe the task, its organization, and the submitted systems. This shared task is the continuation of last year’s Dravidian-CodeMix shared task2 held at FIRE 2020. This year’s tasks included code-mixing at the intra-token and inter-token levels. Additionally, apart from Tamil and Malayalam, Kannada was also introduced. We received 22 systems for Tamil-English, 15 systems for Malayalam-English, and 15 for Kannada-English. The top system for Tamil-English, Malayalam-English and Kannada-English scored weighted average F1-score of 0.711, 0.804, and 0.630, respectively. In summary, the quality and quantity of the submission show that there is great interest in Dravidian languages in code-mixed setting and state of the art in this domain still needs more improvement.
2020
Social media has penetrated into multilingual societies, however most of them use English to be a preferred language for communication. So it looks natural for them to mix their cultural language with English during conversations resulting in abundance of multilingual data, call this code-mixed data, available in todays' world.Downstream NLP tasks using such data is challenging due to the semantic nature of it being spread across multiple languages.One such Natural Language Processing task is sentiment analysis, for this we use an auto-regressive XLNet model to perform sentiment analysis on code-mixed Tamil-English and Malayalam-English datasets.
2018
Sentiment Analysis for Indian Languages (SAIL)-Code Mixed tools contest aimed at identifying the sentence level sentiment polarity of the code-mixed dataset of Indian languages pairs (Hi-En, Ben-Hi-En). Hi-En dataset is henceforth referred to as HI-EN and Ben-Hi-En dataset as BN-EN respectively. For this, we submitted four models for sentiment analysis of code-mixed HI-EN and BN-EN datasets. The first model was an ensemble voting classifier consisting of three classifiers - linear SVM, logistic regression and random forests while the second one was a linear SVM. Both the models used TF-IDF feature vectors of character n-grams where n ranged from 2 to 6. We used scikit-learn (sklearn) machine learning library for implementing both the approaches. Run1 was obtained from the voting classifier and Run2 used the linear SVM model for producing the results. Out of the four submitted outputs Run2 outperformed Run1 in both the datasets. We finished first in the contest for both HI-EN with an...
International Journal on Natural Language Computing
The evolution of information Technology has led to the collection of large amount of data, the volume of which has increased to the extent that in last two years the data produced is greater than all the data ever recorded in human history. This has necessitated use of machines to understand, interpret and apply data, without manual involvement. A lot of these texts are available in transliterated code-mixed form, which due to the complexity are very difficult to analyze. The work already performed in this area is progressing at great pace and this work hopes to be a way to push that work further. The designed system is an effort which classifies Hindi as well as Marathi text transliterated (Romanized) documents automatically using supervised learning methods (KNN), Naïve Bayes and Support Vector Machine (SVM)) and ontology based classification; and results are compared to in order to decide which methodology is better suited in handling of these documents. As we will see, the plain machine learning algorithm applications are just as or in many cases are much better in performance than the more analytical approach.
Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications, 2021
In a multilingual society, people communicate in more than one language, leading to Code-Mixed data. Sentimental analysis on Code-Mixed Telugu-English Text (CMTET) poses unique challenges. The unstructured nature of the Code-Mixed Data is due to the informal language, informal transliterations, and spelling errors. In this paper, we introduce an annotated dataset for Sentiment Analysis in CMTET. Also, we report an accuracy of 80.22% on this dataset using novel unsupervised data normalization with a Multilayer Perceptron (MLP) model. This proposed data normalization technique can be extended to any NLP task involving CMTET. Further, we report an increase of 2.53% accuracy due to this data normalization approach in our best model.
2020
The increasing use of social media and online shopping are generating a lot of text data that consists of sentiments or opinions about anything and everything available over these platforms. Users usually use Roman script to pen their sentiments in their language in addition to using English words due to technological limitations of using their native scripts. Sentiment Analysis (SA), an automatic way of analyzing these sentiments is gaining popularity as analyzing them manually is challenging due to the huge size of the texts and the language used in these texts. In this paper, we, team MUCS, have proposed a SA model and submitted it to ’Sentiment analysis of Dravidian languages in CodeMixed Text’ shared task at FIRE 2020 to analyze Tamil-English and Malayalam-English code-mixing texts. The proposed approach uses a Hybrid Voting Classifier (HVC) by combining Machine Learning (ML) models using word embeddings and n-grams features extracted from sentences with Deep Learning (DL) mode...
Dravidian-CodeMix-FIRE2020, 2020
Theedhum Nandrum is a sentiment polarity detection system using two approaches-a Stochastic Gradient Descent (SGD) based classifier and a Long Short-term Memory (LSTM) based Classifier. Our approach utilises language features like use of emoji, choice of scripts and code mixing which appeared quite marked in the datasets specified for the Dravidian Codemix-FIRE 2020 task. The hyperparameters for the SGD were tuned using GridSearchCV. Our system was ranked 4th in Tamil-English with a weighted average F1 score of 0.62 and 9th in Malayalam-English with a score of 0.65. We achieved a weighted average F1 score of 0.77 for Tamil-English using a Logistic Regression based model after the task deadline. This performance betters the top ranked classifier on this dataset by a wide margin. Our use of language-specific Soundex to harmonise the spelling variants in code-mixed data appears to be a novel application of Soundex. Our complete code is published in github at https://github.com/oligoglot/theedhum-nandrum.