Papers by Pratik Ratadiya

2021 International Conference on Data Mining Workshops (ICDMW), Dec 1, 2021
Document Classification has a wide range of applications in various domains like Ontology Mapping... more Document Classification has a wide range of applications in various domains like Ontology Mapping, Sentiment Analysis, Topic Categorization and Document Clustering, to mention a few. Unlike Text Classification, Document Classification works with longer sequences that typically contain multiple paragraphs. Previous approaches for this task have achieved promising results, but have often relied on complex recurrence mechanisms that are expensive and time-consuming in nature. Recently, self-attention based models like Transformers and BERT have achieved state-of-the-art performance on several Natural Language Understanding (NLU) tasks, but owing to the quadratic computational complexity of the self-attention mechanism with respect to the input sequence length, these approaches are generally applied to shorter text sequences. In this paper, we address this issue, by proposing a new Transformer-based Hierarchical Encoder approach for the Document Classification task. The hierarchical framework we adopt helps us extend the self-attention mechanism to long-form text modelling thereby reducing the complexity considerably. We use the Bidirectional Transformer Encoder (BTE) at the sentence-level to generate a fixed-size sentence embedding for each sentence in the document. A document-level Transformer Encoder is then used to model the global document context and learn the inter-sentence dependencies. We also carry out experiments with the BTE in a feature-extraction and a fine-tuning setup, allowing us to evaluate the trade-off between computation power and accuracy. Furthermore, we also conduct ablation experiments, and evaluate the impact of different pre-training strategies on the overall performance. Experimental results demonstrate that our proposed model achieves state-of-the-art performance on two standard benchmark datasets.

2021 International Conference on Data Mining Workshops (ICDMW), Dec 1, 2021
Social media-specific Sentiment Analysis has a wide range of applications in various domains like... more Social media-specific Sentiment Analysis has a wide range of applications in various domains like Business Intelligence, Marketing, Politics and Psychology, to mention a few. Irony Detection and Emotion Recognition, two of Sentiment Analysis' significant pillars have become increasingly important as a result of the continued growth of social media. Previous approaches for the two tasks have yielded promising results, but have often relied on recurrence and pre-trained wordembedding ensembles. In this paper, we propose two novel contextual embedding-based approaches for Irony Detection and Emotion Recognition. We leverage social media-specific pretraining in the form of BERTweet-A language model pre-trained on English Tweets, along with either a Convolutional Neural Network or a Transformer Encoder. We empirically show that the addition of Convolutional Neural Networks or a Transformer Encoder results in improved performance when compared to a vanilla BERTweet model. Furthermore, we also compare CNNs and the Transformer Encoder as feature extractors, assessing the trade-off between the number of learnable parameters and performance. Finally, we also investigate the impact of partial and complete fine-tuning and analyze the trade-off between computational power and accuracy in the process. Experimental results demonstrate that our proposed methods achieve state-ofthe-art performance on two standard benchmark datasets.

2021 International Conference on Data Mining Workshops (ICDMW)
Document Classification has a wide range of applications in various domains like Ontology Mapping... more Document Classification has a wide range of applications in various domains like Ontology Mapping, Sentiment Analysis, Topic Categorization and Document Clustering, to mention a few. Unlike Text Classification, Document Classification works with longer sequences that typically contain multiple paragraphs. Previous approaches for this task have achieved promising results, but have often relied on complex recurrence mechanisms that are expensive and time-consuming in nature. Recently, self-attention based models like Transformers and BERT have achieved state-of-the-art performance on several Natural Language Understanding (NLU) tasks, but owing to the quadratic computational complexity of the self-attention mechanism with respect to the input sequence length, these approaches are generally applied to shorter text sequences. In this paper, we address this issue, by proposing a new Transformer-based Hierarchical Encoder approach for the Document Classification task. The hierarchical framework we adopt helps us extend the self-attention mechanism to long-form text modelling thereby reducing the complexity considerably. We use the Bidirectional Transformer Encoder (BTE) at the sentence-level to generate a fixed-size sentence embedding for each sentence in the document. A document-level Transformer Encoder is then used to model the global document context and learn the inter-sentence dependencies. We also carry out experiments with the BTE in a feature-extraction and a fine-tuning setup, allowing us to evaluate the trade-off between computation power and accuracy. Furthermore, we also conduct ablation experiments, and evaluate the impact of different pre-training strategies on the overall performance. Experimental results demonstrate that our proposed model achieves state-of-the-art performance on two standard benchmark datasets.

2021 International Conference on Data Mining Workshops (ICDMW)
Social media-specific Sentiment Analysis has a wide range of applications in various domains like... more Social media-specific Sentiment Analysis has a wide range of applications in various domains like Business Intelligence, Marketing, Politics and Psychology, to mention a few. Irony Detection and Emotion Recognition, two of Sentiment Analysis' significant pillars have become increasingly important as a result of the continued growth of social media. Previous approaches for the two tasks have yielded promising results, but have often relied on recurrence and pre-trained wordembedding ensembles. In this paper, we propose two novel contextual embedding-based approaches for Irony Detection and Emotion Recognition. We leverage social media-specific pretraining in the form of BERTweet-A language model pre-trained on English Tweets, along with either a Convolutional Neural Network or a Transformer Encoder. We empirically show that the addition of Convolutional Neural Networks or a Transformer Encoder results in improved performance when compared to a vanilla BERTweet model. Furthermore, we also compare CNNs and the Transformer Encoder as feature extractors, assessing the trade-off between the number of learnable parameters and performance. Finally, we also investigate the impact of partial and complete fine-tuning and analyze the trade-off between computational power and accuracy in the process. Experimental results demonstrate that our proposed methods achieve state-ofthe-art performance on two standard benchmark datasets.

medRxiv (Cold Spring Harbor Laboratory), Apr 11, 2023
Sepsis is a major cause of morbidity and mortality worldwide, and is caused by bacterial infectio... more Sepsis is a major cause of morbidity and mortality worldwide, and is caused by bacterial infection in a majority of cases. However, fungal sepsis often carries a higher mortality rate both due to its prevalence in immunocompromised patients as well as delayed recognition. Using chest x-rays, associated radiology reports, and structured patient data from the MIMIC-IV clinical dataset, the authors present a machine learning methodology to differentiate between bacterial, fungal, and viral sepsis. Model performance shows AUCs of 0.81, 0.83, 0.79 for detecting bacterial, fungal, and viral sepsis respectively, with best performance achieved using embeddings from image reports and structured clinical data. By improving early detection of an often missed causative septic agent, predictive models could facilitate earlier treatment of non-bacterial sepsis with resultant associated mortality reduction.

ArXiv, 2019
Forums play an important role in providing a platform for community interaction. The introduction... more Forums play an important role in providing a platform for community interaction. The introduction of irrelevant content or spam by individuals for commercial and social gains tends to degrade the professional experience presented to the forum users. Automated moderation of the relevancy of posted content is desired. Machine learning is used for text classification and finds applications in spam email detection, fraudulent transaction detection etc. The balance of classes in training data is essential in the case of classification algorithms to make the learning efficient and accurate. However, in the case of forums, the spam content is sparse compared to the relevant content giving rise to a bias towards the latter while training. A model trained on such biased data will fail to classify a spam sample. An approach based on Synthetic Minority Over-sampling Technique(SMOTE) is presented in this paper to tackle imbalanced training data. It involves synthetically creating new minority c...
This paper describes our approach for Task 9 of SemEval 2021: Statement Verification and Evidence... more This paper describes our approach for Task 9 of SemEval 2021: Statement Verification and Evidence Finding with Tables. We participated in both subtasks, namely statement verification and evidence finding. For the subtask of statement verification, we extend the TAPAS model to adapt to the ‘unknown’ class of statements by finetuning it on an augmented version of the task data. For the subtask of evidence finding, we finetune the DistilBERT model in a Siamese setting.

Data privacy and sharing has always been a critical issue when trying to build complex deep learn... more Data privacy and sharing has always been a critical issue when trying to build complex deep learning-based systems to model data. Facilitation of a decentralized approach that could take benefit from data across multiple nodes while not needing to merge their data contents physically has been an area of active research. In this paper, we present a solution to benefit from a distributed data setup in the case of training deep learning architectures by making use of a smart contract system. Specifically, we propose a mechanism that aggregates together the intermediate representations obtained from local ANN models over a blockchain. Training of local models takes place on their respective data. The intermediate representations derived from them, when combined and trained together on the host node, helps to get a more accurate system. While federated learning primarily deals with the same features of data where the number of samples being distributed on multiple nodes, here we are deal...

Question Paraphrase Identification (QPI) is a critical task for large-scale Question-Answering fo... more Question Paraphrase Identification (QPI) is a critical task for large-scale Question-Answering forums. The purpose of QPI is to determine whether a given pair of questions are semantically identical or not. Previous approaches for this task have yielded promising results, but have often relied on complex recurrence mechanisms that are expensive and time-consuming in nature. In this paper, we propose a novel architecture combining a Bidirectional Transformer Encoder with Convolutional Neural Networks for the QPI task. We produce the predictions from the proposed architecture using two different inference setups: Siamese and Matched Aggregation. Experimental results demonstrate that our model achieves state-of-the-art performance on the Quora Question Pairs dataset. We empirically prove that the addition of convolution layers to the model architecture improves the results in both inference setups. We also investigate the impact of partial and complete fine-tuning and analyze the trade...

In this paper, we present a study of the recent advancements which have helped bring Transfer Lea... more In this paper, we present a study of the recent advancements which have helped bring Transfer Learning to NLP through the use of semi-supervised training. We discuss cutting-edge methods and architectures such as BERT, GPT, ELMo, ULMFit among others. Classically, tasks in natural language processing have been performed through rule-based and statistical methodologies. However, owing to the vast nature of natural languages these methods do not generalise well and failed to learn the nuances of language. Thus machine learning algorithms such as Naive Bayes and decision trees coupled with traditional models such as Bag-of-Words and N-grams were used to usurp this problem. Eventually, with the advent of advanced recurrent neural network architectures such as the LSTM, we were able to achieve state-of-the-art performance in several natural language processing tasks such as text classification and machine translation. We talk about how Transfer Learning has brought about the well-known Im...

TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON)
The rise in the number of active online users has subsequently increased the number of cyber abus... more The rise in the number of active online users has subsequently increased the number of cyber abuse incidents being reported as well. Such events pose a harm to the privacy and liberty of users in the digital space. Conventionally, manual moderation and reporting mechanisms have been used to ensure that no such text is present online. However, there have been some flaws in this method including dependency on humans, increased delays and reduced data privacy. Previous approaches to automate this process have involved using supervised machine learning and traditional recurrent sequence models which tend to perform poorly on non-English text. Given the rising diversity of users being a part of the cyberspace, a flexible solution able to accommodate multilingual text is the need of the hour. Furthermore, text in colloquial languages often hold pertinent context and emotion that is lost after translation. In this paper, we propose a generative deep-learning based approach which involves the use of bidirectional transformer-based BERT architecture for cyber abuse detection across English, Hindi and code-mixed Hindi English(Hinglish) text. The proposed architecture can achieve state-of-the-art results on the code-mixed Hindi dataset in the TRAC-1 standard aggression identification task while being able to achieve very good results on the English task leaderboard as well. The results achieved are without using any ensemble-based methods or multiple models and thus prove to be a better alternative to the existing approaches. Deep learning based models which perform well on multilingual text will be able to handle a broader range of inputs and thus can prove to be crucial in cracking down on such social evils.

2019 International Conference on Data Mining Workshops (ICDMW)
The amount of user-generated content in the cyberspace keeps increasing in the 21st century. Howe... more The amount of user-generated content in the cyberspace keeps increasing in the 21st century. However, it has also meant an increase in the number of cyber abuse and bullying incidents being reported. Use of profane text by individuals threatens the liberty and integrity of the digital space. Manual moderation and reporting mechanisms have been traditionally used to keep a check on such profane text. Dependency on human interpretation and delay in results have been the biggest obstacles in this system. Previous deep learning-based approaches to automate the process have involved use of traditional convolution and recurrence based sequential models. However, these models tend to be computationally expensive and have higher memory requirement. Further, they tend to produce state of the art results in binary classification but perform relatively poorly on multilabel tasks, owing to less flexibility in architecture. In today's world, classifying text in a binary way is no longer sufficient and thus a flexible solution able to generalize well on multilabel text is the need of the hour. In this paper, we propose a multihead attention-based approach for detection of profane text. We couple our model with power weighted average ensembling techniques to further improve the performance. The proposed approach does not have additional memory requirement and is less complex as compared to previous approaches. The improved results obtained by our model on publicly available real-world data further validate the same. Flexible, lightweight models which can handle multilabel text well can prove to be crucial in cracking down on social evils in the digital space.
Uploads
Papers by Pratik Ratadiya