Papers by Charibeth Cheng
Concept Paper: An Approach for utilizing Code Files and Diffs for Fine-grained Just-in-Time Defect Prediction
Academia Letters, 2021

A Museum Information System for Sustaining and Analyzing National Cultural Expressions
Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services, 2019
Culture is an integral part of the social and economic development of a nation. In the Philippine... more Culture is an integral part of the social and economic development of a nation. In the Philippines, the National Commission for Culture and the Arts (NCCA) is the overall policy-making body for culture and arts development. NCCA has been experiencing difficulty in collecting and organizing large cultural data sets, which are vital in their decisions concerning culture and arts reforms. We propose a sustainable data collection and analysis method for NCCA, using the museum cultural domain as case study. With the museum-centric information system, valuable information across the museum cultural domain can be obtained, and later translated to information visualization with resulting correlation statistics and reports, showing variables that give context to the performance of a museum, museums in an area, and museums in the country. The museum information system may serve as the information system framework for other cultural domains.

Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years,... more Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years, proving effective even in tasks done in low-resource languages. While pretrained transformers for these languages can be made, it is challenging to measure their true performance and capacity due to the lack of hard benchmark datasets, as well as the difficulty and cost of producing them. In this paper, we present three contributions: First, we propose a methodology for automatically producing Natural Language Inference (NLI) benchmark datasets for low-resource languages using published news articles. Through this, we create and release NewsPH-NLI, the first sentence entailment benchmark dataset in the low-resource Filipino language. Second, we produce new pretrained transformers based on the ELECTRA technique to further alleviate the resource scarcity in Filipino, benchmarking them on our dataset against other commonly-used transfer learning techniques. Lastly, we perform analyses on t...
Trac Dataset for Just-in-Time Defect Prediction
This dataset was created using the downloadable defect tickets from the Trac website and also the... more This dataset was created using the downloadable defect tickets from the Trac website and also the source repository. With many datasets not including the actual commit IDs, utilizing semantic code in JIT research is more challenging. This JIT dataset includes actual commit ids that allow for researchers to do semantic feature extraction for JIT.
Just-in-time (JIT) defect prediction refers to the technique of predicting whether a code change ... more Just-in-time (JIT) defect prediction refers to the technique of predicting whether a code change is defective. Many contributions<br> have been made in this area through the excellent dataset by Kamei. In this paper, we revisit the dataset and highlight preprocessing<br> difficulties with the dataset and the limitations of the dataset on unsupervised learning. Secondly, we propose certain features in<br> the Kamei dataset that can be used for training models. Lastly, we discuss the limitations of the dataset's features.

Intelligent Dengue Infoveillance Using Gated Recurrent Neural Learning and Cross-Label Frequencies
2018 IEEE International Conference on Agents (ICA), 2018
With dengue becoming a major concern in tropical countries such as the Philippines, it is importa... more With dengue becoming a major concern in tropical countries such as the Philippines, it is important that public health officials are able to accurately determine the presence and magnitude of dengue activity as quickly as possible to facilitate fast emergency response. The prevalence of massive streams of publicly available data from social media make this possible through infoveillance. Infoveillance involves observing and analyzing online interactions to gather health-related data for informing decisions on public health. In this paper, we present a public health agent model that performs dengue infoveillance using a gated recurrent neural network classification model incorporated with pre-trained word embeddings and cross-label frequency calculation. We setup the agent to work on the Philippine Twitter stream as its primary environment. Further, we evaluate the agents classification ability using a holdout set of human-labeled tweets. Afterwards, we run a historical simulation where the trained agent works with a stream of six months worth of tweets from the Philippines and we correlate its infoveillance results with actual dengue morbidity data of that time period. Experiments show that the agent is capable of accurately identifying dengue-related tweets with low loss. Moreover, we confirm that the agent model can be used for determining actual dengue activity and can serve as an early warning system with high confidence.

Utilizing Tweet Content for the Detection of Sentiment-Based Interaction Communities on Twitter
2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), 2018
Community detection is one way of extracting insights from voluminous Twitter data. Through this ... more Community detection is one way of extracting insights from voluminous Twitter data. Through this technique, Twitter users can be grouped into different types of communities such as those who interact a lot, or those who have similar sentiments about certain topics. However, most works do not utilize tweet content and simply use directly available information like Twitter follows. Hence, this work explores the incorporation of hashtags and sentiment analysis (also taking into account conversational context) in the input graph for community detection through various schemes. Evaluation was performed by investigating the modularity score, topic similarity/variety, and sentiment homogeneity of the resulting communities. Results suggest that when compared to a baseline graph based on mentions, a scoring approach is more likely to yield a different set of communities compared to the more popular edge-weighting approach. Insights gleaned from the study show the importance of other evaluation methods (depending on the end-goal) aside from usual quantitative metrics of community network structure, and that community detection in conjunction with topic modeling can be a tool for analyzing Twitter discourse.
Use of Word and Character N-Grams for Low-Resourced Local Languages
2018 International Conference on Asian Language Processing (IALP), 2018
Language identification is a text classification task for identifying the language of a given tex... more Language identification is a text classification task for identifying the language of a given text. Several works use this as a preprocessing technique prior to sentiment analysis, mood analysis, and named entity recognition among others. Thus, building an accurate language identification engine is important given that the Philippines is home to more than 170 languages, and is scarce of language documents and resources. We compare machine learning algorithms such as Naive Bayes, Linear Support Vector Machines (SVM), and Random Forest for classification of Philippine languages. Results show that the Linear SVM model had the best performance with 0.97 Fl-score.
Paggamit ng Natural Language Processing bilang Gabay sa Pagtuklas at Pagsiyasat ng Tema sa mga Tweet tuwing Halalan / Using Natural Language Processing in the Discovery and Analysis of Themes of Tweets during Elections
Malay, 2015

Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation
ArXiv, 2020
Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years,... more Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years, proving effective even in tasks done in low-resource languages. While pretrained transformers for these languages can be made, it is challenging to measure their true performance and capacity due to the lack of hard benchmark datasets, as well as the difficulty and cost of producing them. In this paper, we present three contributions: First, we propose a methodology for automatically producing Natural Language Inference (NLI) benchmark datasets for low-resource languages using published news articles. Through this, we create and release NewsPH-NLI, the first sentence entailment benchmark dataset in the low-resource Filipino language. Second, we produce new pretrained transformers based on the ELECTRA technique to further alleviate the resource scarcity in Filipino, benchmarking them on our dataset against other commonly-used transfer learning techniques. Lastly, we perform analyses on t...
Juris2vec: Building Word Embeddings from Philippine Jurisprudence
2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 2021
In this research, we trained nine word embedding models on a large corpus containing Philippine S... more In this research, we trained nine word embedding models on a large corpus containing Philippine Supreme Court decisions, resolutions, and opinions from 1901 through 2020. We evaluated their performance in terms of accuracy on a customized 4,510-question word analogy test set in seven syntactic and semantic categories. Word2vec models fared better on semantic evaluators while fastText models were more impressive on syntactic evaluators. We also compared our word vector models to another trained on a large legal corpus from other countries.

The Philippines is home to more than 150 languages that is considered to be low-resourced even on... more The Philippines is home to more than 150 languages that is considered to be low-resourced even on its major languages. This results into a lack of pursuit in developing a translation system for the underrepresented languages. To simplify the process of developing translation system for multiple languages, and to aid in improving the translation quality of zero to low-resource languages, multilingual NMT became an active area of research. However, existing works in multilingual NMT disregards the analysis of a multilingual model on a closely related and lowresource language group in the context of pivot-based translation and zero-shot translation. In this paper, we benchmarked translation for several Philippine Languages, provided an analysis of a multilingual NMT system for morphologically rich and low-resource languages in terms of its effectiveness in translating zero-resource languages with zero-shot translations. To further evaluate the capability of the multilingual NMT model i...

A tag cloud is a text-based visual representation of a set of tags which usually depicts the tag&... more A tag cloud is a text-based visual representation of a set of tags which usually depicts the tag's importance in a given text. The presentation and layout of tags can be controlled so that features such as the size, font and color can be used to give some measure of importance of a given tag. Words that are used frequently will be displayed with an increased font size; while tags may appear in uniform or varying colors for aesthetics purposes or otherwise. The purpose of a tag cloud is to allow one to see, at a glance, the content of a document. Unfortunately, existing tag cloud generators produce clouds with tags that do not contribute in identifying the general content of a given document. These generators base the tags its frequency in the document. Thus, there may be tags, which are inflections of the same word, thereby populating the cloud with the same rootword. Furthermore, there may be frequently occurring non-stopwords, but are nevertheless non-discrimating (such as big...

Question Generation (QG) is an important task in Natural Language Processing (NLP) that involves ... more Question Generation (QG) is an important task in Natural Language Processing (NLP) that involves generating questions automatically when given a context paragraph. While many techniques exist for the task of QG, they employ complex model architectures, extensive features, and additional mechanisms to boost model performance. In this work, we show that transformer-based finetuning techniques can be used to create robust question generation systems using only a single pretrained language model, without the use of additional mechanisms, answer metadata, and extensive features. Our best model outperforms previous more complex RNN-based Seq2Seq models, with an 8.62 and a 14.27 increase in METEOR and ROUGE L scores, respectively. We show that it also performs on par with Seq2Seq models that employ answer-awareness and other special mechanisms, despite being only a single-model system. We analyze how various factors affect the model’s performance, such as input data formatting, the length ...
ArXiv, 2021
With software system complexity leading to the rise of software defects, research efforts have be... more With software system complexity leading to the rise of software defects, research efforts have been done on techniques towards predicting software defects and Just-in-time (JIT) defect prediction which predicts whether a code change is defective. While using features to determine potentially defective code change, inspection effort is still significant. As code change can impact several files, we investigate an open source project to identify potential gaps with features in JIT perspective. In addition, with a lack of publicly available JIT dataset that link the features with actual commits, we also present a new dataset that can be utilized in JIT and semantic analysis.

The growth of social networking platforms such as Facebook and Twitter has bridged communication ... more The growth of social networking platforms such as Facebook and Twitter has bridged communication channels between people to share their thoughts and sentiments. However, along with the rapid growth and rise of the Internet, the idea of anonymity has also been introduced wherein user identities are easily falsified and hidden. Hence, presenting difficulty for businesses to give accurate advertisements to specific account demographics. As such, this study aims to identify gender and age group of Filipino social media accounts through analyzing post contents. Several features will be considered and various techniques will be adopted to process posts written in English, Filipino, and Taglish (Tagalog interspersed with English). The study will implement these techniques and record their compatibility and performance in a Filipino setting. A computational model capable of gender and age classification will be built as the final product.

How can we organize voluminous amount of news articles to facilitate better search options and an... more How can we organize voluminous amount of news articles to facilitate better search options and analysis? We propose the use of natural language processing techniques, specifically information extraction and sentiment analysis, to allow easier data analysis on news articles and editorials. The proposed technique was tested on news documents written in Filipino. Grammar-based rules were formulated to extract pertinent information from the articles, and were automated through bootstrapping. The extracted information include the Filipino equivalent of the 5W user requirement proposed by Das et al. (2012) that answers the questions who, what, when, where, and why. Subsequently, the articles related through the 5Ws were analyzed based on their sentiment. Both information extraction and sentiment analysis were done at the article level. Collective results were presented visually. In designing the user interface, we considered (1) how the user would be able to find the articles he is lookin...

The use of the internet as a fast medium of spreading fake news reinforces the need for computati... more The use of the internet as a fast medium of spreading fake news reinforces the need for computational tools that combat it. Techniques that train fake news classifiers exist, but they all assume an abundance of resources including large labeled datasets and expert-curated corpora, which low-resource languages may not have. In this work, we make two main contributions: First, we alleviate resource scarcity by constructing the first expertly-curated benchmark dataset for fake news detection in Filipino, which we call “Fake News Filipino.” Second, we benchmark Transfer Learning (TL) techniques and show that they can be used to train robust fake news classifiers from little data, achieving 91% accuracy on our fake news dataset, reducing the error by 14% compared to established few-shot baselines. Furthermore, lifting ideas from multitask learning, we show that augmenting transformer-based transfer techniques with auxiliary language modeling losses improves their performance by adapting ...

While transformer-based finetuning techniques have proven effective in tasks that involve low-res... more While transformer-based finetuning techniques have proven effective in tasks that involve low-resource, low-data environments, a lack of properly established baselines and benchmark datasets make it hard to compare different approaches that are aimed at tackling the low-resource setting. In this work, we provide three contributions. First, we introduce two previously unreleased datasets as benchmark datasets for text classification and low-resource multilabel text classification for the low-resource language Filipino. Second, we pretrain better BERT and DistilBERT models for use within the Filipino setting. Third, we introduce a simple degradation test that benchmarks a model's resistance to performance degradation as the number of training samples are reduced. We analyze our pretrained model's degradation speeds and look towards the use of this method for comparing models aimed at operating within the low-resource setting. We release all our models and datasets for the rese...

A Model for Age and Gender Profiling of Social Media Accounts Based on Post Contents
Neural Information Processing, 2018
The growth of social networking platforms such as Facebook and Twitter has bridged communication ... more The growth of social networking platforms such as Facebook and Twitter has bridged communication channels between people to share their thoughts and sentiments. However, along with the rapid growth and rise of the Internet, the idea of anonymity has also been introduced wherein user identities are easily falsified and hidden. Hence, presenting difficulty for businesses to give accurate advertisements to specific account demographics. As such, this study searched for the best model to identify gender and age group of Filipino social media accounts through analyzing post contents. Two model structures for the classifier namely, the stacked/combined structure and the parallel structure were experimented on. Different types of features including those based on socio-linguistics, grammar, characters and words were considered. The results show that different model structures, features, feature reduction and classification algorithms apply to age classification and gender classification. For Facebook and Twitter, the best model for classifying age was Support Vector Classifier (SVC) with least absolute shrinkage and selection operator (Lasso) on a parallel model structure for Facebook, while a combined model structure is best for Twitter. For gender classification, the best model for Facebook used Ridge Classifier (RC), while the best model for Twitter used SVC, both utilizing Lasso on a parallel model structure. The features that were dominant in age classification for both Facebook and Twitter were word-based, socio-linguistic features and post time, while socio-linguistic features, specifically netspeak, were important in gender classification for both platforms. Based on the differences of the features affecting the performance of the models, Facebook and Twitter data must be analyzed separately as the posts found in these two platforms differ significantly.
Uploads
Papers by Charibeth Cheng