Kdd mtl event forecasting

Liang Zhao

Kdd mtl event forecasting

Abstract

Spatial event forecasting from social media is an important problem but encounters critical challenges, such as dynamic patterns of features (keywords) and geographic heterogeneity (e.g., spatial correlations, imbalanced samples, and different populations in different locations). Most existing approaches (e.g., LASSO regression, dynamic query expansion, and burst detection) are designed to address some of these challenges, but not all of them. This paper proposes a novel multi-task learning framework which aims to concurrently address all the challenges. Specifically, given a collection of locations (e.g., cities), we propose to build forecasting models for all locations simultaneously by extracting and utilizing appropriate shared information that effectively increases the sample size for each location, thus improving the forecasting performance. We combine both static features derived from a predefined vocabulary by domain experts and dynamic features generated from dynamic query expansion in a multi-task feature learning framework; we investigate different strategies to balance homogeneity and diversity between static and dynamic terms. Efficient algorithms based on Iterative Group Hard Thresholding are developed to achieve efficient and effective model training and prediction. Extensive experimental evaluations on Twitter data from four different countries in Latin America demonstrated the effectiveness of our proposed approach.

According to the World Health Organization, several factors have affected the accurate reporting of SARS-CoV-2 outbreak status, such as limited data collection resources, cultural and educational diversity, and inconsistent outbreak reporting from different sectors. Driven by this challenging situation, this study investigates the potential expediency of using social network data to develop reliable early information surveillance and warning system for pandemic outbreaks. As such, an enhanced framework of three inherently interlinked subsystems is proposed. The first subsystem includes data collection and integration mechanisms, data preprocessing, and hybrid sentiment analysis tools to identify tweet sentiment taxonomies and quantitatively estimate public awareness. The second subsystem comprises the feature extraction unit that identifies, selects, embeds, and balances feature vectors and the classifier fitting and training unit. This subsystem is designed to capture the most effective linguistic feature combinations with more spatial evidence by using a variety of approaches, including linear classifiers, MLPs, RNNs, and CNNs, as well as pre-trained word embedding algorithms. The last is the modeling and situational awareness evaluation subsystem, which measures temporal associations between pandemic-relevant social network activities and officially announced infection counts in the most hazardous geolocations. The proposed framework was developed and tested using a combination of static datasets and real-time scraped Twitter data. The results of these experiments showed the remarkable performance of the framework in assessing the temporal associations between public awareness and outbreak status. It also showed that the Decision Tree Classifier with Unigram+TF-IDF feature vectors outperformed other conventional models for sentiment classification and geolocation classification with an accuracy of 94.3% and 80.8, respectively. As indicated, conventional machine learning algorithms didn't achieve a precision of more than 80%, while, for instance, MLP with selfembedding layer, Word2Vec, and GloVe pre-trained word embedding resulted in very poor accuracy of 10%, 36%, and 32%, respectively. However, adding the PoS tag one-hot encoding embedding increased the validation accuracy from 36% to approximately 89%, while the best performance for the second subsystem was achieved by Bi-LSTM with RoBERTa word embedding, with an accuracy of 96%. The achieved results reveal that the proposed framework can proactively capture the potential hazards associated with the prevalence of infectious diseases as an effective early detection and info-surveillance awareness system.

Log In

Kdd mtl event forecasting

Sign up for access to the world's latest research

Abstract

Related papers