Event forecasting in Twitter is an important and challenging problem. Most existing approaches fo... more Event forecasting in Twitter is an important and challenging problem. Most existing approaches focus on forecasting temporal events (such as elections and sports) and do not consider spatial features and their underlying correlations. In this paper, we propose a generative model for spatiotemporal event forecasting in Twitter. Our model characterizes the underlying development of future events by jointly modeling the structural contexts and spatiotemporal burstiness. An effective inference algorithm is developed to train the model parameters. Utilizing the trained model, the alignment likelihood of tweet sequences is calculated by dynamic programming. Extensive experimental evaluations on two different domains demonstrated the effectiveness of our proposed approach. * Virginia Tech. † SUNY Albany.
Twitter has become a popular data source as a surrogate for monitoring and detecting events. Targ... more Twitter has become a popular data source as a surrogate for monitoring and detecting events. Targeted domains such as crime, election, and social unrest require the creation of algorithms capable of detecting events pertinent to these domains. Due to the unstructured language, short-length messages, dynamics, and heterogeneity typical of Twitter data streams, it is technically difficult and labor-intensive to develop and maintain supervised learning systems. We present a novel unsupervised approach for detecting spatial events in targeted domains and illustrate this approach using one specific domain, viz. civil unrest modeling. Given a targeted domain, we propose a dynamic query expansion algorithm to iteratively expand domain-related terms, and generate a tweet homogeneous graph. An anomaly identification method is utilized to detect spatial events over this graph by jointly maximizing local modularity and spatial scan statistics. Extensive experiments conducted in 10 Latin American countries demonstrate the effectiveness of the proposed approach.
In this paper, a genetic algorithm (GA) based principal component selection approach is proposed ... more In this paper, a genetic algorithm (GA) based principal component selection approach is proposed for production performance estimation in mineral processing. The approach combines a modified GA with principal component analysis (PCA) in order to improve the estimation accuracy of production performance. In this context, the extended chromosome encoding, the fitness function formed by combining the prediction performance operator and the penalty function is designed based on the standard GA. Both the mutation allele number operator and the allele mutation possibility operator are also introduced in the mutation process of chromosome. The proposed approach can select the principal components which are crucial for estimation performance, and the useful message from PCA can guide the evolution of GA and accelerate the convergence process. The case studies have been carried out on the prediction of the production rate and concentrate grade of a mineral process and the experimental results show the effectiveness of the proposed approach.
We describe the design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous s... more We describe the design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. Unlike retrospective studies, EMBERS has been making forecasts into the future since Nov 2012 which have been (and continue to be) evaluated by an independent T&E team (MITRE). Of note, EMBERS has successfully forecast the June 2013 protests in Brazil and Feb 2014 violent protests in Venezuela. We outline the system architecture of EMBERS, individual models that leverage specific data sources, and a fusion and suppression engine that supports trading off specific evaluation criteria. EMBERS also provides an audit trail interface that enables the investigation of why specific predictions were made along with the data utilized for forecasting. Through numerous evaluations, we demonstrate the superiority of EMBERS over baserate methods and its capability to forecast significant societal happenings.
Developed under the IARPA Open Source Initiative program, EMBERS (Early Model Based Event Recogni... more Developed under the IARPA Open Source Initiative program, EMBERS (Early Model Based Event Recognition using Surrogates) is a large-scale big-data analytics system for forecasting significant societal events, such as civil unrest incidents and disease outbreaks on the basis of continuous, automated analysis of large volumes of publicly available data. It has been operational since November of 2012, delivering approximately 50 predictions each day. EMBERS is built on a streaming, scalable, share-nothing architecture and is deployed on Amazon Web Services (AWS).
Social microblogs such as Twitter and Weibo are experiencing an explosive growth with billions of... more Social microblogs such as Twitter and Weibo are experiencing an explosive growth with billions of global users sharing their daily observations and thoughts. Beyond public interests (e.g., sports, music), microblogs can provide highly detailed information for those interested in public health, homeland security, and financial analysis. However, the language used in Twitter is heavily informal, ungrammatical, and dynamic. Existing data mining algorithms require extensive manually labeling to build and maintain a supervised system. This paper presents STED, a semi-supervised system that helps users to automatically detect and interactively visualize events of a targeted type from twitter, such as crimes, civil unrests, and disease outbreaks. Our model first applies transfer learning and label propagation to automatically generate labeled data, then learns a customized text classifier based on mini-clustering, and finally applies fast spatial scan statistics to estimate the locations of events. We demonstrate STED's usage and benefits using twitter data collected from Latin America countries, and show how our system helps to detect and track example events such as civil unrests and crimes.
This paper deals with the application of Kalman filter for optimizing and filtering the position ... more This paper deals with the application of Kalman filter for optimizing and filtering the position signal of shuttlecock obtained by the vision servo system of 'Shuttlecock Robot' [1]. Non-uniform mass distribution and air resistance effect can make much noise not only in vision recognition but also in kinematic model analysis of shuttlecock. The Kalman filter algorithm is used to filter the shuttlecock position signal by taking the error of measurement and the error of shuttlecock motion model into account. Besides, by considering the requirement of fast moving control, we reduce dimensions of state vector by decomposition of shuttlecock motion to shorten the executive cycle. The simulation results show its affectivity on improving the accuracy of track prediction. It can also accomplish track prediction fast and accurately when applied on 'Shuttlecock Robot'.
Spatial event forecasting from social media is an important problem but encounters critical chall... more Spatial event forecasting from social media is an important problem but encounters critical challenges, such as dynamic patterns of features (keywords) and geographic heterogeneity (e.g., spatial correlations, imbalanced samples, and different populations in different locations). Most existing approaches (e.g., LASSO regression, dynamic query expansion, and burst detection) are designed to address some of these challenges, but not all of them. This paper proposes a novel multi-task learning framework which aims to concurrently address all the challenges. Specifically, given a collection of locations (e.g., cities), we propose to build forecasting models for all locations simultaneously by extracting and utilizing appropriate shared information that effectively increases the sample size for each location, thus improving the forecasting performance. We combine both static features derived from a predefined vocabulary by domain experts and dynamic features generated from dynamic query expansion in a multi-task feature learning framework; we investigate different strategies to balance homogeneity and diversity between static and dynamic terms. Efficient algorithms based on Iterative Group Hard Thresholding are developed to achieve efficient and effective model training and prediction. Extensive experimental evaluations on Twitter data from four different countries in Latin America demonstrated the effectiveness of our proposed approach.
Developed under the Intelligence Advanced Research Project Activity Open Source Indicators progra... more Developed under the Intelligence Advanced Research Project Activity Open Source Indicators program, Early Model Based Event Recognition using Surrogates (EMBERS) is a large-scale big data analytics system for forecasting significant societal events, such as civil unrest events on the basis of continuous, automated analysis of large volumes of publicly available data. It has been operational since November 2012 and delivers approximately 50 predictions each day for countries of Latin America. EMBERS is built on a streaming, scalable, loosely coupled, shared-nothing architecture using ZeroMQ as its messaging backbone and JSON as its wire data format. It is deployed on Amazon Web Services using an entirely automated deployment process. We describe the architecture of the system, some of the design tradeoffs encountered during development, and specifics of the machine learning models underlying EMBERS. We also present a detailed prospective evaluation of EMBERS in forecasting significant societal events in the past 2 years.
Infectious disease epidemics such as influenza and Ebola pose a serious threat to global public h... more Infectious disease epidemics such as influenza and Ebola pose a serious threat to global public health. It is crucial to characterize the disease and the evolution of the ongoing epidemic efficiently and accurately. Computational epidemiology can model the disease progress and underlying contact network, but suffers from the lack of real-time and fine-grained surveillance data. Social media, on the other hand, provides timely and detailed disease surveillance, but is insensible to the underlying contact network and disease model. This paper proposes a novel semi-supervised deep learning framework that integrates the strengths of computational epidemiology and social media mining techniques. Specifically, this framework learns the social media users' health states and intervention actions in real time, which are regularized by the underlying disease model and contact network. Conversely, the learned knowledge from social media can be fed into computational epidemic model to improve the efficiency and accuracy of disease diffusion modeling. We propose an online optimization algorithm to substantialize the above interactive learning process iteratively to achieve a consistent stage of the integration. The extensive experimental results demonstrated that our approach can effectively characterize the spatiotemporal disease diffusion, outperforming competing methods by a substantial margin on multiple metrics.
Funded by the IARPA Open Source Indicators (OSI) program -aims to develop methods for continuous,... more Funded by the IARPA Open Source Indicators (OSI) program -aims to develop methods for continuous, automated analysis of publicly available data in order to anticipate and/or detect population-level events such as mass violence, protests, riots, mass migrations, elections, disease outbreaks, economic instability, resource shortages, and responses to natural disasters.
Twitter is a crucial platform to get access to breaking news and timely information. However, due... more Twitter is a crucial platform to get access to breaking news and timely information. However, due to questionable provenance, uncontrollable broadcasting, and unstructured languages in tweets, Twitter is hardly a trustworthy source of breaking news. In this paper, we propose a novel topic-focused trust model to assess trustworthiness of users and tweets in Twitter. Unlike traditional graph-based trust ranking approaches in the literature, our method is scalable and can consider heterogeneous contextual properties to rate topicfocused tweets and users. We demonstrate the effectiveness of our topic-focused trustworthiness estimation method with extensive experiments using real Twitter data in Latin America.
where c (t) (W j , D k ) is a boolean value such that c (t) (W j , D k ) = 1 means the term W j a... more where c (t) (W j , D k ) is a boolean value such that c (t) (W j , D k ) = 1 means the term W j appears in the tweet D k while c (t) (W j , D k ) = 0, otherwise. The notation W j ∈ D k signifies that the term W j is contained in the tweet D k .
Twitter has become a popular social sensor. It is socially significant to surveil the tweet conte... more Twitter has become a popular social sensor. It is socially significant to surveil the tweet content under crucial themes such as "disease" and "civil unrest". However, this creates two challenges: 1) how to characterize the theme pattern, given Twitter's heterogeneity, dynamics, and unstructured language; and 2) how to model the theme consistently across multiple Twitter functions such as hashtags, replying, and friendships. In this paper, we propose a dynamic query expansion (DQE) model for theme tracking in Twitter. Specifically, DQE characterizes the theme consistency among heterogeneous entities (e.g., terms, tweets, and users) through semantic and social relationships, including co-occurrence, replying, authorship, and friendship. The proposed new optimization algorithm estimates the weight of each relationship by minimizing the Kullback-Leibler divergence. To demonstrate the effectiveness and scalability of DQE, we conducted extensive experiments to track the theme "civil unrest" across 8 Latin American countries.
Significant societal event forecasting is an important and complex process as it involves the con... more Significant societal event forecasting is an important and complex process as it involves the consideration of many aspects of that society, including its economics, politics, and culture. Traditional forecasting methods based on a single data source find it hard to cover all these aspects comprehensively, thus limiting the model performance. Multi-source event forecasting requires more sophisticated models but still suffers from several challenges, including 1) geographical hierarchies in the multi-source data features, 2) missing values in the interactive features, and 3) the characterization of structured feature sparsity. This paper proposes a novel feature learning model that concurrently addresses all the above challenges. Specifically, given multi-source data from different geographical levels, we design a new forecasting model by characterizing the lower-level features' dependence on higher-level features. To handle the structured sparsity and deal with missing values among the coupled features, we propose a novel feature learning model based on N th-order strong hierarchy and fusedoverlapping group Lasso. An efficient algorithm is developed to optimize the model parameters and ensure global optima. Extensive experiments on 10 datasets in different domains demonstrate the effectiveness and efficiency of the proposed model.
There has been significant recent interest in the application of social media analytics for spati... more There has been significant recent interest in the application of social media analytics for spatiotemporal event mining. However, no structured survey exists to capture developments in this space. This paper seeks to fill this void by reviewing recent research trends. Three branches of research are summarized here-corresponding (resp.) to modeling the past, present, and future-information tracking and backward analysis, spatiotemporal event detection, and spatiotemporal event forecasting. Each branch is illustrated with examples, challenges, and accomplishments.
Event forecasting in Twitter is an important and challenging problem. Most existing approaches fo... more Event forecasting in Twitter is an important and challenging problem. Most existing approaches focus on forecasting temporal events (such as elections and sports) and do not consider spatial features and their underlying correlations. In this paper, we propose a generative model for spatiotemporal event forecasting in Twitter. Our model characterizes the underlying development of future events by jointly modeling the structural contexts and spatiotemporal burstiness. An effective inference algorithm is developed to train the model parameters. Utilizing the trained model, the alignment likelihood of tweet sequences is calculated by dynamic programming. Extensive experimental evaluations on two different domains demonstrated the effectiveness of our proposed approach. * Virginia Tech. † SUNY Albany.
Twitter has become a popular data source as a surrogate for monitoring and detecting events. Targ... more Twitter has become a popular data source as a surrogate for monitoring and detecting events. Targeted domains such as crime, election, and social unrest require the creation of algorithms capable of detecting events pertinent to these domains. Due to the unstructured language, short-length messages, dynamics, and heterogeneity typical of Twitter data streams, it is technically difficult and labor-intensive to develop and maintain supervised learning systems. We present a novel unsupervised approach for detecting spatial events in targeted domains and illustrate this approach using one specific domain, viz. civil unrest modeling. Given a targeted domain, we propose a dynamic query expansion algorithm to iteratively expand domain-related terms, and generate a tweet homogeneous graph. An anomaly identification method is utilized to detect spatial events over this graph by jointly maximizing local modularity and spatial scan statistics. Extensive experiments conducted in 10 Latin American countries demonstrate the effectiveness of the proposed approach.
In this paper, a genetic algorithm (GA) based principal component selection approach is proposed ... more In this paper, a genetic algorithm (GA) based principal component selection approach is proposed for production performance estimation in mineral processing. The approach combines a modified GA with principal component analysis (PCA) in order to improve the estimation accuracy of production performance. In this context, the extended chromosome encoding, the fitness function formed by combining the prediction performance operator and the penalty function is designed based on the standard GA. Both the mutation allele number operator and the allele mutation possibility operator are also introduced in the mutation process of chromosome. The proposed approach can select the principal components which are crucial for estimation performance, and the useful message from PCA can guide the evolution of GA and accelerate the convergence process. The case studies have been carried out on the prediction of the production rate and concentrate grade of a mineral process and the experimental results show the effectiveness of the proposed approach.
We describe the design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous s... more We describe the design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. Unlike retrospective studies, EMBERS has been making forecasts into the future since Nov 2012 which have been (and continue to be) evaluated by an independent T&E team (MITRE). Of note, EMBERS has successfully forecast the June 2013 protests in Brazil and Feb 2014 violent protests in Venezuela. We outline the system architecture of EMBERS, individual models that leverage specific data sources, and a fusion and suppression engine that supports trading off specific evaluation criteria. EMBERS also provides an audit trail interface that enables the investigation of why specific predictions were made along with the data utilized for forecasting. Through numerous evaluations, we demonstrate the superiority of EMBERS over baserate methods and its capability to forecast significant societal happenings.
Developed under the IARPA Open Source Initiative program, EMBERS (Early Model Based Event Recogni... more Developed under the IARPA Open Source Initiative program, EMBERS (Early Model Based Event Recognition using Surrogates) is a large-scale big-data analytics system for forecasting significant societal events, such as civil unrest incidents and disease outbreaks on the basis of continuous, automated analysis of large volumes of publicly available data. It has been operational since November of 2012, delivering approximately 50 predictions each day. EMBERS is built on a streaming, scalable, share-nothing architecture and is deployed on Amazon Web Services (AWS).
Social microblogs such as Twitter and Weibo are experiencing an explosive growth with billions of... more Social microblogs such as Twitter and Weibo are experiencing an explosive growth with billions of global users sharing their daily observations and thoughts. Beyond public interests (e.g., sports, music), microblogs can provide highly detailed information for those interested in public health, homeland security, and financial analysis. However, the language used in Twitter is heavily informal, ungrammatical, and dynamic. Existing data mining algorithms require extensive manually labeling to build and maintain a supervised system. This paper presents STED, a semi-supervised system that helps users to automatically detect and interactively visualize events of a targeted type from twitter, such as crimes, civil unrests, and disease outbreaks. Our model first applies transfer learning and label propagation to automatically generate labeled data, then learns a customized text classifier based on mini-clustering, and finally applies fast spatial scan statistics to estimate the locations of events. We demonstrate STED's usage and benefits using twitter data collected from Latin America countries, and show how our system helps to detect and track example events such as civil unrests and crimes.
This paper deals with the application of Kalman filter for optimizing and filtering the position ... more This paper deals with the application of Kalman filter for optimizing and filtering the position signal of shuttlecock obtained by the vision servo system of 'Shuttlecock Robot' [1]. Non-uniform mass distribution and air resistance effect can make much noise not only in vision recognition but also in kinematic model analysis of shuttlecock. The Kalman filter algorithm is used to filter the shuttlecock position signal by taking the error of measurement and the error of shuttlecock motion model into account. Besides, by considering the requirement of fast moving control, we reduce dimensions of state vector by decomposition of shuttlecock motion to shorten the executive cycle. The simulation results show its affectivity on improving the accuracy of track prediction. It can also accomplish track prediction fast and accurately when applied on 'Shuttlecock Robot'.
Spatial event forecasting from social media is an important problem but encounters critical chall... more Spatial event forecasting from social media is an important problem but encounters critical challenges, such as dynamic patterns of features (keywords) and geographic heterogeneity (e.g., spatial correlations, imbalanced samples, and different populations in different locations). Most existing approaches (e.g., LASSO regression, dynamic query expansion, and burst detection) are designed to address some of these challenges, but not all of them. This paper proposes a novel multi-task learning framework which aims to concurrently address all the challenges. Specifically, given a collection of locations (e.g., cities), we propose to build forecasting models for all locations simultaneously by extracting and utilizing appropriate shared information that effectively increases the sample size for each location, thus improving the forecasting performance. We combine both static features derived from a predefined vocabulary by domain experts and dynamic features generated from dynamic query expansion in a multi-task feature learning framework; we investigate different strategies to balance homogeneity and diversity between static and dynamic terms. Efficient algorithms based on Iterative Group Hard Thresholding are developed to achieve efficient and effective model training and prediction. Extensive experimental evaluations on Twitter data from four different countries in Latin America demonstrated the effectiveness of our proposed approach.
Developed under the Intelligence Advanced Research Project Activity Open Source Indicators progra... more Developed under the Intelligence Advanced Research Project Activity Open Source Indicators program, Early Model Based Event Recognition using Surrogates (EMBERS) is a large-scale big data analytics system for forecasting significant societal events, such as civil unrest events on the basis of continuous, automated analysis of large volumes of publicly available data. It has been operational since November 2012 and delivers approximately 50 predictions each day for countries of Latin America. EMBERS is built on a streaming, scalable, loosely coupled, shared-nothing architecture using ZeroMQ as its messaging backbone and JSON as its wire data format. It is deployed on Amazon Web Services using an entirely automated deployment process. We describe the architecture of the system, some of the design tradeoffs encountered during development, and specifics of the machine learning models underlying EMBERS. We also present a detailed prospective evaluation of EMBERS in forecasting significant societal events in the past 2 years.
Infectious disease epidemics such as influenza and Ebola pose a serious threat to global public h... more Infectious disease epidemics such as influenza and Ebola pose a serious threat to global public health. It is crucial to characterize the disease and the evolution of the ongoing epidemic efficiently and accurately. Computational epidemiology can model the disease progress and underlying contact network, but suffers from the lack of real-time and fine-grained surveillance data. Social media, on the other hand, provides timely and detailed disease surveillance, but is insensible to the underlying contact network and disease model. This paper proposes a novel semi-supervised deep learning framework that integrates the strengths of computational epidemiology and social media mining techniques. Specifically, this framework learns the social media users' health states and intervention actions in real time, which are regularized by the underlying disease model and contact network. Conversely, the learned knowledge from social media can be fed into computational epidemic model to improve the efficiency and accuracy of disease diffusion modeling. We propose an online optimization algorithm to substantialize the above interactive learning process iteratively to achieve a consistent stage of the integration. The extensive experimental results demonstrated that our approach can effectively characterize the spatiotemporal disease diffusion, outperforming competing methods by a substantial margin on multiple metrics.
Funded by the IARPA Open Source Indicators (OSI) program -aims to develop methods for continuous,... more Funded by the IARPA Open Source Indicators (OSI) program -aims to develop methods for continuous, automated analysis of publicly available data in order to anticipate and/or detect population-level events such as mass violence, protests, riots, mass migrations, elections, disease outbreaks, economic instability, resource shortages, and responses to natural disasters.
Twitter is a crucial platform to get access to breaking news and timely information. However, due... more Twitter is a crucial platform to get access to breaking news and timely information. However, due to questionable provenance, uncontrollable broadcasting, and unstructured languages in tweets, Twitter is hardly a trustworthy source of breaking news. In this paper, we propose a novel topic-focused trust model to assess trustworthiness of users and tweets in Twitter. Unlike traditional graph-based trust ranking approaches in the literature, our method is scalable and can consider heterogeneous contextual properties to rate topicfocused tweets and users. We demonstrate the effectiveness of our topic-focused trustworthiness estimation method with extensive experiments using real Twitter data in Latin America.
where c (t) (W j , D k ) is a boolean value such that c (t) (W j , D k ) = 1 means the term W j a... more where c (t) (W j , D k ) is a boolean value such that c (t) (W j , D k ) = 1 means the term W j appears in the tweet D k while c (t) (W j , D k ) = 0, otherwise. The notation W j ∈ D k signifies that the term W j is contained in the tweet D k .
Twitter has become a popular social sensor. It is socially significant to surveil the tweet conte... more Twitter has become a popular social sensor. It is socially significant to surveil the tweet content under crucial themes such as "disease" and "civil unrest". However, this creates two challenges: 1) how to characterize the theme pattern, given Twitter's heterogeneity, dynamics, and unstructured language; and 2) how to model the theme consistently across multiple Twitter functions such as hashtags, replying, and friendships. In this paper, we propose a dynamic query expansion (DQE) model for theme tracking in Twitter. Specifically, DQE characterizes the theme consistency among heterogeneous entities (e.g., terms, tweets, and users) through semantic and social relationships, including co-occurrence, replying, authorship, and friendship. The proposed new optimization algorithm estimates the weight of each relationship by minimizing the Kullback-Leibler divergence. To demonstrate the effectiveness and scalability of DQE, we conducted extensive experiments to track the theme "civil unrest" across 8 Latin American countries.
Significant societal event forecasting is an important and complex process as it involves the con... more Significant societal event forecasting is an important and complex process as it involves the consideration of many aspects of that society, including its economics, politics, and culture. Traditional forecasting methods based on a single data source find it hard to cover all these aspects comprehensively, thus limiting the model performance. Multi-source event forecasting requires more sophisticated models but still suffers from several challenges, including 1) geographical hierarchies in the multi-source data features, 2) missing values in the interactive features, and 3) the characterization of structured feature sparsity. This paper proposes a novel feature learning model that concurrently addresses all the above challenges. Specifically, given multi-source data from different geographical levels, we design a new forecasting model by characterizing the lower-level features' dependence on higher-level features. To handle the structured sparsity and deal with missing values among the coupled features, we propose a novel feature learning model based on N th-order strong hierarchy and fusedoverlapping group Lasso. An efficient algorithm is developed to optimize the model parameters and ensure global optima. Extensive experiments on 10 datasets in different domains demonstrate the effectiveness and efficiency of the proposed model.
There has been significant recent interest in the application of social media analytics for spati... more There has been significant recent interest in the application of social media analytics for spatiotemporal event mining. However, no structured survey exists to capture developments in this space. This paper seeks to fill this void by reviewing recent research trends. Three branches of research are summarized here-corresponding (resp.) to modeling the past, present, and future-information tracking and backward analysis, spatiotemporal event detection, and spatiotemporal event forecasting. Each branch is illustrated with examples, challenges, and accomplishments.
Uploads
Papers by Liang Zhao