Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Spatial event forecasting from social media is an important problem but encounters critical challenges, such as dynamic patterns of features (keywords) and geographic heterogeneity (e.g., spatial correlations, imbalanced samples, and different populations in different locations). Most existing approaches (e.g., LASSO regression, dynamic query expansion, and burst detection) are designed to address some of these challenges, but not all of them. This paper proposes a novel multi-task learning framework which aims to concurrently address all the challenges. Specifically, given a collection of locations (e.g., cities), we propose to build forecasting models for all locations simultaneously by extracting and utilizing appropriate shared information that effectively increases the sample size for each location, thus improving the forecasting performance. We combine both static features derived from a predefined vocabulary by domain experts and dynamic features generated from dynamic query expansion in a multi-task feature learning framework; we investigate different strategies to balance homogeneity and diversity between static and dynamic terms. Efficient algorithms based on Iterative Group Hard Thresholding are developed to achieve efficient and effective model training and prediction. Extensive experimental evaluations on Twitter data from four different countries in Latin America demonstrated the effectiveness of our proposed approach.
ACM Transactions on Spatial Algorithms and Systems, 2016
Event forecasting from social media data streams has many applications. Existing approaches focus on forecasting temporal events (such as elections and sports) but as yet cannot forecast spatiotemporal events such as civil unrest and influenza outbreaks, which are much more challenging. To achieve spatiotemporal event forecasting, spatial features that evolve with time and their underlying correlations need to be considered and characterized. In this article, we propose novel batch and online approaches for spatiotemporal event forecasting in social media such as Twitter. Our models characterize the underlying development of future events by simultaneously modeling the structural contexts and their spatiotemporal burstiness based on different strategies. Both batch and online-based inference algorithms are developed to optimize the model parameters. Utilizing the trained model, the alignment likelihood of tweet sequences is calculated by dynamic programming. Extensive experimental e...
Proceedings of the 2016 SIAM International Conference on Data Mining, 2016
This paper presents a novel geospatio-temporal prediction framework called GSpartan to simultaneously build local regression models at multiple locations. The framework assumes that the local models share a common, low-rank representation, which makes them amenable to multi-task learning. GSpartan learns a set of base models to capture the spatio-temporal variabilities of the data and represents each local model as a linear combination of the base models. A graph Laplacian regularization is used to enforce constraints on the local models based on their spatial autocorrelation. We also introduce sparsity-inducing norms to perform feature selection for the base models and model selection for the local models. Experimental results using historical climate data from 37 weather stations showed that GSpartan outperforms single-task learning and other existing multi-task learning methods in more than 75% of the stations.
ArXiv, 2019
With the rise of opioid abuse in the US, there has been a growth of overlapping hotspots for overdose-related and HIV-related deaths in Springfield, Boston, Fall River, New Bedford, and parts of Cape Cod. With a large part of population, including rural communities, active on social media, it is crucial that we leverage the predictive power of social media as a preventive measure. We explore the predictive power of micro-blogging social media website Twitter with respect to HIV new diagnosis rates per county. While trending work in Twitter NLP has focused on primarily text-based features, we show that multi-dimensional feature construction can significantly improve the predictive power of topic features alone with respect STI's (sexually transmitted infections). By multi-dimensional features, we mean leveraging not only the topical features (text) of a corpus, but also location-based information (counties) about the tweets in feature-construction. We develop novel text-location-...
Journal of Big Data
Event detection from social media aims at extracting specific or generic unusual happenings, such as, family reunions, earthquakes, and disease outbreaks, among others. This paper introduces a new perspective for the hybrid extraction and clustering of social events from big social data streams. We rely on a hybrid learning model, where supervised deep learning is used for feature extraction and topic classification, whereas unsupervised spatial clustering is employed to determine the event whereabouts. We present ‘Deep-Eware’, a scalable and efficient event-aware big data platform that integrates data stream and geospatial processing tools for the hybrid extraction and dissemination of spatio-temporal events. We introduce a pure incremental approach for event discovery, by developing unsupervised machine learning and NLP algorithms and by computing events’ lifetime and spatial spanning. The system integrates a semantic keyword generation tool using KeyBERT for dataset preparation. ...
PloS one, 2014
Twitter has become a popular data source as a surrogate for monitoring and detecting events. Targeted domains such as crime, election, and social unrest require the creation of algorithms capable of detecting events pertinent to these domains. Due to the unstructured language, short-length messages, dynamics, and heterogeneity typical of Twitter data streams, it is technically difficult and labor-intensive to develop and maintain supervised learning systems. We present a novel unsupervised approach for detecting spatial events in targeted domains and illustrate this approach using one specific domain, viz. civil unrest modeling. Given a targeted domain, we propose a dynamic query expansion algorithm to iteratively expand domain-related terms, and generate a tweet homogeneous graph. An anomaly identification method is utilized to detect spatial events over this graph by jointly maximizing local modularity and spatial scan statistics. Extensive experiments conducted in 10 Latin Americ...
Proceedings of the International AAAI Conference on Web and Social Media
Twitter, used in 200 countries with over 250 milliontweets a day, is a rich source of local news from aroundthe world. Many events of local importance are first reportedon Twitter, including many that never reach newschannels. Further, there are often only a few tweetsreporting each such event, in contrast with the largervolumes that follow events of wider significance. Eventhough such events may be primarily of local importance,they can also be of critical interest to some specificbut possibly far flung entities: For example, a firein a supplier’s factory half-way around the world maybe of interest even from afar. In this paper we describehow this ‘long tail’ of events can be detected in spite oftheir sparsity.We then extract and correlate informationfrom multiple tweets describing the same event. Ourgeneric architecture for converting a tweet-stream intoevent-objects uses locality sensitive hashing, classification,boosting, information extraction and clustering.Our results, based ...
As the social media has gained more attention from users on the Internet, social media has been one of the most important information sources in the world. And, with the increasing popularity of social media, data which is posted on social media sites are rapidly becoming popular, which is a term used to refer to new media that is replacing traditional media. In this paper, we concentrate on geotagged tweets on the Twitter site. These geotagged tweets are known to as georeferenced documents because they include not only a short text message, but also have documents’ which are posting time and location. Many researchers have been handling the development of new data mining techniques for georeferenced documents to recognize and analyze emergency topics, such as natural disasters, weather, diseases, and other incidents. In particular, the utilization of geotagged tweets to recognize and analyze natural disasters has received much attention from administrative agencies recently because some case studies have achieved compelling results. In this paper, we propose a novel real-time analysis application for identifying bursty local areas related to emergency topics. The aim of our application is to provide new platforms that can identify and analyze the localities of emergency topics. The proposed application is of three core computational intelligence techniques: the Naive Bayes classifier technique, the spatiotemporal clustering technique, and the burst detection technique. Also, we have implemented two types of application: a Web application interface and an android application. To evaluate the proposed application, we have implemented a real-time weather observation system embedded the proposed application. We used actual crawling geotagged tweets posted on the Twitter site. The weather detection system
Proceedings of the ACM SIGMOD Workshop on Databases and Social Networks - DBSocial '13, 2013
Unprecedented success and active usage of social media services result in massive amounts of user-generated data. An increasing interest in the contained information from social media data leads to more and more sophisticated analysis and visualization applications. Because of the fast pace and distribution of news in social media data it is an appropriate source to identify events in the data and directly display their occurrence to analysts or other users. This paper presents a method for event identification in local areas using the Twitter data stream. We implement and use a combined log-likelihood ratio approach for the geographic and time dimension of real-life Twitter data in predefined areas of the world to detect events occurring in the message contents. We present a case study with two interesting scenarios to show the usefulness of our approach.
Proceedings of the 21st international conference on World Wide Web, 2012
Micro-blogging services have become indispensable communication tools for online users for disseminating breaking news, eyewitness accounts, individual expression, and protest groups. Recently, Twitter, along with other online social networking services such as Foursquare, Gowalla, Facebook and Yelp, have started supporting location services in their messages, either explicitly, by letting users choose their places, or implicitly, by enabling geo-tagging, which is to associate messages with latitudes and longitudes. This functionality allows researchers to address an exciting set of questions: 1) How is information created and shared across geographical locations, 2) How do spatial and linguistic characteristics of people vary across regions, and 3) How to model human mobility. Although many attempts have been made for tackling these problems, previous methods are either complicated to be implemented or oversimplified that cannot yield reasonable performance. It is a challenge task to discover topics and identify users' interests from these geo-tagged messages due to the sheer amount of data and diversity of language variations used on these location sharing services. In this paper we focus on Twitter and present an algorithm by modeling diversity in tweets based on topical diversity, geographical diversity, and an interest distribution of the user. Furthermore, we take the Markovian nature of a user's location into account. Our model exploits sparse factorial coding of the attributes, thus allowing us to deal with a large and diverse set of covariates efficiently. Our approach is vital for applications such as user profiling, content recommendation and topic tracking. We show high accuracy in location estimation based on our model. Moreover, the algorithm identifies interesting topics based on location and language.
A considerable portion of social media messages is devoted to current events. Aside from references to events that recently happened, social media messages may also refer to events that have not occurred yet. Future events, such as football matches in the case study we present here, may be scheduled and known to happen; other future events, such as transfers of football players, may only be rumoured, and may in fact not happen in the end. We describe a news mining component that learns to identify tweets referring to scheduled and unscheduled future events, by being trained on messages referring to scheduled future events (as the latter are easy to harvest). Our results show that discriminating between tweets that refer to upcoming football matches and tweets that refer to past matches can be done relatively reliably with supervised machine learning methods. However, when these trained models are applied to unscheduled events, performance drops to near-baseline performance. We discuss how these results can be explained by the distinction between event type and event domain.
International Journal of Data Science and Analytics, 2022
With COVID-19 affecting every country globally and changing everyday life, the ability to forecast the spread of the disease is more important than any previous epidemic. The conventional methods of disease-spread modeling, compartmental models, are based on the assumption of spatiotemporal homogeneity of the spread of the virus, which may cause forecasting to underperform, especially at high spatial resolutions. In this paper, we approach the forecasting task with an alternative technique-spatiotemporal machine learning. We present COVID-LSTM, a data-driven model based on a long short-term memory deep learning architecture for forecasting COVID-19 incidence at the county level in the USA. We use the weekly number of new positive cases as temporal input, and hand-engineered spatial features from Facebook movement and connectedness datasets to capture the spread of the disease in time and space. COVID-LSTM outperforms the COVID-19 Forecast Hub's Ensemble model (COVIDhub-ensemble) on our 17-week evaluation period, making it the first model to be more accurate than the COVIDhub-ensemble over one or more forecast periods. Over the 4-week forecast horizon, our model is on average 50 cases per county more accurate than the COVIDhub-ensemble. We highlight that the underutilization of data-driven forecasting of disease spread prior to COVID-19 is likely due to the lack of sufficient data available for previous diseases, in addition to the recent advances in machine learning methods for spatiotemporal forecasting. We discuss the impediments to the wider uptake of data-driven forecasting, and whether it is likely that more deep learning-based models will be used in the future.
Journal of Big Data
A key challenge in mining social media data streams is to identify events which are actively discussed by a group of people in a specific local or global area. Such events are useful for early warning for accident, protest, election or breaking news. However, neither the list of events nor the resolution of both event time and space is fixed or known beforehand. In this work, we propose an online spatio-temporal event detection system using social media that is able to detect events at different time and space resolutions. First, to address the challenge related to the unknown spatial resolution of events, a quad-tree method is exploited in order to split the geographical space into multiscale regions based on the density of social media data. Then, a statistical unsupervised approach is performed that involves Poisson distribution and a smoothing method for highlighting regions with unexpected density of social posts. Further, event duration is precisely estimated by merging events...
Resumen: Cuando se producen eventos relacionados con situaciones de emergencia, es importante acceder a tanta información como sea posible relacionada con dicho evento. En este contexto algunas redes sociales como Twitter suponen un importante recurso de información en tiempo real. La técnicas clásicas de filtrado de información suelen centrarse en el análisis de coocurrencia de términos con el conjunto de palabras clave inicialmente consideradas. Sin embargo, estas aproximaciones pueden perder información, ya que no son capaces de recuperar información relevante que venga expresada con palabras que no coocurran con las palabras clave inicialmente usadas, y que expresan nuestra necesidad de información. Considerar información de geolocalización, usuario o temporal dentro de un enfoque de pseudo-relevance feedback, nos permite encontrar terminología relacionada con el evento, pero no coocurrente con las palabras clave inicialmente consideradas. Por otro lado, considerando el aspecto temporal se puede modificar una función de expansión de consultas como la divergencia de Kullback-Leibler con el fin de mejorar el filtrado de información en estas situaciones de emergencia. Nuestras propuestas se han evaluado en dos colecciones de eventos del mundo real obteniéndose resultados alentadores.
Social microblogs such as Twitter and Weibo are experiencing an explosive growth with billions of global users sharing their daily observations and thoughts. Beyond public interests (e.g., sports, music), microblogs can provide highly detailed information for those interested in public health, homeland security, and financial analysis. However, the language used in Twitter is heavily informal, ungrammatical, and dynamic. Existing data mining algorithms require extensive manually labeling to build and maintain a supervised system. This paper presents STED, a semi-supervised system that helps users to automatically detect and interactively visualize events of a targeted type from twitter, such as crimes, civil unrests, and disease outbreaks. Our model first applies transfer learning and label propagation to automatically generate labeled data, then learns a customized text classifier based on mini-clustering, and finally applies fast spatial scan statistics to estimate the locations of events. We demonstrate STED's usage and benefits using twitter data collected from Latin America countries, and show how our system helps to detect and track example events such as civil unrests and crimes.
IEEE Access, 2022
According to the World Health Organization, several factors have affected the accurate reporting of SARS-CoV-2 outbreak status, such as limited data collection resources, cultural and educational diversity, and inconsistent outbreak reporting from different sectors. Driven by this challenging situation, this study investigates the potential expediency of using social network data to develop reliable early information surveillance and warning system for pandemic outbreaks. As such, an enhanced framework of three inherently interlinked subsystems is proposed. The first subsystem includes data collection and integration mechanisms, data preprocessing, and hybrid sentiment analysis tools to identify tweet sentiment taxonomies and quantitatively estimate public awareness. The second subsystem comprises the feature extraction unit that identifies, selects, embeds, and balances feature vectors and the classifier fitting and training unit. This subsystem is designed to capture the most effective linguistic feature combinations with more spatial evidence by using a variety of approaches, including linear classifiers, MLPs, RNNs, and CNNs, as well as pre-trained word embedding algorithms. The last is the modeling and situational awareness evaluation subsystem, which measures temporal associations between pandemic-relevant social network activities and officially announced infection counts in the most hazardous geolocations. The proposed framework was developed and tested using a combination of static datasets and real-time scraped Twitter data. The results of these experiments showed the remarkable performance of the framework in assessing the temporal associations between public awareness and outbreak status. It also showed that the Decision Tree Classifier with Unigram+TF-IDF feature vectors outperformed other conventional models for sentiment classification and geolocation classification with an accuracy of 94.3% and 80.8, respectively. As indicated, conventional machine learning algorithms didn't achieve a precision of more than 80%, while, for instance, MLP with selfembedding layer, Word2Vec, and GloVe pre-trained word embedding resulted in very poor accuracy of 10%, 36%, and 32%, respectively. However, adding the PoS tag one-hot encoding embedding increased the validation accuracy from 36% to approximately 89%, while the best performance for the second subsystem was achieved by Bi-LSTM with RoBERTa word embedding, with an accuracy of 96%. The achieved results reveal that the proposed framework can proactively capture the potential hazards associated with the prevalence of infectious diseases as an effective early detection and info-surveillance awareness system.
2013
Microblogging services such as Twitter, Facebook, and Four-square have become major sources for information about real-world events. Most approaches that aim at extracting event information from such sources typically use the tem-poral context of messages. However, exploiting the location information of georeferenced messages, too, is important to detect localized events, such as public events or emergency situations. Users posting messages that are close to the lo-cation of an event serve as human sensors to describe an event. In this demonstration, we present a novel framework to detect localized events in real-time from a Twitter stream and to track the evolution of such events over time. For this, spatio-temporal characteristics of keywords are contin-uously extracted to identify meaningful candidates for event descriptions. Then, localized event information is extracted by clustering keywords according to their spatial similar-ity. To determine the most important events in a (r...
ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Social media could be very useful source of data for a people interested in disasters, since it can provide them with on-site information. Posted georeferenced messages and images can help to understand the situation of the area affected by the event. Considering this type of resource as a real-time crowdsource of crisis information, the spatial distribution of geolocated posts related to an event can represent an early indicator of the severity of impact. The aim of this paper is to explore the spatial distribution of Twitter posts related to hurricane Michael, occurred in the USA in 2018 and to analyse their potential in providing a fast insight about the event impact. Kernel density estimation has been applied to explore the spatial distribution of Twitter posts, after which Hot Spot analysis has been performed in order to analyse the spatiotemporal distribution of the data. Hot Spot analysis has shown to be the most comprehensive analysis, detecting the area of high impact. The Kernel density map has shown to be useful as well.
Road traffic prediction is a critical component in modern smart transportation systems. It provides the basis for traffic management agencies to generate proactive traffic operation strategies for alleviating congestion. Existing work on near-term traffic prediction (forecasting horizons in the range of 5 minutes to 1 hour) relies on the past and current traffic conditions. However, once the forecasting horizon is beyond 1 hour, i.e., in longer-term traffic prediction, these techniques do not work well since additional factors other than the past and current traffic conditions start to play important roles. To address this problem, in this paper, for the first time, we examine whether it is possible to use the rich information in online social media to improve longer-term traffic prediction. To this end, we first analyze the correlation between traffic volume and tweet counts with various granularities. Then we propose an optimization framework to extract traffic indicators based on tweet semantics using a transformation matrix, and incorporate them into traffic prediction via linear regression. Experimental results using traffic and Twitter data originated from the San Francisco Bay area of California demonstrate the effectiveness of our proposed framework.
Crime Science, 2020
Background: Crime, traffic accidents, terrorist attacks, and other space-time random events are unevenly distributed in space and time. In the case of crime, hotspot and other proactive policing programs aim to focus limited resources at the highest risk crime and social harm hotspots in a city. A crucial step in the implementation of these strategies is the construction of scoring models used to rank spatial hotspots. While these methods are evaluated by area normalized Recall@k (called the predictive accuracy index), models are typically trained via maximum likelihood or rules of thumb that may not prioritize model accuracy in the top k hotspots. Furthermore, current algorithms are defined on fixed grids that fail to capture risk patterns occurring in neighborhoods and on road networks with complex geometries. Results: We introduce CrimeRank, a learning to rank boosting algorithm for determining a crime hotspot map that directly optimizes the percentage of crime captured by the top ranked hotspots. The method employs a floating grid combined with a greedy hotspot selection algorithm for accurately capturing spatial risk in complex geometries. We illustrate the performance using crime and traffic incident data provided by the Indianapolis Metropolitan Police Department, IED attacks in Iraq, and data from the 2017 NIJ Real-time crime forecasting challenge. Conclusion: Our learning to rank strategy was the top performing solution (PAI metric) in the 2017 challenge. We show that CrimeRank achieves even greater gains when the competition rules are relaxed by removing the constraint that grid cells be a regular tessellation.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.