See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/237052158
Practical Extraction of Disaster-Relevant Information from Social Media
Conference Paper · May 2013
DOI: 10.1145/2487788.2488109
CITATIONS READS
145 228
5 authors, including:
Muhammad Imran Shady Elbassuoni
Qatar Computing Research Institute American University of Beirut
90 PUBLICATIONS 2,281 CITATIONS 50 PUBLICATIONS 1,235 CITATIONS
SEE PROFILE SEE PROFILE
Carlos Castillo Fernando Diaz
University Pompeu Fabra Microsoft
264 PUBLICATIONS 12,376 CITATIONS 54 PUBLICATIONS 2,623 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
AIDR: Artificial Intelligence for Digital Response View project
Tree Adaptation for Learning to Rank View project
All content following this page was uploaded by Muhammad Imran on 06 June 2014.
The user has requested enhancement of the downloaded file.
Practical Extraction of Disaster-Relevant Information from
Social Media
Muhammad Imran∗ Shady Elbassuoni Carlos Castillo
University of Trento American University of Beirut Qatar Computing
[email protected] [email protected] Research Institute
[email protected] Fernando Diaz Patrick Meier
Microsoft Research Qatar Computing
[email protected] Research Institute
[email protected]ABSTRACT The rest of the paper is organized as follows. Section 2
During times of disasters online users generate a significant describes our information-extraction method, which is eval-
amount of data, some of which are extremely valuable for re- uated in Section 3. Section 4 shows that our method can be
lief efforts. In this paper, we study the nature of social-media applied also in non-disaster settings. In Section 5, we briefly
content generated during two different natural disasters. We outline related works, and conclude in Section 6.
also train a model based on conditional random fields to ex-
tract valuable information from such content. We evaluate 2. DESCRIPTION OF OUR APPROACH
our techniques over our two datasets through a set of care- This section describes the classification and extraction
fully designed experiments. We also test our methods over steps of our method. For clarity of the exposition and con-
a non-disaster dataset to show that our extraction model creteness, we begin by describing the datasets we use.
is useful for extracting information from socially-generated
content in general. 2.1 Datasets
We use two datasets related to recent emergencies:
Categories and Subject Descriptors Joplin 2011: 206,764 tweets collected during the tornado
that struck Joplin, Missouri (USA) on May 22, 2011. Re-
I.2.7 [Natural Language Processing]: Text analysis
searchers at the University of Colorado at Boulder collected
the dataset through Twitter’s API using the hashtag1 #joplin.
Keywords Sandy 2012: 140,000 tweets collected during the Hurri-
Social Media; Information Filtering; Information Extraction cane Sandy, that hit Northeastern US on Oct 29, 2012. The
dataset was collected using the hashtags #sandy, #nyc.
1. INTRODUCTION 2.2 Classification
Microblogging platforms have become an important way As the messages generated during a disaster are extremely
to share information on the Web, especially during time- varied, an automatic system needs to start by filtering out
critical events such as natural and man-made disasters. In messages that do not contribute to valuable information.
recent years, Twitter has been used to spread news about These include those that are entirely of personal nature and
casualties and damages, donation efforts and alerts, includ- those not relevant to the crisis at hand. Specifically, we start
ing multimedia information such as videos and photos [1, 3]. by separating messages into two main classes classes:
Given the importance of on-topic tweets for time-critical sit- • Personal: if a message is only of interest to its author
uational awareness, disaster-affected communities and pro- and her immediate circle of family/friends and does
fessional responders may benefit from using an automatic not convey any useful information to people who do
system to extract relevant information from Twitter. not know its author.
We propose a two-step method for disaster-related infor- • Informative: if the message is informative (of interest
mation extraction: (i) classification of tweets and (ii) ex- to other people beyond the author’s immediate circle).
traction from tweets. The classification step is based on our • Other: if the message is not related to the disaster.
earlier work [8]; the extraction step is the focus of this pa- Furthermore, we differentiate between two types of infor-
per. Both steps are done using off-the-shelf free software [6, mative messages: direct, i.e., written by a person who is a
7], yielding a system that is easy to implement and that direct eyewitness of what is taking place or indirect, when
according to our experiments has good performance. the message repeats information reported by other sources.
∗
Work done while the author was at QCRI. Once we detect informative tweets, we classify them into
the following classes (details on the choice of this ontology
Copyright is held by the International World Wide Web Conference can be found in [8]):
Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink 1
to the author’s site if the Material is used in electronic media. These hashtags are mostly announced by the crisis man-
WWW 2013 Companion, May 13–17, 2013, Rio de Janeiro, Brazil. agement authorities at the time of an incident.
ACM 978-1-4503-2038-2/13/05.
Table 1: Type-dependent instructions given to the assessors for the extraction phase, and example (in
boldface) of the extracted part.
Type Instruction: Copy-paste the word/phrase that ... Example
Caution or advice: All ... warns about a potential hazard or advices what .@NYGovCuomo orders closing of NYC bridges. Only
to do Staten Island bridges unaffected at this time. Bridges must
close by 7pm. #Sandy #NYC.
Information source: Pho- ... indicates what the contents of a photo/video RT @NBCNewsPictures: Photos of the unbeliev-
tos/videos are about able scenes left in #Hurricane #Sandy’s wake
http://t.co/09U9L5rW #NYC #NJ
People: missing or lost ... indicates who is missing or has been found rt @911buff: public help needed: 2 boys 2 & 4 missing
people found nearly 24 hours after they got separated from their mom
when car submerged in si. #sandy #911buff
Casualties and damage: ... names a structure, road, service, line, etc. that RT @TIME: NYC building had numerous construction
Infrastructure is not working or has been damaged complaints before crane collapse http://t.co/7EDmKOp3
#Sandy
Casualties and damage: ... indicates who has (or how many people have) At least 39 dead millions without power in Sandy’s after-
Injured or dead been injured or dead math. http://t.co/Wdvz8KK8
Donations: Requests ... indicates what (money, goods, work, free ser- 400 Volunteers are needed for areas that #Sandy de-
money/goods/services vices, etc.) is being requested as a donation stroyed.
Donations: Offers ... indicates what (money, goods, work, free ser- I want to volunteer to help the hurricane Sandy victims.
money/goods/services vices, etc.) is being offered as a donation If anyone knows how I can get involved please let me know!
People: Celebri- ... names a celebrity or authority that reacts to V.P. candidate Ryan attends a food drive in Wisconsin for
ties/authorities the event or visits the area victims of Hurricane Sandy. PO-35WE on BitCentral.
• Caution and Advice: if a message conveys/reports We treated the task of detecting class-relevant informa-
information about some warning or a piece of advice tion as a sequence labeling task. A tweet is considered a
about a possible hazard of an incident. sequence of word tokens. In a sequence labeling task, each
• Casualties and Damage: if a message reports the token is algorithmically labeled as part of a subsequence of
information about casualties or infrastructure damage target information or as unrelated to such information. In
done by an incident. the example of the first tweet in Table 1, the tokens “clos-
• Donations of money, goods or services: if a message ing”, “of”, “NYC”, and “bridges” are labeled as positive (part
speaks about goods or services offered or needed by of the target information), while the rest of the tokens are
the victims of an incident. labeled as negative. An example is shown below –note that
• People missing, found, or seen: if a message reports the period (“.”) is also a token:
about a missing or found person affected by an inci-
... orders closing of NYC bridges . Only Staten ...
dent, or reports the reaction or visit of a celebrity.
- + + + + - - -
• Information Sources: if a message points to infor-
mation sources, photos, videos; or mentions a website, We use conditional random fields, a machine learned se-
TV or radio station providing extensive coverage. quence labeling algorithm, for our task [9]. A conditional
• Other: other types of informative messages. random field (CRF) is a probabilistic model which, in our
As we describe in our previous work [8], a set of multi-label task, predicts the label of each token (“+” or “-”) given both
classifiers were trained to automatically classify a tweet into information endogenous to the token (e.g. ‘token is a num-
one or more of the above classes. Naı̈ve Bayesian classifiers ber’, ‘token is the word bridges’) as well as information ex-
are used as implemented in Weka [7]. Our classifiers use a ogenous to the token (e.g. ‘token is preceded by the word
rich set of features including word unigrams, bigrams, Part- closing’). CRFs have been applied successfully in the past
of-Speech (POS) tags and others. Our feature set contains to other information extraction tasks [10].
as well as a set of binary features (for example, whether We use ArkNLP, an implementation of CRFs and a set
a tweet contains a URL, an emoticon, a hashtag, etc) and of features known to be effective for NLP tasks on Twitter
scalar features (such as the tweet length). The training data data [6]. In practice, we simply change the training data
for our classifiers were obtained by manually classifying a set of ArkNLP to conform to what we described above, and
of tweets using crowdsourcing via provider Crowdflower2 . execute it without further modifications.
We obtained about 2,000 labels for the Sandy dataset, and Crowdsourcing task. During the crowdsourcing task for
about 4,400 for the Joplin dataset. extraction, we show to the assessors each tweet and the type
(and sub-type, if available) determined during the classifi-
2.3 Extraction cation phase. We use an instruction that is specific to each
Once a tweet has been classified into one of the above sub-type, as listed in the “instruction” column of Table 1.
classes, class-relevant information can be extracted for fur- The workers were shown a tweet, this instruction, and
ther analysis. For example, for a casualty and damage tweet, an empty text input field, and were asked to copy-paste a
the number of casualties or the name of the infrastructure word or short phrase from the tweet conveying the specified
that was damaged can be identified. information. We did not accept any training example in
which the segment extracted by the crowdsourcing worker
2 was not contained in the original tweet.
http://www.crowdflower.com
3. EXPERIMENTAL RESULTS
Table 2: Performance of the information extraction
Metrics. We evaluate our system by comparing its out- phase for several configurations of training and test-
put with the responses provided by humans. We train our ing set. “All” means no distinction between cate-
system on a part of the human-provided labels, and test gories. The second and fourth columns show the
the system on the remaining part. There are two aspects we number of tweets in the training and test data re-
measure that are related to the sensitivity and the specificity spectively.
of our system. Train on 66% of Joplin, Test on 33% of Joplin
Detection rate (analogous to statistical sensitivity, or re- Train Test Detected Det. rate Hit ratio
call) measures the fraction of examples in which humans All 338 All 169 131 78% 90%
found a relevant piece of information, and our system also All 338 C&A 130 109 84% 93%
found something, even if that something is incorrect. All 338 Infra. 4 3 75% 33%
All 338 Dona. 34 25 74% 92%
Hit ratio (analogous to one minus the specificity, or preci- C&A 260 C&A 130 118 91% 95%
sion) measures the fraction of examples for which our system Infra. 10 Infra. 4 1 25% 0%
found something, and that something could be considered Dona. 69 Dona. 34 16 47% 81%
correct by humans. We consider the output correct if it Train on 66% of Sandy, Test on 33% of Sandy
Train Test Detected Det. rate Hit ratio
overlaps in at least one word with the given human label.
All 397 All 198 82 41% 79%
Metrics example. An example can illustrate these met- All 397 C&A 69 27 39% 74%
rics. Suppose the input and output are as follows: All 397 Infra. 93 71 76% 83%
All 397 Dona. 35 23 66% 83%
C&A 139 C&A 69 26 38% 85%
Input Output Infra. 187 Infra. 93 50 54% 80%
Dona. 72 Dona. 35 12 34% 83%
a There were 12 injured <empty>
Train on 100% of Joplin, Test on 100% of Sandy
b A bridge has collapsed bridge Train (Joplin) Test (Sandy) Detected Det. rate Hit ratio
c 10 volunteers needed needed All 507 All 595 66 11% 78%
All 507 C&A 208 4 2% 100%
In this case, the detection rate is 66%, given that in two All 507 Infra. 280 24 9% 71%
({b, c}) of the 3 examples our system detected something. All 507 Dona. 107 38 36% 82%
C&A 390 C&A 208 2 1% 100%
The hit ratio is 50% given that only in one of the two (b) Infra. 14 Infra. 280 44 16% 73%
the output overlaps with the target extraction in the input. Dona. 103 Dona. 107 52 49% 90%
General results. Table 2 shows the results of our vari- Train on 100% Joplin + 10% of Sandy, Test on 90% of Sandy
ous experiments, where we selected the largest classes we Train (Joplin+) Test (Sandy-) Detected Det. rate Hit ratio
had available: caution and advice, casualties and damage: All 568 All 534 112 21% 81%
All 568 C&A 187 9 5% 100%
infrastructure, and donations. In general, and similarly to All 568 Infra. 251 64 25% 80%
precision-recall trade-offs observed in information retrieval All 568 Dona. 96 39 41% 79%
systems, often a higher detection rate is associated to a lower C&A 411 C&A 187 18 10% 71%
hit ratio and viceversa. Infra. 43 Infra. 251 106 42% 83%
Dona. 114 Dona. 96 46 48% 89%
There are four blocks that study different scenarios. Let
us focus for now on the first row of each block, where Train
is “All” and Test is “All”.
The first two blocks measure the performance of our sys- we wait for a few hours before generating an output, in or-
tem on Joplin and Sandy data. The detection rate is higher der to obtain some labeled examples about the new event.
for Joplin (78%) than for Sandy (41%). The hit ratio is The performance is higher than in the previous case, with a
also higher for Joplin (90%) than for Sandy (78%). This detection rate of 21% and a hit ratio of 81%.
points out that the second dataset is more challenging to our This last result shows that we can incrementally improve
system than the first one. However, in both cases the hit ra- our model to work better whenever we need to use it on a
tio is rather high, indicating that when our system extracts new disaster.
some part of the tweet, it is often the correct part. Detailed results. In each block, the first row reports the
The third block measures the performance of a hypotheti- detection rate and the hit ratio when we train a single model
cal system trained on data from Joplin, and then tested on over all the tweets in our training set and we test it over all
data from Sandy. This is usually referred to as an adapta- the tweets in our test set regardless of the tweets’ respective
tion or transfer scenario. We can observe that compared to classes. In the next three rows we disaggregate this setting
an scenario where we would train on data from Sandy, the for each class in the testing part. Finally, in the last three
detection rate drops dramatically (11% vs 41%), while the rows we show the performance when we train three different
hit ratio is not affected significantly (78% vs 79%). models, one for each class, and test it only over tweets of the
The most affected class of tweets are the ones providing same class.
caution and advice, which seem to be quite event-specific. The results indicate that class-specific models may lead
On the other hand, the performance for the donation-related to improvements in performance for some classes but not
tweets is the least affected among the three classes, indicat- for others. The class-specific models are particularly helpful
ing that the words and phrases used to describe do not vary for the caution and advice class of tweets, and yield im-
as much as for the other classes from one event to another. provements in the detection rate for the Sandy dataset in
In the fourth block, we consider an adaptation scenario the case of donation-related tweets. There are no consistent
in which a limited amount of new data (from Sandy) is in- gains for the tweets related to infrastructure damage, except
corporated into the training. This simulates a case in which when training on Joplin and testing on Sandy.
4. GENERALIZATION TO OTHER EVENTS straightforward manner the Twitter-specific part-of-speech
A robust approach should generalize to a variety of scenar- tagger ArkNLP to our task [6].
ios, including non-disaster related events. In this section we
briefly discuss a set of experiments on a non-disaster dataset 6. CONCLUSIONS AND FUTURE WORK
corresponding to a sports match. The dataset, which con- We have presented a practical system that can extract
sists of 72,000 tweets, was collected using Twitter Streaming disaster-relevant information from tweets. According to ex-
API using #cricket, #indvspak, #indvpk hashtags during a tensive experiments on two different datasets, our approach
Cricket match between Pakistan and India on January 6th, can detect from 40% to 80% of the tweets containing this
2013. type of information, and generate an output that is correct
Crowdsourcing task. We label the data using the same 80% to 90% of the time.
procedure as for our other datasets. In the first task, which This tweet-level extraction is in our opinion key to being
comprised of 2,000 unique tweets, we asked workers to la- able to extract reliable high-level information. Observing,
bel an individual tweet to (i) separate informative tweets for instance, that a large number of tweets in similar loca-
from personal and (ii) for an informative tweet specify what tions report the same infrastructure as being damaged, may
information it conveys. be a strong indicator that this is indeed the case.
We used six classes that are domain-dependent and cor- Please contact authors for inquiries about data availabil-
respond to events during a cricket match: boundary, score, ity.
over, dismissal, ball and other3 . In the second task, which
Acknowledgments. Sincere thanks to Kate Starbird and
comprised of 631 informative tweets, the workers were pre-
Project EPIC at University of Boulder, Colorado, for sharing
sented type and sub-type of a tweet and asked to copy-paste
tweet-ids of the Joplin dataset.
a word or short-phrase using a type-dependent instruction.
Experimental results. Table 3 shows the results of vari- 7. REFERENCES
ous experiments on this dataset. The first two rows are sce-
[1] Cynthia D. Balana. Social media: major tool in
narios where a single model is created, and the remaining
disaster response, 2012.
correspond to multiple class-specific models. When trained
over the whole training set and tested on the whole test set, [2] Fredrik Bergstrand and Jonas Landgren. Information
we observe a relatively low detection rate. This can be im- sharing using live video in emergency response work.
proved if we incorporate examples in which more than one In Proc. of ISCRAM. Citeseer, 2009.
type of information is present in a given tweet, as shown in [3] Heather Blanchard, Andy Carvin, Melissa Elliott
the second row. We can also see significant improvements Whitaker, and Merni Fitzgerald. The case for
in hit ratio for all the class-specific models. integrating crisis response with social media. White
Paper, American Red Cross, 2012.
[4] Mark A Cameron, Robert Power, Bella Robinson, and
Table 3: Results with cricket data. Jie Yin. Emergency situation awareness from twitter
Training Testing Detection Hit for crisis management. In Proc. of WWW, pages
cases cases rate ratio
695–698. ACM, 2012.
All 321 161 43% 95% [5] Tim Finin, Will Murnane, Anand Karandikar,
All (multiple labels) 321 161 51% 95%
Score 129 66 65% 98% Nicholas Keller, Justin Martineau, and Mark Dredze.
Other 100 51 76% 92% Annotating named entities in twitter data with
Dismissal 63 31 81% 88% crowdsourcing. In Proc. of HLT, pages 80–88, 2010.
Boundary 18 8 88% 100%
Ball 6 3 100% 100% [6] Kevin Gimpel, Nathan Schneider, Brendan O’Connor,
Over 5 2 50% 100% Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael
Heilman, Dani Yogatama, Jeffrey Flanigan, and
Noah A. Smith. Part-of-speech tagging for twitter:
annotation, features, and experiments. In Proc. of
5. RELATED WORK HLT, pages 42–47, Stroudsburg, PA, USA, 2011.
During emergencies social media platforms such as Face- [7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
book and Twitter distribute up-to-date situational aware- Pfahringer, Peter Reutemann, and Ian H. Witten. The
ness information (e.g., damage, causalities etc.) in all forms weka data mining software: an update. SIGKDD
(e.g., photos, videos etc.) [2, 3]. Cameron et al. [4] describe Explor. Newsl., 11(1):10–18, November 2009.
a platform for emergency situation awareness, that detects [8] Muhammad Imran, Shady Mamoon Elbassuoni,
incidents using burst keyword detection and classifies inter- Carlos Castillo, Fernando Diaz, and Patrick Meier.
esting tweets using an SVM classifier. However, identifica- Extracting information nuggets from disaster-related
tion of on-topic informative messages and extraction of ac- messages in social media. In ISCRAM, Baden-Baden,
tionable information pose serious challenges due to the noisy Germany, 2013.
and unstructured nature of Twitter’s data. Most previous [9] John Lafferty, Andrew McCallum, and Fernando
works were based on standard machine learning methods Pereira. Conditional random fields: Probabilistic
which typically trained on formal news-text and performed models for segmenting and labeling sequence data. In
poorly for an extremely informal source like Twitter [5]. Proc. of ICML, pages 282–289, 2001.
In this paper we used the classification-extraction approach
[10] Fuchun Peng and Andrew McCallum. Information
presented in our previous work [8], adapting in a simple and
extraction from research papers using conditional
3 random fields. IP&M, 42(4):963 – 979, 2006.
http://en.wikipedia.org/wiki/Cricket
View publication stats