0% found this document useful (0 votes)
44 views6 pages

A Survey On Event Extraction From Webpage

The document presents a survey on event extraction (EE) from webpages, emphasizing its significance in natural language processing (NLP) and the necessity for updated methodologies due to the evolving landscape of data and evaluation metrics. It discusses the definitions, types, and applications of events, as well as the strengths and weaknesses of various event extraction systems. Additionally, the paper reviews several recent studies and models that have contributed to advancements in EE techniques and their practical implementations.

Uploaded by

banmustafa66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views6 pages

A Survey On Event Extraction From Webpage

The document presents a survey on event extraction (EE) from webpages, emphasizing its significance in natural language processing (NLP) and the necessity for updated methodologies due to the evolving landscape of data and evaluation metrics. It discusses the definitions, types, and applications of events, as well as the strengths and weaknesses of various event extraction systems. Additionally, the paper reviews several recent studies and models that have contributed to advancements in EE techniques and their practical implementations.

Uploaded by

banmustafa66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

8thInternational Conference on Contemporary Information Technology and Mathematics (ICCITM2022) , Mosul

University,Mosul- Iraq

A Survey on Event Extraction from Webpage


2022 8th International Conference on Contemporary Information Technology and Mathematics (ICCITM) | 979-8-3503-3486-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICCITM56309.2022.10031730

Rasha Ali Dr.Ban Sharief Mustafa


Computer Science Dep. Computer Science Dep.
Mosul University Mosul University
Mosul, Iraq Mosul, Iraq
[email protected] [email protected]

Abstract— Event extraction (EE) is one of most important as a multipurpose subject, is highly related to statistics,
researches in the field of Natural Language Processing (NLP). computer science, and NLP [5].
Numerous significant events occur every day all over the
world, and it published in various media outlets with varying The event in general can be defined as "a specific
narrative styles. A primary job in EE is to determine if events occurrence” of happening that occurs at a specific place and
in the real world have been reported in articles and posts time with which one or more participants are related.
driven by the 4Ws’ (what, who, when, and where). A complex Extracting events includes two levels of extraction; the
relationship between people can be explained by an event, document level and the sentence level. Those two levels
place, actions, and objects. An event-centered model captures allow the event extractor to achieve the goal, which requires
the dynamic and semantic aspects of an event representation of the ability to answer the questions of the extracting tool,
event facts. An updated and comprehensive survey is needed including: When, Who, Where, … etc [6].
due to the proliferation of methods, datasets, and evaluation
metrics in the literature. In this paper, a survey on extracting Event terms and definitions include the following [7]:
events from websites and defining their types and applications
 Event mention: a phrase or sentence in which an
in several different fields has been discussed. In addition, the
event, including a trigger and arguments, is
study presents the strengths and weaknesses of event
extraction systems for different types of models
described.
 Event trigger: a keyword that clearly describes
Keywords— Event extraction, natural language processing an event that has occurred, is a verb or a noun.
(NLP), event corpus, web page.
 Event argument: an entity mentioned, the entity
I. INTRODUCTION states the temporal expression, the attribution of
An event is something that happens. The event can a specific role in an event, or the value that is
often be described as a change of state. It is specific to the provided as a participant.
participants [1]. The event extraction aims to find event  Argument role: It represents the relationship
instances in the published texts and, if they exist, identify between the argument and the event in which it
the type of the event with all of its attributes and participates.
participants. Although different types of events can be Event extraction is intended to extract a characterization
defined by their different arguments, a simple event of an event from the text, defined by a set of entities
summary by text extraction can be defined as obtaining a associated with a specific role in the event. Some techniques
structured events representation from unstructured natural employ data-driven approaches, while others employ
languages, to assist to answer the "5W1H" questions. These knowledge-driven approaches, and still, others employ
questions include "who, when, where, what, why," and hybrid approaches.
"how" about the events in the real world from a variety of
text sources, such as social media posts, news articles, and III. EVENT EXTRACTION CORPORA
so on [2]. Event Extraction corpora are annotated by domain
Information retrieval is considered an important task in experts or professionals and used to train or evaluate models.
Natural Language Processing (NLP). It is an event However, as the annotation process is cost-prohibitive, many
extraction that has many different applications in various public corpora are small in size and have low coverage as the
fields [3]. So, the structured events, for example, can be ACE event corpus, the TAC-KBP corpus, the TDT corpus,
used directly to expand the knowledge on which further and other domain-specific corpora [2,8]. They are based on
logical reasoning and inference can be made [4]. two major types of extraction Domains as follows as shown
in Fig. 1.
II. EVENT EXTRACTION TASK
The event extraction (EE) requires Named Entity
Recognition (NER) and Relation Extraction (RE) tasks. EE,

979-8-3503-3486-9/22/$31.00 ©2022 IEEE 159

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on March 29,2023 at 11:22:10 UTC from IEEE Xplore. Restrictions apply.
8thInternational Conference on Contemporary Information Technology and Mathematics (ICCITM2022) , Mosul
University,Mosul- Iraq

general, a website consists of one or more web pages that


had been linked accurately together. The name "webpage"
came after the paper pages that are bounded together into a
book. Therefore, the website has not only one but a group
of interlinked web pages. These web pages have
information, graphics, video content, etc. [11].
The web pages had specific functions according to the
purpose behind their designs. They may transmit users'
information on websites about different fields such as
product descriptions, news, etc. The users have just a few
expectations about the subject of their aims that allow
finding certain web pages [12],[13].
HTML is heavily supported by the World Wide Web
(WWW) consortium, the popular language for
implementing web pages, according to the Document Object
Model (DOM) [14].
Web data extraction (known as web harvesting, web
scraping, screen scraping, etc.) is “a method to obtain large
volumes of data from websites”. The available data on
Fig. 1. Represents the classification of extracting different events [5]. websites is not easy to be downloaded. These data can be
accessed only via a web browser. As for web scraping, it is
A. Closed Domain Event Extraction the process of extracting data from websites, often using
automation tools. Web scraping consists of the following
It means that the extraction of the event occurs in a steps [15]:
closed domain. A closed field event is a pre-defined event 1) Download page content (fetching).
scheme for detecting and extracting desired events from a 2) Extract specific data fields (parsing).
specific type of text. The methods of this type can be 3) Format the data.
classified into four groups: pattern matching group which is 4) Save formatted data into a database or a
the traditional methods, machine learning group in which spreadsheet.
machine learning algorithms had been implemented, deep Figure 2 shows the main steps of EE from web Pages.
learning group in which deep learning algorithms had been
implemented, and finally semi-supervised learning group in
which deep learning algorithms had been implemented with
semi learning methods [9].

B. Open Domain Event Extraction


It differs from the closed field event extraction in
extracting new or unexpected events from text or scripts. In
this case, there is no pre-defined event type, and
extrapolating the event schema is an important sub-task of
open field event extraction. the techniques used, in the
current curricula can be divided into Bayesian, aggregation-
based, analysis-based, lexicographically based, semi-
supervised and remote, and adaptation to the hostile sphere
[2].

IV. EVENT EXTRACTION FROM WEBPAGE


The trend of expanding the amount of online material is Fig. 2. The Main Steps of EE from Web Pages [24]
one of the Web's rapidly changing characteristics. Many
important sites such as universities, institutions, and news V. EVENT EXTRACTION WORKS REVIEW
sites have a web page website. This page deals with news, In previous years, researchers presented research papers
activities, and events of interest to the institution responsible dealing with finding many suggestions and solutions in the
for the site. Most of it is natural language text and presented field of EE from web pages, using intelligent techniques that
as Hypertext Markup Language (HTML), Currently, the had the potential to improve the performance of works. This
majority of web information can be considered noisy input is a study for research that extracted events from websites.
to NLP algorithms and designed to work with well-formed,
Zhen Tan and others, (2018) [16], presented an
grammatical sentences and are rarely effective with noisy information extraction model based on an address named
data [10]. TWCEM. It extracts the contents of the web page more
A webpage is a document of HTML that provides a accurately and effectively via a noise filter. TWCEM was
website displayed to the users in any web browser. In implemented on a real website. It achieved good

160

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on March 29,2023 at 11:22:10 UTC from IEEE Xplore. Restrictions apply.
8thInternational Conference on Contemporary Information Technology and Mathematics (ICCITM2022) , Mosul
University,Mosul- Iraq

performance with low time and cost. It achieved a 97.59% of (70%) accuracy and it is more accurate than other related
F1-score and 89.96% of accuracy. The results showed higher methods.
accuracy with NY Times and NY Post datasets that exceeded
99.15% and 98.89%. Matthew Crittenden, (2021) [24] proposed a causal
network to extract relevant event-causal structures on
Yang et al. (2018) [17] presented a framework that ConceptNet and Wikipedia. The proposed network uses
detects the event mentions and then extracts the events from event-causal attributes that are extracted in the bidirectional
the financial news at the document level. The authors transformer encoder to effectively capture long-range
presented Document-level Chinese Financial Event interdependencies. This group increases the complexity of
Extraction (DCFEE) to generate more labeled data to extract the task by classifying entity-type arguments as well as
Chines Financial events. The results showed that the system complex argument types. The model used Huggingface’s
gained up to 94.5 accuracies for mention labeling and 94.08 bert-base-multilingual cased model. It had been pre-trained
for automatic label generation. on 104 different languages with and mini-batch size of 4 on a
single Tesla k40-C and a maximum sequence length of 512
Björne and Salakoski (2018) [18] developed a and trained for 20 epochs.
Convolutional Neural Network (CNN) to be used in event
and relation extraction. The input text is converted to a linear Dilek Kuc¸uk (2022) [25], the researcher proposed
representation. The information is encoded by vector space Energy Monitoring via Information Extraction (EneMonIE) a
embeddings. The dependency path embeddings are used to Web-based semantic system for monitoring current energy
encode the parse graph. The open-source Turku Event trends using automatic, continuous and guided EE from
Extraction System (TEES) is used to integrate CNN. A 12- various forms of media available on the Web. The system
event relation and NER corpora had been checked and included online news videos, online news articles, social
showed good performance on deferent corpora. media texts, open-access scientific papers, and technical
reports, as well as many digital energy data made available to
Li et al. (2019) [19] presented a knowledge base KB- humans by energy institutions. EneMonIE is an important
driven tree-structured long short-term memory networks source of short information for decision-makers, power
(Tree-LSTM) and implemented two new features: (i) generation, transmission and distribution system operators,
dependency structures and (ii) entity properties. The energy research centers, investors, and related entrepreneurs,
approach was evaluated on the BioNLP shared task using the as well as for academics and students. The system has
Genia dataset. It achieved 86% of accuracy for simple various data sources, automatic text processing capabilities,
events. and display facilities open for public use; due to the
Yang, et al. (2019) [20] presented an approach that availability of automatic text processing capabilities, various
extracts the events occurring in plain text. The approach data sources and display facilities are available.
consists of two stages, firstly, trigger extraction is performed,
Jacobs et al. (2022) [26], presented a SENTiVENT
then argument extraction. The authors presented a Pre- scheme to detect economic news articles. It used event
trained Language Model based Event Extractor (PLMEE). triggers, participant arguments, event co-reference, and event
The results showed that the approach gained 81.1 for triggers attributes such as (type, subtype, negation, and modality).
extraction and 58.9 for arguments extractions. The results showed that the scheme obtained a 59% F1 score
Zhang, et al. (2019) [21] presented an entity and event for data set consisting of 18 kinds and 64 subtypes. The
extraction that used generative adversarial imitation learning. training was performed among 6200 events in 288
A Q-learning scanner scans the text to detect the event documents.
boundaries, its triggers, and its entities. The extractor detects Meisin, et al. (2022) [27] presented a system to extract
the connected triggers with the entities. The argument roles events from English Crude Oil news. A seed set of 175 for
are determined. A GAN algorithm is used to train the the news articles. A 25 news subset was used as the
framework on the extracted features of the events. The adjudicated reference test set. The resulting corpus has 425
framework gained 85.2% accuracy and an 80.8 F1 score. news articles with approximately 11k events annotated. The
Felix Hamborg, et al. (2019) [22] presented a system that model trains basic event extraction models to label data. A
used grammar rules to make specific rules for extraction of special dataset is used that contains oil-related triggers and
the related items of the phrases from English articles. The arguments. The overall evaluation results showed the high
system gave answers to 5W1H Questions to determine the performance of the proposed system.
main event in the article. The results showed the system's JianweiLv and others [25] propose an advanced multi-
capability to determine the main event from the answers to task learning framework, named TNC, based on their
just the first four W questions. The system had 82% of original concept: Trigger is Non-Central, in which event
accuracy for the first four questions. argument extraction is performed synchronously with the
Fisichella and Ceroni, (2021) [23] presented a basis for a event triggers. Using label representations and an auxiliary
singular class of evolution-aware entity-primarily based task called Sentence Event Identification (SEI), our TNC
enrichment algorithms to detect events in Wikipedia. They extracts multiple event triggers and arguments
supposed that it would increase the quality of accessibility simultaneously. A special symbol is also designed to merge
and timeliness of Wikipedia's entity retrieval. the representation of candidate arguments over the
Comprehensive experiments had been conducted on a 1.8 Transformer encoder. Experimental results have shown that
million articles dataset. It relied on a supervised model that our model achieves state-of-the-art compared to other
can detect an event in a non-annotated corpus. It gained models, with higher effectiveness and adaptability.

161

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on March 29,2023 at 11:22:10 UTC from IEEE Xplore. Restrictions apply.
8thInternational Conference on Contemporary Information Technology and Mathematics (ICCITM2022) , Mosul
University,Mosul- Iraq

AliBalali and others [26] extracts multiple event triggers models. The closed events extraction models give high
and arguments simultaneously by introducing the shortest performance compared with open event extraction.
dependency path in the dependency graph. The long-range
dependencies are captured by eliminating irrelevant words VII. CONCLUSION
from the sentence. The attention-based graph convolutional Although there are many challenges, text mining
network is also proposed for carrying syntactically related especially open event mining is attracting more and more
information along shortest paths between argument attention due to its important role in information mining.
candidates, capturing and aggregating latent associations Demonstrate a way to quickly understand EE tasks from a
between arguments, a problem that has been overlooked by
medium-difficulty perspective and provide concepts and
most researchers. The results show a substantial
improvement over state-of-the-art methods on two datasets, definitions for EE task models and their applications. In this
namely ACE 2005 and TAC KBP 2015. paper, a review and summarization of a common issue with
the EE from a web page has been demonstrated. The new
VI. COMPARISON deep learning models which provide us with training models
After illustrating the recent papers, table I shows the ease the ways to extract the triggers and their arguments.
common criteria of those papers. The main difficulty to be faced is the lack of an annotated
Most of recently proposed systems implemented a deep- corpus in specific domains.
learning model to perform trigger and argument extraction.
These models showed a high accuracy compared to other

TABLE I. LIST OF EE RESEARCHES THAT SHOWS MODEL, DATASET, STRENGTHS, AND WEAKNESSES OF EACH RESEARCH.

Year Researchers Model Accuracy Datasets Strengths Weaknesses


2018 Zhen Tan et al. TWCEM 97.59% NY Times NY None None
Post

2018 Yang et al. DCFEE 94.5% Chines None None


Financial
events

2018 Björne and alakoski CNN --- Special 12 None None


TEES events

2019 Li et al. Tree-LSTM 86% BioNLP None None


2019 Yang, et al. PLMEE 81.1% Special Dataset None None

2019 Zhang, et al. GAN 80.8 Special Dataset None None

2019 Hamborg, et al. Giveme5W1H 82% Special Dataset More accurate None
of English news results
articles

2021 Fisichella and Ceroni, Special Dataset of 70% Wikipedia More accurate None
Aware entity- articles results
primarily based event in a non-
enrichment annotated corpus.
algorithms and
temporal retrieval

2021 Crittenden, causal network --- ConceptNet More accurate None


Huggingface’s and Wikipedia results
bert-base-
multilingual cased
model

162

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on March 29,2023 at 11:22:10 UTC from IEEE Xplore. Restrictions apply.
8thInternational Conference on Contemporary Information Technology and Mathematics (ICCITM2022) , Mosul
University,Mosul- Iraq

2022 Dilek EneMonIE --- various forms offer a pluggable For data sources,
of media information only textual data
available on the extraction and other processing is
Web text processing intended
components,

2022 Jacobs et al. SENTiVENT 59% English None None


economic news
articles

2022 Meisin, et al. CNN -- English Crude None None


Oil news

2022 Lv, Jianwei Transformer -- --- model achieves None


encoder & state-of-the-art
Sentence Event compared to other
Identification models, with higher
(SEI) effectiveness and
adaptability.

2022 AliBalali Attention-based ACE 2005 and The results show a None
graph TAC KBP substantial
convolutional 2015. improvement over
network state-of-the-art
methods

[10] Janevski, Angel. "UniversityIE: information extraction from


REFERENCES university web pages." (2000).
[1] LDC, “Ace (automatic content extraction) english annotation [11] S. Mahato, D. K. Yadav and D. A. Khan, "A Modified Approach to
guidelines for events,” in Linguistic Data Consortium, 2005. Text Steganography Using HyperText Markup Language," 2013
[2] W. Xiang, and B. Wang. "A survey of event extraction from Third International Conference on Advanced Computing and
text." IEEE Access 7 (2019): 173111-173137. Communication Technologies (ACCT), 2013, pp. 40-44, doi:
[3] M. Rospocher, M. van Erp, P. Vossen, A. Fokkens, I. Aldabe, G. 10.1109/ACCT.2013.19.
Rigau, A.Soroa,T.Ploeger,andT.Bogaard,“Buildingevent- [12] G. Buscher1 , E. Cutrell2 , M. Ringel Morris2, What Do You See
centricknowledge graphs from news,” Journal of Web Semantics, When You’re Surfing? Using Eye Tracking to Predict Salient
vol. 37-38, 2016. Regions of Web Pages, CHI 2009, April 4–9, 2009, Boston,
[4] Z. Li, X. Ding, and T. Liu, “Constructing narrative event Massachusetts, USA. Copyright 2009.
evolutionary [13] V. Broucke Seppe, and B. Baesens. Practical Web scraping for data
graphforscripteventprediction,”inProceedingsofthe27thInternational science. New York, NY: Apress, 2018.
Joint Conference on Artificial Intelligence, 2018, pp. 4201–4207. [14] X. Deng, et al. "DOM-LM: Learning Generalizable Representations
[5] J. Liu, L.Min, and X. Huang. "An overview of event extraction and for HTML Documents." arXiv preprint arXiv:2201.10608 (2022).
its applications." arXiv preprint arXiv:2111.03212 (2021). [15] C.H. Chang, M. Kayed, M. R. Girgis and K. F. Shaalan, "A Survey
[6] J. Foley, M. Bendersky, and V. Josifovski. "Learning to extract local of Web Information Extraction Systems," in IEEE Transactions on
events from the web." Proceedings of the 38th International ACM Knowledge and Data Engineering, vol. 18, no. 10, pp. 1411-1428,
SIGIR Conference on Research and Development in Information Oct. 2006, doi: 10.1109/TKDE.2006.152.
Retrieval. 2015. [16] Z. Tan, et al. "-Based Extraction of News Contents for Text
[7] S. Mehta ”New Approaches to Event Detection and Extraction from Mining." IEEE Access 6 (2018): 64085-64095.
News Articles”,2021. [17] H. Yang, et al. "Dcfee: A document-level chinese financial event
[8] G. R. Doddington, A. Mitchell, M. A. Przybocki, Lance A. extraction system based on automatically labeled training data."
Ramshaw, Stephanie M. Strassel, and Ralph M. Weischedel. The Proceedings of ACL 2018, System Demonstrations. 2018.
automatic content extraction (ace) program-tasks, data, and [18] J. Björne, and T. Salakoski. "Biomedical event extraction using
evaluation. In Lrec, volume 2 of Lrec, pages 837–840. Lisbon, 2004. convolutional neural networks and dependency parsing."
[9] Li, Qian, et al. "A Compact Survey on Event Extraction: Approaches Proceedings of the BioNLP 2018 workshop. 2018.
and Applications." arXiv preprint arXiv:2107.02126 (2021). [19] Li, Diya, et al. "Biomedical event extraction based on knowledge-
driven tree-LSTM." Proceedings of the 2019 Conference of the

163

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on March 29,2023 at 11:22:10 UTC from IEEE Xplore. Restrictions apply.
8thInternational Conference on Contemporary Information Technology and Mathematics (ICCITM2022) , Mosul
University,Mosul- Iraq

North American Chapter of the Association for Computational


Linguistics: Human Language Technologies, Volume 1 (Long and
Short Papers). 2019.
[20] S. Yang, Sen, et al. "Exploring pre-trained language models for
event extraction and generation." Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. 2019.
[21] T. Zhang, H. Ji, and Avirup Sil. "Joint entity and event extraction
with generative adversarial imitation learning." Data Intelligence 1.2
(2019): 99-120.
[22] F. Hamborg, C. Breitinger, and Bela Gipp. "Giveme5w1h: A
universal system for extracting main events from news
articles." arXiv preprint arXiv:1909.02766 (2019).
[23] M. Fisichella, and A. Ceroni. "Event detection in Wikipedia edit
history improved by documents web based automatic
assessment." Big Data and Cognitive Computing 5.3 (2021): 34.
[24] B. A. Hordofa,. "Event extraction and representation model from
news articles." vol 16 (2020): 1-8.
[25] Lv, Jianwei, et al. "Trigger is Non-central: Jointly event extraction
via label-aware representations with multi-task learning."
Knowledge-Based Systems (2022): 109480.

164

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on March 29,2023 at 11:22:10 UTC from IEEE Xplore. Restrictions apply.

You might also like