0% found this document useful (0 votes)
70 views8 pages

Extracting Information Science Concepts

This document discusses extracting information science concepts using natural language programming and JAPE regular expressions in GATE. It provides a brief overview of information extraction tools and compares them, noting that each has advantages and disadvantages. The paper then uses CREOLE plugins in GATE to extract concepts from the field of information science to help speed up the ontology building process in a semi-automatic manner.

Uploaded by

Leïla Gazzeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views8 pages

Extracting Information Science Concepts

This document discusses extracting information science concepts using natural language programming and JAPE regular expressions in GATE. It provides a brief overview of information extraction tools and compares them, noting that each has advantages and disadvantages. The paper then uses CREOLE plugins in GATE to extract concepts from the field of information science to help speed up the ontology building process in a semi-automatic manner.

Uploaded by

Leïla Gazzeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228448906

Extracting Information Science concepts based on Jape Regular Expression

Article · January 2011

CITATION READS

1 306

2 authors:

Ahlam Sawsaa Joan Lu


University of Huddersfield University of Huddersfield
10 PUBLICATIONS   21 CITATIONS    117 PUBLICATIONS   331 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Grid Banking Model View project

Using Natural Language Programming NLP Technology to Model Domain Ontology OTO by Extracting Occupational Therapy Concepts View project

All content following this page was uploaded by Ahlam Sawsaa on 29 October 2014.

The user has requested enhancement of the downloaded file.


EXTRACTING INFORMATION SCIENCE CONCEPTS
BASED ON JAPE REGULAR EXPRESSION

Ahlam Sawsaa Joan Lu


Department of Informatics, School of Computing Department of Informatics, School of Computing
& Engineering- Huddersfield HD1 3DH, United & Engineering Huddersfield HD1 3DH, United
Kingdom Kingdom
<[email protected]> <[email protected]>

Abstracts Recently, an unstructured data on the


natural language document. These tools, such as a
www has generated important further interests in
the extraction text, email, webpage, report, and part of speech tagging-filtering- lexical semantic
research papers in its raw form. Far more
tagging to link between relevant information,
interesting, extracting information from a specific
domain using distribute corpora from World Wide identify the relationships among phrases and
Web is vital step towards creating corpus
sentence elements within text such as GATE. In
annotation. This paper describe a methods of
annotation concepts of Information Science to build fact, each of these tools has advantages and
domain ontology using Natural Language
disadvantages. It required comparative analysis of
programming NLP technology to speed up the
developing ontology process as time consuming existing tools for data extracting to recognize their
and experts in the domain has many barriers as
capabilities. For the purpose to adopt the most
time and loads to do. Using some NLP to reduce
the domain experts work and they can be evaluated appropriate tool we compared between them to
the results.
provide a distinct of the Information extracting
Keywords ontology– Regular expression-
tools[1,8] as illustrated in table (1).
Information extraction – Natural Language
Programming.

1 Introduction language process (NPL) is a technique used by


Recently Information Extracting (IE) has a great many tools to extract data that existing in
interesting in the area of emerging web pages on
In this paper firstly we provide a brief idea about
the internet which contains unstructured data. This
information extracting tools to justify the reason for
amount of information available on the Internet
using NLP technique. To speed up the building
needs a tool to extract to make it available to use in
process of the ontology of Information Science
the right time. Many specialists in the field of
(OIS) and extract concepts in the field, CREOLE
extracting information have worked to find suitable
Plagins in GATE has been deployed into this IE
tools, as Wrappers, that classify interesting data
system.
and mapping them to some appropriate formats
XML or relational database Furthermore, some
2 Background
HTML aware tools can be based on inheriting Basically, the annotating concepts of IS is based on
constructural features of documents to achieve the GATE developer which is a tool of architecture for
extracting process. On the other hand, the natural text Engineering. It is a free open source developed
by a team at Sheffield University which started in
the early of 1990s. The first version was in 1995, different between Information Retrieval and
the second one was in 2002 and the new version is Information Extraction (IE) [3] . IE helps to extract
in 2010s. GATE is running at any platform and information from huge amount of text for the
purpose of fact analysis. Whereas, Information
support JAVA 5 .0. Also, it developed and tested
retrieval (IR) just pulling the document that have
on Linux, windows, and Mac OS X. It has user

Table 1: Information extracting tools

Tools Type Degree of Based on Easy of use Written Adv. &Dis.


automation language
SHOE Knowledge Automatic + Java Allows users to mark up
annotation pages in SHOE guided by
ontologies or URL
Annota Annotation Automatic RDF mark up + C & is Doesn’t support IE,it is
schema W3C XML,XHTML,CSS available for liked to ontology server. –
&Xpointer windows, Makes annotation
unix,&MAC publicly available
Annozilla Email Automatic Mozilla ++ -
annotation
MnM Ontology editor Semi-automatic & HTML + Close to malita
automatic
Ontomat Automatic OWL ++ Use to create & maintain
ontology – Use
OtoBroker as server
COHSE integration of Automatic DAML+OIL + RDF Use ontology server to
text processing mark up pages in
components DAML+OIL& reuse as
RDF
Melita annotation Semi-automatic Extensible mark up ++ To retrival structure &
interface language,Java,HTML semi structured
annotations
KIM Semantic Automatic RDF ++ Semantic annotation,
ontotext annotation indexing, and retrieval of
platform unstructured and semi-
structured content.
GATE Annotation tool Semi-automatic & XML,HTML,XHTML,e +++ Comprises an
automatic mails architecture, framework.
Based inNLP group
interface to enable user editing and visualization relevant information according to the key word
and quick application development. Furthermore, it research. In contrast IE identify the query in
Support for manual annotation, sime-automatic and structure methods and provides knowledge at the
semantic annotation beside ontology management. deep level. While IR use normal queries engine
Moreover, GATE uses CREOLE plug-ins as objects which hard to gain the accurate answer, besides
for language engineering. All of these are packaged providing knowledge at typical level.
as Java Archive and XML configuration data[4].

GATE is a tool of Information Extraction system


(IE). Which is a method to extract unseen texts as
input and produce it in fixed format as XML,
HTML, these data can be displayed for users or
stored in database to analysis. Before talk about
GATE in more details we should clarify what is the
For instance, If you have an enquiry about when finite state algorithm and JAPE grammar and the
something is happened as which airports are application combines from Tokenisor, Sentence
currently closed due to the sever condition weather splitter, POS tagger, Gazatteer, Name entity tagger
in UK? Or to ask about where and who did (JAPE transducer). Orthomatcher (co-references),
something as where did Gordon Brown last visit NP and VP chanker. Among these modules we
before he left? [7] used: Tokenisor, Sentence splitter, Gazatteer, JAPE
transducer [5].

IR gives just the webpage containing the relevant 3 Methods


information and you need to search on it using The process followed the method is based on
terms or concepts to meet your needs, to analyze creating documents-corpora and Gazetteer of
this information. IE provides specific information Information Science, and is based on JAPE rules to
about your enquiry, even if the information is not extract IS concepts as well. Gate provides facilities
accurate but you can back the text. IE is used for for loading corpora for annotation from a URL and
many applications such as; Text Mining, Semantic uploading from a file. The process starts by
Annotation, Question Answering, Opinion Mining, uploading the corpus to the application framework
Decision Support, Rich information retrieval and with a JAPE grammar and Gazatteer to enable
exploration. annotating the concepts from the corpus. Diagram
(1) illustrates the process of corpus annotation.
GATE has many features of both automatic and
semi-automatic semantic annotation and also .
manual annotation which helps you to create your Documents of
Information Science
own annotations, for this purpose GATE developer
is used as the tool to extract terms and concepts
from a specific text effectively and efficiently. For
this work we annotate text belong to members of Analysis
Ontocop. Ontocop is a virtual community of process

practice of Information Science. That helps to


speed up ontology process of building a conceptual Corpus Pdf doc to
model as a life cycle of ontology of IS. XML

Additionally, GATE is a Module that has a Upload to


GATE
comprehensive set of plug-ins as: Alignment, Framewor
ANNIE, Annotation_Merging, k

Copy_Annots_Between_Docs, Gazetteer_LKB,
Running
Gazetteer_Ontology_Based, Information_Retrieval,
ANNIE
Keyphrase_Extraction_Algorithm,
Language_Identification, Ontology_Tools,
WordNet. Annotate concepts & Evaluation

GATE based on ANNE which is a new IE system


has core processing resources. ANNE relies on
Figure ( 1 ) Annotation workflow
Corpus: Collecting the corpus contains 300 required to create own list of concepts to be
documents, all the documents are relevant to annotated.
Information science field.
Gazetteer: The IS list included in Gazetteer which
contains terms. These terms have value to be 4 NLP technique of extract IS
identified such as; MajorType and MinarType for concepts:
each one, e.g. We present an automatic extraction methods
based on ANNE by creating JAPE grammar that
extracts concepts form xml, HTML text, by
Acquisition policy: major type= concept
creating Corpus with 300 documents in XML
Computer aided design: minor type= term format.

Data analysis: major type= concept Our JAPE rule to extract concepts shown in the
following role. The first entity detected is
JAPE rule: Using JAPE rule extracts concepts to
Information service {Type=Token, start=867,
identify Tokens that contain the concepts in the
correct order, and looking up to the concepts in the end= 837, id= 4210, majorType=concept}
Gazatteer list. labelled as information service.concept
JAPE (Java Annotation Patterns Engine) rules
create a phase based on Java for creating specific Phase: one
grammar. Each JAPE rule consists of LHS which
Input: Lookup Token
contains patterns to match. RHS details the
Options: control = appelt
annotations to be created [4].
Rule: concept1
We used JAPE grammar to support regular
expression matching, as it is the way of annotation
by GATE. Annotation can be made by using other (
CREOLE plug-ins such as Gazatteer which
({Token.string == "information"})

Figure (2) shows screenshot of IS Gazetteer


{Token.string == "service"} Options: control = all
({Lookup.minorType == region}): reginName Rule: concept2
Priority: 20
) : service
(
-->
({Token.string == "information"})
: reginName.Location = {},
{Token.string == "service"}
: Information service.concept = {}
Acquisition .service
({Lookup. major Type == "concept"})
3. {Token.string == "archival * "}
) : information
It will annotate archival library, archival
-->
journal, archival processing, archival
: Information. Concept = {Rule=concept2}
software, and archival studies. All these
For more precise details we apply regular
expression for matching strings of text, e.g rules are sorted in the INFCO. jape file .

Phase: Concept
5 Experiment & Evaluation
Input: Lookup Token Extraction IS concepts by using JAPE grammar
Options: control = appelt and Regular expression based on GATE developer
Rule: Glossary for automated extracting information provides a
( significant output. The main idea of using JAPE
({Token.string == "catalog?e"}) and Regular Expression is to identify IS
): concept terminology as tokens, for example, Computing,
--> Libraries and Information technology from a large
:{} .concept= {Rule= "Glossary"} text where terms are founded. The term
In this rule we specify a string of the text identification relies on lookup from Gazatteer list
{Token.string == }string matching to specify the of IS which could be matching, for instance, it
attributes of the annotation by using operators as could be book art, book card, book guidance or
“==”,which provide the whole string matching. book catalogue. Also, look up at these concepts
Some of these regular expressions in next such as computer application, computer Science,
example annotate concepts related to (abstract) computer experts, computer file, or computer
metacharacter (dot, *, [ ], |), image.

1. {Token.string == "abstract(ing)"} The corpus we used to extract information science


It may be abstract, abstracting, abstractor. concepts contains 300 documents which were
obtained. Therefore, a total of document is
Also, if we want to annotate acquisition
analyzed. By running ANNIE application
concept followed by another word as:
organized as document reset, Tokenisor, sentence
2. {Token.string == "acquisition. Spliter Gazzater, POS tagger, JAPE transducer and
number"} Orthomatcher. In annotation set appeared in display
It could be annotate the pan and concepts are highlighted in the annotation
default, as shown in figure (3)
Acquisition. police
Phase: Two
Input: Lookup Token
Figure (3): annotation concepts in Gate

The results show that our approach annotated concepts (see figure 4) and the annotation derives the knowledge
started from (896) end in (905) while computer science concept annotated from (2008)) and end at (2024), with
its features {major Type=concept}. Each annotation starts from specific point and ends at different point based
on how many token it has and listed each time.

Figure 4: Result of the annotation IS domain


We conduct this experiment to achieve accuracy and their relations to make global interoperability
rates that equal to the manual output by IS experts for possible. In future work we plan to enhance these
the annotating concepts. Statistics of the corpus show concepts to develop IS ontology to creating the
pattern matching of IS concepts based on the lookup taxonomy of IS as domain. Next step is coding it
IS list (402), correct concepts and accuracy were by using Protégé as ontology editor. Additionally,
generally higher, whereas, partially correct (0) such a generic model of the IS ontology will be
missing and false positives (0). evaluated.

Acknowledgment
The authors wish to thank Libyan government for
its support. And for each one who provide feedback
on this work.

Reference:
[1]ALBERTO, H. F., BERTHIER, A. L.-. & RIBEIRO-NETO
(2002) A brief survey of web data extraction tools.
SIGMOD Record
http://annotation.semanticweb.org/tools/.

[2]CHANG, C.-H., KAYED, M., GIRGIS, M. R. & SHAALA,


K. (2000) A Survey of Web Information Extraction
Systems. IEEE TRANSACTIONS ON KNOWLEDGE
Figure 5: The result accuracy AND DATA ENGINEERING, 13.

[3]CRESCENZI, V. & MECCA, G. (2004) Automatic


However, we use GATE due to its benefits as open Information Extraction from Large Websites. Journal
of the ACM, 51, pp. 731–779.
source and it contains multi-language NLP models
which can be reused for developing other [4]GATE (2010) Developing Language Processing Components
with GATE Version 6 (a User Guide).
resources. http://gate.ac.uk/sale/tao/splitch13.html#x18-
32300013.2.

6 Conclusion [5]HANDSCHUH, S. & STAAB, S. (2002) Authoring and


6.1 Achievement This paper described Annotation of Web Pages in CREAM. Honolulu,
Hawaii, USA.
a method of using NPL technique to extract
[6]MOENS, M.-F. (2006) Information Extraction: algorithms
concepts for the purpose of speed up developing and prospects in a retrieval context, Springer.
process of IS ontology. Furthermore, the
[7]SRIHARI, R. & LI, W. (2002) Information Extraction
development of IE system saved the efforts of Supported Question Answering. In Proceedings of
the Eighth Text Retrieval Conference (TREC-8 ).
domain experts by labelling most common
[8]TURMO, J., AGENO, A. & CATAL`A, N. (2006) Adaptive
concepts. In total we extract (664) concepts which
Information Extraction. ACM Computing Surveys, 38.
is the classes of Information Science Ontology, and
(650) subclasses, which is the main component of
the ontology skeleton. Using IE technique can be
applied to many different formats as XML, HTML
documents even using URL or emails).

6.2 Future work Ontology is at the


heart of the semantic web. It defined the concepts

View publication stats

You might also like