0% found this document useful (0 votes)
102 views702 pages

GATE User Guide

Uploaded by

Sunanda Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views702 pages

GATE User Guide

Uploaded by

Sunanda Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 702

Developing Language Processing

Components with GATE


Version 8 (a User Guide)

For GATE version 8.4.2-snapshot (development builds)


(built September 12, 2017)

Hamish Cunningham
Diana Maynard
Kalina Bontcheva
Valentin Tablan
Niraj Aswani
Ian Roberts
Genevieve Gorrell
Adam Funk
Angus Roberts
Danica Damljanovic
Thomas Heitz
Mark A. Greenwood
Horacio Saggion
Johann Petrak
Yaoyong Li
Wim Peters
Leon Derczynski
et al

©The University of Sheeld, Department of Computer Science 2001-2017

https://gate.ac.uk/

This user manual is free, but please consider making a donation.

HTML version: https://gate.ac.uk/userguide

Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale
Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT)
and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), Ontotext Matrixware,
the Information Retrieval Facility and several EU-funded projects: (TrendMiner, uComp,
Arcomem, SEKT, TAO, NeOn, MediaCampaign, Musing, KnowledgeWeb, PrestoSpace,
h-TechSight, and enIRaF).
Developing Language Processing Components with GATE Version 8
©2017 The University of Sheeld, Department of Computer Science
The University of Sheeld, Department of Computer Science
Regent Court
211 Portobello
Sheeld
S1 4DP
United Kingdom

https://gate.ac.uk

This work is licenced under the Creative Commons Attribution-No Derivative Licence. You are free
to copy, distribute, display, and perform the work under the following conditions:

ˆ Attribution  You must give the original author credit.


ˆ No Derivative Works  You may not alter, transform, or build upon this work.

With the understanding that:

ˆ Waiver  Any of the above conditions can be waived if you get permission from the copyright
holder.

ˆ Other Rights  In no way are any of the following rights aected by the license: your fair
dealing or fair use rights; the author's moral rights; rights other persons may have either in
the work itself or in how the work is used, such as publicity or privacy rights.

ˆ Notice  For any reuse or distribution, you must make clear to others the licence terms of
this work.

For more information about the Creative Commons Attribution-No Derivative License, please visit
this web address: http://creativecommons.org/licenses/by-nd/2.0/uk/
Brief Contents

I GATE Basics 3
1 Introduction 5

2 Installing and Running GATE 25

3 Using GATE Developer 35

4 CREOLE: the GATE Component Model 67

5 Language Resources: Corpora, Documents and Annotations 89

6 ANNIE: a Nearly-New Information Extraction System 113

II GATE for Advanced Users 133


7 GATE Embedded 135

8 JAPE: Regular Expressions over Annotations 189

9 ANNIC: ANNotations-In-Context 231

10 Performance Evaluation of Language Analysers 241

11 Proling Processing Resources 271

12 Developing GATE 279

III CREOLE Plugins 291


13 Gazetteers 293

14 Working with Ontologies 315

15 Non-English Language Support 355

16 Domain Specic Resources 363

17 Tools for Social Media Data 371

18 Parsers 379
iv
Contents v

19 Machine Learning 391


20 Tools for Alignment Tasks 441
21 Crowdsourcing Data with GATE 457
22 Combining GATE and UIMA 471
23 More (CREOLE) Plugins 483

IV The GATE Family: Cloud, MIMIR, Teamware 559


24 GATE Cloud 561
25 GATE Teamware: A Web-based Collaborative Corpus Annotation Tool 565
26 GATE Mímir 579

Appendices 581
A Change Log 581
B Version 5.1 Plugins Name Map 621
C Obsolete CREOLE Plugins 623
D Design Notes 631
E Ant Tasks for GATE 639
F Named-Entity State Machine Patterns 647
G Part-of-Speech Tags used in the Hepple Tagger 655
References 657
vi Contents
Contents
I GATE Basics 3
1 Introduction 5
1.1 How to Use this Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Developing and Deploying Language Processing Facilities . . . . . . . 9
1.3.2 Built-In Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Additional Facilities in GATE Developer/Embedded . . . . . . . . . . 12
1.3.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Some Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Recent Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.1 Version 8.4.1 (June 2017) . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.2 Version 8.4 (February 2017) . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Installing and Running GATE 27


2.1 Downloading GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Installing and Running GATE . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 The Easy Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.2 The Hard Way (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3 The Hard Way (2): Subversion . . . . . . . . . . . . . . . . . . . . . 29
2.2.4 Running GATE Developer on Unix/Linux . . . . . . . . . . . . . . . 29
2.3 Using System Properties with GATE . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Changing GATE's launch conguration . . . . . . . . . . . . . . . . . . . . . 32
2.5 Conguring GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Building GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.1 Using GATE with Maven/Ivy . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Uninstalling GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Using GATE Developer 37


3.1 The GATE Developer Main Window . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Loading and Viewing Documents . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Creating and Viewing Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vii
viii Contents

3.4 Working with Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


3.4.1 The Annotation Sets View . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 The Annotations List View . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.3 The Annotations Stack View . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.4 The Co-reference Editor . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.5 Creating and Editing Annotations . . . . . . . . . . . . . . . . . . . . 48
3.4.6 Schema-Driven Editing . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.7 Printing Text with Annotations . . . . . . . . . . . . . . . . . . . . . 52
3.5 Using CREOLE Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Installing and updating CREOLE Plugins . . . . . . . . . . . . . . . . . . . 55
3.7 Loading and Using Processing Resources . . . . . . . . . . . . . . . . . . . . 56
3.8 Creating and Running an Application . . . . . . . . . . . . . . . . . . . . . . 58
3.8.1 Running an Application on a Datastore . . . . . . . . . . . . . . . . . 58
3.8.2 Running PRs Conditionally on Document Features . . . . . . . . . . 59
3.8.3 Doing Information Extraction with ANNIE . . . . . . . . . . . . . . . 60
3.8.4 Modifying ANNIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.9 Saving Applications and Language Resources . . . . . . . . . . . . . . . . . . 61
3.9.1 Saving Documents to File . . . . . . . . . . . . . . . . . . . . . . . . 61
3.9.2 Saving and Restoring LRs in Datastores . . . . . . . . . . . . . . . . 62
3.9.3 Saving Application States to a File . . . . . . . . . . . . . . . . . . . 63
3.9.4 Saving an Application with its Resources (e.g. GATE Cloud) . . . . . 64
3.10 Keyboard Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.11 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.11.1 Stopping GATE from Restoring Developer Sessions/Options . . . . . 67
3.11.2 Working with Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 CREOLE: the GATE Component Model 69


4.1 The Web and CREOLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 The GATE Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 The Lifecycle of a CREOLE Resource . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Processing Resources and Applications . . . . . . . . . . . . . . . . . . . . . 72
4.5 Language Resources and Datastores . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 Built-in CREOLE Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 CREOLE Resource Conguration . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7.1 Conguration with XML . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7.2 Conguring Resources using Annotations . . . . . . . . . . . . . . . . 80
4.7.3 Mixing the Conguration Styles . . . . . . . . . . . . . . . . . . . . . 85
4.7.4 Loading Third-Party Libraries using Apache Ivy . . . . . . . . . . . . 87
4.8 Tools: How to Add Utilities to GATE Developer . . . . . . . . . . . . . . . . 88
4.8.1 Putting Your Tools in a Sub-Menu . . . . . . . . . . . . . . . . . . . 89
4.8.2 Adding Tools To Existing Resource Types . . . . . . . . . . . . . . . 89

5 Language Resources: Corpora, Documents and Annotations 91


5.1 Features: Simple Attribute/Value Data . . . . . . . . . . . . . . . . . . . . . 91
Contents ix

5.2 Corpora: Sets of Documents plus Features . . . . . . . . . . . . . . . . . . . 92


5.3 Documents: Content plus Annotations plus Features . . . . . . . . . . . . . 92
5.4 Annotations: Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Annotation Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.2 Examples of Annotated Documents . . . . . . . . . . . . . . . . . . . 94
5.4.3 Creating, Viewing and Editing Diverse Annotation Types . . . . . . . 97
5.5 Document Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.5.1 Detecting the Right Reader . . . . . . . . . . . . . . . . . . . . . . . 99
5.5.2 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5.3 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.4 SGML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5.5 Plain text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5.6 RTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5.7 Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5.8 PDF Files and Oce Documents . . . . . . . . . . . . . . . . . . . . 112
5.5.9 UIMA CAS Documents . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5.10 CoNLL/IOB Documents . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6 XML Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6 ANNIE: a Nearly-New Information Extraction System 115


6.1 Document Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Tokeniser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2.1 Tokeniser Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2.2 Token Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.3 English Tokeniser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.4 Sentence Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5 RegEx Sentence Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.6 Part of Speech Tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7 Semantic Tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.8 Orthographic Coreference (OrthoMatcher) . . . . . . . . . . . . . . . . . . . 125
6.8.1 GATE Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.8.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.8.3 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.9 Pronominal Coreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.9.1 Quoted Speech Submodule . . . . . . . . . . . . . . . . . . . . . . . . 127
6.9.2 Pleonastic It Submodule . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.9.3 Pronominal Resolution Submodule . . . . . . . . . . . . . . . . . . . 127
6.9.4 Detailed Description of the Algorithm . . . . . . . . . . . . . . . . . . 128
6.10 A Walk-Through Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.10.1 Step 1 - Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.10.2 Step 2 - List Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.10.3 Step 3 - Grammar Rules . . . . . . . . . . . . . . . . . . . . . . . . . 133
x Contents

II GATE for Advanced Users 135


7 GATE Embedded 137
7.1 Quick Start with GATE Embedded . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Resource Management in GATE Embedded . . . . . . . . . . . . . . . . . . 138
7.3 Using CREOLE Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.4 Language Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4.1 GATE Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4.2 Feature Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4.3 Annotation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.4.4 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.4.5 GATE Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.5 Processing Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.6 Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.7 Modelling Relations between Annotations . . . . . . . . . . . . . . . . . . . 153
7.8 Duplicating a Resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.8.1 Sharable properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.9 Persistent Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.10 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.11 Creating a New Annotation Schema . . . . . . . . . . . . . . . . . . . . . . . 159
7.12 Creating a New CREOLE Resource . . . . . . . . . . . . . . . . . . . . . . . 160
7.13 Adding Support for a New Document Format . . . . . . . . . . . . . . . . . 163
7.14 Using GATE Embedded in a Multithreaded Environment . . . . . . . . . . . 165
7.15 Using GATE Embedded within a Spring Application . . . . . . . . . . . . . 166
7.15.1 Duplication in Spring . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.15.2 Spring pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.15.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.16 Using GATE Embedded within a Tomcat Web Application . . . . . . . . . . 172
7.16.1 Recommended Directory Structure . . . . . . . . . . . . . . . . . . . 172
7.16.2 Conguration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.16.3 Initialization Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.17 Groovy for GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.17.1 Groovy Scripting Console for GATE . . . . . . . . . . . . . . . . . . 175
7.17.2 Groovy scripting PR . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.17.3 The Scriptable Controller . . . . . . . . . . . . . . . . . . . . . . . . 180
7.17.4 Utility methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.18 Saving Cong Data to gate.xml . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.19 Annotation merging through the API . . . . . . . . . . . . . . . . . . . . . . 187
7.20 Using Resource Helpers to Extend the API . . . . . . . . . . . . . . . . . . . 188

8 JAPE: Regular Expressions over Annotations 191


8.1 The Left-Hand Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.1.1 Matching Entire Annotation Types . . . . . . . . . . . . . . . . . . . 193
8.1.2 Using Features and Values . . . . . . . . . . . . . . . . . . . . . . . . 194
Contents xi

8.1.3 Using Meta-Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 195


8.1.4 Building complex patterns from simple patterns . . . . . . . . . . . . 195
8.1.5 Matching a Simple Text String . . . . . . . . . . . . . . . . . . . . . 197
8.1.6 Using Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.1.7 Multiple Pattern/Action Pairs . . . . . . . . . . . . . . . . . . . . . . 200
8.1.8 LHS Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.1.9 Multi-Constraint Statements . . . . . . . . . . . . . . . . . . . . . . . 202
8.1.10 Using Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.1.11 Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.1.12 Escaping Special Characters . . . . . . . . . . . . . . . . . . . . . . . 206
8.2 LHS Operators in Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.2.1 Equality Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.2.2 Comparison Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.2.3 Regular Expression Operators . . . . . . . . . . . . . . . . . . . . . . 208
8.2.4 Contextual Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.2.5 Custom Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.3 The Right-Hand Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.3.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.3.2 Copying Feature Values from the LHS to the RHS . . . . . . . . . . . 210
8.3.3 Optional or Empty Labels . . . . . . . . . . . . . . . . . . . . . . . . 212
8.3.4 RHS Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.4 Use of Priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
8.5 Using Phases Sequentially . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
8.6 Using Java Code on the RHS . . . . . . . . . . . . . . . . . . . . . . . . . . 217
8.6.1 A More Complex Example . . . . . . . . . . . . . . . . . . . . . . . . 218
8.6.2 Adding a Feature to the Document . . . . . . . . . . . . . . . . . . . 220
8.6.3 Finding the Tokens of a Matched Annotation . . . . . . . . . . . . . 220
8.6.4 Using Named Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
8.6.5 Java RHS Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
8.7 Optimising for Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.8 Ontology Aware Grammar Transduction . . . . . . . . . . . . . . . . . . . . 227
8.9 Serializing JAPE Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
8.9.1 How to Serialize? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
8.9.2 How to Use the Serialized Grammar File? . . . . . . . . . . . . . . . 228
8.10 Notes for Montreal Transducer Users . . . . . . . . . . . . . . . . . . . . . . 228
8.11 JAPE Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

9 ANNIC: ANNotations-In-Context 233


9.1 Instantiating SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
9.2 Search GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.2.2 Syntax of Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
9.2.3 Top Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
9.2.4 Central Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
xii Contents

9.2.5 Bottom Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239


9.3 Using SSD from GATE Embedded . . . . . . . . . . . . . . . . . . . . . . . 239
9.3.1 How to instantiate a searchabledatastore . . . . . . . . . . . . . . . . 239
9.3.2 How to search in this datastore . . . . . . . . . . . . . . . . . . . . . 240

10 Performance Evaluation of Language Analysers 243


10.1 Metrics for Evaluation in Information Extraction . . . . . . . . . . . . . . . 244
10.1.1 Annotation Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.1.2 Cohen's Kappa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.1.3 Precision, Recall, F-Measure . . . . . . . . . . . . . . . . . . . . . . . 248
10.1.4 Macro and Micro Averaging . . . . . . . . . . . . . . . . . . . . . . . 249
10.2 The Annotation Di Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
10.2.1 Performing Evaluation with the Annotation Di Tool . . . . . . . . . 250
10.2.2 Creating a Gold Standard with the Annotation Di Tool . . . . . . . 252
10.3 Corpus Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.3.1 Description of the interface . . . . . . . . . . . . . . . . . . . . . . . . 254
10.3.2 Step by step usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.3.3 Details of the Corpus statistics table . . . . . . . . . . . . . . . . . . 255
10.3.4 Details of the Document statistics table . . . . . . . . . . . . . . . . . 256
10.3.5 GATE Embedded API for the measures . . . . . . . . . . . . . . . . 256
10.3.6 sec:eval:qapr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
10.4 Corpus Benchmark Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
10.4.1 Preparing the Corpora for Use . . . . . . . . . . . . . . . . . . . . . . 260
10.4.2 Dening Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.4.3 Running the Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
10.4.4 The Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
10.5 A Plugin Computing Inter-Annotator Agreement (IAA) . . . . . . . . . . . . 264
10.5.1 IAA for Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.5.2 IAA For Named Entity Annotation . . . . . . . . . . . . . . . . . . . 267
10.5.3 The BDM-Based IAA Scores . . . . . . . . . . . . . . . . . . . . . . . 268
10.6 A Plugin Computing the BDM Scores for an Ontology . . . . . . . . . . . . 269
10.7 Quality Assurance Summariser for Teamware . . . . . . . . . . . . . . . . . . 270

11 Proling Processing Resources 273


11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.2 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.3 Command Line Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.4 Application Programming Interface . . . . . . . . . . . . . . . . . . . . . . . 276
11.4.1 Log4j.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
11.4.2 Benchmark log format . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.4.3 Enabling proling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.4.4 Reporting tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Contents xiii

12 Developing GATE 281


12.1 Reporting Bugs and Requesting Features . . . . . . . . . . . . . . . . . . . . 281
12.2 Contributing Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12.3 Creating New Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
12.3.1 What to Call your Plugin . . . . . . . . . . . . . . . . . . . . . . . . 282
12.3.2 Writing a New PR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
12.3.3 Writing a New VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
12.3.4 Writing a `Ready Made' Application . . . . . . . . . . . . . . . . . . 289
12.3.5 Distributing Your New Plugins . . . . . . . . . . . . . . . . . . . . . 289
12.4 Updating this User Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.4.1 Building the User Guide . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.4.2 Making Changes to the User Guide . . . . . . . . . . . . . . . . . . . 292

III CREOLE Plugins 293


13 Gazetteers 295
13.1 Introduction to Gazetteers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
13.2 ANNIE Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
13.2.1 Creating and Modifying Gazetteer Lists . . . . . . . . . . . . . . . . 297
13.2.2 ANNIE Gazetteer Editor . . . . . . . . . . . . . . . . . . . . . . . . . 297
13.3 OntoGazetteer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
13.4 Gaze Ontology Gazetteer Editor . . . . . . . . . . . . . . . . . . . . . . . . . 299
13.4.1 The Gaze Gazetteer List and Mapping Editor . . . . . . . . . . . . . 299
13.4.2 The Gaze Ontology Editor . . . . . . . . . . . . . . . . . . . . . . . . 300
13.5 Hash Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
13.5.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
13.5.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
13.6 Flexible Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
13.7 Gazetteer List Collector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.8 OntoRoot Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.8.1 How Does it Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.8.2 Initialisation of OntoRoot Gazetteer . . . . . . . . . . . . . . . . . . 306
13.8.3 Simple steps to run OntoRoot Gazetteer . . . . . . . . . . . . . . . . 307
13.9 Large KB Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
13.9.1 Quick usage overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
13.9.2 Dictionary setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
13.9.3 Additional dictionary conguration . . . . . . . . . . . . . . . . . . . 312
13.9.4 Dictionary for Gazetteer List Files . . . . . . . . . . . . . . . . . . . 312
13.9.5 Processing Resource Conguration . . . . . . . . . . . . . . . . . . . 313
13.9.6 Runtime conguration . . . . . . . . . . . . . . . . . . . . . . . . . . 313
13.9.7 Semantic Enrichment PR . . . . . . . . . . . . . . . . . . . . . . . . . 314
13.10The Shared Gazetteer for multithreaded processing . . . . . . . . . . . . . . 314

14 Working with Ontologies 317


xiv Contents

14.1 Data Model for Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318


14.1.1 Hierarchies of Classes and Restrictions . . . . . . . . . . . . . . . . . 318
14.1.2 Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
14.1.3 Hierarchies of Properties . . . . . . . . . . . . . . . . . . . . . . . . . 320
14.1.4 URIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
14.2 Ontology Event Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
14.2.1 What Happens when a Resource is Deleted? . . . . . . . . . . . . . . 324
14.3 The Ontology Plugin: Current Implementation . . . . . . . . . . . . . . . . . 325
14.3.1 The OWLIMOntology Language Resource . . . . . . . . . . . . . . . 326
14.3.2 The ConnectSesameOntology Language Resource . . . . . . . . . . . 329
14.3.3 The CreateSesameOntology Language Resource . . . . . . . . . . . . 330
14.3.4 The OWLIM2 Backwards-Compatible Language Resource . . . . . . 330
14.3.5 Using Ontology Import Mappings . . . . . . . . . . . . . . . . . . . . 330
14.3.6 Using BigOWLIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
14.3.7 The sesameCLI command line interface . . . . . . . . . . . . . . . . . 332
14.4 The Ontology_OWLIM2 plugin: backwards-compatible implementation . . . 333
14.4.1 The OWLIMOntologyLR Language Resource . . . . . . . . . . . . . 333
14.5 GATE Ontology Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
14.6 Ontology Annotation Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
14.6.1 Viewing Annotated Text . . . . . . . . . . . . . . . . . . . . . . . . . 340
14.6.2 Editing Existing Annotations . . . . . . . . . . . . . . . . . . . . . . 340
14.6.3 Adding New Annotations . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.6.4 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.7 Relation Annotation Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
14.7.1 Description of the two views . . . . . . . . . . . . . . . . . . . . . . . 345
14.7.2 Create new annotation and instance from text selection . . . . . . . . 346
14.7.3 Create new annotation and add label to existing instance from text
selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
14.7.4 Create and set properties for annotation relation . . . . . . . . . . . . 346
14.7.5 Delete instance, label or property . . . . . . . . . . . . . . . . . . . . 347
14.7.6 Dierences with OAT and Ontology Editor . . . . . . . . . . . . . . . 347
14.8 Using the ontology API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
14.9 Using the ontology API (old version) . . . . . . . . . . . . . . . . . . . . . . 349
14.10Ontology-Aware JAPE Transducer . . . . . . . . . . . . . . . . . . . . . . . 350
14.11Annotating Text with Ontological Information . . . . . . . . . . . . . . . . . 351
14.12Populating Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
14.13Ontology API and Implementation Changes . . . . . . . . . . . . . . . . . . 354
14.13.1 Dierences between the implementation plugins . . . . . . . . . . . . 354
14.13.2 Changes in the Ontology API . . . . . . . . . . . . . . . . . . . . . . 355

15 Non-English Language Support 357


15.1 Language Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
15.1.1 Fingerprint Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 358
15.2 French Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Contents xv

15.3 German Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359


15.4 Romanian Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
15.5 Arabic Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
15.6 Chinese Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
15.6.1 Chinese Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . 361
15.7 Hindi Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
15.8 Russian Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
15.9 Bulgarian Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
15.10Danish Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
15.11Welsh Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

16 Domain Specic Resources 365


16.1 Biomedical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
16.1.1 ABNER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
16.1.2 MetaMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
16.1.3 GSpell biomedical spelling suggestion and correction . . . . . . . . . 369
16.1.4 BADREX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
16.1.5 MiniChem/Drug Tagger . . . . . . . . . . . . . . . . . . . . . . . . . 370
16.1.6 AbGene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
16.1.7 GENIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
16.1.8 Penn BioTagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
16.1.9 MutationFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

17 Tools for Social Media Data 373


17.1 Tools for Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
17.2 Twitter JSON format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
17.2.1 Entity annotations in JSON . . . . . . . . . . . . . . . . . . . . . . . 375
17.3 Exporting GATE documents as JSON . . . . . . . . . . . . . . . . . . . . . 376
17.4 Low-level PRs for Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
17.5 Handling multi-word hashtags . . . . . . . . . . . . . . . . . . . . . . . . . . 378
17.6 The TwitIE Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

18 Parsers 381
18.1 RASP Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
18.2 SUPPLE Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
18.2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
18.2.2 Building SUPPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
18.2.3 Running the Parser in GATE . . . . . . . . . . . . . . . . . . . . . . 384
18.2.4 Viewing the Parse Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 385
18.2.5 System Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
18.2.6 Conguration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
18.2.7 Parser and Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
18.2.8 Mapping Named Entities . . . . . . . . . . . . . . . . . . . . . . . . . 388
18.2.9 Upgrading from BuChart to SUPPLE . . . . . . . . . . . . . . . . . . 388
18.3 Stanford Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
xvi Contents

18.3.1 Input Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389


18.3.2 Initialization Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 389
18.3.3 Runtime Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 390

19 Machine Learning 393


19.1 ML Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
19.1.1 Some Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
19.1.2 GATE-Specic Interpretation of the Above Denitions . . . . . . . . 395
19.2 Batch Learning PR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
19.2.1 Batch Learning PR Conguration File Settings . . . . . . . . . . . . 397
19.2.2 Case Studies for the Three Learning Types . . . . . . . . . . . . . . . 410
19.2.3 How to Use the Batch Learning PR in GATE Developer . . . . . . . 418
19.2.4 Output of the Batch Learning PR . . . . . . . . . . . . . . . . . . . . 419
19.2.5 Using the Batch Learning PR from the API . . . . . . . . . . . . . . 426
19.3 Machine Learning PR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
19.3.1 The DATASET Element . . . . . . . . . . . . . . . . . . . . . . . . . 427
19.3.2 The ENGINE Element . . . . . . . . . . . . . . . . . . . . . . . . . . 429
19.3.3 The WEKA Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
19.3.4 The MAXENT Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . 430
19.3.5 The SVM Light Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . 431
19.3.6 Example Conguration File . . . . . . . . . . . . . . . . . . . . . . . 434

20 Tools for Alignment Tasks 443


20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
20.2 The Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
20.2.1 Compound Document . . . . . . . . . . . . . . . . . . . . . . . . . . 444
20.2.2 CompoundDocumentFromXml . . . . . . . . . . . . . . . . . . . . . . 446
20.2.3 Compound Document Editor . . . . . . . . . . . . . . . . . . . . . . 446
20.2.4 Composite Document . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
20.2.5 DeleteMembersPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
20.2.6 SwitchMembersPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
20.2.7 Saving as XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
20.2.8 Alignment Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
20.2.9 Saving Files and Alignments . . . . . . . . . . . . . . . . . . . . . . . 456
20.2.10 Section-by-Section Processing . . . . . . . . . . . . . . . . . . . . . . 457

21 Crowdsourcing Data with GATE 459


21.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
21.2 Entity classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
21.2.1 Creating a classication job . . . . . . . . . . . . . . . . . . . . . . . 461
21.2.2 Loading data into a job . . . . . . . . . . . . . . . . . . . . . . . . . . 462
21.2.3 Importing the results . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
21.2.4 Automatic adjudication . . . . . . . . . . . . . . . . . . . . . . . . . 465
21.3 Entity annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
21.3.1 Creating an annotation job . . . . . . . . . . . . . . . . . . . . . . . . 467
Contents xvii

21.3.2 Loading data into a job . . . . . . . . . . . . . . . . . . . . . . . . . . 469


21.3.3 Importing the results . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
21.3.4 Automatic adjudication . . . . . . . . . . . . . . . . . . . . . . . . . 472

22 Combining GATE and UIMA 473


22.1 Embedding a UIMA AE in GATE . . . . . . . . . . . . . . . . . . . . . . . . 474
22.1.1 Mapping File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
22.1.2 The UIMA Component Descriptor . . . . . . . . . . . . . . . . . . . 478
22.1.3 Using the AnalysisEnginePR . . . . . . . . . . . . . . . . . . . . . . 479
22.2 Embedding a GATE CorpusController in UIMA . . . . . . . . . . . . . . . 480
22.2.1 Mapping File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
22.2.2 The GATE Application Denition . . . . . . . . . . . . . . . . . . . . 481
22.2.3 Conguring the GATEApplicationAnnotator . . . . . . . . . . . . . . 482

23 More (CREOLE) Plugins 485


23.1 Verb Group Chunker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
23.2 Noun Phrase Chunker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
23.2.1 Dierences from the Original . . . . . . . . . . . . . . . . . . . . . . 486
23.2.2 Using the Chunker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
23.3 TaggerFramework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
23.3.1 TreeTaggerMultilingual POS Tagger . . . . . . . . . . . . . . . . . 490
23.3.2 GENIA and Double Quotes . . . . . . . . . . . . . . . . . . . . . . . 492
23.4 Chemistry Tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
23.4.1 Using the Tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
23.5 Lupedia Semantic Annotation Service . . . . . . . . . . . . . . . . . . . . . . 493
23.6 TextRazor Annotation Service . . . . . . . . . . . . . . . . . . . . . . . . . . 494
23.7 Annotating Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
23.7.1 Numbers in Words and Numbers . . . . . . . . . . . . . . . . . . . . 496
23.7.2 Roman Numerals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
23.8 Annotating Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
23.9 Annotating and Normalizing Dates . . . . . . . . . . . . . . . . . . . . . . . 503
23.10Snowball Based Stemmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
23.10.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
23.11GATE Morphological Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . 506
23.11.1 Rule File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
23.12Flexible Exporter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
23.13Congurable Exporter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
23.14Annotation Set Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
23.15Schema Enforcer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
23.16Information Retrieval in GATE . . . . . . . . . . . . . . . . . . . . . . . . . 514
23.16.1 Using the IR Functionality in GATE . . . . . . . . . . . . . . . . . . 516
23.16.2 Using the IR API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
23.17Websphinx Web Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
23.17.1 Using the Crawler PR . . . . . . . . . . . . . . . . . . . . . . . . . . 520
xviii Contents

23.17.2 Proxy conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522


23.18WordNet in GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
23.18.1 The WordNet API . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
23.19Kea - Automatic Keyphrase Detection . . . . . . . . . . . . . . . . . . . . . 527
23.19.1 Using the `KEA Keyphrase Extractor' PR . . . . . . . . . . . . . . . 528
23.19.2 Using Kea Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
23.20Annotation Merging Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
23.21Copying Annotations between Documents . . . . . . . . . . . . . . . . . . . 532
23.22LingPipe Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
23.22.1 LingPipe Tokenizer PR . . . . . . . . . . . . . . . . . . . . . . . . . . 534
23.22.2 LingPipe Sentence Splitter PR . . . . . . . . . . . . . . . . . . . . . . 534
23.22.3 LingPipe POS Tagger PR . . . . . . . . . . . . . . . . . . . . . . . . 534
23.22.4 LingPipe NER PR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
23.22.5 LingPipe Language Identier PR . . . . . . . . . . . . . . . . . . . . 536
23.23OpenNLP Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
23.23.1 Init parameters and models . . . . . . . . . . . . . . . . . . . . . . . 537
23.23.2 OpenNLP PRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
23.23.3 Obtaining and generating models . . . . . . . . . . . . . . . . . . . . 539
23.24Stanford CoreNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
23.24.1 Stanford Tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
23.24.2 Stanford Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
23.24.3 Stanford Named Entity Recognition . . . . . . . . . . . . . . . . . . . 541
23.25Content Detection Using Boilerpipe . . . . . . . . . . . . . . . . . . . . . . . 542
23.26Inter Annotator Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
23.27Schema Annotation Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
23.28Coref Tools Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
23.29Pubmed Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
23.30MediaWiki Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
23.31Fast Infoset Document Format . . . . . . . . . . . . . . . . . . . . . . . . . . 549
23.32DataSift Document Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
23.33CSV Document Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
23.34TermRaider term extraction tools . . . . . . . . . . . . . . . . . . . . . . . . 550
23.34.1 Termbank language resources . . . . . . . . . . . . . . . . . . . . . . 550
23.34.2 Termbank Score Copier . . . . . . . . . . . . . . . . . . . . . . . . . . 554
23.34.3 The PMI bank language resource . . . . . . . . . . . . . . . . . . . . 554
23.35Document Normalizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
23.36Developer Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
23.37Linguistic Simplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
23.38GATE-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
23.38.1 DCTParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
23.38.2 HeidelTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
23.38.3 TimeML Event Detection . . . . . . . . . . . . . . . . . . . . . . . . 559
Contents xix

IV The GATE Family: Cloud, MIMIR, Teamware 561


24 GATE Cloud 563
24.1 GATE Cloud services: an overview . . . . . . . . . . . . . . . . . . . . . . . 564
24.2 Using GATE Cloud services . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
24.3 Annotation Jobs on GATE Cloud . . . . . . . . . . . . . . . . . . . . . . . . 565
24.3.1 The Annotation Service Charges Explained . . . . . . . . . . . . . . . 565
24.3.2 Where to nd more details . . . . . . . . . . . . . . . . . . . . . . . . 566

25 GATE Teamware: A Web-based Collaborative Corpus Annotation Tool 567


25.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
25.2 Requirements for Multi-Role Collaborative Annotation Environments . . . . 569
25.2.1 Typical Division of Labour . . . . . . . . . . . . . . . . . . . . . . . . 569
25.2.2 Remote, Scalable Data Storage . . . . . . . . . . . . . . . . . . . . . 571
25.2.3 Automatic annotation services . . . . . . . . . . . . . . . . . . . . . . 571
25.2.4 Workow Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
25.3 Teamware: Architecture, Implementation, and Examples . . . . . . . . . . . 572
25.3.1 Data Storage Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
25.3.2 Annotation Services . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
25.3.3 The Executive Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
25.3.4 The User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
25.4 Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578

26 GATE Mímir 581

Appendices 583
A Change Log 583
A.1 Version 8.4.1 (June 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
A.2 Version 8.4 (February 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
A.2.1 Java compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
A.3 Version 8.3 (January 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
A.3.1 Java compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
A.4 Version 8.2 (May 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
A.4.1 Java compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
A.5 Version 8.1 (June 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
A.5.1 New plugins and signicant new features . . . . . . . . . . . . . . . . 586
A.5.2 Library updates and bugxes . . . . . . . . . . . . . . . . . . . . . . 586
A.5.3 Tools for developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
A.6 Version 8.0 (May 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
A.6.1 Major changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
A.6.2 Other new and improved plugins . . . . . . . . . . . . . . . . . . . . 588
A.6.3 Bug xes and other improvements . . . . . . . . . . . . . . . . . . . . 589
A.6.4 For developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
xx Contents

A.7 Version 7.1 (November 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . 590


A.7.1 New plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
A.7.2 Library updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
A.7.3 GATE Embedded API changes . . . . . . . . . . . . . . . . . . . . . 591
A.8 Version 7.0 (February 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
A.8.1 Major new features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
A.8.2 Removal of deprecated functionality . . . . . . . . . . . . . . . . . . . 593
A.8.3 Other enhancements and bug xes . . . . . . . . . . . . . . . . . . . 593
A.9 Version 6.1 (April 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
A.9.1 New CREOLE Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . 595
A.9.2 Other new features and improvements . . . . . . . . . . . . . . . . . 595
A.10 Version 6.0 (November 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
A.10.1 Major new features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
A.10.2 Breaking changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
A.10.3 Other new features and bugxes . . . . . . . . . . . . . . . . . . . . . 598
A.11 Version 5.2.1 (May 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
A.12 Version 5.2 (April 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
A.12.1 JAPE and JAPE-related . . . . . . . . . . . . . . . . . . . . . . . . . 600
A.12.2 Other Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
A.13 Version 5.1 (December 2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
A.13.1 New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
A.13.2 JAPE improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
A.13.3 Other improvements and bug xes . . . . . . . . . . . . . . . . . . . 605
A.14 Version 5.0 (May 2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
A.14.1 Major New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
A.14.2 Other New Features and Improvements . . . . . . . . . . . . . . . . . 608
A.14.3 Specic Bug Fixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
A.15 Version 4.0 (July 2007) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
A.15.1 Major New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
A.15.2 Other New Features and Improvements . . . . . . . . . . . . . . . . . 611
A.15.3 Bug Fixes and Optimizations . . . . . . . . . . . . . . . . . . . . . . 613
A.16 Version 3.1 (April 2006) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
A.16.1 Major New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
A.16.2 Other New Features and Improvements . . . . . . . . . . . . . . . . . 614
A.16.3 Bug Fixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
A.17 January 2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
A.18 December 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
A.19 September 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
A.20 Version 3 Beta 1 (August 2004) . . . . . . . . . . . . . . . . . . . . . . . . . 618
A.21 July 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
A.22 June 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
A.23 April 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
A.24 March 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
A.25 Version 2.2  August 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
Contents xxi

A.26 Version 2.1  February 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . 621


A.27 June 2002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621

B Version 5.1 Plugins Name Map 623


C Obsolete CREOLE Plugins 625
C.1 Ontotext JapeC Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
C.2 Google Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
C.3 Yahoo Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
C.3.1 Using the YahooPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
C.4 Gazetteer Visual Resource - GAZE . . . . . . . . . . . . . . . . . . . . . . . 627
C.4.1 Display Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
C.4.2 Linear Denition Pane . . . . . . . . . . . . . . . . . . . . . . . . . . 628
C.4.3 Linear Denition Toolbar . . . . . . . . . . . . . . . . . . . . . . . . 629
C.4.4 Operations on Linear Denition Nodes . . . . . . . . . . . . . . . . . 629
C.4.5 Gazetteer List Pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
C.4.6 Mapping Denition Pane . . . . . . . . . . . . . . . . . . . . . . . . . 630
C.5 Google Translator PR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630

D Design Notes 633


D.1 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
D.1.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
D.1.2 Model, view, controller . . . . . . . . . . . . . . . . . . . . . . . . . . 636
D.1.3 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
D.2 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637

E Ant Tasks for GATE 641


E.1 Declaring the Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
E.2 The packagegapp task - bundling an application with its dependencies . . . 641
E.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
E.2.2 Basic Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
E.2.3 Handling Non-Plugin Resources . . . . . . . . . . . . . . . . . . . . . 643
E.2.4 Streamlining your Plugins . . . . . . . . . . . . . . . . . . . . . . . . 646
E.2.5 Bundling Extra Resources . . . . . . . . . . . . . . . . . . . . . . . . 646
E.3 The expandcreoles Task - Merging Annotation-Driven Cong into creole.xml 648

F Named-Entity State Machine Patterns 649


F.1 Main.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
F.2 rst.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
F.3 rstname.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
F.4 name.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
F.4.1 Person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
F.4.2 Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
F.4.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
F.4.4 Ambiguities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
Contents 1

F.4.5 Contextual information . . . . . . . . . . . . . . . . . . . . . . . . . . 652


F.5 name_post.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
F.6 date_pre.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
F.7 date.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
F.8 reldate.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
F.9 number.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
F.10 address.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
F.11 url.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
F.12 identier.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
F.13 jobtitle.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
F.14 nal.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
F.15 unknown.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
F.16 name_context.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
F.17 org_context.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
F.18 loc_context.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
F.19 clean.jape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656

G Part-of-Speech Tags used in the Hepple Tagger 657


References 659
2 Contents
Part I

GATE Basics

3
Chapter 1

Introduction

Software documentation is like sex: when it is good, it is very, very good; and
when it is bad, it is better than nothing. (Anonymous.)
There are two ways of constructing a software design: one way is to make it so
simple that there are obviously no deciencies; the other way is to make it so
complicated that there are no obvious deciencies. (C.A.R. Hoare)
A computer language is not just a way of getting a computer to perform oper-
ations but rather that it is a novel formal medium for expressing ideas about
methodology. Thus, programs must be written for people to read, and only inci-
dentally for machines to execute. (The Structure and Interpretation of Computer
Programs, H. Abelson, G. Sussman and J. Sussman, 1985.)
If you try to make something beautiful, it is often ugly. If you try to make
something useful, it is often beautiful. (Oscar Wilde)1

GATE2 is an infrastructure for developing and deploying software components that process
human language. It is nearly 15 years old and is in active use for all types of computational
task involving human language. GATE excels at text analysis of all shapes and sizes. From
large corporations to small startups, from ¿multi-million research consortia to undergraduate
projects, our user community is the largest and most diverse of any system of this type, and
is spread across all but one of the continents3 .

GATE is open source free software; users can obtain free support from the user and developer
community via GATE.ac.uk or on a commercial basis from our industrial partners. We
are the biggest open source language processing project with a development team more
than double the size of the largest comparable projects (many of which are integrated with
1 These were, at least, our ideals; of course we didn't completely live up to them. . .
2 If you've read the overview at http://gate.ac.uk/overview.html, you may prefer to skip to Section 1.1.
3 Rumours that we're planning to send several of the development team to Antarctica on one-way tickets
are false, libellous and wishful thinking.

5
6 Introduction

GATE4 ). More than ¿5 million has been invested in GATE development5 ; our objective is
to make sure that this continues to be money well spent for all GATE's users.

The GATE family of tools has grown over the years to include a desktop client for developers,
a workow-based web application, a Java library, an architecture and a process. GATE is:

ˆ an IDE, GATE Developer: an integrated development environment6 for language


processing components bundled with a very widely used Information Extraction system
and a comprehensive set of other plugins

ˆ a cloud computing solution for hosted large-scale text processing, GATE Cloud
(https://cloud.gate.ac.uk/). See also Chapter 24.

ˆ a web app, GATE Teamware: a collaborative annotation environment for factory-


style semantic annotation projects built around a workow engine and a heavily-
optimised backend service infrastructure. See also Chapter 25.

ˆ a multi-paradigm search repository, GATE Mímir, which can be used to index and
search over text, annotations, semantic schemas (ontologies), and semantic meta-data
(instance data). It allows queries that arbitrarily mix full-text, structural, linguistic
and semantic queries and that can scale to terabytes of text. See also Chapter 26.

ˆ a framework, GATE Embedded: an object library optimised for inclusion in diverse


applications giving access to all the services used by GATE Developer and more.

ˆ an architecture : a high-level organisational picture of how language processing software


composition.

ˆ a process for the creation of robust and maintainable services.

We also develop:

ˆ a wiki/CMS, GATE Wiki (http://gatewiki.sf.net/), mainly to host our own websites


and as a testbed for some of our experiments

For more information on the GATE family see http://gate.ac.uk/family/ and also Part IV
of this book.

One of our original motivations was to remove the necessity for solving common engineering
problems before doing useful research, or re-engineering before deploying research results
into applications. Core functions of GATE take care of the lion's share of the engineering:
4 Our philosophy is reuse not reinvention, so we integrate and interoperate with other systems e.g.:
LingPipe, OpenNLP, UIMA, and many more specic tools.
5 This is the gure for direct Sheeld-based investment only and therefore an underestimate.
6 GATE Developer and GATE Embedded are bundled, and in older distributions were referred to just as
`GATE'.
Introduction 7

ˆ modelling and persistence of specialised data structures

ˆ measurement, evaluation, benchmarking (never believe a computing researcher who


hasn't measured their results in a repeatable and open setting!)

ˆ visualisation and editing of annotations, ontologies, parse trees, etc.

ˆ a nite state transduction language for rapid prototyping and ecient implementation
of shallow analysis methods (JAPE)

ˆ extraction of training instances for machine learning

ˆ pluggable machine learning implementations (Weka, SVM Light, ...)

On top of the core functions GATE includes components for diverse language processing
tasks, e.g. parsers, morphology, tagging, Information Retrieval tools, Information Extraction
components for various languages, and many others. GATE Developer and Embedded are
supplied with an Information Extraction system (ANNIE) which has been adapted and
evaluated very widely (numerous industrial systems, research systems evaluated in MUC,
TREC, ACE, DUC, Pascal, NTCIR, etc.). ANNIE is often used to create RDF or OWL
(metadata) for unstructured content (semantic annotation).

GATE version 1 was written in the mid-1990s; at the turn of the new millennium we com-
pletely rewrote the system in Java; version 5 was released in June 2009; and version 6  in
November 2010. We believe that GATE is the leading system of its type, but as scientists
we have to advise you not to take our word for it; that's why we've measured our software
in many of the competitive evaluations over the last decade-and-a-half (MUC, TREC, ACE,
DUC and more; see Section 1.4 for details). We invite you to give it a try, to get involved
with the GATE community, and to contribute to human language science, engineering and
development.

This book describes how to use GATE to develop language processing components, test their
performance and deploy them as parts of other applications. In the rest of this chapter:

ˆ Section 1.1 describes the best way to use this book;

ˆ Section 1.2 briey notes that the context of GATE is applied language processing, or
Language Engineering;

ˆ Section 1.3 gives an overview of developing using GATE;

ˆ Section 1.4 lists publications describing GATE performance in evaluations;

ˆ Section 1.5 outlines what is new in the current version of GATE;

ˆ Section 1.6 lists other publications about GATE.


8 Introduction

Note: if you don't see the component you need in this document, or if we mention a com-
ponent that you can't see in the software, contact [email protected] 
various components are developed by our collaborators, who we will be happy to put you
in contact with. (Often the process of getting a new component is as simple as typing the
URL into GATE Developer; the system will do the rest.)

1.1 How to Use this Text

The material presented in this book ranges from the conceptual (e.g. `what is software
architecture?') to practical instructions for programmers (e.g. how to deal with GATE
exceptions) and linguists (e.g. how to write a pattern grammar). Furthermore, GATE's
highly extensible nature means that new functionality is constantly being added in the form
of new plugins. Important functionality is as likely to be located in a plugin as it is to
be integrated into the GATE core. This presents something of an organisational challenge.
Our (no doubt imperfect) solution is to divide this book into three parts. Part I covers
installation, using the GATE Developer GUI and using ANNIE, as well as providing some
background and theory. We recommend the new user to begin with Part I. Part II covers
the more advanced of the core GATE functionality; the GATE Embedded API and JAPE
pattern language among other things. Part III provides a reference for the numerous plugins
that have been created for GATE. Although ANNIE provides a good starting point, the
user will soon wish to explore other resources, and so will need to consult this part of the
text. We recommend that Part III be used as a reference, to be dipped into as necessary. In
Part III, plugins are grouped into broad areas of functionality.

1.2 Context

GATE can be thought of as a Software Architecture for Language Engineering


[Cunningham 00].

`Software Architecture' is used rather loosely here to mean computer infrastructure for soft-
ware development, including development environments and frameworks, as well as the more
usual use of the term to denote a macro-level organisational structure for software systems
[Shaw & Garlan 96].

Language Engineering (LE) may be dened as:

. . . the discipline or act of engineering software systems that perform tasks involv-
ing processing human language. Both the construction process and its outputs
7 Follow the `support' link from http://gate.ac.uk/ to subscribe to the mailing list.
Introduction 9

are measurable and predictable. The literature of the eld relates to both appli-
cation of relevant scientic results and a body of practice. [Cunningham 99a]

The relevant scientic results in this case are the outputs of Computational Linguistics, Nat-
ural Language Processing and Articial Intelligence in general. Unlike these other disciplines,
LE, as an engineering discipline, entails predictability, both of the process of constructing LE-
based software and of the performance of that software after its completion and deployment
in applications.

Some working denitions:

1. Computational Linguistics (CL): science of language that uses computation as an


investigative tool.

2. Natural Language Processing (NLP): science of computation whose subject mat-


ter is data structures and algorithms for computer processing of human language.

3. Language Engineering (LE): building NLP systems whose cost and outputs are
measurable and predictable.

4. Software Architecture: macro-level organisational principles for families of systems.


In this context is also used as infrastructure.

5. Software Architecture for Language Engineering (SALE): software infrastruc-


ture, architecture and development tools for applied CL, NLP and LE.

(Of course the practice of these elds is broader and more complex than these denitions.)

In the scientic endeavours of NLP and CL, GATE's role is to support experimentation. In
this context GATE's signicant features include support for automated measurement (see
Chapter 10), providing a `level playing eld' where results can easily be repeated across
dierent sites and environments, and reducing research overheads in various ways.

1.3 Overview

1.3.1 Developing and Deploying Language Processing Facilities


GATE as an architecture suggests that the elements of software systems that process natural
language can usefully be broken down into various types of component, known as resources8 .
8 The terms `resource' and `component' are synonymous in this context. `Resource' is used instead of just
`component' because it is a common term in the literature of the eld: cf. the Language Resources and
Evaluation conference series [LREC-1 98, LREC-2 00].
10 Introduction

Components are reusable software chunks with well-dened interfaces, and are a popular
architectural form, used in Sun's Java Beans and Microsoft's .Net, for example. GATE
components are specialised types of Java Bean, and come in three avours:

ˆ LanguageResources (LRs) represent entities such as lexicons, corpora or ontologies;

ˆ ProcessingResources (PRs) represent entities that are primarily algorithmic, such as


parsers, generators or ngram modellers;

ˆ VisualResources (VRs) represent visualisation and editing components that participate


in GUIs.

These denitions can be blurred in practice as necessary.

Collectively, the set of resources integrated with GATE is known as CREOLE: a Collection
of REusable Objects for Language Engineering. All the resources are packaged as Java
Archive (or `JAR') les, plus some XML conguration data. The JAR and XML les are
made available to GATE by putting them on a web server, or simply placing them in the
local le space. Section 1.3.2 introduces GATE's built-in resource set.

When using GATE to develop language processing functionality for an application, the
developer uses GATE Developer and GATE Embedded to construct resources of the three
types. This may involve programming, or the development of Language Resources such as
grammars that are used by existing Processing Resources, or a mixture of both. GATE
Developer is used for visualisation of the data structures produced and consumed during
processing, and for debugging, performance measurement and so on. For example, gure 1.1
is a screenshot of one of the visualisation tools.

GATE Developer is analogous to systems like Mathematica for Mathematicians, or JBuilder


for Java programmers: it provides a convenient graphical environment for research and
development of language processing software.

When an appropriate set of resources have been developed, they can then be embedded in
the target client application using GATE Embedded. GATE Embedded is supplied as a
series of JAR les.9 To embed GATE-based language processing facilities in an application,
these JAR les are all that is needed, along with JAR les and XML conguration les for
the various resources that make up the new facilities.

9 The main JAR le (gate.jar) supplies the framework. Built-in resources and various 3rd-party libraries
are supplied as separate JARs; for example (guk.jar, the GATE Unicode Kit.) contains Unicode support
(e.g. additional input methods for languages not currently supported by the JDK). They are separate because
the latter has to be a Java extension with a privileged security prole.
Introduction 11

Figure 1.1: One of GATE's visual resources

1.3.2 Built-In Components


GATE includes resources for common LE data structures and algorithms, including doc-
uments, corpora and various annotation types, a set of language analysis components for
Information Extraction and a range of data visualisation and editing components.

GATE supports documents in a variety of formats including XML, RTF, email, HTML,
SGML and plain text. In all cases the format is analysed and converted into a sin-
gle unied model of annotation. The annotation format is a modied form of the TIP-
STER format [Grishman 97] which has been made largely compatible with the Atlas format
[Bird & Liberman 99], and uses the now standard mechanism of `stand-o markup'. GATE
documents, corpora and annotations are stored in databases of various sorts, visualised via
the development environment, and accessed at code level via the framework. See Chapter 5
for more details of corpora etc.

A family of Processing Resources for language analysis is included in the shape of ANNIE,
A Nearly-New Information Extraction system. These components use nite state techniques
to implement various tasks from tokenisation to semantic tagging or verb phrase chunking.
All ANNIE components communicate exclusively via GATE's document and annotation
resources. See Chapter 6 for more details. Other CREOLE resources are described in
Part III.
12 Introduction

1.3.3 Additional Facilities in GATE Developer/Embedded


Three other facilities in GATE deserve special mention:

ˆ JAPE, a Java Annotation Patterns Engine, provides regular-expression based pat-


tern/action rules over annotations  see Chapter 8.

ˆ The `annotation di' tool in the development environment implements performance


metrics such as precision and recall for comparing annotations. Typically a language
analysis component developer will mark up some documents by hand and then use these
along with the di tool to automatically measure the performance of the components.
See Chapter 10.

ˆ GUK, the GATE Unicode Kit, lls in some of the gaps in the JDK's10 support for
Unicode, e.g. by adding input methods for various languages from Urdu to Chinese.
See Section 3.11.2 for more details.

1.3.4 An Example
This section gives a very brief example of a typical use of GATE to develop and deploy
language processing capabilities in an application, and to generate quantitative results for
scientic publication.

Let's imagine that a developer called Fatima is building an email client11 for Cyberdyne
Systems' large corporate Intranet. In this application she would like to have a language
processing system that automatically spots the names of people in the corporation and
transforms them into mailto hyperlinks.

A little investigation shows that GATE's existing components can be tailored to this purpose.
Fatima starts up GATE Developer, and creates a new document containing some example
emails. She then loads some processing resources that will do named-entity recognition (a
tokeniser, gazetteer and semantic tagger), and creates an application to run these components
on the document in sequence. Having processed the emails, she can see the results in one of
several viewers for annotations.

The GATE components are a decent start, but they need to be altered to deal specially
with people from Cyberdyne's personnel database. Therefore Fatima creates new `cyber-'
versions of the gazetteer and semantic tagger resources, using the `bootstrap' tool. This tool
creates a directory structure on disk that has some Java stub code, a Makele and an XML
10 JDK: Java Development Kit, Sun Microsystem's Java implementation. Unicode support is being actively
improved by Sun, but at the time of writing many languages are still unsupported. In fact, Unicode itself
doesn't support all languages, e.g. Sylheti; hopefully this will change in time.
11 Perhaps because Outlook Express trashed her mail folder again, or because she got tired of Microsoft-
specic viruses and hadn't heard of Gmail or Thunderbird.
Introduction 13

conguration le. After several hours struggling with badly written documentation, Fatima
manages to compile the stubs and create a JAR le containing the new resources. She tells
GATE Developer the URL of these les12 , and the system then allows her to load them in
the same way that she loaded the built-in resources earlier on.

Fatima then creates a second copy of the email document, and uses the annotation editing
facilities to mark up the results that she would like to see her system producing. She saves
this and the version that she ran GATE on into her serial datastore. From now on she can
follow this routine:

1. Run her application on the email test corpus.

2. Check the performance of the system by running the `annotation di' tool to compare
her manual results with the system's results. This gives her both percentage accuracy
gures and a graphical display of the dierences between the machine and human
outputs.

3. Make edits to the code, pattern grammars or gazetteer lists in her resources, and
recompile where necessary.

4. Tell GATE Developer to re-initialise the resources.

5. Go to 1.

To make the alterations that she requires, Fatima re-implements the ANNIE gazetteer so that
it regenerates itself from the local personnel data. She then alters the pattern grammar in the
semantic tagger to prioritise recognition of names from that source. This latter job involves
learning the JAPE language (see Chapter 8), but as this is based on regular expressions it
isn't too dicult.

Eventually the system is running nicely, and her accuracy is 93% (there are still some prob-
lem cases, e.g. when people use nicknames, but the performance is good enough for pro-
duction use). Now Fatima stops using GATE Developer and works instead on embedding
the new components in her email application using GATE Embedded. This application is
written in Java, so embedding is very easy13 : the GATE JAR les are added to the project
CLASSPATH, the new components are placed on a web server, and with a little code to do
initialisation, loading of components and so on, the job is nished in half a day  the code
to talk to GATE takes up only around 150 lines of the eventual application, most of which
is just copied from the example in the sheffield.examples.StandAloneAnnie class.

Because Fatima is worried about Cyberdyne's unethical policy of developing Skynet to help
the large corporates of the West strengthen their strangle-hold over the World, she wants
to get a job as an academic instead (so that her conscience will only have to cope with the
12 While developing, she uses a file:/... URL; for deployment she can put them on a web server.
13 Languages other than Java require an additional interface layer, such as JNI, the Java Native Interface,
which is in C.
14 Introduction

torture of students, as opposed to humanity). She takes the accuracy measures that she
has attained for her system and writes a paper for the Journal of Nasturtium Logarithm
Incitement describing the approach used and the results obtained. Because she used GATE
for development, she can cite the repeatability of her experiments and oer access to example
binary versions of her software by putting them on an external web server.

And everybody lived happily ever after.

1.4 Some Evaluations


This section contains an incomplete list of publications describing systems that used GATE in
competitive quantitative evaluation programmes. These programmes have had a signicant
impact on the language processing eld and the widespread presence of GATE is some
measure of the maturity of the system and of our understanding of its likely performance on
diverse text processing tasks.

[Li et al.07d] describes the performance of an SVM-based learning system in the NTCIR-6
Patent Retrieval Task. The system achieved the best result on two of three measures
used in the task evaluation, namely the R-Precision and F-measure. The system ob-
tained close to the best result on the remaining measure (A-Precision).

[Saggion 07] describes a cross-source coreference resolution system based on semantic clus-
tering. It uses GATE for information extraction and the SUMMA system to create sum-
maries and semantic representations of documents. One system conguration ranked
4th in the Web People Search 2007 evaluation.

[Saggion 06] describes a cross-lingual summarization system which uses SUMMA compo-
nents and the Arabic plugin available in GATE to produce summaries in English from
a mixture of English and Arabic documents.

Open-Domain Question Answering: The University of Sheeld has a long history


of research into open-domain question answering. GATE has formed the ba-
sis of much of this research resulting in systems which have ranked highly dur-
ing independent evaluations since 1999. The rst successful question answering
system developed at the University of Sheeld was evaluated as part of TREC
8 and used the LaSIE information extraction system (the forerunner of ANNIE)
which was distributed with GATE [Humphreys et al. 99]. Further research was
reported in [Scott & Gaizauskas. 00], [Greenwood et al. 02], [Gaizauskas et al. 03],
[Gaizauskas et al. 04] and [Gaizauskas et al. 05]. In 2004 the system was ranked 9th
out of 28 participating groups.

[Saggion 04] describes techniques for answering denition questions. The system uses def-
inition patterns manually implemented in GATE as well as learned JAPE patterns
Introduction 15

induced from a corpus. In 2004, the system was ranked 4th in the TREC/QA evalua-
tions.

[Saggion & Gaizauskas 04b] describes a multidocument summarization system imple-


mented using summarization components compatible with GATE (the SUMMA sys-
tem). The system was ranked 2nd in the Document Understanding Evaluation pro-
grammes.

[Maynard et al. 03e] and [Maynard et al. 03d] describe participation in the TIDES
surprise language program. ANNIE was adapted to Cebuano with four person days of
eort, and achieved an F-measure of 77.5%. Unfortunately, ours was the only system
participating!

[Maynard et al. 02b] and [Maynard et al. 03b] describe results obtained on systems
designed for the ACE task (Automatic Content Extraction). Although a compari-
son to other participating systems cannot be revealed due to the stipulations of ACE,
results show 82%-86% precision and recall.

[Humphreys et al. 98] describes the LaSIE-II system used in MUC-7.

[Gaizauskas et al. 95] describes the LaSIE-II system used in MUC-6.

1.5 Recent Changes

This section details recent changes made to GATE. Appendix A provides a complete change
log.

1.5.1 Version 8.4.1 (June 2017)


This is a minor release that xes one rarely encountered but serious bug with the handling
of CDATA sections within the text content of GATE XML format documents. CDATA has
always been handled correctly in annotation and document feature values, this bug would
only aect a small number of documents where the text contains many less-than signs (<<<)
and few annotations. In particular, annotated documents that have been processed using
the GATE tokeniser are extremely unlikely to be aected as each less-than sign is treated as
a separate Token annotation.

This release also includes one small improvement to the Twitter hashtag tokeniser so it
recognises the names of some political parties when they occur within hashtags such as
#VoteLabour.
16 Introduction

1.5.2 Version 8.4 (February 2017)


GATE Developer and Embedded 8.4 is mainly a bug x release, with a small number of
critical xes compared to version 8.3. This will be the nal major release of GATE before
major re-structuring of the codebase and the plugin system for version 8.5.

ˆ Fixed an issue which had prevented the use of Java 8 lambda expressions in the RHS
of JAPE rules, even when running on Java 8.

ˆ Removed OpenCalais and Zemanta plugins as the web services they depend on have
changed and the plugins no longer work.

ˆ Fixed a bug that could cause the searchable datastore GUI to freeze.

ˆ Fixes to the TermRaider and Hindi sample applications

Java compatibility

For GATE 8.4 we recommend the use of the latest Java 8 from Oracle. If you are still
restricted to Java 7, most components will still work with the exception of the Stanford
CoreNLP tools and the TwitIE application (which uses the Stanford POS tagger). Future
versions of GATE will require Java 8 as a minimum.

1.6 Further Reading


Lots of documentation lives on the GATE web site, including:

ˆ GATE online tutorials;

ˆ the main system documentation tree;

ˆ JavaDoc API documentation;

ˆ HTML of the source code;

ˆ comprehensive list of GATE plugins.

For more details about Sheeld University's work in human language processing see the NLP
group pages or A Denition and Short History of Language Engineering ([Cunningham 99a]).
For more details about Information Extraction see IE, a User Guide or the GATE IE pages.

A list of publications on GATE and projects that use it (some of which are available on-line
from http://gate.ac.uk/gate/doc/papers.html):
Introduction 17

2010

[Bontcheva et al. 10] describes the Teamware web-based collaborative annotation envi-
ronment, emphasising the dierent roles that users play in the corpus annotation pro-
cess.

[Damljanovic 10] presents the use of GATE in the development of controlled natural lan-
guage interfaces. There is other related work by Damljanovic, Agatonovic, and Cun-
ningham on using GATE to build natural language interfaces for quering ontologies.

[Aswani & Gaizauskas 10] discusses the use of GATE to process South Asian languages
(Hindi and Gujarati).

2009

[Saggion & Funk 09] focuses in detail on the use of GATE for mining opinions and facts
for business intelligence gathering from web content.

[Aswani & Gaizauskas 09] presents in more detail the text alignment component of
GATE.

[Bontcheva et al. 09] is the `Human Language Technologies' chapter of `Semantic Knowl-
edge Management' (John Davies, Marko Grobelnik and Dunja Mladenić eds.)

[Damljanovic et al. 09] discusses the use of semantic annotation for software engineering,
as part of the TAO research project.

[Laclavik & Maynard 09] reviews the current state of the art in email processing and
communication research, focusing on the roles played by email in information manage-
ment, and commercial and research eorts to integrate a semantic-based approach to
email.

[Li et al. 09] investigates two techniques for making SVMs more suitable for language learn-
ing tasks. Firstly, an SVM with uneven margins (SVMUM) is proposed to deal with
the problem of imbalanced training data. Secondly, SVM active learning is employed
in order to alleviate the diculty in obtaining labelled training data. The algorithms
are presented and evaluated on several Information Extraction (IE) tasks.

2008

[Agatonovic et al. 08] presents our approach to automatic patent enrichment, tested in
large-scale, parallel experiments on USPTO and EPO documents.

[Damljanovic et al. 08] presents Question-based Interface to Ontologies (QuestIO) - a tool


for querying ontologies using unconstrained language-based queries.
18 Introduction

[Damljanovic & Bontcheva 08] presents a semantic-based prototype that is made for
an open-source software engineering project with the goal of exploring methods for
assisting open-source developers and software users to learn and maintain the system
without major eort.

[Della Valle et al. 08] presents ServiceFinder.


[Li & Cunningham 08] describes our SVM-based system and several techniques we de-
veloped successfully to adapt SVM for the specic features of the F-term patent clas-
sication task.

[Li & Bontcheva 08] reviews the recent developments in applying geometric and quantum
mechanics methods for information retrieval and natural language processing.

[Maynard 08] investigates the state of the art in automatic textual annotation tools, and
examines the extent to which they are ready for use in the real world.

[Maynard et al. 08a] discusses methods of measuring the performance of ontology-based


information extraction systems, focusing particularly on the Balanced Distance Metric
(BDM), a new metric we have proposed which aims to take into account the more
exible nature of ontologically-based applications.

[Maynard et al. 08b] investigates NLP techniques for ontology population, using a com-
bination of rule-based approaches and machine learning.

[Tablan et al. 08] presents the QuestIO system – a natural language interface for ac-
cessing structured information, that is domain independent and easy to use without
training.

2007

[Funk et al. 07a] describes an ontologically based approach to multi-source, multilingual


information extraction.

[Funk et al. 07b] presents a controlled language for ontology editing and a software im-
plementation, based partly on standard NLP tools, for processing that language and
manipulating an ontology.

[Maynard et al. 07a] proposes a methodology to capture (1) the evolution of metadata
induced by changes to the ontologies, and (2) the evolution of the ontology induced by
changes to the underlying metadata.

[Maynard et al. 07b] describes the development of a system for content mining using do-
main ontologies, which enables the extraction of relevant information to be fed into
models for analysis of nancial and operational risk and other business intelligence
applications such as company intelligence, by means of the XBRL standard.
Introduction 19

[Saggion 07] describes experiments for the cross-document coreference task in SemEval
2007. Our cross-document coreference system uses an in-house agglomerative clustering
implementation to group documents referring to the same entity.

[Saggion et al. 07] describes the application of ontology-based extraction and merging in
the context of a practical e-business application for the EU MUSING Project where the
goal is to gather international company intelligence and country/region information.

[Li et al. 07a] introduces a hierarchical learning approach for IE, which uses the target
ontology as an essential part of the extraction process, by taking into account the
relations between concepts.

[Li et al. 07b] proposes some new evaluation measures based on relations among classi-
cation labels, which can be seen as the label relation sensitive version of important
measures such as averaged precision and F-measure, and presents the results of apply-
ing the new evaluation measures to all submitted runs for the NTCIR-6 F-term patent
classication task.

[Li et al. 07c] describes the algorithms and linguistic features used in our participating
system for the opinion analysis pilot task at NTCIR-6.

[Li et al.07d] describes our SVM-based system and the techniques we used to adapt the ap-
proach for the specics of the F-term patent classication subtask at NTCIR-6 Patent
Retrieval Task.

[Li & Shawe-Taylor 07] studies Japanese-English cross-language patent retrieval using
Kernel Canonical Correlation Analysis (KCCA), a method of correlating linear rela-
tionships between two variables in kernel dened feature spaces.

2006

[Aswani et al. 06] (Proceedings of the 5th International Semantic Web Conference
(ISWC2006)) In this paper the problem of disambiguating author instances in on-
tology is addressed. We describe a web-based approach that uses various features such
as publication titles, abstract, initials and co-authorship information.

[Bontcheva et al. 06a] `Semantic Annotation and Human Language Technology', contri-
bution to `Semantic Web Technology: Trends and Research' (Davies, Studer and War-
ren, eds.)

[Bontcheva et al. 06b] `Semantic Information Access', contribution to `Semantic Web


Technology: Trends and Research' (Davies, Studer and Warren, eds.)

[Bontcheva & Sabou 06] presents an ontology learning approach that 1) exploits a range
of information sources associated with software projects and 2) relies on techniques
that are portable across application domains.
20 Introduction

[Davis et al. 06] describes work in progress concerning the application of Controlled Lan-
guage Information Extraction - CLIE to a Personal Semantic Wiki - Semper- Wiki,
the goal being to permit users who have no specialist knowledge in ontology tools or
languages to semi-automatically annotate their respective personal Wiki pages.

[Li & Shawe-Taylor 06] studies a machine learning algorithm based on KCCA for cross-
language information retrieval. The algorithm is applied to Japanese-English cross-
language information retrieval.

[Maynard et al. 06] discusses existing evaluation metrics, and proposes a new method for
evaluating the ontology population task, which is general enough to be used in a variety
of situation, yet more precise than many current metrics.

[Tablan et al. 06b] describes an approach that allows users to create and edit ontologies
simply by using a restricted version of the English language. The controlled language
described is based on an open vocabulary and a restricted set of grammatical con-
structs.

[Tablan et al. 06a] describes the creation of linguistic analysis and corpus search tools for
Sumerian, as part of the development of the ETCSL.

[Wang et al. 06] proposes an SVM based approach to hierarchical relation extraction, using
features derived automatically from a number of GATE-based open-source language
processing tools.

2005

[Aswani et al. 05] (Proceedings of Fifth International Conference on Recent Advances in


Natural Language Processing (RANLP2005)) It is a full-featured annotation indexing
and search engine, developed as a part of the GATE. It is powered with Apache Lucene
technology and indexes a variety of documents supported by the GATE.

[Bontcheva 05] presents the ONTOSUM system which uses Natural Language Generation
(NLG) techniques to produce textual summaries from Semantic Web ontologies.

[Cunningham 05] is an overview of the eld of Information Extraction for the 2nd Edition
of the Encyclopaedia of Language and Linguistics.

[Cunningham & Bontcheva 05] is an overview of the eld of Software Architecture for
Language Engineering for the 2nd Edition of the Encyclopaedia of Language and Lin-
guistics.

[Dowman et al. 05a] (Euro Interactive Television Conference Paper) A system which can
use material from the Internet to augment television news broadcasts.

[Dowman et al. 05b] (World Wide Web Conference Paper) The Web is used to assist the
annotation and indexing of broadcast news.
Introduction 21

[Dowman et al. 05c] (Second European Semantic Web Conference Paper) A system that
semantically annotates television news broadcasts using news websites as a resource to
aid in the annotation process.

[Li et al. 05c] (Proceedings of Sheeld Machine Learning Workshop) describe an SVM
based IE system which uses the SVM with uneven margins as learning component and
the GATE as NLP processing module.

[Li et al. 05a] (Proceedings of Ninth Conference on Computational Natural Language


Learning (CoNLL-2005)) uses the uneven margins versions of two popular learning
algorithms SVM and Perceptron for IE to deal with the imbalanced classication prob-
lems derived from IE.

[Li et al. 05b] (Proceedings of Fourth SIGHAN Workshop on Chinese Language processing
(Sighan-05)) a system for Chinese word segmentation based on Perceptron learning, a
simple, fast and eective learning algorithm.

[Polajnar et al. 05] (University of Sheeld-Research Memorandum CS-05-10) User-


Friendly Ontology Authoring Using a Controlled Language.

[Saggion & Gaizauskas 05] describes experiments on content selection for producing bi-
ographical summaries from multiple documents.

[Ursu et al. 05] (Proceedings of the 2nd European Workshop on the Integration of Knowl-
edge, Semantic and Digital Media Technologies (EWIMT 2005))Digital Media Preser-
vation and Access through Semantically Enhanced Web-Annotation.

[Wang et al. 05] (Proceedings of the 2005 IEEE/WIC/ACM International Conference on


Web Intelligence (WI 2005)) Extracting a Domain Ontology from Linguistic Resource
Based on Relatedness Measurements.

2004

[Bontcheva 04] (LREC 2004) describes lexical and ontological resources in GATE used for
Natural Language Generation.

[Bontcheva et al. 04] (JNLE) discusses developments in GATE in the early naughties.
[Cunningham & Scott 04a] (JNLE) is the introduction to the above collection.
[Cunningham & Scott 04b] (JNLE) is a collection of papers covering many important
areas of Software Architecture for Language Engineering.

[Dimitrov et al. 04] (Anaphora Processing) gives a lightweight method for named entity
coreference resolution.

[Li et al.04] (Machine Learning Workshop 2004) describes an SVM based learning algo-
rithm for IE using GATE.
22 Introduction

[Maynard et al. 04a] (LREC 2004) presents algorithms for the automatic induction of
gazetteer lists from multi-language data.

[Maynard et al. 04b] (ESWS 2004) discusses ontology-based IE in the hTechSight project.
[Maynard et al. 04c] (AIMSA 2004) presents automatic creation and monitoring of se-
mantic metadata in a dynamic knowledge portal.

[Saggion & Gaizauskas 04a] describes an approach to mining denitions.


[Saggion & Gaizauskas 04b] describes a sentence extraction system that produces two
sorts of multi-document summaries; a general-purpose summary of a cluster of related
documents and an entity-based summary of documents related to a particular person.

[Wood et al. 04] (NLDB 2004) looks at ontology-based IE from parallel texts.

2003

[Bontcheva et al. 03] (NLPXML-2003) looks at GATE for the semantic web.
[Cunningham et al. 03] (Corpus Linguistics 2003) describes GATE as a tool for collabo-
rative corpus annotation.

[Kiryakov 03] (Technical Report) discusses semantic web technology in the context of mul-
timedia indexing and search.

[Manov et al. 03] (HLT-NAACL 2003) describes experiments with geographic knowledge
for IE.

[Maynard et al. 03a] (EACL 2003) looks at the distinction between information and con-
tent extraction.

[Maynard et al. 03c] (Recent Advances in Natural Language Processing 2003) looks at
semantics and named-entity extraction.

[Maynard et al. 03e] (ACL Workshop 2003) describes NE extraction without training
data on a language you don't speak (!).

[Saggion et al. 03a] (EACL 2003) discusses robust, generic and query-based summarisa-
tion.

[Saggion et al. 03b] (Data and Knowledge Engineering) discusses multimedia indexing
and search from multisource multilingual data.

[Saggion et al. 03c] (EACL 2003) discusses event co-reference in the MUMIS project.
[Tablan et al. 03] (HLT-NAACL 2003) presents the OLLIE on-line learning for IE system.
Introduction 23

[Wood et al. 03] (Recent Advances in Natural Language Processing 2003) discusses using
parallel texts to improve IE recall.

2002

[Baker et al. 02] (LREC 2002) report results from the EMILLE Indic languages corpus
collection and processing project.

[Bontcheva et al. 02a] (ACl 2002 Workshop) describes how GATE can be used as an en-
vironment for teaching NLP, with examples of and ideas for future student projects
developed within GATE.

[Bontcheva et al. 02b] (NLIS 2002) discusses how GATE can be used to create HLT mod-
ules for use in information systems.

[Bontcheva et al. 02c], [Dimitrov 02a] and [Dimitrov 02b] (TALN 2002, DAARC
2002, MSc thesis) describe the shallow named entity coreference modules in GATE:
the orthomatcher which resolves pronominal coreference, and the pronoun resolution
module.

[Cunningham 02] (Computers and the Humanities) describes the philosophy and moti-
vation behind the system, describes GATE version 1 and how well it lived up to its
design brief.

[Cunningham et al. 02] (ACL 2002) describes the GATE framework and graphical devel-
opment environment as a tool for robust NLP applications.

[Dimitrov 02a, Dimitrov et al. 02] (DAARC 2002, MSc thesis) discuss lightweight coref-
erence methods.

[Lal 02] (Master Thesis) looks at text summarisation using GATE.


[Lal & Ruger 02] (ACL 2002) looks at text summarisation using GATE.
[Maynard et al. 02a] (ACL 2002 Summarisation Workshop) describes using GATE to
build a portable IE-based summarisation system in the domain of health and safety.

[Maynard et al. 02c] (AIMSA 2002) describes the adaptation of the core ANNIE modules
within GATE to the ACE (Automatic Content Extraction) tasks.

[Maynard et al. 02d] (Nordic Language Technology) describes various Named Entity
recognition projects developed at Sheeld using GATE.

[Maynard et al. 02e] (JNLE) describes robustness and predictability in LE systems, and
presents GATE as an example of a system which contributes to robustness and to low
overhead systems development.
24 Introduction

[Pastra et al. 02] (LREC 2002) discusses the feasibility of grammar reuse in applications
using ANNIE modules.

[Saggion et al. 02b] and [Saggion et al. 02a] (LREC 2002, SPLPT 2002) describes how
ANNIE modules have been adapted to extract information for indexing multimedia
material.

[Tablan et al. 02] (LREC 2002) describes GATE's enhanced Unicode support.

Older than 2002

[Maynard et al. 01] (RANLP 2001) discusses a project using ANNIE for named-entity
recognition across wide varieties of text type and genre.

[Bontcheva et al. 00] and [Brugman et al. 99] (COLING 2000, technical report) de-
scribe a prototype of GATE version 2 that integrated with the EUDICO multimedia
markup tool from the Max Planck Institute.

[Cunningham 00] (PhD thesis) denes the eld of Software Architecture for Language
Engineering, reviews previous work in the area, presents a requirements analysis for
such systems (which was used as the basis for designing GATE versions 2 and 3), and
evaluates the strengths and weaknesses of GATE version 1.

[Cunningham et al. 00a], [Cunningham et al. 98a] and [Peters et al. 98] (OntoLex
2000, LREC 1998) presents GATE's model of Language Resources, their access and
distribution.

[Cunningham et al. 00b] (LREC 2000) taxonomises Language Engineering components


and discusses the requirements analysis for GATE version 2.

[Cunningham et al. 00c] and [Cunningham et al. 99] (COLING 2000, AISB 1999)
summarise experiences with GATE version 1.

[Cunningham et al. 00d] and [Cunningham 99b] (technical reports) document early
versions of JAPE (superseded by the present document).

[Gambäck & Olsson 00] (LREC 2000) discusses experiences in the Svensk project, which
used GATE version 1 to develop a reusable toolbox of Swedish language processing
components.

[Maynard et al. 00] (technical report) surveys users of GATE up to mid-2000.


[McEnery et al. 00] (Vivek) presents the EMILLE project in the context of which GATE's
Unicode support for Indic languages has been developed.

[Cunningham 99a] (JNLE) reviewed and synthesised denitions of Language Engineering.


Introduction 25

[Stevenson et al. 98] and [Cunningham et al. 98b] (ECAI 1998, NeMLaP 1998) re-
port work on implementing a word sense tagger in GATE version 1.

[Cunningham et al. 97b] (ANLP 1997) presents motivation for GATE and GATE-like
infrastructural systems for Language Engineering.

[Cunningham et al. 96a] (manual) was the guide to developing CREOLE components for
GATE version 1.

[Cunningham et al. 96b] (TIPSTER) discusses a selection of projects in Sheeld using


GATE version 1 and the TIPSTER architecture it implemented.

[Cunningham et al. 96c, Cunningham et al. 96d, Cunningham et al. 95] (COLING
1996, AISB Workshop 1996, technical report) report early work on GATE version 1.

[Gaizauskas et al. 96a] (manual) was the user guide for GATE version 1.
[Gaizauskas et al. 96b, Cunningham et al. 97a, Cunningham et al. 96e] (ICTAI
1996, TIPSTER 1997, NeMLaP 1996) report work on GATE version 1.

[Humphreys et al. 96] (manual) describes the language processing components dis-
tributed with GATE version 1.

[Cunningham 94, Cunningham et al. 94] (NeMLaP 1994, technical report) argue that
software engineering issues such as reuse, and framework construction, are important
for language processing R&D.
26 Introduction
Chapter 2

Installing and Running GATE

2.1 Downloading GATE


To download GATE point your web browser at http://gate.ac.uk/download/.

2.2 Installing and Running GATE

GATE will run anywhere that supports Java 7 or later, including Solaris, Linux, Mac OS
X and Windows platforms. We don't run tests on other platforms, but have had reports of
successful installs elsewhere.

2.2.1 The Easy Way


The easy way to install is to use one of the platform-specic installers (created using the
excellent IzPack). Download a `platform-specic installer' and follow the instructions it
gives you. Once the installation is complete, you can start GATE Developer using gate.exe
(Windows) or GATE.app (Mac) in the top-level installation directory, on Linux and other
platforms use gate.sh in the bin directory (see section 2.2.4).

2.2.2 The Hard Way (1)


Download the Java-only release package or the binary build snapshot, and follow the instruc-
tions below.
27
28 Installing and Running GATE

Prerequisites:

ˆ A conforming Java 2 environment,

 version 1.4.2 or above for GATE 3.1


 version 5.0 for GATE 4.0 beta 1 or later.
 version 6.0 for GATE 6.1 or later.
 version 7.0 for GATE 8.0 or later.

available free from Oracle or from your UNIX supplier. (We test on various Sun JDKs
on Solaris, Linux and Windows XP.)

ˆ Binaries from the GATE distribution you downloaded: gate.jar (which can be found
in the directory called bin). You will also need the lib directory, containing various
libraries that GATE depends on.

ˆ a suitable Apache ANT installation (version 1.8.1 or newer). You will need to add
an environment variable named ANT_HOME pointing to your ANT installation, and add
ANT_HOME/bin to your PATH.

ˆ An open mind and a sense of humour.

Using the binary distribution:

ˆ Unpack the distribution, creating a directory containing jar les and scripts.

ˆ To run GATE Developer:

 on Windows, start a Command Prompt window, change to the directory where


you unpacked the GATE distribution and run `bin\gate.bat';
 on Windows using the GUI, double-click the `gate.exe' le;
 on UNIX/Linux or Mac open a terminal window and run `bin/gate.sh'.

ˆ To embed GATE as a library (GATE Embedded), put gate.jar and all the libraries
in the lib directory in your CLASSPATH.

The Ant scripts that start GATE Developer (ant.bat or ant) require you to set the
JAVA_HOME environment variable to point to the top level directory of your JAVA instal-
lation. The value of GATE_CONFIG is passed to the system by the scripts using either a -i
command-line option, or the Java property gate.site.config.
Installing and Running GATE 29

2.2.3 The Hard Way (2): Subversion


The GATE code is maintained in a Subversion repository. You can use a Subversion client
to check out the source code  the most up-to-date version of GATE is the trunk:
svn checkout http://svn.code.sf.net/p/gate/code/gate/trunk gate

Once you have checked out the code you can build GATE using Ant (see Section 2.6)

You can browse the complete Subversion repository online at


https://sourceforge.net/p/gate/code/HEAD/tree/.

2.2.4 Running GATE Developer on Unix/Linux


The script gate.sh in the directory bin of your installation can be used to start GATE
Developer. You can run this script by entering its full path in a terminal or by adding the
bin directory to your binary path. In addition you can also add a symbolic link to this script
in any directory that already is in your binary path.

If gate.sh is invoked without parameters, GATE Developer will use the les ~/.gate.xml
and ~/.gate.session to store session and conguration data. Alternately you can run
gate.sh with the following parameters:

-h show usage information


-ld create or use the les .gate.session and .gate.xml in the current directory as the
session and conguration les. If option -dc DIR occurs before this option, the le
.gate.session is created from DIR /default.session if it does not already exist and
the le .gate.xml is created from DIR /default.xml if it does not already exist.

-ln NAME create or use NAME .session and NAME .xml in the current directory as the
session and conguration les. If option -dc DIR occurs before this option, the le
NAME .session is created from DIR /default.session if it does not already exist
and the le DIR .xml is created from DIR /default.xml if it does not already exist.

-ll if the current directory contains a le named log4j.properties then use it instead of the
default (GATE_HOME/bin/log4j.properties) to congure logging. Alternately, you
can specify any log4j conguration le by setting the log4j.configuration property
explicitly (see below).

-rh LOCATION set the resources home directory to the LOCATION provided. If a
resources home location is provided, the URLs in a saved application are saved relative
to this location instead of relative to the application state le (see section 3.9.3). This
is equivalent to setting the property gate.user.resourceshome to this location.

-d URL loads the CREOLE plugin at the given URL during the start-up process.
30 Installing and Running GATE

-iFILE uses the specied le as the site conguration.


-dc DIR copy default.xml and/or default.session from the directory DIR when creat-
ing a new cong or session le. This option works only together with either the -ln,
-ll or -tmp option and must occur before -ln, -ll or -tmp. An existing cong or
session le is used, but if it does not exist, the le from the given directory is copied
to create the le instead of using an empty/default le.
-tmp creates temporary conguration and session les in the current directory, optionally
copying default.xml and default.session from the directory specied with a -dc
DIR option that occurs before it. After GATE exits, those session and cong les are
removed.
all other parameters are passed on to the java command. This can be used to e.g. set
properties using the java option -D. For example to set the maximum amount of
heap memory to be used when running GATE to 6000M, you can add -Xmx6000m as
a parameter. In order to change the default encoding used by GATE to UTF-8 add
-Dfile.encoding=utf-8 as a parameter. To specify a log4j conguration le add
something like
-Dlog4j.configuration=file:///home/myuser/log4jconfig.properties.

Running GATE Developer with either the -ld or the -ln option from dierent directories is
useful to keep several projects separate and can be used to run multiple instances of GATE
Developer (or even dierent versions of GATE Developer) in succession or even simultanously
without the conguration les getting mixed up between them.

2.3 Using System Properties with GATE


During initialisation, GATE reads several Java system properties in order to decide where
to nd its conguration les.

Here is a list of the properties used, their default values and their meanings:

gate.home sets the location of the GATE install directory. This should point to the top
level directory of your GATE installation. This is the only property that is required.
If this is not set, the system will display an error message and them it will attempt to
guess the correct value.
gate.plugins.home points to the location of the directory containing installed plug-
ins (a.k.a.CREOLE directories). If this is not set then the default value of
{gate.home}/plugins is used.
gate.site.cong points to the location of the conguration le containing the site-wide
options. If not set this will default to {gate.home}/gate.xml. The site conguration
le must exist!
Installing and Running GATE 31

gate.user.cong points to the le containing the user's options. If not specied, or if the
specied le does not exist at startup time, the default value of gate.xml (.gate.xml on
Unix platforms) in the user's home directory is used.

gate.user.session points to the le containing the user's saved session. If not specied,
the default value of gate.session (.gate.session on Unix) in the user's home directory
is used. When starting up GATE Developer, the session is reloaded from this le if it
exists, and when exiting GATE Developer the session is saved to this le (unless the
user has disabled `save session on exit' in the conguration dialog). The session is not
used when using GATE Embedded.

gate.user.lechooser.defaultdir sets the default directory to be shown in the le chooser


of GATE Developer to the specied directory instead of the user's operating-system
specic default directory.

load.plugin.path is a path-like structure, i.e. a list of URLs separated by `;'. All directories
listed here will be loaded as CREOLE plugins during initialisation. This has similar
functionality with the the -d command line option.

gate.builtin.creole.dir is a URL pointing to the location of GATE's built-in CREOLE


directory. This is the location of the creole.xml le that denes the fundamental
GATE resource types, such as documents, document format handlers, controllers and
the basic visual resources that make up GATE. The default points to a location inside
gate.jar and should not generally need to be overridden.

When using GATE Embedded, you can set the values for these properties before you call
Gate.init(). Alternatively, you can set the values programmatically using the static
methods setGateHome(), setPluginsHome(), setSiteConfigFile(), etc. before calling
Gate.init(). See the Javadoc documentation for details. If you want to set these values
from the command line you can use the following syntax for setting gate.home for example:

java -Dgate.home=/my/new/gate/home/directory -cp... gate.Main

When running GATE Developer, you can set the properties by creating a le
build.properties in the top level GATE directory. In this le, any system properties
which are prexed with `run.' will be passed to GATE. For example, to set an alternative
user cong le, put the following line in build.properties1 :

run.gate.user.config=${user.home}/alternative-gate.xml

This facility is not limited to the GATE-specic properties listed above, for example the
following line changes the default temporary directory for GATE (note the use of forward
slashes, even on Windows platforms):

run.java.io.tmpdir=d:/bigtmp
1 In this specic case, the alternative cong le must already exist when GATE starts up, so you should
copy your standard gate.xml le to the new location.
32 Installing and Running GATE

When running GATE Developer from the command line via ant or via the gate.sh script
you can set properties using -D. Note that the run prex is required when using ant:

ant run -Drun.gate.user.config=/my/path/to/user/config.file

but not when using gate.sh:

./bin/gate.sh -Dgate.user.config=/my/path/to/user/config.file

The GATE Developer launcher also supports the system property gate.class.path to spec-
ify additional classpath entries that should be added to the classloader that is used to load
GATE classes. This is expected to be in the normal classpath format, i.e. a list of directory
or JAR le paths separated by semicolons on Windows and colons on other platforms. The
standard Java 6 shorthand of /path/to/directory/*2 to include all .jar les from a given
directory is also supported. As an alternative to this system property, the environment vari-
able GATE_CLASSPATH can be used, but the environment variable is only read if the system
property is not set.

./bin/gate.sh -Dgate.class.path=/shared/lib/myclasses.jar

2.4 Changing GATE's launch conguration


With eect from build 4723 (13 November 2013), all the JVM and GATE launch options can
be set in the gate.l4j.ini le on all platforms, as well as by using options to the gate.sh
command.

The gate.l4j.ini le supplied by default with GATE simply sets two standard JVM mem-
ory options:

-Xmx1G
-Xms200m

-Xmx species the maximum heap size in megabytes (m) or gigabytes (g), and -Xms species
the initial size. Other parameters of interest are -XX:MaxPermSize=128m for the "permanent
generation", which you may wish to specify if you are getting OutOfMemoryErrors that say
"PermGen space".

Note that the format consists of one option per line. All the properties listed in Section 2.3
can be congured here by prexing them with -D, e.g., -Dgate.user.config=path/to/other-gate.xml.

Proxy conguration (see Section 23.17.2) can now be set in this le by adding these lines
and editing them as needed for your conguration.
2 Remember to protect the * from expansion by your shell if necessary.
Installing and Running GATE 33

-Drun.java.net.useSystemProxies=true
-Dhttp.proxyHost=proxy.example.com
-Dhttp.proxyPort=8080
-Dhttp.nonProxyHosts=*.example.com

2.5 Conguring GATE


When GATE Developer is started, or when Gate.init() is called from GATE Embedded,
GATE loads various sorts of conguration data stored as XML in les generally called
something like gate.xml or .gate.xml. This data holds information such as:

ˆ whether to save settings on exit;

ˆ whether to save session on exit;

ˆ what fonts GATE Developer should use;

ˆ plugins to load at start;

ˆ colours of the annotations;

ˆ locations of les for the le chooser;

ˆ and a lot of other GUI related options;

This type of data is stored at two levels (in order from general to specic):

ˆ the site-wide level, which by default is located the gate.xml le in top level directory
of the GATE installation (i.e. the GATE home. This location can be overridden by the
Java system property gate.site.config;

ˆ the user level, which lives in the user's HOME directory on UNIX or their prole
directory on Windows (note that parts of this le are overwritten when saving user
settings). The default location for this le can be overridden by the Java system
property gate.user.config.

Where conguration data appears on several dierent levels, the more specic ones overwrite
the more general. This means that you can set defaults for all GATE users on your system,
for example, and allow individual users to override those defaults without interfering with
others.

Conguration data can be set from the GATE Developer GUI via the `Options' menu then
`Conguration'. The user can change the appearance of the GUI in the `Appearance' tab,
34 Installing and Running GATE

which includes the options of font and the `look and feel'. The `Advanced' tab enables the
user to include annotation features when saving the document and preserving its format, to
save the selected Options automatically on exit, and to save the session automatically on
exit. The `Input Methods' submenu from the `Options' menu enables the user to change the
default language for input. These options are all stored in the user's .gate.xml le.

When using GATE Embedded, you can also set the site cong location using
Gate.setSiteConfigFile(File) prior to calling Gate.init().

2.6 Building GATE

Note that you don't need to build GATE unless you're doing development on the system
itself.

Prerequisites:

ˆ A conforming Java environment as above.

ˆ A copy of the GATE sources and the build scripts  either the SRC distribution package
from the nightly snapshots or a copy of the code obtained through Subversion (see
Section 2.2.3).

ˆ A working installation of Apache ANT version 1.8.1 or newer. You will need to add
an environment variable named ANT_HOME pointing to your ANT installation, and add
ANT_HOME/bin to your PATH. It is advisable that you also set your JAVA_HOME environ-
ment variable to point to the top-level directory of your Java installation.

ˆ An appreciation of natural beauty.

To build gate, cd to gate and:

1. Type:
ant

2. [optional] To test the system:


ant test

3. [optional] To make the Javadoc documentation:


ant doc
Installing and Running GATE 35

4. You can also run GATE Developer using Ant, by typing:


ant run

5. To see a full list of options type: ant help

(The details of the build process are all specied by the build.xml le in the gate directory.)

You can also use a development environment like Eclipse (the required .project le and
other metadata are included with the sources), but note that it's still advisable to use ant to
generate documentation, the jar le and so on. Also note that the run congurations have
the location of a gate.xml site conguration le hard-coded into them, so you may need to
change these for your site.

2.6.1 Using GATE with Maven/Ivy


This section is based on contributions by Marin Nozhchev (Ontotext) and Benson Margulies
(Basis Technology Corp).

Stable releases of GATE (since 5.2.1) are available in the standard central Maven repository,
with group ID uk.ac.gate and artifact ID gate-core. To use GATE in a Maven-based
project you can simply add a dependency:

<dependency>
<groupId>uk.ac.gate</groupId>
<artifactId>gate-core</artifactId>
<version>6.0</version>
</dependency>

Similarly, with a project that uses Ivy for dependency management:

<dependency org="uk.ac.gate" name="gate-core" rev="6.0"/>

In addition you will require the matching versions of any GATE plugins you wish to use in
your application  these are not managed by Maven or Ivy, but can be obtained from the
standard GATE release download or downloaded using the GATE Developer plugin manager
as appropriate.

Nightly snapshot builds of gate-core are available from our own Maven repository at
http://repo.gate.ac.uk/content/groups/public.
36 Installing and Running GATE

2.7 Uninstalling GATE


If you have used the installer, run:

java -jar uninstaller.jar

or just delete the whole of the installation directory (the one containing bin, lib, Uninstaller,
etc.). The installer doesn't install anything outside this directory, but for completeness you
might also want to delete the settings les GATE creates in your home directory (.gate.xml
and .gate.session).

2.8 Troubleshooting
See the FAQ on the GATE Wiki for frequent questions about running and using GATE.
Chapter 3

Using GATE Developer

`The law of evolution is that the strongest survives!'


`Yes; and the strongest, in the existence of any social species, are those who are
most social. In human terms, most ethical. . . . There is no strength to be gained
from hurting one another. Only weakness.'
The Dispossessed [p.183], Ursula K. le Guin, 1974.

This chapter introduces GATE Developer, which is the GATE graphical user interface. It is
analogous to systems like Mathematica for mathematicians, or Eclipse for Java programmers,
providing a convenient graphical environment for research and development of language
processing software. As well as being a powerful research tool in its own right, it is also very
useful in conjunction with GATE Embedded (the GATE API by which GATE functionality
can be included in your own applications); for example, GATE Developer can be used to
create applications that can then be embedded via the API. This chapter describes how
to complete common tasks using GATE Developer. It is intended to provide a good entry
point to GATE functionality, and so explanations are given assuming only basic knowledge
of GATE. However, probably the best way to learn how to use GATE Developer is to use
this chapter in conjunction with the demonstrations and tutorials movies. There are specic
links to them throughout the chapter. There is also a complete new set of video tutorials
here.

The basic business of GATE is annotating documents, and all the functionality we will
introduce relates to that. Core concepts are;

ˆ the documents to be annotated,

ˆ corpora comprising sets of documents, grouping documents for the purpose of running
uniform processes across them,

ˆ annotations that are created on documents,


37
38 Using GATE Developer

ˆ annotation types such as `Name' or `Date',

ˆ annotation sets comprising groups of annotations,

ˆ processing resources that manipulate and create annotations on documents, and

ˆ applications, comprising sequences of processing resources, that can be applied to a


document or corpus.

What is considered to be the end result of the process varies depending on the task, but
for the purposes of this chapter, output takes the form of the annotated document/corpus.
Researchers might be more interested in gures demonstrating how successfully their appli-
cation compares to a `gold standard' annotation set; Chapter 10 in Part II will cover ways of
comparing annotation sets to each other and obtaining measures such as F1. Implementers
might be more interested in using the annotations programmatically; Chapter 7, also in Part
II, talks about working with annotations from GATE Embedded. For the purposes of this
chapter, however, we will focus only on creating the annotated documents themselves, and
creating GATE applications for future use.

GATE includes a complete information extraction system that you are free to use, called
ANNIE (a Nearly-New Information Extraction System). Many users nd this is a good
starting point for their own application, and so we will cover it in this chapter. Chapter 6
talks in a lot more detail about the inner workings of ANNIE, but we aim to get you started
using ANNIE from inside of GATE Developer in this chapter.

We start the chapter with an exploration of the GATE Developer GUI, in Section 3.1. We
describe how to create documents (Section 3.2) and corpora (Section 3.3). We talk about
viewing and manually creating annotations (Section 3.4).

We then talk about loading the plugins that contain the processing resources you will use
to construct your application, in Section 3.5. We then talk about instantiating processing
resources (Section 3.7). Section 3.8 covers applications, including using ANNIE (Section
3.8.3). Saving applications and language resources (documents and corpora) is covered in
Section 3.9. We conclude with a few assorted topics that might be useful to the GATE
Developer user, in Section 3.11.

3.1 The GATE Developer Main Window


Figure 3.1 shows the main window of GATE Developer, as you will see it when you rst run
it. There are ve main areas:

1. at the top, the menus bar and tools bar with menus `File', `Options', `Tools', `Help'
and icons for the most frequently used actions;
Using GATE Developer 39

Figure 3.1: Main Window of GATE Developer

2. on the left side, a tree starting from `GATE' and containing `Applications', `Language
Resources' etc.  this is the resources tree;

3. in the bottom left corner, a rectangle, which is the small resource viewer;

4. in the center, containing tabs with `Messages' or the name of a resource from the
resources tree, the main resource viewer;

5. at the bottom, the messages bar.

The menu and the messages bar do the usual things. Longer messages are displayed in the
messages tab in the main resource viewer area.

The resource tree and resource viewer areas work together to allow the system to display
diverse resources in various ways. The many resources integrated with GATE can have either
a small view, a large view, or both.

At any time, the main viewer can also be used to display other information, such as messages,
by clicking on the appropriate tab at the top of the main window. If an error occurs in
processing, the messages tab will ash red, and an additional popup error message may also
occur.
40 Using GATE Developer

In the options dialogue from the Options menu you can choose if you want to link the
selection in the resources tree and the selected main view.

3.2 Loading and Viewing Documents

Figure 3.2: Making a New Document

If you right-click on `Language Resources' in the resources pane, select New' then `GATE
Document', the window `Parameters for the new GATE Document' will appear as shown in
gure 3.2. Here, you can specify the GATE document to be created. Required parameters
are indicated with a tick. The name of the document will be created for you if you do not
specify it. Enter the URL of your document or use the le browser to indicate the le you
wish to use for your document source. For example, you might use `http://gate.ac.uk', or
browse to a text or XML le you have on disk. Click on `OK' and a GATE document will
be created from the source you specied.

See also the movie for creating documents.

The document editor is contained in the central tabbed pane in GATE Developer. Double-
click on your document in the resources pane to view the document editor.

The document editor consists of a top panel with buttons and icons that control the display
of dierent views and the search box. Initially, you will see just the text of your document, as
shown in gure 3.3. Click on `Annotation Sets' and `Annotations List' to view the annotation
sets to the right and the annotations list at the bottom.

You will see a view similar to gure 3.4. In place of the annotations list, you can also choose
to see the annotations stack. In place of the annotation sets, you can also choose to view
the co-reference editor. More information about this functionality is given in Section 3.4.

Several options can be set from the small triangle icon at the top right corner.
Using GATE Developer 41

Figure 3.3: The Document Editor

With `Save Current Layout' you store the way the dierent views are shown and the annota-
tion types highlighted in the document. Then if you set `Restore Layout Automatically' you
will get the same views and annotation types each time you open a document. The layout
is saved to the user preferences le, gate.xml. It means that you can give this le to a new
user so s/he will have a precongured document editor.

Another setting make the document editor `Read-only'. If enabled, you won't be able to edit
the text but you will still be able to edit annotations. It is useful to avoid to involuntarily
modify the original text.

The option `Right To Left Orientation' is useful for changing orientation of the text for the
languages such as Arabic and Urdu. Selecting this option changes orientation of the text of
the currently visible document.

Finally you can choose between `Insert Append' and `Insert Prepend'. That setting is only
relevant when you're inserting text at the very border of an annotation.

If you place the cursor at the start of an annotation, in one case the newly entered text will
become part of the annotation, in the other case it will stay outside. If you place the cursor
at the end of an annotation, the opposite will happen.
42 Using GATE Developer

Let use this sentence: `This is an [annotation].' with the square brackets [] denoting the
boundaries of the annotation. If we insert a `x' just before the `a' or just after the `n' of
`annotation', here's what we get:

Append

ˆ This is an x[annotation].

ˆ This is an [annotationx].

Prepend

ˆ This is an [xannotation].

ˆ This is an [annotation]x.

Figure 3.4: The Document Editor with Annotation Sets and Annotations List

Text in a loaded document can be edited in the document viewer. The usual platform specic
cut, copy and paste keyboard shortcuts should also work, depending on your operating
Using GATE Developer 43

system (e.g. CTRL-C, CTRL-V for Windows). The last icon, a magnifying glass, at the
top of the document editor is for searching in the document. To prevent the new annotation
windows popping up when a piece of text is selected, hold down the CTRL key. Alternatively,
you can hide the annotation sets view by clicking on its button at the top of the document
view; this will also cause the highlighted portions of the text to become un-highlighted.

See also Section 20.2.3 for the compound document editor.

3.3 Creating and Viewing Corpora


You can create a new corpus in a similar manner to creating a new document; simply right-
click on `Language Resources' in the resources pane, select `New' then `GATE corpus'. A
brief dialogue box will appear in which you can optionally give a name for your corpus (if
you leave this blank, a corpus name will be created for you) and optionally add documents
to the corpus from those already loaded into GATE.

There are three ways of adding documents to a corpus:

1. When creating the corpus, clicking on the icon next to the documentsList input eld
brings up a popup window with a list of the documents already loaded into GATE
Developer. This enables the user to add any documents to the corpus.

2. Alternatively, the corpus can be loaded rst, and documents added later by double
clicking on the corpus and using the + and - icons to add or remove documents to the
corpus. Note that the documents must have been loaded into GATE Developer before
they can be added to the corpus.

3. Once loaded, the corpus can be populated by right clicking on the corpus and selecting
`Populate'. With this method, documents do not have to have been previously loaded
into GATE Developer, as they will be loaded during the population process. If you
right-click on your corpus in the resources pane, you will see that you have the option
to `Populate' the corpus. If you select this option, you will see a dialogue box in which
you can specify a directory in which GATE will search for documents. You can specify
the extensions allowable; for example, XML or TXT. This will restrict the corpus
population to only those documents with the extensions you wish to load. You can
choose whether to recurse through the directories contained within the target directory
or restrict the population to those documents contained in the top level directory. Click
on `OK' to populate your corpus. This option provides a quick way to create a GATE
Corpus from a directory of documents.

Additionally, right-clicking on a loaded document in the tree and selecting the `New corpus
with this document' option creates a new transient corpus named Corpus for document
name containing just this document.
44 Using GATE Developer

See also the movie for creating and populating corpora.

Figure 3.5: Corpus Editor

Double click on your corpus in the resources pane to see the corpus editor, shown in gure 3.5.
You will see a list of the documents contained within the corpus.

In the top left of the corpus editor, plus and minus buttons allow you to add documents to
the corpus from those already loaded into GATE and remove documents from the corpus
(note that removing a document from a corpus does not remove it from GATE).

Up and down arrows at the top of the view allow you to reorder the documents in the corpus.
The rightmost button in the view opens the currently selected document in a document
editor.

At the bottom, you will see that tabs entitled `Initialisation Parameters' and `Corpus Quality
Assurance' are also available in addition to the corpus editor tab you are currently looking at.
Clicking on the `Initialisation Parameters' tab allows you to view the initialisation parameters
for the corpus. The `Corpus Quality Assurance' tab allows you to calculate agreement
Using GATE Developer 45

measures between the annotations in your corpus. Agreement measures are discussed in
depth in Chapter 10. The use of corpus quality assurance is discussed in Section 10.3.

3.4 Working with Annotations


In this section, we will talk in more detail about viewing annotations, as well as creating and
editing them manually. As discussed in at the start of the chapter, the main purpose of GATE
is annotating documents. Whilst applications can be used to annotate the documents entirely
automatically, annotation can also be done manually, e.g. by the user, or semi-automatically,
by running an application over the corpus and then correcting/adding new annotations
manually. Section 3.4.5 focuses on manual annotation. In Section 3.7 we talk about running
processing resources on our documents. We begin by outlining the functionality around
viewing annotations, organised by the GUI area to which the functionality pertains.

3.4.1 The Annotation Sets View


To view the annotation sets, click on the `Annotation Sets' button at the top of the doc-
ument editor, or use the F3 key (see Section 3.10 for more keyboard shortcuts). This will
bring up the annotation sets viewer, which displays the annotation sets available and their
corresponding annotation types.

The annotation sets view is displayed on the left part of the document editor. It's a tree-like
view with a root for each annotation set. The rst annotation set in the list is always a
nameless set. This is the default annotation set. You can see in gure 3.4 that there is a
drop-down arrow with no name beside it. Other annotation sets on the document shown in
gure 3.4 are `Key' and `Original markups'. Because the document is an XML document,
the original XML markup is retained in the form of an annotation set. This annotation set
is expanded, and you can see that there are annotations for `TEXT', `body', `font', `html',
`p', `table', `td' and `tr'.

To display all the annotations of one type, tick its checkbox or use the space key. The text
segments corresponding to these annotations will be highlighted in the main text window.
To delete an annotation type, use the delete key. To change the color, use the enter key.
There is a context menu for all these actions that you can display by right-clicking on one
annotation type, a selection or an annotation set.

If you keep shift key pressed when you open the annotation sets view, GATE Developer will
try to select any annotations that were selected in the previous document viewed (if any);
otherwise no annotation will be selected.

Having selected an annotation type in the annotation sets view, hovering over an annotation
in the main resource viewer or right-clicking on it will bring up a popup box containing a
46 Using GATE Developer

list of the annotations associated with it, from which one can select an annotation to view
in the annotation editor, or if there is only one, the annotation editor for that annotation.
Figure 3.6 shows the annotation editor.

Figure 3.6: The Annotation Editor

3.4.2 The Annotations List View


To view the list of annotations and their features, click on the `Annotations list' button
at the top of the main window or use F4 key. The annotation list view will appear below
the main text. It will only contain the annotations selected from the annotation sets view.
These lists can be sorted in ascending and descending order for any column, by clicking on
the corresponding column heading. Moreover you can hide a column by using the context
menu by right-clicking on the column headings. Selecting rows in the table will blink the
respective annotations in the document. Right-click on a row or selection in this view to
delete or edit an annotation. Delete key is a shortcut to delete selected annotations.

3.4.3 The Annotations Stack View


This view is similar to the ANNIC view described in section 9.2. It displays annotations
at the document caret position with some context before and after. The annotations are
stacked from top to bottom, which gives a clear view when they are overlapping.

As the view is centred on the document caret, you can use the conventional key to move it
and update the view: notably the keys left and right to skip one letter; control + left/right
to skip one word; up and down to go one line up or down; and use the document scrollbar
then click in the document to move further.

There are two buttons at the top of the view that centre the view on the closest previous/next
annotation boundary among all displayed. This is useful when you want to skip a region
without annotation or when you want to reach the beginning or end of a very long annotation.

The annotation types displayed correspond to those selected in the annotation sets view. You
can display feature values for an annotation rectangle by hovering the mouse on it or select
Using GATE Developer 47

Figure 3.7: Annotations stack view centred on the document caret.

only one feature to display by double-clicking on the annotation type in the rst column.

Right-click on an annotation in the annotations stack view to edit it. Control-Shift-click to


delete it. Double-click to copy it to another annotation set. Control-click on a feature value
that contains an URL to display it in your browser.

All of these mouse shortcuts make it easier to create a gold standard annotation set.

3.4.4 The Co-reference Editor


The co-reference editor allows co-reference chains (see Section 6.9) to be displayed and edited
in GATE Developer. To display the co-reference editor, rst open a document in GATE
Developer, and then click on the Co-reference Editor button in the document viewer.

The combo box at the top of the co-reference editor allows you to choose which annotation
set to display co-references for. If an annotation set contains no co-reference data, then the
tree below the combo box will just show `Coreference Data' and the name of the annotation
set. However, when co-reference data does exist, a list of all the co-reference chains that are
based on annotations in the currently selected set is displayed. The name of each co-reference
chain in this list is the same as the text of whichever element in the chain is the longest. It
is possible to highlight all the member annotations of any chain by selecting it in the list.

When a co-reference chain is selected, if the mouse is placed over one of its member annota-
tions, then a pop-up box appears, giving the user the option of deleting the item from the
chain. If the only item in a chain is deleted, then the chain itself will cease to exist, and it
will be removed from the list of chains. If the name of the chain was derived from the item
that was deleted, then the chain will be given a new name based on the next longest item
in the chain.
48 Using GATE Developer

Figure 3.8: Co-reference editor inside a document editor. The popup window in the document
under the word `EPSRC' is used to add highlighted annotations to a co-reference chain. Here
the annotation type `Organization' of the annotation set `Default' is highlighted and also
the co-references `EC' and `GATE'.

A combo box near the top of the co-reference editor allows the user to select an annotation
type from the current set. When the Show button is selected all the annotations of the
selected type will be highlighted. Now when the mouse pointer is placed over one of those
annotations, a pop-up box will appear giving the user the option of adding the annotation
to a co-reference chain. The annotation can be added to an existing chain by typing the
name of the chain (as shown in the list on the right) in the pop-up box. Alternatively, if
the user presses the down cursor key, a list of all the existing annotations appears, together
with the option [New Chain]. Selecting the [New Chain] option will cause a new chain to
be created containing the selected annotation as its only element.

Each annotation can only be added to a single chain, but annotations of dierent types can
be added to the same chain, and the same text can appear in more than one chain if it is
referenced by two or more annotations.

The movie for inspecting results is also useful for learning about viewing annotations.

3.4.5 Creating and Editing Annotations


To create annotations manually, select the text you want to annotate and hover the mouse
on the selection or use control+E keys. A popup will appear, allowing you to create an
annotation, as shown in gure 3.9
Using GATE Developer 49

Figure 3.9: Creating a New Annotation

The type of the annotation, by default, will be the same as the last annotation you created,
unless there is none, in which case it will be `_New_'. You can enter any annotation
type name you wish in the text box, unless you are using schema-driven annotation (see
Section 3.4.6). You can add or change features and their values in the table below.

To delete an annotation, click on the red X icon at the top of the popup window. To
grow/shrink the span of the annotation at its start use the two arrow icons on the left or
right and left keys. Use the two arrow icons next on the right to change the annotation end
or alt+right and alt+left keys. Add shift and control+shift keys to make the span increment
bigger. The red X icon is for removing the annotation.

The pin icon is to pin the window so that it remains where it is. If you drag and drop the
window, this automatically pins it too. Pinning it means that even if you select another
annotation (by hovering over it in the main resource viewer) it will still stay in the same
position.

The popup menu only contains annotation types present in the Annotation Schema and
those already listed in the relevant Annotation Set. To create a new Annotation Schema,
see Section 3.4.6. The popup menu can be edited to add a new annotation type, however.

The new annotation created will automatically be placed in the annotation set that has been
selected (highlighted) by the user. To create a new annotation set, type the name of the new
set to be created in the box below the list of annotation sets, and click on `New'.

Figure 3.10 demonstrates adding a `Organization' annotation for the string `EPSRC' (high-
lighted in green) to the default annotation set (blank name in the annotation set view on
the right) and a feature name `type' with a value about to be added.

To add a second annotation to a selected piece of text, or to add an overlapping annotation to


an existing one, press the CTRL key to avoid the existing annotation popup appearing, and
then select the text and create the new annotation. Again by default the last annotation type
to have been used will be displayed; change this to the new annotation type. When a piece
of text has more than one annotation associated with it, on mouseover all the annotations
will be displayed. Selecting one of them will bring up the relevant annotation popup.

To search and annotate the document automatically, use the search and annotate function
as shown in gure 3.11:
50 Using GATE Developer

Figure 3.10: Adding an Organization annotation to the Default Annotation Set

ˆ Create and/or select an annotation to be used as a model to annotate.

ˆ Open the panel at the bottom of the annotation editor window.

ˆ Change the expression to search if necessary.

ˆ Use the [First] button or Enter key to select the rst expression to annotate.

ˆ Use the [Annotate] button if the selection is correct otherwise the [Next] button. After
a few cycles of [Annotate] and [Next], Use the [Ann. all next] button.

Note that after using the [First] button you can move the caret in the document and use the
[Next] button to avoid continuing the search from the beginning of the document. The [?]
button at the end of the search text eld will help you to build powerful regular expressions
to search.
Using GATE Developer 51

Figure 3.11: Search and Annotate Function of the Annotation Editor.

3.4.6 Schema-Driven Editing


Annotation schemas allow annotation types and features to be pre-specied, so that during
manual annotation, the relevant options appear on the drop-down lists in the annotation
editor. You can see some example annotation schemas in Section 5.4.1. Annotation schemas
provide a means to dene types of annotations in GATE Developer. Basically this means
that GATE Developer `knows about' annotations dened in a schema. Annotation schemas
are supported by the `Annotation schema' language resource, which is one of the default LR
types (along with corpus and document) available in GATE without the need to load any
plugins.

To load an annotation schema into GATE Developer, right-click on `Language Resources'


in the resources pane. Select `New' then `Annotation schema'. A popup box will appear
in which you can browse to your annotation schema XML le. A default set of annota-
tion schemas for common annotation types including Person, Organization and Location is
provided in the ANNIE plugin, and can be loaded by creating an Annotation schema LR
from the le plugins/ANNIE/resources/schema/ANNIE-Schemas.xml in the GATE distri-
bution. You can also dene your own schemas to tell GATE Developer about other kinds of
annotations you frequently use. Each schema le can dene only one annotation type, but
you can have a master le which includes others, in order to load a group of schemas in one
operation. The ANNIE schemas provide an example of this technique.

By default GATE Developer will allow you to create any annotations in a document, whether
or not there is a schema to describe them. An alternative annotation editor component is
available which constrains the available annotation types and features much more tightly,
based on the annotation schemas that are currently loaded. This is particularly useful when
annotating large quantities of data or for use by less skilled users.

To use this, you must load the Schema_Annotation_Editor plugin. With this plugin loaded,
the annotation editor will only oer the annotation types permitted by the currently loaded
set of schemas, and when you select an annotation type only the features permitted by the
52 Using GATE Developer

schema are available to edit1 . Where a feature is declared as having an enumerated type the
available enumeration values are presented as an array of buttons, making it easy to select
the required value quickly.

3.4.7 Printing Text with Annotations


We suggest you to use your browser to print a document as GATE don't propose a printing
facility for the moment.

First save your document by right clicking on the document in the left resources tree then
choose `Save Preserving Format'. You will get an XML le with all the annotations high-
lighted as XML tags plus the `Original markups' annotations set.

It's possible that the output will not have an XML header and footer because the document
was created from a plain text document. In that case you can use the XHTML example
below.

Then add a stylesheet processing instruction at the beginning of the XML le, the second
line in the following minimalist XHTML document:

<?xml version="1.0" encoding="UTF-8" ?>


<?xml-stylesheet type="text/css" href="gate.css"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Content of the document</p>
...
</body>
</html>

And create a le `gate.css' in the same directory:

BODY, body { margin: 2em } /* or any other first level tag */


P, p { display: block } /* or any other paragraph tag */
/* ANNIE tags but you can use whatever tags you want */
/* be careful that XML tags are case sensitive */
1 Existing features take precedence over the schema, e.g. those created by previously-run processing
resources, are not editable but are not modied or removed by the editor.
Using GATE Developer 53

Date { background-color: rgb(230, 150, 150) }


FirstPerson { background-color: rgb(150, 230, 150) }
Identifier { background-color: rgb(150, 150, 230) }
JobTitle { background-color: rgb(150, 230, 230) }
Location { background-color: rgb(230, 150, 230) }
Money { background-color: rgb(230, 230, 150) }
Organization { background-color: rgb(230, 200, 200) }
Percent { background-color: rgb(200, 230, 200) }
Person { background-color: rgb(200, 200, 230) }
Title { background-color: rgb(200, 230, 230) }
Unknown { background-color: rgb(230, 200, 230) }
Etc { background-color: rgb(230, 230, 200) }
/* The next block is an example for having a small tag
with the name of the annotation type after each annotation */
Date:after {
content: "Date";
font-size: 50%;
vertical-align: sub;
color: rgb(100, 100, 100);
}

Finally open the XML le in your browser and print it.

Note that overlapping annotations, cannot be expressed correctly with inline XML tags and
thus won't be displayed correctly.

3.5 Using CREOLE Plugins


In GATE, processing resources are used to automatically create and manipulate annotations
on documents. We will talk about processing resources in the next section. However, we
must rst introduce CREOLE plugins. In most cases, in order to use a particular processing
resource (and certain language resources) you must rst load the CREOLE plugin that
contains it. This section talks about using CREOLE plugins. Then, in Section 3.7, we will
talk about creating and using processing resources.

The denitions of CREOLE resources (e.g. processing resources such as taggers and parsers,
see Chapter 4) are stored in CREOLE directories (directories containing an XML le de-
scribing the resources, the Java archive with the compiled executable code and whatever
libraries are required by the resources).

Plugins can have one or more of the following states in relation with GATE:

known plugins are those plugins that the system knows about. These include all the plugins
in the plugins directory of the GATE installation and those installed in the user's
54 Using GATE Developer

own plugin directory (the socalled installed plugins) as well all the plugins that were
manually loaded from the user interface.

loaded plugins are the plugins currently loaded in the system. All CREOLE resource types
from the loaded plugins are available for use. All known plugins can easily be loaded
and unloaded using the user interface.

auto-loadable plugins are the list of plugins that the system loads automatically during
initialisation which can be congured via the load.plugin.path system property.

As hinted at above plugnis can be loaded from numerous sources:

core plugins are distributed with GATE are found in the


t plugins directory of the instillation, although the default location can be modied
using the gate.plugins.home system property.

user plugins are plugins that have been installed by the user into their personal plugins
folder. The location of this folder can be set either through the conguration tab of
the CREOLE manager interface or via the gate.user.plugins system property

local plugins are those plugins located on disk but which aren't in either the core plugins
or user plugin folder.

remote plugins are plugins which are loaded via http from a remote machine.

The CREOLE plugins can be managed through the graphical user interface which can be
activated by selecting `Manage CREOLE Plugins' from the `File' menu. This will bring up
a window listing all the known plugins. For each plugin there are two check-boxes  one
labelled `Load Now', which will load the plugin, and the other labelled `Load Always' which
will add the plugin to the list of auto-loadable plugins. A `Delete' button is also provided 
which will remove the plugin from the list of known plugins. This operation does not delete
the actual plugin directory. Installed plugins are found automatically when GATE is started;
if an installed plugin is deleted from the list, it will re-appear next time GATE is launched.

If you select a plugin, you will see in the pane on the right the list of resources that plugin
contains. For example, in gure 3.12, the `ANNIE' plugin is selected, and you can see that
it contains 17 resources. If you wish to use a particular resource you will have to ascertain
which plugin contains it. This list can be useful for that. Alternatively, the GATE website
provides a directory of plugins and their processing resources.

Having loaded the plugins you need, the resources they dene will be available for use. Typi-
cally, to the GATE Developer user, this means that they will appear on the `New' menu when
you right-click on `Processing Resources' in the resources pane, although some special plugins
have dierent eects; for example, the Schema_Annotation_Editor (see Section 3.4.6).
Using GATE Developer 55

Figure 3.12: Plugin Management Console

3.6 Installing and updating CREOLE Plugins

While GATE is distributed with a number of core plugins (see Part III) there are many
more plugins developed and made available by other GATE users. Some of these additional
plugins can easily be installed into your local copy of GATE through the CREOLE plugin
manager.

Plugin developers can oer their plugins by maintaining a plugin repository. The addresse of
a plugin repository can then be added to your GATE installation through the conguration
tab of the plugin manager. For example, in the following screenshot you can see that two
plugin repositories have been added, although only one is currently enabled.

References to a number of plugin repositories are provided within the GATE distribution,
although they are initially disabled2 . Once a plugin repository is enabled the plugins which
can be installed are listed on the `Available' tab.

Installing new plugins is simply a case of checking the box and clicking `Apply All'. Note
that plugins are installed into the user plugins directory, which must have been correctly
congured before you can try installing new plugins.

Once a plugin is installed it will appear in the list of `Installed Plugins' and can be loaded
2 Currently three plugin repositories are listed in the main distribution. To have your repository included
in the list send an e-mail with the address to the GATE developers mailing list.
56 Using GATE Developer

Figure 3.13: Installing New CREOLE Plugins Through The Manager

in the same way as any other CREOLE plugin (see Section 3.7). If a new version of a plugin
you have installed becomes available the new version will be oered as an update. These
updates can be installed in the same way as a new plugin.

3.7 Loading and Using Processing Resources

This section describes how to load and run CREOLE resources not present in ANNIE. To load
ANNIE, see Section 3.8.3. For technical descriptions of these resources, see the appropriate
chapter in Part III (e.g. Chapter 23). First ensure that the necessary plugins have been
loaded (see Section 3.5). If the resource you require does not appear in the list of Processing
Resources, then you probably do not have the necessary plugin loaded. Processing resources
are loaded by selecting them from the set of Processing Resources: right click on Processing
Resources or select `New Processing Resource' from the File menu.

For example, use the Plugin Console Manager to load the `Tools' plugin. When you right
Using GATE Developer 57

click on `Processing Resources' in the resources pane and select `New' you have the option
to create any of the processing resources that plugin provides. You may choose to create a
`GATE Morphological Analyser', with the default parameters. Having done this, an instance
of the GATE Morphological Analyser appears under `Processing Resources'. This processing
resource, or PR, is now available to use. Double-clicking on it in the resources pane reveals
its initialisation parameters, see gure 3.14.

Figure 3.14: GATE Morphological Analyser Initialisation Parameters

This processing resource is now available to be added to applications. It must be added to


an application before it can be applied to documents. You may create as many of a par-
ticular processing resource as you wish, for example with dierent initialisation parameters.
Section 3.8 talks about creating and running applications.

See also the movie for loading processing resources.


58 Using GATE Developer

3.8 Creating and Running an Application

Once all the resources you need have been loaded, an application can be created from them,
and run on your corpus. Right click on `Applications' and select `New' and then either
`Corpus Pipeline' or `Pipeline'. A pipeline application can only be run over a single document,
while a corpus pipeline can be run over a whole corpus.

To build the pipeline, double click on it, and select the resources needed to run the application
(you may not necessarily wish to use all those which have been loaded).

Transfer the necessary components from the set of `loaded components' displayed on the left
hand side of the main window to the set of `selected components' on the right, by selecting
each component and clicking on the left and right arrows, or by double-clicking on each
component.

Ensure that the components selected are listed in the correct order for processing (starting
from the top). If not, select a component and move it up or down the list using the up/down
arrows at the left side of the pane.

Ensure that any parameters necessary are set for each processing resource (by clicking on
the resource from the list of selected resources and checking the relevant parameters from
the pane below). For example, if you wish to use annotation sets other than the Default one,
these must be dened for each processing resource.

Note that if a corpus pipeline is used, the corpus needs only to be set once, using the drop-
down menu beside the `corpus' box. If a pipeline is used, the document must be selected for
each processing resource used.

Finally, click on `Run' to run the application on the document or corpus.

See also the movie for loading and running processing resources.

For how to use the conditional versions of the pipelines see Section 3.8.2 and for saving/restor-
ing the conguration of an application see Section 3.9.3.

3.8.1 Running an Application on a Datastore


To avoid loading all your documents at the same time you can run an application on a
datastore corpus.

To do this you need to load your datastore, see section 3.9.2, and to load the corpus from
the datastore by double clicking on it in the datastore viewer.

Then, in the application viewer, you need to select this corpus in the drop down list of
corpora.
Using GATE Developer 59

When you run the application on the corpus datastore, each document will be loaded, pro-
cessed, saved then unloaded. So at any time there will be only one document from the
datastore corpus loaded. This prevent memory shortage but is also a little bit slower than
if all your documents were already loaded.

The processed documents are automatically saved back to the datastore so you may want
to use a copy of the datastore to experiment.

Be very careful that if you have some documents from the datastore corpus already loaded
before running the application then they will not be unloaded nor saved. To save such
document you have to right click on it in the resources tree view and save it to the datastore.

3.8.2 Running PRs Conditionally on Document Features


The `Conditional Pipeline' and `Conditional Corpus Pipeline' application types are condi-
tional versions of the pipelines mentioned in Section 3.8 and allow processing resources to
be run or not according to the value of a feature on the document. In terms of graphical
interface, the only addition brought by the conditional versions of the applications is a box
situated underneath the lists of available and selected resources which allows the user to
choose whether the currently selected processing resource will run always, never or only on
the documents that have a particular value for a named feature.

If the Yes option is selected then the corresponding resource will be run on all the documents
processed by the application as in the case of non-conditional applications. If the No option
is selected then the corresponding resource will never be run; the application will simply
ignore its presence. This option can be used to temporarily and quickly disable an application
component, for debugging purposes for example.

The If value of feature option permits running specic application components conditionally
on document features. When selected, this option enables two text input elds that are used
to enter the name of a feature and the value of that feature for which the corresponding
processing resource will be run. When a conditional application is run over a document, for
each component that has an associated condition, the value of the named feature is checked
on the document and the component will only be used if the value entered by the user
matches the one contained in the document features.

At rst sight the conditional behaviour available with these controller may seem limited, but
in fact it is very powerful when used in conjunction with JAPE grammars (see chapter 8).
Complex conditions can be encoded in JAPE rules which set the appropriate feature values
on the document for use by the conditional controllers. Alternatively, the Groovy plugin
provides a scriptable controller (see section 7.17.3) in which the execution strategy is dened
by a Groovy script, allowing much richer conditional behaviour to be encoded directly in the
controller's conguration.
60 Using GATE Developer

3.8.3 Doing Information Extraction with ANNIE


This section describes how to load and run ANNIE (see Chapter 6) from GATE Devel-
oper. ANNIE is a good place to start because it provides a complete information extraction
application, that you can run on any corpus. You can then view the eects.

From the File menu, select `Load ANNIE System'. To run it in its default state, choose
`with Defaults'. This will automatically load all the ANNIE resources, and create a corpus
pipeline called ANNIE with the correct resources selected in the right order, and the default
input and output annotation sets.

If `without Defaults' is selected, the same processing resources will be loaded, but a popup
window will appear for each resource, which enables the user to specify a name, location
and other parameters for the resource. This is exactly the same procedure as for loading a
processing resource individually, the dierence being that the system automatically selects
those resources contained within ANNIE. When the resources have been loaded, a corpus
pipeline called ANNIE will be created as before.

The next step is to add a corpus (see Section 3.3), and select this corpus from the drop-
down corpus menu in the Serial Application editor. Finally click on `Run' from the Serial
Application editor, or by right clicking on the application name in the resources pane and
selecting `Run'. (Many people prefer to switch to the messages tab, then run their application
by right-clicking on it in the resources pane, because then it is possible to monitor any
messages that appear whilst the application is running.)

To view the results, double click on one of the document contained in the corpus processed in
the left hand tree view. No annotation sets nor annotations will be shown until annotations
are selected in the annotation sets; the `Default' set is indicated only with an unlabelled
right-arrowhead which must be selected in order to make visible the available annotations.
Open the default annotation set and select some of the annotations to see what the ANNIE
application has done.

See also the movie for loading and running ANNIE.

3.8.4 Modifying ANNIE


You will nd the ANNIE resources in gate/plugins/ANNIE/resources. Simply locate the
existing resources you want to modify, make a copy with a new name, edit them, and load
the new resources into GATE as new Processing Resources (see Section 3.7).
Using GATE Developer 61

3.9 Saving Applications and Language Resources


In this section, we will describe how applications and language resources can be saved for
use outside of GATE and for use with GATE at a later time. Section 3.9.1 talks about
saving documents to le. Section 3.9.2 outlines how to use datastores. Section 3.9.3 talks
about saving application states (resource parameter states), and Section 3.9.4 talks about
exporting applications together with referenced les and resources to a ZIP le.

3.9.1 Saving Documents to File


There are three main ways to save annotated documents:

1. preserving the original markup, with optional added annotations;

2. in GATE's own XML serialisation format (including all the annotations on the docu-
ment);

3. by writing your own dump algorithm as a processing resource.

This section describes how to use the rst two options.

Both types of data export are available in the popup menu triggered by right-clicking on a
document in the resources tree (see Section 3.1): type 1 is called `Save Preserving Format'
and type 2 is called `Save as XML'. In addition, all documents in a corpus can be saved as
individual XML les into a directory by right-clicking on the corpus in the resources tree
and choosing the option `Save as XML`.

Selecting the save as XML option leads to a le open dialogue; give the name of the le you
want to create, and the whole document and all its data will be exported to that le. If you
later create a document from that le, the state will be restored. (Note: because GATE's
annotation model is richer than that of XML, and because our XML dump implementation
sometimes cuts corners3 , the state may not be identical after restoration. If your intention
is to store the state for later use, use a DataStore instead.)

The `Save Preserving Format' option also leads to a le dialogue; give a name and the data
you require will be dumped into the le. The action can be used for documents that were
created from les using the XML or HTML format. It will save all the original tags as well
as the document annotations that are currently displayed in the `Annotations List' view.
This option is useful for selectively saving only some annotation types.
3 Gorey details: features of annotations and documents in GATE may be any virtually any Java object;
serialising arbitrary binary data to XML is not simple; instead we serialise them as strings, and therefore
they will be re-loaded as strings.
62 Using GATE Developer

The annotations are saved as normal document tags, using the annotation type as the tag
name. If the advanced option `Include annotation features for Save Preserving Format ' (see
Section 2.5) is set to true, then the annotation features will also be saved as tag attributes.

Using this operation for GATE documents that were not created from an HTML or XML
le results in a plain text le, with in-line tags for the saved annotations.

Note that GATE's model of annotation allows graph structures, which are dicult to repre-
sent in XML (XML is a tree-structured representation format). During the dump process,
annotations that cross each other in ways that cannot be represented in legal XML will be
discarded, and a warning message printed.

3.9.2 Saving and Restoring LRs in Datastores


Where corpora are large, the memory available may not be sucient to have all documents
open simultaneously. The datastore functionality provides the option to save documents to
disk and open them only one at a time for processing. This means that much larger corpora
can be used. A datastore can also be useful for saving documents in an ecient and lossless
way.

To save a text in a datastore, a new datastore must rst be created if one does not already
exist. Create a datastore by right clicking on Datastore in the left hand pane, and select the
option `Create Datastore'. Select the data store type you wish to use. Create a directory to
be used as the datastore (note that the datastore is a directory and not a le).

You can either save a whole corpus to the datastore (in which case the structure of the corpus
will be preserved) or you can save individual documents. The recommended method is to
save the whole corpus. To save a corpus, right click on the corpus name and select the `Save
to...' option (giving the name of the datastore created earlier). To save individual documents
to the datastore, right clicking on each document name and follow the same procedure.

To load a document from a datastore, do not try to load it as a language resource. Instead,
open the datastore by right clicking on Datastore in the left hand pane, select `Open Datas-
tore' and choose the datastore to open. The datastore tree will appear in the main window.
Double click on a corpus or document in this tree to open it. To save a corpus and document
back to the same datastore, simply select the `Save' option.

See also the movie for creating a datastore and the movie for loading corpus and documents
from a datastore.
Using GATE Developer 63

3.9.3 Saving Application States to a File


Resources, and applications that are made up of them, are created based on the settings of
their parameters (see Section 3.7). It is possible to save the data used to create an application
to a le and re-load it later. To save the application to a le, right click on it in the resources
tree and select `Save application state', which will give you a le creation dialogue. Choose
a le name that ends in gapp as this le dialog and the one for loading application states age
displays all les which have a name ending in gapp. A common convention is to use .gapp
as a le extension.

To restore the application later, select `Restore application from le' from the `File' menu.

Note that the data that is saved represents how to recreate an application  not the resources
that make up the application itself. So, for example, if your application has a resource that
initialises itself from some le (e.g. a grammar, a document) then that le must still exist
when you restore the application.

In case you don't want to save the corpus conguration associated with the application then
you must select `<none>' in the corpus list of the application before saving the application.

The le resulting from saving the application state contains the values of the initialisation
and runtime parameters for all the processing resources contained by the stored application
as well as the values of the initialisation parameters for all the language resources referenced
by those processing resources. Note that if you reference a document that has been created
with an empty URL and empty string content parameter and subsequently been manually
edited to add content, that content will not be saved. In order for document content to be
preserved, load the document from an URL, specify the content as for the string content
parameter or use a document from a datastore.

For the parameters of type URL (which are typically used to select external resources such as
grammars or rules les) a transformation is applied so that the paths are are stored relative
to either the location of the saved application state le, the GATE home directory, or a
special user resources home directory, according to the following rules:

ˆ If the resource is inside the GATE home directory, but the the application state le
is saved to a location outside the GATE home directory, the path is stored relative to
the GATE home directory and the path marker $gatehome$ is used.

ˆ If the property gate.user.resourceshome is set to the path of a directory and the


resource is located inside that directory but the state le is saved to a location outside
of this directory, the path is stored relative to this directory and the path marker
$resourceshome$ is used.

ˆ in all other situations, the path is stored relative to the location of the application
state le location and the the path marker $relpath$ is used.
64 Using GATE Developer

In this way, all resource les that are part of GATE are always used corretly, no matter
where GATE is installed. Resource les which are not part of GATE and used by an
application do not need to be in the same location as when the application was initially
created but rather in the same location relative to the location of the application le. In
addition if your application uses a project-specic location for global resources or project
specic plugins, the java property gate.user.resourceshome can be set to this location
and the application will be stored so that this location will also always be used correctly,
no matter where the application state le is copied to. To set the resources home directory,
the -rh location option for the Linux script gate.sh to start GATE can be used. The
combination of these features allows the creation and deployment of portable applications
by keeping the application le and the resource les used by the application together.

Note that GATE resources that are used by your application may change between dierent
releases of GATE. If your application depends on a specic version of resources that come
with the GATE distribution, consider copying them to your project directory in order to
ensure the correct version is used. The option "Export for GATE Cloud" (see Section 3.9.4)
supports this by creating a ZIP le that contains a copy all GATE resources used by the
application, including GATE plugins.

When an application is restored from an application state le, GATE uses the keyword
$relpath$ for paths relative to the location of the gapp le, $gatehome$ for paths relative
to the GATE home installation directory and $resourceshom$ for paths relative to the the
location the property gate.user.resourceshome is set. There exists other keywords that
can be interesting in some cases. You will need to edit the gapp le manually. The keywords
are $gateplugins$ and $sysprop:...$. The latter is any java system property, for example
$sysprop:user.home$.

If you want to save your application along with all the resources it requires you can use the
`Export for GATE Cloud' option (see Section 3.9.4).

See also the movie for saving and restoring applications.

3.9.4 Saving an Application with its Resources (e.g. GATE Cloud)


When you save an application using the `Save application state' option (see Section 3.9.3),
the saved le contains references to the plugins that were loaded when the application was
saved, and to any resource les required by the application. To be able to reload the le,
these plugins and other dependencies must exist at the same locations (relative to the saved
state le). While this is ne for saving and loading applications on a single machine it means
that if you want to package your application to run it elsewhere (e.g. deploy it to GATE
Cloud) then you need to be careful to include all the resource les and plugins at the right
locations in your package. The `Export for GATE Cloud' option on the right-click menu for
an application helps to automate this process.
Using GATE Developer 65

When you export an application in this way, GATE Developer produces a ZIP le containing
the saved application state (in the same format as `Save application state'). Any plugins and
resource les that the application refers to are also included in the zip le, and the relative
paths in the saved state are rewritten to point to the correct locations within the package.
The resulting package is therefore self-contained and can be copied to another machine and
unpacked there, or passed to GATE Cloud for deployment.

As well as selecting the location where you want to save the package, the `Export for GATE
Cloud' option will also prompt you to select the annotation sets that your application uses
for input and output. For example, if your application makes use of the unpacked XML
markup in source documents and creates annotations in the default set then you would
select `Original markups' as an input set and the `<Default annotation set>' as an output
set. GATE Developer will try to make an educated guess at the correct sets but you should
check and amend the lists as necessary.

There are a few important points to note about the export process:

ˆ The complete contents of all the plugin directories that are loaded when you perform
the export will be included in the resulting package. Use the plugin manager to unload
any plugins your application is not using before you export it.

ˆ If your application refers to a resource le in a directory that is not under one of the
loaded plugins, the entire contents of this directory will be recursively included in the
package. If you have a number of unrelated resources in a single directory (e.g. many
sets of large gazetteer lists) you may want to separate them into separate directories
so that only the relevant ones are included in the package.

ˆ The packager only knows about resources that your application refers to directly in its
parameters. For example, if your application includes a multi-phase JAPE grammar
the packager will only consider the main grammar le, not any of its sub-phases. If
the sub-phases are not contained in the same directory as the main grammar you may
nd they are not included. If indirect references of this kind are all to les under the
same directory as the `master' le it will work OK.

If you require more exibility than this option provides you should read Section E.2, which
describes the underlying Ant task that the exporter uses.

3.10 Keyboard Shortcuts


You can use various keyboard shortcuts for common tasks in GATE Developer. These are
listed in this section.

General (Section 3.1):


66 Using GATE Developer

ˆ F1 Display a help page for the selected component

ˆ Alt+F4 Exit the application without conrmation

ˆ Tab Put the focus on the next component or frame

ˆ Shift+Tab Put the focus on the previous component or frame

ˆ F6 Put the focus on the next frame

ˆ Shift+F6 Put the focus on the previous frame

ˆ Alt+F Show the File menu

ˆ Alt+O Show the Options menu

ˆ Alt+T Show the Tools menu

ˆ Alt+H Show the Help menu

ˆ F10 Show the rst menu

Resources tree (Section 3.1):

ˆ Enter Show the selected resources

ˆ Ctrl+H Hide the selected resource

ˆ Ctrl+Shift+H Hide all the resources

ˆ F2 Rename the selected resource

ˆ Ctrl+F4 Close the selected resource

Document editor (Section 3.2):

ˆ Ctrl+F Show the search dialog for the document

ˆ Ctrl+E Edit the annotation at the caret position

ˆ Ctrl+S Save the document in a le

ˆ F3 Show/Hide the annotation sets

ˆ Shift+F3 Show the annotation sets with preselection

ˆ F4 Show/Hide the annotations list

ˆ F5 Show/Hide the coreference editor


Using GATE Developer 67

ˆ F7 Show/Hide the text

Annotation editor (Section 3.4):

ˆ Right/Left Grow/Shrink the annotation span at its start

ˆ Alt+Right/Alt+Left Grow/Shrink the annotation span at its end

ˆ +Shift/+Ctrl+Shift Use a span increment of 5/10 characters

ˆ Alt+Delete Delete the currently edited annotation

Annic/Lucene datastore (Chapter 9):

ˆ Alt+Enter Search the expression in the datastore

ˆ Alt+Backspace Delete the search expression

ˆ Alt+Right Display the next page of results

ˆ Alt+Left Display the row manager

ˆ Alt+E Export the results to a le

Annic/Lucene query text eld (Chapter 9):

ˆ Ctrl+Enter Insert a new line

ˆ Enter Search the expression

ˆ Alt+Top Select the previous result

ˆ Alt+Bottom Select the next result

3.11 Miscellaneous

3.11.1 Stopping GATE from Restoring Developer Sessions/Options


GATE can remember Developer options and the state of the resource tree when it exits. The
options are saved by default; the session state is not saved by default. This default behaviour
can be changed from the `Advanced' tab of the `Conguration' choice on the `Options' menu.
68 Using GATE Developer

If a problem occurs and the saved data prevents GATE Developer from starting, you can
x this by deleting the conguration and session data les. These are stored in your home
directory, and are called gate.xml and gate.sesssion or .gate.xml and .gate.sesssion
depending on platform. On Windows your home is:

95, 98, NT: Windows Directory/proles/username


2000, XP: Windows Drive/Documents and Settings/username

3.11.2 Working with Unicode


When you create a document from a URL pointing to textual data in GATE, you have to
tell the system what character encoding the text is stored in. By default, GATE will set this
parameter to be the empty string. This tells Java to use the default encoding for whatever
platform it is running on at the time  e.g. on Western versions of Windows this will be ISO-
8859-1, and Eastern ones ISO-8859-9. On Linux systems, the default encoding is inuenced
by the LANG environment variable, e.g. when this variable is set to en_US.utf-8 the default
encoding used will be UTF-8. When GATE is started using the bin/ant run command or
(on Linux) through the gate.sh script or a link to it, you can change the default encoding
used by GATE to UTF-8 by adding -Drun.file.encoding=utf-8 as a parameter.

A popular way to store Unicode documents is in UTF-8, which is a superset of ASCII (but
can still store all Unicode data); if you get an error message about document I/O during
reading, try setting the encoding to UTF-8, or some other locally popular encoding.
Chapter 4

CREOLE: the GATE Component Model

. . . Noam Chomsky's answer in Secrets, Lies and Democracy (David Barsamian


1994; Odonian) to `What do you think about the Internet?'
`I think that there are good things about it, but there are also aspects of it that
concern and worry me. This is an intuitive response  I can't prove it  but my
feeling is that, since people aren't Martians or robots, direct face-to-face contact
is an extremely important part of human life. It helps develop self-understanding
and the growth of a healthy personality.
`You just have a dierent relationship to somebody when you're looking at them
than you do when you're punching away at a keyboard and some symbols come
back. I suspect that extending that form of abstract and remote relationship,
instead of direct, personal contact, is going to have unpleasant eects on what
people are like. It will diminish their humanity, I think.'
Chomsky, quoted at http://philip.greenspun.com/wtr/dead-trees/53015.

The GATE architecture is based on components: reusable chunks of software with well-
dened interfaces that may be deployed in a variety of contexts. The design of GATE is
based on an analysis of previous work on infrastructure for LE, and of the typical types
of software entities found in the elds of NLP and CL (see in particular chapters 46 of
[Cunningham 00]). Our research suggested that a protable way to support LE software
development was an architecture that breaks down such programs into components of various
types. Because LE practice varies very widely (it is, after all, predominantly a research eld),
the architecture must avoid restricting the sorts of components that developers can plug into
the infrastructure. The GATE framework accomplishes this via an adapted version of the
Java Beans component framework from Sun, as described in section 4.2.

GATE components may be implemented by a variety of programming languages and


databases, but in each case they are represented to the system as a Java class. This class
may do nothing other than call the underlying program, or provide an access layer to a
database; on the other hand it may implement the whole component.
69
70 CREOLE: the GATE Component Model

GATE components are one of three types:

ˆ LanguageResources (LRs) represent entities such as lexicons, corpora or ontologies;

ˆ ProcessingResources (PRs) represent entities that are primarily algorithmic, such as


parsers, generators or ngram modellers;

ˆ VisualResources (VRs) represent visualisation and editing components that participate


in GUIs.

The distinction between language resources and processing resources is explored more fully
in section D.1.1. Collectively, the set of resources integrated with GATE is known as CRE-
OLE: a Collection of REusable Objects for Language Engineering.
In the rest of this chapter:

ˆ Section 4.3 describes the lifecycle of GATE components;

ˆ Section 4.4 describes how Processing Resources can be grouped into applications;

ˆ Section 4.5 describes the relationship between Language Resources and their datas-
tores;

ˆ Section 4.6 summarises GATE's set of built-in components;

ˆ Section 4.7 describes how conguration data for Resource types is supplied to GATE.

4.1 The Web and CREOLE


GATE allows resource implementations and Language Resource persistent data to be dis-
tributed over the Web, and uses Java annotations and XML for conguration of resources
(and GATE itself).

Resource implementations are grouped together as `plugins', stored at a URL (when the
resources are in the local le system this can be a file:/ URL). When a plugin is loaded
into GATE it looks for a conguration le called creole.xml relative to the plugin URL and
uses the contents of this le to determine what resources this plugin declares and where to
nd the classes that implement the resource types (typically these classes are stored in a JAR
le in the plugin directory). Conguration data for the resources may be stored directly in the
creole.xml le, or it may be stored as Java annotations on the resource classes themselves; in
either case GATE retrieves this conguration information and adds the resource denitions
to the CREOLE register. When a user requests an instantiation of a resource, GATE creates
an instance of the resource class in the virtual machine.

Language resource data can be stored in binary serialised form in the local le system.
CREOLE: the GATE Component Model 71

4.2 The GATE Framework


We can think of the GATE framework as a backplane into which users can plug CREOLE
components. The user gives the system a list of URLs to search when it starts up, and
components at those locations are loaded by the system.

The backplane performs these functions:

ˆ component discovery, bootstrapping, loading and reloading;

ˆ management and visualisation of native data structures for common information types;

ˆ generalised data storage and process execution.

A set of components plus the framework is a deployment unit which can be embedded in
another application.

At their most basic, all GATE resources are Java Beans, the Java platform's model of
software components. Beans are simply Java classes that obey certain interface conventions:

ˆ beans must have no-argument constructors.

ˆ beans have properties, dened by pairs of methods named by the convention setProp
and getProp .

GATE uses Java Beans conventions to construct and congure resources at runtime, and
denes interfaces that dierent component types must implement.

4.3 The Lifecycle of a CREOLE Resource


CREOLE resources exhibit a variety of forms depending on the perspective they are viewed
from. Their implementation is as a Java class plus an XML metadata le living at the
same URL. When using GATE Developer, resources can be loaded and viewed via the
resources tree (left pane) and the `create resource' mechanism. When programming with
GATE Embedded, they are Java objects that are obtained by making calls to GATE's
Factory class. These various incarnations are the phases of a CREOLE resource's `lifecycle'.
Depending on what sort of task you are using GATE for, you may use resources in any or
all of these phases. For example, you may only be interested in getting a graphical view of
what GATE's ANNIE Information Extraction system (see Chapter 6) does; in this case you
will use GATE Developer to load the ANNIE resources, and load a document, and create
an ANNIE application and run it on the document. If, on the other hand, you want to
72 CREOLE: the GATE Component Model

create your own resources, or modify the Java code of an existing resource (as opposed to
just modifying its grammar, for example), you will need to deal with all the lifecycle phases.

The various phases may be summarised as:

Creating a new resource from scratch (bootstrapping). To create the binary image
of a resource (a Java class in a JAR le), and the XML le that describes the resource
to GATE, you need to create the appropriate .java le(s), compile them and package
them as a .jar. GATE provides a bootstrap tool to start this process  see Section
7.12. Alternatively you can simply copy code from an existing resource.

Instantiating a resource in GATE Embedded. To create a resource in your own Java


code, use GATE's Factory class (this takes care of parameterising the resource, restor-
ing it from a database where appropriate, etc. etc.). Section 7.2 describes how to do
this.

Loading a resource into GATE Developer. To load a resource into GATE Developer,
use the various `New ... resource' options from the File menu and elsewhere. See
Section 3.1.

Resource conguration and implementation. GATE's bootstrap tool will create an


empty resource that does nothing. In order to achieve the behaviour you require,
you'll need to change the conguration of the resource (by editing the creole.xml le)
and/or change the Java code that implements the resource. See section 4.7.

4.4 Processing Resources and Applications

PRs can be combined into applications. Applications model a control strategy for the exe-
cution of PRs. In GATE, applications are called `controllers' accordingly.

Currently only sequential, or pipeline, execution is supported. There are two main types of
pipeline:

Simple pipelines simply group a set of PRs together in order and execute them in turn.
The implementing class is called SerialController.

Corpus pipelines are specic for LanguageAnalysers  PRs that are applied to documents
and corpora. A corpus pipeline opens each document in the corpus in turn, sets that
document as a runtime parameter on each PR, runs all the PRs on the corpus, then
closes the document. The implementing class is called SerialAnalyserController.
CREOLE: the GATE Component Model 73

Conditional versions of these controllers are also available. These allow processing resources
to be run conditionally on document features. See Section 3.8.2 for how to use these. If more
exibility is required, the Groovy plugin provides a scriptable controller (see section 7.17.3)
whose execution strategy is specied using the Groovy programming language.

Controllers are themselves PRs  in particular a simple pipeline is a standard PR and a


corpus pipeline is a LanguageAnalyser  so one pipeline can be nested in another. This is
particularly useful with conditional controllers to group together a set of PRs that can all
be turned on or o as a group.

There is also a real-time version of the corpus pipeline. When creating such a controller,
a timeout parameter needs to be set which determines the maximum amount of time (in
milliseconds) allowed for the processing of a document. Documents that take longer to
process, are simply ignored and the execution moves to the next document after the timeout
interval has lapsed.

All controllers have special handling for processing resources that implement the interface
gate.creole.ControllerAwarePR. This interface provides methods that are called by the
controller at the start and end of the whole application's execution  for a corpus pipeline,
this means before any document has been processed and after all documents in the corpus
have been processed, which is useful for PRs that need to share data structures across the
whole corpus, build aggregate statistics, etc. For full details, see the JavaDoc documentation
for ControllerAwarePR.

4.5 Language Resources and Datastores

Language Resources can be stored in Datastores. Datastores are an abstract model of disk-
based persistence, which can be implemented by various types of storage mechanism. Here
are the types implemented:

Serial Datastores are based on Java's serialisation system, and store data directly into
les and directories.

Lucene Datastores is a full-featured annotation indexing and retrieval system. It is pro-


vided as part of an extension of the Serial Datastores. See Section 9 for more details.

4.6 Built-in CREOLE Resources


74 CREOLE: the GATE Component Model

GATE comes with various built-in components:

ˆ Language Resources modelling Documents and Corpora, and various types of Annota-
tion Schema  see Chapter 5.

ˆ Processing Resources that are part of the ANNIE system  see Chapter 6.

ˆ Gazetteers  see Chapter 13.

ˆ Ontologies  see Chapter 14.

ˆ Machine Learning resources  see Chapter 19.

ˆ Alignment tools  see Chapter 20.

ˆ Parsers and taggers  see Chapter 18.

ˆ Other miscellaneous resources  see Chapter 23.

4.7 CREOLE Resource Conguration


This section describes how to supply GATE with the conguration data it needs about a
resource, such as what its parameters are, how to display it if it has a visualisation, etc.
Several GATE resources can be grouped into a single plugin, which is a directory containing
an XML conguration le called creole.xml. Conguration data for the plugin's resources
can be given in the creole.xml le or directly in the Java source le using Java annotations.

A creole.xml le has a root element <CREOLE-DIRECTORY>. Traditionally this element


didn't contain any attributes, but with the introduction of installable plugins (see Sections
3.6 and 12.3.5) the following attributes can now be provided.

ID: A string that uniquely identies this plugin. This should be formatted in a similar way
to fully specied Java class names. The class portion (i.e. everything after the last dot)
will be used as the name of the plugin in the GUI. For example, the obsolete RASP
plugin could have the ID gate.obsolete.RASP. Note that unlike Java class names the
plugin name can contain spaces for the purpose of presentation.

VERSION: The version number of the plugin. For example, 3, 3.1, 3.11, 3.12-SNAPSHOT
etc.

DESCRIPTION: A short description of the resources provided by the plugin. Note that there
is really only space for a single sentence in the GUI.

HELPURL: The URL of a web page giving more details about this plugin.
CREOLE: the GATE Component Model 75

GATE-MIN: The earliest version of GATE that this plugin is compatible with. This should
be in the same format as the version shown in the GATE titlebar, i.e. 6.1 or 6.2-
SNAPSHOT. Do not include the build number information.

GATE-MAX: The last version of GATE which the plugin is compatible with. This should be
in the same format as GATE-MIN.

Currently all these attributes are optional, unless you intend to make the plugin available
through a plugin repository (see Section 12.3.5), in which case the ID and VERSION attributes
must be provided. We would, however, suggest that developers start to add these attributes
to all the plugins they develop as the information is likely to be used in more places through-
out GATE developer and embeded in the future.

Child elements of the <CREOLE-DIRECTORY> depend on the conguration style. The following
three sections discuss the dierent styles  all-XML, all-annotations and a mixture of the
two.

4.7.1 Conguration with XML


To congure your resources in the creole.xml le, the <CREOLE-DIRECTORY> element should
contain one <RESOURCE> element for each resource type in the plugin. The <RESOURCE> ele-
ments may optionally be contained within a <CREOLE> element (to allow a single creole.xml
le to be built up by concatenating multiple separate les). For example:

<CREOLE-DIRECTORY>

<CREOLE>
<RESOURCE>
<NAME>Minipar Wrapper</NAME>
<JAR>MiniparWrapper.jar</JAR>
<CLASS>minipar.Minipar</CLASS>
<COMMENT>MiniPar is a shallow parser. It determines the
dependency relationships between the words of a sentence.</COMMENT>
<HELPURL>http://gate.ac.uk/cgi-bin/userguide/sec:parsers:minipar</HELPURL>
<PARAMETER NAME="document"
RUNTIME="true"
COMMENT="document to process">gate.Document</PARAMETER>
<PARAMETER NAME="miniparDataDir"
RUNTIME="true"
COMMENT="location of the Minipar data directory">
java.net.URL
</PARAMETER>
<PARAMETER NAME="miniparBinary"
RUNTIME="true"
76 CREOLE: the GATE Component Model

COMMENT="Name of the Minipar command file">


java.net.URL
</PARAMETER>
<PARAMETER NAME="annotationInputSetName"
RUNTIME="true"
OPTIONAL="true"
COMMENT="Name of the input Source">
java.lang.String
</PARAMETER>
<PARAMETER NAME="annotationOutputSetName"
RUNTIME="true"
OPTIONAL="true"
COMMENT="Name of the output AnnotationSetName">
java.lang.String
</PARAMETER>
<PARAMETER NAME="annotationTypeName"
RUNTIME="false"
DEFAULT="DepTreeNode"
COMMENT="Annotations to store with this type">
java.lang.String
</PARAMETER>
</RESOURCE>
</CREOLE>
</CREOLE-DIRECTORY>

Basic Resource-Level Data

Each resource must give a name, a Java class and the JAR le that it can be loaded from.
The above example is taken from the Parser_Minipar plugin, and denes a single resource
with a number of parameters.

The full list of valid elements under <RESOURCE> is as follows:

NAME the name of the resource, as it will appear in the `New' menu in GATE Developer.
If omitted, defaults to the bare name of the resource class (without a package name).

CLASS the fully qualied name of the Java class that implements this resource.
JAR names JAR les required by this resource (paths are relative to the location of
creole.xml). Typically this will be the JAR le containing the class named by the
<CLASS> element, but additional <JAR> elements can be used to name third-party JAR
les that the resource depends on.

COMMENT a descriptive comment about the resource, which will appear as the tooltip
when hovering over an instance of this resource in the resources tree in GATE Devel-
oper. If omitted, no comment is used.
CREOLE: the GATE Component Model 77

HELPURL a URL to a help document on the web for this resource. It is used in the help
browser inside GATE Developer.

INTERFACE the interface type implemented by this resource, for example new types of
document would specify <INTERFACE>gate.Document</INTERFACE>.

ICON the icon used to represent this resource in GATE Developer. This is a path inside
the plugin's JAR le, for example <ICON>/some/package/icon.png</ICON>. If the
path specied does not start with a forward slash, it is assumed to name an icon from
the GATE default set, which is located in gate.jar at gate/resources/img. If no icon
is specied, a generic language resource or processing resource icon (as appropriate) is
used.

PRIVATE if present, this resource type is hidden in the GATE Developer GUI, i.e. it is
not shown in the `New' menus. This is useful for resource types that are intended to be
created internally by other resources, or for resources that have parameters of a type
that cannot be set in the GUI. <PRIVATE/> resources can still be created in Java code
using the Factory.

AUTOINSTANCE (and HIDDEN-AUTOINSTANCE) tells GATE to automati-


cally create instances of this resource when the plugin is loaded. Any number of auto
instances may be dened, GATE will create them all. Each <AUTOINSTANCE> element
may optionally contain <PARAM NAME="..." VALUE="..." /> elements giving param-
eter values to use when creating the instance. Any parameters not specied explicitly
will take their default values. Use <HIDDEN-AUTOINSTANCE> if you want the auto in-
stances not to show up in GATE Developer  this is useful for things like document
formats where there should only ever be a single instance in GATE and that instance
should not be deleted.

TOOL if present, this resource type is considered to be a tool. Tools can contribute items
to the Tools menu in GATE Developer.

For visual resources, a <GUI> element should also be provided. This takes a TYPE attribute,
which can have the value LARGE or SMALL. LARGE means that the visual resource is a large
viewer and should appear in the main part of the GATE Developer window on the right
hand side, SMALL means the VR is a small viewer which appears in the space below the
resources tree in the bottom left. The <GUI> element supports the following sub-elements:

RESOURCE_DISPLAYED the type of GATE resource this VR can display. Any re-
source whose type is assignable to this type will be displayed with this viewer, so for
example a VR that can display all types of document would specify gate.Document,
whereas a VR that can only display the default GATE document implementation would
specify gate.corpora.DocumentImpl.
78 CREOLE: the GATE Component Model

MAIN_VIEWER if present, GATE will consider this VR to be the `most important'


viewer for the given resource type, and will ensure that if several dierent viewers are
all applicable to this resource, this viewer will be the one that is initially visible.

For annotation viewers, you should specify an <ANNOTATION_TYPE_DISPLAYED> element giv-


ing the annotation type that the viewer can display (e.g. Sentence).

Resource Parameters

Resources may also have parameters of various types. These resources, from the GATE
distribution, illustrate the various types of parameters:

<RESOURCE>
<NAME>GATE document</NAME>
<CLASS>gate.corpora.DocumentImpl</CLASS>
<INTERFACE>gate.Document</INTERFACE>
<COMMENT>GATE transient document</COMMENT>
<OR>
<PARAMETER NAME="sourceUrl"
SUFFIXES="txt;text;xml;xhtm;xhtml;html;htm;sgml;sgm;mail;email;eml;rtf"
COMMENT="Source URL">java.net.URL</PARAMETER>
<PARAMETER NAME="stringContent"
COMMENT="The content of the document">java.lang.String</PARAMETER>
</OR>
<PARAMETER
COMMENT="Should the document read the original markup"
NAME="markupAware" DEFAULT="true">java.lang.Boolean</PARAMETER>
<PARAMETER NAME="encoding" OPTIONAL="true"
COMMENT="Encoding" DEFAULT="">java.lang.String</PARAMETER>
<PARAMETER NAME="sourceUrlStartOffset"
COMMENT="Start offset for documents based on ranges"
OPTIONAL="true">java.lang.Long</PARAMETER>
<PARAMETER NAME="sourceUrlEndOffset"
COMMENT="End offset for documents based on ranges"
OPTIONAL="true">java.lang.Long</PARAMETER>
<PARAMETER NAME="preserveOriginalContent"
COMMENT="Should the document preserve the original content"
DEFAULT="false">java.lang.Boolean</PARAMETER>
<PARAMETER NAME="collectRepositioningInfo"
COMMENT="Should the document collect repositioning information"
DEFAULT="false">java.lang.Boolean</PARAMETER>
<ICON>lr.gif</ICON>
</RESOURCE>

<RESOURCE>
CREOLE: the GATE Component Model 79

<NAME>Document Reset PR</NAME>


<CLASS>gate.creole.annotdelete.AnnotationDeletePR</CLASS>
<COMMENT>Document cleaner</COMMENT>
<PARAMETER NAME="document" RUNTIME="true">gate.Document</PARAMETER>
<PARAMETER NAME="annotationTypes" RUNTIME="true"
OPTIONAL="true">java.util.ArrayList</PARAMETER>
</RESOURCE>

Parameters may be optional, and may have default values (and may have comments to
describe their purpose, which is displayed by GATE Developer during interactive parameter
setting).

Some PR parameters are execution time (RUNTIME), some are initialisation time. E.g. at
execution time a doc is supplied to a language analyser; at initialisation time a grammar
may be supplied to a language analyser.

The <PARAMETER> tag takes the following attributes:

NAME: name of the JavaBean property that the parameter refers to, i.e. for a parameter
named `someParam' the class must have setSomeParam and getSomeParam methods.1

DEFAULT: default value (see below).


RUNTIME: doesn't need setting at initialisation time, but must be set before calling
execute(). Only meaningful for PRs

OPTIONAL: not required


COMMENT: for display purposes
ITEM_CLASS_NAME: (only applies to parameters whose type is java.util.Collection
or a type that implements or extends this) this species the type of elements the col-
lection contains, so GATE can use the right type when parameters are set. If omitted,
GATE will pass in the elements as Strings.

SUFFIXES: (only applies to parameters of type java.net.URL) a semicolon-separated list


of le suxes that this parameter typically accepts, used as a lter in the le chooser
provided by GATE Developer to select a local le as the parameter value.

It is possible for two or more parameters to be mutually exclusive (i.e. a user must specify
one or the other but not both). In this case the <PARAMETER> elements should be grouped
together under an <OR> element.
1 The JavaBeans spec allows is instead of get boolean, but GATE
for properties of the primitive type
does not support parameters with primitive types. Parameters of type java.lang.Boolean (the wrapper
class) are permitted, but these have get accessors anyway.
80 CREOLE: the GATE Component Model

The type of the parameter is specied as the text of the <PARAMETER> element, and the type
supplied must match the return type of the parameter's get method. Any reference type
(class, interface or enum) may be used as the parameter type, including other resource types 
in this case GATE Developer will oer a list of the loaded instances of that resource as options
for the parameter value. Primitive types (char, boolean, . . . ) are not supported, instead you
should use the corresponding wrapper type (java.lang.Character, java.lang.Boolean,
. . . ). If the getter returns a parameterized type (e.g. List<Integer>) you should just specify
the raw type (java.util.List) here2 .

The DEFAULT string is converted to the appropriate type for the parameter -
java.lang.String parameters use the value directly, primitive wrapper types e.g.
java.lang.Integer use their respective valueOf methods, and other built-in Java types
can have defaults specied provided they have a constructor taking a String.

The type java.net.URL is treated specially: if the default string is not an absolute URL (e.g.
http://gate.ac.uk/) then it is treated as a path relative to the location of the creole.xml le.
Thus a DEFAULT of `resources/main.jape' in the le file:/opt/MyPlugin/creole.xml
is treated as the absolute URL file:/opt/MyPlugin/resources/main.jape.

For Collection-valued parameters multiple values may be specied, separated by semi-


colons, e.g. `foo;bar;baz'; if the parameter's type is an interface  Collection or one of
its sub-interfaces (e.g. List)  a suitable concrete class (e.g. ArrayList, HashSet) will be
chosen automatically for the default value.

For parameters of type gate.FeatureMap multiple name=value pairs can be specied, e.g.
`kind=word;orth=upperInitial'. For enum-valued parameters the default string is taken
as the name of the enum constant to use. Finally, if no DEFAULT attribute is specied, the
default value is null.

4.7.2 Conguring Resources using Annotations


As an alternative to the XML conguration style, GATE provides Java annotation types to
embed the conguration data directly in the Java source code. @CreoleResource is used to
mark a class as a GATE resource, and parameter information is provided through annotations
on the JavaBean set methods. At runtime these annotations are read and mapped into the
equivalent entries in creole.xml before parsing. The metadata annotation types are all
marked @Documented so the CREOLE conguration data will be visible in the generated
JavaDoc documentation.

For more detailed information, see the JavaDoc documentation for gate.creole.metadata.

To use annotation-driven conguration for a plugin a creole.xml le is still required but it
2 In this particular case, as the type is a collection, you would specify java.lang.Integer as the
ITEM_CLASS_NAME.
CREOLE: the GATE Component Model 81

need only contain the following:

<CREOLE-DIRECTORY>
<JAR SCAN="true">myPlugin.jar</JAR>
<JAR>lib/thirdPartyLib.jar</JAR>
</CREOLE-DIRECTORY>

This tells GATE to load myPlugin.jar and scan its contents looking for resource classes
annotated with @CreoleResource. Other JAR les required by the plugin can be specied
using other <JAR> elements without SCAN="true".

In a GATE Embedded application it is possible to register a single @CreoleResource anno-


tated class without using a creole.xml le by calling

Gate.getCreoleRegister().registerComponent(MyResource.class);

GATE will extract the conguration from the annotations on the class and make it available
for use as if it had been dened in a plugin.

Basic Resource-Level Data

To mark a class as a CREOLE resource, simply use the @CreoleResource annotation (in
the gate.creole.metadata package), for example:
1 import gate . creole . AbstractLanguageAnalyser ;
2 import gate . creole . metadata .*;
3
4 @CreoleResource ( name = " GATE Tokeniser " ,
5 comment = " Splits text into tokens and spaces " )
6 public class Tokeniser extends AbstractLanguageAnalyser {
7 ...

The @CreoleResource annotation provides slots for all the values that can be specied under
<RESOURCE> in creole.xml, except <CLASS> (inferred from the name of the annotated class)
and <JAR> (taken to be the JAR containing the class):

name (String) the name of the resource, as it will appear in the `New' menu in GATE
Developer. If omitted, defaults to the bare name of the resource class (without a
package name). (XML equivalent <NAME>)

comment (String) a descriptive comment about the resource, which will appear as the
tooltip when hovering over an instance of this resource in the resources tree in GATE
Developer. If omitted, no comment is used. (XML equivalent <COMMENT>)
82 CREOLE: the GATE Component Model

helpURL (String) a URL to a help document on the web for this resource. It is used in
the help browser inside GATE Developer. (XML equivalent <HELPURL>)

isPrivate (boolean) should this resource type be hidden from the GATE Developer GUI, so
it does not appear in the `New' menus? If omitted, defaults to false (i.e. not hidden).
(XML equivalent <PRIVATE/>)

icon (String) the icon to use to represent the resource in GATE Developer. If omitted, a
generic language resource or processing resource icon is used. (XML equivalent <ICON>,
see the description above for details)

interfaceName (String) the interface type implemented by this resource, for example
a new type of document would specify "gate.Document" here. (XML equivalent
<INTERFACE>)
autoInstances (array of @AutoInstance annotations) denitions for any instances of this
resource that should be created automatically when the plugin is loaded. If omitted, no
auto-instances are created by default. (XML equivalent, one or more <AUTOINSTANCE>
and/or <HIDDEN-AUTOINSTANCE> elements, see the description above for details)

tool (boolean) is this resource type a tool?

For visual resources only, the following elements are also available:

guiType (GuiType enum) the type of GUI this resource denes. (XML equivalent
<GUI TYPE="LARGE|SMALL">)
resourceDisplayed (String) the class name of the resource type that this VR displays, e.g.
"gate.Corpus". (XML equivalent <RESOURCE_DISPLAYED>)
mainViewer (boolean) is this VR the `most important' viewer for its displayed resource
type? (XML equivalent <MAIN_VIEWER/>, see above for details)

For annotation viewers, you should specify an annotationTypeDisplayed element giving


the annotation type that the viewer can display (e.g. Sentence).

Resource Parameters

Parameters are declared by placing annotations on their JavaBean set methods. To mark a
setter method as a parameter, use the @CreoleParameter annotation, for example:

@CreoleParameter(comment = "The location of the list of abbreviations")


public void setAbbrListUrl(URL listUrl) {
...
CREOLE: the GATE Component Model 83

GATE will infer the parameter's name from the name of the JavaBean property in the usual
way (i.e. strip o the leading set and convert the following character to lower case, so in
this example the name is abbrListUrl). The parameter name is not taken from the name
of the method parameter. The parameter's type is inferred from the type of the method
parameter (java.net.URL in this case).

The annotation elements of @CreoleParameter correspond to the attributes of the


<PARAMETER> tag in the XML conguration style:

comment (String) an optional descriptive comment about the parameter. (XML equivalent
COMMENT)

defaultValue (String) the optional default value for this parameter. The value is specied
as a string but is converted to the relevant type by GATE according to the conversions
described in the previous section. Note that relative path default values for URL-valued
parameters are still relative to the location of the creole.xml le, not the annotated
class3 . (XML equivalent DEFAULT)

suxes (String) for URL-valued parameters, a semicolon-separated list of default le suf-
xes that this parameter accepts. (XML equivalent SUFFIXES)

collectionElementType (Class) for Collection-valued parameters, the type of the ele-


ments in the collection. This can usually be inferred from the generic type informa-
tion, for example public void setIndices(List<Integer> indices), but must be
specied if the set method's parameter has a raw (non-parameterized) type. (XML
equivalent ITEM_CLASS_NAME)

Mutually-exclusive parameters (such as would be grouped in an <OR> in creole.xml) are


handled by adding a disjunction="label" and priority=n to the @CreoleParameter an-
notation  all parameters that share the same label are grouped in the same disjunction,
and will be oered in order of priority. The parameter with the smallest priority value will
be the one listed rst, and thus the one that is oered initially when creating a resource
of this type in GATE Developer. For example, the following is a simplied extract from
gate.corpora.DocumentImpl:
1 @CreoleParameter ( disjunction = " src " , priority =1)
2 public void setSourceUrl ( URL src ) { / * * / }
3
4 @CreoleParameter ( disjunction = " src " , priority =2)
5 public void setStringContent ( String content ) { / * * / }

This declares the parameters stringContent and sourceUrl as mutually-exclusive, and


when creating an instance of this resource in GATE Developer the parameter that will be
3 When registering a class using CreoleRegister.registerComponent the base URL against which de-
faults for URL parameters are resolved is not specied. In such a resource it may be better to use
Class.getResource to construct the default URLs if no value is supplied for the parameter by the user.
84 CREOLE: the GATE Component Model

shown initially is sourceUrl. To set stringContent instead the user must select it from the
drop-down list. Parameters with the same declared priority value will appear next to each
other in the list, but their relative ordering is not specied. Parameters with no explicit
priority are always listed after those that do specify a priority.

Optional and runtime parameters are marked using extra annotations, for example:
1 @Optional
2 @RunTime
3 @CreoleParameter
4 public void setAnnotationSetName ( String asName ) {
5 ...

Inheritance

Unlike with pure XML conguration, when using annotations a resource will inherit any
conguration data that was not explicitly specied from annotations on its parent class
and on any interfaces it implements. Specically, if you do not specify a comment, inter-
faceName, icon, annotationTypeDisplayed or the GUI-related elements (guiType and re-
sourceDisplayed) on your @CreoleResource annotation then GATE will look up the class
tree for other @CreoleResource annotations, rst on the superclass, its superclass, etc.,
then at any implemented interfaces, and use the rst value it nds. This is useful if you are
dening a family of related resources that inherit from a common base class.

The resource name and the isPrivate and mainViewer ags are not inherited.

Parameter denitions are inherited in a similar way. This is one of the big advantages of
annotation conguration over pure XML  if one resource class extends another then with
pure XML conguration all the parent class's parameter denitions must be duplicated in
the subclass's creole.xml denition. With annotations, parameters are inherited from the
parent class (and its parent, etc.) as well as from any interfaces implemented. For exam-
ple, the gate.LanguageAnalyser interface provides two parameter denitions via annotated
set methods, for the corpus and document parameters. Any @CreoleResource annotated
class that implements LanguageAnalyser, directly or indirectly, will get these parameters
automatically.

Of course, there are some cases where this behaviour is not desirable, for example if a subclass
calculates a value for a superclass parameter rather than having the user set it directly. In
this case you can hide the parameter by overriding the set method in the subclass and using
a marker annotation:
1 @HiddenCreoleParameter
2 public void setSomeParam ( String someParam ) {
3 super . setSomeParam ( someParam );
4 }
CREOLE: the GATE Component Model 85

The overriding method will typically just call the superclass one, as its only purpose is to
provide a place to put the @HiddenCreoleParameter annotation.

Alternatively, you may want to override some of the conguration for a parameter but inherit
the rest from the superclass. Again, this is handled by trivially overriding the set method
and re-annotating it:

1 / / superclass
2 @CreoleParameter ( comment = " Location of the grammar file " ,
3 suffixes = " jape " )
4 public void setGrammarUrl ( URL grammarLocation ) {
5 ...
6 }
7
8 @Optional
9 @RunTime
10 @CreoleParameter ( comment = " Feature to set on success " )
11 public void setSuccessFeature ( String name ) {
12 ...
13 }

1 / / 
2 / / subclass
3
4 / / override the default value, inherit everything else
5 @CreoleParameter ( defaultValue = " resources / defaultGrammar . jape " )
6 public void setGrammarUrl ( URL url ) {
7 super . setGrammarUrl ( url );
8 }
9
10 / / we want the parameter to be required in the subclass
11 @Optional ( false )
12 @CreoleParameter
13 public void setSuccessFeature ( String name ) {
14 super . setSuccessFeature ( name );
15 }

Note that for backwards compatibility, data is only inherited from superclass annotations
if the subclass is itself annotated with @CreoleResource. If the subclass is not annotated
then GATE assumes that all its conguration is contained in creole.xml in the usual way.

4.7.3 Mixing the Conguration Styles


It is possible and often useful to mix and match the XML and annotation-driven congu-
ration styles. The rule is always that anything specied in the XML takes priority over the
annotations. The following examples show what this allows.
86 CREOLE: the GATE Component Model

Overriding Conguration for a Third-Party Resource

Suppose you have a plugin from some third party that uses annotation-driven conguration.
You don't have the source code but you would like to override the default value for one of
the parameters of one of the plugin's resources. You can do this in the creole.xml:

<CREOLE-DIRECTORY>
<JAR SCAN="true">acmePlugin-1.0.jar</JAR>

<!-- Add the following to override the annotations -->


<RESOURCE>
<CLASS>com.acme.plugin.UsefulPR</CLASS>
<PARAMETER NAME="listUrl"
DEFAULT="resources/myList.txt">java.net.URL</PARAMETER>
</RESOURCE>
</CREOLE-DIRECTORY>

The default value for the listUrl parameter in the annotated class will be replaced by your
value.

External AUTOINSTANCEs

For resources like document formats, where there should always and only be one in-
stance in GATE at any time, it makes sense to put the auto-instance denitions in the
@CreoleResource annotation. But if the automatically created instances are a convenience
rather than a necessity it may be better to dene them in XML so other users can disable
them without re-compiling the class:

<CREOLE-DIRECTORY>
<JAR SCAN="true">myPlugin.jar</JAR>

<RESOURCE>
<CLASS>com.acme.AutoPR</CLASS>
<AUTOINSTANCE>
<PARAM NAME="type" VALUE="Sentence" />
</AUTOINSTANCE>
<AUTOINSTANCE>
<PARAM NAME="type" VALUE="Paragraph" />
</AUTOINSTANCE>
</RESOURCE>
</CREOLE-DIRECTORY>
CREOLE: the GATE Component Model 87

Inheriting Parameters

If you would prefer to use XML conguration for your own resources, but would like to benet
from the parameter inheritance features of the annotation-driven approach, you can write a
normal creole.xml le with all your conguration and just add a blank @CreoleResource
annotation to your class. For example:
1 package com . acme ;
2 import gate .*;
3 import gate . creole . metadata . CreoleResource ;
4
5 @CreoleResource
6 public class MyPR implements LanguageAnalyser {
7 ...
8 }

<!-- creole.xml -->


<CREOLE-DIRECTORY>
<CREOLE>
<RESOURCE>
<NAME>My Processing Resource</NAME>
<CLASS>com.acme.MyPR</CLASS>
<COMMENT>...</COMMENT>
<PARAMETER NAME="annotationSetName"
RUNTIME="true" OPTIONAL="true">java.lang.String</PARAMETER>
<!--
don't need to declare document and corpus parameters, they
are inherited from LanguageAnalyser
-->
</RESOURCE>
</CREOLE>
</CREOLE-DIRECTORY>

N.B. Without the @CreoleResource the parameters would not be inherited.

4.7.4 Loading Third-Party Libraries using Apache Ivy


With simple plugins most of the code is contained in a single jar or relies on just one or
two thrid-party libraries which are easy to enumerate within creole.xml in order for them
to be loaded into GATE when the plugin is loaded. More complex plugins can, however, rely
on a large number of third-party libraries, each of which may have it's own dependencies. In
an attempt to simplify the management of third-party libraries, within CREOLE plugins,
Apache Ivy can be used to specify the dependencies.
88 CREOLE: the GATE Component Model

No attempt is made here to explain the workings of Ivy or the format of the ivy.xml le.
For full details you should refer to the approprioate section of the Ivy manual.

Incorporating an Ivy le within a CREOLE plugin is as simple as referencing it from within
creole.xml. Assumuing you have used the default lename of ivy.xml then you can refer-
ence it via a simple <IVY> element.

<CREOLE-DIRECTORY>
<JAR SCAN="true">myPlugin.jar</JAR>
<IVY/>
</CREOLE-DIRECTORY>

If you have used an alternative lename then you can specify it as the text content of the
<IVY> element. For example, if the lename is plugin-ivy.xml you would reference it as
follows:

<CREOLE-DIRECTORY>
<JAR SCAN="true">myPlugin.jar</JAR>
<IVY>plugin-ivy.xml</IVY>
</CREOLE-DIRECTORY>

When the plugin is loaded into GATE Ivy resolves the dependencies, downloads the appro-
priate libraries (if necessary) and then makes them available to the plugin. Once the plugin
is loaded it behaves exactly the same as any other plugin.

Note that if you export an application (see Section 3.9.4) then to ensure that it is self-
contained and useable within any processing environment the Ivy based dependencies are
expanded; the libraries are downloaded into the plugin's lib folder, appropriate entires are
added to creole.xml and the <IVY> element is removed.

4.8 Tools: How to Add Utilities to GATE Developer


Visual Resources allow a developer to provide a GUI to interact with a particular resource
type (PR or LR), but sometimes it is useful to provide general utilities for use in the GATE
Developer GUI that are not tied to any specic resource type. Examples include the an-
notation di tool and the Groovy console (provided by the Groovy plugin), both of which
are self-contained tools that display in their own top-level window. To support this, the
CREOLE model has the concept of a tool.

A resource type is marked as a tool by using the <TOOL/> element in its creole.xml
denition, or by setting tool = true if using the @CreoleResource annotation cong-
uration style. If a resource is declared to be a tool, and written to implement the
CREOLE: the GATE Component Model 89

gate.gui.ActionsPublisher interface, then whenever an instance of the resource is cre-


ated its published actions will be added to the Tools menu in GATE Developer.

Since the published actions of every instance of the resource will be added to the tools menu,
it is best not to use this mechanism on resource types that can be instantiated by the user.
The tool marker is best used in combination with the private ag (to hide the resource
from the list of available types in the GUI) and one or more hidden autoinstance denitions
to create a limited number of instances of the resource when its dening plugin is loaded.
See the GroovySupport resource in the Groovy plugin for an example of this.

4.8.1 Putting Your Tools in a Sub-Menu


If your plugin provides a number of tools (or a number of actions from the same tool) you
may wish to organise your actions into one or more sub-menus, rather than placing them
all on the single top-level tools menu. To do this, you need to put a special value into the
actions returned by the tool's getActions() method:
1 action . putValue ( GateConstants . MENU_PATH_KEY ,
2 new String [] { " Acme toolkit " , " Statistics " });

The key must be GateConstants.MENU_PATH_KEY and the value must be an array of strings. Each
string in the array represents the name of one level of sub-menus. Thus in the example above
the action would be placed under Tools → Acme toolkit → Statistics. If no MENU_PATH_KEY
value is provided the action will be placed directly on the Tools menu.

4.8.2 Adding Tools To Existing Resource Types


While Visual Resources (VR) allow you to add new features to a particular resource they
have a number of shortcomings. Firstly not every new feature will require a full VR; often
a new entry on the resources right-click menu will suce. More importantly new feautres
added via a VR are only available while the VR is open. A Resource Helper is a form of Tool,
as above, which can add new menu options to any existing resource type without requiring
a VR.

A Resource Helper is dened in the same way as a Tool (by setting the tool = true feature
of the @CreoleResource annotation and loaded via an autoinstance denition) but must
also extend the gate.gui.ResourceHelper class. A Resource Helper can then return a
set of actions for a given resource which will be added to its right-click menu. See the
FastInfosetExporter resource in the Format_FastInfoset plugin for an example of how
this works.

A Resource Helper may also make new API calls accessable to allow similar functionality to
be made available to GATE Embedded, see Section 7.20 for more details on how this works.
90 CREOLE: the GATE Component Model
Chapter 5

Language Resources: Corpora,


Documents and Annotations

Sometimes in life you've got to dance like nobody's watching.


...
I think they should introduce `sleeping' to the Olympics. It would be an excellent
eld event, in which the `athletes' (for want of a better word) all lay down in
beds, just beyond where the javelins land, and the rst one to fall asleep and
not wake up for three hours would win gold. I, for one, would be interested
in seeing what kind of personality would be suited to sleeping in a competitive
environment.
...
Life is a mystery to be lived, not a problem to be solved.
Round Ireland with a Fridge, Tony Hawks, 1998 (pp. 119, 147, 179).

This chapter documents GATE's model of corpora, documents and annotations on docu-
ments. Section 5.1 describes the simple attribute/value data model that corpora, documents
and annotations all share. Section 5.2, Section 5.3 and Section 5.4 describe corpora, doc-
uments and annotations on documents respectively. Section 5.5 describes GATE's support
for diverse document formats, and Section 5.5.2 describes facilities for XML input/output.

5.1 Features: Simple Attribute/Value Data


GATE has a single model for information that describes documents, collections of documents
(corpora), and annotations on documents, based on attribute/value pairs. Attribute names
are strings; values can be any Java object. The API for accessing this feature data is Java's
Map interface (part of the Collections API).
91
92 Language Resources: Corpora, Documents and Annotations

5.2 Corpora: Sets of Documents plus Features


A Corpus in GATE is a Java Set whose members are Documents. Both Corpora and Docu-
ments are types of LanguageResource (LR); all LRs have a FeatureMap (a Java Map) asso-
ciated with them that stored attribute/value information about the resource. FeatureMaps
are also used to associate arbitrary information with ranges of documents (e.g. pieces of
text) via the annotation model (see below).

Documents have a DocumentContent which is a text at present (future versions may add
support for audiovisual content) and one or more AnnotationSets which are Java Sets.

5.3 Documents: Content plus Annotations plus Features


Documents are modelled as content plus annotations (see Section 5.4) plus features (see
Section 5.1). The content of a document can be any subclass of DocumentContent.

5.4 Annotations: Directed Acyclic Graphs


Annotations are organised in graphs, which are modelled as Java sets of Annotation. An-
notations may be considered as the arcs in the graph; they have a start Node and an end
Node, an ID, a type and a FeatureMap. Nodes have pointers into the sources document, e.g.
character osets.

5.4.1 Annotation Schemas


Annotation schemas provide a means to dene types of annotations in GATE.
GATE uses the XML Schema language supported by W3C for these denitions.
When using GATE Developer to create/edit annotations, a component is available
(gate.gui.SchemaAnnotationEditor) which is driven by an annotation schema le. This
component will constrain the data entry process to ensure that only annotations that corre-
spond to a particular schema are created. (Another component allows unrestricted annota-
tions to be created.)

Schemas are resources just like other GATE components. Below we give some examples of
such schemas. Section 3.4.6 describes how to create new schemas. Note that each schema
le denes a single annotation type, however it is possible to use include denitions in a
schema to refer to other schemas in order to load a whole set of schemas as a group. The
default schemas for ANNIE annotation types (dened in resources/schema in the ANNIE
plugin) give an example of this technique.
Language Resources: Corpora, Documents and Annotations 93

Date Schema

<?xml version="1.0"?>
<schema
xmlns="http://www.w3.org/2000/10/XMLSchema">
<!-- XSchema deffinition for Date-->
<element name="Date">
<complexType>
<attribute name="kind" use="optional">
<simpleType>
<restriction base="string">
<enumeration value="date"/>
<enumeration value="time"/>
<enumeration value="dateTime"/>
</restriction>
</simpleType>
</attribute>
</complexType>
</element>
</schema>

Person Schema

<?xml version="1.0"?>
<schema
xmlns="http://www.w3.org/2000/10/XMLSchema">
<!-- XSchema definition for Person-->
<element name="Person" />
</schema>

Address Schema

<?xml version="1.0"?> <schema


xmlns="http://www.w3.org/2000/10/XMLSchema">
<!-- XSchema definition for Address-->
<element name="Address">
<complexType>
<attribute name="kind" use="optional">
<simpleType>
<restriction base="string">
<enumeration value="email"/>
<enumeration value="url"/>
<enumeration value="phone"/>
<enumeration value="ip"/>
94 Language Resources: Corpora, Documents and Annotations

<enumeration value="street"/>
<enumeration value="postcode"/>
<enumeration value="country"/>
<enumeration value="complete"/>
</restriction>
</simpleType>
</attribute>
</complexType>
</element>
</schema>

5.4.2 Examples of Annotated Documents


This section shows some simple examples of annotated documents.

This material is adapted from [Grishman 97], the TIPSTER Architecture Design document
upon which GATE version 1 was based. Version 2 has a similar model, although annotations
are now graphs, and instead of multiple spans per annotation each annotation now has a sin-
gle start/end node pair. The current model is largely compatible with [Bird & Liberman 99],
and roughly isomorphic with "stand-o markup" as latterly adopted by the SGML/XML
community.

Each example is shown in the form of a table. At the top of the table is the document being
annotated; immediately below the line with the document is a ruler showing the position
(byte oset) of each character (see TIPSTER Architecture Design Document).

Underneath this appear the annotations, one annotation per line. For each annotation is
shown its Id, Type, Span (start/end osets derived from the start/end nodes), and Features.
Integers are used as the annotation Ids. The features are shown in the form name = value.

The rst example shows a single sentence and the result of three annotation procedures: to-
kenization with part-of-speech assignment, name recognition, and sentence boundary recog-
nition. Each token has a single feature, its part of speech (pos), using the tag set from the
University of Pennsylvania Tree Bank; each name also has a single feature, indicating the
type of name: person, company, etc.

Annotations will typically be organized to describe a hierarchical decomposition of a text.


A simple illustration would be the decomposition of a sentence into tokens. A more complex
case would be a full syntactic analysis, in which a sentence is decomposed into a noun phrase
and a verb phrase, a verb phrase into a verb and its complement, etc. down to the level of
individual tokens. Such decompositions can be represented by annotations on nested sets
of spans. Both of these are illustrated in the second example, which is an elaboration of
our rst example to include parse information. Each non-terminal node in the parse tree is
represented by an annotation of type parse.
Language Resources: Corpora, Documents and Annotations 95

Text
Cyndi savored the soup.
0...5...10..15..20
Annotations
Id Type SpanStart Span End Features
1 token 0 5 pos=NP
2 token 6 13 pos=VBD
3 token 14 17 pos=DT
4 token 18 22 pos=NN
5 token 22 23
6 name 0 5 name_type=person
7 sentence 0 23

Table 5.1: Result of annotation on a single sentence

Text
Cyndi savored the soup.
0...5...10..15..20
Annotations
Id Type SpanStart Span End Features
1 token 0 5 pos=NP
2 token 6 13 pos=VBD
3 token 14 17 pos=DT
4 token 18 22 pos=NN
5 token 22 23
6 name 0 5 name_type=person
7 sentence 0 23 constituents=[1],[2],[3].[4],[5]

Table 5.2: Result of annotations including parse information


96 Language Resources: Corpora, Documents and Annotations

Text
To: All Barnyard Animals
0...5...10..15..20.
From: Chicken Little
25..30..35..40..
Date: November 10,1194
...50..55..60..65.
Subject: Descending Firmament
.70..75..80..85..90..95
Priority: Urgent
.100.105.110.
The sky is falling. The sky is falling.
....120.125.130.135.140.145.150.
Annotations
Id Type SpanStart Span End Features
1 Addressee 4 24
2 Source 31 45
3 Date 53 69 ddmmyy=101194
4 Subject 78 98
5 Priority 109 115
6 Body 116 155
7 Sentence 116 135
8 Sentence 136 155

Table 5.3: Annotation showing overall document structure

In most cases, the hierarchical structure could be recovered from the spans. However, it may
be desirable to record this structure directly through a constituents feature whose value is
a sequence of annotations representing the immediate constituents of the initial annotation.
For the annotations of type parse, the constituents are either non-terminals (other annota-
tions in the parse group) or tokens. For the sentence annotation, the constituents feature
points to the constituent tokens. A reference to another annotation is represented in the
table as "[ Annotation Id]"; for example, "[3]" represents a reference to annotation 3. Where
the value of an feature is a sequence of items, these items are separated by commas. No
special operations are provided in the current architecture for manipulating constituents. At
a less esoteric level, annotations can be used to record the overall structure of documents,
including in particular documents which have structured headers, as is shown in the third
example (Table 5.3).

If the Addressee, Source, ... annotations are recorded when the document is indexed for
retrieval, it will be possible to perform retrieval selectively on information in particular
elds. Our nal example (Table 5.4) involves an annotation which eectively modies the
document. The current architecture does not make any specic provision for the modication
Language Resources: Corpora, Documents and Annotations 97

Text
Topster tackles 2 terrorbytes.
0...5...10..15..20..25..
Annotations
Id Type SpanStart Span End Features
1 token 0 7 pos=NP correction=TIPSTER
2 token 8 15 pos=VBZ
3 token 16 17 pos=CD
4 token 18 29 pos=NNS correction=terabytes
5 token 29 30

Table 5.4: Annotation modifying the document

of the original text. However, some allowance must be made for processes such as spelling
correction. This information will be recorded as a correction feature on token annotations
and possibly on name annotations:

5.4.3 Creating, Viewing and Editing Diverse Annotation Types


Note that annotation types should consist of a single word with no spaces. Otherwise they
may not be recognised by other components such as JAPE transducers, and may create
problems when annotations are saved as inline (`Save Preserving Format' in the context
menu).

To view and edit annotation types, see Section 3.4. To add annotations of a new type, see
Section 3.4.5. To add a new annotation schema, see Section 3.4.6.

5.5 Document Formats


The following document formats are supported by GATE by default:

ˆ Plain Text

ˆ HTML

ˆ SGML

ˆ XML

ˆ RTF

ˆ Email
98 Language Resources: Corpora, Documents and Annotations

ˆ PDF (some documents)


ˆ Microsoft Oce (some formats)
ˆ OpenOce (some formats)
ˆ UIMA CAS XML format
ˆ CoNLL/IOB

Additional formats are provided by plugins  you must load the relevant plugin before at-
tempting to parse these document types

ˆ Twitter JSON (in the Twitter plugin, see section 17.2)


ˆ DataSift JSON, a common format for social media data from http://datasift.com (in
the Format_DataSift plugin, see section 23.32)
ˆ FastInfoset, a compressed binary encoding of GATE XML (in the Format_FastInfoset
plugin, see section 23.31)
ˆ MediaWiki markup, as used by Wikipedia and many other public wiki sites (in the
Format_MediaWiki plugin, see section 23.30)
ˆ The formats used by PubMed and the Cochrane collaboration for biomedical literature
(in the Format_PubMed plugin, see section 23.29)
ˆ CSV les containing one column of text data and optionally additional columns of
metadata (in the Format_CSV plugin, see section 23.33)

By default GATE will try and identify the type of the document, then strip and convert
any markup into GATE's annotation format. To disable this process, set the markupAware
parameter on the document to false.

When reading a document of one of these types, GATE extracts the text between tags (where
such exist) and create a GATE annotation lled as follows:

The name of the tag will constitute the annotation's type, all the tags attributes will mate-
rialize in the annotation's features and the annotation will span over the text covered by the
tag. A few exceptions of this rule apply for the RTF, Email and Plain Text formats, which
will be described later in the input section of these formats.

The text between tags is extracted and appended to the GATE document's content and all
annotations created from tags will be placed into a GATE annotation set named `Original
markups'.

Example:

If the markup is like this:


Language Resources: Corpora, Documents and Annotations 99

<aTagName attrib1="value1" attrib2="value2" attrib3="value3"> A


piece of text</aTagName>

then the annotation created by GATE will look like:

annotation.type = "aTagName";
annotation.fm = {attrib1=value1;atrtrib2=value2;attrib3=value3};
annotation.start = startNode;
annotation.end = endNode;

The startNode and endNode are created from osets referring the beginning and the end of
`A piece of text' in the document's content.

The documents supported by GATE have to be in one of the encodings accepted by Java.
The most popular is the `UTF-8' encoding which is also the most storage ecient one for
UNICODE. If, when loading a document in GATE the encoding parameter is set to `'(the
empty string), then the default encoding of the platform will be used.

5.5.1 Detecting the Right Reader


In order to successfully apply the document creation algorithm described above, GATE
needs to detect the proper reader to use for each document format. If the user knows in
advance what kind of document they are loading then they can specify the MIME type (e.g.
text/html) using the init parameter mimeType, and GATE will respect this. If an explicit type
is not given, GATE attempts to determine the type by other means, taking into consideration
(where possible) the information provided by three sources:

ˆ Document's extension

ˆ The web server's content type

ˆ Magic numbers detection

The rst represents the extension of a le like (xml,htm,html,txt,sgm,rtf, etc), the second
represents the HTTP information sent by a web server regarding the content type of the
document being send by it (text/html; text/xml, etc), and the third one represents certain
sequences of chars which are ultimately number sequences. GATE is capable of supporting
multimedia documents, if the right reader is added to the framework. Sometimes, multimedia
documents are identied by a signature consisting in a sequence of numbers. Inside GATE
they are called magic numbers. For textual documents, certain char sequences form such
magic numbers. Examples of magic numbers sequences will be provided in the Input section
of each format supported by GATE.
100 Language Resources: Corpora, Documents and Annotations

All those tests are applied to each document read, and after that, a voting mechanism decides
what is the best reader to associate with the document. There is a degree of priority for all
those tests. The document's extension test has the highest priority. If the system is in doubt
which reader to choose, then the one associated with document's extension will be selected.
The next higher priority is given to the web server's content type and the third one is given
to the magic numbers detection. However, any two tests that identify the same mime type,
will have the highest priority in deciding the reader that will be used. The web server test is
not always successful as there might be documents that are loaded from a local le system,
and the magic number detection test is not always applicable. In the next paragraphs we
will se how those tests are performed and what is the general mechanism behind reader
detection.

The method that detects the proper reader is a static one, and it belongs to the
gate.DocumentFormat class. It uses the information stored in the maps lled by the init()
method of each reader. This method comes with three signatures:
1 static public DocumentFormat getDocumentFormat ( gate . Document
2 aGateDocument , URL url )
3
4 static public DocumentFormat getDocumentFormat ( gate . Document
5 aGateDocument , String fileSuffix )
6
7 static public DocumentFormat getDocumentFormat ( gate . Document
8 aGateDocument , MimeType mimeType )

The rst two methods try to detect the right MimeType for the GATE document, and after
that, they call the third one to return the reader associate with a MimeType. Of course, if an
explicit mimeType parameter was specied, GATE calls the third form of the method directly,
passing the specied type. GATE uses the implementation from `http://jigsaw.w3.org' for
mime types.

The magic numbers test is performed using the information form


magic2mimeTypeMap map. Each key from this map, is searched in the rst buerSize (the
default value is 2048) chars of text. The method that does this is called
runMagicNumbers(InputStreamReader aReader) and it belongs to DocumentFormat class.
More details about it can be found in the GATE API documentation.

In order to activate a reader to perform the unpacking, the creole denition of a GATE
document denes a parameter called `markupAware' initialized with a default value of true.
This parameter, forces GATE to detect a proper reader for the document being read. If no
reader is found, the document's content is load and presented to the user, just like any other
text editor (this for textual documents).

You can also use Tika format auto-detection by setting the mimeType of a document to
"application/tika". Then the document will be parsed only by Tika.

The next subsections investigates particularities for each format and will describe the le
extensions registered with each document format.
Language Resources: Corpora, Documents and Annotations 101

5.5.2 XML
Input

GATE permits the processing of any XML document and oers support for XML namespaces.
It benets the power of Apache's Xerces parser and also makes use of Sun's JAXP layer.
Changing the XML parser in GATE can be achieved by simply replacing the value of a Java
system property (`javax.xml.parsers.SAXParserFactory').

GATE will accept any well formed XML document as input. Although it has the possibility
to validate XML documents against DTDs it does not do so because the validating procedure
is time consuming and in many cases it issues messages that are annoying for the user.

There is an open problem with the general approach of reading XML, HTML and SGML
documents in GATE. As we previously said, the text covered by tags/elements is appended
to the GATE document content and a GATE annotation refers to this particular span of
text. When appending, in cases such as `end.</P><P>Start' it might happen that the ending
word of the previous annotation is concatenated with the beginning phrase of the annotation
currently being created, resulting in a garbage input for GATE processing resources that
operate at the text surface.

Let's take another example in order to better understand the problem:

<title>This is a title</title><p>This is a paragraph</p><a


href="#link">Here is an useful link</a>

When the markup is transformed to annotations, it is likely that the text from the document's
content will be as follows:

This is a titleThis is a paragraphHere is an useful link

The annotations created will refer the right parts of the texts but for the GATE's processing
resources like (tokenizer, gazetteer, etc) which work on this text, this will be a major disaster.
Therefore, in order to prevent this problem from happening, GATE checks if it's likely to
join words and if this happens then it inserts a space between those words. So, the text will
look like this after loaded in GATE Developer:

This is a title This is a paragraph Here is an useful link

There are cases when these words are meant to be joined, but they are rare. This is why it's
an open problem.

The extensions associate with the XML reader are:

ˆ xml
102 Language Resources: Corpora, Documents and Annotations

ˆ xhtm

ˆ xhtml

The web server content type associate with xml documents is: text/xml.

The magic numbers test searches inside the document for the XML(<?xml version="1.0")
signature. It is also able to detect if the XML document uses the semantics described in the
GATE document format DTD (see 5.5.2 below) or uses other semantics.

Namespace handling
By default, GATE will retain the namespace prex and namespace URIs of XML elements
when creating annotations and features within the Original markups annotation set. For
example, the element

<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Document title</dc:title>

will create the following annotation

dc:title(xmlns:dc=http://purl.org/dc/elements/1.1/)

However, as the colon character ':' is a reserved meta-character in JAPE, it is not possible
to write a JAPE rule that will match the dc:title element or its namespace URI.

If you need to match namespace-prexed elements in the Original markups AS, you can alter
the default namespace deserialization behaviour to remove the namespace prex and add it
as a feature (along with the namespace URI), by specifying the following attributes in the
<GATECONFIG> element of gate.xml or local conguration le:

ˆ addNamespaceFeatures - set to "true" to deserialize namespace prex and uri in-


formation as features.

ˆ namespaceURI - The feature name to use that will hold the namespace URI of the
element, e.g. "namespace"

ˆ namespacePrex - The feature name to use that will hold the namespace prex of
the element, e.g. "prex"

i.e.

<GATECONFIG
addNamespaceFeatures="true"
namespaceURI="namespace"
namespacePrefix="prefix" />
Language Resources: Corpora, Documents and Annotations 103

For example

<dc:title>Document title</dc:title>

would create in Original markups AS (assuming the xmlns:dc URI has dened in the doc-
ument root or parent element)

title(prefix=dc, namespace=http://purl.org/dc/elements/1.1/)

If a JAPE rule is written to create a new annotation, e.g.

description(prefix=foo, namespace=http://www.example.org/)

then these would be serialized to

<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Document title</dc:title>


<foo:description xmlns:foo="http://www.example.org/">...</foo:description>

when using the 'Save preserving document format' XML output option (see 5.5.2 below).

Output

GATE is capable of ensuring persistence for its resources. The types of persistent storage
used for Language Resources are:

ˆ Java serialization;

ˆ XML serialization.

We describe the latter case here.

XML persistence doesn't necessarily preserve all the objects belonging to the annotations,
documents or corpora. Their features can be of all kinds of objects, with various layers of
nesting. For example, lists containing lists containing maps, etc. Serializing these arbitrary
data types in XML is not a simple task; GATE does the best it can, and supports native Java
types such as Integers and Booleans, but where complex data types are used, information
may be lost(the types will be converted into Strings). GATE provides a full serialization of
certain types of features such as collections, strings and numbers. It is possible to serialize
only those collections containing strings or numbers. The rest of other features are serialized
104 Language Resources: Corpora, Documents and Annotations

using their string representation and when read back, they will be all strings instead of being
the original objects. Consequences of this might be observed when performing evaluations
(see Chapter 10).

When GATE outputs an XML document it may do so in one of two ways:

ˆ When the original document that was imported into GATE was an XML document,
GATE can dump that document back into XML (possibly with additional markup
added);

ˆ For all document formats, GATE can dump its internal representation of the document
into XML.

In the former case, the XML output will be close to the original document. In the latter
case, the format is a GATE-specic one which can be read back by the system to recreate
all the information that GATE held internally for the document.

In order to understand why there are two types of XML serialization, one needs to understand
the structure of a GATE document. GATE allows a graph of annotations that refer to
parts of the text. Those annotations are grouped under annotation sets. Because of this
structure, sometimes it is impossible to save a document as XML using tags that surround
the text referred to by the annotation, because tags crossover situations could appear (XML
is essentially a tree-based model of information, whereas GATE uses graphs). Therefore, in
order to preserve all annotations in a GATE document, a custom type of XML document
was developed.

The problem of crossover tags appears with GATE's second option (the preserve format
one), which is implemented at the cost of losing certain annotations. The way it is applied
in GATE is that it tries to restore the original markup and where it is possible, to add in
the same manner annotations produced by GATE.

How to Access and Use the Two Forms of XML Serialization

Save as XML Option This option is available in GATE Developer in the pop-up menu
associated with each language resource (document or corpus). Saving a corpus as XML
is done by calling `Save as XML' on each document of the corpus. This option saves all
the annotations of a document together their features(applying the restrictions previously
discussed), using the GateDocument.dtd :

<!ELEMENT GateDocument (GateDocumentFeatures,


TextWithNodes, (AnnotationSet+))>
<!ELEMENT GateDocumentFeatures (Feature+)>
<!ELEMENT Feature (Name, Value)>
Language Resources: Corpora, Documents and Annotations 105

<!ELEMENT Name (\#PCDATA)>


<!ELEMENT Value (\#PCDATA)>
<!ELEMENT TextWithNodes (\#PCDATA | Node)*>
<!ELEMENT AnnotationSet (Annotation*)>
<!ATTLIST AnnotationSet Name CDATA \#IMPLIED>
<!ELEMENT Annotation (Feature*)>
<!ATTLIST Annotation Type CDATA \#REQUIRED
StartNode CDATA \#REQUIRED
EndNode CDATA \#REQUIRED>
<!ELEMENT Node EMPTY>
<!ATTLIST Node id CDATA \#REQUIRED>

The document is saved under a name chosen by the user and it may have any extension.
However, the recommended extension would be `xml'.

Using GATE Embedded, this option is available by calling gate.Document's toXml()


method. This method returns a string which is the XML representation of the document on
which the method was called.

Note: It is recommended that the string representation to be saved on the le sys-
tem using the UTF-8 encoding, as the rst line of the string is : <?xml version="1.0"
encoding="UTF-8"?>

Example of such a GATE format document:

<?xml version="1.0" encoding="UTF-8" ?>


<GateDocument>

<!-- The document's features-->

<GateDocumentFeatures>
<Feature>
<Name className="java.lang.String">MimeType</Name>
<Value className="java.lang.String">text/plain</Value>
</Feature>
<Feature>
<Name className="java.lang.String">gate.SourceURL</Name>
<Value className="java.lang.String">file:/G:/tmp/example.txt</Value>
</Feature>
</GateDocumentFeatures>

<!-- The document content area with serialized nodes -->

<TextWithNodes>
<Node id="0"/>A TEENAGER <Node
id="11"/>yesterday<Node id="20"/> accused his parents of cruelty
106 Language Resources: Corpora, Documents and Annotations

by feeding him a daily diet of chips which sent his weight


ballooning to 22st at the age of l2<Node id="146"/>.<Node
id="147"/>
</TextWithNodes>

<!-- The default annotation set -->

<AnnotationSet>
<Annotation Type="Date" StartNode="11"
EndNode="20">
<Feature>
<Name className="java.lang.String">rule2</Name>
<Value className="java.lang.String">DateOnlyFinal</Value>
</Feature> <Feature>
<Name className="java.lang.String">rule1</Name>
<Value className="java.lang.String">GazDateWords</Value>
</Feature> <Feature>
<Name className="java.lang.String">kind</Name>
<Value className="java.lang.String">date</Value>
</Feature> </Annotation> <Annotation Type="Sentence" StartNode="0"
EndNode="147"> </Annotation> <Annotation Type="Split"
StartNode="146" EndNode="147"> <Feature>
<Name className="java.lang.String">kind</Name>
<Value className="java.lang.String">internal</Value>
</Feature> </Annotation> <Annotation Type="Lookup" StartNode="11"
EndNode="20"> <Feature>
<Name className="java.lang.String">majorType</Name>
<Value className="java.lang.String">date_key</Value>
</Feature> </Annotation>
</AnnotationSet>

<!-- Named annotation set -->

<AnnotationSet Name="Original markups" >


<Annotation
Type="paragraph" StartNode="0" EndNode="147"> </Annotation>
</AnnotationSet>
</GateDocument>

Note: One must know that all features that are not collections containing numbers or strings
or that are not numbers or strings are discarded. With this option, GATE does not preserve
those features it cannot restore back.

The Preserve Format Option This option is available in GATE Developer from the
popup menu of the annotations table. If no annotation in this table is selected, then the
Language Resources: Corpora, Documents and Annotations 107

option will restore the document's original markup. If certain annotations are selected, then
the option will attempt to restore the original markup and insert all the selected ones. When
an annotation violates the crossed over condition, that annotation is discarded and a message
is issued.

This option makes it possible to generate an XML document with tags surrounding the an-
notation's referenced text and features saved as attributes. All features which are collections,
strings or numbers are saved, and the others are discarded. However, when read back, only
the attributes under the GATE namespace (see below) are reconstructed back dierently to
the others. That is because GATE does not store in the XML document the information
about the features class and for collections the class of the items. So, when read back, all
features will become strings, except those under the GATE namespace.

One will notice that all generated tags have an attribute called `gateId' under the names-
pace `http://www.gate.ac.uk'. The attribute is used when the document is read back in
GATE, in order to restore the annotation's old ID. This feature is needed because it works
in close cooperation with another attribute under the same namespace, called `matches'.
This attribute indicates annotations/tags that refer the same entity1 . They are under this
namespace because GATE is sensitive to them and treats them dierently to all other ele-
ments with their attributes which fall under the general reading algorithm described at the
beginning of this section.

The `gateId' under GATE namespace is used to create an annotation which has as ID the
value indicated by this attribute. The `matches' attribute is used to create an ArrayList in
which the items will be Integers, representing the ID of annotations that the current one
matches.

Example:

If the text being processed is as follows:

<Person gate:gateId="23">John</Person> and <Person


gate:gateId="25" gate:matches="23;25;30">John Major</Person> are
the same person.

What GATE does when it parses this text is it creates two annotations:

a1.type = "Person"
a1.ID = Integer(23)
a1.start = <the start offset of
John>
a1.end = <the end offset of John>
a1.featureMap = {}
1 It's not an XML entity but a information extraction named entity
108 Language Resources: Corpora, Documents and Annotations

a2.type = "Person"
a2.ID = Integer(25)
a2.start = <the start offset
of John Major>
a2.end = <the end offset of John Major>
a2.featureMap = {matches=[Integer(23); Integer(25); Integer(30)]}

Under GATE Embedded, this option is available by calling gate.Document's toXml(Set


aSetContainingAnnotations) method. This method returns a string which is the XML
representation of the document on which the method was called. If called with null as
a parameter, then the method will attempt to restore only the original markup. If the
parameter is a set that contains annotations, then each annotation is tested against the
crossover restriction, and for those found to violate it, a warning will be issued and they will
be discarded.

In the next subsections we will show how this option applies to the other formats supported
by GATE.

5.5.3 HTML
Input

HTML documents are parsed by GATE using the NekoHTML parser. The documents are
read and created in GATE the same way as the XML documents.

The extensions associate with the HTML reader are:

ˆ htm

ˆ html

The web server content type associate with html documents is: text/html.

The magic numbers test searches inside the document for the HTML(<html) signature.There
are certain HTML documents that do not contain the HTML tag, so the magical numbers
test might not hold.

There is a certain degree of customization for HTML documents in that GATE introduces
new lines into the document's text content in order to obtain a readable form. The annota-
tions will refer the pieces of text as described in the original document but there will be a
few extra new line characters inserted.
Language Resources: Corpora, Documents and Annotations 109

After reading H1, H2, H3, H4, H5, H6, TR, CENTER, LI, BR and DIV tags, GATE will
introduce a new line (NL) char into the text. After a TITLE tag it will introduce two NLs.
With P tags, GATE will introduce one NL at the beginning of the paragraph and one at
the end of the paragraph. All newly added NLs are not considered to be part of the text
contained by the tag.

Output

The `Save as XML' option works exactly the same for all GATE's documents so there is no
particular observation to be made for the HTML formats.

When attempting to preserve the original markup formatting, GATE will generate the doc-
ument in xhtml. The html document will look the same with any browser after processed
by GATE but it will be in another syntax.

5.5.4 SGML
Input

The SGML support in GATE is fairly light as there is no freely available Java SGML parser.
GATE uses a light converter attempting to transform the input SGML le into a well formed
XML. Because it does not make use of a DTD, the conversion might not be always good.
It is advisable to perform a SGML2XML conversion outside the system(using some other
specialized tools) before using the SGML document inside GATE.

The extensions associate with the SGML reader are:

ˆ sgm

ˆ sgml

The web server content type associate with xml documents is : text/sgml.

There is no magic numbers test for SGML.

Output

When attempting to preserve the original markup formatting, GATE will generate the doc-
ument as XML because the real input of a SGML document inside GATE is an XML one.
110 Language Resources: Corpora, Documents and Annotations

5.5.5 Plain text


Input

When reading a plain text document, GATE attempts to detect its paragraphs and add
`paragraph' annotations to the document's `Original markups' annotation set. It does that
by detecting two consecutive NLs. The procedure works for both UNIX like or DOS like
text les.

Example:

If the plain text read is as follows:

Paragraph 1. This text belongs to the first paragraph.

Paragraph 2. This text belongs to the second paragraph

then two `paragraph' type annotation will be created in the `Original markups' annotation
set (referring the rst and second paragraphs ) with an empty feature map.

The extensions associate with the plain text reader are:

ˆ txt
ˆ text

The web server content type associate with plain text documents is: text/plain.

There is no magic numbers test for plain text.

Output

When attempting to preserve the original markup formatting, GATE will dump XML
markup that surrounds the text refereed.

The procedure described above applies both for plain text and RTF documents.

5.5.6 RTF
Input

Accessing RTF documents is performed by using the Java's RTF editor kit. It only extracts
the document's text content from the RTF document.
Language Resources: Corpora, Documents and Annotations 111

The extension associate with the RTF reader is `rtf '.

The web server content type associate with xml documents is : text/rtf.

The magic numbers test searches for {\\rtf1.

Output

Same as the plain tex output.

5.5.7 Email
Input

GATE is able to read email messages packed in one document (UNIX mailbox format). It
detects multiple messages inside such documents and for each message it creates annotations
for all the elds composing an e-mail, like date, from, to, subject, etc. The message's body
is analyzed and a paragraph detection is performed (just like in the plain text case) . All
annotation created have as type the name of the e-mail's elds and they are placed in the
Original markup annotation set.

Example:

From [email protected] Wed Sep 6 10:35:50 2000

Date: Wed, 6 Sep2000 10:35:49 +0100 (BST)

From: forename1 surname2 <[email protected]>

To: forename2 surname2 <[email protected]>

Subject: A subject

Message-ID: <Pine.SOL.3.91.1000906103251.26010A-100000@servername>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

This text belongs to the e-mail body....

This is a paragraph in the body of the e-mail

This is another paragraph.


112 Language Resources: Corpora, Documents and Annotations

GATE attempts to detect lines such as `From [email protected] Wed Sep 6 10:35:50 2000'
in the e-mail text. Those lines separate e-mail messages contained in one le. After that,
for each eld in the e-mail message annotations are created as follows:

The annotation type will be the name of the eld, the feature map will be empty and the
annotation will span from the end of the eld until the end of the line containing the e-mail
eld.

Example:

a1.type = "date" a1 spans between the two ^ ^. Date:^ Wed,


6Sep2000 10:35:49 +0100 (BST)^

a2.type = "from"; a2 spans between the two ^ ^. From:^ forename1


surname2 <[email protected]>^

The extensions associated with the email reader are:

ˆ eml

ˆ email

ˆ mail

The web server content type associate with plain text documents is: text/email.

The magic numbers test searches for keywords like Subject:,etc.

Output

Same as plain text output.

5.5.8 PDF Files and Oce Documents


GATE uses the Apache Tika library to provide support for PDF documents and a number of
the document formats from both Microsoft Oce and OpenOce. In essense Tika converts
the document structure into HTML which is then used to create a GATE document. This
means that whilst a PDF or Word document may have been loaded the Original markups
set will contain HTML elements. One advantage of this approach is that processing resources
and JAPE grammars designed for use with HTML les should also work well with PDF and
Oce documents.
Language Resources: Corpora, Documents and Annotations 113

5.5.9 UIMA CAS Documents


GATE can read UIMA CAS documents. The CAS stands for Common Analysis Structure.
It provides a common representation to the artifact being analyzed, here a text.

The subject of analysis (SOFA), here a string, is used as the document content. Multiple
sofa are concatenated. The analysis results or metadata are added as annotations when
having begin and end osets and otherwise are added as document features. The views are
added as GATE annotation sets. The type system (a hierarchical annotation schema) is not
currently supported.

The web server content type associate with UIMA documents is: text/xmi+xml.

The extensions are: xcas, xmicas, xmi.

The magic numbers are:

<CAS version="2">

and

xmlns:cas=

5.5.10 CoNLL/IOB Documents


GATE can read les of text annotated in the traditional CoNLL or BIO/BILOU format,
typically used to represent POS tags and chunks and best known for Conference on Natural
Language Learning2 tasks. The following example illustrates one sentence with POS and
chunk tags (B- and I- indicate the beginning and continuation, respectively, of a chunk);
the columns represent the tokens, the POS tags, and the chunk tags, and sentences are
separated by blank lines.

My PRP$ B-NP
dog NN I-NP
has VBZ B-VP
fleas NNS B-NP
. . O

GATE interpets this format quite exibly: the columns can be separated by any whitespace
sequence, and the number of columns can vary. The strings from the leftmost column become
2 http://ifarm.nl/signll/conll/
114 Language Resources: Corpora, Documents and Annotations

strings in the document content, with spaces interposed, and Token and SpaceToken anno-
tations (with string and length features) are created appropriately in the Original markups
set).

Each blank line (empty or containing only whitespace) in the original data becomes a newline
in the document content.

The tags in subsequent columns are transformed into annotations. A chunk tag (beginning
with B- and followed by zero or more matching I- tags) produces an annotation whose type
is determined by the rest of the tag (NP or VP in the above example, but any string with
no whitespace is acceptable), with a kind = chunk feature. A chunk tag beginning with L-
(last ) terminates the chunk, and a U- (unigram ) tag produces a chunk annotation over one
token. Other tags produce annotations with the tag name as the type and a kind = token
feature.

Every annotation derived from a tag has a column feature whose int value indicates the
source column in the data (numbered from 0 for the string column). An  O tag closes all
open chunk tags at the end of the previous token.

This document format is associated with MIME-type text/x-conll and lename extensions
.conll and .iob.

5.6 XML Input/Output


Support for input from and output to XML is described in Section 5.5.2. In short:

ˆ GATE will read any well-formed XML document (it does not attempt to validate XML
documents). Markup will by default be converted into native GATE format.

ˆ GATE will write back into XML in one of two ways:

1. Preserving the original format and adding selected markup (for example to add
the results of some language analysis process to the document).
2. In GATE's own XML serialisation format, which encodes all the data in a GATE
Document (as far as this is possible within a tree-structured paradigm  for 100%
non-lossy data storage use GATE's RDBMS or binary serialisation facilities  see
Section 4.5).

When using GATE Embedded, object representations of XML documents such as DOM or
jDOM, or query and transformation languages such as X-Path or XSLT, may be used in parallel
with GATE's own Document representation (gate.Document) without conicts.
Chapter 6

ANNIE: a Nearly-New Information


Extraction System

And so the time had passed predictably and soberly enough in work and routine
chores, and the events of the previous night from rst to last had faded; and only
now that both their days' work was over, the child asleep and no further distur-
bance anticipated, did the shadowy gures from the masked ball, the melancholy
stranger and the dominoes in red, revive; and those trivial encounters became
magically and painfully interfused with the treacherous illusion of missed oppor-
tunities. Innocent yet ominous questions and vague ambiguous answers passed
to and fro between them; and, as neither of them doubted the other's absolute
candour, both felt the need for mild revenge. They exaggerated the extent to
which their masked partners had attracted them, made fun of the jealous stirrings
the other revealed, and lied dismissively about their own. Yet this light banter
about the trivial adventures of the previous night led to more serious discussion
of those hidden, scarcely admitted desires which are apt to raise dark and per-
ilous storms even in the purest, most transparent soul; and they talked about
those secret regions for which they felt hardly any longing, yet towards which the
irrational wings of fate might one day drive them, if only in their dreams. For
however much they might belong to one another heart and soul, they knew last
night was not the rst time they had been stirred by a whi of freedom, danger
and adventure.
Dream Story, Arthur Schnitzler, 1926 (pp. 4-5).

GATE was originally developed in the context of Information Extraction (IE) R&D, and IE
systems in many languages and shapes and sizes have been created using GATE with the
IE components that have been distributed with it (see [Maynard et al. 00] for descriptions
of some of these projects).1
1 The principal architects of the IE systems in GATE version 1 were Robert Gaizauskas and Kevin
Humphreys. This work lives on in the LaSIE system. (A derivative of LaSIE was distributed with GATE

115
116 ANNIE: a Nearly-New Information Extraction System

GATE is distributed with an IE system called ANNIE, A Nearly-New IE system (devel-


oped by Hamish Cunningham, Valentin Tablan, Diana Maynard, Kalina Bontcheva, Marin
Dimitrov and others). ANNIE relies on nite state algorithms and the JAPE language (see
Chapter 8).

ANNIE components form a pipeline which appears in gure 6.1. ANNIE components are

Figure 6.1: ANNIE and LaSIE

included with GATE (though the linguistic resources they rely on are generally more simple
than the ones we use in-house). The rest of this chapter describes these components.

6.1 Document Reset


The document reset resource enables the document to be reset to its original state, by remov-
ing all the annotation sets and their contents, apart from the one containing the document
format analysis (Original Markups). An optional parameter, keepOriginalMarkupsAS, al-
lows users to decide whether to keep the Original Markups AS or not while reseting the
document. The parameter annotationTypes can be used to specify a list of annotation
types to remove from all the sets instead of the whole sets.
version 1 under the name VIE, a Vanilla IE system.)
ANNIE: a Nearly-New Information Extraction System 117

Alternatively, if the parameter setsToRemove is not empty, the other parameters except
annotationTypes are ignored and only the annotation sets specied in this list will be
removed. If annotationTypes is also specied, only those annotation types in the specied
sets are removed. In order to specify that you want to reset the default annotation set, just
click the "Add" button without entering a name  this will add <null> which denotes the
default annotation set. This resource is normally added to the beginning of an application,
so that a document is reset before an application is rerun on that document.

6.2 Tokeniser
The tokeniser splits the text into very simple tokens such as numbers, punctuation and words
of dierent types. For example, we distinguish between words in uppercase and lowercase,
and between certain types of punctuation. The aim is to limit the work of the tokeniser
to maximise eciency, and enable greater exibility by placing the burden on the grammar
rules, which are more adaptable.

6.2.1 Tokeniser Rules


A rule has a left hand side (LHS) and a right hand side (RHS). The LHS is a regular
expression which has to be matched on the input; the RHS describes the annotations to be
added to the AnnotationSet. The LHS is separated from the RHS by `>'. The following
operators can be used on the LHS:

| (or)
* (0 or more occurrences)
? (0 or 1 occurrences)
+ (1 or more occurrences)

The RHS uses `;' as a separator, and has the following format:

{LHS} > {Annotation type};{attribute1}={value1};...;{attribute


n}={value n}

Details about the primitive constructs available are given in the tokeniser le (DefaultTo-
keniser.Rules).

The following tokeniser rule is for a word beginning with a single capital letter:

`UPPERCASE_LETTER' `LOWERCASE_LETTER'* >


Token;orth=upperInitial;kind=word;
118 ANNIE: a Nearly-New Information Extraction System

It states that the sequence must begin with an uppercase letter, followed by zero or more
lowercase letters. This sequence will then be annotated as type `Token'. The attribute `orth'
(orthography) has the value `upperInitial'; the attribute `kind' has the value `word'.

6.2.2 Token Types


In the default set of rules, the following kinds of Token and SpaceToken are possible:

Word

A word is dened as any set of contiguous upper or lowercase letters, including a hyphen
(but no other forms of punctuation). A word also has the attribute `orth', for which four
values are dened:

ˆ upperInitial - initial letter is uppercase, rest are lowercase

ˆ allCaps - all uppercase letters

ˆ lowerCase - all lowercase letters

ˆ mixedCaps - any mixture of upper and lowercase letters not included in the above
categories

Number

A number is dened as any combination of consecutive digits. There are no subdivisions of


numbers.

Symbol

Two types of symbol are dened: currency symbol (e.g. `$', `¿') and symbol (e.g. `&', `').
These are represented by any number of consecutive currency or other symbols (respectively).

Punctuation

Three types of punctuation are dened: start_punctuation (e.g. `('), end_punctuation (e.g.
`)'), and other punctuation (e.g. `:'). Each punctuation symbol is a separate token.
ANNIE: a Nearly-New Information Extraction System 119

SpaceToken

White spaces are divided into two types of SpaceToken - space and control - according to
whether they are pure space characters or control characters. Any contiguous (and homoge-
neous) set of space or control characters is dened as a SpaceToken.

The above description applies to the default tokeniser. However, alternative tokenisers can
be created if necessary. The choice of tokeniser is then determined at the time of text
processing.

6.2.3 English Tokeniser


The English Tokeniser is a processing resource that comprises a normal tokeniser and a JAPE
transducer (see Chapter 8). The transducer has the role of adapting the generic output of
the tokeniser to the requirements of the English part-of-speech tagger. One such adaptation
is the joining together in one token of constructs like  '30s,  'Cause,  'em,  'N,  'S, 
's,  'T,  'd,  'll,  'm,  're,  'til,  ve, etc. Another task of the JAPE transducer is
to convert negative constructs like don't from three tokens (don,  '  and t) into two
tokens (do and n't).

The English Tokeniser should always be used on English texts that need to be processed
afterwards by the POS Tagger.

6.3 Gazetteer
The role of the gazetteer is to identify entity names in the text based on lists. The ANNIE
gazetteer is described here, and also covered in Chapter 13 in Section 13.2.

The gazetteer lists used are plain text les, with one entry per line. Each list represents a
set of names, such as names of cities, organisations, days of the week, etc.

Below is a small section of the list for units of currency:

Ecu
European Currency Units
FFr
Fr
German mark
German marks
New Taiwan dollar
New Taiwan dollars
NT dollar
120 ANNIE: a Nearly-New Information Extraction System

NT dollars

An index le (lists.def) is used to access these lists; for each list, a major type is specied and,
optionally, a minor type. It is also possible to include a language in the same way (fourth
column), where lists for dierent languages are used, though ANNIE is only concerned with
monolingual recognition. By default, the Gazetteer PR creates a Lookup annotation for
every gazetteer entry it nds in the text. One can also specify an annotation type (fth
column) specic to an individual list. In the example below, the rst column refers to the
list name, the second column to the major type, and the third to the minor type.

These lists are compiled into nite state machines. Any text tokens that are matched by these
machines will be annotated with features specifying the major and minor types. Grammar
rules then specify the types to be identied in particular circumstances. Each gazetteer list
should reside in the same directory as the index le.

currency_prefix.lst:currency_unit:pre_amount
currency_unit.lst:currency_unit:post_amount
date.lst:date:specific
day.lst:date:day

So, for example, if a specic day needs to be identied, the minor type `day' should be
specied in the grammar, in order to match only information about specic days; if any kind
of date needs to be identied,the major type `date' should be specied, to enable tokens
annotated with any information about dates to be identied. More information about this
can be found in the following section.

In addition, the gazetteer allows arbitrary feature values to be associated with particular
entries in a single list. ANNIE does not use this capability, but to enable it for your own
gazetteers, set the optional gazetteerFeatureSeparator parameter to a single character
(or an escape sequence such as \t or \uNNNN) when creating a gazetteer. In this mode, each
line in a .lst le can have feature values specied, for example, with the following entry in
the index le:

software_company.lst:company:software

the following software_company.lst:

Red Hat&stockSymbol=RHAT
Apple Computer&abbrev=Apple&stockSymbol=AAPL
Microsoft&abbrev=MS&stockSymbol=MSFT

and gazetteerFeatureSeparator set to &, the gazetteer will annotate Red Hat as a Lookup
with features majorType=company, minorType=software and stockSymbol=RHAT. Note that
ANNIE: a Nearly-New Information Extraction System 121

you do not have to provide the same features for every line in the le, in particular it is
possible to provide extra features for some lines in the list but not others.

Here is a full list of the parameters used by the Default Gazetteer:

Init-time parameters

listsURL A URL pointing to the index le (usually lists.def) that contains the list of pattern
lists.

encoding The character encoding to be used while reading the pattern lists.
gazetteerFeatureSeparator The character used to add arbitrary features to gazetteer
entries. See above for an example.

caseSensitive Should the gazetteer be case sensitive during matching.

Run-time parameters

document The document to be processed.


annotationSetName The name for annotation set where the resulting Lookup annotations
will be created.

wholeWordsOnly Should the gazetteer only match whole words? If set to true, a string
segment in the input document will only be matched if it is bordered by characters
that are not letters, non spacing marks, or combining spacing marks (as identied by
the Unicode standard).

longestMatchOnly Should the gazetteer only match the longest possible string starting
from any position. This parameter is only relevant when the list of lookups contains
proper prexes of other entries (e.g when both `Dell' and `Dell Europe' are in the lists).
The default behaviour (when this parameter is set to true) is to only match the longest
entry, `Dell Europe' in this example. This is the default GATE gazetteer behaviour
since version 2.0. Setting this parameter to false will cause the gazetteer to match
all possible prexes.

6.4 Sentence Splitter


The sentence splitter is a cascade of nite-state transducers which segments the text into
sentences. This module is required for the tagger. The splitter uses a gazetteer list of
abbreviations to help distinguish sentence-marking full stops from other kinds.

Each sentence is annotated with the type `Sentence'. Each sentence break (such as a full
stop) is also given a `Split' annotation. It has a feature `kind' with two possible values:
122 ANNIE: a Nearly-New Information Extraction System

`internal' for any combination of exclamation and question mark or one to four dots and
`external' for a newline.

The sentence splitter is domain and application-independent.

There is an alternative ruleset for the Sentence Splitter which considers newlines and carriage
returns dierently. In general this version should be used when a new line on the page
indicates a new sentence). To use this alternative version, simply load the main-single-
nl.jape from the default location instead of main.jape (the default le) when asked to select
the location of the grammar le to be used.

6.5 RegEx Sentence Splitter


The RegEx sentence splitter is an alternative to the standard ANNIE Sentence Splitter.
Its main aim is to address some performance issues identied in the JAPE-based splitter,
mainly do to with improving the execution time and robustness, especially when faced with
irregular input.

As its name suggests, the RegEx splitter is based on regular expressions, using the default
Java implementation.

The new splitter is congured by three les containing (Java style, see http://
java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html) regular expres-
sions, one regex per line. The three dierent les encode patterns for:

internal splits sentence splits that are part of the sentence, such as sentence ending punc-
tuation;

external splits sentence splits that are NOT part of the sentence, such as 2 consecutive
new lines;

non splits text fragments that might be seen as splits but they should be ignored (such as
full stops occurring inside abbreviations).

The new splitter comes with an initial set of patterns that try to emulate the behaviour of
the original splitter (apart from the situations where the original one was obviously wrong,
like not allowing sentences to start with a number).

Here is a full list of the parameters used by the RegEx Sentence Splitter:

Init-time parameters

encoding The character encoding to be used while reading the pattern lists.
ANNIE: a Nearly-New Information Extraction System 123

externalSplitListURL URL for the le containing the list of external split patterns;
internalSplitListURL URL for the le containing the list of internal split patterns;
nonSplitListURL URL for the le containing the list of non split patterns;

Run-time parameters

document The document to be processed.


outputASName The name for annotation set where the resulting Split and Sentence
annotations will be created.

6.6 Part of Speech Tagger


The tagger [Hepple 00] is a modied version of the Brill tagger, which produces a part-
of-speech tag as an annotation on each word or symbol. The list of tags used is given in
Appendix G. The tagger uses a default lexicon and ruleset (the result of training on a large
corpus taken from the Wall Street Journal). Both of these can be modied manually if
necessary. Two additional lexicons exist - one for texts in all uppercase (lexicon_cap), and
one for texts in all lowercase (lexicon_lower). To use these, the default lexicon should be
replaced with the appropriate lexicon at load time. The default ruleset should still be used
in this case.

The ANNIE Part-of-Speech tagger requires the following parameters.

ˆ encoding - encoding to be used for reading rules and lexicons (init-time)

ˆ lexiconURL - The URL for the lexicon le (init-time)

ˆ rulesURL - The URL for the ruleset le (init-time)

ˆ document - The document to be processed (run-time)

ˆ inputASName - The name of the annotation set used for input (run-time)

ˆ outputASName - The name of the annotation set used for output (run-time). This is
an optional parameter. If user does not provide any value, new annotations are created
under the default annotation set.

ˆ baseTokenAnnotationType - The name of the annotation type that refers to Tokens in


a document (run-time, default = Token)

ˆ baseSentenceAnnotationType - The name of the annotation type that refers to Sen-


tences in a document (run-time, default = Sentence).
124 ANNIE: a Nearly-New Information Extraction System

ˆ outputAnnotationType - POS tags are added as category features on the annotations


of type `outputAnnotationType' (run-time, default = Token)

ˆ posTagAllTokens - If set to false, only Tokens within each baseSentenceAnnotation-


Type will be POS tagged (run-time, default = true).

ˆ failOnMissingInputAnnotations - if set to false, the PR will not fail with an Execu-


tionException if no input Annotations are found and instead only log a single warning
message per session and a debug message per document that has no input annotations
(run-time, default = true).

If - (inputASName == outputASName) AND (outputAnnotationType == baseTokenAn-


notationType)

then - New features are added on existing annotations of type `baseTokenAnnotationType'.

otherwise - Tagger searches for the annotation of type `outputAnnotationType' under the
`outputASName' annotation set that has the same osets as that of the annotation with
type `baseTokenAnnotationType'. If it succeeds, it adds new feature on a found annota-
tion, and otherwise, it creates a new annotation of type `outputAnnotationType' under the
`outputASName' annotation set.

6.7 Semantic Tagger


ANNIE's semantic tagger is based on the JAPE language  see Chapter 8. It contains rules
which act on annotations assigned in earlier phases, in order to produce outputs of annotated
entities.

The default annotation types, features and possible values produced by ANNIE are based
on the original MUC entity types, and are as follows:

ˆ Person

 gender: male, female


ˆ Location

 locType: region, airport, city, country, county, province, other


ˆ Organization

 orgType: company, department, government, newspaper, team, other


ˆ Money
ANNIE: a Nearly-New Information Extraction System 125

ˆ Percent

ˆ Date

 kind: date, time, dateTime

ˆ Address

 kind: email, url, phone, postcode, complete, ip, other

ˆ Identier

ˆ Unknown

Note that some of these feature values are generated automatically from the gazetteer lists,
so if you alter the gazetteer list denition le, these could change. Note also that other
annotations, features and values are also created by ANNIE which may be left for debugging
purposes: for example, most annotations have a rule feature that gives information about
which rule(s) red to create the annotation. The Unknown annotation type is used by the
Orthomatcher module (see 6.8) and consists of any proper noun not already identied.

6.8 Orthographic Coreference (OrthoMatcher)

(Note: this component was previously known as a `NameMatcher'.)

The Orthomatcher module adds identity relations between named entities found by the
semantic tagger, in order to perform coreference. It does not nd new named entities as
such, but it may assign a type to an unclassied proper name (an Unknown annotation),
using the type of a matching name.

The matching rules are only invoked if the names being compared are both of the same type,
i.e. both already tagged as (say) organisations, or if one of them is classied as `unknown'.
This prevents a previously classied name from being recategorised.

6.8.1 GATE Interface


Input  entity annotations, with an id attribute.

Output  matches attributes added to the existing entity annotations.


126 ANNIE: a Nearly-New Information Extraction System

6.8.2 Resources
A lookup table of aliases is used to record non-matching strings which represent the same
entity, e.g. `IBM' and `Big Blue', `Coca-Cola' and `Coke'. There is also a table of spurious
matches, i.e. matching strings which do not represent the same entity, e.g. `BT Wireless' and
`BT Cellnet' (which are two dierent organizations). The list of tables to be used is a load
time parameter of the orthomatcher: a default list is set but can be changed as necessary.

6.8.3 Processing
The wrapper builds an array of the strings, types and IDs of all name annotations, which is
then passed to a string comparison function for pairwise comparisons of all entries.

6.9 Pronominal Coreference


The pronominal coreference module performs anaphora resolution using the JAPE grammar
formalism. Note that this module is not automatically loaded with the other ANNIE mod-
ules, but can be loaded separately as a Processing Resource. The main module consists of
three submodules:

ˆ quoted text module

ˆ pleonastic it module

ˆ pronominal resolution module

The rst two modules are helper submodules for the pronominal one, because they do not
perform anything related to coreference resolution except the location of quoted fragments
and pleonastic it occurrences in text. They generate temporary annotations which are used
by the pronominal submodule (such temporary annotations are removed later).

The main coreference module can operate successfully only if all ANNIE modules were
already executed. The module depends on the following annotations created from the re-
spective ANNIE modules:

ˆ Token (English Tokenizer)

ˆ Sentence (Sentence Splitter)

ˆ Split (Sentence Splitter)

ˆ Location (NE Transducer, OrthoMatcher)


ANNIE: a Nearly-New Information Extraction System 127

ˆ Person (NE Transducer, OrthoMatcher)

ˆ Organization (NE Transducer, OrthoMatcher)

For each pronoun (anaphor) the coreference module generates an annotation of type `Coref-
erence' containing two features:

ˆ antecedent oset - this is the oset of the starting node for the annotation (entity)
which is proposed as the antecedent, or null if no antecedent can be proposed.

ˆ matches - this is a list of annotation IDs that comprise the coreference chain comprising
this anaphor/antecedent pair.

6.9.1 Quoted Speech Submodule


The quoted speech submodule identies quoted fragments in the text being analysed. The
identied fragments are used by the pronominal coreference submodule for the proper res-
olution of pronouns such as I, me, my, etc. which appear in quoted speech fragments. The
module produces `Quoted Text' annotations.

The submodule itself is a JAPE transducer which loads a JAPE grammar and builds an
FSM over it. The FSM is intended to match the quoted fragments and generate appropriate
annotations that will be used later by the pronominal module.

The JAPE grammar consists of only four rules, which create temporary annotations for all
punctuation marks that may enclose quoted speech, such as ", ', `, etc. These rules then
try to identify fragments enclosed by such punctuation. Finally all temporary annotations
generated during the processing, except the ones of type `Quoted Text', are removed (because
no other module will need them later).

6.9.2 Pleonastic It Submodule


The pleonastic it submodule matches pleonastic occurrences of `it'. Similar to the quoted
speech submodule, it is a JAPE transducer operating with a grammar containing patterns
that match the most commonly observed pleonastic it constructs.

6.9.3 Pronominal Resolution Submodule


The main functionality of the coreference resolution module is in the pronominal resolution
submodule. This uses the result from the execution of the quoted speech and pleonastic it
submodules. The module works according to the following algorithm:
128 ANNIE: a Nearly-New Information Extraction System

ˆ Preprocess the current document. This step locates the annotations that the submod-
ule need (such as Sentence, Token, Person, etc.) and prepares the appropriate data
structures for them.
ˆ For each pronoun do the following:
 inspect the proper appropriate context for all candidate antecedents for this kind
of pronoun;
 choose the best antecedent (if any);
ˆ Create the coreference chains from the individual anaphor/antecedent pairs and the
coreference information supplied by the OrthoMatcher (this step is performed from the
main coreference module).

6.9.4 Detailed Description of the Algorithm


Full details of the pronominal coreference algorithm are as follows.

Preprocessing

The preprocessing task includes the following subtasks:

ˆ Identifying the sentences in the document being processed. The sentences are identied
with the help of the Sentence annotations generated from the Sentence Splitter. For
each sentence a data structure is prepared that contains three lists. The lists contain
the annotations for the person/organization/location named entities appearing in the
sentence. The named entities in the sentence are identied with the help of the Person,
Location and Organization annotations that are already generated from the Named
Entity Transducer and the OrthoMatcher.
ˆ The gender of each person in the sentence is identied and stored in a global data
structure. It is possible that the gender information is missing for some entities - for
example if only the person family name is observed then the Named Entity transducer
will be unable to deduce the gender. In such cases the list with the matching entities
generated by the OrhtoMatcher is inspected and if some of the orthographic matches
contains gender information it is assigned to the entity being processed.
ˆ The identied pleonastic it occurrences are stored in a separate list. The `Pleonastic
It' annotations generated from the pleonastic submodule are used for the task.
ˆ For each quoted text fragment, identied by the quoted text submodule, a special
structure is created that contains the persons and the 3rd person singular pronouns
such as `he' and `she' that appear in the sentence containing the quoted text, but not
in the quoted text span (i.e. the ones preceding and succeeding the quote).
ANNIE: a Nearly-New Information Extraction System 129

Pronoun Resolution

This task includes the following subtasks:

Retrieving all the pronouns in the document. Pronouns are represented as annotations of
type `Token' with feature `category' having value `PRP$' or `PRP'. The former classies
possessive adjectives such as my, your, etc. and the latter classies personal, reexive etc.
pronouns. The two types of pronouns are combined in one list and sorted according to their
oset in the text.

For each pronoun in the list the following actions are performed:

ˆ If the pronoun is `it', then the module performs a check to determine if this is a
pleonastic occurrence. If it is, then no further attempt for resolution is made.

ˆ The proper context is determined. The context size is expressed in the number of
sentences it will contain. The context always includes the current sentence (the one
containing the pronoun), the preceding sentence and zero or more preceding sentences.

ˆ Depending on the type of pronoun, a set of candidate antecedents is proposed. The


candidate set includes the named entities that are compatible with this pronoun. For
example if the current pronoun is she then only the Person annotations with `gender'
feature equal to `female' or `unknown' will be considered as candidates.

ˆ From all candidates, one is chosen according to evaluation criteria specic for the
pronoun.

Coreference Chain Generation

This step is actually performed by the main module. After executing each of the submodules
on the current document, the coreference module follows the steps:

ˆ Retrieves the anaphor/antecedent pairs generated from them.

ˆ For each pair, the orthographic matches (if any) of the antecedent entity is retrieved
and then extended with the anaphor of the pair (i.e. the pronoun). The result is
the coreference chain for the entity. The coreference chain contains the IDs of the
annotations (entities) that co-refer.

ˆ A new Coreference annotation is created for each chain. The annotation contains a
single feature `matches' whose value is the coreference chain (the list with IDs). The
annotations are exported in a pre-specied annotation set.
130 ANNIE: a Nearly-New Information Extraction System

The resolution of she, her, her$, he, him, his, herself and himself are similar because an
analysis of a corpus showed that these pronouns are related to their antecedents in a similar
manner. The characteristics of the resolution process are:

ˆ Context inspected is not very big - cases where the antecedent is found more than 3
sentences back from the anaphor are rare.
ˆ Recency factor is heavily used - the candidate antecedents that appear closer to the
anaphor in the text are scored better.
ˆ Anaphora have higher priority than cataphora. If there is an anaphoric candidate and
a cataphoric one, then the anaphoric one is preferred, even if the recency factor scores
the cataphoric candidate better.

The resolution process performs the following steps:

ˆ Inspect the context of the anaphor for candidate antecedents. Every Person annotation
is consider to be a candidate. Cases where she/her refers to inanimate entity (ship for
example) are not handled.
ˆ For each candidate perform a gender compatibility check - only candidates having
`gender' feature equal to `unknown' or compatible with the pronoun are considered for
further evaluation.
ˆ Evaluate each candidate with the best candidate so far. If the two candidates are
anaphoric for the pronoun then choose the one that appears closer. The same holds
for the case where the two candidates are cataphoric relative to the pronoun. If one is
anaphoric and the other is cataphoric then choose the former, even if the latter appears
closer to the pronoun.

Resolution of `it', `its', `itself'

This set of pronouns also shares many common characteristics. The resolution process con-
tains certain dierences with the one for the previous set of pronouns. Successful resolution
for it, its, itself is more dicult because of the following factors:

ˆ There is no gender compatibility restriction. In the case in which there are several
candidates in the context, the gender compatibility restriction is very useful for re-
jecting some of the candidates. When no such restriction exists, and with the lack of
any syntactic or ontological information about the entities in the context, the recency
factor plays the major role in choosing the best antecedent.
ˆ The number of nominal antecedents (i.e. entities that are not referred by name) is
much higher compared to the number of such antecedents for she, he, etc. In this case
trying to nd an antecedent only amongst named entities degrades the precision a lot.
ANNIE: a Nearly-New Information Extraction System 131

Resolution of `I', `me', `my', `myself'

Resolution of these pronouns is dependent on the work of the quoted speech submodule. One
important dierence from the resolution process of other pronouns is that the context is not
measured in sentences but depends solely on the quote span. Another dierence is that the
context is not contiguous - the quoted fragment itself is excluded from the context, because
it is unlikely that an antecedent for I, me, etc. appears there. The context itself consists of:

ˆ the part of the sentence where the quoted fragment originates, that is not contained
in the quote - i.e. the text prior to the quote;

ˆ the part of the sentence where the quoted fragment ends, that is not contained in the
quote - i.e. the text following the quote;

ˆ the part of the sentence preceding the sentence where the quote originates, which is
not included in other quote.

It is worth noting that contrary to other pronouns, the antecedent for I, me, my and myself is
most often cataphoric or if anaphoric it is not in the same sentence with the quoted fragment.

The resolution algorithm consists of the following steps:

ˆ Locate the quoted fragment description that contains the pronoun. If the pronoun is
not contained in any fragment then return without proposing an antecedent.

ˆ Inspect the context for the quoted fragment (as dened above) for candidate an-
tecedents. Candidates are considered annotations of type Pronoun or annotations
of type Token with features category = `PRP', string = `she' or category = `PRP',
string = `he'.

ˆ Try to locate a candidate in the text succeeding the quoted fragment (rst pattern).
If more than one candidate is present, choose the closest to the end of the quote. If a
candidate is found then propose it as antecedent and exit.

ˆ Try to locate a candidate in the text preceding the quoted fragment (third pattern).
Choose the closest one to the beginning of the quote. If found then set as antecedent
and exit.

ˆ Try to locate antecedents in the unquoted part of the sentence preceding the sentence
where the quote starts (second pattern). Give preference to the one closest to the end
of the quote (if any) in the preceding sentence or closest to the sentence beginning.
132 ANNIE: a Nearly-New Information Extraction System

6.10 A Walk-Through Example


Let us take an example of a 3-stage procedure using the tokeniser, gazetteer and named-
entity grammar. Suppose we wish to recognise the phrase `800,000 US dollars' as an entity
of type `Number', with the feature `money'.

First of all, we give an example of a grammar rule (and corresponding macros) for money,
which would recognise this type of pattern.

Macro: MILLION_BILLION
({Token.string == "m"}|
{Token.string == "million"}|
{Token.string == "b"}|
{Token.string == "billion"}
)

Macro: AMOUNT_NUMBER
({Token.kind == number}
(({Token.string == ","}|
{Token.string == "."})
{Token.kind == number})*
(({SpaceToken.kind == space})?
(MILLION_BILLION)?)
)

Rule: Money1
// e.g. 30 pounds
(
(AMOUNT_NUMBER)
(SpaceToken.kind == space)?
({Lookup.majorType == currency_unit})
)
:money -->
:money.Number = {kind = "money", rule = "Money1"}

6.10.1 Step 1 - Tokenisation


The tokeniser separates this phrase into the following tokens. In general, a word is comprised
of any number of letters of either case, including a hyphen, but nothing else; a number is
composed of any sequence of digits; punctuation is recognised individually (each character
is a separate token), and any number of consecutive spaces and/or control characters are
recognised as a single spacetoken.

Token, string = `800', kind = number, length = 3


ANNIE: a Nearly-New Information Extraction System 133

Token, string = `,', kind = punctuation, length = 1


Token, string = `000', kind = number, length = 3
SpaceToken, string = ` ', kind = space, length = 1
Token, string = `US', kind = word, length = 2, orth = allCaps
SpaceToken, string = ` ', kind = space, length = 1
Token, string = `dollars', kind = word, length = 7, orth = lowercase

6.10.2 Step 2 - List Lookup


The gazetteer lists are then searched to nd all occurrences of matching words in the text.
It nds the following match for the string `US dollars':

Lookup, minorType = post_amount, majorType = currency_unit

6.10.3 Step 3 - Grammar Rules


The grammar rule for money is then invoked. The macro MILLION_BILLION recognises
any of the strings `m', `million', `b', `billion'. Since none of these exist in the text, it passes
onto the next macro. The AMOUNT_NUMBER macro recognises a number, optionally
followed by any number of sequences of the form`dot or comma plus number', followed
by an optional space and an optional MILLION_BILLION. In this case, `800,000' will be
recognised. Finally, the rule Money1 is invoked. This recognises the string identied by the
AMOUNT_NUMBER macro, followed by an optional space, followed by a unit of currency
(as determined by the gazetteer). In this case, `US dollars' has been identied as a currency
unit, so the rule Money1 recognises the entire string `800,000 US dollars'. Following the rule,
it will be annotated as a Number entity of type Money:

Number, kind = money, rule = Money1


134 ANNIE: a Nearly-New Information Extraction System
Part II

GATE for Advanced Users

135
Chapter 7

GATE Embedded

7.1 Quick Start with GATE Embedded

Embedding GATE-based language processing in other applications using GATE Embedded


(the GATE API) is straightforward:

ˆ add $GATE_HOME/bin/gate.jar and the JAR les in $GATE_HOME/lib to the Java


CLASSPATH ($GATE_HOME is the GATE root directory)

ˆ tell Java that the GATE Unicode Kit is an extension:


-Djava.ext.dirs=$GATE_HOME/lib/ext
N.B. This is only necessary for GUI applications that need to support Unicode text
input; other applications such as command line or web applications don't generally
need GUK.

ˆ initialise GATE with gate.Gate.init();

ˆ program to the framework API.

For example, this code will create the ANNIE extraction system:
1 / / initialise the GATE library
2 Gate . init ();
3
4 / / load ANNIE as an application from a gapp le
5 SerialAnalyserController controller = ( SerialAnalyserController )
6 PersistenceManager . loadObjectFromFile ( new File ( new File (
7 Gate . getPluginsHome () , ANNIEConstants . PLUGIN_DIR ) ,
8 ANNIEConstants . DEFAULT_FILE ));
137
138 GATE Embedded

If you want to use resources from any plugins, you need to load the plugins before calling
createResource:
1 Gate . init ();
2
3 / / need Tools plugin for the Morphological analyser
4 Gate . getCreoleRegister (). registerDirectories (
5 new File ( Gate . getPluginsHome () , " Tools " ). toURL ()
6 );
7
8 ...
9
10 ProcessingResource morpher = ( ProcessingResource )
11 Factory . createResource ( " gate . creole . morph . Morph " );

Instead of creating your processing resources individually using the Factory, you can create
your application in GATE Developer, save it using the `save application state' option (see
Section 3.9.3), and then load the saved state from your code. This will automatically reload
any plugins that were loaded when the state was saved, you do not need to load them
manually.
1 Gate . init ();
2
3 CorpusController controller = ( CorpusController )
4 PersistenceManager . loadObjectFromFile ( new File ( " savedState . xgapp " ));
5
6 / / loadObjectFromUrl is also available

There are many examples of using GATE Embedded available at:


http://gate.ac.uk/wiki/code-repository/.

See Section 2.3 for details of the system properties GATE uses to nd its conguration les.

7.2 Resource Management in GATE Embedded


As outlined earlier, GATE denes three dierent types of resources:

Language Resources : (LRs) entities that hold linguistic data.


Processing Resources : (PRs) entities that process data.
Visual Resources : (VRs) components used for building graphical interfaces.

These resources are collectively named CREOLE1 resources.


1 CREOLE stands for Collection of REusable Objects for Language Engineering
GATE Embedded 139

All CREOLE resources have some associated meta-data in the form of an entry in a special
XML le named creole.xml. The most important role of that meta-data is to specify the set
of parameters that a resource understands, which of them are required and which not, if they
have default values and what those are. The valid parameters for a resource are described
in the resource's section of its creole.xml le or in Java annotations on the resource class
 see Section 4.7.

All resource types have creation-time parameters that are used during the initialisation
phase. Processing Resources also have run-time parameters that get used during execution
(see Section 7.5 for more details).

Controllers are used to dene GATE applications and have the role of controlling the
execution ow (see Section 7.6 for more details).

This section describes how to create and delete CREOLE resources as objects in a running
Java virtual machine. This process involves using GATE's Factory class2 , and, in the case
of LRs, may also involve using a DataStore.

CREOLE resources are Java Beans; creation of a resource object involves using a default
constructor, then setting parameters on the bean, then calling an init() method. The
Factory takes care of all this, makes sure that the GATE Developer GUI is told about what
is happening (when GUI components exist at runtime), and also takes care of restoring LRs
from DataStores. A programmer using GATE Embedded should never call the
constructor of a resource: always use the Factory!
Creating a resource involves providing the following information:

ˆ fully qualied class name for the resource. This is the only required value. For
all the rest, defaults will be used if actual values are not provided.

ˆ values for the creation time parameters.†

ˆ initial values for resource features.† For an explanation on features see Section 7.4.2.

ˆ a name for the new resource;


Parameters and features need to be provided in the form of a GATE Feature Map which is
essentially a java Map (java.util.Map) implementation, see Section 7.4.2 for more details
on Feature Maps.

Creating a resource via the Factory involves passing values for any create-time parameters
that require setting to the Factory's createResource method. If no parameters are passed,
the defaults are used. So, for example, the following code creates a default ANNIE part-of-
speech tagger:
2 Fully qualied name: gate.Factory
140 GATE Embedded

1 Gate . getCreoleRegister (). registerDirectories ( new File (


2 Gate . getPluginsHome () , ANNIEConstants . PLUGIN_DIR ). toURI (). toURL ());
3 FeatureMap params = Factory . newFeatureMap (); / / empty map:default params
4 ProcessingResource tagger = ( ProcessingResource )
5 Factory . createResource ( " gate . creole . POSTagger " , params );

Note that if the resource created here had any parameters that were both mandatory and
had no default value, the createResource call would throw an exception. In this case, all
the information needed to create a tagger is available in default values given in the tagger's
XML denition (in plugins/ANNIE/creole.xml):

<RESOURCE>
<NAME>ANNIE POS Tagger</NAME>
<COMMENT>Mark Hepple's Brill-style POS tagger</COMMENT>
<CLASS>gate.creole.POSTagger</CLASS>
<PARAMETER NAME="document"
COMMENT="The document to be processed"
RUNTIME="true">gate.Document</PARAMETER>
....
<PARAMETER NAME="rulesURL" DEFAULT="resources/heptag/ruleset"
COMMENT="The URL for the ruleset file"
OPTIONAL="true">java.net.URL</PARAMETER>
</RESOURCE>

Here the two parameters shown are either `runtime' parameters, which are set before a PR is
executed, or have a default value (in this case the default rules le is distributed with GATE
itself).

When creating a Document, however, the URL of the source for the document must be
provided3 . For example:
1 URL u = new URL ( " http :// gate . ac . uk / hamish / " );
2 FeatureMap params = Factory . newFeatureMap ();
3 params . put ( " sourceUrl " , u );
4 Document doc = ( Document )
5 Factory . createResource ( " gate . corpora . DocumentImpl " , params );

Note that the document created here is transient: when you quit the JVM the document
will no longer exist. If you want the document to be persistent, you need to store it in a
DataStore (see Section 7.4.5).

Apart from createResource() methods with dierent signatures, Factory also provides
some shortcuts for common operations, listed in table 7.1.

GATE maintains various data structures that allow the retrieval of loaded resources. When
a resource is no longer required, it needs to be removed from those structures in order to
3 Alternatively a string giving the document source may be provided.
GATE Embedded 141

Method Purpose
newFeatureMap() Creates a new Feature Map (as used in
the example above).
newDocument(String content) Creates a new GATE Document start-
ing from a String value that will be used
to generate the document content.
newDocument(URL sourceUrl) Creates a new GATE Document using
the text pointed by an URL to generate
the document content.
newDocument(URL sourceUrl, Same as above but allows the speci-
String encoding) cation of an encoding to be used while
downloading the document content.
newCorpus(String name) creates a new GATE Corpus with a
specied name.

Table 7.1: Factory Operations

remove all references to it, thus making it a candidate for garbage collection. This is achieved
using the deleteResource(Resource res) method on Factory.

Simply removing all references to a resource from the user code will NOT be enough to
make the resource collect-able. Not calling Factory.deleteResource() will lead to memory
leaks!

7.3 Using CREOLE Plugins

As shown in the examples above, in order to use a CREOLE resource the relevant CREOLE
plugin must be loaded. Processing Resources, Visual Resources and Language Resources
other than Document, Corpus and DataStore all require that the appropriate plugin is rst
loaded. When using Document, Corpus or DataStore, you do not need to rst load a plugin.
The following API calls listed in table 7.2 are relevant to working with CREOLE plugins.

If you are writing a GATE Embedded application and have a single resource class
that will only be used from your embedded code (and so does not need to be dis-
tributed as a complete plugin), and all the conguration for that resource is provided
as Java annotations on the class, then it is possible to register the class with the
CreoleRegister at runtime without needing to package it in a JAR and provide a
creole.xml le. You can pass the Class object representing your resource class to
Gate.getCreoleRegister().registerComponent() method and then create instances of
the resource in the usual way using Factory.createResource. Note that resources cannot
be registered this way in the developer GUI, and cannot be included in saved application
states (see section 7.9 below).
142 GATE Embedded

Class gate.Gate
Method Purpose
public static void addKnown- adds the plugin to the list of known plu-
Plugin(URL pluginURL) gins.
public static void remove- tells the system to `forget' about one
KnownPlugin(URL pluginURL) previously known directory. If the spec-
ied directory was loaded, it will be un-
loaded as well - i.e. all the metadata
relating to resources dened by this di-
rectory will be removed from memory.
public static void addAutoload- adds a new directory to the list of plu-
Plugin(URL pluginUrl) gins that are loaded automatically at
start-up.
public static void removeAu- tells the system to remove a plugin URL
toloadPlugin(URL pluginURL) from the list of plugins that are loaded
automatically at system start-up. This
will be reected in the user's congura-
tion data le.
Class gate.CreoleRegister
public void registerDirecto- loads a new CREOLE directory. The
ries(URL directoryUrl) new plugin is added to the list of known
plugins if not already there.
public void registerCompo- registers a single @CreoleResource an-
nent(Class<? extends Resource> notated class without the need for a
cls) creole.xml le.
public void removeDirectory(URL unloads a loaded CREOLE plugin.
directory)

Table 7.2: Calls Relevant to CREOLE Plugins


GATE Embedded 143

7.4 Language Resources


This section describes the implementation of documents and corpora in GATE.

7.4.1 GATE Documents


Documents are modelled as content plus annotations (see Section 7.4.4) plus features (see
Section 7.4.2).

The content of a document can be any implementation of the


gate.DocumentContent interface; the features are <attribute, value> pairs stored a Feature
Map. Attributes are String values while the values can be any Java object.

The annotations are grouped in sets (see section 7.4.3). A document has a default (anony-
mous) annotations set and any number of named annotations sets.

Documents are dened by the gate.Document interface and there is also a provided imple-
mentation:

gate.corpora.DocumentImpl : transient document. Can be stored persistently through


Java serialisation.

Main Document functions are presented in table 7.3.

7.4.2 Feature Maps


All CREOLE resources as well as the Controllers and the annotations can have attached
meta-data in the form of Feature Maps.

A Feature Map is a Java Map (i.e. it implements the java.util.Map interface) and holds
<attribute-name, attribute-value> pairs. The attribute names are Strings while the values
can be any Java Objects.

The use of non-Serialisable objects as values is strongly discouraged.

Feature Maps are created using the gate.Factory.newFeatureMap() method.

The actual implementationfor FeatureMaps is provided by the


gate.util.SimpleFeatureMapImpl class.

Objects that have features in GATE implement the gate.util.FeatureBearer inter-


face which has only the two accessor methods for the object features: FeatureMap
getFeatures() and void setFeatures(FeatureMap features).
144 GATE Embedded

Content Manipulation
Method Purpose
DocumentContent getContent() Gets the Document content.
void edit(Long start, Long end, Modies the Document content.
DocumentContent replacement)
void setContent(DocumentContent Replaces the entire content.
newContent)
Annotations Manipulation
Method Purpose
public AnnotationSet getAnnota- Returns the default annotation set.
tions()
public AnnotationSet getAnnota- Returns a named annotation set.
tions(String name)
public Map getNamedAnnotation- Returns all the named annotation sets.
Sets()
void removeAnnotationSet(String Removes a named annotation set.
name)
Input Output
String toXml() Serialises the Document in XML for-
mat.
String toXml(Set Generates XML from a set of annota-
aSourceAnnotationSet, boolean tions only, trying to preserve the origi-
includeFeatures) nal format of the le used to create the
document.

Table 7.3: gate.Document methods.


GATE Embedded 145

Getting a particular feature from an object



1 Object obj ;
2 String featureName = " length " ;
3 if ( obj instanceof FeatureBearer ){
4 FeatureMap features = (( FeatureBearer ) obj ). getFeatures ();
5 Object value = ( features == null ) ? null :
6 features . get ( featureName );
7 }

7.4.3 Annotation Sets


A GATE document can have one or more annotation layers  an anonymous one, (also
called default), and as many named ones as necessary.

An annotation layer is organised as a Directed Acyclic Graph (DAG) on which the nodes
are particular locations anchors  in the document content and the arcs are made out of
annotations reaching from the location indicated by the start node to the one pointed by the
end node (see Figure 7.1 for an illustration). Because of the graph metaphor, the annotation
layers are also called annotation graphs. In terms of Java objects, the annotation layers are
represented using the Set paradigm as dened by the collections library and they are hence
named annotation sets. The terms of annotation layer, graph and set are interchangeable
and refer to the same concept when used in this book.

Figure 7.1: The Annotation Graph model.

An annotation set holds a number of annotations and maintains a series of indices in order
to provide fast access to the contained annotations.

The GATE Annotation Sets are dened by the gate.AnnotationSet interface and there is
a default implementation provided:

gate.annotation.AnnotationSetImpl annotation set implementation used by transient


documents.

The annotation sets are created by the document as required. The rst time a particular
annotation set is requested from a document it will be transparently created if it doesn't
exist.
146 GATE Embedded

Annotations Manipulation
Method Purpose
Integer add(Long start, Long Creates a new annotation between two
end, String type, FeatureMap osets, adds it to this set and returns
features) its id.
Integer add(Node start, Node Creates a new annotation between two
end, String type, FeatureMap nodes, adds it to this set and returns its
features) id.
boolean remove(Object o) Removes an annotation from this set.
Nodes
Method Purpose
Node rstNode() Gets the node with the smallest oset.
Node lastNode() Gets the node with the largest oset.
Node nextNode(Node node) Get the rst node that is relevant for
this annotation set and which has the
oset larger than the one of the node
provided.
Set implementation
Iterator iterator()
int size()

Table 7.4: gate.AnnotationSet methods (general purpose).

Tables 7.4 and 7.5 list the most used Annotation Set functions.
Iterating from left to right over all annotations of a given type

1 AnnotationSet annSet = ...;
2 String type = " Person " ;
3 / / Get all person annotations
4 AnnotationSet persSet = annSet . get ( type );
5 / / Sort the annotations
6 List persList = new ArrayList ( persSet );
7 Collections . sort ( persList , new gate . util . OffsetComparator ());
8 / / Iterate
9 Iterator persIter = persList . iterator ();
10 while ( persIter . hasNext ()){
11 ...
12 }

7.4.4 Annotations
An annotation is a form of meta-data attached to a particular section of document content.
The connection between the annotation and the content it refers to is made by means of two
pointers that represent the start and end locations of the covered content. An annotation
GATE Embedded 147

Searching
AnnotationSet get(Long offset) Select annotations by oset. This re-
turns the set of annotations whose start
node is the least such that it is greater
than or equal to oset. If a positional
index doesn't exist it is created. If there
are no nodes at or beyond the oset pa-
rameter then it will return null.
AnnotationSet get(Long Select annotations by oset. This re-
startOffset, Long endOffset) turns the set of annotations that over-
lap totally or partially with the inter-
val dened by the two provided osets.
The result will include all the annota-
tions that either:

ˆ start before the start oset and


end strictly after it

ˆ start at a position between the


start and the end osets

AnnotationSet get(String type) Returns all annotations of the specied


type.
AnnotationSet get(Set types) Returns all annotations of the specied
types.
AnnotationSet get(String type, Selects annotations by type and fea-
FeatureMap constraints) tures.
Set getAllTypes() Gets a set of java.lang.String objects
representing all the annotation types
present in this annotation set.
AnnotationSet getContained(Long Select annotations contained within an
startOffset, Long endOffset) interval, i.e.
AnnotationSet getCovering(String Select annotations of the given type that
neededType, Long startOffset, completely span the range.
Long endOffset)

Table 7.5: gate.AnnotationSet methods (searching).


148 GATE Embedded

must also have a type (or a name) which is used to create classes of similar annotations,
usually linked together by their semantics.

An Annotation is dened by:

start node a location in the document content dened by an oset.


end node a location in the document content dened by an oset.
type a String value.
features (see Section 7.4.2).
ID an Integer value. All annotations IDs are unique inside an annotation set.

In GATE Embedded, annotations are dened by the gate.Annotation interface and imple-
mented by the gate.annotation.AnnotationImpl class. Annotations exist only as members
of annotation sets (see Section 7.4.3) and they should not be directly created by means of a
constructor. Their creation should always be delegated to the containing annotation set.

7.4.5 GATE Corpora


A corpus in GATE is a Java List (i.e. an implementation of java.util.List) of documents.
GATE corpora are dened by the gate.Corpus interface and the following implementations
are available:

gate.corpora.CorpusImpl used for transient corpora.


gate.corpora.SerialCorpusImpl used for persistent corpora that are stored in a serial
datastore (i.e. as a directory in a le system).

Apart from implementation for the standard List methods, a Corpus also implements the
methods in table 7.6.
Creating a corpus from all XML les in a directory
1 Corpus corpus = Factory . newCorpus ( " My XML Files " );
2 File directory = ...;
3 ExtensionFileFilter filter = new ExtensionFileFilter ( " XML files " , " xml " );
4 URL url = directory . toURL ();
5 corpus . populate ( url , filter , null , false );

Using a DataStore
Assuming that you have a DataStore already open called myDataStore, this code will ask
the datastore to take over persistence of your document, and to synchronise the memory
representation of the document with the disk storage:
GATE Embedded 149

Method Purpose
String getDocumentName(int Gets the name of a document in this
index) corpus.
List getDocumentNames() Gets the names of all the documents in
this corpus.
void populate(URL directory, Fills this corpus with documents cre-
FileFilter filter, ated on the y from selected les
String encoding, boolean in a directory. Uses a FileFilter
recurseDirectories) to select which les will be used
and which will be ignored. A sim-
ple le lter based on extensions
is provided in the Gate distribution
(gate.util.ExtensionFileFilter).
void populate(URL Fills the provided corpus with docu-
singleConcatenatedFile, ments extracted from the provided sin-
String documentRootElement, gle concatenated le. Uses the content
String encoding, int between the start and end of the element
numberOfDocumentsToExtract, as specied by documentRootElement
String documentNamePrefix, for each document. The parame-
DocType documentType) ter documentType species if the re-
sulting les are html, xml or of any
other type. User can also restrict
the number of documents to extract
by providing the relevant value for
numberOfDocumentsToExtract param-
eter.

Table 7.6: gate.Corpus methods.


150 GATE Embedded

Document persistentDoc = myDataStore.adopt(doc, mySecurity);


myDataStore.sync(persistentDoc);

When you want to restore a document (or other LR) from a datastore, you make the same
createResource call to the Factory as for the creation of a transient resource, but this time
you tell it the datastore the resource came from, and the ID of the resource in that datastore:
1 URL u = ....; / / URL of a serial datastore directory
2 SerialDataStore sds = new SerialDataStore ( u . toString ());
3 sds . open ();
4
5 / / getLrIds returns a list of LR Ids, so we get the rst one
6 Object lrId = sds . getLrIds ( " gate . corpora . DocumentImpl " ). get (0);
7
8 / / we need to tell the factory about the LR's ID in the data
9 / / store, and about which datastore it is in - we do this
10 / / via a feature map:
11 FeatureMap features = Factory . newFeatureMap ();
12 features . put ( DataStore . LR_ID_FEATURE_NAME , lrId );
13 features . put ( DataStore . DATASTORE_FEATURE_NAME , sds );
14
15 / / read the document back
16 Document doc = ( Document )
17 Factory . createResource ( " gate . corpora . DocumentImpl " , features );

7.5 Processing Resources


Processing Resources (PRs) represent entities that are primarily algorithmic, such as parsers,
generators or ngram modellers.

They are created using the GATE Factory in manner similar the Language Resources. Be-
sides the creation-time parameters they also have a set of run-time parameters that are set
by the system just before executing them.

Analysers are a particular type of processing resources in the sense that they always have a
document and a corpus among their run-time parameters.

The most used methods for Processing Resources are presented in table 7.7

7.6 Controllers
Controllers are used to create GATE applications. A Controller handles a set of Processing
Resources and can execute them following a particular strategy. GATE provides a series of
serial controllers (i.e. controllers that run their PRs in sequence):
GATE Embedded 151

Method Purpose
void setParameterValue(String Sets the value for a specied parameter.
paramaterName, Object method inherited from gate.Resource
parameterValue)
void setParameterVal- Sets the values for more parameters
ues(FeatureMap parameters) in one step. method inherited from
gate.Resource
Object getParameterValue(String Gets the value of a named parameter
paramaterName) of this resource. method inherited from
gate.Resource
Resource init() Initialise this resource, and return it.
method inherited from gate.Resource
void reInit() Reinitialises the processing resource.
After calling this method the resource
should be in the state it is after calling
init. If the resource depends on external
resources (such as rules les) then the
resource will re-read those resources. If
the data used to create the resource has
changed since the resource has been cre-
ated then the resource will change too
after calling reInit().
void execute() Starts the execution of this Processing
Resource.
void interrupt() Noties this PR that it should stop its
execution as soon as possible.
boolean isInterrupted() Checks whether this PR has been in-
terrupted since the last time its Exe-
cutable.execute() method was called.

Table 7.7: gate.ProcessingResource methods.


152 GATE Embedded

gate.creole.SerialController: a serial controller that takes any kind of PRs.

gate.creole.SerialAnalyserController: a serial controller that only accepts Language


Analysers as member PRs.

gate.creole.ConditionalSerialController: a serial controller that accepts all types of


PRs and that allows the inclusion or exclusion of member PRs from the execution
chain according to certain run-time conditions (currently features on the document
being processed are used).

gate.creole.ConditionalSerialAnalyserController: a serial controller that only ac-


cepts Language Analysers and that allows the conditional run of member PRs.

gate.creole.RealtimeCorpusController: a SerialAnalyserController that allows you


to specify graceful and timeout parameters (times in milliseconds). If processing for a
document takes longer than the amount of time specied for graceful, then the controller
will attempt to gracefully end it by sending an interrupt request to it. If the graceful
parameter is `-1' then no attempt to gracefully end it is made. If processing takes
longer than the amount of time specied for the timeout parameter, it will be forcibly
terminated and the controller will move on to the next document. The parameter
suppressExceptions controls if time-outs and other exceptions will be suppressed or
passed on to the caller: if this parameter is set to `true', then any exception or a
timeout will simply cause the controller to move on to the next document rather than
failing the entire corpus processing. If the parameter is set to `false' both time-outs
and exceptions will be passed on as exceptions to the caller.

Additionally there is a scriptable controller provided by the Groovy plugin. See section 7.17.3
for details.

Creating an ANNIE application and running it over a corpus


1 / / load the ANNIE plugin
2 Gate . getCreoleRegister (). registerDirectories ( new File (
3 Gate . getPluginsHome () , " ANNIE " ). toURI (). toURL ());
4
5 / / create a serial analyser controller to run ANNIE with
6 SerialAnalyserController annieController =
7 ( SerialAnalyserController ) Factory . createResource (
8 " gate . creole . SerialAnalyserController " ,
9 Factory . newFeatureMap () ,
10 Factory . newFeatureMap () , " ANNIE " );
11
12 / / load each PR as dened in ANNIEConstants
13 for ( int i = 0; i < ANNIEConstants . PR_NAMES . length ; i ++) {
14 / / use default parameters
15 FeatureMap params = Factory . newFeatureMap ();
16 ProcessingResource pr = ( ProcessingResource )
17 Factory . createResource ( ANNIEConstants . PR_NAMES [ i ] ,
18 params );
GATE Embedded 153

19 / / add the PR to the pipeline controller


20 annieController . add ( pr );
21 } / / for each ANNIE PR
22
23 / / Tell ANNIE's controller about the corpus you want to run on
24 Corpus corpus = ...;
25 annieController . setCorpus ( corpus );
26 / / Run ANNIE
27 annieController . execute ();

7.7 Modelling Relations between Annotations


Most text processing tasks in GATE model metadata associated with text snippets as anno-
tations. In some cases, however, it is useful to to have another layer of metadata, associated
with the annotations themselves. One such case is the modelling of relations between anno-
tations. One typical example of relations between annotation is that of co-reference. Two
annotations of type Person may be referring to the same actual person; in this case the two
annotations are said to be co-referring.

Starting with version 7.1, GATE Embedded supports the representation of relations between
annotations. A relation set is associated with, and accssed via, an annotation set. All
members of a relation must be either annotations from the associated annotation set or
other relations within the same set. The classes supporting relations can be found in the
gate.relations package.

A relation, as described by the gate.relations.Relation interface, is dened by the following


values:

id a unique ID that identies the relation. IDs for both relations and annotations are
generated from the same source, guaranteeing that not only is the ID unique among
the relations, but also among all annotations from the same document.
type a String value describing the type of the relation (e.g. 'coref ' for co-reference relations).
members an int[] array, containing the annotation IDs for the annotations referred to by
the relation. Note that relations are not guaranteed to be symmetric, so the ordering
in the members array is relevant.
featureMap a FeatureMap that, like with Annotations, allows the storing of an arbitary
set of features for the relation.
userData an optional Serializable value, which can be used to associate any arbitrary data
with a relation.

Relation sets are modelled by the gate.relations.RelationSet class. The principal API calls
published by this class include:
154 GATE Embedded

ˆ public Relation addRelation(String type, int... members)


Creates a new relation with the specied type and member annotations. Returns the
newly created relation object.

ˆ public void addRelation(Relation rel)


Adds to this relation set an externally-created relation. This method is provided to
support the use of custom implementations of the gate.relations.Relation interface.

ˆ public boolean deleteRelation(Relation relation)


Deletes the specied relation from this relation set. Any relations which include this
relation as a member will also be deleted (recursively) to ensure the set remains inter-
nally consistent.

ˆ public Collection<Relation> get()


Returns all the relations within this set.

ˆ public Relation get(Integer id)


Returns the relation with the given ID.

ˆ public Collection<Relation> getRelations(String type)


Gets all relations with the specied type contained in this relation set.

ˆ public Collection<Relation> getRelations(int... members)


Gets relations by members. Gets all relations with have the specied members on the
specied positions. The required members are represented as an int[], where each
required annotation ID is placed on its required position. For unconstrained positions,
the constant value gate.relations.RelationSet.ANY should be used.

ˆ public Collection<Relation> getRelations(String type, int... members)


Gets all relations with the specied type and members.

ˆ public Collection<Relation> getReferencing(int id)


Gets all the relations which reference an annotation or relation with the specied ID.

ˆ public int getMaximumArity()


Gets the maximum arity (number of members) for all relations in this relation set.

Included next is a simple code snippet that illustrates the RelationSet API. The function of
the example code is to:

ˆ nd all the Sentence annotations inside a document;

ˆ for each sentence, nd all the contained Token annotations;

ˆ for each sentence and contained token, add a new relation named contained between
the token and the sentence.
GATE Embedded 155

1 / / get the document


2 Document doc = Factory . newDocument (
3 new File ( " documents / file . xml " ). toURI (). toURL ());
4 / / get the annotation set
5 AnnotationSet annSet = doc . getAnnotations ();
6 / / get the relations set
7 RelationSet relSet = annSet . getRelations ();
8 / / get all sentences
9 AnnotationSet sentences = annSet . get (
10 ANNIEConstants . SENTENCE_ANNOTATION_TYPE );
11 for ( Annotation sentence : sentences ) {
12 / / get all the tokens
13 AnnotationSet tokens = annSet . get (
14 ANNIEConstants . TOKEN_ANNOTATION_TYPE ,
15 sentence . getStartNode (). getOffset () ,
16 sentence . getEndNode (). getOffset ());
17 for ( Annotation token : tokens ) {
18 / / for each sentence and token, add the contained relation
19 relSet . addRelation ( " contained " ,
20 new int [] { token . getId () , sentence . getId ()});
21 }
22 }

7.8 Duplicating a Resource


Sometimes, particularly in a multi-threaded application, it is useful to be able to create an
independent copy of an existing PR, controller or LR. The obvious way to do this is to call
createResource again, passing the same class name, parameters, features and name, and
for many resources this will do the right thing. However there are some resources for which
this may be insucient (e.g. controllers, which also need to duplicate their PRs), unsafe
(if a PR uses temporary les, for instance), or simply inecient. For example for a large
gazetteer this would involve loading a second copy of the lists into memory and compiling
them into a second identical state machine representation, but a much more ecient way to
achieve the same behaviour would be to use a SharedDefaultGazetteer (see section 13.10),
which can re-use the existing state machine.

The GATE Factory provides a duplicate method which takes an existing resource instance
and creates and returns an independent copy of the resource. By default it uses the algorithm
described above, extracting the parameter values from the template resource and calling
createResource to create a duplicate (the actual algorithm is slightly more complicated
than this, see the following section). However, if a particular resource type knows of a better
way to duplicate itself it can implement the CustomDuplication interface, and provide
its own duplicate method which the factory will use instead of performing the default
duplication algorithm. A caller who needs to duplicate an existing resource can simply call
Factory.duplicate to obtain a copy, which will be constructed in the appropriate way
depending on the resource type.
156 GATE Embedded

Note that the duplicate object returned by Factory.duplicate will not necessarily be of the
same class as the original object. However the contract of Factory.duplicate species that
where the original object implements any of a list of core GATE interfaces, the duplicate
can be assumed to implement the same ones  if you duplicate a DefaultGazetteer the
result may not be an instance of DefaultGazetteer but it is guaranteed to implement the
Gazetteer interface.

Full details of how to implement a custom duplicate method in your own resource type
can be found in the JavaDoc documentation for the CustomDuplication interface and the
Factory.duplicate method.

7.8.1 Sharable properties


The @Sharable annotation (in the gate.creole.metadata package) provides a way for a
resource to mark JavaBean properties whose values should be shared between a resource
and its duplicates. Typical examples of objects that could be marked sharable include
large or expensive-to-create data structures that are created by a resource at init time and
subsequently used in a read-only fashion, a thread-safe cache of some sort, or state used to
create globally unique identiers (such as an AtomicInteger that is incremented each time a
new ID is required). Clearly any ojects that are shared between dierent resource instances
must be accessed by all instances in a way that is thread-safe or appropriately synchronized.

The sharable property must have the standard public getter and setter methods, with the
@Sharable annotation applied to the setter4 . The same setter may be marked both as
a sharable property and as a @CreoleParameter but the two are not related  sharable
properties that are not parameters and parameters that are not sharable are both allowed
and both have uses in dierent circumstances. The use of sharable properties removes the
need to implement custom duplication in many simple cases.

The default duplication algorithm in full is thus as follows:

1. Extract the values of all init-time parameters from the original resource.

2. Recursively duplicate any of these values that are themselves GATE Resources, except
for parameters that are marked as @Sharable (i.e. parameters that are marked sharable
are copied directly to the duplicate resource without being duplicated themselves).

3. Add to this parameter map any other sharable properties of the original resource
(including those that are not parameters).

4. Extract the features of the original resource and recursively duplicate any values in
this map that are themselves resources, as above.
4 In the common case where the getter/setter pair are simple accessors for a private eld whose name
matches the Java Bean property name, the annotation may be applied to the eld rather than to the setter.
GATE Embedded 157

5. Call Factory.createResource passing the class name of the original resource, the
duplicated/shared parameters and the duplicated features.

ˆ this will result in a call to the new resource's init method, with all sharable
properties (parameters and non-parameters) populated with their values from
the old resource. The init method must recognise this and adapt its behaviour
appropriately, i.e. not re-creating sharable data structures that have already been
injected.

6. If the original resource is a PR, extract its runtime parameter values (except those that
are marked as sharable, which have already been dealt with above), and recursively
duplicate any resource values in the map.

7. Set the resulting runtime parameter values on the duplicate resource.

The duplication process keeps track of any recursively-duplicated resources, such that if the
same original resource is used in several places (e.g. when duplicating a controller with several
JAPE transducer PRs that all refer to the same ontology LR in their runtime parameters)
then the same duplicate (ontology) will be used in the same places in the duplicated resource
(i.e. all the duplicate transducers will refer to the same ontology LR, which will be a duplicate
of the original one).

7.9 Persistent Applications


GATE Embedded allows the persistent storage of applications in a format based on XML
serialisation. This is particularly useful for applications management and distribution. A
developer can save the state of an application when he/she stops working on its design and
continue developing it in a next session. When the application reaches maturity it can be
deployed to the client site using the same method.

When an application (i.e. a Controller) is saved, GATE will actually only save the values for
the parameters used to create the Processing Resources that are contained in the application.
When the application is reloaded, all the PRs will be re-created using the saved parameters.

Many PRs use external resources (les) to dene their behaviour and, in most cases, these
les are identied using URLs. During the saving process, all the URLs are converted relative
URLs based on the location of the application le. This way, if the resources are packaged
together with the application le, the entire application can be reliably moved to a dierent
location.

API access to application saving and loading is provided by means of two static methods on
the gate.util.persistence.PersistenceManager class, listed in table 7.8.

Saving and loading a GATE application



158 GATE Embedded

Method Purpose
public static void saveObject- Saves the data needed to re-create the
ToFile(Object obj, File file) provided GATE object to the speci-
ed le. The Object provided can be
any type of Language or Processing Re-
source or a Controller. The procedures
may work for other types of objects as
well (e.g. it supports most Collection
types).
public static Object loadObject- Parses the le specied (which needs to
FromFile(File file) be a le created by the above method)
and creates the necessary object(s) as
specied by the data in the le. Returns
the root of the object tree.

Table 7.8: Application Saving and Loading

1 / / Where to save the application?


2 File file = ...;
3 / / What to save?
4 Controller theApplication = ...;
5
6 / / save
7 gate . util . persistence . PersistenceManager .
8 saveObjectToFile ( theApplication , file );
9 / / delete the application
10 Factory . deleteResource ( theApplication );
11 theApplication = null ;
12
13 [...]
14 / / load the application back
15 theApplication = gate . util . persistence . PersistenceManager .
16 loadObjectFromFile ( file );

7.10 Ontologies
Starting from GATE version 3.1, support for ontologies has been added. Ontologies are
nominally Language Resources but are quite dierent from documents and corpora and are
detailed in chapter 14.

Classes related to ontologies are to be found in the gate.creole.ontology package and its
sub-packages. The top level package denes an abstract API for working with ontologies
while the sub-packages contain concrete implementations. A client program should only use
the classes and methods dened in the API and never any of the classes or methods from
the implementation packages.
GATE Embedded 159

The entry point to the ontology API is the gate.creole.ontology.Ontology interface


which is the base interface for all concrete implementations. It provides methods for accessing
the class hierarchy, listing the instances and the properties.

Ontology implementations are available through plugins. Before an ontology language re-
source can be created using the gate.Factory and before any of the classes and methods in
the API can be used, one of the implementing ontology plugins must be loaded. For details
see chapter 14.

7.11 Creating a New Annotation Schema


An annotation schema (see Section 3.4.6) can be brought inside GATE through the creole.xml
le. By using the AUTOINSTANCE element, one can create instances of resources dened
in creole.xml. The gate.creole.AnnotationSchema (which is the Java representation of an
annotation schema le) initializes with some predened annotation denitions (annotation
schemas) as specied by the GATE team.

Example from GATE's internal creole.xml (in src/gate/resources/creole):

<!-- Annotation schema -->


<RESOURCE>
<NAME>Annotation schema</NAME>
<CLASS>gate.creole.AnnotationSchema</CLASS>
<COMMENT>An annotation type and its features</COMMENT>
<PARAMETER NAME="xmlFileUrl" COMMENT="The url to the definition file"
SUFFIXES="xml;xsd">java.net.URL</PARAMETER>
<AUTOINSTANCE>
<PARAM NAME ="xmlFileUrl" VALUE="schema/AddressSchema.xml" />
</AUTOINSTANCE>
<AUTOINSTANCE>
<PARAM NAME ="xmlFileUrl" VALUE="schema/DateSchema.xml" />
</AUTOINSTANCE>
<AUTOINSTANCE>
<PARAM NAME ="xmlFileUrl" VALUE="schema/FacilitySchema.xml" />
</AUTOINSTANCE>
<!-- etc. -->
</RESOURCE>

In order to create a gate.creole.AnnotationSchema object from a schema annotation le, one


must use the gate.Factory class;
1 FeatureMap params = new FeatureMap ();\\
2 param . put ( " xmlFileUrl " , annotSchemaFile . toURL ());\\
3 AnnotationSchema annotSchema = \\
4 Factory . createResurce ( " gate . creole . AnnotationSchema " , params );
160 GATE Embedded

Note: All the elements and their values must be written in lower case, as XML is dened as
case sensitive and the parser used for XML Schema inside GATE searches is case sensitive.

In order to be able to write XML Schema denitions, the ones dened in GATE
(resources/creole/schema) can be used as a model, or the user can have a look at
http://www.w3.org/2000/10/XMLSchema for a proper description of the semantics of the
elements used.

Some examples of annotation schemas are given in Section 5.4.1.

7.12 Creating a New CREOLE Resource


To create a new resource you need to:

ˆ write a Java class that implements GATE's beans model;

ˆ compile the class, and any others that it uses, into a Java Archive (JAR) le;

ˆ write some XML conguration data for the new resource;

ˆ tell GATE the URL of the new JAR and XML les.

GATE Developer helps you with this process by creating a set of directories and les that
implement a basic resource, including a Java code le and a Makele. This process is called
`bootstrapping'.

For example, let's create a new component called GoldFish, which will be a Processing
Resource that looks for all instances of the word `sh' in a document and adds an annotation
of type `GoldFish'.

First start GATE Developer (see Section 2.2). From the `Tools' menu select `BootStrap
Wizard', which will pop up the dialogue in gure 7.2. The meaning of the data entry elds:

ˆ The `resource name' will be displayed when GATE Developer loads the resource, and
will be the name of the directory the resource lives in. For our example: GoldFish.

ˆ `Resource package' is the Java package that the class representing the resource will be
created in. For our example: sheffield.creole.example.

ˆ `Resource type' must be one of Language, Processing or Visual Resource. In this


case we're going to process documents (and add annotations to them), so we select
ProcessingResource.
ˆ `Implementing class name' is the name of the Java class that represents the resource.
For our example: GoldFish.
GATE Embedded 161

Figure 7.2: BootStrap Wizard Dialogue

ˆ The `interfaces implemented' eld allows you to add other interfaces (e.g.
gate.creole.ControllerAwarePR5 ) that you would like your new resource to im-
plement. In this case we just leave the default (which is to implement the
gate.ProcessingResource interface).

ˆ The last eld selects the directory that you want the new resource created in. For our
example: z:/tmp.

Now we need to compile the class and package it into a JAR le. The bootstrap wizard
creates an Ant build le that makes this very easy  so long as you have Ant set up properly,
you can simply run ant jar

This will compile the Java source code and package the resulting classes into GoldFish.jar.
If you don't have your own copy of Ant, you can use the one bundled with GATE
- suppose your GATE is installed at /opt/gate-5.0-snapshot, then you can use
/opt/gate-5.0-snapshot/bin/ant jar to build.
You can now load this resource into GATE; see Section 3.7. The default Java code that was
created for our GoldFish resource looks like this:
1 /*
2 * GoldFish . java
3 *
4 * You should probably put a copyright notice here . Why not use the
5 * GNU l i c e n c e ? ( See h t t p : / / www . g n u . o r g / . )
6 *
7 * hamish , 26/9/2001

5 See Section 4.4.


162 GATE Embedded

8 *
9 * $Id : howto . tex , v 1.130 2006/10/23 12:56:37 ian Exp $
10 */
11
12 package sheffield . creole . example ;
13
14 import java . util .*;
15 import gate .*;
16 import gate . creole .*;
17 import gate . util .*;
18
19 /* *
20 * T h i s c l a s s i s t h e i m p l e m e n t a t i o n o f t h e r e s o u r c e GOLDFISH .
21 */
22 @CreoleResource ( name = " GoldFish " ,
23 comment = " Add a descriptive comment about this resource " )
24 public class GoldFish extends AbstractProcessingResource
25 implements ProcessingResource {
26
27
28 } / / class GoldFish

The default XML conguration for GoldFish looks like this:

<!-- creole.xml GoldFish -->


<!-- hamish, 26/9/2001 -->
<!-- $Id: howto.tex,v 1.130 2006/10/23 12:56:37 ian Exp $ -->

<CREOLE-DIRECTORY>
<JAR SCAN="true">GoldFish.jar</JAR>
</CREOLE-DIRECTORY>

The directory structure containing these les is shown in gure 7.3. GoldFish.java lives
in the src/sheffield/creole/example directory. creole.xml and build.xml are in the
top GoldFish directory. The lib directory is for libraries; the classes directory is where
Java class les are placed; the doc directory is for documentation. These last two, plus
GoldFish.jar are created by Ant.

This process has the advantage that it creates a complete source tree and build structure
for the component, and the disadvantage that it creates a complete source tree and build
structure for the component. If you already have a source tree, you will need to chop out
the bits you need from the new tree (in this case GoldFish.java and creole.xml) and copy
it into your existing one.

See the example code at http://gate.ac.uk/wiki/code-repository/.


GATE Embedded 163

Figure 7.3: BootStrap directory tree

7.13 Adding Support for a New Document Format

In order to add a new document format, one needs to extend the gate.DocumentFormat
class and to implement an abstract method called:
1 public void unpackMarkup ( Document doc ) throws
2 DocumentFormatException

This method is supposed to implement the functionality of each format reader and to create
annotations on the document. Finally the document's old content will be replaced with a
new one containing only the text between markups.

If one needs to add a new textual reader will extend the gate.corpora.TextualDocumentFormat
and override the unpackMarkup(doc) method.

This class needs to be implemented under the Java bean specications because it will be
instantiated by GATE using Factory.createResource() method.

The init() method that one needs to add and implement is very important because in here
the reader denes its means to be selected successfully by GATE. What one needs to do is
to add some specic information into certain static maps dened in DocumentFormat class,
that will be used at reader detection time.

After that, a denition of the reader will be placed into the one's creole.xml le and the
reader will be available to GATE.

We present for the rest of the section a complete three step example of adding such a reader.
The reader we describe in here is an XML reader.
164 GATE Embedded

Step 1
Create a new class called XmlDocumentFormat that extends
gate.corpora.TextualDocumentFormat.

Step 2
Implement the unpackMarkup(Document doc) which performs the required functionality for
the reader. Add XML detection means in init() method:

1 public Resource init () throws ResourceInstantiationException {


2 / / Register XML mime type
3 MimeType mime = new MimeType ( " text " ," xml " );
4 / / Register the class handler for this mime type
5 mimeString2ClassHandlerMap . put ( mime . getType ()+ " / " + mime . getSubtype () ,
6 this );
7 / / Register the mime type with mine string
8 mimeString2mimeTypeMap . put ( mime . getType () + " / " + mime . getSubtype () ,
9 mime );
10 / / Register le suxes for this mime type
11 suffixes2mimeTypeMap . put ( " xml " , mime );
12 suffixes2mimeTypeMap . put ( " xhtm " , mime );
13 suffixes2mimeTypeMap . put ( " xhtml " , mime );
14 / / Register magic numbers for this mime type
15 magic2mimeTypeMap . put ( " <? xml " , mime );
16 / / Set the mimeType for this language resource
17 setMimeType ( mime );
18 return this ;
19 } // init()

More details about the information from those maps can be found in Section 5.5.1

Step 3
Add the following creole denition in the creole.xml document.

<RESOURCE>
<NAME>My XML Document Format</NAME>
<CLASS>mypackage.XmlDocumentFormat</CLASS>
<AUTOINSTANCE/>
<PRIVATE/>
</RESOURCE>

More information on the operation of GATE's document format analysers may be found in
Section 5.5.
GATE Embedded 165

7.14 Using GATE Embedded in a Multithreaded Envi-


ronment

GATE Embedded can be used in multithreaded applications, so long as you observe a few
restrictions. First, you must initialise GATE by calling Gate.init() exactly once in your ap-
plication, typically in the application startup phase before any concurrent processing threads
are started.

Secondly, you must not make calls that aect the global state of GATE (e.g. loading or
unloading plugins) in more than one thread at a time. Again, you would typically load all
the plugins your application requires at initialisation time. It is safe to create instances of
resources in multiple threads concurrently.

Thirdly, it is important to note that individual GATE processing resources, language re-
sources and controllers are by design not thread safe  it is not possible to use a single
instance of a controller/PR/LR in multiple threads at the same time  but for a well written
resource it should be possible to use several dierent instances of the same resource at once,
each in a dierent thread. When writing your own resource classes you should bear the
following in mind, to ensure that your resource will be useable in this way.

ˆ Avoid static data. Where possible, you should avoid using static elds in your class,
and you should try and take all conguration data via the CREOLE parameters you
declare in your creole.xml le. System properties may be appropriate for truly static
conguration, such as the location of an external executable, but even then it is gen-
erally better to stick to CREOLE parameters  a user may wish to use two dierent
instances of your PR, each talking to a dierent executable.

ˆ Read parameters at the correct time. Init-time parameters should be read in the init()
(and reInit()) method, and for processing resources runtime parameters should be
read at each execute().

ˆ Use temporary les correctly. If your resource makes use of external temporary les
you should create them using File.createTempFile() at init or execute time, as
appropriate. Do not use hardcoded le names for temporary les.

ˆ If there are objects that can be shared between dierent instances of your resource,
make sure these objects are accessed either read-only, or in a thread-safe way. In
particular you must be very careful if your resource can take other resource instances
as init or runtime parameters (e.g. the Flexible Gazetteer, Section 13.6).

Of course, if you are writing a PR that is simply a wrapper around an external library that
imposes these kinds of limitations there is only so much you can do. If your resource cannot
be made safe you should document this fact clearly.
166 GATE Embedded

All the standard ANNIE PRs are safe when independent instances are used in dierent
threads concurrently, as are the standard transient document, transient corpus and controller
classes. A typical pattern of development for a multithreaded GATE-based application is:

ˆ Develop your GATE processing pipeline in GATE Developer.

ˆ Save your pipeline as a .gapp le.

ˆ In your application's initialisation phase, load n copies of the pipeline using


PersistenceManager.loadObjectFromFile() (see the Javadoc documentation for de-
tails), or load the pipeline once and then make copies of it using Factory.duplicate
as described in section 7.8, and either give one copy to each thread or store them in a
pool (e.g. a LinkedList).

ˆ When you need to process a text, get one copy of the pipeline from the pool, and
return it to the pool when you have nished processing.

Alternatively you can use the Spring Framework as described in the next section to handle
the pooling for you.

7.15 Using GATE Embedded within a Spring Applica-


tion
GATE Embedded provides helper classes to allow GATE resources to be created and man-
aged by the Spring framework. For Spring 2.0 or later, GATE Embedded provides a custom
namespace handler that makes them extremely easy to use. To use this namespace, put the
following declarations in your bean denition le:

<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:gate="http://gate.ac.uk/ns/spring"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://gate.ac.uk/ns/spring
http://gate.ac.uk/ns/spring.xsd">

You can have Spring initialise GATE:

<gate:init gate-home="WEB-INF" user-config-file="WEB-INF/user.xml">


<gate:preload-plugins>
<value>WEB-INF/ANNIE</value>
GATE Embedded 167

<value>http://example.org/gate-plugin</value>
</gate:preload-plugins>
</gate:init>

The gate-home, user-cong-le, etc. and the <value> elements under <gate:preload-plugins>
are interpreted as Spring resource paths. If the value is not an absolute URL then Spring
will resolve the path in an appropriate way for the type of application context  in a web
application they are taken as being relative to the web app root, and you would typically
use locations within WEB-INF as shown in the example above. To use an absolute path
for gate-home it is not sucient to use a leading slash (e.g. /opt/gate), for backwards-
compatibility reasons Spring will still resolve this relative to your web application. Instead
you must specify it as a full URL, i.e. file:/opt/gate.

The attributes gate-home, plugins-home, site-config-file, user-config-file and


builtin-creole-dir refer directly to the similarly-named setter methods on gate.Gate.
Any of these that are not specied will take their usual GATE Embedded default values
(i.e. gate-home will be the parent of the directory containing gate.jar, plugins-home will
be the plugins subdirectory of GATE home, user-config-file will be .gate.xml in the
current user's home directory, etc.). Therefore it is highly recommended to specify at least
user-config-file in order to isolate your application from the conguration used by GATE
Developer. Alternatively, you can specify run-in-sandbox="true" (see the JavaDocs) which
will tell GATE not to attempt to read any conguration from les at startup.

<gate:preload-plugins> species CREOLE plugins that should be loaded after GATE


has been initialised. An alternative way to specify extra plugins is to provide separate
<gate:extra-plugin> elements, for example:

<gate:init gate-home="WEB-INF"
user-config-file="WEB-INF/user.xml" />

<gate:extra-plugin>WEB-INF/ANNIE</gate:extra-plugin>

You can freely mix the two styles  nested <gate:preload-plugins> denitions are pro-
cessed rst, followed by all the <gate:extra-plugin> denitions found in the application
context. This is useful if, for example, you are providing additional conguration as a sepa-
rate bean denition le from the one containing the main <gate:init> denition and need
to load extra plugins without editing this main denition.

To create a GATE resource, use the <gate:resource> element.

<gate:resource id="sharedOntology" scope="singleton"


resource-class="gate.creole.ontology.owlim.OWLIMOntologyLR">
<gate:parameters>
<entry key="rdfXmlURL">
<gate:url>WEB-INF/ontology.rdf</gate:url>
168 GATE Embedded

</entry>
</gate:parameters>
<gate:features>
<entry key="ontologyVersion" value="0.1.3" />
<entry key="mainOntology">
<value type="java.lang.Boolean">true</value>
</entry>
</gate:features>
</gate:resource>

The children of <gate:parameters> are Spring <entry/> elements, just as you would write
when conguring a bean property of type Map<String,Object>. <gate:url> provides a
way to construct a java.net.URL from a resource path as discussed above. If it is possible
to resolve the resource path as a file: URL then this form will be preferred, as there are a
number of areas within GATE which work better with file: URLs than with other types
of URL (for example plugins that run external processes, or that use a URL parameter to
point to a directory in which they will create new les).

A note about types : The <gate:parameters> and <gate:features> elements dene GATE
FeatureMaps. When using the simple <entry key="..." value="..." /> form, the entry
values will be treated as strings; Spring can convert strings into many other types of object
using the standard Java Beans property editor mechanism, but since a FeatureMap can hold
any kind of values you must use an explicit <value type="...">...</value> to tell Spring
what type the value should be.

There is an additional twist for <gate:parameters>  GATE has its own internal logic
to convert strings to other types required for resource parameters (see the discussion of
default parameter values in section 4.7.1). So for parameter values you have a choice, you
can either use an explicit <value type="..."> to make Spring do the conversion, or you
can pass the parameter value as a string and let GATE do the conversion. For resource
parameters whose type is java.net.URL, if you pass a string value that is not an absolute
URL (starting le:, http:, etc.) then GATE will treat the string as a path relative to
the creole.xml le of the plugin that denes the resource type whose parameter you are
setting. If this is not what you intended then you should use <gate:url> to cause Spring to
resolve the path to a URL before passing it to GATE. For example, for a JAPE transducer,
<entry key="grammarURL" value="grammars/main.jape" /> would resolve to something
like file:/path/to/webapp/WEB-INF/plugins/ANNIE/grammars/main.jape, whereas

<entry key="grammarURL">
<gate:url>grammars/main.jape</gate:url>
</entry>

would resolve to file:/path/to/webapp/grammars/main.jape.

You can load a GATE saved application with


GATE Embedded 169

<gate:saved-application location="WEB-INF/application.gapp" scope="prototype">


<gate:customisers>
<gate:set-parameter pr-name="custom transducer" name="ontology"
ref="sharedOntology" />
</gate:customisers>
</gate:saved-application>

`Customisers' are used to customise the application after it is loaded. In the example above,
we load a singleton copy of an ontology which is then shared between all the separate
instances of the (prototype) application. The <gate:set-parameter> customiser accepts
all the same ways to provide a value as the standard Spring <property> element (a "value"
or "ref" attribute, or a sub-element - <value>, <list>, <bean>, <gate:resource> . . . ).

The <gate:add-pr> customiser provides support for the case where most of the application
is in a saved state, but we want to create one or two extra PRs with Spring (maybe to inject
other Spring beans as init parameters) and add them to the pipeline.

<gate:saved-application ...>
<gate:customisers>
<gate:add-pr add-before="OrthoMatcher" ref="myPr" />
</gate:customisers>
</gate:saved-application>

By default, the <gate:add-pr> customiser adds the target PR at the end of the pipeline,
but an add-before or add-after attribute can be used to specify the name of a PR before
(or after) which this PR should be placed. Alternatively, an index attribute places the PR
at a specic (0-based) index into the pipeline. The PR to add can be specied either as a
`ref' attribute, or with a nested <bean> or <gate:resource> element.

7.15.1 Duplication in Spring


The above example denes the <gate:application> as a prototype-scoped bean, which
means the saved application state will be loaded afresh each time the bean is fetched from the
bean factory (either explicitly using getBean or implicitly when it is injected as a dependency
of another bean). However in many cases it is better to load the application once and then
duplicate it as required (as described in section 7.8), as this allows resources to optimise
their memory usage, for example by sharing a single in-memory representation of a large
gazetteer list between several instances of the gazetteer PR. This approach is supported by
the <gate:duplicate> tag.

<gate:duplicate id="theApp">
<gate:saved-application location="/WEB-INF/application.xgapp" />
</gate:duplicate>
170 GATE Embedded

The <gate:duplicate> tag acts like a prototype bean denition, in that each time it is
fetched or injected it will call Factory.duplicate to create a new duplicate of its template
resource (declared as a nested element or referenced by the template-ref attribute). How-
ever the tag also keeps track of all the duplicate instances it has returned over its lifetime,
and will ensure they are released (using Factory.deleteResource) when the Spring context
is shut down.

The <gate:duplicate> tag also supports customisers, which will be applied to the newly-
created duplicate resource before it is returned. This is subtly dierent from applying the
customisers to the template resource itself, which would cause them to be applied once to
the original resource before it is rst duplicated.

Finally, <gate:duplicate> takes an optional boolean attribute return-template. If set to


false (or omitted, as this is the default behaviour), the tag always returns a duplicate  the
original template resource is used only as a template and is not made available for use. If set
to true, the rst time the bean dened by the tag is injected or fetched, the original template
resource is returned. Subsequent uses of the tag will return duplicates. Generally speaking,
it is only safe to set return-template="true" when there are no customisers, and when
the duplicates will all be created up-front before any of them are used. If the duplicates will
be created asynchronously (e.g. with a dynamically expanding pool, see below) then it is
possible that, for example, a template application may be duplicated in one thread whilst it
is being executed by another thread, which may lead to unpredictable behaviour.

7.15.2 Spring pooling


In a multithreaded application it is vital that individual GATE resources are not used in
more than one thread at the same time. Because of this, multithreaded applications that use
GATE Embedded often need to use some form of pooling to provided thread-safe access to
GATE components. This can be managed by hand, but the Spring framework has built-in
tools to support transparent pooling of Spring-managed beans. Spring can create a pool of
identical objects, then expose a single proxy object (oering the same interface) for use by
clients. Each method call on the proxy object will be routed to an available member of the
pool in such a way as to guarantee that each member of the pool is accessed by no more
than one thread at a time.

Since the pooling is handled at the level of method calls, this approach is not used to create a
pool of GATE resources directly  making use of a GATE PR typically involves a sequence
of method calls (at least setDocument(doc), execute() and setDocument(null)), and cre-
ating a pooling proxy for the resource may result in these calls going to dierent members
of the pool. Instead the typical use of this technique is to dene a helper object with a sin-
gle method that internally calls the GATE API methods in the correct sequence, and then
create a pool of these helpers. The interface gate.util.DocumentProcessor and its associ-
ated implementation gate.util.LanguageAnalyserDocumentProcessor are useful for this.
The DocumentProcessor interface denes a processDocument method that takes a GATE
GATE Embedded 171

document and performs some processing on it. LanguageAnalyserDocumentProcessor im-


plements this interface using a GATE LanguageAnalyser (such as a saved corpus pipeline
application) to do the processing. A pool of LanguageAnalyserDocumentProcessor in-
stances can be exposed through a proxy which can then be called from several threads.

The machinery to implement this is all built into Spring, but the conguration typically
required to enable it is quite ddly, involving at least three co-operating bean denitions.
Since the technique is so useful with GATE Embedded, GATE provides a special syntax to
congure pooling in a simple way. Given the <gate:duplicate id="theApp"> denition
from the previous section we can create a DocumentProcessor proxy that can handle up to
ve concurrent requests as follows:

<bean id="processor"
class="gate.util.LanguageAnalyserDocumentProcessor">
<property name="analyser" ref="theApp" />
<gate:pooled-proxy max-size="5" />
</bean>

The <gate:pooled-proxy> element decorates a singleton bean denition. It converts the


original denition to prototype scope and replaces it with a singleton proxy delegating to a
pool of instances of the prototype bean. The pool parameters are controlled by attributes
of the <gate:pooled-proxy> element, the most important ones being:

max-size The maximum size of the pool. If more than this number of threads try to call
methods on the proxy at the same time, the others will (by default) block until an
object is returned to the pool.

initial-size The default behaviour of Spring's pooling tools is to create instances in the
pool on demand (up to the max-size). This attribute instead causes initial-size
instances to be created up-front and added to the pool when it is rst created.

when-exhausted-action-name What to do when the pool is exhausted (i.e. there are


already max-size concurrent calls in progress and another one arrives). Should be set
to one of WHEN_EXHAUSTED_BLOCK (the default, meaning block the excess requests until
an object becomes free), WHEN_EXHAUSTED_GROW (create a new object anyway, even
though this pushes the pool beyond max-size) or WHEN_EXHAUSTED_FAIL (cause the
excess calls to fail with an exception).

Any of these attributes can make use of the usual ${...} property placeholder mechanism.
Many more options are available, corresponding to the properties of the underlying Spring
TargetSource in use (by default CommonsPoolTargetSource). These allow you, for exam-
ple, to congure a pool that dynamically grows and shrinks as necessary, releasing objects
that have been idle for a set amount of time. See the JavaDoc documentation of Common-
sPoolTargetSource (and the documentation for Apache commons-pool) for full details. If you
172 GATE Embedded

wish to use a dierent TargetSource implementation from the default CommonsPoolTar-


getSource, you can provide a target-source-class attribute with the fully-qualied class
name of the class you wish to use (which must, of course, implement the TargetSource
interface).

Note that the <gate:pooled-proxy> technique is not tied to GATE in any way, it is simply
an easy way to congure standard Spring beans and can be used with any bean that needs
to be pooled, not just objects that make use of GATE.

7.15.3 Further reading


These custom elements all dene various factory beans. For full details, see the JavaDocs for
gate.util.spring (the factory beans) and gate.util.spring.xml (the gate: namespace
handler). The main Spring framework API documentation is the best place to look for more
detail on the pooling facilities provided by Spring AOP.

Note: the former approach using factory methods of the gate.util.spring.SpringFactory


class will still work, but should be considered deprecated in favour of the new factory beans.

7.16 Using GATE Embedded within a Tomcat Web Ap-


plication
Embedding GATE in a Tomcat web application involves several steps.

1. Put the necessary JAR les (gate.jar and all or most of the jars in gate/lib) in your
webapp/WEB-INF/lib.
2. Put the plugins that your application depends on in a suitable location (e.g.
webapp/WEB-INF/plugins).
3. Create suitable gate.xml conguration les for your environment.

4. Set the appropriate paths in your application before calling Gate.init().

This process is detailed in the following sections.

7.16.1 Recommended Directory Structure


You will need to create a number of other les in your web application to allow GATE to
work:
GATE Embedded 173

ˆ Site and user gate.xml cong les - we highly recommend dening these specically for
the web application, rather than relying on the default les on your application server.

ˆ The plugins your application requires.

In this guide, we assume the following layout:

webapp/
WEB-INF/
gate.xml
user-gate.xml
plugins/
ANNIE/
etc.

7.16.2 Conguration Files


Your gate.xml (the `site-wide conguration le') should be as simple as possible:

<?xml version="1.0" encoding="UTF-8" ?>


<GATE>
<GATECONFIG Save_options_on_exit="false"
Save_session_on_exit="false" />
</GATE>

Similarly, keep the user-gate.xml (the `user cong le') simple:

<?xml version="1.0" encoding="UTF-8" ?>


<GATE>
<GATECONFIG Known_plugin_path=";"
Load_plugin_path=";" />
</GATE>

This way, you can control exactly which plugins are loaded in your webapp code.

7.16.3 Initialization Code


Given the directory structure shown above, you can initialize GATE in your web application
like this:
174 GATE Embedded

1 / / imports
2 ...
3 public class MyServlet extends HttpServlet {
4 private static boolean gateInited = false ;
5
6 public void init () throws ServletException {
7 if (! gateInited ) {
8 try {
9 ServletContext ctx = getServletContext ();
10
11 / / use /path/to/your/webapp/WEB-INF as gate.home
12 File gateHome = new File ( ctx . getRealPath ( " / WEB - INF " ));
13
14 Gate . setGateHome ( gateHome );
15 / / thus webapp/WEB-INF/plugins is the plugins directory, and
16 / / webapp/WEB-INF/gate.xml is the site cong le.
17
18 / / Use webapp/WEB-INF/user-gate.xml as the user cong le,
19 / / to avoid confusion with your own user cong.
20 Gate . setUserConfigFile ( new File ( gateHome , " user - gate . xml " ));
21
22 Gate . init ();
23 / / load plugins, for example...
24 Gate . getCreoleRegister (). registerDirectories (
25 ctx . getResource ( " / WEB - INF / plugins / ANNIE " ));
26
27 gateInited = true ;
28 }
29 catch ( Exception ex ) {
30 throw new ServletException ( " Exception initialising GATE " ,
31 ex );
32 }
33 }
34 }
35 }

Once initialized, you can create GATE resources using the Factory in the usual way (for
example, see Section 7.1 for an example of how to create an ANNIE application). You should
also read Section 7.14 for important notes on using GATE Embedded in a multithreaded
application.

Instead of an initialization servlet you could also consider doing your initialization in a
ServletContextListener, or using Spring (see Section 7.15).

7.17 Groovy for GATE

Groovy is a dynamic programming language based on Java. Groovy is not used in the core
GATE distribution, so to enable the Groovy features in GATE you must rst load the Groovy
GATE Embedded 175

plugin. Loading this plugin:

ˆ provides access to the Groovy scripting console (congured with some extensions for
GATE) from the GATE Developer Tools menu.

ˆ provides a PR to run a Groovy script over documents.

ˆ provides a controller which uses a Groovy DSL to dene its execution strategy.

ˆ enhances a number of core GATE classes with additional convenience methods that can
be used from any Groovy code including the console, the script PR, and any Groovy
class that uses the GATE Embedded API.

This section describes these features in detail, but assumes that the reader already
has some knowledge of the Groovy language. If you are not already familiar with
Groovy you should read this section in conjunction with Groovy's own documentation at
http://groovy.codehaus.org/.

7.17.1 Groovy Scripting Console for GATE


Loading the Groovy plugin in GATE Developer will provide a Groovy Console item in
the Tools/Groovy Tools menu. This menu item opens the standard Groovy console window
(http://groovy.codehaus.org/Groovy+Console).

To help scripting GATE in Groovy, the console is pre-congured to import all classes from
the gate and gate.util packages of the core GATE API. This means you can refer to classes
and interfaces such as Factory, AnnotationSet, Gate, etc. without needing to prex them
with a package name. In addition, the following (read-only) variable bindings are pre-dened
in the Groovy Console.

ˆ corpora: a list of loaded corpora LRs (Corpus)

ˆ docs: a list of all loaded document LRs (DocumentImpl)

ˆ prs: a list of all loaded PRs

ˆ apps: a list of all loaded Applications (AbstractController)

These variables are automatically updated as resources are created and deleted in GATE.

Here's an example script. It nds all documents with a feature annotator set to fred, and
puts them in a new corpus called fredsDocs.
176 GATE Embedded

1 Factory . newCorpus ( " fredsDocs " ). addAll (


2 docs . findAll {
3 it . features . annotator == " fred "
4 }
5 )

You can nd other examples (and add your own) in the Groovy script repository on the
GATE Wiki: http://gate.ac.uk/wiki/groovy-recipes/.

Why won't the `Groovy executing' dialog go away? Sometimes, when you execute a
Groovy script through the console, a dialog will appear, saying Groovy is executing. Please
wait. The dialog fails to go away even when the script has ended, and cannot be closed by
clicking the Interrupt button. You can, however, continue to use the Groovy Console, and
the dialog will usually go away next time you run a script. This is not a GATE problem: it
is a Groovy problem.

7.17.2 Groovy scripting PR


The Groovy scripting PR enables you to load and execute Groovy scripts as part of a GATE
application pipeline. The Groovy scripting PR is made available when you load the Groovy
plugin via the plugin manager.

Parameters

The Groovy scripting PR has a single initialisation parameter

ˆ scriptURL: the path to a valid Groovy script

It has three runtime parameters

ˆ inputASName: an optional annotation set intended to be used as input by the PR


(but note that the PR has access to all annotation sets)

ˆ outputASName: an optional annotation set intended to be used as output by the


PR (but note that the PR has access to all annotation sets)

ˆ scriptParams: optional parameters for the script. In a creole.xml le, these should
be specied as key=value pairs, each pair separated by a comma. For example:
'name=fred,type=person' . In the GATE GUI, these are specied via a dialog.
GATE Embedded 177

Script bindings

As with the Groovy console described above Groovy scripts run by the scripting PR implicitly
import all classes from the gate and gate.util packages of the core GATE API. The Groovy
scripting PR also makes available the following bindings, which you can use in your scripts:

ˆ doc: the current document (Document)

ˆ corpus: the corpus containing the current document

ˆ controller: the controller running the script

ˆ content: the string content of the current document

ˆ inputAS: the annotation set specied by inputASName in the PRs runtime parameters

ˆ outputAS: the annotation set specied by outputASName in the PRs runtime pa-
rameters

Note that inputAS and outputAS are intended to be used as input and output Annotation-
Sets. This is, however, a convention: there is nothing to stop a script writing to or reading
from any AnnotationSet. Also, although the script has access to the corpus containing the
document it is running over, it is not generally necessary for the script to iterate over the
documents in the corpus itself  the reference is provided to allow the script to access data
stored in the FeatureMap of the corpus. Any other variables assigned to within the script
code will be added to the binding, and values set while processing one document can be used
while processing a later one.

Passing parameters to the script

In addition to the above bindings, one further binding is available to the script:

ˆ scriptParams: a FeatureMap with keys and values as specied by the scriptParams


runtime parameter

For example, if you were to create a scriptParams runtime parameter for your PR, with
the keys and values: 'name=fred,type=person', then the values could be retrieved in your
script via scriptParams.name and scriptParams.type. If you populate the scriptParams
FeatureMap programmatically, the values will of course have the same types inside the
Groovy script, but if you create the FeatureMap with GATE Developer's parameter editor,
the keys and values will all have String type. (If you want to set n=3 in the GUI editor,
for example, you can use scriptParams.n as Integer in the Groovy script to obtain the
Integer type.)
178 GATE Embedded

Controller callbacks

A Groovy script may wish to do some pre- or post-processing before or after processing
the documents in a corpus, for example if it is collecting statistics about the corpus. To
support this, the script can declare methods beforeCorpus and afterCorpus, taking a
single parameter. If the beforeCorpus method is dened and the script PR is running in
a corpus pipeline application, the method will be called before the pipeline processes the
rst document. Similarly, if the afterCorpus method is dened it will be called after the
pipeline has completed processing of all the documents in the corpus. In both cases the
corpus will be passed to the method as a parameter. If the pipeline aborts with an exception
the afterCorpus method will not be called, but if the script declares a method aborted(c)
then this will be called instead.

Note that because the script is not processing a particular document when these methods
are called, the usual doc, corpus, inputAS, etc. are not available within the body of the
methods (though the corpus is passed to the method as a parameter). The scriptParams
and controller variables are available.

The following example shows how this technique could be used to build a simple tf/idf
index for a GATE corpus. The example is available in the GATE distribution as
plugins/Groovy/resources/scripts/tfidf.groovy. The script makes use of some of the
utility methods described in section 7.17.4.
1 / / reset variables
2 void beforeCorpus ( c ) {
3 / / list of maps (one for each doc) from term to frequency
4 frequencies = []
5 / / sorted map from term to docs that contain it
6 docMap = new TreeMap ()
7 / / index of the current doc in the corpus
8 docNum = 0
9 }
10
11 / / start frequency list for this document
12 frequencies << [:]
13
14 / / iterate over the requested annotations
15 inputAS [ scriptParams . annotationType ]. each {
16 def str = doc . stringFor ( it )
17 / / increment term frequency for this term
18 frequencies [ docNum ][ str ] =
19 ( frequencies [ docNum ][ str ] ?: 0) + 1
20
21 / / keep track of which documents this term appears in
22 if (! docMap [ str ]) {
23 docMap [ str ] = new LinkedHashSet ()
24 }
25 docMap [ str ] << docNum
26 }
27
GATE Embedded 179

28 / / normalize counts by doc length


29 def docLength = inputAS [ scriptParams . annotationType ]. size ()
30 frequencies [ docNum ]. each { freq ->
31 freq . value = (( double ) freq . value ) / docLength
32 }
33
34 / / increment the counter for the next document
35 docNum ++
36
37 / / compute the IDFs and store the table as a corpus feature
38 void afterCorpus ( c ) {
39 def tfIdf = [:]
40 docMap . each { term , docsWithTerm ->
41 def idf = Math . log (( double ) docNum / docsWithTerm . size ())
42 tfIdf [ term ] = [:]
43 docsWithTerm . each { docId ->
44 tfIdf [ term ][ docId ] = frequencies [ docId ][ term ] * idf
45 }
46 }
47 c . features . freqTable = tfIdf
48 }

Examples

The plugin directory Groovy/resources/scripts contains some example scripts. Below is the
code for a naive regular expression PR.
1
2 matcher = content =~ scriptParams . regex
3 while ( matcher . find ())
4 outputAS . add ( matcher . start () ,
5 matcher . end () ,
6 scriptParams . type ,
7 Factory . newFeatureMap ())

The script needs to have the runtime parameter scriptParams set with keys and values as
follows:

ˆ regex: the Groovy regular expression that you want to match e.g. [^\s]*ing

ˆ type: the type of the annotation to create for each regex match, e.g. regexMatch

When the PR is run over a document, the script will rst make a matcher over the document
content for the regular expression given by the regex parameter. It will iterate over all
matches for this regular expression, adding a new annotation for each, with a type as given
by the type parameter.
180 GATE Embedded

7.17.3 The Scriptable Controller


The Groovy plugin's Scriptable Controller is a more exible alternative to the standard
pipeline (SerialController) and corpus pipeline (SerialAnalyserController) applica-
tions and their conditional variants, and also supports the time limiting and robustness
features of the realtime controller. Like the standard controllers, a scriptable controller con-
tains a list of processing resources and can optionally be congured with a corpus, but unlike
the standard controllers it does not necessarily execute the PRs in a linear order. Instead
the execution strategy is controlled by a script written in a Groovy domain specic language
(DSL), which is detailed in the following sections.

Running a single PR

To run a single PR from the scriptable controller's list of PRs, simply use the PR's name as
a Groovy method call:
1 somePr ()
2 " ANNIE English Tokeniser " ()

If the PR's name contains spaces or any other character that is not valid in a Groovy
identier, or if the name is a reserved word (such as import) then you must enclose the
name in single or double quotes. You may prefer to rename the PRs so their names are valid
identiers. Also, if there are several PRs in the controller's list with the same name, they
will all be run in the order in which they appear in the list.

You can optionally provide a Map of named parameters to the call, and these will override
the corresponding runtime parameter values for the PR (the original values will be restored
after the PR has been executed):
1 myTransducer ( outputASName : " output " )

Iterating over the corpus

If a corpus has been provided to the controller then you can iterate over all the documents
in the corpus using eachDocument:
1 eachDocument {
2 tokeniser ()
3 sentenceSplitter ()
4 myTransducer ()
5 }

The block of code (in fact a Groovy closure ) is executed once for each document in the corpus
exactly as a standard corpus pipeline application would operate. The current document is
available to the script in the variable doc and the corpus in the variable corpus, and in
GATE Embedded 181

addition any calls to PRs that implement the LanguageAnalyser interface will set the PR's
document and corpus parameters appropriately.

Running all the PRs in sequence

Calling allPRs() will execute all the controller's PRs once in the order in which they appear
in the list. This is rarely useful in practice but it serves to dene the default behaviour:
the initial script that is used by default in a newly instantiated scriptable controller is
eachDocument { allPRs() }, which mimics the behaviour of a standard corpus pipeline appli-
cation.

More advanced scripting

The basic DSL is extremely simple, but because the script is Groovy code you can use all
the other facilities of the Groovy language to do conditional execution, grouping of PRs,
etc. The control script has the same implicit imports as provided by the Groovy Script PR
(section 7.17.2), and additional import statements can be added as required.

For example, suppose you have a pipeline for multi-lingual document processing, contain-
ing PRs named englishTokeniser, englishGazetteer, frenchTokeniser, frenchGazetteer,
genericTokeniser, etc., and you need to choose which ones to run based on a document
feature:
1 eachDocument {
2 def lang = doc . features . language ?: ' generic '
3 " $ { lang } Tokeniser " ()
4 " $ { lang } Gazetteer " ()
5 }

As another example, suppose you have a particular JAPE grammar that you know is slow
on documents that mention a large number of locations, so you only want to run it on
documents with up to 100 Location annotations, and use a faster but less accurate one on
others:
1 / / helper method to group several PRs together
2 void annotateLocations () {
3 tokeniser ()
4 splitter ()
5 gazetteer ()
6 locationGrammar ()
7 }
8
9 eachDocument {
10 annotateLocations ()
11 if ( doc . annotations [ " Location " ]. size () <= 100) {
12 fullLocationClassifier ()
13 }
182 GATE Embedded

14 else {
15 fastLocationClassifier ()
16 }
17 }

You can have more than one call to eachDocument, for example a controller that pre-processes
some documents, then collects some corpus-level statistics, then further processes the docu-
ments based on those statistics.

As a nal example, consider a controller to post-process data from a manual annotation


task. Some of the documents have been annotated by one annotator, some by more than
one (the annotations are in sets named annotator1, annotator2, etc., but the number of
sets varies from document to document).
1 eachDocument {
2 / / nd all the annotatorN sets on this document
3 def annotators =
4 doc . annotationSetNames . findAll {
5 it ==~ / annotator \ d +/
6 }
7
8 / / run the post-processing JAPE grammar on each one
9 annotators . each { asName ->
10 postProcessingGrammar (
11 inputASName : asName ,
12 outputASName : asName )
13 }
14
15 / / now merge them to form a consensus set
16 mergingPR ( annSetsForMerging : annotators . join ( '; ' ))
17 }

Nesting a scriptable controller in another application

Like the standard SerialAnalyserController, the scriptable controller implements the


LanugageAnalyser interface and so can itself be nested as a PR in another pipeline. When
used in this way, eachDocument does not iterate over the corpus but simply calls its closure
once, with the current document set to the document that was passed to the controller as
a parameter. This is the same logic as is used by SerialAnalyserController, which runs its
PRs once only rather than once per document in the corpus.

Global variables

There are a number of variables that are pre-dened in the control script.

controller (read-only) a reference to the ScriptableController object itself, providing


GATE Embedded 183

access to its features etc.

prs (read-only) an unmodiable list of the processing resources in the pipeline.


corpus (read-write) a reference to the corpus (if any) currently set on the controller, and
over which any eachDocument loops will iterate. This variable is a direct alias to the
controller's getCorpus/setCorpus methods, so for example a script could build a new
corpus (using a web crawler or similar), then use eachDocument to iterate over this
corpus and process the documents.

In addition, as mentioned above, within the scope of an eachDocument loop there is a doc
variable giving access to the document being processed in the current iteration. Note that
if this controller is nested inside another controller (see the previous section) then the doc
variable will be available throughout the script.

Ignoring errors

By default, if an exception or error occurs while processing (either thrown by a PR or


occurring directly within the controller's script) then the controller's execution will terminate
with an exception. If this occurs during an eachDocument then the remaining documents
will not be processed. In some circumstances it may be preferable to ignore the error and
simply continue with the next document. To support this you can use ignoringErrors:
1 eachDocument {
2 ignoringErrors {
3 tokeniser ()
4 sentenceSplitter ()
5 myTransducer ()
6 }
7 }

Any exceptions or errors thrown within the ignoringErrors block will be logged6 but not
rethrown. So in the example above if myTransducer fails with an exception the controller
will continue with the next document. Note that it is important to nest the blocks correctly
 if the nesting were reversed (with the eachDocument inside the ignoringErrors) then
an exception would terminate the whole eachDocument loop and the remaining documents
would not be processed.

Realtime behaviour

Some GATE processing resources can be very slow when operating on large or complex
documents. In many cases it is possible to use heuristics within your controller's script to
spot likely problem documents and avoid running such PRs over them (see the fast vs.
6 to the gate.groovy.ScriptableController Log4J logger
184 GATE Embedded

full location classier example above), but for situations where this is not possible you can
use the timeLimit method to put a blanket limit on the time that PRs will be allowed to
consume, in a similar way to the real-time controller.
1 eachDocument {
2 ignoringErrors {
3 annotateLocations ()
4 timeLimit ( soft :30. seconds , hard :30. seconds ) {
5 classifyLocations ()
6 }
7 }
8 }

A call to timeLimit will attempt to limit the running time of its associated code block. You
can specify three dierent kinds of limit:

soft if the block is still executing after this time, attempt to interrupt it gently. This
uses Thread.interrupt() and also calls the interrupt() method of the currently
executing PR (if any).

exception if the block is still executing after this time beyond the soft limit, attempt to
induce an exception by setting the corpus and document parameters of the currently
running PR to null. This is useful to deal with PRs that do not properly respect the
interrupt call.

hard if the block is still executing after this time beyond the previous limit, forcibly termi-
nate it using Thread.stop. This is inherently dangerous and prone to memory leakage
but may be the only way to stop particularly stubborn PRs. It should be used with
caution.

Limits can be specied using Groovy's TimeCategory notation as shown above (e.g.
10.seconds, 2.minutes, 1.minute+45.seconds), or as simple numbers (of milliseconds).
Each limit starts counting from the end of the last, so in the example above the hard limit
is 30 seconds after the soft limit, or 1 minute after the start of execution. If no hard limit is
specied the controller will wait indenitely for the block to complete.

Note also that when a timeLimit block is terminated it will throw an exception. If you do
not wish this exception to terminate the execution of the controller as a whole you will need
to wrap the timeLimit block in an ignoringErrors block.

timeLimit blocks, particularly ones with a hard limit specied, should be regarded as a last
resort  if there are heuristic methods you can use to avoid running slow PRs in the rst place
it is a good idea to use them as a rst defence, possibly wrapping them in a timeLimit block
if you need hard guarantees (for example when you are paying per hour for your compute
time in a cloud computing system).
GATE Embedded 185

Figure 7.4: Accessing the script editor for a scriptable controller

The Scriptable Controller in GATE Developer

When you double-click on a scriptable controller in the resources tree of GATE Developer
you see the same controller editor that is used by the standard controllers. This view allows
you to add PRs to the controller and set their default runtime parameter values, and to
specify the corpus over which the controller should run. A separate view is provided to allow
you to edit the Groovy script, which is accessible via the Control Script tab (see gure 7.4).
This tab provides a text editor which does basic Groovy syntax highlighting (the same editor
used by the Groovy Console).

7.17.4 Utility methods


Loading the Groovy plugin adds some additional methods to several of the core GATE API
classes and interfaces using the Groovy mixin mechanism. Any Groovy code that runs after
the plugin has been loaded can make use of these additional methods, including snippets
186 GATE Embedded

run in the Groovy console, scripts run using the Script PR, and any other Groovy code that
uses the GATE Embedded API.

The methods that are injected come from two classes. The gate.Utils class (part of the core
GATE API in gate.jar) denes a number of static methods that can be used to simplify
common tasks such as getting the string covered by an annotation or annotation set, nding
the start or end oset of an annotation (or set), etc. These methods do not use any Groovy-
specic types, so they are usable from pure Java code in the usual way as well as being mixed
in for use in Groovy. Additionally, the class gate.groovy.GateGroovyMethods (part of the
Groovy plugin) provides methods that use Groovy types such as closures and ranges.

The added methods include:

ˆ Unied access to the start and end osets of an Annotation, AnnotationSet or


Document: e.g. someAnnotation.start() or anAnnotationSet.end()
ˆ Simple access to the DocumentContent or string covered by an annotation or annotation
set: document.stringFor(anAnnotation), document.contentFor(annotationSet)
ˆ Simple access to the length of an annotation or document, either as an int
(annotation.length()) or a long (annotation.lengthLong()).

ˆ A method to construct a FeatureMap from any map, to support constructions like def params
= [sourceUrl:'http://gate.ac.uk', encoding:'UTF-8'].toFeatureMap()

ˆ A method to convert an annotation set into a List of annotations in the order they appear
in the document, for iteration in a predictable order: annSet.inDocumentOrder().collect
{ it.type }

ˆ The each, eachWithIndex and collect methods for a corpus have been redened to properly
load and unload documents if the corpus is stored in a datastore.

ˆ Various getAt methods to support constructions like annotationSet["Token"] (get all Token
annotations from the set), annotationSet[15..20] (get all annotations between osets 15
and 20), documentContent[0..10] (get the document content between osets 0 and 10).

ˆ A withResource method for any resource, which calls a closure with the resource passed as
a parameter, and ensures that the resource is properly deleted when the closure completes
(analagous to the default Groovy method InputStream.withStream).

For full details, see the source code or javadoc documentation for these two classes.

7.18 Saving Cong Data to gate.xml


Arbitrary feature/value data items can be saved to the user's gate.xml le via the following
API calls:
GATE Embedded 187

To get the cong data: Map configData = Gate.getUserConfig().

To add cong data simply put pairs into the map: configData.put("my new config key",
"value");.

To write the cong data back to the XML le: Gate.writeUserConfig();.

Note that new cong data will simply override old values, where the keys are the same. In
this way defaults can be set up by putting their values in the main gate.xml le, or the site
gate.xml le; they can then be overridden by the user's gate.xml le.

7.19 Annotation merging through the API


If we have annotations about the same subject on the same document from dierent an-
notators, we may need to merge those annotations to form a unied annotation. Two ap-
proaches for merging annotations are implemented in the API, via static methods in the
class gate.util.AnnotationMerging.

The two methods have very similar input and output parameters. Each of the methods
takes an array of annotation sets, which should be the same annotation type on the same
document from dierent annotators, as input. A single feature can also be specied as a
parameter (or given asnull if no feature is to be specied).

The output is a map, the key of which is one merged annotation and the value of which
represents the annotators (in terms of the indices of the array of annotation sets) who sup-
port the annotation. The methods also have a boolean input parameter to indicate whether
or not the annotations from dierent annotators are based on the same set of instances,
which can be determined by the static method public boolean isSameInstancesForAnnota-
tors(AnnotationSet[] annsA) in the class gate.util.IaaCalculation. One instance corre-
sponds to all the annotations with the same span. If the annotation sets are based on the
same set of instances, the merging methods will ensure that the merged annotations are on
the same set of instances.

The two methods corresponding to those described for the Annotation Merging plugin de-
scribed in Section 23.20. They are:

ˆ The Method public static void mergeAnnotation(AnnotationSet[] annsArr, String


nameFeat, HashMap<Annotation,String>mergeAnns, int numMinK, boolean isThe-
SameInstances) merges the annotations stored in the array annsArr. The merged
annotation is put into the map mergeAnns, with a key of the merged annotation and
value of a string containing the indices of elements in the annotation set array annsArr
which contain that annotation. NumMinK species the minimal number of the anno-
tators supporting one merged annotation. The boolean parameter isTheSameInstances
indicate if or not those annotation sets for merging are based on the same instances.
188 GATE Embedded

ˆ Method public static void mergeAnnotationMajority(AnnotationSet[] annsArr, String


nameFeat, HashMap<Annotation, String>mergeAnns, boolean isTheSameInstances)
selects the annotations which the majority of the annotators agree on. The meanings
of parameters are the same as those in the above method.

7.20 Using Resource Helpers to Extend the API


Resource Helpers (see Section 4.8.2) are an easy way of adding new features to existing
resources within GATE Developer. Currently most Resource Helpers provide additional
ways of loading or exporting documents, and it would also be useful to have the same
features available via the API. While you could compile embedded code against the plugin
classes or use reection, this can quickly become dicult to manage, and rather negates the
whole plugin philosophy. Fortunately the Resource Helper API makes it easy to access these
new features from embedded code.

Here is a complete example showing how a GATE document can be exported using the
Resource Helper in the Fast Infoset plugin (see Section 23.31 for details on Fast Infoset
support):
1 / / initialise GATE and load the plugin (which creates an autoinstance of the Resource Helper)
2 Gate . init ();
3 Gate . getCreoleRegister (). registerDirectories (
4 ( new File ( Gate . getGateHome () , " plugins / Format_FastInfoset " )). toURI ()
5 . toURL ());
6
7 / / get the autoinstance of the Resource Helper
8 ResourceHelper rh =
9 ( ResourceHelper ) Gate . getCreoleRegister ()
10 . getAllInstances ( " gate . corpora . FastInfosetExporter " ). iterator ()
11 . next ();
12
13 / / create a simple test document
14 Document doc =
15 Factory . newDocument ( " A test of the Resource Handler API access " );
16
17 / / use the Resource Helper to export the document
18 rh . call ( " export " , doc , new File ( " resource - handler - test . finf " ));

The comments should make the code fairly self-explanatory, but the main feature is on line 18
which uses the ResourceHandler.call(String, Resource, Object...) method. This essentially
allows you to call a named method of the Resource Helper (in the example export), for a
given Resource instance (here we are using a Document instance), supplying any necessary
parameters. This allows you to access any public method (including static methods) of a
Resource Helper that takes a Resource as it's rst parameter.

The only downside to this approach is that there is no compile time checking that the method
you are trying to call actually exists or that the parameters are of the correct type so testing
GATE Embedded 189

is important.
190 GATE Embedded
Chapter 8

JAPE: Regular Expressions over


Annotations

If Osama bin Laden did not exist, it would be necessary to invent him. For the
past four years, his name has been invoked whenever a US president has sought
to increase the defence budget or wriggle out of arms control treaties. He has
been used to justify even President Bush's missile defence programme, though
neither he nor his associates are known to possess anything approaching ballistic
missile technology. Now he has become the personication of evil required to
launch a crusade for good: the face behind the faceless terror.
The closer you look, the weaker the case against Bin Laden becomes. While
the terrorists who inicted Tuesday's dreadful wound may have been inspired by
him, there is, as yet, no evidence that they were instructed by him. Bin Laden's
presumed guilt appears to rest on the supposition that he is the sort of man who
would have done it. But his culpability is irrelevant: his usefulness to western
governments lies in his power to terrify. When billions of pounds of military
spending are at stake, rogue states and terrorist warlords become assets precisely
because they are liabilities.
The need for dissent, George Monbiot, The Guardian, Tuesday September 18,
2001.

JAPE is a Java Annotation Patterns Engine. JAPE provides nite state transduction over
annotations based on regular expressions. JAPE is a version of CPSL  Common Pattern
Specication Language1 . This chapter introduces JAPE, and outlines the functionality avail-
able. (You can nd an excellent tutorial here; thanks to Dhaval Thakker, Taha Osmin and
Phil Lakin).
1 A good description of the original version of this language is in Doug Appelt's TextPro manual. Doug
was a great help to us in implementing JAPE. Thanks Doug!

191
192 JAPE: Regular Expressions over Annotations

JAPE allows you to recognise regular expressions in annotations on documents. Hang on,
there's something wrong here: a regular language can only describe sets of strings, not graphs,
and GATE's model of annotations is based on graphs. Hmmm. Another way of saying this:
typically, regular expressions are applied to character strings, a simple linear sequence of
items, but here we are applying them to a much more complex data structure. The result is
that in certain cases the matching process is non-deterministic (i.e. the results are dependent
on random factors like the addresses at which data is stored in the virtual machine): when
there is structure in the graph being matched that requires more than the power of a regular
automaton to recognise, JAPE chooses an alternative arbitrarily. However, this is not the
bad news that it seems to be, as it turns out that in many useful cases the data stored in
annotation graphs in GATE (and other language processing systems) can be regarded as
simple sequences, and matched deterministically with regular expressions.

A JAPE grammar consists of a set of phases, each of which consists of a set of pattern/ac-
tion rules. The phases run sequentially and constitute a cascade of nite state transducers
over annotations. The left-hand-side (LHS) of the rules consist of an annotation pattern
description. The right-hand-side (RHS) consists of annotation manipulation statements.
Annotations matched on the LHS of a rule may be referred to on the RHS by means of
labels that are attached to pattern elements. Consider the following example:

Phase: Jobtitle
Input: Lookup
Options: control = appelt debug = true

Rule: Jobtitle1
(
{Lookup.majorType == jobtitle}
(
{Lookup.majorType == jobtitle}
)?
)
:jobtitle
-->
:jobtitle.JobTitle = {rule = "JobTitle1"}

The LHS is the part preceding the `-->' and the RHS is the part following it. The LHS spec-
ies a pattern to be matched to the annotated GATE document, whereas the RHS species
what is to be done to the matched text. In this example, we have a rule entitled `Jobtitle1',
which will match text annotated with a `Lookup' annotation with a `majorType' feature of
`jobtitle', followed optionally by further text annotated as a `Lookup' with `majorType' of
`jobtitle'. Once this rule has matched a sequence of text, the entire sequence is allocated a
label by the rule, and in this case, the label is `jobtitle'. On the RHS, we refer to this span
of text using the label given in the LHS; `jobtitle'. We say that this text is to be given an
annotation of type `JobTitle' and a `rule' feature set to `JobTitle1'.
JAPE: Regular Expressions over Annotations 193

We began the JAPE grammar by giving it a phase name, e.g. `Phase: Jobtitle'. JAPE gram-
mars can be cascaded, and so each grammar is considered to be a `phase' (see Section 8.5).
Phase names (and rule names) must contain only alphanumeric characters, hyphens and
underscores, and cannot start with a number.

We also provide a list of the annotation types we will use in the grammar. In this case,
we say `Input: Lookup' because the only annotation type we use on the LHS are Lookup
annotations. If no annotations are dened, all annotations will be matched.

Then, several options are set:

ˆ Control; in this case, `appelt'. This denes the method of rule matching (see Section
8.4)

ˆ Debug. When set to true, if the grammar is running in Appelt mode and there is more
than one possible match, the conicts will be displayed on the standard output.

A wide range of functionality can be used with JAPE, making it a very powerful system.
Section 8.1 gives an overview of some common LHS tasks. Section 8.2 talks about the various
operators available for use on the LHS. After that, Section 8.3 outlines RHS functionality.
Section 8.4 talks about priority and Section 8.5 talks about phases. Section 8.6 talks about
using Java code on the RHS, which is the main way of increasing the power of the RHS. We
conclude the chapter with some miscellaneous JAPE-related topics of interest.

8.1 The Left-Hand Side

The LHS of a JAPE grammar aims to match the text span to be annotated, whilst avoiding
undesirable matches. There are various tools available to enable you to do this. This section
outlines how you would approach various common tasks on the LHS of your JAPE grammar.

8.1.1 Matching Entire Annotation Types


The simplest pattern in JAPE is to match any single annotation of a particular annotation
type. You can match only annotation types you specied in the Input line at the top of
the le. For example, the following will match any Lookup annotation:

{Lookup}
194 JAPE: Regular Expressions over Annotations

If the annotation type contains anything other than ASCII letters and digits, you need to
quote it2 :

{"html:table"}

8.1.2 Using Features and Values


You can specify the features (and values) of an annotation to be matched. Several operators
are supported; see Section 8.2 for full details:

ˆ {Token.kind == "number"}, {Token.length != 4} - equality and inequality.

ˆ {Token.string > "aardvark"}, {Token.length < 10} - comparison operators. >=


and <= are also supported.

ˆ {Token.string =~ "[Dd]ogs"}, {Token.string !~ "(?i)hello"} - regular expres-


sion. ==~ and !=~ are also provided, for whole-string matching.

ˆ {X contains Y}, {X notContains Y}, {X within Y} and {X notWithin Y} for


checking annotations within the context of other annotations.

In the following rule, the `category' feature of the `Token' annotation is used, along with the
`equals' operator:

Rule: Unknown
Priority: 50
(
{Token.category == NNP}
)
:unknown
-->
:unknown.Unknown = {kind = "PN", rule = Unknown}

As with the annotation type, if you want to match a feature name that contains anything
other than letters or digits, it must be quoted

{element."xsi:type" == "xs:string"}
2 In order for this rule to match you would also need to quote the type in the Input line at the top of the
grammar, e.g. Input: "html:table"
JAPE: Regular Expressions over Annotations 195

8.1.3 Using Meta-Properties


In addition to referencing annotation features, JAPE allows access to other `meta-properties'
of an annotation. This is done by using an `@' symbol rather than a `.' symbol after the
annotation type name. The three meta-properties that are built in are:

ˆ length - returns the spanning length of the annotation.

ˆ string - returns the string spanned by the annotation in the document.

ˆ cleanString - Like string, but with extra white space stripped out. (i.e. `\s+' goes to
a single space and leading or trailing white space is removed).

{X@length > 5}:label-->:label.New = {}

8.1.4 Building complex patterns from simple patterns


So far we have seen how to build a simple pattern that matches a single annotation, optionally
with a constraint on one of its features or meta-properties, but to do anything useful with
JAPE you will need to combine these simple patterns into more complex ones.

Sequences, alternatives and grouping

Patterns can be matched in sequence, for example:

Rule: InLocation
(
{Token.category == "IN"}
{Location}
):inLoc

matches a Token annotation of category IN followed by a Location annotation. Note that
followed by in JAPE depends on the annotation types specied in the Input line  the
above pattern matches a Token annotation and a Location annotation provided there are no
intervening annotations of a type listed in the Input line. The Token and Location will not
necessarily be immediately adjacent (they would probably be separated by an intervening
space). In particular the pattern would not match if SpaceToken were specied in the
Input line.

The vertical bar  | is used to denote alternatives. For example


196 JAPE: Regular Expressions over Annotations

Rule: InOrAdjective
(
{Token.category == "IN"} | {Token.category == "JJ"}
):inLoc

would match either a Token whose category is IN or one whose category is JJ.

Parentheses are used to group patterns:

Rule: InLocation
(
({Token.category == "IN"} | {Token.category == "JJ"})
{Location}
):inLoc

matches a Token with one or other of the two category values, followed by a Location,
whereas:

Rule: InLocation
(
{Token.category == "IN"} |
( {Token.category == "JJ"}
{Location} )
):inLoc

would match either an IN Token or a sequence of JJ Token and Location.

Repetition

JAPE also provides repetition operators to allow a pattern in parentheses to be optional (?),
or to match zero or more (*), one or more (+) or some specied number of times. In the
following example, you can see the `|' and ` ?' operators being used:

Rule: LocOrganization
Priority: 50

(
({Lookup.majorType == location} |
{Lookup.majorType == country_adj})
{Lookup.majorType == organization}
({Lookup.majorType == organization})?
)
:orgName -->
:orgName.TempOrganization = {kind = "orgName", rule=LocOrganization}
JAPE: Regular Expressions over Annotations 197

Range Notation

Repetition ranges are specied using square brackets.

({Token})[1,3]

matches one to three Tokens in a row.

({Token.kind == number})[3]

matches exactly 3 number Tokens in a row.

8.1.5 Matching a Simple Text String


JAPE operates over annotations so it cannot match strings of text in the document directly.
To match a string you need to match an annotation that covers that string, typically a
Token. The GATE Tokeniser adds a string feature to all the Token annotations containing
the string that the Token covers, so you can use this (or the @string meta property) to match
text in your document.

{Token.string == "of"}

The following grammar shows a sequence of strings being matched.

Phase: UrlPre
Input: Token SpaceToken
Options: control = appelt

Rule: Urlpre

( (({Token.string == "http"} |
{Token.string == "ftp"})
{Token.string == ":"}
{Token.string == "/"}
{Token.string == "/"}
) |
({Token.string == "www"}
{Token.string == "."}
)
):urlpre
-->
:urlpre.UrlPre = {rule = "UrlPre"}
198 JAPE: Regular Expressions over Annotations

Since we are matching annotations and not text, you must be careful that the strings you
ask for are in fact single tokens. In the example above, {Token.string == "://"} would
never match (assuming the default ANNIE Tokeniser) as the three characters are treated as
separate tokens.

8.1.6 Using Templates


In cases where a grammar contains many similar or identical strings or other literal values,
JAPE supports the concept of templates. A template is a named value declared in the
grammar le, similar to a variable in Java or other programming languages, which can be
referenced anywhere where a normal string literal, boolean or numeric value could be used,
on the left- or right-hand side of a rule. In the simplest case templates can be constants:

Template: source = "Interesting entity finder"


Template: threshold = 0.6

The templates can be used in rules by providing their names in square brackets:

Rule: InterestingLocation
(
{Location.score >= [threshold]}
):loc
-->
:loc.Entity = { type = Location, source = [source] }

The JAPE grammar parser substitutes the template values for their references when the
grammar is parsed. Thus the example rule is equivalent to

Rule: InterestingLocation
(
{Location.score >= 0.6}
):loc
-->
:loc.Entity = { type = Location,
source = "Interesting entity finder" }

The advantage of using templates is that if there are many rules in the grammar that all
reference the threshold template then it is possible to change the threshold for all rules by
simply changing the template denition.

The name template stems from the fact that templates whose value is a string can contain
parameters, specied using ${name} notation:
JAPE: Regular Expressions over Annotations 199

Template: url = "http://gate.ac.uk/${path}"

When a template containing parameters is referenced, values for the parameters may be
specied:

...
-->
:anchor.Reference = {
page = [url path = "userguide"] }

This is equivalent to page = "http://gate.ac.uk/userguide". Multiple parameter value


assignments are separated by commas, for example:

Template: proton =
"http://proton.semanticweb.org/2005/04/proton${mod}#${n}"

...
{Lookup.class == [proton mod="km", n="Mention"]}
// equivalent to
// {Lookup.class ==
// "http://proton.semanticweb.org/2005/04/protonkm#Mention"}

The parser will report an error if a value is specied for a parameter that is not declared by
the referenced template, for example [proton module="km"] would not be permitted in the
above example.

Advanced template usage

If a template contains parameters for which values are not provided when the template is
referenced, the parameter placeholders are passed through unchanged. Combined with the
fact that the value for a template denition can itself be a reference to a previously-dened
template, this allows for idioms like the following:

Template: proton =
"http://proton.semanticweb.org/2005/04/proton${mod}#${n}"
Template: pkm = [proton mod="km"]
Template: ptop = [proton mod="t"]

...
({Lookup.class == [ptop n="Person"]}):look
-->
:look.Mention = { class = [pkm n="Mention"], of = "Person"}
200 JAPE: Regular Expressions over Annotations

(This example is inspired by the ontology-aware JAPE matching mode described in sec-
tion 14.10.)

In a multi-phase JAPE grammar, templates dened in earlier phases may be referenced in


later phases. This makes it possible to declare constants (such as the PROTON URIs above)
in one place and reference them throughout a complex grammar.

8.1.7 Multiple Pattern/Action Pairs


It is also possible to have more than one pattern and corresponding action, as shown in the
rule below. On the LHS, each pattern is enclosed in a set of round brackets and has a unique
label; on the RHS, each label is associated with an action. In this example, the Lookup
annotation is labelled `jobtitle' and is given the new annotation JobTitle; the TempPerson
annotation is labelled `person' and is given the new annotation `Person'.

Rule: PersonJobTitle
Priority: 20

(
{Lookup.majorType == jobtitle}
):jobtitle
(
{TempPerson}
):person
-->
:jobtitle.JobTitle = {rule = "PersonJobTitle"},
:person.Person = {kind = "personName", rule = "PersonJobTitle"}

Similarly, labelled patterns can be nested, as in the example below, where the whole pattern
is annotated as Person, but within the pattern, the jobtitle is annotated as JobTitle.

Rule: PersonJobTitle2
Priority: 20

(
(
{Lookup.majorType == jobtitle}
):jobtitle
{TempPerson}
):person
-->
:jobtitle.JobTitle = {rule = "PersonJobTitle"},
:person.Person = {kind = "personName", rule = "PersonJobTitle"}
JAPE: Regular Expressions over Annotations 201

8.1.8 LHS Macros


Macros allow you to create a denition that can then be used multiple times in
your JAPE rules. In the following JAPE grammar, we have a cascade of macros
used. The macro `AMOUNT_NUMBER' makes use of the macros `MILLION_BILLION'
and `NUMBER_WORDS', and the rule `MoneyCurrencyUnit' then makes use of
`AMOUNT_NUMBER':

Phase: Number
Input: Token Lookup
Options: control = appelt

Macro: MILLION_BILLION
({Token.string == "m"}|
{Token.string == "million"}|
{Token.string == "b"}|
{Token.string == "billion"}|
{Token.string == "bn"}|
{Token.string == "k"}|
{Token.string == "K"}
)

Macro: NUMBER_WORDS
(
(({Lookup.majorType == number}
({Token.string == "-"})?
)*
{Lookup.majorType == number}
{Token.string == "and"}
)*
({Lookup.majorType == number}
({Token.string == "-"})?
)*
{Lookup.majorType == number}
)

Macro: AMOUNT_NUMBER
(({Token.kind == number}
(({Token.string == ","}|
{Token.string == "."}
)
{Token.kind == number}
)*
|
(NUMBER_WORDS)
)
202 JAPE: Regular Expressions over Annotations

(MILLION_BILLION)?
)

Rule: MoneyCurrencyUnit
(
(AMOUNT_NUMBER)
({Lookup.majorType == currency_unit})
)
:number -->
:number.Money = {kind = "number", rule = "MoneyCurrencyUnit"}

8.1.9 Multi-Constraint Statements


In the examples we have seen so far, most statements have contained only one constraint.
For example, in this statement, the `category' of `Token' must equal `NNP':

Rule: Unknown
Priority: 50
(
{Token.category == NNP}
)
:unknown
-->
:unknown.Unknown = {kind = "PN", rule = Unknown}

However, it is equally acceptable to have multiple constraints in a statement. In this example,


the `majorType' of `Lookup' must be `name' and the `minorType' must be `surname':

Rule: Surname
(
{Lookup.majorType == "name",
Lookup.minorType == "surname"}
):surname
-->
:surname.Surname = {}

Multiple constraints on the same annotation type must all be satised by the same annotation
in order for the pattern to match.

The constraints may refer to dierent annotations, and for the pattern as a whole to match
the constraints must be satised by annotations that start at the same location in the doc-
ument. In this example, in addition to the constraints on the `majorType' and `minorType'
of `Lookup', we also have a constraint on the `string' of `Token':
JAPE: Regular Expressions over Annotations 203

Rule: SurnameStartingWithDe
(
{Token.string == "de",
Lookup.majorType == "name",
Lookup.minorType == "surname"}
):de
-->
:de.Surname = {prefix = "de"}

This rule would match anywhere where a Token with string `de' and a Lookup with ma-
jorType `name' and minorType `surname' start at the same oset in the text. Both the
Lookup and Token annotations would be included in the :de binding, so the Surname an-
notation generated would span the longer of the two. As before, constraints on the same
annotation type must be satised by a single annotation, so in this example there must be a
single Lookup matching both the major and minor types  the rule would not match if there
were two dierent lookups at the same location, one of them satisfying each constraint.

8.1.10 Using Context


Context can be dealt with in the grammar rules in the following way. The pattern to be
annotated is always enclosed by a set of round brackets. If preceding context is to be included
in the rule, this is placed before this set of brackets. This context is described in exactly
the same way as the pattern to be matched. If context following the pattern needs to be
included, it is placed after the label given to the annotation. Context is used where a pattern
should only be recognised if it occurs in a certain situation, but the context itself does not
form part of the pattern to be annotated.

For example, the following rule for Time (assuming an appropriate macro for `year') would
mean that a year would only be recognised if it occurs preceded by the words `in' or `by':

Rule: YearContext1

({Token.string == "in"}|
{Token.string == "by"}
)
(YEAR)
:date -->
:date.Timex = {kind = "date", rule = "YearContext1"}

Similarly, the following rule (assuming an appropriate macro for `email') would mean that
an email address would only be recognised if it occurred inside angled brackets (which would
not themselves form part of the entity):
204 JAPE: Regular Expressions over Annotations

Rule: Emailaddress1
({Token.string == `<'})
(
(EMAIL)
)
:email
({Token.string == `>'})
-->
:email.Address= {kind = "email", rule = "Emailaddress1"}

It is important to remember that context is consumed by the rule, so it cannot be reused in


another rule within the same phase. So, for example, right context for one rule cannot be
used as left context for another rule.

8.1.11 Negation
All the examples in the preceding sections involve constraints that require the presence of
certain annotations to match. JAPE also supports `negative' constraints which specify the
absence of annotations. A negative constraint is signalled in the grammar by a ` !' character.

Negative constraints are used in combination with positive ones to constrain the locations
at which the positive constraint can match. For example:

Rule: PossibleName
(
{Token.orth == "upperInitial", !Lookup}
):name
-->
:name.PossibleName = {}

This rule would match any uppercase-initial Token, but only where there is no Lookup anno-
tation starting at the same location. The general rule is that a negative constraint matches
at any location where the corresponding positive constraint would not match. Negative
constraints do not contribute any annotations to the bindings - in the example above, the
:name binding would contain only the Token annotation3 .

Any constraint can be negated, for example:

Rule: SurnameNotStartingWithDe
3 The exception to this is when a negative constraint is used alone, without any positive constraints in
the combination. In this case it binds all the annotations at the match position that do not match the
constraint. Thus, {!Lookup} would bind all the annotations starting at this location except Lookups. In
general negative constraints should only be used in combination with positive ones.
JAPE: Regular Expressions over Annotations 205

(
{Surname, !Token.string ==~ "[Dd]e"}
):name
-->
:name.NotDe = {}

This would match any Surname annotation that does not start at the same place
as a Token with the string `de' or `De'. Note that this is subtly dierent from
{Surname, Token.string !=~ "[Dd]e"}, as the second form requires a Token annotation
to be present, whereas the rst form (!Token...) will match if there is no Token annotation
at all at this location.4

As with positive constraints, multiple negative constraints on the same annotation type
must all match the same annotation in order for the overall pattern match to be blocked.
For example:

{Name, !Lookup.majorType == "person", !Lookup.minorType == "female"}

would match a Name annotation, but only if it does not start at the same location as a
Lookup with majorType person and minorType female. A Lookup with majorType per-
son and minorType male would not block the pattern from matching. However negated
constraints on dierent annotation types are independent:

{Person, !Organization, !Location}

would match a Person annotation, but only if there is no Organization annotation and no
Location annotation starting at the same place.

Note Prior to GATE 7.0, negated constraints on the same annotation type were considered
independent, i.e. in the Name example above any Lookup of majorType person would
block the match, irrespective of its minorType. If you have existing grammars that depend
on this behaviour you should add negationGrouping = false to the Options line at the
top of the JAPE phase in question.

Although JAPE provides an operator to look for the absence of a single annotation type,
there is no support for a general negative operator to prevent a rule from ring if a particular
sequence of annotations is found. One solution to this is to create a `negative rule' which
has higher priority than the matching `positive rule'. The style of matching must be Appelt
for this to work. To create a negative rule, simply state on the LHS of the rule the pattern
that should NOT be matched, and on the RHS do nothing. In this way, the positive rule
cannot be red if the negative pattern matches, and vice versa, which has the same end
result as using a negative operator. A useful variation for developers is to create a dummy
annotation on the RHS of the negative rule, rather than to do nothing, and to give the
4 In the Montreal transducer, the two forms were equivalent
206 JAPE: Regular Expressions over Annotations

dummy annotation a rule feature. In this way, it is obvious that the negative rule has red.
Alternatively, use Java code on the RHS to print a message when the rule res. An example
of a matching negative and positive rule follows. Here, we want a rule which matches a
surname followed by a comma and a set of initials. But we want to specify that the initials
shouldn't have the POS category PRP (personal pronoun). So we specify a negative rule
that will re if the PRP category exists, thereby preventing the positive rule from ring.

Rule: NotPersonReverse
Priority: 20
// we don't want to match 'Jones, I'
(
{Token.category == NNP}
{Token.string == ","}
{Token.category == PRP}
)
:foo
-->
{}

Rule: PersonReverse
Priority: 5
// we want to match `Jones, F.W.'

(
{Token.category == NNP}
{Token.string == ","}
(INITIALS)?
)
:person -->

8.1.12 Escaping Special Characters


To specify a single or double quote as a string, precede it with a backslash, e.g.

{Token.string=="\""}

will match a double quote. For other special characters, such as `$', enclose it in double
quotes, e.g.

{Token.category == "PRP$"}
JAPE: Regular Expressions over Annotations 207

8.2 LHS Operators in Detail


This section gives more detail on the behaviour of the matching operators used on the left-
hand side of JAPE rules.

Matching operators are used to specify how matching must take place between a JAPE
pattern and an annotation in the document. Equality (`==' and ` !=') and comparison
(`<', `<=', `>=' and `>') operators can be used, as can regular expression matching and
contextual operators (`contains' and `within').

8.2.1 Equality Operators


The equality operators are `==' and ` !='. The basic operator in JAPE is equality.
{Lookup.majorType == "person"} matches a Lookup annotation whose majorType fea-
ture has the value `person'. Similarly {Lookup.majorType != "person"} would match any
Lookup whose majorType feature does not have the value `person'. If a feature is missing
it is treated as if it had an empty string as its value, so this would also match a Lookup
annotation that did not have a majorType feature at all.

Certain type coercions are performed:

ˆ If the constraint's attribute is a string, it is compared with the annotation feature value
using string equality (String.equals()).

ˆ If the constraint's attribute is an integer it is treated as a java.lang.Long. If the


annotation feature value is also a Long, or is a string that can be parsed as a Long,
then it is compared using Long.equals().

ˆ If the constraint's attribute is a oating-point number it is treated as a


java.lang.Double. If the annotation feature value is also a Double, or is a string that
can be parsed as a Double, then it is compared using Double.equals().

ˆ If the constraint's attribute is true or false (without quotes) it is treated as a


java.lang.Boolean. If the annotation feature value is also a Boolean, or is a string
that can be parsed as a Boolean, then it is compared using Boolean.equals().

The != operator matches exactly when == doesn't.

8.2.2 Comparison Operators


The comparison operators are `<', `<=', `>=' and `>'. Comparison operators have their
expected meanings, for example {Token.length > 3} matches a Token annotation whose
208 JAPE: Regular Expressions over Annotations

length attribute is an integer greater than 3. The behaviour of the operators depends on the
type of the constraint's attribute:

ˆ If the constraint's attribute is a string it is compared with the annotation feature value
using Unicode-lexicographic order (see String.compareTo()).