Papers by Dawn Archer

Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the pa... more Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the past 14 years, a large semantic lexical resource has been built at Lancaster University. Different from other major semantic lexicons in existence, such as WordNet, EuroWordNet and HowNet, etc., in which lexemes are clustered and linked via the relationship between word/MWE senses or definitions of meaning, the Lancaster semantic lexicon employs a semantic field taxonomy and maps words and multiword expression (MWE) templates to their potential semantic categories, which are disambiguated according to their context in use by a semantic tagger called USAS (UCREL semantic analysis system). The lexicon is classified with a set of broadly defined semantic field categories, which are organised in a thesaurus-like structure. The Lancaster semantic taxonomy provides a conception of the world that is as general as possible as opposed to a semantic network for some specific domains. This paper describes the Lancaster semantic lexicon both in terms of its semantic field taxonomy, lexical distribution across the semantic categories and lexeme/tag type ratio. As will be shown, the Lancaster semantic lexicon is a unique and valuable lexical resource that offers a large-scale generalpurpose semantically structured lexicon resource, which can have various applications in corpus linguistics and NLP. 1 The semantic lexicon and the USAS tagger are accessible for academic research as part of the Wmatrix tool, for more details see
Published conference papers by Dawn Archer

Proceedings of Corpus Linguistics 2003, 2003
As reported by Wilson and Rayson (1993) and Rayson and Wilson (1996), the UCREL semantic analysis... more As reported by Wilson and Rayson (1993) and Rayson and Wilson (1996), the UCREL semantic analysis system (USAS) has been designed to undertake the automatic semantic analysis of present-day English (henceforth PresDE) texts. In this paper, we report on the feasibility of (re)training the USAS system to cope with English from earlier periods, specifically the Early Modern English (henceforth EmodE) period. We begin by describing how effectively the existing system tagged a training corpus prior to any modifications. The training corpus consists of newsbooks dating from December 1653 -May 1654, and totals approximately 613,000.words. We then document the various adaptations that we made to the system in an attempt to improve its efficiency, and the results we achieved when we applied the modified system to two newsbook texts, and an additional text from the Lampeter Corpus (i.e. a text that was not part of the original training corpus). To conclude, we propose a design for a modified semantic tagger for EmodE texts, that contains an 'intelligent' spelling regulariser, that is, a system that has been designed so as to regularise spellings in their 'correct' context. selection of texts from the Lampeter corpus, before undertaking experiments using the semantic categories, using the newsbook test corpus to validate our findings).

LREC 2004 Proceedings, 2004
Semantic lexical resources play an important part in both linguistic study and natural language e... more Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community. In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods. We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17 th and 19 th centuries. In the evaluation, the semantic lexicon achieved a lexical coverage of 98.49% on the BNC sampler, 95.38% on the METER Corpus and 92.76% --97.29% on the historical data. Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words. Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them 'future proof', we need to evaluate their potential both synchronically and diachronically across genres.

Proceedings of the EURALEX-2004 Conference, 2004
Annotation schemes for semantic field analysis use abstract concepts to classify words and phrase... more Annotation schemes for semantic field analysis use abstract concepts to classify words and phrases in a given text. The use of such schemes within lexicography is increasing, mdeed, our own UCREL semantic annotation system (USAS) is to form part ofa web-based 'intelligent' dictionary (Herpio 2002). As USAS was originally designed to enable automatic content analysis (WUson and Rayson 1993), we have been assessing its usefulness in a lexicographical setting, and also comparing its taxonomy with schemes developed by lexicographers. This paper initially reports the comparisons we have undertaken with two dictionary taxonomies: the first was designed by Tom McArthur for use in the Longman Lexicon of Contemporary English, and the second by Collins Dictionaries for use in their Collins English Dictionary. We then assess thefeasibility of mapping USAS to the CED tagset, before reporting our intentions to also map to WordNet (a reasonably comprehensive machine-useable database ofthe meanings ofEnglish words) via WordNet Domains (which augments WordNet 1.6 with 200+ domains). We argue that this type ofresearch can provide a practical guide for tagset mapping and, by so doing, bring lexicographers one-step closer to using the semantic field as the organising principle for their general-purpose dictionaries.

Piao, S., Archer, D., Mudraya, O., Rayson, P., Garside, R., McEnery, T. & Wilson, A. (2005). A Large Semantic Lexicon for Corpus Annotation Proceedings of the corpus linguistics 2005 conference
Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the pa... more Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the past 14 years, a large semantic lexical resource has been built at Lancaster University. Different from other major semantic lexicons in existence, such as WordNet, EuroWordNet and HowNet, etc., in which lexemes are clustered and linked via the relationship between word/MWE senses or definitions of meaning, the Lancaster semantic lexicon employs a semantic field taxonomy and maps words and multiword expression (MWE) templates to their potential semantic categories, which are disambiguated according to their context in use by a semantic tagger called USAS (UCREL semantic analysis system). The lexicon is classified with a set of broadly defined semantic field categories, which are organised in a thesaurus-like structure. The Lancaster semantic taxonomy provides a conception of the world that is as general as possible as opposed to a semantic network for some specific domains. This paper describes the Lancaster semantic lexicon both in terms of its semantic field taxonomy, lexical distribution across the semantic categories and lexeme/tag type ratio. As will be shown, the Lancaster semantic lexicon is a unique and valuable lexical resource that offers a large-scale generalpurpose semantically structured lexicon resource, which can have various applications in corpus linguistics and NLP. 1 The semantic lexicon and the USAS tagger are accessible for academic research as part of the Wmatrix tool, for more details see

Proceedings of the ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, , 2003
Automatic extraction of multiword expressions (MWE) presents a tough challenge for the NLP commun... more Automatic extraction of multiword expressions (MWE) presents a tough challenge for the NLP community and corpus linguistics. Although various statistically driven or knowledge-based approaches have been proposed and tested, efficient MWE extraction still remains an unsolved issue. In this paper, we present our research work in which we tested approaching the MWE issue using a semantic field annotator. We use an English semantic tagger (USAS) developed at Lancaster University to identify multiword units which depict single semantic concepts. The Meter Corpus built in Sheffield was used to evaluate our approach. In our evaluation, this approach extracted a total of 4,195 MWE candidates, of which, after manual checking, 3,792 were accepted as valid MWEs, producing a precision of 90.39% and an estimated recall of 39.38%. Of the accepted MWEs, 68.22% or 2,587 are low frequency terms, occurring only once or twice in the corpus. These results show that our approach provides a practical solution to MWE extraction.
Uploads
Papers by Dawn Archer
Published conference papers by Dawn Archer