The UCREL semantic analysis system

Paul Rayson

The UCREL semantic analysis system

2004, proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks in association with 4th International Conference on Language Resources and Evaluation (LREC 2004)

Abstract

The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semantic analysis of English spoken and written data. This paper describes the software system, and the hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained semantic field tags. We discuss the manually constructed lexical resources on which the system relies, and the seven disambiguation methods including part-of-speech tagging, general likelihood ranking, multi-word-expression extraction, domain of discourse identification, and contextual rules. We report an evaluation of the accuracy of the system compared to a manually tagged test corpus on which the USAS software obtained a precision value of 91%. Finally, we make reference to the applications of the system in corpus linguistics, content analysis, software engineering, and electronic dictionaries.

Key takeaways

The research areas closely related to our work include automatic word sense disambiguation (WSD) and semantic tagging.
The core part of the USAS system is a semantic annotation component, which consists of semantic lexical resources, a set of context rules and programs implementing algorithms of disambiguation and assigning semantic tags to each word in a running text.
As in the case of grammatical tagging, the task of semantic tagging subdivides broadly into two phases: Phase I (Tag assignment): attaching a set of potential semantic tags to each lexical unit and Phase II (Tag disambiguation): selecting the contextually appropriate semantic tag from the set provided by Phase I. USAS makes use of seven major techniques or sources of information in phase II.
Auxiliary verb identification appears to be particularly We define initial ambiguity ratio as the percentage of words in a text with more than one possible semantic tag assigned from the semantic lexicon and MWE list before the application of disambiguation techniques.
Employing a hierarchical semantic taxonomy, semantic lexical resources and a number of disambiguation algorithms such as templates, context rules etc., USAS assigns semantic categories to words and MWEs in a running text.

Matrix: A statistical method and software tool for linguistic analysis through corpus comparison This thesis reports the development of a new kind of method and tool (Matrix) for advancing the statistical analysis of electronic corpora of linguistic data. First, we describe the standard corpus linguistic methodology, which is hypothesis-driven. The standard research process model is 'question -build -annotate -retrieve -interpret', in other words, identifying the research question (and the linguistic features) early in the study. In recent years corpora have been increasingly annotated with linguistic information. From our survey, we find that no tools are available which are datadriven on annotated corpora, in other words, a tool which assists in finding candidate research questions. However, Matrix is such a tool. It allows the macroscopic analysis (the study of the characteristics of whole texts or varieties of language) to inform the microscopic level (focussing on the use of a particular linguistic feature) as to which linguistic features should be investigated further. By integrating part-of-speech tagging and lexical semantic tagging in a profiling tool, the Matrix technique extends the keywords procedure to produce key grammatical categories and key concepts. It has been shown to be applicable in the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties, vocabulary studies in sociolinguistics, studies of language learners, information extraction and content analysis. Currently, it has been tested on restricted levels of annotation and only on English language data. ii First of all I'd like to thank Pete Sawyer, who, as we were sitting in a bar in Barcelona, convinced me that writing this up was possible, and he paid for the cerveza too. None of this work would have been possible without Roger Garside who was not only my supervisor but also ignited my interest in natural language processing when I began my third year project as an undergraduate in 1989. During my work on this thesis, and before I started my PhD research, I have been a member of the UCREL research group at Lancaster University, and I would like to thank the members of the group, specifically, Geoffrey Leech, Jenny Thomas and Andrew Wilson, all of whom I worked with from 1990, along with Roger Garside. The early seed of this work was sown then. Nick Smith, David Lee, Simon Botley and Tony McEnery have given me support along the way. From the Centre for Applied Statistics at Lancaster University, Damon Berridge and Brian Francis have been of invaluable assistance when I posed many statistical questions to them over the past few years. My thanks also to Sylviane Granger for many interesting discussions during our work together in Lancaster and at the Université Catholique de Louvain.

Log In

The UCREL semantic analysis system

Sign up for access to the world's latest research

Abstract

Key takeaways

Related papers

Related topics