Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2004, proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks in association with 4th International Conference on Language Resources and Evaluation (LREC 2004)
The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semantic analysis of English spoken and written data. This paper describes the software system, and the hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained semantic field tags. We discuss the manually constructed lexical resources on which the system relies, and the seven disambiguation methods including part-of-speech tagging, general likelihood ranking, multi-word-expression extraction, domain of discourse identification, and contextual rules. We report an evaluation of the accuracy of the system compared to a manually tagged test corpus on which the USAS software obtained a precision value of 91%. Finally, we make reference to the applications of the system in corpus linguistics, content analysis, software engineering, and electronic dictionaries.
Proceedings of the Beyond Named Entity Recognition Semantic Labeling for NLP Tasks Workshop, 2004
The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semantic analysis of English spoken and written data. This paper describes the software system, and the hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained semantic field tags. We discuss the manually constructed lexical resources on which the system relies, and the seven disambiguation methods including part-of-speech tagging, general likelihood ranking, multi-word-expression extraction, domain of discourse identification, and contextual rules. We report an evaluation of the accuracy of the system compared to a manually tagged test corpus on which the USAS software obtained a precision value of 91%. Finally, we make reference to the applications of the system in corpus linguistics, content analysis, software engineering, and electronic dictionaries.
Proceedings of the corpus linguistics 2005 conference
Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the past 14 years, a large semantic lexical resource has been built at Lancaster University. Different from other major semantic lexicons in existence, such as WordNet, EuroWordNet and HowNet, etc., in which lexemes are clustered and linked via the relationship between word/MWE senses or definitions of meaning, the Lancaster semantic lexicon employs a semantic field taxonomy and maps words and multiword expression (MWE) templates to their potential semantic categories, which are disambiguated according to their context in use by a semantic tagger called USAS (UCREL semantic analysis system). The lexicon is classified with a set of broadly defined semantic field categories, which are organised in a thesaurus-like structure. The Lancaster semantic taxonomy provides a conception of the world that is as general as possible as opposed to a semantic network for some specific domains. This paper describes the Lancaster semantic lexicon both in terms of its semantic field taxonomy, lexical distribution across the semantic categories and lexeme/tag type ratio. As will be shown, the Lancaster semantic lexicon is a unique and valuable lexical resource that offers a large-scale generalpurpose semantically structured lexicon resource, which can have various applications in corpus linguistics and NLP. 1 The semantic lexicon and the USAS tagger are accessible for academic research as part of the Wmatrix tool, for more details see
2014
Since, at the moment there is not a goldstandard annotated corpus for this objective, it is necessary to build one, to allow generation and testing of automatic systems for classifying the purpose or function of a citation referenced in an article. The development of this kind of corpus is subject to two conditions: the first one is to present a clear and unambiguous classification scheme. The second one is to secure an initial manual process of labeling to reach a sufficient inter-coder agreement among annotators to validate the annotation scheme and to be able to reproduce it even with coders who do not know in depth the topic of the analyzed articles. This paper proposes and validates a methodology for corpus annotation for citation classification in scientific literature that facilitate annotation and produces substantial interannotator agreement.
Cambridge University Press, 2022
Corpus analysis can be expanded and scaled up by incorporating computational methods from natural language processing. This Element shows how text classification and text similarity models can extend our ability to undertake corpus linguistics across very large corpora. These computational methods are becoming increasingly important as corpora grow too large for more traditional types of linguistic analysis. We draw on five case studies to show how and why to use computational methods, ranging from usage-based grammar to authorship analysis to using social media for corpus-based sociolinguistics. Each section is accompanied by an interactive code notebook that shows how to implement the analysis in Python. A standalone Python package is also available to help readers use these methods with their own data. Because large-scale analysis introduces new ethical problems, this Element pairs each new methodology with a discussion of potential ethical implications.
1996
Building on a successful previous project, UCREL (the University Centre for Computer Corpus Research on Language) is collaborating with Reflexions Communication Research (a market research company in London, UK) to develop software which will undertake the semantic tagging of words in a text, facilitate the assignment of 'content tags' to those words, and provide a statistical analysis of the resulting tag frequency profile. The project intends to extend previous work by developing enhanced disambiguation techniques, larger lexical resources and word sense frequency data for spoken English, automatic pronoun resolution and a broader dependency-style syntactic analysis.
Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the past 14 years, a large semantic lexical resource has been built at Lancaster University. Different from other major semantic lexicons in existence, such as WordNet, EuroWordNet and HowNet, etc., in which lexemes are clustered and linked via the relationship between word/MWE senses or definitions of meaning, the Lancaster semantic lexicon employs a semantic field taxonomy and maps words and multiword expression (MWE) templates to their potential semantic categories, which are disambiguated according to their context in use by a semantic tagger called USAS (UCREL semantic analysis system). The lexicon is classified with a set of broadly defined semantic field categories, which are organised in a thesaurus-like structure. The Lancaster semantic taxonomy provides a conception of the world that is as general as possible as opposed to a semantic network for some specific domains. This paper describes the Lancaster semantic lexicon both in terms of its semantic field taxonomy, lexical distribution across the semantic categories and lexeme/tag type ratio. As will be shown, the Lancaster semantic lexicon is a unique and valuable lexical resource that offers a large-scale generalpurpose semantically structured lexicon resource, which can have various applications in corpus linguistics and NLP. 1 The semantic lexicon and the USAS tagger are accessible for academic research as part of the Wmatrix tool, for more details see
Natural language is to facilitate the user to exchange the ideas among people. These ideas converge to form the "meaning" of an utterance or text in the form of a series of sentences. The meaning of sentences describes as semantics. The input/output of a NLP can be a written text or a speech. There are two major components of natural language processing, namely: natural language understanding which describes mapping of given input in the natural language into a useful representation and the other is natural language generation which produce natural language as output on basis of input data as text. This paper deals with natural language understanding mainly on semantics
Unpublished doctoral thesis, Lancaster University, …, 2003
Matrix: A statistical method and software tool for linguistic analysis through corpus comparison This thesis reports the development of a new kind of method and tool (Matrix) for advancing the statistical analysis of electronic corpora of linguistic data. First, we describe the standard corpus linguistic methodology, which is hypothesis-driven. The standard research process model is 'question -build -annotate -retrieve -interpret', in other words, identifying the research question (and the linguistic features) early in the study. In recent years corpora have been increasingly annotated with linguistic information. From our survey, we find that no tools are available which are datadriven on annotated corpora, in other words, a tool which assists in finding candidate research questions. However, Matrix is such a tool. It allows the macroscopic analysis (the study of the characteristics of whole texts or varieties of language) to inform the microscopic level (focussing on the use of a particular linguistic feature) as to which linguistic features should be investigated further. By integrating part-of-speech tagging and lexical semantic tagging in a profiling tool, the Matrix technique extends the keywords procedure to produce key grammatical categories and key concepts. It has been shown to be applicable in the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties, vocabulary studies in sociolinguistics, studies of language learners, information extraction and content analysis. Currently, it has been tested on restricted levels of annotation and only on English language data. ii First of all I'd like to thank Pete Sawyer, who, as we were sitting in a bar in Barcelona, convinced me that writing this up was possible, and he paid for the cerveza too. None of this work would have been possible without Roger Garside who was not only my supervisor but also ignited my interest in natural language processing when I began my third year project as an undergraduate in 1989. During my work on this thesis, and before I started my PhD research, I have been a member of the UCREL research group at Lancaster University, and I would like to thank the members of the group, specifically, Geoffrey Leech, Jenny Thomas and Andrew Wilson, all of whom I worked with from 1990, along with Roger Garside. The early seed of this work was sown then. Nick Smith, David Lee, Simon Botley and Tony McEnery have given me support along the way. From the Centre for Applied Statistics at Lancaster University, Damon Berridge and Brian Francis have been of invaluable assistance when I posed many statistical questions to them over the past few years. My thanks also to Sylviane Granger for many interesting discussions during our work together in Lancaster and at the Université Catholique de Louvain.
Computational Linguistics, 1993
In this paper we outline a research program for computational linguistics, making extensive use of text corpora. We demonstrate how a semantic framework for lexical knowledge can suggest richer relationships among words in text beyond that of simple co-occurrence. The work suggests how linguistic phenomena such as metonymy and polysemy might be exploitable for semantic tagging of lexical items. Unlike with purely statistical collocational analyses, the framework of a semantic theory allows the automatic construction of predictions about deeper semantic relationships among words appearing in collocational systems. We illustrate the approach for the acquisition oflexical information for several classes of nominals, and how such techniques can fine-tune the lexical structures acquired from an initial seeding of a machine-readable dictionary. In addition to conventional lexical semantic relations, we show how information concerning lexical presuppositions and preference relations can also be acquired from corpora, when analyzed with the appropriate semantic tools. Finally, we discuss the potential that corpus studies have for enriching the data set for theoretical linguistic research, as well as helping to confirm or disconfirm linguistic hypotheses.
The use of corpora in semantic research is a rapidly developing method. However, the range of quantitative techniques employed in the field can make it difficult for the non-specialist to keep abreast with the methodological development. This chapter serves as an introduction to the use of corpus methods in Cognitive Semantic research and as an overview of the relevant statistical techniques and software needed for performing them. The discussion and description are intended for researches in semantics that are interested in adopting quantitative corpus-driven methods. The discussion argues that there are fundamentally two corpus-driven approaches to meaning, one based on observable formal patterns (collocation analysis) and another based on patterns of annotated usage-features of use (feature analysis). The discussion then introduces and explains each of the statistical techniques currently used in the field. Examples of the use of each technique are listed and a summary of the software packages available in R for performing the techniques is included.
Proceedings of the EURALEX-2004 Conference, 2004
Annotation schemes for semantic field analysis use abstract concepts to classify words and phrases in a given text. The use of such schemes within lexicography is increasing, mdeed, our own UCREL semantic annotation system (USAS) is to form part ofa web-based 'intelligent' dictionary (Herpio 2002). As USAS was originally designed to enable automatic content analysis (WUson and Rayson 1993), we have been assessing its usefulness in a lexicographical setting, and also comparing its taxonomy with schemes developed by lexicographers. This paper initially reports the comparisons we have undertaken with two dictionary taxonomies: the first was designed by Tom McArthur for use in the Longman Lexicon of Contemporary English, and the second by Collins Dictionaries for use in their Collins English Dictionary. We then assess thefeasibility of mapping USAS to the CED tagset, before reporting our intentions to also map to WordNet (a reasonably comprehensive machine-useable database ofthe meanings ofEnglish words) via WordNet Domains (which augments WordNet 1.6 with 200+ domains). We argue that this type ofresearch can provide a practical guide for tagset mapping and, by so doing, bring lexicographers one-step closer to using the semantic field as the organising principle for their general-purpose dictionaries.
LREC 2004 Proceedings, 2004
Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community. In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods. We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17 th and 19 th centuries. In the evaluation, the semantic lexicon achieved a lexical coverage of 98.49% on the BNC sampler, 95.38% on the METER Corpus and 92.76% --97.29% on the historical data. Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words. Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them 'future proof', we need to evaluate their potential both synchronically and diachronically across genres.
Relationship Analysis System and Method for Semantic Disambiguation of Natural Language, 2007
Current approaches to natural language understanding involve example-based statistical analyses or Latent Semantic Indexing to interpret the contextual meaning of messages. However, Any Language Communications has developed a novel system that uses the innate relationships of the words in a sensible message to determine the true contextual meaning of the message. This patented methodology is called “Relationship Analysis” and includes a class/category structure of language concepts, a weighted inheritance system, a number language word conversion, and a tailored genetic algorithm to select the best of the possible word meanings. Relationship Analysis is a powerful language-independent method that has been tested using machine translations with English, French, and Arabic as source languages and English, French, German, Hindi, and Russian as target languages. A simplified form of Relationship Analysis does sophisticated text analyses, in which concepts in the text are recognized irrespective of the text language. Such analyses have been demonstrated using English and Arabic texts, with applications that include concept searches, email routing, semantic tagging, and semantic metadata indexing. In addition, a class/category data analysis provides machine-readable codes suitable for further computer system processing.
2010
Availability of labeled language resources, such as annotated corpora and domain dependent labeled language resources is crucial for experiments in the field of Natural Language Processing. Most often, due to lack of resources, manual verification and annotation of electronic text material is a prerequisite for the development of NLP tools. In the context of under-resourced language, the lack of copora becomes a crucial problem because most of the research efforts are supported by organizations with limited funds. Using free, multilingual and highly structured corpora like Wikipedia to produce automatically labeled language resources can be an answer to those needs. This paper introduces NLGbAse, a multilingual linguistic resource built from the Wikipedia encyclopedic content. This system produces structured metadata which make possible the automatic annotation of corpora with syntactical and semantical labels. A metadata contains semantical and statistical informations related to an encyclopedic document. To validate our approach, we built and evaluated a Named Entity Recognition tool, trained with Wikipedia corpora annotated by our system.
Qubahan Academic Journal
Semantic analysis is an essential feature of the NLP approach. It indicates, in the appropriate format, the context of a sentence or paragraph. Semantics is about language significance study. The vocabulary used conveys the importance of the subject because of the interrelationship between linguistic classes. In this article, semantic interpretation is carried out in the area of Natural Language Processing. The findings suggest that the best-achieved accuracy of checked papers and those who relied on the Sentiment Analysis approach and the prediction error is minimal.
Bioscience Biotechnology Research Communications, 2020
We are alive in the age of brainy machineries where the whole thing is substituting by machinery, understanding of concealed meaning of text/speech turn into essential to comprehend by machine. Natural Language Understanding (NLU) is an emerging area of Computational Linguistics under the subarea of Natural language processing that is used to analyse and distinguish text or speech sentences written in human understandable language into machine understandable presentation. The purpose of NLU system is to interpret a text. The NLU expects that human languages must be interpreted statistically by the system. Understanding natural language is biggest challenge in Artificial intelligence field as well as Discourse processing. Discourse is "Language above the Sentence or above the Clause". This paper presents a method which designs and builds computer system that are able to analyse and understands human language and to generate output in same human understandable language. The system which is designed for discourse implementation is tested on IMDb movie review dataset. IMDb Stands for Internet Movies Database. The results are quiet promising. We have used Convolution Neural network concept for the same. The efficiency of this model is around 87%.
Journal of Computer Speech & Language, 19:4, 2005
Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL Semantic Analysis System (henceforth USAS (Rayson, P., Archer, D., Piao, S., McEnery, T., 2004. The UCREL semantic analysis system. In: Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic labelling for NLP tasks, Lisbon, Portugal. pp. 7-12)) for MWE extraction, and explore the possibility of improving USAS by incorporating a statistical algorithm. Developed at Lancaster University, the USAS system automatically annotates English corpora with semantic category information. Employing a large-scale semantically classified multi-word expression template database, the system is also capable of detecting many multiword expressions, as well as assigning semantic field information to the MWEs extracted. Whilst USAS therefore offers a unique tool for MWE extraction, allowing us to both extract and semantically classify MWEs, it can sometimes suffer from low recall. Consequently, we have been comparing USAS, which employs a symbolic approach, to a statistical tool, which is based on collocational information, in order to determine the pros and cons of these different tools, and more importantly, to examine the possibility of improving MWE extraction by combining them. As we report in this paper, we have found a highly complementary relation between the different tools: USAS missed many domain-specific MWEs (law/court terms in this case), and the statistical tool missed many commonly used MWEs that occur in low frequencies (lower than three in this 0885-2308/$ -see front matter Ó COMPUTER SPEECH AND LANGUAGE case). Due to their complementary relation, we are proposing that MWE coverage can be significantly increased by combining a lexicon-based symbolic approach and a collocation-based statistical approach.
2004
Annotation schemes for semantic field analysis use abstract concepts to classify words and phrases in a given text. The use of such schemes within lexicography is increasing. Indeed, our own UCREL semantic annotation system (USAS) is to form part of a web-based ‘intelligent’ dictionary (Herpio 2002). As USAS was originally designed to enable automatic content analysis (Wilson and Rayson 1993), we have been assessing its usefulness in a lexicographical setting, and also comparing its taxonomy with schemes developed by lexicographers. This paper initially reports the comparisons we have undertaken with two dictionary taxonomies: the first was designed by Tom McArthur for use in the Longman Lexicon of Contemporary English, and the second by Collins Dictionaries for use in their Collins English Dictionary. We then assess the feasibility of mapping USAS to the CED tagset, before reporting our intentions to also map to WordNet (a reasonably comprehensive machine-useable database of the me...
2001
A comparison of semantic tagging with syntactic Part-of-Speech tagging leads us to propose that a domain-independent semantic tagger for English corpora should not aim to annotate each word with an atomic 'sem-tag', but instead that a semantic tagging should attach to each word a set of semantic primitive attributes or features. These features should include:
Lingua, 2017
This article introduces a new standardized system of direct quotation meta-data tagging for instances of direct quotation in EAP corpora. The system was developed to assist corpus based research in direct quotations and to facilitate replication within the field of corpus-assisted discourse studies. Even though the system is currently implemented as a manual tagging system, suggestions are made regarding the creation of an algorithm to automate annotation, as is currently the case with PoS tagging. The system is comprised of a manually annotated six-letter tag (we call DQMD tag), and a set of custom corpus query language queries. The DQMD tag has been designed to precede each direct quotation in a corpus. The given corpus query strings allow either the direct quotation or the DQMD tag to be shown as the key word in context in a concordance. The system can be used to collect direct quotation frequency data and lexical data (including associated PoS data) using any combination of DQMD tag attributes. Sketch Engine was employed as the concordancer for examples in the article.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.