Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2019, Proceedings of Recent Advances in Natural Language Processing
https://doi.org/10.26615/978-954-452-056-4_041…
7 pages
1 file
This paper describes a set of tools that offers comprehensive solutions for corpus lexicography. The tools perform a range of tasks, including construction of corpus lexicon, integrating information from external dictionaries, internal analysis of the lexicon, and lexical analysis of the corpus. The set of tools is particularly useful for creating dictionaries for under-resourced languages. The tools are integrated in a general-purpose software that includes additional tools for various research tasks, such as linguistic development analysis. Equipped with a user-friendly interface, the described system can be easily incorporated in research in a variety of fields.
To analyse corpus data, lexicographers need software that allows them to search, manipulate and save data, a 'corpus tool'. A good corpus tool is key to a comprehensive lexicographic analysis–a corpus without a good tool to access it is of little use. Both corpus compilation and corpus tools have been swept along by general technological advances over the last three decades. Compiling and storing corpora has become far faster and easier, so corpora tend to be far larger.
Lexikos, 2010
This article presents various approaches used in corpus-based computational lexicography. A claim is made that in order for computational lexicography to be efficient, precise and comprehensive, it should utilize the method where the corpus text is first analysed, and the results of this analysis is then processed further to meet the needs of a dictionary. This method has several advantages, including high precision and recall, as well as the possibility to automate the process much further than with more traditional computational methods. The frequency list obtained by using the lemma (the equivalent of the headword) as basis helps in selecting the words to be included in the dictionary. The approach is demonstrated through various phases by applying SALAMA (the Swahili Language Manager) to the process. Manual work will be needed in the phase when examples of use are selected from the corpus, and possibly modified. However, the list of examples of use, arranged alphabetically according to the corresponding headword, can also be produced automatically. Thus the alphabetical list of headwords with examples of use is the material on which the lexicographer works manually. The article deals with problems encountered in compiling traditional printed dictionaries, and it excludes electronic dictionaries and thesauri.
2020
In this paper, we describe the development of Skema and its features. Skema [ˈskiːmə] is a new corpus pattern editor system which supports the manual annotation of concordance lines with user-defined labels (each concordance has its own set of labels) and the editing of the corresponding patterns in terms of slots, attributes, examples and other features following the lexicographic technique of Corpus Pattern Analysis. Skema is integrated into the web-based Sketch Engine and can be used by any user for annotating both preloaded and user corpora. Each annotation label is linked to the pattern structure (stored in JSON format) which can be easily customized to individual projects, a generic pattern structure (i.e. a list of user-defined attributes) being available by default. The paper illustrates the use of Skema in three specific projects, i.e. Woordcombinaties for Dutch verbs, Typed Predicate-Argument Structures for Italian Verbs (T-PAS) and its sister project for Croatian Verbs (C...
What can translations tell us about ongoing semantic changes? The case of must 3 KARIN AIJMER Taking a Language to Pieces: art, science, technology 3 GUY COOK The textual dimensions of Lexical Priming 4 MICHAEL HOEY, MATTHEW BROOK O'DONNELL No corpus linguist is an island: Collaborative and cross-disciplinary work in researching phraseology 4 UTE RÖMER Papers A corpus-based study for assessing the collocational competence in learner production across proficiency levels 9 MAHA N. ALHARTHI 'Sure he has been talking about coming for the last year or two": the Corpus of Irish English Correspondence and the use of discourse markers 13 CAROLINA P. AMADOR-MORENO, KEVIN MCCAFFERTY Developing AntConc for a new generation of corpus linguists 14 LAURENCE ANTHONY Bridging lexical and constructional synonymy, and linguistic variantsthe Passive and its auxiliary verbs in British and American English 16 ANTTI ARPPE, DAGMARA DOWBOR An open-access gold-standard multi-annotated corpus with huge user-base and impact: The Quran
There are many benefits to using corpora. In order to reap those rewards, how should someone who is setting up a dictionary project proceed? We describe a practical experience of such ‘setting up’ for a new Portuguese-English, English-Portuguese dictionary being written at Oxford University Press. We focus on the Portuguese side, as OUP did not have Portuguese resources prior to the project. We collected a very large (3.5 billion word) corpus from the web, including removing all unwanted material and duplicates. We then identified the best tools for Portuguese for lemmatizing and parsing, and undertook the very large task of parsing it. We then used the dependency parses, as output by the parser, to create word sketches (one page summaries of a word’s grammatical and collocational behavior). We plan to customize an existing system for automatically identifying good candidate dictionary examples, to Portuguese, and add salient information about regional words to the word sketches. All of the data and associated support tools for lexicography are available to the lexicographer in the Sketch Engine corpus query system.
Input a Word, Analyze the World represents current perspectives on Corpus Linguistics (CL) from a variety of linguistic subdisciplines. Corpus Linguistics has proven itself an excellent methodology for the study of language variation and change, and is well-suited for interdisciplinary collaboration, as shown by the studies in this volume. Its title is inspired by the use of CL to assess language in different registers and with a variety of purposes. This collection contains thirty contributions by scholars in the field from across the globe, dealing with current topics on corpus production and corpus tools; lexical analysis, phraseology and grammar; translation and contrastive linguistics; and language learning. Language specialists will find these papers inspiring, as they present new insights on aspects related to research and teaching.
1995
The elaboration of the DECIDE lexicon follows two parallel lines: dictionary-based construction and corpus-based construction. The originality of the DECIDE project was indeed the wish to combine collocational information extracted from dictionaries and textual corpora. The relevance of corpus analysis no longer needs to be demonstrated. Research done so far has already produced very promising results and shown how the variety and intrinsically authentic quality of the information extracted from large corpora could complement the formalized and selective information contained in dictionaries. Although it is undeniable that machine-readable dictionaries provide a fertile resource for the extraction of lexical information for the base vocabulary of a language (see for instance the ACQUILEX project) which demonstrates that it is possible to automatically construct a hierarchy of word types in a number of languages), the completeness of the lexical information offered by monolingual dictionaries is hampered by the historical purpose of the dictionary itself: - being geared to human users, some definitions require complex world knowledge in order to be exploited, - general language dictionaries are usually meant to cover the basic language and omit technical words and expressions that are nonetheless likely to appear in any specific corpus. - their contents usually lag behind usage changes. One further factor that prompted us to explore corpus-based technologies in addition to our dictionary analysis, is the fact that the collection of MRD's is a finite resource. The creation of each dictionary requires hundreds of man years, an effort that limits their production. On the other hand, free text is becoming available in seemingly unlimited quantities on CD ROMs from news groups or in publicly available archives on the Internet. In an ecological sense, unrestricted text is a 'renewable resource' which can be mined without limit, making corpus based techniques a promising source of lexical information. In the first part of this deliverable (chapter II), we will focus on the contents of the lexicons produced as one of the outputs of the DECIDE project. After explaining how the subfield of speech act nouns was chosen and detailing the criteria used for the selection of the lexical entries, we will look in detail at the dictionary-based lexicon construction, explaining the work done to fine-tune and enhance the dictionary tools and relating an experiment to retrieve collocations from the Cobuild dictionary with the help of the tagger developed in the MECOLB project. Then, we will examine the corpus-based lexicon construction, presenting the various tools that were used and developed within the framework of the DECIDE project to retrieve collocations from various textual corpora. Finally, in the second part of the deliverable (chapter III), we will present and document the architecture of our lexicon, explaining the rationale for choosing this specific format and providing a few commented examples for illustration purposes.
In R.V. Fjield y J.M. Torjusen (eds.), Proceedings of the 15th EURALEX International Congress. Oslo: Reprosentalen, University of Oslo. Págs. 404-412 (2012). ISBN: 978-82-30-2228-4.
The latest generation of lexical profiling software (which developed out of the probability measures originally proposed by Church and Hanks) has recently been used as a central source of linguistic data for a new, writtenfrom-scratch pedagogical dictionary. The "Word Sketch" software uses parsed corpus data to identify salient collocates -in separate lists -for the whole range of grammatical relations in which a given word participates. It also links these collocate lists to corpus examples instantiating each combination so identified. Lexicographers found that the Word Sketches not only streamlined the process of searching for significant word combinations, but often provided a more revealing, and more efficient, way of uncovering the key features of a word's behaviour than the (now traditional) method of scanning concordances.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
International Journal of Corpus Linguistics, 2001
Beyond Philology An International Journal of Linguistics, Literary Studies and English Language Teaching
International Journal of Applied Linguistics, 1999
Vocabulary Learning and Instruction, 2017
UAD TEFL International Conference
Studii si Cercetari Filologice: Seria Limbi Straine Aplicate, 2012
Lexicographica, 2007