Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
The many electronic text corpora available nowadays present ever fewer obstacles to a wide range of corpus linguistic study. However, corpora are expensive resources to create and to update, and there remain problems for linguists if they seek access to very large, very recent, or changing language. The World Wide Web, whilst intended as an information source, is an obvious resource for the retrieval of linguistic information, being the largest store of texts in existence, freely-available, covering a range of domains, and constantly added to and updated. Individual linguistic researchers have been trying to retrieve instances of rare or neologistic language use from the web by manipulating existing web search engines. Whilst this strategy is possible, in particular via Google, the output is rather haphazard and not linguist-friendly. The Research and Development Unit for English Studies has been seeking to remedy the situation through the creation of' 'WebCorp', a tool designed to search the Internet and provide on-line tailored access to linguists. A demonstration tool is available at
This paper aims at revisiting the issue whether it is possible to use commercial web search tools such as the Google interface for meaningful corpus research, given recent advances in search technology. It is argued that with the proper methodology these web tools can and should be used for corpus research, since they provide considerable advantages in comparison with both closed corpora and web-based linguistic search tools.
Language Resources and Evaluation, 2010
In 2003, the special issue of ''Computational linguistics'' (September, 29, 3) dedicated to the Web as corpus and edited by Adam Kilgarriff and Gregory Grefenstette was a landmark event for a promising field of study. Today, this book makes for a fine update, even if it is more limited in scope than its predecessor and less recent in its content than its date of publication would lead to believe. The articles included are in fact partially ''based on papers presented at the symposium Corpus linguistics-Perspectives for the Future held (…) in Heidelberg in October 2004'' (p. 4). However, the editors state that some of the articles were commissioned later, and many of the texts have in fact been brought up to date to take recent developments into account. As for the book itself, it is effectively divided into two parts of nearly equal length. The first includes articles with a ''methodological'' bent, while the second is made up of seven papers dealing with different corpus linguistics approaches to English. In this regard, it is worthwhile pointing out that, in spite of the wealth of corpus linguistics work on other languages and the fact that the book's contributions come mostly from scholars based in continental Europe, no languages other than English are dealt with at any length in a book named simply ''Corpus Linguistics''. This second half of the collection is therefore mainly of interest to students of English, though, in a sense, it nicely complements some of the state-of-the-art summaries in the first section, discussing some practical applications of the tools described there. One good example of this regards using Google as a research tool. After several years of trials and testing, it is now clear that Google search results can vary by up to an order of magnitude, even for frequent words or features (see for example the contribution by William H. Fletcher in this volume, pp. 25-45, Concordancing the web: promise and problems, tools and techniques, and in particular p. 37, drawing
Computational linguistics, 2003
The Web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is a fabulous linguists’ playground. This special issue of Computational Linguistics explores ways in which this dream is being explored.
In this paper, we provide an overview of the new GloWbE corpus – the Corpus of Global Web-Based English. GloWbE is based on 1.9 billion words in 1.8 million web pages from 20 different English-speaking countries. Approximately 60% of the corpus comes from informal blogs, and the rest from a wide range of other genres and text types. Because of its large size, as well as because of its architecture and interface, the corpus can be used to examine many types of variation among dialects, which might not be possible with other corpora – including variation in lexis, morphology, (medium- and low-frequency) syntactic constructions, variation in meaning, as well as discourse and its relationship to culture.
2009
Abstract The paper compares systematically the utility of specially-made text corpora and the textual resources of the World Wide Web for linguists and language learners. Different modes of access are discussed, including via dedicated concordancing software, web-concordancing and the universal search engine.
Literary and Linguistic Computing, 2008
Language Resources and …, 2009
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC versus the British National Corpus was also conducted, so as to highlight differences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.
2003
Language scientists and technologists are increasingly turning to the web as a source of language data, because other resources are not large enough, because they do not contain the types of language the researcher is interested in, or simply because it is free and instantly available. The default means of access to the web is through a search engine such as Google. While the web search engines are dazzlingly efficient pieces of technology and excellent at the task they set themselves, for the linguist they are frustrating.
Transactions of the Philological Society, 1983
… del lenguaje Natural, 2008
Georg Rehm, Oliver Schonefeld, Andreas Witt SFB 441 Linguistic Data Structures University of Tübingen Nauklerstrasse 35 72074 Tübingen, Germany [email protected] [email protected] [email protected]
2010
This report presents an experience of development and construction of the information-analytical system “Manuscript” designed for the preparation of the electronic publications of the medieval documents on the Internet (the project portal: http://manuscripts.ru/index_en.html) and also the technique of application of the electronic corpus to historical-linguistic research. The special system modules interacting with the full-text database help to carry out the entire cycle of works on the preparation of the Internet edition, its annotation and linguistic marking. The special attention is paid to the system possibilities for the preparation of the search requests and visualization of the retrieval results. The request criteria, various forms of ordering of retrieval units based on the meta-marking of the manuscripts and texts, the annotation of their fragments and the word-by-word parallel analysis of contexts help to the user to get the material for the linguistic and linguistic-text...
International Journal of Arts Humanities and Social Sciences Studies, 2021
The introduction of Internet and the emergence of available web resources have facilitated endless efforts by language teachers to do research in language patterns. Web resources in general and corpus tools in particular have enabled large samples of language to be explored for better insights into the nature of language in use in all its forms and its uses. While corpora are normally assumed to be in the hands of lexicographers whose job is to inform dictionaries or grammar books, arguments may have arisen around why such end-users as language teachers and learners cannot make use of these innovative tools. This paper adds to this ongoing debate by discussing approaches to using corpora as a reference point for language teaching and research. It shows potentials of manipulating common data-driven web tools for research in language patterns. It also explores some pathways for language teachers to examine aspects of language in use through authentic texts accessed via corpus tools. Results revealed from corpus search and generalizations made from manipulated data, and in this case noun and verb patterns and grammatical metaphor, in COCA and BNC can further showcase the inexhaustible implications of using web resources for enhanced language teaching and research.
Literary and Linguistic Computing, 2010
Corpus linguistics has revolutionised our way of working in historical linguistics. The painstaking job of collecting data and manually analysing them has been made less arduous with the introduction of the machine processing of corpora, which allows for quick and efficient searches. The aim of the present study is two-fold: to show how corpus linguistics has contributed to the ways in which researchers approach the study of the history of English, and to provide an overview of selected corpora available in the field. Setting aside the theoretical debate as to whether corpus linguistics should be considered merely a methodology, a branch of linguistics, or both (Taylor, 2008), it is widely acknowledged that corpus linguistics is of considerable help in any branch of linguistics, be it theoretical or applied. The use of corpora makes it possible to test hypotheses established within a specific linguistic area through the fast and reliable analysis of vast pools of material. As a result, the objective measurement of data is available to scholars, who can thus verify their hypotheses and intuitions, and can quickly amend or qualify their research claims if previous ones are seen to be falsifiable. There is, then, a continuous interaction within theory, as expressed in linguistic postulates, concepts and hypotheses, and an application and validation of these theoretical principles through the use of linguistic corpus analysis. The use of corpora is perhaps a more powerful instrument in the field of historical linguistics than in other fields, since the absence of living informants here makes judgements based on intuitions unreliable, and claims have to be empirically attested using data. This data can be extracted from systematically compiled collections of machine-readable texts, called corpora. However, in considering these undeniably advantageous working tools, some caveats should be borne in mind, as will be discussed in what follows.
A. Renouf & A. Kehoe (eds.) The Changing Face of Corpus Linguistics, Amsterdam: Rodopi., 2006
The WebCorp project has demonstrated how the Web may be used as a source of linguistic data. One feature of standard corpus analysis tools hitherto missing in WebCorp is the ability to filter and sort results by date. This paper discusses the dating mechanisms available on the Web and the date query facilities offered by standard Web search engines. The new date heuristics built into WebCorp are then discussed and illustrated with a case study.
Using Corpora in …, 2010
2000
In this paper we present the Corpógrafo, an integrated web-based environment for corpus linguistics and knowledge engineering that is being developed at the Porto node of Linguateca. The Corpógrafo aims to provide an integrated corpora research environment by making freely available on the web a comprehensive set of text and language tools (http://www.linguateca.pt/corpografo/). We describe the current stage of development
2002
Seven Tones 1 ([13]) is a search engine specialized in linguistics and languages. Its current database, which is stored on a single machine, contains approximately 240,000 indexed web pages about linguistics and languages. Nevertheless, the search engine is designed for a much larger capacity. It has been used in several other systems and is feasible to be transplanted into a distributed computing environment. The performance of Seven Tones in terms of relevance and web page quality is better than Google in the area of linguistics and languages.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.