Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
9 pages
1 file
This article provides a brief overview of Daba software package created in the course of building corpora for Manding languages. Key software features are motivated by the tasks and problems characteristic of many African languages. The corpus-building model proposed here was initially developed for Bambara Reference Corpus which is available online and is freely accessible. The morphological analysis procedure and corpus annotation scheme are dis-cussed in detail. Daba uses a morpheme-based morphological annotation scheme inspired by the interlinear glossed form of presentation of linguistic examples. A scheme mapping Daba's morpheme-based morphological information onto tra-ditional word-based corpus annotation is provided. Since Bambara is characterized by a low level of written language standardization special attention is paid to the issues of representing variability in corpus annotation. Résumé. L'article traite du paquet des logiciels « Daba » créé dans le cadre du pr...
2013
African Language Technology is rapidly becoming one of the hottest new topics in computational linguistics. The increasing availability of digital resources, an exponentially growing number of publications and a myriad of exciting new projects are just some of the indications that African Language Technology has been firmly established as a mature field of research. The AfLaT workshops attempt to bring together researchers in the field of African Language Technology and provide a forum to present ongoing efforts and discuss common obstacles and goals. We are pleased to present to you the proceedings of the Second Workshop on African Language Technology (AfLaT 2010), which is held in collocation with the Seventh International Conference on Language Resources and Evaluation (LREC 2010). We were overwhelmed by the quantity and quality of the submissions we received this year, but were lucky enough to have a wonderful program committee, who sacrificed their valuable time to help us pick the cream of the crop. We pay tribute to their efforts by highlighting reviewers' quotes in the next paragraphs. Grover et al. kick off the proceedings with a comprehensive overview of the HLT situation in South Africa, followed by Bański and Wójtowicz's description of an initiative that is beneficial to the creation of resources [...] for African languages. De Pauw et al. describe techniques that could be used to develop a plethora of [...] HLT resources with minimal human effort, while Shah et al. present impressive results on tackling the problem of NER in MT systems between languages, one of which at least is poorly resourced. Groenewald and du Plooy's paper tackles the all too-often overlooked problem of text anonymization in corpus collection, followed by Chege et al.'s effort that is significant [...] to the open source community, not just for Gĩkũyũ but for the African languages in general. Faaß presents a useful resource for further computational processing of the language of Northern Sotho. Tachbelie and Menzel provide a clear and concise overview of the general issues affecting language models for morphologically rich languages, while Van der Merwe et al. go into an informative discussion of the properties of the Zulu verb, its extensions, and deverbatives. The paper by Oosthuizen et al. aptly discusses the issue of quantifying and correcting transcription differences between inexperienced transcribers, while Davydov's paper is an interesting case study for collecting corpora for "languages recently put into writing". Ng'ang'a presents the key resource for the identification of a machine-readable dialectal dictionary for Igbo and Purvis concludes by discussing a corpus that contributes to the development of HLT tools for Dagbani. We are proud to have Justus Roux as the invited speaker for this year's edition of AfLaT to discuss one of the most often asked and rarely answered questions in our field of research: Do we need linguistic knowledge for speech technology applications in African languages? We hope you enjoy the AfLat 2010 workshop and look forward to meeting you again at AfLaT 2011.
The development of computational morphological analysers for South African Bantu languages is linked to a project funded by the National Research Foundation in South Africa. The main research question in the project concerns the development of finite-state morphological analysers for five Bantu languages, namely Zulu, Xhosa and Swati (belonging to the Nguni group of languages), and Northern Sotho and Tswana (belonging to the Sotho group of languages). This development is based on underlying machine-readable lexicons that conform to common lexical specifications and international standards. Due to the rich agglutinating morphological structures of these languages, the morphological processing poses particular challenges. These challenges are of an orthographical, a morphological as well as of a lexical nature. The current status of the project is reported on, firstly in terms of the development of prototypes of morphological analysers for the various languages, and secondly in terms of the development of standardised XML machine-readable lexicons for the South African Bantu languages, based on an appropriate general data model.
2007
In this paper the development of computational morphological analysers for six South African Bantu languages is discussed. Due to the rich agglutinating morphological structures of these languages, the morphological processing poses particular challenges. These challenges are of an orthographical, a morphological as well as of a lexical nature. The current status of the project is reported on, firstly in terms of the development of prototypes of morphological analysers for the various languages, and secondly in terms of the development of standardised XML machine-readable lexicons for the South African Bantu languages, based on an appropriate general data model. 22
1996
The paper describes problems in disambiguating the morphological analysis of Bantu languages by using Swahili as a test language. The main factors of ambiguity in this language group can be traced to the noun class structure on one hand and to the bi-directional word-formation on the other. In analyzing word-forms, the system applied utilizes SWATWOL, a morphological parsing program based on two-level formalism. Disambiguation is carried out with the latest version (April 1996) of the Constraint Grammar Parser (GGP). Statistics on ambiguity are provided. Solutions tbr resolving different types of ambiguity are presented and they are demonstrated by examples fi'om corpus text. Finally, statistics on the performance of the disambiguator are presented.
Language Resources and Evaluation, 2011
In this paper, we describe tools and resources for the study of African languages developed at the Collaborative Research Centre "Information Structure". These include deeply annotated data collections of 25 subsaharan languages that are described together with their annotation scheme, and further, the corpus tool ANNIS that provides a unified access to a broad variety of annotations created with a range of different tools. With the application of ANNIS to several African data collections, we illustrate its suitability for the purpose of language documentation, distributed access and the creation of data archives.
ArXiv, 2021
We introduced the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error correction. We have also modified the existing morphological analyzer, HornMorpho, to use it for the automatic tagging.
Frederick Mario Fales and Giulia Francesca Grassi (eds.), CAMSEMUD 2007: Proceedings of the 13th Italian Meeting of Afro-Asiatic Linguistics. Padova: S.A.R.G.O.N.: 2010: 177-180, 2010
Southern African Linguistics and Applied Language Studies, 2003
There are currently two distinct but not necessarily mutually exclusive approaches to the retrieval of information from linguistic corpora. 'Corpus-driven' approaches rely solely on the corpus itself to yield significant patterns. With the exception of orthographic spacing, no additional annotations to a 'raw' corpus are used to guide searches and the retrieval of information from the corpus. Typically, key word in context (KWIC) analyses are applied to relevant concordance lines to extract statistically significant lexical and grammatical patterns. In 'corpus-based' approaches, on the other hand, information is retrieved from an enriched corpus on the basis of annotations in the form of linguistic tags and annotations. That is, the annotations are used to direct the searches to specific grammatical and lexical phenomena in a corpus. In this article, we propose a corpus-based approach and a tag set to be used on a corpus of spoken language for the African languages of South Africa. A number of problematic linguistic phenomena such as fixed expressions, agglutination, morphemic merging and spoken language phenomena such as interrupted words, etc, often have some effect on tagging principles. These problematic phenomena are discussed and illustrated. The development of the tag set is based on the morphosyntactic properties of Xhosa for reasons that are outlined in the article. Manual tagging of a large corpus would be quite a daunting and time-consuming task, not to mention the potential for various kinds of errors. This problem is solved in a two-step process. Firstly, a computer-based drag-and-drop tagger was developed to facilitate the manual tagging of a so-called training corpus. This training corpus then forms the input to the development of an automatic tagger. The principles and procedures for the development of an automatic tagger for African languages are also discussed.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Cadernos de Estudos Lingüísticos, 2017
Language Resources and Evaluation, 2011
Proceedings of the First Workshop on Language Technologies for African Languages - AfLaT '09, 2009
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Journal of Language Teaching and Research