Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021
…
10 pages
1 file
St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysynthetic language in the Inuit-Yupik language family indigenous to Alaska and Chukotka. This work presents a step-by-step pipeline for the digitization of written texts, and the first publicly available digital corpus for St. Lawrence Island Yupik, created using that pipeline. This corpus has great potential for future linguistic inquiry and research in NLP. It was also developed for use in Yupik language education and revitalization, with a primary goal of enabling easy access to Yupik texts by educators and by members of the Yupik community. A secondary goal is to support development of language technology such as spell-checkers, text-completion systems, interactive e-books, and language learning apps for use by the Yupik community.
2021
This work presents the first publicly available digital corpus of written texts in St. Lawrence Island Yupik (Inuit-Yupik, ISO 639-3: ess). The public release of this corpus has been coordinated with various stakeholders in the St. Lawrence Island community. The corpus of digitized texts is available on GitHub under a Creative Commons Attribution No-Commercial 4.0 International License
Proceedings of the 2019 Conference of the North, 2019
In this paper, we introduce a morphologicallyaware electronic dictionary for St. Lawrence Island Yupik, an endangered language of the Bering Strait region. Implemented using HTML, Javascript, and CSS, the dictionary is set in an uncluttered interface and permits users to search in Yupik or in English for Yupik root words and Yupik derivational suffixes. For each matching result, our electronic dictionary presents the user with the corresponding entry from the Badten et al. ( ) Yupik-English paper dictionary. Because Yupik is a polysynthetic language, handling of multimorphemic word forms is critical. If a user searches for an inflected Yupik word form, we perform a morphological analysis and return entries for the root word and for any derivational suffixes present in the word. This electronic dictionary should serve not only as a valuable resource for all students and speakers of Yupik, but also for field linguists working towards documentation and conservation of the language.
Études/Inuit/Studies
St. Lawrence Island Yupik, an endangered language of the Bering Strait region spoken by fewer than one thousand people in western Alaska and far eastern Russia, is currently in a state of generational transition. We survey the existing body of Yupik literature and pedagogical resources developed during the twentieth century, examine the context and use of Yupik in the current educational setting, and describe current challenges for teaching the language in the schools. We then outline our integrated approach to language documentation currently being applied to Yupik, and address how existing resources can be integrated into research and development processes in a way that both supports research efforts and results in tangible modern educational tools for the Yupik community on St. Lawrence Island, and eventually in Russia. This approach is intentionally designed to closely integrate research processes from language documentation and computational linguistics such that the results of each research endeavour positively support the other, and such that both disciplines concretely support community-based efforts to revitalize and teach the language.
Études/Inuit/Studies, 2000
Ce document est protégé par la loi sur le droit d'auteur. L'utilisation des services d'Érudit (y compris la reproduction) est assujettie à sa politique d'utilisation que vous pouvez consulter en ligne.
2018
Mi’kmaq is a polysynthetic Indigenous language spoken primarily in Eastern Canada, on which no prior computational work has focused. In this paper we first construct and analyze a web corpus of Mi’kmaq. We then evaluate several approaches to language modelling for Mi’kmaq, including character-level models that are particularly well-suited to morphologically-rich languages. Preservation of Indigenous languages is particularly important in the current Canadian context; we argue that natural language processing could aid such efforts.
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)
As the global crisis of language endangerment deepens, Indigenous communities have continued to seek new means of preserving, promoting and passing on their languages to future generations. For many communities, modern language technology holds the promise of accelerating that process. However, the cultural and disciplinary divides between documentary linguists, computational linguists and Indigenous communities have posed an ongoing challenge for the development and deployment of NLP applications that can support the documentation and revitalization of Indigenous languages. In this paper, we discuss the main barriers to collaboration that these groups have encountered, as well as some notable initiatives in recent years to bring the groups closer together. We follow this with specific recommendations to build upon those efforts, calling for increased opportunities for awareness-building and skills-training in computational linguistics, tailored to the specific needs of both documentary linguists and Indigenous community members. We see this as an essential step as we move forward into an era of NLP-assisted language revitalization.
The paper describes work-in-progress by the Izhva Komi language documentation project, which records new spoken language data, digitizes available recordings and annotate these multimedia data in order to provide a comprehensive language corpus as a databases for future research on and this endangered – and under-described – Uralic speech community. While working with a spoken variety and in the framework of documentary linguistics, we apply language technology methods and tools, which have been applied so far only to normalized written languages. Specifically, we describe a script providing interactivity between ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora, and different morphosyntactic analysis modules implemented as Finite State Transducers and Constraint Grammar for rule-based morphosyntactic tagging and disambiguation. Our aim is to challenge current manual approaches in the annotation of language documentation corpora.
2020
This paper surveys the first, three-year phase of a project at the National Research Council of Canada that is developing software to assist Indigenous communities in Canada in preserving their languages and extending their use. The project aimed to work within the empowerment paradigm, where collaboration with communities and fulfillment of their goals is central. Since many of the technologies we developed were in response to community needs, the project ended up as a collection of diverse subprojects, including the creation of a sophisticated framework for building verb conjugators for highly inflectional polysynthetic languages (such as Kanyen’kéha, in the Iroquoian language family), release of what is probably the largest available corpus of sentences in a polysynthetic language (Inuktut) aligned with English sentences and experiments with machine translation (MT) systems trained on this corpus, free online services based on automatic speech recognition (ASR) for easing the tra...
This document describes the status of documentation for the native languages of British Columbia. 1 The kinds of documentation described are:(a) dictionaries;(b) grammars;(c) collections of text; and (d) textbooks. Ideally, a language should have a comprehensive dictionary containing necessary grammatical information and analytic apparatus. It should also have a comprehensive reference grammar.
2018
The preservation of linguistic diversity has long been recognized as a crucial, integral part of supporting our cultural heritage. Yet many “minority” languages—those that lack official state status—are in decline, many severely endangered. We present a prototype system aimed at “heritage” speakers of endangered Finno-Ugric languages. Heritage speakers are people who have heard the language used by the older generations while they were growing up, and who possess a considerable passive competency—well beyond the “beginner” level,—but are lacking in active fluency. Our system is based on natural language processing and artificial intelligence. It assists the learners by allowing them to learn from arbitrary texts of their choice, and by creating exercises that engage them in active production of language—rather than in passive memorization of material. Continuous automatic assessment helps guide the learner toward improved fluency. We believe that providing such AI-based tools will h...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Tomsk Journal of Linguistics and Anthropology, 2018
Proceedings of the International Mother Language Day 2024, held at Sri Lanka Foundation, Colombo on 21st February 2024 and published by the National Institute of Language Education and Training (NILET) and the Department of Sinhala, University of Ruhuna, Sri Lanka, Pp. 5-24., 2024
Working Papers in Corpus Linguistics and Digital Technologies: Analyses and Methodology, 2020
Language Resources and Evaluation, 2020
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, 2021
Working Papers in Corpus Linguistics and Digital Technologies: Analyses and Methodology, 2020
Papers of the Forty-Ninth Algonquian Conference, 2020
Proceedings of Australasian Language Technology Association Workshop, 2018
Oral history meets linguistics, 2017
Septentrio Conference Series, 2015
Dictionaries: Journal of the Dictionary Society of North America, 2012