Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013
…
13 pages
1 file
This paper gives an overview of the morpho-syntactic features of the Amazighe language and corpus encoding, afterwards we present our experience of constructing an annotated corpus with part-of-speech (POS) information. The annotated corpora consist of 20,667 Moroccan Amazighe tokens chosen from different materials; it is to our knowledge the first one dealing with Amazighe language. The experience is also meant to give a handle on the encoding and tagging processes of the aforementioned corpus.
2014
This paper gives an overview of the morpho-syntactic features of the Amazighe language and corpus encoding, afterwards we present our experience of constructing an annotated corpus with part-of-speech (POS) information. The annotated corpora consist of 20,667 Moroccan Amazighe tokens chosen from different materials; it is to our knowledge the first one dealing with Amazighe language. The experience is also meant to give a handle on the encoding and tagging processes of the aforementioned corpus.
Amazigh language, as one of the indo-European languages, poses many challenges on natural language processing. The writing system, the morphology based on unique word formation process of roots and patterns, and the lack of linguistic corpora make computational approaches to Amazigh language challenging. In this paper, we give an overview of the current state of the art in Natural Language Processing for Amazigh language in Morocco, and we suggest the development of other technologies needed for the Amazigh language to live in "information society".
The main goal of this work is the implementation of a new tool for the Amazigh part of speech tagging using Markov Models and decision trees. After studying different approaches and problems of part of speech tagging, we have implemented a tagging system based on TreeTagger-a generic stochastic tagging tool, very popular for its efficiency. We have gathered a working corpus, large enough to ensure a general linguistic coverage. This corpus has been used to run the tokenization process, as well as to train TreeTagger. Then, we performed a straightforward outputs' evaluation on a small test corpus. Though restricted, this evaluation showed really encouraging results. Part-of-Speech (POS) tagging is an essential step to achieve the most natural language processing applications because it identifies the grammatical category of words belong text. Thus, POS taggers are an import ant module for large public applications such as questions-answering systems, information extraction, information retrieval, machine translation... They can be used in many other applications such as text-to-speech or like a pre-processor for a parser; the parser can do it better but more expensive. In this paper, we decided to focus on POS tagging for the Amazigh language. Currently, TreeTagger (hencefore TT) is one of the most popular and most widely used tools thanks to its speed, its independent architecture of languages, and the quality of obtained results. Therefore, we sought to develop a settings file TT for Amazigh. Our work involves the construction of dataset and the input pre-processing in order to run the two main modules: training program and tagger itself. For this reason, this work is the part to the still scarce set of tools and resources available for Amazigh automatic processing. The rest of the paper is organized as follows. Section 2 puts the current article in context by overviewing related work. Section 3 describes the linguistic background of Amazigh language. Section 4 presents the used Amazigh tagset and our training corpus. Experimentation results are discussed in Section 5. Finally, we will report our conclusions and eventual future works.
2010
Amazigh language and culture may well be viewed to have known an unprecedented booming in Morocco : more than a hundredwhich are published by the Royal Institute of Amazigh Culture (IRCAM), an institution created in 2001 to preserve, promote and endorse Amazigh culture in all its dimensions. Crucially, publications in the Amazigh language would not have seen light without the valiant attempts to upgrade the language on the linguistic and technological levels. The central thrust of this contribution is to provide a vista about the whole range of actions carried out by IRCAM. Of prime utility to this presentation is what was accomplished to supply Amazigh with the necessary tools and corpora without which the Amazigh language would emphatically fail to have a place in the world of NITCs. After a brief description of the prime specificities that characterise the standardisation of Amazigh in Morocco, a retrospective on the basic computer tools now available for the processing of Amazigh will be set out. It is concluded that the homogenisation of a considerable number of corpora should, by right, be viewed as a strategic move and an incontrovertible prerequisite to the computerisation of Amazigh,
The study described in this paper belongs to the area of computational linguistics. Computational linguistics is a field of artificial intelligence dealing with the logical modeling of natural language from a computational perspective. It unites two areas that are quite different in appearance, computer science and natural languages. Computational linguistics might be considered as a synonym of automatic processing of natural language, since the main task of computational linguistics is just the construction of computer programs to process words and texts in natural language. There are many areas that may be considered as properly included within the discipline of computational linguistics. One of these areas is part-of-speech tagging (POS-tagging). POS-tagging is considered as a process for automatically assigning the proper grammatical tag to each word of a written text according to its appearance on the text. Thus, the task of POS-tagging is attaching appropriate grammatical or morpho-syntactical category labels to each word, token, symbol, abbreviation and even punctuation mark in a corpus. POS-tagging is usually the first step in linguistic analysis. Also, it is very important intermediate step to build many natural language processing applications. It could be used in spell checking and correcting systems, speech recognition systems, information retrieval systems and text-to-speech synthesis systems.
2022
The most important objective of our study was to build and construct a complete and comprehensive morphological analyzing scheme, tagging, and parsing system which can be used for annotating Arabic corpora. In our dissertation, we did an analytical study, implementation, and evaluation of Arabic morphological analysis, tagging, and syntactic analysis starting from raw text. The three different systems were implemented in various methods; for morphological analysis, we use finite-state automaton as discussed in chapter four, after doing the process of tokenization and segmentation of the raw text as explained in chapter three. The tagging system was implemented under a new very rich tag set, which was designed and developed by us. It consists of 30 tags in addition to some other features and linguistic information as described in chapter five. An appropriate set of tags has a direct influence on the accuracy and the usefulness of tagging system. So, the smaller the tag set, the highe...
MATEC Web of Conferences, 2018
Standardized resources are key components for the development of applications related to human language technology. Therefore, it is important to adopt it for designing lexical resources, especially for less commonly resourced languages such Amazigh. This language is spoken by many North African communities, including Morocco. Due to historical, geographical and sociolinguistic factors, the Amazigh language is characterized by the proliferation of many intervarieties, which has led to a complex morphology. This latter poses significant challenge to NLP tasks, especially that Amazigh language belongs to the Afro-Asiatic language (Hamito-Semitic) family, known by its non-concatenative morphology based on root and pattern. Face to the scarcity of Amazigh language resources dealing with morphemes encoding, orthographic changes, and morphotactic variations, the elaboration of a standardized lexical resource will certainly ensure a large exchange and exploitation. In this context, this pa...
Since antiquity, the Amazigh heritage is expanding from generation to generation. In the aim of safeguarding it from being threatened of disappearance, it seems opportune to equip this language of necessary means to confront the stakes of access to the domain of New Information and Communication Technologies (ICT). In this context, and in the perspective to build tools and linguistic resources for the automatic processing of Amazigh language, we develop a lexicon and morphological rules using finite state technology within the linguistic developmental environment Nooj to parse amazigh texts. Vers un traitement automatique de la langue Amazighe Depuis l'antiquité, le patrimoine Amazighe est en expansion de génération en génération. Dans l'objectif de sauvegarder, exploiter ce patrimoine et éviter qu'il soit menacé de disparition, il semble opportun d'équiper cette langue de moyens nécessaires pour affronter les enjeux d'accès au domaine des nouvelles technologies de l'information et de la communication (NTIC) qui s'avère primordial pour promouvoir et informatiser cette langue. Dans ce contexte, et dans les perspectives de développer des outils et des ressources linguistiques pour le traitement automatique de cette langue, nous avons entrepris d'utiliser la plateforme d'ingénierie linguistique NooJ afin de créer un module pour la langue Amazighe standard (Ameur et al., 2004a). Notre premier objectif est l'analyse des textes Amazighe. A cet effet, nous commençons par la formalisation du vocabulaire Amazighe (Nom, Verbe et Particules). Dans cet article nous nous intéresserons à la formalisation de deux catégories, nom et de particules, permettant de générer à partir d'une entrée lexicale son genre (masculin, féminin), son nombre (singulier, pluriel) et son état (libre, annexion). Enfin, nous développons un dictionnaire électronique afin de l'utiliser, d'une part, pour tester nos règles de flexions et d'autre part pour l'analyse lexicale des textes Amazighe.
Amazigh language is the autochthon language of North Africa. However, until 2011 that it became a constitutionally official language in Morocco, after years of persecution. Amazigh language is still considered as one of the under resourced languages. This paper presents the development of a multilingual parallel corpus (Amazigh-English-French) aligned on the sentence level. The objective is to be used in linguistic research, teaching, and natural language processing application, primarily machine translation. The paper discusses this aspect, and presents the corpus encoding. A multilingual parallel corpus, which brings together Amazigh, English and French, is a new resource for the NLP community that completes the present panorama of parallel corpora. To the best of our knowledge, this corpus is the first Amazigh-English-French multilingual parallel corpus. The built corpus is sentence aligned, including 31864 sentences. The alignment was done automatically, while the evaluation was done manually. The evaluation results are satisfactory, achieving more than 90%.
International Journal of Advanced Computer Science and Applications
This paper presents an Arabic-compliant part-ofspeech (POS) tagging scheme based on using atomic tag markers that are grouped together using brackets. This scheme promotes the speedy production of annotations while preserving the richness of resultant annotations. The proposed scheme is comprised of two main elements, a new tokenization approach and a custom tool that enables the semi-automatic implementation of this scheme. The proposed model can serve in many scenarios where the user is in a need for better Arabic support and more control over the Part-of-Speech tagging process. This scheme was used to annotate sample narratives and it demonstrated capability and adaptability while addressing the various distinguishing features of Arabic language including its unique declension system. It also sets new baselines that are prospect for further exploration by future efforts.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
IAES International Journal of Artificial Intelligence, 2024
Northern European journal of language technology, 2022
Zenodo (CERN European Organization for Nuclear Research), 2022
Submitted to Journal …, 2007
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), 2014