Papers by Mohamed Outahajala

Abstract—Like most of the languages which have only recently started being investigated for the N... more Abstract—Like most of the languages which have only recently started being investigated for the Natural Language Processing (NLP) tasks, Amazigh lacks annotated corpora and tools and still suffers from the scarcity of linguistic tools and resources. The main aim of this paper is to present a tokenizer tool and a new part-of-speech (POS) tagger based on a new Amazigh tag set (AMTS) composed of 28 tag. In line with our goal we have trained two sequence classification models using Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) to build a toknizer and a POS tagger for the Amazigh language. We have used the 10-fold technique to evaluate and validate our approach. We report that POS tagging results using SVMs and CRFs are very comparable. Across the board, CRFs outperformed SVMs on the fold level (91.18 % vs. 90.75%) and CRFs outperformed SVMs on the 10 folds average level (87.95 % vs. 87.11%). Regarding tokenization task, SVMs outperformed CRFs on the fold level (99....
This paper gives an overview of the morpho-syntactic features of the Amazighe language and corpus... more This paper gives an overview of the morpho-syntactic features of the Amazighe language and corpus encoding, afterwards we present our experience of constructing an annotated corpus with part-of-speech (POS) information. The annotated corpora consist of 20,667 Moroccan Amazighe tokens chosen from different materials; it is to our knowledge the first one dealing with Amazighe language. The experience is also meant to give a handle on the encoding and tagging processes of the aforementioned corpus.
Études et Documents Berbères

Like most of the languages which have only recently started being investigated for the Natural La... more Like most of the languages which have only recently started being investigated for the Natural Language Processing (NLP) tasks, Amazigh lacks annotated corpora and tools and still suffers from the scarcity of linguistic tools and resources. The main aim of this paper is to present a new part-of-speech (POS) tagger based on a new Amazigh tag set (AMTS) composed of 28 tags. In line with our goal we have trained Conditional Random Fields (CRFs) to build a POS tagger for the Amazigh language. We have used the 10-fold technique to evaluate and validate our approach. The CRFs 10 folds average level is 87.95% and the best fold level result is 91.18%. In order to improve this result, we have gathered a set of about 8k words with their POS tags. The collected lexicon was used with CRFs confidence measure in order to have a more accurate POS-tagger. Hence, we have obtained a better performance of 93.82%.

The aim of this paper is to present the first amazigh POS tagger. Very few linguistic resources h... more The aim of this paper is to present the first amazigh POS tagger. Very few linguistic resources have been developed so far for amazigh and we believe that the development of a POS tagger tool is the first step needed for automatic text processing. In order to achieve this endeavor, we have trained two sequence classification models using Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) after using a tokenization step. We have used the 10fold technique to evaluate our approach. Results show that the performance of SVMs and CRFs are very comparable. Across the board, SVMs outperformed CRFs on the fold level (92.58% vs. 92.14%) and CRFs outperformed SVMs on the 10 folds average level (89.48% vs. 89.29%). These results are very promising considering that we have used a corpus of only ~20k tokens. Mots-cles etiquetage grammatical automatique, langue amazighe, TAL, apprentissage supervise, segmentation

Comme la plupart des langues qui n’ont que recemment commence les investigations en Traitement Au... more Comme la plupart des langues qui n’ont que recemment commence les investigations en Traitement Automatique des Langues (TAL), la langue amazighe est peu dotee en ressources et outils du TAL. Dans ce sens, l’un des objectifs principaux de ce travail est de doter cette langue de son premier etiqueteur morphosyntaxique. L’etiquetage morphosyntaxique est la premiere couche au-dessus du niveau lexical et le niveau le plus bas de l'analyse syntaxique et de toutes les tâches du TAL traitant des niveaux linguistiques superieurs. Cette tâche produit des informations supplementaires au texte en entree; chose tres benefique pour les autres tâches du TAL l’utilisant. Afin d’atteindre cet objectif, nous avons forme deux modeles de classification de sequences, a savoir: les SVMs et les CRFs. Aussi, nous avons construit un corpus d’environ un quart de million de mots, dont nous avons utilise le caractere informatif des mots hors vocabulaire et la mesure de confiance a meme de reduire le taux d...
This paper gives an overview of the morpho-syntactic features of the Amazighe language and corpus... more This paper gives an overview of the morpho-syntactic features of the Amazighe language and corpus encoding, afterwards we present our experience of constructing an annotated corpus with part-of-speech (POS) information. The annotated corpora consist of 20,667 Moroccan Amazighe tokens chosen from different materials; it is to our knowledge the first one dealing with Amazighe language. The experience is also meant to give a handle on the encoding and tagging processes of the aforementioned corpus.

Amazigh language, and like most of the languages wh ich ave only recently started being investiga... more Amazigh language, and like most of the languages wh ich ave only recently started being investigated f or the Natural Language Processing (NLP) tasks, lacks nnotated corpora and tools and still suffers from the scarcity of linguistic tools and resources and espe cially annotated corpora. Creating labeled data is a hard task. However, obtaining unlabeled data, although n eeding most time preprocessing for languages with scarce resources, is less difficult. The aim of thi s paper is to present a semi-supervised based appro ach using labeled and unlabeled data. Preliminary resul ts show an error reduction of 1,3%, when training o ur POS tagger with Conditional Random Fields(CRFs), wi th chosen automatically annotated texts and a small manually annotated corpus of about 20k tokens . Also, when trained with automatically annotated data, the achieved improvement between 60% and 90% of the trained data is 5.9%.
This paper present the first Amazigh POS tagger. Very few linguistic resources have been develope... more This paper present the first Amazigh POS tagger. Very few linguistic resources have been developed so far for Amazigh and we believe that the development of a POS tagger tool is the first step needed for automatic text processing. In order to achieve this endeavor, we have trained two sequence classification models using Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) after using a tokenization step. We have used the 10-fold technique to evaluate our approach. Results show that the performance of SVMs and CRFs are very comparable. Across the board, SVMs outperformed CRFs on the fold level (92.58% vs. 92.14%) and CRFs outperformed SVMs on the 10 folds average level (89.48% vs. 89.29%). These results are very promising considering that we have used a corpus of only ~20k tokens.

This paper investigates how to best couple hand-annotated data with information extracted from an... more This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on Amazigh tagging, we introduce a decision tree and Markov model using TreeTagger system. This system gives 92.3 % accuracy on the Amazigh corpus, an error reduction of 15 % (18.45 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best tradeoff between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data.

Part of Speech (PoS) tagging is the task to assign the appropriate morphosyntactic category to ea... more Part of Speech (PoS) tagging is the task to assign the appropriate morphosyntactic category to each word according to the context. Several probabilistic methods have been adapted for PoS tagging such as Conditional Random Fields, Support Vector Machines, and Decision Trees. Based on these methods, language independent PoS taggers have been developed such as CRF++, Yamcha and TreeTagger. These POS taggers implement the process of assigning the correct PoS (noun, verb, adjective, adverb …) to each word of the sentence. PoS taggers are developed by modeling the morphosyntactic structure of natural language text. In this paper, we tried to improve the accuracy of existing Amazigh POS taggers using a voting algorithm. The three used Amazigh POS taggers are: (1) Conditional Random Fields (CRF) tagger (2) Support Vector Machines (SVM) tagger (3) TreeTagger (TT). These taggers are developed with an accuracy of 86.79 %, 84.64 % and 86.57 % respectively. An annotated corpus of 60,000 words is...
Proceedings of the 2nd international Conference on Big Data, Cloud and Applications
Language resources are important for those working on computational methods to analyze and study ... more Language resources are important for those working on computational methods to analyze and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of morphosyntactically annotated corpus for Amazigh language that currently lacks them. We illustrate our approach for creating this corpus, that is more expensive but of high quality, using crowdsourcing and manual effort with appropriately skilled human participants. Qualitative and quantitative evaluations of the resources are also presented.

The aim of this paper is to present the first Amazi ghe POS tagger. Very few linguistic resources... more The aim of this paper is to present the first Amazi ghe POS tagger. Very few linguistic resources have been developed so far for Amazighe a nd we believe that the development of a POS tagger tool is the first step needed for automa tic text processing. In order to achieve this endeavor, we have trained two sequence classificati on models using Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) after u sing a tokenization step. We have used the 10-fold technique to evaluate our approach. Res ults how that the performance of SVMs and CRFs are very comparable. Across the board, SVM s outperformed CRFs on the fold level (92.58% vs. 92.14%) and CRFs outperformed SVMs on t he 10 folds average level (89.48% vs. 89.29%). These results are very promising consi dering that we have used a corpus of only ~20k tokens. Mohamed Outahajala, Yassine Benajiba, Paolo Rosso, Lahbib Zenkouar

Over the last few years, Moroccan society has known a lot of debate about the Amazigh language an... more Over the last few years, Moroccan society has known a lot of debate about the Amazigh language and culture. The creation of a new governmental institution, namely IRCAM, has made it possible for the Amazigh language and culture to reclaim their rightful place in many domains. Taking into consideration the situation of the Amazigh language which needs more tools and scientific work to achieve its automatic processing, the aim of this paper is to present the Amazigh language features for a morphology annotation purpose. Put in another way, the paper is meant to address the issue of Amazigh’s tagging with the multilevel annotation tool AnCoraPipe. This tool is adapted to use a specific tagset to annotate Amazigh corpora with a new defined writing system. This step may well be viewed as the first step for an automatic processing of the Amazigh language; the main aim at very beginning being to achieve a part of speech tagger.

Amazigh is used by tens of millions of people mainly for oral communication. However, and like al... more Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, i t i s resource-scarce. The main aim of this paper is to present o u r POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very time-consuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.

2013 ACS International Conference on Computer Systems and Applications (AICCSA), 2013
ABSTRACT Like most of the languages which have only recently started being investigated for the N... more ABSTRACT Like most of the languages which have only recently started being investigated for the Natural Language Processing (NLP) tasks, Amazigh lacks annotated corpora and tools and still suffers from the scarcity of linguistic tools and resources. The main aim of this paper is to present a tokenizer tool and a new part-of-speech (POS) tagger based on a new Amazigh tag set (AMTS) composed of 28 tag. In line with our goal we have trained two sequence classification models using Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) to build a toknizer and a POS tagger for the Amazigh language. We have used the 10-fold technique to evaluate and validate our approach. We report that POS tagging results using SVMs and CRFs are very comparable. Across the board, CRFs outperformed SVMs on the fold level (91.18% vs. 90.75%) and CRFs outperformed SVMs on the 10 folds average level (87.95% vs. 87.11%). Regarding tokenization task, SVMs outperformed CRFs on the fold level (99.97% vs. 99.85%) and on the 10 folds average level (99.95% vs. 99.89%).
Proceeding of the Workshop on Language Resources and Human Language Technology for Semitic Languages, 2010
Over the last few years, Moroccan society has known a lot of debate about the Amazigh language an... more Over the last few years, Moroccan society has known a lot of debate about the Amazigh language and culture. The creation of a new governmental institution, namely IRCAM, has made it possible for the Amazigh language and culture to reclaim their rightful place in many domains. Taking into consideration the situation of the Amazigh language which needs more tools and scientific work to achieve its automatic processing, the aim of this paper is to present the Amazigh language features for a morphology annotation ...
Proceeding of the Workshop on Language Resources and Human Language Technology for Semitic Languages, 2010
Over the last few years, Moroccan society has known a lot of debate about the Amazigh language an... more Over the last few years, Moroccan society has known a lot of debate about the Amazigh language and culture. The creation of a new governmental institution, namely IRCAM, has made it possible for the Amazigh language and culture to reclaim their rightful place in many domains. Taking into consideration the situation of the Amazigh language which needs more tools and scientific work to achieve its automatic processing, the aim of this paper is to present the Amazigh language features for a morphology annotation ...
Uploads
Papers by Mohamed Outahajala