Academia.eduAcademia.edu

Morphologically Annotated Corpora for Seven Arabic Dialects

Abstract

We present a collection of morphologically annotated corpora for seven Arabic dialects: Taizi Yemeni, Sanaani Yemeni, Najdi, Jordanian, Syrian, Iraqi and Moroccan Arabic. The corpora collectively cover over 200,000 words, and are all manually annotated in a common set of standards for orthography, diacritized lemmas, tokenization, morphological units and English glosses. These corpora will be publicly available to serve as benchmarks for training and evaluating systems for Arabic dialect morphological analysis and disambiguation.

Key takeaways

  • As Arabic dialects (DA) become more widely written in social media, there is increased interest in the Arabic NLP community to have annotated corpora that will allow us to both study the dialects linguistically, and to create systems that can automatically process dialectal text.
  • Linguistic Studies There are many theoretical and descriptive linguistic studies for the dialects we work on: Yemeni dialects (Watson, 1993(Watson, , 2002, Najdi (Ingham, 1994), Gulf Arabic dialect (Holes, 1990), Jordanian (Bani-Yasin and Owens, 1987), Moroccan (Harrell, 1962), Syrian (Cowell, 1964), andIraqi (Erwin, 1963); not to mentions comparative studies across dialects and MSA (Holes, 2004;Brustad, 2000).
  • Dialects and MSA Arabic dialects share many commonalities with Classical Arabic and Modern Standard Arabic (MSA).
  • Dialectal Orthography Since Arabic dialects do not have spelling standards, several previous efforts on Arabic dialect annotations (Maamouri et al., 2014;Jarrar et al., 2014; contributed to a movement that lead to the creation of a common Conventional Orthography for Dialectal Arabic (CODA) (Habash et al., 2012a;Zribi et al., 2014;.
  • This variation is not unique to YE.SN and other dialects such as IR.BG and JOR have it as well.