Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015, The International Conference on Information and Communication Technology Research (ICTRC)
…
4 pages
1 file
A Treebank is a linguistic resource that is composed of a large collection of manually annotated and verified syntactically analyzed sentences. Statistical Natural Language Processing (NLP) approaches have been successful in using these annotations for developing basic NLP tasks such as tokenization, diacritization, part-of-speech tagging, parsing, among others. In this paper, we address the problem of exploiting Treebank resources for statistical parsing of Modern Standard Arabic (MSA) sentences. Statistical parsing is significant for NLP tasks that use parsed text as an input such as Information Retrieval, and Machine Translation. We conducted an experiment on Pen Arabic Treebank (PATB) and the parsing performance obtained in terms of Precision, Recall, and F-measure was 82.4%, 86.6%, 84.4%, respectively.
The Arabic Treebank at the Linguistic Data Consortium has significantly revised and enhanced its annotation guidelines and annotation procedure over the past year. The revised syntactic guidelines are now being applied in annotation production, and the combination of the revised guidelines and a period of intensive annotator training has raised inter-annotator agreement f-measure scores already. Revised morphological/part-of-speech (POS) guidelines are nearly complete as well, and will be applied in annotation production in the near future. This paper reports on an experiment in automatically enhancing the old morphological/POS tags in the right direction and the resulting parsing improvement. Finally, a new division of the POS analysis marking both morphological form and POS function is proposed.
The three well-known Arabic treebanks for MSA are the Penn Arabic Treebank (PATB) , the Prague Arabic Dependency Treebank (PADT) [17 28, 29], and the Columbia Arabic Treebank (CATiB) .
Arabic diacritization (referred to sometimes as vocalization or vowelling), defined as the full or partial representation of short vowels, shadda (consonantal length or germination), tanween (nunation or definiteness), and hamza (the glottal stop and its support letters), is still largely understudied in the current NLP literature. In this paper, the lack of diacritics in standard Arabic texts is presented as a major challenge to most Arabic natural language processing tasks, including parsing. Recent studies (Messaoudi, et al. 2004; Vergyri & Kirchhoff 2004; Zitouni, et al. 2006 and Maamouri, et al. forthcoming) about the place and impact of diacritization in text-based NLP research are presented along with an analysis of the weight of the missing diacritics on Treebank morphological and syntactic analyses and the impact on parser development.
Proceedings of the 23rd International Conference on Computational Linguistics, 2010
In this paper, we offer broad insight into the underperformance of Arabic constituency parsing by analyzing the interplay of linguistic phenomena, annotation choices, and model design. First, we identify sources of syntactic ambiguity understudied in the existing parsing literature. Second, we show that although the Penn Arabic Treebank is similar to other treebanks in gross statistical terms, annotation consistency remains problematic. Third, we develop a human interpretable grammar that is competitive with a latent variable PCFG. Fourth, we show how to build better models for three different parsers. Finally, we show that in application settings, the absence of gold segmentation lowers parsing performance by 2-5% F1.
2018
This paper presents a methodology for rule based bottom up parsing technique forModern Standard Arabic (MSA) inContext Free Grammar (CFG) formalism in Phrase Structure Grammar (PSG) representation, where the grammar isautomatically extracted from a syntactically annotated corpus.The extracted grammar is used to build an automatic lexicon andgrammar rules module. Furthermore, the extracted CFG is further transformed into Probabilistic Context Free Grammar (PCFG)that could be used in a hybrid approach, which is also calculated automatically. The used corpus is the Penn ArabicTreebank(PATB)and algorithm implementation is performed with Natural Language Processing Toolkit (NLTK).The parsershowed that automatic extraction of grammar improved the grammar building phase in both coverage of structures and timeneeded, but still needs further manual constrains addition. Automatic extraction of grammar is able to enhance rule basedgrammar parsers and it will enable a new paradigm of statistica...
ACM Transactions on Asian and Low-Resource Language Information Processing, 2022
Treebanks are valuable linguistic resources that include the syntactic structure of a language sentence in addition to part-of-speech tags and morphological features. They are mainly utilized in modeling statistical parsers. Although the statistical natural language parser has recently become more accurate for languages such as English, those for the Arabic language still have low accuracy. The purpose of this article is to construct a new Arabic dependency treebank based on the traditional Arabic grammatical theory and the characteristics of the Arabic language, to investigate their effects on the accuracy of statistical parsers. The proposed Arabic dependency treebank, called I3rab, contrasts with existing Arabic dependency treebanks in two main concepts. The first concept is the approach of determining the main word of the sentence, and the second concept is the representation of the joined and covert pronouns. To evaluate I3rab, we compared its performance against a subset of Pr...
Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages - Semitic '04, 2004
In this paper we address the following questions from our experience of the last two and a half years in developing a large-scale corpus of Arabic text annotated for morphological information, part-of-speech, English gloss, and syntactic structure: (a) How did we 'leapfrog' through the stumbling blocks of both methodology and training in setting up the Penn Arabic Treebank (ATB) annotation? (b) How did we reconcile the Penn Treebank annotation principles and practices with the Modern Standard Arabic (MSA) traditional and more recent grammatical concepts? (c) What are the current issues and nagging problems? (d) What has been achieved and what are our future expectations?
Parsing the Arabic language is a difficult task given the specificities of this language and given the scarcity of digital resources (grammars and annotated corpora). In this paper, we suggest a method for Arabic parsing based on supervised machine learning. We used the SVMs algorithm to select the syntactic labels of the sentence. Furthermore, we evaluated our parser following the cross validation method by using the Penn Arabic Treebank. The obtained results are very encouraging.
… on Arabic Language Resources and Tools, …, 2009
The Columbia Arabic Treebank (CATiB) is a database of syntactic analyses of Arabic sentences. CATiB contrasts with previous ap-proaches to Arabic treebanking in its emphasis on faster production with some constraints on linguistic richness. Two basic ideas inspire ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
2009 International Multiconference on Computer Science and Information Technology, 2009
ACM Transactions on Asian Language Information Processing, 2009
Tounsi Lamia and Van Genabith Josef Arabic Parsing Using Grammar Transforms in Lrec 2010 7th Conference on International Language Resources and Evaluation 17 23 May 2010 Valletta Malta, 2010