Telugu Computational Tools by Christopher Mala

This paper describes the development of a Morphological Generator, a generic Engine which can be ... more This paper describes the development of a Morphological Generator, a generic Engine which can be used for any language by plugging in a specific language database. This Generator synthesizes all and only the well-formed word forms. These word forms include both inflectional and productive derivational forms. This Morphological Generator engine is independent of language and works effectively and is based on word-and-paradigm method. This Computational model uses machine learning method based on morphological data base developed by using word and paradigm model of Morphology. This method not only ensures coverage but also evolvement. The engine takes as input a root and along with it its inflectional categories (features) like gender, number, person and case in case of nouns and verbal categories in case of verbs and other relevant inflectional endings depending on the category. In this paper we describe how the Morphological Generator handles all of the inflectional forms in addition to the productive derivational forms. The Input and output are in Shakti Standard Form (SSF).When tested with languages like Telugu, Hindi and Tamil their accuracy was 97.2%, 98% and 94% respectively.

Spell Checker is an application which handles spelling errors and Spelling Variations (SV). All t... more Spell Checker is an application which handles spelling errors and Spelling Variations (SV). All the misspelt words are marked and allowed for correction. This system also can be used as an editor where the text is checked for spelling errors and suggestion for correction are provided. Telugu is an agglutinating language and has a very complex morphology which is coupled with prolific sandhi or morphophonemics. The sandhi that is noticed in Telugu is not limited to internal but also external. Both consonantal and vocalic sandhi are common and well-studied in Telugu [Krishnamurti, 1957, 1985]. To identify the specific sandhi type and split it appropriately is a very challenging task. External sandhi is a linguistic phenomenon which refers to a set of changes that occur at word boundaries. These changes are similar to phonological processes such as substition (modification by various means) deletion, and insertion. External sandhi is often orthographically reflected in Telugu. External sandhi in such cases, causes the formation of such forms which are morphologically unanalyzable, thus posing a problem for all kinds of NLP applications. In this paper, we discuss in detail the processes external sandhi in Telugu and the Computational tool the Spell Checker.
We present the development of Machine Translation (MT) System which translates texts from Tamil t... more We present the development of Machine Translation (MT) System which translates texts from Tamil to Telugu and vice-versa (Bi-directional). It is based on Transfer Approach. The System's Architecture is divided into three stages i.e. Source language Analysis module (SL), Source language to Target language Transfer module (SL-TL) and Target language generation module (TL). The major cross-linguistic differences that are experienced between Tamil and Telugu during the development of Machine Translation system are discussed here.

The development of Machine Translation (MT) is one of the most challenging tasks of Natural Langu... more The development of Machine Translation (MT) is one of the most challenging tasks of Natural Language Processing Applications. In MT there are a number of approaches that are being practiced all over the world, chiefly, they are Direct translations, Interlingual translations, Transfer based translations and a combination of these beside the statistical and corpus based methods. It is a known fact that Indian languages exhibit a considerable amount of diversity between them at every level viz. morphological, syntactic, semantic and lexical levels. In the Transfer Based approach a representation of source language (SL) at certain level is transferred to the corresponding target language (TL) representation. Keeping these in mind, building a Machine Translation System for these languages using Transfer based Method can be non-trivial and challenging. The present paper discusses the successful implementation of the Transfer based Approach to the Machine Translation (MT) System for Hindi<->Telugu. Different resources for this system come from eleven different institutions across India.
A Morphological Analyzer (MA) is a program which compiles and analyses words of a natural languag... more A Morphological Analyzer (MA) is a program which compiles and analyses words of a natural language into their roots and their constituent morpho-syntactic elements along with their attributes. The present paper demonstrates computational implementation of a Morphological Analyzer for Telugu. The algorithm used to build this MA is theoretically justified and is practically executed for Telugu in the context of Modern Standard Written variety. The present proposal is a demonstration of the optimal organization of linguistic database and its performance in computational environment by ensuring high precision and coverage in the parsing of wordforms. The current MA engine's coverage may range between 95-97% on a variety of corpora (3 million word length corpus).
Papers by Christopher Mala
Trans-Himalayan Linguistics, 2013

One of the major challenges in Natural Language Processing is identifying Clauses and their Bound... more One of the major challenges in Natural Language Processing is identifying Clauses and their Boundaries in Compu-tational Linguistics. This paper attempts to develop an Automatic Clause Bound-ary Identifier (CBI) for Telugu lan-guage. The language Telugu belongs to South-Central Dravidian language fami-ly with features of head-final, leftbranching and morphologically agglutinative in nature (Bh. Krishnamurti, 2003). A huge amount of corpus is studied to frame the rules for identifying clause boundaries and these rules are trained to a computational algorithm and also discussed some of the issues in identifying clause boundaries. A clause boundary annotated corpus can be developed from raw text which can be used to train a machine learning algorithm which in turns helps in development of a Hybrid Clause Boundary Identification Tool for Telugu. Its implementation and evaluation are discussed in this paper.
caltslab.uohyd.ernet.in
... For instance, (27) Ta. eṉ-akku nīccal teriyum. Eng. I know to swim. 'me-DAT swimming kno... more ... For instance, (27) Ta. eṉ-akku nīccal teriyum. Eng. I know to swim. 'me-DAT swimming know-fut-3p.sg.n'. Te. ... A Grammar of Modern Tamil. Pondicherry: Pondicherry Institute of Linguistics and culture. Subbarao, KV 2010. South Asian Languages : A syntactic Typology. ...

A Morphological Analyzer (MA) is a program which compiles and analyses words of a natural languag... more A Morphological Analyzer (MA) is a program which compiles and analyses words of a natural language into their roots and their constituent morpho-syntactic elements along with their attributes. The present paper demonstrates computational implementation of a Morphological Analyzer for Telugu. The algorithm used to build this MA is theoretically justified and is practically executed for Telugu in the context of Modern Standard Written variety. The present proposal is a demonstration of the optimal organization of linguistic database and its performance in computational environment by ensuring high precision and coverage in the parsing of wordforms. The current MA engine's coverage may range between 95-97% on a variety of corpora (3 million word length corpus). Introduction: It is a well known fact that the morphology of Telugu is not only rich in terms of the density of word-forms produced for a given root/stem but also diverse in the morphological strategies that are usually empl...
Uploads
Telugu Computational Tools by Christopher Mala
Papers by Christopher Mala