Papers by Prakash Pimpale
Reordering is pre-processing stage for Statistical Machine Translation (SMT) system where the wor... more Reordering is pre-processing stage for Statistical Machine Translation (SMT) system where the words of the source sentence are reordered as per the syntax of the target language. We are proposing a rich set of rules for better reordering. The idea is to facilitate the training process by better alignments and parallel phrase extraction for a phrase based SMT system. Reordering also helps the decoding process and hence improving the machine translation quality. We have observed significant improvements in the translation quality by using our approach over the baseline SMT. We have used BLEU, NIST, multi-reference word error rate, multi-reference position independent error rate for judging the improvements. We have exploited open source SMT toolkit MOSES to develop the system.
This paper presents Centre for Development of Advanced Computing Mumbai's (CDACM) submission to N... more This paper presents Centre for Development of Advanced Computing Mumbai's (CDACM) submission to NLP Tools Contest on POS Tagging For Code-mixed Indian Social Media Text (POSCMISMT) 2015 (collocated with ICON 2015). We submitted results for Hindi, Bengali, and Telugu languages mixed with English. In this paper we have described our approaches to the POS tagging techniques we exploited for this task. Machine learning techniques have been used to POS tag the mixed language text. Distributed representations of words in vector space (word2vec) for feature extraction and Log-linear part-of-speech tagging for POS tagging have been tried. We report our work on all three languages Hindi, Bengali and Telugu mixed with English.
This paper presents Centre for Development of
Advanced Computing Mumbai's (CDACM)
submission to N... more This paper presents Centre for Development of
Advanced Computing Mumbai's (CDACM)
submission to NLP Tools Contest on Statistical
Machine Translation in Indian Languages
(ILSMT) 2015 (collocated with ICON 2015). The
aim of the contest is to collectively explore the
effectiveness of Statistical Machine Translation
(SMT) while translating within Indian languages
and between English and Indian languages.
In this paper, we report our work on all five pairs
of languages, namely Bengali-Hindi, Marathi-
Hindi, Tamil-Hindi, Telugu-Hindi and English-
Hindi for Health, Tourism and General domains.
We have used suffix separation, compound
splitting and pre-reordering prior to SMT training
and testing.
This paper discusses Centre for Development of Advanced Computing Mumbai's (CDACM) submission to ... more This paper discusses Centre for Development of Advanced Computing Mumbai's (CDACM) submission to NLP Tools Contest on Statistical Machine Translation in Indian Languages (ILSMT) 2014 (collocated with ICON 2014). The objective of the contest is to explore the effectiveness of Statistical Machine Translation (SMT) for Indian language to Indian language and English-Hindi machine translation.
Reordering is pre-processing stage for Statistical Machine Translation (SMT) system where the wor... more Reordering is pre-processing stage for Statistical Machine Translation (SMT) system where the words of the source sentence are reordered as per the syntax of the target language. We are proposing a rich set of rules for better reordering. The idea is to facilitate the training process by better alignments and parallel phrase extraction for a phrase based SMT system. Reordering also helps the decoding process and hence improving the machine translation quality. We have observed significant improvements in the translation quality by using our approach over the baseline SMT.
Internet is a humongous repository of information which is majorly in English. To make this infor... more Internet is a humongous repository of information which is majorly in English. To make this information accessible to all the people irrespective of their know-how of English, the notion of Cross Lingual Information Access (CLIA) has been introduced. We discuss here work on CLIA systems for Indian languages and describe Setu - a CLIA system for English-Hindi developed at CDAC Mumbai. We also analyze the Indian perspective in this context, discussing relevant issues faced by the existing systems. We focus mainly on issues arising due to ambiguities at different stages of CLIA and the quality of machine translation, a vital component in CLIA. We also try to look at opportunities for such systems and the road ahead.
Marathi and Hindi both being Indo-Aryan family members and using Devanagari script are similar to... more Marathi and Hindi both being Indo-Aryan family members and using Devanagari script are similar to a great extent. Both follow SOV sentence structure and are equally liberal in word order. The translation for this language pair appears to be easy. But experiments show this to be a significantly difficult task, primarily due to the fact that Marathi is morphologically richer compared to Hindi. We propose a Marathi to Hindi Statistical Machine Translation (SMT) system which makes use of compound word splitting to tackle the morphological richness of Marathi.
Reordering is pre-processing stage for Statistical Machine Translation (SMT) system where the wor... more Reordering is pre-processing stage for Statistical Machine Translation (SMT) system where the words of the source sentence are reordered as per the syntax of the target language. We are proposing a rich set of rules for better reordering. The idea is to facilitate the training process by better alignments and parallel phrase extraction for a phrase based SMT system. Reordering also helps the decoding process and hence improving the machine translation quality. We have observed significant improvements in the translation quality by using our approach over the baseline SMT.
This paper discusses Centre for Development of Advanced Computing Mumbai's (CDACM) submission to ... more This paper discusses Centre for Development of Advanced Computing Mumbai's (CDACM) submission to NLP Tools Contest on Statistical Machine Translation in Indian Languages (ILSMT) 2014 (collocated with ICON 2014). The objective of the contest is to explore the effectiveness of Statistical Machine Translation (SMT) for Indian language to Indian language and English-Hindi machine translation.
Uploads
Papers by Prakash Pimpale
Advanced Computing Mumbai's (CDACM)
submission to NLP Tools Contest on Statistical
Machine Translation in Indian Languages
(ILSMT) 2015 (collocated with ICON 2015). The
aim of the contest is to collectively explore the
effectiveness of Statistical Machine Translation
(SMT) while translating within Indian languages
and between English and Indian languages.
In this paper, we report our work on all five pairs
of languages, namely Bengali-Hindi, Marathi-
Hindi, Tamil-Hindi, Telugu-Hindi and English-
Hindi for Health, Tourism and General domains.
We have used suffix separation, compound
splitting and pre-reordering prior to SMT training
and testing.
Advanced Computing Mumbai's (CDACM)
submission to NLP Tools Contest on Statistical
Machine Translation in Indian Languages
(ILSMT) 2015 (collocated with ICON 2015). The
aim of the contest is to collectively explore the
effectiveness of Statistical Machine Translation
(SMT) while translating within Indian languages
and between English and Indian languages.
In this paper, we report our work on all five pairs
of languages, namely Bengali-Hindi, Marathi-
Hindi, Tamil-Hindi, Telugu-Hindi and English-
Hindi for Health, Tourism and General domains.
We have used suffix separation, compound
splitting and pre-reordering prior to SMT training
and testing.