Computational Linguistic (CL) becomes an essential and important amenity in the present scenarios... more Computational Linguistic (CL) becomes an essential and important amenity in the present scenarios, as many different technologies are involved in making machines to understand human languages. Khasi is the language which is spoken in Meghalaya, India. Many Indian languages have been researched in different fields of Natural Language Processing (NLP), whereas Khasi lacks substantial research from the NLP perspectives. Therefore, in this paper, taking POS tagging as one of the key aspects of NLP, we present POS tagger based on Hidden Markov Model (HMM) for Khasi language. In this present preliminary stage of building NLP system for Khasi, with the analyses of the categories and structures of the words is started. Therefore, we have designed specific POS tagsets to categories Khasi words and vocabularies. Then, the POS system based on HMM is trained by using Khasi words which have been tagged manually using the designed tagsets. As ambiguity is one of the main challenges in POS tagging in Khasi, we anticipated difficulties in tagging. However, by running with the first few sets of data in the experimental data by using the HMM tagger we found out that the result yielded by this model is 76.70% of accurate.
Part-Of-Speech(POS) tagging is the pre-processing technique or text processing in Natural Languag... more Part-Of-Speech(POS) tagging is the pre-processing technique or text processing in Natural Language Processing (NLP). The POS tagger is a system that generates the tags of each word as output from the given input sentence. POS tagging in India is a challenging task as Indian languages are morphologically rich. The most difficult challenge of POS tagging is ambiguity because a single word can be used in multiple senses depending on the context it is used in. Therefore, words or items of the language structure can only be disambiguated based on the speech contexts. This paper presents the grammatical POS and the designed POS tag-sets of Khasi language. Khasi is an Austro-Asiatic language spoken in the central and eastern parts of the state of Meghalaya, India. Though Khasi is mostly isolating in morphology, some words are derived through certain morphological processing.
International Journal of Speech Technology, Jun 4, 2021
Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi lang... more Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi language is spoken by the indigenous people of the state of Meghalaya in India. This paper presents a work on Part-of-speech (POS) tagging for the Khasi language by using the Conditional Random Field (CRF) method. The main significance of this work, is to experiment with the CRF model for PoS tagging in the Khasi language. This method produces a reliable agreement on the features of the language. POS tagging for Khasi is essential for creating lemmatizers which are used to lessen a word to its root structure and the POS corpus or dataset can be used in other NLP applications. In this research work, we have designed a tag set and POS tagging corpus. Khasi does not have any standard POS corpus. Therefore, we have to build a Khasi corpus that consists of around 71,000 tokens. After feeding the Khasi corpus to the CRF model for learning, the system yields a testing accuracy of 92.12% and an F1-score of 0.91. The result is compared with few other state-of-art techniques. It is observed that our approach produces promising results in comparison with other techniques. In future, we will increase the size of the Khasi POS corpus.
Indian journal of languages and linguistics, Jun 24, 2023
This research paper presents a detailed analysis of the noun phrases (NP) in War-Khasi and War-Ja... more This research paper presents a detailed analysis of the noun phrases (NP) in War-Khasi and War-Jaiñtia, two varieties of the Khasi language spoken in northeastern India. The study investigates the syntactic and morphological features of NPs, with a particular emphasis on the distinctions between pre-modifiers and postmodifiers. By comparing and contrasting the NPs of War-Khasi and War-Jaiñtia, the paper highlights the unique attributes and functions of these constructions in each variety. The research also explores the various constructions of NPs in both varieties and evaluates their syntactic and semantic roles. The findings demonstrate that while both varieties share similar NP constructions, there are notable differences in the functions and attributes of NPs in each variety. Furthermore, the paper elaborates on the different functions of NPs and their lexical elements, including the head noun and all of its accompanying modifiers. Overall, this study contributes to our understanding of the syntax and morphology of noun phrases in Khasi and provides insights into the unique linguistic features of these two varieties. The findings have important implications for cross-linguistic comparisons of NP constructions and for further research in the field of linguistics of the Khasi varieties.
This research paper presents a detailed analysis of the noun phrases (NP) in War-Khasi and War-Ja... more This research paper presents a detailed analysis of the noun phrases (NP) in War-Khasi and War-Jaiñtia, two varieties of the Khasi language spoken in north-eastern India. The study investigates the syntactic and morphological features of NPs, with a particular emphasis on the distinctions between pre-modifiers and post- modifiers. By comparing and contrasting the NPs of War-Khasi and War-Jaiñtia, the paper highlights the unique attributes and functions of these constructions in each variety. The research also explores the various constructions of NPs in both varieties and evaluates their syntactic and semantic roles. The findings demonstrate that while both varieties share similar NP constructions, there are notable differences in the functions and attributes of NPs in each variety. Furthermore, the paper elaborates on the different functions of NPs and their lexical elements, including the head noun and all of its accompanying modifiers. Overall, this study contributes to our under...
ACM Transactions on Asian and Low-Resource Language Information Processing, 2022
Part-of-speech (POS) tagging is one of the research challenging fields in natural language proces... more Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words conc...
Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi lang... more Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi language is spoken by the indigenous people of the state of Meghalaya in India. This paper presents a work on Part-of-speech (POS) tagging for the Khasi language by using the Conditional Random Field (CRF) method. The main significance of this work, is to experiment with the CRF model for PoS tagging in the Khasi language. This method produces a reliable agreement on the features of the language. POS tagging for Khasi is essential for creating lemmatizers which are used to lessen a word to its root structure and the POS corpus or dataset can be used in other NLP applications. In this research work, we have designed a tag set and POS tagging corpus. Khasi does not have any standard POS corpus. Therefore, we have to build a Khasi corpus that consists of around 71,000 tokens. After feeding the Khasi corpus to the CRF model for learning, the system yields a testing accuracy of 92.12% and an F1-score of 0.91. The result is compared with few other state-of-art techniques. It is observed that our approach produces promising results in comparison with other techniques. In future, we will increase the size of the Khasi POS corpus.
Computational Linguistic (CL) becomes an essential and important amenity in the present scenarios... more Computational Linguistic (CL) becomes an essential and important amenity in the present scenarios, as many different technologies are involved in making machines to understand human languages. Khasi is the language which is spoken in Meghalaya, India. Many Indian languages have been researched in different fields of Natural Language Processing (NLP), whereas Khasi lacks substantial research from the NLP perspectives. Therefore, in this paper, taking POS tagging as one of the key aspects of NLP, we present POS tagger based on Hidden Markov Model (HMM) for Khasi language. In this present preliminary stage of building NLP system for Khasi, with the analyses of the categories and structures of the words is started. Therefore, we have designed specific POS tagsets to categories Khasi words and vocabularies. Then, the POS system based on HMM is trained by using Khasi words which have been tagged manually using the designed tagsets. As ambiguity is one of the main challenges in POS tagging in Khasi, we anticipated difficulties in tagging. However, by running with the first few sets of data in the experimental data by using the HMM tagger we found out that the result yielded by this model is 76.70% of accurate.
Part-Of-Speech(POS) tagging is the pre-processing technique or text processing in Natural Languag... more Part-Of-Speech(POS) tagging is the pre-processing technique or text processing in Natural Language Processing (NLP). The POS tagger is a system that generates the tags of each word as output from the given input sentence. POS tagging in India is a challenging task as Indian languages are morphologically rich. The most difficult challenge of POS tagging is ambiguity because a single word can be used in multiple senses depending on the context it is used in. Therefore, words or items of the language structure can only be disambiguated based on the speech contexts. This paper presents the grammatical POS and the designed POS tag-sets of Khasi language. Khasi is an Austro-Asiatic language spoken in the central and eastern parts of the state of Meghalaya, India. Though Khasi is mostly isolating in morphology, some words are derived through certain morphological processing.
International Journal of Speech Technology, Jun 4, 2021
Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi lang... more Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi language is spoken by the indigenous people of the state of Meghalaya in India. This paper presents a work on Part-of-speech (POS) tagging for the Khasi language by using the Conditional Random Field (CRF) method. The main significance of this work, is to experiment with the CRF model for PoS tagging in the Khasi language. This method produces a reliable agreement on the features of the language. POS tagging for Khasi is essential for creating lemmatizers which are used to lessen a word to its root structure and the POS corpus or dataset can be used in other NLP applications. In this research work, we have designed a tag set and POS tagging corpus. Khasi does not have any standard POS corpus. Therefore, we have to build a Khasi corpus that consists of around 71,000 tokens. After feeding the Khasi corpus to the CRF model for learning, the system yields a testing accuracy of 92.12% and an F1-score of 0.91. The result is compared with few other state-of-art techniques. It is observed that our approach produces promising results in comparison with other techniques. In future, we will increase the size of the Khasi POS corpus.
Indian journal of languages and linguistics, Jun 24, 2023
This research paper presents a detailed analysis of the noun phrases (NP) in War-Khasi and War-Ja... more This research paper presents a detailed analysis of the noun phrases (NP) in War-Khasi and War-Jaiñtia, two varieties of the Khasi language spoken in northeastern India. The study investigates the syntactic and morphological features of NPs, with a particular emphasis on the distinctions between pre-modifiers and postmodifiers. By comparing and contrasting the NPs of War-Khasi and War-Jaiñtia, the paper highlights the unique attributes and functions of these constructions in each variety. The research also explores the various constructions of NPs in both varieties and evaluates their syntactic and semantic roles. The findings demonstrate that while both varieties share similar NP constructions, there are notable differences in the functions and attributes of NPs in each variety. Furthermore, the paper elaborates on the different functions of NPs and their lexical elements, including the head noun and all of its accompanying modifiers. Overall, this study contributes to our understanding of the syntax and morphology of noun phrases in Khasi and provides insights into the unique linguistic features of these two varieties. The findings have important implications for cross-linguistic comparisons of NP constructions and for further research in the field of linguistics of the Khasi varieties.
This research paper presents a detailed analysis of the noun phrases (NP) in War-Khasi and War-Ja... more This research paper presents a detailed analysis of the noun phrases (NP) in War-Khasi and War-Jaiñtia, two varieties of the Khasi language spoken in north-eastern India. The study investigates the syntactic and morphological features of NPs, with a particular emphasis on the distinctions between pre-modifiers and post- modifiers. By comparing and contrasting the NPs of War-Khasi and War-Jaiñtia, the paper highlights the unique attributes and functions of these constructions in each variety. The research also explores the various constructions of NPs in both varieties and evaluates their syntactic and semantic roles. The findings demonstrate that while both varieties share similar NP constructions, there are notable differences in the functions and attributes of NPs in each variety. Furthermore, the paper elaborates on the different functions of NPs and their lexical elements, including the head noun and all of its accompanying modifiers. Overall, this study contributes to our under...
ACM Transactions on Asian and Low-Resource Language Information Processing, 2022
Part-of-speech (POS) tagging is one of the research challenging fields in natural language proces... more Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words conc...
Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi lang... more Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi language is spoken by the indigenous people of the state of Meghalaya in India. This paper presents a work on Part-of-speech (POS) tagging for the Khasi language by using the Conditional Random Field (CRF) method. The main significance of this work, is to experiment with the CRF model for PoS tagging in the Khasi language. This method produces a reliable agreement on the features of the language. POS tagging for Khasi is essential for creating lemmatizers which are used to lessen a word to its root structure and the POS corpus or dataset can be used in other NLP applications. In this research work, we have designed a tag set and POS tagging corpus. Khasi does not have any standard POS corpus. Therefore, we have to build a Khasi corpus that consists of around 71,000 tokens. After feeding the Khasi corpus to the CRF model for learning, the system yields a testing accuracy of 92.12% and an F1-score of 0.91. The result is compared with few other state-of-art techniques. It is observed that our approach produces promising results in comparison with other techniques. In future, we will increase the size of the Khasi POS corpus.
Uploads
Papers by sarah lyngdoh