In Korean, spelling change in various forms must be recovered into base forms in morphological an... more In Korean, spelling change in various forms must be recovered into base forms in morphological analysis as well as part-of-speech (POS) tagging is difficult without morphological analysis because Korean is agglutinative. This is one of notorious problems in Korean morphological analysis and has been solved by morpheme recovery rules, which generate morphological ambiguity resolved by POS tagging. In this paper, we propose a morpheme recovery scheme based on machine learning methods like Naïve Bayes models. Input features of the models are the surrounding context of the syllable which the spelling change is occurred and categories of the models are the recovered syllables. The POS tagging system with the proposed model has demonstrated the -score of 97.5% for the ETRI tree-tagged corpus. Thus it can be decided that the proposed model is very useful to handle morpheme recovery in Korean.
Noun extraction plays an important part in the fields of information retrieval, text summarizatio... more Noun extraction plays an important part in the fields of information retrieval, text summarization, and so on. In this paper, we present a Korean base-noun extraction system and apply it to text summarization to deal with a huge amount of text effectively. The base-noun is an atomic noun but not a compound noun and we use tow techniques, filtering and segmenting. The filtering technique is used for removing non-nominal words from text before extracting base-nouns and the segmenting technique is employed for separating a particle from a nominal and for dividing a compound noun into base-nouns. We have shown that both of the recall and the precision of the proposed system are about 89% on the average under experimental conditions of ETRI corpus. The proposed system has applied to Korean text summarization system and is shown satisfactory results.
In this paper, we introduce a method to represent phrase structure grammars for building a large ... more In this paper, we introduce a method to represent phrase structure grammars for building a large annotated corpus of Korean syntactic trees. Korean is different from English in word order and word compositions. As a result of our study, it turned out that the differences are significant enough to induce meaningful changes in the tree annotation scheme for Korean with respect to the schemes for English. A tree annotation scheme defines the grammar formalism to be assumed, categories to be used, and rules to determine correct parses for unsettled issues in parse construction. Korean is partially free in word order and the essential components such as subjects and objects of a sentence can be omitted with greater freedom than in English. We propose a restricted representation of phrase structure grammar to handle the characteristics of Korean more efficiently. The proposed representation is shown by means of an extensive experiment to gain improvements in parsing time as well as grammar size. We also describe the system named Teb that is a software environment set up with a goal to build a tree annotated corpus of Korean containing more than one million units.
ABSTRACT Since a natural language has inherently structural ambiguities, one of the difficulties ... more ABSTRACT Since a natural language has inherently structural ambiguities, one of the difficulties of parsing is resolving the structural ambiguities. Recently, a probabilistic approach to tackle this disambiguation problem has received considerable attention because it has some attractions such as automatic learning, wide-coverage, and robustness. In this paper, we focus on Korean probabilistic parsing model using head co-occurrence. We are apt to meet the data sparseness problem when we're using head co-occurrence because it is lexical. Therefore, how to handle this problem is more important than others. To lighten the problem, we have used the restricted and simplified phrase-structure grammar and back-off model as smoothing. The proposed model has showed that the accuracy is about 84%.
In this paper, we propose a modified unsupervised linear alignment algorithm for building an alig... more In this paper, we propose a modified unsupervised linear alignment algorithm for building an aligned corpus. The original algorithm inserts null characters into both of two aligned strings (source string and target string), because the two strings are different from each other in length. This can cause some difficulties like the search space explosion for applications using the aligned corpus with null characters and no possibility of applying to several machine learning algorithms. To alleviate these difficulties, we modify the algorithm not to contain null characters in the aligned source strings. We have shown the usability of our approach by applying it to different areas such as Korean-English back-transliteration, English grapheme-phoneme conversion, and Korean morphological analysis.
This paper presents a simple and effective method for automatic bilingual lexicon extraction from... more This paper presents a simple and effective method for automatic bilingual lexicon extraction from less-known language pairs. To do this, we bring in a bridge language named the pivot language and adopt information retrieval techniques combined with natural language processing techniques. Moreover, we use a freely available word aligner: Anymalign (Lardilleux et al., 2011) for constructing context vectors. Unlike the previous works, we obtain context vectors via a pivot language. Therefore, we do not require to translate context vectors by using a seed dictionary and improve the accuracy of low frequency word alignments that is weakness of statistical model by using Anymalign. In this paper, experiments have been conducted on two different language pairs that are bi-directional Korean-Spanish and Korean-French, respectively. The experimental results have demonstrated that our method for high-frequency words shows at least 76.3 and up to 87.2% and for the lowfrequency words at least 43.3% and up to 48.9% within the top 20 ranking candidates, respectively.
A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vec... more A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word association between source words (resp., target words) and pivot words and the other estimates them from two parallel corpora based on word alignment tools for statistical machine translation. Empirical results on two language pairs (e.g., Korean-Spanish and Korean-French) have shown that the pivot-based approach is very promising for resource-poor languages and this approach observes its validity and usability. Furthermore, for words with low frequency, our method is also well performed.
In Korean, spelling change in various forms must be recovered into base forms in morphological an... more In Korean, spelling change in various forms must be recovered into base forms in morphological analysis as well as part-of-speech (POS) tagging is difficult without morphological analysis because Korean is agglutinative. This is one of notorious problems in Korean morphological analysis and has been solved by morpheme recovery rules, which generate morphological ambiguity resolved by POS tagging. In this paper, we propose a morpheme recovery scheme based on machine learning methods like Naïve Bayes models. Input features of the models are the surrounding context of the syllable which the spelling change is occurred and categories of the models are the recovered syllables. The POS tagging system with the proposed model has demonstrated the -score of 97.5% for the ETRI tree-tagged corpus. Thus it can be decided that the proposed model is very useful to handle morpheme recovery in Korean.
Noun extraction plays an important part in the fields of information retrieval, text summarizatio... more Noun extraction plays an important part in the fields of information retrieval, text summarization, and so on. In this paper, we present a Korean base-noun extraction system and apply it to text summarization to deal with a huge amount of text effectively. The base-noun is an atomic noun but not a compound noun and we use tow techniques, filtering and segmenting. The filtering technique is used for removing non-nominal words from text before extracting base-nouns and the segmenting technique is employed for separating a particle from a nominal and for dividing a compound noun into base-nouns. We have shown that both of the recall and the precision of the proposed system are about 89% on the average under experimental conditions of ETRI corpus. The proposed system has applied to Korean text summarization system and is shown satisfactory results.
In this paper, we introduce a method to represent phrase structure grammars for building a large ... more In this paper, we introduce a method to represent phrase structure grammars for building a large annotated corpus of Korean syntactic trees. Korean is different from English in word order and word compositions. As a result of our study, it turned out that the differences are significant enough to induce meaningful changes in the tree annotation scheme for Korean with respect to the schemes for English. A tree annotation scheme defines the grammar formalism to be assumed, categories to be used, and rules to determine correct parses for unsettled issues in parse construction. Korean is partially free in word order and the essential components such as subjects and objects of a sentence can be omitted with greater freedom than in English. We propose a restricted representation of phrase structure grammar to handle the characteristics of Korean more efficiently. The proposed representation is shown by means of an extensive experiment to gain improvements in parsing time as well as grammar size. We also describe the system named Teb that is a software environment set up with a goal to build a tree annotated corpus of Korean containing more than one million units.
ABSTRACT Since a natural language has inherently structural ambiguities, one of the difficulties ... more ABSTRACT Since a natural language has inherently structural ambiguities, one of the difficulties of parsing is resolving the structural ambiguities. Recently, a probabilistic approach to tackle this disambiguation problem has received considerable attention because it has some attractions such as automatic learning, wide-coverage, and robustness. In this paper, we focus on Korean probabilistic parsing model using head co-occurrence. We are apt to meet the data sparseness problem when we're using head co-occurrence because it is lexical. Therefore, how to handle this problem is more important than others. To lighten the problem, we have used the restricted and simplified phrase-structure grammar and back-off model as smoothing. The proposed model has showed that the accuracy is about 84%.
In this paper, we propose a modified unsupervised linear alignment algorithm for building an alig... more In this paper, we propose a modified unsupervised linear alignment algorithm for building an aligned corpus. The original algorithm inserts null characters into both of two aligned strings (source string and target string), because the two strings are different from each other in length. This can cause some difficulties like the search space explosion for applications using the aligned corpus with null characters and no possibility of applying to several machine learning algorithms. To alleviate these difficulties, we modify the algorithm not to contain null characters in the aligned source strings. We have shown the usability of our approach by applying it to different areas such as Korean-English back-transliteration, English grapheme-phoneme conversion, and Korean morphological analysis.
This paper presents a simple and effective method for automatic bilingual lexicon extraction from... more This paper presents a simple and effective method for automatic bilingual lexicon extraction from less-known language pairs. To do this, we bring in a bridge language named the pivot language and adopt information retrieval techniques combined with natural language processing techniques. Moreover, we use a freely available word aligner: Anymalign (Lardilleux et al., 2011) for constructing context vectors. Unlike the previous works, we obtain context vectors via a pivot language. Therefore, we do not require to translate context vectors by using a seed dictionary and improve the accuracy of low frequency word alignments that is weakness of statistical model by using Anymalign. In this paper, experiments have been conducted on two different language pairs that are bi-directional Korean-Spanish and Korean-French, respectively. The experimental results have demonstrated that our method for high-frequency words shows at least 76.3 and up to 87.2% and for the lowfrequency words at least 43.3% and up to 48.9% within the top 20 ranking candidates, respectively.
A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vec... more A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word association between source words (resp., target words) and pivot words and the other estimates them from two parallel corpora based on word alignment tools for statistical machine translation. Empirical results on two language pairs (e.g., Korean-Spanish and Korean-French) have shown that the pivot-based approach is very promising for resource-poor languages and this approach observes its validity and usability. Furthermore, for words with low frequency, our method is also well performed.
Uploads
Papers by Jae-Hoon Kim