Papers by GULMIRA TOLEGEN

Symmetry
This article presents an approach for spellchecking and autocorrection using web data for morphol... more This article presents an approach for spellchecking and autocorrection using web data for morphologically complex languages (in the case of Kazakh language), which can be considered an end-to-end approach that does not require any manually annotated word–error pairs. A sizable web of noisy data is crawled and used as a base to infer the knowledge of misspellings with their correct forms. Using the extracted corpus, a sub-string error model with a context model for morphologically complex languages are trained separately, then these two models are integrated with a regularization parameter. A sub-string alignment model is applied to extract symmetric and non-symmetric patterns in two sequences of word–error pairs. The model calculates the probability for symmetric and non-symmetric patterns of a given misspelling and its candidates to obtain a suggestion list. Based on the proposed method, a Kazakh Spellchecking and Autocorrection system is developed, which we refer to as QazSpell. S...
2022 7th International Conference on Computer Science and Engineering (UBMK)
2022 7th International Conference on Computer Science and Engineering (UBMK)
Tatarstan Academy of Sciences, Oct 21, 2017

2019 15th International Asian School-Seminar Optimization Problems of Complex Systems (OPCS), 2019
This paper presents the comparison results of dependency parsing for two distinct languages: Kaza... more This paper presents the comparison results of dependency parsing for two distinct languages: Kazakh and English, by using a various discrete and distributed feature-based approaches We apply graph/transition-based methods to train these models and to report the typed and untyped accuracy. Different comparisons are made for comparing these models by utilizing discrete or dense features. Experimental results show that discrete feature-based approaches (graph-based) perform well than others when the size of data-set is relatively small. For a large data set, the results of those approaches are very competitive with each other, and no significant difference in performance can be observed. In terms of training speed, the results show that discrete feature-based parsers take much less training time than the neural network-based parser, but with comparable performances.
Advances in Computational Collective Intelligence, 2020
Various neural networks for sequence labeling tasks have been studied extensively in recent years... more Various neural networks for sequence labeling tasks have been studied extensively in recent years. The main research focus on neural networks for the task are range from the feed-forward neural network to the long short term memory (LSTM) network with CRF layer. This paper summarizes the existing neural architectures and develop the most representative four neural networks for part-of-speech tagging and apply them on several typologically different languages. Experimental results show that the LSTM type of networks outperforms the feed-forward network in most cases and the character-level networks can learn the lexical features from characters within words, which makes the model achieve better results than no-character ones.
This paper presents an approach of voted perceptron for morphological disambiguation for the case... more This paper presents an approach of voted perceptron for morphological disambiguation for the case of Kazakh language. Guided by the intuition that the feature value from the correct path of analyses must be higher than the feature value of non-correct path of analyses, we propose the voted perceptron algorithm with Viterbi decoding manner for disambiguation. The approach can use arbitrary features to learn the feature vector for a sequence of analyses, which plays a vital role for disambiguation. Experimental results show that our approach outperforms other statistical and rule-based models. Moreover, we manually annotated a new morphological disambiguation corpus for Kazakh language.
ArXiv, 2020
We present several neural networks to address the task of named entity recognition for morphologi... more We present several neural networks to address the task of named entity recognition for morphologically complex languages (MCL). Kazakh is a morphologically complex language in which each root/stem can produce hundreds or thousands of variant word forms. This nature of the language could lead to a serious data sparsity problem, which may prevent the deep learning models from being well trained for under-resourced MCLs. In order to model the MCLs' words effectively, we introduce root and entity tag embedding plus tensor layer to the neural networks. The effects of those are significant for improving NER model performance of MCLs. The proposed models outperform state-of-the-art including character-based approaches, and can be potentially applied to other morphologically complex languages.
Cogent Engineering, Feb 12, 2020
In this paper, we investigate two neural architecture for gender detection and speaker identifica... more In this paper, we investigate two neural architecture for gender detection and speaker identification tasks by utilizing Mel-frequency cepstral coefficients (MFCC) features which do not cover the voice related characteristics. One of our goals is to compare different neural architectures, multi-layers perceptron (MLP) and, convolutional neural networks (CNNs) for both tasks with various settings and learn the gender/speaker-specific features automatically. The experimental results reveal that the models using z-score and Gramian matrix transformation obtain better results than the models only use max-min normalization of MFCC. In terms of training time, MLP requires large training epochs to converge than CNN. Other experimental results show that MLPs outperform CNNs for both tasks in terms of generalization errors.

Computación y Sistemas
Keyphrase extraction is a task of automatically selecting topical phrases from a document. We pre... more Keyphrase extraction is a task of automatically selecting topical phrases from a document. We present KeyVector, an unsupervised approach with weighted topics via semantic relatedness for keyphrase extraction. Our method relies on various measures of semantic relatedness of documents, topics and keyphrases in the same vector space, which allow us to compute three keyphrase ranking scores: global semantic score, find more important keyphrases for a given document by measuring the semantic relation between documents and keyphrase embeddings; topic weight, pruning/selecting the candidate keyphrases on the topic level; topic inner score, ranking the keyphrases inside each topic. Keyphrases are then generated by ranking the values of combined three scores for each candidate. We conducted experiments on three evaluation data sets of different length documents and domains. Results show that KeyVector outperforms state of the art methods on short, medium and long documents.
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
We develop a language-independent, deep learning-based approach to the task of morphological disa... more We develop a language-independent, deep learning-based approach to the task of morphological disambiguation. Guided by the intuition that the correct analysis should be "most similar" to the context, we propose dense representations for morphological analyses and surface context and a simple yet effective way of combining the two to perform disambiguation. Our approach improves on the languagedependent state of the art for two agglutinative languages (Turkish and Kazakh) and can be potentially applied to other morphologically complex languages.
Uploads
Papers by GULMIRA TOLEGEN