Papers by Rinat Gilmullin
2023 8th International Conference on Computer Science and Engineering (UBMK)
Ученые записки Казанского университета. Серия Гуманитарные науки, 2014
Образовательные технологии и общество, 2013
Образовательные технологии и общество, 2011
2023 8th International Conference on Computer Science and Engineering (UBMK)

Transaction Kola Science Centre, 2021
System analysis of the problem of modeling a natural language (NL) made it possible to formulate ... more System analysis of the problem of modeling a natural language (NL) made it possible to formulate the root cause of the low efficiency of modern means for accumulating and processing knowledge in such languages. This is the complexity of intellectualization for such tools, which are created on the basis of primitive artificial programming languages that practically represent a subset of flectional analytical languages or artificial constructions based on them. To reduce the severity of the identified problem, it is proposed to build NL modeling systems on the basis of technological tools for verbalization and recognition of sense. These tools consist of semiotic models of NL lexical and grammatical means. This approach seems to be especially promising for agglutinative languages; it is supposed to be implemented on the example of the Tatar language.
Russ. Digit. Libr. J., 2016
This article concerns the issues of corpus-oriented study of the most frequent types of grammatic... more This article concerns the issues of corpus-oriented study of the most frequent types of grammatical homonymy in the Tatar language and the possiblities for automation of the disambiguation process in the corpus. The authors determine the relevance of alternative parses generated in the process of automatic morphological analysis in terms of real linguistic ambiguity. This work presents a variant of classification of frequent homoforms and methods for their disambiguation, and it estimates the potential impact on the corpus.

The paper is dedicated to the problem of grammatical ambiguity in the Tatar National Corpus and d... more The paper is dedicated to the problem of grammatical ambiguity in the Tatar National Corpus and describes the methodology and software used for automation of the disambiguation process. Grammatical ambiguity is widely represented in agglutinative languages like Turkic or Finno-Ugric. Disambiguation in the corpus is based on the context-oriented classification of ambiguity types which has been carried out on corpus data in the Tatar language for the first time. In this study the corpus is used as a source for the research and at the same time as a destination for implementing the results. The grammatical ambiguity types are detected automatically using the finite-state morphological analyzer and then classified. In order to build up the grammatically disambiguated subcorpus, a special software module was developed. It searches for ambiguous tokens in the corpus, collects statistical information and allows creating and implementing the formal context-based disambiguation rules for dif...

This paper concerns the issues of grammatical ambiguity in the Tatar National Corpus and the poss... more This paper concerns the issues of grammatical ambiguity in the Tatar National Corpus and the possiblities for automation of the disambiguation process in the corpus. Grammatical ambiguity is widely represented in agglutinative languages like Turkic or Finno-Ugric. In order to build the grammatically disambiguated subcorpus, wе have developed a special software module which searches for ambiguous tokens in the corpus, collects statistical information and allows creating and implementing the formal disambiguation rules for different ambiguity types. Disambiguation in the corpus is based on the context-oriented classification of ambiguity types which has been carried out on statistical corpus data in the Tatar language for the first time. We can say that we use the corpus as a source of our research and at the same time as a destination for implementing the results. Estimated cumulative effect of disambiguation of the identified frequent ambiguity types in the Tatar National Corpus can...
This paper presents the description of the morphological analysis system for the Tatar Language b... more This paper presents the description of the morphological analysis system for the Tatar Language based on a two-level morphology model. The morphological system is used for grammatical annotation of the Tatar national corpus. This paper shows the results of evaluation of completeness of the system using statistical information that was obtained from the corpus data and describes the ways to improve this system.
This paper presents the results of experiments on morphological disambiguation in the National co... more This paper presents the results of experiments on morphological disambiguation in the National corpus of the Tatar language “Tugan tel”. The experiments were conducted using the LSTM based neural network model. The tagged socio-political sub-corpus of the National corpus of the Tatar language “Tugan tel” with a volume of 2,4 million words was used as training data. Experiments have shown that LSTM models are language-independent and can be applied to the Tatar language too. The results for Tatar are on a comparable level with those for other agglutinative languages, such as Hungarian and Turkish.

This paper assesses the possibility of combining the rule-based and the neural network approaches... more This paper assesses the possibility of combining the rule-based and the neural network approaches to the construction of the machine translation system for the Tatar-Russian language pair. We propose a rule-based system that allows using parallel data of a group of 6 Turkic languages (Tatar, Kazakh, Kyrgyz, Crimean-Tatar, Uzbek, Turkish) and the Russian language to overcome the problem of limited Tatar-Russian data. We incorporated modern approaches for data augmentation, neural networks training and linguistically motivated rule-based methods. The main results of the work are the creation of the first neural Tatar-Russian translation system and the improvement of the translation quality in this language pair in terms of BLEU scores from 12 to 39 and from 17 to 45 for both translation directions (comparing to the existing translation system). Also the translation between any of the Tatar, Kazakh, Kyrgyz, Crimean Tatar, Uzbek, Turkish languages becomes possible, which allows to trans...

This article presents the results of experiments on the use of various methods and algorithms in ... more This article presents the results of experiments on the use of various methods and algorithms in creating the Russian-Tatar machine translation system. As a basic algorithm, we used a neural network approach based on the Transformer architecture as well as various algorithms to increase the amount of parallel data using monolingual corpora (back-translation). For the first time experiments were conducted for the Russian-Tatar language pair on the use of transfer learning (based on Kazakh-Russian parallel corpus). As the main training data, we created and used the parallel corpus with a total volume of about 1 million Russian-Tatar sentence pairs. Experiments show that the created system is superior in quality to the currently existing Russian-Tatar translators. The best quality for the Russian-Tatar translation direction was achieved by our basic model (BLEU 35.4), and for the Tatar-Russian direction – by the model for which the back-translation algorithm was used (BLEU 39.2).

The idea of the “TurkLang-7” project is to create datasets and neural machine translation systems... more The idea of the “TurkLang-7” project is to create datasets and neural machine translation systems for a set of Russian-Turkic low-resource language pairs. It is planned to achieve this goal through a hybrid approach to the creation of a multilingual parallel corpus between Russian and Turkic languages, studying the applicability and effectiveness of neural network learning methods (transfer learning, multi-task learning, back-translation, dual learning) in the context of the selected language pairs, as well as the development of specialized methods for the unification of parallel data in different languages, based on the agglutinative nature of the selected Turkic languages (structural and functional model of the Turkic morpheme). In this paper, we describe the main stages of work on this project and the results of the first year: we developed a semiautomatic process for creating parallel corpora, collected data from several sources on 7 Turkic languages, and conducted the first exp...
Procedia - Social and Behavioral Sciences, Oct 1, 2013
This article presents the National Corpus of the Tatar Language, which is being developed at the ... more This article presents the National Corpus of the Tatar Language, which is being developed at the Research Institute of Applied Semiotics of the Tatarstan Academy of Sciences on the EANC technological platform. It describes the morphological model of the Tatar language used for grammatical annotation of words.
Uploads
Papers by Rinat Gilmullin