0% found this document useful (0 votes)
66 views6 pages

Automatic Lexical Text Simplification For Turkish: Ahmet Yavuz Uluslu

This paper introduces the first automatic lexical simplification system for the Turkish language, addressing the unique challenges posed by its morphological richness and low-resource status. The proposed LS-BERT pipeline utilizes pretrained BERT models and morphological features to generate grammatically correct and semantically appropriate simplifications. The study also presents a new dataset for complex word identification and evaluates the system's performance, contributing to the accessibility of Turkish texts for various audiences.

Uploaded by

anilkimsesiz1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views6 pages

Automatic Lexical Text Simplification For Turkish: Ahmet Yavuz Uluslu

This paper introduces the first automatic lexical simplification system for the Turkish language, addressing the unique challenges posed by its morphological richness and low-resource status. The proposed LS-BERT pipeline utilizes pretrained BERT models and morphological features to generate grammatically correct and semantically appropriate simplifications. The study also presents a new dataset for complex word identification and evaluates the system's performance, contributing to the accessibility of Turkish texts for various audiences.

Uploaded by

anilkimsesiz1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Automatic Lexical Text Simplification for Turkish

Ahmet Yavuz Uluslu


ETH Zürich
[email protected]

Abstract
In this paper, we present the first automatic lexical simplification system for the Turkish language. Recent text simplification efforts rely
on manually crafted simplified corpora and comprehensive NLP tools that can analyse the target text both in word and sentence levels.
Turkish is a morphologically rich agglutinative language that requires unique considerations such as the proper handling of inflectional
cases. Being a low-resource language in terms of available resources and industrial-strength tools, it makes the text simplification
task harder to approach. We present a new text simplification pipeline based on pretrained representation model BERT together with
morphological features to generate grammatically correct and semantically appropriate word-level simplifications.

Keywords: Turkish, automatic text simplification, lexical simplification


arXiv:2201.05878v3 [cs.CL] 28 Jul 2023

1. Introduction unavailable clinical insight is required for our work to


The goal of the lexical simplification task is to replace have any practical importance for people with special
complex words with simpler alternatives. There are needs. More psycholinguistic research is needed to
many groups of people that can benefit from this in- establish what constitutes a simple language for dif-
cluding children, people with cognitive disabilities and ferent groups. Therefore, we simply focus on build-
non-native speakers (Paetzold and Specia, 2016; Rello ing a general-purpose lexical simplification pipeline for
et al., 2013a). The common assumption among lin- Turkish to lay the foundations of further research.
guists is that those who are familiar with the vocabu-
lary of a text can often understand the meaning even if
they have problems with grammatical structures. Auto-
matic lexical simplification thus can become an effec-
tive method to make text accessible for different audi-
ences.

Turkish, the most widely spoken language in the Tur-


kic language family, is the official language of Turkey
with 80 million speakers. It is a morphologically rich
agglutinative language. Text simplification for Turkish
exhibits a number of challenge due to lack of linguis-
tics resources and dissimilarity to other covered lan-
guages. Turkish is considered to be a low resource lan- Figure 1: An example lexical simplification by LS-
guage in terms of standard linguistic resources (Cieri BERT pipeline
et al., 2016). Recently there has been an initiative by
different research groups to release datasets and tools
publicly. To name a few, some of the necessary lexical LS-BERT is a lexical simplification method that gener-
resources for modern text simplification such as Word- ates substitute words with pretrained encoders (Qiang
Net has become available (Bakay et al., 2021) and mul- et al., 2021). Our paper builds a similar pipeline and
tiple Turkish Treebanks were successfully integrated adapts BERTurk (Schweter, 2020) to handle challenges
into Universal Dependencies (Türk et al., 2021). How- of Turkish text simplification by using additional fea-
ever, the lack of parallel corpora for different tasks and tures. We present a new manually constructed dataset
domains renders data-driven approaches such as Neural for complex word identification. We evaluated our pro-
Text Simplification ineffective. posed system automatically, and released our code and
lexical resources open-source.
There has not been a comprehensive conceptual study
on text simplification in Turkish. Different simplifi-
cation methods should be proposed for the target au- 2. Related Work
dience and they should be subjected to experimenta- Text simplification is the process of simplifying the
tion to show their effectiveness. Simplification meth- content of the original text while retaining the mean-
ods can be found insignificant on their (Rello et al., ing and preserving the grammaticality. It focuses on
2013b) and can be used mixed with other techniques to the simplification of vocabulary and the syntactic struc-
improve the text accessibility. We admit that currently tures in the text. Early text simplification systems were
rule-based, relying on lexical resources such as Word- (Devlin et al., 2018). LSBERT exploits this to gener-
Net and other linguistic databases for a predefined set ate suitable simplifications for complex words (Qiang
of complex words to substitute words with simpler al- et al., 2021). This method considers the whole sentence
ternatives (Carroll et al., 1998). The major limitation context, and it is shown to generate coherent and cohe-
of such an approach was the identification of complex sive sentences. BERTurk, a community driven BERT
words (Shardlow, 2014). Rule-based systems relied model for Turkish is available to implement this ap-
heavily on word frequencies and ignored the context. proach (Schweter, 2020). We create a similar pipeline
The synonym replacement also required simplification which consists of the following three steps: complex
rules for every word or general rules that failed to ac- word identification, substitute generation, substitute se-
count for different linguistic relationships. Even with lection.
integration of N-gram language models to understand
the word context, simplification algorithms had limited 3.1. Complex Word Identification
understanding of the whole sentence. The most common first step in lexical simplification
With the availability of complex and simple parallel is to identify which words are considered complex by
corpora (Coster and Kauchak, 2011), data-driven meth- the target audience (Shardlow, 2013). Complex words
ods started to produce adequate results. Recent re- may be identified by different features such as word
search treated text simplification task as a monolin- length, syllable count, and word frequency. General
gual machine translation problem (Tang et al., 2019). purpose text simplification systems focus on replacing
Statistical machine translation (SMT) algorithms were infrequent words with frequent alternatives. The num-
the first techniques to be used for text simplification ber of syllables and vowels may become important in
(Wubben et al., 2012). It was followed by the recent de- special situations such as vowel dyslexia (Güven and
velopments in neural machine translation (NMT). Re- Friedmann, 2021).
searchers started to apply the new trend of deep learn- We trained a POS (Part-of-speech) tagger on BOUN
ing based machine translation models on the text sim- Treebank to establish what sentence parts are targeted
plification problem. (Wang et al., 2016) The study built by the simplification pipeline (Türk et al., 2021). The
a model based on long short-term memory (LSTM) PoS tagger (Lample et al., 2016) achieved F1 score
encoder-decoder and successfully shown that it was of 0.89 on the test set. The complex sentence is first
able to learn simplification rules such as sorting, re- PoS tagged and only words with predefined set of tags
versing, replacing, removal and substitution of words. Nouns (NN), adjectives (ADJ), verbs (VB), adverbs
The study proved an increased capacity of simplifica- (ADV) are checked for their frequency inside the Turk-
tion and LSTM-based encoder-decoders outperformed ish section of the wordfreq corpus (Speer et al., 2018).
their statistical counterparts. The corpus includes crawled Wikipedia entries, movie
There has been relatively little research on text sim- subtitles, tweets and web pages.
plification in Turkish (Torunoglu-Selamet et al., 2016;
Özkan and Ercan, 2018). The first study proposed We observed two different conditions where predefined
various syntax-level simplification rules but did not word lists and naive frequency based algorithms failed
cover lexical simplification. Some of the proposed to capture fundamental aspects of language. We pro-
rules such as paratactic sentence simplification appear vided sentence examples to address context-awareness
to exist only at a conceptual level and practical impli- and morphological complexity.
cations went uncovered. The system should be eval- Turkish is a morphologically rich agglutinative lan-
uated on actual complex sentences to assess the ro- guage. It can produce very complex sentences with
bustness of defined rules against text cohesion (Sid- only a few words. These words may appear frequently
dharthan, 2006). The target group of the study seem to in everyday speech and written language. It can then
be not defined clearly and the group names preteens (8- go unnoticed by the frequency algorithm. Non-native
12) and children (0-18) are used interchangeably. The speakers tend to have a hard time grasping unfamil-
latter study approaches the text simplification problem iar concepts. Infrequent and long words are already
from the modernisation perspective and trains a statisti- known to affect dyslexia (Rello et al., 2013a). Re-
cal machine translation model on a parallel corpus con- cently, morphological complexity in Turkish words are
structed with the original and modernised version of also shown to affect sentence comprehension in stu-
Turkish classics. dents with dyslexia (Dodur and Miray, 2021).
3. Turkish Text Simplification 1. Morphological Complexity:
The lack of parallel data in Turkish limits the appli-
cability of data-driven approaches. However, unsuper- In: Çevrendekilerle iyi geçinmelisin.
vised language models may still be employed for low In EN: You should get along well with those
resource languages as it only requires a large corpus of around you.
raw text. BERT-based pretrained language models has
shown to be effective for masked language modeling Complex words identified: None
The frequency algorithm does not identify the pronoun In: Hak söz söyleyenin dostu az olur.
’çevrendekilerle’ (A3pl+Pnon+Ins) as a complex word. Out: Doğru söz söyleyenin dostu az olur.
It is possible to disambiguate the pronoun depending on
the sentence context and break it down into two words Simplification: Doğru (-Hak)
to reduce morphological complexity. The lexical sim-
plification affects the overall sentence complexity and Complex word identification problem has recently been
results in a clear and concise outcome. The syntax of treated as a sequence labelling task (Gooding and
the sentence was also affected by this change, therefore Kochmar, 2019). Data-driven models take word con-
it may be an overstep depending on the definition of the text into account and they avoid the necessity of ex-
lexical simplification task. tensive feature engineering to address linguistic com-
plexity. We manually crafted an annotated com-
plex word identification dataset to experiment with se-
quence models. We followed the annotation guide-
line of CWI Shared Task 2018 (Yimam et al., 2018).
The author whose native tongue is Turkish assumed the
target group of preteens proceeding into high school
level study with limited exposure to Arabic and Persian
rooted words in the Turkish language. 1000 complex
sentences from the Bilkent Creative Writing dataset
and 2000 complex sentences from Wikipedia were an-
notated. This is not a complete study or dataset, as CWI
Figure 2: Dependency Visualisation Before Simplifica- Shared Task included multiple annotators with differ-
tion ent assumed roles to construct a corpus of 90.000 sen-
tences (Yimam et al., 2018). We regardless make our
data and code open-source for further study.
In: Çevrendekilerle iyi geçinmelisin.
Out: Çevrendeki insanlarla iyi geçinmelisin. Total Dataset Training Data Test Data
3k Sentences 2650 Sentences 350 Sentences
Simplification: Çevrendeki (-lerle) insanlarla
Table 1: Turkish CWI Dataset

A sequence labelling based word-level BiLSTM model


was trained to predict the binary complexity of words
annotated in the dataset. The model F1 score was
0.64 for complex word class and the overlap between
frequency-based algorithm was 67.3%. We have pre-
viously explored the differences behind the two ap-
proaches, however, without comprehensive benchmark
data statistical results are not robust enough for further
Figure 3: Dependency Visualisation After Simplifica- analysis.
tion
3.2. Substitute Generation
The aim of substitution generation is to produce substi-
2. Contextual Information: tute candidates for a complex word. We produce sub-
In: Hak söz söyleyenin dostu az olur. stitute candidates using the pre-trained language model
BERT (Devlin et al., 2018). BERT is a self-supervised
In EN: S/he who speaks truth has few friends. method based on the encoder part of the transformer ar-
Complex words identified: None chitecture. The model is trained on two language tasks:
masked language modeling and next sentence predic-
The frequency algorithm does not identify the word hak tion. Masked language model is the objective of pre-
(justice, truth, right) as a complex word. Since the word dicting the next word in a sequence given its left and
has several meanings and repeatedly used in compound right context. Next sentence prediction is the task of
verbs (hak etmek, hakkı olmak, hak görmek) and nouns given a pair of sentences predicting if the second sen-
(hak sahibi, miras hakkı), it frequently appears inside tence in the pair is the subsequent sentence in the orig-
the corpus. This usage is now considered an old prac- inal document. BERT accomplishes the masked lan-
tice, and it can be simplified for a certain age group and guage modelling task by replacing random words with
education background. It is impossible to identify such special token [MASK] during training. In our simplifi-
words without contextual information. cation pipeline, we follow the LS-BERT study (Qiang
et al., 2021) and replace the identified complex word plex word if and only if it has a higher frequency and it
with a [MASK] symbol to produce the substitute can- has a better loss outcome in the language modelling.
didates based on BERT. Bi-directional nature of the
model allows candidate generation depending on the 4. Evaluation
whole sentence context. We could not find any simplified parallel corpus sen-
tence pairs for Turkish. To evaluate our simplification
3.3. Substitute Selection
system, we manually simplified a reserved subset of
The substitution selection is the decision step to filter our CWI dataset that was not included in the training
and select which one of the candidate substitutions is process. The final parallel corpus contained 500 com-
the simplest choice and fits the context of the com- plex sentences and their corresponding lexical simplifi-
plex word best. The candidates are ranked based on cations. We adhered to the CWI dataset guidelines and
BERT prediction probability, word frequency and se- assumed the role of a student with pre-high school edu-
mantic similarity. cation background. The complex sentences were taken
from the same resources: Wikipedia and a university-
BERT probability distribution:
level Turkish writing corpus. The simplifications were
BERT returns the probability distribution of vocabu-
created by the author whose native language is Turkish
lary corresponding to the masked word given a com-
and has a background in linguistics. The simplification
plex word identified sentence. The results are calcu-
pipeline first identified the complex word, and BERT
lated based on the attention mechanism and depend on
generated the substitute candidates. The ranking algo-
the sentence context. Therefore, the higher the proba-
rithm included different features to pick the best can-
bility, the more relevant the candidate for the original
didate. We decided to take core parts of the algorithm,
sentence. It is possible to rank the candidates accord-
the probability distribution and frequency analysis as
ingly.
our baseline of evaluation and show the improvement
Language model feature: in performance after the addition of each feature.
A substitution candidate should fit in the context of the We evaluate our system outputs using standard evalu-
words that come before and after the original term. In ation metrics for text simplification: BLEU and SARI
non-context lexical simplification systems, n-gram lan- (Xu et al., 2016). BLEU score for the evaluation of
guage models are implemented to verify grammatical- text simplification was recently disputed (Sulem et al.,
ity (Qasmi et al., 2020). Bi-directional nature of BERT 2018). However, our method is out of scope for the ma-
already accounts for grammaticality depending on the jor shortcomings mentioned such as sentence splitting.
sentence context. We simply add another ranking mea- We regardless provide the score for comparison with
sure to evaluate compatibility between the whole sen- other studies.
tence and the limited word frame context. It is possible
Model BLEU SARI
to mask nearby words back to front for each candidate
to calculate the overall loss and rank accordingly. BERT (Prob + Freq) 70.30 35.52
+ Similarity 76.84 37.36
Semantic similarity: + LM 78.25 37.40
The semantic similarity is calculated based on the co-
sine similarity between the GloVe vector of the original Table 2: Results of Automatic Evaluation
word and the candidate substitution.
Frequency comparison:
Frequency-based approaches were covered during the 5. Conclusion & Future Work
complex word identification. We make use of a similar This paper presents the first automatic lexical text sim-
algorithm as a supportive measure in substitute selec- plification system for Turkish. We also present a com-
tion. We rank the substition candidates according to plex word identification dataset for Turkish, and cre-
their appearence in the wordfreq corpus (Speer et al., ate a small simplified parallel corpus for benchmarking
2018). text simplification tasks. Our model achieves a BLEU
LS-BERT algorithm makes use of an additional mea- score of 78.25 and a SARI score of 37.40 on automatic
sure called PPDB feature. It analyses a paraphrase cor- evaluation. In our future work, we would like to ex-
pus to see if the complex word and the substitution oc- pand our previously created datasets with multiple an-
cured inside a paraphrase pair. They conclude that this notators and address the simplification shortcomings in
feature had the least impact in overall performance. We multi-word expressions and morphologically complex
were not able to find a comprehensive paraphrase cor- words in Turkish.
pus with over hundred million examples, therefore we
exclude it from our study. The performance of substitu- 6. Bibliographical References
tion candidates are averaged to calculate the final rank- Carroll, J., Minnen, G., Canning, Y., Devlin, S., and
ing score. The foremost candidate replaces the com- Tait, J. (1998). Practical simplification of english
newspaper text to assist aphasic readers. In Proceed- The impact of lexical simplification by verbal para-
ings of the AAAI-98 Workshop on Integrating Artifi- phrases for people with and without dyslexia. In In-
cial Intelligence and Assistive Technology, pages 7– ternational Conference on Intelligent Text Process-
10. Citeseer. ing and Computational Linguistics, pages 501–512.
Cieri, C., Maxwell, M., Strassel, S., and Tracey, J. Springer.
(2016). Selection criteria for low resource language Shardlow, M. (2013). A comparison of techniques
programs. In Proceedings of the Tenth International to automatically identify complex words. In 51st
Conference on Language Resources and Evaluation Annual Meeting of the Association for Computa-
(LREC’16), pages 4543–4549. tional Linguistics Proceedings of the Student Re-
Coster, W. and Kauchak, D. (2011). Simple english search Workshop, pages 103–109.
wikipedia: a new text simplification task. In Pro- Shardlow, M. (2014). Out in the open: Finding
ceedings of the 49th Annual Meeting of the Associ- and categorising errors in the lexical simplification
ation for Computational Linguistics: Human Lan- pipeline. In LREC, pages 1583–1590.
guage Technologies, pages 665–669. Siddharthan, A. (2006). Syntactic simplification and
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, text cohesion. Research on Language and Computa-
K. (2018). Bert: Pre-training of deep bidirectional tion, 4(1):77–109.
transformers for language understanding. arXiv Sulem, E., Abend, O., and Rappoport, A. (2018). Bleu
preprint arXiv:1810.04805. is not suitable for the evaluation of text simplifica-
Dodur, S. and Miray, H. (2021). Syntax compre- tion. arXiv preprint arXiv:1810.05995.
hension skills of turkish-speaking students with Tang, G., Sennrich, R., and Nivre, J. (2019). Un-
dyslexia. International Journal of Curriculum and derstanding neural machine translation by simpli-
Instruction, 13(3):2732–2745. fication: The case of encoder-free models. arXiv
Gooding, S. and Kochmar, E. (2019). Complex word preprint arXiv:1907.08158.
identification as a sequence labelling task. In Pro- Torunoglu-Selamet, D., Pamay, T., and Eryigit, G.
ceedings of the 57th Annual Meeting of the Asso- (2016). Simplification of turkish sentences. In The
ciation for Computational Linguistics, pages 1148– First International Conference on Turkic Computa-
1153. tional Linguistics, pages 55–59.
Güven, S. and Friedmann, N. (2021). Vowel dyslexia Wang, T., Chen, P., Rochford, J., and Qiang, J. (2016).
in turkish: A window to the complex structure of the Text simplification using neural machine translation.
sublexical route. PLOS ONE, 16(3):1–39, 03. In Proceedings of the AAAI Conference on Artificial
Lample, G., Ballesteros, M., Subramanian, S., Intelligence, volume 30.
Kawakami, K., and Dyer, C. (2016). Neural archi- Wubben, S., Krahmer, E., and van den Bosch, A.
tectures for named entity recognition. arXiv preprint (2012). Sentence simplification by monolingual ma-
arXiv:1603.01360. chine translation.
Özkan, E. and Ercan, G. (2018). Modernization of Xu, W., Napoles, C., Pavlick, E., Chen, Q., and
old turkish texts. In 2018 26th Signal Process- Callison-Burch, C. (2016). Optimizing statistical
ing and Communications Applications Conference machine translation for text simplification. Transac-
(SIU), pages 1–4. IEEE. tions of the Association for Computational Linguis-
Paetzold, G. and Specia, L. (2016). Unsupervised lex- tics, 4:401–415.
ical simplification for non-native speakers. In Pro- Yimam, S. M., Biemann, C., Malmasi, S., Paet-
ceedings of the AAAI Conference on Artificial Intel- zold, G. H., Specia, L., Štajner, S., Tack, A., and
ligence, volume 30. Zampieri, M. (2018). A report on the complex
Qasmi, N. H., Zia, H. B., Athar, A., and Raza, A. A. word identification shared task 2018. arXiv preprint
(2020). Simplifyur: Unsupervised lexical text sim- arXiv:1804.09132.
plification for urdu. In Proceedings of The 12th Lan-
guage Resources and Evaluation Conference, pages 7. Language Resource References
3484–3489. O. Bakay and O. Ergelen and E. Sarmis and S. Yildirim
Qiang, J., Li, Y., Zhu, Y., Yuan, Y., Shi, Y., and Wu, and A. Kocabalcioglu and B. N. Arican and M.
X. (2021). Lsbert: Lexical simplification based on Ozcelik and E. Saniyar and O. Kuyrukcu and B. Avar
bert. IEEE/ACM Transactions on Audio, Speech, and O. T. Yıldız. (2021). Turkish WordNet KeNet.
and Language Processing, 29:3064–3076. Stefan Schweter. (2020). BERTurk - BERT models for
Rello, L., Baeza-Yates, R., Dempere-Marco, L., and Turkish. Zenodo.
Saggion, H. (2013a). Frequent words improve Robyn Speer and Joshua Chin and Andrew Lin and
readability and short words improve understandabil- Sara Jewett and Lance Nathan. (2018). LuminosoIn-
ity for people with dyslexia. In IFIP Conference sight/wordfreq: v2.2.
on Human-Computer Interaction, pages 203–219. Türk, Utku and Atmaca, Furkan and Özateş, Şaziye
Springer. Betül and Berk, Gözde and Bedir, Seyyit Talha and
Rello, L., Baeza-Yates, R., and Saggion, H. (2013b). Köksal, Abdullatif and Başaran, Balkız Öztürk and
Güngör, Tunga and Özgür, Arzucan. (2021). Re-
sources for Turkish dependency parsing: Introduc-
ing the BOUN treebank and the BoAT annotation
tool. Springer.

You might also like