A Morphology-Aware Network For Morphological Disambiguation
A Morphology-Aware Network For Morphological Disambiguation
Eray Yildiz Caglar Tirkaz H. Bahadir Sahin Mustafa Tolga Eren Ozan Sonmez
Huawei Turkey Research and Development Center, Umraniye, Istanbul, Turkey
{[Link], [Link]}@[Link]
{caglartirkaz, hbahadirsahin, osonmez}@[Link]
arXiv:1702.03654v1 [[Link]] 13 Feb 2017
as input and propagates to the next layer. The second layer, proved recognition accuracies (Collobert et al. 2011; Turian,
(b), takes a window of n words as input and propagates to Ratinov, and Bengio 2010). In order to improve the perfor-
the softmax layer, (c). The non-linearity in both the first and mance of our disambiguation system we also use unsuper-
the second layers are provided through the use of tanh as vised methods to pre-train root embeddings of words. We
the transfer function. The softmax layer is responsible for created a corpus comprised of 1 billion Turkish words that
deciding the likelihood of the current morphological analy- we collected from various sources, such as e-books and web
sis of the words, i.e., a binary decision is produced with the sites. Although our corpus is rather small compared to En-
expected result of 1 if the analysis is correct, 0 otherwise. glish corpora, it is the largest text corpus in Turkish that
We train our network with the possible sequences of mor- we know of. After we trained the supervised disambiguation
phological analyses in the training data. For each sentence, system as described above, we disambiguated each word in
and for each word, we select the n-2 words preceding the the corpus and extracted the roots of words. Next, we built
word and their groundtruth annotations along with the pos- representations for root forms of the words using the unsu-
sible annotations of the last two words. We also add n-1 out pervised skip-gram algorithm (Mikolov et al. 2013). After
of sentence tokens at the beginning of each sentence so that obtaining the pre-trained root vectors, we retrained our dis-
all words in the sentence are included in the training data. ambiguation system with pre-trained root embeddings. This
We label the sequences containing the correct morphologi- technique allowed us to further improve the disambiguation
cal analysis as positive whereas the remaining sequences are accuracies we obtained.
labeled as negative. This way the model is trained to predict As discussed earlier, the first layer takes as input the root
the correct annotation for the last two words in a sequence and the morphological features of a word. The morphologi-
given that the first n-2 words have correct annotations. Train- cal features of words we use are presented in Table 5. Specif-
ing is performed with stochastic gradient descent and Ada- ically, the set of morphological features we consider con-
Grad (Duchi, Hazan, and Singer 2011) as the optimization tains the root, main POS tag, minor POS tag, person and
algorithms. At inference time, given a sentence containing possessive agreements, plurality, gender, case marker, polar-
words to disambiguate, we use the network to make predic- ity and tense. Note that the information contained in a sur-
tions for window of words in the sentence and then use the face word form may differ due to morphological character-
Viterbi algorithm to select the best morphological analysis istics of a language. For instance, German and French have
for each word. gender feature contrary to Turkish while Turkish words have
Unsupervised pre-training of word embeddings have been possesive agreement and polarity. Main POS tag, describes
employed in various NLP tasks, and their usage have im- the category of a word and can take on values such as noun,
Table 5: The morphological (morphosyntactic and morphosemantic) features we used to represent each word
Morphosyntactic and Morphosemantic Features
Root Main Minor Person Plurality Gender Possesive Case Polarity Tense
Language POSTag POSTag Agreement Agreement Marker
Turkish + + + + + - + + + +
German + + + + + + - + - +
French + + - + + + - - - +
verb, adjective and adverb. Minor POS tag determines the We use SPMRL 2014 dataset (Seddah and Tsarfaty 2014)
minor morphological properties of a word such as seman- for German and French. This data set is created in the Penn
tic markers, causative markers and post-position. ”Since”, tree bank format and used for a shared task on statistical
”While”, ”Propernoun”, ”Without” can be given as exam- parsing of morphologically rich languages. This dataset con-
ples to this kind of morphological features in Turkish. Per- tains 1M and 500K sentences with POS tag and morpho-
son and possessive agreement are used to answer the ques- logical information for German and French respectiveley. It
tions “who” and “whose” respectively, i.e., they are used to provides 90% of all sentences as training set and %10 of
indicate a person or an ownership relationship. Case marker rest of the sentences as test set. We align the features in the
relate the nouns to the rest of the sentence as prepositions do tree bank to the HFST outputs in order to determine the cor-
in English. Nominative (none), dative(to, for), locative (at, rect morphological analyses generated by the HFST tool. We
in, on), ablative (from, out of) and genitive (of) forms are ex- use this data set for both training and testing. The develop-
amples of the forms that can be observed in a sentence. Po- ment sets for each language are randomly separated from the
larity of a word is positive if the word is not negated and neg- training data and are used to optimize the embedding lengths
ative otherwise. Tense indicates the tense of the verbs such as of morphological features.
present, past and future tense. Additionally, we consider the We noticed that similar parameters lead to the best per-
moods of the verbs within tense feature. Moods express the formance. Thus, in the experiments, we used embedding
speaker’s attitude such as indicative, imperative or subjunc- lengths 50, 20 and 5 for roots, POS tags and the other mor-
tive moods. In languages with grammatical gender such Ger- phological features respectively. The number of filters in the
man and French, every noun is associated with a gender. The first and second layers are 30 and 40 respectively. The win-
morphological analyzer we use associates each French word dow length, n, that determines the number of words input to
with one of the two genders (masculine and feminine) while the second layer is set to 5.
it associates each German word with one of the four pos-
sible genders (masculine, feminine, neuter and no gender).
Some of the suffixes in Turkish change word meaning cre- Table 6: POS tagging, lemmatization and morphological dis-
ating derivational boundaries in the morphological analyses. ambiguation accuracies of the proposed approach for Turk-
The morphological features of a word given in Table 5 are ish, German and French.
extracted after the final derivational boundary. In Turkish, Turkish(%) German(%) French(%)
we add one more feature to each word named previous tags
in order to account for the previous suffixes that the word POS Tagging 96.85 98.35 98.47
might have. This way, our model learns the effect of suffixes Lemma. 97.59 95.95 99.52
that change word meaning. Some of the described morpho- M. disamb. 84.12 88.35 93.78
logical features exist only for certain word categories. For
instance possessive agreement and case marker features can The experiment results for POS tagging, lemmatization
only exist in nouns, polarity and tense exist in verbs and per- and morphological disambiguation in Turkish, German and
son agreement exist in nouns and verbs. If a morphological French are presented in Table 6. Notice that the POS tagging
feature cannot be extracted from a word, we label it as hav- and lemmatization accuracies are refer to the percentages of
ing NULL for the feature. POS tags and lemmas predicted correctly while morpholog-
ical disambiguation accuracies are refer to the percentages
Experiments of the words disambiguated correctly among the ambiguous
For Turkish, we used a semi-automatically disambiguated words According to the results, we observe that even though
corpus containing 1M tokens (Yüret and Türe 2006). Since our initial target was to be able to achieve Turkish morpho-
this dataset is annotated semi-automatically, it also contains logical disambiguation, our model consistently obtains high
noise. In order to reduce the effect of noise to the recogni- accuracies in French and German as well.
tion accuracies, we created a test set by randomly selecting In Table 7, we present the results of various models for
sentences containing 20K of the tokens and manually anno- Turkish morphological disambiguation on our hand-labeled
tating them. We make this test data publicly available 1 so test data. The results of the multilayer perceptron developed
that Turkish morphological disambiguation algorithms can in (Sak, Güngör, and Saraçlar 2007) and the decision list
be compared more accurately in the future. learning algorithm developed in (Yüret and Türe 2006) are
presented in lines 1 and 2 respectively. We present Turk-
1
[Link] ish morphological disambiguation results obtained by our
English words can be separated into morphemes so that they
Table 7: The comparison of the disambiguation accuracy of can be better represented. This allows creating systems that
the proposed approach with the state-of-the-art models in are less affected from problems such as data sparsity (Lu-
Turkish. ong, Socher, and Manning 2013).
Method Accuracy(%) While using pre-training, we only considered the pre-
Multilayer Perceptron 82.13 trained root embeddings. It would be preferred to pre-train
Decision List 83.31 all the embeddings using our text corpus which we leave as
Proposed Model - w/o pre-training 84.12 future work. Another point of note is the selected embedding
Proposed Model - with pre-training 85.18 sizes that we used in our experiments. While we worked on a
development set separated from training data for parameter
selection, further investigation in parameter selection might
improve the obtained accuracies.
model with and without pre-training in lines 3 and 4 respec-
tively. As we discussed before, unsupervised pre-training of
the embeddings can boost accuracies of neural networks. As Acknowledgments
expected, morphological disambiguation accuracy increases This project is partially funded by 3140951 numbered
by around 1% (around 6% reduction in error) when root em- TUBITAK-TEYDEB (The Scientific and Technological Re-
beddings are pre-trained instead of randomly initialized. We search Council of Turkey – Technology and Innovation
see that even without unsupervised pre-training our algo- Funding Programs Directorate).
rithm outperforms the current state of the art models and
we are able to further improve the accuracy by pre-training References
of the embeddings. [Beesley and Karttunen 2003] Beesley, K. R., and Karttunen, L.
Although we do not evaluate the effects of unsupervised 2003. Finite state morphology. Center for the Study of Language
pre-training for German and French, it is expected that and Inf.
higher accuracies can be achieved using unsupervised pre- [Botha and Blunsom 2014] Botha, J. A., and Blunsom, P. 2014.
training of the embeddings for these languages as well. Error Compositional morphology for word representations and language
analysis for Turkish morphological disambiguation shows modelling. arXiv preprint arXiv:1405.4273.
that the root is incorrectly decided in 30% of errors. The [Brill 1992] Brill, E. 1992. A simple rule-based part of speech
root is correct but the POS tag is incorrectly decided in 40% tagger. In Proceedings of the Third Conference on Applied Natu-
of errors while 30% of errors caused by wrong decisions on ral Language Processing, ANLC ’92, 152–155. Stroudsburg, PA,
other inflectional groups. When compared with the study of USA: Association for Computational Linguistics.
Sak et al. 2007, there is no significant difference in the distri- [Brill 1995] Brill, E. 1995. Transformation-based error-driven
bution of mistakes. However our method performs better in learning and natural language processing: A case study in part-of-
root decisions due to unsupervised learning of root embed- speech tagging. Comput. Linguist. 21(4):543–565.
dings. As discussed before, the available data for Turkish [Candito et al. 2010] Candito, M.; Nivre, J.; Denis, P.; and An-
morphological disambiguation task contains some system- guiano, E. H. 2010. Benchmarking of statistical dependency
atic errors. Yüret and Türe (2006) report that the accuracy parsers for french. In Proceedings of the 23rd International Confer-
of the training data is below 95%. According to our obser- ence on Computational Linguistics: Posters, 108–116. Association
vation there is a major confusion between noun and adjective for Computational Linguistics.
POS tags in training data which affects the decision of the [Chahuneau, Smith, and Dyer 2013] Chahuneau, V.; Smith, N. A.;
morphological disambiguation systems. In our experiment, and Dyer, C. 2013. Knowledge-rich morphological priors for
we observe that 18% of the errors are caused by such con- bayesian language models. Association for Computational Lin-
fusion, whereas the ratio of these errors are reported as 22% guistics.
in the experiments of Sak et al. (2007). [Collobert and Weston 2008] Collobert, R., and Weston, J. 2008. A
unified architecture for natural language processing: Deep neural
Summary and Future Work networks with multitask learning. In Proceedings of the 25th Inter-
national Conference on Machine Learning, ICML ’08, 160–167.
In this paper, we present a model capable of learning word New York, NY, USA: ACM.
representations for languages with rich morphology. We [Collobert et al. 2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen,
show the utility of our approach in the task of Turkish, M.; Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language pro-
German and French morphological disambiguation. We also cessing (almost) from scratch. J. Mach. Learn. Res. 12:2493–2537.
show the effect of unsupervised pre-training on recognition [Cotterell and Schütze 2015] Cotterell, R., and Schütze, H. 2015.
accuracies and improve the current state-of-the-art in Turk- Morphological word-embeddings. In Annual Conference of the
ish morphological disambiguation. We publicly make avail- North American Chapter of the ACL, 1287—-1292.
able a manually annotated test set containing 20K tokens [Cui et al. 2015] Cui, Q.; Gao, B.; Bian, J.; Qiu, S.; Dai, H.; and
which we believe will benefit Turkish NLP. Liu, T.-Y. 2015. Knet: A general framework for learning word
This paper presents a deep learning architecture specif- embedding using morphological knowledge. ACM Transactions
ically aiming to handle morphologically rich languages. on Information Systems (TOIS) 34(1):4.
Nonetheless, NLP systems that work on languages such as [Cutting et al. 1992] Cutting, D.; Kupiec, J.; Pedersen, J.; and Si-
English can also benefit from our work. Using our model, bun, P. 1992. A practical part-of-speech tagger. In Proceedings
of the Third Conference on Applied Natural Language Process- [Lindén, Silfverberg, and Pirinen 2009] Lindén, K.; Silfverberg,
ing, ANLC ’92, 133–140. Stroudsburg, PA, USA: Association for M.; and Pirinen, T. 2009. Hfst tools for morphology–an efficient
Computational Linguistics. open-source package for construction of morphological analyzers.
[Daybelge and Cicekli 2007] Daybelge, T., and Cicekli, I. 2007. In State of the Art in Computational Morphology. Springer. 28–47.
A rule-based morphological disambiguator for turkish”, in. In [Luong, Socher, and Manning 2013] Luong, T.; Socher, R.; and
Proceedings of Recent Advances in Natural Language Processing Manning, C. D. 2013. Better word representations with recur-
(RANLP 2007), Borovets, 145–149. sive neural networks for morphology. In Proceedings of the Seven-
[Duchi, Hazan, and Singer 2011] Duchi, J.; Hazan, E.; and Singer, teenth Conference on Computational Natural Language Learning,
Y. 2011. Adaptive subgradient methods for online learning and CoNLL 2013, Sofia, Bulgaria, August 8-9, 2013, 104–113.
stochastic optimization. The Journal of Machine Learning Re- [Megyesi 1999] Megyesi, B. 1999. Improving brill’s pos tagger
search 12:2121–2159. for an agglutinative language. In Proceedings of the Joint SIGDAT
[Ehsani et al. 2012] Ehsani, R.; Alper, M. E.; Eryigit, G.; and Adali, Conference on Empirical Methods in Natural Language Process-
E. 2012. Disambiguating main POS tags for turkish. In Pro- ing and Very Large Corpora, 275–284.
ceedings of the 24th Conference on Computational Linguistics and [Mikolov et al. 2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado,
Speech Processing, ROCLING 2012, Yuan Ze University, Chung- G. S.; and Dean, J. 2013. Distributed representations of words and
Li, Taiwan, September 21-22, 2012. phrases and their compositionality. In Advances in neural informa-
tion processing systems, 3111–3119.
[Ezeiza et al. 1998] Ezeiza, N.; Alegria, I.; Arriola, J. M.; Urizar,
R.; and Aduriz, I. 1998. Combining stochastic and rule-based [Oflazer and Kuruöz 1994] Oflazer, K., and Kuruöz, I. 1994. Tag-
methods for disambiguation in agglutinative languages. In Pro- ging and morphological disambiguation of turkish text. In Pro-
ceedings of the 36th Annual Meeting of the Association for Com- ceedings of the Fourth Conference on Applied Natural Language
putational Linguistics and 17th International Conference on Com- Processing, ANLC ’94, 144–149. Stroudsburg, PA, USA: Associ-
putational Linguistics-Volume 1, 380–384. Association for Com- ation for Computational Linguistics.
putational Linguistics. [Oflazer and Tur 1996] Oflazer, K., and Tur, G. 1996. Combining
[Görgün and Yildiz 2011] Görgün, O., and Yildiz, O. T. 2011. A hand-crafted rules and unsupervised learning in constraint-based
novel approach to morphological disambiguation for turkish. In morphological disambiguation. In Conference on Empirical Meth-
Computer and Information Sciences II - 26th International Sympo- ods in Natural Language Processing, 69–81.
sium on Computer and Information Sciences, London, UK, 26-28 [Oflazer 1993] Oflazer, K. 1993. Two-level description of turkish
September 2011, 77–83. morphology. In Proceedings of the Sixth Conference on European
[Hajic and Hladka 1998] Hajic, J., and Hladka, B. 1998. Czech Chapter of the Association for Computational Linguistics, EACL
language processing—pos tagging. In Proceedings of the First ’93, 472–472. Stroudsburg, PA, USA: Association for Computa-
International Conference on Language Resources & Evaluation, tional Linguistics.
931–936. [Orosz and Novák 2013] Orosz, G., and Novák, A. 2013. Purepos
[Hakkani-Tür, Oflazer, and Tür 2000] Hakkani-Tür, D. Z.; Oflazer, 2.0: a hybrid tool for morphological disambiguation. In RANLP,
K.; and Tür, G. 2000. Statistical morphological disambiguation 539–545.
for agglutinative languages. In Proceedings of the 18th Conference [Pennington, Socher, and Manning 2014] Pennington, J.; Socher,
on Computational Linguistics - Volume 1, COLING ’00, 285–291. R.; and Manning, C. 2014. Glove: Global vectors for word rep-
Stroudsburg, PA, USA: Association for Computational Linguistics. resentation. In Proceedings of the 2014 Conference on Empirical
[Kaplan and Kay 1981] Kaplan, R. M., and Kay, M. 1981. Phono- Methods in Natural Language Processing (EMNLP), 1532–1543.
logical rules and finite-state transducers. In Linguistic Society of Doha, Qatar: Association for Computational Linguistics.
America Meeting Handbook, Fifty-Sixth Annual Meeting, 27–30. [Ratnaparkhi 1997] Ratnaparkhi, A. 1997. A maximum entropy
[Karlsson et al. 1995] Karlsson, F.; Voutilainen, A.; Heikkila, J.; model for part-of-speech tagging. In EMNLP 1997.
and Anttila, A., eds. 1995. Constraint Grammar: A Language- [Sak, Güngör, and Saraçlar 2007] Sak, H.; Güngör, T.; and
Independent System for Parsing Unrestricted Text. Mouton de Saraçlar, M. 2007. Morphological disambiguation of Turkish text
Gruyter. with perceptron algorithm. In CICLing 2007, volume LNCS 4394,
[Koskenniemi 1984] Koskenniemi, K. 1984. A general computa- 107–118.
tional model for word-form recognition and production. In Pro- [Schmid 1994] Schmid, H. 1994. Probabilistic part-of-speech tag-
ceedings of the 10th international conference on Computational ging using decision trees. In Proceedings of the international con-
Linguistics, 178–181. Association for Computational Linguistics. ference on new methods in language processing. Vol. 12.
[Kutlu and Cicekli 2013] Kutlu, M., and Cicekli, I. 2013. A hybrid [Seddah and Tsarfaty 2014] Seddah, D., and Tsarfaty, R. 2014. In-
morphological disambiguation system for turkish. In Sixth Interna- troducing the spmrl 2014 shared task on parsing morphologically-
tional Joint Conference on Natural Language Processing, IJCNLP rich languages. SPMRL-SANCL 2014 103.
2013, Nagoya, Japan, October 14-18, 2013, 1230–1236. [Sennrich et al. 2009] Sennrich, R.; Schneider, G.; Volk, M.; and
[Lafferty, McCallum, and Pereira 2001] Lafferty, J.; McCal- Warin, M. 2009. A new hybrid dependency parser for german.
lum, A.; and Pereira, F. C. 2001. Conditional random fields: Proceedings of the German Society for Computational Linguistics
Probabilistic models for segmenting and labeling sequence data. and Language Technology 115–124.
[Le and Mikolov 2014] Le, Q., and Mikolov, T. 2014. Distributed [Socher et al. 2012] Socher, R.; Huval, B.; Manning, C. D.; and Ng,
representations of sentences and documents. In Jebara, T., and A. Y. 2012. Semantic compositionality through recursive matrix-
Xing, E. P., eds., Proceedings of the 31st International Conference vector spaces. In Proceedings of the 2012 Joint Conference on
on Machine Learning (ICML-14), 1188–1196. JMLR Workshop Empirical Methods in Natural Language Processing and Compu-
and Conference Proceedings. tational Natural Language Learning, EMNLP-CoNLL ’12, 1201–
1211. Stroudsburg, PA, USA: Association for Computational Lin-
guistics.
[Turian, Ratinov, and Bengio 2010] Turian, J.; Ratinov, L.; and
Bengio, Y. 2010. Word representations: A simple and general
method for semi-supervised learning. In Proceedings of the 48th
Annual Meeting of the Association for Computational Linguistics,
ACL ’10, 384–394. Stroudsburg, PA, USA: Association for Com-
putational Linguistics.
[Yüret and Türe 2006] Yüret, D., and Türe, F. 2006. Learning mor-
phological disambiguation rules for turkish. In Proceedings of the
Main Conference on Human Language Technology Conference of
the North American Chapter of the Association of Computational
Linguistics, HLT-NAACL ’06, 328–334. Stroudsburg, PA, USA:
Association for Computational Linguistics.