Tagging and Morphological Disambiguation of Turkish Text
Kemal O azer and I_lker Kuruoz
Department of Computer Engineering and Information Science
Bilkent University
Bilkent, Ankara, TURKEY
fko,
[email protected] Abstract scale two-level morphological speci cation of Turk-
Automatic text tagging is an important ish (O azer, 1993), implemented on the PC-KIMMO
component in higher level analysis of text environment (Antworth, 1990). In this paper, we de-
corpora, and its output can be used in scribe the functionality and the performance of our
tagger along with various techniques that we have
cmp-lg/9407026 29 Jul 1994
many natural language processing applica- employed to deal with various sources of ambigui-
tions. In languages like Turkish or Finnish, ties.
with agglutinative morphology, morpholog-
ical disambiguation is a very crucial pro-
cess in tagging, as the structures of many
lexical forms are morphologically ambigu-
2 Tagging Text
ous. This paper describes a POS tagger for Automatic text tagging is an important step in dis-
Turkish text based on a full-scale two-level covering the linguistic structure of large text cor-
speci cation of Turkish morphology that is pora. Basic tagging involves annotating the words
based on a lexicon of about 24,000 root in a given text with various pieces of information,
words. This is augmented with a multi- such as part-of-speech and other lexical features.
word and idiomatic construct recognizer, Part-of-speech tagging facilitates higher-level analy-
and most importantly morphological dis- sis, such as parsing, essentially by performing a cer-
ambiguator based on local neighborhood tain amount of ambiguity resolution using relatively
constraints, heuristics and limited amount cheaper methods.
of statistical information. The tagger also
has functionality for statistics compilation The most important functionality of a tagger is
and ne tuning of the morphological an- the resolution of the structure and parts-of-speech of
alyzer, such as logging erroneous morpho- the lexical items in the text. This, however, is not a
logical parses, commonly used roots, etc. very trivial task since many words are in general am-
Preliminary results indicate that the tag- biguous in their part-of-speech for various reasons.
ger can tag about 98-99% of the texts ac- In English, for example a word such as make can
curately with very minimal user interven- be verb or a noun. In Turkish, even though there
tion. Furthermore for sentences morpho- are ambiguities of such sort, the agglutinative na-
logically disambiguated with the tagger, an ture of the language usually helps resolution of such
LFG parser developed for Turkish, gener- ambiguities due to morphotactical restrictions. On
ates, on the average, 50% less ambiguous the other hand, this very nature introduces another
parses and parses almost 2.5 times faster. kind of ambiguity, where a lexical form can be mor-
The tagging functionality is not speci c to phologically interpreted in many ways. For example,
Turkish, and can be applied to any lan- the word evin, can be broken down as:1
guage with a proper morphological analysis evin POS English
interface. 1. N(ev)+2SG-POSS N (your) house
2. N(ev)+GEN N of the house
1 Introduction 3. N(evin) N wheat germ
As a part of large scale project on natural language If, however, the local context is considered, it may
processing for Turkish, we have undertaken the de- be possible to resolve the ambiguity as in:
velopment of a number of tools for analyzing Turk-
ish text. This paper describes one such tool { a text 1 Output of the morphological analyzer is edited for
tagger for Turkish. The tagger is based on a full clarity.
.. sen-in ev-in .. based approach implemented with nite-state ma-
PN(you)+GEN N(ev)+2SG-POSS chines (Koskenniemi et al., 1992; Voutilainen and
your house Tapanainen, 1993).
A completely di erent approach to tagging uses
.. evin kap-s .. statistical methods, (e.g., (Church, 1988; Cutting et
N(ev)+GEN N(door)+3SG-POSS al., 1993)). These systems essentially train a statis-
door of the house tical model using a previously hand-tagged corpus
using genitive{possessive agreement constraints. and provide the capability of resolving ambiguity on
As a more complex case we can give the following: the basis of most likely interpretation. The models
alnms that have been widely used assume that the part-of-
1 ADJ(al)+2SG-POSS+NtoV()+NARR+3SG 2
speech of a word depends on the categories of the two
(V) (it) was your red (one) preceding words. However, the applicability of such
2 ADJ(al)+GEN+NtoV()+NARR+3SG approaches to word-order free languages remains to
(V) (it) belongs to the red (one)
3 N(aln)+NtoV()+NARR+3SG be seen.
(V) (it) was a forehead
4 V(al)+PASS+VtoAdj(mis) 2.1 An example
(ADJ) (a) taken (object) We can describe the process of tagging by showing
5 V(al)+PASS+NARR+3SG the analysis for the sentence:
(V) (it) was taken
6 V(aln)+VtoAdj(mis) I_sten doner donmez evimizin yaknnda bulunan
(ADJ) (an) o ended (person) derin golde yuzerek gevsemek en buyuk zevkimdi.
7 V(aln)+NARR+3SG (Relaxing by swimming the deep lake near our
(V) (s/he) was o ended house, as soon as I return from work was my greatest
It is in general rather hard to select one of these pleasure.)
interpretations without doing substantial analysis of which we assume has been processed by the morpho-
the local context, and even then one can not fully logical analyzer with the following output:
resolve such (usually semantic) ambiguities. isten POS
An additional problem that can be o -loaded to 1. N(is)+ABL N+
the tagger is the recognition of multi-word or id- doner
iomatic constructs. In Turkish, which abounds with 1. N(doner) N
such forms, such a recognizer can recognize these 2. V(don)+AOR+3SG V+
very productive multi-word constructs, like 3. V(don)+VtoAdj(er) ADJ
kos-a kos-a donmez
1. V(don)+NEG+AOR+3SG V+
run+OPT+3SG run+OPT+3SG 2. V(don)+VtoAdj(mez) ADJ
evimizin
yap-ar yap-ma-z 1. N(ev)+1PL{POSS+GEN N+
do+AOR+3SG do+NEG+AOR+3SG yaknnda
where both components are verbal but the com- 1. ADJ(yakn)+3SG{POSS+LOC N+
pound construct is a manner or temporal adverb. 2. ADJ(yakn)+2SG{POSS+LOC N
This relieves the parser from dealing with them at bulunan
the syntactic level. Furthermore, it is also possible 1. V(bul)+PASS+VtoADJ(yan) ADJ
to recognize various proper nouns with this func- 2. V(bulun)+VtoADJ(yan) ADJ+
tionality. Such help from a tagging functionality derin
1. N(deri)+2SG{POSS N
would simplify the development of parsers for Turk- 2. ADJ(derin) ADJ+
ish (Demir, 1993; Gungordu, 1993). 3. V(der)+IMP+2PL V
Researchers have used a number of di erent ap- 4. V(de)+VtoADJ(er)+2SG{POSS N
proaches for building text taggers. Karlsson (Karls- 5. V(de)+VtoADJ(er)+GEN N
son, 1990) has used a rule-based approach where golde
the central idea is to maximize the use of mor- 1. N(gol)+LOC N+
phological information. Local constraints expressed yuzerek
as rules basically discard many alternative parses 1. V(yuz)+VtoADV(yerek) ADV+
whenever possible. Brill (Brill, 1992) has designed gevsemek
1. V(gevse)+VtoINF(mak) V+
a rule-based tagger for English. The tagger works en
by automatically recognizing rules and remedying 1. N(en) N
its weaknesses, thereby incrementally improving its 2. ADV(en) ADV+
performance. More recently, there has been a rule- buyuk
1. ADJ(buyuk) ADJ+
2
In Turkish, all adjectives can be used as nouns, hence zevkimdi
with very minor di erences adjectives have the same 1. N(zevk)+1SG{POSS+ V+
morphotactics as nouns. NtoV()+PAST+3SG
Although there are a number of choices for tags The conditions refer to any available morpholog-
for the lexical items in the sentence, almost all ex- ical or positional feature associated with a lexical
cept one set of choices give rise to ungrammatical or form such as:
implausible sentence structures.3 There are number Absolute or relative lexical position (e.g., sen-
of points that are of interest here: tence initial or nal, or 1 after the current word,
the construct d oner donmez formed by two etc.)
tensed verbs, is actually a temporal adverb root and nal POS category,
meaning ... as soon as .. return(s), hence these derivation type,
two lexical items can be coalesced into a single
lexical item and tagged as a temporal adverb. case, agreement (number and person), and cer-
The second person singular possessive interpre- tain semantic markers, for nominal forms,
tation of yaknnda is not possible since this aspect and tense, subcategorization require-
word forms a simple compound noun phrase ments, verbal voice, modality,and sense for ver-
with the previous lexical item and the third per- bal forms
son singular possessive morpheme functions as subcategorization requirements for postposi-
the compound marker, agreeing with the agree- tions.
ment of the previous genitive case-marked form. Conditions may refer to absolute feature values or
The word derin (deep) is the modi er of a sim- variables (as in Prolog, denoted by the pre x in the
ple compound noun derin gol (deep lake) hence following examples) which are then used to link con-
the second choice can safely be selected. The ditions. All occurrences of a variable have to unify
verbal root in the third interpretation is very for the match to be considered successful. This fea-
unlikely to be used in text, let alone in sec- ture is powerful and and lets us specify in a rather
ond person imperative form. The fourth and general way, (possibly long distance) feature con-
the fth interpretations are not very plausible straints in complex NPs, PPs and VPs. This is a
either. The rst interpretation (meaning your part of our approach that distinguishes it from other
skin) may be a possible choice but can be dis- constraint-based approaches.
carded in the middle of a longer compound noun The actions are of the following types:
phrase. Null action: Nothing is done on the matching
The word en preceding an adjective indicates parse.
a superlative construction and hence the noun Delete: Removes the matching parse if more
reading can be discarded. than one parse for the lexical form are still in
3 The Tagging Tool the set associated with the lexical form.
Output: Removes all but the matching parse
The tagging tool that we have developed integrates from the set e ectively tagging the lexical form
the following functionality with a user interface, as with the matching parse.
shown in Figure 1, implemented under X-windows. Compose: Composes a new parse from various
It can be used interactively, though user interaction matching parses, for multi-word constructs.
is very rare and (optionally) occurs only when the
disambiguation can not be done by the tagger. These rules are ordered, and applied in the given
order and actions licensed by any matching rule are
1. Morphological analysis with error logging, applied. One rule formalism is used to encode both
2. Multi-word and idiomatic construct recogni- multi-word constructs and constraints.
tion, 3.1 The Multi-word Construct Processor
3. Morphological disambiguation by using con- As mentioned before, tagging text on lexical item ba-
straints, heuristics and certain statistics, sis may generate spurious or incorrect results when
4. Root and lexical form statistics compilation, multiple lexical items act as single syntactic or se-
The second and the third functionalities are imple- mantic entity. For example, in the sentence Sirin mi
mented by a rule-base subsystem which allows one sirin bir kopek kosa kosa geldi (A very cute dog came
to write rules of the following form: running) the fragment sirin mi sirin constitutes a
C1:A1; C2:A2; ... Cn :An. duplicated emphatic adjective in which there is an
embedded4 question sux mi (written separately in
where each C is a set of constraints on a lexical form,
i Turkish), and the fragment kosa kosa is a dupli-
and the corresponding A is an action to be executed
i cated verbal construction, which has the grammat-
on the set of parses associated with that lexical form, ical role of manner adverb in the sentence, though
only when all the conditions are satis ed. 4 If, however, the adjective
sirin was not repeated,
3
The correct choices of tags are marked with +. then we would have a question formation.
Figure 1: User interface of tagging tool
both of the constituent forms are verbal construc- Saray (Topkap Palace).
tions. The purpose of the multi-word construct pro- 6. compound verb formations which are formed by
cessor is to detect and tag such productive con- a lexically adjacent, direct or oblique object and
structs in addition to various other semantically co- a verb, which for the purposes of syntactic anal-
alesced forms such as proper nouns, etc. ysis, may be considered as single lexical item.
The following is a set of multi-word constructs for
Turkish that we handle in our tagger. This list is We can give the following example for specifying
not meant to be comprehensive, and new construct a multi-word construct:5
speci cations can easily be added. It is conceivable Lex=_W1, Root=_R1, Cat=V, Aspect=AOR, Agr=3SG,
that such a functionality can be used in almost any Sense=POS: ;
language. Lex=_W2, Root=_R1, Cat=V, Aspect=AOR, Agr=3SG,
Sense = NEG:
1. duplicated optative and 3SG verbal forms func- Compose=((*CAT* ADV)(*R* "_W1 _W2 (_R1)")
tioning as manner adverb, e.g., kosa kosa, aorist (*SUB* TEMP)).
verbal forms with root duplications and sense
negation functioning as temporal adverbs, e.g., This rule would match any adjacent verbal lexical
yapar yapmaz, and duplicated verbal and de- forms with the same root, both with the aorist as-
rived adverbial forms with the same verbal root pect, and 3SG agreement. The rst verb has to be
acting as temporal adverbs, e.g., gitti gideli, positive and the second one negated. When found,
a composite lexical form with an temporal adverb
2. duplicated compound nominal form construc- part-of-speech, is then generated. The original ver-
tions that act as adjectives, e.g., guzeller guzeli, bal root may be recovered from the root of the com-
and emphatic adjectival forms involving the posed form for any subcategorization checks, at the
question sux, e.g., guzel mi guzel, syntactic level.
3. adjective or noun duplications that act as man-
ner adverbs, e.g., hzl hzl, ev ev, 3.2 Using constraints for morphological
4. idiomatic word sequences with speci c usage
ambiguity resolution
whose semantics is not compositional, e.g., yan Morphological analysis does not have access to syn-
sra, hic olmazsa, and idiomatic forms which are tactic context, so when the morphological structure
never used singularly, e.g., gurul gurul, 5 The output of the morphological analyzer is actually
5. proper nouns, e.g., Jimmy Carter, Topkap a feature-value list in the standard LISP format.
of a lexical form has several distinct analyses, it approach is e ective in disambiguating morpholog-
is not possible to disambiguate such cases except ical structures, and hence POS, with minimal user
maybe by using root usage frequencies. For disam- intervention. Currently, the speed of the tagger is
biguation one may have to use information provided limited by essentially that of the morphological ana-
by sentential position and the local morphosyntac- lyzer, but we have ported the morphologicalanalyzer
tic context. Voutilainen and Heikkila (Voutilainen et to the XEROX TWOL system developed by Kart-
al., 1992) have proposed a constraint grammar ap- tunen and Beesley (Karttunen and Beesley, 1992).
proach where one speci es constraints on the local This system can analyze Turkish word forms at
context of a word to disambiguate among multiple about 1000 forms/sec on SparcStation 10's. We in-
readings of a word. Their approach has, however, tend to integrate this to our tagger soon, improving
been applied to English where morphological infor- its speed performance considerably.
mation has rather little use in such resolution. We have tested the impact of morphological dis-
In our tagger, constraints are applied on each ambiguation on the performance of a LFG parser
word, and check if the forms within a speci ed neigh- developed for Turkish (Gungordu, 1993; Gungordu
borhood of the word satisfy certain morphosyntactic and O azer, 1994). The input to the parser was dis-
or positional restrictions, and/or agreements. Our ambiguated using the tool developed and the results
constraint pattern speci cation is very similar to were compared to the case when the parser had to
multi-word construct speci cation. Use of variables, consider all possible morphological ambiguities it-
operators and actions, are same except that the com- self. For a set of 80 sentences considered, it can be
pose actions does not make sense here. The follow- seen that (Table 2), morphological disambiguation
ing is an example constraint that is used to select enables almost a factor of two reduction in the av-
the postpositional reading of certain word when it is erage number of parses generated and over a factor
preceded by a yet unresolved nominal form with a of two speed-up in time.
certain case. The only requirement is that the case
of the nominal form agrees with the case subcatego-
rization requirement of the following postposition.
5 Conclusions
(LP = 0 refers to current word, LP = 1 refers to This paper has presented an overview of a tool for
next word.) tagging text along with various issues that have
LP = 0, Case = _C : Output; come up in disambiguating morphological parses of
LP = 1, Cat = POSTP, Subcat = _C : Output. Turkish words. We have noted that the use of con-
straints is very e ective in morphological disam-
When a match is found, the matching parses from biguation. Preliminary results indicate that the tag-
both words are selected and the others are discarded. ger can tag about 98-99% of the texts accurately
This one constraint disambiguates almost all of the with very minimal user intervention, though it is
postpositions and their arguments, the exceptions conceivable that it may do worse on more substantial
being nominal words which semantically convey the text { but there is certainly room for improvement in
information provided by the case (such as words in- the mechanisms provided. The tool also provides for
dicating direction, which may be used as if they have recognition of multi-word constructs that behave as
a dative case). a single syntactic and semantic entity in higher level
Finally the following example constraint deletes analysis, and the compilation of information for ne-
the sentence nal adjectival readings derived from tuning of the morphological analyzer and the tagger
verbs, e ectively preferring the verbal reading (as itself. We, however, feel that our approach does not
Turkish is a SOV language.) deal satisfactorily with most aspects of word-order
Cat = V, Finalcat = ADJ, SP = END : Delete. freeness. We are currently working on an extension
whereby the rules do not apply immediately but vote
4 Performance of the Tagger on their preferences and a nal global vote tally de-
termines the assignments.
We have performed some preliminary experiments
to assess the e ectiveness of our tagger. We have
used about 250 constraints for Turkish. Some of
6 Acknowledgment
these constraints are very general as the postposition This research was supported in part by a NATO Sci-
rule above, while some are geared towards recogni- ence for Stability Program Grant, TU-LANGUAGE.
tion of NP's of various sorts and a small number ap-
ply certain syntactic heuristics. In this section, we
summarize our preliminary results. Table 1 presents
some preliminary results about the our tagging ex-
References
periments. E. L. Antworth. 1990. PC-KIMMO: A Two-level
Although the texts that we have experimented Processor for Morphological Analysis. Summer In-
with are rather small, the results indicate that our stitute of Linguistics, Dallas, Texas.
Table 1: Statistics on texts tagged, and tagging and disambiguation results
Text Words Morphological Parse Distribution
0 1 2 3 4 5
1 468 7.3% 28.7% 41.1% 11.1% 7.1 % 4.7%
2 573 1.0% 30.2% 37.3% 13.1% 11.1% 7.3%
3 533 3.8% 24.8% 38.1% 19.1% 9.2 % 5.0%
4 7004 3.9% 17.2% 41.5% 15.6% 11.7% 10.1%
Note: Words with zero parses are proper names which are not in the lexicon of the morphological analyzer.
Text % Correctly % Tagged % Correctly Automatic Disambiguation by
Tagged by Tagged Multi-word Constraints
Automatically User Total Rules
1 98.5 1.0 99.1 10.1 67.7
2 98.5 0.3 98.8 7.5 74.4
3 97.8 1.1 98.9 3.1 74.5
4 95.4 1.7 97.1 4.2 76.4
Table 2: Impact of disambiguation on parsing performance
No disambiguation With disambiguation Ratios
Avg. Length Avg. Avg. Avg. Avg.
(words) parses time (sec) parses time (sec) parses speed-up
5.7 5.78 29.11 3.30 11.91 1.97 2.38
Note: The ratios are the averages of the sentence by sentence ratios.
E. Brill. 1992. A simple rule-based part-of-speech L. Karttunen and K. R. Beesley. 1992. Two-level
tagger. In Proceedings of the Third Conference on rule compiler. Technical Report, XEROX Palo
Applied Computational Linguistics, Trento, Italy. Alto Research Center.
K. W. Church. 1988. A stochastic parts program K. Koskenniemi, P. Tapanainen, and A. Voutilainen.
and noun phrase parser for unrestricted text. In 1992. Compiling and using nite-state syntactic
Proceedings of the Second Conference on Applied rules. In Proceedings of COLING-92, the 14th
Natural Language Processing (ACL), pages 136{ International Conference on Computational Lin-
143. guistics, volume 1, pages 156{162, Nantes, France.
D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. K. O azer. 1993. Two-level description of Turkish
1993. A practical part-of-speech tagger. Technical morphology. In Proceedings of the Sixth Confer-
report, Xerox Palo Alto Research Center. ence of the European Chapter of the Association
for Computational Linguistics, April. A full ver-
C. Demir. 1993. An ATN grammar for Turkish. sion appears in Literary and Linguistic Comput-
Master's thesis, Department of Computer Engi- ing, Vol.9 No.2, 1994.
neering and Information Sciences, Bilkent Univer- A. Voutilainen and P. Tapanainen. 1993. Ambiguity
sity, Ankara, Turkey, July. resolution in a reductionistic parser. In Proceed-
Z. Gungordu and K. O azer. 1994. Parsing Turkish ings of EACL'93, Utrecht, Holland.
using the Lexical-Functional Grammar formalism. A. Voutilainen, J. Heikkila, and A. Anttila. 1992.
In Proceedings of COLING-94, the 15th Interna- Constraint Grammar of English. University of
tional Conference on Computational Linguistics, Helsinki.
Kyoto, Japan.
Z. Gungordu. 1993. A Lexical-Functional Gram-
mar for Turkish. Master's thesis, Department of
Computer Engineering and Information Sciences,
Bilkent University, Ankara, Turkey, July.
F. Karlsson. 1990. Constraint grammar as a frame-
work for parsing running text. In Proceedings of
COLING-90, the 13th International Conference
on Computational Linguistics, volume 3, pages
168{173, Helsinki, Finland.