0% found this document useful (0 votes)
67 views5 pages

Tag Disambiguation in Italian: Rodolfo Delmonte°, Emanuele Pianta

This document discusses part-of-speech tagging for Italian. It argues that statistical approaches are inefficient for Italian due to data sparsity and high levels of homography. The document presents experimental results showing that a syntactic disambiguation approach achieves over 99% accuracy on Italian text.

Uploaded by

ramlohani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views5 pages

Tag Disambiguation in Italian: Rodolfo Delmonte°, Emanuele Pianta

This document discusses part-of-speech tagging for Italian. It argues that statistical approaches are inefficient for Italian due to data sparsity and high levels of homography. The document presents experimental results showing that a syntactic disambiguation approach achieves over 99% accuracy on Italian text.

Uploaded by

ramlohani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

TAG DISAMBIGUATION IN ITALIAN

Rodolfo Delmonte, Emanuele Pianta*


* I.R.S.T. - Povo (Trento)
Ca' Garzoni-Moro, San Marco 3417
Universit "Ca Foscari"
30124 - VENEZIA
Tel. 39-41-2578464/52/19
E-mail: [email protected]
Website: http//byron.cgm.unive.it

ABSTRACT
In this paper we argue in favour of syntactically based tagging by presenting data from a study of a 1,000,000 word
corpus of Italian. Most papers present approaches on tagging which are statistically based. None of the statistically
based analyses, however, produce an accuracy level comparable to the one obtained by means of linguistic rules [1]. Of
course their data are strictly referred to English, with the exception of [2, 3, 4]. As to Italian, we argue that purely
statistically based approaches are inefficient basically due to great sparsity of tag distribution 50% only unambiguous
tags. In addition, the level of homography is also very high: readings per word are 1.7 compared to 1.07 computed for
English by [2] with a similar tagset. In a preliminary experiment we made we obtained 99,97% accuracy in the training
set and 99,03% in the test set using syntactic disambiguation: data derived from statistical tagging is well below 95%
even when referred to the training set.
1. Introduction
The availability of Brill's tagger[5], work being carried out at Saarbruecken and the research done with the wellestablished Xerox tagger have contributed tagging results in languages different from English: in this paper we shall
contribute data and experimental results on Italian, a morphologically rich language which seems to behave in a
similar fashion to Swedish and French, and differently from English.
We assume that tagging cannot be/is not to be regarded as a self-contained/self-sufficient processing task/module: we
regard it as just the first important module/process in a wider and deeper text processing system. We also assume it
must be in a strict feeding relationship with a syntactic (shallow) parser/chunker, which is then used either for text
understanding and generation/summarization or for other such more complex tasks.
Since tagging cannot be regarded as an end in itself, restrictions on its output should be targeted to the goals of tagging
is intended for, i.e. it should respect/obey the following criteria:
Accuracy and Efficiency: it should be over 99% correct, or the error rate should be below 1% errors are here
intended as out-of-vocabulary tokens which are unknown words to the Guesser and cannot be tagged as either
proper names, or foreign words;
Robustness and Reusability: it should be generative, in order to be adaptable/cope with different domains/genres/
corpora: in our case this means that the tagger is actually a morphological analyser with linguistic rules, a root
dictionary, and a list of affixes of the language and constraints to the generation process;
Linguistic Granularity: it should produce lemmata, which is a trivial task with morphologically poor languages
like English or Chinese, but not so trivial with all remaining languages; lemmata are essential in alla tasks of
semantic/conceptual information retrieval;
Linguistic Efficiency: in order not to require reprocessing, it should allow for subcategorization information to be
encoded in verbal tags, to serve further processing modules. It should incorporate a minimum of efficient and
necessary semantic information in tags requiring it in order to produce sensible tagging disambiguation: i.e.
temporal nouns, common nouns, human beings nouns, proper nouns etc.
In addition, disambiguation should be syntactically targeted and pragmatically constrained on the basis of genre/corpus
type: in Italian as in French word "La" is three times ambiguous: it is a clitic pronoun, a definite article and a common
noun (meaning the A musical note) but this latter tag is rare or specific to a certain domain and only with initial
uppercase "L".
Most of the published papers on the subject deal with English due to the availabiliby of tagged corpora for training and
testing. However, English is not a good representative of European languages in that it should be regarded a
"morphologically poor" language, whereas the remaining languages are "rich" or even "extremely rich" in morphology.
This amounts to a lot of differences in processing: as can be easily surmised the first difference is in the number of
different worforms each lemma can produce; then a second important difference lies in the level of homography of each
wordform.
We argue in favour of a syntactically based disambiguation phase after the redundant morphological analysis for two
main reasons: HMM-based tagging, which is usually adopted as the best statistical framework to work in, has two
important limitations, i.e. sparsity of data and lack of wider-context information, being conventionally based on
trigrams. The second reason is based on the simple intuition that unambiguous tag sequences are strictly syntactically
governed in the sense that they must obey the grammar of the language. This is confirmed by [1] where they say: We

believe that the best way to boost the accuracy of a tagger is to employ even more linguistic knowledge. The
knowledge should in addition contain more syntactic information so that we could refer to real (syntactic) objects of the
language, not just a sequence of words or parts of speech.
Being language-dependent the tagger needs to be based on an accurate analysis of corpora with an as broad as possible
coverage of genre, style and other social and communicative variables. To answer these needs we built our syntactic
shallow parser on the basis of manually annotated texts for 60,000 words chosen from different corpora and satisfying
the above-mentioned criteria. The annotation was carried out twelve years ago to be used for a a text-to-speech system
for Italian (DecTalk Italian version) with unlimited vocabulary [9,10].
Italian has a number of peculiar linguistic features that make it more difficult to disambiguate than other languages: it
may be defined stucturally underdetermined in the sense that it allows a lot of freedom to the position of syntactic
constituents. Sentences may be subjectless and start with a VP which may contain the subject NP or its Object
complement. Postverbal position may be occupied by adjuncts which can be freely interspersed between main verb and
direct object or other nuclear complements. At constituent level, adjectives may be placed before or after the Head noun
they modify with only few exceptions. If we compare it with other languages, the level of omography is very high and
in addition the number of wordform per lemma is significantly higher than English. Written texts tend to have very
long sentences were complements may be very far apart from their governing head with a series of nested adjuncts in
between.
The paper is organized as follows: we give general information on the 1,000,000 word corpus of Italian we used to train
our tag disambiguator in Section 1.1; we then comment on the use of probability transition tables based on
unambiguous bigrams and give further data on bigram and n-gram distribution in our corpus in Section 2; and then
finally we describe our Syntactic Disambiguator (SD) in Section 3 and give some accuracy measurements on a
preliminary experiment we made on Italian, in Section 4.
1.1 Morphologically Rich Languages are Different
For morphologically rich languages like Italian, processes like tagging and syntactic analysis must be soundly based
on linguistically generated morphological analysis. We experimented this approach with the analysis of a corpus of
approximately 1 million words and on a first run of the tagger, it failed for approximately 5% of the total: at least one
word over 20 constituted what can be labelled as unknown (out-of-vocabulary) word. In POS taggers which rely on the
context to induce the appropriate part of speech, guesses will be based on the surrounding words. However, the
problem is to find misspelled words and tell them apart from foreign words and other classes of words. We analysed the
50,000 unknown words in 5 classes of words: Misspelled Words = 4500; New Vocabulary Entries = 6000; Foreign
Words = 3000; Proper Nouns = 15,000; Abbreviations & Acronyms 10,000.
As to types, the total number of types from the three subcorpora is 58334 which were then merged to 36578. In the
total rank file we extracted the first 65 types whose total frequency of occurrence amounts to 332,238 tokens. An
extended hapax legomena count - types with frequency less than 4 - cover 22,421 types and approximately 33,000
tokens.
Total number of lemmata is 24666. Non-rich lemmata constitute a big percentage of the total number of lemmata:
lemmata with only one wordform associated amount to 17464, which corresponds to 70% of the overall number of
lemmata. We end up with 1007 lemmata with more than four types associated, the great majority of which are
verbs[8].
1.2 Lemmata and Wordforms
We shall consider now the ratio lemmata/wordform as indicating the morphological richness of the language: suppose
now that in our corpora more than half of all wordforms or types uniquely individuate its lemma and viceversa, we
might conclude that even though Italian has a potentially rich morphology it uses it in a poor manner.
From the computation of lemmata we ended up with the following data:
Total number of lemmata is 24666. Non-rich lemmata constitute a big percentage of the total number of lemmata:
lemmata with only one wordform associated amount to 17464, which corresponds to 70% of the overall number of
lemmata. Here below is the count for lemmata with two, three or four wordforms associated:
Wordforms = 2 Number of lemmata 4354
Wordforms = 3 Number of lemmata 962
Wordforms = 4 Number of lemmata 879
Finally we end up with 1007 lemmata with more than four types associated, the great majority of which are verbs.
However when we look down in the rank list starting from lemmata with 8 types associated, the number of past
participles/adjectives increases until they become the majority of lemmata. The rank list has the two
auxiliary/copulative verbs have (avere) and be (essere) at the top, respectively with 50 and 48 word forms associated.
We may note that "avere" has 13 cliticized forms and that "essere" has 10 such forms.
l(avere, 50, [abbia, abbiamo, abbiano, abbiate, avendo, avendola, avendole, avendolo, avendone, avente, aventi, aver,
averci, avere, avergli, averla, averle, averlo, avermi, averne, aversi, averti, avesse, avessero, avessi, avessimo, aveste,

avete, aveva, avevamo, avevano, avevo, avr, avranno, avrebbe, avrebbero, avrei, avremmo, avremo, avresti, avrete,
avuta, avute, avuti, avuto, ebbe, ebbero, ha, hai, hanno, ho]). / have
l(essere, 48, [, era, erano, eravamo, eravate, eri, ero, essendo, essendoci, essendosi, essendovi, esser, essercene,
esserci, essere, esserlo, esserne, essersi, esservi, fosse, fossero, fossi, fossimo, fu, fui, fummo, furono, sar, saranno,
sarebbe, sarebbero, sarei, saremmo, saremo, sarete, sar, sei, sia, siamo, siano, siate, siete, sono, stata, state, stati,
stativi, stato]). / be

3. Statistical vs Syntactic Tagging


In the rest of the paper we present our syntactic disambiguator (hence SD), final modul of our syntactic tagger of Italian.
Input to the SD is the complete and redundant output of the morphological analyser and lemmatizer, IMMORTALE
(see Delmonte & Pianta, 1998). IMMORTALE finds all possible and legal tags for the word/token under analysis on
the basis of morphological generation from a root dictionary of Italian made up of 60,000 entries and a dictionary of
invariant words - function words, polywords, names and surnames, abbreviations etc. - of over 12,000 entries.
As commented by Brill(92), the application of stochastic techniques in automatic part-of-speech tagging is particularly
appealing given the ease with which the necessary statistics can be automatically acquired and the fact that very little
handcrafted knowledge need to be built into the system(ibid., 152). However both probabilistic models and Brills
algorithm need a large tagged corpus where to derive most likely tagging information. It is a wellknown fact that in
lack of sufficient training data, sparsity in the probabilistic matrix will cause many bigrams or trigrams to be
insufficiently characterized and prone to generate wrong hypotheses. This in turn will introduce errors in the tagging
prediction procedure. So the training corpus must be really very large in order to adequately cover all possible tag
combinations.
Italian is a language which has not yet made available to the scientific community such large corpus. In lack of such an
important basic resource, there are two possibilities:
manually building it by yourself;
using some incrementally automatic learning procedure to recursively apply in order to get to 1 million tagged
word corpus (the same as the frequently quoted Brown Corpus for English).
We have been working on such a corpus of Italian with the aim of getting at the final goal above without having to
manually build it. The algorithm that we will present in this paper is coupled with linguistic processing by means of a
CF grammar of Italian formalized as an RTN, which filters it. Statistics is usefully integrated into the syntactic
disambiguator in order to reduce recursivity and allow for better predictions.
Fully stochastic taggers, in case no large tagged corpora are available, make use of HMMs. However, HMMs show
some of the disadvantages present in more common Markov models: they lack perspicuity, and even though they allow
for biases in the form of Finite State Automata to be implemented they are inherently incapable of capturing higher
level dependencies present in natural language, and are always prone at generating wrong interpretations, i.e. accuracy
never goes higher than 96-97%. Of course it is a good statistical result, but a poor linguistic result, seen the premises,
i.e. the need to use tagging information for further syntactic processing.
We assume, together with Voutilainen & Tapanainen we hold that pos tagging is essentially a syntactically-based
phenomenon and that by cleverly coupling stochastic and linguistic processing one should be able to remedy some if
not all of the drawbacks usually associated with the two approaches, when used in isolation.
3.1 Tagset and Ambiguity Classes
We studied our training corpus in order to ascertain what level of ambiguity was present and where, seen that our
corpus is made up of sub-corpora from different domains and genres. Our tagset is made up of 86 tags thus subdivided:
7 for punctuation; 4 for unknown out-of-vocabulary words, abbreviations, titles, dates, numbers; 19 for verbs including
three syntactic types of subcategorization transitive, intransitives, copulatives and tensed cliticized verbs; 8 for
auxiliaries, both have and be; 42 for function closed class words; 6 for nouns including special labels for colour nouns,
time nouns, factive nouns, proper nouns, person names.
Twenty categories from the general tagset never occur single, so they had to be converted into distributionally
equivalent ones, in the statistical table.
A general criterion we adopted for including or not a tag in our tagset which obeys the following principles:
a tag must be unambiguously associated to a wordform or class of wordforms;
a tag must be motivated by unique distributional properties, i.e. must be in complementary distribution with
other similar tags: for instance, we collapsed the tag for common nouns [n] has a number of "allotags", [nt]
for temporal nouns, [nf] for factive nouns, [np] for proper nouns (geographical and others), [nh] for person names:
this subdivision is important when disambiguation is called for and for syntactic reasons;
don't use new tags when there is no need to: tagsets for English usually include 3rd person verbs and plural for
nouns. We don't see any reasond to introduce such morphologically-based tags, which in the case of Italian or
other such languages would make the tagset explode.
Here below we present a series of tables that illustrate the tag distribution particularly in relation to their inherent
ambiguity. We regard the ambiguity level a parameter which is both language specific and dependent on the linguistic

domain. Roughly speaking, it is different to have to deal with Ambiguity Classes lower than a certain threshold, say 4
times ambiguous, rather than having to cope with more than the double. And this is what happens in Italian: the
lexicon of Italian seems to be highly ambiguous, as will be shown below. This does not have to apply to all human
languages, of course. Anyhow we will also show some data for English produced by our tagger thus following
similar tagging conventions, which have a much lower level of ambiguity.
The analysis below is referred to half a million tokens approximately when punctuation is subtracted, 450,000
tokens. This is half the corpus of Italian we have been working on previously, which is presented above: a lot of
additional work has been done to encode polywords, abbreviations and proper nouns in order to prevent misanalysis
from taking place. As can be gathered from Table 1, the level of ambiguity amounts to less than half the tokens, with a
5% increase when punctuation tokens are subtracted from the total amounts.
TABLE 1. General Token/Types Data
Total
Total
%Tot.
%Typs/ Total
%Read %Read/
Tokens
Types
Types
Tokens
Readings /Token Tok-Punc
Culture
28,000
5,277
13.13
18.84
48,595
1.73
1.61
Politics
58,230
6,430
16
11.04
104,452
1.79
1.71
School-Adm.
45,420
2,706
6.73
5.95
82,019
1.80
1.71
Finance-Bur
328,550
17,468
43.47
5.31
562,455
1.71
1.59
Science
79,893
8,298
20.65
10.38
146,715
1.83
1.74
Totals
540,093
40,179 100.00
7.44
944,236
1.74
1.67
Merge
21,085
We tabulated the number of Total Tokens in column1 and the number of Total Types in column2. We then have
percent of Types per each subcorpus as compared to the total number of types where we see that apart from the Finance
corpus which has the highest value, the remaining figures see an overwhelming superiority of Culture which has types
for more than the double the amount of School, being clearly highly repetitive. In the third column we compare total
types with total tokens in each corpus and we get the usual Zipf-law effect whereby to an increase in the number of
tokens the number of types increment tends to zero: in particular then we note the previous fact that School corpus has
a very low figure, comparable only to the much bigger Finance. Comin now to the three remaining columns, we
tabulated the number of total readings per corpus and their percentages. If we compare our data with those available for
English and reported by Tapanainen&Voutilainen, the overall proportion between number of tokens and number of
morphological analyses is much higher in Italian: they report 1.04-1.07 readings per token, whereas our data are
attested on an average of 1.74 readings per word; 1.67 when punctuation is subtracted from the total count.
TABLE 1. General Tagging Data
Total
%
Tokens
Total
Culture
28,000
5.18
Politics
58,230
10.78
School-Adm.
45,420
8.40
Finance-Bur
328,550
60.83
Science
79,893
14.80
Totals
540,093
100%

Punctuation
3,288
4,792
3,962
38,660
7,655
58,357

%Tot.T
okens
11.74
8.23
8.72
11.77
9.58
10.8

Tot. Ambiguous
12,402
29,802
20,715
142,851
40,877
246,647

%Tot.
Ambigu.
5.02
12.08
8.4
58
16.57

% Ambi.
/ Tokens
44%
51%
45%
43%
51%
45.67%

% Amb. Punct.
50,2%
55,7%
50%
49,3%
56,6%
51%

Ambiguous tags are fairly evenly distributed among the subcorpora as can be seen in Table 2, where we tabulate
ambiguity classes starting from unambiguous then cardinality 2 or twice ambiguous tokens, ending with 9 times
ambiguity figuring only once. As can be seen from Table 1 and 2, there are two domains which have a higher level of
ambiguity than the remaining domains which are fairly evenly behaved, and these are Politics and Science.
TABLE 2a. AMBIGUITY CLASSES: TYPES
1
2
3
4
Types
76
144
128
76

5
47

TABLE 2b. AMBIGUITY CLASSES: TOKENS


2
% Tot 3
%Tot 4
Culture
7,450
26.6
2,975
10.6
734
Politics
19,652
33.7
6,602
11.34 1,323

6
16

7
14

8
4

9
1

5
777
1456

6
174
445

7
49
92

8
107
126

Tot
506

9
12
0

S.Admin.
Finance
Science

12,106
86,763
25,770

26.6
26.4
32.2

4,795
35,421
9,656

10.76
10.78
12.08

1,284
8,261
1,919

1723
8138
2532

527
1757
434

137
805
277

102
882
289

0
0
0

As can be seen from Table 2, the number of ambiguous tags is very high and even though it decreases dramatically
already with class 4 it might still constitute a serious obstacle to approaches like the ones adopted by advocates for
rule-based or constraint based disambiguation. Of course what the Table tells us is that the total number of occurrences
of tags belonging to a given AC is very high: however, the class might be represented by a very small number of types.
In that case it would still be very convenient to manually encode biases for all types.
Unfortunately this is not the case, and is clearly shown in the following table, Table 3. We discovered that the number
of possible bigrams is very high and changes a lot from one sub-corpus to another. We also discovered that the great
majority of bigrams is made up by hapax-legomena, i.e. bigrams with frequency of occurence equal to or below 3.
TABLE 3: BIGRAMS - TYPES
Tot.Typ
%Tot
Culture
5,867
12,7
Politics
7,291
15,8
S.Admin.
4,956
10,7
Finance
17,885
38,7
Science
10,164
22
Total
46,163
100,0
Merge

Hapax 1
3,206
3,487
2,006
7,192
4,826
20,717
18,407

TABLE 4. N-GRAMS DISTRIBUTION


Tokens
Types
3
Culture
9,889
7,659 1,893
Politics
22,061
12,657 1,677
S.Adm.
17,105
9,488 1,943
Finance
119,820
40,645 6,000
Science
29,500
18,802 2,680

%Tot
15,47
16,83
9,68
34,71
23,29
100,00

Hapax 2
967
1,180
770
2,755
785
6,457

4
5
2,824 1,501
4,179 3,240
3,354 2,042
6,548 13,546
6,077 4,376

6
753
1,811
1,111
7,484
2,452

Hapax 3
413
558
446
1,449
785
3651

7
8
345 180
833 446
543 283
3728 1756
1332 693

9
80
235
118
822
342

10
37
121
41
418
155

11
12
26 11
64 18
24
9
172 76
89 52

REFERENCES
[1] P. Tapanainen and Voutilainen A.(1994)Tagging accurately - don't guess if you know. Proc. of ANLP '94, pp.4752, Stuttgart, Germany.
[2] Brants T. & C.Samuelsson(1995), Tagging the Teleman Corpus, in Proc.10th Nordic Conference of Computational
Linguistics, Helsinki, 1-12.
[3] Lecomte J.(1998), Le Categoriseur Brill14-JL5 / WinBrill-0.3, INaLF/CNRS,
[4] Chanod J.P., P.Tapanainen (1995), Tagging French - comparing a statistical and a constraint-based method". Proc.
EACL'95, pp.149-156.
[5] Brill E. (1992), A Simple Rule-Based Part of Speech Tagger, in Proc. 3rd Conf. ANLP, Trento, 152-155.
[6] Cutting D., Kupiec J., Pedersen J., Sibun P., (1992), A practical part-of-speech tagger, in Proc. 3rd Conf. ANLP,
Proc. ANLP, Trento.
[7] Voutilainen A. and P. Tapanainen,(1993), Ambiguity resolution in a reductionistic parser, in Sixth Conference of
the European Chapter of the ACL, pp. 394-403. Utrecht.
[8] Delmonte R., E.Pianta(1996), "IMMORTALE - Analizzatore Morfologico, Tagger e Lemmatizzatore per l'Italiano",
in Atti V Convegno AI*IA, Napoli, 19-22.
[9] Delmonte R. G.A.Mian, G.Tisato(1986), A Grammatical Component for a Text-to-Speech System, Proceedings of
the ICASSP'86, IEEE, Tokyo, 2407-2410.
[10] Delmonte R., R.Dolci(1989), Parsing Italian with a Context-Free Recognizer, Annali di Ca' Foscari XXVIII, 12,123-161.

You might also like