0% found this document useful (0 votes)
39 views8 pages

Document Centered Approach To Text Normalization

Uploaded by

rickshark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views8 pages

Document Centered Approach To Text Normalization

Uploaded by

rickshark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Document Centered Approach t o T e x t Normalization

Andrei Mikheev
LTG
University of Edinburgh
2 Buccleuch Place, Edinburgh EH8 9LW, UK
E-mail mikheev@cogsci, ed. ac. uk

Abstract The disambiguation of capitalized words in the


In this paper we present an approach to tackle three ambiguous positions in general leads to the identi-
important problems of text normalization: sentence fication of proper names and in this paper we will
boundary disambiguation, disambiguation of capital- use these two terms and the term case normalization
ized words when they are used in positions where cap- interchangeably. Note that this task does not involve
italization is expected, and identification of abbrevi- the classification of proper names into semantic cate-
ations. The main/eature of our approach is that it gories (person, organization, location, etc.) which is
uses a minimum of pre-built resources, instead dy- the objective of the Named Entity Recognition task.
namically in/erring disambiguation clues from the en- Disambiguation of capitalized words in mixed-
tire document itself. This makes it domain indepen- case texts hardly received much attention in the
dent, closely targeted to each individual document and natural language processing and information retrieval
portable to other languages. We thoroughly evaluated communities, but in fact it plays an important role
this approach on several corpora and it showed high in many tasks. Church [3], among other simple text
accuracy. normalization techniques, studied the effect of case
normalization for different words and showed that
"...sometimes case variants refer to the same thing
1 Introduction (hurricane and Hurricane), sometimes they refer to
Text cleaning and normalization is a significant as- different things (continental and Continental) and
pect in developing many text processing and Infor- sometimes they don't refer to much of anything (e.g.,
mation Retrieval applications. One important ac- anytime and Anytime)." Obviously these differences
tivity at the text normalization phase involves the are due to the fact t h a t some capitalized words stand
disambiguation of capitalized words. In mixed-case for proper names (such as Continental- the name of
texts capitalized words usually denote proper names an airline) and some don't. Although there are some
- names of organizations, locations, people, artifacts, evidence that Word Sense Disarnbiguation does not
etc., but there are special positions in the text where necessarily help document retrieval when queries
capitalization is expected. Such mandatory (ambigu- are sufficiently long, there are other applications
ous) positions include the first word in a sentence, such as name finding, information extraction or
words in all-capitalized titles or table entries, a cap- even document retrieval with short queries, where
italized word after a colon or an open quote, the the ability to recognize proper names is of crucial
first capitalized word in a list-entry, etc. Capitalized importance.
words in these and some other positions present a Another important task of text normalization is
case of ambiguity - they can stand for proper names sentence boundary disambiguation (SBD) or sentence
as in "White later said ... ", or they can be just cap- splitting. Segmenting text into sentences is an im-
italized common words as in "White elephants are portant aspect in developing many text processing
,,, ", application - syntactic parsing, Information Extrac-
tion, Machine Translation, Text Alignment, Docu-
Permission to make digital or hard copies of all or part of th~s w o r k for ment Summarization, etc. Sentence splitting in most
personal or classroom use is granted w i t h o u t foe provided that
copies are not made or distributed for profit or commercial advan-
cases is a simple m a t t e r - a period, an exclamation
tage and that copies bear this notice and the full cRatlon on the first page. mark or a question mark usually signal a sentence
T o copy otherwise, t o republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a tee. boundary. However, there are cases when a period
SIGIR 2 0 0 0 7 / 0 0 Athens, Greece denotes a decimal point or is a part of abbreviation
© 2000 ACM 1-58113°226-3/00/0007.. $5.00
and thus it does not signal a sentence break. Further-

136
more, an abbreviation itself can be the last token in a by a proper name. To deal with this situation in
sentence in which case its period acts both as p a r t of general, one should encode information a b o u t the
this abbreviation and as the end-of-sentence indicator words which are seen in such contexts. This is obvi-
(full stop). ously domain dependent and labour-intensive either
The disambiguation of capitalized words and sen- in t e r m s of writing rules or annotating the corpus for
tence boundaries presents a chicken and egg problem. training. However, if we always resolve these ambigu-
If we know that a capitalized word which follows a ous cases as "not sentence b o u n d a r y " , we will achieve
period is a common word, we can safely assign such the overall error rate of 0.01% on the Brown Corpus
period as sentence terminal. On the other hand, if we and 0.13% on the W S J (as presented in the first row
know t h a t a period is not sentence terminal, then we of Table 1) which is well above the current state-of-
can conclude t h a t the following capitalized word is a the-art (the third row of Table 1). T h e main problem
proper name. Another frequent source of ambiguity with the above reasoning is t h a t when dealing with
in the end-of-sentence marking is introduced by ab- real-world texts, we have to solve the abbreviation
breviations: if we know t h a t the word which precedes and the proper name tasks ourselves, which is far
a period is not an abbreviation, then almost certainly from trivial. Thus the SBD results which are based
this period denotes a sentence break. However, if on the 100% correct disambiguation of capitalized
this word is an abbreviation, then it is not t h a t easy words and abbreviations can be considered as the
to make a clear decision. This is also exacerbated upper bound for our approach.
by the fact t h a t abbreviations do not form a closed We can also estimate the lower bound for this
set, i.e., one cannot list all possible abbreviations. approach. The simplest strategy for deciding whether
It gets even worse - abbreviations can coincide with a capitalized word in a m a n d a t o r y position is a proper
ordinary words, i.e., "in" c a n denote an abbrevia- name or not is to apply the lexical lookup strat-
tion for "inches", "no" can denote an abbreviation egy (possibly enhanced with a morphological word
for "number", "bus" can denote an abbreviation for guesser, e.g., [9]). There all words which are not listed
"business", etc. in the lexicon of common words usually are marked as
In this paper we present a method which tackles proper names. This strategy gave us 7.4% error rate
sentence boundaries, capitalized words, and abbrevi- on the Brown Corpus and 15% error rate on the WSJ.
ations in a uniform way through the development of For the abbreviations handling the simplest strategy
a document-centered approach (DCA). is to apply a well-known heuristic t h a t single-word
abbreviations are short and normally do not include
2 Our Approach to SBD vowels (Mr., Dr., kg.). Thus a word without vowels
can be guessed to be an abbreviation unless it is
There are two corpora normally used for the eval- written in all capital letters which can be an acronym
uation (and development) in a number of text pro- (e.g. RPC). A span of single letters, separated by
cessing tasks: the Brown Corpus and the Wall Street periods forms an abbreviation too ( e . g . Y . M . C . A . ) .
Journal (WSJ) corpus - both are part of the Penn This can be combined with a list of very few (3-5)
Treebank [8]. Words in these two corpora are an- most frequent known abbreviations. This strategy
notated with part-of-speech (POS) information and gave us about 15-16% error rate. The combination
the text is split into documents, paragraphs and sen- of these methods when applied to the SBD task gave
tences. This gives all necessary information for all us 2.0% error rate on the Brown Corpus and 4.1%
three our tasks: the SBD task, the Capitalized Word on the WSJ. The second row of Table 1 displays a
Disambiguation task and the Abbreviation Identi- s u m m a r y for the lower bound results.
fication task. The Brown Corpus represents gen- Since the upper bound of our SBD algorithm is
eral English; it is composed from subsections ranging high and the lower bound is far from being accept-
form journalistic and science domains to fiction and able, our main strategy for the sentence boundary
speech transcriptions. The WSJ corpus represents disambiguation task is to invest into capitalized word
journalistic news-wire style. and abbreviation handling.
The estimates from the Brown Corpus and the
W S J show that if we had at our disposal the entirely
correct information on whether a word is an abbrevi-
3 D o c u m e n t - C e n t e r e d Approach
ation and whether or not an ambiguously capitalized As we discussed above, the bad news (well, not really
word is a proper name, then only 5-7% of potential news) is t h a t virtually any common word can poten-
sentence boundaries would present the case of am- tially act as a proper name or p a r t of a multi-word
biguity. This is when an abbreviation is followed proper name. For instance, the word "Black" in the

137
Table 1: Error rate of different approaches on Sentence Boundary Disambiguation (SBD), Capitalized Words
Disambiguation and Abbreviation Identification measured on the Brown Corpus and the WSJ.

Brown Corpus. WSJ Corpus.


SBD Cap. words Abbrs. SBD Cap. words Abbrs.
Upper Bound 0.01% 0.00% 0.0% 0.13% [Link]% [Link]%
Lower Bound 2.00% 7.40% 15.0% 4.10% 15.0% 16.0%
Best quoted 0.20% 3.15% - 0.50% 4.72% -

DCA (no abbr lex) 0.76% 2.89% 8.4% 1.91% 4.92% 9.5%
DCA + a b b r lex. 0.28% 2.83% 0.8% 0.45% 4.88% 1.2%

sentence-initial position can stand for a person's sur- information to make a reliable decision. For instance,
name but can also refer to the color. Even in multi- Riders in the sentence "Riders said later.." is equally
word capitalized phrases the first word can belong to likely to be a proper noun, a plural proper noun or
the rest of the phrase or can be just an external mod- a plural common noun but if in the same text we
ifier. In the sentence "Daily, Mason and Partners lost find "John Riders" this sharply increases the proper
their court case" it is clear t h a t "Daily, Mason and noun interpretation and conversely if we find " m a n y
Partners" is the name of a company. In the sentence riders" this suggests the plural noun interpretation.
"Unfortunately, Mason and Partners lost their court The above reasoning can be summarized as fol-
case" the name of the company does not involve the lows: if we detect t h a t a word has been used capital-
word "unfortunately",-but..the word "Daily" is just ized in an unambiguous context (not in a m a n d a t o r y
as common a word as "unfortunately". The same position), this increases the chances for this word to
applies to abbreviations: there is no closed list of act as a proper name in m a n d a t o r y positions in the
abbreviations and almost any short word can be used same document. And conversely if a word is seen
as an abbreviation. only lowercased, this increases the chances to down-
Fortunately, there is good news too: important case it in m a n d a t o r y positions of the same document.
words are typically used in a document more than This, of course, is only a general principle and will be
once and in different contexts. Some of these further elaborated elsewhere in the paper.
contexts create very ambiguous situations but some The same logic applies to abbreviations.
don't. Furthermore, ambiguous words and phrases Although a short word which is followed by a
are usually unambiguously introduced at least once period is a potential abbreviation, the same word
in the text unless they are p a r t of common knowledge when occurring in the same document in a different
presupposed to be known by the readers. context can be unambiguously classified as an
This is an observation which can be applied to ordinary word if it is used without a trailing
a broader class of tasks. For example, people are period, or it can be unambiguously classified as an
often referred to by their surnames (e.g., "Black") abbreviation if it is used with a trailing period and
but usually introduced at least once in the text either is followed by a lowercased word or a comma.
with their first name ("John Black") or with their We call it a document-centered approach (DCA)
title/profession affiliation ("Mr. Black", "President since the information for the disambiguation of an
Bush") and it is only when their names are common individual word is derived from the entire document.
knowledge t h a t they don't need an introduction ( This is very different from the widely-spread local
e.g., "Castro", "Gorbachev"). Thus our suggestion context approach where the information for the dis-
is to look at the unambiguous usage of the words in ambiguation is derived from the immediate surround-
question in the entire document. ing of an individual word.
In the case of proper name identification we are
not concerned with the semantic class of a name e.g., 4 Getting Abbreviations
whether it is a person n a m e or location. We simply
want to distinguish whether this word in this partic- The answer to the question whether or not a word
ular occurrence acts as a proper name (or part of a token is an abbreviation largely solves the sentence
multi-word proper name) or it is just a common word boundary problem. In the Brown Corpus 92% of
which is capitalized because it is used in a m a n d a t o r y potential sentence boundaries come after a regular
position. If we restrict our scope only to the local word. The WSJ Corpus is richer with abbreviations
context, we might find t h a t there is just not enough and this comes only to 83%. As we discussed earlier

138
there are a variety of abbreviation lists but these C" and the unigram "C" with the information that
lists are very likely not to cover all abbreviations in it is a regular word. If in the same document the
a text, especially in a specialized technical domain. system also detects a context "John C. later said" it
This means t h a t we cannot simply apply a strategy stores the bigram "John C." and the unigram "C"
which says t h a t if a period follows a word which is not with the information t h a t it is an abbreviation. Here
listed in our abbreviation list, this period is a fullstop we have conflicting information for the word "C" -
because this word is definitely not an abbreviation. it was detected to act as a regular word and as an
In addition to the lists of known abbreviations abbreviation within the same document - so there is
and the heuristics which assign as abbreviations short not enough information to resolve ambiguous cases
words which consist of consonants only, we applied purely using the unigram. However, some cases can
our document-centered approach which looks over still be resolved on the basis of the bigrams e.g. the
the entire document for the contexts where a poten- system will assign "C" to be an abbreviation in an
tial abbreviation is used in. If a potential abbre- ambiguous context "John C. Research" and it will
viation is used elsewhere in the document without assign "C" as a regular word (non-abbreviation) in
a trailing period, we can conclude t h a t this in fact an ambiguous context "vitamin C. Research".
is an ordinary word. To decide whether a potential We evaluated the performance of the DCA and
abbreviation is in fact an abbreviation we look for measured about 8% improvement in the error rate
the contexts when it is followed by a period and then (the fourth row of Table 1) over the b o t t o m line strat-
by a lowercased word, a number or a comma. egy described in section 2. When we applied the DCA
For instance, the word "Kong" followed by a pe- in conjunction with a list of 230 most frequent ab-
riod and then by a capitalized word cannot be safely breviations estimated from a random corpus of news
classified as an_ or_din~y, word and therefore it is a feeds, the error rate of the abbreviation handling was
potential abbreviation. But if in the same document measured at about 1% as shown in the last row of
we detect a context "lived in Hong Kong in 1993" this Table 1.
indicates t h a t "Kong" is normally written without
a trailing period and hence is not an abbreviation. 5 Getting Capitalized Words
Having established that, we can apply these findings
to the non-evident contexts and classify "Kong" as an The second key task of our approach is the disam-
ordinary word throughout the document. However, biguation of capitalized words which follow a poten-
if we detect a context as "Kong., said" this indi- tial sentence boundary punctuation. Apart from be-
cates t h a t "'Kong" is normally written with a trailing ing an important task of text normalization, the in-
period and hence is an abbreviation. This gives us formation a b o u t whether or not a capitalized word
grounds to classify "Kong" as an abbreviation in all which follows a period is a common word allows us
its occurrences within the same document. to accurately assign as sentence terminal about 30%
The DCA strategy relies on the assumption t h a t of the unassigned potential sentence boundaries. In
there is a consistency of writing within the same the case when abbreviation handling is done with-
document. Different authors can write "Mr" or "Dr" out the help of an abbreviation list, the impact of
with or without trailing period but we assume t h a t the capitalized word disambiguation is even higher -
the same author (the author of a document) will write a b o u t 80% of the unassigned sentence boundaries can
it in the same way consistently. There, however, can be handled. We tackle capitalized words in a simi-
lar fashion as we tackled the abbreviations - by the
occur a situation when the same potential abbrevia-
tion is used as a regular word and as an abbreviation analysis of the distribution of ambiguous words in the
within the same document. This is usually the case entire document. This is implemented as a cascade of
when an abbreviation coincides with a regular word simple strategies which was briefly described in [10].
e.g. "Sun." (meaning Sunday) and "Sun" (the n a m e
5.1 The Sequence Strategy
of a newspaper). To tackle this problem, our strategy
is to collect not only unigrams of potential abbrevi- Our first strategy for the disambiguation of capi-
ations in unambiguous contexts as explained earlier talized words in ambiguous positions is to explore
but also their bigrams with the preceding word. Now sequences of proper nouns in unambiguous positions.
the positional guessing strategy can assign ambiguous We call it the Sequence Strategy. The rationale be-
instances on the basis of the bigrams it collected from hind this is t h a t if we detect a phrase of two or
the document. more capitalized words and this phrase starts from an
For instance, if in a document the DCA found a unambiguous position we can be reasonably confident
context "vitamin C is" it stores the bigram "vitamin t h a t even when the same phrase starts from an unre-

139
liable position all its words still have to be grouped 5.2 Frequent List Lookup Strategy
together and hence are proper nouns. Moreover, this
Our next strategy is to mark words from the fre-
applies not just to the exact replication of such a
quent starter lists which we compiled completely au-
phrase but to any partial ordering of its words of size
tomatically as explained in [10]. Ambiguously cap-
two or more preserving their sequence. For instance,
italized words found in the list of frequent starting
if we detect a phrase Rocket Systems Development
common words are marked as common words and
Co. in the middle of a sentence, we can mark words in
words found in the list of frequent proper names are
the sub-phrases Rocket Systems, Rocket Systems Co.,
marked as proper names. Note, however, that the
Rocket Co., Systems Development, etc. as proper
Sequence Strategy is applied prior to the frequent list
nouns even if they occur at the beginning of a sen-
assignment and thus a word from one of these lists is
tence or in other ambiguous positions. A span of
not always will be marked according to its list class.
capitalized words can also include lower-cased words
Among such cases resolved by the Sequence Strategy
of length three or shorter. This allows us to capture
were a multi-word proper name "To B. Super" where
phrases like A ~4 M, The Phantom o] the Opera.,
both "To" and "Super" were correctly identified as
etc. We generate partial orders from such phrases
proper nouns and a multi-word proper name "The
in a similar way but insist that every generated sub-
Update" where "The" was correctly identified as part
phrase should start and end with a capitalized word.
of the magazine name. Both "To" and "The" were
To make the Sequence Strategy robust to poten-
listed in the frequent starters list as common words
tial capitalization errors in the document we also use
and therefore were not likely to be classified as proper
a set of negative evidence. This set is essentially a
nouns but nevertheless the system handled them cor-
set of all lower-cased words of the document with
rectly.
their following words (bigrams). We don't attempt
The Frequent List Lookup Strategy is extremely
here to build longer sequences and their partial orders
accurate (99.8-100%). The only few wrong
because we cannot in general restrict the scope of de-
assignments were in cases like "Mr. A", "Mrs.
pendencies in such sequences. The negative evidence
Someone" and words in titles like 'Tve got a Dog"
is then used together with the positive evidence of
where "A", "Someone" and 'T' were assigned as
the Sequence Strategy and block the proper name
common words but they were tagged as proper
assignment when controversy is found. For instance,
nouns in the Brown Corpus and the WSJ. The
if in a document the system detects a capitalized
Frequent List Lookup Strategy is not very effective
phrase "The President" in an unambiguous position,
for proper names, where it covers only about 6-7%,
then it will be assigned as a proper name even if found
but it is extremely effective for common words where
in ambiguous positions in the same document. To be
it covers about 60-70% of ambiguously capitalized
more precise the method will assign the word "The"
common words.
as a proper noun since it should be grouped together
with the word "President" into a single proper name. 5.3 Single Word Assignment
However, if in the same document the system detects
an alternative evidence e.g. "the President" or "the The Sequence Strategy is accurate, but it covers only
president" - it then blocks such assignment as unsafe. a part of potential proper names in ambiguous po-
The Sequence Strategy is extremely useful when sitions and at the same time it does not cover cases
dealing with names of organizations since many of when capitalized words do not act as proper names.
them are multi-word phrases composed from common For this purpose we developed another strategy which
words. And indeed, the accuracy of this strategy also uses information from the entire document. We
when applied to proper names was about 99% with call this strategy Single Word Assignment, and it
coverage of about 8-9%. All the wrong assignments can be summarized as follows: if we detect a word
came as a result of erroneous capitalizations in the which in the current document is seen capitalized in
documents but the high accuracy confirms that this an unambiguous position and at the same time it
strategy is not over-sensitive to such errors. For tag- is not used lower-cased, this word in this particular
ging common words the Sequence Strategy was 100% document, even when used capitalized in ambiguous
accurate, covering 16.7% of ambiguously capitalized positions, is very likely to stand for a proper name
common words on the WSJ Corpus and 25.5% on the as well. And conversely, if we detect a word which
Brown Corpus. in the current document is used only lower-cased in
unambiguous positions, it is extremely unlikely that
this word will act as a proper name in an ambiguous
position and thus, such a word can be marked as

140
a common word. The only consideration here should 6 Assigning Sentence Breaks
be made for the high frequency sentence-initial words
Now after we have identified the abbreviations and
which do not normally act as proper names, but this
more importantly most of the non-abbreviations in
is taken care of by the Frequent List Strategy.
the text, and after we have classified the words into
The Single Word Assignment was v e r y useful for
proper nouns and common words, we can carry out
the proper name identification: it achieved an accu-
the assignments of the potential sentence breaks.
racy of over 99% and covered 50.8% of ambiguously
Only when a period is preceded by an abbreviation
capitalized proper names in the WSJ Corpus and
and is followed by a lowercased word, proper name,
35.9% in the Brown Corpus. On the common words
c o m m a or a number we assign it as sentence internal.
this method was less accurate (about 97%), covering When we combined the DCA with a list of 230
3-4% of the cases.
abbreviations this strategy gave us 0.28% error rate
on the Brown Corpus and 0.45% of error rate on the
5.4 T h e "After Abbr." H e u r i s t i c
WSJ Corpus as shown in the last row of Table 1. The
When we studied the distribution of capitalized fourth row of Table 1 displays the results when this
words after capitalized abbreviations, we uncovered strategy was applied to the texts which were handled
an interesting empirical fact. A capitalized word without consulting the list of abbreviations. There
which follows a capitalized abbreviation is almost we see that the increase of about 8% in the error
certainly a proper name unless it is listed in the list rate on the abbreviation handling induced 0.5% error
of frequent sentence starting common words, i.e., rate increase on the Brown Corpus and about 1.5%
it is not "The", "However", etc. The error rate of increase on the WSJ for the SBD task.
this heuristic is about 0.8% and, not surprisingly, in
99.5% of-c-ases-the-abbreviation and the following
77 R e l a t e d Research
proper name belonged to the same sentence. We will
use this fact later on when we deal with sentence There exist two large classes of the SBD systems: rule
boundaries. Naturally, the coverage of this "After based and machine learning. The rule based systems
Abbr." heuristic depends on the proportion of use manually built rules which are usually encoded in
capitalized abbreviations in the text. In our two terms of regular expression g r a m m a r s supplemented
corpora this heuristic disambiguated about 20% of with lists of abbreviations, common words, proper
ambiguously capitalized proper names. names, etc. For instance, the Alembic workbench [1]
contains a sentence splitting module which employs
5.5 T h e Overall P e r f o r m a n c e over 100 regular-expression rules written in Flex. Ma-
In general our method achieved about 1% error rate chine learning systems treat the SBD task as a classi-
fication problem, using features such as word spelling,
but it left unclassified about 9% of ambiguously cap-
capitalization, sumx, word class, etc., found in the
italized words in the Brown Corpus and 15% of such
local context of potential sentence breaking punctu-
words in the WSJ. When we concentrate on the im-
ation.
pact of the individual strategies, we see t h a t for the
State-of-the-art machine-learning and rule-based
proper name category the most productive is the Sin-
gle Word Assignment, then the "After Abbr." strat- SBD systems achieve the error rate of about 0.8~1.5%
measured on the Brown Corpus and the WSJ. The
egy, and then the Sequence Strategy. For common
best performance on the W S J was achieved by a com-
words the most productive is the Frequent List strat-
bination of the SATZ system [11] with the Alembic
egy and the Sequence Strategy.
system [1] - 0.5% error rate. The best performance
Now we have to decide what to do with the re-
on the Brown Corpus, 0.2% error rate, was reported
maining 10-15% of unassigned ambiguously capital-
by [12], who trained a decision tree classifier on a
ized word. To keep our system simple and domain
25 million word corpus. In the disambiguation of
independent we opted for the lexical lookup strategy
capitalized words the most wide-spread method is
which we evaluated in section 2. This strategy of
part-of-speech tagging which achieves about 3% error
course is not very accurate but it is applied only to
I0-15% of ambiguously capitalized words. The two rate on the Brown Corpus and 5% error rate on the
last rows of Table 1 display the results of the DCA WSJ.
The m a j o r difference between these systems and
combined with the the lexical lookup strategy: the
our approach is t h a t in our approach
overall error rate in the disambiguation of capitalized
words of about 2.9% on the Brown Corpus and about • we use global information distributed across the
4.9% on the WSJ. entire document rather t h a n an immediate local

141
context or extensive lists of words and abbrevi- of lowercased and capitalized words found in unam-
ations; biguous positions to make decisions on downcasing
the sentence starting words and words in all-capital
• our approach does not require human interven- titles. This is quite similar to the document-centered
tion or annotated data; method we applied to capitalized word disambigua-
• our approach is targeted to each document un- tion described above. The description of the EAGLE
der processing while all other approaches operate case normalization module is, however, very brief and
with rules or models smoothed across the entire pro~des no performance evaluation or other details.
document collection.
8 Discussion
The use of non-local context and dynamic adap-
tation have been studied in language modeling for In this paper we presented an approach to tackle
speech recognition. Kuhn and de Mori [6] introduced three problems of text normalization: disambigua-
the cache model which works as a kind of short-term tion of sentence boundaries, disambiguation of capi-
memory by which the probability of the recent n talized words when they are used in positions where
words is increased over the probability of a general capitalization is expected and identification of abbre-
purpose bigram or trigram model. Within certain viations. The main feature of our approach is that it
limits, such a model can adapt itself to changes in uses a minimum of pre-built resources - we use only a
word frequencies depending on the topic of the text list of common words of English and a list of the most
passage. From this point of view, the widely used N- frequent words which appear in the sentence-stating
gram model is virtually stationary because its adap- positions. Both of these lists were acquired without
tation is limited by the value of N. any human intervention. To compensate for the lack
Clarkson and Robinson [4] developed a way of in- of pre-acquired knowledge, the system tries ~to infer
corporating the cache model with standard N-grams disambiguation clues from the entire document itself.
using mixtures of language models and also exponen- This makes our approach domain independent and
tially decaying the weight for the cache prediction closely targeted to each document.
depending on the recency of the word's last occur- The DCA approach achieved the error rate which
rence. The cache model has recently been extended is comparable to or even better than the best re-
to a more general topic adaptation model ([4], [13], ported results known to us. At the same time our
etc.). To adapt a language model to a topic the doc- approach does not require any human intervention
uments in the training corpus are clustered into pos- in training and does not require porting or tailoring
sibly overlapping topical subsets. A new document when dealing with new domains. Another important
is estimated to belong to a topic and the probability benefit is that our system in contrast to other known
for topic dependent words is increased proportionally to us systems proved to be robust even when it was
to the probability that this document belongs to the running without consulting the list of abbreviations.
topic. The simplicity of the approach ensured high running
Mani&MacMillan [7] pointed out that little speed of the resulting system - over 7,000 words per
attention had been paid in the named entity second on Pentium II 400 MHz.
recognition field to the discourse properties of We deliberately shaped our approach so it does
names. They proposed to view proper names as not rely on pre-compiled statistics but rather acts by
linguistic expressions whose interpretation often analogy. This is because the most interesting events
depends on the discourse context, i.e., text-driven are inherently infrequent and, hence, are difficult
processing rather than reliance on pre-existing lists. to collect reliable statistics for, and at the same
Gale, Church and Yarowsky [5] showed that words time pre-compiled statistics would be smoothed
strongly tend to exhibit only one sense in a document across multiple documents rather than targeted to a
or discourse. This is also one of the assumptions of specific document.
the document-centered approach advocated in this The described approach is very easy to implement
proposal. and it does not require training or installation of
The description of the EAGLE workbench for lin- other software. The system can be used as it is and,
guistic engineering [2] mentions a case normalization by implementing the cache memory of multi-word
module which uses a heuristic that a capitalized word proper names, it can be targeted to a specific domain.
in a mandatory position should be downcased if it The system can also be used as a pre-processor to a
is found lowercased in the same document. This part-of-speech tagger or a sentence boundary disam-
also employs a database of bigrams and unigrams

142
biguation program which can try to apply more so- [10] A. Mikheev. A knowledge-free method for
phisticated methods to unresolved capitalized words. capitalized word disambiguation. In Proceed-
ings of the 37th Conference of the Association
References for Computational Linguistics (ACL'99), pages
159-168. University of Maryland, 1999.
[1] J. Aberdeen, J Burger, D. Day, L. Hirschman,
P. Robinson and M. Vilain. Mitre: Description [11] D. D. Palmer and M. A. Hearst. Adaptive
of the alembic system used for muc-6. In The multilingual sentence boundary disambiguation.
Proceedings of the Sixth Message Understand- Computational Linguistics, 1997.
ing Conference (MUC-6), Columbia, Maryland,
1995. Morgan Kanfmann. [12] M.D. Riley. Some applications of tree-based
modelling to speech and language indexing. In
[2] B. Baldwin, C. Doran, J. Reynar, M. Niv, Proceedings of the DARPA Speech and Natural
B. Srinivas and M. Wasson. Eagle: An extensible Language Workshop, pages 339-352. Morgan
architecture for general linguistic engineering. In Kaufman, 1989.
Proceedings of RIAO '97, Montreal, June 1997.
[13] K. Seymore, S. Chen and R. Rosenfeld. Non-
[3] Kenneth W. Church. One term or two? In linear interpolation of topic models for language
Proceedings of the 18th Annual Internationals model adaptation. In Proceedings of ICSLP98,
ACM SIGIR Conference on Research and De- 1998.
velopment in Information Retrieval (SIGIR '95),
1995.
[4] P. Clarkson and A.J. Robinson. Language
model adaptation using mixtures and an expo-
nentially decaying cache. In Proceedings IEEE
International Conference on Speech and Signal
Processing, Munich, Germany, 1997.
[5] W. Gale, K. Church and D. Yarowsky. One sense
per discourse. In Proceedings of the 4th DARPA
Speech and Natural Language Workshop, pages
233-237, 1992.
[6] R. Kuhn and R. de Mori. A cache-based natural
language model for speech recognition. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, Volume 12, pages 570-583, 1998.
[7] I. Mani and T.R. MacMillan. Identifying un-
known proper names in newswire text. In
B. Boguraev and J. Pustejovsky (editors), Cor-
pus Processing for Lexical Acquisition. MIT
Press, 1995.
[8] Mitchell Marcus, Mary Ann Marcinkiewicz and
Beatrice Santorini. Building a large annotated
corpus of english: The penn treebank. Computa-
tional Linguistics, Volume 19, Number 2, pages
313-329, 1993.
[9] A. Mikheev. Automatic rule induction for
unknown word guessing. Computational Lin-
guistics, Volume 23, Number 3, pages 405-423,
1997.

143

You might also like