Superior and Efficient Fully Unsupervised Pattern-based Concept
Acquisition Using an Unsupervised Parser
Dmitry Davidov1 Roi Reichart1 Ari Rappoport2
1
ICNC , 2 Institute of Computer Science
Hebrew University of Jerusalem
{
[email protected]|roiri@cs|arir@cs}.huji.ac.il
Abstract While handcrafted concept databases (e.g., Word-
Net) are extensively used in NLP, manual compila-
Sets of lexical items sharing a significant
aspect of their meaning (concepts) are fun- tion of such databases is labor intensive, error prone,
damental for linguistics and NLP. Unsuper- and somewhat arbitrary. Hence, for many languages
vised concept acquisition algorithms have and domains great efforts have been made for au-
been shown to produce good results, and are tomated construction of such databases from avail-
preferable over manual preparation of con- able corpora. While language-specific and domain-
cept resources, which is labor intensive, er- specific studies show significant success in develop-
ror prone and somewhat arbitrary. Some ex- ment of concept discovery frameworks, the majority
isting concept mining methods utilize super-
of domains and languages remain untreated. Hence
vised language-specific modules such as POS
taggers and computationally intensive parsers. there is a need for a framework that performs well
for many diverse settings and is as unsupervised and
In this paper we present an efficient fully
unsupervised concept acquisition algorithm language-independent as possible.
that uses syntactic information obtained from Numerous methods have been proposed for seed-
a fully unsupervised parser. Our algorithm based concept extraction where a set of concept pat-
incorporates the bracketings induced by the terns (or rules), or a small set of seed words for each
parser into the meta-patterns used by a sym- concept, is provided as input to the concept acqui-
metric patterns and graph-based concept dis- sition system. However, even simple definitions for
covery algorithm. We evaluate our algorithm
concepts are not always available.
on very large corpora in English and Russian,
using both human judgments and WordNet- To avoid requiring this type of input, a number of
based evaluation. Using similar settings as distributional and pattern-based methods have been
the leading fully unsupervised previous work, proposed for fully unsupervised seed-less acquisi-
we show a significant improvement in con- tion of concepts from text. Pattern-based algorithms
cept quality and in the extraction of multiword were shown to obtain high quality results while be-
expressions. Our method is the first to use ing highly efficient in comparison to distributional
fully unsupervised parsing for unsupervised methods. Such fully unsupervised methods do not
concept discovery, and requires no language-
incorporate any language-specific parsers or taggers,
specific tools or pattern/word seeds.
so can be successfully applied to diverse languages.
However, unsupervised pattern-based methods
1 Introduction
suffer from several weaknesses. Thus they are fre-
Comprehensive lexical resources for many domains quently restricted to single-word terms and are un-
and languages are essential for most NLP applica- able to discover multiword expressions in efficient
tions. One of the most utilized types of such re- and precise manner. They also usually ignore poten-
sources is a repository of concepts: sets of lexical tially useful part-of-speech and other syntactic in-
items sharing a significant aspect of their meanings formation. In order to address these weaknesses,
(e.g., types of food, tool names, etc). several studies utilize language-specific parsing or
tagging systems in concept acquisition. Unfortu- sition of a lexically-defined matrix (by SVD, PCA
nately, while improving results, this heavily affects etc), e.g. (Schütze, 1998; Deerwester et al., 1990).
the language- and domain- independence of such While great effort has been made for improv-
frameworks, and severely impacts efficiency since ing the computational complexity of distributional
even shallow parsing is computationally demanding. methods (Gorman and Curran, 2006), they still re-
In this paper we present a method to utilize the in- main highly computationally intensive in compari-
formation induced by unsupervised parsers in an un- son to pattern approaches (see below), and most of
supervised pattern-based concept discovery frame- them do not scale well for very large datasets.
work. With the recent development of fast fully un- The second main approach is to use lexico-
supervised parsers, it is now possible to add parser- syntactic patterns. Patterns have been shown to pro-
based information to lexical patterns while keep- duce more accurate results than feature vectors, at
ing the language-independence of the whole frame- a lower computational cost on large corpora (Pan-
work and still avoiding heavy computational costs. tel et al., 2004). Since (Hearst, 1992), who used a
Specifically, we incorporate the bracketings induced manually prepared set of initial lexical patterns, nu-
by the parser into the meta-patterns used by a sym- merous pattern-based methods have been proposed
metric patterns and graph-based unsupervised con- for the discovery of concepts from seeds. Other
cept discovery algorithm. studies develop concept acquisition for on-demand
We performed a thorough evaluation on two En- tasks where concepts are defined by user-provided
glish corpora (the BNC and a 68GB web corpus) seeds. Many of these studies utilize information ob-
and on a 33GB Russian corpus. Evaluations were tained by language-specific parsing and named en-
done using both human judgments and WordNet, in tity recognition tools (Dorow et al., 2005). Pantel et
similar settings as that of the leading unsupervised al. (2004) reduce the depth of linguistic data used,
previous work. Our results show that utilization of but their method requires POS tagging.
unsupervised parser both improves the assignment TextRunner (Banko et al., 2007) utilizes a set
of single-word terms to concepts and allows high- of pattern-based seed-less strategies in order to ex-
precision discovery and assignment of of multiword tract relational tuples from text. However, this sys-
expressions to concepts. tem contains many language-specific modules, in-
cluding the utilization of a parser in one of the pro-
2 Previous Work cessing stages. Thus the majority of the existing
Much work has been done on lexical acquisition of pattern-based concept acquisition systems rely on
all sorts and the acquisition of concepts in particu- pattern/word seeds or supervised language-specific
lar. Concept acquisition methods differ in the type of tools, some of which are very inefficient.
corpus annotation and other human input used, and Davidov and Rappoport (2006) developed a
in their basic algorithmic approach. Some methods framework which discovers concepts based on high
directly aim at concept acquisition, while the direct frequency words and symmetry-based pattern graph
goal in some is the construction of hyponym (‘is-a’) properties. This framework allows a fully unsuper-
hierarchies. A subtree in such a hierarchy can be vised seed-less discovery of concepts without rely-
viewed as defining a concept. ing on language-specific tools. However, it com-
A major algorithmic approach is to represent pletely ignores potentially useful syntactic or mor-
word contexts as vectors in some space and use dis- phological information.
tributional measures and clustering in that space. For example, the pattern ‘X and his Y’ is useful
Pereira (1993), Curran (2002) and Lin (1998) use for acquiring the concept of family member types,
syntactic features in the vector definition. (Pantel as in “his siblings and his parents’. Without syn-
and Lin, 2002) improves on the latter by clustering tactic information, it can capture noise, as in “... in
by committee. Caraballo (1999) uses conjunction ireland) and his wife)” (parentheses denote syntac-
and appositive annotations in the vector representa- tic constituent boundaries). As another example, the
tion. Several studies avoid requiring any syntactic useful symmetric pattern “either X or Y” can appear
annotation. Some methods are based on decompo- in both good examples (“choose either Chihuahua
or Collie.”) and bad ones (“either Collie or Aus- pects relevant to this paper.
tralian Bulldog”). In the latter case, the algorithm First, it achieves state of the art unsupervised
both captures noise (“Australlian” is now consid- parsing performance: its F-score2 is 75.9% for sen-
ered as a candidate for the ‘dog type’ concept), and tences of up to 10 words from the PennTreebank
misses the discovery of a valid multiword candidate Wall Street Journal corpus (WSJ) (Marcus, 1993),
(“Australlian Bulldog”). While symmetry-based fil- and 59% for sentences of the same length from the
tering greatly reduces such noise, the basic problem German NEGRA (Brants, 1997) corpus. These cor-
remains. As a result, incorporating at least some pora consists of newspaper texts.
parsing information in a language-independent and Second, to obtain good results, manually created
efficient manner could be beneficial. POS tags are used as input in all the unsupervised
Unsupervised parsing has been explored for sev- parsers mentioned above except of Seginer’s, which
eral decades (see (Clark, 2001; Klein, 2005) for re- uses raw sentences as input. (Headden et al., 2008)
cent reviews). Recently, unsupervised parsers have have shown that the performance of algorithms that
for the first time outperformed the right branch- require POS tags substantially decreases when using
ing heuristic baseline for English. These include POS tags induced by unsupervised POS taggers in-
CCM (Klein and Manning, 2002), the DMV and stead of manually created ones. Seginer’s incremen-
DMV+CCM models (Klein and Manning, 2004), tal parser is therefore the only fully unsupervised
(U)DOP based models (Bod, 2006a; Bod, 2006b; parser providing high quality parses.
Bod, 2007), an exemplar based approach (Den- Third, Seginer’s parser is extremely fast. During
nis, 2005), guiding EM using contrastive estimation its initial stage, the parser builds a lexicon. Our Pen-
(Smith and Eisner, 2006), and the incremental parser tium 2.8GHB machines with 4GHB RAM can store
of Seginer (2007) which we use here. These works in memory the lexicon created by up to 0.2M sen-
learn an unlabeled syntactic structure, dependency tences. We thus divided our corpora to batches of
or constituency. In this work we use constituency 0.2M sentences and parsed each of them separately.
trees as our syntactic representation. Note that in this setup parsing quality might be even
Another important factor in concept acquisition better than the quality reported in (Seginer, 2007),
is the source of textual data used. To take advan- since in the setup reported in that paper the parser
tage of the rapidly expanding web, many of the pro- was applied to a few thousand sentences only. On
posed frameworks utilize web queries rather than average, the parsing time of a single batch was 5
local corpora (Etzioni et al., 2005; Davidov et al., minutes (run time did not significantly differ across
2007; Pasca and Van Durme, 2008; Davidov and batches and corpora).
Rappoport, 2009). While these methods have a defi- Parser description. The parser utilizes the novel
nite practical advantage of dealing with the most re- common-cover link representation for syntactic
cent and comprehensive data, web-based evaluation structure. This representation resembles depen-
has some methodological drawbacks such as limited dency structure but unlike the latter, it can be trans-
repeatability (Kilgarriff, 2007). In this study we ap- lated into a constituency tree, which is the syntactic
ply our framework on offline corpora in settings sim- representation we use in this work.
ilar to that of previous work, in order to be able to The parsing algorithm creates the common-cover
make proper comparisons. links structure of a sentence in an incremental man-
ner. This means that the parser reads the words of
3 Efficient Unsupervised Parsing a sentence one after the other and, as each word is
read, it is only allowed to add links that have one of
Our method utilizes the information induced by un-
their ends at that words (and update existing ones).
supervised parsers. Specifically, we make use of the
Words which have not yet been read are not avail-
bracketings induced by Seginer’s parser1 (Seginer,
2
2007). This parser has advantages in three major as- F = 2·R·P
R+P
, where R and P are the recall and precision of
the parsers’ bracketing compared to manually created bracket-
1
The parser is freely available at ing of the same text. This is the accepted measure for parsing
http://staff.science.uva.nl/∼yseginer/ccl performance (see (Klein, 2005)).
able to the parser at this stage. This restriction is bracketings induced by unsupervised parsers in or-
inspired by psycholinguistics research which sug- der to avoid the problems above. We utilize brack-
gests that humans process language incrementally. eting boundaries in our meta-patterns in addition
This results in a significant restriction of the parser’s to HFW and C slots. In other words, their origi-
search space, which is the reason it is so fast. nal meta-patterns are totally lexical, while ours are
During its initial stage the parser builds a lexicon lexico-syntactic meta-patterns. We preserve the at-
containing, for each word, statistics helping the deci- tractive properties of meta-patterns, because both
sion of whether to link that word to other words. The HFWs and bracketings can be found or computed in
lexicon is updated as any new sentence is read. Lex- a language independent manner and very efficiently.
icon updating is also done in an incremental manner Concretely, we define a HFW as a word appearing
so this stage is also very fast. more than TH times per million words, and a C as
a word or multiword expression containing up to 4
4 Unsupervised Pattern Discovery words, appearing less than TC times per million.
In the first stage of our algorithm, we run the unsu- We require that our patterns include two slots for
pervised parser on the corpus in order to produce a C’s, separated by at least a single HFW or bracket.
bracketing structure for each sentence. In the sec- We allow separation by a single bracket because the
ond stage, described here, we use these bracketings lowest level in the induced bracketing structure usu-
in order to discover, in a fully unsupervised manner, ally corresponds to lexical items, while higher levels
patterns that could be useful for concept mining. correspond to actual syntactic constituents.
In order to avoid truncation of multiword expres-
Our algorithm is based on the concept acquisition
sions, we also require the meta pattern to start and
method of (Davidov and Rappoport, 2006). We dis-
end by a HFW or bracket. Thus our meta-patterns
cover patterns that connect terms belonging to the
match the following regular expression:
same concept in two main stages: discovery of pat-
tern candidates, and identification of the symmetric {H|B}∗ C1 {H|B}+ C2 {H|B}∗
patterns among the candidates.
Pattern candidates. A major idea of (Davidov where “*” means zero or more times, and “+” means
and Rappoport, 2006) is that a few dozen high fre- one or more time and B can be “(”,“)” brackets pro-
quency words (HFW) such as ‘and’ and ‘is’ con- duced by the parser (in these patterns we do not
nect other, less frequent content terms into relation- need to guarantee that brackets match properly). Ex-
ships. They define meta-patterns, which are short amples of such patterns include “((C1 )in C2 ))”,
sequences of H’s and C’s, where H is a slot for “(C1 )(such(as(((C2 )”, and “(C1 )and(C2 )”3 . We
a HFW and C is a slot for a content word (later dismiss rare patterns that appear less than TP times
to become a word belonging to a discovered con- per million words.
cept). Their method was shown to produce good Symmetric patterns. Many of the pattern candi-
results. However, the fact that it does not consider dates discovered in the previous stage are not usable.
any syntactic information causes problems. Specif- In order to find a usable subset, we focus on the sym-
ically, it does not consider the constituent structure metric patterns. We define a symmetric pattern as a
of the sentence. Meta-patterns that cross constituent pattern in which the same pair of terms (C words)
boundaries are likely to generate noise – two content is likely to appear in both left-to-right and right-to-
words (C’s) in a meta-pattern that belong to differ- left orders. In order to identify symmetric patterns,
ent constituents are likely to belong to different con- for each pattern we define a pattern graph G(P ), as
cepts as well. In addition, meta-patterns that do not proposed by (Widdows and Dorow, 2002). If term
occupy a full constituent are likely to ‘cut’ multi- pair (C1 , C2 ) appears in pattern P in some context,
word expressions (MWEs) into two parts, one part 3
This paper does not use any punctuation since the parser
that gets treated as a valid C word and one part that is provided with sentences having all non-alphabetic characters
is completely ignored. removed. We assume word separation. C1,2 can be a word or a
The main idea in the present paper is to use the multiword expression.
we add nodes c1 , c2 to the graph and a directed edge computer companies, while the connection of ‘Sun’
EP (c1 , c2 ) between them. In order to select sym- to ‘Earth’ can lead to a concept of celestial bodies.
metric patterns, we create such a pattern graph for
Reducing noise: merging and windowing. Since
every discovered pattern, and create a symmetric
any given term can participate in many cliques, the
subgraph SymG(P) in which we take only bidirec-
algorithm creates overlapping categories, some of
tional edges from G(P ). Then we compute three
which redundant. In addition, due to the nature of
measures for each pattern candidate as proposed by
language and the imperfection of the corpus some
(Davidov and Rappoport, 2006):
noise is obviously to be expected. We enhance the
|{c1 |∃c2 EP (c1 , c2 ) ∧ ∃c3 EP (c3 , c1 )}| quality of the obtained concepts by merging them
M1 (P ) := and by windowing on the corpus. We merge two
|N odes(G(P ))|
concepts Q, R, iff there
T is more than a 50%T overlap
|N odes(SymG(P ))| between them: (|Q R| > |Q|/2) ∧ (|Q R| >
M2 (P ) :=
|N odes(G(P ))| |R|/2). In order to increase concept quality and re-
|Edges(SymG(P ))| move concepts that are too context-specific, we use
M3 (P ) := a simple corpus windowing technique. Instead of
|Edges(G(P ))|
running the algorithm of this section on the whole
For each measure, we prepare a sorted list of all can-
corpus, we divide the corpus into windows of equal
didate patterns. We remove patterns that are not in
size and perform the concept discovery algorithm of
the top ZT (we use 100, see Section 6) in any of the
this section (without pattern discovery) on each win-
three lists, and patterns that are in the bottom ZB in
dow independently. We now have a set of concepts
at least one of the lists.
for each window. For the final set, we select only
5 Concept Discovery those concepts that appear in at least two of the win-
At the end of the previous stage we have a set of dows. This technique reduces noise at the potential
symmetric patterns. We now use them in order to cost of lowering coverage.
discover concepts. The concept discovery algorithm A decrease in the number of windows should pro-
is essentially the same as used by (Davidov and Rap- duce more noisy results, while discovering more
poport, 2006) and has some similarity with the one concepts and terms. In the next section we show that
used by (Widdows and Dorow, 2002). In this section while windowing is clearly required for a large cor-
we outline the algorithm. pus, incorporation of parser data increases the qual-
ity of the extracted corpus to the point where win-
The clique-set method. The utilized approach to
dowing can be significantly reduced.
concept discovery is based on connectivity struc-
tures in the all-pattern term relationship graph G, 6 Results
resulting from merging all of the single-pattern
graphs for symmetric patterns selected in the previ- In order to estimate the quality of concepts and to
ous stage. The main observation regarding G is that compare it to previous work, we have performed
highly interconnected words are good candidates to both automatic and human evaluation. Our basic
form a concept. We find all strong n-cliques (sub- comparison was to (Davidov and Rappoport, 2006)
graphs containing n nodes that are all interconnected (we have obtained their data and utilized their al-
in both directions). A clique Q defines a concept that gorithm), where we can estimate if incorporation of
contains all of the nodes in Q plus all of the nodes parser data can solve some fundamental weaknesses
that are (1) at least unidirectionally connected to all of their framework. In the following description, we
nodes in Q, and (2) bidirectionally connected to at call their algorithm P and our parser-based frame-
least one node in Q. Using this definition, we create work P+. We have also performed an indirect com-
a concept for each such clique. parison to (Widdows and Dorow, 2002).
Note that a single term can be assigned to several While there is a significant number of other re-
concepts. Thus a clique based on a connection of the lated studies4 on concept acquisition (see Section 2),
4
word ‘Sun’ to ‘Microsoft’ can lead to a concept of Most are supervised and/or use language-specific tools.
V W C AS
direct or even indirect comparison to these works is P P+ P P+ P P+
DMOZ 16 330 504 142 130 12.8 16.0
problematic due to difference in corpora, problem BNC 0.3 25 42 9.6 8.9 10.2 15.6
Russ. 10 235 406 115 96 11.6 15.1
definitions and evaluation strategies. Below we de-
scribe the corpora and parameters used in our evalu- Table 1: Results for concept discovery with (P+) and
ation and then show and discuss WordNet-based and without (P) utilization of parsing data. V is the total num-
Human evaluation settings and results. ber (millions) of different words in the corpus. W is the
number (thousands) of words belonging to at least one of
Corpora. We performed in-depth evaluation in the terms for one of the concepts. C is the number (thou-
two languages, English and Russian, using three sands) of concepts (after merging and windowing). AS
corpora, two for English and one for Russian. is the average(words) category size.
The first English corpus is the BNC, containing
about 100M words. The second English corpus, covered by discovered concepts raises nearly 1.5-
DMOZ(Gabrilovich and Markovitch, 2005), is a fold when we utilize patterns based on parsing data
web corpus obtained by crawling URLs in the Open in comparison to pure HFW patterns used in previ-
Directory Project (dmoz.org), resulting in 68GB ous work. We can also see nearly the same increase
containing about 8.2G words from 50M web pages. in average concept size. At the same time we ob-
The Russian corpus (Davidov and Rappoport, 2006) serve about 15% reduction in the total number of
was assembled from web-based Russian reposito- discovered concepts.
ries, to yield 33GB and 4G words. All of these cor-
There are two opposite factors in P+ which may
pora were also used by (Davidov and Rappoport,
influence the number of concepts, their size and cov-
2006) and BNC was used in similar settings by
erage in comparison to P. On one hand, utilization of
(Widdows and Dorow, 2002).
more restricted patterns that include parsing infor-
Algorithm parameters. The thresholds mation leads to a reduced number of concept term
TH , TC , TP , ZT , ZB , were determined mostly instances being discovered. Thus, the P+ pattern “(X
by practical memory size considerations: we com- (or (a Y))” will recognize “(TV (or (a movie))” in-
puted thresholds that would give us the maximal stance and will miss “(lunch) or (a snack))”, while
number of terms, while enabling the pattern access the P pattern “X or a Y” will capture both. This leads
table to reside in main memory. The resulting to a decrease in the number of discovered concepts.
numbers are 100, 50, 20, 100, 100. Corpus window On the other hand, P+ patterns, unlike P ones, al-
size was determined by starting from a small low the extraction of multiword expressions5 , and
window size, extracting at random a single window, indeed more than third of the discovered terms us-
running the algorithm, and iterating this process ing P+ were MWEs. Utilization of MWEs not only
with increased ×2 window sizes until reaching a allows to cover a greater amount of different words,
desired vocabulary concept participation percentage but also increases the number of discovered concepts
(before windowing) (i.e., x% of the different words since new concepts can be found using cliques of
in the corpus participate in terms assigned into newly discovered MWEs. From the results, we can
concepts. We used 5%.). We also ran the algorithm see that for a given concept size and word coverage,
without windowing in order to check how well the the ability to discover MWEs overcomes the disad-
provided parsing information can help reduce noise. vantage of ignoring potentially useful concepts.
Among the patterns discovered are the ubiquitous
Human judgment evaluation. Our human judge-
ones containing “and”,“or”, e.g. ‘((X) or (a Y))’,
ment evaluation closely followed the protocol (Davi-
and additional ones such as ‘from (X) to (Y)’.
dov and Rappoport, 2006).
Influence of parsing data on number of discov- We used 4 subjects for evaluation of the English
ered concepts. Table 1 compares the concept ac- 5
While P method can potentially be used to extract MWEs,
quisition framework with (P+) and without (P) uti-
preliminary experimentation shows that without significant
lization of parsing data. modification, quality of MWEs obtained by P is very low in
We can see that the amount of different words comparison to P+
concepts and 4 subjects for Russian ones. In order The first part of Table 2 gives the average per-
to assess subjects’ reliability, we also included ran- centage of triplets that were given scores of 1 or 2
dom concepts (see below). The goal of the exper- (that is, ‘significant shared meaning’). The second
iment was to examine the differences between the part gives the average score of a triplet (1 is best).
P+ and P concept acquisition frameworks. Subjects In these lines scores of 5 were not counted. Inter-
were given 50 triplets of words and were asked to evaluator Kappa between scores are 0.68/0.75/0.76
rank them using the following scale: (1) the words for DMOZ, BNC and Russian respectively. We can
definitely share a significant part of their meaning; see that terms selected by P and skipped by P+
(2) the words have a shared meaning but only in receive low scores, at the same time even single-
some context; (3) the words have a shared mean- word terms selected by P+ and skipped by P show
ing only under a very unusual context/situation; (4) very high scores. This shows that using parser data,
the words do not share any meaning; (5) I am not the proposed framework can successfully avoid se-
familiar enough with some/all of the words. lection of erroneous terms, while discovering high-
The 50 triplets were obtained as follows. We have quality terms missed by P. We can also see that P+
randomly selected 40 concept pairs (C+,C): C+ in performance on MWEs, while being slightly infe-
P+ and C in P using five following restrictions: (1) rior to the one for single-word terms, still achieves
concepts should contain at least 10 words; (2) for results comparable to those of single-word terms.
a selected pair, C+ should share at least half of its Thus our algorithm can greatly improve the re-
single-word terms with C, and C should share at sults not only by discovering of MWEs but also by
least half of its words with C+; (3) C+ should con- improving the set of single word concept terms.
tain at least 3 MWEs; (4) C should contain at least 3
words not appearing in C+; (5) C+ should contain at WordNet-based evaluation. The major guideline
least 3 single-word terms not appearing in C. in this part of the evaluation was to compare our re-
sults with previous work (Davidov and Rappoport,
These restrictions allow to select concept pairs
2006; Widdows and Dorow, 2002) without the pos-
such that C+ is similar to C while they still carry
sible bias of human evaluation. We have followed
enough differences which can be examined. We se-
their methodology as best as we could, using the
lected the triplets as following: for pairs (C+, C) ten
same WordNet (WN) categories and the same cor-
triplets include terms appearing in both C+ and C
pora. This also allows indirect comparison to several
(Both column in Table 2), ten triplets include single-
other studies, thus (Widdows and Dorow, 2002) re-
word terms appearing in C+ but not C (P+ single
ports results for an LSA-based clustering algorithm
column), ten triplets include single-word terms ap-
that are vastly inferior to the pattern-based ones.
pearing in C but not C+ (P column), ten triplets in-
The evaluation method is as follows. We took
clude MWEs appearing in C+ (P+ mwe column) and
the exact 10 WN subsets referred to as ‘subjects’ in
ten triplets include random terms obtained from P+
(Widdows and Dorow, 2002), and removed all multi-
concepts (Rand column).
word items. We then selected at random 10 pairs of
P+ P Both Rand words from each subject. For each pair, we found
mwe single
% shared
the largest of our discovered concepts containing it.
meaning
DMOZ 85 88 68 81 6
The various morphological forms or clear typos of
BNC
Russ.
85
89
90
95
61
70
88
93
0
11
the same word were treated as one in the evaluation.
triplet
score (1-4)
We have improved the evaluation framework for
DMOZ
BNC
1.7
1.6
1.4
1.3
2.5
2.1
1.7
1.5
3.8
4.0
Russian by using the Russian WordNet (Gelfenbey-
Russ. 1.5 1.1 2.0 1.3 3.7 nand et al., 2003) instead of back-translations as
Table 2: Results of evaluation by human judgment of
done in (Davidov and Rappoport, 2006). Prelim-
three data sets. P+ single/mwe: single-word/MWE terms inary examination shows that this has no apparent
existing only in P+ concept; P: single-word terms existing effect on the results.
only in P concept; Both: terms existing in both concepts; For each found concept C containing N words,
Rand: random terms. See text for detailed explanations. we computed the following: (1) Precision: the num-
ber of words present in both C and WN divided by avoiding windowing altogether. Each time we ran-
N ; (2) Precision*: the number of correct words di- domly sampled a set of 100 concepts and tagged (by
vided by N . Correct words are either words that the authors) noisy ones. A concept is considered to
appear in the WN subtree, or words whose entry in be noisy if it has at least 3 words unrelated to each
the American Heritage Dictionary or the Britannica other. Table 4 shows results of this test.
directly defines them as belonging to the given class Reg. Window ×4 Window No windowing
(e.g., ‘murder’ is defined as ‘a crime’). This was P 4 18 33
P+ 4 5 21
done in order to overcome the relative poorness of
WN; (3) Recall: the number of words present in Table 4: Percentage of noisy concepts as a function of
both C and WN divided by the number of words windowing.
in WN; (4) The percentage of correctly discovered
words (according to Precision*) that are not in WN. We can see that while windowing is still essential
Table 3 compares the macro-average of these 10 even with available parser data, using this data we
categories to corresponding related work. We do not can significantly reduce windowing requirements,
Prec. Prec.* Rec. %New
allowing us to discover more concepts from the
DMOZ same data.
P 79.8 86.5 22.7 2.5
P+ 79.5 91.3 28.6 3.7 Timing requirements are modest, considering we
BNC
P 92.76 95.72 7.22 0.4 parsed such large amounts of data. BNC pars-
P+ 93.0 96.1 14.6 1.7
Widdows 82.0 - - - ing took 45 minutes, and the total single-machine
Russian
P 82.39 89.64 20.03 2.1 processing time for the 68Gb DMOZ corpus was
P+ 83.5 92.6 29.6 4.0
4 days6 . In comparison, a state-of-art supervised
Table 3: WordNet evaluation in comparison to P (Davi- parser (Charniak and Johnson, 2005) would process
dov and Rappoport, 2006) and to Widdows(Widdows and the same amount of data in 1.3 years7 .
Dorow, 2002). Columns show average precision, preci-
sion* (as defined in text), recall, and % of new words 7 Discussion
added to corresponding WN subtree. We have presented a framework which utilizes an
efficient fully unsupervised parser for unsupervised
observe apparent rise in precision when comparing pattern-based discovery of concepts. We showed
P+ and P, but we can see significant improvement that utilization of unsupervised parser in pattern ac-
in both recall and precision* for all of three cor- quisition not only allows successful extraction of
pora. In combination with human judgement results, MWEs but also improves the quality of obtained
this suggests that the P+ framework successfully dis- concepts, avoiding noise and adding new terms
covers more correct terms not present in WN. This missed by the parse-less approach. At the same time,
causes precision to remain constant while precision* the framework remains fully unsupervised, allowing
improves significantly. Rise in recall also shows that its straightforward application to different languages
the P+ framework can discover significantly more as supported by our bilingual evaluation.
correct terms from the same data.
This research presents one more step towards the
Windowing requirement. As discussed in Sec- merging of fully unsupervised techniques for lex-
tion 5, windowing is required for successful noise ical acquisition, allowing to extract semantic data
reduction. However, due to the increase in pattern without strong assumptions on domain or language.
quality with parser data, it is likely that less noise While we have aimed for concept acquisition, the
will be captured by the discovered patterns. Hence, proposed framework can be also useful for extrac-
windowing could be relaxed allowing to obtain more tion of different types of lexical relationships, both
data with sufficiently high precision. among concepts and between concept terms.
In order to test this issue we applied our algo- 6
In fact, we used a PC cluster, and all 3 corpora were parsed
rithms on the DMOZ corpus with 3 different win- in 15 hours.
dowing settings: (1) choosing window size as de- 7
Considering the reported parsing rate of 10 sentences per
scribed above; (2) using ×4 larger window; (3) second
References Ilya Gelfenbeyn, Artem Goncharuk, Vladislav Lehelt,
Anton Lipatov, Victor Shilo, 2003. Automatic Trans-
Mishele Banko, Michael J Cafarella , Stephen Soderland, lation of WordNet Semantic Network to Russian Lan-
Matt Broadhead, Oren Etzioni, 2007. Open Informa- guage (in Russian) International Dialog 2003 Work-
tion Extraction from the Web. IJCAI ’07. shop.
Rens Bod, 2006a. An All-Subtrees Approach to Unsu- James Gorman, James R. Curran, 2006. Scaling Distri-
pervised Parsing. ACL ’06. butional Similarity to Large Corpora. COLING-ACL
Rens Bod, 2006b. Unsupervised Parsing with U-DOP. ’06.
CoNLL X. William P. Headden III, David McClosky and Eugene
Rens Bod, 2007. Is the End of Supervised Parsing in Charniak, 2008. Evaluating Unsupervised Part-of-
Sight? ACL ’07. Speech tagging for Grammar Induction. COLING ’08.
Thorsten Brants, 1997. The NEGRA Export Format. Marti Hearst, 1992. Automatic Acquisition of Hy-
CLAUS Report, Saarland University. ponyms from Large Text Corpora. COLING ’92.
Sharon Caraballo, 1999. Automatic Construction of a Adam Kilgarriff, 2007. Googleology is Bad Science.
Hypernym-labeled Noun Hierarchy from Text. ACL Computational Linguistics ’08, Vol.33 No. 1,pp147-
’99. 151. .
Eugene Charniak and Mark Johnson, 2005. Coarse- Dan Klein and Christopher Manning, 2002. A genera-
to-Fine n-Best Parsing and MaxEnt Discriminative tive constituent-context model for improved grammar
Reranking. ACL ’05. induction. Proc. of the 40th Meeting of the ACL.
Alexander Clark, 2001. Unsupervised Language Acqui- Dan Klein and Christopher Manning, 2004. Corpus-
sition: Theory and Practice. Ph.D. thesis, University based induction of syntactic structure: Models of de-
of Sussex. pendency and constituency. ACL ’04.
James R. Curran, Marc Moens, 2002. Improvements in Dan Klein, 2005. The unsupervised learning of natural
Automatic Thesaurus Extraction SIGLEX 02’, 59–66. language structure. Ph.D. thesis, Stanford University.
Dmitry Davidov, Ari Rappoport, 2006. Efficient Un- Hang Li, Naoki Abe, 1996. Clustering Words with the
supervised Discovery of Word Categories using Sym- MDL Principle. COLING ’96.
metric Patterns and High Frequency Words. COLING- Dekang Lin, 1998. Automatic Retrieval and Clustering
ACL ’06. of Similar Words. COLING ’98.
Dmitry Davidov, Ari Rappoport, Moshe Koppel, 2007. Marcus Mitchell P., Beatrice Santorini and Mary Ann
Fully Unsupervised Discovery of Concept-Specific Marcinkiewicz, 1993. Building a large annotated cor-
Relationships by Web Mining. ACL ’07. pus of English: The Penn Treebank. Computational
Dmitry Davidov, Ari Rappoport, 2009. Translation and Linguistics, 19(2):313–330
Extension of Concepts Across Languages. EACL ’09. Marius Pasca, Benjamin Van Durme, 2008. Weakly-
Scott Deerwester, Susan Dumais, George Furnas, supervised Acquisition of Open-domain Classes and
Thomas Landauer, Richard Harshman, 1990. Index- Class Attributes from Web Documents and Query
ing by Latent Semantic Analysis. J. of the American Logs. ACL 08.
Society for Info. Science, 41(6):391–407. Patrick Pantel, Dekang Lin, 2002. Discovering Word
Simon Dennis, 2005. An exemplar-based approach to Senses from Text. SIGKDD ’02.
unsupervised parsing. Proceedings of the 27th Con- Patrick Pantel, Deepak Ravichandran, Eduard Hovy,
ference of the Cognitive Science Society. 2004. Towards Terascale Knowledge Acquisition.
Beate Dorow, Dominic Widdows, Katarina Ling, Jean- COLING ’04.
Pierre Eckmann, Danilo Sergi, Elisha Moses, 2005. Fernando Pereira, Naftali Tishby, Lillian Lee, 1993. Dis-
Using curvature and Markov clustering in Graphs for tributional Clustering of English Words. ACL ’93.
Lexical Acquisition and Word Sense Discrimination. Hinrich Schütze, 1998. Automatic Word Sense Discrim-
MEANING ’05. ination. Computational Linguistics , 24(1):97–123.
Oren Etzioni, Michael Cafarella, Doug Downey, S. Kok, Yoav Seginer, 2007. Fast Unsupervised Incremental
Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Parsing. ACL ’07.
Daniel Weld, Alexander Yates, 2005. Unsupervised Noah A. Smith and Jason Eisner, 2006. Annealing Struc-
Named-entity Extraction from the Web: An Experi- tural Bias in Multilingual Weighted Grammar Induc-
mental Study. Artificial Intelligence, 165(1):91134. tion . ACL ’06.
Evgeniy Gabrilovich, Shaul Markovitch, 2005. Fea- Dominic Widdows, Beate Dorow, 2002. A Graph Model
ture Generation for Text Categorization Using World for Unsupervised Lexical Acquisition. COLING ’02.
Knowledge. IJCAI ’05.