0% found this document useful (0 votes)
94 views6 pages

Paetzold (2016)

Paetzold (2016)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views6 pages

Paetzold (2016)

Paetzold (2016)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Inferring Psycholinguistic Properties of Words

Gustavo Henrique Paetzold and Lucia Specia


Department of Computer Science
University of Sheffield, UK
{ghpaetzold1,l.specia}@sheffield.ac.uk

Abstract are less sensitive to changes in wording made to sen-


tences with high Concreteness words.
We introduce a bootstrapping algorithm for re-
gression that exploits word embedding mod- When quantified, these aspects can be used as
els. We use it to infer four psycholinguis- features for various Natural Language Processing
tic properties of words: Familiarity, Age of (NLP) tasks. The Lexical Simplification approach
Acquisition, Concreteness and Imagery and
in (Jauhar and Specia, 2012) is an example. By
further populate the MRC Psycholinguistic
Database with these properties. The ap- combining various collocational features and psy-
proach achieves 0.88 correlation with human- cholinguistic measures extracted from the MRC
produced values and the inferred psycholin- Psycholinguistic Database (Coltheart, 1981), they
guistic features lead to state-of-the-art results trained a ranker (Joachims, 2002) that reached first
when used in a Lexical Simplification task. place in the English Lexical Simplification task
at SemEval 2012. Semantic Classification tasks
1 Introduction have also benefited from the use of such features:
Throughout the last three decades, much has been by combining Concreteness with other features,
found on how the psycholinguistic properties of (Hill and Korhonen, 2014) reached the state-of-the-
words influence cognitive processes in the human art performance in Semantic Composition (denota-
brain when a subject is presented with either writ- tive/connotative) and Semantic Modification (inter-
ten or spoken forms. A word’s Age of Acquisition sective/subsective) prediction.
is an example. The findings in (Carroll and White, Despite the evident usefulness of psycholinguis-
1973) reveal that objects whose names are learned tic properties of words, resources describing such
earlier in life can be named faster in later stages of properties are rare. The most extensively developed
life. Zevin and Seidenberg (2002) show that words resource for English is the MRC Psycholinguistic
learned in early ages are orthographically or phono- Database (Section 2). However, it is far from com-
logically very distinct from those learned in adult plete, most likely due to the inherent cost of manu-
life. ally entering such properties. In this paper we pro-
Other examples of psycholinguistic properties, pose a method to automatically infer these missing
such as Familiarity and Concreteness, influence properties. We train regressors by performing boot-
one’s proficiency in word recognition and text com- strapping (Yarowsky, 1995) over the existing fea-
prehension. The experiments in (Connine et al., tures in the MRC database, exploiting word em-
1990; Morrel-Samuels and Krauss, 1992) show that bedding models and other linguistic resources for
words with high Familiarity yield lower reaction that (Section 3). This approach outperform various
times in both visual and auditory lexical decision, strong baselines (Section 4) and the resulting prop-
and require less hand gesticulation in order to be de- erties lead to significant improvements when used in
scribed. Begg and Paivio (1969) found that humans Lexical Simplification models (Section 5).

435

Proceedings of NAACL-HLT 2016, pages 435–440,


San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics
2 The MRC Psycholinguistic Database 3. Predict values for a set of unlabelled instances U .
Introduced by Coltheart (1981), the MRC (Machine
4. Add to S all instances from U for which the pre-
Readable Dictionary) Psycholinguistic Database is
diction confidence c is equal or greater than ζ.
a digital compilation of lexical, morphological and
psycholinguistic properties for 150,837 words. The 5. If at least one instance was added to S, go to step
27 psycholinguistic properties in the resource range 2, otherwise, return the resulting classifier.
from simple frequency measures (Rudell, 1993) to
elaborate measures estimated by humans, such as One critical difference between this approach and
Age of Acquisition and Imagery (Gilhooly and Lo- ours is that our task requires regression algorithms
gie, 1980). However, despite various efforts to pop- instead of classifiers. In classification, the predic-
ulate the MRC Database, these properties are only tion confidence c is often calculated as the maxi-
available for small subsets of the 150,837 words. mum signed distance between an instance and the
We focus on four manually estimated psycholin- estimated hyperplanes. There is, however, no analo-
guistic properties in the MRC Database: gous confidence estimation technique for regression
• Familiarity: The frequency with which a word problems. We address this problem by using word
is seen, heard or used daily. Available for 9,392 embedding models.
words. Embedding models have been proved effective in
capturing linguistic regularities of words (Mikolov
• Age of Acquisition: The age at which a word is et al., 2013b). In order to exploit these regularities,
believed to be learned. Available for 3,503 words. we assume that the quality of a regressor’s prediction
• Concreteness: How “palpable” the object the on an instance is directly proportional to how similar
word refers to is. Available for 8, 228 words. the instance is to the ones in the labelled set. Since
the input for the regressors are words, we compute
• Imagery: The intensity with which a word the similarity between a test word and the words in
arouses images. Available for 9,240 words. the labelled dataset as the maximum cosine similar-
ity between the test word’s vector and the vectors in
All four properties are real values, determined
the labelled set.
based on different quantifiable metrics. We focus
Let M be an embeddings model trained over vo-
on these properties since they have been proven use-
cabulary V , S a set of training seeds, ζ a minimum
ful and are some of the most scarce in the MRC
confidence threshold, sim(w, S, M ) the maximum
Database. As we discussed in Section 1, these prop-
cosine similarity between word w and S with respect
erties have been successfully used in various ap-
to model M , R a regression model, and R(w) its
proaches for Lexical Simplification and Semantic
prediction for word w. Our bootstrapping algorithm
Classification, and yet are available for no more than
is depicted in Algorithm 1.
6% of the words in the MRC Database.

3 Bootstrapping with Word Embeddings Algorithm 1: Regression Bootstrapping


input: M, V, S, ζ;
In order to automatically estimate missing psy- output: R;
cholinguistic properties in the MRC Database, we
resort to bootstrapping. We base our approach on repeat
Train R over S;
that by (Yarowsky, 1995), a bootstrapping algorithm
for w ∈ V −S do
which aims to learn a classifier over a reduced set of
if sim(w, S, M ) ≥ ζ then
annotated training instances (or “seeds”). It does so
Add hw, R(w)i to S;
by performing the following five steps:
end
1. Initialise training set S with the seeds available. end
until kSk converges ;
2. Train a classifier over S.

436
We found that 64,895 out of the 150,837 words in • Max. Similarity: Test word is assigned the prop-
the MRC database were not present in either Word- erty value of the closest word in the training set,
Net or our word embedding models. Since our boot- i.e. the word with the highest cosine similarity
strappers use features extracted from both these re- according to the word embeddings model.
sources, we were only able to predict the Familiarity,
Age of Acquisition, Concreteness and Imagery val- • Avg. Similarity: Test word is assigned the aver-
ues of the remaining 85,942 words in MRC. age property value of the n closest words in the
training set, i.e. the words with the highest co-
4 Evaluation sine similarity according to the word embeddings
model. The value of n is decided through 5-fold
Since we were not able to find previous work for this
cross validation.
task, in these experiments, we compare the perfor-
mance of our bootstrapping strategy to various base- • Simple SVM: Test word is assigned the prop-
lines. For training, we use the Ridge regression algo- erty value as predicted by an SVM regressor
rithm (Tikhonov, 1963). As features, our regressor (Smola and Vapnik, 1997) with a polynomial ker-
uses the word’s raw embedding values, along with nel trained with the 15 aforementioned lexical
the following 15 lexical features: features.
• Word’s length and number of syllables, as deter-
• Simple Ridge: Test word is assigned the property
mined by the Morph Adorner module of LEXen-
value as predicted by a Ridge regressor trained
stein (Paetzold and Specia, 2015).
with the 15 aforementioned lexical features.
• Word’s frequency in the Brown (Francis and
Kucera, 1979), SUBTLEX (Brysbaert and New, • Super Ridge: Identical to Simple Ridge, the only
2009), SubIMDB (Paetzold and Specia, 2016), difference being that it also includes the words
Wikipedia and Simple Wikipedia (Kauchak, embeddings in the feature set. We note that this
2013) corpora. baseline uses the exact same features and regres-
sion algorithm as our bootstrapped regressors.
• Number of senses, synonyms, hypernyms and hy-
ponyms for word in WordNet (Fellbaum, 1998). The parameters of all baseline systems are opti-
mised following the same method as with our ap-
• Minimum, maximum and average distance be- proach. We also measure the correlation between
tween the word’s senses in WordNet and the the- each of the aforementioned lexical features and the
saurus’ root sense. psycholinguistic properties. For each psycholinguis-
• Number of images found for word in the Getty tic property, we create a training and a test set by
Images database1 . splitting the labelled instances available in the MRC
Database in two equally sized portions. All train-
We train our embedding models using word2vec ing instances are used as seeds in our approach. As
(Mikolov et al., 2013a) over a corpus of 7 billion evaluation metrics, we use Spearman’s (ρ) and Pear-
words composed by the SubIMDB corpus, UMBC son’s (r) correlation. Pearson’s correlation is the
webbase2 , News Crawl3 , SUBTLEX (Brysbaert most important indicator of performance: an effec-
and New, 2009), Wikipedia and Simple Wikipedia tive regressor would predict values that change lin-
(Kauchak, 2013). We use 5-fold cross-validation to early with a given psycholinguistic property.
optimise parameters: ζ, embeddings model architec- The results are illustrated in Table 1. While the
ture (CBOW or Skip-Gram), and word vector size similarity-based approaches tend to perform well for
(from 300 to 2,500 in intervals of 200). We include Concreteness and Imagery, typical regressors cap-
four strong baseline systems in the comparison: ture Familiarity and Age of Acquisition more effec-
1
http://developers.gettyimages.com/
tively. Our approach, on the other hand, is con-
2
http://ebiquity.umbc.edu/resource/html/id/351 sistently superior for all psycholinguistic proper-
3
http://www.statmt.org/wmt11/translation-task.html ties, with both Spearman’s and Pearson’s correlation

437
Familiarity Age of Acquisition Concreteness Imagery
System ρ r ρ r ρ r ρ r
Word Length -0.238 -0.171 0.501 0.497 -0.170 -0.195 -0.190 -0.193
Syllables -0.168 -0.114 0.464 0.458 -0.207 -0.238 -0.218 -0.224
Freq: SubIMDB 0.798 0.725 -0.679 -0.699 0.048 0.003 0.208 0.170
Freq: SUBTLEX 0.827 0.462 -0.646 -0.251 0.028 0.137 0.187 0.265
Freq: SimpleWiki 0.725 0.488 -0.453 -0.306 0.015 0.145 0.119 0.247
Freq: Wikipedia 0.694 0.283 -0.349 -0.112 -0.076 0.081 0.027 0.134
Freq: Brown 0.706 0.608 -0.380 -0.395 -0.155 -0.214 -0.054 -0.107
Sense Count 0.471 0.363 -0.429 -0.391 0.020 -0.017 0.119 0.059
Synonym Count 0.411 0.336 -0.381 -0.357 -0.036 -0.047 0.070 0.035
Hypernym Count 0.307 0.295 -0.411 -0.387 0.167 0.088 0.268 0.160
Hyponym Count 0.379 0.245 -0.324 -0.196 0.120 0.002 0.196 0.023
Min. Sense Depth -0.347 -0.072 0.366 0.055 0.151 -0.185 0.127 -0.224
Max. Sense Depth -0.021 -0.008 -0.197 -0.196 0.447 0.455 0.415 0.414
Avg. Sense Depth -0.295 -0.256 0.215 0.183 0.400 0.428 0.345 0.347
Img. Search Count 0.544 0.145 -0.325 -0.033 -0.037 -0.073 0.117 -0.059
Max. Similarity 0.406 0.402 0.445 0.443 0.742 0.743 0.618 0.605
Avg. Similarity 0.528 0.527 0.536 0.535 0.826 0.819 0.733 0.707
Simple SVM 0.835 0.815 0.778 0.770 0.548 0.477 0.555 0.528
Simple Ridge 0.832 0.815 0.785 0.778 0.603 0.591 0.620 0.613
Super Ridge 0.847 0.833 0.827 0.820 0.859 0.852 0.813 0.800
Bootstrapping 0.863 0.846 0.871 0.862 0.876 0.869 0.835 0.823
Table 1: Regression correlation scores. In bold are the highest scores within a group (features, baselines, proposed approach), and
underlined the highest scores overall.

scores varying between 0.82 and 0.88. The differ- Interestingly, frequency in the SubIMDB corpus4 ,
ence in performance between the Super Ridge base- composed of over 7 million sentences extracted from
line and our approach confirm that our bootstrapping subtitles of “family” movies and series, has good lin-
algorithm can in fact improve on the performance of ear correlation with Familiarity and Age of Acquisi-
a regressor. tion, much higher than any other feature. For Con-
The parameters used by our bootstrappers, which creteness and Imagery, on the other hand, the results
are reported below, highlight the importance of pa- suggest something different: the further a word is
rameter optimization in out bootstrapping strategy: from the root of a thesaurus, the most likely it is to
its performance peaked with very different configu- refer to a physical object or entity.
rations for most psycholinguistic properties:
5 Psycholinguistic Features for LS
• Familiarity: 300 word vector dimensions with a Here we assess the effectiveness of our bootstrap-
Skip-Gram model, and ζ = 0.9. pers in the task of Lexical Simplification (LS). As
shown in (Jauhar and Specia, 2012), psycholinguis-
• Age of Acquisition: 700 word vector dimensions tic features can help supervised ranking algorithms
with a CBOW model, and ζ = 0.7. capture word simplicity. Using the parameters de-
scribed in Section 4, we train bootstrappers for
• Concreteness: 1,100 word vector dimensions
these two properties using all instances in the MRC
with a Skip-Gram model, and ζ = 0.7.
Database as seeds. We then train three rankers with
• Imagery: 1,100 word vector dimensions with a (W) and without (W/O) psycholinguistic features:
Skip-Gram model, and ζ = 0.7. 4
http://ghpaetzold.github.io/subimdb

438
• Horn (Horn et al., 2014): Uses an SVM ranker surrogates for confidence predictors and as regres-
trained on various n-gram probability features. sion features. Our findings also indicate the use-
fulness of individual features and resources: word
• Glavas (Glavaš and Štajner, 2015): Ranks can- frequencies in the SubIMDB corpus have a much
didates using various collocational and semantic stronger correlation with Familiarity and Age of Ac-
metrics, and then re-ranks them according to their quisition than previously used corpora, while the
average rankings. depth of a word’s in a thesaurus hierarchy correlates
• Paetzold (Paetzold and Specia, 2015): Ranks well with both its Concreteness and Imagery.
words according to their distance to a decision In future work we plan to employ our boot-
boundary learned from a classification setup in- strapping solution in other regression problems, and
ferred from ranking examples. Uses n-gram fre- to further explore potential uses of automatically
quencies as features. learned psycholinguistic features.
The updated version of the MRC resource can be
We use data from the English Lexical Simplifica- downloaded from http://ghpaetzold.github.io/data/
tion task of SemEval 2012 to assess systems’ per- BootstrappedMRC.zip.
formance. The goal of the task is to rank words
in different contexts according to their simplicity.
References
The training and test sets contain 300 and 1,710 in-
stances, respectively. The official metric from the Ian Begg and Allan Paivio. 1969. Concreteness and im-
task – TRank (Specia et al., 2012) – is used to mea- agery in sentence meaning. Journal of Verbal Learn-
ing and Verbal Behavior, 8(6):821–827.
sure systems’ performance. As discussed in (Paet-
Marc Brysbaert and Boris New. 2009. Moving beyond
zold, 2015), this metric best represents LS perfor- kučera and francis: A critical evaluation of current
mance in practice. The results in Table 2 show that word frequency norms and the introduction of a new
the addition of our features lead to performance in- and improved word frequency measure for american
creases with all rankers. Performing F-tests over the english. Behavior research methods, 41:977–990.
rankings estimated for the simplest candidate in each John B Carroll and Margaret N White. 1973. Word
instance, we have found these differences to be sta- frequency and age of acquisition as determiners of
picture-naming latency. The Quarterly Journal of Ex-
tistically significant (p < 0.05). Using our features,
perimental Psychology, 25(1):85–95.
the Paetzold ranker reaches the best published re- Max Coltheart. 1981. The mrc psycholinguistic
sults for the dataset, significantly superior to the best database. The Quarterly Journal of Experimental Psy-
system in SemEval (Jauhar and Specia, 2012). chology, 33(4):497–505.
Cynthia M Connine, John Mullennix, Eve Shernoff, and
TRank Jennifer Yelen. 1990. Word familiarity and frequency
Ranker W/O W in visual and auditory word recognition. Journal of
Best SemEval - 0.602 Experimental Psychology: Learning, Memory, and
Horn 0.625 0.635 Cognition, 16(6):1084.
Christiane Fellbaum. 1998. WordNet: An Electronic
Glavas 0.623 0.636
Lexical Database. Bradford Books.
Paetzold 0.653 0.657 W Nelson Francis and Henry Kucera. 1979. Brown cor-
Table 2: Results on SemEval 2012 LS task dataset pus manual. Brown University.
Kenneth J Gilhooly and Robert H Logie. 1980. Age-
of-acquisition, imagery, concreteness, familiarity, and
6 Conclusions ambiguity measures for 1,944 words. Behavior Re-
Overall, the proposed bootstrapping strategy for re- search Methods & Instrumentation, 12(4):395–427.
Goran Glavaš and Sanja Štajner. 2015. Simplifying lexi-
gression has led to very positive results, despite
cal simplification: Do we need simplified corpora? In
its simplicity. It is therefore a cheap and reliable Proceedings of the 53rd ACL.
alternative to manually producing psycholinguistic Felix Hill and Anna Korhonen. 2014. Concreteness and
properties of words. Word embedding models have subjectivity as dimensions of lexical meaning. In Pro-
proven to be very useful in bootstrapping, both as ceedings of ACL, pages 725–731.

439
Colby Horn, Cathryn Manduca, and David Kauchak. Jason D Zevin and Mark S Seidenberg. 2002. Age of ac-
2014. Learning a Lexical Simplifier Using Wikipedia. quisition effects in word reading and other tasks. Jour-
In Proceedings of the 52nd ACL, pages 458–463. nal of Memory and language, 47(1):1–29.
S. Jauhar and L. Specia. 2012. Uow-shef: Simplex–
lexical simplicity ranking based on contextual and psy-
cholinguistic features. In Proceedings of the 1st Se-
mEval, pages 477–481.
Thorsten Joachims. 2002. Optimizing search engines us-
ing clickthrough data. In Proceedings of the 8th ACM,
pages 133–142.
David Kauchak. 2013. Improving text simplification lan-
guage modeling using unsimplified text data. In Pro-
ceedings of the 51st ACL, pages 1537–1546.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013a. Efficient estimation of word representa-
tions in vector space. arXiv preprint arXiv:1301.3781.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
2013b. Linguistic regularities in continuous space
word representations. In HLT-NAACL, pages 746–
751.
Palmer Morrel-Samuels and Robert M Krauss. 1992.
Word familiarity predicts temporal asynchrony of
hand gestures and speech. Journal of Experimen-
tal Psychology: Learning, Memory, and Cognition,
18(3):615.
Gustavo Henrique Paetzold and Lucia Specia. 2015.
Lexenstein: A framework for lexical simplification. In
Proceedings of The 53rd ACL.
Gustavo Henrique Paetzold and Lucia Specia. 2016. Un-
supervised lexical simplification for non-native speak-
ers. In Proceedings of The 30th AAAI.
Gustavo Henrique Paetzold. 2015. Reliable lexical sim-
plification for non-native speakers. In Proceedings of
the 2015 NAACL Student Research Workshop.
Allan P. Rudell. 1993. Frequency of word usage and per-
ceived word difficulty: Ratings of Kuera and Francis
words. Behavior Research Methods.
Alex Smola and Vladimir Vapnik. 1997. Support vector
regression machines. Advances in neural information
processing systems, 9:155–161.
Lucia Specia, Sujay Kumar Jauhar, and Rada Mihalcea.
2012. Semeval-2012 task 1: English lexical simplifi-
cation. In Proceedings of the 1st SemEval, pages 347–
355.
Andrey Tikhonov. 1963. Solution of incorrectly formu-
lated problems and the regularization method. In So-
viet Math. Dokl., volume 5, pages 1035–1038.
David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In Proceed-
ings of the 33rd annual meeting on Association for
Computational Linguistics, pages 189–196. Associa-
tion for Computational Linguistics.

440

You might also like