Paetzold (2016)
Paetzold (2016)
435
436
We found that 64,895 out of the 150,837 words in • Max. Similarity: Test word is assigned the prop-
the MRC database were not present in either Word- erty value of the closest word in the training set,
Net or our word embedding models. Since our boot- i.e. the word with the highest cosine similarity
strappers use features extracted from both these re- according to the word embeddings model.
sources, we were only able to predict the Familiarity,
Age of Acquisition, Concreteness and Imagery val- • Avg. Similarity: Test word is assigned the aver-
ues of the remaining 85,942 words in MRC. age property value of the n closest words in the
training set, i.e. the words with the highest co-
4 Evaluation sine similarity according to the word embeddings
model. The value of n is decided through 5-fold
Since we were not able to find previous work for this
cross validation.
task, in these experiments, we compare the perfor-
mance of our bootstrapping strategy to various base- • Simple SVM: Test word is assigned the prop-
lines. For training, we use the Ridge regression algo- erty value as predicted by an SVM regressor
rithm (Tikhonov, 1963). As features, our regressor (Smola and Vapnik, 1997) with a polynomial ker-
uses the word’s raw embedding values, along with nel trained with the 15 aforementioned lexical
the following 15 lexical features: features.
• Word’s length and number of syllables, as deter-
• Simple Ridge: Test word is assigned the property
mined by the Morph Adorner module of LEXen-
value as predicted by a Ridge regressor trained
stein (Paetzold and Specia, 2015).
with the 15 aforementioned lexical features.
• Word’s frequency in the Brown (Francis and
Kucera, 1979), SUBTLEX (Brysbaert and New, • Super Ridge: Identical to Simple Ridge, the only
2009), SubIMDB (Paetzold and Specia, 2016), difference being that it also includes the words
Wikipedia and Simple Wikipedia (Kauchak, embeddings in the feature set. We note that this
2013) corpora. baseline uses the exact same features and regres-
sion algorithm as our bootstrapped regressors.
• Number of senses, synonyms, hypernyms and hy-
ponyms for word in WordNet (Fellbaum, 1998). The parameters of all baseline systems are opti-
mised following the same method as with our ap-
• Minimum, maximum and average distance be- proach. We also measure the correlation between
tween the word’s senses in WordNet and the the- each of the aforementioned lexical features and the
saurus’ root sense. psycholinguistic properties. For each psycholinguis-
• Number of images found for word in the Getty tic property, we create a training and a test set by
Images database1 . splitting the labelled instances available in the MRC
Database in two equally sized portions. All train-
We train our embedding models using word2vec ing instances are used as seeds in our approach. As
(Mikolov et al., 2013a) over a corpus of 7 billion evaluation metrics, we use Spearman’s (ρ) and Pear-
words composed by the SubIMDB corpus, UMBC son’s (r) correlation. Pearson’s correlation is the
webbase2 , News Crawl3 , SUBTLEX (Brysbaert most important indicator of performance: an effec-
and New, 2009), Wikipedia and Simple Wikipedia tive regressor would predict values that change lin-
(Kauchak, 2013). We use 5-fold cross-validation to early with a given psycholinguistic property.
optimise parameters: ζ, embeddings model architec- The results are illustrated in Table 1. While the
ture (CBOW or Skip-Gram), and word vector size similarity-based approaches tend to perform well for
(from 300 to 2,500 in intervals of 200). We include Concreteness and Imagery, typical regressors cap-
four strong baseline systems in the comparison: ture Familiarity and Age of Acquisition more effec-
1
http://developers.gettyimages.com/
tively. Our approach, on the other hand, is con-
2
http://ebiquity.umbc.edu/resource/html/id/351 sistently superior for all psycholinguistic proper-
3
http://www.statmt.org/wmt11/translation-task.html ties, with both Spearman’s and Pearson’s correlation
437
Familiarity Age of Acquisition Concreteness Imagery
System ρ r ρ r ρ r ρ r
Word Length -0.238 -0.171 0.501 0.497 -0.170 -0.195 -0.190 -0.193
Syllables -0.168 -0.114 0.464 0.458 -0.207 -0.238 -0.218 -0.224
Freq: SubIMDB 0.798 0.725 -0.679 -0.699 0.048 0.003 0.208 0.170
Freq: SUBTLEX 0.827 0.462 -0.646 -0.251 0.028 0.137 0.187 0.265
Freq: SimpleWiki 0.725 0.488 -0.453 -0.306 0.015 0.145 0.119 0.247
Freq: Wikipedia 0.694 0.283 -0.349 -0.112 -0.076 0.081 0.027 0.134
Freq: Brown 0.706 0.608 -0.380 -0.395 -0.155 -0.214 -0.054 -0.107
Sense Count 0.471 0.363 -0.429 -0.391 0.020 -0.017 0.119 0.059
Synonym Count 0.411 0.336 -0.381 -0.357 -0.036 -0.047 0.070 0.035
Hypernym Count 0.307 0.295 -0.411 -0.387 0.167 0.088 0.268 0.160
Hyponym Count 0.379 0.245 -0.324 -0.196 0.120 0.002 0.196 0.023
Min. Sense Depth -0.347 -0.072 0.366 0.055 0.151 -0.185 0.127 -0.224
Max. Sense Depth -0.021 -0.008 -0.197 -0.196 0.447 0.455 0.415 0.414
Avg. Sense Depth -0.295 -0.256 0.215 0.183 0.400 0.428 0.345 0.347
Img. Search Count 0.544 0.145 -0.325 -0.033 -0.037 -0.073 0.117 -0.059
Max. Similarity 0.406 0.402 0.445 0.443 0.742 0.743 0.618 0.605
Avg. Similarity 0.528 0.527 0.536 0.535 0.826 0.819 0.733 0.707
Simple SVM 0.835 0.815 0.778 0.770 0.548 0.477 0.555 0.528
Simple Ridge 0.832 0.815 0.785 0.778 0.603 0.591 0.620 0.613
Super Ridge 0.847 0.833 0.827 0.820 0.859 0.852 0.813 0.800
Bootstrapping 0.863 0.846 0.871 0.862 0.876 0.869 0.835 0.823
Table 1: Regression correlation scores. In bold are the highest scores within a group (features, baselines, proposed approach), and
underlined the highest scores overall.
scores varying between 0.82 and 0.88. The differ- Interestingly, frequency in the SubIMDB corpus4 ,
ence in performance between the Super Ridge base- composed of over 7 million sentences extracted from
line and our approach confirm that our bootstrapping subtitles of “family” movies and series, has good lin-
algorithm can in fact improve on the performance of ear correlation with Familiarity and Age of Acquisi-
a regressor. tion, much higher than any other feature. For Con-
The parameters used by our bootstrappers, which creteness and Imagery, on the other hand, the results
are reported below, highlight the importance of pa- suggest something different: the further a word is
rameter optimization in out bootstrapping strategy: from the root of a thesaurus, the most likely it is to
its performance peaked with very different configu- refer to a physical object or entity.
rations for most psycholinguistic properties:
5 Psycholinguistic Features for LS
• Familiarity: 300 word vector dimensions with a Here we assess the effectiveness of our bootstrap-
Skip-Gram model, and ζ = 0.9. pers in the task of Lexical Simplification (LS). As
shown in (Jauhar and Specia, 2012), psycholinguis-
• Age of Acquisition: 700 word vector dimensions tic features can help supervised ranking algorithms
with a CBOW model, and ζ = 0.7. capture word simplicity. Using the parameters de-
scribed in Section 4, we train bootstrappers for
• Concreteness: 1,100 word vector dimensions
these two properties using all instances in the MRC
with a Skip-Gram model, and ζ = 0.7.
Database as seeds. We then train three rankers with
• Imagery: 1,100 word vector dimensions with a (W) and without (W/O) psycholinguistic features:
Skip-Gram model, and ζ = 0.7. 4
http://ghpaetzold.github.io/subimdb
438
• Horn (Horn et al., 2014): Uses an SVM ranker surrogates for confidence predictors and as regres-
trained on various n-gram probability features. sion features. Our findings also indicate the use-
fulness of individual features and resources: word
• Glavas (Glavaš and Štajner, 2015): Ranks can- frequencies in the SubIMDB corpus have a much
didates using various collocational and semantic stronger correlation with Familiarity and Age of Ac-
metrics, and then re-ranks them according to their quisition than previously used corpora, while the
average rankings. depth of a word’s in a thesaurus hierarchy correlates
• Paetzold (Paetzold and Specia, 2015): Ranks well with both its Concreteness and Imagery.
words according to their distance to a decision In future work we plan to employ our boot-
boundary learned from a classification setup in- strapping solution in other regression problems, and
ferred from ranking examples. Uses n-gram fre- to further explore potential uses of automatically
quencies as features. learned psycholinguistic features.
The updated version of the MRC resource can be
We use data from the English Lexical Simplifica- downloaded from http://ghpaetzold.github.io/data/
tion task of SemEval 2012 to assess systems’ per- BootstrappedMRC.zip.
formance. The goal of the task is to rank words
in different contexts according to their simplicity.
References
The training and test sets contain 300 and 1,710 in-
stances, respectively. The official metric from the Ian Begg and Allan Paivio. 1969. Concreteness and im-
task – TRank (Specia et al., 2012) – is used to mea- agery in sentence meaning. Journal of Verbal Learn-
ing and Verbal Behavior, 8(6):821–827.
sure systems’ performance. As discussed in (Paet-
Marc Brysbaert and Boris New. 2009. Moving beyond
zold, 2015), this metric best represents LS perfor- kučera and francis: A critical evaluation of current
mance in practice. The results in Table 2 show that word frequency norms and the introduction of a new
the addition of our features lead to performance in- and improved word frequency measure for american
creases with all rankers. Performing F-tests over the english. Behavior research methods, 41:977–990.
rankings estimated for the simplest candidate in each John B Carroll and Margaret N White. 1973. Word
instance, we have found these differences to be sta- frequency and age of acquisition as determiners of
picture-naming latency. The Quarterly Journal of Ex-
tistically significant (p < 0.05). Using our features,
perimental Psychology, 25(1):85–95.
the Paetzold ranker reaches the best published re- Max Coltheart. 1981. The mrc psycholinguistic
sults for the dataset, significantly superior to the best database. The Quarterly Journal of Experimental Psy-
system in SemEval (Jauhar and Specia, 2012). chology, 33(4):497–505.
Cynthia M Connine, John Mullennix, Eve Shernoff, and
TRank Jennifer Yelen. 1990. Word familiarity and frequency
Ranker W/O W in visual and auditory word recognition. Journal of
Best SemEval - 0.602 Experimental Psychology: Learning, Memory, and
Horn 0.625 0.635 Cognition, 16(6):1084.
Christiane Fellbaum. 1998. WordNet: An Electronic
Glavas 0.623 0.636
Lexical Database. Bradford Books.
Paetzold 0.653 0.657 W Nelson Francis and Henry Kucera. 1979. Brown cor-
Table 2: Results on SemEval 2012 LS task dataset pus manual. Brown University.
Kenneth J Gilhooly and Robert H Logie. 1980. Age-
of-acquisition, imagery, concreteness, familiarity, and
6 Conclusions ambiguity measures for 1,944 words. Behavior Re-
Overall, the proposed bootstrapping strategy for re- search Methods & Instrumentation, 12(4):395–427.
Goran Glavaš and Sanja Štajner. 2015. Simplifying lexi-
gression has led to very positive results, despite
cal simplification: Do we need simplified corpora? In
its simplicity. It is therefore a cheap and reliable Proceedings of the 53rd ACL.
alternative to manually producing psycholinguistic Felix Hill and Anna Korhonen. 2014. Concreteness and
properties of words. Word embedding models have subjectivity as dimensions of lexical meaning. In Pro-
proven to be very useful in bootstrapping, both as ceedings of ACL, pages 725–731.
439
Colby Horn, Cathryn Manduca, and David Kauchak. Jason D Zevin and Mark S Seidenberg. 2002. Age of ac-
2014. Learning a Lexical Simplifier Using Wikipedia. quisition effects in word reading and other tasks. Jour-
In Proceedings of the 52nd ACL, pages 458–463. nal of Memory and language, 47(1):1–29.
S. Jauhar and L. Specia. 2012. Uow-shef: Simplex–
lexical simplicity ranking based on contextual and psy-
cholinguistic features. In Proceedings of the 1st Se-
mEval, pages 477–481.
Thorsten Joachims. 2002. Optimizing search engines us-
ing clickthrough data. In Proceedings of the 8th ACM,
pages 133–142.
David Kauchak. 2013. Improving text simplification lan-
guage modeling using unsimplified text data. In Pro-
ceedings of the 51st ACL, pages 1537–1546.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013a. Efficient estimation of word representa-
tions in vector space. arXiv preprint arXiv:1301.3781.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
2013b. Linguistic regularities in continuous space
word representations. In HLT-NAACL, pages 746–
751.
Palmer Morrel-Samuels and Robert M Krauss. 1992.
Word familiarity predicts temporal asynchrony of
hand gestures and speech. Journal of Experimen-
tal Psychology: Learning, Memory, and Cognition,
18(3):615.
Gustavo Henrique Paetzold and Lucia Specia. 2015.
Lexenstein: A framework for lexical simplification. In
Proceedings of The 53rd ACL.
Gustavo Henrique Paetzold and Lucia Specia. 2016. Un-
supervised lexical simplification for non-native speak-
ers. In Proceedings of The 30th AAAI.
Gustavo Henrique Paetzold. 2015. Reliable lexical sim-
plification for non-native speakers. In Proceedings of
the 2015 NAACL Student Research Workshop.
Allan P. Rudell. 1993. Frequency of word usage and per-
ceived word difficulty: Ratings of Kuera and Francis
words. Behavior Research Methods.
Alex Smola and Vladimir Vapnik. 1997. Support vector
regression machines. Advances in neural information
processing systems, 9:155–161.
Lucia Specia, Sujay Kumar Jauhar, and Rada Mihalcea.
2012. Semeval-2012 task 1: English lexical simplifi-
cation. In Proceedings of the 1st SemEval, pages 347–
355.
Andrey Tikhonov. 1963. Solution of incorrectly formu-
lated problems and the regularization method. In So-
viet Math. Dokl., volume 5, pages 1035–1038.
David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In Proceed-
ings of the 33rd annual meeting on Association for
Computational Linguistics, pages 189–196. Associa-
tion for Computational Linguistics.
440