Semantic Similarity
Semantic Similarity
Abstract—Calculating the semantic similarity between sentences is a long dealt problem in the area of natural language processing.
The semantic analysis field has a crucial role to play in the research related to the text analytics. The semantic similarity differs as the
domain of operation differs. In this paper, we present a methodology which deals with this issue by incorporating semantic similarity
arXiv:1802.05667v2 [[Link]] 20 Feb 2018
and corpus statistics. To calculate the semantic similarity between words and sentences, the proposed method follows an edge-based
approach using a lexical database. The methodology can be applied in a variety of domains. The methodology has been tested on both
benchmark standards and mean human similarity dataset. When tested on these two datasets, it gives highest correlation value for
both word and sentence similarity outperforming other similar models. For word similarity, we obtained Pearson correlation coefficient
of 0.8753 and for sentence similarity, the correlation obtained is 0.8794.
Index Terms—Natural Language Processing, Semantic Analysis, Word Similarity, Sentence Similarity, lexical database, Corpus
1 I NTRODUCTION
1
Lexical databases come into the picture at this point of pro-
3.1.4 Hierarchical distribution of words Used along with the Word Disambiguation Approach
In WordNet, the primary relationship between the synsets described in section 3.1.2, the final similarity of the word
is the super-subordinate relation, also called hyperonymy, would be different for every corpus. The corpus belonging
hyponymy or ISA relation [21]. This relationship connects to particular domain works as supervised learning data
the general concept synsets to the synsets having specific for the algorithm. We first disambiguate the whole corpus
characteristics. For example, Table 4 represents vehicle and to get the sense of the word and further calculate the
its hyponyms. frequency of the particular sense. These statistics for the
The hyponyms of ‘vehicle’ have more specific properties corpus work as the knowledge base for the algorithm. Fig.
and represent the particular set, whereas ‘vehicle’ has 4 represents the steps involved in the analysis of corpus
general properties. Hence, words at the upper layer of the statistics.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5
similarity in the range of 0 to 1. Now, using Eq. 4 and Eq. 5, • S2: A quick brown fox jumps over the lazy dog.
we establish similarity as:
The edge-based approach using lexical database will
Sim = S/ζ (6) produce a result showing both S1 and S2 are same, but
since the words appear in a different order we should
scale down the overall similarity as they represent different
Algorithm 1 Semantic similarity between sentences meaning. We start with the formation of vectors V1 and
V2 dynamically for sentences S1 and S2 respectively.
1: procedure S ENTENCE SIMILARITY
Initialization of vectors is performed as explained in section
2: S1 - list of tagged tokens ← disambiguate
3.3. Instead of forming joint word set, we treat sentences
3: S2 - list of tagged tokens ← disambiguate
relatively to keep the size of vector minimum.
4: vector length ← max(length(S1),length(S2))
The process starts with the sentence having maximum
5: V1, V2 ← vector length(null)
length. Vector V1 is formed with respect to sentence 1 and
6: V1, V2 ← vector length(word similarity(S1,S2))
cells in V1 are initialized to index values of words in S1
7: ζ =0
beginning with 1. Hence V1 for S1 is:
8: while S1 list of tagged tokens do
V1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
9: if word similarity value >
Now, we form V2 concerning S1 and S2. To form V2, every
benchmark similarity value then
word from S2 is compared with S1. If the word from S2 is
10: C1 ← C1+1
absent in S1, then the cell in V2 is filled with the index value
11: while S2 list of tagged tokens do
of the word in sentence S2. If the word from S2 matches
12: if word similarity value >
with a word from S1, then the index of the word from S1 is
benchmark similarity value then
filled in V2.
13: C2 ← C2+1
In the above example, consider words ‘fox’ and ‘dog’ from
14: ζ ← sum(C1, C2)/γ
sentence 2. The word ‘fox’ from S2 is present in S1 at the
15: S ← ||V 1||.||V 2||
index 9. Hence, entry for ‘fox’ in V2 would be 9. Similarly,
16: if sum(C1, C2) = 0 then
the word ‘dog’ form S2 is present in the S1 at the index
17: ζ ← vector length/2
4. Hence, entry for ‘dog’ in V2 would be 9. Following the
18: Sim ←S/ζ same procedure for all the words, we get V2 as:
V2 = [1, 2, 3, 9, 5, 6, 7, 8, 4]
Finally, word order similarity is given by:
3.4 Word Order Similarity
Ws = ||V1 − V2||/||V1 ∗ V2|| (7)
Along with semantic nature of the sentences, we need to
consider the syntactic structure of the sentences too. The In this case, Ws is 0.067091.
word order similarity, simply put, is the aggregation of
comparisons of word indices in two sentences. The semantic
similarity approach based on words and the lexical database 4 I MPLEMENTATION USING S EMANTIC NETS
doesn’t take into account the grammar of the sentence. Li
The database used to implement the proposed methodology
[14] assigns a number to each word in the sentence and
is WordNet and statistical information from WordNet is
forms a word order vector according to their occurrence
used calculate the information content of the word. To test
and similarity. They also consider the semantic similarity
the behavior with an external corpus, a small compiled
value of words to decide the word order vector. If a word
corpus is used. The corpus contained ten sentences be-
from sentence 1 is not present in sentence 2, the number
longing to ‘Chemistry’ domain. This section describes the
assigned to the index of this word in word order vector
prerequisites to implement the method.
corresponds to the word with maximum similarity. This
case is not valid always and introduces errors in the final
semantic similarity index. For the methods which calculate 4.1 The Database - WordNet
the similarity by chunking the sentence into words, it is not
WordNet is a lexical semantic dictionary available for online
always necessary to decide the word order similarity. For
and offline use, developed and hosted at Princeton. The
such techniques, the word order similarity actually matters
version used in this study is WordNet 3.0 which has 117,000
when two sentences contain same words in different order.
synonymous sets, Synsets. Synsets for a word represent the
Otherwise, if the sentences contain different words, the
possible meanings of the word when used in a sentence.
word order similarity should be an optional construct.
WordNet currently has synset structure for nouns, verbs,
In the entirely different sentences, word order similarity
adjectives and adverbs. These lexicons are grouped sepa-
doesn’t impact on the large scale. For such sentences, the
rately and do not have interconnections; for instance, nouns
impact of word order similarity is negligible as compared
and verbs are not interlinked.
to the semantic similarity. Hence, in our approach, we
The main relationship connecting the synsets is the super-
implement word order similarity as an optional feature.
subordinate(ISA-HASA) relationship. The relation becomes
Consider following classical example:
more general as we move up the hierarchy. The root node
of all the noun hierarchies is ‘Entity’. Like nouns, verbs are
• S1: A quick brown dog jumps over the lazy fox. arranged into hierarchies as well.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7
The WordNet relations connect the same parts of speeches. Words Similarity
Thus, it consists of four subnets of nouns, verbs, adjectives gem - jewel 0.908008550956
and adverbs respectively. Hence, determining the similarity gem - stone 0.180732071642
between cross-domains is not possible. gem - used 0.0
The shortest path distance is calculated by using the tree- gem - decorate 0.0
gem - valuable 0.0
like hierarchical structure. To figure the shortest path, we
gem - things 0.284462910289
climb up the hierarchy from both the synsets and determine gem - wear 0.0
the meeting point which is also a synset. This synset is gem - rings 0.485032351325
called subsumer of the respective synsets. The shortest path gem - necklaces 0.669319889871
distance equals the hops from one synset to another. jewel - jewel 0.997421032224
jewel - stone 0.217431543606
We consider the position of subsumer of two synsets to jewel - used 0.0
determine the hierarchical distance. Subsumer is found by jewel - decorate 0.0
using the hyperonymy (ISA) relation for both the synsets. The jewel - valuable 0.0
algorithm moves up the hierarchy until a common synset is jewel - things 0.406309448212
jewel - wear 0.0
found. This common synset is the subsumer for the synsets jewel - rings 0.456849659596
in comparison. A set of hypernyms is formed individually jewel - necklaces 0.41718607131
for each synset and the intersection of sets contains the stone - jewel 0.475813717007
subsumer. If the intersection of these sets contain more than stone - stone 0.901187866267
one synset, then the synset with the shortest path distance stone - used 0.0
stone - decorate 0.0
is considered as a subsumer. stone - valuable 0.0
stone - things 0.198770510639
4.1.2 The Information content of the word stone - wear 0.0
stone - rings 0.100270000776
For general purposes, we use the statistical information stone - necklaces 0.0856785820827
from WordNet for the information content of the word. used - jewel 0.0
WordNet provides the frequency of each synset in the used - stone 0.0
WordNet corpus. This frequency distribution is used in the used - used 0.42189900525
used - decorate 0.0
implementation of section 3.2. used - valuable 0.0
used - things 0.0
4.1.3 Illustrative example used - wear 0.0
used - rings 0.0
This section explains in detail the steps involved in the used - necklaces 0.0
calculation of semantic similarity between two sentences. jewellery - jewel 0.509332774797
jewellery - stone 0.220266070205
jewellery - used 0.0
• S1: A gem is a jewel or stone that is used in jewellery. jewellery - decorate 0.0
• S2: A jewel is a precious stone used to decorate valu- jewellery - valuable 0.0
jewellery - things 0.346687374295
able things that you wear, such as rings or necklaces.
jewellery - wear 0.0
Following segment contains the parts of speeches and jewellery - rings 0.592019999822
jewellery - necklaces 0.81750915958
corresponding synsets used to determine the similarity.
For S1 the tagged words are:
TABLE 6 TABLE 7
L2 compared with L1 Linear regression parameter values for proposed methodology
TABLE 8
Rubenstein and Goodenough Vs Lee2014 Vs Proposed Algorithm Similarity
TABLE 9
Proposed Algorithm Similarity Vs Islam2008 Vs Li2006
Fig. 7. Comparison of linear regressions from various algorithms with Fig. 8. Linear regression model- Mean Human Similarity against Algo-
R&G1965 rithm Sentence Similarity
TABLE 10
Sentence Similarity from proposed methodology compared with human mean similarity from Li2006
TABLE 11
Sentence Similarity from proposed methodology compared with human mean similarity from Li2006 (Continued from previous page)
TABLE 12
Sentence Similarity from proposed methodology compared with human mean similarity from Li2006 (Continued from previous page)
this methodology is to achieve results as close as to the affect overall similarity measure more than the actual words
benchmark standard by Rubenstein and Goodenough [16]. compared ‘lad-wizard’.
The definitions of the words are obtained from the Collins
Cobuild dictionary. Our algorithm achieved good Pearson
correlation coefficient of 0.8753695501 for word similarity 7 C ONCLUSIONS
which is cosiderably higher than the existing algorithms. This paper presented an approach to calculate the semantic
Fig. 5 represents the results for 65 pairs against the R&G similarity between two words, sentences or paragraphs.
benchmark standard. Fig. 6 represents the linear regression The algorithm initially disambiguates both the sentences
against the standard. The linear regression shows that this and tags them in their parts of speeches. The disambigua-
algortihm outperforms other similar algorithms. Table 7 tion approach ensures the right meaning of the word for
shows the values of parameters for linear regression. comparison. The similarity between words is calculated
based on a previously established edge-based approach. The
5.1 Sentence similarity information content from a corpus can be used to influence
the similarity in particular domain. Semantic vectors con-
Tables 10, 11 and 12 contain the mean human sentence
taining similarities between words are formed for sentences
similarity values from Pilot Short Text Semantic Similarity
and further used for sentence similarity calculation. Word
Benchmark Data Set by James O’Shea [26]. As Li [14] ex-
order vectors are also formed to calculate the impact of the
plains, when a survey was conducted by 32 participants to
syntactic structure of the sentences. Since word order affects
establish a measure for semantic similarity, they were asked
less on the overall similarity than that of semantic similarity,
to mark the sentences, not the words. Hence, word similarity
word order similarity is weighted to a smaller extent. The
is compared with the R&G [16] whereas sentence similarity
methodology has been tested on previously established data
is compared with mean human similarity. Our algorithm’s
sets which contain standard results as well as mean human
sentence similarity achieved good Pearson correlation coef-
results. Our algorithm achieved good Pearson correlation
ficient of 0.8794 with mean human similarity outperforming
coefficient of 0.8753 for word similarity concerning the
previous methods. Li [14] obtained correlation coefficient of
bechmark standard and 0.8794 for sentence similarity with
0.816 and Islam [29] obtained correlation coefficient of 0.853.
respect to mean human similarity.
Out of 65 sentence pairs, 5 pairs were eliminated because of
Future work includes extending the domain of algorithm
their definitions from Collins Cobuild dictionary [27]. The
to analyze Learning Objectives from Course Descriptions,
reasons and results are discussed in next section.
incorporating the algorithm with Bloom’s taxonomy will
also be considered. Analyzing Learning Objectives requires
6 D ISCUSSION ontologies and relationship between words belonging to the
particular field.
Our algorithm’s similarity measure achieved a good Pear-
son correlation coefficient of 0.8753 with R&G word pairs
[16]. This performance outperforms all the previous meth- ACKNOWLEDGMENTS
ods. Table 8 represents the comparison of similarity from
proposed method and Lee [28] with the R&G. Table 9 We would like to acknowledge the financial support pro-
depicts the comparison of algorithm similarity against Islam vided by ONCAT(Ontario Council on Articulation and
[29] and Li [14] for the 30 noun pairs and performs better. Transfer)through Project Number- 2017-17-LU,without their
For sentence similarity, the pairs 17: coast-forest, 24: lad- support this research would have not been possible. We
wizard, 30: coast-hill, 33: hill-woodland and 39: brother-lad are are also grateful to Salimur Choudhury for his insight on
not considered. The reason for this is, the definition of these different aspects of this project; [Link] team for
word pairs have more than one common or synonymous reviewing and proofreading the paper.
words. Hence, the overall sentence similarity does not reflect
the true sense of these word pairs as they are rated with
low similarity in mean human ratings. For example, the R EFERENCES
definition of ‘lad’ is given as: ‘A lad is a young man or [1] D. Lin et al., “An information-theoretic definition of similarity.” in
boy.’ and the definition of ‘wizard’ is: ‘In legends and fairy Icml, vol. 98, no. 1998, 1998, pp. 296–304.
stories, a wizard is a man who has magic powers.’ Both [2] A. Freitas, J. Oliveira, S. ORiain, E. Curry, and J. Pereira da Silva,
“Querying linked data using semantic relatedness: a vocabulary
sentences have similar or closely related words such as: independent approach,” Natural Language Processing and Informa-
‘man-man’, ‘boy-man’ and ‘lad-man’. Hence, these pairs tion Systems, pp. 40–51, 2011.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 14
[3] V. Abhishek and K. Hosanagar, “Keyword generation for search [27] J. M. Sinclair, Looking up: An account of the COBUILD project in
engine advertising using semantic similarity between terms,” in lexical computing and the development of the Collins COBUILD English
Proceedings of the ninth international conference on Electronic com- language dictionary. Collins Elt, 1987.
merce. ACM, 2007, pp. 89–94. [28] M. C. Lee, J. W. Chang, and T. C. Hsieh, “A grammar-based
[4] C. Pesquita, D. Faria, A. O. Falcao, P. Lord, and F. M. Couto, semantic similarity algorithm for natural language sentences,” The
“Semantic similarity in biomedical ontologies,” PLoS computational Scientific World Journal, vol. 2014, 2014.
biology, vol. 5, no. 7, p. e1000443, 2009. [29] A. Islam and D. Inkpen, “Semantic text similarity using corpus-
[5] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble, “Investigating based word similarity and string similarity,” ACM Transactions on
semantic similarity measures across the gene ontology: the rela- Knowledge Discovery from Data (TKDD), vol. 2, no. 2, p. 10, 2008.
tionship between sequence and annotation,” Bioinformatics, vol. 19,
no. 10, pp. 1275–1283, 2003.
[6] T. Pedersen, S. V. Pakhomov, S. Patwardhan, and C. G. Chute,
“Measures of semantic similarity and relatedness in the biomed-
ical domain,” Journal of biomedical informatics, vol. 40, no. 3, pp.
288–299, 2007.
[7] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. Petrakis, and E. E.
Milios, “Semantic similarity methods in wordnet and their appli-
cation to information retrieval on the web,” in Proceedings of the
7th annual ACM international workshop on Web information and data
management. ACM, 2005, pp. 10–16.
[8] G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical cen-
trality as salience in text summarization,” Journal of Artificial
Intelligence Research, vol. 22, pp. 457–479, 2004.
Atish Pawar Atish received B.E. degree in com-
[9] Y. Ko, J. Park, and J. Seo, “Improving text categorization using
puter science and engineering with distinction
the importance of sentences,” Information processing & management,
from Walchand Institute of Technology, India in
vol. 40, no. 1, pp. 65–79, 2004.
2014. He worked for Infosys Technologies from
[10] C. Fellbaum, WordNet. Wiley Online Library, 1998. 2014 to 2016. He is currently a graduate student
[11] A. D. Baddeley, “Short-term memory for word sequences as a at Lakehead University, Canada. His research
function of acoustic, semantic and formal similarity,” The Quarterly interests include machine learning, natural lan-
Journal of Experimental Psychology, vol. 18, no. 4, pp. 362–365, 1966. guage processing, and artificial intelligence. He
[12] P. Resnik et al., “Semantic similarity in a taxonomy: An is a research assistant at DataScience lab at
information-based measure and its application to problems of Lakehead University.
ambiguity in natural language,” J. Artif. Intell. Res.(JAIR), vol. 11,
pp. 95–130, 1999.
[13] G. A. Miller and W. G. Charles, “Contextual correlates of semantic
similarity,” Language and cognitive processes, vol. 6, no. 1, pp. 1–28,
1991.
[14] Y. Li, D. McLean, Z. A. Bandar, J. D. O’shea, and K. Crockett,
“Sentence similarity based on semantic nets and corpus statistics,”
IEEE transactions on knowledge and data engineering, vol. 18, no. 8,
pp. 1138–1150, 2006.
[15] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus
statistics and lexical taxonomy,” arXiv preprint cmp-lg/9709008,
1997.
[16] H. Rubenstein and J. B. Goodenough, “Contextual correlates of
synonymy,” Communications of the ACM, vol. 8, no. 10, pp. 627–
633, 1965.
[17] C. T. Meadow, Text information retrieval systems. Academic Press, Vijay Mago Dr. Vijay. Mago is an Assistant
Inc., 1992. Professor in the Department of Computer Sci-
[18] Y. Matsuo and M. Ishizuka, “Keyword extraction from a single ence at Lakehead University in Ontario, where
document using word co-occurrence statistical information,” In- he teaches and conducts research in areas in-
ternational Journal on Artificial Intelligence Tools, vol. 13, no. 01, pp. cluding decision making in multi-agent environ-
157–169, 2004. ments, probabilistic networks, neural networks,
[19] D. Bollegala, Y. Matsuo, and M. Ishizuka, “Measuring semantic and fuzzy logic-based expert systems. Recently,
similarity between words using web search engines.” www, vol. 7, he has diversified his research to include natural
pp. 757–766, 2007. Language Processing, big data and cloud com-
puting. Dr. Mago received his Ph.D. in Computer
[20] R. L. Cilibrasi and P. M. Vitanyi, “The google similarity distance,”
Science from Panjab University, India in 2010.
IEEE Transactions on knowledge and data engineering, vol. 19, no. 3,
In 2011 he joined the Modelling of Complex Social Systems program
2007.
at the IRMACS Centre of Simon Fraser University before moving on
[21] G. A. Miller, “Wordnet: a lexical database for english,” Communi-
to stints at Fairleigh Dickinson University, University of Memphis and
cations of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
Troy University. He has served on the program committees of many
[22] S. Bird, “Nltk: the natural language toolkit,” in Proceedings of the
international conferences and workshops. Dr. Mago has published ex-
COLING/ACL on Interactive presentation sessions. Association for
tensively on new methodologies based on soft computing and artificial
Computational Linguistics, 2006, pp. 69–72.
intelligent techniques to tackle complex systemic problems such as
[23] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a homelessness, obesity, and crime. He currently serves as an associate
large annotated corpus of english: The penn treebank,” Computa- editor for BMC Medical Informatics and Decision Making and as co-
tional linguistics, vol. 19, no. 2, pp. 313–330, 1993. editor for the Journal of Intelligent Systems.
[24] T. Pedersen, S. Banerjee, and S. Patwardhan, “Maximizing seman-
tic relatedness to perform word sense disambiguation,” University
of Minnesota supercomputing institute research report UMSI, vol. 25,
p. 2005, 2005.
[25] L. Tan, “Pywsd: Python implementations of word
sense disambiguation (wsd) technologies [software],”
[Link] 2014.
[26] J. O’Shea, Z. Bandar, K. Crockett, and D. McLean, “Pilot short text
semantic similarity benchmark data set: Full listing and descrip-
tion,” Computing, 2008.