0% found this document useful (0 votes)
96 views14 pages

Semantic Similarity

Semantic similarity between text

Uploaded by

priyanka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views14 pages

Semantic Similarity

Semantic similarity between text

Uploaded by

priyanka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

Calculating the similarity between words and


sentences using a lexical database and corpus
statistics
Atish Pawar, Vijay Mago

Abstract—Calculating the semantic similarity between sentences is a long dealt problem in the area of natural language processing.
The semantic analysis field has a crucial role to play in the research related to the text analytics. The semantic similarity differs as the
domain of operation differs. In this paper, we present a methodology which deals with this issue by incorporating semantic similarity
arXiv:1802.05667v2 [[Link]] 20 Feb 2018

and corpus statistics. To calculate the semantic similarity between words and sentences, the proposed method follows an edge-based
approach using a lexical database. The methodology can be applied in a variety of domains. The methodology has been tested on both
benchmark standards and mean human similarity dataset. When tested on these two datasets, it gives highest correlation value for
both word and sentence similarity outperforming other similar models. For word similarity, we obtained Pearson correlation coefficient
of 0.8753 and for sentence similarity, the correlation obtained is 0.8794.

Index Terms—Natural Language Processing, Semantic Analysis, Word Similarity, Sentence Similarity, lexical database, Corpus

1 I NTRODUCTION
1
Lexical databases come into the picture at this point of pro-

T problem of calculating the semantic similarity be-


HE
tween two concepts, words or sentences is a long dealt
problem in the area of natural language processing. In gen-
cessing. Lexical databases have connections between words
which can be utilized to determine the semantic similarity
of the words [10]. Many approaches have been developed
eral, semantic similarity is a measure of conceptual distance over past few years and proved to be very useful in the area
between two objects, based on the correspondence of their of semantic analysis [11] [12] [13] [14] [5] [15].
meanings [1]. This paper aims to improve existing algorithms and make
Determination of semantic similarity in natural language it robust by integrating it with an corpus of a specific
processing has a wide range of applications. In internet- domain. The main contribution of this research is the robust
related applications, the uses of semantic similarity include semantic similarity algorithm which outperforms the exist-
estimating relatedness between search engine queries [2] ing algorithms with respect to the Rubenstein and Good-
and generating keywords for search engine advertising [3]. enough benchmark standard [16]. The application domain
In biomedical applications, semantic similarity has become of this research is calculating semantic similarity between
a valuable tool for analyzing the results in gene clustering, two Learning Outcomes from course description documents.
gene expression and disease gene prioritization [4] [5] [6]. The approach taken to solve this problem is first treating
In addition to this, semantic similarity is also beneficial in the course objectives as natural language sentences and then
information retrieval on web [7], text summarization [8] introducing domain specific statistics to calculate the simial-
and text categorization [9]. Hence, such applications need to rity. A separate article will be dedicated to analyze Learning
have a robust algorithm to estimate the semantic similarity Objectives extracted from different Course Descriptions.
which can be used across variety of domains. The next section reviews some related work. Section 3
All the applications mentioned above are domain spe- elaborates the whole methodology step by step. Section 4
cific and require different algorithms to serve the purpose explains the idea of traversal in a lexical database along
though the basic idea of calculating the semantic similarity with an illustrative example in detail. Section 5 contains
remains the same. To determine the closeness of impli- the result of the algorithm for the 65 noun word pairs from
cations of the objects under comparison, we need some R&G [16] and the results of the proposed algorithm sentence
predefined standard measure which readily describes such similarity for the sentence pairs in pilot data set [26]. Section
relatedness of the meanings. The absence of predefined 6 discusses the results obtained and compares it with previ-
measure makes the problem of comparing definitions, a ous methodologies. It also explains the performance of the
recursive problem. algorithm. Finally, section 7 presents the outcomes in brief
and draws the conclusion.
• A. Pawar and V. Mago are with the Department of Computer Science,
Lakehead University, Thunder Bay, ON, P7B 5E1.
E-mail: {apawar1,vmago}@[Link]
2 R ELATED W ORK
1. This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version The recent work in the area of natural language processing
may no longer be accessible. has contributed valuable solutions to calculate the semantic
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2

similarity between words and sentences. This section re-


views some related work to investigate the strengths and
limitations of previous methods and to identify the partic-
ular difficulties in computing semantic similarity. Related
works can roughly be classified into following major cate-
gories:
• Word co-occurrence methods
• Similarity based on a lexical database
• Method based on web search engine results
Word co-occurrence methods are commonly used in Infor-
mation Retrieval (IR) systems [17]. This method has word
list of meaningful words and every query is considered
as a document. A vector is formed for the query and for
documents. The relevant documents are retrieved based on
the similarity between query vector and document vector
[9]. This method has obvious drawbacks such as:
• It ignores the word order of the sentence.
• It does not take into account the meaning of the word
in the context of the sentence.
But it has following advantages:
• It matches documents regardless the size of docu-
ments
• It successfully extracts keywords from documents
[18]
Using the lexical database methodology, the similarity is
computed by using a predefined word hierarchy which has
words, meaning, and relationship with other words which
Fig. 1. Proposed sentence similarity methodology
are stored in a tree-like structure [14]. While comparing two
words, it takes into account the path distance between the
words as well as the depth of the subsumer in the hierarchy.
The subsumer refers to the relative root node concerning 3 T HE P ROPOSED M ETHODOLOGY
the two words in comparison. It also uses a word corpus
to calculate the ‘information content’of the word which The proposed methodology considers the text as a sequence
influences the final similarity. This methodology has the of words and deals with all the words in sentences sepa-
following limitations: rately according to their semantic and syntactic structure.
The information content of the word is related to the fre-
• The appropriate meaning of the word is not consid- quency of the meaning of the word in a lexical database or
ered while calculating the similarity, rather it takes a corpus. The method to calculate the semantic similarity
the best matching pair even if the meaning of the between two sentences is divided into four parts:
word is totally different in two distinct sentence.
• The information content of the word form a corpus, • Word similarity
differs from corpus to corpus. Hence, final result • Sentence similarity
differs for every corpus. • Word order similarity
The third methodology computes relatedness based on
web search engine results, utilizes the number of search Fig. 1 depicts the procedure to calculate the similarity be-
results [19]. This technique doesn’t necessarily give the tween two sentences. Unlike other existing methods that use
similarity between words as words with opposite meaning the fixed structure of vocabulary, the proposed method uses
frequently occur together on the web pages, hence influ- a lexical database to compare the appropriate meaning of
encing the final similarity index. We have implemented the the word. A semantic vector is formed for each sentence
methodology to calcuate the Google Similarity Distance [20]. which contains the weight assigned to each word for every
The search engines that we used for this study are Google other word from the second sentence in comparison. This
and Bing. The results obtained from this method are not step also takes into account the information content of the
encouraging for both the search engines. word, for instance, word frequency from a standard corpus.
Overall, above-mentioned methods compute the semantic Semantic similarity is calculated based on two semantic
similarity without considering the context of the word ac- vectors. An order vector is formed for each sentence which
cording to the sentence. The proposed algorithm addresses considers the syntactic similarity between the sentences.
aforementioned issues by disambiguating the words in sen- Finally, semantic similarity is calculated based on semantic
tences and forming semantic vectors dynamically for the vectors and order vectors. The following section further
compared sentences and words. describes each of the steps in more details.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3

Fig. 2. Synsets for the word: bank

3.1 Word Similarity


TABLE 1
The proposed method uses the sizeable lexical database Parts of speeches
for the English language, WordNet [21], from the Princeton
University. Following are the steps involved in computing Word Part of Speech
word similarity: A DT - Determiner
voyage NN - Noun
3.1.1 Identifying words for comparison is VBZ - Verb
Before calculating the semantic similarity between words, it a DT - Determiner
is essential to determine the words for comparison. We use long JJ - Adjective
word tokenizer and ‘parts of speech tagging technique’as journey NN - Noun
implemented in natural language processing toolkit, NLTK on IN - Preposition
[22]. This step filters the input sentence and tags the words a DT - Determiner
ship NN - Noun
into their ‘part of speech’(POS) and labels them accordingly.
or CC - Coordinating
As discussed in section 2, WordNet has path relationships conjunction
between noun-noun and verb-verb only. Such relationships in IN - Prepostion
are absent in WordNet for the other parts of speeches. a DT - Determiner
Hence, it is not possible to get a numerical value that spacecraft NN - Noun
represents the link between other parts of speeches except
nouns and verbs. Therefore, to reduce the time and space
complexity of the algorithm, we only consider nouns and
verbs to calculate the similarity. TABLE 2
Example: ‘A voyage is a long journey on a ship or in a Synsets and corresponding definitions from WordNet
spacecraft’
Synset Definition
Table 1 represents the words and corresponding parts of
Synset(‘river.n.01’) a large natural stream of
speeches. The parts of speeches are as per the Penn Treebank water (larger than a creek)
[23]. Synset(‘bank.n.01’) sloping land (especially the
slope beside a body of
3.1.2 Associating word with a sense water)
Synset(‘bank.n.09’) a building in which
The primary structure of the WordNet is based on synonymy. the business of banking
Every word has some synsets according to the meaning of transacted
the word in the context of a statement. For example, word: Synset(‘bank.n.06’) the funds held by a gam-
‘bank.’ Fig. 2 represents all the synsets for the word ‘bank’. bling house or the dealer in
The distance between synsets in comparison varies as we some gambling games
change the meaning of the word.
Consider an example where we calculate the shortest
path distance between words ‘river’ and ‘bank.’ WordNet TABLE 3
has only one synset for the word ‘river’. We will calculate Synsets and corresponding shortest path distances from WordNet
the path distance between synset of ‘river’ and three synsets
of word ‘bank’. Table 2 represents the synsets and corre- Synset Pair Shortest Path Distance
sponding definitions for the words ‘bank’ and ‘river’. Synset(‘river.n.01’) - Synset(‘bank.n.01’) 8
Shortert distances for synset pairs are represented in Synset(‘river.n.01’) - Synset(‘bank.n.09’) 10
Table 3. When comparing two sentences, we have many Synset(‘river.n.01’) - Synset(‘bank.n.06’) 11
such word pairs which have multiple synsets. Therefore,
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4

not considering the proper synset in context of the sen- TABLE 4


tence, could introduce errors at the early stage of similarity Synset and corresponding hyponyms from WordNet
calculation. Hence, sense of the word affects significantly
on the overall similarity measure. Identifying sense of the Synset Hyponyms
Synset(‘bumper car.n.01’)
word is part of the ‘word sense disambiguation’ research
Synset(‘craft.n.02’)
area. We use ‘max similarity’ algorithm, Eq. (1), to perform
Synset(‘military vehicle.n.01’)
word sense disambiguation [24] as implemented in Pywsd,
Synset(‘rocket.n.01’)
an NLTK based Python library [25]. Synset(‘vehicle.n.01’)
Synset(‘skibob.n.01’)
n
X Synset(‘sled.n.01’)
argmaxsynset(a) ( maxsynset(i) (sim(i, a)) (1) Synset(‘steamroller.n.02’)
i Synset(‘wheeled vehicle.n.01’)

3.1.3 Shortest path distance between synsets


The following example explains in detail the methodolgy hierarchy have more general features and less semantic
used to calculate the shortest path distance. information, as compared to words at the lower layer of
hierarchy [14].
Entity Hierarchical distance plays an important role when the
path distances between word pairs are same. For instance,
referring to Fig. 3, consider following word pairs:
Unit Conveyence
car - motorcycle and bicycle - self propelled vehicle.
The shortest path distance between both the pairs is 2, but
Instrumentality Vehicle the pair car - motorcycle has more semantic information and
specific properties than bicycle - self propelled vehicle. Hence,
we need to scale up the similarity measure if the word
Container Wheeled Vehicle
pair subsume words at the lower level of the hierarchy and
scale down if they subsume words at the upper level of
self propelled vehicle bicycle the hierarchy. To include this behavior, we use previously
established function [14]:
motor vehicle
eβh − e−βh
g(h) = (3)
eβh + e−βh
motorcycle car
For WordNet, the optimal values of α and β are 0.2 and 0.45
respectively as reported previously [8].
Fig. 3. Hierarchical structure from WordNet

Referring to Fig. 3, consider words: 3.2 Information content of the word


w1 = motorcycle and w2 = car
We are referring to Synset(‘motorcycle.n.01’) for ‘motorcycle’ The meaning of the word differs as we change the
and (‘car.n.01’) for ‘car’. domain of operation. We can use this behavior of natural
The traversal path is : motorcycle → motor vehicle → language to make the similarity measure domain-specific.
car. Hence, the shortest path distance between motorcycle Aforementioned is an optional part of the algorithm. It
and car is 2. In WordNet, the gap between words increases is used to influence the similarity measure if the domain
as similarity decreases. We use the previously established operation is predetermined. To illustrate the Information
monotonically decreasing function [14]: Content of the word in action, consider the word: bank. The
most frequent meaning of the word bank in the context of
f (l) = e−αl (2) Potamology (the study of rivers) is sloping land (especially
the slope beside a body of water). The most frequent meaning
where l is the shortest path distance and α is a constant. The of the word bank in the context of Economics would be a
selection of exponential function is to ensure that the value financial institution that accepts deposits and channels the money
of f(l) lies between 0 to 1. into lending activities.

3.1.4 Hierarchical distribution of words Used along with the Word Disambiguation Approach
In WordNet, the primary relationship between the synsets described in section 3.1.2, the final similarity of the word
is the super-subordinate relation, also called hyperonymy, would be different for every corpus. The corpus belonging
hyponymy or ISA relation [21]. This relationship connects to particular domain works as supervised learning data
the general concept synsets to the synsets having specific for the algorithm. We first disambiguate the whole corpus
characteristics. For example, Table 4 represents vehicle and to get the sense of the word and further calculate the
its hyponyms. frequency of the particular sense. These statistics for the
The hyponyms of ‘vehicle’ have more specific properties corpus work as the knowledge base for the algorithm. Fig.
and represent the particular set, whereas ‘vehicle’ has 4 represents the steps involved in the analysis of corpus
general properties. Hence, words at the upper layer of the statistics.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5

S1= “A jewel is a precious stone used to decorate valuable


things that you wear, such as rings or necklaces.”
S2= “A gem is a jewel or stone that is used in jewellery.”

List of tagged words for S1:


[(‘jewel’, Synset(‘jewel.n.01’)), Synset(‘jewel.n.02’)],
[(‘stone’, Synset(‘stone.n.02’)), Synset(‘stone.n.13’)],
[(‘used’, Synset(‘use.v.03’)), Synset(‘use.v.06’)],
[(‘decorate’, Synset(‘decorate.v.01’)), Synset(‘dress.v.09’)],
[(‘valuable’, Synset(‘valuable.a.01’)), Synset(‘valuable.s.02’)],
[(‘things’, Synset(‘thing.n.04’)), Synset(‘thing.n.12’)],
[(‘wear’, Synset(‘wear.v.01’)), Synset(‘wear.v.09’)],
[(‘rings’, Synset(‘ring.n.08’)), Synset(‘band.n.12’)],
[(‘necklaces’, Synset(‘necklace.n.01’)), Synset(‘necklace.n.01’)]

Length of list of tagged words for S1: 9

List of tagged words for S2:


[(‘gem’, Synset(‘jewel.n.01’)), Synset(‘jewel.n.01’)],
[(‘jewel’, Synset(‘jewel.n.01’)), Synset(‘jewel.n.02’)],
[(‘stone’, Synset(‘gem.n.02’)), Synset(‘stone.n.13’)],
[(‘used’, Synset(‘use.v.03’)), Synset(‘use.v.06’)]
[(‘jewellery’, Synset(‘jewelry.n.01’)), Synset(‘jewelry.n.01’)]

Length of list of tagged words for S2: 5


Fig. 4. Corpus statistics calculation diagram
We eliminate words like a, is, to, that, you, such, as, or;
hence further reducing the computing overhead. The
formed semantic vectors contain semantic information
3.3 Sentences’ semantic similarity
concerning all the words from both the sentences. For
As Li [14] states, the meaning of the sentence is reflected by example, the semantic vector for S1 is:
the words in the sentence. Hence, we can use the semantic
information from section 3.1 and section 3.2 to calculate the V1 = [ 0.99742103, 0.90118787, 0.42189901, 0.0, 0.0,
final similarity measure. Previously established methods 0.40630945, 0.0, 0.59202, 0.81750916]
to estimate the semantic similarity between sentences,
use the static approaches like using a precompiled list of Vector V1 has semantic information from S1 as well as
words and phrases. The problem with this technique is the from S2. Similarly, vector V2 also has semantic information
precompiled list of words and phrases doesn’t necessarily from S1 and S2. To establish a similarity value using two
reflect the correct semantic information in the context of vectors, we use the magnitude of the normalized vectors.
compared sentences.
The dynamic approach includes the formation of joint word S = ||V 1||.||V 2|| (4)
vector which compiles words from sentences and use it as a We make this method adaptable to longer sentences by
baseline to form individual vectors. This method introduces introducing a variable(ζ ) which will be dynamically cal-
inaccuracy for the long sentences and the paragraphs culated at runtime. With the utilization of ζ this method
containing multiple sentences. can also be used to compare paragraphs with multiple
Unlike these methods, our method forms the semantic sentences.
value vectors for the sentences and aims to keep the size of
the semantic value vector minimum. Formation of semantic
3.3.1 Determination of ζ
vector begins after the section 3.1.2. This approach avoids
overhead involved to form semantic vectors separately The words with maximum similarity have more impact
unlike done in previously discussed methods. Also, we on the magnitude of the vector. Using this property, we
eliminate prepositions, conjunctions and interjections in establish ζ for the sentences in comparison. According to
this stage. Hence, these connectives are automatically Rubinstein 1965, the benchmark synonymy value of two
eliminated from the semantic vector. We determine the size words is 0.8025 [16]. Using this as a determination standard,
of the vector, based on the number of tokens from section we calculate all the cells from V1 and V2 with the value
3.1.2. Every unit of the semantic vector is initialized to greater than 0.8025. ζ is given by:
null to void the foundational effect. Initializing semantic ζ = sum(C1, C2)/γ (5)
vector to a unit positive value discards the negative/null
effects, and overall semantic similarity will be a reflection where C1 is count of valid elements in V1 and C2 is count
of most similar words in the sentences. Let’s see an example. of valid cells in V2. γ is set to 1.8 to limit the value of
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6

similarity in the range of 0 to 1. Now, using Eq. 4 and Eq. 5, • S2: A quick brown fox jumps over the lazy dog.
we establish similarity as:
The edge-based approach using lexical database will
Sim = S/ζ (6) produce a result showing both S1 and S2 are same, but
since the words appear in a different order we should
scale down the overall similarity as they represent different
Algorithm 1 Semantic similarity between sentences meaning. We start with the formation of vectors V1 and
V2 dynamically for sentences S1 and S2 respectively.
1: procedure S ENTENCE SIMILARITY
Initialization of vectors is performed as explained in section
2: S1 - list of tagged tokens ← disambiguate
3.3. Instead of forming joint word set, we treat sentences
3: S2 - list of tagged tokens ← disambiguate
relatively to keep the size of vector minimum.
4: vector length ← max(length(S1),length(S2))
The process starts with the sentence having maximum
5: V1, V2 ← vector length(null)
length. Vector V1 is formed with respect to sentence 1 and
6: V1, V2 ← vector length(word similarity(S1,S2))
cells in V1 are initialized to index values of words in S1
7: ζ =0
beginning with 1. Hence V1 for S1 is:
8: while S1 list of tagged tokens do
V1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
9: if word similarity value >
Now, we form V2 concerning S1 and S2. To form V2, every
benchmark similarity value then
word from S2 is compared with S1. If the word from S2 is
10: C1 ← C1+1
absent in S1, then the cell in V2 is filled with the index value
11: while S2 list of tagged tokens do
of the word in sentence S2. If the word from S2 matches
12: if word similarity value >
with a word from S1, then the index of the word from S1 is
benchmark similarity value then
filled in V2.
13: C2 ← C2+1
In the above example, consider words ‘fox’ and ‘dog’ from
14: ζ ← sum(C1, C2)/γ
sentence 2. The word ‘fox’ from S2 is present in S1 at the
15: S ← ||V 1||.||V 2||
index 9. Hence, entry for ‘fox’ in V2 would be 9. Similarly,
16: if sum(C1, C2) = 0 then
the word ‘dog’ form S2 is present in the S1 at the index
17: ζ ← vector length/2
4. Hence, entry for ‘dog’ in V2 would be 9. Following the
18: Sim ←S/ζ same procedure for all the words, we get V2 as:
V2 = [1, 2, 3, 9, 5, 6, 7, 8, 4]
Finally, word order similarity is given by:
3.4 Word Order Similarity
Ws = ||V1 − V2||/||V1 ∗ V2|| (7)
Along with semantic nature of the sentences, we need to
consider the syntactic structure of the sentences too. The In this case, Ws is 0.067091.
word order similarity, simply put, is the aggregation of
comparisons of word indices in two sentences. The semantic
similarity approach based on words and the lexical database 4 I MPLEMENTATION USING S EMANTIC NETS
doesn’t take into account the grammar of the sentence. Li
The database used to implement the proposed methodology
[14] assigns a number to each word in the sentence and
is WordNet and statistical information from WordNet is
forms a word order vector according to their occurrence
used calculate the information content of the word. To test
and similarity. They also consider the semantic similarity
the behavior with an external corpus, a small compiled
value of words to decide the word order vector. If a word
corpus is used. The corpus contained ten sentences be-
from sentence 1 is not present in sentence 2, the number
longing to ‘Chemistry’ domain. This section describes the
assigned to the index of this word in word order vector
prerequisites to implement the method.
corresponds to the word with maximum similarity. This
case is not valid always and introduces errors in the final
semantic similarity index. For the methods which calculate 4.1 The Database - WordNet
the similarity by chunking the sentence into words, it is not
WordNet is a lexical semantic dictionary available for online
always necessary to decide the word order similarity. For
and offline use, developed and hosted at Princeton. The
such techniques, the word order similarity actually matters
version used in this study is WordNet 3.0 which has 117,000
when two sentences contain same words in different order.
synonymous sets, Synsets. Synsets for a word represent the
Otherwise, if the sentences contain different words, the
possible meanings of the word when used in a sentence.
word order similarity should be an optional construct.
WordNet currently has synset structure for nouns, verbs,
In the entirely different sentences, word order similarity
adjectives and adverbs. These lexicons are grouped sepa-
doesn’t impact on the large scale. For such sentences, the
rately and do not have interconnections; for instance, nouns
impact of word order similarity is negligible as compared
and verbs are not interlinked.
to the semantic similarity. Hence, in our approach, we
The main relationship connecting the synsets is the super-
implement word order similarity as an optional feature.
subordinate(ISA-HASA) relationship. The relation becomes
Consider following classical example:
more general as we move up the hierarchy. The root node
of all the noun hierarchies is ‘Entity’. Like nouns, verbs are
• S1: A quick brown dog jumps over the lazy fox. arranged into hierarchies as well.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7

4.1.1 Shortest path distance and hierarchical distances TABLE 5


from WordNet L1 compared with L2

The WordNet relations connect the same parts of speeches. Words Similarity
Thus, it consists of four subnets of nouns, verbs, adjectives gem - jewel 0.908008550956
and adverbs respectively. Hence, determining the similarity gem - stone 0.180732071642
between cross-domains is not possible. gem - used 0.0
The shortest path distance is calculated by using the tree- gem - decorate 0.0
gem - valuable 0.0
like hierarchical structure. To figure the shortest path, we
gem - things 0.284462910289
climb up the hierarchy from both the synsets and determine gem - wear 0.0
the meeting point which is also a synset. This synset is gem - rings 0.485032351325
called subsumer of the respective synsets. The shortest path gem - necklaces 0.669319889871
distance equals the hops from one synset to another. jewel - jewel 0.997421032224
jewel - stone 0.217431543606
We consider the position of subsumer of two synsets to jewel - used 0.0
determine the hierarchical distance. Subsumer is found by jewel - decorate 0.0
using the hyperonymy (ISA) relation for both the synsets. The jewel - valuable 0.0
algorithm moves up the hierarchy until a common synset is jewel - things 0.406309448212
jewel - wear 0.0
found. This common synset is the subsumer for the synsets jewel - rings 0.456849659596
in comparison. A set of hypernyms is formed individually jewel - necklaces 0.41718607131
for each synset and the intersection of sets contains the stone - jewel 0.475813717007
subsumer. If the intersection of these sets contain more than stone - stone 0.901187866267
one synset, then the synset with the shortest path distance stone - used 0.0
stone - decorate 0.0
is considered as a subsumer. stone - valuable 0.0
stone - things 0.198770510639
4.1.2 The Information content of the word stone - wear 0.0
stone - rings 0.100270000776
For general purposes, we use the statistical information stone - necklaces 0.0856785820827
from WordNet for the information content of the word. used - jewel 0.0
WordNet provides the frequency of each synset in the used - stone 0.0
WordNet corpus. This frequency distribution is used in the used - used 0.42189900525
used - decorate 0.0
implementation of section 3.2. used - valuable 0.0
used - things 0.0
4.1.3 Illustrative example used - wear 0.0
used - rings 0.0
This section explains in detail the steps involved in the used - necklaces 0.0
calculation of semantic similarity between two sentences. jewellery - jewel 0.509332774797
jewellery - stone 0.220266070205
jewellery - used 0.0
• S1: A gem is a jewel or stone that is used in jewellery. jewellery - decorate 0.0
• S2: A jewel is a precious stone used to decorate valu- jewellery - valuable 0.0
jewellery - things 0.346687374295
able things that you wear, such as rings or necklaces.
jewellery - wear 0.0
Following segment contains the parts of speeches and jewellery - rings 0.592019999822
jewellery - necklaces 0.81750915958
corresponding synsets used to determine the similarity.
For S1 the tagged words are:

Synset(‘jewel.n.01’) : a precious or semiprecious stone Synset(‘decorate.v.01’) : make more attractive by adding


incorporated into a piece of jewelry ornament, colour, etc.
Synset(‘jewel.n.01’) : a precious or semiprecious stone Synset(‘valuable.a.01’) : having great material or monetary
incorporated into a piece of jewelry value especially for use or exchange
Synset(‘gem.n.02’) : a crystalline rock that can be cut and Synset(‘thing.n.04’) : an artifact
polished for jewelry Synset(‘wear.v.01’) : be dressed in
Synset(‘use.v.03’) : use up, consume fully Synset(‘ring.n.08’) : jewelry consisting of a circlet of precious
Synset(‘jewelry.n.01’) : an adornment (as a bracelet or ring metal (often set with jewels) worn on the finger
or necklace) made of precious metals and set with gems (or Synset(‘necklace.n.01’) : jewelry consisting of a cord or
imitation gems) chain (often bearing gems) worn about the neck as an
ornament (especially by women)
For S2 the tagged words are:
After identifying the synsets for comparison, we find
Synset(‘jewel.n.01’) : a precious or semiprecious stone the shortest path distances between all the synsets and
incorporated into a piece of jewelry take the best matching result to form the semantic vector.
Synset(‘stone.n.02’) : building material consisting of a piece The intermediate list is formed which contains the words
of rock hewn in a definite shape for a special purpose and the identified synsets. L1 and L2 below represent the
Synset(‘use.v.03’) : use up, consume fully intermediate lists.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8

TABLE 6 TABLE 7
L2 compared with L1 Linear regression parameter values for proposed methodology

Words Similarity Slope 0.84312603549362108


jewel - gem 0.908008550956 Intercept 0.017742354112473213
jewel - jewel 0.997421032224 r-value 0.87536955005374539
jewel - stone 0.475813717007 p-value 1.4816200698817255e-21
jewel - used 0.0 stderr 0.058665976202757132
jewel - jewellery 0.509332774797
stone - gem 0.180732071642
stone - jewel 0.217431543606
stone - stone 0.901187866267
stone - used 0.0
stone - jewellery 0.220266070205
used - gem 0.0
used - jewel 0.0
used - stone 0.0
used - used 0.42189900525
used - jewellery 0.0
decorate - gem 0.0
decorate - jewel 0.0
decorate - stone 0.0
decorate - used 0.0
decorate - jewellery 0.0
valuable - gem 0.0
valuable - jewel 0.0
valuable - stone 0.0
valuable - used 0.0 Fig. 5. Perfomance of word similarity method vs Standard by Rubenstein
valuable - jewellery 0.0 and Goodenough
things - gem 0.284462910289
things - jewel 0.406309448212
things - stone 0.198770510639 contains the cross comparison of L1 and L2.
things - used 0.0
things - jewellery 0.346687374295 Cross-comparison with all the words from S1 and S2 is
wear - gem 0.0 essential because if a word from statement S1 best matches
wear - jewel 0.0 with a word from S2, does not necessarily mean that it
wear - stone 0.0 would be true if the case is reversed. This scenario can be
wear - used 0.0
wear - jewellery 0.0 observed with the words jewel from Table 5 and things from
rings - gem 0.485032351325 Table 6. things best matches with jewel with index of 0.4063
rings - jewel 0.456849659596 whereas jewel from Table 5 best matches with jewel from
rings - stone 0.100270000776 Table 6.
rings - used 0.0
rings - jewellery 0.592019999822
After getting the similarity values for all the word pairs, we
necklaces - gem 0.669319889871 need to determine an index entry for the semantic vector.
necklaces - jewel 0.41718607131 The entry in the semantic vector for a word is the highest
necklaces - stone 0.0856785820827 similarity value from the comparison with the words from
necklaces - used 0.0
other sentence. For instance, for the word gem, from Table
necklaces - jewellery 0.81750915958
5, the corresponding semantic vector entry is 0.90800855 as
it is the maximum of all the compared similarity values.
Hence, we get V1 and V2 as following:

L1: [(‘gem’, Synset(‘jewel.n.01’))], [(‘jewel’,


Synset(‘jewel.n.01’))], [(‘stone’, Synset(‘gem.n.02’))], [(‘used’,
Synset(‘use.v.03’))], [(‘jewellery’, Synset(‘jewelry.n.01’))]

L2: [(‘jewel’, Synset(‘jewel.n.01’))], [(‘stone’,


Synset(‘stone.n.02’))], [(‘used’, Synset(‘use.v.03’))],
[(‘decorate’, Synset(‘decorate.v.01’))], [(‘valuable’,
Synset(‘valuable.a.01’))], [(‘things’, Synset(‘thing.n.04’))],
[(‘wear’, Synset(‘wear.v.01’))], [(‘rings’, Synset(‘ring.n.08’))],
[(‘necklaces’, Synset(‘necklace.n.01’))]

Now we begin to form the semantic vectors for S1


and S2 by comparing every synset from L1 with every
synset from L2. The intermediate step here is to determine
the size of semantic vector and initialize it to null. In this
example, the size of the semantic vector is 9 by referring Fig. 6. Linear Regression model word similarity method against Stan-
to the method explained in section 3.3. The following part dard by Rubenstein and Goodenough
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9

TABLE 8
Rubenstein and Goodenough Vs Lee2014 Vs Proposed Algorithm Similarity

R&GNo R&Gpair R&G Similarity Lee2014 Proposed Algorithm Simi-


larity
1 cord smile 0.005 0.01 0.0899021679
2 noon string 0.01 0.005 0.0440401486
3 rooster voyage 0.01 0.0125 0.010051669
4 fruit furnace 0.0125 0.0475 0.0720444643
5 autograph shore 0.015 0.005 0.0742552483
6 automobile wizard 0.0275 0.02 0.0906955651
7 mound stove 0.035 0.005 0.0656419906
8 grin implement 0.045 0.005 0.0899021679
9 asylum fruit 0.0475 0.005 0.0720444643
10 asylum monk 0.0975 0.0375 0.0757289762
11 graveyard madhouse 0.105 0.0225 0.0607950554
12 boy rooster 0.11 0.0075 0.0907164485
13 glass magician 0.11 0.1075 0.1782144411
14 cushion jewel 0.1125 0.0525 0.2443794293
15 monk slave 0.1425 0.045 0.3750880747
16 asylum cemetery 0.1975 0.0375 0.1106378337
17 coast forest 0.2125 0.0475 0.1106378337
18 grin lad 0.22 0.0125 0.0899021679
19 shore woodland 0.225 0.0825 0.3011198804
20 monk oracle 0.2275 0.1125 0.2464473057
21 boy sage 0.24 0.0425 0.2017739882
22 automobile cushion 0.2425 0.02 0.2018466921
23 mound shore 0.2425 0.035 0.2018466921
24 lad wizard 0.2475 0.0325 0.3673305438
25 forest graveyard 0.25 0.065 0.2015952767
26 food rooster 0.2725 0.055 0.2732326922
27 cemetery woodland 0.295 0.0375 0.2015952767
28 shore voyage 0.305 0.02 0.4075214431
29 bird woodland 0.31 0.0125 0.1651985693
30 coast hill 0.315 0.1 0.4103617321
31 furnace implement 0.3425 0.05 0.2464473057
32 crane rooster 0.3525 0.02 0.2465928735
33 hill woodland 0.37 0.145 0.2918421392
34 car journey 0.3875 0.0725 0.2730713984
35 cemetery mound 0.4225 0.0575 0.0656419906
36 glass jewel 0.445 0.1075 0.3176716099
37 magician oracle 0.455 0.13 0.3057403627
38 crane implement 0.5925 0.185 0.4486585394
39 brother lad 0.6025 0.1275 0.5462290271
40 sage wizard 0.615 0.1525 0.3675115617
41 oracle sage 0.6525 0.2825 0.5279307332
42 bird cock 0.6575 0.035 0.5750838807
43 bird crane 0.67 0.1625 0.4978503715
44 food fruit 0.6725 0.2425 0.6196075053
45 brother monk 0.685 0.045 0.2664571358
46 asylum madhouse 0.76 0.215 0.8185286992
47 furnace stove 0.7775 0.3475 0.1651985693
48 magician wizard 0.8025 0.355 0.9985079423
49 hill mound 0.8225 0.2925 0.8148010746
50 cord string 0.8525 0.47 0.8148010746
51 glass tumbler 0.8625 0.1375 0.8561402541
52 grin smile 0.865 0.485 0.9910074537
53 serf slave 0.865 0.4825 0.8673305438
54 journey voyage 0.895 0.36 0.8185286992
55 autograph signature 0.8975 0.405 0.8499457067
56 coast shore 0.9 0.5875 0.8179120223
57 forest woodland 0.9125 0.6275 0.9780261147
58 implement tool 0.915 0.59 0.0822919486
59 cock rooster 0.92 0.8625 0.9093502924
60 boy lad 0.955 0.58 0.9093502924
61 cushion pillow 0.96 0.5225 0.8157293861
62 cemetery graveyard 0.97 0.7725 0.9985079423
63 automobile car 0.98 0.5575 0.8185286992
64 gem jewel 0.985 0.955 0.8175091596
65 midday noon 0.985 0.6525 0.9993931059
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10

TABLE 9
Proposed Algorithm Similarity Vs Islam2008 Vs Li2006

R&G No R&G pair Proposed Algorithm Simi- A.Islam2008 Lietal.2006


larity
1 cord smile 0.0899021679 0.06 0.33
5 autograph shore 0.0742552483 0.11 0.29
9 asylum fruit 0.0720444643 0.07 0.21
12 boy rooster 0.0907164485 0.16 0.53
17 coast forest 0.1106378337 0.26 0.36
21 boy sage 0.2017739882 0.16 0.51
25 forest graveyard 0.2015952767 0.33 0.55
29 bird woodland 0.1651985693 0.12 0.33
33 hill woodland 0.2918421392 0.29 0.59
37 magician oracle 0.3057403627 0.2 0.44
41 oracle sage 0.5279307332 0.09 0.43
47 furnace stove 0.1651985693 0.3 0.72
48 magician wizard 0.9985079423 0.34 0.65
49 hill mound 0.8148010746 0.15 0.74
50 cord string 0.8148010746 0.49 0.68
51 glass tumbler 0.8561402541 0.28 0.65
52 grin smile 0.9910074537 0.32 0.49
53 serf slave 0.8673305438 0.44 0.39
54 journey voyage 0.8185286992 0.41 0.52
55 autograph signature 0.8499457067 0.19 0.55
56 coast shore 0.8179120223 0.47 0.76
57 forest woodland 0.9780261147 0.26 0.7
58 implement tool 0.0822919486 0.51 0.75
59 cock rooster 0.9093502924 0.94 1
60 boy lad 0.9093502924 0.6 0.66
61 cushion pillow 0.8157293861 0.29 0.66
62 cemetery graveyard 0.9985079423 0.51 0.73
63 automobile car 0.8185286992 0.52 0.64
64 gem jewel 0.8175091596 0.65 0.83
65 midday noon 0.9993931059 0.93 1

Fig. 7. Comparison of linear regressions from various algorithms with Fig. 8. Linear regression model- Mean Human Similarity against Algo-
R&G1965 rithm Sentence Similarity

Now, the final similarity is


• V1= [ 0.90800855, 0.99742103, 0.90118787, 0.42189901, Similarity = S/ζ = 3.31974454153/3.89 = 0.8534.
0.81750916, 0.0, 0.0, 0.0, 0.0]
• V2= [ 0.99742103, 0.90118787, 0.42189901, 0.0, 0.0, 5 E XPERIMENTAL R ESULTS
0.40630945, 0.0, 0.59202, 0.81750916]
To evaluate the algorithm, we used a standard dataset
The intermediate step here is to calculate the dot product which has 65 noun pairs originally measure by Rubenstein
of the magnitude of normalized vectors: V1 and V2 as and Goodenough [16]. The data has been used in many
explained in section 3.3. investigations over the years and has been established as a
S = 3.31974454153 stable source of the semantic similarity measure. The word
The following segment explains the determination of ζ with similarity obtained in this experiment is assisted by the
reference to section 3.3.1. standard sentences in Pilot Short Text Semantic Similarity
C1 for V1 is 4. C2 for V2 is 3. Hence, ζ is (4+3)/1.8 = 3.89. Benchmark Data Set by James O’Shea [26]. The aim of
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 11

TABLE 10
Sentence Similarity from proposed methodology compared with human mean similarity from Li2006

R&G Sentence 1 Sentence 2 Mean Human Proposed


number Similarity Algorithm
Sentence
Similarity
1 Cord is strong, thick string. A smile is the expression that you have on your 0.01 0.0225
face when you are pleased or amused, or when
you are being friendly.
2 A rooster is an adult male chicken. A voyage is a long journey on a ship or in a 0.005 0.2593
spacecraft.
3 Noon is 12 o’clock in the middle of the day. String is thin rope made of twisted threads, used 0.0125 0.03455
for tying things together or tying up parcels.
4 Fruit or a fruit is something which grows on a A furnace is a container or enclosed space in 0.0475 0.1388
tree or bush and which contains seeds or a stone which a very hot fire is made, for example to
covered by a substance that you can eat. melt metal, burn rubbish or produce steam.
5 An autograph is the signature of someone The shores or shore of a sea, lake, or wide river is 0.0050 0.0701
famous which is specially written for a fan to the land along the edge of it.
keep.
6 An automobile is a car. In legends and fairy stories, a wizard is a man 0.0200 0.0088
who has magic powers.
7 A mound of something is a large rounded pile of A stove is a piece of equipment which provides 0.0050 0.4968
it. heat, either for cooking or for heating a room.
8 A grin is a broad smile. An implement is a tool or other pieces of 0.0050 0.0099
equipment.
9 An asylum is a psychiatric hospital. Fruit or a fruit is something which grows on a 0.0050 0.01456
tree or bush and which contains seeds or a stone
covered by a substance that you can eat.
10 An asylum is a psychiatric hospital. A monk is a member of a male religious 0.0375 0.0175
community that is usually separated from the
outside world.
11 A graveyard is an area of land, sometimes near a If you describe a place or situation as a 0.0225 0.1339
church, where dead people are buried. madhouse,you mean that it is full of confusion
and noise.
12 Glass is a hard transparent substance that is used A magician is a person who entertains people by 0.0075 0.0911
to make things such as windows and bottles. doing magic tricks.
13 A boy is a child who will grow up to be a man. A rooster is an adult male chicken. 0.1075 0.2921
14 A cushion is a fabric case filled with soft A jewel is a precious stone used to decorate 0.0525 0.1745
material, which you put on a seat to make it valuable things that you wear, such as rings or
more comfortable. necklaces.
15 A monk is a member of a male religious A slave is someone who is the property of 0.0450 0.1394
community that is usually separated from the another person and has to work for that person.
outside world.
16 An asylum is a psychiatric hospital. A cemetery is a place where dead peoples bodies 0.375 0.03398
or their ashes are buried.
17 The coast is an area of land that is next to the sea. A forest is a large area where trees grow close 0.0475 0.3658
together.
18 A grin is a broad smile. A lad is a young man or boy. 0.0125 0.0281
19 The shores or shore of a sea, lake, or wide river is Woodland is land with a lot of trees. 0.0825 0.3192
the land along the edge of it.
20 A monk is a member of a male religious In ancient times, an oracle was a priest or 0.1125 0.1011
community that is usually separated from the priestess who made statements about future
outside world. events or about the truth.
21 A boy is a child who will grow up to be a man. A sage is a person who is regarded as being very 0.0425 0.2305
wise.
22 An automobile is a car. A cushion is a fabric case filled with soft 0.0200 0.0330
material, which you put on a seat to make it
more comfortable.
23 A mound of something is a large rounded pile of The shores or shore of a sea, lake, or wide river is 0.0350 0.0386
it. the land along the edge of it.
24 A lad is a young man or boy. In legends and fairy stories, a wizard is a man 0.0325 0.3939
who has magic powers.
25 A forest is a large area where trees grow close A graveyard is an area of land, sometimes near a 0.0650 0.2787
together. church, where dead people are buried.
26 Food is what people and animals eat. A rooster is an adult male chicken. 0.0550 0.2972
27 A cemetery is a place where dead peoples bodies Woodland is land with a lot of trees. 0.0375 0.1240
or their ashes are buried.
28 The shores or shore of a sea, lake, or wide river is A voyage is a long journey on a ship or in a 0.0200 0.0304
the land along the edge of it. spacecraft.
29 A bird is a creature with feathers and wings, Woodland is land with a lot of trees. 0.0125 0.1334
females lay eggs, and most birds can fly.
30 The coast is an area of land that is next to the sea. A hill is an area of land that is higher than the 0.1000 0.8032
land that surrounds it.
31 A furnace is a container or enclosed space in An implement is a tool or other piece of 0.0500 0.1408
which a very hot fire is made, for example to equipment.
melt metal, burn rubbish or produce steam.
32 A crane is a large machine that moves heavy A rooster is an adult male chicken. 0.0200 0.0564
things by lifting them in the air.
33 A hill is an area of land that is higher than the Woodland is land with a lot of trees. 0.1450 0.7619
land that surrounds it.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12

TABLE 11
Sentence Similarity from proposed methodology compared with human mean similarity from Li2006 (Continued from previous page)

R&G Sentence 1 Sentence 2 Mean Human Proposed


Num- Similarity Algorithm
ber Sentence
Similarity
34 A car is a motor vehicle with room for a small When you make a journey, you travel from one 0.0725 0.02610
number of passengers. place to another.
35 A cemetery is a place where dead peoples bodies A mound of something is a large rounded pile of 0.0575 0.0842
or their ashes are buried. it.
36 Glass is a hard transparent substance that is used A jewel is a precious stone used to decorate 0.1075 0.2692
to make things such as windows and bottles. valuable things that you wear, such as rings or
necklaces.
37 A magician is a person who entertains people by In ancient times, an oracle was a priest or 0.1300 0.1000
doing magic tricks. priestess who made statements about future
events or about the truth.
38 A crane is a large machine that moves heavy An implement is a tool or other piece of 0.1850 0.1060
things by lifting them in the air. equipment.
39 Your brother is a boy or a man who has the same A lad is a young man or boy. 0.1275 0.9615
parents as you.
40 A sage is a person who is regarded as being very In legends and fairy stories, a wizard is a man 0.1525 0.1920
wise. who has magic powers.
41 In ancient times, an oracle was a priest or A sage is a person who is regarded as being very 0.2825 0.0452
priestess who made statements about future wise.
events or about the truth.
42 A bird is a creature with feathers and wings, A crane is a large machine that moves heavy 0.0350 0.1660
females lay eggs, and most birds can fly. things by lifting them in the air.
43 A bird is a creature with feathers and wings, A cock is an adult male chicken. 0.1625 0.1704
females lay eggs, and most birds can fly.
44 Food is what people and animals eat. Fruit or a fruit is something which grows on a 0.2425 0.1379
tree or bush and which contains seeds or a stone
covered by a substance that you can eat.
45 Your brother is a boy or a man who has the same A monk is a member of a male religious 0.0450 0.2780
parents as you. community that is usually separated from the
outside world.
46 An asylum is a psychiatric hospital. If you describe a place or situation as a 0.2150 0.1860
madhouse, you mean that it is full of confusion
and noise.
47 A furnace is a container or enclosed space in A stove is a piece of equipment which provides 0.3475 0.1613
which a very hot fire is made, for example, to heat, either for cooking or for heating a room.
melt metal, burn rubbish, or produce steam.
48 A magician is a person who entertains people by In legends and fairy stories, a wizard is a man 0.3550 0.5399
doing magic tricks. who has magic powers.
49 A hill is an area of land that is higher than the A mound of something is a large rounded pile of 0.2925 0.2986
land that surrounds it. it.
50 Cord is strong, thick string. String is thin rope made of twisted threads, used 0.4700 0.2530
for tying things together or tying up parcels.
51 Glass is a hard transparent substance that is used A tumbler is a drinking glass with straight sides. 0.1375 0.3016
to make things such as windows and bottles.
52 A grin is a broad smile. A smile is the expression that you have on your 0.4850 0.8419
face when you are pleased or amused, or when
you are being friendly.
53 In former times, serfs were a class of people who A slave is someone who is the property of 0.4825 0.8896
had to work on a particular persons land and another person and has to work for that person.
could not leave without that persons permission.
54 When you make a journey, you travel from one A voyage is a long journey on a ship or in a 0.3600 0.7826
place to another. spacecraft.
55 An autograph is the signature of someone Your signature is your name, written in your 0.4050 0.3146
famous which is specially written for a fan to own characteristic way, often at the end of a
keep. document to indicate that you wrote the
document or that you agree with what it says.
56 The coast is an area of land that is next to the sea. The shores or shore of a sea, lake, or wide river is 0.5875 0.9773
the land along the edge of it.
57 A forest is a large area where trees grow close Woodland is land with a lot of trees. 0.6275 0.4770
together.
58 An implement is a tool or other pieces of A tool is any instrument or simple piece of 0.5900 0.8919
equipment. equipment that you hold in your hands and use
to do a particular kind of work.
59 A cock is an adult male chicken. A rooster is an adult male chicken. 0.8625 0.8560
60 A boy is a child who will grow up to be a man. A lad is a young man or boy. 0.5800 0.8980
61 A cushion is a fabric case filled with soft A pillow is a rectangular cushion which you rest 0.5225 0.9340
material, which you put on a seat to make it your head on when you are in bed.
more comfortable.
62 A cemetery is a place where dead peoples bodies A graveyard is an area of land, sometimes near a 0.7725 1.0
or their ashes are buried. church, where dead people are buried.
63 An automobile is a car. A car is a motor vehicle with room for a small 0.5575 0.7001
number of passengers.
64 Midday is 12 oclock in the middle of the day. Noon is 12 oclock in the middle of the day. 0.9550 0.8726
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13

TABLE 12
Sentence Similarity from proposed methodology compared with human mean similarity from Li2006 (Continued from previous page)

R&G Sentence 1 Sentence 2 Mean Human Proposed


Num- Similarity Algorithm
ber Sentence
Similarity
65 A gem is a jewel or stone that is used in jewellery. A jewel is a precious stone used to decorate 0.6525 0.8536
valuable things that you wear, such as rings or
necklaces.

this methodology is to achieve results as close as to the affect overall similarity measure more than the actual words
benchmark standard by Rubenstein and Goodenough [16]. compared ‘lad-wizard’.
The definitions of the words are obtained from the Collins
Cobuild dictionary. Our algorithm achieved good Pearson
correlation coefficient of 0.8753695501 for word similarity 7 C ONCLUSIONS
which is cosiderably higher than the existing algorithms. This paper presented an approach to calculate the semantic
Fig. 5 represents the results for 65 pairs against the R&G similarity between two words, sentences or paragraphs.
benchmark standard. Fig. 6 represents the linear regression The algorithm initially disambiguates both the sentences
against the standard. The linear regression shows that this and tags them in their parts of speeches. The disambigua-
algortihm outperforms other similar algorithms. Table 7 tion approach ensures the right meaning of the word for
shows the values of parameters for linear regression. comparison. The similarity between words is calculated
based on a previously established edge-based approach. The
5.1 Sentence similarity information content from a corpus can be used to influence
the similarity in particular domain. Semantic vectors con-
Tables 10, 11 and 12 contain the mean human sentence
taining similarities between words are formed for sentences
similarity values from Pilot Short Text Semantic Similarity
and further used for sentence similarity calculation. Word
Benchmark Data Set by James O’Shea [26]. As Li [14] ex-
order vectors are also formed to calculate the impact of the
plains, when a survey was conducted by 32 participants to
syntactic structure of the sentences. Since word order affects
establish a measure for semantic similarity, they were asked
less on the overall similarity than that of semantic similarity,
to mark the sentences, not the words. Hence, word similarity
word order similarity is weighted to a smaller extent. The
is compared with the R&G [16] whereas sentence similarity
methodology has been tested on previously established data
is compared with mean human similarity. Our algorithm’s
sets which contain standard results as well as mean human
sentence similarity achieved good Pearson correlation coef-
results. Our algorithm achieved good Pearson correlation
ficient of 0.8794 with mean human similarity outperforming
coefficient of 0.8753 for word similarity concerning the
previous methods. Li [14] obtained correlation coefficient of
bechmark standard and 0.8794 for sentence similarity with
0.816 and Islam [29] obtained correlation coefficient of 0.853.
respect to mean human similarity.
Out of 65 sentence pairs, 5 pairs were eliminated because of
Future work includes extending the domain of algorithm
their definitions from Collins Cobuild dictionary [27]. The
to analyze Learning Objectives from Course Descriptions,
reasons and results are discussed in next section.
incorporating the algorithm with Bloom’s taxonomy will
also be considered. Analyzing Learning Objectives requires
6 D ISCUSSION ontologies and relationship between words belonging to the
particular field.
Our algorithm’s similarity measure achieved a good Pear-
son correlation coefficient of 0.8753 with R&G word pairs
[16]. This performance outperforms all the previous meth- ACKNOWLEDGMENTS
ods. Table 8 represents the comparison of similarity from
proposed method and Lee [28] with the R&G. Table 9 We would like to acknowledge the financial support pro-
depicts the comparison of algorithm similarity against Islam vided by ONCAT(Ontario Council on Articulation and
[29] and Li [14] for the 30 noun pairs and performs better. Transfer)through Project Number- 2017-17-LU,without their
For sentence similarity, the pairs 17: coast-forest, 24: lad- support this research would have not been possible. We
wizard, 30: coast-hill, 33: hill-woodland and 39: brother-lad are are also grateful to Salimur Choudhury for his insight on
not considered. The reason for this is, the definition of these different aspects of this project; [Link] team for
word pairs have more than one common or synonymous reviewing and proofreading the paper.
words. Hence, the overall sentence similarity does not reflect
the true sense of these word pairs as they are rated with
low similarity in mean human ratings. For example, the R EFERENCES
definition of ‘lad’ is given as: ‘A lad is a young man or [1] D. Lin et al., “An information-theoretic definition of similarity.” in
boy.’ and the definition of ‘wizard’ is: ‘In legends and fairy Icml, vol. 98, no. 1998, 1998, pp. 296–304.
stories, a wizard is a man who has magic powers.’ Both [2] A. Freitas, J. Oliveira, S. ORiain, E. Curry, and J. Pereira da Silva,
“Querying linked data using semantic relatedness: a vocabulary
sentences have similar or closely related words such as: independent approach,” Natural Language Processing and Informa-
‘man-man’, ‘boy-man’ and ‘lad-man’. Hence, these pairs tion Systems, pp. 40–51, 2011.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 14

[3] V. Abhishek and K. Hosanagar, “Keyword generation for search [27] J. M. Sinclair, Looking up: An account of the COBUILD project in
engine advertising using semantic similarity between terms,” in lexical computing and the development of the Collins COBUILD English
Proceedings of the ninth international conference on Electronic com- language dictionary. Collins Elt, 1987.
merce. ACM, 2007, pp. 89–94. [28] M. C. Lee, J. W. Chang, and T. C. Hsieh, “A grammar-based
[4] C. Pesquita, D. Faria, A. O. Falcao, P. Lord, and F. M. Couto, semantic similarity algorithm for natural language sentences,” The
“Semantic similarity in biomedical ontologies,” PLoS computational Scientific World Journal, vol. 2014, 2014.
biology, vol. 5, no. 7, p. e1000443, 2009. [29] A. Islam and D. Inkpen, “Semantic text similarity using corpus-
[5] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble, “Investigating based word similarity and string similarity,” ACM Transactions on
semantic similarity measures across the gene ontology: the rela- Knowledge Discovery from Data (TKDD), vol. 2, no. 2, p. 10, 2008.
tionship between sequence and annotation,” Bioinformatics, vol. 19,
no. 10, pp. 1275–1283, 2003.
[6] T. Pedersen, S. V. Pakhomov, S. Patwardhan, and C. G. Chute,
“Measures of semantic similarity and relatedness in the biomed-
ical domain,” Journal of biomedical informatics, vol. 40, no. 3, pp.
288–299, 2007.
[7] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. Petrakis, and E. E.
Milios, “Semantic similarity methods in wordnet and their appli-
cation to information retrieval on the web,” in Proceedings of the
7th annual ACM international workshop on Web information and data
management. ACM, 2005, pp. 10–16.
[8] G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical cen-
trality as salience in text summarization,” Journal of Artificial
Intelligence Research, vol. 22, pp. 457–479, 2004.
Atish Pawar Atish received B.E. degree in com-
[9] Y. Ko, J. Park, and J. Seo, “Improving text categorization using
puter science and engineering with distinction
the importance of sentences,” Information processing & management,
from Walchand Institute of Technology, India in
vol. 40, no. 1, pp. 65–79, 2004.
2014. He worked for Infosys Technologies from
[10] C. Fellbaum, WordNet. Wiley Online Library, 1998. 2014 to 2016. He is currently a graduate student
[11] A. D. Baddeley, “Short-term memory for word sequences as a at Lakehead University, Canada. His research
function of acoustic, semantic and formal similarity,” The Quarterly interests include machine learning, natural lan-
Journal of Experimental Psychology, vol. 18, no. 4, pp. 362–365, 1966. guage processing, and artificial intelligence. He
[12] P. Resnik et al., “Semantic similarity in a taxonomy: An is a research assistant at DataScience lab at
information-based measure and its application to problems of Lakehead University.
ambiguity in natural language,” J. Artif. Intell. Res.(JAIR), vol. 11,
pp. 95–130, 1999.
[13] G. A. Miller and W. G. Charles, “Contextual correlates of semantic
similarity,” Language and cognitive processes, vol. 6, no. 1, pp. 1–28,
1991.
[14] Y. Li, D. McLean, Z. A. Bandar, J. D. O’shea, and K. Crockett,
“Sentence similarity based on semantic nets and corpus statistics,”
IEEE transactions on knowledge and data engineering, vol. 18, no. 8,
pp. 1138–1150, 2006.
[15] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus
statistics and lexical taxonomy,” arXiv preprint cmp-lg/9709008,
1997.
[16] H. Rubenstein and J. B. Goodenough, “Contextual correlates of
synonymy,” Communications of the ACM, vol. 8, no. 10, pp. 627–
633, 1965.
[17] C. T. Meadow, Text information retrieval systems. Academic Press, Vijay Mago Dr. Vijay. Mago is an Assistant
Inc., 1992. Professor in the Department of Computer Sci-
[18] Y. Matsuo and M. Ishizuka, “Keyword extraction from a single ence at Lakehead University in Ontario, where
document using word co-occurrence statistical information,” In- he teaches and conducts research in areas in-
ternational Journal on Artificial Intelligence Tools, vol. 13, no. 01, pp. cluding decision making in multi-agent environ-
157–169, 2004. ments, probabilistic networks, neural networks,
[19] D. Bollegala, Y. Matsuo, and M. Ishizuka, “Measuring semantic and fuzzy logic-based expert systems. Recently,
similarity between words using web search engines.” www, vol. 7, he has diversified his research to include natural
pp. 757–766, 2007. Language Processing, big data and cloud com-
puting. Dr. Mago received his Ph.D. in Computer
[20] R. L. Cilibrasi and P. M. Vitanyi, “The google similarity distance,”
Science from Panjab University, India in 2010.
IEEE Transactions on knowledge and data engineering, vol. 19, no. 3,
In 2011 he joined the Modelling of Complex Social Systems program
2007.
at the IRMACS Centre of Simon Fraser University before moving on
[21] G. A. Miller, “Wordnet: a lexical database for english,” Communi-
to stints at Fairleigh Dickinson University, University of Memphis and
cations of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
Troy University. He has served on the program committees of many
[22] S. Bird, “Nltk: the natural language toolkit,” in Proceedings of the
international conferences and workshops. Dr. Mago has published ex-
COLING/ACL on Interactive presentation sessions. Association for
tensively on new methodologies based on soft computing and artificial
Computational Linguistics, 2006, pp. 69–72.
intelligent techniques to tackle complex systemic problems such as
[23] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a homelessness, obesity, and crime. He currently serves as an associate
large annotated corpus of english: The penn treebank,” Computa- editor for BMC Medical Informatics and Decision Making and as co-
tional linguistics, vol. 19, no. 2, pp. 313–330, 1993. editor for the Journal of Intelligent Systems.
[24] T. Pedersen, S. Banerjee, and S. Patwardhan, “Maximizing seman-
tic relatedness to perform word sense disambiguation,” University
of Minnesota supercomputing institute research report UMSI, vol. 25,
p. 2005, 2005.
[25] L. Tan, “Pywsd: Python implementations of word
sense disambiguation (wsd) technologies [software],”
[Link] 2014.
[26] J. O’Shea, Z. Bandar, K. Crockett, and D. McLean, “Pilot short text
semantic similarity benchmark data set: Full listing and descrip-
tion,” Computing, 2008.

You might also like