MEASURING SIMILARITY BETWEEN
QUESTION PAIR IN ONLINE FORUMS
1st Pramod Kumar Rai 2nd Kunal Chakma
Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering
National Institute of Technology Agartala National Institute of Technology Agartala
Agartala,India Agartala,India
[email protected] [email protected] Abstract—Two questions asking the same thing can have classification of texts as well as documents, we can get the
different set of vocabulary set and syntactic structure which best relevant documents corresponding to what user wants.
makes detecting the semantic equivalence between the sentences Several levels of similarities exist in natural languages. It
challenging. In online user forums like Quora, Stack Overflow,
Stack Exchange etc. it is important to maintain high quality may be word-level, sentence level, phrase-level and document-
knowledge base by ensuring each unique question exists only level. Words can be classified into parts of speech like noun,
once. Writers shouldn’t have to write the same answer to each pronoun, adjective , adverb etc. or it may be classified into
of the similar question and the reader must get a single page of synonyms and antonyms which is based on the similarity
the question they are looking for. For example, consider questions between words. Similarities between documents are the basis
like What are the best ways to lose weight?, How can a person
reduce weight?, and What are effective weight loss plans? to be of text classification as well as text clustering. Sentence simi-
duplicate questions because they all have the same intent. The larity is between the word similarity and document similarity.
work presented in this report is an attempt to study, analyze and Similarities calculating methods vary in different levels.
accordingly propose methods for finding similarity of questions
posted on online forums. A. Word similarity
Index Terms—Text Similarity, Text Mining, NLP
Similarity between words can be calculated from the
spelling of words or the meaning of words. Edit distance can
I. I NTRODUCTION
be used to measure similarity between words from the spelling
In the last two decades, internet or the web has evolved as of words. If two words are similar in spelling, they are possible
the reservoir of information and its growth is phenomenal. Due similar in meaning in our intuition or it can say that both are
to this, today there are large numbers of texts or documents Synonyms to each other.
available on the web some of which may be redundant or carry
similar information. So, getting the appropriate documents B. Sentence similarity
from the web as per a users requirement is difficult. For The similarities between words in given different sentences
this, there are different techniques available through which we have great impact on the similarity between two sentences.
can retrieve most relevant document from the web. Therefore, Words and their orders in the sentences are important factors
finding similarity between words, sentences, paragraphs and to calculate sentence similarity.
documents is an important part in various tasks such as
retrieving the information from web, text clustering (document C. Document similarity
organization), automatic grade assignment to essays, short The similarity between word and sentences has great im-
answer scoring, machine transformation and summarizing the pact on the similarity between documents. Commonly used
texts. Text similarity means how semantically two or more approaches are often based on similarity between the keyword
documents are close to each other with respect to an infor- sets (e.g., Dice similarity) or similarity between the vectors of
mation requirement. For example, an information requirement keywords (e.g., cosine similarity).
about Apple the fruit and Apple Inc. the company is not
similar, whereas, an information requirement for Apple iPhone II. R ELATED W ORK
and Google Pixel is similar. A users information requirement Works that have been done previously described different
is fulfilled by retrieving those documents that have contents approaches or techniques to measure similarity between short
satisfying the questions or query of users. Such documents text parts or chunks. Measuring similarity between long texts
are considered as the most relevant documents with respect has been used in information retrieval. It considers mainly
to the users queries. Text similarity is also used to categorize on the numerical ,graphical and statistical information of
the texts as well as documents. We can also evaluate the sim- keywords in long texts. Keywords are generally selected on
ilarity between sentences, words, paragraphs and documents the basis of assigning weights schemes. Sentence similarity
to classify them in an appropriate way. On the basis of these has been used in machine translation, translation memory, text
summarization , text categorization, question answering and B. N-gram Similarity
even image search on the Web . If there is a sequence of text is given, the N-gram is a
Related works can roughly be classified into following technique through which we can deter- mine the similarity of
major categories: sub-sequence of n items from given text sequence[1]. In N-
gram similarity technique, we compute the similarity on the
A. Word co-occurrence methods basis of how much distance is their between character in two
strings. if- N-gram of size 1 is given then it is called - unigram
It is mainly used for Information Retrieval (IR) systems. In
N-gram of size 2 is given then it is called - bigram
this method we take list of meaningful words and take every
N-gram of size 3 is given then it is called - trigram
query as a document. Now a vector is defined for the query and
For Example, for the sentence We are going to school.
for documents (From large corpus). The most appropriate texts
If we take N=2 ( bigrams), then the n-grams would be:
or documents are retrieved based on the how much similarties
We are,are going,going to,to school.
are their in between query vector and document vector.
C. Levenshtein distance Similarity
B. Similarity based on a lexical database The Levenshtein distance technique use the distance aspect
Using the lexical database methodology, measuring the to measure the similarity between given two string. In reality,
similarity by using word hierarchy which is predefined .In this this distance is counting the minimum number of opera-
hierarchy words, meaning, and relationship with other words tion(insertion,deletion,an substitution) needed to transform one
are stored in a tree-like structure[15] .But when we comparing string into other string. For example. the Levenshtein distance
two words, it takes into account the path distance between the between Bitten and sitting is 3. 1. kitten - sitten(substitution
words . of s for B)
2. sitten - sittin(substitution of i for e)
3. sittin - sitting(insertion of g at the end)
C. Method based on web search engine results
The third method calculate the similarity between texts as D. JARO Distance Similarity
relatedness based on web search engine results and utilizes the This algorithm determines the similarity between two
total number of search results [15]. We have implemented the strings on the basis of common charac-ter[1]. Higher the
methodology to calculates the Google Similarity Distance[15] jaro distance for two string ,similarity will be more between
. The search engines that we are used for this methodology strings.when the result is calculated and it gives 0 then their is
are Google, Bing, Ask etc. no similarity between two texts and if result is 1 then it could
be said that it is similar.
III. L EXICAL S IMILARITY METHODS USED FOR MEASURE
T EXT SIMILARITY
Lexical similarity is a method of measuring similarity to
measure the number of sets of two words that are similar to the
two given chains. It is said that a lexical similarity is totally
coincident if it gives the result 1, while it is said that the
lexical similarity has no agreement when it gives the result 0.
This similarity method verifies the similarity between two texts
based on the character. for example IIT and NIT are lexical
similar to each other because their sequence of characters is
approximately in the same order.
A. Longest Common Subsequence similarity
LCS matching technique is a commonly used technique to E. Cosine Similarity
measure the similarity between two strings .It is the way to It is a algorithm through which we can find the similarity
measure the similarity of two or more sequences is to find between two texts. We represent these texts in the form of
their longest common sub-sequences. LCS of two sequences is non-zero vectors for measuring similarities. Cosine similarity
a subsequence of maximum possible length ,which is common is a similarity function that is often used in Information
to both sequences . For Example: Retrieval. In the case of IR , it measure the angle between
s1=can i transfer my wallet balance to my bank account two documents. If vector is higher in case of cosine similarity
s2=can i transfer my bank balance to my wallet between two documents, then both the documents have more
then the LCS in the sequence is can i transfer my balance number of words in common and if the vector is lower then
to my less number of common words has to be found.
F. Jaccard Similarity the strength of Association of the corresponding words of row
This similarity algorithm is also known as intersection over and column. Like the text is analyzed, the focused word is
union. It is used to find similarity between words. If two sets A selected and compared with the close words that are called
and B are given and we have to calculated Jaccard Similarity, co-occurring [2]. the co-occurrence is inversely proportional
then take the intersection of given two sets and and devides to the distance from the word of focus and the values are in
by their union of given two sets. the matrix.
IV. S EMANTIC S IMILARITY METHODS USED FOR C. Semantic Similarity using Web Search Engines
MEASURE T EXT SIMILARITY
This approach uses the results of web search to find a
Semantic resemblance between concepts is a method for semantic relationship between two words. For the semantic
measuring semantic similarity or Semantic distance between relationship, use the page. Counting and fragments provided
two concepts according to a given ontology [6]. In other by the web search. Suppose, likeness between two words
words, semantics. Similarity is used to detect the common A and B you have to find out. The method search A and
characteristics between certain concepts or documents. Se- B separately and find the number of pages. In In the same
mantic similarity methods are used intensively for most ap- way, the combined query A and B and the page are searched
plications Search systems for semantic and knowledge-based for the count is determined. The fragments are used to find
information (identify an optimal match between query terms the term. Frequencies of A and B. The observed models
and documents) [6]. The semantic similarity and the semantic are classified. According to his ability to co-relate different
relationship are two. Related words, but the semantic similarity words semantically. The similarity scores of page counts and
is more specific than the relationship and can be considered as code fragments are integrated. together using machine vector
a type of semantic relationship [6]. For example, student and support to evaluate Semantic resemblance between words.
teacher are relative terms, that are not similar The semantic
similarity and the semantic distance are defined in reverse. D. KNOWLEDGE BASED SIMILARITY
Sean s1 and s2 two concepts that belong to two different Ontology, taxonomies and semantic networks are knowl-
nodes n1 and n2 in a given ontology, the distance between edge. forms of representation that are used in the recovery
The nodes (n1 and n2) determine the similarity between these of information e these representations of knowledge are used
two concepts s1 and s2 [6]. both of them n1 and n2 can with various methods for Find a semantic similarity between
be considered as an ontology (also called concept node) that different terms or concepts. Similarity based on knowledge is
contains a set of Synonym of terms and consequently [6]. one of the semantic similarities. Measure that uses information
Two terms are synonymous if they are in the same position derived from semantic networks. to identify the degree of sim-
The node and its semantic similarity are maximized [6]. ilarity between words [2].WordNet is The semantic network
A. Latent Semantic Analysis (LSA) widely used. It is a large lexical database of English in which
verbs, nouns, adverbs and adjectives are grouped together. in
The formal definition of LSA is that the psychological
a similar sense, it sets notes as synsets. And these synsets are
similarity between any two words is reflected in the way they
Connected to each other by semantic and lexical conceptual
occur together in small sub-samples of language [8]. In this
concepts. relations. Knowledge-based similarity measures can
method we take a matrix in which words are put as rows
be classified as Measures of semantic similarity and measures
and contexts as columns [8]. Contexts can be anything, for
of semantics.
example, journalistic articles, textbooks or student essays and
words are simply those that appear in the training set [8]. It is V. DATASET
important to underline that the contexts with which the model
is provided will determine the types of words with which it has In my dataset 400000 lines of question-pair available and
experience, so that the training set must be relevant to the task each question contain their Id number . In the final section
that the model must perform [8]. The first step is to associate a binary value present in the form of 0 and 1 that indicates
each word with the contexts in which it is likely to appear whether the question pair is similar or not.
[8]. In addition to recording the frequency with which a given The first public data set in Quora is related to the problem
word appears in certain texts, the model evaluates the items to of identifying duplicate questions. In Quora, an important
reflect the diagnostic ability of a word for a given context [8]. principle of the product is that there must be a page of
For example, a word that appears in a large number of very unambiguous questions for each different logical question. For
different contexts is not diagnostic as a word that occurs less example, the questions What is the most populous state in the
frequently and only in a small set of similar contexts [8] United States? And What is the state with the most people in
the United States? They should not exist separately in Quora
B. Hyper-space Analogue to Language(HAL) because the intention of both is identical. Having a canonical
Use the word co-occurrences to form a semantic space. page for each different query makes knowledge sharing more
In this An array is created in which each row and column efficient in many ways: for example, knowledge researchers
represent both the word Each element of the matrix represents can access all the answers to a question in one place and
the authors can reach a number a greater number of readers 1) Results: In this table I have taken two questions from the
compared to if the public were divided into several pages. dataset that I have taken and measure the similarity between
The dataset is based on actual data from Quora and will give these two questions by applying Jaccard similarity methods
anyone the opportunity to train and test models of semantic
equivalence. similarity.png
A. Some Quora questions pair Dataset which represent binary
value
question.png
VII. C ONCLUSION
VI. E XPERIMENT A ND R ESULT Measuring the similarity between words, sentences, doc-
uments and concepts is an important parts in various tasks
A. Measuring Similarity between two sentences using Co- sine such as information retrieval, automatic essay scoring , short
similarity method answer grading, document clustering , machine translation,
It is a algorithm through which we can find the similarity web mining and text summarization using different similarity
between two texts. We represent these texts in the form of techniques. Till now i have used two techniques to check
non-zero vectors for measuring similarities. Cosine similarity similarity between two sentences, these techniques are Cosine
is a similarity function that is often used in Information similarity and Jaccard similarity.
Retrieval. In the case of IR , it measure the angle between two
VIII. R EFERENCES
documents. It is normally used in the context of text mining
for comparing documents or emails. If vector is higher in case [1] Pradhan, Nitesh, Manasi Gyanchandani, and Rajesh Wad-
of cosine similarity between two documents, then both the hvani(2015), A Review on Text Similarity Technique used
documents have more number of words in common[17]. in IR and its Application. International Journal of Computer
1) Results: In this table I have taken two questions from the Applications 120.9 .
dataset that I have taken and measure the similarity between [2] Gomaa, Wael H., and Aly A. Fahmy.(2013), A survey of
these two questions by applying Cosine similarity methods. text similarity approaches. International Journal of Computer
Applications 68.13 : 13-18.
similarity .png [3] Gupta, Aditi, et al.(2017), A Survey on Semantic Simi-
larity Measures. IJIRST-International Journal for Innovative
Research in Science Technology 3 : 12.
[4] 4Rensch CR.(1992), Calculating lexical similarity. Windows
on bilingualism. 13-5.
[5] Mihalcea R, Corley C, Strapparava C.(2006), Corpus-based
and knowledge-based measures of text semantic similarity.
InAAAI 2006 (Vol. 6, pp. 775-780).
[6] Slimani T.(2013 Oct 30) Description and evaluation of
semantic similarity measures ap-proaches. arXiv preprint
arXiv:1310.8059. .
[7] Zhang J, Sun Y, Wang H, He Y.(2011) Calculating statistical
similarity between sentences. Journal of Convergence Infor-
mation Technology.6(2).
B. Measuring Similarity between two sentences using Jaccard [8] Simmons S, Estes Z(2006). Using latent semantic analysis to
similarity method estimate similarity. InPro-ceedings of the Cognitive Science
This similarity algorithm is also known as intersection over Society (pp. 2169-2173).
union. It is used to find similarity between words. Web Jaccard [9] Ramaprabha J, Das S, Mukerjee P.(2018), Survey on Sen-
coefficient can be computed based on number of elements in tence Similarity Evaluation using Deep Learning. InJournal of
the intersection set divided by the number of element in the Physics: Conference Series (Vol. 1000, No. 1, p. 012070). IOP
union set[1]. Publishing.
[10] http://www.stokastik.in/dynamic-programming-in-natural-
language-processing-longest-common-subsequence/
[11] https://dataconomy.com/2015/04/implementing-the-five-
most-popular-similarity- measures-in-python
[12] https://www.wikipedia.org/
[13] https://ai.googleblog.com/2018/05/advances-in-semantic-
textual-similarity.htm l
[14] Achananuparp P, Hu X, Shen X. The evaluation of sentence
similarity measures. InIn-ternational Conference on data ware-
housing and knowledge discovery 2008 Sep 2 (pp. 305-316).
Springer, Berlin, Heidelberg.
[15] Pawar A, Mago V. Calculating the similarity between words
and sentences using a lexical database and corpus statistics.
arXiv preprint arXiv:1802.05667. 2018 Feb
[16] https://www.listendata.com/2015/09/text-mining-basicsl
[17] https://github.com/tim5go/quora-question-pairs