Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
Architecture of Personalized Web Search Engine
Using Suffix Tree Clustering
Ajitha Annadurai Anitha Annadurai
MTech in Information Technology, Madras Institute of Data warehouse Engineer, Ness Technologies Ltd,
Technology, Anna University, Chennai, India Bangalore, India
[email protected] [email protected]Abstract— Web search engines are designed to serve all users, considered to be suitable for today’s internet users.
independent of the special needs of any individual user .The Unsupervised clustering techniques are not given pre-defined
objective of the project is to develop a personalized web search input categories like traditional method .So they are
engine which considers users interest and generates search considered to be suitable. Conversely, today’s search engines
results based on the user’s semantic profile. The proposed system pose more challenges demanding new innovations.
utilizes clustering and re-ranking algorithms in order to organize It is not possible to utilize traditional clustering algorithms
the web documents and provide an order to the results displayed for offline search clustering due to some realistic reasons.
to the user. Web crawlers are utilized to get the links, images and
allied information from the World Wide Web. The fetched
There are several other approaches to search results clustering
documents are further clustered using suffix tree clustering including Semantic On-line Hierarchical Clustering (SHOC)
algorithm, which enhances the performance of the web search approach [3], Tolerance Rough Set [5], and DisCover [6].
engine. The results are organized using Page Re-Rank algorithm There are academic search engine which uses several
which considers hyperlink and link structure information to clustering algorithms and can robotically organize small
bring an order to the web. The system creates a semantic profile collections of the web snippets into clusters [7].
of the user by monitoring and analyzing the users search history. Clustering of web page documents into a hierarchical
The search results generated will utilize an amalgamation of structure of cluster labels has been proposed by Chuang and
varied techniques including clustering, re-ranking and semantic Chien [2]. The binary tree structure is constructed using a
user profiles to enhance the performance of the web search
Hierarchical Agglomerative Clustering (HAC) [12] algorithm,
engine.
In, Yahoo [19] and DMOZ [4] a multiway-tree cluster
Keywords— Suffix tree clustering, base clusters, search engine. hierarchy is created from the binary-tree hierarchy. This is
achieved by partitioning the hierarchies in to create sub
I. INTRODUCTION hierarchies. Baeza-Yates et al. [1] proposed a method which
With an enormous growth of the Internet, most existing utilizes semantic information. Similar queries are grouped
search engines [10] such as Google, Yahoo and MSN present according to their semantics. In this method, a vector
users a single ordered linear list of pages along with their representation Q is created for a given search query q. The
partial content ranked by the relevance to the search query. vector Q has details from the user click actions. To find out
The query-list paradigm is adopted by a majority of search the similar queries, a similarity measure involving
engines. The internet users are forced to go through the long mathematical cosine function is applied to the query vectors.
list and examine the titles sequentially to identify their Zhang and Nasraoui [23] proposed methods to discover
required results. It is expected that the search engines should similar queries. The users’ sequential search behavior is used
not only return the most popular documents matching a query. as a factor in determining the similarity. It is assumed that
It is also expected to provide comprehensive domain successive user queries will be related to each other. Existing
information. Presentation of the web search results in a ranked content based similarity method is used in addition to above
list confines the search effectiveness to the first few described method to balance the high sparsity of log data
documents. Clustering the search results in to various collected in real time.
document groups has been identified to be a valuable solution Most of the clustering algorithms fail to suggest cluster
to overcome the above mentioned problem [18]. If the results topics for the collected documents, the categories include,
are organized in this manner, the users just need to select the agglomerative hierarchical clustering (AHC) and K-means
relevant cluster and browse for the needed document. clustering. The potential cluster labels produced by others
For decades, the most widely used post-retrieval document some clustering algorithms are not most comprehensible or
visualization technique is found to be clustering precise. Hence the search set was not reduced by the search
[13],[15],[17] .In most document clustering algorithms, engines. Suffix tree clustering (STC) algorithm is proposed to
documents which represents similar ideas are grouped in to surmount the above limitations.
similar clusters which greatly helps users to identify the topic In case of STC, the documents phrases are used to cluster
of interest. The traditional clustering methods [22] are not the pages. The phrases aids in preserving the semantic order
of the words. The phrase becomes a good candidate to be used
978-1-61284-653-8/11/$26.00 ©2011 IEEE 604
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
as cluster labels. The usage of phrases as cluster label “vultures hunt snakes too” is mentioned in the below
overcomes the disadvantages of the historical clustering discussion.
algorithms. The internal nodes are represented by circles and are
The foremost key component of the search engines is the labelled appropriately where as the leaf nodes are mentioned
ranking algorithm utilized to provide the search results in a by rectangles. The leaf nodes contain the suffixes of the
specific order. VMS models have been employed by the document. The identifiers in the leaf nodes indicate the
Conventional Information Retrieval (IR) systems [20] and the details about which document that suffix is present. The
systems used content similarity measures to compute rank of second identifier states the string index where the suffix starts.
web page documents. The similarity measure is calculated The circles represents an overlap phrase which at least any of
between the search query given by the user and web two suffixes shares. The number of documents shared by the
documents returned. The internet poses several issues with the internal nodes forms the basic factor to combine base clusters
mentioned methods. Like spamming the search engine with into a single cluster.
false user request often results in ranking results that are
incorrect. Algorithm which employs implicit details has been
found to overcome the issues and they are mostly based on the
links in the document.
The acknowledged algorithms include Page Rank [21] and
HITS (Hyperlink Induced Topic Search) [19]. Page Rank, is
authored by Google in its search engine ranking strategy. The
algorithm is not dependent on the search query. The ranking
can be performed offline and can be stored to handle further
expected user requests. In contrast, HITS algorithm is
performed online [20]. The HITS algorithm depends on the
user search requests which make it more accurate but in
addition it has lot of disadvantages. The computations
performed to calculate online ranking of the web pages
consumes a lot of time and the complexity involved is very
high. This results in intolerable query response time.
II. ARCHITECTURE OF THE PROPOSED SYSTEM
Fig.1 Generalized Suffix Tree Example
The proposed system is composed of the following
modules, the web crawler fetches all the links from the
Semantic information of the document is utilized to
external search interface and the fetched results are stored in a
automatically group the retrieved web pages in to clusters [11].
buffer. The web documents stored in the buffer are further
analysed and the corresponding clusters are formed using B. Biased Page Ranking Algorithm
suffix tree clustering algorithm [8].
The user profiles and the search history details of the user The resulted clusters are then processed by the ranking
will be maintained. The personalization detection module module. In addition to the textual context, the significance of
analyses the user profiles and the cluster and the search results the documents to the users search query are computed by
are generated based on the profile of the user. The results are considering the significance of the linked pages to the current
then re-ranked using page re-rank algorithm and the results page with respect to the given searched criteria.
are presented to the user. The system handles the semantic C. Semantic User Profiles
representation of the user, algorithms for clustering and re-
ranking and provides a personalization interface. The returned web search results are personalized further
based on the context of the user search [24]. The search
A. Enhanced Suffix Tree Clustering context is determined by analysing the user search history.
A suffix tree data structure is used extensively in string The operations performed by the user and the amount of time
matching and text mining applications. It is considered as a spent by the user in the document aids in analysing the context
compact trie of all the possible suffixes of a string. of the user. Each user profile created in the web server is
For any string S consisting of m-word, a suffix tree will be allotted to a user group, which also incorporates an access
a rooted directed tree with n leaves numbered 1 to n precisely. control list. This provides enhanced security to the
Each internal node has a minimum of two children and a personalized search engine.
nonempty sub-string of S is labelled at each of the edges. Two The semantic user profile created by analysing the search
edges are not supposed to have edge labels starting with the history log of the user is predicted to give accurate
same word. information about the context of the users search. Security to
A suffix tree constructed from three simple documents the user groups are provided by using access control
“Vultures hunt insects”, “snakes hunt insects too”, and mechanisms and by authorizing the users with the credentials.
978-1-61284-653-8/11/$26.00 ©2011 IEEE 605
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
Personalization STC
Query Processing Detection User
User Module Module group
Stemmer
Stop
Words Clustering Module
User Profiling & Text
Mining Module
External Search
User Semantic Engine Interface
Search User, User User
History query, Profiles context Crawler
operation
Returned Labels,
Result Set Presentation Biased Reranking Module
User Web Documents pages
Module
Fig. 2 Proposed Architecture
III. OVERVIEW OF THE ALGORITHM score below a threshold value. The score for each of the base
The STC clustering algorithm is organized into three main clusters is defined by the following:
steps including processing the retrieved web document,
identifying the Phrase clusters and merging the phrase clusters. s(n) = |n| · f(|np|) · ∑ zfidf(wi) (1)
A. Processing the retrieved web document Where |n| represents number of documents in the identified
This process positively influences the quality of discovered phrase cluster n, wi represents the words in np, zfidf (wi)
resultant clusters. Each web page’s title and snippet in the represents the score for each word in the cluster np, and |np|
search results are taken as a good summary of its content, and represents the number of nonstop-words .The function f is
used as a “document” to be fed to the clustering algorithm. used to penalize phrases containing single word.
Web snippet is parsed and split into sentences according to For phrases which are 2 to 6 words long, the function is
punctuations (comma, full stop, semicolon and question mark defined linearly but is constant if the identified phrases are
symbol). Each sentence boundary is marked. The non-word longer than six words. A minimal threshold value is used to
tokens (such as HTML tags, numbers, entities and non-letter determine if a cluster can be a base cluster.
characters except for sentence boundaries) are stripped. The
classic Porter’s stemmer algorithm is used to stem words for C. Merging the phrase clusters
English documents. Stop words (such as article, pronouns, If the base clusters share a common phrase, it is
auxiliary verbs and prepositions etc.) existing in a predefined considered for merging into a common cluster by considering
list are removed. the following measure.
B. Identifying the Phrase clusters sim (ni, nj) = 1 if |ni. nj| / |ni| > α
In this phase the suffix tree is constructed by using words & |ni . nj| / |nj| > α
as the fundamental items. All the sentences of each of the sim (ni, nj) = 0 otherwise (2)
collected documents are inserted into any of the nodes of the
tree. Where α is a constant between 0 and 1
The indexed internal nodes will be updated to point to the The threshold value mirror the degree of overlap between
current documents while rearranging the tree. The suffix tree base cluster ni and nj. In the phrase cluster graph, the nodes
is constructed in such a way that all the paths from the root represent the phrase clusters, and two nodes are connected if
node till the end represent a phrase. Additionally, algorithms and only if the measure defining the similarity between the
with n-gram approach used to find out all phrases will result two phrase clusters is 1. A merged cluster is represented as a
in creating lot of base clusters. It is ideal to utilize a statistic to connected component in the phrase cluster graph.
the identified base clusters and eliminate base cluster with
978-1-61284-653-8/11/$26.00 ©2011 IEEE 606
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
The phrase cluster label has been determined by using Fig. 3 Web Search Engine Interface
methods employing lexical analysis and using the frequency The experiments are conducted with the web documents
of occurrence of the phrases [5] [9] [16].Nevertheless, the retrieved by the web crawlers using the search engine APIs.
labels identified by these methods do not help to identify the There are provisions in the system to process the documents
applicable documents clusters. stored in the buffer.
A phrase is considered as a cluster label by using the The following Fig. 4 represents the search engine input.
following formula: The search engine input consists of url, title and the web
document snippet.
L (P) =αf (P) + (1-α) S (P)/Smax(P) (3)
The function considers the phrase length P and the number
of times it appears in the documents. For longer phrases, the
function is represented by the statistic of words that appear in
the latent semantic space S (P). The parameters, represents
the preference coefficient ranging from 0 to 1
D. Biased Page Ranking Algorithm
The Page Rank algorithm, PR (A) is defined as follows:
PR (W) = (1-d) +d∑iPR (Wi /C (Li) (4)
Where PR(W) represent the Page Rank of web page W and
PR(W ) represents the Page Rank of web pages L which are Fig. 4 An example of Web page input
i i
linked to page W, The damping constraint, d is a typically set
between 0 and 1. The number of outbound links is represented The main results page displays the number of documents
by C (L ). retrieved and the number of clusters found as shown in Fig. 5.
i Within each cluster the results are ranked and are personalized
In Biased page rank algorithm important pages are given based on the profile of the user.
larger rank values rather than giving the same importance to
each of the linked pages. Thus, Win and Wout represents the
importance of incoming and outgoing links. The rank is
determined by
PR (W) = (1-d) +d∑iPR (Wi) WinWout (5)
IV. EVALUATION OF EXPERIMENTAL RESULTS
The experimental evaluation ensures that the proposed
architecture considerably minimizes the users response time
for the given search engine query and assists the user in the
relevant information by employing clustering and ranking
techniques.
The web search engine interface is given in Fig. 3. A session
starts with the user entering her query in the query box Fig.5 Formation of clusters
provided by the search engine [14]. The user enters the query
and the relevant clusters are identified. The user profile is also The semantic profile along with the context information
formed by the various modules of the search engine. The aids in personalizing the search engine and thereby reduces
returned results are made specific to the user’s interest. the number of web pages the web user has to review to find
the information needed. This approach increases the search
engines response time
The search results displayed to user will be organized
into clusters which assist the user in finding the relevant
information easily. The list of clusters formed for the search
query is shown in figure 6.
978-1-61284-653-8/11/$26.00 ©2011 IEEE 607
Proceedings of 2011 International Conference on Signal Processing, Communication, Computing and Networking Technologies (ICSCCN 2011)
REFERENCES
[1] R.A. Baeza-Yates, C.A. Hurtado, and M. Mendoza, “Query
Recommendation Using Query Logs in Search Engines,” Proc.EDBT
Workshop, vol. 3268, pp. 588-596, 2004.
[2] S. Chuang and L. Chien, “Automatic Query Taxonomy Generation for
Information Retrieval Applications,” Online Information Rev.vol. 27, no. 4,
pp. 243-255, 2003.
[3] Dell Zhang, Yi-sheng Dong, “Semantic, Hierarchical, Online Clustering
of Web Search Results”,Proceedings of the 6th Asia Pacific Web Conference,
Hangzhou, China, vol. 3007, pp. 69-78, 14-17 Apr 2004.
[4] http://www.dmoz.org/, 2008.
[5] P.W. Foltz and S.T. Dumais, “Personalized Information Delivery: An
Analysis of Information Filtering Methods,” Comm. ACM,vol. 35, no. 12, pp.
51-60, 1992.
[6] Y. Fu, S. Yan, and T. Huang, “Correlation metric for generalized feature
extraction,” IEEE Trans. Pattern Anal. Mach. Intell., pp.2229–2235, 2008.
[7] M. A. Hearst and J. O. Pedersen, “Reexamining the cluster hypothesis:
Fig. 6 Search Engine results Scatter/Gather on retrieval results”,Proceedings of the F 19th International
It is observed that the search engines built using STC is ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR’96), 1996, pp.76-84.
found to give faster responses to users than the ones using [8] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, Jinwen Ma,
traditional clustering approaches. The given plot shows the “Learning to cluster web search results”, Proceedings of SIGIR’04, Sheffield,
time taken to process the documents using STC verses South Yorkshire, pp. 210-217, July 25-29, 2004
traditional methods. [9] Hung Chim, Xiaotie Deng, “A New Suffix Tree Similarity Measure for
Document Clustering”,Proceedings of the 16th international conference on
World Wide Web (WWW), Banff, Alberta, Canada,pp. 121-130, 8-12 May,
2007.
[10] Kummamuru, K., Lotlikar, R. Roy, S., Singal, K.,Krishnapuram, R., “A
hierarchical monothetic document clustering algorithm for summarization and
browsing search results”, Proceedings of the 13th international conference on
World Wide Web, ACM Press, pp. 658-665, 2004.
[11] Lang, N.C., “A tolerance rough set approach to clustering web search
results”, Master’s thesis,Faculty of Mathematics, Informatics and
Mechanics,Warsaw University, 2004.
[12]B. Mirkin, Mathematical Classification and Clustering. Kluwer, 1996.
[13] H. T. Nguyen and A. Smeulders, “Active learning using pre-clustering,”
in Proc. Int. Conf. Machine Learning, 2004, pp. 623–630.
[14] B. Smyth et al., “Exploiting Query Repetition and Regularity in an
Adaptive Community-Based Web Search Engine,” User Modeling
and User-Adapted Interaction, vol. 14, no. 5, pp. 383-423, 2005.
[15] M. Spiliopoulou, B. Mobasher, B. Berendt, and M. Nakagawa,“A
Framework for the Evaluation of Session Reconstruction Heuristics in Web
Usage Analysis,” INFORMS J. Computing,no. 2, p. 15, 2003.
[16] Sven Meyer zu Eissen, Benno Stein, Martin Potthast, “The Suffix Tree
Fig. 7 Response time versus No of documents processed Document Model Revisited”,Proceedings of the 5th International Conference
on Knowledge Management, Universal Computer Science, pp. 596-603, 2005.
V CONCLUSION [17] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations. Morgan Kaufmann,
The paper proposed a novel architecture for implementing 2000.
a personalized web search engine. The web documents are [18] Xuanhui Wang. ChengXiang Zhai, “Learn from Web Search Logs to
fetched by using crawlers and are further organized in to Organize Search Results”, SIGIR, July 23- 27,2007, Amsterdam, pp.87.94.
clusters which aid the user in finding the required details [19] http://www.yahoo.com/, 2008.
[20] Y. Yang and B. Padmanabhan, “Segmenting Customer Transactions
faster. Extended suffix tree clustering algorithm is proposed Using a Pattern-Based Clustering Approach,” Proc. Third
which analyzes the retrieved web documents semantically and IEEE Int’l Conf. Data Mining (ICDM), 2003.
figures descriptive meaningful phrases as cluster labels. In [21] C.T. Yu, W. Meng, W. Wu, and K.-L. Liu, “Efficient and Effective
addition, biased page ranking algorithm is used to order the Metasearch for Text Databases Incorporating Linkages among
Documents,” ACM SIGMOD, pp. 187-198, 2001.
search results based on the hyperlink information. Australia, pp. ,6-54, 1998.
The search results are further personalized by creating [22] Zamir O., Etzioni O., “Web Document Clustering: A Feasibility
semantic user profiles using the user search history and the Demonstration”, Proceedings of the 19th International ACM SIGIR
user groups. The experimental results demonstrate that the Conference on Research and Development of Information Retrieval
(SIGIR’98), 1998,pp.46-54.
architecture enhances the search engine performance by [23] Z. Zhang and O. Nasraoui, “Mining Search Engine Query Logs for
employing state-of-art techniques and drastically reduces the Query Recommendation,” Proc. 15th Int’l World Wide Web Conf.
user’s response time and the users are able to get the (WWW),2008.
interested information quickly. In the future, extensive [24] G.K. Zipf, Human Behavior and the Principle of Least Effort. Addison-
Wesley, 1949.
researches should be performed to optimize and evaluate the
efficiency of the proposed search engine architecture.
978-1-61284-653-8/11/$26.00 ©2011 IEEE 608