Untitled
Untitled
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Ramamohanarao Kotagiri
P. Radha Krishna Mukesh Mohania
Ekawit Nantajeewarawat (Eds.)
Advances in Databases:
Concepts, Systems
and Applications
13
Volume Editors
Ramamohanarao Kotagiri
The University of Melbourne
Department of Computer Science and Software Engineering
Victoria 3010, Australia
E-mail: [email protected]
P. Radha Krishna
Institute for Development and Research in Banking Technology
Masab Tank, Hyderabad 500 057, Andhra Pradesh, India
E-mail: [email protected]
Mukesh Mohania
IBM India Research Laboratory
Institutional Area, Vasant Kunj, New Delhi 110 070, India
E-mail: [email protected]
Ekawit Nantajeewarawat
Thammasat University - Rangsit Campus
Sirindhorn International Institute of Technology
Pathum Thani 12121, Thailand´
E-mail: [email protected]
ISSN 0302-9743
ISBN-10 3-540-71702-1 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-71702-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2007
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 12043323 06/3142 543210
Preface
The conference was sponsored by IBM, Thailand, the Database Society of Japan,
Korea Information Science Society, National Electronics and Computer Technology
Center and Software Industry Promotion Agency.
Conference Chair
Vilas Wuwongse Asian Institute of Technology, Thailand
Industrial Co-chairs
Prasan Roy IBM India Research, India
Masashi Tsuchida Software Division, Hitachi, Ltd., Japan
Panel Co-chairs
Sourav Bhowmick NTU, Singapore
Masaru Kitsuregawa University of Tokyo, Japan
Publication Chair
P. Radha Krishna Institute for Development and Research in
Banking Technology, India
Publicity Co-chairs
Chin-Wan Chung KAIST, Korea
Qing Li City University of Hong Kong, PRC
VIII Organization
Regional Chairs
Asia Yasushi Kiyoki Keio University, Japan
Australia and New Zealand Millist Vincent University of South
Australia, Australia
Europe Michael Schrefl University of Linz, Austria
USA Sanjay Madria University of
Missouri-Rolla, USA
Program Committee
Akiyo Nadamoto NICT, Japan
Amol Deshpande University of Maryland at College Park, USA
Anirban Mondal University of Tokyo, Japan
Arkady Zaslavsky Monash University, Australia
Arnd Christian Konig Microsoft Research, USA
Atsuyuki Morishima University of Tsukuba, Japan
Bala Iyer IBM, USA
Balaraman Ravindran IIT Madras, India
Barbara Catania University of Genoa, Italy
Charnyot Pluempitiwiriyawej Mahidol University, Thailand
Chiang Lee National Cheng Kung University, Taiwan
Cholwich Nattee Thammasat University, Thailand
Chutiporn Anutariya Shinawatra University, Thailand
Dan Lin National University of Singapore, Singapore
David Embley Brigham Young University, USA
David Taniar Monash University, Australia
Dimitrios Gunopulos UCR, USA
Egemen Tanin University of Melbourne, Australia
Elena Ferrari University of Insubria, Italy
Ernesto Damiani University of Milan, Italy
Evaggelia Pitoura University of Ioannina, Greece
Gao Cong University of Edinburgh, UK
Gill Dobbie University of Auckland, New Zealand
Gunther Pernul University of Regensburg, Germany
Haibo Hu Hong Kong University of Science and
Technology, China
Haixun Wang IBM T.J. Watson Research Center, USA
Hakan Ferhatosmanoglu Ohio State University, USA
Hayato Yamana Waseda University, Japan
Heng Tao Shen University of Queensland, Australia
H.V. Jagadish University of Michigan, USA
Hyunchul Kang Chung-Ang University, South Korea
Organization IX
External Referees
Alex Liu University of Texas, USA
Amit Garde Persistent Systems, India
Chavdar Botev Cornell University, USA
Fan Yang Cornell University, USA
Feng Shao Cornell University, USA
L. Venkata Subramaniam IBM IRL, India
Man Lung Yiu University of Hong Kong, China
Meghana Deodhar IBM IRL, India
Mirek Riedewald Cornell University, USA
Panagiotis Karras University of Hong Kong, China
Pankaj Kankar IBM IRL, India
R. Venkateswaran Persistent Systems, India
Shipra Agrawal Bell Labs Research, India
Sourashis Roy IBM IRL, India
Suju Rajan University of Texas, USA
Sunita Sarawagi IIT Bombay, India
Umesh Bellur IIT Bombay, India
External Reviewers
A. Balachandran Persistent Systems, India
Amit Garde Persistent Systems, India
Atsushi Kubota Fujitsu Laboratories, Japan
Daniel Lieuwen Bell Labs, USA
Deepak P IBM Research, India
Iko Pramudiono NTT, Japan
Krishna Kummamuru IBM Research, India
Masanori Goto Fujitsu Laboratories, Japan
Noriaki Kawamae NTT, Japan
Nicolas Anciaux INRIA, France
Nicolas Travers University of Versailles, France
Philippe Pucheral INRIA, France
XII Organization
Sponsoring Institutions
IBM, Thailand
Software Industry
Promotion Agency
Table of Contents
Invited Talks
‘Socio Sense’ and ‘Cyber Infrastructure for Information Explosion Era’:
Projects in Japan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Masaru Kitsuregawa
Is (Your) Database Research Having Impact? . . . . . . . . . . . . . . . . . . . . . . . . 3
Guy M. Lohman
Clustering
A Comparative Study of Ontology Based Term Similarity Measures on
PubMed Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, and
Xiaohua Zhou
An Adaptive and Efficient Unsupervised Shot Clustering Algorithm for
Sports Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Jia Liao, Guoren Wang, Bo Zhang, Xiaofang Zhou, and Ge Yu
A Robust Feature Normalization Scheme and an Optimized Clustering
Method for Anomaly-Based Intrusion Detection System . . . . . . . . . . . . . . . 140
Jungsuk Song, Hiroki Takakura, Yasuo Okabe, and Yongjin Kwon
Detection and Visualization of Subspace Cluster Hierarchies . . . . . . . . . . . 152
Elke Achtert, Christian Böhm, Hans-Peter Kriegel, Peer Kröger,
Ina Müller-Gorman, and Arthur Zimek
Outlier Detection
Correlation-Based Detection of Attribute Outliers . . . . . . . . . . . . . . . . . . . . 164
Judice L.Y. Koh, Mong Li Lee, Wynne Hsu, and Kai Tak Lam
An Efficient Histogram Method for Outlier Detection . . . . . . . . . . . . . . . . . 176
Matthew Gebski and Raymond K. Wong
Data Warehouse
BioDIFF: An Effective Fast Change Detection Algorithm for Biological
Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Yang Song, Sourav S. Bhowmick, and C. Forbes Dewey Jr.
An Efficient Implementation for MOLAP Basic Data Structure and Its
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
K.M. Azharul Hasan, Tatsuo Tsuji, and Ken Higuchi
Information Retrieval
Monitoring Heterogeneous Nearest Neighbors for Moving Objects
Considering Location-Independent Attributes . . . . . . . . . . . . . . . . . . . . . . . . 300
Yu-Chi Su, Yi-Hung Wu, and Arbee L.P. Chen
Similarity Joins of Text with Incomplete Information Formats . . . . . . . . . 313
Shaoxu Song and Lei Chen
Self-tuning in Graph-Based Reference Disambiguation . . . . . . . . . . . . . . . . 325
Rabia Nuray-Turan, Dmitri V. Kalashnikov, and Sharad Mehrotra
Probabilistic Nearest-Neighbor Query on Uncertain Objects . . . . . . . . . . . 337
Hans-Peter Kriegel, Peter Kunath, and Matthias Renz
Mobile Databases
Optimizing Moving Queries over Moving Object Data Streams . . . . . . . . . 563
Dan Lin, Bin Cui, and Dongqing Yang
MIME: A Dynamic Index Scheme for Multi-dimensional Query in
Mobile P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
Ping Wang, Lidan Shou, Gang Chen, and Jinxiang Dong
Data Streams
The Tornado Model: Uncertainty Model for Continuously Changing
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
Byunggu Yu, Seon Ho Kim, Shayma Alkobaisi, Wan D. Bae, and
Thomas Bailey
ClusterSheddy: Load Shedding Using Moving Clusters over
Spatio-temporal Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
Rimma V. Nehme and Elke A. Rundensteiner
Evaluating MAX and MIN over Sliding Windows with Various Size
Using the Exemplary Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
Jiakui Zhao, Dongqing Yang, Bin Cui, Lijun Chen, and Jun Gao
CLAIM: An Efficient Method for Relaxed Frequent Closed Itemsets
Mining over Stream Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,
Yunfeng Liu, and Kunqing Xie
XML Databases
An Efficient Encoding and Labeling for Dynamic XML Data . . . . . . . . . . 715
Jun-Ki Min, Jihyun Lee, and Chin-Wan Chung
An Original Semantics to Keyword Queries for XML Using Structural
Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
Dimitri Theodoratos and Xiaoying Wu
Lightweight Model Bases and Table-Driven Modeling . . . . . . . . . . . . . . . . . 740
Hung-chih Yang and D. Stott Parker
XML Indexing
An Efficient Index Lattice for XML Query Evaluation . . . . . . . . . . . . . . . . 753
Wilfred Ng and James Cheng
XVIII Table of Contents
XML Databases
AB-Index: An Efficient Adaptive Index for Branching XML Queries . . . . 988
Bo Zhang, Wei Wang, Xiaoling Wang, and Aoying Zhou
XX Table of Contents
Query Processing
QuickCN: A Combined Approach for Efficient Keyword Search over
Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032
Jun Zhang, Zhaohui Peng, and Shan Wang
Adaptive Join Query Processing in Data Grids: Exploring Relation
Partial Replicas and Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036
Donghua Yang, Jianzhong Li, and Hong Gao
Efficient Semantically Equal Join on Strings . . . . . . . . . . . . . . . . . . . . . . . . . 1041
Juggapong Natwichai, Xingzhi Sun, and Maria E. Orlowska
Masaru Kitsuregawa
University of Tokyo
[email protected]
Some of the large projects in Japan where I am serving PI are introduced in this talk.
MEXT (Ministry of Education, Culture, Sports, Science and Technology) approved
new project named ‘Cyber Infrastructure for Information Explosion Era’ in 2005. The
year of 2005 was a preparation stage and we asked research proposals under this
program. Totally seventy four research teams were accepted. The project effectively
started on April 2006.This is the largest IT related project in the category of Grant-in-
Aid for Scientific Research on Priority Areas. Around 5 million dollars for 2006. The
project supposed to continue until FY2010. The amount of information created by
people, generated by sensors and computers is explosively increasing recent years.
Especially the growth ratio of web contents is very high. People do not ask questions
to the friends anymore if they want to know something but use search engine and
people are now really heavily dependent on the web. Knowledge workers are using a
lot of time just for ‘search’. The more the information be generated, the more we find
difficulty to locate appropriate information. In order to achieve higher quality search,
we are currently developing an open next generation search engine incorporating deep
NLP capabilities. By deep, we mean we put more machine power to web contents
analysis. In another words, we do not care about response time, since current 1 sec
response time is dependent on the advertisement based monetization scheme. We
believe we should provide service, which is more than ordinary search. In addition to
web, we do have yet another information explosion in the area so called e-science.
Through introduction of very powerful supercomputer and various kinds of advanced
sensor systems, science is now becoming very data intensive. We plan to build tools
for science discovery over the sea of data explosion. Another area would be health
care. A lot of patient health care records are now becoming to be digitally stored.
Monitoring the human activities with sensors and mining the archived HCR would be
typical data driven application.
Explosion of the information incurs several problems not just in search but also in
computer system management. A lot of information means a lot of applications,
which gives so much stresses against the system. Cost of maintaining the system is
now increasing more and more. Self monitoring the system activities also generate
huge amount of information. BAM is one typical higher level example. We are now
building large scale distributed cluster test bed over Japan, which is a shared platform
for next generation system software development.
Human interaction is also very important research issue. All the information
finally has to be absorbed by people. Highly functional human interaction capturing
room are being developed. Various kinds of sensors are prepared and eight video
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 1–2, 2007.
© Springer-Verlag Berlin Heidelberg 2007
2 M. Kitsuregawa
Guy M. Lohman
Is your research having real impact? The ultimate test of the research done by this
community is how it impacts society. Perhaps the most important metric of this
impact is acceptance in the marketplace, i.e. incorporation into products that bring
value to the purchaser. Merely publishing papers and getting them referenced has no
intrinsic value unless the ideas therein are eventually used by someone. So let us ask
ourselves candidly – is (my) database research having (positive) impact? Concisely:
Are they buying my stuff? Have the “hot topics” of the past withstood the test of time
by actually being used in products that sold? If so, what characteristics were
instrumental in their success? And if not, why did something that got so many people
excited fail to gain traction with users? Perhaps more importantly, what can we learn
from our track record of the past in order to have better impact in the future? How
can we better serve our user community by solving their real problems, not the ones
we may imagine?
Let us first critique our historical track record as a community. Over the last thirty
years, a few major topics seem to have dominated the interest of the research
community in databases. Waves of “hot topics” appear to rise to predominance, in
terms of the number of papers submitted (and hence published), and after a few years
of excitement get replaced by another topic. Not that these waves exclude other
topics or are cleanly delineated – they simply seem to coincidentally interest a large
proportion of our community. I will not attempt to justify this premise with statistics
on topics; it’s just an observation that many experienced researchers recognize. The
first of these with which I’m familiar was relational databases, themselves, which
captivated the attention of database researchers in the last half of the 1970s, resulting
in major prototypes such as System R, Ingres, and others that formed the foundation
of products in the early 1980s. Distributed databases seemed to dominate the early
1980s, but this thread rapidly evolved into separate threads on parallel databases and
the integration of disjoint (and often heterogeneous) databases, usually called
federated databases. In the late 1980s and early 1990s, object-oriented databases
attempted to address the requirements of some under-served applications, and the
relational crowd fought back by creating “extensible” databases with “object-
relational” extensions to meet the OODBMS challenge. About the same time, interest
in Datalog created strong interest in deductive databases. The mid- and late-1990s
saw the birth and explosion of interest in data warehousing and data mining,
eventually spawning a whole new research community in knowledge discovery.
Around 1999, standardization of XML rocketed XML databases into the forefront.
The early- to mid-2000s have seen great interest in streams and sensor databases.
And along the way, numerous other variations on these themes have enjoyed the
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 3–5, 2007.
© Springer-Verlag Berlin Heidelberg 2007
4 G.M. Lohman
spotlight for a while: database machines, temporal databases, multi-media and spatial
databases, scientific and statistical databases, active databases, semantic databases
and knowledge bases, and a recent favorite of mine that has yet to gain much interest
from academia – self-managing databases. To what extent has each of these topics
successfully impacted the marketplace, and why? We must learn from our successes
and failures by carefully examining why the market accepted or rejected our
technology.
My assessment is that our success has depended upon the consumability of our
technology: how well it meets a customer need, how simple it is to understand and
use, and how well standardization has stabilized its acceptance across vendors.
Relational technology succeeded and has grown spectacularly to become a U.S. $14
Billion industry in 2004 largely because it was simpler and easier to understand than
its predecessors, with a declarative query language (SQL) that simplified application
development, and was standardized early in its (product) evolution. However,
attempts to “augment” it with object-relational, temporal, and deductive extensions
have been either: (a) too complicated, (b) insufficiently vital to most consumers’
applications, and/or (c) not standardized or standardized too late in its evolution.
Parallel databases exploited increasingly inexpensive hardware to facilitate growth
and performance requirements with generally acceptable increases in complexity
(mostly in administration, not querying), whereas federated databases have seen less
success because the complexity of integrating diverse data sources largely fell on the
user. Data mining, while a genuine success in the research community, evoked a
comparative yawn in the marketplace largely because users needed to understand it to
use it, and they had difficulty understanding it because of its novelty and
mathematical intricacies. The jury is still out on XML databases, but my fear is that,
despite the need for storing increasing volumes of XML data, XQuery is far more
complicated than SQL. Similarly, stream databases are too new to be judged
adequately, but I question the market size and whether the research in the database
community adequately suits the “lean and mean” real-time requirements of the
primary market – the investment and banking industries.
How then should we increase the impact of our research in the future? First, we
must candidly assess our strengths and weaknesses. Our strengths lie in modeling the
semantics underlying information, enabling better precision in our queries than the
keyword search upon which Information Retrieval and the popular search engines are
based. We have much to offer the IR and search communities here, and they have
recognized this by aggressively hiring from the database community in the last few
years. Our models also permit reasoning about the data through complex OLAP-style
queries to extract actionable information from a sea of data. We know how to
optimize a declarative language, and how to exploit massive parallelism, far better
than any other discipline. Our primary weakness is in simplicity / usability,
particularly in setting up and administering databases. This is certainly exacerbated by
database researchers not gaining firsthand experience by routinely using databases to
store their own data. Secondly, we must reach out to other disciplines with
complementary strengths, and learn from them. Despite the lack of precision of
keyword search, why is it vastly preferred over SQL? Third, we must engage with
real users (which should include ourselves) and listen carefully to what they say.
Have you ever tried to query or manage a non-trivial database of at least 500 tables
Is (Your) Database Research Having Impact? 5
that was not constructed by you? Have you ever tried to add disks or nodes to an
existing database that exceeded its initial space allocation? Have you ever built and
administered a real application using a database? Did your research remedy any of
the pain points you encountered or heard from a user? Fourth, we must go back to
basics and design our systems based upon user requirements, not upon what
technology we understand or want to develop.
Pursuing the fourth item in greater detail, we should honestly ask ourselves why
less than 20% of the world’s data is stored in databases. Weren’t object-relational
extensions supposed to rectify this by enabling storage of unstructured and semi-
structured data, as well as structured data? Currently, users rely upon content
managers to manage this unstructured and semi-structured content. Though content
managers are built upon relational DBMSs, the content is stored in files, so isn’t
easily searched, and the search interface isn’t SQL. This certainly isn’t what users
want. Users want a single, uniform interface to all their data, particularly for
searching. Increasingly, they recognize that the majority of their costs are for people
and their skills, as hardware costs are driven downward by Moore’s Law. So lowering
the Total Cost of Ownership (TCO) requires systems that are easier to manage and
require fewer skilled people to manage. Users also want a scalable solution that
permits easily adding more capacity to either the storage or the computing power in
an incremental fashion as their needs for information management increase. The
increasing requirements for compliance with government regulations, as well as
business imperatives to extract more value out of information already collected in
diverse application “silos”, are driving their need to integrate systems never designed
to interact with other systems, and to be able to more pro-actively and quickly derive
business intelligence than with today’s data warehouses. Ultimately, users want to be
able to quickly and easily find, integrate, and aggregate the data that they need to
make business decisions. But that data is currently scattered throughout their
enterprise in a staggering array of incompatible systems, in a daunting tangle of
differing formats. The usual lament is that they know the data is out there somewhere,
but they can’t find it.
Clearly there are plenty of hard research problems – as well as business
opportunities! – in all of these requirements! We simply have to listen and be willing
to change our research agendas to the problems that matter most to our “customers”.
And focusing on customer pain points doesn’t preclude attempting risky, imaginative,
cool, technically advanced, and occasionally far-out technical approaches. In fact,
problems having origins in reality tend to be the most challenging. Only by doing so
will our research withstand the test of time in the marketplace of ideas, and truly have
the impact we all want for our work.
Improving Quality and Convergence of Genetic
Query Optimizers
1 Introduction
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 6–17, 2007.
c Springer-Verlag Berlin Heidelberg 2007
Improving Quality and Convergence of Genetic Query Optimizers 7
lies in the size of the search space, which grows exponentially with the increase
of the number of relations involved in the query. In this scenario, users, or even
the DBMS [20], are usually forced to split the query in smaller subqueries in
order to optimize it, obtaining QEPs that are typically far from the optimum.
Genetic approaches have proven to be a good alternative since they are in-
cluded among the best randomized approaches in terms of quality and speed of
convergence [15]. However, there are still important aspects to be studied in order
to improve the performance of genetic approaches. On the one hand, evolution-
ary algorithms perform a beam search, based on the evolution of a population,
instead of focusing on the evolution of a single individual [1], as opposed to
random-walk algorithms like iterative improvement or simulated annealing. Al-
though this can be beneficial in terms of quality, it may jeopardize the ability
of the optimizer to converge quickly. On the other hand, recent studies show, by
means of a statistical model, that the random effects of the initial population
cannot be neglected, since they have a significant impact on the quality of the
returned QEP after the optimization process [11]. In other words, depending
on the small sample of QEPs created at random for the initial population, the
genetic optimizer will experience difficulties to find a near optimal QEP. This is
aggravated by the fact that the search space grows exponentially as the number
of relations increases, which implies that the size of the initial population should
also grow exponentially.
In order to remedy these two drawbacks, we propose two different approaches.
We call our first proposal Weighted Election (WE) and it tackles the problem
of the speed of convergence mentioned above. In all the traditional evolution-
ary algorithms, the members of the population chosen to be crossed with other
members or mutated are chosen at random. WE proposes a new approach where
the QEPs are chosen with a certain probability depending on their associated
cost, giving more probability to low-costed plans to be chosen as opposed to
high-costed plans. Our second approach is aimed at reducing the variability in
the quality of the results, introduced by the random effects of the initial popu-
lation, by using heuristics to assure that the first sample of QEPs is not blindly
chosen from the search space, but it follows a minimum quality criterion. We
call this approach Heuristic Initial Population (HIP).
Finally, we show that the combination of both approaches is beneficial. Specifi-
cally, we compare our new approach with the Two-Phase Optimization algorithm
(2PO) [4], which is considered to be the best randomized algorithm presented in
the literature. We show that our techniques significantly improve a genetic opti-
mizer and, in addition, are more suitable than previous randomized techniques
for very large join queries.
This paper is organized as follows. Section 2 introduces genetic optimization
and the genetic optimizer used in this work. Section 3 and 4 describe our pro-
posals in detail. In Section 5, we present the results obtained by the comparison
of the different algorithms. Finally, in Sections 6 and 7, we present related work
and conclude.
8 V. Muntés-Mulero et al.
where Cp is the cost associated with QEP p, μ 12 is the median cost in the
population and BP op is the best cost in the population. Note that Wp ranges
from 1 to α, where α > 1. Specifically, QEPs with costs lower than the median
are assigned a weigh from 1 to α, while QEPs with costs higher than the median
are assigned a cost of 1. Depending on the value of α we can give more or less
importance to the differences between the costs of the QEPs in the population.
For example, for α = 2 and α = 100, the probability of the QEP with the lowest
cost in the population to be chosen is 2 and 100 times the probability of the
highest-costed QEP, respectively.
using the so-called KBZ rank, related to the KBZ algorithm. Among the five
criteria, the minimum selectivity criterion turned out to be the most efficient
and, for this reason, it is the one selected for this work. Depending on the relation
chosen to start the optimization process, different QEPs can be generated. In
general, we consider that the AG algorithm does not generate a single QEP, but
as many QEPs as relations involved in the query.
3: p ← generateMinimumJoinSelectivityPlan();
4: currentTime ← getCurrentTime();
5: while (p ∧ currentTime < maxT 2
ime
∧ numPlan < maxPlans) do
6: insertPlanToPopulation(p);
7: numPlan ← numPlan + 1;
8: p ← generateMinimumJoinSelectivityPlan();
9: currentTime ← getCurrentTime();
10: end while
5 Experimental Results
Our first concern is to provide means to assure a fair comparison between the
approaches studied in this paper. With this purpose, we have used the meta-
structures created for CGO in order to implement the new techniques and 2PO,
Improving Quality and Convergence of Genetic Query Optimizers 11
i.e., the QEP metadata, the functions to calculate the cost of a plan, etc. With
this, we guarantee that the efforts put on the performance optimization of CGO
are also used by the other approaches.
Our new techniques are tested first with star schemas and for random queries,
since they represent one of the most typical scenarios in Decision Support Sys-
tems, similar to those used for TPC-H. In order to provide means to generalize
our conclusions, we also test our techniques with random queries. We do not
show the results using the TPC-H benchmark since the number of relations in
this schema does not allow the creation of large join queries.
Star Join Queries. For star join queries [3] we have randomly generated two
databases containing 20 and 50 relations. Both schemas contain a large fact table
or central relation and 19 and 49 smaller dimension tables, respectively. The fact
table contains a foreign key attribute to a primary key in each dimension relation.
We have distributed the cardinalities in order to have most of the dimensions
with a significantly lower cardinality compared to the fact table. A few set of
dimensions would have cardinalities closer to the cardinality of this fact table,
but still at least one order of magnitude smaller, which typically corresponds to
real scenarios (similar to the TPC-H database schema). The number of attributes
per dimension, other than those included in the primary key, ranges from 1 to 10.
The exact number of attributes per dimension and the attribute type is chosen
at random. We define an index for every primary key.
We randomly define two sets of 9 star join queries, Q20 and Q50 , one for
each database schema. Each set contains queries involving 20 and 50 relations,
respectively. Every query includes all the relations of its corresponding database
schema with at least one explicit join condition associated with each relation.
Therefore, since CGO avoids cross products, we ensure that our queries are well
defined star join queries.
Let Q be a SQL statement reading from a set of relations and γ the set of
constraints in Q. Every constraint c in γ has an associated selectivity factor s(c).
In a star join query, every dimension table typically adds some information to
the data flow or, if a constraint is affecting one of its attributes, it acts as a
filter to discard those results not matching the constraint. Let us define S as the
selectivity of the query calculated as S = Πc∈γ s(c). Each set of queries Q20 and
Q50 contains 9 queries qi , i = 1..9 and, in both cases, S(q1 ) = S(q2 ) = S(q3 ) ≈
10−2 , S(q4 ) = S(q5 ) = S(q6 ) ≈ 10−4 and S(q7 ) = S(q8 ) = S(q9 ) ≈ 10−8 .
where Corig represents the best cost obtained by the original implementation
of CGO and CT ech represents the best cost achieved by the specified technique
to be tested. This way, the scaled cost in formula (2) allows us to obtain the
average from the execution of different queries and databases and it is centered
in 0. So if a technique has a positive scaled cost sc (sc > 0), it obtains QEPs
with costs that are, on average, more than sc times lower than those obtained
by CGO. A negative value indicates that the QEP obtained by that technique
is, on average, worse than those obtained by CGO. From here on, we compare
the techniques analyzed in this paper to CGO using formula (2).
Carquinyoli Genetic Optimizer (CGO). In order to parameterize CGO we
use the recommendations obtained by the statistical analysis presented in [11].
Table 1 summarizes the values used to configure CGO.
Two-Phase Optimization (2PO). We have parameterized 2PO using the
configuration proposed in [4]. During the first phase of 2PO, we perform 10
local optimizations using iterative improvement. The best QEP obtained in this
first phase is used as the starting point, in the second phase, for the simulated
annealing algorithm. The starting value for the initial temperature is the 10% of
the cost of this QEP. The same parametrization for 2PO was also used in [15].
As explained before, the difference between the probability to choose the best
and the probability to choose the worst QEP in the population can be magnified
depending on the value of parameter α. In order to study the effect of this
parameter, each run is tested using five different values for α: 2, 10, 102 , 103 and
104 . We run our experiments using the two different sets of queries mentioned
above, namely the star join query set, executing all the policies 10 times per
each of the 5 populations created per query, and 30 random queries, where each
policy is also run 10 times per configuration, in order to obtain averages.
Table 1. Parameters set used depending on the number of relations in the query. The
number of crossover and mutation operations presented is executed per generation.
Figure 1 shows the results obtained after these experiments. The uppermost
row shows the behavior of WE for star join queries involving 20 relations. The
leftmost plot (plot a) corresponds to the star join queries with highest selectivity,
i.e., those queries that return a larger number of matches (S ≈ 10−2 ). The plot in
the middle (plot b) corresponds to queries with S ≈ 10−4 and the rightmost plot
(plot c) to queries with lowest selectivity S ≈ 10−8 . Since the number of relations
is relatively small, close to what can still be handled by dynamic programming
techniques, there is still little room for improvement. In general, the larger the
value of α, the more significant the improvements introduced by WE. However,
the plots show that the difference between α = 1000 and α = 10000 is not
significant. We can also observe that, for very low selectivity, the gains of WE
are reduced (plot c). This effect is explained by the fact that, when the selectivity
is very small, most of the potential tuple results are discarded, resulting in a very
low data flow cardinality in the QEP. Since the join operations can be executed
in memory and do not incur extra I/O, all the QEPs have a similar cost and
most of the executions of CGO are likely to reach a QEP with a near-optimal
cost, reducing the chances for good performance.
Analogously, the central row of plots shows the same results for star join
queries involving 50 relations. Our first observation is that, in some cases the
gains obtained by WE are several orders of magnitude larger than those obtained
by CGO. Again, we can observe that the general trend is to reward large values of
Star Join 20 Rel (a) Star Join 20 Rel (b) Star Join 20 Rel (c)
1 0,8 0,4
0,8 0,6 0,3
Scaled Cost
Scaled Cost
Scaled Cost
0,6
0,4 0,2
0,4
0,2 0,1
0,2
0 0 0
-0,2 2s 5s 10 s 30 s 60 s -0,2 2s 5s 10 s 30 s 60 s -0,1 2s 5s 10 s 30 s 60 s
Star Join 50 Rel (a) Star Join 50 Rel (b) Star Join 50 Rel (c)
800 1500 20
600 1000 15
Scaled Cost
Scaled Cost
Scaled Cost
500
400 10
0
200 30 s 60 s 90 s 120 s 240 s 5
-500
0 -1000 0
-200 30 s 60 s 90 s 120 s 240 s -1500 -5 30 s 60 s 90 s 120 s 240 s
Random Queries 20 Rel Random Queries 50 Rel Random Queries 100 Rel
30 10 50
25 40
20 5 30
Scaled Cost
Scaled Cost
Scaled Cost
15 20
0
10 10
30 s 60 s 90 s 120 s 240 s
5 -5 0
0 -10 30 s 60 s 90 s 120 s 240 s
-5 2s 5s 10 s 30 s 60 s -10 -20
Fig. 1. Scaled Cost evolution for different values of α and different configurations
14 V. Muntés-Mulero et al.
Scaled Cost Evolution (20 Relations) Scaled Cost Evolution (50 Relations)
3 2000 (HIP → 7979)
(HIP+WE → 8233)
1800
2,5
1600
2 1400
Scaled Cost
1200 2PO
Scaled Cost
2PO HIP
1,5
HIP 1000 HIP + WE
HIP + WE 800 WE
1
WE
600
0,5 400
200
0
5s 10 s 30 s 60 s 0
-0,5 30 s 60 s 90 s 120 s 240 s
Fig. 2. Scaled Cost evolution for WE using α = 1000, HIP, the combination of both
and 2PO studying different number of relations for star join queries
α with better performance. Also, we would prefer the performance achieved for
α = 1000 instead of that achieved for α = 10000, which is not as stable in all the
situations. There is a trade-off for parameter α: it is recommendable to use larger
values to achieve good performance (i.e., larger than 100), but too large values
increase the probability of the best plan in the population to be chosen in such
a way that, in practice, we are almost forcing the exclusive use of the best QEPs
in the population, destroying one of the main differences between the genetic
approaches and the random-walk approaches. Similarly, the improvements of WE
decrease as the selectivity decreases for the reason explained above. However,
in the worst cases we still obtain QEPs with costs that, in general, are several
times larger than those obtained by CGO.
Finally, for random queries, in the lowermost row of plots, we observe the same
trends as with the star join queries. Again, the best value of α tested is 1000,
independently of the number of relations involved in the query. Extreme cases
like α = 2 or α = 10000 must be avoided since they might lead to performances
worst than those by CGO.
In this section we analyze the benefits obtained by generating part of the pop-
ulation using HIP. Specifically, we run the same number of executions as in the
previous analysis, using the same configurations. Figures 2 and 3 show the re-
sults of our analysis of this technique, and also the results described in the next
subsection.
We first study the behavior for star join queries. In general, the use of HIP
does always improve the performance of CGO. As suggested in [11], spending
extra time generating good initial plans is clearly beneficial. Similar to what
happens with WE, the improvements are in general very limited in the case of
star join queries with 20 relations (left plot in Figure 2), since the search space
has not grown enough to obtain QEPs that clearly differ, in terms of quality, from
those obtained by CGO. However, for 50 relations (right plot in Figure 2) HIP
obtains results that are three orders of magnitude better than those obtained by
CGO. As the plot shows, for small optimization times, the improvement of our
Improving Quality and Convergence of Genetic Query Optimizers 15
techniques is around 10 times better than 2PO. It takes CGO about 4 minutes
to achieve results similar to those generated by HIP, which implies that HIP
converges much faster without losing quality. For random queries (Figure 3), we
can observe that HIP also obtains results similar to those obtained for star join
queries achieving, for queries containing 100 joins, improvement of more than
four orders of magnitude.
6 Related Work
The first approaches that applied genetic algorithms to query optimization con-
sidered a reduced set of QEP properties in crossover and mutation operations
[2,15]. In these first proposals, the amount of information per plan is very lim-
ited because plans are transformed to chromosomes, represented as strings of
integers. This lack of information usually leads to the generation of invalid plans
that have to be repaired. In [16], a genetic-programming-based optimizer is pro-
posed that directly uses QEPs as the members in the population, instead of
using chromosomes. A first genetic optimizer prototype was created for Post-
greSQL [12], but its search domain is reduced to left-deep trees and mutation
operations are deprecated, thus bounding the search to only those properties
appearing in the QEPs of the initial population. Besides, execution plans are
Scaled Cost Evolution (20 Rel) Scaled Cost Evolution (50 Rel) Scaled Cost Evolution (100 Rel)
6 4000 15000
4 3000 10000
Scaled Cost
Scaled Cost
Scaled Cost
2000
2 5000
1000
0 0 0
5s 10 s 30 s 60 s 30 s 60 s 90 s 120 s 240 s 30 s 60 s 90 s 120 s 240 s
-2 -1000 -5000
2PO HIP HIP + WE WE 2PO HIP HIP + WE WE 2PO HIP HIP + WE WE
Fig. 3. Scaled Cost evolution for WE using α = 1000, HIP, the combination of both
and 2PO studying different numbers of relations for random queries
16 V. Muntés-Mulero et al.
7 Conclusions
In this paper we present two techniques, namely Weighted Election (WE) and
Heuristic Initial Population (HIP). These techniques tackle two important as-
pects of genetic optimization: the time wasted optimizing some QEPs in the
population with a large cost and the effects of the initial population on the
quality of the best QEP generated by the optimizer. WE is able to speed up
a genetic optimizer and achieve a quick convergence compared to the original,
meaning that, without de-randomizing the genetic evolution, it is important to
focus on those QEPs with lower associated cost, and avoid spending time opti-
mizing QEPs that are far from the best QEP in the population. HIP is the first
technique combining heuristics with genetic query optimizers, and it shows that
using simple rules to generate the initial population allows the genetic optimizer
to quickly generate good-fitted QEPs, improving the speed and the quality of
the optimizer. The combination of both techniques, which are orthogonal, is very
simple and it is shown to outperform the best random-walk approach presented
in the literature. All in all, we show that, for very large join queries, as the num-
ber of relations increases it is advisable to use genetic methods based on beam
search strategies, rather than random-walk techniques.
References
1. W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone. Genetic Programming
– An Introduction; On the Automatic Evolution of Computer Programs and its
Applications. Morgan Kaufmann, dpunkt.verlag, Jan. 1998.
2. K. Bennett, M. C. Ferris, and Y. E. Ioannidis. A genetic algorithm for database
query optimization. In R. Belew and L. Booker, editors, Proceedings of the Fourth
International Conference on Genetic Algorithms, pages 400–407, San Mateo, CA,
1991. Morgan Kaufman.
3. S. Chaudhuri and U. Dayal. Data warehousing and OLAP for decision support.
SIGMOD’97: In Proceedings of the ACM SIGMOD international conference on
Management of data, pages 507–508, 1997.
4. Y. E. Ioannidis and Y. Kang. Randomized algorithms for optimizing large join
queries. In SIGMOD ’90: Proc. of the 1990 ACM SIGMOD international confer-
ence on Management of data, pages 312–321, New York, NY, USA, 1990. ACM
Press.
5. Y. E. Ioannidis and E. Wong. Query optimization by simulated annealing. In
SIGMOD ’87: Proceedings of the 1987 ACM SIGMOD international conference on
Management of data, pages 9–22, New York, NY, USA, 1987. ACM Press.
Improving Quality and Convergence of Genetic Query Optimizers 17
6. A. Kemper, D. Kossmann, and B. Zeller. Performance tuning for sap r/3. IEEE
Data Eng. Bull., 22(2):32–39, 1999.
7. R. Krishnamurthy, H. Boral, and C. Zaniolo. Optimization of nonrecursive queries.
In VLDB, pages 128–137, 1986.
8. V. Muntés-Mulero, J. Aguilar-Saborit, C. Zuzarte, and J.-L. Larriba-Pey. Cgo:
a sound genetic optimizer for cyclic query graphs. In Proc. of ICCS 2006, pages
156–163, Reading, UK, May 2006. Springer-Verlag.
9. V. Muntés-Mulero, J. Aguilar-Saborit, C. Zuzarte, V. Markl, and J.-L. Larriba-Pey.
Genetic evolution in query optimization: a complete analysis of a genetic optimizer.
Technical Report UPC-DAC-RR-2005-21, Dept. d’Arqu. de Comp. Universitat Po-
litecnica de Catalunya (http://www.dama.upc.edu), 2005.
10. V. Muntés-Mulero, J. Aguilar-Saborit, C. Zuzarte, V. Markl, and J.-L. Larriba-
Pey. An inside analysis of a genetic-programming optimizer. In Proc. of IDEAS
’06., December 2006.
11. V. Muntés-Mulero, M. Pérez-Cassany, J. Aguilar-Saborit, C. Zuzarte, and J.-L.
Larriba-Pey. Parameterizing a genetic optimizer. In Proc. of DEXA ’06, pages
707–717, September 2006.
12. PostgreSQL. http://www.postgresql.org/.
13. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price.
Access path selection in a relational database management system. In Proceedings
of the 1979 ACM SIGMOD international conference on Management of data, pages
23–34. ACM Press, 1979.
14. E. J. Shekita, H. C. Young, and K.-L. Tan. Multi-join optimization for symmetric
multiprocessors. In R. Agrawal, S. Baker, and D. A. Bell, editors, 19th Interna-
tional Conference on Very Large Data Bases, August 24-27, 1993, Dublin, Ireland,
Proceedings, pages 479–492. Morgan Kaufmann, 1993.
15. M. Steinbrunn, G. Moerkotte, and A. Kemper. Heuristic and randomized opti-
mization for the join ordering problem. VLDB Journal: Very Large Data Bases,
6(3):191–208, 1997.
16. M. Stillger and M. Spiliopoulou. Genetic programming in database query optimiza-
tion. In J. R. Koza, D. E. Goldberg, D. B. Fogel, and R. L. Riolo, editors, Genetic
Programming 1996: Proceedings of the First Annual Conference, pages 388–393,
Stanford University, CA, USA, 28–31 July 1996. MIT Press.
17. A. Swami. Optimization of large join queries: combining heuristics and combina-
torial techniques. In SIGMOD ’89: Proceedings of the 1989 ACM SIGMOD inter-
national conference on Management of data, pages 367–376. ACM Press, 1989.
18. A. Swami and A. Gupta. Optimization of large join queries. In SIGMOD ’88:
Proceedings of the 1988 ACM SIGMOD international conference on Management
of data, pages 8–17, New York, NY, USA, 1988. ACM Press.
19. A. N. Swami and B. R. Iyer. A polynomial time algorithm for optimizing join
queries. In Proceedings of the Ninth International Conference on Data Engineering,
pages 345–354, Washington, DC, USA, 1993. IEEE Computer Society.
20. Y. Tao, Q. Zhu, C. Zuzarte, and W. Lau. Optimizing large star-schema queries
with snowflakes via heuristic-based query rewriting. In CASCON ’03: Proceedings
of the 2003 conference of the Centre for Advanced Studies on Collaborative research,
pages 279–293. IBM Press, 2003.
1 Introduction
With the rapid growth of World-Wide-Web, new data archiving and analyzing tech-
niques bring forth a huge volume of data available in public, which is graph structured
in nature including hypertext data, semi-structured data and XML [1]. A graph provides
great expressive power for people to describe and understand the complex relationships
among data objects. As a major standard for representing data on the World-Wide-
Web, XML provides facilities for users to view data as graphs with two different links,
the parent-child links (document-internal links) and reference links (cross-document
links). In addition, XLink (XML Linking Language) [7] and XPointer (XML Pointer
Language) [8] provide more facilities for users to manage their complex data as graphs
and integrate data effectively. Besides, RDF [3] explicitly describes semantical resource
in graphs.
Upon such a graph, a primitive operation, reachability join (or simply R-join) was
studied [11,6]. In brief, a reachability join, A→D, denoted R-join, is to find all the
node-pairs, (a, d), in the underlying large data graph such that d is reachable from a,
denoted a ; d, and the labels of a and d are A and D respectively. R-joins help users to
find information effectively without requesting them to fully understand the schema of
the underlying graph. We explain the need of such R-join using an XML example. In
Figure 1, it shows a graph representation (Figure 1 (b)) for an XML data (Figure 1 (a)).
In Figure 1 (b), solid links represent document-internal links whereas dashed links rep-
resent cross-document links. We consider Figure 1 (b) as a graph with all links being
treated in the same way. With R-join, we can easily find all the topics that a researcher
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 18–30, 2007.
c Springer-Verlag Berlin Heidelberg 2007
Cost-Based Query Optimization for Multi Reachability Joins 19
<Reseach>
··· ···
<Institute>
<researcher>
<name>Jim</name>
</researcher>
<researcher>
<name>Joe</name>
Research
</researcher> 0
</Institute>
<Lab>
<researcher>
<name>Linda</name>
<projectref idref=“proj1”/> Institute Lab Projects Institute
</researcher>
<researcher> 1 2 4 3
<name>John</name>
<projectref idref=“proj2”/>
</researcher>
</Lab>
<institute>
<researcher> researcher researcher researcher proj proj researcher
<name>Keith</name>
<projectref idref=“proj2”/> 5 6 9 10 13 16 17
</researcher> researcher
</institute>
<projects>
<project id=“proj1”>
<topic>XML</topic> name name name projref name projref topic topic name projref
</person>
<project id=“proj2”> 7 8 11 12 14 15 20 21 18 19
<topic>RDF</topic>
</person>
</projects>
</Reseach> “Jim” “Joe” “Linda” “John” “XML” “RDF” “Keith”
Fig. 1. An Example
Table 1. The Graph Encoding of [2] Table 2. The Graph Encoding of [5]
l v pov Iv (a) Tree Interval Encoding (b) SSPI Index
Institute 1 5 [1 : 5]
Institute 3 20 [12 : 13][17 : 20]
researcher 5 2 [1 : 2] v Interval v Interval v preds
researcher 6 4 [3 : 4]
1 [2 : 11] 13 [17 : 20] 13 {4}
researcher 9 10 [6 : 10]
researcher 10 15 [11 : 15] 3 [34 : 41] 16 [27 : 30] 16 {19, 4}
researcher 17 19 [12 : 13][17 : 19] 4 [42 : 43] 17 [35 : 40] 20 {13}
topic 20 7 [7 : 7] 5 [3 : 6] 19 [38 : 39] 21 {16}
topic 21 12 [12 : 12]
6 [7 : 10] 20 [18 : 19]
9 [13 : 22] 21 [28 : 29]
10 [23 : 32]
Example 1. Fig. 2 represents a simple multi R-join query as a directed graph. This
query graph has a node labeled Institute, a node labeled researcher and a node labeled
topic. And two edges are in the query graph. The edge from Institute node to researcher
node requires that the data node pair (i, r), i ∈ext(Institute) and r ∈ext(researcher),
such that i ; r, should be returned; in the same time, the edge from researcher node to
topic node requires that the data node pair (r,t), r ∈ext(researcher) and t ∈ext(topic),
such that r ; t, should be returned.
3 Motivation
Recently, as an effort to extend Twig-Join in [4] to be workable on graphs, Chen et al.
studied multi R-join query processing(called pattern matching) over a directed acyclic
graph (DAG) in [5]. As an approach along the line of Twig-join [4], Chen et al. used
the interval-based encoding scheme, which is widely used for processing queries over
an XML tree, where a node v is encoded with a pair [s, e] and s and e together specify
an interval. Given two nodes u and v in an XML tree, u is an ancestor of v, u ; v, if
u.s < v.s and u.e > v.e or simply u’s interval contains v’s.
The test of a reachability relationship in [5] is broken into two parts. First, like the ex-
isting interval-based techniques for processing pattern matching over an XML tree, they
first check if the reachability relationships can be identified over a spanning tree gen-
erated by depth-first traversal of a DAG. Table 2(a) lists the intervals from a spanning
tree over the DAG of our running example. Second, for the reachability relationship that
may exist over DAG but not in the spanning tree, they index all non-tree edges (named
remaining edges in [5]), and all nodes being incident with any such non-tree edges in
a data structure called SSPI in [5]. Thus, all predecessor/successor relationships that
can not be identified by the intervals alone can be found with the help of SSPI. For our
running example, Table 2(b) shows SSPI.
As given in [5], for example, the procedure to find the predecessor/successor rela-
tionship of 17 ; 21 in the DAG of Fig. 1 as follows. First, it checks the containment
of tree intervals for 17 and 21, but there is no such a path between them in the tree.
Then, because 21 has entries of predecessor in SSPI, it tries to find a reachability rela-
tionship between 17 and all 21’s predecessors in SSPI by checking the containment of
tree interval for 17 and that of each of 21’s predecessors in SSPI recursively.
As shown above, in order to identify a reachability relationship between two nodes,
say, a and d, TwigStackD need to recursively search on SSPI to check if a predecessor
of d can be reached by a. This overhead over a DAG can be costly. Consider the DAG
of 2n − 1 nodes in Fig. 2, where the solid lines are edges in the spanning tree generated
by a depth-first search, and dashed lines are the remaining edges. Note that in the SSPI,
the entry for vn contains nodes vn+1 , vn+2 , · · · , v2n−1 . Thus to determine the reachability
relationship from node v2n−1 to node vn , TwigStackD needs n − 1 times of checking to
see if v2n−1 can reach any node in the entry. The cost of processing R-joins queries is
considerable high.
We conducted tests to confirm our observations. We generate a DAG by collapsing all
strongly connected components in a graph that is obtained using XMark data generator
dataset with a factor 0.01 (16K nodes). Here both XML tree edge and ID/IDREF links
are treated as the edges in the graph. Fig. 4 shows the performance of TwigStackD
22 J. Cheng, J. Xu Yu, and B. Ding
70K 8K 10M
Q1(TSD) Q1(TSD) Q1(TSD)
40K
4K
30K 4M
20K 2K
2M
10K
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
Remaining Edges Included (%) Remaining Edges Included (%) Remaining Edges Included (%)
(a) Number of I/Os (b) Elapsed Time (sec) (c) Number of Index Seek
Fig. 4. The Test on DAGs with Increasing Densities
on 5 DAGs, with 10%, 20%, 30%, 40% and 50% of non-tree edges (called remaining
edges) as the percentage of the total tree edges in the spanning tree obtained from the
graph. The queries used are , Q1 and Q4, which are listed in Fig. 7 (a) and (d). In Fig. 4,
Q(T SD) and Q(DP) are the processing costs to process Q using Chen TwigStackD and
our dynamic programming approach, respectively.
Fig. 4 (a) shows the I/O number when more remaining edges are added to the under-
lying DAG. As an example, for query Q4(TSD), the I/O number increased by 4,606 from
10% to 20% on the y-axis, while it increased by 38,881 from 40% to 50% on the y-axis.
When 5 times of number of remaining edges is included, the I/O number increases about
35 times. As for the number of index seeks in SSPI, namely the number of times to seek
an leaf page from the B+-Tree that implements the SSPI, which is showed in Fig. 4 (c),
this value increased by 616,052 from 10% to 20% on the y-axis, while it increased by
5,201,991 from 40% to 50% on the y-axis. The correlation coefficient for such two
types of measurements is as hight as above 0.999, which speaks that such an behavior
for the number of I/Os during processing is mainly caused by the number of index seek
of SSPI. Similar situation for processing time can also be observed in Figure 4 (c), since
the I/O number is the dominating factor for total processing cost. This test empirically
showed that TwigStackD performs better for DAGs with fewer remaining edges, but its
performance degrades rapidly when more edges being included in the underneath DAG.
Fig. 4 (a) and (b) also show the efficiency of our dynamic programming approach.
Our approach is not so sensitive as TwigStackD is to the density of the DAG. For Q4, our
approach only uses less than 200 number of I/O access, and 1 second processing time.
assigned a set of intervals and a postorder number for each node in DAG. Let Iu =
{[s1 : e1 ], [s2 : e2 ], · · · , [sn : en ]} be a set of intervals assigned to a node u, there is a path
from u to v, u ; v, if the postorder number of v is contained in an interval, [s j : e j ] in
Iu . The interval-based coding for the graph in Figure 1 (b) is given in Table 1. For the
same example of 17 ; 21 in the DAG of Fig. 1, it can be identified by the 21’s poId,
12, and one interval associated with 17, [12 : 13], since 12 is contained in [12 : 13].
Based on [2], Wang et al. studied processing R-join over a directed graph [11]. In
brief, given a directed graph, G. First, it constructs a DAG G by condensing all strongly
connected component in G as a node in G . Second, it generates encoding for G based
on [2]. All nodes in a strongly connected component in G share the same code assigned
to the corresponding representative node condensed in G . Given an R-join, A→D, two
lists Alist and Dlist are formed respectively. Alist encodes every node v as (v, s:e) where
[s : e] ∈ Iv . A node of A has n entries in the Alist, if it has n intervals. Dlist encodes
each node v as (v, pov ) where pov is the postorder number. Note: Alist is sorted on the
intervals [s : e] by the ascending order of x and then the descending order of y, and Dlist
is sorted by the postnumbers in ascending order. Wang et al. proposed to merge-join the
nodes in Alist and Dlist and to scan the two lists once.
It is important to know that some necessary extension is needed to use the R-join al-
gorithm [11] to process multi R-joins. Consider A→D ∧ D→E. For processing A→D,
Dlist needs to be sorted based on the postnumbers, because D is descendant. For
processing D→E, Dlist needs to be sorted based on s followed by e for all (v, s:e),
because D is a successor. Also, recall, for A ; D, Alist needs to encode every node v as
(v, s:e) where [s : e] ∈ Iv , which means there is a blocking between the two consecutive
R-joins, A→D followed by D→E, and we need to generate a new Alist from the output
of the previous R-join, A→D, in order to carry out the next R-join, D→E. Thus, The
intervals and postnumbers of each node must be maintained in multi R-join processing
for regeneration of intermediate Alist or Dlist on the fly. A total three operations are
needed during such blocking that enables multi R-join query processing.
– α(A): Given a list of node vectors in the form of (v1 , v2 , . . . , vl ) and each vi is in the
extension associated with A, it attaches each interval [s, e] ∈ Ivi and obtain a number
of (v1 , v2 , . . . , vl , [s : e]) from every vector (v1 , v2 , . . . , vl ) and sorts the resulting list
to obtain an Alist from these vector. For example, considering execution of two
consequent R-joins, Institute→researcher and researcher→topic, to process the
query of our running example, the first R-join Institute→researcher will produce
a set of temporary results A , {(1, 5), (1, 6), (3, 17)}. In order to make the proper
input for the second R-join researcher→topic. An α(A ) operation is hence applied
and we obtain {(1, 5, [1 : 2]), (1, 6, [3 : 4]), (3, 17, [12 : 13]), (3, 17, [17 : 19])}, which
becomes the input Alist for the second R-join.
– δ(D): Similarly as α, but it attaches the postnumbers for every vector (v1 , v2 , . . . , vl )
and obtains the (v1 , v2 , . . . , vl , [povi ]), vi in the extension associated with D, to form
a sorted Dlist. For example, considering execution of two consequent R-joins,
researcher→topic and Institute→researcher, to process the query of our running
example, the first R-joinresearcher→topic will produce a set of temporary results
24 J. Cheng, J. Xu Yu, and B. Ding
D , {(9, 20), (10, 21), (17, 21)}. In order to make the proper input for the second
R-join Institute→researcher. An δ(D ) operation is hence applied and we obtain
{(9, 20, [10]), (10, 21, [15]), (17, 21, [19])}, which becomes the input Dlist for the
second R-join.
– σ(A, D): Given a list of node vectors in the form of (v1 , v2 , . . . , vl ) and vi /v j in a
vector is in the extensions associated with A/D, it select out those vectors satisfy-
ing vi →v j . This is used to processing an R-join A→D when both A nodes and D
nodes already present in the partial solution. For example, considering the query
in Fig. 7(c) and four consequent R-joins, I→C, I→P, C→P and L→P to evaluate
that query, when the processing for I→C, I→P and C→P has been done, we only
further need a σ(L, P) to finish the total evaluation.
We develop the cost function involving those operations during processing for multi
R-joins after the description for R-join size estimation.
So the estimated answer size of (R1 →R2 ∧ . . . ∧ Ri−1 →Ri ) ∧ (Rh →Ri+1 ) can be
EST = |R1 ||R2 |..|Ri||Ri+1 |Pr(Join(r1 , r2 ..ri , ri+1 ))
M N M ×N
= |R1 |..|Ri+1 | = .
|R1 ||R2 |..|Ri | |Rh ||Ri+1 | |Rh |
So we will be able to estimate the answer size for all such R-joins by conveniently
memorizing all pairwise R-join size and all label’s extension cardinalities in the data-
base catalog.
Example 2. For our running example, the first join is Institute→ research, thus M = 3.
For Institute→ research→topic, since N = 3 and |ext(research)|=5, so the estimated
result set size is 3×3
|5| = 1.8. The same result can be calculated if research→ topic is
taken as the first join.
Cost-Based Query Optimization for Multi Reachability Joins 25
I R T
In our dynamic programming style optimization, two basic components in the solution
space are statuses and moves.
We can estimate the cost for each move by those cost formulae in Sec. 4.4. Each status
S is associated with a cost function, denoted cost(S), which is the minimal accumulated
estimated cost to move from the initial status S0 to the current status S. Such accumu-
lated cost of a sequence of moves from S0 to S is the estimated cost for evaluating the
subquery GS being considered under the current status S. Our goal for dynamic pro-
gramming is to find the sequence of moves from the initial status S0 toward the final
status S f with the minimum cost, cost(S f ), among all the possible sequences of moves.
This method is quite strait forward and its search space is bounded by 2m .
Our algorithm is outlined in Algorithm 1. We simply apply Dijkstra’s algorithm for
the shortest path problem into our search space, aiming to find a ”shortest” path from
S0 to any S f , where nodes represent statues, edges represent moves, and the length of
an edge is the cost of one move. We omit further explanation about Algorithm 1.
Example 3. For our running example, Figure 5 shows two alternative plans for evalu-
ating the query I→R→T , both containing two moves. The status S0 is associated with
a NULL graph, while S1 and S2 are respectively associated with two two graphs with
two connected nodes, and S3 is associated with the Gq and thus to be a final status.
Details steps in the searching for an optimal plan is showed in Figure 6, where each
row of the table lists a move in the solution space. The first column is the status where
to start the move and the second column is the status where the move reaches. The third
column is the R-join that will be processed in that move, while the number of results
generated after the R-join is the fourth column.
5 Performance Evaluation
In this section, we conducted two sets of tests to show the efficiency of our approach.
The first set of tests is designed to compare our dynamic programming approach (de-
noted DP) with algorithm [5] (denoted TSD). The second set of tests further confirms the
ability to scale of our approach. We implemented all the algorithms using C++ on top
of a Minibase-based2 variant deployed in Windows XP. We configure the buffer of the
database system to be 2MB. A PC with a 3.4GHz processor, 2GB memory, and 120G
hard disk running Windows XP is used to carry out all tests.
I I
Dataset |V | |E| |I| |I|/|V |
I
C D
20M 307,110 352,214 453,526 1.478
C L
C D P
40M 610,140 700,250 901,365 1.477
I D P P P L 60M 916,800 1,003,437 1,360,559 1.484
80M 1,225,216 1,337,378 1,816,493 1.483
(a) Q1 (b) Q2 (c) Q3 (d) Q4 100M 1,666,315 1,756,509 2,269,465 1.485
We generated 20M, 40M, 60M, 80M and 100M size XMark datasets [9] using 5
different factors, 0.2, 0.4, 0.6, 0.8, and 1.0 respectively, and named each dataset by its
size. In these XML documents, we treat parent-child edges and ID/IDREF edges with-
out difference to obtain graphs and collapse the strong connected components in graphs
to get DAGs. The details of the datasets are given in Fig. 8. In Fig. 8, the first column is
the dataset name. The second and third columns are the node number and edge number
of the resulting DAG respectively. The forth column is the multiple interval labeling
size, while the last column shows the average number of intervals per node in the DAG.
Throughout all experiments, we use the 4 multi R-join join queries listed in Fig. 7,
where the label I stands for interest, C for category, L for listitem,D for description and
P for parlist.
We test all queries over the same dataset described in Section 3 for the purpose of
compare TwigStackD algorithm to our approach. We show two set of figures that show
the elapsed time, number of I/Os and memory used to process each query.
2 Developed at Univ. of Wisconsin-Madison.
28 J. Cheng, J. Xu Yu, and B. Ding
120
Memory (MB)
DP DP 100 DP
10 80
# I/Os
100
60
1 10 40
20
0.1 1
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Queries Queries Queries
(a) Elapsed Time (sec) (b) Number of I/Os (c) Memory (MB)
10000 900
Elapsed Time (sec)
Memory (MB)
DP 10000 DP DP
1000 700
# I/Os
100 1000
500
10 100
300
1 10
100
0.1
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Queries Queries Queries
(d) Elapsed Time (sec) (e) Number of I/Os (f) Memory (MB)
Fig. 9. Compare on the DAG with 10% and 50% Remaining Edge Included
The first set of figures shows the performance on the DAG with 10 percent remaining
edges added, which are listed in Fig. 9 (a)-(c), and the second set of figures show the
performance on the DAG with 50 percent remaining edges added, which are listed in
Fig. 9 (d)-(f).
As shown in Fig. 9, our approach significantly outperforms TwigStackD, in terms of
elapsed time, number of I/O accesses, and memory consumption. The sharp difference
becomes even greater for a denser DAG, due to the rapid performance degradation of
TwigStackD when the edge number in the DAG increases. For example, consider Q3,
TwigStackD used 16.7 times of elapsed time and 8.7 times of I/O accesses than those
for our approach when 10 percent remaining edges being added, but when 50 percent
remaining edges being added, the two rates become 2922.3 and 266.4 respectively.
The memory usage of TwigStackD is unstable, and can range from 60MB to 900MB
for the 4 queries, because TwigStackD needs to buffer every node that can potentially
participate in any final solution and thus largely depends on the solution size. And it
can also be observed that the larger query needs more memory for the increased needs
of buffer pools by TwigStackD generally.
25 25K 180
Q1 Q1 Q1
Q2 Q2 Q2
150
Elapsed Time (sec)
20 Q3 20K Q3 Q3
Q4 Q4 Q4
Memory (MB)
120
15 15K
# I/Os
90
10 10K
60
5 5K 30
20M 40M 60M 80M 100M 20M 40M 60M 80M 100M 20M 40M 60M 80M 100M
XMark Dataset Size XMark Dataset Size XMark Dataset Size
6 Conclusion
In this paper, we studied query processing of multi reachability joins (R-joins) over a
large DAG. The most up-to-date approach, TwigStackD algorithm, uses a single interval
encoding scheme. TwigStackD assigns to each node in a DAG a single interval based on
a spanning tree it obtains from the DAG, and builds a complimentary index called SSPI.
It uses a twig-join algorithm to find matches that exist in the spanning tree and buffers
all nodes that belong to any solution, in order to find all matches in the DAG, with
the help of SSPI. TwigStackD has good performance for rather sparse DAGs. But, its
performance degrades noticeably when DAG becomes dense, due to the high overhead
of accessing edge transitive closures.
We present an approach of using an exisiting multiple interval encoding scheme
that assigns to each node multiple intervals. With the multiple encoding scheme, no
additional data structure is needed. We show that optimizing R-joins (R-join order se-
lection), using dynamic programming with a primitive implementation of R-join, can
significantly improve the performance, even though such an approach may introduce
overhead for feeding the intermediate result of an R-join to another. We conducted ex-
tensive performance studies and confirmed the efficiency of our DP approach. DP sig-
nificantly outperforms TwigStackD, and is not sensitive to the density of the underneath
DAG.
Acknowledgment
This work was supported by a grant of RGC, Hong Kong SAR, China (No. 418206).
References
1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: from relations to semistructured
data and XML. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2000.
2. R. Agrawal, A. Borgida, and H. V. Jagadish. Efficient management of transitive relationships
in large data and knowledge bases. In Proc. of SIGMOD’89, 1989.
3. D. Brickley and R. V. Guha. Resource Description Framework (RDF) Schema Specification
1.0. W3C Candidate Recommendation, 2000.
4. N. Bruno and N. K. et. al. Holistic twig joins: optimal xml pattern matching. In Proc. of
SIGMOD’02.
5. L. Chen and A. G. et. al. Stack-based algorithms for pattern matching on dags. In Proc. of
VLDB’05.
30 J. Cheng, J. Xu Yu, and B. Ding
6. J. Cheng and J. X. Y. et. al. Fast reachability query processing. In Proc. of DASFAA’06.
7. S. DeRose, E. Maler, and D. Orchard. XML linking language (XLink) version 1.0. 2001.
8. S. DeRose, E. Maler, and D. Orchard. XML pointer language (XPointer) version 1.0. 2001.
9. A. Schmidt and F. W. et. al. XMark: A benchmark for XML data management. In Proc. of
VLDB’02.
10. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path
selection in a relational database management system. In Proc. SIGMOD’79, pages 23–34,
1979.
11. H. Wang, W. Wang, X. Lin, and J. Li. Labeling scheme and structural joins for graph-
structured XML data. In Proc. of The 7th Asia Pacific Web Conference, 2005.
12. H. Wang, W. Wang, X. Lin, and J. Li. Subgraph join: Efficient processing subgraph queries
on graph-structured XML document. In Proc. of WIAM’02, 2005.
A Path-Based Approach for Efficient Structural
Join with Not-Predicates
1 Introduction
Research on XML query processing has been focused on queries involving struc-
tural join, e.g., the query ”//dept[/name=”CS”]//professor” retrieves all the
professors in the CS department. However, many real world applications also
require complex XML queries containing not-predicates. For example, the query
”//dept[NOT(/name=”CS”)]//professor” retrieves all the professors who are
not from the CS department. We call this class of queries negation queries.
A naive method to evaluate negation queries is to decompose it into several
normal queries involving structural join operation. Each decomposed query can
be evaluated using any existing structural join method [4,6,7,8,9,12,11], followed
by a post processing step to merge the results. This simplistic approach is expen-
sive because it requires repeated data scans and overheads to merge the inter-
mediate results. The work in [10] propose a holistic path join algorithm which is
effective for path queries with not-predicates, while [14] develop a method called
TwigStackList¬ to handle a limited class of twig queries with not-predicates,
i.e., queries with answer nodes above any negative edge.
In this paper, we propose a path-based approach to handle a larger class
of negation queries efficiently, i.e., queries with answer nodes both above and
below negative edges. We introduce a model called XQuery tree to model queries
involving negated containment relationship. We utilize the path-based labeling
scheme in [11] for queries involving not-predicates. Experiment results indicate
that the path-based approach is more efficient than TwigStackList¬[14].
The rest of the paper is organized as follows. Section 2 reviews related work.
Section 3 illustrates the drawback of the TwigStackList¬ method. Section 4
describes the proposed path-based approach. Section 5 gives the experimental
results and we conclude in Section 6.
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 31–42, 2007.
c Springer-Verlag Berlin Heidelberg 2007
32 H. Li et al.
2 Related Work
The structural join has become a core operation in XML queries [4,6,7,8,9,12,11].
The earliest work [12] use a sort-merge or a nested-loop approach to process the
structural join. Index-based binary structural join solutions employ B + -tree[7],
XB-tree[6], XR-tree[8] to process queries efficiently. Subsequent works extend
binary structural join to holistic twig join. Bruno et al. [6] propose a holistic twig
join algorithm, TwigStack, which aims at reducing the size of the intermediate
result and is optimal for ancestor-descendent relationship, while [13] design an
algorithm called TwigStackList to handle parent-child relationships. The work
in [11] design a path-based labeling scheme to reduce the number of elements
accessed in a structural join operation.
Al-Khalifa et al. [5] examine how the binary structural join method can be
employed to evaluate negation in XML queries. Algorithm PathStack¬ [10] uti-
lizes a boolean stack to answer negation queries. The boolean stack contains a
boolean variable ”satisfy” which indicates whether the associated item satisfies
the sub-path rooted at this node. In this way, a negation query does not need to
be decomposed, thus improving the query evaluation process.
Algorithm TwigStackList¬ [14] extends the algorithm TwigStackList [13] to
handle holistic twig negation queries. TwigStackList¬ also avoids decomposing
holistic negation queries into several sub-queries without negations. However,
TwigStackList¬ can only process a limited class of negation queries and suffer
from high computational cost (see Section 3). In contrast, our approach utilizes
the path-based labeling scheme in [11] to filter out unnecessary element nodes
efficiently and handles a larger class of negation queries.
3 Motivating Example
TwigStackList¬ [14] defines a query node as an output node if it does not appear
below any negative edge, otherwise, it is a non-output node. Consider query
T1 in Fig. 1(b) where {B } is an output node and {D, E, F } are non-output
nodes. Suppose we issue query T1 over the XML document Doc1 in Fig. 1(a)
whose element nodes have been labeled using the region encoding scheme [4].
TwigStackList¬ associates a list LB and a stack SB for the output node B.
Element B1 in the XML document is first inserted into the list LB . Since B1
satisfies the not-predicate condition in query T1 , it is also pushed into the stack
SB . Next, element B2 is inserted into LB . B2 is subsequently deleted from LB
since its descendent element D1 has child nodes E2 and F1 , thus satisfying the
sub-query rooted at D in T1 . The final answer for T1 is B1 .
There are two main drawbacks in Algorithm TwigStackList¬. First, the class
of negation queries which can be processed is limited to output nodes occurring
above any negative edge. Hence, it cannot handle meaningful complex queries
such as T2 in Fig. 1(c) which retrieves all the matching occurrences of elements
B and C such that B is not a child of A and B has child nodes C and D while D
has a child node E but does not have a descendant node F (we call nodes B and
A Path-Based Approach for Efficient Structural Join with Not-Predicates 33
A1
(1:15, 1)
B1 B2 A
(2:6, 2) (7:14, 2)
C1
(3:5, 3)
C2
(8:13, 3)
B
B
E1 D1
(4:4, 4) (9:12, 4) C D
D
E2 F1
(10:10, 5) (11:11, 5) E F E F
(a) XML document (b) Query T1 (c) Query T2
4 Path-Based Approach
The proposed approach to evaluate XML negation queries utilizes the path-based
labeling scheme proposed in [11]. We will first review the scheme and introduce
the XQuery tree model to represent negation queries. Then we describe the
algorithms P Join¬ and N Join¬ which removes the unnecessary elements and
carries out structural join operation respectively.
A1
(111, 1)
B1 B2
(100, 2) (011, 5) A
C1 C2
(100, 3) (011, 6) B
B
Root-to-leaf Path Encoding
E1 D1 Root/A/B/C/E 1
(100, 4) (011, 7) C D
Root/A/B/C/D/E 2
E2 F1 E F
(010, 8) (001, 9)
Root/A/B/C/D/F 3
Let P idA and P idD be the path ids for elements with tags A and D respec-
tively. If (P idA & P idD ) = P idD , then we say P idA contains P idD . This is
called Path ID Containment. Li et al. [11] prove that the containment of two
nodes can be deduced from the containment of their path ids.
Property I: Let P idA and P idD be the path ids for elements with tags A and
D respectively. If P idA contains P idD and P idA = P idD , then each A with
P idA must have at least one descendant D with P idD .
Consider the element nodes B2 and E2 in Doc1 . The path id 011 for B2
contains the path id 010 for E2 since the bit-and operation between 011 and 010
equals to 010 and they are not equal. Therefore, B2 must be an ancestor of E2 .
If two sets of nodes have the same path ids, then we need to check their
corresponding root-to-leaf paths to determine their structural relationship. For
example, the nodes B1 and E1 in Doc1 have the same path id 100. We can
decompose the path id 100 into one root-to-leaf path with the encoding 1 since
the bit in the corresponding position is 1. By looking up the first path in the
encoding table (Fig. 2(b)), we know that B1 is an ancestor of E1 .
Fig. 2(c) shows an example negation query modeled using the XQuery tree. The
equivalent query specified using the XQuery language is as follows:
F or $v In //B
W here exists($v/C) and exists($v/D/E) and
count(A/$v) = 0 and count($v/D//F ) = 0
Return {$v} {$v/C}
Note that negated edges cannot occur between the projected nodes of a query
since they would result in queries that are meaningless, e.g., retrieve all the
elements A and B such that A does not contain B. Therefore, we can deduce
that given an XQuery tree T , there exists some subtree T of T such that T
contains all the projected nodes in T and all edges in T are not negated edges.
1. V ⊆ V and
2. V contains all the projected nodes in T , and
3. for any e ∈ E , e is not a negated edge.
The projected tree of the XQuery tree in Fig. 2(c) is shown within the dashed
circle. Given an XQuery tree T , we define the subtree above TP as tree TPa and
the subtree below TP as TPb respectively.
Definition 3 (Tree TPa ). Given an XQuery tree T , let R be the root node of
TP , and e be the incoming edge of R. We define TPa as the subtree obtained from
T - TR - e, where TR denotes the subtree rooted at R.
Definition 4 (Tree TPb ). Given an XQuery tree T , we define TPb as the subtree
rooted at C, where C denotes a child node of the leaf nodes of TP .
In Fig. 2(c), the nodes A and F form the trees TPa and TPb of T respectively.
Note that an XQuery tree T has at most one TPa and possibly multiple TPb . A
tree TPa or TPb may contain negated edges. However, queries with negated edges
in TPa or TPb may have multiple interpretations. For example, the query “A does
not contain B, and B does not contain C, where C is the projected node” has
different semantics depending on the applications. Here, we focus on queries
whose subtrees TPa and TPb do not contain any negated edges.
based on the path id containment property. Any path id that does not satisfy
the path id containment relationship is removed from the lists of path ids of
both the parent node and the child node. However, this algorithm does not work
well for queries involving not-predicates.
Consider query T3 in Fig. 3(a) where the lists of path ids have been associ-
ated with the corresponding nodes. We assume that the path ids with the same
subscripts satisfy the path id containment relationship, i.e., b2 contains c2 , etc.
Algorithm P Join will first perform a bottom-up binary path join. The path id
lists for nodes C and D are joined. Since the path id d2 , d3 and d4 are contained
in the path id c2 , c3 and c4 respectively, d1 is removed from the set of path ids
of D. The path id list of node C is joined with the path id list of node E. No
path id is removed since each path id of E is contained in some path id of C.
We join the path id list of node B with that of node C. The path ids c2 and c3
are contained in the path id b2 and b3 respectively. Since there is a not-predicate
condition between nodes B and C, the path id b2 and b3 need to be removed
from the set of path ids of B. Finally, a binary path join between nodes A and
B is carried out and the path id a2 is removed.
Next, Algorithm P Join carries out a top-down binary path join on T3 starting
from the root node A. The final result is shown in Fig. 3(b). The optimal sets of
path ids for the nodes in T3 is shown in Fig. 3(c). The difference in the two sets
of path ids shown in Fig. 3(b) and Fig. 3(c) is because Algorithm P Join does
not apply the constraint that is imposed on nodes A and B to the entire query.
The above example illustrates that the proper way to evaluate a negated
containment relationship between path ids is to only update the path ids of the
nodes in the projected tree. This leads to the design of Algorithm P Join¬.
The basic idea behind P Join¬ (Algorithm 1) is that given a query T , we first
apply P Join on TPa and TPb . The path ids of the leaf node of TPa and the root
node(s) of TPb are used to filter out the path ids of the corresponding nodes in
TP . The input to Algorithm P Join¬ is an XQuery tree T with a set of pro-
jected nodes. We first determine the projected tree TP of T . Then the P Join
algorithm is carried out on TPb and TPa (if any) respectively (lines 4-5). Next, a
bottom-up binary path join and a top-down binary path join are performed on TP
A Path-Based Approach for Efficient Structural Join with Not-Predicates 37
Algorithm 1. P Join¬
1: Input: T - An XQuery-tree
2: Output: Path ids for the nodes in T
3: Associate every node in T with its path ids;
4: Perform a bottom-up binary path join on TPb and TPa ;
5: Perform a top-down binary path join on TPb and TPa ;
6: Perform a path anti-join between the root node(s) of TPb and their parent node(s)
if necessary;
7: Perform a bottom-up binary path join on TP ;
8: Perform a path anti-join between the leaf node of TPa with its child node if necessary;
9: Perform a top-down binary path join on TP ;
(lines 7, 9). Each binary path join operation is followed by a path antijoin op-
eration (lines 6, 8). A path antijoin takes as input two lists of path ids, but one
list of path ids is for reference; only path ids in the other list need to be removed
if necessary. In line 6(8), the Algorithm P Join¬ utilizes the root(leaf) nodes of
TPb (TPa ) to filter out the path ids of their parent(child) node(s).
Note that if the set of path ids for the root node (leaf node) of TPb (TPa )
contains some path id whose corresponding element node is not a result of T
(super P id set), then the path antijoin operation in Lines 6 (8) of Algorithm 1
is skipped. This is because the super P id set of the root node (leaf node) of TPb
(TPa ) could erroneously remove path ids from its parent node (child node), and
we may miss some correct answers in the final query result.
Consider again query T3 in Fig. 3(a). The projected tree is the subtree rooted
at node C. A P Join is first performed on tree TPa which contains nodes A and
B. The set of path ids for B obtained is {b1 , b2 }. Next, bottom-up path join is
carried out on TP . Since TPa is a simple path query without value predicates, the
path id set associated with B is not a super P id set according to the discussion
in [11]. Then we can perform a path anti-join between nodes B and C. This step
eliminates c2 from the path id set of C since c2 is contained in b2 . Finally, a
top-down path join is performed on TP , which eliminates d1 and d2 from the set
of path ids for D, and e2 from the set of path ids for E. The final result after
P Join¬ is shown in Fig. 3(c).
Algorithm 2. N Join¬
1: Input: T - An XQuery-tree
2: Output: All occurrences of nodes in TP
3: if TPa is null then
4: Perform TwigStackList¬ on T;
5: else
6: Perform holistic structural join on TPa , TPb and TP ;
7: Merge the intermediate result;
8: end if
B {100, 011}
B {100}
D {011} D {011}
E F {001} E F {001}
{100, 010} {010}
(a) (b)
Fig. 4. Example to illustrate optimality of path-based approach
5 Experiment Evaluation
We use three real world datasets for our experiments. They are Shakespeare’s
Plays (SSPlays) [1], DBLP [2] and XMark benchmark [3]. Table 1 shows the
characteristics of the datasets and Table 2 gives the query workload.
|N |
for the subsequent N Join¬ operation. The following metrics are used:
F iltering Ef f iciency = |N |p
i
i
|N |
Selectivity Rate = |N |
n
i
i
where |Nip |
denotes the number of instances for node Ni after P Join¬ operation,
|Nin | denotes the number of instances for node Ni in the result set after N Join¬
operation and |Ni | denotes the total number of instances for node Ni in the
projected tree of the query.
Fig. 5(a) shows the Filtering Efficiency with Selectivity Rate for queries Q1
to Q4. The closer the two values are, the more effective P Join¬ is for the query.
We observe that Algorithm P Join¬ is able to remove all the unnecessary ele-
ments for queries Q1, Q2 and Q3 and the subsequent N Join¬ will not access
any element that does not contribute to the final result, leading to optimal query
evaluation. Query Q4 has a higher Filtering efficiency value than Query Selectiv-
ity because the query node person which is the root node of the subtree rooted
at node age is a branch node. The set of path ids for person is a super P id set.
Nevertheless, Algorithm P Join¬ remains effective in eliminating unnecessary
path ids even for such queries.
Fig. 5(b) and (c) show that the I/O cost and elapsed time of Algorithm
P Join¬ are marginal compared with N Join¬ for queries Q1 to Q4. This is
40 H. Li et al.
Filtering Efficiency
0.8 Selectivity Rate
0.6
0.4
0.2
0
Q1 Q2 Q3 Q4
B
B
C D
E F
because the sizes of the path lists are much smaller than that of node lists. The
time cost of P Join¬ for queries Q3 and Q4 is slightly more compared to Q1 and
Q2 due to a larger number of distinct paths, as well as longer path ids for the
XMark dataset.
A Path-Based Approach for Efficient Structural Join with Not-Predicates 41
TwigStackList
Path-Based
TwigStackList
Path-Based
TwigStackList
Path-Based
6 Conclusion
References
1. http://www.ibiblio.org/xml/examples/shakespeare.
2. http://www.informatik.uni-trier.de/˜ley/db/.
3. http://monetdb.cwi.nl/.
4. A. Al-Khalifa, H. V. Jagadish, and J. M. Patel et al. Structural joins: A primitive
for efficient xml query pattern matching. IEEE ICDE, 2002.
5. S. Al-Khalifa and H. V. Jagadish. Multi-level operator combination in xml query
processing. ACM CIKM, 2002.
6. N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal xml pattern
matching. ACM SIGMOD, 2002.
7. S-Y. Chien, Z. Vagena, and D. Zhang et al. Efficient structural joins on indexed
xml documents. VLDB, 2002.
8. H. Jiang, H. Lu, W. Wang, and B. C. Ooi. Xr-tree: Indexing xml data for efficient
structural joins. IEEE ICDE, 2003.
9. H. Jiang, W. Wang, and H. Lu. Holistic twig joins on indexed xml documents.
VLDB, 2003.
10. E. Jiao, T-W. Ling, C-Y. Chan, and P. S. Yu. Pathstack¬: A holistic path join
algorithm for path query with not-predicates on xml data. DASFAA, 2005.
11. H. Li, M-L. Lee, and W. Hsu. A path-based labeling scheme for efficient structural
join. International Symposium on XML Databases, 2005.
12. Q. Li and B. Moon. Indexing and querying xml data for regular path expressions.
VLDB, 2001.
13. J. Lu, T. Chen, and T-W. Ling. Efficient processing of xml twig patterns with
parent child edges: A look-ahead approach. CIKM, 2004.
14. T. Yu, T-W. Ling, and J. Lu. Twigstacklist¬: A holistic twig join algorithm for
twig query with not-predicates on xml data. DASFAA, 2006.
RRPJ: Result-Rate Based Progressive
Relational Join
School of Computing
National University of Singapore
{tokwh,steph,leeml}@comp.nus.edu.sg
1 Introduction
The universe of network-accessible information is expanding. It is now common
practice for applications to process streams of data incoming from remote sources
(repositories continuously publishing or sensor networks producing continuous
data). An essential operation is the equijoin of two data streams of relational
data. Designing an algorithm for such an algorithm must meet a key requirement:
the algorithm must be non-blocking (or progressive), i.e. it must be able to
produce results as soon as possible, at the least possible expense for the overall
throughput.
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 43–54, 2007.
c Springer-Verlag Berlin Heidelberg 2007
44 W.H. Tok, S. Bressan, and M.-L. Lee
Several non-blocking algorithms for various operators in general and for the
relational equijoin in particular have been proposed [1,2,3,4,5]. These algorithms
can be categorized as heuristic or probabilistic methods. Heuristic methods rely
on pre-defined policies for the efficient usage of the available memory; whereas
probabilistic methods [6,7] attempt to model the incoming data distribution (val-
ues and arrival parameters) and use it to predict the tuples or partitions that
are kept in memory in order to produce the maximum number of result tuples.
The main thrust in all these techniques lies in the simple idea of keeping useful
tuples or partitions (i.e. tuples or partitions likely to produce more results) in
memory. Amongst the many progressive join algorithms introduced, one of the
state-of-art hash-based progressive join algorithm is the Rate-based Progressive
Join (RPJ) [6]. One of the limitations of RPJ is that it is not able to perform
well if the data within the partitions are non-uniform, and that it is not straight-
forward to generalize it for non-relational data. In this paper, we propose the
Result-Rate based Progressive join (RRPJ) which overcomes these limitations.
The rest of the paper is organized as follows: In Section 2, we discuss related
work and focus on two recent progressive join algorithms, and their strengths
and limitations. In Section 3, we present a novel method, called Result Rate-
based Progressive Join (RRPJ), which uses a model of the result distribution
to determine which tuples to be flushed. We conduct an extensive performance
study in Section 4. We conclude in Section 5.
unpredictable network. Let the two sets of relational data objects be denoted by
R = {r1 , r2 , . . . , rn }, and S = {s1 , s2 , . . . , sm }, where ri and sj denotes the i-th
and j-th data object from the remote data source respectively. When performing
a relational equijoin, with join attribute A, a result is returned when ri .A is
equal to sj .A. Formally, (ri , sj ) is reported as the result if ri .A is equal to sj .A.
The goal is to deliver initial results quickly and ensure a high result-throughput.
ntotal [j] nrcnt
has the value v is then computed as Piarr [j] = i
npart · nrcnt
i
+nrcnt
(Refer
total 1 2
ni [j]
j=1
Using this model, the probability that a tuple t will appear at the n-th position
h
of the stream is given by P rob(xn = t|xn−1 , ..., xn−h ) = bP (t) + aj δ(xn−j , t)
j=1
(δ(xk , c) = 1 if xk = c, and it is 0 otherwise). Using the LA model, the marginal
utility of a tuple is then derived, and is then used as the basis for determining
the tuples to be flushed to disk whenever memory is full.
3.1 RRPJ
We propose a novel join algorithm, call Result-Rate Based Progressive Join
(RRPJ) (Algorithm 1), which uses information on the result throughput of the
partitions to determine the tuples or partitions that are likely to produce results.
In Algorithm 1, an arriving tuple is first used to probe the hash partitions of the
corresponding data stream in order to produce result tuples. Next, it will check
whether memory is full (line 2). If memory is full, it will first compute the T hi
values (i.e value computed by Equation 3) for all the partitions. Partitions with
the lowest T hi values will then be flushed to disk, and the newly arrived tuple
inserted. The main difference between the RRPJ flushing and RPJ is that the
T hi values are reflective of the output (i.e. results) distribution over the data
partitions; whereas the RPJ values are based on input the data distribution.
To compute the T hi values (computed using Equation 3), we track the total
number of tuples, ni (for each partition), that contribute to a join result from
the probes against the partition. Intuitively, RRPJ tracks the join throughput
of each partition. Whenever memory becomes full, we flush nf lush (user-defined
parameter) tuples from the partition that have the smallest T hi values, since
these partitions have produced the least result so far. If the number of tuples
in the partition is less than nf lush , we move on to the partition with the next
lowest T hi values.
Given two timestamps t1 and t2 (t2 > t1 )and the number of join results
produced at t1 and t2 are n1 and n2 respectively. A straightforward definition
of the throughput of a partition i, denoted by T hi , is given in Equation 1.
n2 − n1
T hi = (version 1) (1)
t2 − t1
From Equation 1, we can observe that since (t2 − t1 ) is the same for all
partitions, it suffice to maintain counters on just the number of results produced
(i.e. n1 and n2 ). A partition with a high T hi value will be the partition which
have higher potential of producing the most results. However, it is important to
note that Equation 1 do not take into consideration the size of the partitions and
its impact on the number of results produced. Intuitively, a large partition will
produce more results. It is important to note that this might not always be true.
For example, a partition might contain few tuples, but produces a lot of results.
This partition should be favored over a relatively larger partition which is also
48 W.H. Tok, S. Bressan, and M.-L. Lee
producing the same number of results. Besides considering the result distribution
amongst the partitions, we must also consider the following: (1) Total number
of tuples that have arrived, (2) Number of tuples in each partition, (3) Number
of result tuples produced by each partition and (4) Total results produced by
the system. Therefore, we use an improved definition for T hi , given below.
Suppose there are P partitions maintained for the relation. Let Ni denote
the number of tuples in partition i (1 ≤ i ≤ P ), and Ri denote the number of
result tuples produced by partition i. Then, the T hi value for a partition i can
be computed. In Equation 2, we consider the ratio of the results produced to
the total number of results produced so far (i.e. numerator), and also the ratio
of the number of tuples in a partition to to the total number of tuples that have
arrived (i.e. denominator).
Ri×(
N)
P
j
R
Ri
N
Ni
R )×N
j=1
T hi = ( P
)/( P
)= P
(version 2) (2)
j j ( j i
j=1 j=1 j=1
Since the total number of results produced and the total number of tuples
is the same for all partitions, Equation 2 can be simplified. This is given in
Equation 3.
Ri
T hi = Ni (version 2 - after simplification) (3)
Equation 3 computes the T hi value w.r.t to the size of the partition. For
example, let us consider two cases. In case (1), suppose Ni = 1 (i.e. one tuple
in the partition) and Ri = 100. In case (2), suppose Ni = 10 and Ri = 1000.
Then, the T hi values for case (1) and (2) are the same. This prevents large
partitions from unfairly dominating the smaller partitions (due to the potential
large number of results produced by larger partitions) when a choice needs to
be made on which partitions should be flushed to disk.
4 Performance Study
In this section, we study the performance of the proposed RRPJ against RPJ. All
the experiments were conducted on a Pentium 4 2.4GHz CPU PC (1GB RAM).
We measure the progressiveness of the various flushing policies by measuring the
number of results produced and the response time.
result tuples generated (y-axis). From Figure 2, we can observe that the per-
formance of RRPJ is comparable to RPJ using the same datasets from [6], and
hence is at least as effective as RPJ for uniform data.
1.2e+07 1.2e+07
RRPJ RRPJ
RPJ RPJ
1e+07 1e+07
8e+06 8e+06
# Results Tuples
# Results Tuples
6e+06 6e+06
4e+06 4e+06
2e+06 2e+06
0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)
1e+07 1e+07
RRPJ RRPJ
RPJ RPJ
8e+06 8e+06
# Results Tuples
# Results Tuples
6e+06 6e+06
4e+06 4e+06
2e+06 2e+06
0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)
and 5 are 0.68, 0.17, 0.08, 0.04 and 0.03 respectively. During each reorder, these
probabilities for a newly arrived tuple to belong to a specific partition change.
In this experiment, we evaluate the performance of the Amortized RRPJ
(ARRPJ) against RPJ and RRPJ, when the data arriving exhibits varying data
arrival distribution (i.e the probability that a newly arrived tuple belongs to a
partition changes). We vary the amortization factor, σ, for ARRPJ between 0.0
to 1.0. We call the corresponding algorithm ARRPJ-σ. When σ = 0.0, only the
latest RRPJ values (i.e. number of results produced and size of data partition
since the last flush) are used; whereas when σ = 1.0, ARRPJ is exactly RRPJ
(it computes the average of the statistics over time).
7e+07 7e+07
RRPJ RRPJ
ARRPJ-0.0 ARRPJ-0.0
6e+07 ARRPJ-0.2 6e+07 ARRPJ-0.2
ARRPJ-0.5 ARRPJ-0.5
ARRPJ-0.8 ARRPJ-0.8
5e+07 5e+07
ARRPJ-1.0 ARRPJ-1.0
# Results Tuples
# Results Tuples
4e+07 4e+07
3e+07 3e+07
2e+07 2e+07
1e+07 1e+07
0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)
# Results Tuples
4e+07 4e+07
3e+07 3e+07
2e+07 2e+07
1e+07 1e+07
0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)
# Results Tuples
4e+07 4e+07
3e+07 3e+07
2e+07 2e+07
1e+07 1e+07
0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)
(e) α = 4k (f) α = 0k
5 Conclusion
References
1. Haas, P.J., Hellerstein, J.M.: Ripple join for online aggregation. In: SIGMOD.
(1999) 287–298
2. Urhan, T., Franklin, M.J., Amsaleg, L.: Cost based query scrambling for initial
delays. In: SIGMOD. (1998) 130–141
3. Urhan, T., Franklin, M.J.: XJoin: Getting fast answers from slow and bursty net-
works. Technical Report CS-TR-3994, Computer Science Department, University
of Maryland (1999)
4. Avnur, R., Hellerstein, J.M.: Eddies: Continuously adaptive query processing. In:
SIGMOD. (2000) 261–272
5. Madden, S., Shah, M.A., Hellerstein, J.M., Raman, V.: Continuously adaptive
continuous queries over streams. In: SIGMOD. (2002) 49–60
6. Tao, Y., Yiu, M.L., Papadias, D., Hadjieleftheriou, M., Mamoulis, N.: Rpj: Pro-
ducing fast join results on streams through rate-based optimization. In: SIGMOD.
(2005) 371–382
7. Li, F., Chang, C., Kollios, G., Bestavros, A.: Characterizing and exploiting refer-
ence locality in data stream applications. In: ICDE. (2006) 81
8. Dittrich, J.P., Seeger, B., Taylor, D.S., Widmayer, P.: Progressive merge join: A
generic and non-blocking sort-based join algorithm. In: VLDB. (2002) 299–310
9. Dittrich, J.P., Seeger, B., Taylor, D.S., Widmayer, P.: On producing join results
early. In: PODS. (2003) 134–142
10. Mokbel, M.F., Lu, M., Aref, W.G.: Hash-merge join: A non-blocking join algorithm
for producing fast and early join results. In: ICDE. (2004) 251–263
11. Lawrence, R.: Early hash join: A configurable algorithm for the efficient and early
production of join results. In: VLDB. (2005) 841–852
12. Wilschut, A.N., Apers, P.M.G.: Dataflow query execution in a parallel main-
memory environment. In: PDIS. (1991) 68–77
13. Tok, W.H., Bressan, S., Lee, M.L.: Progressive spatial join. In: SSDBM. (2006)
353–358
GChord: Indexing for Multi-Attribute Query in P2P
System with Low Maintenance Cost
Department of Computer Science and Engineering, Fudan University, Shanghai 200433, China
{zhouminqi,rongzh,wnqian,ayzhou}@fudan.edu.cn
1 Introduction
Peer-to-peer (P2P) systems provide a new paradigm for information sharing in large-
scale distributed environments. Though the success of file sharing applications has
proved the potential of P2P-based systems, the limited query operators supported by
existing systems prevent their usage in more advanced applications.
Much effort has been devoted to provid fully featured database query processing
in P2P systems [1,2,3,4]. There are several differences between query processing for
file sharing and database queries. Firstly, the types of data are much more complex
in databases than those in file names. Basically, numerical and categorical data types
should be supported. Secondly, files are searched via keywords. Keyword search is
often implemented by using exact match query. However, for numerical data types,
both exact match queries (or point queries) and range queries should be supported. The
last but not the least, user may issue queries with constraints on variant number of at-
tributes for database applications. This last requirement poses additional challenges for
database style query processing in P2P systems. Some existing methods, such as VBI-
Tree [2], can only support user queries with constraints on all attributes. Some other
methods, namely Mercury [3] and MAAN [4], separately index data on each attribute.
Though they can support multi-attribute queries with constraints on arbitrary number
of attributes, they are not efficient for indexing data with more than three attributes for
two reasons. The first one is that the maintenance cost increases with the number of
attributes. Another reason is that the selectivity of indexes on one attribute decreases
drastically when the number of attributes increases.
We present GChord, a Gray code based Chord, as a new index scheme supporting
multi-attribute queries (MAQ) in P2P environment. It distinguishes itself from other
methods in the following aspects:
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 55–66, 2007.
c Springer-Verlag Berlin Heidelberg 2007
56 M. Zhou et al.
2 Related Works
MAQ is widely studied in centralized database systems. One solution of indexing data
for MAQ is hBΠ -tree [6], which is a combination of multi-attribute index hB-tree [7]
and abstract index Π-tree[8]. hBΠ -tree achieves low storage cost and efficient point-
and range-query processing for various data types and data distribution. However, the
different setting of large-scale distributed systems prevents the application of existing
technique in centralized systems in P2P systems.
In large-scale P2P systems, distributed hash tables (DHTs), such as Chord [5], Pas-
try [9], and CAN [10], are widely used. However, it can only support key-word-based
search lookup(key) and these hash-based methods usually cannot preserve the locality
and continuity of data.
The methods supporting MAQ in structured P2P systems can be classified into two
categories. The first one introduces traditional tree-structured index scheme into P2P
systems. BATON [1] is P2P index structure based on balanced binary tree. BATON*
[11] substitute the binary tree in BATON with an m-way tree. These two can well sup-
port single dimensional range query. VBI-tree [2] provides a framework for indexing
multi-dimensional data in P2P systems with hierarchical tree-structured index in cen-
tralized systems. However, these structures cannot support queries with constraints on
arbitrary number of attributes efficiently.
The other category of research work is based on extending DHT-based overlay net-
works. The basic idea behind these methods is to use one overlay network for each
GChord: Indexing for MAQ in P2P System with Low Maintenance Cost 57
attribute that needs to be indexed, and to use locality-preserving hash to index numeri-
cal attributes. Mercury [3] and MAAN [4] belong to this category. Both of them index
each attribute separately on Chord ring and the index with the best selectivity power is
used to prune the search to support MAQ. Therefore, both of them have high stor-
age cost and index maintenance cost. Furthermore, the search efficiency decreases
drastically when the dimensionality of data increases.
3 Problem Statements
A tuple of data is represented by a set attribute-value pairs: t{attri , vi }, i = 1, · · · , N .
The domain Ai of attribute attri is either numerical or categorical. A numerical domain
is supposed to be continuous or sectionally continuous, and bounded. Given a data set,
the set of domains A : {Ai } for i = 1, 2, · · · , N is supposed to be known in advance.
We believe that even with this assumption, many applications can be fit into our MAQ
model.
A multi-attribute query (MAQ) is a conjunction of a set of predicates of the form
(attri , op, vi ), i = 1, · · · , m, in which, attri is the attribute name, op is one of <, ≤, >,
≥, = for numerical attributes and = for categorical ones. Note that a query may have ar-
bitrary number of predicates, and the predicates may be on arbitrary attributes. Figure 1
shows a simple example of data and queries.
Fig. 1. Data Item and Query Fig. 2. Two Attributes Domain Partition
The results to a MAQ are the set of tuples satisfying all predicates presented in the
query. Note that there is no constraints on the values of attributes missed in the query.
The problem of MAQ in a P2P environment is that a set of peers each may share
(or publish) a set of tuples. Each peer may issue a MAQ to be evaluated in a P2P
style, which means there is no centralized server, and the peers work in a collaborative
mechanism. To collaborate with others, a peer devotes some storage space for indexing
data shared by others, and supports index lookup and maintenance.
The difficulty of query processing for MAQ in P2P systems lies in the following
three aspects: 1) one query may involve attributes with different data types, and point
58 M. Zhou et al.
constraint and range constraint may appear simultaneously in one query, 2) arbitrary
number of attributes may presented in a query, and 3) index maintenance is especially
expensive in P2P systems. Since there are N ! − 1 possible combinations of attributes
in a query, any method using index structure that can only answer queries with fixed
number of attributes will fail to handle MAQ in P2P environment, for the high cost of
index maintenance in distributed environments.
In the next section, we present GChord, a Gray code based indexing scheme that
can be distributed over Chord like overlay network. By fully utilizing the network links
provided by the underlying network, it indexes each tuple only once, and can support
MAQ with arbitrary number of attributes using the sole index.
8 return graycode
Generating the Index Key for a Tuple. As the number of peers that participant in
the network is much less than the number of peers that network can accommodate,
one peer in the network have to manage many index keys. If the index key is simply
constructed by concatenating all N codes, the attribute encoded at the right side will
lose its distinguishability. All values of the attribute that is encoded on the right side
may be mapped to the same peer. It results in the index on that attribute useless.
GChord provides a shuffling-based method to generate the index key of a tuple. The
shuffled index key is constructed by concatenating a bit of code of one attribute by that
of another. The order of the attributes is pre-determined using the descending order of
the size of the domains.
Analysis to the Index Key Generation Method. Since two adjacent Gray codes only
differs in one bit, the adjacent relationships between two sections of numerical attributes
are preserved by any structured overlay network protocols that maintains one-bit differ-
ent links in routing tables, such as Chord and Pastry.
Property 1. Two index keys have one bit difference if the two corresponding tuples
have adjacent values on one numerical attribute and same values on other attributes.
Lemma 1. The prefixes of a set of Gray codes, which have a same bit length, construct
the Gray Code sequence either.
60 M. Zhou et al.
Property 3. The index keys stored on each peer constitute a similar size portion on
each attribute.
Property 4. The process of node join is the process of the attribute domain repartition.
As known from Property 4, load balancing can be achieved by selecting some suitable
Id for node at the time of node join.
1. For each predicate attr op v on a numerical attribute attr, code(attr) op code(v);
2. For each predicate attr = v on a categorical attribute attr, code(attr) = code(v);
3. All other bits can be either 0 or 1.
Thus, a multicast task can be represented by a set of strings with the same length as
that of the identifier (index key). Each element in the string is either 0, 1 or x, in which
x means either 0 or 1.
Multicast trees (MCTs) are constructed to forward the query to indexing peers. A
MCT is a virtual tree whose edges are routing paths of the query. A MCT corresponding
to the multicast of 10xx1xx is shown in Fig. 3.
1000100
1011111
1011110
1010111
1010110
1000111 1001101 1001110 1010101 1010110 1011100
1010101
1010100
1001111
1001100
1000111
1000101
1011111 1000110
1000100
(a) (b)
Multicast Tree Construction. As the links in the finger table of overlay network are di-
rected, one single MCT without irrelevant indexing nodes for MAQ may not exist. The
MCTs should be constructed on-the-fly when a query is issued. A modified Karnaugh
Map [13] construction algorithm is employed for this task.
GChord: Indexing for MAQ in P2P System with Low Maintenance Cost 61
a
000 001 011 010 110 111 101 100 0
b
10
000 1 1 0 0 0 0 1 1
1
001 0 0 0 0 0 0 0 0 1
11
011 0 0 0 0 0 0 0 0
0
010 0 0 0 0 0 0 0 0 0
01
110 0 0 1 1 1 1 0 0 1
111 0 0 1 1 1 1 0 0 1
00
101 0 0 1 1 1 1 0 0 0
0 1 1 0 0 1 1 0
100 1 1 0 0 0 0 1 1 00 01 11 10
Index Buddy
After all the MTPs having been generated, we could shuffle these MTPs on each
attribute like constructing index key to constructe the MCT.
Our Karnaugh Map based MCT construction techniques can be generalized to handle
multiple queries. Targeted index peers from different MAQs may be grouped together
into one MCT using the same technique introduced above. Thus, the queries may share
the same message to retrieve the index entries. This may further reduce the network
transmitting cost.
Query Multicasting. After all the MCTs correspoding to the query have been gener-
ated,multicasting of a query is conducted as follows: (1)Query message is sent to the
root of each MCTwhich is a peer with identifier by substituting all xs in the MCT repre-
sentationwith 0s. (2)When a query is received by a peer, it is evaluated on its localindex,
and forwarded to all peers whose identifier is substituting one of the xs in MCT repre-
sentation from 0 to 1. (3)This is conducted recursively until their is no x remains 0.
Fig. 3 (b) illustrates a multicast process.
Property 6. The query message routing hops can be bounded to O(log 2 N +M ), where
N is the number of nodes that overlay network can accommodate and M is the number
of xs in the MTP representation.
5 Performance Enhancement
The number of attributes which contain constraints in query could vary form 1 to N .
More MCTs will be generated, if there are more range constraints on attributes con-
tained in query. The number of MCTs is a product of the number of MTPs on each
attribute. The cost of multicasting a large number of MCTs involved in the query sepa-
rately is very high.
On the other hand, the query range will be very large if there are fewer attributes
which contain constraints in the query. A large number of peers have to be accessed
to process such MAQ. If MAQ is addressed by accessing a large number of peers, the
number of query message routing hops and query messages will be high. In the two
scenarios above, performance can be enhanced by multicast tree clustering and index
buddy respectively.
MCT clustering only sends multi-MCTs in single query message, but it does not affect
the query evaluation to each MCT.
The clustering procedure is as follows: (1)Query submitter clusters all the MCTs to-
gether, which contain close root keys. Namely the difference of each two root keys is
less than the index range that maintained by the peer. (2)Query message is sent accord-
ing to the root key which is thesmallest one within the MCT cluster. (3)Peer received
the MCT cluster clusters all adjacent submuliticast trees together as query submitter
does. (4)Procedure 3 is done recursively until no sub multicast trees exists.
As many MCTs and sub MCTs are sent within one message, the number of message
for one query is reduced dramatically.
their index keys within the region.When detecting region becomes infrequent again,
the two peers will remove these redundant index keys out of the index buddy.
– Index Modification. When index buddy existing, new index to be inserted or ex-
isting index to be modified are need to be processed at both site of the index buddy
in order to keep index consistent.
6 Experimental Study
To evaluate the performance of GChord, we implement one simulator in Java JDK 1.42.
In our implementation, each peer is identified by its peer Id. Like physical peer, it main-
tains two limited message queues, one sending message queue and one receiving mes-
sage queue. The network layer is simulated to control the network communication,
which is the message sending from one peer to another based on peer Ids.
In our experiment, 10000 peers with randomly distributed peer Ids are involved to
construct the Chord ring. The peer Id is a 32-bit string. The data tuple contains 5 numer-
ical attributes and 1 categorical attribute. 100000 data tuples with randomly distributed
values within their attribute domains need to be indexed. Range queries which have
been set maximum query range are generated randomly within the attribute domains.
Point query is generated randomly within the attribute domains.
Impact of Attribute Number in MAQ. The first set of experiments gives the perfor-
mance curves impacted by variable number of attributes which contain constraints in
the query. The maximum query range on each attribute is set to be 10% of its domain.
As showing in Fig. 6(a), 6(b) and 6(c), the numbers of maximum routing hops, rout-
ing messages and accessed peers reduce dramatically when the number of attributes
that contain constraints in MAQ increase. The number of routing messages reduces to
about one tenth when using multicast tree clustering strategy. Multicast tree clustering
improves performance pretty well especially when query contains fewer attributes.
35
5000 80
Predecessor
30 Predecessor
Cluster
Cluster
Buddy 4000 60 Predecessor
Number of Peers Accessed
Buddy
Both Cluster
25 Both
Number of Messages
Buddy
Number of Hops
Both
3000
20 40
15 2000
20
10
1000
0
5
0
2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
Number of Attributes Number of Attributes Number of Attribute
(a) Hops Vary with Number (b) Messages Vary with Num- (c) Accessed Peers vary with
of attributes queried ber of Attributes Queried Number of Attributes Queried
Fig. 6. Performance Comparison with Variable Attributes in Query
Impact of Query Range in MAQ. In this set of experiments, the number of attributes
that contain constraints in query is set to be 4. As showing in Fig. 7(a), 7(b), and 7(c), the
numbers of maximum routing hops, routing messages and accessed peers decrease as
except when the query range on each attribute decreases. More MCTs will be generated,
when the query range on each attribute increases. Much more routing messages are
diminished by using multicast tree cluster in this scenario.
GChord: Indexing for MAQ in P2P System with Low Maintenance Cost 65
21 3000 13
Predecessor
Cluster
20 12
2500 Predecessor Buddy
Cluster Both
Number of Messages
Number of Hops
18 Predecessor 10
Cluster
Buddy 1500
17 9
Both
1000 8
16
7
15 500
6
14
0
20% 15% 10% 5% 20% 15% 10% 5% 20% 15% 10% 5%
Range of Attribute Range of Attribute Range of Attribute
(a) Hops Vary with Query (b) Messages Vary with (c) Accessed Peers Vary with
Range Query Range Query Range
Fig. 7. Performance Comparison with Variable Query Range
Impact of Frequently Queried Region. In this set of experiments, the maximum query
range on each attribute is set to be 10% of its domain. As showing in Fig. 8(a), 8(b) and
8(c), index buddy has evident effort in reducing the number of peers accessed when
the percentage of frequent query increase. Index buddy has a similar impact on the
maximum number of routing hops, especially when query contains less attributes.
18
200 14
17
6attributes 6Attributes 6Attributes
16 180
5Attributes 5Attributes 12 5Attributes
15 4Attributes 160 4Attributes 4Attributes
3Attributes 3Attributes Number of Peers Accessed 3Attributes
14 140 10
Number of Messages
13
Number of Hops
120
12 8
100
11
6
10 80
9 60 4
8 40
7 2
20
6
0 0
20% 40% 60% 80% 20% 40% 60% 80% 20% 40% 60% 80%
Query Frequency Query Frequency Query Frequency
(a) Hops Vary with Frequent (b) Messages Vary with Fre- (c) Accessed Peers Vary with
Query quent Query Frequent Query
Fig. 8. Performance Comparison with Frequent Queries
Comparison with Mercury. As there are 10000 peers in the network, the number of
maximum hops and accessed peers in Mercury is much bigger than GChord’s. Approx-
imately 1700 peers construct a Chord ring to maintain the index keys on each attribute.
The selectivity power of the attribute is very strong, so Mercury need to accessed a large
number of peers to process MAQ. Index keys are stored continuously on peers, so ac-
cessing adjacent index key need only one more hop. That’s why the number of routing
80
160
60
70
GChord GChord
140 Mercury Mercury
60 GChord 50
Number of Peers Accessed
Mercury 120
Number of Messages
40
Number of Hops
50 100
40 80 30
30 60
20
40
20
10
20
10
0 0
3 4 5 6 3 4 5 6 3 4 5 6
Number of Attributes Number of Attributes Number of Attributes
(a) Hops Comparison with (b) Messages Comparison (c) Accessed Peers Compari-
Mercury with Mercury son with Mercury
Fig. 9. Performance Comparison with Mercury
66 M. Zhou et al.
messages is smaller than GChord’s. As showing in Fig. 9(a) and 9(c), the performance
of the GChord exceeds Mercury much in the number of maximum routing hops and
accessed peers.
As the limitation of paper size, the comparison of index cost with Mercury is no
showing in figures. Mercury keeps one index duplication for each attribute, so the index
cost of Mercury is proportional to the number of attributes that data tuple contains. So
the index costs of GChord, including index storage and index messages, are much less
than Mercury’s.
7 Conclusion
In this paper, we present the design of GChord, a P2P-based indexing scheme for pro-
cessing multi-attribute queries. Using Gray code based indexing technique, both point-
and range-query on numerical attributes can be handled. By integrating Gray code and
hash based encoding method, each tuple only need to be indexed once in GChord.
Our index can support queries having constraints on arbitrary number of attributes.
Thus, it is more efficient than previous methods in terms of storage cost and search
performance. Enhancement techniques further improves the performance of GChord.
Our future work on GChord includes the research on supporting keyword-based
queries and aggregate queries over GChord, and the study on more intelligent query
optimization techniques.
References
1. H.Jagadish, B.Ooi, Q.Vu: Baton: A balanced tree structure for peer-to-peer networks. In:
VLDB. (2005)
2. H.Jagadish, B.Ooi, Q.Vu, R.Zhang, A.Zhou: Vbi-tree: A peer-to-peer framework for sup-
porting multi-dimensional indexing schemes. In: ICDE. (2006)
3. R.Bharambe, M.Agrawal, S.Seshan: Mercury:supporting scalble multi-attribute range
queries. In: SIGCOMM. (2004)
4. M.Cai, M.Frank, J.Chen, P.Szekely: Maan:a multi-attribute addressable network for grid
information services. In: Grid. (2003)
5. I.Stoica, R.Morris, D.Karger, F.Kaashoek, H.Blalakrishnan: Chord: A scalable peer-to-peer
lookup service for internet applications. In: ACM SIGCOMM. (2001) 149–160
6. G.Evangelidis, D.Lomet, B.Salzberg: The hbpi-tree: a multi-attribute index supporting con-
currency,recovery and node consolidation. VLDB Journal 6 (1997) 1–25
7. D.Lomet, B.Salzberg: The hb-tree: a multiattribute indexing method with good guaranteed
performance. ACM Trans Database Syst 15 (1990) 625–658
8. D.Lometand, B.Salzberg: Access method concurrency with recovery. In: SIGMOD. (1992)
9. A.Rowstron, P.Druschel: Pastry:scalable,decentraized object location and routing for large-
scale peer-to-peer systems. In: Middleware. (2001) 329–350
10. S.Francis, P.Handley, M.Karp, R.Shenker: A scalable content-addressable network. In: SIG-
COMM. (2001)
11. H.Jagadish, B.Ooi, K.LeeTan, Q.Vu, R.Zhang: Speeding up search in peertopeer networks
with a multiway tree structure. In: SIGMOD. (2006)
12. F.Gray: Pulse code communications. In: U.S. Patent 2632058. (1953)
13. M.Karnaugh: The map method for synthesis of combinational logic circuits. AIEE 72 (1953)
593–599
ITREKS: Keyword Search over Relational Database by
Indexing Tuple Relationship
1 Introduction
Keyword-based search is well studied in the world of text documents and Internet
search engines, but Keyword-based search over relational databases is not well
supported. The user of a relational database needs to know the schema of the
database; Casual users must learn SQL and know the schema of the underlying data
even to pose simple searches. For example, suppose we have a DBLP database,
whose schema is shown in Figure 1. We wish to search for an author Bob’s paper
related to “relation”. To answer this query, we must know how to join the Author,
Write and Paper relations on the appropriate attributes, and we must know which
relations and attributes to contain “Bob” and “relation”. In keyword-based search, for
the above example a user should be able to enter the keywords ‘Bob relation’ and the
associated tuples which are associated with the two keywords are returned.
Enabling keyword search in databases that does not require knowledge of the
schema is a challenging task. Due to database normalization, logical units of
information may be fragmented and scattered across several physical tables. Given a
set of keywords, a matching result may need to be obtained by joining several tables
on the fly.
In this paper, we have developed a system, ITREKS (Indexing Tuple Relationship
for Efficient Keyword Search), which supports highly efficient keyword-based search
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 67–78, 2007.
© Springer-Verlag Berlin Heidelberg 2007
68 J. Zhan and S. Wang
over relational databases by indexing tuple relationship. The key features and
advantages of our approach, and the contributions of this paper, are summarized as
follows:
z Most previous approaches perform a significant amount of database
computation at search time to find the connection of tuples which contain
keyword. We do all significant join computing work in advance by create tuple
relation index, so a great amount of computing work is saved in search time.
z We present a novel approach to index the tuple relationship. We construct
basic tuple relationship-FDJT by computing full disjunction[1] of the
interconnected relational database. We present an FDJT-Tuple-Index table to
index tuples’ relationship.
z We propose a modular architecture and have implemented ITREKS based on it.
z We present an efficient algorithm which incorporate basic tuples and FDJT-
Tuple-Index table to generate result tuples matching the query.
z We take full advantage of existing relational database functions. ITREKS has
been implemented on top of Oracle 9i. Oracle 9i Text use standard SQL to
create full text indexes on text attributes of relations. We completely avoid
reimplementing basic IR capabilities by using Oracle Text as the back end.
Furthermore, ITREKS keep both FDJT and tuple-FDJT in relation tables. Our
searching algorithm is also based on a search table.
2 Related Work
Oracle [3] , IBM DB2, Microsoft SQL Server, PostgreSQL, and MySQL all provide
text search engine extensions that are tightly coupled with the database engine.
However, in all cases each text index is designed over a single column. Using this
feature alone to do meaningful keyword search over an interconnected database
would require merging the results from many column text indexes.
ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship 69
Keyword-based search over relational database gets much attention recently. Three
systems, DISCOVER[4] [5] , BANKS[6] , and DBXplorer[7] , share a similar
approach: At query time, given a set of keywords, first find tuples in each relation that
contain at least one of the keywords, usually using database system auxiliary full text
indexes. Then use graph-based approaches to find tuples among those from the
previous step that can be joined together, such that the joined tuple contains all
keywords in the query. All three systems use foreign-key relationships as edges in the
graph, and point out that their approach could be extended to more general join
conditions. A main shortage of the three systems is they spend a plenty of time to find
the candidate tuples that can be joined together.
Four systems share the concept of crawling databases to build external indexes.
Verity[8] crawls the content of relational databases and builds an external text index
for keyword searches, as well as external auxiliary indexes to enable parametric
searches. DataSpot[9] extracts database content and builds an external, graph-based
representation called a hyperbase to support keyword search. Graph nodes represent
data objects such as relations, tuples, and attribute values. Query answers are
connected subgraphs of the hyperbase whose nodes contain all of the query keywords.
DbSurfer[10] indexes the textual content of each relational tuple as a virtual web
page. Given a keyword query, the system query and navigate the virtual web pages
and find the results. EKSO[11] indexes interconnected textual content in relational
databases, and do keyword search over this content. A relational database is crawled
in advance, text-indexing virtual documents that correspond to interconnected
database content. At query time, the text index supports keyword-based searches with
interactive response, identifying database objects corresponding to the virtual
documents matching the query.
All the index-data-offline systems have two challenges, how to control the
granularity of the indexed content and how to efficiently find the exact results from
the indexed content.
While a direct empirical comparison between our system and some of the other
approaches mentioned in this section would be very interesting, the comparison is not
feasible for the follow reasons:
z The systems are not publicly available.
z The systems implemented different search semantic and different result sets.
z Any effort to implement them well enough for a fair comparison would be
prohibitive.
3 Background
3.1 Basic Tuple Relationship
In our method, we need first to find the closest and the most important connection
among tuples. In general, if we have any collection of facts that agree on common
attributes (are join-consistent) we would like them to be available in the “result” of
this collection of facts. The problem is related to that of computing the full outerjoin
of many relations in a way that preserves all possible connections among facts. Such a
computation has been termed a “full disjunction” by Galindo-Legaria[1] . A full
disjunction is a relation with nulls (represented by ⊥ ) such that every set of
70 J. Zhan and S. Wang
join-consistent tuples in our database appears within a tuple of the full disjunction,
with either ⊥ or a concrete value in each attribute not found among our set of tuples.
Each tuple of full disjunction is corresponding to a set of connective tuples, each of
them from a database relation. Naturally, full disjunction reflects the closest and most
important relationship among the tuples that generate them. Through full disjunction,
we can build the basic relationship of the tuples that come from different database
relation.
We would like to find a simple way of computing the full disjunction of a set of
relations. The solution is to compute full disjunction by full outerjoin. The full
outerjoin is a variant of the join in which tuples of one relation that do not match any
tuple of the other relation are add to the result, padded with nulls. This operation is
part of the SQL92 standard. This problem of computing full disjunction by outerjoin
was studied by Galindo-Legaria in [1]. [1] gave a test for when some order of
outerjoins is guaranteed to produce the full disjunction by itself. This test is simple.
Create a graph whose nodes are the relations and whose edges connect relations that
are constrained by one or more comparison; if the graph is acyclic then the full
disjunction can be computed applying full outerjoins in any order. For cyclic graphs,
however, the full disjunctions don’t exist. Thus we have the Lemma 1.
Lemma 1. For a database which has an acyclic connected scheme graph, we can
compute full disjunction by applying full outerjoin of the connected relations in any
sequence.
Now for a database whose scheme graph is acyclic, we can use Lemma 1 to generate
a full outerjoin sequence producing the full disjunction. In the above full outerjoin
sequence, each relation appears exactly once. The relation tuples which are
outerjoined to generate a tuple of full disjunction is FDJT of this tuple. Algorithm 1
generates FDJTR when computing full disjunction of a database.
Index: A database is enabled for keyword search through the following steps.
Step 1: A database D is identified, along with its schema graph G.
Step 2: If G is cyclic, turn it into an acyclic schema graph G’ with Algorithm 2, witch
will be discussed in Section 4.2.
Step 3: Given D and G’, Indexer generates FDJTR of D using Algorithm 1.
Step 4: FDJT-Tuple-Index table is created for supporting keyword searches, which
will be discussed in detail in Section 4.3.
Search: Given a query consisting of a set of keywords, it is answered as follows.
Step 1: For each keyword k, a Basic Tuple Set (BTS) is established by using database
full text search functions. Keyword k’s BTS is a relation recording all
database tuples which have scores to k.
Step 2: Based on BTSs, FDJT-Tuple-Index table and Search Table (see Section 4.4),
Searcher finds the results (joinable tuples) which include all keywords. We
discuss this step in Section 4.4.
ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship 73
Given a database schema graph, ITREKS firstly cut off the cycle if the graph is
cyclic, so we can use Algorism 1 to compute the FDJTR of the database.
Figure 3 (a) is schema graph of DBLP database, where , for simplicity, A, W, P
and C denote relations Author, Write, Paper and Cite respectively. Figure 3 (b) is a
simplest but typical cyclic schema graph. ITREKS revise the cyclic database graph by
two operations: cut-off and duplication.
Cut-off: By erasing a less important edge which belongs to the cycle, we can make
cyclic schema graph acyclic. Figure 4 shows cut-off revised schema graph in Figure 3,
where the schema graph is acyclic but we lost a relation between P and C (in
Figure 4 (a)) and relation between B and C (in Figure 4 (b)), which we think is less
important. If there isn’t a less important relation, we can remove any edge in graph cycle.
relations pure connective relations, because the only function of their attributes is to
connect tuple and they don’t contain indispensable keywords for our keyword search.
In ITREKS, we discard pure connective relations in FDJTR once FDJTR is
completely constructed. For example, after computing FDJTR by Algorithm 1 over
revised DBLP schema Graph (Figure 5 (a)), the schema of FDJTR is (FDJTid, Aid,
Wid, Pid, Cid, CPid). After discard pure connective relations we get FDJTR (FDJTid,
Aid, Pid, CPid). For simplicity, we use Aid represent the tuple’s id in relation Author.
Similarly Pid and PCid are the tuple’s id in Paper.
FDJT-Tuple-Index table index each database tuples with FDJTs. ITREKS builds
tuples’ relationships by establishing FDJT-Tuple-Index table.
Extended Schema Graph: To build FDJT-Tuple-Index table, ITREKS extends
FDJTR as follow:
For each relation in FDJTR, if the relation has edges with other relations in original
database schema graph, add these relations and edges to FDJTR. If new added
relation is pure connective relation, ITREKES continue add the other relations that
have edges with the pure connective relation.
For DBLP database, the extended schema graph of FDJTR is shown in Figure 6.
Extended Schema reflects the relationship between each database relations and
FDJTR. If a relationship is not so important to be indexed, we discard the relative
relations in extended schema. For example, in Figure 6, which papers are cited by
papers in PC in FDJTR need not to be indexed, ITREKS discard relative relations W
and P. Note that extend schema graph is always a tree (is acyclic).
Locator Number: ITREKS gives each relation in extended schema a locator number
to records distance and relationship between tuple and FDJT. The number is used
when ITREKS calculate the results. ITREKS appoints locator number to relations as
follow:
ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship 75
• Let FDJTR has n relations, ITREKS labels each relation in FDJTR with an integer
from 1 to n in sequence.
• For other relations in extended schema graph, the locator number consists of two
parts divided by a dot. The number in left of the dot (left number) is the number of
FDJTR’s relation connected to it; The number in right of the dot (right number) is
integer 1.
FDJT-Tuple-Index Table: In ITREKS, FDJT-Tuple-Index table has 4 columns; the
first two columns are RN and Tid which identify a database tuple’s relation name and
rowid. Column FDJTid is rowid of a FDJT in FDJTR that has connection with the
tuple. Column N is the locator number representing the relationship between the tuple
and the FDJT. The locator number is come from the extended schema graph of the
FDJTR. In FDJT-Tuple-Index table, each row records a tuple- FDJT pair and their
relationship.
After FDJT-Tuple-Index table created, ITREKS is ready for keyword search. Given a
query consisting of a set of keywords, ITREKS establishes a BTS (Basic Tuple Set)
for each keyword k, recording all database tuples which have scores to k. Then based
on BTSs, FDJT-Tuple-Index table and Search Table, Searcher finds the results
(joinable tuples) which include all keywords.
Definition 5 (BTS). For a keyword k, the Basic Tuple Set is a relation BTSk={t |
Score (T, k)>0}, which consists of the database tuples with a non-zero score for
keyword k.
ITREKS uses Oracle Text Full Text Search Function to build BTSs for each keyword.
BTS table consists of 3 columns, RN, Tid and Score, which representing relation
name, tuple id and score respectively.
Definition 6 (ST). Search Table is a table that is dynamically generated by ITREKS
to find joinable tuples at search step. Given keywords k1,…,kn, ITREKS generates a
ST with 2+k*3 columns. In ST, a keyword ki (i=1,...,n) corresponds to 3 columns,
76 J. Zhan and S. Wang
ki_RN, ki_Tid and ki_N, which represent tuples and relationship between the tuples.
The other two columns is FDJTid which comes from FDJT-Tuple-Index table and
Score of the result.
Definition 7 (Result Tree). Result Tree is a tree of joinable tuples based on
extended schema graph of FDJTR, where each leaf node of the tree contains at least
one keyword and the nodes of the tree contain all keywords. The sizeof(T) of a result
tree T is the number of edges in T.
Ranking Function: ITREKS uses simple but effective ranking function to rank the
result trees for a given query. ITREKS assigns the score of a result tree T in the
following way:
sizeof (T ) k
score(T , Q ) = ∑ ∑ Score(ti , kw j )
1
sizeof (T ) i =1 j =1
where Score(ti,kwj) is the score of a tuple ti towards keyword kwj. ITREKS computes
sizeof(T) as follow:
Let N be the tuple’s locator number that is defined in FDJT-Tuple-Index table.
Let max be the largest left number of N in result tree’s leaf nodes; Let min be the
smallest left number of N in result tree’s leaf nodes. Let r be the sum of the right
numbers of result tree’s nodes.
Computing the size of T as follow:
Sizeof(T)=max-min+r
Given a set of query keywords, ITREKS finds the results by algorithm 3 described
below.
In Algorithm 3, once Fi-1 natural join BTS_ki (i=1,…,n) and put the relative
information into ST, ST records relations of joinable tuples based on FDJT and the
joined tuples contain all keywords kj (j=1,…,i).
5 System Evaluation
8000
)
c 6000
e
s
m
( 4000
e
m
i
t 2000
0
2 3 4 5
number of keywords
Acknowledgements
This work is supported by the National Natural Science Foundation of
China(No.60473069 and 60496325), and China Grid(No.CNGI-04-15-7A).
References
[1] Galindo-Legaria, C. Outerjoins as disjunctions. ACM SIGMOD International Conf. on
Management of Data, 1994
[2] Rajaraman, A. and J. D. Ullman. Integrating information by outerjoins and full
disjunctions.
[3] Oracle Text. http://otn.oracle.com/products/text/index.html.
[4] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases.
In Proc. of VLDB, 2002.
[5] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-Style Keyword Search
over Relational Databases. In Proc. Of VLDB, 2003.
[6] Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching
and browsing in databases using banks. In Proc. of ICDE, 2002.
[7] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search
over relational databases. In Proc. of ICDE, 2002.
[8] P. Raghavan. Structured and unstructured search in enterprises. IEEE Data Engineering
Bulletin, 24(4), 2001.
[9] S. Dar, G. Entin, S. Geva, , and E. Palmon. Dtl's dataspot: Database exploration using
plain languages. In Proc. Of VLDB, 1998.
[10] R. Wheeldon, M. Levene, and K. Keenoy. Search and navigation in relational databases.
http://arxiv.org/abs/cs.DB/0307073.
[11] Qi Su, Jennifer Widom. Efficient and Extensible Keyword Search over Relational
Databases. Stanford University Technical Report, 2003.
[12] DBLP bibliography. 2004. http://www.informatik.uni-trier.de/~ley/db/index.html
An MBR-Safe Transform for High-Dimensional
MBRs in Similar Sequence Matching
Yang-Sae Moon
1 Introduction
Time-series data are the sequences of real numbers representing values at spe-
cific points in time. Typical examples of time-series data include stock prices,
exchange rates, and weather data [1,3,5,8]. The time-series data stored in a
database are called data sequences, and those given by users are called query
sequences. Finding data sequences similar to the given query sequence from
the database is called similar sequence matching [3,8]. As the distance function
D(X, Y ) between two sequences X = {x0 , x1 , ..., xn−1 } and Y = {y0 , y1 , ..., yn−1 }
of the same length n, many similar sequence matching models have used Lp -
n−1
distance (= p i=0 |xi − yi |p ) including the Manhattan distance (= L1 ), the
Euclidean distance (= L2 ), and the maximum distance (= L∞ ) [1,2,3,4,7,8,9].
Most similar sequence matching solutions have used the lower-dimensional
transformation to store high-dimensional sequences into a multidimensional in-
dex [1,2,3,5,7,8,9]. The lower-dimensional transformation has first been intro-
duced in Agrawal et al.’s whole matching solution [1], and widely used in various
whole matching solutions [2,5] and subsequence matching solutions [3,7,8,9]. Re-
cently, it was also used in similar sequence matching on streaming time-series
for dimensionality reduction of query sequences or streaming time-series [4]. In
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 79–90, 2007.
c Springer-Verlag Berlin Heidelberg 2007
80 Y.-S. Moon
2 Related Work
Similar sequence matching can be classified into whole matching and subsequence
matching [3]. The whole matching[1,2,5] finds data sequences similar to a query
sequence, where the lengths of data sequences and the query sequence are all
identical. On the other hand, the subsequence matching[3,7,8] finds subsequences,
contained in data sequences, similar to a query sequence of arbitrary length.
An MBR-Safe Transform for High-Dimensional MBRs 81
3 Definition of MBR-Safe
Symbols Definitions
X A high-dimensional sequence. (= {x0 , x1 , ..., xn−1 })
XT A (low-dimensional) sequence transformed from X by the transform T .
(= {xT0 , xT1 , ..., xTm−1 })
[L, U ] A high-dimensional MBR whose lower-left and upper-right points are
L and U , respectively. (= [{l0 , l1 , ..., ln−1 }, {u0 , u1 , ..., un−1 }])
[L, U ]T A (low-dimensional) MBR transfromed from [L, U ] by the transform T .
= [Λ, Υ ] (= [{λ0 , λ1 , ..., λn−1 }, {υ0 , υ1 , ..., υn−1 }])
X ∈ [L, U ] The sequence X is contained in the MBR [L, U ].
(i.e., for every i, li ≤ xi ≤ ui )
Transform T1
({
X T 1 = x0T 1 ,..., xmT 1−1 })
U ( = {u0 ,..., un−1 })
[ L ,U ] Λ ( = {λ 1 ,..., λ m })
Transform T2
L ( = {l0 ,..., ln−1 })
({
X T 2 = x0T 2 ,..., xTm2−1 })
Β ( = {β0 ,..., βm−1 })
Transform T2
[ L ,U ]T 2 = [ Α , Β]
(For some i , (x T2
i ) (
< αi ∨ xTi 2 > βi holds.) )
Fig. 1. An MBR-safe transform (T 1) and a non-MBR-safe transform (T 2)
•
• •
• •
•
The traditional transform
A low-dimensional MBR
A high-dimensional MBR
MBR itself rather than a large number of individual sequences. It means that,
by using the MBR-safe transform, we can reduce the number of transformations
in similar sequence matching.
DFT has been most widely used as the lower-dimensional transformation in sim-
ilar sequence matching [1,3,7,8,9]. DFT transforms an n-dimensional sequence X
to a new n-dimensional sequence Y (= {y0 , y1 , ..., yn−1 }) in a complex number
space, where each complex number yi is defined as the following Eq. (2) [1,11]:
1
n−1
yi = √ xt e−j·2πit/n , 0 ≤ i ≤ n − 1. (2)
n t=0
By Euler’s formula [11] and definition of complex number, we can rewrite Eq. (2)
to Eq. (3) of the real part and imaginary part.
1 1
n−1 n−1
yi = √ xt cos(−2πit/n) + √ xt sin(−2πit/n) · j, 0 ≤ i ≤ n − 1. (3)
n t=0 n t=0
DFT concentrates most of the energy into the first few coefficients, and thus
only a few coefficients extracted from the transformed point Y are used for
the lower-dimensional transformation [1,3]. The following Definition 2 shows the
traditional DFT-based lower-dimensional transformation.
Definition 2. The DFT-based lower-dimensional transformation transforms an
n-dimensional sequence X to a new m( n)-dimensional sequence X DFT of
{xDFT
0 , xDFT
1 m−1 }, where each xi
, ..., xDFT DFT
is obtained by Eq. (4). Also, it trans-
forms an n-dimensional MBR [L, U ] to a new m-dimensional MBR [L, U ]DFT
84 Y.-S. Moon
whose lower-left and upper-right points are LDFT and U DFT , respectively, i.e.,
[L, U ]DFT = [LDFT , U DFT ]. In Eq. (4), θ = −2πi/2 t/n and 0 ≤ i ≤ m − 1.
n−1
√1
t=0 xt cos θ, if i is even;
DFT
xi = 1
n
n−1 (4)
√
n t=0 xt sin θ, if i is odd.
b sin θ,
if i is even;
if i is odd;
, υi =
√1
n
t=0
n−1
t
d sin θ,
if i is even;
if i is odd;
,
a = l , c = u ,
n t=0 t n t=0 t
if cos θ ≥ 0;
at = ut , ct = lt ,
t t t t
if cos θ < 0;
where
sin θ ≥ 0;
(5)
bt = lt , dt = ut ,
bt = ut , dt = lt ,
if
if sin θ < 0.
8.E+05 8.E+05
orgDFT orgDFT
Value (complexity)
mbrDFT mbrDFT
Value (complexity)
6.E+05 6.E+05
4.E+05 4.E+05
2.E+05 2.E+05
0.E+00 0.E+00
128 256 512 1024 128 256 512 1024
# of sequences per MBR (m) Sequence length (n)
(a) Complexity comparison when varying m. (b) Complexity comparison when varying n.
6 Performance Evaluation
6.1 Experimental Data and Environment
We have performed extensive experiments using two types of synthetic data sets.
The first data set, used in the previous similar sequence matching works [3,8,9],
contains a random walk series consisting of one million entries: the first en-
try is set to 1.5, and subsequent entries are obtained by adding a random
value in the range (-0.001,0.001) to the previous one. We call this data set
An MBR-Safe Transform for High-Dimensional MBRs 87
Experiment 2) Figure 5 shows the results when we set the number m of se-
quences in an MBR to 256, but change the length n of sequences from 128 to
1024 by multiples of two. As in Experiment 1), we measure the total number of
transformations and the average elapsed time for transforming an MBR. From
Figure 5 (a), we note that the numbers of transformations are not changed even
as the length of sequences increases. It is because the numbers are dependent on
the number of sequences in orgDFT or the number of MBRs in mbrDFT, but
are not dependent on the length of sequences in both orgDFT and mbrDFT.
As shown in Figures 5 (b) and 5 (c), mbrDFT significantly reduces the elapsed
time over orgDFT. In particular, as we analyzed in Figure 3 (b) in Section 5,
the larger length of sequences causes the more performance difference between
orgDFT and mbrDFT.
30.0 3.E+03
Boundary-length of MBRs
Boundary-length of MBRs
orgDFT orgDFT
mbrDFT mbrDFT
20.0 2.E+03
10.0 1.E+03
0.0 0.E+00
1 2 3 4 1 2 3 4
# of dimensions (# of features) # of dimensions (# of features)
(a) WALK-DATA (b) SINE-DATA
7 Conclusions
Acknowledgements
This work was supported by the Ministry of Science and Technology (MOST)/
Korea Science and Engineering Foundation (KOSEF) through the Advanced In-
formation Technology Research Center (AITrc).
References
1. Agrawal, R., Faloutsos, C., and Swami, A., “Efficient Similarity Search in Sequence
Databases,” In Proc. the 4th Int’l Conf. on Foundations of Data Organization and
Algorithms, pp. 69-84, Oct. 1993.
2. Chan, K.-P., Fu, A. W.-C., and Yu, C. T., “Haar Wavelets for Efficient Similar-
ity Search of Time-Series: With and Without Time Warping,” IEEE Trans. on
Knowledge and Data Engineering, Vol. 15, No. 3, pp. 686-705, Jan./Feb. 2003.
3. Faloutsos, C., Ranganathan, M., and Manolopoulos, Y., “Fast Subsequence Match-
ing in Time-Series Databases,” In Proc. Int’l Conf. on Management of Data, ACM
SIGMOD, pp. 419-429, May 1994.
4. Gao, L. and Wang, X. S., “Continually Evaluating Similarity-based Pattern Queries
on a Streaming Time Series,” In Proc. Int’l Conf. on Management of Data, ACM
SIGMOD, pp. 370-381, June 2002.
5. Keogh, E. J., Chakrabarti, K., Mehrotra, S., and Pazzani, M. J., “Locally Adaptive
Dimensionality Reduction for Indexing Large Time Series Databases,” In Proc. of
Int’l Conf. on Management of Data, ACM SIGMOD, pp. 151-162, May 2001.
6. Korn, F., Jagadish, H. V., and Faloutsos, C., “Efficiently Supporting Ad Hoc
Queries in Large Datasets of Time Sequences,” In Proc. of Int’l Conf. on Man-
agement of Data, ACM SIGMOD, pp. 289-300, June 1997.
7. Lim, S.-H., Park, H.-J., and Kim, S.-W., “Using Multiple Indexes for Efficient
Subsequence Matching in Time-Series Databases,” In Proc. of the 11th Int’l Conf.
on Database Systems for Advanced Applications (DASFAA), pp. 65-79, Apr. 2006.
8. Moon, Y.-S., Whang, K.-Y., and Han, W.-S., “General Match: A Subsequence
Matching Method in Time-Series Databases Based on Generalized Windows,” In
Proc. Int’l Conf. on Management of Data, ACM SIGMOD, pp. 382-393, June 2002.
9. Moon, Y.-S. and Kim, J., “A Single Index Approach for Time-Series Subse-
quence Matching that Supports Moving Average Transform of Arbitrary Order,”
In Proc. of the 10th Pacific-Asia Conf. on Knowledge Discovery and Data Mining
(PAKDD), pp. 739-749, Apr. 2006.
10. Natsev, A., Rastogi, R., and Shim, K., “WALRUS: A Similarity Retrieval Algo-
rithm for Image Databases,” IEEE Trans. on Knowledge and Data Engineering,
Vol. 16, No. 3, pp. 301-316 , Mar. 2004.
11. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T., Numerical
Recipes in C: The Art of Scientific Computing, Cambridge University Press, 2nd
Ed., 1992.
Mining Closed Frequent Free Trees
in Graph Databases
1 Introduction
Recent research on frequent pattern discovery has progressed from mining item-
sets and sequences to mining structural patterns including (ordered, unordered,
free) trees, lattices, graphs and other complicated structures. Among all these
structural patterns, graph, a general data structure representing relations among
entities, has been widely used in a broad range of areas, such as bioinformatics,
chemistry, pattern recognition, computer networks, etc. In recent years, we have
witnessed a number of algorithms addressing the frequent graph mining prob-
lem [5,9,4,6]. However, discovering frequent graph patterns comes with expensive
cost. Two computationally expensive operations are unavoidable: (1) to check if
a graph contains another graph (in order to determine the frequency of a graph
pattern) is an instance of subgraph isomorphism problem, which is NP-complete
[3]; and (2) to check if two graphs are isomorphic (in order to avoid creating a
candidate graph for multiple times) is an instance of graph isomorphism prob-
lem, which is not known to be either P or NP-complete [3].
With the advent of XML and the need for mining semi-structured data, a
particularly useful family of general graph — free tree, has been studied and
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 91–102, 2007.
c Springer-Verlag Berlin Heidelberg 2007
92 P. Zhao and J. Xu Yu
stage, which is confirmed to output no desired patterns; (3) Based on the intrin-
sic characteristics of free tree, we propose the automorphism-based pruning and
the canonical mapping-based pruning to alleviate the expensive computation of
equivalent occurrence sets and candidate answer sets during the mining process.
We carried out different experiments on both synthetic data and real applica-
tion data. Our performance study shows that CFFTree outperforms up-to-date
frequent free mining algorithms by a factor of roughly 10. To the best of our
knowledge, CFFTree is the first algorithm that, instead of using post-processing
methods, directly mines closed frequent free trees from graph databases.
The rest of the paper is organized as follows. Section 2 provides necessary
background and detailed problem statement. We study the closed frequent free
tree mining problem in Section 3, and propose a basic algorithmic framework
to solve the problem. Advanced pruning algorithms are presented in Section 4.
Section 5 formulates our algorithm, CFFTree. In Section 6, we report our per-
formance study and finally, we offer conclusions in Section 7.
2 Preliminaries
A labeled graph is defined as a 4-tuple G = (V, E, Σ, λ) where V is a set of
vertices, E is a set of edges (unordered pairs of vertices), Σ is a set of labels,
and λ is a labeling function, λ : V ∪ E → Σ, that assigns labels to vertices and
edges. A free tree, denoted ftree, is a special undirected labeled graph that is
connected and acyclic. Below, we call a ftree with n vertices a n-ftree.
Let t and s be two ftrees, and g be a graph. t is a subtree of s (or s is the
supertree of t), denoted t ⊆ s, if t can be obtained from s by repeatedly removing
vertices with degree 1, a.k.a leaves of the tree. Similarly, t is a subtree of a graph
g, denoted t ⊆ g, if t can be obtained by repeatedly removing vertices and edges
from g. Ftrees t and s are isomorphic to each other if there is a one-to-one
mapping from the vertices of t to the vertices of s that preserves vertex labels,
edge labels, and adjacency. An automorphism is an isomorphism that maps from
a ftree to itself. A subtree isomorphism from t to g is an isomorphism from t to
some subtree(s) of g.
Given a graph database D = {g1 , g2 , . . . , gN } where gi is a graph (1 ≤ i ≤ N ).
The problem of frequent ftree mining is to discover the set of all frequent ftrees,
denoted F S, where t ∈ F S iff the ratio of graphs in D that has t as its subtree
is greater than or equal to a user-given threshold φ. Formally, let t be a ftree
and gi be a graph. We define
1 if t ⊆ gi
ς(t, gi ) = (1)
0 otherwise
and
σ(t, D) = ς(t, gi ) (2)
gi ∈D
The problem of closed frequent ftree mining is to discover the set of frequent
ftrees, denoted CF S, where t ∈ CF S iff t is frequent and the support of t is
strictly larger than that of any supertree of t. Formally, the closed frequent ftree
mining problem is to discover the ftree set CF S of D which satisfies
CF S = {t | t ∈ F S ∧ ∀t ⊃ t, σ(t, D) > σ(t , D)} (4)
Since CF S contains no ftree that has a supertree with the same support, we
have CF S ⊆ F S.
t = t ◦ef v, v ∈ Σ (5)
1 1
2 3 4 5 2 3 4 5 v
6 v v v v 6 v v v
v v
N
of simultaneous occurrences of t w.r.t t in gi ∈ D, i.e., i=1 ω(t, t , gi ), denoted
by SO(t, t , D).
Definition 3. Given t = t ◦x v and a graph database D = {g1 , g2 , . . . , gN }, if
O(t, D) = SO(t, t , D), we say that t and t have equivalent occurrences.
Lemma 1. For a frequent ftree t in the enumeration tree, if there exists a t ∈
F S(t) such that (1) t and t have equivalent occurrences; (2) the vertex (t − t)
is not grown on the extension frontier of any descendants of t, including t, in
the enumeration tree, then (1) t is not a closed frequent ftree and (2) for each
child t of t in the enumeration tree, there exists at least one supertree t of t ,
such that t and t have equivalent occurrences.
Proof. The first statement can be easily proved. Since t and t have equivalent
occurrences in D, then O(t , D) = O(t, D). For the second statement, we notice
that (t − t) occurs at each occurrence of t in D, so it occurs at each occurrence of
t in D. In addition, the vertex (t − t) never be grown on the extension frontier
of any descendant of t, so it will not be a vertex of t (Notice t is a child of t in
the enumeration tree by growing a vertex on t’s extension frontier). Therefore,
we can obtain t by adding (t − t) on t , so that t and t have equivalent
occurrences.
By inductively applying Lemma 1 to t and all t’s descendants in the enumeration
tree, we can conclude that all branches originated from t in the enumeration tree
are guaranteed to produce no closed frequent ftrees. However, the conditions
mentioned in Lemma 1, especially the condition (2) is hard to be justified. Since
when mining frequent ftree t, we have no information of all t’s descendants in the
enumeration tree. The following sections will present more detailed techniques
to prune the search space.
t t
c p a t
v a
a a b p c
b a a v
b p c d
a e v v’
Fig. 3. A Special Case in Position (2) Fig. 4. The Safe Label Pruning
the extension frontier. So the first two possible positions of p are unsafe when
growing vertex (t − t), which disallows the conditions mentioned in Lemma 1.
The following theorem shows that only position (3) of p is safe to grow the
vertex (t − t), while not violating the conditions mentioned in Lemma 1.
Theorem 1. For a frequent ftree t ∈ F S(t) such that t and t have equivalent
occurrences in D. If depth(p) > 2, then neither t nor any t’s descendants in the
enumeration tree can be closed.
Proof. Since for every vertex u on the extension frontier of a ftree, it is located
at the bottom two levels, i.e., depth(u) ≤ 2. If depth(p) > 2, the vertex p can
never appear on the extension frontier of any ftree, i.e., the vertex (t − t) will
not be grown on the extension frontier of any descendant of t, including t, in the
enumeration tree. According to Lemma 1, the branches originated from t can
not generate closed frequent ftrees.
The pruning algorithm mentioned in Theorem 1 is called the safe position prun-
ing, since the vertex (t − t) can only be grown on a safe vertex p ∈ t, where
depth(p) > 2. Given a n-ftree, the depth of every vertex of t can be computed in
O(n), so the safe position pruning is quite efficient to testify whether a certain
branch in the enumeration tree should be pruned or not.
the vertex (t − t) will not be reconsidered to be grown on t and all t ’s descen-
dants in the enumeration tree. According to Lemma 1, neither t nor any of its
descendants can be closed.
The pruning algorithm mentioned in Theorem 2 is called the safe label pruning.
The vertex label of (t − t) is safe because all vertices with labels lexicographically
greater than (t −t) can be exempted from growing on p of t, and all descendants of
corresponding ftrees in the enumeration tree are also pruned. An example is shown
in Figure 4. p is located on the extension frontier of t and v = (t − t). If v ’s label
is lexicographically greater than v’s label, the frequent ftree t = t ◦p v and the
frequent ftree t = t ◦p v have equivalent occurrences, so that t is not closed.
Similarly, all t ’s descendants in the enumeration tree are not closed, either.
Based on Theorem 1 and Theorem 2, the set EO(t) can be further divided
into the following mutually exclusive subsets:
EO1 (t) = {t ∈ EO(t) | p ∈ t is safe}
EO2 (t) = {t ∈ EO(t) | p is on the extension frontier of t}
EO3 (t) = EO(t) − EO1 (t) − EO2 (t)
t t t
a 0 a a
a
b 1 b 2 b b b b
b b
c 3 4 d c 5 d 6 c d c d c d c d
c d c d
V V
of t in Figure 5 into four equivalence classes. When computing F S(t), only one
representative for each equivalence class of t is considered, instead of growing
vertices on every position within an equivalence class.
Canonical Mapping-based Pruning: When computing F S(t), we maintain
mappings from t to all its occurrences in gi ∈ D. However, there exist redundant
mappings because of ftree automorphism. Given a n-ftree t, and assume that
the number of equivalence classes of t is c, and the number of vertices in each
equivalence class Ci is ni , for 1 ≤ i ≤ c. The
cnumber of mappings from t to an
occurrence in gi is computed as ω(t, gi ) = i=1 (ni )!. When either the number
of equivalence classes, or the number of vertices in some equivalence class is
large, ω(t, gi ) can be huge. However,
c among all mappings describing the same
occurrence of t ∈ gi , one out of i=1 (ni )! mappings is selected as canonical
mapping and all computation of F S(t) is based on the canonical mapping of t in
c
D. While other ( i=1 (ni )!−1) mappings can be pruned so that the computation
of F S(t) can be greatly facilitated.
is closed (Line 15-16). The set F (t) is computed by extending vertices on the
extension frontier of t, which grows the enumeration tree for frequent ftree mining
(Line 8-12). This procedure proceeds recursively (Line 13-14) until we find all
closed frequent ftrees in the graph database.
6 Experiments
In this section, we report a systematic performance study that validates the
effectiveness and efficiency of our closed frequent free tree mining algorithm:
CFFTree. We use both a real dataset and a synthetic dataset in our experiments.
All experiments were done on a 3.4GHz Intel Pentium IV PC with 2GB main
memory, running MS Windows XP operating system. All algorithms are imple-
mented in C++ using the MS Visual Studio compiler. We compare CFFTree
with F3TM plus post-processing, thus, the performance curve mainly reflects the
effectiveness of pruning techniques mentioned in Section 4.
Mining Closed Frequent Free Trees in Graph Databases 101
3000
F3TM frequent ftrees F3TM
CFFTree closed frequent ftrees CFFTree
2500 10000
# Runtime (sec)
100000
2000
# Features
# Patterns
1500
1000
1000
500 10000
0 100
0 5 10 15 20 25 0.05 0.06 0.07 0.08 0.09 0.1 0.05 0.06 0.07 0.08 0.09 0.1
Size of Free Trees Minimum support threshold Minimum support threshold
10000
frequent ftrees F3TM F3TM
100000 closed frequent ftrees CFFTree CFFTree
# Runtime (sec)
# Runtime (sec)
10000 1000
# Patterns
1000
1000
100
100
100
10 10
0.05 0.06 0.07 0.08 0.09 0.1 0.05 0.06 0.07 0.08 0.09 0.1 5 10 15 20 25 30 35 40
Minimum support threshold Minimum support threshold Average size of graphs (edges)
parameter T in the synthetic data, while other parameters keep fixed. The ex-
perimental results are shown in Figure 8(c). Again, CFFTree performs better
than F3TM.
7 Conclusion
In this paper, we investigate the problem of mining closed frequent ftrees from
large graph databases, a critical problem in structural pattern mining because
mining all frequent ftrees are inherently inefficient and redundant. Several new
pruning algorithms are introduced in this study including the safe position prun-
ing and the safe label pruning to efficiently prune branches of the search space.
The automorphism-based pruning and the canonical mapping-based pruning are
applied in the computation of candidate sets and equivalent occurrence sets,
which dramatically facilitate the total mining process. A CFFTree algorithm is
implemented and our performance study demonstrates its high efficiency over
the up-to-date frequent ftree mining algorithms. To our best knowledge, this is
the first piece of work on closed frequent ftree mining on large graph databases.
References
1. Yun Chi, Yi Xia, Yirong Yang, and Richard R. Muntz. Mining closed and maximal
frequent subtrees from databases of labeled rooted trees. IEEE Transactions on
Knowledge and Data Engineering, 17(2):190–202, 2005.
2. Yun Chi, Yirong Yang, and Richard R. Muntz. Indexing and mining free trees. In
Proceedings of ICDM03, 2003.
3. Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide
to the Theory of NP-Completeness. 1979.
4. Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraphs in
the presence of isomorphism. In Proceedings of ICDM03, 2003.
5. Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In Pro-
ceedings of ICDM01, 2001.
6. Siegfried Nijssen and Joost N. Kok. A quickstart in frequent structure mining can
make a difference. In Proceedings of KDD04, 2004.
7. Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Discovering fre-
quent closed itemsets for association rules. In Proceeding of ICDT99, 1999.
8. Ulrich Rückert and Stefan Kramer. Frequent free tree discovery in graph data. In
Proceedings of SAC04, 2004.
9. Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In
Proceedings of ICDM02, 2002.
10. Xifeng Yan and Jiawei Han. Closegraph: mining closed frequent graph patterns.
In Proceedings of KDD03, 2003.
11. Xifeng Yan, Jiawei Han, and Ramin Afshar. Clospan: Mining closed sequential
patterns in large databases. In Proceedings of SDM03, 2003.
12. Peixiang Zhao and Jeffrey Xu Yu. Fast frequent free tree mining in graph databases.
In Proceedings of MCD06 - ICDM 2006 Workshop, Hong Kong, China, 2006.
Mining Time-Delayed Associations from
Discrete Event Datasets
1 Introduction
Developments in sensor network technology have attracted vast amounts of re-
search interest in recent years [1,2,3,6,7,8,9]. One of the research topics related to
sensor networks is to find correlations among the behaviour of different sensors.
F
4
A 2 E
A A A A B B B C C
•-
3 4
•| •| •| •| •| •| •| •| | time
B 5 4 D 0 3 4 5 7 9 10 13 15
C
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 103–114, 2007.
c Springer-Verlag Berlin Heidelberg 2007
104 K.K. Loo and B. Kao
if an event of type I occurs and is followed by one or more event of type J within
a certain constrained time period, then at least one of the type-J consequences
is likely followed by a type-K event within a constrained time period.
In [5], Mannila et al proposed the concept of episode, which is an ordered
list of events. They proposed the use of minimal occurrences to find episode
rules in order to capture temporal relationships between different event types. A
minimal occurrence of an episode is a time interval [ts , te ) such that the episode
occurs in the interval but not in any proper sub-interval of [ts , te ). Let α and β
be two episodes. An episode rule has the form α[w1 ] ⇒ β[w2 ], which specifies
the following relationship: “if α has a minimal occurrence in the interval [ts , te )
such that te − ts ≤ w1 , then there is a minimal occurrence of β in [ts , te ) such
that te − ts ≤ w2 ”. The goal is to discover episodes and episode rules that occur
frequently in the data sequence.
In a sense, our problem is similar to episode discovery in that we are looking
for frequently occurring event sequences. However, we remark that the use of
minimal occurrence to define the occurrence of an episode might in some cases
fail to reflect the strength of an association. As an example, consider Figure 1(b)
again. It is possible that the three type-B events that occur at time t = 7, 9 and
10 are “triggered” respectively by the three preceding A’s that occur at t = 3, 4
and 5. Hence, the association A → B has occurred three times. However, only
one period ([5, 8)) is qualified as a minimal occurrence of the episode A → B. In
other words, out of all 4 occurrences of A in the figure, there is only 1 occurrence
of the episode A → B, even though 3 of the A’s have triggered B.
A major difference between our definition of time-delayed association and the
episode’s minimal occurrence approach is that, under our approach, every event
Mining Time-Delayed Associations from Discrete Event Datasets 105
that matches an association counts towards the association’s support. This fairly
reflects the strength of correlations among event types. Also, our definition allows
the specification of a timing constraint [u, v] between successive event types in
an association. This helps removing those associations that are not interesting.
For example, if it takes at least 2 time units for a packet to pass through a
switch, then any type-B alert that occurs 1 time unit after a type-A alert should
not count towards the association A → B (See Figure 1). We can thus use the
timing constraint to filter false matches. The minimal occurrence approach used
in episode does not offer such flexibility.
A straightforward approach to finding all frequent associations is to generate
and verify them incrementally. First, we form all possible length-2 associations
X → Y , where X and Y are any event types in the data sequence. We then
scan the data to count the associations’ supports. Those associations with high
supports are considered frequent. Next, for each frequent association X → Y , we
consider every length-3 extension, i.e., we append every event type Z to X → Y
forming (X → Y ) → Z. The support of those length-3 associations are counted
and those that are frequent will be used to generate length-4 associations, and so
on. The process stops when we can no longer obtain any new frequent sequences.
In Section 3 we will show how the above conceptual procedure is implemented
in practice. In particular, we show how the computational problem is reduced
to a large number of table joins. We call this algorithm the baseline algorithm.
The baseline algorithm is not particularly efficient. We address two methods to
improve its efficiency. First, the baseline algorithm extends a frequent association
I → Y by considering all possible extensions (I → Y ) → Z. Many of such
extensions could be infrequent and the effort spent on counting their supports
is wasted. A better strategy is to estimate upper bounds of the associations’
supports and discard those that cannot meet the support requirement. Second,
as we will explain later, the baseline algorithm generates (I → Y ) → Z by
retrieving and joining the tables associated with two sub-associations, namely,
I → Y and Y → Z. Since the number of such associations and their associated
tables is huge, the tables will have to be disk-resident. A caching strategy that
can avoid disk accesses as much as possible would thus have a big impact on the
algorithm’s performance. In this paper we study an interesting coupling effect
of a caching strategy and an association-generation order.
The rest of the paper is structured as follows. We give a formal definition of
our problem in Section 2. In Section 3, we discuss some properties of time-delayed
associations and propose a baseline algorithm for the problem. In Section 4, we
discuss the pruning strategies and the caching strategies. We present experiment
results in Section 5 and conclude the paper in Section 6.
2 Problem Definition
In this section we define the problem of finding time-delayed associations from
event datasets. We define an event e as a 2-tuple (Ee , te ) where Ee is the event
type and te is the time e occurs. Let D denote an event dataset and E denote
the set of all event types that appear in D. We define a time-delayed association
106 K.K. Loo and B. Kao
the triggering event type and J the consequence event type of the association.
[u,v]
Intuitively, I −−−−→ J captures the observation that if an event i such that Ei = I
occurs at time ti , then it is “likely” that there exists an event j so that Ej = J
and ti +u ≤ tj ≤ ti +v, where v ≥ u > 0. The likelihood is given by the confidence
of the association, whereas the statistical significance of an association is given
by its support. We will define support and confidence shortly.
[u,v]
For an association r = I −−−−→ J, an event i is called a match of r (or i matches r)
if Ei = I and there exists another event j such that Ej = J and ti + u ≤ tj ≤
ti + v. The event j here is called a consequence of r. We use the notations Mr to
denote the set of all matches of r, qr,i to denote the set of all consequences that
correspond to a match i of r and mr,j to denote the set of all matches of r that
correspond to a consequence j. Also, we define Qr = qr,i ∀i ∈ Mr . That is, Qr
is the set of all events that are consequences of r. The support of an association
r is defined as the ratio of the number of matching events to the total number of
events (i.e., |M r| |Mr |
|D| ). The confidence of r is defined as the fraction |DI | , where DI
is the set of all type-I events in D. We use the notations supp(r) and conf (r) to
represent the support and confidence of r, respectively. Finally, the length of an
association r, denoted by len(r), is the number of event types contained in r.
We can extend the definition to relate more than two event types. Consider an
association r = I −−−−→ J as a complex event type I, an association between I and
[u,v]
event type and K is the consequence event type. Intuitively, the association says
that if an event of type I is followed by one or more event of type J within
certain time constraints u and v, then at least one of the J’s is likely to be
followed by a type K event. A match for the association r is a match i for r
such that, for some j where j ∈ qr,i , there exists an event k such that Ek = K
and tj + u ≤ tk ≤ tj + v. We say that event k is a consequence of event i
w.r.t. the association r . The support of r is defined as the fraction of events
in D that match r (i.e., |M r |
|D| ). The confidence of r is defined as the ratio of
the number of events that match r to the number of events that match r (i.e.,
|Mr |
|Mr | ). Given two user-specified thresholds ρs and ρc and a timing constraint
[u, v], the problem of mining time-delayed associations is to find all associations
r such that supp(r) ≥ ρs and conf (r) ≥ ρc .
In our model, we use the same timing constraint [u, v] for all associations.
[u,v]
Therefore, we will use a plain arrow “→” instead of “−−−−→ ” in the rest of the
paper when the timing constraint is clear from the context or is unimportant.
[3,5] [3,5]
A−−−−→ B B −−−−→ C
algorithm BASELINE m q m q
1) L := ∅; C := ∅ n = 2; 3 7 9 13
2) F := {all frequent event types} 4 7 10 13
3) foreach I ∈ F , J ∈ E do 4 9 10 15
4) C := C ∪ {I → J} 5 9
5) end-for 5 10
6) while C = ∅ do
7) Cn := C; C := ∅ (a) (b)
8) foreach r ∈ Cn do
9) if r = I → J is frequent do [3,5] [3,5]
10) L := L ∪ {r} (A−−−−→ B)−−−−→ C
11) C := C ∪ {(I → J) → K} ∀ K ∈ E m q
12) end-if 4 13
13) end-for 5 13
14) n := n + 1 5 15
15) end-while
16) return L (c)
Fig. 2. Algorithm BASELINE Fig. 3. M -Q mappings for various
time-delayed associations
Property 3: For any event j ∈ Qr1 ∩ Mr2 , every i ∈ mr1 ,j is a match of r and
every k ∈ qr2 ,j is a consequence of event i w.r.t. r for every i ∈ mr1 ,j .
Proof: By definition, every i ∈ mr1 ,j is a match of r because there exists k such
that tj + u ≤ tk ≤ tj + v. Indeed, every k ∈ qr2 ,j fulfils this requirement.
Hence, every k ∈ qr2 ,j is a consequence of i w.r.t. r for every i ∈ mr1 ,j .
108 K.K. Loo and B. Kao
(a) M -Q mapping and multiplic- (b) SectTop vec- (c) M -Q mapping and number of
ity of consequences for I → J tors for I → J matches per segment for J → K
Fig. 4. Multiplicity of consequences and SectTop
chosen. Figure 5(a) and (b) shows two candidate generation schemes commonly
used in level-wise algorithms. Figure 5(a) illustrates a depth-first (DF) candi-
date generation, i.e., after an association r = I → J is evaluated as frequent,
the algorithm immediately generates candidates by extending r and evaluates
them. Figure 5(b) illustrates a breadth-first (BF) candidate generation that all
candidates of the same length are evaluated before longer candidates. These
candidate generation schemes would not work well with the LRU strategy. For
example, in Figure 5(a), A → B is referenced when evaluating the candidates
((A → A) → A) → B and ((B → C) → A) → B. Between the accesses,
a number of other candidates are evaluated, which means that many different
associations are brought into the memory and cache overflows are more likely.
When A → B is accessed the second time, its M -Q mapping may no longer
reside in the cache. Similar problem exists in the BF scheme (see Figure 5(b)).
It is noteworthy that, in the baseline algorithm, length-2 associations are
repeatedly referenced for candidate evaluation. In particular, when evaluating
extensions of an association I → J, each of length-2 associations of the form
J → K is referenced. By processing as a batch all associations in Li with the same
consequence event type (see Figure 5(c)), we ensure that length-2 associations of
the form J → K are accessed closely temporally, which favours the LRU strategy.
This observation can be easily fitted into the BF candidate generation scheme.
At the end of each iteration, we sort the associations in Li by their consequence
event type. Then the sorted associations are fetched sequentially for candidate
generation. We call this the breadth-first* (BF*) candidate generation scheme.
5 Experiment Results
We conducted experiments using stock price data. Due to space limitation, we
leave the discussion on how the raw data is transformed into an event dataset
in [4]. The transformed dataset consists 99 event types and around 45000 events.
5e+06 1.2e+07
Number of candidates evaluated
0 0
0.3 0.4 0.5 0.6 0.7 0.75 0.8 0.85 0.9
Support threshold Support threshold
32% and 63%. Similar trend is observed when we changed v to 2 (Figure 6(b)).
Although the savings are not as dramatic as in the case when v = 1, at low
support (0.7%), GlobalK and SectTop achieve savings of 26% and 41%, while at
high support (0.9%), the savings are around 39% and 44% respectively.
As shown by the figures, SectTop always outperforms GlobalK in terms of
number of candidates being evaluated. A reason is that, for each candidate
c, SectTop obtains an upper-bound on supp(c) by estimating the number of
matches that are associated to the consequences in each segment. A reasonably
fine segmentation of the time covered by D thus ensures that the upper-bound
obtained is relatively tight. For GlobalK, however, the GlobalK threshold for a
frequent association is calculated from the highest multiplicity values without
considering where these values actually exist in the whole period of time covered
by D. So, the pruning ability of GlobalK is not as good as that of SectTop.
4e+08 4e+08
2e+08 2e+08
0 0
4 16 64 256 4 16 64 256
Cache size (’000 tuples) Cache size (’000 tuples)
2e+08 2e+08
0 0
4 16 64 256 4 16 64 256
Cache size (’000 tuples) Cache size (’000 tuples)
accessed multiple times for these candidates. If the cache is big enough to hold
the M -Q mappings of all such length-2 associations, it is likely that the M -Q
mappings are in the cache after they are referenced for the first time. For the
dataset used in the experiment, we find that the maximum sum of the sizes of
all M -Q mappings of a particular triggering event type is about 22000 tuples. A
cache with 24000-tuple capacity is thus big enough to save most I/O accesses.
Figure 7(b) shows the case when “ST32” is applied. The curves are similar
in shape compared to those in the “NoOpt” case. A big drop in I/O access is
also observed with the curve of BF* and the big drop begins at the cache size of
10000 tuples. This is because SectTop avoids evaluating candidates that cannot
be frequent. So, for a frequent association I → J, it is not necessary to evaluate
every candidate of the form (I → J) → K. A smaller cache is thus enough to
hold the M -Q mappings of length-2 associations used for candidate evaluation.
Figure 8 shows the case of LFU. From the figure, all three candidate gener-
ation methods are very similar in terms of I/O requirement. Both depth-first
and breadth-first generation performed slightly better when LFU was adopted
instead of LRU. However, the “big drop” with BF* is not observed and so the
performance of BF* is much worse than the case with LRU. It is because the
LFU strategy gives preference to data that are frequently accessed when decid-
ing on what to keep in the cache. This does not match the idea of BF* candidate
generation, which works best when recently accessed data are kept in the cache.
114 K.K. Loo and B. Kao
In addition, associations entered the cache early may reside in the cache for a
long time because, when they are first used for evaluating candidates, a certain
number of accesses have been accumulated. Associations newly added to the
cache must be accessed even more frequently to stay in the cache.
6 Conclusion
We propose time-delayed association as a way to capture time-delayed depen-
dencies between types of events. We illustrate how time-delayed associations can
be found from event datasets in a simple baseline algorithm.
We identify in the simple algorithm two areas for improvement. First, we can
get upper-bounds on the supports of candidate associations. Those that cannot
be frequent are discarded without finding their actual supports. We proposed
two methods, namely, GlobalK and SectTop, for getting an upper-bound on a
candidate’s support. Experiment results show that these methods reduce signif-
icantly the number of candidates being evaluated.
Second, some of the intermediate results generated are repeatedly used for
candidate evaluation. Since the volume of data being processed is likely to be
high, such intermediate results must be disk-resident and are brought into main
memory only when needed. Caching of the intermediate results is thus important
for reducing expensive I/O accesses. We find that the order that candidate as-
sociations are formed and evaluated would affect the performance of the cache.
Experiment results show that the BF* candidate generation scheme, coupled
with a reasonably-sized cache and the LRU cache replacement strategy, can
comprehensively reduce the I/O requirement of the algorithm.
References
1. Xiaonan Ji, James Bailey, and Guozhu Dong. Mining minimal distinguishing sub-
sequence patterns with gap constraints. In ICDM, pages 194–201, 2005.
2. Daesu Lee and Wonsuk Lee. Finding maximal frequent itemsets over online data
streams adaptively. In ICDM, pages 266–273, 2005.
3. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. Seman-
tics and evaluation techniques for window aggregates in data streams. In SIGMOD
Conference, pages 311–322, 2005.
4. K. K. Loo and Ben Kao. Mining time-delayed associations from discrete event
datasets. Technical Report TR-2007-01, Department of Computer Science, The
University of Hong Kong, Hong Kong, 2007.
5. Heikki Mannila and Hannu Toivonen. Discovering generalized episodes using mini-
mal occurrences. In KDD, pages 146–151, 1996.
6. Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos. Streaming pattern dis-
covery in multiple time-series. In VLDB, pages 697–708, 2005.
7. Yasushi Sakurai, Spiros Papadimitriou, and Christos Faloutsos. Braid: Stream min-
ing through group lag correlations. In SIGMOD Conference, pages 599–610, 2005.
8. Mohammed Javeed Zaki. Spade: An efficient algorithm for mining frequent se-
quences. Machine Learning, 42(1/2):31–60, 2001.
9. Rui Zhang, Nick Koudas, Beng Chin Ooi, and Divesh Srivastava. Multiple aggre-
gations over data streams. In SIGMOD Conference, pages 299–310, 2005.
A Comparative Study of Ontology Based Term Similarity
Measures on PubMed Document Clustering
Xiaodan Zhang1, Liping Jing2, Xiaohua Hu1, Michael Ng3, and Xiaohua Zhou1
1
College of Information Science & Technology, Drexel University, 3141 Chestnut,
Philadelphia, PA 19104, USA
2
ETI & Department of Math, The University of Hong Kong, Pokfulam Road, Hong Kong
3
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
{xzhang,thu}@ischool.drexel.edu,[email protected],
[email protected], [email protected]
1 Introduction
Recent research has been focused on how to integrate domain ontology as background
knowledge to document clustering process and shows that ontology can improve
document clustering performance with its concept hierarchy knowledge [2, 3, and 16].
Hotho et al. [2] uses WordNet synsets to augment document vector and achieves
better results than that of “bag of words” model on public domain. Yoo et al. [16]
achieves promising cluttering result using MeSH domain ontology for clustering
initialization. They first cluster terms by calculating term semantic similarity using
MeSH ontology (http://www.nlm.nih.gov/mesh/) on PubMed document sets [16].
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 115–126, 2007.
© Springer-Verlag Berlin Heidelberg 2007
116 X. Zhang et al.
Then the documents are mapped to the corresponding term cluster. Last, mutual
reinforcement strategy is applied. Varelas et al. [14] uses term re-weighting for
information retrieval application. Jing et al. [3] adopt similar technique on document
clustering. They re-weight terms and assign more weight to terms that are more
semantically similar with each other.
Although existing approaches rely on term semantic similarity measure, not many
studies have been done on evaluating the effects of different similarity measures on
document clustering for a specific domain. Yoo et al. [16] uses only one similarity
measure that calculates the number of shared ancestor concepts and the number of co-
occurred documents. Jing et al. [3] compares two ontology based term similarity
measure. Even though these approaches are heavily relied on term similarity
information and all these similarity measures are domain independent, however, to
date, relatively little work has been done on developing and evaluating measures of
term similarity for biomedical domain (where there are a growing number of
ontologies that organize medical concepts into hierarchies such as MeSH ontology)
on document clustering.
Clustering initialization and term re-weighting are two techniques adopted for
integrating domain knowledge. In this paper, term re-weighting is chosen because: (1)
a document is often full of class-independent “general” terms, how to discount the
effect of general terms is a central task. Term re-weighting may help discount the
effects of class-independent general terms and aggravate the effects of class-specific
“core” terms; (2) hierarchically clustering terms [16] for clustering initialization is
more computational expensive and more lack of scalability than that of term re-
weighting approach.
As a result, in this paper, we evaluate the effects of different term semantic
similarity measures on document clustering using term re-weighting, an important
measure for integration domain knowledge. We examine 4 path based similarity
measures, 3 information content based similarity measures, and 2 feature based
similarity measures for document clustering on PubMed document sets. The rest of
the paper is organized as follows: Section 2 describes term semantic similarity
measures; section 3 shows document representation and defines the term re-weighting
scheme. In section 4, we present and discuss experiment results. Section 5 concludes
the paper shortly.
scope of our research, our focus here is on term semantic similarity measure using
ontology information. In the subsequent subsections, we classify the ontology based
semantic measures into the following three categories and try to pick popular
measures for each category.
Path based similarity measure usually utilizes the information of the shortest path
between two concepts, of the generality or specificity of both concepts in ontology
hierarchy, and of their relationships with other concepts.
Wu and Palmer [15] present a similarity measure finding the most specific
common concept that subsumes both of the concepts being measured. The path length
from most specific shared concept is scaled by the sum of IS-A links from it to the
compared two concepts.
SW & P (C1 , C 2 ) =
2H
N1 + N 2 + 2 H
(1)
In the equation (1), N1 and N 2 is the number of IS-A links from C1 ,C2 respectively to
the most specific common concept C , and H is the number of IS-A links from C to
the root of ontology. It scores between 1(for similar concepts) to 0. In practice, we set
H to 1 when the parent of the most specific common concept C is the root node.
Li et al. [8] combines the shortest path and the depth of ontology information in a
non-linear function:
e β H − e − βH
S Li (C1 , C 2 ) = e −αL (2)
e β H + e − βH
where L stands for the shortest path between two concepts, α and β are parameters
scaling the contribution of shortest path length and depth respectively. The value is
between 1(for similar concepts) and 0. In our experiment, the same as [8]’s, we set
α and β to 0.2 and 0.6 respectively.
Leacock and Chodorow [7] define a similarity measure based on the shortest path
d (C1 , C2 ) between two concepts and scaling that value by twice the maximum depth
of the hierarchy, and then taking the logarithm to smooth the resulting score:
S L & C (C1 , C 2 ) = − log (d (C1 , C 2 ) / 2 D ) (3)
where D is the maximum depth of the ontology and similarity value. In practice, we
add 1 to both d (C1 ,C 2 ) and 2 D to avoid log (0) when the shortest path length is 0.
Mao et al. [10] define a similarity measure using both shortest path information
and number of descendents of compared concepts.
δ
S Mao (C1, C2 ) = (4)
d (C1, C2 ) log 2 (1 + d (C1 ) + d (C2 ))
where d (C1 ,C 2 ) is the number of edges between C1 and C 2 , d (C1 ) is the number of
C1 ’s descendants, which represents the generality of the concept. Here, the constant
118 X. Zhang et al.
S Lin (C1, C2 ) =
2 log ICmis (C1, C2 )
(8)
log IC (C1) + log IC (C2 )
Feature based measure assumes that each term is described by a set of terms
indicating its properties or features. Then, the more common characteristics two terms
have and the less non-common characteristics they have, the more similar the
A Comparative Study of Ontology Based Term Similarity Measures 119
terms are [14]. As there is no describing feature set for MeSH descriptor concepts, in
our experimental study, we take all the ancestor nodes of each compared concept as
their feature sets. The following measure is defined according to [5, 9]:
Ans(C1 ) ∩ Ans(C 2 )
S BasicFeature (C1 , C 2 ) = (9)
Ans(C1 ) ∪ Ans(C 2 )
where Ans(C1 ) and Ans(C2 ) correspond to description sets (the ancestor nodes) of
terms C1 and c2 respectively, C1 ∩ C2 is the join of two parent node sets and
C1 ∪ C2 is the union of two parent node sets.
Knappe [5] defines a similarity measure as below using the information of
generalization and specification of two compared concepts:
Ans(C1 ) ∩ Ans(C 2 ) Ans(C1 ) ∩ Ans(C 2 )
S Knappe (C1 , C 2 ) = p × + (1 − p ) × (10)
Ans(C1 ) Ans(C 2 )
where p’s range is [0, 1] that defines the relative importance of generalization vs.
specialization. This measure scores between 1 (for similar concepts) and 0. In our
experiment, p is set to 0.5.
Fig. 1. The concept mapping from MeSH entry terms to MeSH descriptors
that those terms do not have distinguishable power in clustering documents. Hence,
we have selected a set of only meaningful corpus-level concepts, in terms of MeSH
Descriptors, representing the documents. We call this set Document Concept Set
(DCS), where DCS = {C1, C2, …, Cn} and Ci is a corpus-level concept. Fig.1 shows
that MeSH Entry term sets are detected from “Doc1” and “Doc2” documents using the
MeSH ontology, and then the Entry terms are replaced with Descriptors based on the
MeSH ontology. For a more comprehensive comparative study, we represent
document in two ways: MeSH entry terms, MeSH descriptor terms. At the time of this
writing, there are about 23833 unique MeSH descriptor terms, 44978 MeSH ontology
nodes (one descriptor term might belong to more than one ontology nodes) and
593626 MeSH entry terms.
Re-weighting Scheme. A document is often full of class-independent “general”
words and short of class-specific “core” words, which leads to the difficulty of
document clustering. Steinbach et al. [13] examines on the data that each class has a
“core” vocabulary of words and remaining “general” words may have similar
distributions on different classes. To solve this problem, we should “discount” general
words and “emphasize” more importance on core words in a vector [17]. [3, 14]
define the term re-weighting scheme as below
~
x ji1 = x ji1 + ∑
m
i 2 =1 ( )
S x ji1 , x ji 2 ⋅ x ji 2
(11)
i 2 ≠ i1
S ( x ji1 , x ji 2 ) ≥ Threshold
where x stands for term weight, m stands for the number of co-occurred terms, and
( )
S x ji1, x ji 2 stands for the semantic similarity between two concepts. Through this re-
weighting scheme, the weights of semantically similar terms will be co-augmented.
Here the threshold stands for minimum similarity score between two compared terms.
Since we are only interested in re-weighting those terms that are more semantically
similar with each other, it’s necessary to set up a threshold value—the minimum
similarity score between compared terms. Besides, it should be noted that the term
weight can be referred as term frequency (TF), normalized term frequency (NTF) and
TF*IDF (Inverse Document Frequency).
A Comparative Study of Ontology Based Term Similarity Measures 121
Cluster quality is evaluated by four extrinsic measures, entropy [13], F-measure [6],
purity [19], and normalized mutual information (NMI) [1]. Because of space
restrictions, we only describe in detail a recently popular measure—NMI, which is
defined as the mutual information between the cluster assignments and a pre-existing
labeling of the dataset normalized by the arithmetic mean of the maximum possible
entropies of the empirical marginal, i.e.,
I ( X ;Y )
NMI ( X , Y ) = (12)
(log k + log c ) / 2
where X is a random variable for cluster assignments, Y is a random variable for the
pre-existing labels on the same data, k is the number of clusters, and c is the number
of pre-existing classes. NMI ranges from 0 to 1. The bigger the NMI is the higher
quality the clustering is. NMI is better than other common extrinsic measures such as
purity and entropy in the sense that it does not necessarily increase when the number
of clusters increases. For Purity and F-measure ranging from 0 to 1, the bigger the
value is the higher quality the clustering has. For entropy, the smaller the value is the
higher clustering quality is.
122 X. Zhang et al.
Table 3. Clustering results of MeSH entry terms scheme; each measure is followed by the
threshold of similarity value (in parenthesis) that helps achieve the best results
[0, 1]. So term similarity scores using these three measures are normalized before
being applied to do term reweighting for a fair comparison reason. Interestingly,
Information content based measure with support of corpus statistics has very similar
performance with the other two types of measure. This indicates that the corpus
statistics is fit with ontology structure of MeSH and does not improve path based
measure. The measure of Mao et al. achieves the best result in both indexing schemes
as shown in table 3 & 4. The reason might be that it is the only measure that utilizes
the number of descendents information of compared terms. Judging from the overall
performance, Wu et al., Li et al., Mao et al., Resink and the two feature based
measures have a rather more stable performance than that of others. Moreover, for
almost all the cases as shown in table 3, the four evaluation metrics are consistent
with each other except that the score of F-measure and Purity of Wu et al. and Li et al
is slightly better than baseline concept without re-weighting while NMI score of them
is slightly worse.
From table 3 & 4, it’s easily seen that the overall performance of descriptor scheme
is very consistent with and slightly better than that of entry term scheme, which shows
that making a document vector more precise by mapping synonym entry terms to one
descriptor terms has positive effects on document clustering. It’s also noted that both
indexing schemes without term re-weighting have competitive performance to those
with term re-weighting. It shows that term re-weighting as a method of integrating
domain ontology to clustering might not be an effective approach, especially when the
documents are short of terms, because when all these terms are very important core
terms for the documents, ignoring the effects of some of them by re-weighting can
cause serious information loss. This is in contrast to the experiment results in general
domain where document length is relatively longer [3].
It’s obvious that word indexing scheme achieves the best clustering result although
it’s not statistically significant (The word scheme experimental result is listed in both
table 3 & 4 for convenience of reader). However, this does not mean indexing
medical documents using MeSH entry term or MeSH descriptor is a bad scheme. In
other words, it does not mean domain knowledge is not good. First, while keeping
124 X. Zhang et al.
Table 4. Clustering results of MeSH descriptor terms scheme; each measure is followed by the
threshold of similarity value (in parenthesis) that helps achieve the best results
competitive clustering results, not only the dimension of clustering space but also the
computational cost is dramatically reduced especially when handling large datasets.
Second, existing ontologies are under growing, they are still not enough for many text
mining applications. For example, there are only 28533 unique entry terms for the
time of writing. Third, there is also limitation of term extraction. So far, existing
approaches usually use “exact match” to map abstract terms to entry terms and can
not judge by the sense the phrase. This will cause serious information loss. For
example, when representing document as entry terms, the average document length
is 14, while the length of the word representation is 81. Finally, if taking advantage of
both medical concept representation and informative word representation, the results
of text mining application can be more convincing.
5 Conclusion
In this paper, we evaluate the effects of 9 semantic similarity measures with a term re-
weighting method on document clustering of PubMed document sets. The k-means
clustering experiment shows that term re-weighting as a method of integrating domain
knowledge has some positive effects on medical document clustering, but might not
be significant. In detail, we obtain following interesting findings from the experiment
by comparing 8 semantic similarity measures three types: path based, information
content based and feature based measure with two indexing schemes—MeSH entry
term and MeSH descriptor: (1) Descriptor scheme is relatively more effective on
clustering than entry term scheme because synonym problem is well handled. (2)
There is no a certain type of measures is significantly better than others since most of
these measures consider only the path between compared concepts and their depth
information within the ontology. (3) Information content based measure using corpus
statistics, as well as ontology structure, does not necessarily improve the clustering
result when corpus statistics is very consistent with ontology structure (4) As the only
similarity measure using the number of descendents information of compared
concepts, the measure of Mao et al. has the best clustering result compared to other
A Comparative Study of Ontology Based Term Similarity Measures 125
similarity measure. (5) Similarity measure that is not scored between 1 and 0 needs to
be normalized, otherwise they will aggravate term weight much more aggressively.
(6) Over all, term re-weighting achieves similar clustering result with that without
term re-weighting. Some of them outperform the baseline, some of them don’t and
neither of them is very significant, which may indicate that term re-weighting might
not be an effective approach when documents are short of terms because when most
of these terms are distinguish core terms for a document, ignoring some of them by
re-weighting will cause serious information loss. (7) The performance of MeSH term
based schemes are slightly worse than that of word based scheme, which can be
resulted from the limitation of domain ontology and limitation of term extraction and
sense disambiguation. However, while keeping competitive results, indexing using
domain ontology dramatically reduces the dimension of clustering space and
computational complexity. Furthermore, this finding indicates that there should be an
approach taking advantage of both medical concept representation and informative
word representation.
In our future work, we may consider other biomedical ontology such as Medical
Language System (UMLS) and also expand this comparative study to some public
domain.
Acknowledgments. This work is supported in part by NSF Career grant (NSF IIS
0448023), NSF CCF 0514679, PA Dept of Health Tobacco Settlement Formula Grant
(No. 240205 and No. 240196), and PA Dept of Health Grant (No. 239667).
References
1. Banerjee, A. and Ghosh, J. Frequency sensitive competitive learning for clustering on
high-dimensional hperspheres. Proc. IEEE Int. Joint Conference on Neural Networks, pp.
1590-1595.
2. Hotho, A., Staab, S. and Stumme, G., “Wordnet improves text document clustering,” in
Proc. of the Semantic Web Workshop at 26th Annual International ACM SIGIR
Conference, Toronto, Canada, 2003.
3. Jing, J., Zhou, L., Ng, M. K. and Huang, Z., “Ontology-based distance measure for text
clustering,” in Proc. of SIAM SDM workshop on text mining, Bethesda, Maryland, USA,
2006.
4. Jiang, J.J. and Conrath, D.W., Semantic Similarity Based on Corpus Statistics and Lexical
Taxonomy. In Proceedings of the International Conference on Research in Computational
Linguistic, Taiwan, 1998.
5. Knappe, R., Bulskov, H. and Andreasen, T.: Perspectives on Ontology-based Querying,
International Journal of Intelligent Systems, 2004.
6. Larsen, B. and Aone, C. Fast and effective text mining using linear-time document
clustering, KDD-99, San Diego, California, 1999, 16-22.
7. Leacock, C. and Chodorow, M., Filling in a sparse training space for word sense
identification. ms., March 1994.
8. Li, Y., Zuhair, A.B., and McLean, D.. An Approach for Measuring Semantic Similarity
between Words Using Multiple Information Sources. IEEE Transactions on Knowledge
and Data Engineering, 15(4):871-882, July/August 2003.
126 X. Zhang et al.
1 Introduction
In the past a few years, more and more sports videos are being produced, dis-
tributed and made available all over the world, thus, as an important video
domain, sports video has been widely studied due to its tremendous commercial
potential.
Different from other categories of video such as news, movie, sitcom, etc.,
sports video has its own special characteristics [1]. A sports game usually occurs
at a specific field and always has its own well-defined content structures and
domain-specific rules. In addition, sports video is usually taken by some fixed
cameras which have some fixed motions in the play field, and that results in some
R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 127–139, 2007.
c Springer-Verlag Berlin Heidelberg 2007
128 J. Liao et al.
2 Dimensionality Reduction
For high-dimensional data, not all dimensions are useful for different applica-
tions. In many applications, such as clustering, indexing, information retrieval,
only some special dimensions are needed. Figure 1 shows an example of cluster-
ing. According to the distribution of the data set in (a), if we want to partition
the points into three clusters, the clustering results can be easily found out by
computing the distances among the points in the feature space of dimension d1
and d2. But in fact, we have no use to take both of the two dimensions into
An Adaptive and Efficient Unsupervised Shot Clustering Algorithm 129
d2 d2
5 5
4 4
3 3
2 2
1 1
0 d1 0 d1
0 1 2 3 4 5 0 1 2 3 4 5
(a) (b)
account, only dimension d1 is enough. (b) shows that the clustering results ob-
tained by only considering dimension d1 are the same as the clustering results
in (a). Therefore, dimension d1 is contributing for clustering, and it is a valid
dimension of the data set.
Valid dimensions are the dimensions which can maximally represent the in-
trinsic characteristics of data set. For the data set in Figure 1, the standard
deviations of dimension d1 and d2 are 0.95 and 0.48 respectively. The reason
why dimension d1 is valid for clustering is that its standard deviation is larger
and it can represent the distribution of the data set. Standard deviation of a data
set is a measure of how spread out it is [11]. The larger the standard deviation
is, the more spread out from the mean the data set is. The data set which is
more spread out is more sensitive in clustering, therefore, the dimension whose
standard deviation is larger is more helpful for clustering.
The dimensionality reduction approach in this paper is to extract the valid
dimensions for our clustering algorithm. In next subsection, we will discuss the
extraction rule for valid dimensions.
Sports videos have their own intrinsic characteristics. The variety of the back-
grounds of sports video is not obvious. By carefully observing the high-
dimensional feature vectors of video shots, it can be easily found that a mass of
dimensions’ values are all zero, especially in the color features. In other words,
these dimensions are useless for computation, and the dimensions whose val-
ues are non-zero are called available dimensions in our paper. Table 1 gives an
example of the ratio of available dimensions over the total dimensions of dif-
ferent categories of sports. It illustrates that the ratios of available dimensions
are about 50%, thus, extracting the available dimensions is the first step of our
dimensionality reduction approach.
Let Dm be the subspace of data set with m available dimensions. For the values
of the jth dimension of Dm , Si [j] denotes the value of shot Si , σS[j] denotes the
standard deviation of jth available dimension of Dm , where 1 ≤ j ≤ m. Standard
deviation of each available dimension indicates its essentiality for clustering.
Larger σS[j] illustrates that the data in jth available dimension are more spread
out and more advantageous for clustering.
130 J. Liao et al.
10
9
8
7
6
5
4
3
2
1
0
1 2 3 4 5 6
In order to extract valid dimensions which can maximally represent the dis-
tribution of data for clustering, an heuristic method on ADH is applied for
determining the value of ε. Let r [i] denote the rank of available dimensions in
Dm corresponding to ADH. Then ε = σr[k] , only if σr[k] − σr[k+1] = max(σr[i] −
σr[i+1] , 1 ≤ i ≤ m − 1). ε is the standard deviation of available dimension r [k ]
whose difference to that of r [k +1] is largest in Dm . That means ε is the largest
plunge occurs in VDH. Referring to Figure 2, the largest drop of ADH occurs
from r [3] to r [4], i.e., ε = σr[3] , and the available dimensions which correspond
to r [1], r [2], and r [3] are the valid dimensions. Intuitively, such extraction rule
guarantees the most significant available dimensions are extracted as valid di-
mensions for our clustering.
A video shot Si can be represented as: Si ={xi1 , xi2 , . . . xin }, where xip is the pth
dimension of the n-dimensional feature vector Si . Let Df be the subspace of
valid dimensions, where f is the number of valid dimensions which are obtained
by our dimensionality reduction approach.
Valid dimension clustering(VDC) is an unsupervised clustering algorithm
which performs on Df one by one, that’s because different valid dimensions
have their own different essentialities for clustering. After ranking the standard
deviations of valid dimensions in descending order, we first take the valid dimen-
sion whose standard deviation is the largest as the beginning of the algorithm,
then the following valid dimensions are taken into account in order.
For the first valid dimension, each shot is first initialized as one cluster, then
the iterations of merging similar shots into one cluster are repeated until the
stop criterion is satisfied. For other valid dimension di , the clustering results of
valid dimension di−1 (the prior dimension of di according to the rank of valid
dimensions) should be set as the initial clustering status of di , then the same
merging procedures perform on each initial cluster of di until all initial clusters
have been processed. After finishing valid dimension di , the algorithm will turn
to di+1 . The final clustering results will be returned when all f valid dimen-
sions are processed. It is obvious that for each valid dimension, only merging
procedures are performed, but for two consecutive valid dimensions di−1 and di ,
the processing of di is splitting procedures for di−1 . Thus, VDC comprises both
merging and splitting procedures.
(a) (b)
The reason why VDC performs on valid dimensions one by one is explained
by Figure 3. (a) gives the clustering results of VDC, i.e., valid dimensions are
taken into account one by one. While (b) shows the results of the algorithm
which all valid dimensions are taken into account once. Obviously, the results
in (a) are better than (b). Originally, all the six shots are play field shots, but
(b) partitions them into two clusters as different positions of the play table. The
reason is that when we consider all valid dimensions together, all valid dimensions
are treated fairly, the different essentialities of different valid dimensions have
not been distinguished.
132 J. Liao et al.
Let SSE and SSG denote the total within-class divergence and total between-
class divergence of each data sample. The α which maximize the criterion F (α)
is used in the Fisher discriminant function, the formula (1). F (α) is represented
as below:
SSG αT Bα
F (α) = = T (2)
SSE α Eα
For our shot clustering algorithm, we are only interested in the concepts of
within-class divergence and between-class divergence. For clustering, the intra-
distance within a cluster and the inter-distance among different clusters can
be mapped into the concepts of within-class divergence and between-class di-
vergence respectively. The clustering results in which the intra-distance of each
cluster is smallest and the inter-distances among different clusters are largest
are the encouraging results. That indicates the data set is separated optimally.
Let rl denote the ratio of the intra-distance of one cluster over the inter-
distances among clusters when the number of clusters is Nl , and the best clus-
tering result we want is the one with smallest value of rl . The value of rl can be
calculated by the formula below:
Nl c
Nl m
dcw |Sic − Smean
c
|
c=0 c=0 i=0
rl = = (3)
dt
N
|Sj − Smean |
j=0
where dt is the initial distance among clusters, dcw is the intra-cluster distance
of cluster c. N is the initial number of clusters at the beginning, while mc is the
number of shots in cluster c. |•| denotes the Manhattan distance. Sic and Smean
c
represent the ith shot and the mean vector of cluster c respectively, while Sj
and Smean are used for denoting the same concept of the initial clusters.
Apart from rl , another important factor nl is considered in our algorithm too,
which is the statistic information of the number of clusters. Let nl = Nl /N be
An Adaptive and Efficient Unsupervised Shot Clustering Algorithm 133
1.2
1.1
1.0
r l+ n l
0.9
stop point
0.8
0.7
0.6
0 10 20 30 40 50 60 70 80 90 100
m
Algorithm 1. VDC()
Input: ranking array of valid dimensions r[k]; cluster structures CR
Output: clustering results
1: for dn =1 to k do
2: ptr =GetHead(CR)
3: while ptr = NULL do
4: S=ODC(ptr, dn ) // S denotes the splitting results
5: InsertFront(CR, S)
6: ptr= GetNext(ptr )
7: dn ++
8: end while
9: end for
Function ODC(CR,dn )
initialize each shot Si as one cluster Ci
(1) (1)
Let rl = 0, nl = 1, calculate dist(Ci , Cj )dn , 1 ≤ i, j ≤ Nl
execute MergeCluster()
(1) (1) (2) (2)
WHILE rl + nl > rl + nl ∩ Nl > 1
(1) (2) (1) (2)
rl = rl , nl = nl
execute MergeCluster()
ENDWHILE
add the clustering results to CR
end Function
Function MergeCluster()
merge two most similar shots into one cluster
(2) (2) (2) (2)
calculate rl , nl and rl + nl
end Function
the ratio of the cluster number Nl over the initial total number of shots N. In
order to maximally approximate the real cluster number which is a small value,
the smaller the value of nl is, the better the clustering result is.
At the beginning of the clustering algorithm, each shot is initialized as one
cluster, the value of rl is 0, and the value of nl is 1. Then as the merging proceeds,
the value of rl is increasing while nl is descending. When all the shots are merged
134 J. Liao et al.
into one cluster, the value of rl reaches 1, and nl reaches its smallest value. Since
the encouraging clustering results should have both smaller rl and nl , we choose
min(rl + nl ) as the stop criterion of our algorithm. When rl + nl reaches its
smallest value, the iterations of merging stop. For example, the relation curve
of the value of rl + nl and the times of iterations m for one valid dimension of
football is shown in Figure 4. The inflexion of the curve which corresponds to
the smallest value of rl + nl is the stop point of the iterations.
After presenting the stop criterion for iterations of our clustering algorithm,
the detailed algorithm description of VDC is described in Algorithm 1.
4 Performance Study
In this section, we will report our extensive performance study on large real
video data, and the comparison results with other two clustering algorithms.
70000 80000
FDC 70000 FDC
60000
VDC 60000 VDC
50000
CPU(ms)
CPU(ms)
50000
40000
40000
30000
30000
20000 20000
10000 10000
0 0
288 320 384 512 200 400 600 800 1000 1200
Dimensionality Number of shots
(a) CPU cost for dimensionality (b) CPU cost for shot number
100 100
VDC VDC
Precision(%)
95 FDC 95 FDC
X-means X-means
Recall(%)
90 90
85 85
80 80
75 75
70 70
288 320 384 512 288 320 384 512
Dimensionality Dimensionality
which applies our stop criterion for merging iterations but performs on the whole
high-dimensional feature space without dimensionality reduction. The other is
called X -means [6] which is a reformative algorithm of k -means.