0% found this document useful (0 votes)
222 views1,143 pages

Untitled

Uploaded by

Efrem Girma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views1,143 pages

Untitled

Uploaded by

Efrem Girma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1143

Lecture Notes in Computer Science 4443

Commenced Publication in 1973


Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Ramamohanarao Kotagiri
P. Radha Krishna Mukesh Mohania
Ekawit Nantajeewarawat (Eds.)

Advances in Databases:
Concepts, Systems
and Applications

12th International Conference on Database Systems


for Advanced Applications, DASFAA 2007
Bangkok, Thailand, April 9-12, 2007
Proceedings

13
Volume Editors

Ramamohanarao Kotagiri
The University of Melbourne
Department of Computer Science and Software Engineering
Victoria 3010, Australia
E-mail: [email protected]

P. Radha Krishna
Institute for Development and Research in Banking Technology
Masab Tank, Hyderabad 500 057, Andhra Pradesh, India
E-mail: [email protected]

Mukesh Mohania
IBM India Research Laboratory
Institutional Area, Vasant Kunj, New Delhi 110 070, India
E-mail: [email protected]

Ekawit Nantajeewarawat
Thammasat University - Rangsit Campus
Sirindhorn International Institute of Technology
Pathum Thani 12121, Thailand´
E-mail: [email protected]

Library of Congress Control Number: 2007923774

CR Subject Classification (1998): H.2, H.3, H.4, H.5, J.1

LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web


and HCI

ISSN 0302-9743
ISBN-10 3-540-71702-1 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-71702-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2007
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 12043323 06/3142 543210
Preface

The 12th International Conference on Database Systems for Advanced Applications


(DASFAA), organized jointly by the Asian Institute of Technology, National
Electronics and Computer Technology Center and Sirindhorn International Institute of
Technology, sought to provide information to users and practitioners of database and
database systems on advanced applications.
The DASFAA conference series has already established itself and it continues to
attract, each year, participants from all over the world. In this context, it may be
recalled that the previous DASFAA conferences were successfully held in Seoul,
Korea (1989), Tokyo, Japan (1991), Daejeon, Korea (1993), Singapore (1995),
Melbourne, Australia (1997), Taiwan, ROC (1999), Hong Kong (2001), Kyoto, Japan
(2003), Jeju Island, Korea (2004), Beijing, China (2005) and Singapore (2006).
Thailand had the opportunity to host this prestigious and important international
conference and join the league.
This conference provides an international forum for academic exchanges and
technical discussions among researchers, developers and users of databases from
academia, business and industry. DASFAA focuses on research in database theory,
development of advanced DBMS technologies and their advanced applications. It also
promotes research and development activities in the field of databases among
participants and their institutions from Pacific Asia and the rest of the world .
This proceedings volume puts together 112 accepted papers from more than 18
countries in the areas of XML Databases, Mobile Databases, Query Language, Query
Optimization and Data Mining etc., of which 68 are full papers, 24 are short papers,
17 are posters and 3 are industrial track papers. The conference received 375
submissions and such a rigorous selections helped retain DASFAA's reputation as a
highly selective conference that publishes only quality research.
We are delighted to feature two invited talks from Guy M. Lohman, IBM Almaden
Research Center, and Masaru Kitsuregawa, University of Tokyo. DASFAA 2007
also featured an excellent tutorial program covering three tutorials related to Matching
Words and Pictures, Time Series Databases, XML Databases and Streams. In
addition, there were three demonstrations, panel sessions and two workshops.
The members of the DASFAA Organizing Committee worked extremely hard to
make this conference a success. The members of the Program Committee, consisting
of renowned data management experts, undertook the arduous task of reviewing all
the submitted papers and invested their valuable time and expertise, despite their
extremely tight schedules. We would like to thank all the reviewers who very
carefully reviewed the papers on time, the authors who submitted their papers and all
the participants.
We are grateful to Alfred Hofmann and the staff of Springer for their support in
publishing these proceedings.
VI Preface

The conference was sponsored by IBM, Thailand, the Database Society of Japan,
Korea Information Science Society, National Electronics and Computer Technology
Center and Software Industry Promotion Agency.

April 2007 Ramamohanarao Kotagiri


P. Radha Krishna
Mukesh Mohania
Ekawit Nantajeewarawat
DASFAA 2007 Conference Organization

Conference Chair
Vilas Wuwongse Asian Institute of Technology, Thailand

Program Committee Co-chairs


Ramamohanarao Kotagiri University of Melbourne, Australia
Mukesh Mohania IBM India Research, India
Ekawit Nantajeewarawat Sirindhorn International Institute of Technology,
Thammasat University, Thailand
Demo Co-chairs
Mizuho Iwaihara Kyoto University, Japan
Xuemin Lin University of New South Wales, Australia

Industrial Co-chairs
Prasan Roy IBM India Research, India
Masashi Tsuchida Software Division, Hitachi, Ltd., Japan

Panel Co-chairs
Sourav Bhowmick NTU, Singapore
Masaru Kitsuregawa University of Tokyo, Japan

Tutorial Committee Co-chairs


Tharam Dillon University of Technology, Sydney, Australia
Haruo Yokota Tokyo Institute of Technology, Japan

Publication Chair
P. Radha Krishna Institute for Development and Research in
Banking Technology, India

Publicity Co-chairs
Chin-Wan Chung KAIST, Korea
Qing Li City University of Hong Kong, PRC
VIII Organization

Regional Chairs
Asia Yasushi Kiyoki Keio University, Japan
Australia and New Zealand Millist Vincent University of South
Australia, Australia
Europe Michael Schrefl University of Linz, Austria
USA Sanjay Madria University of
Missouri-Rolla, USA

Local Arrangements Chair


Suranart Tanvejsilp NECTEC, Thailand

Program Committee
Akiyo Nadamoto NICT, Japan
Amol Deshpande University of Maryland at College Park, USA
Anirban Mondal University of Tokyo, Japan
Arkady Zaslavsky Monash University, Australia
Arnd Christian Konig Microsoft Research, USA
Atsuyuki Morishima University of Tsukuba, Japan
Bala Iyer IBM, USA
Balaraman Ravindran IIT Madras, India
Barbara Catania University of Genoa, Italy
Charnyot Pluempitiwiriyawej Mahidol University, Thailand
Chiang Lee National Cheng Kung University, Taiwan
Cholwich Nattee Thammasat University, Thailand
Chutiporn Anutariya Shinawatra University, Thailand
Dan Lin National University of Singapore, Singapore
David Embley Brigham Young University, USA
David Taniar Monash University, Australia
Dimitrios Gunopulos UCR, USA
Egemen Tanin University of Melbourne, Australia
Elena Ferrari University of Insubria, Italy
Ernesto Damiani University of Milan, Italy
Evaggelia Pitoura University of Ioannina, Greece
Gao Cong University of Edinburgh, UK
Gill Dobbie University of Auckland, New Zealand
Gunther Pernul University of Regensburg, Germany
Haibo Hu Hong Kong University of Science and
Technology, China
Haixun Wang IBM T.J. Watson Research Center, USA
Hakan Ferhatosmanoglu Ohio State University, USA
Hayato Yamana Waseda University, Japan
Heng Tao Shen University of Queensland, Australia
H.V. Jagadish University of Michigan, USA
Hyunchul Kang Chung-Ang University, South Korea
Organization IX

Ibrahim Kamel University of Sharjah, UAE


Indrakshi Ray Colorado State University, USA
James Bailey University of Melbourne, Australia
Jeffrey Xu Yu Chinese University of Hong Kong, Hong Kong,
China
Jialie Shen University of Glasgow, UK
Jinyan Li Institute for Infocomm Research, Singapore
Jun Miyazaki Nara Institute of Science and Technology, Japan
Kamal Karlapalem IIIT, Hyderabad, India
Katsumi Tanaka Kyoto University, Japan
Keishi Tajima Kyoto University, Japan
Kenji Hatano Doshisha University, Japan
K. Selcuk Candan Arizona State University, USA
Kazumasa Yokota Okayama Prefectural University, Japan
Kazunari Ito Aoyama Gakuin University, Japan
Kazutoshi Sumiya University of Hyogo, Japan
Kyoji Kawagoe Ritsumeikan University, Japan
Kyu-Young Whang KAIST, Korea
Ladjel Bellatreche LISI/ENSMA, France
Linhao Xu National University of Singapore, Singapore
Li Yang Western Michigan University, USA
Luc Bouganim INRIA, France
Manolis Koubarakis National and Kapodistrian University of Athens,
Greece
Markus Schneider University of Florida, USA
Masatoshi Arikawa The University of Tokyo, Japan
Masayoshi Aritsugi Gunma University, Japan
Matthew Dailey Asian Institute of Technology, Thailand
Md Maruf Hasan Shinawatra University, Thailand
Miyuki Nakano University of Tokyo, Japan
Mizuho Iwaihara Kyoto University, Japan
Nandlal Sarda Indian Institute of Technology Bombay, India
Oded Shmueli Technion-Israel Institute of Technology, Israel
Ozgur Ulusoy Bilkent University, Turkey
Panos Kalnis National University of Singapore, Singapore
Photchanan Ratanajaipan Shinawatra University, Thailand
Pierangela Samarati Universita` degli Studi di Milano, Italy
Pongtawat Chippimolchai Asian Institute of Technology, Thailand
P. Radha Krishna Institute for Development and Research in
Banking Technology, India
Qiankun Zhao The Pennsylvania State University, USA
Rachada Kongkachandra Thammasat University, Thailand
Rachanee Ungrangsi Shinawatra University, Thailand
Rajugan R. University of Technology, Sydney (UTS),
Australia
Rui Zhang University of Melbourne, Australia
Sanghyun Park Yonsei University, Korea
X Organization

Sanjay Madria University of Missouri-Rolla, USA


Sean Wang The University of Vermont, USA
Sengar Vibhuti Singh Microsoft Research, India
Sergio Lifschitz Pontifícia Universidade
Católica do Rio de Janeiro, Brazil
Shuigeng Zhou Fudan University, China
Shyam Kumar Gupta IIT Delhi, India
Simonas Saltenis Aalborg University, Denmark
Sonia Berman University of Cape Town, South Africa
Sourav Bhowmick Nanyang Technological University, Singapore
Sreenivasa Kumar P. IIT Madras, India
Stefan Manegold CWI, The Netherlands
Stephane Bressan National University of Singapore, Singapore
Steven Gordon Thammasat University, Thailand
Sujeet Pradhan Kurashiki University of Science and the Arts,
Japan
Sunil Prabhakar Purdue University, USA
Sushil Jajodia George Mason University, USA
Takahiro Hara Osaka University, Japan
Takashi Tomii Yokohama National University, Japan
Takeo Kunishima Okayama Perfectural University, Japan
Takuya Maekawa NTT, Japan
Thanaruk Theeramunkong Sirindhorn International Institute of Technology,
Thammasat University, Thailand
Thanwadee Sunetnanta Mahidol University, Thailand
Theo Haerder University of Kaiserslautern, Germany
Tore Risch Uppsala University, Sweden
Toshiyuki Amagasa University of Tsukuba, Japan
Vasilis Vassalos Athens University of Economics and Business,
Greece
Verayuth Lertnattee Silpakorn University, Thailand
Vicenc Torra IIIA-CSIC, Catalonia, Spain
Vijay Atluri Rutgers University, USA
Wang-Chien Lee Penn State University, USA
Weining Qian Fudan University, P.R. China
Wei Wang University of North Carolina at Chapel Hill, USA
Weiyi Meng SUNY at Binghamton University, USA
Willem Jonker Philips Research, The Netherlands
Wolfgang Nejdl L3S and University of Hannover, Germany
Xiaofang Zhou University of Queensland, Australia
Xiaofeng Meng Renmin University of China, China
Xuemin Lin University of New South Wales, Australia
Yanfeng Shu The University of Queensland, Australia
Yang-Sae Moon Kangwon National University, Korea
Yan Wang Macquarie University, Australia
Yasuhiko Morimoto Hiroshima University, Japan
Organization XI

Yasushi Sakurai NTT, Japan


Ying Chen IBM, China
Young-Koo Lee Kyoung Hee University, Korea
Yufei Tao Chinese University of Hong Kong, China

Industrial Program Committee


Arvind R Hulgeri Persistent Systems, India
Katsumi Takahashi NTT Lab, Japan
Ming Xiong Lucent/Bell Labs, USA
Prasad M Deshpande IBM Research, India
Yasuhiko Kanemasa Fujitsu Lab, Japan

External Referees
Alex Liu University of Texas, USA
Amit Garde Persistent Systems, India
Chavdar Botev Cornell University, USA
Fan Yang Cornell University, USA
Feng Shao Cornell University, USA
L. Venkata Subramaniam IBM IRL, India
Man Lung Yiu University of Hong Kong, China
Meghana Deodhar IBM IRL, India
Mirek Riedewald Cornell University, USA
Panagiotis Karras University of Hong Kong, China
Pankaj Kankar IBM IRL, India
R. Venkateswaran Persistent Systems, India
Shipra Agrawal Bell Labs Research, India
Sourashis Roy IBM IRL, India
Suju Rajan University of Texas, USA
Sunita Sarawagi IIT Bombay, India
Umesh Bellur IIT Bombay, India

External Reviewers
A. Balachandran Persistent Systems, India
Amit Garde Persistent Systems, India
Atsushi Kubota Fujitsu Laboratories, Japan
Daniel Lieuwen Bell Labs, USA
Deepak P IBM Research, India
Iko Pramudiono NTT, Japan
Krishna Kummamuru IBM Research, India
Masanori Goto Fujitsu Laboratories, Japan
Noriaki Kawamae NTT, Japan
Nicolas Anciaux INRIA, France
Nicolas Travers University of Versailles, France
Philippe Pucheral INRIA, France
XII Organization

R. Venkateswaran Persistent Systems, India


Satyanarayana Valluri IIIT-Hyderabad, India
Takeshi Motohashi NTT, Japan
Toshikazu Ichikawa NTT, Japan
Vijil E. Chenthamarakshan IBM Research, India
Vinod G. Kulkarni Persistent Systems, India

Sponsoring Institutions

IBM, Thailand

Korea Information Science


Society

Database Society of Japan

National Electronics and


Computer Technology
Center

Software Industry
Promotion Agency
Table of Contents

Invited Talks
‘Socio Sense’ and ‘Cyber Infrastructure for Information Explosion Era’:
Projects in Japan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Masaru Kitsuregawa
Is (Your) Database Research Having Impact? . . . . . . . . . . . . . . . . . . . . . . . . 3
Guy M. Lohman

Part I: Full Papers


Query Language and Query Optimization - I
Improving Quality and Convergence of Genetic Query Optimizers . . . . . . 6
Victor Muntés-Mulero, Néstor Lafón-Gracia,
Josep Aguilar-Saborit, and Josep-L. Larriba-Pey
Cost-Based Query Optimization for Multi Reachability Joins . . . . . . . . . . 18
Jiefeng Cheng, Jeffrey Xu Yu, and Bolin Ding
A Path-Based Approach for Efficient Structural Join with
Not-Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Hanyu Li, Mong Li Lee, Wynne Hsu, and Ling Li

Query Language and Query Optimization - II


RRPJ: Result-Rate Based Progressive Relational Join . . . . . . . . . . . . . . . . 43
Wee Hyong Tok, Stéphane Bressan, and Mong-Li Lee
GChord: Indexing for Multi-Attribute Query in P2P System with Low
Maintenance Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Minqi Zhou, Rong Zhang, Weining Qian, and Aoying Zhou
ITREKS: Keyword Search over Relational Database by Indexing Tuple
Relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Jiang Zhan and Shan Wang

Data Mining and Knowledge Discovery


An MBR-Safe Transform for High-Dimensional MBRs in Similar
Sequence Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Yang-Sae Moon
Mining Closed Frequent Free Trees in Graph Databases . . . . . . . . . . . . . . . 91
Peixiang Zhao and Jeffrey Xu Yu
Mining Time-Delayed Associations from Discrete Event Datasets . . . . . . . 103
K.K. Loo and Ben Kao
XIV Table of Contents

Clustering
A Comparative Study of Ontology Based Term Similarity Measures on
PubMed Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, and
Xiaohua Zhou
An Adaptive and Efficient Unsupervised Shot Clustering Algorithm for
Sports Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Jia Liao, Guoren Wang, Bo Zhang, Xiaofang Zhou, and Ge Yu
A Robust Feature Normalization Scheme and an Optimized Clustering
Method for Anomaly-Based Intrusion Detection System . . . . . . . . . . . . . . . 140
Jungsuk Song, Hiroki Takakura, Yasuo Okabe, and Yongjin Kwon
Detection and Visualization of Subspace Cluster Hierarchies . . . . . . . . . . . 152
Elke Achtert, Christian Böhm, Hans-Peter Kriegel, Peer Kröger,
Ina Müller-Gorman, and Arthur Zimek

Outlier Detection
Correlation-Based Detection of Attribute Outliers . . . . . . . . . . . . . . . . . . . . 164
Judice L.Y. Koh, Mong Li Lee, Wynne Hsu, and Kai Tak Lam
An Efficient Histogram Method for Outlier Detection . . . . . . . . . . . . . . . . . 176
Matthew Gebski and Raymond K. Wong

Privacy Preserving Data Mining


Efficient k-Anonymization Using Clustering Techniques . . . . . . . . . . . . . . . 188
Ji-Won Byun, Ashish Kamra, Elisa Bertino, and Ninghui Li
Privacy Preserving Data Mining of Sequential Patterns for Network
Traffic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Seung-Woo Kim, Sanghyun Park, Jung-Im Won, and
Sang-Wook Kim
Privacy Preserving Clustering for Multi-party . . . . . . . . . . . . . . . . . . . . . . . 213
Weijia Yang and Shangteng Huang
Privacy-Preserving Frequent Pattern Sharing . . . . . . . . . . . . . . . . . . . . . . . . 225
Zhihui Wang, Wei Wang, Baile Shi, and S.H. Boey

Parallel and Distributed Databases


Kn Best - A Balanced Request Allocation Method for Distributed
Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Jorge-Arnulfo Quiané-Ruiz, Philippe Lamarre, and Patrick Valduriez
The Circular Two-Phase Commit Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Heine Kolltveit and Svein-Olaf Hvasshovd
Towards Timely ACID Transactions in DBMS . . . . . . . . . . . . . . . . . . . . . . . 262
Marco Vieira, António C. Costa, and Henrique Madeira
Table of Contents XV

Data Warehouse
BioDIFF: An Effective Fast Change Detection Algorithm for Biological
Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Yang Song, Sourav S. Bhowmick, and C. Forbes Dewey Jr.
An Efficient Implementation for MOLAP Basic Data Structure and Its
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
K.M. Azharul Hasan, Tatsuo Tsuji, and Ken Higuchi

Information Retrieval
Monitoring Heterogeneous Nearest Neighbors for Moving Objects
Considering Location-Independent Attributes . . . . . . . . . . . . . . . . . . . . . . . . 300
Yu-Chi Su, Yi-Hung Wu, and Arbee L.P. Chen
Similarity Joins of Text with Incomplete Information Formats . . . . . . . . . 313
Shaoxu Song and Lei Chen
Self-tuning in Graph-Based Reference Disambiguation . . . . . . . . . . . . . . . . 325
Rabia Nuray-Turan, Dmitri V. Kalashnikov, and Sharad Mehrotra
Probabilistic Nearest-Neighbor Query on Uncertain Objects . . . . . . . . . . . 337
Hans-Peter Kriegel, Peter Kunath, and Matthias Renz

Indexing and Caching Databases


Making the Most of Cache Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Andreas Bühmann and Theo Härder
Construction of Tree-Based Indexes for Level-Contiguous Buffering
Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Tomáš Skopal, David Hoksza, and Jaroslav Pokorný
A Workload-Driven Unit of Cache Replacement for Mid-Tier Database
Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
Xiaodan Wang, Tanu Malik, Randal Burns,
Stratos Papadomanolakis, and Anastassia Ailamaki
J+ -Tree: A New Index Structure in Main Memory . . . . . . . . . . . . . . . . . . . . 386
Hua Luan, Xiaoyong Du, Shan Wang, Yongzhi Ni, and Qiming Chen
CST-Trees: Cache Sensitive T-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Ig-hoon Lee, Junho Shim, Sang-goo Lee, and Jonghoon Chun

Security and Integrity Maintenance


Specifying Access Control Policies on Data Streams . . . . . . . . . . . . . . . . . . 410
Barbara Carminati, Elena Ferrari, and Kian Lee Tan
Protecting Individual Information Against Inference Attacks in Data
Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
Chen Li, Houtan Shirani-Mehr, and Xiaochun Yang

Quality Aware Privacy Protection for Location-Based Services . . . . . . . . . 434


Zhen Xiao, Xiaofeng Meng, and Jianliang Xu
XVI Table of Contents

Implementation of Bitmap Based Incognito and Performance


Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Hyun-Ho Kang, Jae-Myung Kim, Gap-Joo Na, and Sang-Won Lee
Prioritized Active Integrity Constraints for Database Maintenance . . . . . 459
Luciano Caroprese, Sergio Greco, and Cristian Molinaro

Image and Ontology-Based Databases


Using Redundant Bit Vectors for Near-Duplicate Image Detection . . . . . . 472
Jun Jie Foo and Ranjan Sinha
OLYBIA: Ontology-Based Automatic Image Annotation System Using
Semantic Inference Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
Kyung-Wook Park, Jin-Woo Jeong, and Dong-Ho Lee
OntoDB: An Ontology-Based Database for Data Intensive
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Hondjack Dehainsala, Guy Pierra, and Ladjel Bellatreche

Sensor and Scientific Database Applications


Continuously Maintaining Sliding Window Skylines in a Sensor
Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
Junchang Xin, Guoren Wang, Lei Chen, Xiaoyi Zhang, and
Zhenhua Wang
Bayesian Reasoning for Sensor Group-Queries and Diagnosis . . . . . . . . . . 522
Ankur Jain, Edward Y. Chang, and Yuan-Fang Wang

Telescope: Zooming to Interesting Skylines . . . . . . . . . . . . . . . . . . . . . . . . . . 539


Jongwuk Lee, Gae-won You, and Seung-won Hwang
Eliciting Matters – Controlling Skyline Sizes by Incremental Integration
of User Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
Wolf-Tilo Balke, Ulrich Güntzer, and Christoph Lofi

Mobile Databases
Optimizing Moving Queries over Moving Object Data Streams . . . . . . . . . 563
Dan Lin, Bin Cui, and Dongqing Yang
MIME: A Dynamic Index Scheme for Multi-dimensional Query in
Mobile P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
Ping Wang, Lidan Shou, Gang Chen, and Jinxiang Dong

Temporal and Spatial Databases


Interval-Focused Similarity Search in Time Series Databases . . . . . . . . . . . 586
Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger, Peter Kunath,
Alexey Pryakhin, and Matthias Renz
Adaptive Distance Measurement for Time Series Databases . . . . . . . . . . . 598
Van M. Chhieng and Raymond K. Wong
Table of Contents XVII

Clustering Moving Objects in Spatial Networks . . . . . . . . . . . . . . . . . . . . . . 611


Jidong Chen, Caifeng Lai, Xiaofeng Meng, Jianliang Xu, and
Haibo Hu

Data Streams
The Tornado Model: Uncertainty Model for Continuously Changing
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
Byunggu Yu, Seon Ho Kim, Shayma Alkobaisi, Wan D. Bae, and
Thomas Bailey
ClusterSheddy: Load Shedding Using Moving Clusters over
Spatio-temporal Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
Rimma V. Nehme and Elke A. Rundensteiner
Evaluating MAX and MIN over Sliding Windows with Various Size
Using the Exemplary Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
Jiakui Zhao, Dongqing Yang, Bin Cui, Lijun Chen, and Jun Gao
CLAIM: An Efficient Method for Relaxed Frequent Closed Itemsets
Mining over Stream Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,
Yunfeng Liu, and Kunqing Xie

P2P and Grid-Based Data Management


Capture Inference Attacks for K-Anonymity with Privacy Inference
Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
Xiaojun Ye, Zude Li, and Yongnian Li
Schema Mapping in P2P Networks Based on Classification and
Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
Guoliang Li, Beng Chin Ooi, Bei Yu, and Lizhu Zhou
ABIDE: A Bid-Based Economic Incentive Model for Enticing
Non-cooperative Peers in Mobile-P2P Networks . . . . . . . . . . . . . . . . . . . . . . 703
Anirban Mondal, Sanjay Kumar Madria, and Masaru Kitsuregawa

XML Databases
An Efficient Encoding and Labeling for Dynamic XML Data . . . . . . . . . . 715
Jun-Ki Min, Jihyun Lee, and Chin-Wan Chung
An Original Semantics to Keyword Queries for XML Using Structural
Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
Dimitri Theodoratos and Xiaoying Wu
Lightweight Model Bases and Table-Driven Modeling . . . . . . . . . . . . . . . . . 740
Hung-chih Yang and D. Stott Parker

XML Indexing
An Efficient Index Lattice for XML Query Evaluation . . . . . . . . . . . . . . . . 753
Wilfred Ng and James Cheng
XVIII Table of Contents

A Development of Hash-Lookup Trees to Support Querying Streaming


XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
James Cheng and Wilfred Ng
Efficient Integration of Structure Indexes of XML . . . . . . . . . . . . . . . . . . . . 781
Taro L. Saito and Shinichi Morishita
Efficient Support for Ordered XPath Processing in Tree-Unaware
Commercial Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
Boon-Siew Seah, Klarinda G. Widjanarko, Sourav S. Bhowmick,
Byron Choi, and Erwin Leonardi

XML Query Processing


On Label Stream Partition for Efficient Holistic Twig Join . . . . . . . . . . . . 807
Bo Chen, Tok Wang Ling, M. Tamer Özsu, and Zhenzhou Zhu

Efficient XML Query Processing in RDBMS Using GUI-Driven


Prefetching in a Single-User Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
Sandeep Prakash, Sourav S. Bhowmick,
Klarinda G. Widjanarko, and C. Forbes Dewey Jr.
Efficient Holistic Twig Joins in Leaf-to-Root Combining with
Root-to-Leaf Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
Guoliang Li, Jianhua Feng, Yong Zhang, and Lizhu Zhou
TwigList: Make Twig Pattern Matching Fast . . . . . . . . . . . . . . . . . . . . . . . . 850
Lu Qin, Jeffrey Xu Yu, and Bolin Ding

Part II: Short Papers


Query Language and Query Optimization
CircularTrip: An Effective Algorithm for Continuous kNN Queries . . . . . 863
Muhammad Aamir Cheema, Yidong Yuan, and Xuemin Lin
Optimizing Multiple In-Network Aggregate Queries in Wireless Sensor
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 870
Huei-You Yang, Wen-Chih Peng, and Chia-Hao Lo
Visible Nearest Neighbor Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876
Sarana Nutanong, Egemen Tanin, and Rui Zhang
On Query Processing Considering Energy Consumption for Broadcast
Database Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
Shinya Kitajima, Jing Cai, Tsutomu Terada, Takahiro Hara, and
Shojiro Nishio

Data Mining and Knowledge Discovery


Mining Vague Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
An Lu, Yiping Ke, James Cheng, and Wilfred Ng
Table of Contents XIX

An Optimized Process Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . 898


Guojie Song, Dongqing Yang, Yunfeng Liu, Bin Cui, Ling Wu, and
Kunqing Xie
Clustering XML Documents Based on Structural Similarity . . . . . . . . . . . 905
Guangming Xing, Zhonghang Xia, and Jinhua Guo
The Multi-view Information Bottleneck Clustering . . . . . . . . . . . . . . . . . . . 912
Yan Gao, Shiwen Gu, Jianhua Li, and Zhining Liao

Web and Information Retrieval


Web Service Composition Based on Message Schema Analysis . . . . . . . . . 918
Aiqiang Gao, Dongqing Yang, and Shiwei Tang
SQORE: A Framework for Semantic Query Based Ontology Retrieval . . . 924
Chutiporn Anutariya, Rachanee Ungrangsi, and Vilas Wuwongse
Graph Structure of the Korea Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 930
In Kyu Han, Sang Ho Lee, and Soowon Lee
EasyQuerier: A Keyword Based Interface for Web Database Integration
System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936
Xian Li, Weiyi Meng, and Xiaofeng Meng

Database Applications and Security


Anomalies Detection in Mobile Network Management Data . . . . . . . . . . . . 943
Marco Anisetti, Claudio A. Ardagna, Valerio Bellandi,
Elisa Bernardoni, Ernesto Damiani, and Salvatore Reale
Security-Conscious XML Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
Yan Xiao, Bo Luo, and Dongwon Lee
Framework for Extending RFID Events with Business Rule . . . . . . . . . . . 955
Mikyeong Moon, Seongjin Kim, Keunhyuk Yeom, and Heeseok Choi

Ontology and Data Streams


Approximate Similarity Search over Multiple Stream Time Series . . . . . . 962
Xiang Lian, Lei Chen, and Bin Wang
WT-Heuristics: A Heuristic Method for Efficient Operator Ordering . . . . 969
Jun-Ki Min
An Efficient and Scalable Management of Ontology . . . . . . . . . . . . . . . . . . . 975
Myung-Jae Park, Jihyun Lee, Chun-Hee Lee, Jiexi Lin,
Olivier Serres, and Chin-Wan Chung
Estimating Missing Data in Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
Nan Jiang and Le Gruenwald

XML Databases
AB-Index: An Efficient Adaptive Index for Branching XML Queries . . . . 988
Bo Zhang, Wei Wang, Xiaoling Wang, and Aoying Zhou
XX Table of Contents

Semantic XPath Query Transformation: Opportunities and


Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994
Dung Xuan Thi Le, Stephane Bressan, David Taniar, and
Wenny Rahayu
TGV: A Tree Graph View for Modeling Untyped XQuery . . . . . . . . . . . . . 1001
Nicolas Travers, Tuyêt Trâm Dang Ngoc, and Tianxiao Liu
Indexing Textual XML in P2P Networks Using Distributed Bloom
Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007
Clement Jamard, Georges Gardarin, and Laurent Yeh
Towards Adaptive Information Merging Using Selected XML
Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
Ho-Lam Lau and Wilfred Ng

Part III: Posters


Data Warehouse and Data Mining
LAPIN: Effective Sequential Pattern Mining Algorithms by Last
Position Induction for Dense Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1020
Zhenglu Yang, Yitong Wang, and Masaru Kitsuregawa
Spatial Clustering Based on Moving Distance in the Presence of
Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024
Sang-Ho Park, Ju-Hong Lee, and Deok-Hwan Kim
Tracing Data Transformations: A Preliminary Report . . . . . . . . . . . . . . . . . 1028
Gang Qian and Yisheng Dong

Query Processing
QuickCN: A Combined Approach for Efficient Keyword Search over
Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032
Jun Zhang, Zhaohui Peng, and Shan Wang
Adaptive Join Query Processing in Data Grids: Exploring Relation
Partial Replicas and Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036
Donghua Yang, Jianzhong Li, and Hong Gao
Efficient Semantically Equal Join on Strings . . . . . . . . . . . . . . . . . . . . . . . . . 1041
Juggapong Natwichai, Xingzhi Sun, and Maria E. Orlowska

Database Modeling and Information Retrieval


Integrating Similarity Retrieval and Skyline Exploration Via Relevance
Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045
Yiming Ma and Sharad Mehrotra
An Image-Semantic Ontological Framework for Large Image
Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1050
Xiaoyan Li, Lidan Shou, Gang Chen, and Kian-Lee Tan
Table of Contents XXI

Flexible Selection of Wavelet Coefficients for Continuous Data Stream


Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
Jaehoon Kim and Seog Park
Versioned Relations: Support for Conditional Schema Changes and
Schema Versioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
Peter Sune Jørgensen and Michael Böhlen

Network and XML Databases


Compatibility Analysis and Mediation-Aided Composition for BPEL
Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1062
Wei Tan, Fangyan Rao, Yushun Fan, and Jun Zhu
Expert Finding in a Social Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066
Jing Zhang, Jie Tang, and Juanzi Li
Efficient Reasoning About XFDs with Pre-image Semantics . . . . . . . . . . . 1070
Sven Hartmann, Sebastian Link, and Thu Trinh

Part IV: Industrial Track


Context RBAC/MAC Access Control for Ubiquitous Environment . . . . . 1075
Kyu Il Kim, Hyuk Jin Ko, Hyun Sik Hwang, and Ung Mo Kim
Extending PostgreSQL to Support Distributed/Heterogeneous Query
Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086
Rubao Lee and Minghong Zhou
Geo-WDBMS: An Improved DBMS with the Function of Watermarking
Geographical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1098
Min Huang, Xiang Zhou, Jiaheng Cao, and Zhiyong Peng

Part V: Demonstrations Track


TinTO: A Tool for the View-Based Analysis of Streams of Stock
Market Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110
Andreas Behrend, Christian Dorau, and Rainer Manthey
Danaı̈des: Continuous and Progressive Complex Queries on RSS
Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115
Wee Hyong Tok, Stéphane Bressan, and Mong-Li Lee
OntoDB: It Is Time to Embed Your Domain Ontology in Your
Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1119
Stéphane Jean, Hondjack Dehainsala, Dung Nguyen Xuan,
Guy Pierra, Ladjel Bellatreche, and Yamine Aı̈t-Ameur

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1123


‘Socio Sense’ and ‘Cyber Infrastructure for
Information Explosion Era’: Projects in Japan

Masaru Kitsuregawa

University of Tokyo
[email protected]

Some of the large projects in Japan where I am serving PI are introduced in this talk.
MEXT (Ministry of Education, Culture, Sports, Science and Technology) approved
new project named ‘Cyber Infrastructure for Information Explosion Era’ in 2005. The
year of 2005 was a preparation stage and we asked research proposals under this
program. Totally seventy four research teams were accepted. The project effectively
started on April 2006.This is the largest IT related project in the category of Grant-in-
Aid for Scientific Research on Priority Areas. Around 5 million dollars for 2006. The
project supposed to continue until FY2010. The amount of information created by
people, generated by sensors and computers is explosively increasing recent years.
Especially the growth ratio of web contents is very high. People do not ask questions
to the friends anymore if they want to know something but use search engine and
people are now really heavily dependent on the web. Knowledge workers are using a
lot of time just for ‘search’. The more the information be generated, the more we find
difficulty to locate appropriate information. In order to achieve higher quality search,
we are currently developing an open next generation search engine incorporating deep
NLP capabilities. By deep, we mean we put more machine power to web contents
analysis. In another words, we do not care about response time, since current 1 sec
response time is dependent on the advertisement based monetization scheme. We
believe we should provide service, which is more than ordinary search. In addition to
web, we do have yet another information explosion in the area so called e-science.
Through introduction of very powerful supercomputer and various kinds of advanced
sensor systems, science is now becoming very data intensive. We plan to build tools
for science discovery over the sea of data explosion. Another area would be health
care. A lot of patient health care records are now becoming to be digitally stored.
Monitoring the human activities with sensors and mining the archived HCR would be
typical data driven application.
Explosion of the information incurs several problems not just in search but also in
computer system management. A lot of information means a lot of applications,
which gives so much stresses against the system. Cost of maintaining the system is
now increasing more and more. Self monitoring the system activities also generate
huge amount of information. BAM is one typical higher level example. We are now
building large scale distributed cluster test bed over Japan, which is a shared platform
for next generation system software development.
Human interaction is also very important research issue. All the information
finally has to be absorbed by people. Highly functional human interaction capturing
room are being developed. Various kinds of sensors are prepared and eight video

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 1–2, 2007.
© Springer-Verlag Berlin Heidelberg 2007
2 M. Kitsuregawa

cameras capture the interaction process in synchronously from different angles.


Building the interaction corpus would be important for several modal analysis
researches.
Thus information explosion project covers almost all the computer science areas.
More than 200 researchers are now participating.
Socio-sense project will be also introduced. People are spending more time in the
cyber world in addition to in the real world. Most of the important events are
immediately reflected onto the cyber world, which means we can capture the
activities of the real world through the cyber world, whose information can be
crunched by information technology. Cyber world can be regarded as a SENSOR for
the real world. By viewing the evolution of cyber world, we can interpret various
interesting social activities. The system of socio sense is not a search engine but a
kind of tool to see the societal behavior. This is also supported by MEXT.
METI is going to start ‘Grand Information Voyage’ project from April 2007.
Several national projects on information explosion starts(ed) in Japan. We are
considering the possibilities of international collaboration.
Information explosion project:
http://itkaken.ex.nii.ac.jp/i-explosion/ctr.php/m/IndexEng/a/Index/
Consortium for Grand Information Voyage project:
http://www.jyouhoudaikoukai-consortium.jp/
Is (Your) Database Research Having Impact?

Guy M. Lohman

IBM Almaden Research Center


650 Harry Rd., San Jose, CA 95120
[email protected]

Is your research having real impact? The ultimate test of the research done by this
community is how it impacts society. Perhaps the most important metric of this
impact is acceptance in the marketplace, i.e. incorporation into products that bring
value to the purchaser. Merely publishing papers and getting them referenced has no
intrinsic value unless the ideas therein are eventually used by someone. So let us ask
ourselves candidly – is (my) database research having (positive) impact? Concisely:
Are they buying my stuff? Have the “hot topics” of the past withstood the test of time
by actually being used in products that sold? If so, what characteristics were
instrumental in their success? And if not, why did something that got so many people
excited fail to gain traction with users? Perhaps more importantly, what can we learn
from our track record of the past in order to have better impact in the future? How
can we better serve our user community by solving their real problems, not the ones
we may imagine?
Let us first critique our historical track record as a community. Over the last thirty
years, a few major topics seem to have dominated the interest of the research
community in databases. Waves of “hot topics” appear to rise to predominance, in
terms of the number of papers submitted (and hence published), and after a few years
of excitement get replaced by another topic. Not that these waves exclude other
topics or are cleanly delineated – they simply seem to coincidentally interest a large
proportion of our community. I will not attempt to justify this premise with statistics
on topics; it’s just an observation that many experienced researchers recognize. The
first of these with which I’m familiar was relational databases, themselves, which
captivated the attention of database researchers in the last half of the 1970s, resulting
in major prototypes such as System R, Ingres, and others that formed the foundation
of products in the early 1980s. Distributed databases seemed to dominate the early
1980s, but this thread rapidly evolved into separate threads on parallel databases and
the integration of disjoint (and often heterogeneous) databases, usually called
federated databases. In the late 1980s and early 1990s, object-oriented databases
attempted to address the requirements of some under-served applications, and the
relational crowd fought back by creating “extensible” databases with “object-
relational” extensions to meet the OODBMS challenge. About the same time, interest
in Datalog created strong interest in deductive databases. The mid- and late-1990s
saw the birth and explosion of interest in data warehousing and data mining,
eventually spawning a whole new research community in knowledge discovery.
Around 1999, standardization of XML rocketed XML databases into the forefront.
The early- to mid-2000s have seen great interest in streams and sensor databases.
And along the way, numerous other variations on these themes have enjoyed the

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 3–5, 2007.
© Springer-Verlag Berlin Heidelberg 2007
4 G.M. Lohman

spotlight for a while: database machines, temporal databases, multi-media and spatial
databases, scientific and statistical databases, active databases, semantic databases
and knowledge bases, and a recent favorite of mine that has yet to gain much interest
from academia – self-managing databases. To what extent has each of these topics
successfully impacted the marketplace, and why? We must learn from our successes
and failures by carefully examining why the market accepted or rejected our
technology.
My assessment is that our success has depended upon the consumability of our
technology: how well it meets a customer need, how simple it is to understand and
use, and how well standardization has stabilized its acceptance across vendors.
Relational technology succeeded and has grown spectacularly to become a U.S. $14
Billion industry in 2004 largely because it was simpler and easier to understand than
its predecessors, with a declarative query language (SQL) that simplified application
development, and was standardized early in its (product) evolution. However,
attempts to “augment” it with object-relational, temporal, and deductive extensions
have been either: (a) too complicated, (b) insufficiently vital to most consumers’
applications, and/or (c) not standardized or standardized too late in its evolution.
Parallel databases exploited increasingly inexpensive hardware to facilitate growth
and performance requirements with generally acceptable increases in complexity
(mostly in administration, not querying), whereas federated databases have seen less
success because the complexity of integrating diverse data sources largely fell on the
user. Data mining, while a genuine success in the research community, evoked a
comparative yawn in the marketplace largely because users needed to understand it to
use it, and they had difficulty understanding it because of its novelty and
mathematical intricacies. The jury is still out on XML databases, but my fear is that,
despite the need for storing increasing volumes of XML data, XQuery is far more
complicated than SQL. Similarly, stream databases are too new to be judged
adequately, but I question the market size and whether the research in the database
community adequately suits the “lean and mean” real-time requirements of the
primary market – the investment and banking industries.
How then should we increase the impact of our research in the future? First, we
must candidly assess our strengths and weaknesses. Our strengths lie in modeling the
semantics underlying information, enabling better precision in our queries than the
keyword search upon which Information Retrieval and the popular search engines are
based. We have much to offer the IR and search communities here, and they have
recognized this by aggressively hiring from the database community in the last few
years. Our models also permit reasoning about the data through complex OLAP-style
queries to extract actionable information from a sea of data. We know how to
optimize a declarative language, and how to exploit massive parallelism, far better
than any other discipline. Our primary weakness is in simplicity / usability,
particularly in setting up and administering databases. This is certainly exacerbated by
database researchers not gaining firsthand experience by routinely using databases to
store their own data. Secondly, we must reach out to other disciplines with
complementary strengths, and learn from them. Despite the lack of precision of
keyword search, why is it vastly preferred over SQL? Third, we must engage with
real users (which should include ourselves) and listen carefully to what they say.
Have you ever tried to query or manage a non-trivial database of at least 500 tables
Is (Your) Database Research Having Impact? 5

that was not constructed by you? Have you ever tried to add disks or nodes to an
existing database that exceeded its initial space allocation? Have you ever built and
administered a real application using a database? Did your research remedy any of
the pain points you encountered or heard from a user? Fourth, we must go back to
basics and design our systems based upon user requirements, not upon what
technology we understand or want to develop.
Pursuing the fourth item in greater detail, we should honestly ask ourselves why
less than 20% of the world’s data is stored in databases. Weren’t object-relational
extensions supposed to rectify this by enabling storage of unstructured and semi-
structured data, as well as structured data? Currently, users rely upon content
managers to manage this unstructured and semi-structured content. Though content
managers are built upon relational DBMSs, the content is stored in files, so isn’t
easily searched, and the search interface isn’t SQL. This certainly isn’t what users
want. Users want a single, uniform interface to all their data, particularly for
searching. Increasingly, they recognize that the majority of their costs are for people
and their skills, as hardware costs are driven downward by Moore’s Law. So lowering
the Total Cost of Ownership (TCO) requires systems that are easier to manage and
require fewer skilled people to manage. Users also want a scalable solution that
permits easily adding more capacity to either the storage or the computing power in
an incremental fashion as their needs for information management increase. The
increasing requirements for compliance with government regulations, as well as
business imperatives to extract more value out of information already collected in
diverse application “silos”, are driving their need to integrate systems never designed
to interact with other systems, and to be able to more pro-actively and quickly derive
business intelligence than with today’s data warehouses. Ultimately, users want to be
able to quickly and easily find, integrate, and aggregate the data that they need to
make business decisions. But that data is currently scattered throughout their
enterprise in a staggering array of incompatible systems, in a daunting tangle of
differing formats. The usual lament is that they know the data is out there somewhere,
but they can’t find it.
Clearly there are plenty of hard research problems – as well as business
opportunities! – in all of these requirements! We simply have to listen and be willing
to change our research agendas to the problems that matter most to our “customers”.
And focusing on customer pain points doesn’t preclude attempting risky, imaginative,
cool, technically advanced, and occasionally far-out technical approaches. In fact,
problems having origins in reality tend to be the most challenging. Only by doing so
will our research withstand the test of time in the marketplace of ideas, and truly have
the impact we all want for our work.
Improving Quality and Convergence of Genetic
Query Optimizers

Victor Muntés-Mulero1 , Néstor Lafón-Gracia1, Josep Aguilar-Saborit2, and


Josep-L. Larriba-Pey1
1
DAMA-UPC, Computer Architecture Dept., Universitat Politècnica de Catalunya,
Campus Nord UPC, C/Jordi Girona Módul D6 Despatx 117 08034 Barcelona, Spain
{vmuntes,nlafon,larri}@ac.upc.edu,
http://www.dama.upc.edu
2
IBM Canada Ltd. IBM Toronto Lab. 8200 Warden Ave. Markham, Ontario.
Canada L6G1C7
[email protected]

Abstract. The application of genetic programming strategies to query


optimization has been proposed as a feasible way to solve the large join
query problem. However, previous literature shows that the potentiality
of evolutionary strategies has not been completely exploited in terms of
convergence and quality of the returned query execution plans (QEP).
In this paper, we propose two alternatives to improve the performance
of a genetic optimizer and the quality of the resulting QEPs. First, we
present a new method called Weighted Election that proposes a criterion
to choose the QEPs to be crossed and mutated during the optimization
time. Second, we show that the use of heuristics in order to create the
initial population benefits the speed of convergence and the quality of the
results. Moreover, we show that the combination of both proposals out-
performs previous randomized algorithms, in the best cases, by several
orders of magnitude for very large join queries.

1 Introduction

Query optimization based on evolutionary approaches is still an intriguing alter-


native to solve the very large join query problem. Advanced applications such
as SAP or those involving information integration often need to combine a large
set of tables to reconstruct complex business objects. For instance, the SAP
schema may contain more than 10,000 relations [6] and may join more than 20
of these in a single SQL query. As the number of relations involved in a SQL
statement increases, traditional optimizers, which are usually based on dynamic
programming techniques [13], fail to perform satisfactorily. The main problem

Research supported by the IBM Toronto Lab Centre for Advanced Studies and UPC
Barcelona. The authors from DAMA-UPC want to thank Generalitat de Catalunya
for its support through grant number GRE-00352 and Ministerio de Educacin y
Ciencia of Spain for its support through grant TIN2006-15536-C02-02.

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 6–17, 2007.

c Springer-Verlag Berlin Heidelberg 2007
Improving Quality and Convergence of Genetic Query Optimizers 7

lies in the size of the search space, which grows exponentially with the increase
of the number of relations involved in the query. In this scenario, users, or even
the DBMS [20], are usually forced to split the query in smaller subqueries in
order to optimize it, obtaining QEPs that are typically far from the optimum.
Genetic approaches have proven to be a good alternative since they are in-
cluded among the best randomized approaches in terms of quality and speed of
convergence [15]. However, there are still important aspects to be studied in order
to improve the performance of genetic approaches. On the one hand, evolution-
ary algorithms perform a beam search, based on the evolution of a population,
instead of focusing on the evolution of a single individual [1], as opposed to
random-walk algorithms like iterative improvement or simulated annealing. Al-
though this can be beneficial in terms of quality, it may jeopardize the ability
of the optimizer to converge quickly. On the other hand, recent studies show, by
means of a statistical model, that the random effects of the initial population
cannot be neglected, since they have a significant impact on the quality of the
returned QEP after the optimization process [11]. In other words, depending
on the small sample of QEPs created at random for the initial population, the
genetic optimizer will experience difficulties to find a near optimal QEP. This is
aggravated by the fact that the search space grows exponentially as the number
of relations increases, which implies that the size of the initial population should
also grow exponentially.
In order to remedy these two drawbacks, we propose two different approaches.
We call our first proposal Weighted Election (WE) and it tackles the problem
of the speed of convergence mentioned above. In all the traditional evolution-
ary algorithms, the members of the population chosen to be crossed with other
members or mutated are chosen at random. WE proposes a new approach where
the QEPs are chosen with a certain probability depending on their associated
cost, giving more probability to low-costed plans to be chosen as opposed to
high-costed plans. Our second approach is aimed at reducing the variability in
the quality of the results, introduced by the random effects of the initial popu-
lation, by using heuristics to assure that the first sample of QEPs is not blindly
chosen from the search space, but it follows a minimum quality criterion. We
call this approach Heuristic Initial Population (HIP).
Finally, we show that the combination of both approaches is beneficial. Specifi-
cally, we compare our new approach with the Two-Phase Optimization algorithm
(2PO) [4], which is considered to be the best randomized algorithm presented in
the literature. We show that our techniques significantly improve a genetic opti-
mizer and, in addition, are more suitable than previous randomized techniques
for very large join queries.
This paper is organized as follows. Section 2 introduces genetic optimization
and the genetic optimizer used in this work. Section 3 and 4 describe our pro-
posals in detail. In Section 5, we present the results obtained by the comparison
of the different algorithms. Finally, in Sections 6 and 7, we present related work
and conclude.
8 V. Muntés-Mulero et al.

2 The Carquinyoli Genetic Optimizer (CGO)


The Carquinyoli Genetic Optimizer (CGO) is, to the best of our knowledge,
the most sophisticated genetic approach presented in the literature and the first
genetic optimizer tested against a well-known commercial optimizer [8]. For this
reason we use CGO as the baseline of our work.
CGO is based on genetic programming. Inspired by the principles of nat-
ural selection, the basic idea of genetic programming is, given an initial set of
programs, generally called members of an initial population, to perform a set
of operations in order to get a well-fitted program able to solve a specific task.
Each member or program in the population represents a way to achieve a specific
objective and has an associated cost.
Starting with this initial population, usually created from scratch, two oper-
ations are used to produce new members in the population: (i) crossover op-
erations, which combine properties of two members in the population chosen
at random, and (ii) mutation operations, which introduce new properties into
a randomly chosen member in the population. In order to keep the size of the
population constant, a third operation, usually referred to as selection, is used
to discard the worst fitted members, using a fitness function. This process gen-
erates a new population, also called generation, that includes both the old and
the new members that have survived to the selection operation. This is repeated
iteratively until a stop condition ends the execution. Once the stop criterion is
met, the best solution is taken from the final population. Query optimization
can be reduced to a search problem where the DBMS needs to find the optimum
query execution plan (QEP) in a vast search space. Each QEP can be consid-
ered as a possible solution (or program) for the problem of finding a good access
path to retrieve the required data. Therefore, in a genetic query optimizer, every
member of the population is a QEP. Further details of CGO can be found in [8].

3 Weighted Election (WE)


Among the randomized algorithms, two different classes of algorithms have been
applied to query optimization. On the one hand, we have the random-walk based
algorithms, typically represented by iterative improvement and simulated anneal-
ing and all the improvements and combinations of these two, such as 2PO. On
the other hand, there are proposals in the literature for the use of evolution-
ary techniques as an alternative way to achieve a near-optimal QEP. There is a
fundamental difference between both alternatives: the philosophy of the search
space is different. While random-walk algorithms rely on a single individual
(QEP) and a sequence of transformations on this individual, evolutionary al-
gorithms apply the transformations on a population. As a consequence, while
genetic approaches keep more information than random-walk algorithms, which
may lead the optimizer to find better-costed QEPs, they might experience a
lower speed of convergence. This is sustained by the fact that they do not only
keep the best QEP, but they spend some time optimizing QEPs that are not
close to the optimal plan.
Improving Quality and Convergence of Genetic Query Optimizers 9

In order to improve this drawback of evolutionary strategies, we propose and


analyze a new technique called Weighted Election (WE). This technique aims
at directing the search towards the directions marked by the best QEPs in the
population, by giving more opportunities to these QEPs to be crossed and mu-
tated than QEPs with higher costs. Note that high-costed QEPs still have an
opportunity to participate in the genetic operations performed by the optimizer,
although the probability is lower.
In order to assign the weight of a QEP p in the population P op, denoted by
Wp , we use the following formula:
 
μ 12 − Cp
Wp = max · α, 1 (1)
μ 12 − BP op

where Cp is the cost associated with QEP p, μ 12 is the median cost in the
population and BP op is the best cost in the population. Note that Wp ranges
from 1 to α, where α > 1. Specifically, QEPs with costs lower than the median
are assigned a weigh from 1 to α, while QEPs with costs higher than the median
are assigned a cost of 1. Depending on the value of α we can give more or less
importance to the differences between the costs of the QEPs in the population.
For example, for α = 2 and α = 100, the probability of the QEP with the lowest
cost in the population to be chosen is 2 and 100 times the probability of the
highest-costed QEP, respectively.

4 Heuristic Initial Population (HIP)


The quality of the initial population can be decisive in order to obtain near-
optimal QEPs. Unfortunately, since the initial population is usually created at
random [15], its affect on the quality of the results is unpredictable. Our proposal
assures that the quality of the initial population is higher than a randomly
created population, using heuristics to create part of the plans in it.
Several heuristic algorithms have been proposed in the literature aiming at
solving the query optimization problem. Representatives of this class of algo-
rithms are the KBZ algorithm [7], the AB algorithm [19], the Augmentation
algorithm (AG), and other greedy algorithms [14,15,17].
Because of its working principle, the KBZ algorithm requires the assignment of
join implementations to join graph edges before the optimization is carried out.
This requirement and the restrictions concerning the cost model do not allow the
algorithm to approximate the real solution, when it deals with a sophisticated
and detailed cost model [15]. AB was developed in order to solve the restrictions
imposed by KBZ on the join implementation placement. However, even with the
AB extension it is difficult to make use of a complex cost model.
The Augmentation algorithm (AG) is an incremental heuristic method to
build QEPs. Specifically, 5 different criteria are studied, namely, choosing the
relation with minimum cardinality, choosing the relation participating in the
largest number of joins, choosing the joins with minimum selectivity, choosing
an operation using the combination of the first and the third criteria and, finally,
10 V. Muntés-Mulero et al.

using the so-called KBZ rank, related to the KBZ algorithm. Among the five
criteria, the minimum selectivity criterion turned out to be the most efficient
and, for this reason, it is the one selected for this work. Depending on the relation
chosen to start the optimization process, different QEPs can be generated. In
general, we consider that the AG algorithm does not generate a single QEP, but
as many QEPs as relations involved in the query.

Algorithm 1. HIP: Initial Population generation pseudocode


1: procedure IniPop(time maxTime, int maxPlans)
2: int numPlan ← 0;

3: p ← generateMinimumJoinSelectivityPlan();
4: currentTime ← getCurrentTime();
5: while (p ∧ currentTime < maxT 2
ime
∧ numPlan < maxPlans) do
6: insertPlanToPopulation(p);
7: numPlan ← numPlan + 1;
8: p ← generateMinimumJoinSelectivityPlan();
9: currentTime ← getCurrentTime();
10: end while

11: if (numPlans < maxPlans) then


12: genRemainingRandomMembers(maxPlans - numPlans);
13: end if
14: end procedure

Algorithm 1 summarizes the working principles of HIP. In order to simplify


the implementation and the experiments, we assume that we fix the optimization
time a priori. This optimization time is passed to Algorithm 1 using the para-
meter maxTime. A number of QEPs are created (lines 3 and 8) and introduced
in the population (line 6) using the Minimum Join Selectivity heuristic (MJS).
Since MJS has a non-trivial computational cost, generating all the members of
the population with the heuristic could be very time-consuming, exhausting the
whole optimization time, and preventing the genetic optimizer from performing
an operation. Therefore, as shown in line 5, the heuristic is applied until the
maximum number of possible QEPs generated by the heuristic is reached. Thus,
we create as many QEPs as needed in the population or we spend about half
of the optimization time. Finally, if the population is not completed after the
loop, the remaining QEPs are created at random using the function genRemain-
ingRandomMembers(). This function has a parameter that specifies the number
of remaining QEPs to be created at random (line 12).

5 Experimental Results
Our first concern is to provide means to assure a fair comparison between the
approaches studied in this paper. With this purpose, we have used the meta-
structures created for CGO in order to implement the new techniques and 2PO,
Improving Quality and Convergence of Genetic Query Optimizers 11

i.e., the QEP metadata, the functions to calculate the cost of a plan, etc. With
this, we guarantee that the efforts put on the performance optimization of CGO
are also used by the other approaches.
Our new techniques are tested first with star schemas and for random queries,
since they represent one of the most typical scenarios in Decision Support Sys-
tems, similar to those used for TPC-H. In order to provide means to generalize
our conclusions, we also test our techniques with random queries. We do not
show the results using the TPC-H benchmark since the number of relations in
this schema does not allow the creation of large join queries.

Star Join Queries. For star join queries [3] we have randomly generated two
databases containing 20 and 50 relations. Both schemas contain a large fact table
or central relation and 19 and 49 smaller dimension tables, respectively. The fact
table contains a foreign key attribute to a primary key in each dimension relation.
We have distributed the cardinalities in order to have most of the dimensions
with a significantly lower cardinality compared to the fact table. A few set of
dimensions would have cardinalities closer to the cardinality of this fact table,
but still at least one order of magnitude smaller, which typically corresponds to
real scenarios (similar to the TPC-H database schema). The number of attributes
per dimension, other than those included in the primary key, ranges from 1 to 10.
The exact number of attributes per dimension and the attribute type is chosen
at random. We define an index for every primary key.
We randomly define two sets of 9 star join queries, Q20 and Q50 , one for
each database schema. Each set contains queries involving 20 and 50 relations,
respectively. Every query includes all the relations of its corresponding database
schema with at least one explicit join condition associated with each relation.
Therefore, since CGO avoids cross products, we ensure that our queries are well
defined star join queries.
Let Q be a SQL statement reading from a set of relations and γ the set of
constraints in Q. Every constraint c in γ has an associated selectivity factor s(c).
In a star join query, every dimension table typically adds some information to
the data flow or, if a constraint is affecting one of its attributes, it acts as a
filter to discard those results not matching the constraint. Let us define S as the
selectivity of the query calculated as S = Πc∈γ s(c). Each set of queries Q20 and
Q50 contains 9 queries qi , i = 1..9 and, in both cases, S(q1 ) = S(q2 ) = S(q3 ) ≈
10−2 , S(q4 ) = S(q5 ) = S(q6 ) ≈ 10−4 and S(q7 ) = S(q8 ) = S(q9 ) ≈ 10−8 .

Random Queries. We have generated 30 random queries to evaluate our pro-


posal. The set of random queries is divided into three groups involving 20, 50
and 100 join operations, respectively. In order to generate random queries we
use two tools that we have created and called rdbgen and rqgen, described in [9].
Execution details. Every algorithm has been tested on all the queries. For each
star join query, we have created 5 populations. Each population is used by all the
algorithms, except for HIP, which creates a different initial population. This way,
we eliminate possible noise relative to the random effects of the initial population
and perform a fairer comparison. Every test on every evolutionary algorithm and
12 V. Muntés-Mulero et al.

population consists of 10 executions each. We also run 10 executions for 2PO.


In total, we have run 5280 executions. The experiments have been run on an
Intel
R
XeonR
processor at 2.8 GHz with 2 GB of RAM. Either for evolutionary
algorithms or 2PO, we use the scaled cost to compare results:

Corig /CT ech − 1 if Corig ≥ CT ech
ScaledCost = (2)
1 − CT ech /Corig if Corig < CT ech

where Corig represents the best cost obtained by the original implementation
of CGO and CT ech represents the best cost achieved by the specified technique
to be tested. This way, the scaled cost in formula (2) allows us to obtain the
average from the execution of different queries and databases and it is centered
in 0. So if a technique has a positive scaled cost sc (sc > 0), it obtains QEPs
with costs that are, on average, more than sc times lower than those obtained
by CGO. A negative value indicates that the QEP obtained by that technique
is, on average, worse than those obtained by CGO. From here on, we compare
the techniques analyzed in this paper to CGO using formula (2).
Carquinyoli Genetic Optimizer (CGO). In order to parameterize CGO we
use the recommendations obtained by the statistical analysis presented in [11].
Table 1 summarizes the values used to configure CGO.
Two-Phase Optimization (2PO). We have parameterized 2PO using the
configuration proposed in [4]. During the first phase of 2PO, we perform 10
local optimizations using iterative improvement. The best QEP obtained in this
first phase is used as the starting point, in the second phase, for the simulated
annealing algorithm. The starting value for the initial temperature is the 10% of
the cost of this QEP. The same parametrization for 2PO was also used in [15].

5.1 Weighted Election Analysis

As explained before, the difference between the probability to choose the best
and the probability to choose the worst QEP in the population can be magnified
depending on the value of parameter α. In order to study the effect of this
parameter, each run is tested using five different values for α: 2, 10, 102 , 103 and
104 . We run our experiments using the two different sets of queries mentioned
above, namely the star join query set, executing all the policies 10 times per
each of the 5 populations created per query, and 30 random queries, where each
policy is also run 10 times per configuration, in order to obtain averages.

Table 1. Parameters set used depending on the number of relations in the query. The
number of crossover and mutation operations presented is executed per generation.

PARAMETER # members # cross # mut


# Relations 20 50 100 20 50 100 20 50 100
Value 160 400 800 80 200 300 50 100 150
Improving Quality and Convergence of Genetic Query Optimizers 13

Figure 1 shows the results obtained after these experiments. The uppermost
row shows the behavior of WE for star join queries involving 20 relations. The
leftmost plot (plot a) corresponds to the star join queries with highest selectivity,
i.e., those queries that return a larger number of matches (S ≈ 10−2 ). The plot in
the middle (plot b) corresponds to queries with S ≈ 10−4 and the rightmost plot
(plot c) to queries with lowest selectivity S ≈ 10−8 . Since the number of relations
is relatively small, close to what can still be handled by dynamic programming
techniques, there is still little room for improvement. In general, the larger the
value of α, the more significant the improvements introduced by WE. However,
the plots show that the difference between α = 1000 and α = 10000 is not
significant. We can also observe that, for very low selectivity, the gains of WE
are reduced (plot c). This effect is explained by the fact that, when the selectivity
is very small, most of the potential tuple results are discarded, resulting in a very
low data flow cardinality in the QEP. Since the join operations can be executed
in memory and do not incur extra I/O, all the QEPs have a similar cost and
most of the executions of CGO are likely to reach a QEP with a near-optimal
cost, reducing the chances for good performance.
Analogously, the central row of plots shows the same results for star join
queries involving 50 relations. Our first observation is that, in some cases the
gains obtained by WE are several orders of magnitude larger than those obtained
by CGO. Again, we can observe that the general trend is to reward large values of

Star Join 20 Rel (a) Star Join 20 Rel (b) Star Join 20 Rel (c)

1 0,8 0,4
0,8 0,6 0,3
Scaled Cost

Scaled Cost

Scaled Cost

0,6
0,4 0,2
0,4
0,2 0,1
0,2
0 0 0
-0,2 2s 5s 10 s 30 s 60 s -0,2 2s 5s 10 s 30 s 60 s -0,1 2s 5s 10 s 30 s 60 s

Star Join 50 Rel (a) Star Join 50 Rel (b) Star Join 50 Rel (c)

800 1500 20
600 1000 15
Scaled Cost

Scaled Cost

Scaled Cost

500
400 10
0
200 30 s 60 s 90 s 120 s 240 s 5
-500
0 -1000 0
-200 30 s 60 s 90 s 120 s 240 s -1500 -5 30 s 60 s 90 s 120 s 240 s

Random Queries 20 Rel Random Queries 50 Rel Random Queries 100 Rel

30 10 50
25 40
20 5 30
Scaled Cost

Scaled Cost
Scaled Cost

15 20
0
10 10
30 s 60 s 90 s 120 s 240 s
5 -5 0
0 -10 30 s 60 s 90 s 120 s 240 s
-5 2s 5s 10 s 30 s 60 s -10 -20

Fig. 1. Scaled Cost evolution for different values of α and different configurations
14 V. Muntés-Mulero et al.

Scaled Cost Evolution (20 Relations) Scaled Cost Evolution (50 Relations)
3 2000 (HIP → 7979)
(HIP+WE → 8233)
1800
2,5
1600
2 1400

Scaled Cost
1200 2PO
Scaled Cost

2PO HIP
1,5
HIP 1000 HIP + WE
HIP + WE 800 WE
1
WE
600
0,5 400
200
0
5s 10 s 30 s 60 s 0
-0,5 30 s 60 s 90 s 120 s 240 s

Fig. 2. Scaled Cost evolution for WE using α = 1000, HIP, the combination of both
and 2PO studying different number of relations for star join queries

α with better performance. Also, we would prefer the performance achieved for
α = 1000 instead of that achieved for α = 10000, which is not as stable in all the
situations. There is a trade-off for parameter α: it is recommendable to use larger
values to achieve good performance (i.e., larger than 100), but too large values
increase the probability of the best plan in the population to be chosen in such
a way that, in practice, we are almost forcing the exclusive use of the best QEPs
in the population, destroying one of the main differences between the genetic
approaches and the random-walk approaches. Similarly, the improvements of WE
decrease as the selectivity decreases for the reason explained above. However,
in the worst cases we still obtain QEPs with costs that, in general, are several
times larger than those obtained by CGO.
Finally, for random queries, in the lowermost row of plots, we observe the same
trends as with the star join queries. Again, the best value of α tested is 1000,
independently of the number of relations involved in the query. Extreme cases
like α = 2 or α = 10000 must be avoided since they might lead to performances
worst than those by CGO.

5.2 Heuristic Initial Population Analysis

In this section we analyze the benefits obtained by generating part of the pop-
ulation using HIP. Specifically, we run the same number of executions as in the
previous analysis, using the same configurations. Figures 2 and 3 show the re-
sults of our analysis of this technique, and also the results described in the next
subsection.
We first study the behavior for star join queries. In general, the use of HIP
does always improve the performance of CGO. As suggested in [11], spending
extra time generating good initial plans is clearly beneficial. Similar to what
happens with WE, the improvements are in general very limited in the case of
star join queries with 20 relations (left plot in Figure 2), since the search space
has not grown enough to obtain QEPs that clearly differ, in terms of quality, from
those obtained by CGO. However, for 50 relations (right plot in Figure 2) HIP
obtains results that are three orders of magnitude better than those obtained by
CGO. As the plot shows, for small optimization times, the improvement of our
Improving Quality and Convergence of Genetic Query Optimizers 15

techniques is around 10 times better than 2PO. It takes CGO about 4 minutes
to achieve results similar to those generated by HIP, which implies that HIP
converges much faster without losing quality. For random queries (Figure 3), we
can observe that HIP also obtains results similar to those obtained for star join
queries achieving, for queries containing 100 joins, improvement of more than
four orders of magnitude.

5.3 Combining WE and HIP vs. 2PO


Finally, we combine both techniques and compare their behavior with the best
random-walk algorithm presented in the literature: 2PO. All the experiments
in this subsection have been run using α = 1000. As it can be observed in
Figures 2 and 3, the combination of HIP and WE in CGO clearly outperforms
2PO with star join and random queries, except for the case of 20 relations,
where they behave very similarly. The benefits obtained by the combination of
the two techniques presented in this paper obtain QEPs that are, on average,
20 times better than those obtained by 2PO, with 50 joins, and four orders of
magnitude better for 100 joins. These results show that 2PO can be used as an
intermediate solution for queries with about 20 joins, but it quickly fails to find
QEPs for very large join queries, since the search space expands exponentially,
and the random-walk algorithms potentiality degraded.

6 Related Work
The first approaches that applied genetic algorithms to query optimization con-
sidered a reduced set of QEP properties in crossover and mutation operations
[2,15]. In these first proposals, the amount of information per plan is very lim-
ited because plans are transformed to chromosomes, represented as strings of
integers. This lack of information usually leads to the generation of invalid plans
that have to be repaired. In [16], a genetic-programming-based optimizer is pro-
posed that directly uses QEPs as the members in the population, instead of
using chromosomes. A first genetic optimizer prototype was created for Post-
greSQL [12], but its search domain is reduced to left-deep trees and mutation
operations are deprecated, thus bounding the search to only those properties
appearing in the QEPs of the initial population. Besides, execution plans are

Scaled Cost Evolution (20 Rel) Scaled Cost Evolution (50 Rel) Scaled Cost Evolution (100 Rel)

6 4000 15000

4 3000 10000
Scaled Cost

Scaled Cost
Scaled Cost

2000
2 5000
1000
0 0 0
5s 10 s 30 s 60 s 30 s 60 s 90 s 120 s 240 s 30 s 60 s 90 s 120 s 240 s
-2 -1000 -5000
2PO HIP HIP + WE WE 2PO HIP HIP + WE WE 2PO HIP HIP + WE WE

Fig. 3. Scaled Cost evolution for WE using α = 1000, HIP, the combination of both
and 2PO studying different numbers of relations for random queries
16 V. Muntés-Mulero et al.

represented as strings of integers, thereby losing a lot of important information.


CGO is presented in [8] and later analyzed in [10,11] showing that it is possible
to find criteria to parameterize a genetic optimizer for star-join queries. Also,
several variants of random-walk algorithms have been proposed in [4,5,15,18].
Randomized search techniques try to remedy the exponential explosion of dy-
namic programming techniques by iteratively exploring the search space and
converging to a nearly optimal solution.

7 Conclusions
In this paper we present two techniques, namely Weighted Election (WE) and
Heuristic Initial Population (HIP). These techniques tackle two important as-
pects of genetic optimization: the time wasted optimizing some QEPs in the
population with a large cost and the effects of the initial population on the
quality of the best QEP generated by the optimizer. WE is able to speed up
a genetic optimizer and achieve a quick convergence compared to the original,
meaning that, without de-randomizing the genetic evolution, it is important to
focus on those QEPs with lower associated cost, and avoid spending time opti-
mizing QEPs that are far from the best QEP in the population. HIP is the first
technique combining heuristics with genetic query optimizers, and it shows that
using simple rules to generate the initial population allows the genetic optimizer
to quickly generate good-fitted QEPs, improving the speed and the quality of
the optimizer. The combination of both techniques, which are orthogonal, is very
simple and it is shown to outperform the best random-walk approach presented
in the literature. All in all, we show that, for very large join queries, as the num-
ber of relations increases it is advisable to use genetic methods based on beam
search strategies, rather than random-walk techniques.

References
1. W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone. Genetic Programming
– An Introduction; On the Automatic Evolution of Computer Programs and its
Applications. Morgan Kaufmann, dpunkt.verlag, Jan. 1998.
2. K. Bennett, M. C. Ferris, and Y. E. Ioannidis. A genetic algorithm for database
query optimization. In R. Belew and L. Booker, editors, Proceedings of the Fourth
International Conference on Genetic Algorithms, pages 400–407, San Mateo, CA,
1991. Morgan Kaufman.
3. S. Chaudhuri and U. Dayal. Data warehousing and OLAP for decision support.
SIGMOD’97: In Proceedings of the ACM SIGMOD international conference on
Management of data, pages 507–508, 1997.
4. Y. E. Ioannidis and Y. Kang. Randomized algorithms for optimizing large join
queries. In SIGMOD ’90: Proc. of the 1990 ACM SIGMOD international confer-
ence on Management of data, pages 312–321, New York, NY, USA, 1990. ACM
Press.
5. Y. E. Ioannidis and E. Wong. Query optimization by simulated annealing. In
SIGMOD ’87: Proceedings of the 1987 ACM SIGMOD international conference on
Management of data, pages 9–22, New York, NY, USA, 1987. ACM Press.
Improving Quality and Convergence of Genetic Query Optimizers 17

6. A. Kemper, D. Kossmann, and B. Zeller. Performance tuning for sap r/3. IEEE
Data Eng. Bull., 22(2):32–39, 1999.
7. R. Krishnamurthy, H. Boral, and C. Zaniolo. Optimization of nonrecursive queries.
In VLDB, pages 128–137, 1986.
8. V. Muntés-Mulero, J. Aguilar-Saborit, C. Zuzarte, and J.-L. Larriba-Pey. Cgo:
a sound genetic optimizer for cyclic query graphs. In Proc. of ICCS 2006, pages
156–163, Reading, UK, May 2006. Springer-Verlag.
9. V. Muntés-Mulero, J. Aguilar-Saborit, C. Zuzarte, V. Markl, and J.-L. Larriba-Pey.
Genetic evolution in query optimization: a complete analysis of a genetic optimizer.
Technical Report UPC-DAC-RR-2005-21, Dept. d’Arqu. de Comp. Universitat Po-
litecnica de Catalunya (http://www.dama.upc.edu), 2005.
10. V. Muntés-Mulero, J. Aguilar-Saborit, C. Zuzarte, V. Markl, and J.-L. Larriba-
Pey. An inside analysis of a genetic-programming optimizer. In Proc. of IDEAS
’06., December 2006.
11. V. Muntés-Mulero, M. Pérez-Cassany, J. Aguilar-Saborit, C. Zuzarte, and J.-L.
Larriba-Pey. Parameterizing a genetic optimizer. In Proc. of DEXA ’06, pages
707–717, September 2006.
12. PostgreSQL. http://www.postgresql.org/.
13. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price.
Access path selection in a relational database management system. In Proceedings
of the 1979 ACM SIGMOD international conference on Management of data, pages
23–34. ACM Press, 1979.
14. E. J. Shekita, H. C. Young, and K.-L. Tan. Multi-join optimization for symmetric
multiprocessors. In R. Agrawal, S. Baker, and D. A. Bell, editors, 19th Interna-
tional Conference on Very Large Data Bases, August 24-27, 1993, Dublin, Ireland,
Proceedings, pages 479–492. Morgan Kaufmann, 1993.
15. M. Steinbrunn, G. Moerkotte, and A. Kemper. Heuristic and randomized opti-
mization for the join ordering problem. VLDB Journal: Very Large Data Bases,
6(3):191–208, 1997.
16. M. Stillger and M. Spiliopoulou. Genetic programming in database query optimiza-
tion. In J. R. Koza, D. E. Goldberg, D. B. Fogel, and R. L. Riolo, editors, Genetic
Programming 1996: Proceedings of the First Annual Conference, pages 388–393,
Stanford University, CA, USA, 28–31 July 1996. MIT Press.
17. A. Swami. Optimization of large join queries: combining heuristics and combina-
torial techniques. In SIGMOD ’89: Proceedings of the 1989 ACM SIGMOD inter-
national conference on Management of data, pages 367–376. ACM Press, 1989.
18. A. Swami and A. Gupta. Optimization of large join queries. In SIGMOD ’88:
Proceedings of the 1988 ACM SIGMOD international conference on Management
of data, pages 8–17, New York, NY, USA, 1988. ACM Press.
19. A. N. Swami and B. R. Iyer. A polynomial time algorithm for optimizing join
queries. In Proceedings of the Ninth International Conference on Data Engineering,
pages 345–354, Washington, DC, USA, 1993. IEEE Computer Society.
20. Y. Tao, Q. Zhu, C. Zuzarte, and W. Lau. Optimizing large star-schema queries
with snowflakes via heuristic-based query rewriting. In CASCON ’03: Proceedings
of the 2003 conference of the Centre for Advanced Studies on Collaborative research,
pages 279–293. IBM Press, 2003.

Trademarks. IBM is a registered trademark of International Business Machines Cor-


poration in the United States, other countries, or both. Intel and Intel Xeon are regis-
tered trademarks of Intel Corporation or its subsidiaries in the United States and other
countries.
Cost-Based Query Optimization for
Multi Reachability Joins

Jiefeng Cheng, Jeffrey Xu Yu, and Bolin Ding

The Chinese University of Hong Kong, China


{jfcheng,yu,blding}@se.cuhk.edu.hk

Abstract. There is a need to efficiently identify reachabilities between different


types of objects over a large data graph. A reachability join (R-join) serves as a
primitive operator for such a purpose. Given two types, A and D, R-join finds all
pairs of A and D that D-typed objects are reachable from some A-typed objects.
In this paper, we focus on processing multi reachability joins (R-joins). In the
literature, the up-to-date approach extended the well-known twig-stack join algo-
rithm, to be applicable on directed acyclic graphs (DAGs). The efficiency of such
an approach is affected by the density of large DAGs. In this paper, we present
algorithms to optimize R-joins using a dynamic programming based on the esti-
mated costs associated with R-join. Our algorithm is not affected by the density
of graphs. We conducted extensive performance studies, and report our findings
in our performance studies.

1 Introduction
With the rapid growth of World-Wide-Web, new data archiving and analyzing tech-
niques bring forth a huge volume of data available in public, which is graph structured
in nature including hypertext data, semi-structured data and XML [1]. A graph provides
great expressive power for people to describe and understand the complex relationships
among data objects. As a major standard for representing data on the World-Wide-
Web, XML provides facilities for users to view data as graphs with two different links,
the parent-child links (document-internal links) and reference links (cross-document
links). In addition, XLink (XML Linking Language) [7] and XPointer (XML Pointer
Language) [8] provide more facilities for users to manage their complex data as graphs
and integrate data effectively. Besides, RDF [3] explicitly describes semantical resource
in graphs.
Upon such a graph, a primitive operation, reachability join (or simply R-join) was
studied [11,6]. In brief, a reachability join, A→D, denoted R-join, is to find all the
node-pairs, (a, d), in the underlying large data graph such that d is reachable from a,
denoted a ; d, and the labels of a and d are A and D respectively. R-joins help users to
find information effectively without requesting them to fully understand the schema of
the underlying graph. We explain the need of such R-join using an XML example. In
Figure 1, it shows a graph representation (Figure 1 (b)) for an XML data (Figure 1 (a)).
In Figure 1 (b), solid links represent document-internal links whereas dashed links rep-
resent cross-document links. We consider Figure 1 (b) as a graph with all links being
treated in the same way. With R-join, we can easily find all the topics that a researcher

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 18–30, 2007.

c Springer-Verlag Berlin Heidelberg 2007
Cost-Based Query Optimization for Multi Reachability Joins 19

<Reseach>
··· ···
<Institute>
<researcher>
<name>Jim</name>
</researcher>
<researcher>
<name>Joe</name>
Research
</researcher> 0
</Institute>
<Lab>
<researcher>
<name>Linda</name>
<projectref idref=“proj1”/> Institute Lab Projects Institute
</researcher>
<researcher> 1 2 4 3
<name>John</name>
<projectref idref=“proj2”/>
</researcher>
</Lab>
<institute>
<researcher> researcher researcher researcher proj proj researcher
<name>Keith</name>
<projectref idref=“proj2”/> 5 6 9 10 13 16 17
</researcher> researcher
</institute>
<projects>
<project id=“proj1”>
<topic>XML</topic> name name name projref name projref topic topic name projref
</person>
<project id=“proj2”> 7 8 11 12 14 15 20 21 18 19
<topic>RDF</topic>
</person>
</projects>
</Reseach> “Jim” “Joe” “Linda” “John” “XML” “RDF” “Keith”

(a) XML Data (b) XML Data Graph

Fig. 1. An Example

is interested in using researcher→topic. However, it would be difficult for a user


to find the same information using XPath queries, because XPath supports document-
internal links using a descendants-or-self-axis operator // and cross-document links
using value-matching based on a notion called ID/IDREF in XML. It cannot find such
information using an XPath query, researcher//topic, because topic is a child of
proj, and there is an ID/IDREF from researcher to proj. XPath requests users to
fully understand the schema and understand that the two different kinds of links are
processed in two different ways in XML data. In this paper, we focus ourselves on
optimizing and processing multi R-join queries.
A query with more than one R-joins can be naturally be represented by a query
graph. The existing approaches [5,12] extended the well-known tree-specific method,
namely, twig-stack join algorithm [4], to process such a query graph over a DAG. We
observed that this approach is very sensitive to the density of the underlying DAG. In
this paper, we propose a dynamic programming approach that optimizes multi R-joins
in a similar fashion as to optimize multi joins, based on the estimated costs associated
with an R-join. The advantage of our approach is that it is not sensitive to the density
of the underlying DAG. We conducted extensive experimental studies on multi R-join
queries using large XMark benchmark dataset [9], which confirms the efficiency of
our approach.
The rest of paper is organized as follows. Section 2 gives our problem statement on
multi reachability join query processing. Section 3 briefly review the existing technique
which extended the well-known twig-stack join algorithm. Together with the motiva-
tion of our approach, we discuss drawbacks of such an approach for multi R-join queries
processing. In Section 4, we review the multiple interval encoding for DAGs, followed
by discussions on our cost-based approach that optimizes R-joins using a dynamic
programming approach for multi R-join queries. Section 5 reports the performance
evaluation on our proposed method. Section 6 concludes this paper.
20 J. Cheng, J. Xu Yu, and B. Ding

vn+1 v2n−2 v2n−1


researcher
............
Institute topic v1 v2 vn−1 vn
Fig. 2. A R-join Query Fig. 3. A Example DAG for TwigStackD

2 Multi Reachability Joins

We consider a database as a directed node-labeled data graph G = (V, E, L, φ). Here, V


is a set of elements, E is a set of edges, L is a set of labels, and φ is a mapping function
which assigns a node a label. Given a label l ∈ L, the extent of l is defined as a set of
nodes in G whose label is l, denoted ext(l). Below, we use V (G) and E(G) to denote the
set of nodes and the set of edges of a graph G, respectively. Such a data graph example
is shown in Figure 1 (b).
A reachability join, A→D, called R-join, is to find all the node-pairs, (a, d), in the
data graph G such that d is reachable from a, denoted a ; d, and φ(a) = A and φ(d) =
D. We also use D←A, instead of A→D, if needed. A→D ≡ D←A. In this paper, we
concentrate on processing conjunctive multi R-join queries in the form of
A→B ∧ B→C ∧ · · · ∧ X→Y
The following holds for R-joins.
– Asymmetric: A→B ≡ B→A.
– Transitive: If A→B ∧ B→C hold, then A→C.
– Associative: (A→B)→C ≡ A→(B→C)1
A multi R-join query can be represented as a directed query graph, Gq (Vq , Eq , Lq , λ).
Here, Vq is a set of nodes. The node-label of a node v ∈ Vq is represented as λ(v).
An edge v → u represents an R-join A→D, where the labels of v and u are A and D,
respectively. A graph representation of a multi R-join query, A→C ∧ B→C ∧C→D, is
shown in Figure 2.
We evaluate a query graph Gq (Vq , Eq , Lq , λq ) over a data graph G(V, E, L, φ). The
result of the query graph, Gq , denoted R (Gq ), consists of a set of n-ary tuples. A tuple
consists of n nodes in the data graph G, if the query graph Gq has n nodes (|V (Gq )| = n),
in the form of t = [v1 , v2 , · · · , vn ], where there is a one-to-one mapping between vi in t
and ui in V (Gg ) such that φ(vi ) = λ(ui ). In addition, all nodes in the n-ary tuple r satisfy
all the reachability join conditions specified in the query graph Gq .

Table 1. The Graph Encoding of [2] Table 2. The Graph Encoding of [5]
l v pov Iv (a) Tree Interval Encoding (b) SSPI Index
Institute 1 5 [1 : 5]
Institute 3 20 [12 : 13][17 : 20]
researcher 5 2 [1 : 2] v Interval v Interval v preds
researcher 6 4 [3 : 4]
1 [2 : 11] 13 [17 : 20] 13 {4}
researcher 9 10 [6 : 10]
researcher 10 15 [11 : 15] 3 [34 : 41] 16 [27 : 30] 16 {19, 4}
researcher 17 19 [12 : 13][17 : 19] 4 [42 : 43] 17 [35 : 40] 20 {13}
topic 20 7 [7 : 7] 5 [3 : 6] 19 [38 : 39] 21 {16}
topic 21 12 [12 : 12]
6 [7 : 10] 20 [18 : 19]
9 [13 : 22] 21 [28 : 29]
10 [23 : 32]

1 The chain query A→B ∧ B→C ∧ · · · ∧ X→Y is abbreviated to A→B→C→· · · →X→Y .


Cost-Based Query Optimization for Multi Reachability Joins 21

Example 1. Fig. 2 represents a simple multi R-join query as a directed graph. This
query graph has a node labeled Institute, a node labeled researcher and a node labeled
topic. And two edges are in the query graph. The edge from Institute node to researcher
node requires that the data node pair (i, r), i ∈ext(Institute) and r ∈ext(researcher),
such that i ; r, should be returned; in the same time, the edge from researcher node to
topic node requires that the data node pair (r,t), r ∈ext(researcher) and t ∈ext(topic),
such that r ; t, should be returned.

3 Motivation
Recently, as an effort to extend Twig-Join in [4] to be workable on graphs, Chen et al.
studied multi R-join query processing(called pattern matching) over a directed acyclic
graph (DAG) in [5]. As an approach along the line of Twig-join [4], Chen et al. used
the interval-based encoding scheme, which is widely used for processing queries over
an XML tree, where a node v is encoded with a pair [s, e] and s and e together specify
an interval. Given two nodes u and v in an XML tree, u is an ancestor of v, u ; v, if
u.s < v.s and u.e > v.e or simply u’s interval contains v’s.
The test of a reachability relationship in [5] is broken into two parts. First, like the ex-
isting interval-based techniques for processing pattern matching over an XML tree, they
first check if the reachability relationships can be identified over a spanning tree gen-
erated by depth-first traversal of a DAG. Table 2(a) lists the intervals from a spanning
tree over the DAG of our running example. Second, for the reachability relationship that
may exist over DAG but not in the spanning tree, they index all non-tree edges (named
remaining edges in [5]), and all nodes being incident with any such non-tree edges in
a data structure called SSPI in [5]. Thus, all predecessor/successor relationships that
can not be identified by the intervals alone can be found with the help of SSPI. For our
running example, Table 2(b) shows SSPI.
As given in [5], for example, the procedure to find the predecessor/successor rela-
tionship of 17 ; 21 in the DAG of Fig. 1 as follows. First, it checks the containment
of tree intervals for 17 and 21, but there is no such a path between them in the tree.
Then, because 21 has entries of predecessor in SSPI, it tries to find a reachability rela-
tionship between 17 and all 21’s predecessors in SSPI by checking the containment of
tree interval for 17 and that of each of 21’s predecessors in SSPI recursively.
As shown above, in order to identify a reachability relationship between two nodes,
say, a and d, TwigStackD need to recursively search on SSPI to check if a predecessor
of d can be reached by a. This overhead over a DAG can be costly. Consider the DAG
of 2n − 1 nodes in Fig. 2, where the solid lines are edges in the spanning tree generated
by a depth-first search, and dashed lines are the remaining edges. Note that in the SSPI,
the entry for vn contains nodes vn+1 , vn+2 , · · · , v2n−1 . Thus to determine the reachability
relationship from node v2n−1 to node vn , TwigStackD needs n − 1 times of checking to
see if v2n−1 can reach any node in the entry. The cost of processing R-joins queries is
considerable high.
We conducted tests to confirm our observations. We generate a DAG by collapsing all
strongly connected components in a graph that is obtained using XMark data generator
dataset with a factor 0.01 (16K nodes). Here both XML tree edge and ID/IDREF links
are treated as the edges in the graph. Fig. 4 shows the performance of TwigStackD
22 J. Cheng, J. Xu Yu, and B. Ding

70K 8K 10M
Q1(TSD) Q1(TSD) Q1(TSD)

# Index Seek on SSPI


60K Q4(TSD) Q4(TSD) Q4(TSD)

Elapsed Time (sec)


8M
50K Q1(DP) 6K Q1(DP)
Q4(DP) Q4(DP)
6M
# I/Os

40K
4K
30K 4M
20K 2K
2M
10K

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
Remaining Edges Included (%) Remaining Edges Included (%) Remaining Edges Included (%)

(a) Number of I/Os (b) Elapsed Time (sec) (c) Number of Index Seek
Fig. 4. The Test on DAGs with Increasing Densities

on 5 DAGs, with 10%, 20%, 30%, 40% and 50% of non-tree edges (called remaining
edges) as the percentage of the total tree edges in the spanning tree obtained from the
graph. The queries used are , Q1 and Q4, which are listed in Fig. 7 (a) and (d). In Fig. 4,
Q(T SD) and Q(DP) are the processing costs to process Q using Chen TwigStackD and
our dynamic programming approach, respectively.
Fig. 4 (a) shows the I/O number when more remaining edges are added to the under-
lying DAG. As an example, for query Q4(TSD), the I/O number increased by 4,606 from
10% to 20% on the y-axis, while it increased by 38,881 from 40% to 50% on the y-axis.
When 5 times of number of remaining edges is included, the I/O number increases about
35 times. As for the number of index seeks in SSPI, namely the number of times to seek
an leaf page from the B+-Tree that implements the SSPI, which is showed in Fig. 4 (c),
this value increased by 616,052 from 10% to 20% on the y-axis, while it increased by
5,201,991 from 40% to 50% on the y-axis. The correlation coefficient for such two
types of measurements is as hight as above 0.999, which speaks that such an behavior
for the number of I/Os during processing is mainly caused by the number of index seek
of SSPI. Similar situation for processing time can also be observed in Figure 4 (c), since
the I/O number is the dominating factor for total processing cost. This test empirically
showed that TwigStackD performs better for DAGs with fewer remaining edges, but its
performance degrades rapidly when more edges being included in the underneath DAG.
Fig. 4 (a) and (b) also show the efficiency of our dynamic programming approach.
Our approach is not so sensitive as TwigStackD is to the density of the DAG. For Q4, our
approach only uses less than 200 number of I/O access, and 1 second processing time.

4 A New Dynamic Programming Approach


Dynamic programming has been widely used and studied as an effective paradigm for
query optimization [10]. In this section, we show how to use dynamic programming to
optimize and process multi R-joins queries. In brief, we use an R-join algorithm [11]
that uses a multiple interval encoding scheme [2] for processing R-joins over a DAG.
Below, first, we discuss the R-join algorithm [11], and how to extend it to process
multi R-joins. Then, we will discuss R-join size estimation, and give our optimization
approach based on dynamic programming.

4.1 An R-Join Algorithm Based on a Multiple Interval Encoding


Agrawal et al. proposed an interval-based coding for encoding DAG [2]. Unlike the
approaches that assign a single code, [s : e], for every node in a tree, Agrawal et al.
Cost-Based Query Optimization for Multi Reachability Joins 23

assigned a set of intervals and a postorder number for each node in DAG. Let Iu =
{[s1 : e1 ], [s2 : e2 ], · · · , [sn : en ]} be a set of intervals assigned to a node u, there is a path
from u to v, u ; v, if the postorder number of v is contained in an interval, [s j : e j ] in
Iu . The interval-based coding for the graph in Figure 1 (b) is given in Table 1. For the
same example of 17 ; 21 in the DAG of Fig. 1, it can be identified by the 21’s poId,
12, and one interval associated with 17, [12 : 13], since 12 is contained in [12 : 13].
Based on [2], Wang et al. studied processing R-join over a directed graph [11]. In
brief, given a directed graph, G. First, it constructs a DAG G by condensing all strongly
connected component in G as a node in G . Second, it generates encoding for G based
on [2]. All nodes in a strongly connected component in G share the same code assigned
to the corresponding representative node condensed in G . Given an R-join, A→D, two
lists Alist and Dlist are formed respectively. Alist encodes every node v as (v, s:e) where
[s : e] ∈ Iv . A node of A has n entries in the Alist, if it has n intervals. Dlist encodes
each node v as (v, pov ) where pov is the postorder number. Note: Alist is sorted on the
intervals [s : e] by the ascending order of x and then the descending order of y, and Dlist
is sorted by the postnumbers in ascending order. Wang et al. proposed to merge-join the
nodes in Alist and Dlist and to scan the two lists once.

4.2 Multi R-Joins Processing

It is important to know that some necessary extension is needed to use the R-join al-
gorithm [11] to process multi R-joins. Consider A→D ∧ D→E. For processing A→D,
Dlist needs to be sorted based on the postnumbers, because D is descendant. For
processing D→E, Dlist needs to be sorted based on s followed by e for all (v, s:e),
because D is a successor. Also, recall, for A ; D, Alist needs to encode every node v as
(v, s:e) where [s : e] ∈ Iv , which means there is a blocking between the two consecutive
R-joins, A→D followed by D→E, and we need to generate a new Alist from the output
of the previous R-join, A→D, in order to carry out the next R-join, D→E. Thus, The
intervals and postnumbers of each node must be maintained in multi R-join processing
for regeneration of intermediate Alist or Dlist on the fly. A total three operations are
needed during such blocking that enables multi R-join query processing.
– α(A): Given a list of node vectors in the form of (v1 , v2 , . . . , vl ) and each vi is in the
extension associated with A, it attaches each interval [s, e] ∈ Ivi and obtain a number
of (v1 , v2 , . . . , vl , [s : e]) from every vector (v1 , v2 , . . . , vl ) and sorts the resulting list
to obtain an Alist from these vector. For example, considering execution of two
consequent R-joins, Institute→researcher and researcher→topic, to process the
query of our running example, the first R-join Institute→researcher will produce
a set of temporary results A , {(1, 5), (1, 6), (3, 17)}. In order to make the proper
input for the second R-join researcher→topic. An α(A ) operation is hence applied
and we obtain {(1, 5, [1 : 2]), (1, 6, [3 : 4]), (3, 17, [12 : 13]), (3, 17, [17 : 19])}, which
becomes the input Alist for the second R-join.
– δ(D): Similarly as α, but it attaches the postnumbers for every vector (v1 , v2 , . . . , vl )
and obtains the (v1 , v2 , . . . , vl , [povi ]), vi in the extension associated with D, to form
a sorted Dlist. For example, considering execution of two consequent R-joins,
researcher→topic and Institute→researcher, to process the query of our running
example, the first R-joinresearcher→topic will produce a set of temporary results
24 J. Cheng, J. Xu Yu, and B. Ding

D , {(9, 20), (10, 21), (17, 21)}. In order to make the proper input for the second
R-join Institute→researcher. An δ(D ) operation is hence applied and we obtain
{(9, 20, [10]), (10, 21, [15]), (17, 21, [19])}, which becomes the input Dlist for the
second R-join.
– σ(A, D): Given a list of node vectors in the form of (v1 , v2 , . . . , vl ) and vi /v j in a
vector is in the extensions associated with A/D, it select out those vectors satisfy-
ing vi →v j . This is used to processing an R-join A→D when both A nodes and D
nodes already present in the partial solution. For example, considering the query
in Fig. 7(c) and four consequent R-joins, I→C, I→P, C→P and L→P to evaluate
that query, when the processing for I→C, I→P and C→P has been done, we only
further need a σ(L, P) to finish the total evaluation.
We develop the cost function involving those operations during processing for multi
R-joins after the description for R-join size estimation.

4.3 R-Join Size Estimation


We introduce a simple but effective way to estimate the answer size for a sequence of
R-joins. We need two presumption for our estimation: (1) For any pair-wise R-join,
say A→D, every pair of instance (a, d), where a ∈ ext(A) and d ∈ ext(D), is joinable
with the same probability. (2) Consider two R-joins, say A→B and B→C, for any three
instance (a, b, c), where a ∈ ext(A), b ∈ ext(B), and c ∈ ext(C), the two events E1 =
{a is joinable with b} and E2 = {b is joinable with c} are independent.
Suppose the answer size for R-joins (R1 →R2 ∧ . . . ∧ Ri−1 →Ri ) is M and the answer
size for the pairwise R-join Rh →Ri+1 , where 1 ≤ h ≤ i, is N, we will show the answer
size for (R1 →R2 ∧ . . . ∧ Ri−1 →Ri ) ∧ (Rh →Ri+1 ) can be estimated as M×N
|Rh | , where |Rh |
is the cardinality for the extension of Rh .
Suppose r j is an instance from ext(R j ), and let Join(·) denote the event that instances
are joinable. Then because presumption (2), we have
Pr(Join(r1 , r2 ..ri , ri+1 )) = Pr(Join(r1 ..ri ) ∧ Join(ri , rh )) = Pr(Join(r1 ..ri )) · Pr(Join(ri , rh )).

And because of presumption (1), we have


Pr(Join(r1 ..ri )) ≈ M
|R1 |·|R2 |..|Ri | Pr(Join(ri , rh )) ≈ N
|Rh |·|Ri+1 | .

So the estimated answer size of (R1 →R2 ∧ . . . ∧ Ri−1 →Ri ) ∧ (Rh →Ri+1 ) can be
EST = |R1 ||R2 |..|Ri||Ri+1 |Pr(Join(r1 , r2 ..ri , ri+1 ))
M N M ×N
= |R1 |..|Ri+1 | = .
|R1 ||R2 |..|Ri | |Rh ||Ri+1 | |Rh |
So we will be able to estimate the answer size for all such R-joins by conveniently
memorizing all pairwise R-join size and all label’s extension cardinalities in the data-
base catalog.
Example 2. For our running example, the first join is Institute→ research, thus M = 3.
For Institute→ research→topic, since N = 3 and |ext(research)|=5, so the estimated
result set size is 3×3
|5| = 1.8. The same result can be calculated if research→ topic is
taken as the first join.
Cost-Based Query Optimization for Multi Reachability Joins 25

4.4 The Enumeration Space for Multi R-Joins


We use dynamic programming style optimization to enumerate a set of equivalent plans
to evaluate a multi R-join query graph Gq against a database graph G. We briefly outline
the procedure of searching such plans and its execution to evaluate Gq .
Given a query graph Gq , only left-deep tree plans are searched as a common prac-
tice for a reasonable search space. Recall: in Gq , a node represent a label and an edge
represents →. An R-join, A→D, is represented as an edge from A to D. Initially, sub-
graphs G2 with two nodes connected by an edge are considered. Here, V (G2 ) = {v, u}
and E(G2 ) = {(v, u)} or E(G2 ) = {(u, v)} depending on whether it is for v→u or u→v.
In the next step, it considers to add one more edge. That is, it considers a subgraph
G3 with three edges, such that E3 includes all the edges in E(G2 ) plus one edge which
connects at least one incident node in V (G2 ). The last step repeats until it includes all
the nodes and edges in the original query graph Gq and we can get a sequence of sub-
graphs (G2 , G3 , ..., Gm ) and a sequence of edges being added (e2 , e2 , ..., em ). Regarding
a subgraph in the sequence, say, Gi and the edge to be added to the subgraph, which
should be ei or more specifically, (ui , vi ), there are 3 cases:
– Only ui exists in V (Gi ), in this case, an α operation is needed and followed by a
join for ui →vi and the cost is calculated as
CI = Cα · |R (Gi )| +C→ (ε · |R (Gi )| + |Dlistvi |)
– Only vi exists in V (Gi ), in this case, an δ operation maybe needed and followed
by a join for ui →vi , since the Dlist for vi maybe obtained by the output from the
preceding join. When δ operation is needed, the cost is calculated as
CII = Cδ · |R (Gi )| +C→ (|R (Gi )| + |Alistui |)
The first term in CII can be eliminated if no δ operation needed.
– Both ui and vi exist in V (Gi ), in this case, an σ operation is needed. The cost is
calculated as CIII = Cσ · |R (Gi )|).
In these cost formulae, values for |Alistui | and |Dlistvi | are obtained from the sta-
tistics in database catalog. The intermediate result by evaluating the query graph Gi is
represented as R (Gi ). We estimate the value of |R (Gi )| according to section 4.3. The
explanation for other factors are as follows,
– Cα : factor to approximate the cost of α operation by the cardinality of the node
vectors;
– Cδ : factor to approximate the cost of δ operation by the cardinality of the node
vectors;
Gq

I R T

S0 Start End R-join Result Size Cost


s0 s1 I→R 3 10
S1 S2
I R R T
s0 s2 R→T 3 10
s1 sf1 R→T 1 21
S f1 s2 sf1 I→R 1 19
I R T

Fig. 5. Searching for an Optimal Plan using DP Fig. 6. DP on Gq


26 J. Cheng, J. Xu Yu, and B. Ding

– Cσ : factor to approximate the cost of σ operation by the cardinality of the node


vectors;
– C→ : factor to approximate the cost of R-join operation by the sum of two lists’
length;
– ε: factor to approximate the length of an Alist by the cardinality of the node vectors.

4.5 Our Dynamic Programming Algorithm

In our dynamic programming style optimization, two basic components in the solution
space are statuses and moves.

– A status, S, specifies a a subquery, Gs , as an intermediate stage in generating a query


plan. To be more specific, a subquery of Gq is a subgraph Gs , where V (Gs ) ⊆ V (Gq )
and E(Gs ) ⊆ E(Gq ). Note: Gs does not necessarily be a connected graph if without
the left-deep tree restriction.
– A move from one status (subquery Gsi ) to another status (subquery Gs j ) considers
an additional edge (R-join) in Gs j that does not appear in Gsi , toward finding the en-
tire query plan for Gq . The next status is determined based on a cost function which
results in the minimal cost, in comparison with all possible moves. The process of
moving from one status to another results in a left-deep tree which is the R-join
order selection result.

We can estimate the cost for each move by those cost formulae in Sec. 4.4. Each status
S is associated with a cost function, denoted cost(S), which is the minimal accumulated
estimated cost to move from the initial status S0 to the current status S. Such accumu-
lated cost of a sequence of moves from S0 to S is the estimated cost for evaluating the
subquery GS being considered under the current status S. Our goal for dynamic pro-
gramming is to find the sequence of moves from the initial status S0 toward the final
status S f with the minimum cost, cost(S f ), among all the possible sequences of moves.
This method is quite strait forward and its search space is bounded by 2m .
Our algorithm is outlined in Algorithm 1. We simply apply Dijkstra’s algorithm for
the shortest path problem into our search space, aiming to find a ”shortest” path from
S0 to any S f , where nodes represent statues, edges represent moves, and the length of
an edge is the cost of one move. We omit further explanation about Algorithm 1.

Algorithm 1. DP Algorithm to Generate Plan


l is a priority queue of status, sorting statues in the increasing order of cost(S).
1: Initialize queue l as ∅;
2: Add S+0 into l;
3: while l is NOT empty do
4: S = l. f irst;
5: Delete l. f irst from l;
6: if S is a Final Status then
7: Output plan P backward from l; Terminate this Algorithm;
8: for each move from S to S do
9: if S ∈
/ l then
10: Insert S into l;
11: else
12: Update cost(S ) and l;
Cost-Based Query Optimization for Multi Reachability Joins 27

Example 3. For our running example, Figure 5 shows two alternative plans for evalu-
ating the query I→R→T , both containing two moves. The status S0 is associated with
a NULL graph, while S1 and S2 are respectively associated with two two graphs with
two connected nodes, and S3 is associated with the Gq and thus to be a final status.
Details steps in the searching for an optimal plan is showed in Figure 6, where each
row of the table lists a move in the solution space. The first column is the status where
to start the move and the second column is the status where the move reaches. The third
column is the R-join that will be processed in that move, while the number of results
generated after the R-join is the fourth column.

5 Performance Evaluation
In this section, we conducted two sets of tests to show the efficiency of our approach.
The first set of tests is designed to compare our dynamic programming approach (de-
noted DP) with algorithm [5] (denoted TSD). The second set of tests further confirms the
ability to scale of our approach. We implemented all the algorithms using C++ on top
of a Minibase-based2 variant deployed in Windows XP. We configure the buffer of the
database system to be 2MB. A PC with a 3.4GHz processor, 2GB memory, and 120G
hard disk running Windows XP is used to carry out all tests.

I I
Dataset |V | |E| |I| |I|/|V |
I
C D
20M 307,110 352,214 453,526 1.478
C L
C D P
40M 610,140 700,250 901,365 1.477
I D P P P L 60M 916,800 1,003,437 1,360,559 1.484
80M 1,225,216 1,337,378 1,816,493 1.483
(a) Q1 (b) Q2 (c) Q3 (d) Q4 100M 1,666,315 1,756,509 2,269,465 1.485

Fig. 7. R-join Query Graphs Fig. 8. Datasets Statistics

We generated 20M, 40M, 60M, 80M and 100M size XMark datasets [9] using 5
different factors, 0.2, 0.4, 0.6, 0.8, and 1.0 respectively, and named each dataset by its
size. In these XML documents, we treat parent-child edges and ID/IDREF edges with-
out difference to obtain graphs and collapse the strong connected components in graphs
to get DAGs. The details of the datasets are given in Fig. 8. In Fig. 8, the first column is
the dataset name. The second and third columns are the node number and edge number
of the resulting DAG respectively. The forth column is the multiple interval labeling
size, while the last column shows the average number of intervals per node in the DAG.
Throughout all experiments, we use the 4 multi R-join join queries listed in Fig. 7,
where the label I stands for interest, C for category, L for listitem,D for description and
P for parlist.

5.1 TwigStackD v.s. DP

We test all queries over the same dataset described in Section 3 for the purpose of
compare TwigStackD algorithm to our approach. We show two set of figures that show
the elapsed time, number of I/Os and memory used to process each query.
2 Developed at Univ. of Wisconsin-Madison.
28 J. Cheng, J. Xu Yu, and B. Ding

120

Elapsed Time (sec)


100 TSD TSD TSD
1000

Memory (MB)
DP DP 100 DP

10 80

# I/Os
100
60
1 10 40
20
0.1 1
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Queries Queries Queries

(a) Elapsed Time (sec) (b) Number of I/Os (c) Memory (MB)
10000 900
Elapsed Time (sec)

TSD 70000 TSD TSD

Memory (MB)
DP 10000 DP DP
1000 700

# I/Os
100 1000
500
10 100
300
1 10
100
0.1
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Queries Queries Queries

(d) Elapsed Time (sec) (e) Number of I/Os (f) Memory (MB)
Fig. 9. Compare on the DAG with 10% and 50% Remaining Edge Included

The first set of figures shows the performance on the DAG with 10 percent remaining
edges added, which are listed in Fig. 9 (a)-(c), and the second set of figures show the
performance on the DAG with 50 percent remaining edges added, which are listed in
Fig. 9 (d)-(f).
As shown in Fig. 9, our approach significantly outperforms TwigStackD, in terms of
elapsed time, number of I/O accesses, and memory consumption. The sharp difference
becomes even greater for a denser DAG, due to the rapid performance degradation of
TwigStackD when the edge number in the DAG increases. For example, consider Q3,
TwigStackD used 16.7 times of elapsed time and 8.7 times of I/O accesses than those
for our approach when 10 percent remaining edges being added, but when 50 percent
remaining edges being added, the two rates become 2922.3 and 266.4 respectively.
The memory usage of TwigStackD is unstable, and can range from 60MB to 900MB
for the 4 queries, because TwigStackD needs to buffer every node that can potentially
participate in any final solution and thus largely depends on the solution size. And it
can also be observed that the larger query needs more memory for the increased needs
of buffer pools by TwigStackD generally.

5.2 Scalability Test of Our Approach


Because TwigStackD does not scale well, in this section, we report the scalability of
DP. With the size of the dataset increasing from 20M to 100M, we tested the scalability
performance for our approach and Fig. 10 shows the results.
Both the number of I/Os and memory usage increase evenly as the size of underly-
ing DAGs increases. However, for the processing time of each query when the data size
increased, its variation is not so uniformly. A main reason for this observation is the
CPU overhead caused by sorting which is required in α and δ operations, for different
distribution of the data may result different join processing order, hence different num-
ber of those operations for the same query. However, there is no abrupt change for the
processing and the overall performance is still acceptable and all queries can be done
within tens of seconds.
Cost-Based Query Optimization for Multi Reachability Joins 29

25 25K 180
Q1 Q1 Q1
Q2 Q2 Q2
150
Elapsed Time (sec)
20 Q3 20K Q3 Q3
Q4 Q4 Q4

Memory (MB)
120
15 15K

# I/Os
90
10 10K
60
5 5K 30

20M 40M 60M 80M 100M 20M 40M 60M 80M 100M 20M 40M 60M 80M 100M
XMark Dataset Size XMark Dataset Size XMark Dataset Size

(a) Elapsed Time (sec) (b) Number of I/Os (c) Memory


Fig. 10. Scalability Test on DP

6 Conclusion
In this paper, we studied query processing of multi reachability joins (R-joins) over a
large DAG. The most up-to-date approach, TwigStackD algorithm, uses a single interval
encoding scheme. TwigStackD assigns to each node in a DAG a single interval based on
a spanning tree it obtains from the DAG, and builds a complimentary index called SSPI.
It uses a twig-join algorithm to find matches that exist in the spanning tree and buffers
all nodes that belong to any solution, in order to find all matches in the DAG, with
the help of SSPI. TwigStackD has good performance for rather sparse DAGs. But, its
performance degrades noticeably when DAG becomes dense, due to the high overhead
of accessing edge transitive closures.
We present an approach of using an exisiting multiple interval encoding scheme
that assigns to each node multiple intervals. With the multiple encoding scheme, no
additional data structure is needed. We show that optimizing R-joins (R-join order se-
lection), using dynamic programming with a primitive implementation of R-join, can
significantly improve the performance, even though such an approach may introduce
overhead for feeding the intermediate result of an R-join to another. We conducted ex-
tensive performance studies and confirmed the efficiency of our DP approach. DP sig-
nificantly outperforms TwigStackD, and is not sensitive to the density of the underneath
DAG.

Acknowledgment
This work was supported by a grant of RGC, Hong Kong SAR, China (No. 418206).

References
1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: from relations to semistructured
data and XML. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2000.
2. R. Agrawal, A. Borgida, and H. V. Jagadish. Efficient management of transitive relationships
in large data and knowledge bases. In Proc. of SIGMOD’89, 1989.
3. D. Brickley and R. V. Guha. Resource Description Framework (RDF) Schema Specification
1.0. W3C Candidate Recommendation, 2000.
4. N. Bruno and N. K. et. al. Holistic twig joins: optimal xml pattern matching. In Proc. of
SIGMOD’02.
5. L. Chen and A. G. et. al. Stack-based algorithms for pattern matching on dags. In Proc. of
VLDB’05.
30 J. Cheng, J. Xu Yu, and B. Ding

6. J. Cheng and J. X. Y. et. al. Fast reachability query processing. In Proc. of DASFAA’06.
7. S. DeRose, E. Maler, and D. Orchard. XML linking language (XLink) version 1.0. 2001.
8. S. DeRose, E. Maler, and D. Orchard. XML pointer language (XPointer) version 1.0. 2001.
9. A. Schmidt and F. W. et. al. XMark: A benchmark for XML data management. In Proc. of
VLDB’02.
10. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path
selection in a relational database management system. In Proc. SIGMOD’79, pages 23–34,
1979.
11. H. Wang, W. Wang, X. Lin, and J. Li. Labeling scheme and structural joins for graph-
structured XML data. In Proc. of The 7th Asia Pacific Web Conference, 2005.
12. H. Wang, W. Wang, X. Lin, and J. Li. Subgraph join: Efficient processing subgraph queries
on graph-structured XML document. In Proc. of WIAM’02, 2005.
A Path-Based Approach for Efficient Structural
Join with Not-Predicates

Hanyu Li, Mong Li Lee, Wynne Hsu, and Ling Li

School of Computing, National University of Singapore


{leeml,whsu,lil}@comp.nus.edu.sg

Abstract. There has been much research on XML query processing.


However, there has been little work on the evaluation of XML queries
involving not-predicates. Such queries are useful and common in many
real-life applications. In this paper, we present a model called XQuery
tree to model queries involving not-predicates and describe a path-based
method to evaluate such queries efficiently. A comprehensive set of ex-
periments is carried out to demonstrate the effectiveness and efficiency
of the proposed solution.

1 Introduction
Research on XML query processing has been focused on queries involving struc-
tural join, e.g., the query ”//dept[/name=”CS”]//professor” retrieves all the
professors in the CS department. However, many real world applications also
require complex XML queries containing not-predicates. For example, the query
”//dept[NOT(/name=”CS”)]//professor” retrieves all the professors who are
not from the CS department. We call this class of queries negation queries.
A naive method to evaluate negation queries is to decompose it into several
normal queries involving structural join operation. Each decomposed query can
be evaluated using any existing structural join method [4,6,7,8,9,12,11], followed
by a post processing step to merge the results. This simplistic approach is expen-
sive because it requires repeated data scans and overheads to merge the inter-
mediate results. The work in [10] propose a holistic path join algorithm which is
effective for path queries with not-predicates, while [14] develop a method called
TwigStackList¬ to handle a limited class of twig queries with not-predicates,
i.e., queries with answer nodes above any negative edge.
In this paper, we propose a path-based approach to handle a larger class
of negation queries efficiently, i.e., queries with answer nodes both above and
below negative edges. We introduce a model called XQuery tree to model queries
involving negated containment relationship. We utilize the path-based labeling
scheme in [11] for queries involving not-predicates. Experiment results indicate
that the path-based approach is more efficient than TwigStackList¬[14].
The rest of the paper is organized as follows. Section 2 reviews related work.
Section 3 illustrates the drawback of the TwigStackList¬ method. Section 4
describes the proposed path-based approach. Section 5 gives the experimental
results and we conclude in Section 6.

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 31–42, 2007.

c Springer-Verlag Berlin Heidelberg 2007
32 H. Li et al.

2 Related Work

The structural join has become a core operation in XML queries [4,6,7,8,9,12,11].
The earliest work [12] use a sort-merge or a nested-loop approach to process the
structural join. Index-based binary structural join solutions employ B + -tree[7],
XB-tree[6], XR-tree[8] to process queries efficiently. Subsequent works extend
binary structural join to holistic twig join. Bruno et al. [6] propose a holistic twig
join algorithm, TwigStack, which aims at reducing the size of the intermediate
result and is optimal for ancestor-descendent relationship, while [13] design an
algorithm called TwigStackList to handle parent-child relationships. The work
in [11] design a path-based labeling scheme to reduce the number of elements
accessed in a structural join operation.
Al-Khalifa et al. [5] examine how the binary structural join method can be
employed to evaluate negation in XML queries. Algorithm PathStack¬ [10] uti-
lizes a boolean stack to answer negation queries. The boolean stack contains a
boolean variable ”satisfy” which indicates whether the associated item satisfies
the sub-path rooted at this node. In this way, a negation query does not need to
be decomposed, thus improving the query evaluation process.
Algorithm TwigStackList¬ [14] extends the algorithm TwigStackList [13] to
handle holistic twig negation queries. TwigStackList¬ also avoids decomposing
holistic negation queries into several sub-queries without negations. However,
TwigStackList¬ can only process a limited class of negation queries and suffer
from high computational cost (see Section 3). In contrast, our approach utilizes
the path-based labeling scheme in [11] to filter out unnecessary element nodes
efficiently and handles a larger class of negation queries.

3 Motivating Example

TwigStackList¬ [14] defines a query node as an output node if it does not appear
below any negative edge, otherwise, it is a non-output node. Consider query
T1 in Fig. 1(b) where {B } is an output node and {D, E, F } are non-output
nodes. Suppose we issue query T1 over the XML document Doc1 in Fig. 1(a)
whose element nodes have been labeled using the region encoding scheme [4].
TwigStackList¬ associates a list LB and a stack SB for the output node B.
Element B1 in the XML document is first inserted into the list LB . Since B1
satisfies the not-predicate condition in query T1 , it is also pushed into the stack
SB . Next, element B2 is inserted into LB . B2 is subsequently deleted from LB
since its descendent element D1 has child nodes E2 and F1 , thus satisfying the
sub-query rooted at D in T1 . The final answer for T1 is B1 .
There are two main drawbacks in Algorithm TwigStackList¬. First, the class
of negation queries which can be processed is limited to output nodes occurring
above any negative edge. Hence, it cannot handle meaningful complex queries
such as T2 in Fig. 1(c) which retrieves all the matching occurrences of elements
B and C such that B is not a child of A and B has child nodes C and D while D
has a child node E but does not have a descendant node F (we call nodes B and
A Path-Based Approach for Efficient Structural Join with Not-Predicates 33

A1
(1:15, 1)

B1 B2 A
(2:6, 2) (7:14, 2)

C1
(3:5, 3)
C2
(8:13, 3)
B
B
E1 D1
(4:4, 4) (9:12, 4) C D
D
E2 F1
(10:10, 5) (11:11, 5) E F E F
(a) XML document (b) Query T1 (c) Query T2

Fig. 1. Example XML document and queries

C projected nodes). Second, TwigStackList¬ may access elements which are


not answers to a query. For example, to answer query T1 , a series of operations
is also carried out on element B2 which is not in the final answer. Our proposed
path-based approach aims to overcome these two drawbacks.

4 Path-Based Approach
The proposed approach to evaluate XML negation queries utilizes the path-based
labeling scheme proposed in [11]. We will first review the scheme and introduce
the XQuery tree model to represent negation queries. Then we describe the
algorithms P Join¬ and N Join¬ which removes the unnecessary elements and
carries out structural join operation respectively.

4.1 Path-Based Labeling Scheme


The path-based labeling scheme [11] identifies each element node by a pair of
(path id, node id). Each text node is identified by a node id. The node id can be
assigned using any existing node labeling scheme, e.g, interval-based [12]. A path
id is composed of a sequence of bits. We first omit the text nodes from an XML
document. Then we find distinct root-to-leaf paths in the XML document by
considering only the tag names of the elements on the paths. We use an integer
to encode each distinct root-to-leaf path in an XML document. The number of
bits in the path id is given by the number of the distinct root-to-leaf element
sequences of the tag names that occur in the XML document. Let k denote the
number of distinct root-to-leaf paths, hence the path id of an element node has
k bits. For a leaf element node, all the bits except for the ith bit, are set to 0,
where i is the encoding of the root-to-leaf path on which the leaf node occurs.
The path id of a non-leaf element node is given by a bit-or operation on the path
ids of all its child element nodes.
Fig. 2(a) shows the XML document Doc1 labeled using the path-based label-
ing scheme. The corresponding encoding table is given in Fig. 2(b).
34 H. Li et al.

A1
(111, 1)

B1 B2
(100, 2) (011, 5) A

C1 C2
(100, 3) (011, 6) B
B
Root-to-leaf Path Encoding
E1 D1 Root/A/B/C/E 1
(100, 4) (011, 7) C D
Root/A/B/C/D/E 2
E2 F1 E F
(010, 8) (001, 9)
Root/A/B/C/D/F 3

(a) XML document (b) Encoding Table (c) Query T2

Fig. 2. Example to illustrate Path Labeling Scheme and XQuery Tree

Let P idA and P idD be the path ids for elements with tags A and D respec-
tively. If (P idA & P idD ) = P idD , then we say P idA contains P idD . This is
called Path ID Containment. Li et al. [11] prove that the containment of two
nodes can be deduced from the containment of their path ids.

Property I: Let P idA and P idD be the path ids for elements with tags A and
D respectively. If P idA contains P idD and P idA = P idD , then each A with
P idA must have at least one descendant D with P idD .
Consider the element nodes B2 and E2 in Doc1 . The path id 011 for B2
contains the path id 010 for E2 since the bit-and operation between 011 and 010
equals to 010 and they are not equal. Therefore, B2 must be an ancestor of E2 .
If two sets of nodes have the same path ids, then we need to check their
corresponding root-to-leaf paths to determine their structural relationship. For
example, the nodes B1 and E1 in Doc1 have the same path id 100. We can
decompose the path id 100 into one root-to-leaf path with the encoding 1 since
the bit in the corresponding position is 1. By looking up the first path in the
encoding table (Fig. 2(b)), we know that B1 is an ancestor of E1 .

4.2 XQuery Tree


In this section, we define a model called XQuery tree to model queries involving
not-predicates. This is accomplished by augmenting the standard XML query
pattern tree with two new features: node projection and not operator.

Definition 1 (XQuery Tree). An XQuery Tree is defined as a tree T = (V, E)


where V and E denote the set of nodes and edges respectively.
1. A single edge denotes a parent-child relationship while a double edge denotes
an ancestor-descendant relationship.
2. Nodes to be projected are circled.
3. A negated containment relationship between two nodes is specified by putting
the symbol “¬” next to the edge. We call such an edge a negated edge.
A Path-Based Approach for Efficient Structural Join with Not-Predicates 35

Fig. 2(c) shows an example negation query modeled using the XQuery tree. The
equivalent query specified using the XQuery language is as follows:

F or $v In //B
W here exists($v/C) and exists($v/D/E) and
count(A/$v) = 0 and count($v/D//F ) = 0
Return {$v} {$v/C}

Note that negated edges cannot occur between the projected nodes of a query
since they would result in queries that are meaningless, e.g., retrieve all the
elements A and B such that A does not contain B. Therefore, we can deduce
that given an XQuery tree T , there exists some subtree T  of T such that T 
contains all the projected nodes in T and all edges in T  are not negated edges.

Definition 2 (Projected Tree TP ). Let T = (V, E) be an XQuery tree, and


S be the set of subtrees T  = (V  , E  ) of T , such that

1. V  ⊆ V and
2. V  contains all the projected nodes in T , and
3. for any e ∈ E  , e is not a negated edge.

The largest T  in S is defined as the projected tree TP of T .

The projected tree of the XQuery tree in Fig. 2(c) is shown within the dashed
circle. Given an XQuery tree T , we define the subtree above TP as tree TPa and
the subtree below TP as TPb respectively.

Definition 3 (Tree TPa ). Given an XQuery tree T , let R be the root node of
TP , and e be the incoming edge of R. We define TPa as the subtree obtained from
T - TR - e, where TR denotes the subtree rooted at R.

Definition 4 (Tree TPb ). Given an XQuery tree T , we define TPb as the subtree
rooted at C, where C denotes a child node of the leaf nodes of TP .

In Fig. 2(c), the nodes A and F form the trees TPa and TPb of T respectively.
Note that an XQuery tree T has at most one TPa and possibly multiple TPb . A
tree TPa or TPb may contain negated edges. However, queries with negated edges
in TPa or TPb may have multiple interpretations. For example, the query “A does
not contain B, and B does not contain C, where C is the projected node” has
different semantics depending on the applications. Here, we focus on queries
whose subtrees TPa and TPb do not contain any negated edges.

4.3 Algorithm P Join¬


Algorithm P Join [11] filters out unnecessary path ids for queries involving struc-
tural join. The main operation in P Join is the binary path join. A binary path
join takes as input two lists of path ids, one for the parent node and the other
for the child node. A nested loop is used to find the matching pairs of path ids
36 H. Li et al.

based on the path id containment property. Any path id that does not satisfy
the path id containment relationship is removed from the lists of path ids of
both the parent node and the child node. However, this algorithm does not work
well for queries involving not-predicates.
Consider query T3 in Fig. 3(a) where the lists of path ids have been associ-
ated with the corresponding nodes. We assume that the path ids with the same
subscripts satisfy the path id containment relationship, i.e., b2 contains c2 , etc.

Query Tc Query Tc after PJoin Query Tc after PJoin

(a) (b) (c)

Fig. 3. Example to illustrate Algorithm P Join¬

Algorithm P Join will first perform a bottom-up binary path join. The path id
lists for nodes C and D are joined. Since the path id d2 , d3 and d4 are contained
in the path id c2 , c3 and c4 respectively, d1 is removed from the set of path ids
of D. The path id list of node C is joined with the path id list of node E. No
path id is removed since each path id of E is contained in some path id of C.
We join the path id list of node B with that of node C. The path ids c2 and c3
are contained in the path id b2 and b3 respectively. Since there is a not-predicate
condition between nodes B and C, the path id b2 and b3 need to be removed
from the set of path ids of B. Finally, a binary path join between nodes A and
B is carried out and the path id a2 is removed.
Next, Algorithm P Join carries out a top-down binary path join on T3 starting
from the root node A. The final result is shown in Fig. 3(b). The optimal sets of
path ids for the nodes in T3 is shown in Fig. 3(c). The difference in the two sets
of path ids shown in Fig. 3(b) and Fig. 3(c) is because Algorithm P Join does
not apply the constraint that is imposed on nodes A and B to the entire query.
The above example illustrates that the proper way to evaluate a negated
containment relationship between path ids is to only update the path ids of the
nodes in the projected tree. This leads to the design of Algorithm P Join¬.
The basic idea behind P Join¬ (Algorithm 1) is that given a query T , we first
apply P Join on TPa and TPb . The path ids of the leaf node of TPa and the root
node(s) of TPb are used to filter out the path ids of the corresponding nodes in
TP . The input to Algorithm P Join¬ is an XQuery tree T with a set of pro-
jected nodes. We first determine the projected tree TP of T . Then the P Join
algorithm is carried out on TPb and TPa (if any) respectively (lines 4-5). Next, a
bottom-up binary path join and a top-down binary path join are performed on TP
A Path-Based Approach for Efficient Structural Join with Not-Predicates 37

Algorithm 1. P Join¬
1: Input: T - An XQuery-tree
2: Output: Path ids for the nodes in T
3: Associate every node in T with its path ids;
4: Perform a bottom-up binary path join on TPb and TPa ;
5: Perform a top-down binary path join on TPb and TPa ;
6: Perform a path anti-join between the root node(s) of TPb and their parent node(s)
if necessary;
7: Perform a bottom-up binary path join on TP ;
8: Perform a path anti-join between the leaf node of TPa with its child node if necessary;
9: Perform a top-down binary path join on TP ;

(lines 7, 9). Each binary path join operation is followed by a path antijoin op-
eration (lines 6, 8). A path antijoin takes as input two lists of path ids, but one
list of path ids is for reference; only path ids in the other list need to be removed
if necessary. In line 6(8), the Algorithm P Join¬ utilizes the root(leaf) nodes of
TPb (TPa ) to filter out the path ids of their parent(child) node(s).
Note that if the set of path ids for the root node (leaf node) of TPb (TPa )
contains some path id whose corresponding element node is not a result of T
(super P id set), then the path antijoin operation in Lines 6 (8) of Algorithm 1
is skipped. This is because the super P id set of the root node (leaf node) of TPb
(TPa ) could erroneously remove path ids from its parent node (child node), and
we may miss some correct answers in the final query result.
Consider again query T3 in Fig. 3(a). The projected tree is the subtree rooted
at node C. A P Join is first performed on tree TPa which contains nodes A and
B. The set of path ids for B obtained is {b1 , b2 }. Next, bottom-up path join is
carried out on TP . Since TPa is a simple path query without value predicates, the
path id set associated with B is not a super P id set according to the discussion
in [11]. Then we can perform a path anti-join between nodes B and C. This step
eliminates c2 from the path id set of C since c2 is contained in b2 . Finally, a
top-down path join is performed on TP , which eliminates d1 and d2 from the set
of path ids for D, and e2 from the set of path ids for E. The final result after
P Join¬ is shown in Fig. 3(c).

4.4 Algorithm N Join¬


We retrieve the elements with path ids output by Algorithm P Join¬ and apply
Algorithm N Join¬ on these elements to obtain the result of the negation queries.
Algorithm 2 shows the details of N Join¬ method. If the negation query with
TPa is null, Algorithm N Join¬ will use the method TwigStackList¬ [14] to cal-
culate the final result. This is because TwigStackList¬ cannot handle queries
with TPa as illustrated in Section 3. Otherwise, we will use the holistic structural
join in [6] to evaluate the trees TPa , TPb and TP , and then merge the results.
38 H. Li et al.

Algorithm 2. N Join¬
1: Input: T - An XQuery-tree
2: Output: All occurrences of nodes in TP
3: if TPa is null then
4: Perform TwigStackList¬ on T;
5: else
6: Perform holistic structural join on TPa , TPb and TP ;
7: Merge the intermediate result;
8: end if

4.5 Optimality of Path-Based Approach


The optimality of the proposed solution is due to Algorithm P Join¬. This step
can greatly reduce the number of elements accessed by Algorithm N Join¬.
Consider the query T1 (Fig. 1(b)) issued over the XML document Doc1 in
Fig. 2(a). The path ids of each node is shown in Fig. 4(a). When Algorithm
P Join¬ is applied on T1 , the path id {100 } is removed from the path id set of
node E and {011 } is removed from the path id set of node B (see Fig. 4(b)).
There is only one path id left for node B after P Join¬, which corresponds to
element B1 . Further processing of element B2 is not needed. Experimental results
in the next section indicate that the relatively inexpensive P Join¬ can greatly
filter out the irrelevant element nodes.
The other advantage of our approach is that by decomposing negation queries
into three parts (TP , TPa and TPb ), we can handle an additional class of queries
compared to the method TwigStackList¬.

B {100, 011}
B {100}

D {011} D {011}

E F {001} E F {001}
{100, 010} {010}
(a) (b)
Fig. 4. Example to illustrate optimality of path-based approach

5 Experiment Evaluation

In this section, we examine the performance of the proposed path-based solution


for negation queries. We also compare our method with TwigStackList¬[14].
Both approaches are implemented in C++. All experiments are carried out on
a Pentium IV 2.4 GHz CPU with 1 GB RAM. The operating system is Linux
2.4. The page size is set to be 4KB.
A Path-Based Approach for Efficient Structural Join with Not-Predicates 39

We use three real world datasets for our experiments. They are Shakespeare’s
Plays (SSPlays) [1], DBLP [2] and XMark benchmark [3]. Table 1 shows the
characteristics of the datasets and Table 2 gives the query workload.

Table 1. Characteristics of Datasets

Datasets Size (Distinct Elements) (Elements)


SSPlays 7.5 MB 21 179,690
DBLP 60.7 MB 32 1,534,453
XMark 61.4 MB 74 959,495

Table 2. Query Workload

Query Dataset Nodes in Result


Q1 //PLAY[ NOT(/PROLOGUE)]/EPILOGUE//TITLE SSPlays 13
Q2 //dblp/article[ NOT(//url)] DBLP 14
Q3 //person[ NOT(/creditcard)] XMark 7618
Q4 //people/person[ NOT(/age)]/profile/education XMark 9568

5.1 Effectiveness of P Join¬


We first evaluate the effectiveness of P Join¬ in filtering out irrelevant elements

 |N |
for the subsequent N Join¬ operation. The following metrics are used:

F iltering Ef f iciency =  |N |p
i
i

 |N |
Selectivity Rate =  |N |
n
i
i

where |Nip |
denotes the number of instances for node Ni after P Join¬ operation,
|Nin | denotes the number of instances for node Ni in the result set after N Join¬
operation and |Ni | denotes the total number of instances for node Ni in the
projected tree of the query.
Fig. 5(a) shows the Filtering Efficiency with Selectivity Rate for queries Q1
to Q4. The closer the two values are, the more effective P Join¬ is for the query.
We observe that Algorithm P Join¬ is able to remove all the unnecessary ele-
ments for queries Q1, Q2 and Q3 and the subsequent N Join¬ will not access
any element that does not contribute to the final result, leading to optimal query
evaluation. Query Q4 has a higher Filtering efficiency value than Query Selectiv-
ity because the query node person which is the root node of the subtree rooted
at node age is a branch node. The set of path ids for person is a super P id set.
Nevertheless, Algorithm P Join¬ remains effective in eliminating unnecessary
path ids even for such queries.
Fig. 5(b) and (c) show that the I/O cost and elapsed time of Algorithm
P Join¬ are marginal compared with N Join¬ for queries Q1 to Q4. This is
40 H. Li et al.

Filtering Efficiency
0.8 Selectivity Rate

0.6

0.4

0.2

0
Q1 Q2 Q3 Q4

(a) Filtering Efficiency vs. Selectivity Rate

(b) P Join¬ and N Join¬ (I/O cost)

B
B

C D

E F

(c) P Join¬ and N Join¬ (Time)

Fig. 5. Effectiveness of P Join¬

because the sizes of the path lists are much smaller than that of node lists. The
time cost of P Join¬ for queries Q3 and Q4 is slightly more compared to Q1 and
Q2 due to a larger number of distinct paths, as well as longer path ids for the
XMark dataset.
A Path-Based Approach for Efficient Structural Join with Not-Predicates 41

TwigStackList
Path-Based

(a) Elements Accessed

TwigStackList
Path-Based

(b) I/O Cost

TwigStackList
Path-Based

(c) Elapsed Time

Fig. 6. TwigStackList¬ vs. Path-Based


42 H. Li et al.

5.2 Comparative Experiments

In this set of experiments, we compare our solution with TwigStackList¬[14].


Fig. 6 shows the results. The path-based solution outperforms TwigStackList¬
because Algorithm P Join¬ is able to greatly reduce the actual number of el-
ements retrieved while TwigStackList¬ is designed to reduce the intermediate
result sizes and may access all the elements involved in the queries. For example,
TwigStackList¬ must read in the full sets of elements when evaluating Q1.

6 Conclusion

In this paper, we have described a path-based approach to evaluate negation


queries. We introduced a model called XQuery tree to model queries involving
negated containment relationship. The proposed approach utilizes a path-based
labeling scheme to filter out irrelevant elements. Experimental results indicate
that the path-based approach is more efficient than TwigStackList¬ and is ef-
fective for a larger class of negation queries.

References
1. http://www.ibiblio.org/xml/examples/shakespeare.
2. http://www.informatik.uni-trier.de/˜ley/db/.
3. http://monetdb.cwi.nl/.
4. A. Al-Khalifa, H. V. Jagadish, and J. M. Patel et al. Structural joins: A primitive
for efficient xml query pattern matching. IEEE ICDE, 2002.
5. S. Al-Khalifa and H. V. Jagadish. Multi-level operator combination in xml query
processing. ACM CIKM, 2002.
6. N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal xml pattern
matching. ACM SIGMOD, 2002.
7. S-Y. Chien, Z. Vagena, and D. Zhang et al. Efficient structural joins on indexed
xml documents. VLDB, 2002.
8. H. Jiang, H. Lu, W. Wang, and B. C. Ooi. Xr-tree: Indexing xml data for efficient
structural joins. IEEE ICDE, 2003.
9. H. Jiang, W. Wang, and H. Lu. Holistic twig joins on indexed xml documents.
VLDB, 2003.
10. E. Jiao, T-W. Ling, C-Y. Chan, and P. S. Yu. Pathstack¬: A holistic path join
algorithm for path query with not-predicates on xml data. DASFAA, 2005.
11. H. Li, M-L. Lee, and W. Hsu. A path-based labeling scheme for efficient structural
join. International Symposium on XML Databases, 2005.
12. Q. Li and B. Moon. Indexing and querying xml data for regular path expressions.
VLDB, 2001.
13. J. Lu, T. Chen, and T-W. Ling. Efficient processing of xml twig patterns with
parent child edges: A look-ahead approach. CIKM, 2004.
14. T. Yu, T-W. Ling, and J. Lu. Twigstacklist¬: A holistic twig join algorithm for
twig query with not-predicates on xml data. DASFAA, 2006.
RRPJ: Result-Rate Based Progressive
Relational Join

Wee Hyong Tok, Stéphane Bressan, and Mong-Li Lee

School of Computing
National University of Singapore
{tokwh,steph,leeml}@comp.nus.edu.sg

Abstract. Progressive join algorithms are join algorithms that produce


results incrementally as input data is available. Because they are non-
blocking, they are particularly suitable for online processing of data
streams. Reference algorithms of this family are the symmetric hash join,
the X-join and more recently, the rate-based progressive join (RPJ).
While the symmetric hash join introduces the idea of a symmetric
processing of the input streams but assumes sufficient main memory, the
X-Join suggests that the processing can scale to very large amounts of
data if main memory is regularly flushed to disk, and a reactive/cleanup
phase is triggered for disk-resident data. The X-join flushing strategy
is based on a simple largest-first strategy, where the largest partition is
flushed to disk. The recently proposed RPJ predicts the main memory
tuples or partitions that should be flushed to disk in order to maximize
throughput by computing their probabilities to contribute to a result.
In this paper, we discuss the limitations of RPJ and propose a novel
extension, called Result Rate-based Progressive Join (RRPJ), which ad-
dresses these limitations. Instead of computing the probabilities from
statistics over the input data, RRPJ directly observes the output (result)
statistics. This not only yields a better performance, but also simplifies
the generalization of the algorithm to non-relational data such as multi-
dimensional data and hierarchical data. We empirically show that RRPJ
is effective and efficient and outperforms the state-of-art RPJ. We also
investigate the relevance and performance of an adaptive version of these
algorithms using amortization parameters.

Keywords: Query Processing, Join Algorithms, Data Streams.

1 Introduction
The universe of network-accessible information is expanding. It is now common
practice for applications to process streams of data incoming from remote sources
(repositories continuously publishing or sensor networks producing continuous
data). An essential operation is the equijoin of two data streams of relational
data. Designing an algorithm for such an algorithm must meet a key requirement:
the algorithm must be non-blocking (or progressive), i.e. it must be able to
produce results as soon as possible, at the least possible expense for the overall
throughput.

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 43–54, 2007.

c Springer-Verlag Berlin Heidelberg 2007
44 W.H. Tok, S. Bressan, and M.-L. Lee

Several non-blocking algorithms for various operators in general and for the
relational equijoin in particular have been proposed [1,2,3,4,5]. These algorithms
can be categorized as heuristic or probabilistic methods. Heuristic methods rely
on pre-defined policies for the efficient usage of the available memory; whereas
probabilistic methods [6,7] attempt to model the incoming data distribution (val-
ues and arrival parameters) and use it to predict the tuples or partitions that
are kept in memory in order to produce the maximum number of result tuples.
The main thrust in all these techniques lies in the simple idea of keeping useful
tuples or partitions (i.e. tuples or partitions likely to produce more results) in
memory. Amongst the many progressive join algorithms introduced, one of the
state-of-art hash-based progressive join algorithm is the Rate-based Progressive
Join (RPJ) [6]. One of the limitations of RPJ is that it is not able to perform
well if the data within the partitions are non-uniform, and that it is not straight-
forward to generalize it for non-relational data. In this paper, we propose the
Result-Rate based Progressive join (RRPJ) which overcomes these limitations.
The rest of the paper is organized as follows: In Section 2, we discuss related
work and focus on two recent progressive join algorithms, and their strengths
and limitations. In Section 3, we present a novel method, called Result Rate-
based Progressive Join (RRPJ), which uses a model of the result distribution
to determine which tuples to be flushed. We conduct an extensive performance
study in Section 4. We conclude in Section 5.

2 Progressive Join Algorithms

In the literature, many equijoin algorithms [2,3,4,5,8,9,10,11] have been pro-


posed. Most of these algorithms considered local datasets, and do not generalize
easily to handle unpredictable data arrival common in data streams environment.
Many of these equijoin algorithms are based on the seminal work on symmetric
hash join’s (SHJ) [12]. SHJ assumes the use of in-memory hash tables; an insert-
probe paradigm is used to deliver results progressively to users. In the literature,
many subsequently proposed progressive relational join algorithms are based on
an extended SHJ model, where both in-memory and disk-resident hash parti-
tions are used to store tuples. Whenever memory becomes full, some in-memory
tuples need to be flushed to disk to make space for new-arriving tuples. XJoin
[3] uses a simple heuristics that flushes the largest partitions. Throughput can
be improved if the sacrificed tuples are those with the smallest probability of
joining with future tuples, i.e. of contributing to the production of results. This
heuristics is the basis of two recent proposals [6,7] that attempt to extrapolate
stochastic models of the data and their productivity from the observation of
incoming tuples.

2.1 Problem Definition

We consider the problem of performing a relational equijoin between two re-


lational datasets, which are transmitted from remote data sources through an
RRPJ: Result-Rate Based Progressive Relational Join 45

unpredictable network. Let the two sets of relational data objects be denoted by
R = {r1 , r2 , . . . , rn }, and S = {s1 , s2 , . . . , sm }, where ri and sj denotes the i-th
and j-th data object from the remote data source respectively. When performing
a relational equijoin, with join attribute A, a result is returned when ri .A is
equal to sj .A. Formally, (ri , sj ) is reported as the result if ri .A is equal to sj .A.
The goal is to deliver initial results quickly and ensure a high result-throughput.

2.2 Rate-Based Progressive Join (RPJ)


RPJ [6] is a hash-based join. It builds a stochastic model based on the tuples’
arrival pattern. Whenever memory becomes full, the model is used to determine
probabilistically which tuples are least likely to produce tuples with the other
incoming data, and hence flushed from memory to disk.
In order to compute the conditional probability that an incoming tuple t be-
longs to the j -partition, given that t belongs to relation Ri , RPJ keeps track of
the total number of tuples from relation i that have arrived and falls into parti-
tion j, denoted by ntotal
i [j] By dividing ntotal
i [j] over the total number of tuples
that have arrived in the system so far, the conditional probability P (j|Ri ) can be

ntotal [j]
derived as P (j|Ri ) = i
npart . To reduce the need to track conditional prob-
ntotal
i [j]
j=1
abilities for each values in the domain of the join attribute, RPJ assumes that
the data in each partition is uniformly distributed. (i.e. local uniformity assump-
tion). The probability P(R1 ) and P(R2 ) are estimated by maintaining counters
nrcnt
i for each relation Ri (initially set to the number of arriving Ri tuples be-
tween the initial time interval [0,1]). Subsequently, RPJ counts the number of
tuples, denoted by αi (t), between the interval [t,t+1]. To more accurately reflect
current arrivals, and to reduce the impact from historical arrivals, RPJ updates
the value of nrcnt
i to λ·nrcnt
i + (1−λ)·αi (t), where λ is a user-tunable parameter
(varies between [0,1]).
Thus, the arrival probability parr
i (v) of a tuple belonging to relation Ri and


ntotal [j] nrcnt
has the value v is then computed as Piarr [j] = i
npart · nrcnt
i
+nrcnt
(Refer
total 1 2
ni [j]
j=1

to [6] for the complete proof).

2.3 Locality-Aware (LA) Model


[7] observes that a data stream exhibits reference locality when tuples with spe-
cific attribute values have a higher probability of re-appearing in a future time
interval. Leveraging this observation, a Locality-Aware (LA) model was pro-
posed, where the reference locality caused by both long-term popularity and
short-term correlations are captured. This is described by the following model:

xn = xn−i (with probability ai ); xn = y (with probability b, where 1 ≤ i ≤ h and


h
b+ ai = 1. y denotes a random variable that is independent and identically
i=1
distributed (IID) with respect to the probability distribution of the popularity, P.
46 W.H. Tok, S. Bressan, and M.-L. Lee

Using this model, the probability that a tuple t will appear at the n-th position

h
of the stream is given by P rob(xn = t|xn−1 , ..., xn−h ) = bP (t) + aj δ(xn−j , t)
j=1
(δ(xk , c) = 1 if xk = c, and it is 0 otherwise). Using the LA model, the marginal
utility of a tuple is then derived, and is then used as the basis for determining
the tuples to be flushed to disk whenever memory is full.

2.4 Limitations of RPJ and LA Model


In this section, we discuss the limitations of RPJ and the LA model. RPJ rely
on the availability of an analytical model deriving the output probabilities from
statistics on the input data. This is possible in the case of relational equijoins
but embeds some uniformity assumptions that are not necessarily true. It is not
able to efficiently handle scenarios in which the data within each partition is
non-uniform, which breaks the local uniformity assumption. Consider the two
partitions, belonging to dataset R and S respectively, presented in Figure 1. The
grayed area is used to denote ranges of data. Suppose in both Figure (a) and
(b), N tuples have arrived. In Figure 1(a), the N tuples is uniformly distributed
across the entire partitions of each dataset; whereas in Figure 1(b), the N tuples
is distributed within a specific numeric range (i.e. areas marked grey). Assume
the same number of tuples have arrived for both cases, then P (1|R) and P (1|S)
would be the same. However, it is important to note that if partition 1 is selected
to be the partition to be kept in memory, the partitions in Figure 1(a) would
produce results as predicted by RPJ; whereas the partitions in Figure 1(b) would
fail to produce any results. Though RPJ attempts to amortize the effect of
historical arrivals of each relation, it assumes that the data distribution remains
stable throughout the lifetime of the join, which makes it less useful when the
data distribution are changing (which is common in long-running data streams).
The LA model is applied to deal with the approximate sliding window join on
relational data Based on the LA model given in the earlier section, we can see
that it relies on determining whether a similar tuple appears in a future position
in the data stream. For relational data, a similar tuple could be one that has the
same value as a previous tuple. However, for non-relational data, such as spatial
data, the notion of similarity between two tuples is more complex, and hence it
is not straightforward to extend the LA model to deal with non-relational data
types.

Partition 1 Partition 1 Partition 1 Partition 1


from R from S from R from S

(a) Uniform Data (b) Non-Uniform Data


within partition within partition

Fig. 1. Data in a Partition


RRPJ: Result-Rate Based Progressive Relational Join 47

3 Result-Rated Based Progressive Join (RRPJ)


In this section, we present a novel method of maintaining statistics over the result
distribution, instead of the data distribution. This is motivated by the fact that
in most progressive join scenarios, we are concerned with delivering initial results
quickly and maintaining a high overall throughput. Hence, the criteria used to
determine which tuples need to be flushed to disk whenever memory becomes
full should be ‘result-motivated’. In addition, the number of results produced by
a partition is reflective of the data distribution of the partitions.

3.1 RRPJ
We propose a novel join algorithm, call Result-Rate Based Progressive Join
(RRPJ) (Algorithm 1), which uses information on the result throughput of the
partitions to determine the tuples or partitions that are likely to produce results.
In Algorithm 1, an arriving tuple is first used to probe the hash partitions of the
corresponding data stream in order to produce result tuples. Next, it will check
whether memory is full (line 2). If memory is full, it will first compute the T hi
values (i.e value computed by Equation 3) for all the partitions. Partitions with
the lowest T hi values will then be flushed to disk, and the newly arrived tuple
inserted. The main difference between the RRPJ flushing and RPJ is that the
T hi values are reflective of the output (i.e. results) distribution over the data
partitions; whereas the RPJ values are based on input the data distribution.
To compute the T hi values (computed using Equation 3), we track the total
number of tuples, ni (for each partition), that contribute to a join result from
the probes against the partition. Intuitively, RRPJ tracks the join throughput
of each partition. Whenever memory becomes full, we flush nf lush (user-defined
parameter) tuples from the partition that have the smallest T hi values, since
these partitions have produced the least result so far. If the number of tuples
in the partition is less than nf lush , we move on to the partition with the next
lowest T hi values.
Given two timestamps t1 and t2 (t2 > t1 )and the number of join results
produced at t1 and t2 are n1 and n2 respectively. A straightforward definition
of the throughput of a partition i, denoted by T hi , is given in Equation 1.
n2 − n1
T hi = (version 1) (1)
t2 − t1
From Equation 1, we can observe that since (t2 − t1 ) is the same for all
partitions, it suffice to maintain counters on just the number of results produced
(i.e. n1 and n2 ). A partition with a high T hi value will be the partition which
have higher potential of producing the most results. However, it is important to
note that Equation 1 do not take into consideration the size of the partitions and
its impact on the number of results produced. Intuitively, a large partition will
produce more results. It is important to note that this might not always be true.
For example, a partition might contain few tuples, but produces a lot of results.
This partition should be favored over a relatively larger partition which is also
48 W.H. Tok, S. Bressan, and M.-L. Lee

Algorithm 1: RRPJ Join Algorithm


Data : t - Newly Arrived tuple
Result : Result Tuples
1 Use t to probe hash partitions from other data stream
2 If ( Memory is full() ) {
3 ComputeThValue() ;
4 FlushDataToDisk() }
5 Insert t into hash table HT ;

producing the same number of results. Besides considering the result distribution
amongst the partitions, we must also consider the following: (1) Total number
of tuples that have arrived, (2) Number of tuples in each partition, (3) Number
of result tuples produced by each partition and (4) Total results produced by
the system. Therefore, we use an improved definition for T hi , given below.
Suppose there are P partitions maintained for the relation. Let Ni denote
the number of tuples in partition i (1 ≤ i ≤ P ), and Ri denote the number of
result tuples produced by partition i. Then, the T hi value for a partition i can
be computed. In Equation 2, we consider the ratio of the results produced to
the total number of results produced so far (i.e. numerator), and also the ratio
of the number of tuples in a partition to to the total number of tuples that have
arrived (i.e. denominator).

Ri×(
N)
P
j

R
Ri
N
Ni
 R )×N
j=1
T hi = ( P
)/( P
)= P
(version 2) (2)
j j ( j i
j=1 j=1 j=1

Since the total number of results produced and the total number of tuples
is the same for all partitions, Equation 2 can be simplified. This is given in
Equation 3.
Ri
T hi = Ni (version 2 - after simplification) (3)

Equation 3 computes the T hi value w.r.t to the size of the partition. For
example, let us consider two cases. In case (1), suppose Ni = 1 (i.e. one tuple
in the partition) and Ri = 100. In case (2), suppose Ni = 10 and Ri = 1000.
Then, the T hi values for case (1) and (2) are the same. This prevents large
partitions from unfairly dominating the smaller partitions (due to the potential
large number of results produced by larger partitions) when a choice needs to
be made on which partitions should be flushed to disk.

3.2 Amortized RRPJ (ARRPJ)


In order to allow RRPJ to be less susceptible to varying data distributions,
we introduce Amortized RRPJ (ARRPJ). Suppose there are two partitions P1
and P2 , each containing 10 tuples. If P1 produces 5 and 45 result tuples at
RRPJ: Result-Rate Based Progressive Relational Join 49

timestamp 1 and 2 respectively, the T h1 value is 5. If partition P2 produces


45 and 5 result tuples at timestamp 1 and 2 respectively, the T h2 value for P2
will also be 5. From the above example, we can observe that the two scenarios
cannot be easily differentiated. However, we should favor partition P1 since it is
obviously producing more results than P2 currently. This is important because
we want to ensure that tuples that are kept in memory are able to produce more
results because of its current state, and not due to a past state.
To achieve this, let σ be a user-tunable factor that determines the impact of
historical result values. The amortized RRPJ value, denoted as Ati , for a partition
i at time t is presented in Equation 4. When σ = 1.0, then the amortized RRPJ
is exactly the same as the RRPJ. When σ = 0.0, then only the latest RRPJ
values are considered. By varying the values of σ between 0.0 to 1.0 (inclusive),
we can then control the effect of historical RRPJ on the overall flushing behavior
of the system.

t
(t−j) j
ri
σt ri0 +σt−1 ri1 +σt−2 ri2 +......+σ1 rit−1 +σ0 rit j=0 (4)
Ati = Ni = Ni

4 Performance Study

In this section, we study the performance of the proposed RRPJ against RPJ. All
the experiments were conducted on a Pentium 4 2.4GHz CPU PC (1GB RAM).
We measure the progressiveness of the various flushing policies by measuring the
number of results produced and the response time.

Table 1. Experiment Parameters

Dataset Parameter Default Values


Number of Tuples Per Page 85
Available Memory 1000 pages
Domain of Join attribute [1, 10000]
Tuple Inter-arrival 0.001s
Dataset Size (Relation R1 + Relation R2) 2 million tuples
Percentage of tuples flushed 10%

The experimental parameters are given in Table 1. Unless otherwise stated,


the datasets used in the experiments uses the default values given in the table.

4.1 Effect of Uniform Data Within Partitions

We generated the datasets HARMONY and REVERSE based on the dataset


generation techniques described in [6]. We used the same arrival pattern HAR-
MONY and REVERSE. In this experiment, we evaluate the performance of the
RRPJ against RPJ. We measure the response time (x-axis) and the number of
50 W.H. Tok, S. Bressan, and M.-L. Lee

result tuples generated (y-axis). From Figure 2, we can observe that the per-
formance of RRPJ is comparable to RPJ using the same datasets from [6], and
hence is at least as effective as RPJ for uniform data.

1.2e+07 1.2e+07
RRPJ RRPJ
RPJ RPJ

1e+07 1e+07

8e+06 8e+06
# Results Tuples

# Results Tuples
6e+06 6e+06

4e+06 4e+06

2e+06 2e+06

0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)

(a) Harmony (b) Reverse

Fig. 2. Effect of Uniform-Data Within Partitions

4.2 Effect of Non-uniform Data Within Partitions


In this experiment, we evaluate the performance of RRPJ against RPJ for non-
uniform datasets. We used the same arrival pattern HARMONY and REVERSE.
Similar to Figure 1, we restrict the domain for the join attribute for 50% of the
tuples from one dataset (R1) to be in the range [1,5000] and the domain of the
join attribute for 50% of the other dataset (R2) to be in the range [5001,10000].
We measure the response time (x-axis) and the number of result tuples generated
(y-axis). From Figure 3(a) and 3(b), we can observe that the RRPJ outperforms
RPJ by a large margin. This is because RPJ’s local uniformity assumption breaks
when the data within each partition is non-uniform. Comparatively, since RRPJ
tracks the number of results, it is able to identify the partitions that are not
producing any results, and hence avoid keeping tuples belonging to these non-
productive partitions in memory.

4.3 Varying Data Arrival Distribution


The datasets are generated as follows: We make use of a zipfian distribution (with
tunable parameter θ) to determine the partition for assigning a newly-arrived
tuple. When θ = 0.0, the data distribution is uniform (i.e. a newly-arrived tuple
have equal probability of belonging to any of the partitions). When θ increases,
the arrival distribution becomes more skewed (i.e. a newly-arrived tuple have
higher probability to belong to specific partitions). In order to simulate a vary-
ing data arrival distribution, we re-order the partitions probabilities whenever
every α tuples have arrived. The partitions are randomly re-ordered. For exam-
ple, when θ = 2.0, Table 2 shows the arrival probabilities. During the initial
stage, the probability that a newly arrived tuple will belong to partition 1,2,3,4
RRPJ: Result-Rate Based Progressive Relational Join 51

1e+07 1e+07
RRPJ RRPJ
RPJ RPJ

8e+06 8e+06
# Results Tuples

# Results Tuples
6e+06 6e+06

4e+06 4e+06

2e+06 2e+06

0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)

(a) Harmony (b) Reverse

Fig. 3. Effect of Non-Uniform-Data Within Partitions

and 5 are 0.68, 0.17, 0.08, 0.04 and 0.03 respectively. During each reorder, these
probabilities for a newly arrived tuple to belong to a specific partition change.
In this experiment, we evaluate the performance of the Amortized RRPJ
(ARRPJ) against RPJ and RRPJ, when the data arriving exhibits varying data
arrival distribution (i.e the probability that a newly arrived tuple belongs to a
partition changes). We vary the amortization factor, σ, for ARRPJ between 0.0
to 1.0. We call the corresponding algorithm ARRPJ-σ. When σ = 0.0, only the
latest RRPJ values (i.e. number of results produced and size of data partition
since the last flush) are used; whereas when σ = 1.0, ARRPJ is exactly RRPJ
(it computes the average of the statistics over time).

Table 2. Arrival Probabilities, θ = 2.0

Arrival Probabilities, P Initial 1st Reorder 2nd Reorder


Partitions Assigned
0.68 1 2 3
0.17 2 3 4
0.08 3 4 5
0.04 4 5 1
0.03 5 1 2

The results are shown in Figure 4(a)-(f). In addition, we summarize the


throughput (i.e. number of result tuples produced over time) of each algorithm
in table 3. In Table 3, we can observe that an amortization factor = 0.0 need not
necessarily be the best (highlighted in bold). There is a need to balance between
the impact of past and current results. From Figure 4(a)-(e), we can observe that
ARRPJ (with different amortization factor) performs much better than RRPJ.
Also, when the data distribution changes frequently (e.g. Figure 4(f), α = 0k),
the performance of RRPJ and ARRPJ are similar.
When α = 0k, the data arrival distribution is re-ordered aggressively (changes
each time a tuple arrives). Thus, all the methods (including RPJ and XJoin)
perform similarly. This is because none of the methods can make use of the
52 W.H. Tok, S. Bressan, and M.-L. Lee

7e+07 7e+07
RRPJ RRPJ
ARRPJ-0.0 ARRPJ-0.0
6e+07 ARRPJ-0.2 6e+07 ARRPJ-0.2
ARRPJ-0.5 ARRPJ-0.5
ARRPJ-0.8 ARRPJ-0.8
5e+07 5e+07
ARRPJ-1.0 ARRPJ-1.0
# Results Tuples

# Results Tuples
4e+07 4e+07

3e+07 3e+07

2e+07 2e+07

1e+07 1e+07

0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)

(a) α = 32k (b) α = 20k


7e+07 7e+07
RRPJ RRPJ
ARRPJ-0.0 ARRPJ-0.0
6e+07 ARRPJ-0.2 6e+07 ARRPJ-0.2
ARRPJ-0.5 ARRPJ-0.5
ARRPJ-0.8 ARRPJ-0.8
5e+07 5e+07
ARRPJ-1.0 ARRPJ-1.0
# Results Tuples

# Results Tuples

4e+07 4e+07

3e+07 3e+07

2e+07 2e+07

1e+07 1e+07

0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)

(c) α = 16k (d) α = 8k


7e+07 7e+07
RRPJ RRPJ
ARRPJ-0.0 ARRPJ-0.0
6e+07 ARRPJ-0.2 6e+07 ARRPJ-0.2
ARRPJ-0.5 ARRPJ-0.5
ARRPJ-0.8 ARRPJ-0.8
5e+07 5e+07
ARRPJ-1.0 ARRPJ-1.0
# Results Tuples

# Results Tuples

4e+07 4e+07

3e+07 3e+07

2e+07 2e+07

1e+07 1e+07

0 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Execution Time(s) Execution Time(s)

(e) α = 4k (f) α = 0k

Fig. 4. Varying Data Distribution

Table 3. Throughput of various methods (Summary of Fig 4 )

α RRPJ ARRPJ-0.0 ARRPJ-0.2 ARRPJ-0.5 ARRPJ-0.8 ARRPJ-1.0


0 4113 4128 4128 4125 4119 4113
4 6735 7719 7950 7665 7541 6735
8 9783 12266 12009 11503 10551 9783
16 11879 20133 20038 19428 17307 11879
20 10027 25140 25152 24554 20887 10027
32 12177 36388 36053 34685 27120 12177
RRPJ: Result-Rate Based Progressive Relational Join 53

statistics gathered to do effective prediction of which tuples to keep in memory


combined with a generally smaller number of possible results. However, when
α increases from 4k to 32k, we can observe that ARRPJ (with different α)
outperforms RRPJ. This is because ARRPJ was able to better reduce the impact
of the past results by amortizing the RRPJ values. RRPJ do not perform as
well, since RRPJ does not differentiate between past and current results. From
Figure 4, we can also observe that as the data changes less frequently (i.e. when α
varies from 0K to 32K), the total number of result tuples significantly increases.
This is because when the data distribution changes less often, the statistics
computed could be used for more effective prediction of which tuples need to be
kept in memory.
In addition, we also varied ρ (percentage of pages flushed each time memory is
full, and θ (skewness of the data distribution). Similar trends are observed. When
θ is 0.0 (i.e. uniform data), all methods (i.e. RPJ, RRPJ, ARRPJ) performs
the same. The results are omitted due to space constraints. These experiments
suggest however that several factors influence the correct evaluation of the output
statistics when data distribution is changing over time. The amortization formula
must be tuned with respect to the size of the buffer, the percentage and size of
the replaced partitions as well as the frequency of the replacement. While the
purpose of this paper is to introduce the idea of amortization and illustratively
quantify its potential, such fine tuning is left to future work.

5 Conclusion

We proposed a new adaptive and progressive equijoin algorithm for relational


data streams. The algorithm is of the X- and symmetric hash join family. Its
originality is twofold.
Firstly, the algorithm implements a replacement strategy for main memory
partitions that estimates the probability of partition to produce results directly
from the observation of output statistics. Previous proposals, such as the RPJ
and LA algorithms, have attempted to analytically construct such a model from
the statistics on the input streams. We showed that our algorithm is equivalent to
RPJ in the cases for which RPJs performance was evaluated by its inventors (we
use the same data sets). We showed that our algorithm significantly outperforms
RPJ, when the uniformity hypothesis necessary to the estimation by the RPJ
algorithm does not hold. We therefore showed that our algorithm is globally
better than RPJ empirically.
Secondly, we proposed an adaptive version of our algorithm that makes use of
amortization in order to incrementally weight out the influence of past statistics.
The same principle can be incorporated in previously proposed algorithms such
as RPJ and LA. This allows the algorithm to cater for changes over time in
the input data distributions. We showed that this technique leads to significant
performance increase in some cases, thus proving the concept. However, the
results we obtained compel further studies in order to understand the impact of
the different parameters. Future and ongoing work includes the practical tuning
54 W.H. Tok, S. Bressan, and M.-L. Lee

of such parameters: amortization formula, buffer size, frequency of replacements


and percentage/size of replaced partitions.
Finally we underline that, as we had preliminarily shown in [13], as opposed
to RPJ and LA, our approach rather gracefully generalizes to non-relational
data as it does not require the complex analytical modeling of the probabilities
of partitions to produce results from a model of the input data distribution
but rather directly observes a statistical model of the output distribution. We
are currently investigating the performance of RRPJ against RPJ and LA in
non-relational domains.

Acknowledgments. We would like to thank Dr Tao Yufei and his colleagues


for providing us with the RPJ code and data generator.

References
1. Haas, P.J., Hellerstein, J.M.: Ripple join for online aggregation. In: SIGMOD.
(1999) 287–298
2. Urhan, T., Franklin, M.J., Amsaleg, L.: Cost based query scrambling for initial
delays. In: SIGMOD. (1998) 130–141
3. Urhan, T., Franklin, M.J.: XJoin: Getting fast answers from slow and bursty net-
works. Technical Report CS-TR-3994, Computer Science Department, University
of Maryland (1999)
4. Avnur, R., Hellerstein, J.M.: Eddies: Continuously adaptive query processing. In:
SIGMOD. (2000) 261–272
5. Madden, S., Shah, M.A., Hellerstein, J.M., Raman, V.: Continuously adaptive
continuous queries over streams. In: SIGMOD. (2002) 49–60
6. Tao, Y., Yiu, M.L., Papadias, D., Hadjieleftheriou, M., Mamoulis, N.: Rpj: Pro-
ducing fast join results on streams through rate-based optimization. In: SIGMOD.
(2005) 371–382
7. Li, F., Chang, C., Kollios, G., Bestavros, A.: Characterizing and exploiting refer-
ence locality in data stream applications. In: ICDE. (2006) 81
8. Dittrich, J.P., Seeger, B., Taylor, D.S., Widmayer, P.: Progressive merge join: A
generic and non-blocking sort-based join algorithm. In: VLDB. (2002) 299–310
9. Dittrich, J.P., Seeger, B., Taylor, D.S., Widmayer, P.: On producing join results
early. In: PODS. (2003) 134–142
10. Mokbel, M.F., Lu, M., Aref, W.G.: Hash-merge join: A non-blocking join algorithm
for producing fast and early join results. In: ICDE. (2004) 251–263
11. Lawrence, R.: Early hash join: A configurable algorithm for the efficient and early
production of join results. In: VLDB. (2005) 841–852
12. Wilschut, A.N., Apers, P.M.G.: Dataflow query execution in a parallel main-
memory environment. In: PDIS. (1991) 68–77
13. Tok, W.H., Bressan, S., Lee, M.L.: Progressive spatial join. In: SSDBM. (2006)
353–358
GChord: Indexing for Multi-Attribute Query in P2P
System with Low Maintenance Cost

Minqi Zhou, Rong Zhang, Weining Qian, and Aoying Zhou

Department of Computer Science and Engineering, Fudan University, Shanghai 200433, China
{zhouminqi,rongzh,wnqian,ayzhou}@fudan.edu.cn

Abstract. To provide complex query processing in peer-to-peer systems has at-


tracted much attention in both academic and industrial community. We present
GChord, a scalable technique for evaluating queries with multi-attributes. Both
exact match and range queries can be handled by GChord. It has advantages over
existing methods in that each tuple only needs to be indexed once, while the query
efficiency is guaranteed. Thus, index maintenance cost and search efficiency are
balanced. Additional optimization techniques further improves the performance
of GChord. Extensive experiments are conducted to validate the efficiency of the
proposed method.

1 Introduction
Peer-to-peer (P2P) systems provide a new paradigm for information sharing in large-
scale distributed environments. Though the success of file sharing applications has
proved the potential of P2P-based systems, the limited query operators supported by
existing systems prevent their usage in more advanced applications.
Much effort has been devoted to provid fully featured database query processing
in P2P systems [1,2,3,4]. There are several differences between query processing for
file sharing and database queries. Firstly, the types of data are much more complex
in databases than those in file names. Basically, numerical and categorical data types
should be supported. Secondly, files are searched via keywords. Keyword search is
often implemented by using exact match query. However, for numerical data types,
both exact match queries (or point queries) and range queries should be supported. The
last but not the least, user may issue queries with constraints on variant number of at-
tributes for database applications. This last requirement poses additional challenges for
database style query processing in P2P systems. Some existing methods, such as VBI-
Tree [2], can only support user queries with constraints on all attributes. Some other
methods, namely Mercury [3] and MAAN [4], separately index data on each attribute.
Though they can support multi-attribute queries with constraints on arbitrary number
of attributes, they are not efficient for indexing data with more than three attributes for
two reasons. The first one is that the maintenance cost increases with the number of
attributes. Another reason is that the selectivity of indexes on one attribute decreases
drastically when the number of attributes increases.
We present GChord, a Gray code based Chord, as a new index scheme supporting
multi-attribute queries (MAQ) in P2P environment. It distinguishes itself from other
methods in the following aspects:

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 55–66, 2007.

c Springer-Verlag Berlin Heidelberg 2007
56 M. Zhou et al.

– GChord utilizes the property of one-bit-difference of Gray code to encode numer-


ical data. This encoding scheme, together with the traditional hash-based indexing
technique for categorical data, transforms the MAQ problem to multicast problem
in a large-scale network. Fully utilizing the finger table links provided by Chord [5],
a general purpose P2P overlay network, GChord provides a solid base for index-
ing data without modification on the underlying overlay network structure. Thus, it
provides a convenient solution to work with existing efficient P2P technologies.
– In GChord, each data tuple only needs to be indexed once. Thus, performance of our
method does not directly rely on the number of attributes of data. Compared with
other Chord-based methods, it is more efficient in terms of maintenance overhead
and search performance.
– In additional to the basic indexing and query processing scheme, GChord intro-
duces optimization techniques called multicast tree clustering and index buddy.
The former provides an efficient implementation of multicast in P2P network for
the MAQ problem. The latter shows that by consuming a small portion of storage
space for caching index entries, GChord outperforms methods with index duplica-
tion in terms of storage cost and query processing.
The remainder of this paper is organized as follows. Section 2 is for related work of
GChord. After the problem statement given in Section 3, we introduce the basic GChord
in detail in Section 4. In Section 5, we present the optimization techniques of GChord.
After the experimental result shown in Section 6, Section 7 is for concluding remarks.

2 Related Works
MAQ is widely studied in centralized database systems. One solution of indexing data
for MAQ is hBΠ -tree [6], which is a combination of multi-attribute index hB-tree [7]
and abstract index Π-tree[8]. hBΠ -tree achieves low storage cost and efficient point-
and range-query processing for various data types and data distribution. However, the
different setting of large-scale distributed systems prevents the application of existing
technique in centralized systems in P2P systems.
In large-scale P2P systems, distributed hash tables (DHTs), such as Chord [5], Pas-
try [9], and CAN [10], are widely used. However, it can only support key-word-based
search lookup(key) and these hash-based methods usually cannot preserve the locality
and continuity of data.
The methods supporting MAQ in structured P2P systems can be classified into two
categories. The first one introduces traditional tree-structured index scheme into P2P
systems. BATON [1] is P2P index structure based on balanced binary tree. BATON*
[11] substitute the binary tree in BATON with an m-way tree. These two can well sup-
port single dimensional range query. VBI-tree [2] provides a framework for indexing
multi-dimensional data in P2P systems with hierarchical tree-structured index in cen-
tralized systems. However, these structures cannot support queries with constraints on
arbitrary number of attributes efficiently.
The other category of research work is based on extending DHT-based overlay net-
works. The basic idea behind these methods is to use one overlay network for each
GChord: Indexing for MAQ in P2P System with Low Maintenance Cost 57

attribute that needs to be indexed, and to use locality-preserving hash to index numeri-
cal attributes. Mercury [3] and MAAN [4] belong to this category. Both of them index
each attribute separately on Chord ring and the index with the best selectivity power is
used to prune the search to support MAQ. Therefore, both of them have high stor-
age cost and index maintenance cost. Furthermore, the search efficiency decreases
drastically when the dimensionality of data increases.

3 Problem Statements
A tuple of data is represented by a set attribute-value pairs: t{attri , vi }, i = 1, · · · , N .
The domain Ai of attribute attri is either numerical or categorical. A numerical domain
is supposed to be continuous or sectionally continuous, and bounded. Given a data set,
the set of domains A : {Ai } for i = 1, 2, · · · , N is supposed to be known in advance.
We believe that even with this assumption, many applications can be fit into our MAQ
model.
A multi-attribute query (MAQ) is a conjunction of a set of predicates of the form
(attri , op, vi ), i = 1, · · · , m, in which, attri is the attribute name, op is one of <, ≤, >,
≥, = for numerical attributes and = for categorical ones. Note that a query may have ar-
bitrary number of predicates, and the predicates may be on arbitrary attributes. Figure 1
shows a simple example of data and queries.

film name(c) price(n) duration(n) premiere(n) cinema(c) 100


Level with
Lord of War 50 90 05-06-1 Yongle prefix 10
101
Garfield 60 75 06-07-12 Guang
XMan 90 100 05-11-4 Yonghua 111
Level with
Thinking 100 125 06-05-09 Yonghua prefix 11
110
The Break 90 110 05-12-17 Yonghua
Silther 90 100 06-10-30 Guang 010
Level with
Records prefix 01
Q1: fn=”Lord of War”∧price>40∧price<80 011
Q2:premiere>06-05-4∧premiere<06-07-09∧cinema=”Yongle” 001
Level with
Q3:duration>80∧price<90∧cinema=”Guang”∧premiere>06-06-17 prefix 00
Q3:fn=”XMan”∧price<70∧premiere>06-03-12∧cinema=”Guang” 000
Q4:fn=”Thinking”∧price=50∧premiere>06-07-05 000 001 011 010 110 111 101 100
Queries Level with prefix 0 Level with prefix 1

Fig. 1. Data Item and Query Fig. 2. Two Attributes Domain Partition

The results to a MAQ are the set of tuples satisfying all predicates presented in the
query. Note that there is no constraints on the values of attributes missed in the query.
The problem of MAQ in a P2P environment is that a set of peers each may share
(or publish) a set of tuples. Each peer may issue a MAQ to be evaluated in a P2P
style, which means there is no centralized server, and the peers work in a collaborative
mechanism. To collaborate with others, a peer devotes some storage space for indexing
data shared by others, and supports index lookup and maintenance.
The difficulty of query processing for MAQ in P2P systems lies in the following
three aspects: 1) one query may involve attributes with different data types, and point
58 M. Zhou et al.

constraint and range constraint may appear simultaneously in one query, 2) arbitrary
number of attributes may presented in a query, and 3) index maintenance is especially
expensive in P2P systems. Since there are N ! − 1 possible combinations of attributes
in a query, any method using index structure that can only answer queries with fixed
number of attributes will fail to handle MAQ in P2P environment, for the high cost of
index maintenance in distributed environments.
In the next section, we present GChord, a Gray code based indexing scheme that
can be distributed over Chord like overlay network. By fully utilizing the network links
provided by the underlying network, it indexes each tuple only once, and can support
MAQ with arbitrary number of attributes using the sole index.

4 The Basic GChord Mechanism


4.1 Data Indexing
Each attribute is assigned a set of bits in the 128-bit bitstring to store its code. The
number of bits of a code is proportional to the size of the domain. Note that it is assumed
all domains are known in advance. Intuitively, the larger a domain is, the stronger is
the selectivity power of that attribute. Thus, more bits should be devoted to index that
attribute.Numerical and categorical attributes are encoded differently in GChord.
Numerical and categorical attributes are encoded differently in GChord.
Encoding Numerical Attributes. As the domain of each numerical attribute is pre-
defined, GChord partitions the domain equally and continuously. For those sectionally
continuous attribute domains, it concatenates all the sections together first, and then
makes equally partition among them. For example, if one attribute domain is composed
of three continuous sections, (1, 4), (5, 17), (31, 36), the four equally partitioned parts
are {(1,4),(5,7]}, {(7,12]},{(12,17)} and {(31,36)}. Obviously, each part contains a
interzone with same length.
All partitioned parts of one attribute domain are encoded by Standard Binary Re-
flected Gray Code [12](Gray code for short) continuously and sequently, as Figure 2
shows. Obviously, the two Gray codes that represent two adjacent partition parts differ
in one bit. To make Gray code be fully used, we restrict the number of partitioned parts
to be 2k , where k is the number of bits in Gray code.
It’s hard to compute the Gray Code from attribute value directly, while it is easy
to compute the sequence number of the partitioned part that attribute value fills in by
 (v−v
simply using equation, SN (v) = n j(vmin
i=0 i
)×(2m −1)
max −vimin ) ,where v is the attribute value
and v ∈ [vj min , vj max ], and m is the Gray Code length. The corresponding Gray Code
which is the sub index key on the attribute is converted from the sequence number using
algorithm 1.
Encoding Categorical Attributes. Since only exact match query is to be supported
for categorical attributes in MAQ, it is much easier to encode categorical attributes. A
hash-based method is used to determine the code of a categorical value,like code(v) =
hash(v) mod 2m in which m is the number of bits assigned to the categorical attribute,
and hash is a general purpose hash function such as SHA-1.
GChord: Indexing for MAQ in P2P System with Low Maintenance Cost 59

Algorithm 1: SN 2 GC(BitString sequencenumber[n])


1 BitString graycode[n] //output Gray code which contains n bits
2 graycode[0]← sequencenumber[0]
3 for each i from 1 to n − 1 do
4 if sequencenumber[i] = sequencenumber[i-1] then
5 graycode[i]← 0
6 else
7 graycode[i]← 1

8 return graycode

Generating the Index Key for a Tuple. As the number of peers that participant in
the network is much less than the number of peers that network can accommodate,
one peer in the network have to manage many index keys. If the index key is simply
constructed by concatenating all N codes, the attribute encoded at the right side will
lose its distinguishability. All values of the attribute that is encoded on the right side
may be mapped to the same peer. It results in the index on that attribute useless.
GChord provides a shuffling-based method to generate the index key of a tuple. The
shuffled index key is constructed by concatenating a bit of code of one attribute by that
of another. The order of the attributes is pre-determined using the descending order of
the size of the domains.
Analysis to the Index Key Generation Method. Since two adjacent Gray codes only
differs in one bit, the adjacent relationships between two sections of numerical attributes
are preserved by any structured overlay network protocols that maintains one-bit differ-
ent links in routing tables, such as Chord and Pastry.

Property 1. Two index keys have one bit difference if the two corresponding tuples
have adjacent values on one numerical attribute and same values on other attributes.

Thus, our indexing scheme preserves locality of numerical attributes.


Property 2. The index keys stored on one peer are constituted by a set of continuous
partitions on each numerical attribute.
Property 1 means adjacent values on each attribute are linked by the links in routing
table of the overlay network like Chord and Pastry. Query message can be routed ef-
ficiently for index keys of adjacent data are one hop away. Property 2 means that part
of the queried region may addressed by accessing one peer. As the overlay is not fully
filled, routing hops may be saved if index keys are inserted into the predecessor.
As the index key is shuffled, load balancing can achieved simultaneously. Since the
distribution of real data are always skewed and the sections of attribute domain are
encoded by equally partition, the data tuple filled in one section may skewed. Some
new strategies need to adopt to keep load balancing among peers in the network.

Lemma 1. The prefixes of a set of Gray codes, which have a same bit length, construct
the Gray Code sequence either.
60 M. Zhou et al.

Property 3. The index keys stored on each peer constitute a similar size portion on
each attribute.

Property 4. The process of node join is the process of the attribute domain repartition.

As known from Property 4, load balancing can be achieved by selecting some suitable
Id for node at the time of node join.

4.2 Query Processing


To evaluate a MAQ, a set of nodes with index entries satisfying the query should be
visited. Intuitively, attributes presented in the MAQ should satisfy the predicates in the
query. Therefore, their corresponding bits in the peer identifier, i.e. the codes of those
attributes, should satisfy the constraints. There are no constraints on other bits. Thus, the
query processing procedure can be transformed into a multicast problem. The targeted
peers are peers taking care of the identifiers satisfying the following constraints:

1. For each predicate attr op v on a numerical attribute attr, code(attr) op code(v);
2. For each predicate attr = v on a categorical attribute attr, code(attr) = code(v);
3. All other bits can be either 0 or 1.

Thus, a multicast task can be represented by a set of strings with the same length as
that of the identifier (index key). Each element in the string is either 0, 1 or x, in which
x means either 0 or 1.
Multicast trees (MCTs) are constructed to forward the query to indexing peers. A
MCT is a virtual tree whose edges are routing paths of the query. A MCT corresponding
to the multicast of 10xx1xx is shown in Fig. 3.

1000100

1011111
1011110

1000101 1000110 1001100 1010100 1011101


1011100

1010111

1010110
1000111 1001101 1001110 1010101 1010110 1011100
1010101

1010100

1001111

1001111 1010111 1011101 1011110 1001110


1001101

1001100
1000111
1000101

1011111 1000110
1000100

(a) (b)

Fig. 3. Multicast Tree of 10xx1xx

Multicast Tree Construction. As the links in the finger table of overlay network are di-
rected, one single MCT without irrelevant indexing nodes for MAQ may not exist. The
MCTs should be constructed on-the-fly when a query is issued. A modified Karnaugh
Map [13] construction algorithm is employed for this task.
GChord: Indexing for MAQ in P2P System with Low Maintenance Cost 61

Karnaugh Map is a graphical way of minimizing a boolean expression based on the


rule of complementation. Preventing processing a Karnaugh Map of large size, we com-
pute Multicast Tree Proportion (MTP) of two attributes at each time. We need compute
m2  times to get all the MTPs. If m is odd, the last MTP is computed on single at-
tribute. Attribute that is not present in the query has a MTP in the form of “xx · · · x”
which has a same length with the code represented the attribute partition. After all MTPs
are computed, they are shuffled and put together using the method we generate the
index keys.
Supposing the number of MTPs on each attribute that contains constraints in the
MAQ is n1 , n2 , . . . , nm , the number of MTPs of the MAQ is n1 × n2 × . . . × nm .
The procedure to compute MTP is as follows: (1)Initialize an empty Karnaugh Map:
each side of the map has the length equal to the length of the code of the corresponding
attribute. (2)For the cells satisfying all constraints given by the predicates on attributes
presented in both sides, they are marked by “1”. All other cells are marked by “0”. Each
cell forms a rectangle of size 1. (3)Two adjacent rectangles containing cells all marked
by “1” are merged to generate a rectangle with size 2i . Note that Karnaugh Map is
a torus. Thus a leftmost cell and a rightmost cell in the same column are considered
adjacent. And likewise is to the top and bottom cells. (4)The step 3 iterates until there
is no larger rectanges can be generated.
Fig. 4 shows a Karnaugh Map with three MTPs corresponding to multicast tasks
< x0x, x0x >, < x1x, 11x >, and < x1x, 101 >.

a
000 001 011 010 110 111 101 100 0
b
10
000 1 1 0 0 0 0 1 1
1

001 0 0 0 0 0 0 0 0 1
11
011 0 0 0 0 0 0 0 0
0

010 0 0 0 0 0 0 0 0 0
01
110 0 0 1 1 1 1 0 0 1

111 0 0 1 1 1 1 0 0 1
00
101 0 0 1 1 1 1 0 0 0
0 1 1 0 0 1 1 0
100 1 1 0 0 0 0 1 1 00 01 11 10
Index Buddy

Fig. 4. Karnaugh Map on Two Attributes Fig. 5. Index Buddy

Property 5. Supposing query region on two attributes is a A × B rectangle, where


A and B are the numbers of partition parts contained in query. Supposing the binary
forms of A and B contains m,n ”1”s respectively. The query rectangle can be divided
2 2
into m +n 2+3m+n + 1(if m ≥ n) MTPs.
Proof: As the number of cells which contained in the MTP rectangle must be power
of two, the number of cells on each side of the rectangle must be power of two either.
Obviously the A cells on one attribute is divided into m parts. The B cells on the other
side of Karnaugh Map is divided into n parts. After once division, we get m + n − 1
MTPs and one (A − 2a1 ) × (B − 2b1 ) rectangle that haven’t been divided, where a1 and
b1 are the position of first ”1” in A’s and B’s binary form. Then we do it recursively. 
62 M. Zhou et al.

After all the MTPs having been generated, we could shuffle these MTPs on each
attribute like constructing index key to constructe the MCT.
Our Karnaugh Map based MCT construction techniques can be generalized to handle
multiple queries. Targeted index peers from different MAQs may be grouped together
into one MCT using the same technique introduced above. Thus, the queries may share
the same message to retrieve the index entries. This may further reduce the network
transmitting cost.
Query Multicasting. After all the MCTs correspoding to the query have been gener-
ated,multicasting of a query is conducted as follows: (1)Query message is sent to the
root of each MCTwhich is a peer with identifier by substituting all xs in the MCT repre-
sentationwith 0s. (2)When a query is received by a peer, it is evaluated on its localindex,
and forwarded to all peers whose identifier is substituting one of the xs in MCT repre-
sentation from 0 to 1. (3)This is conducted recursively until their is no x remains 0.
Fig. 3 (b) illustrates a multicast process.
Property 6. The query message routing hops can be bounded to O(log 2 N +M ), where
N is the number of nodes that overlay network can accommodate and M is the number
of xs in the MTP representation.

5 Performance Enhancement
The number of attributes which contain constraints in query could vary form 1 to N .
More MCTs will be generated, if there are more range constraints on attributes con-
tained in query. The number of MCTs is a product of the number of MTPs on each
attribute. The cost of multicasting a large number of MCTs involved in the query sepa-
rately is very high.
On the other hand, the query range will be very large if there are fewer attributes
which contain constraints in the query. A large number of peers have to be accessed
to process such MAQ. If MAQ is addressed by accessing a large number of peers, the
number of query message routing hops and query messages will be high. In the two
scenarios above, performance can be enhanced by multicast tree clustering and index
buddy respectively.

5.1 Multicast Tree Clustering


Peers scattered on the overlay network are sparse, so each peer needs to manage a set of
continuous index keys. A portion of continuous results to the query may be indexed on
the single peer, but these index keys may be accessed by different MCTs for constraints
of MCT computing strategy. In this scenario, a number of messages have to be sent to
the same peer to get result for the query. If these MCTs is clustered together, and sent
within single message, lots of network traffic will be saved.
MCT clustering strategy clusters MCTs which are close to each other together. Be-
fore doing MCT clustering, peers in the network need to know the approximately peer
density of overlay network. Avoiding network traffic, we estimate the peer density us-
ing local density which is the reciprocal of the index key range that peer maintained.
No matter how inaccurate the density estimation is, it has no impact on query results.
GChord: Indexing for MAQ in P2P System with Low Maintenance Cost 63

MCT clustering only sends multi-MCTs in single query message, but it does not affect
the query evaluation to each MCT.
The clustering procedure is as follows: (1)Query submitter clusters all the MCTs to-
gether, which contain close root keys. Namely the difference of each two root keys is
less than the index range that maintained by the peer. (2)Query message is sent accord-
ing to the root key which is thesmallest one within the MCT cluster. (3)Peer received
the MCT cluster clusters all adjacent submuliticast trees together as query submitter
does. (4)Procedure 3 is done recursively until no sub multicast trees exists.
As many MCTs and sub MCTs are sent within one message, the number of message
for one query is reduced dramatically.

5.2 Index Buddy


If MAQ contains a large query region, a large number of peers will be involved to
process the query. The response time and network bandwidth consumption will be en-
larged if more peers are involved. Avoiding to involve too many peers, we adopt index
buddy strategy. An index buddy is tow peers which store adjacent values on the same
attribute and have a same peer Id prefix.
Users may have similar interests at the same time. For example users may submit
similar queries to get match list during the time of Olympic Games. If these frequently
queried index keys are stored on a few peers, reduced number of routing hops and query
messages can be achieved.
As described in Property 2, index keys stored on peer are continuous on each at-
tribute. Adding index keys that is frequently queried to the peer maintains adjacent
values on the attribute, MAQ can be addressed by accessing fewer peers in the net-
work. Fig. 5 shows the index buddy. The frequently queried region is depicted within
the red rectangle. Peers which maintain index keys within this region will manage the
index keys of its index buddy’s either. Obviously, half of the peers can be released from
processing the MAQ. The procedure of doing index buddy is as follows:
– Partition Level Sampling. Before exchanging index keys between index buddies,
we need to know the range of index keys of each attribute stored on the buddies.
We compute the index range using IR = succ.id − id. The number of partitions
managed by the peer is 2ki on the ith attribute, where ki is the number of bits used
to represent partitions on the attribute in IR s binary form.
– Frequent Query Region Detecting. We maintain one counter to indicate the cur-
rent query frequency for each attribute. The counter is a fade function that records
the number of messages which are relayed to get the adjacent value on the very
F QAiold
attribute through links in the finger table. We use Equation F QAinew = 2 +
i
Nnew
Ttime to estimate the query frequency of the ith attribute, where Nnew the number
of messagesthat are sent to get the adjacent index keys on the ith attribute, Ttime
is a specified interval time. When the frequency of the ith attribute exceed the
threshold F QAithreshold , the region of the attribute stored on the peer is regarded
as frequently queried.
– Index Buddy Establishing and Deleting. When detecting some attribute region
stored on the peer is frequently queried, the peer asks its index buddy to exchange
64 M. Zhou et al.

their index keys within the region.When detecting region becomes infrequent again,
the two peers will remove these redundant index keys out of the index buddy.
– Index Modification. When index buddy existing, new index to be inserted or ex-
isting index to be modified are need to be processed at both site of the index buddy
in order to keep index consistent.

6 Experimental Study
To evaluate the performance of GChord, we implement one simulator in Java JDK 1.42.
In our implementation, each peer is identified by its peer Id. Like physical peer, it main-
tains two limited message queues, one sending message queue and one receiving mes-
sage queue. The network layer is simulated to control the network communication,
which is the message sending from one peer to another based on peer Ids.
In our experiment, 10000 peers with randomly distributed peer Ids are involved to
construct the Chord ring. The peer Id is a 32-bit string. The data tuple contains 5 numer-
ical attributes and 1 categorical attribute. 100000 data tuples with randomly distributed
values within their attribute domains need to be indexed. Range queries which have
been set maximum query range are generated randomly within the attribute domains.
Point query is generated randomly within the attribute domains.
Impact of Attribute Number in MAQ. The first set of experiments gives the perfor-
mance curves impacted by variable number of attributes which contain constraints in
the query. The maximum query range on each attribute is set to be 10% of its domain.
As showing in Fig. 6(a), 6(b) and 6(c), the numbers of maximum routing hops, rout-
ing messages and accessed peers reduce dramatically when the number of attributes
that contain constraints in MAQ increase. The number of routing messages reduces to
about one tenth when using multicast tree clustering strategy. Multicast tree clustering
improves performance pretty well especially when query contains fewer attributes.

35
5000 80
Predecessor
30 Predecessor
Cluster
Cluster
Buddy 4000 60 Predecessor
Number of Peers Accessed

Buddy
Both Cluster
25 Both
Number of Messages

Buddy
Number of Hops

Both
3000
20 40

15 2000
20

10
1000
0
5
0
2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
Number of Attributes Number of Attributes Number of Attribute

(a) Hops Vary with Number (b) Messages Vary with Num- (c) Accessed Peers vary with
of attributes queried ber of Attributes Queried Number of Attributes Queried
Fig. 6. Performance Comparison with Variable Attributes in Query

Impact of Query Range in MAQ. In this set of experiments, the number of attributes
that contain constraints in query is set to be 4. As showing in Fig. 7(a), 7(b), and 7(c), the
numbers of maximum routing hops, routing messages and accessed peers decrease as
except when the query range on each attribute decreases. More MCTs will be generated,
when the query range on each attribute increases. Much more routing messages are
diminished by using multicast tree cluster in this scenario.
GChord: Indexing for MAQ in P2P System with Low Maintenance Cost 65

21 3000 13
Predecessor
Cluster
20 12
2500 Predecessor Buddy
Cluster Both

Number of Peers Accessed


19 Buddy 11
2000 Both

Number of Messages
Number of Hops

18 Predecessor 10
Cluster
Buddy 1500
17 9
Both

1000 8
16

7
15 500
6
14
0
20% 15% 10% 5% 20% 15% 10% 5% 20% 15% 10% 5%
Range of Attribute Range of Attribute Range of Attribute

(a) Hops Vary with Query (b) Messages Vary with (c) Accessed Peers Vary with
Range Query Range Query Range
Fig. 7. Performance Comparison with Variable Query Range

Impact of Frequently Queried Region. In this set of experiments, the maximum query
range on each attribute is set to be 10% of its domain. As showing in Fig. 8(a), 8(b) and
8(c), index buddy has evident effort in reducing the number of peers accessed when
the percentage of frequent query increase. Index buddy has a similar impact on the
maximum number of routing hops, especially when query contains less attributes.
18
200 14
17
6attributes 6Attributes 6Attributes
16 180
5Attributes 5Attributes 12 5Attributes
15 4Attributes 160 4Attributes 4Attributes
3Attributes 3Attributes Number of Peers Accessed 3Attributes
14 140 10
Number of Messages

13
Number of Hops

120
12 8
100
11
6
10 80

9 60 4
8 40
7 2
20
6
0 0
20% 40% 60% 80% 20% 40% 60% 80% 20% 40% 60% 80%
Query Frequency Query Frequency Query Frequency

(a) Hops Vary with Frequent (b) Messages Vary with Fre- (c) Accessed Peers Vary with
Query quent Query Frequent Query
Fig. 8. Performance Comparison with Frequent Queries

Comparison with Mercury. As there are 10000 peers in the network, the number of
maximum hops and accessed peers in Mercury is much bigger than GChord’s. Approx-
imately 1700 peers construct a Chord ring to maintain the index keys on each attribute.
The selectivity power of the attribute is very strong, so Mercury need to accessed a large
number of peers to process MAQ. Index keys are stored continuously on peers, so ac-
cessing adjacent index key need only one more hop. That’s why the number of routing

80
160
60
70
GChord GChord
140 Mercury Mercury
60 GChord 50
Number of Peers Accessed

Mercury 120
Number of Messages

40
Number of Hops

50 100

40 80 30

30 60
20
40
20
10
20
10
0 0
3 4 5 6 3 4 5 6 3 4 5 6
Number of Attributes Number of Attributes Number of Attributes

(a) Hops Comparison with (b) Messages Comparison (c) Accessed Peers Compari-
Mercury with Mercury son with Mercury
Fig. 9. Performance Comparison with Mercury
66 M. Zhou et al.

messages is smaller than GChord’s. As showing in Fig. 9(a) and 9(c), the performance
of the GChord exceeds Mercury much in the number of maximum routing hops and
accessed peers.
As the limitation of paper size, the comparison of index cost with Mercury is no
showing in figures. Mercury keeps one index duplication for each attribute, so the index
cost of Mercury is proportional to the number of attributes that data tuple contains. So
the index costs of GChord, including index storage and index messages, are much less
than Mercury’s.

7 Conclusion
In this paper, we present the design of GChord, a P2P-based indexing scheme for pro-
cessing multi-attribute queries. Using Gray code based indexing technique, both point-
and range-query on numerical attributes can be handled. By integrating Gray code and
hash based encoding method, each tuple only need to be indexed once in GChord.
Our index can support queries having constraints on arbitrary number of attributes.
Thus, it is more efficient than previous methods in terms of storage cost and search
performance. Enhancement techniques further improves the performance of GChord.
Our future work on GChord includes the research on supporting keyword-based
queries and aggregate queries over GChord, and the study on more intelligent query
optimization techniques.

References
1. H.Jagadish, B.Ooi, Q.Vu: Baton: A balanced tree structure for peer-to-peer networks. In:
VLDB. (2005)
2. H.Jagadish, B.Ooi, Q.Vu, R.Zhang, A.Zhou: Vbi-tree: A peer-to-peer framework for sup-
porting multi-dimensional indexing schemes. In: ICDE. (2006)
3. R.Bharambe, M.Agrawal, S.Seshan: Mercury:supporting scalble multi-attribute range
queries. In: SIGCOMM. (2004)
4. M.Cai, M.Frank, J.Chen, P.Szekely: Maan:a multi-attribute addressable network for grid
information services. In: Grid. (2003)
5. I.Stoica, R.Morris, D.Karger, F.Kaashoek, H.Blalakrishnan: Chord: A scalable peer-to-peer
lookup service for internet applications. In: ACM SIGCOMM. (2001) 149–160
6. G.Evangelidis, D.Lomet, B.Salzberg: The hbpi-tree: a multi-attribute index supporting con-
currency,recovery and node consolidation. VLDB Journal 6 (1997) 1–25
7. D.Lomet, B.Salzberg: The hb-tree: a multiattribute indexing method with good guaranteed
performance. ACM Trans Database Syst 15 (1990) 625–658
8. D.Lometand, B.Salzberg: Access method concurrency with recovery. In: SIGMOD. (1992)
9. A.Rowstron, P.Druschel: Pastry:scalable,decentraized object location and routing for large-
scale peer-to-peer systems. In: Middleware. (2001) 329–350
10. S.Francis, P.Handley, M.Karp, R.Shenker: A scalable content-addressable network. In: SIG-
COMM. (2001)
11. H.Jagadish, B.Ooi, K.LeeTan, Q.Vu, R.Zhang: Speeding up search in peertopeer networks
with a multiway tree structure. In: SIGMOD. (2006)
12. F.Gray: Pulse code communications. In: U.S. Patent 2632058. (1953)
13. M.Karnaugh: The map method for synthesis of combinational logic circuits. AIEE 72 (1953)
593–599
ITREKS: Keyword Search over Relational Database by
Indexing Tuple Relationship

Jiang Zhan and Shan Wang

School of Information, Renmin University of China, Beijing, 100872, P.R. China


{zhanjiang,swang}@ruc.edu.cn
Key Laboratory of Data Engineering and Knowledge Engineering (Renmin University of
China), MOE, Beijing 100872, P.R. China

Abstract. Keyword-based search is well studied in the world of text documents


and Internet search engines. While traditional database management systems
offer powerful query languages, they do not allow keyword-based search. In
this paper, we discussed ITREKS, a system that support efficient keyword-
based search over relational database by indexing tuple relationship: A basic
database tuple relationship, FDJT, is established in advance. Then a FDJT-
Tuple-Index table is created, which records relationships between each tuple
and FDJT. At query time, for each of keywords, system first finds tuples in
every relation that contain it, using full text indexes offered by database
management system. Then use FDJT-Tuple-Index table to find the joinable
tuples contain all keywords in the query.

Keywords: keyword search, relational database, full disjunction.

1 Introduction
Keyword-based search is well studied in the world of text documents and Internet
search engines, but Keyword-based search over relational databases is not well
supported. The user of a relational database needs to know the schema of the
database; Casual users must learn SQL and know the schema of the underlying data
even to pose simple searches. For example, suppose we have a DBLP database,
whose schema is shown in Figure 1. We wish to search for an author Bob’s paper
related to “relation”. To answer this query, we must know how to join the Author,
Write and Paper relations on the appropriate attributes, and we must know which
relations and attributes to contain “Bob” and “relation”. In keyword-based search, for
the above example a user should be able to enter the keywords ‘Bob relation’ and the
associated tuples which are associated with the two keywords are returned.
Enabling keyword search in databases that does not require knowledge of the
schema is a challenging task. Due to database normalization, logical units of
information may be fragmented and scattered across several physical tables. Given a
set of keywords, a matching result may need to be obtained by joining several tables
on the fly.
In this paper, we have developed a system, ITREKS (Indexing Tuple Relationship
for Efficient Keyword Search), which supports highly efficient keyword-based search

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 67–78, 2007.
© Springer-Verlag Berlin Heidelberg 2007
68 J. Zhan and S. Wang

over relational databases by indexing tuple relationship. The key features and
advantages of our approach, and the contributions of this paper, are summarized as
follows:
z Most previous approaches perform a significant amount of database
computation at search time to find the connection of tuples which contain
keyword. We do all significant join computing work in advance by create tuple
relation index, so a great amount of computing work is saved in search time.
z We present a novel approach to index the tuple relationship. We construct
basic tuple relationship-FDJT by computing full disjunction[1] of the
interconnected relational database. We present an FDJT-Tuple-Index table to
index tuples’ relationship.
z We propose a modular architecture and have implemented ITREKS based on it.
z We present an efficient algorithm which incorporate basic tuples and FDJT-
Tuple-Index table to generate result tuples matching the query.
z We take full advantage of existing relational database functions. ITREKS has
been implemented on top of Oracle 9i. Oracle 9i Text use standard SQL to
create full text indexes on text attributes of relations. We completely avoid
reimplementing basic IR capabilities by using Oracle Text as the back end.
Furthermore, ITREKS keep both FDJT and tuple-FDJT in relation tables. Our
searching algorithm is also based on a search table.

Fig. 1. DBLP Schema

In Section 2 we provide a thorough survey of related work. The essential formal


background on full disjunction and related definition is presented in Section 3.
Section 4 is the core of the paper. It discusses the ITREKS system, including
architecture of the system, functionality, algorithms, and system implementation
details. Section 5 presents our system evaluation of ITREKS, while we give
conclusion and future works in Section 6.

2 Related Work
Oracle [3] , IBM DB2, Microsoft SQL Server, PostgreSQL, and MySQL all provide
text search engine extensions that are tightly coupled with the database engine.
However, in all cases each text index is designed over a single column. Using this
feature alone to do meaningful keyword search over an interconnected database
would require merging the results from many column text indexes.
ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship 69

Keyword-based search over relational database gets much attention recently. Three
systems, DISCOVER[4] [5] , BANKS[6] , and DBXplorer[7] , share a similar
approach: At query time, given a set of keywords, first find tuples in each relation that
contain at least one of the keywords, usually using database system auxiliary full text
indexes. Then use graph-based approaches to find tuples among those from the
previous step that can be joined together, such that the joined tuple contains all
keywords in the query. All three systems use foreign-key relationships as edges in the
graph, and point out that their approach could be extended to more general join
conditions. A main shortage of the three systems is they spend a plenty of time to find
the candidate tuples that can be joined together.
Four systems share the concept of crawling databases to build external indexes.
Verity[8] crawls the content of relational databases and builds an external text index
for keyword searches, as well as external auxiliary indexes to enable parametric
searches. DataSpot[9] extracts database content and builds an external, graph-based
representation called a hyperbase to support keyword search. Graph nodes represent
data objects such as relations, tuples, and attribute values. Query answers are
connected subgraphs of the hyperbase whose nodes contain all of the query keywords.
DbSurfer[10] indexes the textual content of each relational tuple as a virtual web
page. Given a keyword query, the system query and navigate the virtual web pages
and find the results. EKSO[11] indexes interconnected textual content in relational
databases, and do keyword search over this content. A relational database is crawled
in advance, text-indexing virtual documents that correspond to interconnected
database content. At query time, the text index supports keyword-based searches with
interactive response, identifying database objects corresponding to the virtual
documents matching the query.
All the index-data-offline systems have two challenges, how to control the
granularity of the indexed content and how to efficiently find the exact results from
the indexed content.
While a direct empirical comparison between our system and some of the other
approaches mentioned in this section would be very interesting, the comparison is not
feasible for the follow reasons:
z The systems are not publicly available.
z The systems implemented different search semantic and different result sets.
z Any effort to implement them well enough for a fair comparison would be
prohibitive.

3 Background
3.1 Basic Tuple Relationship
In our method, we need first to find the closest and the most important connection
among tuples. In general, if we have any collection of facts that agree on common
attributes (are join-consistent) we would like them to be available in the “result” of
this collection of facts. The problem is related to that of computing the full outerjoin
of many relations in a way that preserves all possible connections among facts. Such a
computation has been termed a “full disjunction” by Galindo-Legaria[1] . A full
disjunction is a relation with nulls (represented by ⊥ ) such that every set of
70 J. Zhan and S. Wang

join-consistent tuples in our database appears within a tuple of the full disjunction,
with either ⊥ or a concrete value in each attribute not found among our set of tuples.
Each tuple of full disjunction is corresponding to a set of connective tuples, each of
them from a database relation. Naturally, full disjunction reflects the closest and most
important relationship among the tuples that generate them. Through full disjunction,
we can build the basic relationship of the tuples that come from different database
relation.

3.2 Full Disjunction


We consider a database that has n relations R1,…, Rn. The schema graph G is an
undirected graph that captures the primary key to foreign key relationships in the
database schema. It has a node Ri for each relation Ri of the database and an edge
from Ri to Rj for each primary key to foreign key relationship. We assume that
schema graph G is connected, which is a reasonable assumption for realistic database
schema design.
Definition 1 (Tuple Subsumption). We say that tuple t subsumes tuple u if t and u

agree in every component where u is not . That is, t is obtained from u by replacing
zero or more nulls by concrete values. Note that a tuple t subsumes itself.
Definition 2 (Full Disjunction). Let Г=R1,R2,…,Rn be relations whose tuples do not
have nulls. We say R is the full disjunction for Гif the following hold:
z No redundancy: No tuple of R subsumes any other tuple of R.
z Tuples of R come from connected pieces of Г: Let t be a tuple of R. Then
there is some connected subset of the relations of Гsuch that t, restricted to its
non null components, is the join of tuples from those relations.
z All connections are represented:
„ Let t1,…,tk be tuples chosen from distinct relations Ri1 ,..., Rik ,
respectively, such that the schema graph of {R i1 ,..., Rik } is
connected.
„ Let the ti’s be join-consistent, in the sense that for any attribute A, all
the components among the ti’s corresponding to attribute A have the
same value.
„ Let t be the tuple that agree with each of the ti’s in those attributes
appearing among any of Ri1 ,..., Rik and that has ⊥ in other attributes
found among the schemes of Г.
Than t is subsumed by some tuple of R.
Definition 3 (FDJT). Let t be a full disjunction tuple. We call such tuples t1,…,tk Full
Disjunction Join Tuples (FDJT) of t, if t can be generated by joining the set of
tuples t1,…,tk. Each full disjunction tuple is corresponding to a set of tuples like
t1,…,tk. Note that tuples in FDJT doesn’t have sequence.
Definition 4 (FDJTR). Full Disjunction Join Tuples Relation is a relation made up
with FDJTs of all tuples in full disjunction. In ITREKS, FDJTR is made up with
relation names and tupleIDs (or rowids) of tuples in FDJT. Each pairs of relation
ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship 71

name and tupleID represent a database tuple of a FDJT.


It was proved that full disjunction is unique [2] .It is easy to prove that FDJT is also
unique; otherwise there will be two equivalent tuples in full disjunction.

3.3 Computing Full Disjunction and FDJTR

We would like to find a simple way of computing the full disjunction of a set of
relations. The solution is to compute full disjunction by full outerjoin. The full
outerjoin is a variant of the join in which tuples of one relation that do not match any
tuple of the other relation are add to the result, padded with nulls. This operation is
part of the SQL92 standard. This problem of computing full disjunction by outerjoin
was studied by Galindo-Legaria in [1]. [1] gave a test for when some order of
outerjoins is guaranteed to produce the full disjunction by itself. This test is simple.
Create a graph whose nodes are the relations and whose edges connect relations that
are constrained by one or more comparison; if the graph is acyclic then the full
disjunction can be computed applying full outerjoins in any order. For cyclic graphs,
however, the full disjunctions don’t exist. Thus we have the Lemma 1.
Lemma 1. For a database which has an acyclic connected scheme graph, we can
compute full disjunction by applying full outerjoin of the connected relations in any
sequence.
Now for a database whose scheme graph is acyclic, we can use Lemma 1 to generate
a full outerjoin sequence producing the full disjunction. In the above full outerjoin
sequence, each relation appears exactly once. The relation tuples which are
outerjoined to generate a tuple of full disjunction is FDJT of this tuple. Algorithm 1
generates FDJTR when computing full disjunction of a database.

Algorithm 1. Computing FDJTR


Input: database relations R1,…,Rn, connected acyclic database graph G
Output: FDJTR of the database
1. Do breath-first traversal on G from one of the G’s leaf node of G, get a sequence
of the relations R1’,…,Rn’.
2. Let FDk store full disjunction of some connected relations R1’,…,Rk’(k=2 to n),
Fk store FDJTR of FDk.
3. FD1=R1’
4. Add R1 to F1
5. for i=2 to n
6. FDi←FDi-1 full ouertjoin Ri’
7. foreach tuple t in FDi, add FDJT of t to Fi
8. end for
9. return Fi

Algorithm 1 first generates the relations sequence by breath-first traversal over G,


then full outerjoins the relations in turn, computing the full disjunction and
corresponding FDJTR of the relations.
We will discuss how to compute FDJTR of database whose schema is cyclic in
Section 4.
72 J. Zhan and S. Wang

4 The ITREKS System


The system we have developed, ITREKS, is an instantiation of the general
architecture we propose for keyword search over databases that is shown in Figure 2.
Given a set of query keywords, ITREKS returns all results (sets of joinable tuples
from relations connected by foreign-key relationship) such that each result contains
all keywords. Enabling such keyword search requires (a) a preprocessing step called
Index that enables databases for keyword search by building the table (FDJT-Tuple-
Index) which keeps tuple relationships, and (b) a Search step that gets matching
results from the published database.

Fig. 2. Architecture of ITREKS

Index step is implemented by model Indexer in Figure 2, where Search step is


implemented by model Searcher.

4.1 Overview of Index and Search Steps

Index: A database is enabled for keyword search through the following steps.
Step 1: A database D is identified, along with its schema graph G.
Step 2: If G is cyclic, turn it into an acyclic schema graph G’ with Algorithm 2, witch
will be discussed in Section 4.2.
Step 3: Given D and G’, Indexer generates FDJTR of D using Algorithm 1.
Step 4: FDJT-Tuple-Index table is created for supporting keyword searches, which
will be discussed in detail in Section 4.3.
Search: Given a query consisting of a set of keywords, it is answered as follows.
Step 1: For each keyword k, a Basic Tuple Set (BTS) is established by using database
full text search functions. Keyword k’s BTS is a relation recording all
database tuples which have scores to k.
Step 2: Based on BTSs, FDJT-Tuple-Index table and Search Table (see Section 4.4),
Searcher finds the results (joinable tuples) which include all keywords. We
discuss this step in Section 4.4.
ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship 73

4.2 Acyclization of Database Schema Graph

Given a database schema graph, ITREKS firstly cut off the cycle if the graph is
cyclic, so we can use Algorism 1 to compute the FDJTR of the database.
Figure 3 (a) is schema graph of DBLP database, where , for simplicity, A, W, P
and C denote relations Author, Write, Paper and Cite respectively. Figure 3 (b) is a
simplest but typical cyclic schema graph. ITREKS revise the cyclic database graph by
two operations: cut-off and duplication.
Cut-off: By erasing a less important edge which belongs to the cycle, we can make
cyclic schema graph acyclic. Figure 4 shows cut-off revised schema graph in Figure 3,
where the schema graph is acyclic but we lost a relation between P and C (in
Figure 4 (a)) and relation between B and C (in Figure 4 (b)), which we think is less
important. If there isn’t a less important relation, we can remove any edge in graph cycle.

Fig. 3. Two Schema Graph Fig. 4. Cut-off Revised Schema Graph

Fig. 5. Duplication Revised Schema Graph

Duplication: By renaming a relation that is in an edge deleted by cut-off operation, we


can keep relationship that is deleted by the operation. Figure 5 shows the duplication
revised schema graph in Figure 4, where CP is a renamed duplication of relation P
(in Figure 4 (a)) and B1 is a renamed duplication of relation B (in Figure 4 (b)).
Pure connective relation: In revised DBLP database graph (see Figure 5 (a)), there
are two special relations, W and C, whose attributes are all foreign keys. We call such
74 J. Zhan and S. Wang

relations pure connective relations, because the only function of their attributes is to
connect tuple and they don’t contain indispensable keywords for our keyword search.
In ITREKS, we discard pure connective relations in FDJTR once FDJTR is
completely constructed. For example, after computing FDJTR by Algorithm 1 over
revised DBLP schema Graph (Figure 5 (a)), the schema of FDJTR is (FDJTid, Aid,
Wid, Pid, Cid, CPid). After discard pure connective relations we get FDJTR (FDJTid,
Aid, Pid, CPid). For simplicity, we use Aid represent the tuple’s id in relation Author.
Similarly Pid and PCid are the tuple’s id in Paper.

4.3 FDJT-Tuple-Index Table

FDJT-Tuple-Index table index each database tuples with FDJTs. ITREKS builds
tuples’ relationships by establishing FDJT-Tuple-Index table.
Extended Schema Graph: To build FDJT-Tuple-Index table, ITREKS extends
FDJTR as follow:
For each relation in FDJTR, if the relation has edges with other relations in original
database schema graph, add these relations and edges to FDJTR. If new added
relation is pure connective relation, ITREKES continue add the other relations that
have edges with the pure connective relation.
For DBLP database, the extended schema graph of FDJTR is shown in Figure 6.
Extended Schema reflects the relationship between each database relations and
FDJTR. If a relationship is not so important to be indexed, we discard the relative
relations in extended schema. For example, in Figure 6, which papers are cited by
papers in PC in FDJTR need not to be indexed, ITREKS discard relative relations W
and P. Note that extend schema graph is always a tree (is acyclic).

Fig. 6. Extended Schema graph of DBLP

Locator Number: ITREKS gives each relation in extended schema a locator number
to records distance and relationship between tuple and FDJT. The number is used
when ITREKS calculate the results. ITREKS appoints locator number to relations as
follow:
ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship 75

• Let FDJTR has n relations, ITREKS labels each relation in FDJTR with an integer
from 1 to n in sequence.
• For other relations in extended schema graph, the locator number consists of two
parts divided by a dot. The number in left of the dot (left number) is the number of
FDJTR’s relation connected to it; The number in right of the dot (right number) is
integer 1.
FDJT-Tuple-Index Table: In ITREKS, FDJT-Tuple-Index table has 4 columns; the
first two columns are RN and Tid which identify a database tuple’s relation name and
rowid. Column FDJTid is rowid of a FDJT in FDJTR that has connection with the
tuple. Column N is the locator number representing the relationship between the tuple
and the FDJT. The locator number is come from the extended schema graph of the
FDJTR. In FDJT-Tuple-Index table, each row records a tuple- FDJT pair and their
relationship.

Algorithm 2. Computing FDJT-Tuple-Index Table


Input: database relations R1,…,Rn, FDJTR of the database
Output: FDJT-Tuple-Index table
1. Extend schema graph of FDJTR of the database to extended schema graph ESG.
2. For each record R in FDJTR of ESG
3. Insert rowid, relation name and locator number of tuples in R into FDJT-
Tuple-Index table.
4. Based on ESG, insert rowid, relation name and locator number of tuples that
have relations with R into FDJT-Tuple-Index table.
5. end for
6. return FDJT-Tuple-Index table

Given a database and its FDJTR, Algorism 2 generates FDJT-Tuple-Index table.

4.4 Searching Step

After FDJT-Tuple-Index table created, ITREKS is ready for keyword search. Given a
query consisting of a set of keywords, ITREKS establishes a BTS (Basic Tuple Set)
for each keyword k, recording all database tuples which have scores to k. Then based
on BTSs, FDJT-Tuple-Index table and Search Table, Searcher finds the results
(joinable tuples) which include all keywords.
Definition 5 (BTS). For a keyword k, the Basic Tuple Set is a relation BTSk={t |
Score (T, k)>0}, which consists of the database tuples with a non-zero score for
keyword k.
ITREKS uses Oracle Text Full Text Search Function to build BTSs for each keyword.
BTS table consists of 3 columns, RN, Tid and Score, which representing relation
name, tuple id and score respectively.
Definition 6 (ST). Search Table is a table that is dynamically generated by ITREKS
to find joinable tuples at search step. Given keywords k1,…,kn, ITREKS generates a
ST with 2+k*3 columns. In ST, a keyword ki (i=1,...,n) corresponds to 3 columns,
76 J. Zhan and S. Wang

ki_RN, ki_Tid and ki_N, which represent tuples and relationship between the tuples.
The other two columns is FDJTid which comes from FDJT-Tuple-Index table and
Score of the result.
Definition 7 (Result Tree). Result Tree is a tree of joinable tuples based on
extended schema graph of FDJTR, where each leaf node of the tree contains at least
one keyword and the nodes of the tree contain all keywords. The sizeof(T) of a result
tree T is the number of edges in T.
Ranking Function: ITREKS uses simple but effective ranking function to rank the
result trees for a given query. ITREKS assigns the score of a result tree T in the
following way:
sizeof (T ) k
score(T , Q ) = ∑ ∑ Score(ti , kw j )
1
sizeof (T ) i =1 j =1
where Score(ti,kwj) is the score of a tuple ti towards keyword kwj. ITREKS computes
sizeof(T) as follow:
Let N be the tuple’s locator number that is defined in FDJT-Tuple-Index table.
Let max be the largest left number of N in result tree’s leaf nodes; Let min be the
smallest left number of N in result tree’s leaf nodes. Let r be the sum of the right
numbers of result tree’s nodes.
Computing the size of T as follow:
Sizeof(T)=max-min+r
Given a set of query keywords, ITREKS finds the results by algorithm 3 described
below.

Algorithm 3. Generating Results


Input: a query Q, database D, FDJT-Tuple-Index table of the database FDJTTI
Output: Result Trees

1. For each keyword ki (i=1,…,n) in Q do Create BTS_ki}
2. Sort BTS_ki in ascending order by the number of the records in the BTS table.
We might as well let BTS_k1,…,BTS_kn be the ascending order list of the BTSs
of the keywords
3. Generate search table ST, initially empty
4. Let F0=FDJTTI
5. F1 =F0 natural join BTS_k1
6. Add relative information(score, FDJT_id, K1_RN, K1_Tid, K1_N) to ST
7. For i=2 to n do {
8. Fi =Fi-1 natural join BTS_ki
9. Insert relative information(Ki_RN, Ki_Tid, Ki_N) into ST and update
relative scores }
10. Remove records that contain null fields from ST
11. Sort records in ST in descending order by their scores
12. For records that have the same values on all fields Ki_RN and Ki_Tid(i=1,…,n)
in ST, only keep the record with the highest score and remove others
13. Retrieve the results from ST
14. Return result trees
ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship 77

In Algorithm 3, once Fi-1 natural join BTS_ki (i=1,…,n) and put the relative
information into ST, ST records relations of joinable tuples based on FDJT and the
joined tuples contain all keywords kj (j=1,…,i).

5 System Evaluation

Our system is implemented on a PC with Pentium Ⅳ 2.8GHz processor and 4GB of


RAM, running Windows XP and Oracle 9i. ITREKS has been implemented in Java
and connects to the DBMS through JDBC.
We evaluate our tuple relationship indexing and searching system on a 102MB
DBLP[12] data set, which we decomposed into 4 relations according to the schema
shown in Figure 2. Table 1 summarizes the 4 DBLP relations. The BTSk toward
keyword k is produced by merging the tuples returned by Oracle 9i full-text index on
each relation tuple in the database.
Table 2 summarizes FDJTR and FDJT-Tuple-Index (FDJTTI) table which are
produced in preprocessing step. Because the FDJTTI table stores the relations
between all tuples and FDJTs, it is far larger that other tables. Indexing time includes
both FDJT and FDJTTI producing times and only consumes at index step.

Table 1. DBLP dataset characteristics Table 2. FDJT and FDJTTI

Relation #Tuples Size(MB) FDJT FDJTTI


Author 294063 8.62 Tuples 1248595 39789519
Write 1000126 36 Size(MB) 42.99 2415.92
Paper 446409 51.1 Time(ms) 155094 3798591
Cite 223013 6.7

We evaluate search performance by submitting conjunctive keyword queries of


length 2, 3, 4 and 5 words. We evaluate returning full ranked result set.

8000
)
c 6000
e
s
m
( 4000
e
m
i
t 2000
0
2 3 4 5
number of keywords

Fig. 7. Query Performance


78 J. Zhan and S. Wang

In each trial, we generate 50 queries by randomly choosing keywords from the


keywords set. The reported time for a trial is the average of the 50 query execution
times. Figure 7 shows query performance on the DBLP dataset. The performance of
queries returning results at less than 7 seconds and alone with the number of
keywords increase, the query times do not increase sharply.

6 Conclusion and Future Work


We presented a general architecture for supporting keyword-based search over
relational database, and implemented an instantiation of the architecture in our fully
implemented system ITREKS. ITREKS indexes tuple relationships in relational
database, providing efficient keyword search capabilities over the database. Our
system trades online search and offline indexing method to do efficient keyword
based search over relational database.
In the future, we will extend our method to semi-structured data like XML and
implement our system over more databases.

Acknowledgements
This work is supported by the National Natural Science Foundation of
China(No.60473069 and 60496325), and China Grid(No.CNGI-04-15-7A).

References
[1] Galindo-Legaria, C. Outerjoins as disjunctions. ACM SIGMOD International Conf. on
Management of Data, 1994
[2] Rajaraman, A. and J. D. Ullman. Integrating information by outerjoins and full
disjunctions.
[3] Oracle Text. http://otn.oracle.com/products/text/index.html.
[4] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases.
In Proc. of VLDB, 2002.
[5] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-Style Keyword Search
over Relational Databases. In Proc. Of VLDB, 2003.
[6] Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching
and browsing in databases using banks. In Proc. of ICDE, 2002.
[7] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search
over relational databases. In Proc. of ICDE, 2002.
[8] P. Raghavan. Structured and unstructured search in enterprises. IEEE Data Engineering
Bulletin, 24(4), 2001.
[9] S. Dar, G. Entin, S. Geva, , and E. Palmon. Dtl's dataspot: Database exploration using
plain languages. In Proc. Of VLDB, 1998.
[10] R. Wheeldon, M. Levene, and K. Keenoy. Search and navigation in relational databases.
http://arxiv.org/abs/cs.DB/0307073.
[11] Qi Su, Jennifer Widom. Efficient and Extensible Keyword Search over Relational
Databases. Stanford University Technical Report, 2003.
[12] DBLP bibliography. 2004. http://www.informatik.uni-trier.de/~ley/db/index.html
An MBR-Safe Transform for High-Dimensional
MBRs in Similar Sequence Matching

Yang-Sae Moon

Department of Computer Science, Kangwon National University


192-1, Hyoja2-Dong, Chunchon, Kangwon 200-701, Korea
[email protected]

Abstract. In this paper we propose a formal approach that transforms


a high-dimensional MBR itself to a low-dimensional MBR directly, and
show that the approach significantly reduces the number of lower-dimen-
sional transformations in similar sequence matching. To achieve this goal,
we first formally define a new notion of MBR-safe. We say that a trans-
form is MBR-safe if it constructs a low-dimensional MBR by containing
all the low-dimensional sequences to which an infinite number of high-
dimensional sequences in an MBR are transformed. We then propose an
MBR-safe transform based on DFT. For this, we prove the original DFT-
based lower-dimensional transformation is not MBR-safe and define a
new transform, called mbrDFT, by extending definition of DFT. We also
formally prove this mbrDFT is MBR-safe. Analytical and experimental
results show that our mbrDFT reduces the number of lower-dimensional
transformations drastically and improves performance significantly com-
pared with the traditional method.

1 Introduction
Time-series data are the sequences of real numbers representing values at spe-
cific points in time. Typical examples of time-series data include stock prices,
exchange rates, and weather data [1,3,5,8]. The time-series data stored in a
database are called data sequences, and those given by users are called query
sequences. Finding data sequences similar to the given query sequence from
the database is called similar sequence matching [3,8]. As the distance function
D(X, Y ) between two sequences X = {x0 , x1 , ..., xn−1 } and Y = {y0 , y1 , ..., yn−1 }
of the same  length n, many similar sequence matching models have used Lp -
n−1
distance (= p i=0 |xi − yi |p ) including the Manhattan distance (= L1 ), the
Euclidean distance (= L2 ), and the maximum distance (= L∞ ) [1,2,3,4,7,8,9].
Most similar sequence matching solutions have used the lower-dimensional
transformation to store high-dimensional sequences into a multidimensional in-
dex [1,2,3,5,7,8,9]. The lower-dimensional transformation has first been intro-
duced in Agrawal et al.’s whole matching solution [1], and widely used in various
whole matching solutions [2,5] and subsequence matching solutions [3,7,8,9]. Re-
cently, it was also used in similar sequence matching on streaming time-series
for dimensionality reduction of query sequences or streaming time-series [4]. In

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 79–90, 2007.

c Springer-Verlag Berlin Heidelberg 2007
80 Y.-S. Moon

this paper we pay attention to the method of constructing an MBR (Minimum


Bounding Rectangle) in similar sequence matching. Previous similar sequence
matching solutions use MBRs to reduce the number of points to be stored in
a multidimensional index. That is, they do not store individual points directly
into the index, but stores only MBRs that contains hundreds or thousands of
the low-dimensional points. For example, a low-dimensional MBR in subsequence
matching is constructed as follows [3,9]: data sequences are divided into windows;
the high-dimensional windows are transformed to low-dimensional points; and
an MBR is constructed by containing multiple transformed points. In summary,
to construct an MBR to be stored in the index, the existing methods transform
tens ∼ thousands of high-dimensional sequences (or windows) to low-dimensional
sequences (or points) [3,8]. Likewise, the methods should require a huge number
of lower-dimensional transformations, and thus in this paper we tackle the prob-
lem of how to reduce the number of transformations.
To reduce the number of transformations in constructing low-dimensional
MBRs, we propose the lower-dimensional transformation method for high-
dimensional MBRs. That is, the method transforms a high-dimensional MBR
itself to a low-dimensional MBR directly, where the high-dimensional MBR con-
tains multiple high-dimensional sequences. For this, we first propose a new notion
of MBR-safe. We say that a transform T is MBR-safe if T satisfies the following
property: suppose MBR M is transformed to M T by T , and sequence X is con-
tained in M , then the transformed sequence X T by T should also be contained in
M T . If using the notion of MBR-safe, we can construct a low-dimensional MBR
by transforming a high-dimensional MBR itself rather than a large number of
individual sequences in the MBR. And accordingly, we can reduce the number
of transformations required for constructing low-dimensional MBRs.
In this paper we propose an MBR-safe transform based on DFT (Discrete
Fourier Transform) [11], which is most widely used as the lower-dimensional
transformation. For this, we first prove the original DFT-based lower-dimensional
transformation is not MBR-safe. We then define a new transform, called mbrDFT,
by extending definition of DFT. We also formally prove this mbrDFT is MBR-
safe. Through analysis and experiments, we show superiority of the proposed
MBR-safe transform. By deriving the computational complexity of constructing
a low-dimensional MBR, we analytically show superiority of our mbrDFT. We
then empirically show that mbrDFT reduces the number of lower-dimensional
transformations drastically and improves performance significantly compared
with the traditional method.

2 Related Work

Similar sequence matching can be classified into whole matching and subsequence
matching [3]. The whole matching[1,2,5] finds data sequences similar to a query
sequence, where the lengths of data sequences and the query sequence are all
identical. On the other hand, the subsequence matching[3,7,8] finds subsequences,
contained in data sequences, similar to a query sequence of arbitrary length.
An MBR-Safe Transform for High-Dimensional MBRs 81

Also, several transform techniques such as moving average transform, shifting


& scaling, normalization transform, and time warping have been used in similar
sequence matching to solve the problems that the Euclidean distance function
has [9]. We note that most similar sequence matching solutions have used the
lower-dimensional transformation to use a multidimensional index.
Previous similar sequence matching solutions construct MBRs to reduce the
number of points to be stored in the index or to reduce the number of range
queries. For example, solutions in [3,9] divide data sequences into windows, trans-
form the windows to low-dimensional points, and finally store MBRs containing
multiple transformed points in the index. Similarly, solutions in [7,8] divide a
query sequence into windows, transform the windows to low-dimensional points,
and finally use MBRs containing multiple transformed points in constructing
range queries. Also, recent work for continuous queries on streaming time-series
uses the method of constructing MBRs that contain multiple sequences [4]. Like-
wise, most previous solutions construct MBRs after transforming individual
high-dimensional sequences into low-dimensional sequences (points); in contrast,
our solution transforms a high-dimensional MBR itself to a low-dimensional
MBR directly. Therefore, our solution is quite different from the previous ones
in constructing MBRs.
Various transforms including DFT and Wavelet transform are used as the
lower-dimensional transformation of high-dimensional sequences. DFT is most
widely used in many similar sequence matching solutions [1,3,7,8,9]. Wavelet
transform is also used as the lower-dimensional transformation in [2,10]. Be-
sides these transforms, PAA(Piecewise Aggregate Approximation) [5] and SVD
(Singluar Value Decomposition) [6] were introduced as the lower-dimensional
transformation. All these transformations, however, focused on transforming
high-dimensional sequences to low-dimensional ones, and they cannot be directly
applied to the lower-dimensional transformation of high-dimensional MBRs.

3 Definition of MBR-Safe

We first summarize in Table 1 the notation to be used throughout the paper.


We then formally define the notion of MBR-safe as the following Definition 1.
Definition 1. For an n-dimensional sequence X and an n-dimensional MBR
[L, U ], if a transform T satisfies the following Eq. (1), then we say T is MBR-safe.

X ∈ [L, U ] =⇒ X T ∈ [L, U ]T (1)

Figure 1 depicts the concept of MBR-safe. In Figure 1, transform T 1 is MBR-


safe, but T 2 is not. The reason why T 1 is MBR-safe is that, if an arbitrary
sequence X is contained in MBR [L, U ] (i.e., X ∈ [L, U ]), then the transformed
sequence X T 1 is also contained in the transformed MBR [L, U ]T 1 (i.e., X T 1 ∈
[L, U ]T 1 = [Λ, Υ ]). Analogously, the reason why T 2 is not MBR-safe is that,
even though X is contained in [L, U ] (i.e., X ∈ [L, U ]), X T 2 is not contained in
[L, U ]T 2 (i.e., X T 2 ∈
/ [L, U ]T 2 = [A, B]).
82 Y.-S. Moon

Table 1. Summary of notation

Symbols Definitions
X A high-dimensional sequence. (= {x0 , x1 , ..., xn−1 })
XT A (low-dimensional) sequence transformed from X by the transform T .
(= {xT0 , xT1 , ..., xTm−1 })
[L, U ] A high-dimensional MBR whose lower-left and upper-right points are
L and U , respectively. (= [{l0 , l1 , ..., ln−1 }, {u0 , u1 , ..., un−1 }])
[L, U ]T A (low-dimensional) MBR transfromed from [L, U ] by the transform T .
= [Λ, Υ ] (= [{λ0 , λ1 , ..., λn−1 }, {υ0 , υ1 , ..., υn−1 }])
X ∈ [L, U ] The sequence X is contained in the MBR [L, U ].
(i.e., for every i, li ≤ xi ≤ ui )

Transform T1 ϒ ( = {υ0 ,..., υm−1 })


[ L ,U ]T 1 = [ Λ , ϒ ]

Transform T1
({
X T 1 = x0T 1 ,..., xmT 1−1 })
U ( = {u0 ,..., un−1 })
[ L ,U ] Λ ( = {λ 1 ,..., λ m })

(For every i , λ ≤ xTi 1 ≤ υi holds. )


X ( = {x0 ,..., xn−1 })
i

Transform T2
L ( = {l0 ,..., ln−1 })
({
X T 2 = x0T 2 ,..., xTm2−1 })
Β ( = {β0 ,..., βm−1 })
Transform T2
[ L ,U ]T 2 = [ Α , Β]

Α ( = {α0 ,..., α m−1 })

(For some i , (x T2
i ) (
< αi ∨ xTi 2 > βi holds.) )
Fig. 1. An MBR-safe transform (T 1) and a non-MBR-safe transform (T 2)

If using an MBR-safe transform, we can drastically reduce the number of


lower-dimensional transformations. In general, previous solutions construct an
MBR after tens ∼ thousands of lower-dimensional transformations for individ-
ual sequences [3,7,9]. In contrast, if using the notion of MBR-safe, we can reduce
the number of lower-dimensional transformations since we transform the high-
dimensional MBR itself to a low-dimensional MBR directly. Figure 2 shows these
two methods of constructing a low-dimensional MBR. The upper part of the fig-
ure shows an example of using the traditional transform, and the lower part that
of using an MBR-safe transform. As shown in the figure, if using the traditional
transform, we first transform tens ∼ thousands of individual sequences to low-
dimensional sequences, and then construct a low-dimensional MBR by containing
the transformed sequences. In contrast, if using the MBR-safe transform, we can
construct a low-dimensional MBR by simply transforming a high-dimensional
An MBR-Safe Transform for High-Dimensional MBRs 83

Low-dimensional sequences A low-dimensional MBR

High-dimensional sequences •••


• •
• •

The traditional transform

A low-dimensional MBR

A high-dimensional MBR

••• MBR-safe transform

Fig. 2. Two methods of constructing a low-dimensional MBR from high-dimensional


sequences

MBR itself rather than a large number of individual sequences. It means that,
by using the MBR-safe transform, we can reduce the number of transformations
in similar sequence matching.

4 A DFT-Based MBR-Safe Transform

DFT has been most widely used as the lower-dimensional transformation in sim-
ilar sequence matching [1,3,7,8,9]. DFT transforms an n-dimensional sequence X
to a new n-dimensional sequence Y (= {y0 , y1 , ..., yn−1 }) in a complex number
space, where each complex number yi is defined as the following Eq. (2) [1,11]:

1 
n−1
yi = √ xt e−j·2πit/n , 0 ≤ i ≤ n − 1. (2)
n t=0

By Euler’s formula [11] and definition of complex number, we can rewrite Eq. (2)
to Eq. (3) of the real part and imaginary part.

1  1 
n−1 n−1
yi = √ xt cos(−2πit/n) + √ xt sin(−2πit/n) · j, 0 ≤ i ≤ n − 1. (3)
n t=0 n t=0

DFT concentrates most of the energy into the first few coefficients, and thus
only a few coefficients extracted from the transformed point Y are used for
the lower-dimensional transformation [1,3]. The following Definition 2 shows the
traditional DFT-based lower-dimensional transformation.
Definition 2. The DFT-based lower-dimensional transformation transforms an
n-dimensional sequence X to a new m( n)-dimensional sequence X DFT of
{xDFT
0 , xDFT
1 m−1 }, where each xi
, ..., xDFT DFT
is obtained by Eq. (4). Also, it trans-
forms an n-dimensional MBR [L, U ] to a new m-dimensional MBR [L, U ]DFT
84 Y.-S. Moon

whose lower-left and upper-right points are LDFT and U DFT , respectively, i.e.,
[L, U ]DFT = [LDFT , U DFT ]. In Eq. (4), θ = −2πi/2 t/n and 0 ≤ i ≤ m − 1.
 n−1
√1
t=0 xt cos θ, if i is even;
DFT
xi = 1
n
n−1 (4)

n t=0 xt sin θ, if i is odd.

In similar sequence matching, by using the DFT-based lower-dimensional trans-


formation, we transform a high-dimensional sequence with tens ∼ hundreds of
dimensions to a low-dimensional sequence with one ∼ six dimensions.
The DFT-based lower-dimensional transformation, however, is not MBR-safe.
We give in Example 1 a counterexample to show that it is not MBR-safe.
Example 1. Let X be a 4-dimensional sequence of {3.00, 2.50, 3.50, 3.00}, and
[L, U ] be a 4-dimensional MBR of L = {2.00, 1.00, 3.00, 2.00} and U = {4.00,
3.00, 5.00, 4.00}. Then, for the given X and [L, U ], X ∈ [L, U ] holds. By using
the DFT-based lower-dimensional transformation, we now transform the given
4-dimensional sequence and MBR to the 2-dimensional sequence and MBR, re-
spectively. Then, by Definition 2, we can transform X to a new sequence X DFT of
{6.00, −0.25}1. Similarly, we can also transform [L, U ] to a new MBR [L, U ]DFT ,
where LDFT = {4.00, −0.50} and U DFT = {8.00, −0.50}. Here, we note that
−0.50 ≤ −0.25  −0.50, that is, l2DFT ≤ xDFT 2  uDFT
2 . Thus, for the trans-
formed X DFT
and [L, U ]DFT
,X DFT
∈ [L, U ]DFT
does not hold. It means that
the DFT-based lower-dimensional transformation is not MBR-safe. 
As noted in Example 1, the DFT-based lower-dimensional transformation is not
MBR-safe, and thus we cannot use it for the lower-dimensional transformation
of MBRs. Therefore, we introduce a DFT-based MBR-safe transform, called
mbrDFT. The following Definition 3 presents a formal definition of mbrDFT.
Definition 3. For an n-dimensional MBR [L, U ], mbrDFT is defined as an
operation that constructs an m( n)-dimensional MBR [L, U ]mbrDFT whose
lower-left and upper-right points are Λ and Υ , respectively, in Eq. (5). And, for
an n-dimensional sequence X, the mbrDFT -transformed sequence X mbrDFT is
identical to X DFT . In Eq. (5), θ = −2πi/2 t/n and 0 ≤ i ≤ m − 1.
√1
n−1 a cos θ, √1
n−1 c cos θ,
λi =
√1
n
t=0
n−1
t

b sin θ,
if i is even;
if i is odd;
, υi =
√1
n
t=0
n−1
t

d sin θ,
if i is even;
if i is odd;
,
a = l , c = u ,
n t=0 t n t=0 t


 if cos θ ≥ 0;
at = ut , ct = lt ,
t t t t

if cos θ < 0;
where

 sin θ ≥ 0;
(5)
bt = lt , dt = ut ,
bt = ut , dt = lt ,
if
if sin θ < 0.

To guarantee MBR-safety of mbrDFT, we intentionally make Λ and Υ in Eq. (5)


contain every possible sequence that can be transformed from the original MBR
[L, U ]. The following Theorem 1 shows that mbrDFT is an MBR-safe transform.
1
In DFT, the imaginary part of the first complex number (i.e., xDFT
1 ) is always 0.
Thus, we use {xDFT
0 , xDFT
2 } instead of {xDFT
0 , xDFT
1 } [3,8].
An MBR-Safe Transform for High-Dimensional MBRs 85

Theorem 1. For an n-dimensional sequence X and an n-dimensional MBR


[L, U ], if X ∈ [L, U ] holds, then X mbrDFT ∈ [L, U ]mbrDFT also holds (X ∈
[L, U ] ⇒ X mbrDFT ∈ [L, U ]mbrDFT ). That is, mbrDFT is MBR-safe.
Proof: To show X mbrDFT ∈ [Λ, Υ ](= [L, U ]mbrDFT ), we need to prove that
λi ≤ xmbrDFT
i ≤ υi holds for every i. We now proceed the proof by two cases: 1)
the first case where i of xmbrDFT
i is even, and 2) the second one where i is odd.
 n−1
1) Assume i is even. Then, λi = √1n n−1 t=0 at cos θ and υi =
√1
n t=0 ct cos θ.
Here, we note that lt ≤ xt ≤ ut holds for every t (0 ≤ t ≤ n − 1) since X ∈
[L, U ] holds by the assumption. Thus, if cos θ is positive, lt cos θ ≤ xt cos θ ≤
ut cos θ holds since lt ≤ xt ≤ ut holds. Similarly,
n−1 if cos θ is negative, ut cos θ ≤
xt cos θ ≤ lt cos θ holds. And accordingly, t=0 at cos θ, which is obtained by
adding ltcos θ if cos θ is positive and ut cos θ if cos θ is negative, is less than or
n−1
n−1
equal to t=0 xt cos θ. It means that √1n t=0 at cos θ (= λi ) is less than or equal
n−1 n−1
to √1n t=0 xt cos θ (= xmbrDFT
i ). Analogously, t=0 ct cos θ, which is obtained
by adding ut cos θif cos θ is positive and lt cos θ if cos θ is negative, is greater
n−1 n−1
than or equal to t=0 xt cos θ. Thus, √1n t=0 ct cos θ (= υi ) is greater than or

equal to √1n n−1t=0 xt cos θ (= xi
mbrDFT
). Therefore, λi ≤ xmbrDFT
i ≤ υi holds for
mbrDFT
every case where i of xi is even.
2) Assume i is odd. We can also prove that λi ≤ xmbrDFT i ≤ υi holds by the
similar steps described in the case 1) above.
According to the cases 1) and 2), λi ≤ xmbrDFT
i ≤ υi holds for every i. Therefore,
mbrDFT is MBR-safe by Definition 1. 

The following Example 2 shows that mbrDFT is an MBR-safe transform.


Example 2. As in Example 1, let X be a sequence of {3.00, 2.50, 3.50, 3.00}, and
[L, U ] be an MBR of L = {2.00, 1.00, 3.00, 2.00} and U = {4.00, 3.00, 5.00, 4.00}.
We now want to transform X and [L, U ] using mbrDFT. Then, we can transform
X to a new sequence X mbrDFT of {6.00, −0.25}. Similarly, we can also transform
[L, U ] to a new MBR [Λ, Υ ] (= [L, U ]mbrDFT ), where Λ = {4.00, −1.50} and Υ =
{8.00, 0.50}. Here, we note that both 4.00 ≤ 6.00 ≤ 8.00 (λ0 ≤ xmbrDFT 0 ≤ υ0 )
and −1.50 ≤ −0.25 ≤ 0.50 (λ2 ≤ xmbrDFT
2 ≤ υ 2 ) hold. Thus, for the mbrDFT-
transformed X mbrDFT and [L, U ]mbrDFT , X mbrDFT ∈ [L, U ]mbrDFT holds. It
means that mbrDFT is an MBR-safe transform. 
The proposed mbrDFT is optimal (i.e., it constructs the smallest MBR) among
the DFT-based MBR-safe transforms that convert a high-dimensional MBR itself
into a low-dimensional MBR directly. It means that there is no DFT-based MBR-
safe transform whose low-dimensional MBR is smaller than that of mbrDFT. We
omit the proof of optimality due to space limitation.

5 Computational Complexity Analysis


In this section we analyze computational complexity required to construct a
low-dimensional MBR. We analyze two DFT-based transformations: 1) the
86 Y.-S. Moon

traditional method that constructs an MBR after performing the DFT-based


lower-dimensional transformation for individual sequences (we simply call this
method orgDFT and its complexity orgDFT-complexity) and 2) the proposed
mbrDFT (we call its complexity mbrDFT-complexity).
First, orgDFT-complexity depends on the length n and the number m of se-
quences contained in an MBR. That is, if the computational complexity of a DFT
unit operation for a sequence of length n is O(f (n)), we can obtain orgDFT-
complexity for m sequences as O(mf (n)). Here, we know the complexity of one
DFT operation for a sequence of length n as O(nlogn) [11]. Thus, orgDFT-
complexity for an MBR is to be O(mnlogn). Next, mbrDFT requires only
two DFT operations for two sequences, Λ and Υ , respectively. Thus,
mbrDFT-complexity for an MBR is to be O(nlogn).
In summary, we derive orgDFT-complexity as O(mnlogn) and mbrDFT-
complexity as O(nlogn), respectively. Figure 3 (a) shows a graph that presents
orgDFT-complexity and mbrDFT-complexity, where we set the length n of se-
quences to 256 and change the number m of sequences in an MBR from 128 to
1024 by multiples of two. Figure 3 (b) shows another graph, where we set m to
256 and change n from 128 to 1024. As shown in the graphs, mbrDFT-complexity
is much lower than orgDFT-complexity. Note that Y axes in the graphs have the
exponential scale. Also, as m or n increases, the complexity difference between
orgDFT and mbrDFT becomes larger. It means that our mbrDFT is very useful
and practical for the case where an MBR contains a large number of sequences
or the length of sequences is large, i.e., it is suitable for large databases.

8.E+05 8.E+05

orgDFT orgDFT
Value (complexity)

mbrDFT mbrDFT
Value (complexity)

6.E+05 6.E+05

4.E+05 4.E+05

2.E+05 2.E+05

0.E+00 0.E+00
128 256 512 1024 128 256 512 1024
# of sequences per MBR (m) Sequence length (n)
(a) Complexity comparison when varying m. (b) Complexity comparison when varying n.

Fig. 3. Comparison of orgDFT-complexity and mbrDFT-complexity

6 Performance Evaluation
6.1 Experimental Data and Environment
We have performed extensive experiments using two types of synthetic data sets.
The first data set, used in the previous similar sequence matching works [3,8,9],
contains a random walk series consisting of one million entries: the first en-
try is set to 1.5, and subsequent entries are obtained by adding a random
value in the range (-0.001,0.001) to the previous one. We call this data set
An MBR-Safe Transform for High-Dimensional MBRs 87

WALK-DATA. The second data set contains a synthetic streaming time-series


consisting of one million entries: the series is generated using the function yi =
100 · [sin(0.1 · xi ) + 1.0 + i/1000000] (i = 0..999999) as in [4], where we set xi to
the i-th entry of WALK-DATA. We call this data set SINE-DATA.
We generate high-dimensional MBRs by dividing the whole data set into mul-
tiple smaller sequences (i.e., sliding windows in [3,8]). In the experiments, we use
128, 256, 512, and 1024 as the length n and the number m of sequences contained
in an MBR. As in [1], we transform each high-dimensional sequence, i.e., 128– ∼
1024–dimensional sequence, to a 1– ∼ 4–dimensional sequence (point). It means
that the number of features extracted by the lower-dimensional transformation
is set to one ∼ four [1]. As the experimental methods, we compare orgDFT and
mbrDFT.
The hardware platform for the experiment is a PC equipped with an Intel
Pentium IV 2.80 GHz CPU, 512 MB RAM, and a 70.0GB hard disk. The software
platform is GNU/Linux Version 2.6.6 operating system. For the experimental
results, we measure the number of transformations and the elapsed time for
each method. We also show that, by comparing the boundary-length of the
transformed MBRs, the proposed mbrDFT is practically applicable in similar
sequence matching.

6.2 Experimental Results


We have performed three experiments. Experiment 1) measures the number of
transformations and the elapsed time by varying the number m of sequences in
an MBR for the fixed length n of sequences. Experiment 2) performs the same
experiment by varying the length n for the fixed number m. Finally, Experi-
ment 3) compares orgDFT and mbrDFT in the boundary-length of MBRs.
Experiment 1) Figure 4 shows the experimental results of orgDFT and
mbrDFT. Here, we set the length n of sequences to 256, but change the number
m of sequences in an MBR from 128 to 1024 by multiples of two. We set the
number of extracted features to two as in [1]. In the experiment, we measure
the total number of transformations and the average elapsed time for trans-
forming an MBR. Figure 4 (a) shows the numbers of transformations for both
WALK-DATA and SINE-DATA; Figures 4 (b) and 4 (c) show the elapsed times
for WALK-DATA and SINE-DATA, respectively. As shown in Figure 4 (a), our
mbrDFT drastically reduces the number of transformations over orgDFT. It is
because orgDFT has to consider all the individual sequences in an MBR; in con-
trast, mbrDFT requires only two transformations for an MBR. Figures 4 (b) and
4 (c) show that mbrDFT also reduces the elapsed time significanlty over orgDFT.
As we analyzed in Figure 3 (a) in Section 5, the more number of sequences in an
MBR causes the more performance difference between orgDFT and mbrDFT.
1
In summary, mbrDFT drastically reduces the number of transformations to 136
of that for orgDFT on the average, and also significantly improves performance
by 31 times that for orgDFT on the average.
88 Y.-S. Moon

1.E+06 1.2E+05 1.2E+05

The elapsed time (usec)

The elapsed time (usec)


# of transformations

orgDFT orgDFT orgDFT


mbrDFT mbrDFT mbrDFT
1.E+05 8.0E+04 8.0E+04

1.E+04 4.0E+04 4.0E+04

1.E+03 0.0E+00 0.0E+00


128 256 512 1024 128 256 512 1024 128 256 512 1024
# of sequences per MBR (m) # of sequences per MBR (m) # of sequences per MBR (m)
(a) Number of transformations (b) The elapsed time (WALK-DATA) (c) The elapsed time (SINE-DATA)

Fig. 4. Experimental results when varying the number m of sequences in an MBR

Experiment 2) Figure 5 shows the results when we set the number m of se-
quences in an MBR to 256, but change the length n of sequences from 128 to
1024 by multiples of two. As in Experiment 1), we measure the total number of
transformations and the average elapsed time for transforming an MBR. From
Figure 5 (a), we note that the numbers of transformations are not changed even
as the length of sequences increases. It is because the numbers are dependent on
the number of sequences in orgDFT or the number of MBRs in mbrDFT, but
are not dependent on the length of sequences in both orgDFT and mbrDFT.
As shown in Figures 5 (b) and 5 (c), mbrDFT significantly reduces the elapsed
time over orgDFT. In particular, as we analyzed in Figure 3 (b) in Section 5,
the larger length of sequences causes the more performance difference between
orgDFT and mbrDFT.

1.E+06 1.E+05 1.E+05


The elapsed time (usec)

The elapsed time (usec)


# of transformations

orgDFT orgDFT orgDFT


mbrDFT mbrDFT DFT
mbrDFT
1.E+05 8.E+04 8.E+04
mbrDFT

1.E+04 4.E+04 4.E+04

1.E+03 0.E+00 0.E+00


128 256 512 1024 128 256 512 1024 128 256 512 1024
Sequence length (n) Sequence length (n) Sequence length (n)
(a) Number of transformations (b) The elapsed time (WALK-DATA) (c) The elapsed time (SINE-DATA)

Fig. 5. Experimental results when varying the length n of sequences

Experiment 3) In this experiment, we compare the methods in the average


boundary-length of MBRs. Here, the boundary-length of an MBR is defined as
the sum of the length of each dimension in the MBR, i.e., the boundary-length
of [L, U ] is defined as n−1
i=0 (ui − li ). Figure 6 compares orgDFT and mbrDFT
in the average boundary-length of MBRs. Here, we set both the length n and
the number m of sequences in an MBR to 256, but increment the number of
extracted dimensions (features) from one to four. As shown in the figure, the av-
erage boundary-length in mbrDFT is longer than that in orgDFT if the number
of extracted dimensions is greater than two. It is because our mbrDFT considers
An MBR-Safe Transform for High-Dimensional MBRs 89

30.0 3.E+03

Boundary-length of MBRs
Boundary-length of MBRs
orgDFT orgDFT
mbrDFT mbrDFT
20.0 2.E+03

10.0 1.E+03

0.0 0.E+00
1 2 3 4 1 2 3 4
# of dimensions (# of features) # of dimensions (# of features)
(a) WALK-DATA (b) SINE-DATA

Fig. 6. Comparison of orgDFT and mbrDFT in the average boundary-length of MBRs

an infinite number of every possible sequence that can be contained in a high-


dimensional MBR, while orgDFT does a finite number of real sequences in the
MBR. On the other hand, if the number of extracted dimensions is one, there
is only a little difference (0.2%∼2.6%) in the boundary-length. As experimented
in [1], DFT concentrates most of energy into the first dimension, and thus we
can say that our mbrDFT is much more useful if we extract only one or two
dimensions.

7 Conclusions

In this paper we have proposed a formal approach that transforms a high-


dimensional MBR itself to a low-dimensional MBR directly. We have noted
that most similar sequence matching solutions required a huge number of lower-
dimensional transformations to construct low-dimensional MBRs to be stored
in the index. To solve this problem, we have introduced a new notion of MBR-
safe and proposed MBR-safe transforms that can reduce the number of lower-
dimensional transformations drastically.
We can summarize our work as the following three contributions. First, we
formally defined the notion of MBR-safe. If using the notion of MBR-safe, we
can construct a low-dimensional MBR by transforming a high-dimensional MBR
itself rather than a large number of individual sequences. Second, we proposed a
DFT-based MBR-safe transform. For this, we first proved the traditional DFT-
based lower-dimensional transformation is not MBR-safe. We then introduced a
new transform, called mbrDFT, and formally proved in Theorem 1 it is MBR-
safe. Third, through analysis and experiments, we showed superiority of our
MBR-safe transform.
These results indicate that our MBR-safe transforms will provide a useful
framework for a variety of applications that require the lower-dimensional trans-
formation of high-dimensional MBRs. Therefore, as the further research, we will
try to apply the MBR-safe transform to real applications such as similarity
search, multimedia data retrieval, and GIS.
90 Y.-S. Moon

Acknowledgements

This work was supported by the Ministry of Science and Technology (MOST)/
Korea Science and Engineering Foundation (KOSEF) through the Advanced In-
formation Technology Research Center (AITrc).

References
1. Agrawal, R., Faloutsos, C., and Swami, A., “Efficient Similarity Search in Sequence
Databases,” In Proc. the 4th Int’l Conf. on Foundations of Data Organization and
Algorithms, pp. 69-84, Oct. 1993.
2. Chan, K.-P., Fu, A. W.-C., and Yu, C. T., “Haar Wavelets for Efficient Similar-
ity Search of Time-Series: With and Without Time Warping,” IEEE Trans. on
Knowledge and Data Engineering, Vol. 15, No. 3, pp. 686-705, Jan./Feb. 2003.
3. Faloutsos, C., Ranganathan, M., and Manolopoulos, Y., “Fast Subsequence Match-
ing in Time-Series Databases,” In Proc. Int’l Conf. on Management of Data, ACM
SIGMOD, pp. 419-429, May 1994.
4. Gao, L. and Wang, X. S., “Continually Evaluating Similarity-based Pattern Queries
on a Streaming Time Series,” In Proc. Int’l Conf. on Management of Data, ACM
SIGMOD, pp. 370-381, June 2002.
5. Keogh, E. J., Chakrabarti, K., Mehrotra, S., and Pazzani, M. J., “Locally Adaptive
Dimensionality Reduction for Indexing Large Time Series Databases,” In Proc. of
Int’l Conf. on Management of Data, ACM SIGMOD, pp. 151-162, May 2001.
6. Korn, F., Jagadish, H. V., and Faloutsos, C., “Efficiently Supporting Ad Hoc
Queries in Large Datasets of Time Sequences,” In Proc. of Int’l Conf. on Man-
agement of Data, ACM SIGMOD, pp. 289-300, June 1997.
7. Lim, S.-H., Park, H.-J., and Kim, S.-W., “Using Multiple Indexes for Efficient
Subsequence Matching in Time-Series Databases,” In Proc. of the 11th Int’l Conf.
on Database Systems for Advanced Applications (DASFAA), pp. 65-79, Apr. 2006.
8. Moon, Y.-S., Whang, K.-Y., and Han, W.-S., “General Match: A Subsequence
Matching Method in Time-Series Databases Based on Generalized Windows,” In
Proc. Int’l Conf. on Management of Data, ACM SIGMOD, pp. 382-393, June 2002.
9. Moon, Y.-S. and Kim, J., “A Single Index Approach for Time-Series Subse-
quence Matching that Supports Moving Average Transform of Arbitrary Order,”
In Proc. of the 10th Pacific-Asia Conf. on Knowledge Discovery and Data Mining
(PAKDD), pp. 739-749, Apr. 2006.
10. Natsev, A., Rastogi, R., and Shim, K., “WALRUS: A Similarity Retrieval Algo-
rithm for Image Databases,” IEEE Trans. on Knowledge and Data Engineering,
Vol. 16, No. 3, pp. 301-316 , Mar. 2004.
11. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T., Numerical
Recipes in C: The Art of Scientific Computing, Cambridge University Press, 2nd
Ed., 1992.
Mining Closed Frequent Free Trees
in Graph Databases

Peixiang Zhao and Jeffrey Xu Yu

The Chinese University of Hong Kong, China


{pxzhao,yu}@se.cuhk.edu.hk

Abstract. Free tree, as a special graph which is connected, undirected


and acyclic, has been extensively used in bioinformatics, pattern
recognition, computer networks, XML databases, etc. Recent research
on structural pattern mining has focused on an important problem of
discovering frequent free trees in large graph databases. However, it can
be prohibitive due to the presence of an exponential number of frequent
free trees in the graph database. In this paper, we propose a computa-
tionally efficient algorithm that discovers only closed frequent free trees
in a database of labeled graphs. A free tree t is closed if there exist
no supertrees of t that has the same frequency of t. Two pruning algo-
rithms, the safe position pruning and the safe label pruning, are proposed
to efficiently detect unsatisfactory search spaces with no closed frequent
free trees generated. Based on the special characteristics of free tree, the
automorphism-based pruning and the canonical mapping-based pruning
are introduced to facilitate the mining process. Our performance study
shows that our algorithm not only reduces the number of false positives
generated but also improves the mining efficiency, especially in the pres-
ence of large frequent free tree patterns in the graph database.

1 Introduction
Recent research on frequent pattern discovery has progressed from mining item-
sets and sequences to mining structural patterns including (ordered, unordered,
free) trees, lattices, graphs and other complicated structures. Among all these
structural patterns, graph, a general data structure representing relations among
entities, has been widely used in a broad range of areas, such as bioinformatics,
chemistry, pattern recognition, computer networks, etc. In recent years, we have
witnessed a number of algorithms addressing the frequent graph mining prob-
lem [5,9,4,6]. However, discovering frequent graph patterns comes with expensive
cost. Two computationally expensive operations are unavoidable: (1) to check if
a graph contains another graph (in order to determine the frequency of a graph
pattern) is an instance of subgraph isomorphism problem, which is NP-complete
[3]; and (2) to check if two graphs are isomorphic (in order to avoid creating a
candidate graph for multiple times) is an instance of graph isomorphism prob-
lem, which is not known to be either P or NP-complete [3].
With the advent of XML and the need for mining semi-structured data, a
particularly useful family of general graph — free tree, has been studied and

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 91–102, 2007.

c Springer-Verlag Berlin Heidelberg 2007
92 P. Zhao and J. Xu Yu

applied extensively in various areas such as bioinformatics, chemistry, computer


vision, networks, etc. Free tree — the connected, undirected and acyclic graph, is
a generalization of linear sequential patterns, and hence reserves plenty of struc-
tural information of databases. At the same time, it is a specialization of general
graph, therefore avoids undesirable theoretical properties and algorithmic com-
plexities incurred by graph. As the middle ground between two extremes, free
tree has provided us a good compromise in data mining research [8,2].
Similar to frequent graph mining, the discovery of frequent free trees in a
graph database shares a common combinatorial explosion problem: the number
of frequent free trees grows exponentially although most free trees deliver nothing
interesting but redundant information if all of them share the same frequency.
This is the case especially when graphs of a database are strongly correlated.
Our work is inspired by mining closed frequent itemsets and sequences in [7].
According to [7,11], a frequent pattern I is closed if there exists no proper super-
pattern of I with the same frequency in the dataset. In comparison to frequent
free trees, the number of closed ones is dramatically small. At the same time,
closed frequent free trees maintain the same information (w.r.t frequency) as
that held by frequent free trees with less redundancy and better efficiency.
There are several previous studies on discovering closed frequent patterns
among large tree or graph databases. CMTreeMiner [1] discovers all closed
frequent ordered or unordered trees in a rooted-tree database by traversing an
enumeration tree, a special data structure to enumerate all frequent (ordered or
unordered) subtrees in the database. However, some elegant properties of or-
dered (unordered) trees do not hold in free trees, which makes infeasible to apply
their pruning techniques directly to mine closed frequent free trees. CloseGraph
[10] discovers all closed frequent subgraphs in a graph database by traversing
a search space representing the complete set of frequent subgraphs. The novel
concepts of equivalent occurrence and early termination help CloseGraph prune
certain branches of the search space which produce no closed frequent subgraphs.
We can directly use CloseGraph to mine closed frequent free trees because free
tree is a special case of general graph, but CloseGraph will introduce a lot of in-
efficiencies. First, all free trees are computed as general graphs while the intrinsic
characteristics of free tree are omitted; Second, the early termination may fail
and CloseGraph may miss some closed frequent patterns. Although this failure
of early termination can be detected, the detection operations should be applied
case-by-case, which introduce a lot of complexities.
In this paper, we fully study the closed frequent free tree mining problem and
develop an efficient algorithm, CFFTree which is short for Closed Frequent, Free
Tree mining, to systematically discover the complete set of closed frequent free
trees in large graph databases. The main contributions of this paper are: (1) We
first introduce the concept of closed frequent free trees and study its properties
and its relationship to frequent free trees; (2) Our algorithm CFFTree depth-
first traverses the enumeration tree to discover closed frequent free trees. Two
original pruning algorithms, the safe position pruning and the safe label prun-
ing are proposed to prune search branches of the enumeration tree in the early
Mining Closed Frequent Free Trees in Graph Databases 93

stage, which is confirmed to output no desired patterns; (3) Based on the intrin-
sic characteristics of free tree, we propose the automorphism-based pruning and
the canonical mapping-based pruning to alleviate the expensive computation of
equivalent occurrence sets and candidate answer sets during the mining process.
We carried out different experiments on both synthetic data and real applica-
tion data. Our performance study shows that CFFTree outperforms up-to-date
frequent free mining algorithms by a factor of roughly 10. To the best of our
knowledge, CFFTree is the first algorithm that, instead of using post-processing
methods, directly mines closed frequent free trees from graph databases.
The rest of the paper is organized as follows. Section 2 provides necessary
background and detailed problem statement. We study the closed frequent free
tree mining problem in Section 3, and propose a basic algorithmic framework
to solve the problem. Advanced pruning algorithms are presented in Section 4.
Section 5 formulates our algorithm, CFFTree. In Section 6, we report our per-
formance study and finally, we offer conclusions in Section 7.

2 Preliminaries
A labeled graph is defined as a 4-tuple G = (V, E, Σ, λ) where V is a set of
vertices, E is a set of edges (unordered pairs of vertices), Σ is a set of labels,
and λ is a labeling function, λ : V ∪ E → Σ, that assigns labels to vertices and
edges. A free tree, denoted ftree, is a special undirected labeled graph that is
connected and acyclic. Below, we call a ftree with n vertices a n-ftree.
Let t and s be two ftrees, and g be a graph. t is a subtree of s (or s is the
supertree of t), denoted t ⊆ s, if t can be obtained from s by repeatedly removing
vertices with degree 1, a.k.a leaves of the tree. Similarly, t is a subtree of a graph
g, denoted t ⊆ g, if t can be obtained by repeatedly removing vertices and edges
from g. Ftrees t and s are isomorphic to each other if there is a one-to-one
mapping from the vertices of t to the vertices of s that preserves vertex labels,
edge labels, and adjacency. An automorphism is an isomorphism that maps from
a ftree to itself. A subtree isomorphism from t to g is an isomorphism from t to
some subtree(s) of g.
Given a graph database D = {g1 , g2 , . . . , gN } where gi is a graph (1 ≤ i ≤ N ).
The problem of frequent ftree mining is to discover the set of all frequent ftrees,
denoted F S, where t ∈ F S iff the ratio of graphs in D that has t as its subtree
is greater than or equal to a user-given threshold φ. Formally, let t be a ftree
and gi be a graph. We define

1 if t ⊆ gi
ς(t, gi ) = (1)
0 otherwise
and 
σ(t, D) = ς(t, gi ) (2)
gi ∈D

where σ(t, D) denotes the frequency or support of t in D. The frequent ftree


mining problem is to discover the ftree set F S of D which satisfies
F S = {t | σ(t, D) ≥ φN } (3)
94 P. Zhao and J. Xu Yu

The problem of closed frequent ftree mining is to discover the set of frequent
ftrees, denoted CF S, where t ∈ CF S iff t is frequent and the support of t is
strictly larger than that of any supertree of t. Formally, the closed frequent ftree
mining problem is to discover the ftree set CF S of D which satisfies
CF S = {t | t ∈ F S ∧ ∀t ⊃ t, σ(t, D) > σ(t , D)} (4)

Since CF S contains no ftree that has a supertree with the same support, we
have CF S ⊆ F S.

3 Closed Frequent Ftree Mining: Proposed Solutions


Based on the definition in Eq.(4), a naive two-step algorithm of discovering
CF S from D can be easily drafted. First, using current frequent ftree mining
algorithms to discover F S from D; Second, for each t ∈ F S, examining all t ∈
F S where t ⊂ t to tell whether t satisfies σ(t , D) < σ(t, D). This algorithm is
straightforward, but far from efficient. It indirectly discovers CF S by computing
F S in the first place whose size is exponentially larger than that of CF S. The
postprocessing operation of filtering non-closed frequent ftrees from F S also
incurs unnecessary computation. We want an alternative method which directly
computes CF S instead of computing F S in advance, i.e., under the traditional
search space for mining frequent ftrees, efficient pruning algorithms should be
proposed to detect branches that do not correspond to closed frequent ftrees
as early as possible, and prune them to avoid unnecessary computation, which
finally facilitate the total mining process.
In [12], we demonstrate F3TM, a fast frequent ftree mining algorithm, which
outperforms up-to-date algorithms FreeTreeMiner[2,8] by an order of magni-
tude. In F3TM, an enumeration tree representing the search space of all frequent
ftrees is built by a pattern-growth approach. Given a frequent n-ftree t, the
potential frequent (n + 1)-ftree t originated from t is generated as

t = t ◦ef v, v ∈ Σ (5)

where ef means pattern growth can be conducted on the extension frontier of


t instead of each vertex of t, while at the same time ensuring the completeness
of frequent ftrees discovered from the graph database . Figure 1 illustrates the
extension frontier of a ftree, which is composed of vertices 3, 4, 5 and 6, and the
candidate generation of t, based on Eq. 5.
For each frequent ftree in the enumeration tree discovered by F3TM, we can
check the closeness condition in Eq. 4. Given a frequent n-ftree t, its immediate
supertree set, denoted CS(t), which contains all (n + 1)-ftrees t ⊃ t can be
generated as
CS(t) = {t | t = t ◦x v, v ∈ Σ} (6)
where x means v can be grown on any vertex of t, which is shown in Figure 2.
t’s immediate frequent supertree set, denoted F S(t), which contains all frequent
(n + 1)-ftrees t ⊃ t can be generated as
F S(t) = {t |t ∈ CS(t) ∧ σ(t, D) ≥ φN } (7)
Mining Closed Frequent Free Trees in Graph Databases 95

1 1

2 3 4 5 2 3 4 5 v

6 v v v v 6 v v v

v v

Fig. 1. t = t ◦ef v Fig. 2. t = t ◦x v

Given a frequent ftree t ∈ F S(t), we denote the vertex which is grown on t to


get t as (t − t), and the vertex of t at which (t − t) is grown on as p, i.e., the
parent of (t − t) in t .
The basic algorithmic framework for mining closed frequent ftrees can be
formalized as follows: if for every t ∈ F S(t), σ(t , D) is strictly smaller than
σ(t, D), then t is closed; Otherwise, t is non-closed, i.e., we can tell the closeness
of t by checking the support values of all its immediate frequent supertrees in
F S(t) during the traversal of the enumeration tree for mining frequent ftrees.

4 Pruning the Search Space


In the previous section, we traverse the enumeration tree to discover all frequent
ftrees in a graph database. However, the final goal of our algorithm is to find
only closed frequent ftrees. Therefore, it is not necessary to grow the complete
enumeration tree, because under certain conditions, some branches of the enu-
meration tree are guaranteed to produce no closed frequent ftrees and therefore
can be pruned efficiently. In this section, we introduce algorithms that prune
unwanted branches of the search space.

4.1 Equivalent Occurrence


Given a ftree t and a graph g ∈ D, let f (t, g) represents a subtree isomorphism
from t to g. f (t, g) is also referred to as an occurrence of t in g. Notice that t can
occurs more than once in g. Let ω(t, g) denote the number of occurrences of t in g.
The number of occurrences of t in a graph database D can be formally defined as
Definition 1. Given a ftree t and a graph database D = {g1 , g2 , . . . , gN }, the
number of occurrence of t in D is the sum of the number of subtree isomorphisms
N
of t in gi ∈ D, i.e., i=1 ω(t, gi ), denoted by O(t, D).
Suppose a ftree t = t ◦x v, f is a subtree isomorphism of t in g and f  is a
subtree isomorphism of t in g. If ∃ρ, ρ is subtree isomorphism of t in t , i.e.,
∀v, f (v) = f  (ρ(v)), we call t and t simultaneously occur in graph g. Intuitively,
as we can derive t from t by t = t ◦x v, we can get t in the same pattern-growth
way from t in g. We denote the number of such simultaneous occurrences of t
w.r.t t in g by ω(t, t , g). Similarly, the number of simultaneous occurrences of
t w.r.t t in D is defined as
Definition 2. Given a ftree t = t◦x v and a graph database D = {g1 , g2 , . . . , gN },
the number of simultaneous occurrence of t w.r.t. t in D is the sum of the number
96 P. Zhao and J. Xu Yu

N
of simultaneous occurrences of t w.r.t t in gi ∈ D, i.e., i=1 ω(t, t , gi ), denoted
by SO(t, t , D).
Definition 3. Given t = t ◦x v and a graph database D = {g1 , g2 , . . . , gN }, if
O(t, D) = SO(t, t , D), we say that t and t have equivalent occurrences.
Lemma 1. For a frequent ftree t in the enumeration tree, if there exists a t ∈
F S(t) such that (1) t and t have equivalent occurrences; (2) the vertex (t − t)
is not grown on the extension frontier of any descendants of t, including t, in
the enumeration tree, then (1) t is not a closed frequent ftree and (2) for each
child t of t in the enumeration tree, there exists at least one supertree t of t ,
such that t and t have equivalent occurrences.
Proof. The first statement can be easily proved. Since t and t have equivalent
occurrences in D, then O(t , D) = O(t, D). For the second statement, we notice
that (t − t) occurs at each occurrence of t in D, so it occurs at each occurrence of
t in D. In addition, the vertex (t − t) never be grown on the extension frontier
of any descendant of t, so it will not be a vertex of t (Notice t is a child of t in
the enumeration tree by growing a vertex on t’s extension frontier). Therefore,
we can obtain t by adding (t − t) on t , so that t and t have equivalent
occurrences.
By inductively applying Lemma 1 to t and all t’s descendants in the enumeration
tree, we can conclude that all branches originated from t in the enumeration tree
are guaranteed to produce no closed frequent ftrees. However, the conditions
mentioned in Lemma 1, especially the condition (2) is hard to be justified. Since
when mining frequent ftree t, we have no information of all t’s descendants in the
enumeration tree. The following sections will present more detailed techniques
to prune the search space.

4.2 The Safe Position Pruning


Given a ftree t and a vertex v ∈ t, the depth of v can be defined as follows

1 if v is a leaf
depth(v) = (8)
minu∈t,u is child of v {depth(u) + 1} otherwise

Intuitively, the depth of a vertex v is the minimum number of vertices from v


to the nearest leaf of t. For a frequent ftree t ∈ F S(t) where t and t have
equivalent occurrences, the vertex (t − t) can be grown at different positions,
i.e., there are the following possibilities for the position of p in t. (1)depth(p) ≤ 2
and p is on the extension frontier of t; (2) depth(p) ≤ 2 but p is not on the
extension frontier; (3) depth(p) > 2.
If p occurs in position (1), vertex(t − t) is grown on the extension frontier of
t. If p occurs in position (2), there are possibilities that for some descendant t
of t in the enumeration tree, the vertex p can still be on the extension frontier of
t . A example is shown in Figure 3. In frequent ftree t, depth(p) = 2 and p is not
located on the extension frontier. After the vertex a is grown on the extension
frontier (vertex b), we get another frequent ftree t in which p is now located on
Mining Closed Frequent Free Trees in Graph Databases 97

t t
c p a t
v a
a a b p c

b a a v
b p c d

a e v v’

Fig. 3. A Special Case in Position (2) Fig. 4. The Safe Label Pruning

the extension frontier. So the first two possible positions of p are unsafe when
growing vertex (t − t), which disallows the conditions mentioned in Lemma 1.
The following theorem shows that only position (3) of p is safe to grow the
vertex (t − t), while not violating the conditions mentioned in Lemma 1.
Theorem 1. For a frequent ftree t ∈ F S(t) such that t and t have equivalent
occurrences in D. If depth(p) > 2, then neither t nor any t’s descendants in the
enumeration tree can be closed.
Proof. Since for every vertex u on the extension frontier of a ftree, it is located
at the bottom two levels, i.e., depth(u) ≤ 2. If depth(p) > 2, the vertex p can
never appear on the extension frontier of any ftree, i.e., the vertex (t − t) will
not be grown on the extension frontier of any descendant of t, including t, in the
enumeration tree. According to Lemma 1, the branches originated from t can
not generate closed frequent ftrees.
The pruning algorithm mentioned in Theorem 1 is called the safe position prun-
ing, since the vertex (t − t) can only be grown on a safe vertex p ∈ t, where
depth(p) > 2. Given a n-ftree, the depth of every vertex of t can be computed in
O(n), so the safe position pruning is quite efficient to testify whether a certain
branch in the enumeration tree should be pruned or not.

4.3 The Safe Label Pruning


If p is on the extension frontier of t, obviously, depth(p) ≤ 2. We can not prune
t from the enumeration tree. However, depending on the vertex label of (t − t),
we can still possibly prune some children of t in the enumeration tree.
Theorem 2. For a frequent ftree t ∈ F S(t) such that t and t have equivalent
occurrences in D, if p is located on the extension frontier of t, we do not need
to grow t by adding to p a new vertex with label lexicographically greater than
(t − t).
Proof. For any t ∈ F S(t) such that p is the parent of(t − t) and (t − t) is
lexicographically greater than (t − t), a ftree t = t ◦p (t − t) have equivalent
occurrence with t and t ∈ F S(t ). Note t ◦p (t − t) means growing vertex
(t − t) on p of ftree t . According to Lemma 1, t is not closed. And for every
descendant of t in the enumeration tree, (t − t) never be grown on its exten-
sion frontier. Because during frequent ftrees mining, we generate candidates in
a lexicographical order. Since (t − t) is lexicographically greater than (t − t),
98 P. Zhao and J. Xu Yu

the vertex (t − t) will not be reconsidered to be grown on t and all t ’s descen-
dants in the enumeration tree. According to Lemma 1, neither t nor any of its
descendants can be closed.

The pruning algorithm mentioned in Theorem 2 is called the safe label pruning.
The vertex label of (t − t) is safe because all vertices with labels lexicographically
greater than (t −t) can be exempted from growing on p of t, and all descendants of
corresponding ftrees in the enumeration tree are also pruned. An example is shown
in Figure 4. p is located on the extension frontier of t and v = (t − t). If v  ’s label
is lexicographically greater than v’s label, the frequent ftree t = t ◦p v  and the
frequent ftree t = t ◦p v have equivalent occurrences, so that t is not closed.
Similarly, all t ’s descendants in the enumeration tree are not closed, either.

4.4 Efficient Computation of F S(t)


Based on the above analysis, both candidate generation and closeness test of the
frequent ftree, t, need to compute F S(t). Depending on if t can be pruned from
the enumeration tree during closed frequent ftree mining, we can divide F S(t)
into the following mutually exclusive subsets:
EO(t) = {t ∈ F S(t) | t and t have equivalent occurrences}
EN (t) = {t ∈ F S(t) | σ(t, D) = σ(t , D)}
F (t) = {t ∈ F S(t) | t is frequent}

Based on Theorem 1 and Theorem 2, the set EO(t) can be further divided
into the following mutually exclusive subsets:
EO1 (t) = {t ∈ EO(t) | p ∈ t is safe}
EO2 (t) = {t ∈ EO(t) | p is on the extension frontier of t}
EO3 (t) = EO(t) − EO1 (t) − EO2 (t)

When computing the sets mentioned above, we map t to each occurrence in


gi ∈ D and select the possible vertex (t − t) to grow. However, this procedure
is far from efficient since a lot of redundant t are generated. Now we study how
to speed up the computation of F S(t) based on the characteristics of ftree. The
detailed analysis can be found in [12].

Automorphism-based Pruning: In the example shown in Figure 5, The left-


most ftree t is a frequent 7-ftree, where vertices are identified with a unique
number as vertex id. When growing a new vertex v on vertex 3 of t, we get a
8-ftree t ∈ CS(t), shown in the middle of Figure 5. However, when growing
v on vertex 5 of t, we get another 8-ftree t ∈ CS(t), shown on the right of
Figure 5. Notice t = t in the sense of ftree isomorphism, so t can be pruned
when computing F S(t).
Based on the observation mentioned above, We propose an automorphism-
based pruning algorithm to efficiently avoid redundant generation of ftrees in
F S(t). Given a ftree, all vertices can be partitioned into different equivalence
classes based on ftree automorphism. Figure 6 shows how to partition vertices
Mining Closed Frequent Free Trees in Graph Databases 99

t t t
a 0 a a
a
b 1 b 2 b b b b
b b
c 3 4 d c 5 d 6 c d c d c d c d

c d c d
V V

Fig. 5. t , t ∈ CS(t) and t = t Fig. 6. Equivalence Class

of t in Figure 5 into four equivalence classes. When computing F S(t), only one
representative for each equivalence class of t is considered, instead of growing
vertices on every position within an equivalence class.
Canonical Mapping-based Pruning: When computing F S(t), we maintain
mappings from t to all its occurrences in gi ∈ D. However, there exist redundant
mappings because of ftree automorphism. Given a n-ftree t, and assume that
the number of equivalence classes of t is c, and the number of vertices in each
equivalence class Ci is ni , for 1 ≤ i ≤ c. The
cnumber of mappings from t to an
occurrence in gi is computed as ω(t, gi ) = i=1 (ni )!. When either the number
of equivalence classes, or the number of vertices in some equivalence class is
large, ω(t, gi ) can be huge. However,
c among all mappings describing the same
occurrence of t ∈ gi , one out of i=1 (ni )! mappings is selected as canonical
mapping and all  computation of F S(t) is based on the canonical mapping of t in
c
D. While other ( i=1 (ni )!−1) mappings can be pruned so that the computation
of F S(t) can be greatly facilitated.

5 The CFFTree Algorithm


In this section, we summarize our CFFTree algorithm, which is short for Closed
Frequent Ftree Mining. Algorithm 1 illustrates the framework of CFFTree. The
algorithm simply calls CF-Mine which recursively mines closed frequent ftrees of
a graph database by a depth-first traversal on the enumeration tree.
Algorithm 2 outlines the pseudo-code of CF-Mine. For each frequent ftree t,
CFFTree check all candidate frequent ftree t = t ◦x v, to obtain SO(t, t , D),
which is useful to compute EO(t) (Line 1) and EN (t) (Line 2). However, for
t ∈ F (t), CFFTree only grows t on its extension-frontier, i.e. t = t ◦ef v, which
ensures the completeness of frequent ftrees in D (Line 7-12). Automorphism-
based pruning and canonical mapping-based pruning can be applied to facilitate
the computation of the three sets EO(t), EN (t) and F (t). For the frequent ftree
t, if there exists t ∈ EO1 (t), then neither t nor any of t’s descendants in the
enumeration tree can be closed, and hence can be efficiently pruned (Line 3-4).
If EO1 (t) = ∅ but there exists t ∈ EO2 (t), although we cannot prune t from the
enumeration tree, we can apply Theorem 2 to prune some children of t in the
enumeration tree (Line 11-12). If EO(t) = ∅, then no pruning is possible and we
have to compute EN (t) to determine the closeness of t, i.e., the naive algorithm
mentioned in Section 3 (Line 2). If EN (t) = ∅, t is not closed, otherwise, t
100 P. Zhao and J. Xu Yu

Algorithm 1. CFFTree (D, φ)


Input: A graph database D, the minimum support threshold φ
Output: The closed frequent ftrees set CF
1: CF ← ∅;
2: F ← frequent 1-ftrees;
3: for all frequent 1-ftree t ∈ F do
4: CF-Mine(t, CF, D, φ);
5: return CF

Algorithm 2. CF-Mine (t, CF, D, φ)


Input: A frequent ftree t, the set of closed frequent ftrees, CF, A graph database D
and the minimum support threshold φ
Output: The closed frequent ftrees set CF
1: Compute EO(t);
2: if EO(t) = ∅ then Compute EN (t);
3: if ∃t ∈ EO1 (t) then
4: return; // The safe position pruning;
5: else
6: F (t) ← ∅
7: for each equivalence class eci on the extension frontier of t do
8: for each valid vertex v which can be grown on eci of t do
9: t ← t ◦ef v, where p, a representative of eci , is v’s parent
10: if support(t ) ≥ φ|D| then
11: if t ∈ EO2 (t), where (t − t) is p and the label of (t − t) is lexico-
graphically greater than that of (t − t) then
12: F (t) ← F (t) ∪ {t } // the safe label pruning
13: for each frequent t in F (t) do
14: CF-Mine(t , CF, D, φ)
15: if EO(t) = ∅ and EN (t) = ∅ then
16: CF ← CF ∪ { t}

is closed (Line 15-16). The set F (t) is computed by extending vertices on the
extension frontier of t, which grows the enumeration tree for frequent ftree mining
(Line 8-12). This procedure proceeds recursively (Line 13-14) until we find all
closed frequent ftrees in the graph database.

6 Experiments
In this section, we report a systematic performance study that validates the
effectiveness and efficiency of our closed frequent free tree mining algorithm:
CFFTree. We use both a real dataset and a synthetic dataset in our experiments.
All experiments were done on a 3.4GHz Intel Pentium IV PC with 2GB main
memory, running MS Windows XP operating system. All algorithms are imple-
mented in C++ using the MS Visual Studio compiler. We compare CFFTree
with F3TM plus post-processing, thus, the performance curve mainly reflects the
effectiveness of pruning techniques mentioned in Section 4.
Mining Closed Frequent Free Trees in Graph Databases 101

3000
F3TM frequent ftrees F3TM
CFFTree closed frequent ftrees CFFTree
2500 10000

# Runtime (sec)
100000
2000
# Features

# Patterns
1500
1000
1000

500 10000

0 100
0 5 10 15 20 25 0.05 0.06 0.07 0.08 0.09 0.1 0.05 0.06 0.07 0.08 0.09 0.1
Size of Free Trees Minimum support threshold Minimum support threshold

(a) Number of patterns (b) Number of Patterns (c) Performance

Fig. 7. Mining patterns in real datasets

10000
frequent ftrees F3TM F3TM
100000 closed frequent ftrees CFFTree CFFTree
# Runtime (sec)

# Runtime (sec)
10000 1000
# Patterns

1000

1000
100
100
100

10 10
0.05 0.06 0.07 0.08 0.09 0.1 0.05 0.06 0.07 0.08 0.09 0.1 5 10 15 20 25 30 35 40
Minimum support threshold Minimum support threshold Average size of graphs (edges)

(a) Number of patterns (b) Performance (c) Performance

Fig. 8. Mining patterns in synthetic datasets

The real dataset we tested is an AIDS antiviral screen chemical compound


database from Developmental Theroapeutics Program in NCI/NIH. The database
contains up to 43, 905 chemical compounds. There are total 63 kinds of atoms
in this database, most of which are C, H, O, S, etc. Three kinds of bonds are
popular in these compounds: single-bond, double-bond and aromatic-bond. We
take atom types as vertex labels and bond types as edge labels. On average, com-
pounds in the database has 43 vertices and 45 edges. The graph of maximum
size has 221 vertices and 234 edges.
Figure 7(a) shows the number of frequent patterns w.r.t. the size of patterns
(vertex number). We select 10000 chemical compounds from the real database
and set the minimum threshold φ to be 10%. As shown, most frequent and closed
frequent ftrees have vertices ranging from 8 to 17. While the number of small
ftrees with vertex number less than 5 and large ftrees with vertex number greater
than 20 is quite limited. Figure 7(b) shows the number of patterns of interest with
φ varying from 5% to 10% and the running time is shown in Figure 7(c) on the
same dataset. As we can see, CFFTree outperforms F3TM by a factor of 10 in aver-
age and the ratio between frequent ftrees and closed ones is close from 10 to 1.5. It
demonstrates that closed pattern ming can deliver more compact mining results.
We then tested CFFTree on a series of synthetic graph databases, which are
generated by the widely-used graph generator [5]. The synthetic dataset is char-
acterized by different parameters, which is described in detail in [5]. Figure 8(a)
shows the number of patterns of interest with φ varying from 5% to 10% and
the running time is shown in Figure 8(b) for the dataset D10000I10T 30V 50.
Compared with the real dataset, CFFTree has a similar performance gain in
this synthetic dataset. We then test the mining performance by changing the
102 P. Zhao and J. Xu Yu

parameter T in the synthetic data, while other parameters keep fixed. The ex-
perimental results are shown in Figure 8(c). Again, CFFTree performs better
than F3TM.

7 Conclusion
In this paper, we investigate the problem of mining closed frequent ftrees from
large graph databases, a critical problem in structural pattern mining because
mining all frequent ftrees are inherently inefficient and redundant. Several new
pruning algorithms are introduced in this study including the safe position prun-
ing and the safe label pruning to efficiently prune branches of the search space.
The automorphism-based pruning and the canonical mapping-based pruning are
applied in the computation of candidate sets and equivalent occurrence sets,
which dramatically facilitate the total mining process. A CFFTree algorithm is
implemented and our performance study demonstrates its high efficiency over
the up-to-date frequent ftree mining algorithms. To our best knowledge, this is
the first piece of work on closed frequent ftree mining on large graph databases.

Acknowledgment. This work was supported by a grant of RGC, Hong Kong


SAR, China (No. 418206).

References
1. Yun Chi, Yi Xia, Yirong Yang, and Richard R. Muntz. Mining closed and maximal
frequent subtrees from databases of labeled rooted trees. IEEE Transactions on
Knowledge and Data Engineering, 17(2):190–202, 2005.
2. Yun Chi, Yirong Yang, and Richard R. Muntz. Indexing and mining free trees. In
Proceedings of ICDM03, 2003.
3. Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide
to the Theory of NP-Completeness. 1979.
4. Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraphs in
the presence of isomorphism. In Proceedings of ICDM03, 2003.
5. Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In Pro-
ceedings of ICDM01, 2001.
6. Siegfried Nijssen and Joost N. Kok. A quickstart in frequent structure mining can
make a difference. In Proceedings of KDD04, 2004.
7. Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Discovering fre-
quent closed itemsets for association rules. In Proceeding of ICDT99, 1999.
8. Ulrich Rückert and Stefan Kramer. Frequent free tree discovery in graph data. In
Proceedings of SAC04, 2004.
9. Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In
Proceedings of ICDM02, 2002.
10. Xifeng Yan and Jiawei Han. Closegraph: mining closed frequent graph patterns.
In Proceedings of KDD03, 2003.
11. Xifeng Yan, Jiawei Han, and Ramin Afshar. Clospan: Mining closed sequential
patterns in large databases. In Proceedings of SDM03, 2003.
12. Peixiang Zhao and Jeffrey Xu Yu. Fast frequent free tree mining in graph databases.
In Proceedings of MCD06 - ICDM 2006 Workshop, Hong Kong, China, 2006.
Mining Time-Delayed Associations from
Discrete Event Datasets

K.K. Loo and Ben Kao

Department of Computer Science,


The University of Hong Kong, Hong Kong
{kkloo, kao}@cs.hku.hk

Abstract. We study the problem of finding time-delayed associations


among types of events from an event dataset. We present a baseline algo-
rithm for the problem. We analyse the algorithm and identify two meth-
ods for improving efficiency. First, we propose pruning strategies that can
effectively reduce the search space for frequent time-delayed associations.
Second, we propose the breadth-first* (BF*) candidate-generation order.
We show that BF*, when coupled with the least-recently-used cache re-
placement strategy, provides a significant saving in I/O cost. Experiment
results show that combining the two methods results in a very efficient
algorithm for solving the time-delayed association problem.

1 Introduction
Developments in sensor network technology have attracted vast amounts of re-
search interest in recent years [1,2,3,6,7,8,9]. One of the research topics related to
sensor networks is to find correlations among the behaviour of different sensors.

F
4
A 2 E
A A A A B B B C C
•-
3 4
•| •| •| •| •| •| •| •| | time
B 5 4 D 0 3 4 5 7 9 10 13 15
C

(a) Network topology (b) Alerts issued by a network monitoring system

Fig. 1. An example showing a network monitoring system

Consider a network monitoring system designed for collecting traffic data of


a network of switches and links as shown in Figure 1(a). In the figure, nodes
represent switches, whereas edges are links connecting switches. Under normal
conditions, the time needed to pass through a link is represented by the number
on the corresponding edge. When the traffic at Switch X exceeds certain capac-
ity, a congestion alert is raised. Figure 1(b) shows an example of alert signals.

This research is supported by Hong Kong Research Grants Council Grant HKU
7138/04E.

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 103–114, 2007.

c Springer-Verlag Berlin Heidelberg 2007
104 K.K. Loo and B. Kao

By analysing an alert sequence, one may discover interesting correlations


among different types of alerts. For example, one may find that a Switch-A alert
is likely followed by a Switch-B alert within a certain time period. One may also
find that if such an A-B pattern occurs, a Switch-C alert is likely to occur soon
after. Such association information would be useful, for example, in congestion
prediction, which could be applied to intelligent traffic redirection strategies.
In this paper, we model correlations of events in the form of time-delayed
associations. In our model, an event e is a pair (Ee , te ) where Ee is its type and
te is the time at which e occurs. We are interested in associations among events
whose occurrences are time-constrained. A time-delayed association thus takes
[u,v]
the form I −−−−→ J, where I, J are event types and u, v are two time values such
that 0 < u ≤ v. The association captures the observation that when an event i
of type I occurs at some time ti , it is likely that an event j of type J occurs at
time tj such that ti + u ≤ tj ≤ ti + v. If such an event j exists, event i is said to
match the association and we call j a consequence of i w.r.t. the association.
Associations can be “chained” to form longer associations that involve more
than two event types. Chained associations can help detecting risk of unfavourable
[u,v]
conditions early. Here, we can treat an association I −−−−→ J as a complex event
type I. An association between a complex event type I and an event type K has
the form I−−−−→ K. Intuitively, such an association refers to the observation that
[u,v]

if an event of type I occurs and is followed by one or more event of type J within
a certain constrained time period, then at least one of the type-J consequences
is likely followed by a type-K event within a constrained time period.
In [5], Mannila et al proposed the concept of episode, which is an ordered
list of events. They proposed the use of minimal occurrences to find episode
rules in order to capture temporal relationships between different event types. A
minimal occurrence of an episode is a time interval [ts , te ) such that the episode
occurs in the interval but not in any proper sub-interval of [ts , te ). Let α and β
be two episodes. An episode rule has the form α[w1 ] ⇒ β[w2 ], which specifies
the following relationship: “if α has a minimal occurrence in the interval [ts , te )
such that te − ts ≤ w1 , then there is a minimal occurrence of β in [ts , te ) such
that te − ts ≤ w2 ”. The goal is to discover episodes and episode rules that occur
frequently in the data sequence.
In a sense, our problem is similar to episode discovery in that we are looking
for frequently occurring event sequences. However, we remark that the use of
minimal occurrence to define the occurrence of an episode might in some cases
fail to reflect the strength of an association. As an example, consider Figure 1(b)
again. It is possible that the three type-B events that occur at time t = 7, 9 and
10 are “triggered” respectively by the three preceding A’s that occur at t = 3, 4
and 5. Hence, the association A → B has occurred three times. However, only
one period ([5, 8)) is qualified as a minimal occurrence of the episode A → B. In
other words, out of all 4 occurrences of A in the figure, there is only 1 occurrence
of the episode A → B, even though 3 of the A’s have triggered B.
A major difference between our definition of time-delayed association and the
episode’s minimal occurrence approach is that, under our approach, every event
Mining Time-Delayed Associations from Discrete Event Datasets 105

that matches an association counts towards the association’s support. This fairly
reflects the strength of correlations among event types. Also, our definition allows
the specification of a timing constraint [u, v] between successive event types in
an association. This helps removing those associations that are not interesting.
For example, if it takes at least 2 time units for a packet to pass through a
switch, then any type-B alert that occurs 1 time unit after a type-A alert should
not count towards the association A → B (See Figure 1). We can thus use the
timing constraint to filter false matches. The minimal occurrence approach used
in episode does not offer such flexibility.
A straightforward approach to finding all frequent associations is to generate
and verify them incrementally. First, we form all possible length-2 associations
X → Y , where X and Y are any event types in the data sequence. We then
scan the data to count the associations’ supports. Those associations with high
supports are considered frequent. Next, for each frequent association X → Y , we
consider every length-3 extension, i.e., we append every event type Z to X → Y
forming (X → Y ) → Z. The support of those length-3 associations are counted
and those that are frequent will be used to generate length-4 associations, and so
on. The process stops when we can no longer obtain any new frequent sequences.
In Section 3 we will show how the above conceptual procedure is implemented
in practice. In particular, we show how the computational problem is reduced
to a large number of table joins. We call this algorithm the baseline algorithm.
The baseline algorithm is not particularly efficient. We address two methods to
improve its efficiency. First, the baseline algorithm extends a frequent association
I → Y by considering all possible extensions (I → Y ) → Z. Many of such
extensions could be infrequent and the effort spent on counting their supports
is wasted. A better strategy is to estimate upper bounds of the associations’
supports and discard those that cannot meet the support requirement. Second,
as we will explain later, the baseline algorithm generates (I → Y ) → Z by
retrieving and joining the tables associated with two sub-associations, namely,
I → Y and Y → Z. Since the number of such associations and their associated
tables is huge, the tables will have to be disk-resident. A caching strategy that
can avoid disk accesses as much as possible would thus have a big impact on the
algorithm’s performance. In this paper we study an interesting coupling effect
of a caching strategy and an association-generation order.
The rest of the paper is structured as follows. We give a formal definition of
our problem in Section 2. In Section 3, we discuss some properties of time-delayed
associations and propose a baseline algorithm for the problem. In Section 4, we
discuss the pruning strategies and the caching strategies. We present experiment
results in Section 5 and conclude the paper in Section 6.

2 Problem Definition
In this section we define the problem of finding time-delayed associations from
event datasets. We define an event e as a 2-tuple (Ee , te ) where Ee is the event
type and te is the time e occurs. Let D denote an event dataset and E denote
the set of all event types that appear in D. We define a time-delayed association
106 K.K. Loo and B. Kao

as a relation between two event types I, J ∈ E of the form I −−−−→ J. We call I


[u,v]

the triggering event type and J the consequence event type of the association.
[u,v]
Intuitively, I −−−−→ J captures the observation that if an event i such that Ei = I
occurs at time ti , then it is “likely” that there exists an event j so that Ej = J
and ti +u ≤ tj ≤ ti +v, where v ≥ u > 0. The likelihood is given by the confidence
of the association, whereas the statistical significance of an association is given
by its support. We will define support and confidence shortly.
[u,v]
For an association r = I −−−−→ J, an event i is called a match of r (or i matches r)
if Ei = I and there exists another event j such that Ej = J and ti + u ≤ tj ≤
ti + v. The event j here is called a consequence of r. We use the notations Mr to
denote the set of all matches of r, qr,i to denote the set of all consequences that
correspond to a match i of r and mr,j to denote the  set of all matches of r that
correspond to a consequence j. Also, we define Qr = qr,i ∀i ∈ Mr . That is, Qr
is the set of all events that are consequences of r. The support of an association
r is defined as the ratio of the number of matching events to the total number of
events (i.e., |M r| |Mr |
|D| ). The confidence of r is defined as the fraction |DI | , where DI
is the set of all type-I events in D. We use the notations supp(r) and conf (r) to
represent the support and confidence of r, respectively. Finally, the length of an
association r, denoted by len(r), is the number of event types contained in r.
We can extend the definition to relate more than two event types. Consider an
association r = I −−−−→ J as a complex event type I, an association between I and
[u,v]

an ordinary event type K is of the form r = I−−−−→ K. Here, I is the triggering


[u,v]

event type and K is the consequence event type. Intuitively, the association says
that if an event of type I is followed by one or more event of type J within
certain time constraints u and v, then at least one of the J’s is likely to be
followed by a type K event. A match for the association r is a match i for r
such that, for some j where j ∈ qr,i , there exists an event k such that Ek = K
and tj + u ≤ tk ≤ tj + v. We say that event k is a consequence of event i
w.r.t. the association r . The support of r is defined as the fraction of events
in D that match r (i.e., |M r | 
|D| ). The confidence of r is defined as the ratio of
the number of events that match r to the number of events that match r (i.e.,
|Mr |
|Mr | ). Given two user-specified thresholds ρs and ρc and a timing constraint
[u, v], the problem of mining time-delayed associations is to find all associations
r such that supp(r) ≥ ρs and conf (r) ≥ ρc .
In our model, we use the same timing constraint [u, v] for all associations.
[u,v]
Therefore, we will use a plain arrow “→” instead of “−−−−→ ” in the rest of the
paper when the timing constraint is clear from the context or is unimportant.

3 The Baseline Algorithm


We start this section with two properties based on which the baseline algorithm
is designed.

Property 1: If |DI |, i.e., the number of occurrences of type I events, is smaller


than ρs ×|D|, then any association of the form r = I → J must be infrequent.
Mining Time-Delayed Associations from Discrete Event Datasets 107

[3,5] [3,5]
A−−−−→ B B −−−−→ C
algorithm BASELINE m q m q
1) L := ∅; C := ∅ n = 2; 3 7 9 13
2) F := {all frequent event types} 4 7 10 13
3) foreach I ∈ F , J ∈ E do 4 9 10 15
4) C := C ∪ {I → J} 5 9
5) end-for 5 10
6) while C = ∅ do
7) Cn := C; C := ∅ (a) (b)
8) foreach r ∈ Cn do
9) if r = I → J is frequent do [3,5] [3,5]
10) L := L ∪ {r} (A−−−−→ B)−−−−→ C
11) C := C ∪ {(I → J) → K} ∀ K ∈ E m q
12) end-if 4 13
13) end-for 5 13
14) n := n + 1 5 15
15) end-while
16) return L (c)
Fig. 2. Algorithm BASELINE Fig. 3. M -Q mappings for various
time-delayed associations

Proof: By definition, the set of matches of r must be a subset of DI . Hence,


|Mr | ≤ |DI | < ρs × |D|. 

Property 2: For any associations x and y = x → K, supp(x) ≥ supp(y).


Proof: By definition, Mx ⊇ My . Hence, supp(x) ≥ supp(y). 

From Property 2, we know that if an association y is frequent, so is x. In other


words, if an association x is not frequent, we do not need to consider any associ-
ations that are right extensions of x. The baseline algorithm (Figure 2) generates
associations based on this observation.
First, the algorithm collects into the set F all frequent event types (Line 2).
The algorithm then maintains two sets: C is a set of candidate associations which
are to be verified, and L is a set that contains all frequent associations discovered.
The set C is initialized to contain all possible length-2 associations (Lines 3-5).
The support of a candidate association r is determined. (We will discuss how
to compute the support shortly.) If r is verified to be frequent, we extend r to
r → K for each event type K ∈ E and add them to C. The algorithm terminates
when all candidates are evaluated and no new candidates can be generated.
To compute an association’s support, consider an association r = (I → J) →
K. By definition, an event i is a match of r if i is a match of r1 = I → J and
for some consequence j of i, there exists an event k such that Ek = K and
tj + u ≤ tk ≤ tj + v. In other words, j is both a consequence of r1 and a match
of r2 = J → K. The set of all such events is given by Qr1 ∩ Mr2 . We call this
the connecting set between r1 and r2 . We then have the following properties.

Property 3: For any event j ∈ Qr1 ∩ Mr2 , every i ∈ mr1 ,j is a match of r and
every k ∈ qr2 ,j is a consequence of event i w.r.t. r for every i ∈ mr1 ,j .
Proof: By definition, every i ∈ mr1 ,j is a match of r because there exists k such
that tj + u ≤ tk ≤ tj + v. Indeed, every k ∈ qr2 ,j fulfils this requirement.
Hence, every k ∈ qr2 ,j is a consequence of i w.r.t. r for every i ∈ mr1 ,j . 
108 K.K. Loo and B. Kao

Property 4: For any event j


∈ Qr1 ∩ Mr2 ,
∃i ∈ mr1 ,j , k ∈ qr2 ,j such that i is
a match and k is a consequence of i w.r.t. r.
Proof: (i) Any event j
∈ Qr1 cannot be a consequence of any i ∈ Mr1 for the
association r1 . So mr1 ,j = ∅. (ii) For any j ∈ Qr1 but j
∈ Mr2 , qr2 ,j = ∅. 
Given an association r and a match i of r, we can determine all consequences
j of i w.r.t. r. If we put all these match-consequence i-j pairs in a relation, we
obtain an M -Q mapping of the association r. Let us consider the network switch
[3,5]
example again (Figure 1). If r = A−−−−→ B, then the matching type-A event at
t = 4 leads to two consequence type-B events at t = 7 and 9. Hence the tuples
4, 7 and 4, 9 are in the M -Q mapping of the association. Figures 3(a) and 3(b)
[3,5] [3,5]
show the M -Q mappings of the associations A−−−−→ B and B −−−−→ C, respectively.
By Property 3, given the M -Q mappings for r1 and r2 , denoted respectively
by T1 and T2 , we can derive the M -Q mapping of r by performing an equi-
join on T1 and T2 so that T1 .q = T2 .m, where the join result is projected on
T1 .m and T2 .q. removing the duplicate tuples in the mapping. Figure 3(c) shows
[3,5] [3,5]
the resultant M -Q mapping of (A−−−−→ B)−−−−→ C. Given the M -Q mapping of an
association r, the support supp(r) can be computed by counting the number
of distinct elements in the m column. The confidence of r can then be easily
determined by the supports of its sub-associations. In this paper, we focus on
computing the supports of associations and extracting those that are frequent.

4 Improving the Baseline Algorithm


The baseline algorithm described in the previous section offers a method to
find frequent time-delayed associations. In this section, we propose methods to
improve the efficiency of the algorithm by investigating two areas, namely the
search space for frequent associations and the handling of intermediate results.

4.1 Pruning Strategy


For our problem of mining time-delayed associations, Properties 1 and 2, de-
scribed in Section 3, are the only base for trimming the search space for frequent
time-delayed associations. So, the baseline algorithm takes all possible extensions
of a frequent association as candidates. A better strategy would be to estimate
an upper-bound for the support of each candidate, without actually joining the
M -Q mappings of its sub-associations, and trim those that cannot be frequent.

Multiplicity of consequences. With respect to a time-delayed association,


an event can be a consequence of one or more matches. We define, for an associ-
ation r, the multiplicity of a consequence q as the number of matches such that
q is a consequence. Getting the multiplicity values is easy. By sorting the M -Q
mapping of an association r by the q column, rows for a particular consequence
are arranged consecutively. The multiplicity of each consequence q is thus ob-
tained by the number of consecutive rows corresponding to q. Figure 4(a) shows
an example. Based on multiplicity, we propose two methods, namely, GlobalK
and SectTop, for efficiently identifying candidates that cannot be frequent.
Mining Time-Delayed Associations from Discrete Event Datasets 109

m q q Multi seg ST Vector m q seg num matches


1 3 3 1 1
1 7 9 1 0
4 7 7 3 2
3, 5 11 13 2 1
5 7 3
3 1
6 7
8 10 10 2
9 10

(a) M -Q mapping and multiplic- (b) SectTop vec- (c) M -Q mapping and number of
ity of consequences for I → J tors for I → J matches per segment for J → K
Fig. 4. Multiplicity of consequences and SectTop

GlobalK. The notion of multiplicity implies that, for an association r1 = I → J,


the sum of the multiplicities of n distinct consequences gives an upper-bound on
the number of matches associated. Given that r1 is frequent. To verify whether
an association r = (I → J) → K is frequent, we consider r1 and r2 = J → K.
It is clear that the connecting set between r1 and r2 contains at most x =
min(|Qr1 |, |Mr2 |) events, which in turn are associated to at most y matches of
r1 where y is the sum of the top x multiplicity values w.r.t. r1 . If y is smaller
than ρs × |D|, r cannot be frequent. In general, for a time-delayed association
r1 = I → J, we call the minimum number of consequences such that the sum of
their multiplicity values is not smaller than the support threshold the GlobalK
threshold for the association. Any extension of r1 of the form (I → J) → K
cannot be frequent if |MJ→K | is smaller than the GlobalK threshold of r1 .

SectTop. GlobalK is a simple method for pruning candidates that cannot be


frequent. However, the GlobalK threshold, which is derived from the highest
multiplicities for an association, can be too generous as a pruning condition as
those consequences with top multiplicities may not all enter the connecting set.
We address this issue in SectTop. In simple words, we conceptually divide
the whole length of time represented by D into a number of segments. For a
frequent association r1 = I → J, for each segment, we capture information on
the multiplicities of the consequences occurring within the segment in a SectTop
vector. Then, when checking whether an association r = (I → J) → K can be
frequent, for each segment, we get an upper-bound on the number of matches
that associate to the consequences in the segment. The sum of the upper-bounds
for each segment thus gives an overall upper-bound on the number of matches of
r. If the overall upper-bound is smaller than ρs × |D|, then r cannot be frequent.
For an association r1 , the SectTop vector for a segment is obtained as follows.
First, the multiplicities for the consequences of r1 occurring within the segment
are sorted inversely. Then, we keep the x highest multiplicity values such that x
is minimum and the sum of the x multiplicities exceeds ρs × |D|. If the sum of
all x values does not exceed ρs × |D|, we keep all multiplicities. The multiplicity
values are then transformed to a vector such that the y-th element is the sum of
the top-y multiplicities in the segment. Figure 4(b) shows the SectTop vectors
derived from the M -Q mapping in Figure 4(a) if the length of time represented
by the dataset is divided in segments of 5 time units.
110 K.K. Loo and B. Kao

A→A A→A A→A


(A → A) → A A→B A→B
((A → A) → A) →A : :
((A → A) → A) →B (A → A) → A (A → A) → A
: (A → A) → B (A → A) → B
B→A (A → A) → C :
B→B : ((A → A) → A) → A
(B → B) → A ((A → A) → A) →A ((A → A) → A) → B
(B → B) → B ((A → A) → A) →B ((A → A) → A) → C
((B → B) → B) →A : ((A → A) → A) → D
: ((B → B) → B) →A ((B → C) → A) → A
B→C ((B → B) → B) →B ((B → C) → A) → B
(B → C) → A : ((B → C) → A) → C
((B → C) → A) →A ((B → C) → A) →A ((B → C) → A) → D
((B → C) → A) →B ((B → C) → A) →B ((B → B) → B) → A
: : :

(a) Depth-first can- (b) Breadth-first candi- (c) Breadth-first*


didate generation date generation candidate generation
Fig. 5. Candidate generation schemes

To check whether an association r = (I → J) → K can be frequent, we count


the number of distinct matches for the sub-association r2 = J → K appearing in
each segment. If there are y matches for J → K in the i-th segment, then, in the
segment, at most y consequences of r1 may appear in the connecting set. Hence,
an upper-bound on the number of matches associated to the consequences is
given by the y-th element of the SectTop vector for the segment. We get the
upper-bounds for each segment and their sum gives an overall upper-bound on
the number of matches of r. As an example, the table on the left in Figure 4(c) is
the M -Q mapping of J → K, while that on the right lists the number of matches
of J → K appearing in each segment. When evaluating whether r = (I → J) →
K can be frequent, we check the number of matches of J → K in each segment
against the SectTop vectors of I → J. It turns out that the overall upper-bound
on the number of matches for r is (0 + 3 + 0) = 3. If the support threshold is
4 matches, then we know immediately that r cannot be frequent.

4.2 Cache Management


The baseline algorithm generates a lot of associations during execution. Some of
them are repeatedly used for evaluating other candidates later on. Because the
volume of data being processed is often very large, keeping all such associations
in main memory is not feasible. Maintaining a cache is thus a compromise so
that, while keeping some of the intermediate results in memory and reduce I/O
accesses, memory can be made available for other operations.
When the cache overflows, part of the cached data is replaced by data fetched
from disk. Two commonly used strategies for choosing data for replacement
are “Least recently used” (LRU), i.e., data that have not been accessed for
the longest time are replaced, and “Least frequently used” (LFU), i.e., least
frequently accessed data in the cache are replaced.
The effectiveness of cache, i.e., the likeliness that the data accessed is in the
cache, is often mentioned as the hit-rate. We argue that the hit-rate is related
to the order that candidates are evaluated and the cache replacement strategy
Mining Time-Delayed Associations from Discrete Event Datasets 111

chosen. Figure 5(a) and (b) shows two candidate generation schemes commonly
used in level-wise algorithms. Figure 5(a) illustrates a depth-first (DF) candi-
date generation, i.e., after an association r = I → J is evaluated as frequent,
the algorithm immediately generates candidates by extending r and evaluates
them. Figure 5(b) illustrates a breadth-first (BF) candidate generation that all
candidates of the same length are evaluated before longer candidates. These
candidate generation schemes would not work well with the LRU strategy. For
example, in Figure 5(a), A → B is referenced when evaluating the candidates
((A → A) → A) → B and ((B → C) → A) → B. Between the accesses,
a number of other candidates are evaluated, which means that many different
associations are brought into the memory and cache overflows are more likely.
When A → B is accessed the second time, its M -Q mapping may no longer
reside in the cache. Similar problem exists in the BF scheme (see Figure 5(b)).
It is noteworthy that, in the baseline algorithm, length-2 associations are
repeatedly referenced for candidate evaluation. In particular, when evaluating
extensions of an association I → J, each of length-2 associations of the form
J → K is referenced. By processing as a batch all associations in Li with the same
consequence event type (see Figure 5(c)), we ensure that length-2 associations of
the form J → K are accessed closely temporally, which favours the LRU strategy.
This observation can be easily fitted into the BF candidate generation scheme.
At the end of each iteration, we sort the associations in Li by their consequence
event type. Then the sorted associations are fetched sequentially for candidate
generation. We call this the breadth-first* (BF*) candidate generation scheme.

5 Experiment Results
We conducted experiments using stock price data. Due to space limitation, we
leave the discussion on how the raw data is transformed into an event dataset
in [4]. The transformed dataset consists 99 event types and around 45000 events.

5.1 Pruning Strategy


In the first set of experiments, we want to study the effectiveness of the pruning
strategy “GlobalK” and “SectTop”. The effectiveness is best reflected by the
number of candidate associations being evaluated. Figure 6 shows the number of
candidate associations evaluated when ρs is set at different values. We comment
that a candidate is regarded as “evaluated” only if the M -Q mapping of the
candidate is enumerated. The lines labelled “NoOpt”, “GlobalK” and “ST32”
represent respectively the case that no pruning strategy (i.e., the original baseline
algorithm) is used, that “GlobalK” is chosen and that “SectTop” is chosen with
the time covered by D divided into 32 segments.
Figure 6(a) shows the results when u and v are set to δ (i.e., a value just
bigger than 0) and 1 respectively. As shown in the figure, both GlobalK and
SectTop save a major fraction of candidate evaluations performed. At high sup-
port (0.6%), savings of 55% and 82% are observed respectively with GlobalK and
SectTop over the baseline algorithm while, at low support (0.3%), the savings are
112 K.K. Loo and B. Kao

5e+06 1.2e+07
Number of candidates evaluated

Number of candidates evaluated


NoOpt NoOpt
GlobK 1e+07 GlobK
4e+06 ST32 ST32
400000 8e+06
3e+06 300000 2e+06
6e+06
200000 1e+06
2e+06
100000 4e+06
0
1e+06 0 0.8 0.85 0.9
0.4 0.5 0.6 2e+06

0 0
0.3 0.4 0.5 0.6 0.7 0.75 0.8 0.85 0.9
Support threshold Support threshold

(a) [u, v] = [δ, 1] (b) [u, v] = [δ, 2]


Fig. 6. No. of candidates evaluated at different ρs

32% and 63%. Similar trend is observed when we changed v to 2 (Figure 6(b)).
Although the savings are not as dramatic as in the case when v = 1, at low
support (0.7%), GlobalK and SectTop achieve savings of 26% and 41%, while at
high support (0.9%), the savings are around 39% and 44% respectively.
As shown by the figures, SectTop always outperforms GlobalK in terms of
number of candidates being evaluated. A reason is that, for each candidate
c, SectTop obtains an upper-bound on supp(c) by estimating the number of
matches that are associated to the consequences in each segment. A reasonably
fine segmentation of the time covered by D thus ensures that the upper-bound
obtained is relatively tight. For GlobalK, however, the GlobalK threshold for a
frequent association is calculated from the highest multiplicity values without
considering where these values actually exist in the whole period of time covered
by D. So, the pruning ability of GlobalK is not as good as that of SectTop.

5.2 Candidate Generation, Cache Replacement Strategy and I/O


Costs
In the second set of experiments, we want to study the effect of candidate gen-
eration orders on different cache replacement strategies. We plot the number of
M -Q mapping tuples read from disk, reflecting total I/O requirement, against
the size of the cache in Figures 7 and 8.
We start the analysis with the LRU strategy and ρs set to a relative low
value at 0.3%. Figure 7(a) shows the I/O performance when no pruning strategy
is applied. From the figure, we find that the I/O performance of breadth-first
and that of depth-first strategies are very close to each other. For BF* strategy,
the I/O cost begins to drop at 16000 tuples and then drops dramatically. The
improvement levels down when the cache size is increased to 24000 tuples.
The sharp improvement here is no coincidence. Recall that in BF* candidate
generation, at the end of each iteration, we sort the newly found frequent asso-
ciations by their consequence event type. Candidates are formed and evaluated
by extending each of the sorted associations sequentially. In other words, after
candidates of the form (I1 → J) → K are evaluated (for some simple or complex
event type I1 ), next evaluated are those of the form (I2 → J) → K, if such I2
exists. The whole set of length-2 associations with triggering event type J are
Mining Time-Delayed Associations from Discrete Event Datasets 113

Number of tuples read from disk

Number of tuples read from disk


1e+09 1e+09 LRU-BF
LRU-DF
LRU-BF*
8e+08 LRU-BF 8e+08
LRU-DF
LRU-BF*
6e+08 6e+08

4e+08 4e+08

2e+08 2e+08

0 0
4 16 64 256 4 16 64 256
Cache size (’000 tuples) Cache size (’000 tuples)

(a) LRU, ρs = 0.3%, “NoOpt” (b) LRU, ρs = 0.3%, “ST32”


Fig. 7. I/O requirement for LRU ([u, v] = [δ, 1], ρs = 0.3%)
Number of tuples read from disk

Number of tuples read from disk


1e+09 1e+09 LFU-BF
LFU-DF
LFU-BF*
8e+08 8e+08
LFU-BF
6e+08 LFU-DF 6e+08
LFU-BF*
4e+08 4e+08

2e+08 2e+08

0 0
4 16 64 256 4 16 64 256
Cache size (’000 tuples) Cache size (’000 tuples)

(a) LFU, ρs = 0.3%, “NoOpt” (b) LFU, ρs = 0.3%, “ST32”


Fig. 8. I/O requirement for LFU ([u, v] = [δ, 1], ρs = 0.3%)

accessed multiple times for these candidates. If the cache is big enough to hold
the M -Q mappings of all such length-2 associations, it is likely that the M -Q
mappings are in the cache after they are referenced for the first time. For the
dataset used in the experiment, we find that the maximum sum of the sizes of
all M -Q mappings of a particular triggering event type is about 22000 tuples. A
cache with 24000-tuple capacity is thus big enough to save most I/O accesses.
Figure 7(b) shows the case when “ST32” is applied. The curves are similar
in shape compared to those in the “NoOpt” case. A big drop in I/O access is
also observed with the curve of BF* and the big drop begins at the cache size of
10000 tuples. This is because SectTop avoids evaluating candidates that cannot
be frequent. So, for a frequent association I → J, it is not necessary to evaluate
every candidate of the form (I → J) → K. A smaller cache is thus enough to
hold the M -Q mappings of length-2 associations used for candidate evaluation.
Figure 8 shows the case of LFU. From the figure, all three candidate gener-
ation methods are very similar in terms of I/O requirement. Both depth-first
and breadth-first generation performed slightly better when LFU was adopted
instead of LRU. However, the “big drop” with BF* is not observed and so the
performance of BF* is much worse than the case with LRU. It is because the
LFU strategy gives preference to data that are frequently accessed when decid-
ing on what to keep in the cache. This does not match the idea of BF* candidate
generation, which works best when recently accessed data are kept in the cache.
114 K.K. Loo and B. Kao

In addition, associations entered the cache early may reside in the cache for a
long time because, when they are first used for evaluating candidates, a certain
number of accesses have been accumulated. Associations newly added to the
cache must be accessed even more frequently to stay in the cache.

6 Conclusion
We propose time-delayed association as a way to capture time-delayed depen-
dencies between types of events. We illustrate how time-delayed associations can
be found from event datasets in a simple baseline algorithm.
We identify in the simple algorithm two areas for improvement. First, we can
get upper-bounds on the supports of candidate associations. Those that cannot
be frequent are discarded without finding their actual supports. We proposed
two methods, namely, GlobalK and SectTop, for getting an upper-bound on a
candidate’s support. Experiment results show that these methods reduce signif-
icantly the number of candidates being evaluated.
Second, some of the intermediate results generated are repeatedly used for
candidate evaluation. Since the volume of data being processed is likely to be
high, such intermediate results must be disk-resident and are brought into main
memory only when needed. Caching of the intermediate results is thus important
for reducing expensive I/O accesses. We find that the order that candidate as-
sociations are formed and evaluated would affect the performance of the cache.
Experiment results show that the BF* candidate generation scheme, coupled
with a reasonably-sized cache and the LRU cache replacement strategy, can
comprehensively reduce the I/O requirement of the algorithm.

References
1. Xiaonan Ji, James Bailey, and Guozhu Dong. Mining minimal distinguishing sub-
sequence patterns with gap constraints. In ICDM, pages 194–201, 2005.
2. Daesu Lee and Wonsuk Lee. Finding maximal frequent itemsets over online data
streams adaptively. In ICDM, pages 266–273, 2005.
3. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. Seman-
tics and evaluation techniques for window aggregates in data streams. In SIGMOD
Conference, pages 311–322, 2005.
4. K. K. Loo and Ben Kao. Mining time-delayed associations from discrete event
datasets. Technical Report TR-2007-01, Department of Computer Science, The
University of Hong Kong, Hong Kong, 2007.
5. Heikki Mannila and Hannu Toivonen. Discovering generalized episodes using mini-
mal occurrences. In KDD, pages 146–151, 1996.
6. Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos. Streaming pattern dis-
covery in multiple time-series. In VLDB, pages 697–708, 2005.
7. Yasushi Sakurai, Spiros Papadimitriou, and Christos Faloutsos. Braid: Stream min-
ing through group lag correlations. In SIGMOD Conference, pages 599–610, 2005.
8. Mohammed Javeed Zaki. Spade: An efficient algorithm for mining frequent se-
quences. Machine Learning, 42(1/2):31–60, 2001.
9. Rui Zhang, Nick Koudas, Beng Chin Ooi, and Divesh Srivastava. Multiple aggre-
gations over data streams. In SIGMOD Conference, pages 299–310, 2005.
A Comparative Study of Ontology Based Term Similarity
Measures on PubMed Document Clustering

Xiaodan Zhang1, Liping Jing2, Xiaohua Hu1, Michael Ng3, and Xiaohua Zhou1
1
College of Information Science & Technology, Drexel University, 3141 Chestnut,
Philadelphia, PA 19104, USA
2
ETI & Department of Math, The University of Hong Kong, Pokfulam Road, Hong Kong
3
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
{xzhang,thu}@ischool.drexel.edu,[email protected],
[email protected], [email protected]

Abstract. Recent research shows that ontology as background knowledge can


improve document clustering quality with its concept hierarchy knowledge.
Previous studies take term semantic similarity as an important measure to
incorporate domain knowledge into clustering process such as clustering
initialization and term re-weighting. However, not many studies have been
focused on how different types of term similarity measures affect the clustering
performance for a certain domain. In this paper, we conduct a comparative
study on how different semantic similarity measures of term including path
based similarity measure, information content based similarity measure and
feature based similarity measure affect document clustering. We evaluate term
re-weighting as an important method to integrate domain ontology to clustering
process. Meanwhile, we apply k-means clustering on one real-world text
dataset, our own corpus generated from PubMed. Experiment results on 8
different semantic measures have shown that: (1) there is no a certain type of
similarity measures that significantly outperforms the others; (2) Several
similarity measures have rather more stable performance than the others; (3)
term re-weighting has positive effects on medical document clustering, but
might not be significant when documents are short of terms.

Keywords: Semantic Similarity Measure, Document Clustering, Domain


Ontology.

1 Introduction
Recent research has been focused on how to integrate domain ontology as background
knowledge to document clustering process and shows that ontology can improve
document clustering performance with its concept hierarchy knowledge [2, 3, and 16].
Hotho et al. [2] uses WordNet synsets to augment document vector and achieves
better results than that of “bag of words” model on public domain. Yoo et al. [16]
achieves promising cluttering result using MeSH domain ontology for clustering
initialization. They first cluster terms by calculating term semantic similarity using
MeSH ontology (http://www.nlm.nih.gov/mesh/) on PubMed document sets [16].

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 115–126, 2007.
© Springer-Verlag Berlin Heidelberg 2007
116 X. Zhang et al.

Then the documents are mapped to the corresponding term cluster. Last, mutual
reinforcement strategy is applied. Varelas et al. [14] uses term re-weighting for
information retrieval application. Jing et al. [3] adopt similar technique on document
clustering. They re-weight terms and assign more weight to terms that are more
semantically similar with each other.
Although existing approaches rely on term semantic similarity measure, not many
studies have been done on evaluating the effects of different similarity measures on
document clustering for a specific domain. Yoo et al. [16] uses only one similarity
measure that calculates the number of shared ancestor concepts and the number of co-
occurred documents. Jing et al. [3] compares two ontology based term similarity
measure. Even though these approaches are heavily relied on term similarity
information and all these similarity measures are domain independent, however, to
date, relatively little work has been done on developing and evaluating measures of
term similarity for biomedical domain (where there are a growing number of
ontologies that organize medical concepts into hierarchies such as MeSH ontology)
on document clustering.
Clustering initialization and term re-weighting are two techniques adopted for
integrating domain knowledge. In this paper, term re-weighting is chosen because: (1)
a document is often full of class-independent “general” terms, how to discount the
effect of general terms is a central task. Term re-weighting may help discount the
effects of class-independent general terms and aggravate the effects of class-specific
“core” terms; (2) hierarchically clustering terms [16] for clustering initialization is
more computational expensive and more lack of scalability than that of term re-
weighting approach.
As a result, in this paper, we evaluate the effects of different term semantic
similarity measures on document clustering using term re-weighting, an important
measure for integration domain knowledge. We examine 4 path based similarity
measures, 3 information content based similarity measures, and 2 feature based
similarity measures for document clustering on PubMed document sets. The rest of
the paper is organized as follows: Section 2 describes term semantic similarity
measures; section 3 shows document representation and defines the term re-weighting
scheme. In section 4, we present and discuss experiment results. Section 5 concludes
the paper shortly.

2 Term Semantic Similarity Measure


Ontology based similarity measure has some advantages over other measures. First,
ontology is created by human being manually for a domain and thus more precise;
second, compared to other methods such as latent semantic indexing, it’s much more
computational efficient; Third, it helps integrate domain knowledge into the data
mining process. Comparing two terms in a document using ontology information
usually exploit the fact that their corresponding concepts within ontology usually have
properties in the form of attributes, level of generality or specificity, and their
relationships with other concepts [11]. It should be noted that there are many other
term semantic similarity measures such as latent semantic indexing, but it’s out of
A Comparative Study of Ontology Based Term Similarity Measures 117

scope of our research, our focus here is on term semantic similarity measure using
ontology information. In the subsequent subsections, we classify the ontology based
semantic measures into the following three categories and try to pick popular
measures for each category.

2.1 Path Based Similarity Measure

Path based similarity measure usually utilizes the information of the shortest path
between two concepts, of the generality or specificity of both concepts in ontology
hierarchy, and of their relationships with other concepts.
Wu and Palmer [15] present a similarity measure finding the most specific
common concept that subsumes both of the concepts being measured. The path length
from most specific shared concept is scaled by the sum of IS-A links from it to the
compared two concepts.

SW & P (C1 , C 2 ) =
2H
N1 + N 2 + 2 H
(1)

In the equation (1), N1 and N 2 is the number of IS-A links from C1 ,C2 respectively to
the most specific common concept C , and H is the number of IS-A links from C to
the root of ontology. It scores between 1(for similar concepts) to 0. In practice, we set
H to 1 when the parent of the most specific common concept C is the root node.
Li et al. [8] combines the shortest path and the depth of ontology information in a
non-linear function:
e β H − e − βH
S Li (C1 , C 2 ) = e −αL (2)
e β H + e − βH

where L stands for the shortest path between two concepts, α and β are parameters
scaling the contribution of shortest path length and depth respectively. The value is
between 1(for similar concepts) and 0. In our experiment, the same as [8]’s, we set
α and β to 0.2 and 0.6 respectively.
Leacock and Chodorow [7] define a similarity measure based on the shortest path
d (C1 , C2 ) between two concepts and scaling that value by twice the maximum depth
of the hierarchy, and then taking the logarithm to smooth the resulting score:
S L & C (C1 , C 2 ) = − log (d (C1 , C 2 ) / 2 D ) (3)
where D is the maximum depth of the ontology and similarity value. In practice, we
add 1 to both d (C1 ,C 2 ) and 2 D to avoid log (0) when the shortest path length is 0.
Mao et al. [10] define a similarity measure using both shortest path information
and number of descendents of compared concepts.
δ
S Mao (C1, C2 ) = (4)
d (C1, C2 ) log 2 (1 + d (C1 ) + d (C2 ))

where d (C1 ,C 2 ) is the number of edges between C1 and C 2 , d (C1 ) is the number of
C1 ’s descendants, which represents the generality of the concept. Here, the constant
118 X. Zhang et al.

δ refers to a boundary case where C1 is the only direct hypernym of C 2 , C 2 is the


only direct hyponym of C1 and C 2 has no hyponym. In this case, because the
concepts C1 and C 2 are very close, δ should be chosen close to 1. In practice, we set
it to 0.9.

2.2 Information Content Based Measure

Information content based measure associates probabilities with concepts in the


ontology. The probability [11] is defined in equation (5), where freq(C) is the
frequency of concept C, and freq(Root) is the frequency of root concept of
the ontology. In this study, the frequency count assigned to a concept is the sum of the
frequency counts of all the terms that map to the concept. Additionally, the frequency
counts of every concept includes the frequency counts of subsumed concepts in an IS-
A hierarchy.
⎛ freq(C ) ⎞
IC (C ) = − log⎜⎜ ⎟⎟ (5)
⎝ freq(Root ) ⎠
As there may be multiple parents for each concept, two concepts can share parents by
multiple paths. We may take the minimum IC (C ) when there is more than one shared
parents, and then we call concept C the most informative subsumer— ICmis (C1 , C2 ) .
In another word, ICmis (C1 , C2 ) has the least probability among all shared subsumer
between two concepts.
SRe snik (C1, C2 ) = − log ICmis (C1, C2 ) (6)
S Jiang (C1, C2 ) = − log IC (C1) − log IC (C2 ) + 2 log ICmis (C1, C2 ) (7)
Resnik [12] presents a similarity measure. It signifies that the more information
two terms share in common, the more similar they are, and the information shared by
two terms is indicated by the information content of the term that subsume them in the
ontology. The measure reveals information about the usage within corpus of the part
of the ontology queried. Jiang [4] includes not only the shared information content
between two terms, but also the information content each term contains.
Lin [9] utilizes both the information needed to state the commonality of two
terms and the information needed to fully describe these two terms. Since
IC mis (C1 , C 2 ) >= log IC (C1 ) , log IC (C2 ) the similarity value varies between 1(for similar
concepts) and 0.

S Lin (C1, C2 ) =
2 log ICmis (C1, C2 )
(8)
log IC (C1) + log IC (C2 )

2.3 Feature Based Measure

Feature based measure assumes that each term is described by a set of terms
indicating its properties or features. Then, the more common characteristics two terms
have and the less non-common characteristics they have, the more similar the
A Comparative Study of Ontology Based Term Similarity Measures 119

terms are [14]. As there is no describing feature set for MeSH descriptor concepts, in
our experimental study, we take all the ancestor nodes of each compared concept as
their feature sets. The following measure is defined according to [5, 9]:
Ans(C1 ) ∩ Ans(C 2 )
S BasicFeature (C1 , C 2 ) = (9)
Ans(C1 ) ∪ Ans(C 2 )

where Ans(C1 ) and Ans(C2 ) correspond to description sets (the ancestor nodes) of
terms C1 and c2 respectively, C1 ∩ C2 is the join of two parent node sets and
C1 ∪ C2 is the union of two parent node sets.
Knappe [5] defines a similarity measure as below using the information of
generalization and specification of two compared concepts:
Ans(C1 ) ∩ Ans(C 2 ) Ans(C1 ) ∩ Ans(C 2 )
S Knappe (C1 , C 2 ) = p × + (1 − p ) × (10)
Ans(C1 ) Ans(C 2 )

where p’s range is [0, 1] that defines the relative importance of generalization vs.
specialization. This measure scores between 1 (for similar concepts) and 0. In our
experiment, p is set to 0.5.

3 Document Representation and Re-weighting Scheme


MeSH. Medical Subject Headings (MeSH) mainly consists of the controlled
vocabulary and a MeSH Tree. The controlled vocabulary contains several different
types of terms, such as Descriptor, Qualifiers, Publication Types, Geographics, and
Entry terms. Among them, Descriptors and Entry terms are used in this study since
they are terms that can be extracted from documents. Descriptor terms are main
concepts or main headings. Entry terms are the synonyms or the related terms to
descriptors. For example, “Neoplasms” as a descriptor has the following entry terms
{“Cancer”, “Cancers”, “Neoplasm”, “Tumors”, “Tumor”, “Benign Neoplasm”,
“Neoplasm, Benign”}. MeSH descriptors are organized in a MeSH Tree, which can
be seen as the MeSH Concept Hierarchy. In the MeSH Tree there are 15 categories
(e.g. category A for anatomic terms), and each category is further divided into
subcategories. For each subcategory, corresponding descriptors are hierarchically
arranged from most general to most specific. In addition to its ontology role, MeSH
descriptors have been used to index MEDLINE articles. For this purpose, about 10 to
20 MeSH terms are manually assigned to each article (after reading full papers). On
the assignment of MeSH terms to articles, about 3 to 5 MeSH terms are set as
“MajorTopics” that primarily represent an article.
With mesh descriptor and MeSH tree, the similarity score between two medical
terms can be easily calculated. Therefore, we first match the terms in each document
abstract to the Entry terms in MeSH and then maps the selected Entry terms into
MeSH Descriptors. We select those candidate terms (1- 6gram) that only match with
MeSH Entry terms. We then replace those semantically similar Entry terms with the
Descriptor term to remove synonyms. We next filter out some MeSH Descriptors that
are too general (e.g. HUMAN, WOMEN or MEN) or too common in MEDLINE
articles (e.g. ENGLISH ABSTRACT or DOUBLE-BLIND METHOD). We assume
120 X. Zhang et al.

Fig. 1. The concept mapping from MeSH entry terms to MeSH descriptors

that those terms do not have distinguishable power in clustering documents. Hence,
we have selected a set of only meaningful corpus-level concepts, in terms of MeSH
Descriptors, representing the documents. We call this set Document Concept Set
(DCS), where DCS = {C1, C2, …, Cn} and Ci is a corpus-level concept. Fig.1 shows
that MeSH Entry term sets are detected from “Doc1” and “Doc2” documents using the
MeSH ontology, and then the Entry terms are replaced with Descriptors based on the
MeSH ontology. For a more comprehensive comparative study, we represent
document in two ways: MeSH entry terms, MeSH descriptor terms. At the time of this
writing, there are about 23833 unique MeSH descriptor terms, 44978 MeSH ontology
nodes (one descriptor term might belong to more than one ontology nodes) and
593626 MeSH entry terms.
Re-weighting Scheme. A document is often full of class-independent “general”
words and short of class-specific “core” words, which leads to the difficulty of
document clustering. Steinbach et al. [13] examines on the data that each class has a
“core” vocabulary of words and remaining “general” words may have similar
distributions on different classes. To solve this problem, we should “discount” general
words and “emphasize” more importance on core words in a vector [17]. [3, 14]
define the term re-weighting scheme as below
~
x ji1 = x ji1 + ∑
m
i 2 =1 ( )
S x ji1 , x ji 2 ⋅ x ji 2
(11)
i 2 ≠ i1
S ( x ji1 , x ji 2 ) ≥ Threshold

where x stands for term weight, m stands for the number of co-occurred terms, and
( )
S x ji1, x ji 2 stands for the semantic similarity between two concepts. Through this re-
weighting scheme, the weights of semantically similar terms will be co-augmented.
Here the threshold stands for minimum similarity score between two compared terms.
Since we are only interested in re-weighting those terms that are more semantically
similar with each other, it’s necessary to set up a threshold value—the minimum
similarity score between compared terms. Besides, it should be noted that the term
weight can be referred as term frequency (TF), normalized term frequency (NTF) and
TF*IDF (Inverse Document Frequency).
A Comparative Study of Ontology Based Term Similarity Measures 121

4 Experiment Setting and Result Analysis

4.1 Datasets and Indexing Schemes

We conduct experiments on public MEDLINE documents (abstracts). First we


collect document sets related to various diseases from MEDLINE. We use
“MajorTopic” tag along with the disease-related MeSH terms as queries to
MEDLINE. Table 1 shows the 10 document sets (24566 documents) retrieved from
MEDLINE. Then, the collected dataset is indexed using two schemes: MeSH entry
term and MeSH descriptor term. The average document length for MeSH entry term
and MeSH descriptor are 14 and 13 respectively (as shown in table 2). Compared to
the average document length—81 when using bag of words representation, the
dimension of clustering space is dramatically reduced. A general stop word list is
applied to bag of words scheme. Moreover, we collect PubMed documents from
1995-2005 to make MeSH descriptor stop term list for MeSH term and MeSH
descriptor term indexing. Since a MeSH entry term can be mapped to more than one
MeSH descriptor term in MeSH ontology, we then map it to the MeSH descriptor
term which is semantically similar with most of the other terms in the document. For
a better comparative study, we also make the following environmental settings: 1)
the number of clusters is set to 10, the same as the number of the document sets; 2)
documents with length less than 5 are removed from the clustering process; 3) when
conducting k-means clustering, we run ten times with random initialization and take
the average as the result. During the comparative experiment, each run has the same
initialization.

4.2 Evaluation Methodology

Cluster quality is evaluated by four extrinsic measures, entropy [13], F-measure [6],
purity [19], and normalized mutual information (NMI) [1]. Because of space
restrictions, we only describe in detail a recently popular measure—NMI, which is
defined as the mutual information between the cluster assignments and a pre-existing
labeling of the dataset normalized by the arithmetic mean of the maximum possible
entropies of the empirical marginal, i.e.,
I ( X ;Y )
NMI ( X , Y ) = (12)
(log k + log c ) / 2

where X is a random variable for cluster assignments, Y is a random variable for the
pre-existing labels on the same data, k is the number of clusters, and c is the number
of pre-existing classes. NMI ranges from 0 to 1. The bigger the NMI is the higher
quality the clustering is. NMI is better than other common extrinsic measures such as
purity and entropy in the sense that it does not necessarily increase when the number
of clusters increases. For Purity and F-measure ranging from 0 to 1, the bigger the
value is the higher quality the clustering has. For entropy, the smaller the value is the
higher clustering quality is.
122 X. Zhang et al.

Table 1. The Document Sets and Their Sizes

Document Sets No. of Docs


1 Gout 642
2 Chickenpox 1,083
3 Raynaud Disease 1,153
4 Jaundice 1,486
5 Hepatitis B 1,815
6 Hay Fever 2,632
7 Kidney Calculi 3,071
8 Age-related Macular Degeneration 3,277
9 Migraine 4,174
10 Otitis 5,233

Table 2. Document indexing schemes

Indexing Scheme No. of term indexed Avg. doc length


MeSH entry term 14885 14
MeSH descriptor term 8829 13
Word 41208 81

4.3 Result Analysis


To compare the effects of different similarity measures on improving clustering
quality, we run k-means clustering on the collected dataset. We represent each
document as TF*IDF vector, because this scheme achieves much better performance
than NTF and TF. Cosine similarity measure is applied when calculating the distance
between one document vector and the cluster center vector. Moreover, when
representing a document using MeSH entry terms, it’s somewhat similar with
augmenting a document vector with synonym terms. As one MeSH descriptor term
can relate with many different MeSH entry terms, it is possible that two or more
MeSH entry terms with same descriptor term appear in one document. Furthermore, if
a document is represented as a document using MeSH descriptors, it can help map all
the synonyms occurred in one document to their according descriptor terms. In this
paper, we evaluate the clustering qualities of both representation schemes as well as
word representation scheme. The process of clustering is as follows: (1) index the
document sets using MeSH entry terms or MeSH descriptor terms; (2) calculate term
similarity using selected similarity measure and then build similarity matrix for
indexed terms; (3) re-weight terms in each document vector using similarity matrix
and equation (10); (4) Run k-means clustering. We use dragon toolkit [18] to
implement the whole process.
Experimental results show that of the three types of term similarity measures, there
is no a certain type of measures that significantly outperforms others. This can be
partially resulted from the fact that most of these measures consider not only the term
closeness within the ontology but also the depth of the two compared concepts within
the ontology. Apparently, the similarity score of S L&C , SRe snik and S Jiang is not within
A Comparative Study of Ontology Based Term Similarity Measures 123

Table 3. Clustering results of MeSH entry terms scheme; each measure is followed by the
threshold of similarity value (in parenthesis) that helps achieve the best results

Type of Measure Similarity Measure Entropy F-Score Purity NMI


Path based Wu & Palmer (0.8) 0.392 0.803 0.876 0.757
Li et al. (0.7) 0.353 0.830 0.871 0.771
Leacock (0.2) 0.930 0.596 0.686 0.524
Mao et al. (0.8) 0.338 0.836 0.885 0.781
Information Content Resnik (0.0) 0.353 0.821 0.877 0.774
Jiang (0.1) 0.572 0.695 0.799 0.701
Lin (0.9) 0.360 0.825 0.880 0.771
Feature based Basic Feature (0.8) 0.389 0.795 0.874 0.759
Knappe (0.8) 0.484 0.778 0.831 0.717
MeSH entry term None 0.363 0.800 0.870 0.774
Word None 0.245 0.755 0.908 0.820

[0, 1]. So term similarity scores using these three measures are normalized before
being applied to do term reweighting for a fair comparison reason. Interestingly,
Information content based measure with support of corpus statistics has very similar
performance with the other two types of measure. This indicates that the corpus
statistics is fit with ontology structure of MeSH and does not improve path based
measure. The measure of Mao et al. achieves the best result in both indexing schemes
as shown in table 3 & 4. The reason might be that it is the only measure that utilizes
the number of descendents information of compared terms. Judging from the overall
performance, Wu et al., Li et al., Mao et al., Resink and the two feature based
measures have a rather more stable performance than that of others. Moreover, for
almost all the cases as shown in table 3, the four evaluation metrics are consistent
with each other except that the score of F-measure and Purity of Wu et al. and Li et al
is slightly better than baseline concept without re-weighting while NMI score of them
is slightly worse.
From table 3 & 4, it’s easily seen that the overall performance of descriptor scheme
is very consistent with and slightly better than that of entry term scheme, which shows
that making a document vector more precise by mapping synonym entry terms to one
descriptor terms has positive effects on document clustering. It’s also noted that both
indexing schemes without term re-weighting have competitive performance to those
with term re-weighting. It shows that term re-weighting as a method of integrating
domain ontology to clustering might not be an effective approach, especially when the
documents are short of terms, because when all these terms are very important core
terms for the documents, ignoring the effects of some of them by re-weighting can
cause serious information loss. This is in contrast to the experiment results in general
domain where document length is relatively longer [3].
It’s obvious that word indexing scheme achieves the best clustering result although
it’s not statistically significant (The word scheme experimental result is listed in both
table 3 & 4 for convenience of reader). However, this does not mean indexing
medical documents using MeSH entry term or MeSH descriptor is a bad scheme. In
other words, it does not mean domain knowledge is not good. First, while keeping
124 X. Zhang et al.

Table 4. Clustering results of MeSH descriptor terms scheme; each measure is followed by the
threshold of similarity value (in parenthesis) that helps achieve the best results

Type of Measure Similarity Measure Entropy F-Score Purity NMI


Path based Wu & Palmer (0.8) 0.361 0.789 0.883 0.771
Li et al. (0.7) 0.339 0.756 0.877 0.780
Leacock (0.2) 0.485 0.749 0.907 0.720
Mao et al. (0.8) 0.259 0.831 0.907 0.814
Information Content Resink (0.0) 0.346 0.815 0.890 0.777
Jiang(0.1) 0.529 0.703 0.809 0.696
Lin (0.9) 0.683 0.582 0.775 0.631
Feature based Basic Feature (0.8) 0.385 0.778 0.873 0.760
Knappe (0.8) 0.375 0.784 0.866 0.765
MeSH descriptor None 0.341 0.772 0.867 0.776
Word None 0.245 0.755 0.908 0.820

competitive clustering results, not only the dimension of clustering space but also the
computational cost is dramatically reduced especially when handling large datasets.
Second, existing ontologies are under growing, they are still not enough for many text
mining applications. For example, there are only 28533 unique entry terms for the
time of writing. Third, there is also limitation of term extraction. So far, existing
approaches usually use “exact match” to map abstract terms to entry terms and can
not judge by the sense the phrase. This will cause serious information loss. For
example, when representing document as entry terms, the average document length
is 14, while the length of the word representation is 81. Finally, if taking advantage of
both medical concept representation and informative word representation, the results
of text mining application can be more convincing.

5 Conclusion
In this paper, we evaluate the effects of 9 semantic similarity measures with a term re-
weighting method on document clustering of PubMed document sets. The k-means
clustering experiment shows that term re-weighting as a method of integrating domain
knowledge has some positive effects on medical document clustering, but might not
be significant. In detail, we obtain following interesting findings from the experiment
by comparing 8 semantic similarity measures three types: path based, information
content based and feature based measure with two indexing schemes—MeSH entry
term and MeSH descriptor: (1) Descriptor scheme is relatively more effective on
clustering than entry term scheme because synonym problem is well handled. (2)
There is no a certain type of measures is significantly better than others since most of
these measures consider only the path between compared concepts and their depth
information within the ontology. (3) Information content based measure using corpus
statistics, as well as ontology structure, does not necessarily improve the clustering
result when corpus statistics is very consistent with ontology structure (4) As the only
similarity measure using the number of descendents information of compared
concepts, the measure of Mao et al. has the best clustering result compared to other
A Comparative Study of Ontology Based Term Similarity Measures 125

similarity measure. (5) Similarity measure that is not scored between 1 and 0 needs to
be normalized, otherwise they will aggravate term weight much more aggressively.
(6) Over all, term re-weighting achieves similar clustering result with that without
term re-weighting. Some of them outperform the baseline, some of them don’t and
neither of them is very significant, which may indicate that term re-weighting might
not be an effective approach when documents are short of terms because when most
of these terms are distinguish core terms for a document, ignoring some of them by
re-weighting will cause serious information loss. (7) The performance of MeSH term
based schemes are slightly worse than that of word based scheme, which can be
resulted from the limitation of domain ontology and limitation of term extraction and
sense disambiguation. However, while keeping competitive results, indexing using
domain ontology dramatically reduces the dimension of clustering space and
computational complexity. Furthermore, this finding indicates that there should be an
approach taking advantage of both medical concept representation and informative
word representation.
In our future work, we may consider other biomedical ontology such as Medical
Language System (UMLS) and also expand this comparative study to some public
domain.

Acknowledgments. This work is supported in part by NSF Career grant (NSF IIS
0448023), NSF CCF 0514679, PA Dept of Health Tobacco Settlement Formula Grant
(No. 240205 and No. 240196), and PA Dept of Health Grant (No. 239667).

References
1. Banerjee, A. and Ghosh, J. Frequency sensitive competitive learning for clustering on
high-dimensional hperspheres. Proc. IEEE Int. Joint Conference on Neural Networks, pp.
1590-1595.
2. Hotho, A., Staab, S. and Stumme, G., “Wordnet improves text document clustering,” in
Proc. of the Semantic Web Workshop at 26th Annual International ACM SIGIR
Conference, Toronto, Canada, 2003.
3. Jing, J., Zhou, L., Ng, M. K. and Huang, Z., “Ontology-based distance measure for text
clustering,” in Proc. of SIAM SDM workshop on text mining, Bethesda, Maryland, USA,
2006.
4. Jiang, J.J. and Conrath, D.W., Semantic Similarity Based on Corpus Statistics and Lexical
Taxonomy. In Proceedings of the International Conference on Research in Computational
Linguistic, Taiwan, 1998.
5. Knappe, R., Bulskov, H. and Andreasen, T.: Perspectives on Ontology-based Querying,
International Journal of Intelligent Systems, 2004.
6. Larsen, B. and Aone, C. Fast and effective text mining using linear-time document
clustering, KDD-99, San Diego, California, 1999, 16-22.
7. Leacock, C. and Chodorow, M., Filling in a sparse training space for word sense
identification. ms., March 1994.
8. Li, Y., Zuhair, A.B., and McLean, D.. An Approach for Measuring Semantic Similarity
between Words Using Multiple Information Sources. IEEE Transactions on Knowledge
and Data Engineering, 15(4):871-882, July/August 2003.
126 X. Zhang et al.

9. Lin, D., Principle-Based Parsing Without Overgeneration. In Proceedings of the 31st


Annual Meeting of the Association for Computational Linguistics (ACL'93), pages
112-120, Columbus, Ohio, 1993.
10. Mao, W. and Chu, W. W., “Free text medical document retrieval via phrased-based vector
space model,” in Proc. of AMIA’02, San Antonio,TX, 2002.
11. Pedersen, T., Pakhomov,S., Patwardhan,S. and Chute, C., Measures of semantic similarity
and relatedness in the biomedical domain. Journal of Biomedical Informatics, In Press,
Corrected Proof, June 2006.
12. Resnik, O., Semantic Similarity in a Taxonomy: An Information-Based Measure and its
Application to Problems of Ambiguity and Natural Language. Journal of Artificial
Intelligence Research, 11:95-130, 1999.
13. Steinbach, M., Karypis, G., and Kumar, V. A Comparison of document clustering
techniques. Technical Report #00-034, Department of Computer Science and Engineering,
University of Minnesota, 2000.
14. Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E. G., and Milios, E. E. 2005.
Semantic similarity methods in wordNet and their application to information retrieval on
the web. WIDM '05. ACM Press, New York, NY, 10-16.
15. Wu, Z. and Palmer, M.. Verb Semantics and Lexical Selection. In Proceedings of the 32nd
Annual Meeting of the Associations for Computational Linguistics (ACL'94), pp133-138,
Las Cruces, New Mexico, 1994.
16. Yoo I., Hu X., Song I-Y., Integration of Semantic-based Bipartite Graph Representation
and Mutual Refinement Strategy for Biomedical Literature Clustering, in the Proceedings
of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (SIGKDD 2006), pp 791-796
17. Zhang X., Zhou X., Hu X., Semantic Smoothing for Model-based Document Clustering,
accepted in the 2006 IEEE International Conference on Data Mining (ICDM'06).
18. Zhou, X., Zhang, X., and Hu, X., The Dragon Toolkit, Data Mining & Bioinformatics Lab,
iSchool at Drexel University, http://www.ischool.drexel.edu/dmbio/dragontool
19. Zhao, Y. and Karypis, G. Criterion functions for document clustering: experiments and
analysis, Technical Report, Department of Computer Science, University of Minnesota,
2001.
An Adaptive and Efficient Unsupervised Shot
Clustering Algorithm for Sports Video

Jia Liao1 , Guoren Wang1 , Bo Zhang1 , Xiaofang Zhou2 , and Ge Yu1


1
College of Information Science & Engineering,
Northeastern University, Shenyang, China
2
School of Information Technology and Electrical Engineering,
The University of Queensland, Brisbane, Australia
liaojia [email protected], [email protected]

Abstract. Due to its tremendous commercial potential, sports video


has become a popular research topic nowadays. As the bridge of low-level
features and high-level semantic contents, automatic shot clustering is
an important issue in the field of sports video content analysis. In previ-
ous work, many clustering approaches need some professional knowledge
of videos, some experimental parameters, or some thresholds to obtain
good clustering results. In this article, we present a new efficient shot
clustering algorithm for sports video which is generic and does not need
any prior domain knowledge. The novel algorithm, which is called Valid
Dimension Clustering(VDC), performs in an unsupervised manner. For
the high-dimensional feature vectors of video shots, a new dimensional-
ity reduction approach is proposed first, which takes advantage of the
available dimension histogram to get ”valid dimensions” as a good ap-
proximation of the intrinsic characteristics of data. Then the clustering
algorithm performs on valid dimensions one by one to furthest utilize the
intrinsic characteristics of each valid dimension. The iterations of merg-
ing and splitting of similar shots on each valid dimension are repeated
until the novel stop criterion which is designed inheriting the theory of
Fisher Discriminant Analysis is satisfied. At last, we apply our algo-
rithm on real video data in our extensive experiments, the results show
that VDC has excellent performance and outperforms other clustering
algorithms.

1 Introduction
In the past a few years, more and more sports videos are being produced, dis-
tributed and made available all over the world, thus, as an important video
domain, sports video has been widely studied due to its tremendous commercial
potential.
Different from other categories of video such as news, movie, sitcom, etc.,
sports video has its own special characteristics [1]. A sports game usually occurs
at a specific field and always has its own well-defined content structures and
domain-specific rules. In addition, sports video is usually taken by some fixed
cameras which have some fixed motions in the play field, and that results in some

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 127–139, 2007.

c Springer-Verlag Berlin Heidelberg 2007
128 J. Liao et al.

recurrent distinctive scenes throughout the video. For example, in a basketball


game video, there are always four dominant scenes including play field, close-
up of players, distant-view of players and audiences. To well understand sports
video, how to take full advantage of dominant scenes is important. Video shot
which comprises a sequence of interrelated consecutive frames taken continuously
by a single camera represents a continuous action in time and space, and it is
the basic unit of video scene. Since video shots of a scene are usually similar,
merging similar shots into clusters becomes useful for the analysis of dominant
scenes and even for the high-level contents of videos.
For shot clustering, some conventional algorithms such as k -means clustering
and hierarchical clustering have been exploited recently [2] [3]. These methods,
however, all require some prior domain knowledge to obtain good clustering re-
sults. Apart from this, these existing clustering algorithms all have their intrinsic
limitations to process high-dimensional data.
In this article, we put forward a novel shot clustering algorithm for sports
video, and the main contributions of our work is listed as follows. First, a new
dimensionality reduction approach is proposed. By applying available dimension
histogram(ADH), only valid dimensions are extracted to achieve the goal of
dimensionality reduction. Second, in the subspace of valid dimensions, according
to the different essentialities of them, our clustering algorithm performs on valid
dimensions one by one to get more encouraging clustering results. Third, a novel
stop criterion for the iterative merging and splitting procedures of each valid
dimension is designed based on the theory of Fisher Discriminant Analysis.
The rest of this paper is structured as follows. Section 2 will introduce the
novel dimensionality reduction approach. The details of our shot clustering al-
gorithm will be discussed in section 3. In section 4, the performance study will
be described. Section 5 will give some related work while section 6 will conclude
the paper and suggest the future work.

2 Dimensionality Reduction

In this section, we will discuss our dimensionality reduction approach in detail.


The valid dimension is introduced first, then how to extract valid dimensions
by available dimension histogram(ADH) to achieve the goal of dimensionality
reduction is proposed.

2.1 Valid Dimension

For high-dimensional data, not all dimensions are useful for different applica-
tions. In many applications, such as clustering, indexing, information retrieval,
only some special dimensions are needed. Figure 1 shows an example of cluster-
ing. According to the distribution of the data set in (a), if we want to partition
the points into three clusters, the clustering results can be easily found out by
computing the distances among the points in the feature space of dimension d1
and d2. But in fact, we have no use to take both of the two dimensions into
An Adaptive and Efficient Unsupervised Shot Clustering Algorithm 129

d2 d2
5 5

4 4

3 3

2 2

1 1

0 d1 0 d1
0 1 2 3 4 5 0 1 2 3 4 5

(a) (b)

Fig. 1. An example of valid dimensions for clustering

account, only dimension d1 is enough. (b) shows that the clustering results ob-
tained by only considering dimension d1 are the same as the clustering results
in (a). Therefore, dimension d1 is contributing for clustering, and it is a valid
dimension of the data set.
Valid dimensions are the dimensions which can maximally represent the in-
trinsic characteristics of data set. For the data set in Figure 1, the standard
deviations of dimension d1 and d2 are 0.95 and 0.48 respectively. The reason
why dimension d1 is valid for clustering is that its standard deviation is larger
and it can represent the distribution of the data set. Standard deviation of a data
set is a measure of how spread out it is [11]. The larger the standard deviation
is, the more spread out from the mean the data set is. The data set which is
more spread out is more sensitive in clustering, therefore, the dimension whose
standard deviation is larger is more helpful for clustering.
The dimensionality reduction approach in this paper is to extract the valid
dimensions for our clustering algorithm. In next subsection, we will discuss the
extraction rule for valid dimensions.

2.2 Extraction Rule for Valid Dimensions

Sports videos have their own intrinsic characteristics. The variety of the back-
grounds of sports video is not obvious. By carefully observing the high-
dimensional feature vectors of video shots, it can be easily found that a mass of
dimensions’ values are all zero, especially in the color features. In other words,
these dimensions are useless for computation, and the dimensions whose val-
ues are non-zero are called available dimensions in our paper. Table 1 gives an
example of the ratio of available dimensions over the total dimensions of dif-
ferent categories of sports. It illustrates that the ratios of available dimensions
are about 50%, thus, extracting the available dimensions is the first step of our
dimensionality reduction approach.
Let Dm be the subspace of data set with m available dimensions. For the values
of the jth dimension of Dm , Si [j] denotes the value of shot Si , σS[j] denotes the
standard deviation of jth available dimension of Dm , where 1 ≤ j ≤ m. Standard
deviation of each available dimension indicates its essentiality for clustering.
Larger σS[j] illustrates that the data in jth available dimension are more spread
out and more advantageous for clustering.
130 J. Liao et al.

Table 1. Ratios of available dimensions of different sports videos

Video data Total dimension Available dimension Ratio


Basketball 512 314 61.3%
Table tennis 512 305 59.6%
Football 512 253 49.4%

Definition 1. valid dimensions. Given a threshold value ε, the available dimen-


sions whose standard deviations are equal to or greater than ε are called valid
dimensions.
Definition 2. available dimension histogram(ADH). The available dimension
histogram of Dm represents the distribution of m available dimensions’ standard
deviations(σS[j] ), in which, the x-axis represents the rank of available dimensions
and the y-axis represents their corresponding σS[j] . ADH displays the descending
trend of σS[j] values. The following Figure 2 gives an example of ADH.

10
9
8
7
6
5
4
3
2
1
0
1 2 3 4 5 6

Fig. 2. Example of available dimension histogram(ADH)

In order to extract valid dimensions which can maximally represent the dis-
tribution of data for clustering, an heuristic method on ADH is applied for
determining the value of ε. Let r [i] denote the rank of available dimensions in
Dm corresponding to ADH. Then ε = σr[k] , only if σr[k] − σr[k+1] = max(σr[i] −
σr[i+1] , 1 ≤ i ≤ m − 1). ε is the standard deviation of available dimension r [k ]
whose difference to that of r [k +1] is largest in Dm . That means ε is the largest
plunge occurs in VDH. Referring to Figure 2, the largest drop of ADH occurs
from r [3] to r [4], i.e., ε = σr[3] , and the available dimensions which correspond
to r [1], r [2], and r [3] are the valid dimensions. Intuitively, such extraction rule
guarantees the most significant available dimensions are extracted as valid di-
mensions for our clustering.

3 Unsupervised Shot Clustering Algorithm


In this section, we will provide an efficient shot clustering algorithm called valid
dimension clustering(VDC) in detail.
An Adaptive and Efficient Unsupervised Shot Clustering Algorithm 131

3.1 Algorithm Description of Valid Dimension Clustering

A video shot Si can be represented as: Si ={xi1 , xi2 , . . . xin }, where xip is the pth
dimension of the n-dimensional feature vector Si . Let Df be the subspace of
valid dimensions, where f is the number of valid dimensions which are obtained
by our dimensionality reduction approach.
Valid dimension clustering(VDC) is an unsupervised clustering algorithm
which performs on Df one by one, that’s because different valid dimensions
have their own different essentialities for clustering. After ranking the standard
deviations of valid dimensions in descending order, we first take the valid dimen-
sion whose standard deviation is the largest as the beginning of the algorithm,
then the following valid dimensions are taken into account in order.
For the first valid dimension, each shot is first initialized as one cluster, then
the iterations of merging similar shots into one cluster are repeated until the
stop criterion is satisfied. For other valid dimension di , the clustering results of
valid dimension di−1 (the prior dimension of di according to the rank of valid
dimensions) should be set as the initial clustering status of di , then the same
merging procedures perform on each initial cluster of di until all initial clusters
have been processed. After finishing valid dimension di , the algorithm will turn
to di+1 . The final clustering results will be returned when all f valid dimen-
sions are processed. It is obvious that for each valid dimension, only merging
procedures are performed, but for two consecutive valid dimensions di−1 and di ,
the processing of di is splitting procedures for di−1 . Thus, VDC comprises both
merging and splitting procedures.

(a) (b)

Fig. 3. Different clustering results for table tennis

The reason why VDC performs on valid dimensions one by one is explained
by Figure 3. (a) gives the clustering results of VDC, i.e., valid dimensions are
taken into account one by one. While (b) shows the results of the algorithm
which all valid dimensions are taken into account once. Obviously, the results
in (a) are better than (b). Originally, all the six shots are play field shots, but
(b) partitions them into two clusters as different positions of the play table. The
reason is that when we consider all valid dimensions together, all valid dimensions
are treated fairly, the different essentialities of different valid dimensions have
not been distinguished.
132 J. Liao et al.

3.2 Stop Criterion of Valid Dimension Clustering


The stop criterion for the iterations is the most critical technique of unsupervised
clustering algorithm. It directly determines the results of clustering. In the paper,
we devise a novel stop criterion which uses Fisher Discriminant Analysis for
reference.
Fisher Discriminant Analysis is a widely used multivariate statistical tech-
nique [12]. The discrimination function can be used as a well-defined rule in
order to optimally assign a new observation into the labeled class. Consider k
populations G1 ,G2 ,. . . ,Gk , each with p-variate distribution which is denoted as
(x1 ,x2 ,. . . ,xp ). Fisher suggested finding a linear combination of multivariate ob-
servations (x1 ,x2 ,. . . ,xp ) to create univariate observation u(x ) such that u(x )
can separate the different samples of different populations as much as possible.
Fisher discriminant function can be written as:

u(x) = αT x = α1 x1 + α2 x2 + ... + αp xp (1)

Let SSE and SSG denote the total within-class divergence and total between-
class divergence of each data sample. The α which maximize the criterion F (α)
is used in the Fisher discriminant function, the formula (1). F (α) is represented
as below:
SSG αT Bα
F (α) = = T (2)
SSE α Eα
For our shot clustering algorithm, we are only interested in the concepts of
within-class divergence and between-class divergence. For clustering, the intra-
distance within a cluster and the inter-distance among different clusters can
be mapped into the concepts of within-class divergence and between-class di-
vergence respectively. The clustering results in which the intra-distance of each
cluster is smallest and the inter-distances among different clusters are largest
are the encouraging results. That indicates the data set is separated optimally.
Let rl denote the ratio of the intra-distance of one cluster over the inter-
distances among clusters when the number of clusters is Nl , and the best clus-
tering result we want is the one with smallest value of rl . The value of rl can be
calculated by the formula below:

Nl  c
Nl m
dcw |Sic − Smean
c
|
c=0 c=0 i=0
rl = = (3)
dt 
N
|Sj − Smean |
j=0

where dt is the initial distance among clusters, dcw is the intra-cluster distance
of cluster c. N is the initial number of clusters at the beginning, while mc is the
number of shots in cluster c. |•| denotes the Manhattan distance. Sic and Smean
c

represent the ith shot and the mean vector of cluster c respectively, while Sj
and Smean are used for denoting the same concept of the initial clusters.
Apart from rl , another important factor nl is considered in our algorithm too,
which is the statistic information of the number of clusters. Let nl = Nl /N be
An Adaptive and Efficient Unsupervised Shot Clustering Algorithm 133

1.2

1.1

1.0

r l+ n l
0.9

stop point
0.8

0.7

0.6
0 10 20 30 40 50 60 70 80 90 100
m

Fig. 4. Relation curve of rl +nl and m

Algorithm 1. VDC()
Input: ranking array of valid dimensions r[k]; cluster structures CR
Output: clustering results
1: for dn =1 to k do
2: ptr =GetHead(CR)
3: while ptr = NULL do
4: S=ODC(ptr, dn ) // S denotes the splitting results
5: InsertFront(CR, S)
6: ptr= GetNext(ptr )
7: dn ++
8: end while
9: end for

Function ODC(CR,dn )
initialize each shot Si as one cluster Ci
(1) (1)
Let rl = 0, nl = 1, calculate dist(Ci , Cj )dn , 1 ≤ i, j ≤ Nl
execute MergeCluster()
(1) (1) (2) (2)
WHILE rl + nl > rl + nl ∩ Nl > 1
(1) (2) (1) (2)
rl = rl , nl = nl
execute MergeCluster()
ENDWHILE
add the clustering results to CR
end Function
Function MergeCluster()
merge two most similar shots into one cluster
(2) (2) (2) (2)
calculate rl , nl and rl + nl
end Function

the ratio of the cluster number Nl over the initial total number of shots N. In
order to maximally approximate the real cluster number which is a small value,
the smaller the value of nl is, the better the clustering result is.
At the beginning of the clustering algorithm, each shot is initialized as one
cluster, the value of rl is 0, and the value of nl is 1. Then as the merging proceeds,
the value of rl is increasing while nl is descending. When all the shots are merged
134 J. Liao et al.

into one cluster, the value of rl reaches 1, and nl reaches its smallest value. Since
the encouraging clustering results should have both smaller rl and nl , we choose
min(rl + nl ) as the stop criterion of our algorithm. When rl + nl reaches its
smallest value, the iterations of merging stop. For example, the relation curve
of the value of rl + nl and the times of iterations m for one valid dimension of
football is shown in Figure 4. The inflexion of the curve which corresponds to
the smallest value of rl + nl is the stop point of the iterations.
After presenting the stop criterion for iterations of our clustering algorithm,
the detailed algorithm description of VDC is described in Algorithm 1.

4 Performance Study
In this section, we will report our extensive performance study on large real
video data, and the comparison results with other two clustering algorithms.

4.1 Experiments Set Up


Our data set consists of about 4.5 hours’ long video data which includes three
categories of sports video captured from TV stations. The formats of them are all
Mpgs with the frame extraction rate is 25fps, and each frame is 320*240 pixels.
After shot boundaries detecting, each shot is represented by feature vectors in
four high-dimensional degrees: 288-D, 320-D, 384-D and 512-D which all compose
HSV color feature and motion feature in P -frames of it for experiments.

Table 2. Data set statistics

Video Lengh Total shots Cluster of shots(shot number)


Basketball(B) 1:07:25 390 C1:play field(145);C2:close-up of player(67)
C3:distant view of player(150);C4:audience(28)
Table tennis(T) 1:22:32 634 C1:play field(220);C2:close-up of player(335)
C3:distant view of player(47);C4:audience(32)
Football(F) 1:35:51 630 C1:play field(182);C2:close-up of player(275)
C3:distant view of player(93);C4:shooting(56)
C5:audience(24)

In order to detect the efficiency of our algorithm, we manually identify each


shot into different clusters beforehand according to video grammar. The total
number of shots in our data set is 1654, and the detailed information is listed in
Table 2. Two common used measurements which are Precision(P ) and Recall (R)
are used to evaluate the performance of our algorithm. And all the experiments
were done with Intel Pentium D820 processor(2.8GHz CPU’s with 1GB RAM).

4.2 Effectiveness of Valid Dimension Clustering(VDC)


In order to show the excellent performance of our algorithm, other two clustering
algorithms are applied in our experiment as comparisons. One is called FDC
An Adaptive and Efficient Unsupervised Shot Clustering Algorithm 135

70000 80000
FDC 70000 FDC
60000
VDC 60000 VDC
50000

CPU(ms)
CPU(ms)

50000
40000
40000
30000
30000
20000 20000
10000 10000
0 0
288 320 384 512 200 400 600 800 1000 1200
Dimensionality Number of shots

(a) CPU cost for dimensionality (b) CPU cost for shot number

Fig. 5. Effect of dimensionality reduction

100 100
VDC VDC
Precision(%)

95 FDC 95 FDC
X-means X-means

Recall(%)
90 90

85 85

80 80

75 75

70 70
288 320 384 512 288 320 384 512
Dimensionality Dimensionality

(a) Comparison of precision (b) Comparison of recall

Fig. 6. Effect of dimensionality

which applies our stop criterion for merging iterations but performs on the whole
high-dimensional feature space without dimensionality reduction. The other is
called X -means [6] which is a reformative algorithm of k -means.

Efficiency of Dimensionality Reduction. First, we will test the efficiency


of our dimensionality reduction approach which applies the available dimension
histogram(ADH).
Figure 5 depicts the CPU time improvement achieved by VDC over FDC on
the data sets with different dimensionality and different data size.
This experiment confirms that dimensionality reduction is outstanding and
necessary for clustering. When the dimensionality and the size of data are in-
creasing, the CPU time of VDC and FDC are all increasing. But it can be
easily witnessed that the increasing rates of VDC are much slower than FDC,
especially in (a). Obviously, dimensionality reduction plays an important role.
By dimensionality reduction, only valid dimensions are considered in clustering,
thus the algorithm is sped up.

Performance Comparison. In this experiment, we compare VDC with other


two shot clustering algorithms. We test the effect of different dimensional feature
spaces and different categories of sports vid