0% found this document useful (0 votes)
602 views834 pages

Lecture Notes in Computer Science

Procceding
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
602 views834 pages

Lecture Notes in Computer Science

Procceding
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 834

Lecture Notes in Computer Science 3907

Commenced Publication in 1973


Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
New York University, NY, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Franz Rothlauf et al. (Eds.)

Applications
of Evolutionary
Computing
EvoWorkshops 2006: EvoBIO, EvoCOMNET, EvoHOT
EvoIASP, EvoINTERACTION, EvoMUSART, and EvoSTOC
Budapest, Hungary, April 10-12, 2006
Proceedings

13
Volume Editors

see next page

The cover illustration is the work of Pierre Grenier

Library of Congress Control Number: 2006922618

CR Subject Classification (1998): F.1, D.1, B, C.2, J.3, I.4, J.5

LNCS Sublibrary: SL 1 Theoretical Computer Science and General Issues

ISSN 0302-9743
ISBN-10 3-540-33237-5 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-33237-4 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
Springer-Verlag Berlin Heidelberg 2006
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 11732242 06/3142 543210
Lecture Notes in Computer Science
For information about Vols. 13821

please contact your bookseller or Springer

Vol. 3927: J. Hespanha, A. Tiwari (Eds.), Hybrid Systems: Vol. 3895: O. Goldreich, A.L. Rosenberg, A.L. Selman
Computation and Control. XII, 584 pages. 2006. (Eds.), Theoretical Computer Science. XII, 399 pages.
2006.
Vol. 3925: A. Valmari (Ed.), Model Checking Software.
X, 307 pages. 2006. Vol. 3894: W. Grass, B. Sick, K. Waldschmidt (Eds.), Ar-
chitecture of Computing Systems - ARCS 2006. XII, 496
Vol. 3924: P. Sestoft (Ed.), Programming Languages and
pages. 2006.
Systems. XII, 343 pages. 2006.
Vol. 3890: S.G. Thompson, R. Ghanea-Hercock (Eds.),
Vol. 3923: A. Mycroft, A. Zeller (Eds.), Compiler Con-
Defence Applications of Multi-Agent Systems. XII, 141
struction. XV, 277 pages. 2006.
pages. 2006. (Sublibrary LNAI).
Vol. 3922: L. Baresi, R. Heckel (Eds.), Fundamental Ap-
Vol. 3889: J. Rosca, D. Erdogmus, J.C. Prncipe, S. Haykin
proaches to Software Engineering. XIII, 427 pages. 2006.
(Eds.), Independent Component Analysis and Blind Sig-
Vol. 3921: L. Aceto, A. Inglfsdttir (Eds.), Foundations nal Separation. XXI, 980 pages. 2006.
of Software Science and Computation Structures. XV, 447
Vol. 3888: D. Draheim, G. Weber (Eds.), Trends in Enter-
pages. 2006.
prise Application Architecture. IX, 145 pages. 2006.
Vol. 3920: H. Hermanns, J. Palsberg (Eds.), Tools and
Vol. 3887: J.R. Correa, A. Hevia, M. Kiwi (Eds.), LATIN
Algorithms for the Construction and Analysis of Systems.
2006: Theoretical Informatics. XVI, 814 pages. 2006.
XIV, 506 pages. 2006.
Vol. 3886: E.G. Bremer, J. Hakenberg, E.-H.(S.) Han,
Vol. 3916: J. Li, Q. Yang, A.-H. Tan (Eds.), Data Min-
D. Berrar, W. Dubitzky (Eds.), Knowledge Discovery in
ing for Biomedical Applications. VIII, 155 pages. 2006.
Life Science Literature. XIV, 147 pages. 2006. (Sublibrary
(Sublibrary LNBI).
LNBI).
Vol. 3915: R. Nayak, M.J. Zaki (Eds.), Knowledge Dis-
Vol. 3885: V. Torra, Y. Narukawa, A. Valls, J. Domingo-
covery from XML Documents. VIII, 105 pages. 2006.
Ferrer (Eds.), Modeling Decisions for Artificial Intelli-
Vol. 3907: F. Rothlauf, J. Branke, S. Cagnoni, E. Costa, C. gence. XII, 374 pages. 2006. (Sublibrary LNAI).
Cotta, R. Drechsler, E. Lutton, P. Machado, J.H. Moore, J.
Vol. 3884: B. Durand, W. Thomas (Eds.), STACS 2006.
Romero, G.D. Smith, G. Squillero, H. Takagi (Eds.), Ap-
XIV, 714 pages. 2006.
plications of Evolutionary Computing. XXIV, 813 pages.
2006. Vol. 3881: S. Gibet, N. Courty, J.-F. Kamp (Eds.), Gesture
in Human-Computer Interaction and Simulation. XIII,
Vol. 3906: J. Gottlieb, G.R. Raidl (Eds.), Evolutionary
344 pages. 2006. (Sublibrary LNAI).
Computation in Combinatorial Optimization. XI, 293
pages. 2006. Vol. 3880: A. Rashid, M. Aksit (Eds.), Transactions on
Aspect-Oriented Software Development I. IX, 335 pages.
Vol. 3905: P. Collet, M. Tomassini, M. Ebner, S.
2006.
Gustafson,A. Ekrt (Eds.), Genetic Programming. XI, 361
pages. 2006. Vol. 3879: T. Erlebach, G. Persinao (Eds.), Approximation
and Online Algorithms. X, 349 pages. 2006.
Vol. 3904: M. Baldoni, U. Endriss, A. Omicini, P. Tor-
roni (Eds.), Declarative Agent Languages and Technolo- Vol. 3878: A. Gelbukh (Ed.), Computational Linguistics
gies III. XII, 245 pages. 2006. (Sublibrary LNAI). and Intelligent Text Processing. XVII, 589 pages. 2006.
Vol. 3903: K. Chen, R. Deng, X. Lai, J. Zhou (Eds.), Infor- Vol. 3877: M. Detyniecki, J.M. Jose, A. Nrnberger, C. J.
mation Security Practice and Experience. XIV, 392 pages. . van Rijsbergen (Eds.), Adaptive Multimedia Retrieval:
2006. User, Context, and Feedback. XI, 279 pages. 2006.
Vol. 3901: P.M. Hill (Ed.), Logic Based Program Synthesis Vol. 3876: S. Halevi, T. Rabin (Eds.), Theory of Cryptog-
and Transformation. X, 179 pages. 2006. raphy. XI, 617 pages. 2006.
Vol. 3899: S. Frintrop,VOCUS:AVisualAttention System Vol. 3875: S. Ur, E. Bin, Y. Wolfsthal (Eds.), Haifa Verifi-
for Object Detection and Goal-Directed Search. XIV, 216 cation Conference. X, 265 pages. 2006.
pages. 2006. (Sublibrary LNAI). Vol. 3874: R. Missaoui, J. Schmidt (Eds.), Formal Concept
Vol. 3897: B. Preneel, S. Tavares (Eds.), Selected Areas in Analysis. X, 309 pages. 2006. (Sublibrary LNAI).
Cryptography. XI, 371 pages. 2006. Vol. 3873: L. Maicher, J. Park (Eds.), Charting the Topic
Vol. 3896: Y. Ioannidis, M.H. Scholl, J.W. Schmidt, F. Maps Research and Applications Landscape. VIII, 281
Matthes, M. Hatzopoulos, K. Boehm,A. Kemper, T. Grust, pages. 2006. (Sublibrary LNAI).
C. Boehm (Eds.), Advances in Database Technology - Vol. 3872: H. Bunke, A. L. Spitz (Eds.), Document Anal-
EDBT 2006. XIV, 1208 pages. 2006. ysis Systems VII. XIII, 630 pages. 2006.
Vol. 3870: S. Spaccapietra, P. Atzeni, W.W. Chu, T. Vol. 3844: J.-M. Bruel (Ed.), Satellite Events at the MoD-
Catarci, K.P. Sycara (Eds.), Journal on Data Semantics ELS 2005 Conference. XIII, 360 pages. 2006.
V. XIII, 237 pages. 2006. Vol. 3843: P. Healy, N.S. Nikolov (Eds.), Graph Drawing.
Vol. 3869: S. Renals, S. Bengio (Eds.), Machine Learning XVII, 536 pages. 2006.
for Multimodal Interaction. XIII, 490 pages. 2006. Vol. 3842: H.T. Shen, J. Li, M. Li, J. Ni, W. Wang (Eds.),
Vol. 3868: K. Rmer, H. Karl, F. Mattern (Eds.), Wireless Advanced Web and Network Technologies, and Applica-
Sensor Networks. XI, 342 pages. 2006. tions. XXVII, 1057 pages. 2006.
Vol. 3866: T. Dimitrakos, F. Martinelli, P.Y.A. Ryan, S. Vol. 3841: X. Zhou, J. Li, H.T. Shen, M. Kitsuregawa, Y.
Schneider (Eds.), Formal Aspects in Security and Trust. Zhang (Eds.), Frontiers of WWW Research and Develop-
X, 259 pages. 2006. ment - APWeb 2006. XXIV, 1223 pages. 2006.
Vol. 3865: W. Shen, K.-M. Chao, Z. Lin, J.-P.A. Barths Vol. 3840: M. Li, B. Boehm, L.J. Osterweil (Eds.), Uni-
(Eds.), Computer Supported Cooperative Work in Design fying the Software Process Spectrum. XVI, 522 pages.
II. XII, 359 pages. 2006. 2006.
Vol. 3863: M. Kohlhase (Ed.), Mathematical Knowledge Vol. 3839: J.-C. Fillitre, C. Paulin-Mohring, B. Werner
Management. XI, 405 pages. 2006. (Sublibrary LNAI). (Eds.), Types for Proofs and Programs. VIII, 275 pages.
2006.
Vol. 3862: R.H. Bordini, M. Dastani, J. Dix, A.E.F.
Seghrouchni (Eds.), Programming Multi-Agent Systems. Vol. 3838: A. Middeldorp, V. van Oostrom, F. van Raams-
XIV, 267 pages. 2006. (Sublibrary LNAI). donk, R. de Vrijer (Eds.), Processes, Terms and Cycles:
Steps on the Road to Infinity. XVIII, 639 pages. 2005.
Vol. 3861: J. Dix, S.J. Hegner (Eds.), Foundations of In-
formation and Knowledge Systems. X, 331 pages. 2006. Vol. 3837: K. Cho, P. Jacquet (Eds.), Technologies for
Advanced Heterogeneous Networks. IX, 307 pages. 2005.
Vol. 3860: D. Pointcheval (Ed.), Topics in Cryptology
CT-RSA 2006. XI, 365 pages. 2006. Vol. 3836: J.-M. Pierson (Ed.), Data Management in Grids.
X, 143 pages. 2006.
Vol. 3858: A. Valdes, D. Zamboni (Eds.), RecentAdvances
in Intrusion Detection. X, 351 pages. 2006. Vol. 3835: G. Sutcliffe, A. Voronkov (Eds.), Logic for Pro-
gramming, Artificial Intelligence, and Reasoning. XIV,
Vol. 3857: M.P.C. Fossorier, H. Imai, S. Lin, A. Poli
744 pages. 2005. (Sublibrary LNAI).
(Eds.), Applied Algebra, Algebraic Algorithms and Error-
Correcting Codes. XI, 350 pages. 2006. Vol. 3834: D.G. Feitelson, E. Frachtenberg, L. Rudolph,
U. Schwiegelshohn (Eds.), Job Scheduling Strategies for
Vol. 3855: E. A. Emerson, K.S. Namjoshi (Eds.), Verifi-
Parallel Processing. VIII, 283 pages. 2005.
cation, Model Checking, and Abstract Interpretation. XI,
443 pages. 2005. Vol. 3833: K.-J. Li, C. Vangenot (Eds.), Web and Wireless
Geographical Information Systems. XI, 309 pages. 2005.
Vol. 3854: I. Stavrakakis, M. Smirnov (Eds.), Autonomic
Communication. XIII, 303 pages. 2006. Vol. 3832: D. Zhang, A.K. Jain (Eds.), Advances in Bio-
metrics. XX, 796 pages. 2005.
Vol. 3853:A.J. Ijspeert, T. Masuzawa, S. Kusumoto (Eds.),
Biologically Inspired Approaches to Advanced Informa- Vol. 3831: J. Wiedermann, G. Tel, J. Pokorn, M.
tion Technology. XIV, 388 pages. 2006. Bielikov, J. tuller (Eds.), SOFSEM 2006: Theory and
Vol. 3852: P.J. Narayanan, S.K. Nayar, H.-Y. Shum (Eds.), Practice of Computer Science. XV, 576 pages. 2006.
Computer Vision ACCV 2006, Part II. XXXI, 977 pages. Vol. 3830: D. Weyns, H. V.D. Parunak, F. Michel (Eds.),
2006. Environments for Multi-Agent Systems II. VIII, 291
Vol. 3851: P.J. Narayanan, S.K. Nayar, H.-Y. Shum (Eds.), pages. 2006. (Sublibrary LNAI).
Computer Vision ACCV 2006, Part I. XXXI, 973 pages. Vol. 3829: P. Pettersson, W. Yi (Eds.), Formal Modeling
2006. and Analysis of Timed Systems. IX, 305 pages. 2005.
Vol. 3850: R. Freund, G. Paun, G. Rozenberg, A. Salomaa Vol. 3828: X. Deng, Y. Ye (Eds.), Internet and Network
(Eds.), Membrane Computing. IX, 371 pages. 2006. Economics. XVII, 1106 pages. 2005.
Vol. 3849: I. Bloch, A. Petrosino, A.G.B. Tettamanzi Vol. 3827: X. Deng, D.-Z. Du (Eds.), Algorithms and
(Eds.), Fuzzy Logic and Applications. XIV, 438 pages. Computation. XX, 1190 pages. 2005.
2006. (Sublibrary LNAI).
Vol. 3826: B. Benatallah, F. Casati, P. Traverso (Eds.),
Vol. 3848: J.-F. Boulicaut, L. De Raedt, H. Mannila (Eds.), Service-Oriented Computing - ICSOC 2005. XVIII, 597
Constraint-Based Mining and Inductive Databases. X, 401 pages. 2005.
pages. 2006. (Sublibrary LNAI).
Vol. 3824: L.T. Yang, M. Amamiya, Z. Liu, M. Guo, F.J.
Vol. 3847: K.P. Jantke, A. Lunzer, N. Spyratos, Y. Tanaka Rammig (Eds.), Embedded and Ubiquitous Computing
(Eds.), Federation over the Web. X, 215 pages. 2006. (Sub- EUC 2005. XXIII, 1204 pages. 2005.
library LNAI).
Vol. 3823: T. Enokido, L. Yan, B. Xiao, D. Kim, Y. Dai,
Vol. 3846: H. J. van den Herik, Y. Bjrnsson, N.S. Ne- L.T. Yang (Eds.), Embedded and Ubiquitous Computing
tanyahu (Eds.), Computers and Games. XIV, 333 pages. EUC 2005 Workshops. XXXII, 1317 pages. 2005.
2006.
Vol. 3822: D. Feng, D. Lin, M. Yung (Eds.), Information
Vol. 3845: J. Farr, I. Litovsky, S. Schmitz (Eds.), Imple- Security and Cryptology. XII, 420 pages. 2005.
mentation and Application of Automata. XIII, 360 pages.
2006.
Volume Editors

Franz Rothlauf Penousal Machado


Dept. of Business Administration and Dep. de Engenharia Informatica
Information Systems University of Coimbra
University of Mannheim Polo II, 3030 Coimbra, Portugal
Schloss, 68131 Mannheim, Germany [email protected]
[email protected]

Jurgen Branke Jason H. Moore


Institute AIFB Dept. of Genetics
University of Karlsruhe Dartmouth Medical School
76128 Karlsruhe, Germany HB7937 One Medical Center Dr.
[email protected] Lebanon, NH 03756, USA
[email protected]
Stefano Cagnoni
Dept. of Computer Engineering
University of Parma Juan Romero
Parco Area delle Scienze 181/a Facultad de Informatica
43100 Parma, Italy University of A Coruna
[email protected] A Coruna, CP 15071, Spain
[email protected]
Ernesto Costa
Dept. de Engenharia Informatica
University of Coimbra George D. Smith
Polo II - Pinhal de Marrocos School of Computing Sciences
3030 - 290 Coimbra, Portugal University of East Anglia
[email protected] UEA Norwich
Norwich NR4 7TJ, UK
Carlos Cotta [email protected]
Dep. Lenguajes y Ciencias de la Com-
putacion
Universidad de Malaga Giovanni Squillero
29071 Malaga, Spain Dip. di Automatica e Informatica
[email protected] Politecnico di Torino
Corso Duca degli Abruzzi 24
Rolf Drechsler 10129 Torino, Italy
Institute of Computer Science [email protected]
University of Bremen
28359 Bremen, Germany
[email protected] Hideyuki Takagi
Kyushu University, Faculty of Design
Evelyne Lutton 4-9-1, Shiobaru, Minami-ku
INRIA Rocquencourt Fukuoka, 815-8540 Japan
B.P. 105, 78153 LE CHESNAY Cedex, [email protected]
France
[email protected]
Preface

Evolutionary computation (EC) techniques are ecient nature-inspired planning


and optimization methods based on the principles of natural evolution and genet-
ics. Due to their eciency and the simple underlying principles, these methods
can be used for a large number of problems in the context of problem solving,
optimization, and machine learning. A large and continuously increasing number
of researchers and practitioners make use of EC techniques in many application
domains. This book presents a careful selection of relevant EC applications com-
bined with thorough examinations of techniques for a successful application of
EC. The presented papers illustrate the current state of the art in the applica-
tion of EC and should help and inspire researchers and practitioners to develop
ecient EC methods for design and problem solving.
All the papers in this book were presented during EvoWorkshops 2006, which
consisted of a varying collection of workshops on application-oriented aspects of
EC. Since 1998, the format of the EvoWorkshops has proved to be very successful
and to represent signicant advances in the application areas of EC. As a result,
over the last few years, EvoWorkshops has become one of the major events
to focus solely on applicational aspects of EC, constituting an important link
between EC research and the application of EC in a variety of domains.
EvoWorkshops is co-located with EuroGP, the main European event dedi-
cated to genetic programming, and EvoCOP, which has become the main Eu-
ropean conference on evolutionary computation in combinatorial optimization.
The proceedings for both of these events, EuroGP 2006 and EvoCOP 2006, are
also available in the LNCS series (number 3905 and 3906).
EvoWorkshops 2006, of which this volume contains the proceedings, was
held in Budapest, Hungary, on April 1012, 2006, jointly with EuroGP 2006
and EvoCOP 2006. EvoWorkshops 2006 consisted of the following i ndividual
workshops:

EvoBIO, the Fourth European Workshop on Evolutionary Bioinformatics,


EvoCOMNET, the Third European Workshop on Evolutionary Computation
in Communications, Networks, and Connected Systems,
EvoHOT, the Third European Workshop on Evolutionary Computation in
Hardware Optimization,
EvoIASP, the Eighth European Workshop on Evolutionary Computation in
Image Analysis and Signal Processing,
EvoINTERACTION, the First European Workshop on Interactive Evolution
and Humanized Computational Intelligence,
EvoMUSART, the Fourth European Workshop on Evolutionary Music and
Art, and
VIII Preface

EvoSTOC, the Third European Workshop on Evolutionary Algorithms in


Stochastic and Dynamic Environments.

EvoBIO is concerned with the exploitation of EC and related techniques


in bioinformatics and computational biology. For analyzing and understanding
biological data, EC plays an increasingly important role in the pharmaceutical
industry, in biotechnology, and in associated industries, as well as in scientic
discovery.
EvoCOMNET addresses the application of EC techniques to problems in
communications, networks, and connected systems. New communication tech-
nologies, the creation of interconnected communication and information net-
works such as the Internet, new types of interpersonal and interorganizational
communication, and the integration and interconnection of production centers
and industries are the driving forces on the road towards a connected, networked
society. EC techniques are important tools for facing these challenges.
EvoHOT highlights the latest developments in the eld of EC applications to
hardware and design optimization. This includes various aspects like the design
of electrical and digital circuits or the solving of classical hardware optimization
problems.
EvoIASP, which was the rst international event solely dedicated to the
applications of EC to image analysis and signal processing, addressed this year
topics ranging from ngerprinting to classication problems and articial ants.
EvoInteraction deals with various aspects of interactive evolution, and more
broadly of computational intelligence in interaction with human intelligence,
including methodology, theoretical issues, and new applications.Interaction with
humans raises several problems, mainly linked to what has been called the user
bottleneck, i.e. human fatigue.
EvoMUSART focuses on the use of EC techniques for the development of cre-
ative systems. There is a growing interest in the application of these techniques
in elds such as art, music, architecture, and design. The goal of EvoMUSART
is to bring together researchers that use EC in this context, providing an oppor-
tunity to promote, present and discuss the latest work in the area, fostering its
further developments and collaboration among researchers.
EvoSTOC addresses the application of EC in stochastic environments. This
includes optimization problems with noisy and approximated tness functions
that are changing over time, the treatment of noise, and the search for robust
solutions. These topics recently gained increasing attention in the EC community
and EvoSTOC was the rst workshop that provided a platform to present and
discuss the latest research in this eld.
EvoWorkshops 2006 continued the tradition of providing researchers in these
elds, as well as people from industry, students, and interested newcomers, with
an opportunity to present new results, discuss current developments and appli-
cations, or just become acquainted with the world of EC, besides fostering closer
future interaction between members of all scientic communities that may ben-
et from EC techniques.
Preface IX

This year, the EvoWorkshops had the highest number of submissions ever.
The number of submissions increased from 123 in 2004 to 143 in 2005 to 149
in 2006. EvoWorkshops 2006 accepted full papers with twelve pages and short
papers with a reduced number of ve pages. The acceptance rate of 43.6% for
EvoWorkshops 2006 is an indicator for the high quality of the papers presented at
the workshops and included in these proceedings. The following table gives some
details on the number of submissions, the number of accepted papers, and the
acceptance ratios for EvoWorkshops 2005 and EvoWorkshops 2006 (accepted
short papers are in brackets). Of further importance for the statistics is the
acceptance rate of EvoWorkshops 2004 which was 44.7%.

2006 2005
year
submissions accept ratio submissions accept ratio
EvoBIO 40 21 52.5% 32 13 40.6%
EvoCOMNET 16 5 31.2% 22 5 22.7%
EvoHOT 9 5 55.6% 11 7 63.6%
EvoIASP 35 12(7) 34.3% 37 17 45.9%
EvoInteraction 8 6 75% - - -
EvoMUSART 29 10(4) 34.5% 29 10(6) 34.5%
EvoSTOC 12 6(2) 50.0% 12 4(4) 33.3%
Total 149 65(13) 43.6% 143 56(10) 39.1%

We would like to thank all the members of the program committees for
their quick and thorough work. We thank the Artpool Art Research Center
of Budapest, and especially Gyorgy Galantai, for oering space and expertise
without which the wonderful evolutionary art and music exhibition associated
with the conference would not have been possible. Furthermore, we would like
to acknowledge the support from Napier University, Edinburgh.
Finally, we would like to say a special thanks to everybody who was involved
in the preparation of the event. Special thanks are due to Jennifer Willies, whose
work is a great and invaluable help. Without her support, running such a type
of conference with a large number of dierent organizers and dierent opinions
would be impossible. Further thanks go to the local organizer, Aniko Ekart, and
her group, who made it possible to run such a conference in such a nice place.

April 2006 Franz Rothlauf Jurgen Branke Stefano Cagnoni


Ernesto Costa Carlos Cotta Rolf Drechsler
Evelyne Lutton Penousal Machado Jason H. Moore
Juan Romero George D. Smith Giovanni Squillero
Hideyuki Takagi
Organization

EvoWorkshops 2006 was jointly organized with EuroGP 2006 and EvoCOP 2006.

Organizing Committee

EvoWorkshops chair: Franz Rothlauf, University of Mannheim, Germany

Local chair: Aniko Ekart, Hungarian Academy of Sciences,


Hungary
Publicity chair: Steven Gustafson, University of Nottingham, UK

EvoBIO co-chairs: Carlos Cotta, Universidad de Malaga, Spain


Jason H. Moore, Darthmouth Medical School, USA
EvoCOMNET co-chairs: Franz Rothlauf, University of Mannheim, Germany
George D. Smith, University of East Anglia, UK
EvoHOT co-chairs: Giovanni Squillero, Politecnico di Torino, Italy
Rolf Drechsler, University of Bremen, Germany
EvoIASP chair: Stefano Cagnoni, University of Parma, Italy
EvoInteraction co-chairs: Evelyne Lutton, INRIA, France
Hideyuki Takagi, Kyushu University, Japan
EvoMUSART co-chairs: Juan Romero, University of A Coruna, Spain,
Penousal Machado, University of Coimbra, Portugal
EvoSTOC co-chairs: Jurgen Branke, University of Karlsruhe, Germany
Ernesto Costa, University of Coimbra, Portugal

Program Committees

EvoBIO Program Committee:

Jesus Aguilar, Pablo de Olavide University, Spain


Jacek Blazewicz, Poznan University of Technology, Poland
Vincenzo Cutello, University of Catania, Italy
Gary Fogel, Natural Selection Inc., USA
James Foster, University of Idaho, USA
Alex Freitas, University of Kent, UK
Raul Giraldez, Pablo de Olavide University, Spain
Rosalba Giugno, University of Catania, Italy
Jin-Kao Hao, University of Angers, France
Natalio Krasnogor, University of Nottingham, UK
XII Organization

Bill Langdon, University of Essex, UK


Robert MacCallum, Imperial College London, UK
Elena Marchiori, Vrije Universiteit Amsterdam, The Netherlands
Andrew Martin, University College London, UK
Pablo Moscato, University of Newcastle, Australia
Vic J. Rayward-Smith, University of East Anglia, UK
John Rowe, University of Birmingham, UK
Jem Rowland, University of Wales, UK
El-Ghazali Talbi, INRIA Futurs, France
Antoine van Kampen, Academic Medical Center, The Netherlands
Gwen Volkert, Kent State University, UK
Ray Walshe, Dublin City University, Ireland
Eckart Zitzler, ETH Zurich, Switzerland
Igor Zwir, University of Granada, Spain

EvoCOMNET Program Committee:


Stuart Allen, Cardi University, UK
Jin-Kao Hao, University of Angers, France
Bryant Julstrom, St. Cloud State University, USA
Paul Marrow, BT UK, UK
Geo McKeown, UEA Norwich, UK
Gunther R. Raidl, Vienna University of Technology, Austria
Vic Rayward-Smith, UEA Norwich, UK
Franz Rothlauf, University of Mannheim, Germany
Giovanni Squillero, Politecnico di Torino, Italy
George D. Smith, University of East Anglia, UK
Andrew Tuson, City University, London, UK

EvoHOT Program Committee:


Varun Aggarwal, Massachusetts Institute of Technology, USA
Bernd Becker, Albert-Ludwigs-University Freiburg, Germany
Rolf Drechsler, University of Bremen, Germany
Michelanglo Grosso, Politecnico di Torino, Italy
Andrew Kinane, Dublin City University, Ireland
Gabriella Kokai, University of Erlangen, Germany
Una-May OReilly, Massachusetts Institute of Technology, USA
Mihai Oltean, Babes-Bolyai University, Romania
Gregor Papa, Jozef Stefan Institute, Slovenia
Ernesto Sanchez, Politecnico di Torino, Italy
Lukas Sekanina, Brno University of Technology, Czech Republic
Massimiliano Schillaci, Politecnico di Torino, Italy
Giovanni Squillero, Politecnico di Torino, Italy
Luca Sterpone, Politecnico di Torino, Italy
Andreas Veneris, University of Toronto, Canada
Organization XIII

EvoIASP Program Committee:

Lucia Ballerini, Orebro University, Sweden


Bir Bhanu, University of California, USA
Leonardo Bocchi, University of Florence, Italy
Alberto Broggi, University of Parma, Italy
Stefano Cagnoni, University of Parma, Italy
Ela Claridge, University of Birmingham, UK
Laura Dipietro, Massachusetts Institute of Technology, USA
Marc Ebner, University of Wurzburg, Germany
Daniel Howard, Qinetiq, UK
Mario Koeppen, FhG IPK Berlin, Germany
Evelyne Lutton, INRIA, France
Gustavo Olague, CICESE, Mexico
Riccardo Poli, University of Essex, UK
Stephen Smith, University of York, UK
Giovanni Squillero, Politecnico di Torino, Italy
Kiyoshi Tanaka, Shinshu University, Japan
Ankur M. Teredesai, Rochester Institute of Technology, USA
Andy Tyrrell, University of York, UK
Leonardo Vanneschi, University of Milan Bicocca, Italy
Robert Vanyi, Siemens PSE, Hungary
Mengjie Zhang, Victoria University of Wellington, New Zealand

EvoInteraction Program Committee:

Thomas Baeck, Leiden University / Nutech Solutions, USA


Eric Bonabeau, Icosystem, USA
Praminda Caleb-Solly, University of the West of England, UK
Pierre Collet, Universite du Littoral, Calais, France
Michael Herdy, Virtuelles Prototyping, Germany
Fang-Cheng Hsu, Aletheia University, R.O. China
Christian Jacob, University of Calgary, USA
Daisuke Katagami, Tokyu Institute of Technology, Japan
Penousal Machado, University of Coimbra, Spain
Yoichiro Maeda, University of Fukui, Japan
Hiroaki Nishino, Oita University, Japan
Ian C. Parmee, University of the West of England, UK
Yago Saez, Universidad CARLOS III de Madrid, Spain
Marc Schoenauer, INRIA, France
Daniel Thalmann, EPFL, Switzerland
Tatsuo Unemi, Souka University, Japan
Leuo-Hong Wang, Aletheia University, R.O. China
XIV Organization

EvoMUSART Program Committee:

Alan Dorin, Monash University, Australia


Alice C. Eldridge, University of Sussex, UK
Amilcar Cardoso, University of Coimbra, Portugal
Alejandro Pazos, University of A Coruna, Spain
Anargyros Sarafopoulos, Bournemouth University, UK
Andrew Horner, University of Science & Technology, Hong Kong
Antonino Santos, University of A Coruna, Spain
Bill Manaris, College of Charleston, USA
Carlos Grilo, School of Technology and Management of Leiria, Portugal
Colin Johnson, University of Kent, UK
Eduardo R. Miranda, University of Plymouth, UK
Evelyne Lutton, INRIA, France
Francisco Camara Pereira, University of Coimbra, Portugal
Gary Greeneld, University of Richmond, USA
Gerhard Widmer, Johannes Kepler University Linz, Austria
James McDermott, University of Limerick, Ireland
Janis Jeeries, Goldsmiths College, University of London, UK
Jerey Ventrella, Independent Artist, USA
John Collomosse, University of Bath, UK
Jon McCormack, Monash University, Australia
Jorge Tavares, University of Coimbra, Portugal
Ken Musgrave, Pandromeda, Inc., US
Lee Spector, Hampshire College, USA
Luigi Pagliarini, Academy of Fine Arts of Rome, Italy
& University of Southern Denmark, Denmark
Martin Hemberg, Imperial College London, UK
Matthew Lewis, Ohio State University, USA
Mauro Annunziato, Plancton Art Studio, Italy
Michael Young, University of London, UK
Niall J.L. Grith, University of Limerick, Ireland
Paul Brown, Centre for Computational Neuroscience and Robotics,
University of Sussex, UK
Paulo Urbano, Universidade de Lisboa, Portugal
Peter Bentley, University College London, UK
Peter Todd, Max Planck Institute for Human Development, Germany
Rafael Ramirez, Pompeu Fabra University, Spain
Rodney Waschka II, North Carolina State University, USA
Scott Draves, San Francisco, USA
Stefano Cagnoni, University of Parma, Italy
Stephen Todd, IBM, UK
Tatsuo Unemi, Soka University, Japan
Tim Blackwell, University of London, UK
William Latham, Art Games Ltd., UK
Organization XV

EvoSTOC Program Committee:


Dirk Arnold, Dalhousie University, Canada
Hans-Georg Beyer, Vorarlberg University of Applied Sciences, Austria
Tim Blackwell, Goldsmiths College, UK
Yaochu Jin, Honda Research Institute, Germany
Stephan Meisel, Technical University Braunschweig, Germany
Daniel Merkle, University of Leipzig, Germany
Martin Middendorf, University of Leipzig, Germany
Ron Morrison, Mitretek Systems, USA
Ferrante Neri, University of Technology of Bari, Italy
Yew Soon Ong, Nanyang Technological University, Singapore
William Rand, Northwestern University, USA
Christian Schmidt, University of Karlsruhe, Germany
Sima Uyar, Istanbul Technical University, Turkey
Karsten Weicker, Leipzig University of Applied Sciences, Germany
Shengxiang Yang, University of Leicester, UK

Sponsoring Institutions

EvoNet, the Network of Excellence on Evolutionary Computing


Artpool Art Research Center, Budapest, Hungary
Table of Contents

EvoBIO Contributions
Functional Classication of G-Protein Coupled Receptors, Based on
Their Specic Ligand Coupling Patterns
Burcu Bakir, Osman Ugur Sezerman . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Incorporating Biological Domain Knowledge into Cluster Validity


Assessment
Nadia Bolshakova, Francisco Azuaje, Padraig Cunningham . . . . . . . . . 13

A Novel Mathematical Model for the Optimization of DNAChip


Design and Its Implementation
Kornelia Danyi, Gabriella Kokai, Jozsef Csontos . . . . . . . . . . . . . . . . . 23

A Hybrid GA/SVM Approach for Gene Selection and Classication of


Microarray Data
Edmundo Bonilla Huerta, Beatrice Duval, Jin-Kao Hao . . . . . . . . . . . 34

Multi-stage Evolutionary Algorithms for Ecient Identication of Gene


Regulatory Networks
Kee-Young Kim, Dong-Yeon Cho, Byoung-Tak Zhang . . . . . . . . . . . . . 45

Human Papillomavirus Risk Type Classication from Protein Sequences


Using Support Vector Machines
Sun Kim, Byoung-Tak Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Hierarchical Clustering, Languages and Cancer


Pritha Mahata, Wagner Costa, Carlos Cotta,
Pablo Moscato . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Robust SVM-Based Biomarker Selection with Noisy Mass Spectrometric


Proteomic Data
Elena Marchiori, Connie R. Jimenez, Mikkel West-Nielsen,
Niels H.H. Heegaard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

On the Use of Variable Complementarity for Feature Selection in


Cancer Classication
Patrick Emmanuel Meyer, Gianluca Bontempi . . . . . . . . . . . . . . . . . . . . 91
XVIII Table of Contents

Comparison of Neural Network Optimization Approaches for Studies of


Human Genetics
Alison A. Motsinger, Scott M. Dudek, Lance W. Hahn,
Marylyn D. Ritchie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Obtaining Biclusters in Microarrays with Population-Based Heuristics


Pablo Palacios, David Pelta, Armando Blanco . . . . . . . . . . . . . . . . . . . . 115

Multiple Sequence Alignment Based on Set Covers


Alexandre H.L. Porto, Valmir C. Barbosa . . . . . . . . . . . . . . . . . . . . . . . . 127

A Methodology for Determining Amino-Acid Substitution Matrices


from Set Covers
Alexandre H.L. Porto, Valmir C. Barbosa . . . . . . . . . . . . . . . . . . . . . . . . 138

Multi-Objective Evolutionary Algorithm for Discovering Peptide


Binding Motifs
Menaka Rajapakse, Bertil Schmidt, Vladimir Brusic . . . . . . . . . . . . . . . 149

Mining Structural Databases: An Evolutionary Multi-Objetive


Conceptual Clustering Methodology
Roco Romero-Zaliz, Cristina Rubio-Escudero, Oscar Cordon,
Oscar Harari, Coral del Val, Igor Zwir . . . . . . . . . . . . . . . . . . . . . . . . . . 159

Optimal Selection of Microarray Analysis Methods Using a Conceptual


Clustering Algorithm
Cristina Rubio-Escudero, Roco Romero-Zaliz, Oscar Cordon,
Oscar Harari, Coral del Val, Igor Zwir . . . . . . . . . . . . . . . . . . . . . . . . . . 172

Microarray Probe Design Using -Multi-Objective Evolutionary


Algorithms with Thermodynamic Criteria
Soo-Yong Shin, In-Hee Lee, Byoung-Tak Zhang . . . . . . . . . . . . . . . . . . . 184

An Algorithm for the Automated Verication of DNA Supercontig


Assemblies
Nikola Stojanovic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

From HP Lattice Models to Real Proteins: Coordination Number


Prediction Using Learning Classier Systems
Michael Stout, Jaume Bacardit, Jonathan D. Hirst,
Natalio Krasnogor, Jacek Blazewicz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

Conditional Random Fields for Predicting and Analyzing Histone


Occupancy, Acetylation and Methylation Areas in DNA Sequences
Dang Hung Tran, Tho Hoan Pham, Kenji Satou, Tu Bao Ho . . . . . . . 221
Table of Contents XIX

DNA Fragment Assembly: An Ant Colony System Approach


Wannasak Wetcharaporn, Nachol Chaiyaratana,
Sissades Tongsima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

EvoCOMNET Contributions
BeeHiveGuard: A Step Towards Secure Nature Inspired Routing
Algorithms
Horst F. Wedde, Constantin Timm, Muddassar Farooq . . . . . . . . . . . . 243

Optimal Broadcasting in Metropolitan MANETs Using Multiobjective


Scatter Search
Francisco Luna, Antonio J. Nebro, Bernabe Dorronsoro,
Enrique Alba, Pascal Bouvry, Luc Hogie . . . . . . . . . . . . . . . . . . . . . . . . . 255

Evolutionary Design of OAB and AAB Communication Schedules for


Interconnection Networks
Milos Ohldal, Jir Jaros, Josef Schwarz, Vaclav Dvorak . . . . . . . . . . . 267

A Multiagent Algorithm for Graph Partitioning


Francesc Comellas, Emili Sapena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

Tracing Denial of Service Origin: Ant Colony Approach


Chia-Mei Chen, Bing Chiang Jeng, Chia Ru Yang,
Gu Hsin Lai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

EvoHOT Contributions
Optimisation of Constant Matrix Multiplication Operation Hardware
Using a Genetic Algorithm
Andrew Kinane, Valentin Muresan, Noel OConnor . . . . . . . . . . . . . . . 296

Finding Compact BDDs Using Genetic Programming


Ulrich Kuhne, Nicole Drechsler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

Ecient Evolutionary Approaches for the Data Ordering Problem with


Inversion
Doina Logofatu, Rolf Drechsler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

GRACE: Generative Robust Analog Circuit Exploration


Michael A. Terry, Jonathan Marcus, Matthew Farrell,
Varun Aggarwal, Una-May OReilly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
XX Table of Contents

On the Practical Limits of the Evolutionary Digital Filter Design at


the Gate Level
Lukas Sekanina, Zdenek Vascek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344

EvoIASP Contributions
Image Space Colonization Algorithm
Leonardo Bocchi, Lucia Ballerini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

Enhancement of an Automatic Fingerprint Identication System Using


a Genetic Algorithm and Genetic Programming
Wannasak Wetcharaporn, Nachol Chaiyaratana,
Sanpachai Huvanandana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

Evolutionary Singularity Filter Bank Optimization for Fingerprint


Image Enhancement
Ung-Keun Cho, Jin-Hyuk Hong, Sung-Bae Cho . . . . . . . . . . . . . . . . . . . 380

Evolutionary Generation of Prototypes for a Learning Vector


Quantization Classier
Luigi Pietro Cordella, Claudio De Stefano, Francesco Fontanella,
Angelo Marcelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

Automatic Classication of Handsegmented Image Parts with


Dierential Evolution
Ivanoe De Falco, Antonio Della Cioppa, Ernesto Tarantino . . . . . . . . . 403

Mixed-Integer Evolution Strategies and Their Application to


Intravascular Ultrasound Image Analysis
Rui Li, Michael T.M. Emmerich, Ernst G.P. Bovenkamp,
Jeroen Eggermont, Thomas Back, Jouke Dijkstra,
Johan H.C. Reiber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

The Honeybee Search Algorithm for Three-Dimensional Reconstruction


Gustavo Olague, Cesar Puente . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427

Improving the Segmentation Stage of a Pedestrian Tracking


Video-Based System by Means of Evolution Strategies
Oscar Perez, Miguel Angel Patricio, Jesus Garca,
Jose Manuel Molina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438

An Adaptive Stochastic Collision Detection Between Deformable


Objects Using Particle Swarm Optimization
Tianzhu Wang, Wenhui Li, Yi Wang, Zihou Ge,
Dongfeng Han . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
Table of Contents XXI

Genetic Programming for Automatic Stress Detection in Spoken English


Huayang Xie, Mengjie Zhang, Peter Andreae . . . . . . . . . . . . . . . . . . . . . 460

Localisation Fitness in GP for Object Detection


Mengjie Zhang, Malcolm Lett . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472

Immune Multiobjective Optimization Algorithm for Unsupervised


Feature Selection
Xiangrong Zhang, Bin Lu, Shuiping Gou, Licheng Jiao . . . . . . . . . . . . 484

Classifying and Counting Vehicles in Trac Control Applications


Francesco Archetti, Enza Messina, Daniele Toscani,
Leonardo Vanneschi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495

A Neural Evolutionary Classication Method for Brain-Wave Analysis


Antonia Azzini, Andrea G.B. Tettamanzi . . . . . . . . . . . . . . . . . . . . . . . . 500

Dierential Evolution Applied to a Multimodal Information Theoretic


Optimization Problem
Patricia Besson, Jean-Marc Vesin, Vlad Popovici, Murat Kunt . . . . . 505

Articial Life Models in Lung CTs


Sorin Cristian Cheran, Gianfranco Gargano . . . . . . . . . . . . . . . . . . . . . . 510

Learning High-Level Visual Concepts Using Attributed Primitives and


Genetic Programming
Krzysztof Krawiec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515

Evolutionary Denoising Based on an Estimation of Holder Exponents


with Oscillations
Pierrick Legrand, Evelyne Lutton, Gustavo Olague . . . . . . . . . . . . . . . . 520

Probability Evolutionary Algorithm Based Human Body Tracking


Shuhan Shen, Weirong Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

EvoINTERACTION Contributions
On Interactive Evolution Strategies
Ron Breukelaar, Michael T.M. Emmerich, Thomas Back . . . . . . . . . . 530

An Experimental Comparative Study for Interactive Evolutionary


Computation Problems
Yago Saez, Pedro Isasi, Javier Segovia, Asuncion Mochon . . . . . . . . . . 542
XXII Table of Contents

Creating Chance by New Interactive Evolutionary Computation:


Bipartite Graph Based Interactive Genetic Algorithm
Chao-Fu Hong, Hsiao-Fang Yang, Leuo-hong Wang, Mu-Hua Lin,
Po-Wen Yang, Geng-Sian Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554

Interactive Evolutionary Computation Framework and the On-Chance


Operator for Product Design
Leuo-hong Wang, Meng-yuan Sung, Chao-fu Hong . . . . . . . . . . . . . . . . 565

Practically Applying Interactive Genetic Algorithms to Customers


Designs on a Customizable C2C Framework: Entrusting Select
Operations to IGA Users
Fang-Cheng Hsu, Ming-Hsiang Hung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575

Evaluation of Sequential, Multi-objective, and Parallel Interactive


Genetic Algorithms for Multi-objective Floor Plan Optimisation
Alexandra Melike Brintrup, Hideyuki Takagi, Jeremy Ramsden . . . . . 586

EvoMUSART Contributions
Supervised Genetic Search for Parameter Selection in Painterly
Rendering
John P. Collomosse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599

Robot Paintings Evolved Using Simulated Robots


Gary Greenfield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611

Consensual Paintings
Paulo Urbano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622

Using Physiological Signals to Evolve Art


Tristan Basa, Christian Anthony Go, Kil-Sang Yoo,
Won-Hyung Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633

Science of Networks and Music: A New Approach on Musical Analysis


and Creation
Gianfranco Campolongo, Stefano Vena . . . . . . . . . . . . . . . . . . . . . . . . . . 642

Continuous-Time Recurrent Neural Networks for Generative and


Interactive Musical Performance
Oliver Bown, Sebastian Lexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652

Synthesising Timbres and Timbre-Changes from Adjectives/Adverbs


Alex Gounaropoulos, Colin G. Johnson . . . . . . . . . . . . . . . . . . . . . . . . . . 664
Table of Contents XXIII

Modelling Expressive Performance: A Regression Tree Approach Based


on Strongly Typed Genetic Programming
Amaury Hazan, Rafael Ramirez, Esteban Maestre, Alfonso Perez,
Antonio Pertusa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676

Evolutionary Musique Concrete


Cristyn Magnus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688

A Connectionist Architecture for the Evolution of Rhythms


Joao Magalhaes Martins, Eduardo Reck Miranda . . . . . . . . . . . . . . . . . 696

MovieGene: Evolutionary Video Production Based on Genetic


Algorithms and Cinematic Properties
Nuno A.C. Henriques, Nuno Correia, Jonatas Manzolli,
Lus Correia, Teresa Chambel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707

Audible Convergence for Optimal Base Melody Extension with


Statistical Genre-Specic Interval Distance Evaluation
Ronald Hochreiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712

A Two-Stage Autonomous Evolutionary Music Composer


Yaser Khalifa, Robert Foster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717

Layered Genetical Algorithms Evolving into Musical Accompaniment


Generation
Ribamar Santarosa, Artemis Moroni, Jonatas Manzolli . . . . . . . . . . . . 722

EvoSTOC Contributions
A Preliminary Study on Handling Uncertainty in Indicator-Based
Multiobjective Optimization
Matthieu Basseur, Eckart Zitzler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727

Fluctuating Crosstalk as a Source of Deterministic Noise and Its Eects


on GA Scalability
Kumara Sastry, Paul Winward, David E. Goldberg,
Claudio Lima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740

Integrating Techniques from Statistical Ranking into Evolutionary


Algorithms
Christian Schmidt, Jurgen Branke, Stephen E. Chick . . . . . . . . . . . . . . 752

The Role of Representations in Dynamic Knapsack Problems


Jurgen Branke, Merve Orbay, Sima Uyar . . . . . . . . . . . . . . . . . . . . . . . . 764
XXIV Table of Contents

The Eect of Building Block Construction on the Behavior of the GA


in Dynamic Environments: A Case Study Using the Shaky Ladder
Hyperplane-Dened Functions
William Rand, Rick Riolo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776

Associative Memory Scheme for Genetic Algorithms in Dynamic


Environments
Shengxiang Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788

Bayesian Optimization Algorithms for Dynamic Problems


Milos Kobliha, Josef Schwarz, Jir Ocenasek . . . . . . . . . . . . . . . . . . . . . 800

Prudent-Daring vs Tolerant Survivor Selection Schemes in Control


Design of Electric Drives
Ferrante Neri, Giuseppe L. Cascella, Nadia Salvatore,
Anna V. Kononova, Giuseppe Acciani . . . . . . . . . . . . . . . . . . . . . . . . . . . 805

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811


Functional Classification of G-Protein Coupled
Receptors, Based on Their Specific Ligand
Coupling Patterns

Burcu Bakir1 and Osman Ugur Sezerman2


1
School of Biology, Georgia Institute of Technology, Atlanta, USA
2
Sabanci University, Istanbul, Turkey

Abstract. Functional identication of G-Protein Coupled Receptors


(GPCRs) is one of the current focus areas of pharmaceutical research.
Although thousands of GPCR sequences are known, many of them re-
main as orphan sequences (the activating ligand is unknown). Therefore,
classication methods for automated characterization of orphan GPCRs
are imperative. In this study, for predicting Level 2 subfamilies of Amine
GPCRs, a novel method for obtaining xed-length feature vectors, based
on the existence of activating ligand specic patterns, has been developed
and utilized for a Support Vector Machine (SVM)-based classication.
Exploiting the fact that there is a non-promiscuous relationship between
the specic binding of GPCRs into their ligands and their functional
classication, our method classies Level 2 subfamilies of Amine GPCRs
with a high predictive accuracy of 97.02% in a ten-fold cross validation
test. The presented machine learning approach, bridges the gulf between
the excess amount of GPCR sequence data and their poor functional
characterization.

1 Introduction

G-Protein Coupled Receptors (GPCRs) are vital protein bundles with their key
role in cellular signaling and regulation of various basic physiological processes.
With their versatile functions in a wide range of physiological cellular condi-
tions, they constitute one of the vastest families of eukaryotic transmembrane
proteins [29]. In addition to the biological importance of their functional roles,
their interaction with more than 50% of prescription drugs have lead GPCRs
to be an excellent potential therapeutic target class for drug design and cur-
rent pharmaceutical research. Over the last 20 years, several hundred new drugs
have been registered which are directed towards modulating more than 20 dif-
ferent GPCRs, and approximately 40% of the top 200 synthetic drugs act on
GPCRs [6]. Therefore, many pharmaceutical companies are involved in carrying
out research aimed towards understanding the structure and function of these
GPCR proteins. Even though thousands of GPCR sequences are known as a
result of ongoing genomics projects [10], the crystal structure has been solved
only for one GPCR sequence using electron diraction at medium resolution (2.8

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 112, 2006.

c Springer-Verlag Berlin Heidelberg 2006
2 B. Bakir and O.U. Sezerman

A) to date [15] and for many of the GPCRs the activating ligand is unknown,
which are called orphan GPCRs [25]. Hence, based on sequence information,
a functional classication method of those orphan GPCRs and new upcoming
GPCR sequences is of great practical use in facilitating the identication and
characterization of novel GPCRs.
Albeit laboratory experiments are the most reliable methods, they are not
cost and labour eective. To automate the process, computational methods such
as decision trees, discriminant analysis, neural networks and support vector ma-
chines (SVMs), have been extensively used in the elds of classication of biolog-
ical data [21]. Among these methods, SVMs give best prediction performance,
when applied to many real-life classication problems, including biological issues
[30]. One of the most critical issues in classication is the minimization of the
probability of error on test data using the trained classier, which is also known
as structural risk minimization. It has been demonstrated that SVMs are able
to minimize the structural risk through nding a unique hyper-plane with max-
imum margin to separate data from two classes [27]. Therefore, compared with
the other classication methods, SVM classiers supply the best generalization
ability on unseen data [30].
In the current literature, to classify GPCRs in dierent levels of families,
there exist dierent attempts, such as using primary database search tools, e.g.,
BLAST [1], FASTA [20]. However, these methods require the query protein to
be signicantly similar to the database sequences in order to work properly.
In addition to these database search tools, the same problem is addressed by
using secondary database methods (proles and patterns for classication), e.g.,
Attwood et al. have worked in particular on GPCRs in the PRINTS database
[2] (whose data appeared in INTERPRO database [17]). Hidden Markov Models
[24], bagging classication trees [32] and SVMs [13], [31] are other methods that
have been used to classify GPCRs in dierent levels of families. Karchin et al.
conducted the most comprehensive controlled experiments for sequence based
prediction of GPCRs in [13] and showed that SVMs gave the highest accuracy
in recognizing GPCR families. Whereas, in SVMs, an initial step to transform
each protein sequence into a xed-length vector is required and the predictive
accuracy of SVMs signicantly depends on this particular xed-length vector. In
[13], it is also pointed out that the SVM performance could be further increased
by using feature vectors that encode only the most relevant features, since SVMs
do not identify the features most responsible for class discrimination. Therefore,
for an accurate SVM classication, feature vectors should reect the unique
biological information contained in sequences, which is specic to the type of
classication problem.
In this paper, we address Level 2 subfamily classication of Amine GPCRs
problem by applying Support Vector Machine (SVM) technique, using a novel
xed-length feature vector, based on the existence of activating ligand specic
patterns. We obtain discriminative feature vectors by utilizing biological knowl-
edge of the Level 2 subfamilies transmembrane topology and identifying specic
patterns for each Level 2 subfamily. Since these specic patterns carry ligand
Functional Classication of G-Protein Coupled Receptors 3

binding information, the features obtained from these patterns are more relevant
features than amino acid and dipeptide composition of GPCR sequences, which
in turn improves the accuracy of GPCR Level 2 subfamily classication. Apply-
ing our method on Amine Level 1 subfamily of GPCRs [10], we have shown that
the classication accuracy is increased compared to the previous studies at the
same level of classication.

2 Background

G-Protein Coupled Receptor Database (GPCRDB) information system orga-


nizes the GPCRs into a hierarchy of classes, Level 1 subfamilies (sub-families),
Level 2 subfamilies (sub-sub-families), and types, based on the pharmacological
classication of receptors [10]. A simplied view of GPCR family tree is pre-
sented in Figure 1. Since the binding of GPCRs into their specied ligands is
important for drug design purposes, GPCRDB denes the classications chemi-
cally (according to which ligands the receptor binds, based on the experimental
data), rather than by sequence homology [13]. For class discrimination, gen-
eralization of the features shared by a diverse group of examples is required.
Whereas, for subfamily discrimination, only the examples, that dier slightly,
should be grouped together. Therefore, for GPCR subfamily classication prob-
lem, which is also related to GPCR function prediction, the ligand type that
GPCR binds is more crucial than it is for GPCR class discrimination.

Fig. 1. Portion of GPCR family tree showing the ve main classes of GPCRs and some
subfamily members, based on the GPCRDB information system [10]

3 Materials and Methods

3.1 Benchmark Data

For GPCR function prediction from sequence information, subfamily recogni-


tion is more important than class recognition [13]. As mentioned before, sub-
family recognition requires the knowledge of ligand coupling information of the
4 B. Bakir and O.U. Sezerman

receptor proteins. It is claimed that according to the binding of GPCRs with


dierent ligand types, GPCRs are classied into at least six dierent families
[9]. Among the sub-families in GPCRDB, Amine Level 1 subfamily of class A
GPCRs is classied into seven sub-sub-families: (i) Muscarinic Acetylcholine, (ii)
Adrenoceptors, (iii) Dopamine, (iv) Histamine, (v) Serotonin, (vi) Octopamine,
(vii) Trace amine Level 2 subfamilies, according to the March 2005 release (9.0)
of GPCRDB (Horn et al., 1998). Therefore, the correlation between sub-family
classication and the specic binding of GPCRs to their ligands can be computa-
tionally explored for Level 2 subfamily classication of Amine Level 1 subfamily.
Moreover, compared to the other classes, since Class A dominates by accounting
for more than 80% of sequences as March 2005 release (9.0) of GPCRDB [10],
it is the best studied class among dierent GPCRs. Thus, we will be able to
compare our work with the previous studies. We use the same dataset, as that
of Elrod and Chau, for Amine Level 1 subfamily GPCR sequences in GPCRDB,
belonging to one of Acetylcholine, Adrenoceptor, Dopamine, Serotonin sub-sub-
families, which have enough entries inside as a statistically signicant training
set, as shown in Table 1. The GPCR sequences in this dataset were extracted
through the website http://www.expasy.org (SWISS-PROT database, Release
46.4, 2005) and xed-length feature vectors are created for each sequence as it
is explained in the next section.

Table 1. Summary of 168 Class A, Amine GPCRs, classied into four Level 2 Sub-
families as shown in [9]

Level 2 Subfamilies Number of Sequences


Acetylcholine 31
Adrenoceptor 44
Dopamine 39
Serotonin 54
TOTAL 168

3.2 Fixed-Length Feature Vector Creation


Since protein sequences are of variable length, for classication, these sequences
should be converted into xed-length feature vectors. In order to obtain those
xed-length feature vectors, which also carry ligand specicity information, we
followed a three step procedure as outlined in Figure 2. The rst step, i.e., Topol-
ogy Prediction step, aims to extract extracellular loop regions of the GPCR se-
quences since ligands couple to the outer loops of the GPCRs. So as to force
xed-length feature vectors to encode only biologically relevant features, ac-
tivating ligand specicity information is taken into account. For this purpose,
conserved patterns, which are specic to each sub-sub-family of GPCR sequences
are found in extracellular GPCR sequences in step two, i.e., Pattern Discovery
step. In third step, i.e., Pattern Matching step, the existence of those activating
ligand specic (specic for a sub-sub-family) patterns is checked. So that we in-
tegrate the coupling specicity of GPCRs into their ligands knowledge into our
Functional Classication of G-Protein Coupled Receptors 5

Fig. 2. Flow chart for xed-length feature vector creation

novel xed-length feature vectors. Details of the three steps (Topology Predic-
tion, Pattern Discovery, and Pattern Matching) for xed-length feature vector
creation are described below.
Topology Prediction. Since transmembrane (TM) topology pattern is shown
to be well conserved among GPCRs that have the same function [19], for the
168 GPCR sequences in Elrod and Chaus dataset, TM topology is checked. For
topology prediction, Hidden Markov Model for Topology Prediction (HMMTOP)
server, which accurately predicts the topology of helical TM proteins, is used [26].
In order to segment amino acid sequences into membrane, inside, outside parts,
HMMTOP method utilizes HMMs in a way that the product of the relative
frequencies of the amino acids of these segments along the amino acid sequence
is maximized. This shows that the maximum of the likelihood function on the
space of all possible topologies of a given amino acid sequence, correlates with
the experimentally established topology [25].
Following topology prediction, extracellular loop sequences are extracted for
each 168 GPCR sequences, based on the fact that ligands couple to extracellular
loops of GPCRs and we are interested in the relation between ligand specicity
of GPCRs and GPCR sub-sub-family classication.
Pattern Discovery. In the second step of the xed-length feature vector cre-
ation, for each sub-sub-family of GPCR sequences, exible patterns that are
conserved in the extracellular loop of that particular sub-sub-family GPCR se-
quences are found by using Pratt 2.1., exible pattern discovery program [4].
Due to the exibility of the Pratt patterns, they include ambiguous compo-
nents, xed and exible wildcards, in addition to their identity components [12].
Hence, Pratt patterns are described using PROSITE notation [4].
Pratt nds patterns matching a subset of the input sequences. This subset is
dened by Min Percentage Seqs to Match (MPSM) parameter, which denes
the minimum percentage of the input sequences that should match a pattern.
This threshold is set to 50% and 75% in this study in order not to allow for
6 B. Bakir and O.U. Sezerman

some very specic patterns that are not general to all GPCR sequences in any
sub-sub-family. This can also be thought as a precaution to prevent overtting
problem. For each class of GPCRs, 50 conserved patterns are identied by two
dierent MPSM parameters (50 and 75).
Pattern Matching. The nal step for creating xed-length feature vectors is
to check for the existence of every activating ligand specic pattern in each outer
GPCR sequence. In order to check the existence of the exible Pratt patterns, all
patterns in PROSITE notation are converted into regular expression form and
then they are searched within 168 extracellular GPCR sequences. Consequently,
by taking activating ligand specic pattern existence information into account,
each GPCR sequence is represented with a vector in the 200 dimensional space
(50 patterns multiplied by 4 output classes).

Gk = (Gk,1 , Gk,2 , . . . , Gk,200 ) (1)

where Gk,1 , Gk,2 Gk,200 are the 200 components of activating ligand specic
pattern inclusion for the k th extracellular GPCR sequence Gk . Note that if the
k th extracellular GPCR sequence has the pattern j, then Gk,j =1 and if the k th
extracellular GPCR sequence does not have the pattern j, then Gk,j =0, where
j=1, 2, ... 200.
Writing down each xed-length feature vector, Gk , in a new row, we obtain
a Gk,j matrix, where k=1, 2, ... 168; j=1, 2, ... 200. After insertion of the sub-
sub-family labels for each of the GPCR sequences into the zeroth dimension of
each Gk vector (Gk,0 ), the matrix corresponds to a training set. So that k=0, 1,
2, ... 168, where Gk,0 is 1, 2, 3 or 4, since four sub-sub-families are dened for
this classication problem. Note that these 4 class output labelling (1, 2, 3, 4)
does not imply any relationship between classes.
We have also created a second xed-length feature vector, by using the best
10 patterns among the 50 patterns based on signicance scores assigned by the
Pratt program from each sub-sub-family. Using a similar representation, Gk is
denoted in 40 dimensional space (10 patterns multiplied by 4 output classes),
where j=1, 2, ... 40. A Gk,j matrix is formed (similar to above), where k=1, 2,
... 168 and j=1, 2, ... 40 corresponding another training set.
As a result, four training sets (two training sets with 50 MPSM parameter,
for j up to 40 or 200; another two with 75 MPSM parameter, for j up to 40
or 200) are created to produce a classier using Support Vector Machines, as
mentioned in detail below.

3.3 Classification Using Support Vector Machine (SVM)


The eciency of SVMs for classication problems made them applicable in many
real-world applications, including biological issues such as: protein classication
and gene expression data analysis. SVM-based method for classication of sub-
families of GPCRs, is rst developed by Karchin et al. [13]. When applied to
the problem of discriminating both Level 1 and Level 2 subfamilies of GPCRs,
SVMs are shown to make signicantly fewer errors of both false positive and
Functional Classication of G-Protein Coupled Receptors 7

false negative than WU-BLAST and SAM-T2K prole HMMs [13]. For these
reasons, we selected to use SVMs for GPCRs Level 2 subfamily classication
problem.

4 Experiments
Since we are interested in the classication of Amine Level 1 sub-family into
four Level 2 subfamilies, we are facing with a multi-class classication problem.
We use LIBSVM software [7], which deals with multi-class classication problem
implementing one-against-one approach. As suggested in [11], to be able to get
satisfactory results, some preprocesses are performed before building a classier
using LIBSVM. Preprocesses, that are performed in this study, can be summa-
rized in two headlines: i) Choice of Kernel function, ii) Grid search combined
with cross-validation for parameter tuning.

4.1 Choice of Kernel Function


Among linear, polynomial, radial basis function (RBF) and sigmoid Kernel func-
tions, RBF kernel is a reasonable rst choice as stated in [11], [14], [16]. There-
fore, grid search and parameter tuning is done on RBF kernels. However, results
obtained by using those four kernels are compared with parameter tuned RBF
kernel at the end.

4.2 Cross-Validation and Grid Search


In order to get better accuracy using RBF kernel for SVM classication, penalty
parameter of error term, C, and parameter, which is specic to RBF ker-
nel, should be tuned. Grid search procedure identies the best (C, ) pair, so
that using these parameters the classier (model) can accurately predict unseen
test data [11]. Since the accuracy on test data also depends on the examples
in the test data, cross validation is a better choice to tune (C, ) parameters
and select the best model that neither overts nor underrepresents the training
data. Compared to other advanced methods for parameter tuning, grid-search is
straightforward, easy to implement and its computational time is not much more
than advanced methods. Additionally, since each (C, ) is independent, it can be
easily parallelized. During grid search, it is recommended to try exponentially
growing sequences of (C, ) to identify good parameters [11].
To be able to solve our multi-class SVM classication problem, for each of
our four training sets, grid search is performed for C = 25 , 24 , . . . , 210 and
= 25 , 24 , . . . , 210 . Figure 3 shows the grid search with 10-fold cross validation
for the training data with 200 attributes and 50 MPSM parameter. As it is
seen in Figure 3, highest cross-validation accuracy is reached when =24 and
C=(20 , 210 ). After two preprocessing steps, as mentioned above, we build our
classiers using RBF kernel with best (C, ) parameter pair, which is specic to
the training set. Since we do not have a test set, and the number of examples in
8 B. Bakir and O.U. Sezerman

the training set is not big enough to separate into two, 10-fold cross-validation
is done for each of the four training sets. Combining our biologically relevant
xed-length feature vector denition with a robust kernel, RBF, and parameter
tuning with a grid search technique shows promising results, which is analyzed
more in detail in the next section.

Fig. 3. Coarse grid search on C and with 10-fold cross validation for the training data
with 200 attributes and 50 MPSM parameter. Highest cross-validation accuracy is
obtained when =24 and C=(20 , 210 ).

5 Results
As mentioned before, in addition to the SVM classication with parameter tuned
RBF kernel, other three standard kernel functions are tested as well (with their
default parameters) on our four training data using 10-fold cross validation.
Results for each experiment are summarized in Table 2.
Classication with RBF kernel with parameter tuning clearly outperforms
other kernel functions in all cases. Since linear kernel is the specialized form of
RBF kernel, results obtained with these two kernels without parameter tuning
are quite close. Although, the classication accuracy with 200 and 40 attributes
are so close, accuracy with 200 attributes are consistently better than with 40
attributes. The probable reason behind this observation is that 40 attributes are
not enough to represent the examples (more attributes are needed to discrimi-
nate between data points), or those chosen 40 attributes do not correctly reect
the data points. In contrast to the strict domination of 200 attributes over 40
attributes, there is no such a relationship between training data with 50 MPSM
parameter and 75 MPSM parameter. While sometimes one performs better, it
is vice versa (e.g. results of RBF Kernel and RBF* Kernel in Table 2). Lower
accuracy for the training data with 75 MPSM parameter is caused by overt-
ting, which decreases accuracy at the end whereas with 50 MPSM parameter,
patterns that are conserved in at least 50% of the data can not represent overall
data.
Functional Classication of G-Protein Coupled Receptors 9

In this paper, for the classication of Level 2 Subfamily of Amine GPCRs,


97.02% prediction accuracy is achieved in a ten-fold cross validation. Compared
to the existing GPCR functional classication methods, for the classication of
Level 2 Subfamily of Amine GPCRs, our result is superior to the SVM method
with a simpler xed-length feature vector denition and no parameter tuning,
where prediction accuracy is 94.01% [31] and covariant discriminant algorithm,
where prediction accuracy is 83.23% [9]. In another study, Ying et al. performs
classication for both sub-family and sub-sub-family levels of GPCRs using bag-
ging classication tree [32] . For sub-sub-family level, using the same dataset [9],
our prediction accuracy in a ten-fold cross validation (97.02%) is higher than
their prediction accuracy obtained in ten-fold cross validation (82.4%). More
extensive comparison with previous studies is presented in the following section.

Table 2. Results for four dierent training sets, as explained in the text, using four
dierent kernel functions and RBF kernel with parameter tuning (RBF*), with 10-fold
cross-validation

# of MPSM Linear Polynomial Sigmoid RBF RBF*


Attributes Parameter Kernel Kernel Kernel Kernel Kernel
200 75 94.0476 48.8095 91.6667 91.6667 95.2381
200 50 96.4286 32.1429 86.3095 90.4762 97.0238
40 75 89.2857 47.0238 84.5238 86.3095 92.8571
40 50 92.8571 32.1479 84.5238 85.7143 93.4524

6 Discussion
The dierence of this study from previous studies can be emphasized in two
main points:
i) Fixed-length feature vector creation: We developed a novel method for ob-
taining xed-length feature vectors of SVM. The naive idea that using direct pro-
tein sequence information as feature vector can not be used in SVM classication
since the sequence length is not xed. Many studies [9], [32], [31] attempted this
problem by dening a xed-length feature vector based on the proteins amino
acid composition. Following the representation in [8], each protein is represented
by a vector, Xk , in 20 dimensional space, where each dimension corresponds
to how many times that particular amino acid, which represents that specic
dimension, occurred in those particular protein.

Xk = (Xk,1 , Xk,2 , . . . , Xk,20 ) (2)

where Xk,1 , Xk,2 ... Xk,20 are 20 components of amino acid composition for
the k th protein Xk . In addition to the amino acid composition, in some of the
studies, xed-length vector is obtained by dipeptide composition [3], which takes
local order of amino acids into account, in addition to the information about the
fraction of amino acids. The dipeptide composition of each protein is shown
10 B. Bakir and O.U. Sezerman

using fractions of all possible dipeptides, where fraction of dipeptide i is the


ratio of the number of dipeptide i in the protein divided by the total number
of all possible dipeptides, namely 400. Alternatively, each protein sequence can
also be transformed into a xed-length feature vector, in the form of Fischer
score vector [13].
Since in this study, the eect of activating ligand specicity in Level 2 Subfam-
ily classication of Amine GPCRs is emphasized, a new feature vector is built,
based on this observation. In this regard, we have used the existence information
of activating ligand specic patterns, as xed-length feature vectors, in order to
come up with a biologically meaningful and distinctive measure. Therefore, the
superiority of our feature vector stems from the biological importance of ligand
coupling specicity for Level 2 Subfamily classication of Amine GPCRs. By
combining those feature vectors with a robust kernel function, and parameter
tuning strategy, we come up with an accurate classication method.
ii) Classification level: Apart from the denition of the feature vector for SVM,
the exact classication level that we concentrate on, has been attempted in a few
previous studies. Ying and Yanda and Ying et al. attempted the classication
problem in the same Level 2 Subfamily of Amine GPCRs by using SVMs with
amino acid composition as feature vector [31] and bagging classication tree
[32], respectively. Our dierence with their work is based on our novel feature
vector denition as it is mentioned above, which in turn signicantly aects the
prediction accuracy (from 82.4% to 97.02% and 94.01% to 97.02% respectively).
Apart from these particular papers, most of the previous studies concentrate
on Superfamily level or Level 1 Subfamily. Although Karchin et al. have done
experiments by using hierarchical multi-class SVMs, on Level 2 Subfamily [13] ,
they combine Class A Level 2 Subfamilies with Class C Level 2 Subfamilies.
Performance results in this study are promising and outperform other com-
petitive techniques that classify GPCRs at the same level, with a very high cross
validation accuracy of 97.02%. This result is mainly due to the denition of our
feature vectors, since compared studies do not take into account such conserved
pattern information for proper functioning of the GPCR. As the importance
of specic ligand binding into GPCRs and the hidden information behind this
binding is pointed out previously [13], we realized the use of ligand specic
coupling patterns for creation of xed-length feature vectors, which answers the
need for biologically relevant features. Using these vectors for SVM classication
and doing grid search for model selection, the accuracy have further improved
even with very few sequences. With such accurate and automated GPCR func-
tional classication methods, we are hoping to accelerate the pace of identifying
proper GPCRs to facilitate drug discovery especially for schizophrenic and psy-
chiatric diseases. Therefore, one of our future goals is to automate the presented
procedure and come up with an integrated environment to perform GPCR clas-
sication conveniently with many exible options to the biological users, who
are not experts on the topic.
Functional Classication of G-Protein Coupled Receptors 11

References
1. Altshul, S. et al.: Basic local alignment search tool. J. Mol. Biol. 215 (1990) 403
410
2. Attwood, T.K. et al.: PRINTS and its automatic supplement, prePRINTS. Nucleic
Acids Research 31 (2003) 400402
3. Bhasin, M. and Raghava, G.: GPCRpred: an SVM-based method for prediction of
families and subfamilies of G-protein coupled receptors. Nucleic Acids Research 32
(2004) 383389
4. Brazma, A. et al.: Discovering patterns and subfamilies in biosequences. Proceed-
ings of the Fourth International Conference on Intellignent Systems for Molecular
Biology (ISMB-96), AAAI Press (1996) 3443
Pratt 2.1 software is available at www.ebi.ac.uk/pratt
5. Byvatov, E. and Schneider, G.: Support vector machine applications in bioinfor-
matics. Appl. Bioinformatics 2 (2003) 6777
6. Chalmers, D.T. and Behan, D.P.: The Use of Constitutively Active GPCRs in Drug
Discovery and Functional Genomics. Nature Reviews, Drug Discovery 1 (2002)
599-608
7. Chang, C.C. and Lin, C.J.: LIBSVM : a library for support vector machines. (2001)
LIBSVM software is available at http://www.csie.ntu.edu.tw/ cjlin/libsvm
8. Chou, K.C.: A Novel Approach to Predicting Protein Structural Classes in a (ZO-l)-
D Amino Acid Composition Space. PROTEINS: Structure, Function, and Genetics
21 (1995) 319344
9. Elrod, D.W. and Chou, K.C.: A study on the correlation of G-protein-coupled
receptor types with amino acid composition. Protein Eng. 15 (2002) 713715
10. Horn, F. et al.: GPCRDB: an information system for G protein coupled receptors.
Nucleic Acids Res. 26 (1998) 275279
Available at: www.gpcr.org/7tm
11. Hsu, C.W. et al.: A Practical Guide to Support Vector Classication. Image, Speech
and Intelligent Systems (ISIS) Seminars. (2004)
12. Jonassen, I. et al.: Finding exible patterns in unaligned protein sequences. Protein
Sci. 4 (1995) 15871595
13. Karchin, R. et al.: Classifying G-protein coupled receptors with support vector
machines. Bioinformatics 18 (2001) 147159
14. Keerthi, S.S. and Lin, C.J.: Asymptotic behaviors of support vector machines with
Gaussian kernel. Neural Computation 15 (2003) 1667-1689
15. Krzysztof, P. et al.: Crystal Structure of Rhodopsin: A G- Protein-Coupled Recep-
tor. Science 4 (2000) 739745
16. Lin, H.T. and Lin, C.J.: A study on sigmoid kernels for SVM and the train ing of
nonPSDkernels by SMO type methods. Technical report, Department of Computer
Science and Information Engineering, National Taiwan University. (2003)
Available at http://www.csie.ntu.edu.tw/ cjlin/papers/tanh.pdf
17. Mulder, N.J. et al.: The InterPro Database - 2003 brings increased coverage and
new features. Nucleic Acids Research 31 (2003) 315318
18. Neuwald, A. and Green, P.: Detecting Patterns in Protein Sequences. J. Mol. Biol.
239 (1994) 698-712
19. Otaki, J.M. and Firestein, S.: Length analyses of mammalian g-protein-coupled
receptors. J. Theor. Biol. 211 (2001) 77100
20. Pearson, W. and Lipman, D.: Improved tools for biological sequence analysis. Pro-
ceedings of National Academic Science 85 (1988) 2444-2448
12 B. Bakir and O.U. Sezerman

21. Quinlan, J.R.: C4.5; Programs for Machine Learning, Morgan Kauman Publish-
ers, San Mateo, CA (1988)
22. Sadka, T. and Linial, M.: Families of membranous proteins can be characterized
by the amino acid composition of their transmembrane domains. Bioinformatics
21 (2005) 378386
23. Schoneberg, T. et al.: The structural basis of g-protein-coupled receptor function
and dysfunction in human diseases. Rev Physiol Biochem Pharmacol. 144 (2002)
143227
24. Sreekumar, K.R. et al.: Predicting GPCR-G-Protein coupling using hidden Markov
models. Bioinformatics 20 (2004) 34903499
Swiss-Prot database (Release 46.4, 2005) is available at http:// www.expasy.org
25. Tusndy, G.E. and Simon, I.: Principles Governing Amino Acid Composition of
Integral Membrane Proteins: Applications to topology prediction. J. Mol. Biol.
283 (1998) 489506
26. Tusndy, G.E. and Simon, I.: The HMMTOP transmembrane topology prediction
server. Bioinformatics 17 (2001) 849850
Available at: http://www.enzim.hu/hmmtop
27. Vapnik, V. The Nature of Statistical Learning Theory, Springer-Verlag, New York.
(1995)
28. Vert,J.P. Introduction to Support Vector Machines and applications to computa-
tional biology.(2001)
29. Vilo, J. et al.: Prediction of the Coupling Specicity of G Protein Coupled Recep-
tors to their G Proteins. Bioinformatics 17 (2001) 174181
30. Yang, Z.R.: Biological applications of support vector machines. Brief. Bioinform.
5 (2004) 328338
31. Ying, H. and Yanda, L.: Classifying G-protein Coupled Receptors with Support
Vector Machine. Advances in Neural Network (ISNN 2004), Springer LNCS. 3174
(2004) 448452
32. Ying, H. et al.: Classifying G-protein Coupled receptors with bagging classication
tree. Computational Biology and Chemistry 28 (2004) 275280
Incorporating Biological Domain Knowledge
into Cluster Validity Assessment

Nadia Bolshakova1, Francisco Azuaje2 , and Padraig Cunningham1


1
Department of Computer Science, Trinity College Dublin, Ireland
{nadia.bolshakova, padraig.cunningham}@cs.tcd.ie
2
School of Computing and Mathematics, University of Ulster,
Jordanstown, BT37 0QB, UK
[email protected]

Abstract. This paper presents an approach for assessing cluster validity


based on similarity knowledge extracted from the Gene Ontology (GO)
and databases annotated to the GO. A knowledge-driven cluster valid-
ity assessment system for microarray data was implemented. Dierent
methods were applied to measure similarity between yeast genes prod-
ucts based on the GO. This research proposes two methods for calculat-
ing cluster validity indices using GO-driven similarity. The rst approach
processes overall similarity values, which are calculated by taking into
account the combined annotations originating from the three GO hierar-
chies. The second approach is based on the calculation of GO hierarchy-
independent similarity values, which originate from each of these hierar-
chies. A traditional node-counting method and an information content
technique have been implemented to measure knowledge-based similar-
ity between genes products (biological distances). The results contribute
to the evaluation of clustering outcomes and the identication of opti-
mal cluster partitions, which may represent an eective tool to support
biomedical knowledge discovery in gene expression data analysis.

1 Introduction
Over the past few years DNA microarrays have become a key tool in functional
genomics. They allow monitoring the expression of thousands of genes in parallel
over many experimental conditions (e.g. tissue types, growth environments). This
technology enables researchers to collect signicant amounts of data, which need
to be analysed to discover functional relationships between genes or samples. The
results from a single experiment are generally presented in the form of a data
matrix in which rows represent genes and columns represent conditions. Each
entry in the data matrix is a measure of the expression level of a particular gene
under a specic condition.
A central step in the analysis of DNA microarray data is the identication of
groups of genes and/or conditions that exhibit similar expression patterns. Clus-
tering is a fundamental approach to classifying expression patterns for biological
and biomedical applications. The main assumption is that genes that are con-
tained in a particular functional pathway should be co-regulated and therefore

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 1322, 2006.

c Springer-Verlag Berlin Heidelberg 2006
14 N. Bolshakova, F. Azuaje, and P. Cunningham

should exhibit similar patterns of expression [1]. A great variety of clustering al-
gorithms have been developed for gene expression data. The next data analysis
step is to integrate these numerical analyses of co-expressed genes with biolog-
ical function information. Many approaches and tools have been proposed to
address this problem at dierent processing levels. Some methods, for example,
score whole clustering outcomes or specic clusters according to their biological
relevance, other techniques aim to estimate the signicance of over-represented
functional annotations, such as those encoded in the Gene Ontology (GO), in
clusters [2], [3], [4], [5]. Some approaches directly incorporate biological knowl-
edge (e.g. functional, curated annotations) into the clustering process to aid in
the detection of relevant clusters of co-expressed genes involved in common pro-
cesses [6], [7]. Several tools have been developed for ontological analysis of gene
expression data (see review by Khatri and Draghici [8], for instance) and more
tools are likely to be proposed in the future.
The prediction of the correct number of clusters in a data set is a funda-
mental problem in unsupervised learning. Various cluster validity indices have
been proposed to measure the quality of clustering results [9], [10]. Recent stud-
ies conrm that there is no universal pattern recognition and clustering model
to predict molecular proles across dierent datasets. Thus, it is useful not to
rely on one single clustering or validation method, but to apply a variety of
approaches. Therefore, combination of GO-based (knowledge-driven) validation
and microarray data (data-driven) validation methods may be used for the es-
timation of the number of clusters. This estimation approach may represent a
useful tool to support biological and biomedical knowledge discovery.
We implemented a knowledge-driven cluster validity assessment system for
microarray data clustering. Unlike traditional methods that only use (gene ex-
pression) data-derived indices, our method consists of validity indices that incor-
porate similarity knowledge originating from the GO and a GO-driven annota-
tion database. We used annotations from the Saccharomyces Genome Database
(SGD) (October 2005 release of the GO database). A traditional node-counting
method proposed by Wu and Palmer [11] and an information content technique
proposed by Resnik [12] were implemented to measure similarity between genes
products. These similarity measurements have not been implemented for clus-
tering evaluation by other research.
The main objective of this research is to assess the application of knowledge-
driven cluster validity methods to estimate the number of clusters in a known
data set derived from Saccharomyces cerevisiae.

2 The GO and Cluster Validity Assessment

The automated integration of background knowledge is fundamental to support


the generation and validation of hypotheses about the function of gene prod-
ucts. The GO and GO-based annotation databases represent recent examples
of such knowledge resources. The GO is a structured, shared vocabulary that
allows the annotation of gene products across dierent model organisms. The
Incorporating Biological Domain Knowledge 15

GO comprises three independent hierarchies: molecular function (MF), biological


process (BP) and cellular component (CC). Researchers can represent relation-
ships between gene products and annotation terms encoded in these hierarchies.
Previous research has applied GO information to detect over-represented func-
tional annotations in clusters of genes obtained from expression analyses [13]. It
has also been suggested to assess gene sequence similarity and expression cor-
relation [14]. For a deeper review of the GO and its applications, the reader is
referred to its website (http://www.geneontology.org) and Wang et al. [14].
Topological and statistical information extracted from the GO and databases
annotated to the GO may be used to measure similarity between gene products.
Dierent GO-driven similarity assessment methods may be then implemented to
perform clustering or to quantify the quality of the resulting clusters. Cluster va-
lidity assessment may consist of data- and knowledge-driven methods, which aim
to estimate the optimal cluster partition from a collection of candidate partitions
[15]. Data-driven methods mainly include statistical tests or validity indices ap-
plied to the data clustered. A data-driven, cluster validity assessment platform
was previously reported by Bolshakova and Azuaje, [9], [10]. We have previously
proposed knowledge-driven methods to enhance the predictive reliability and
potential biological relevance of the results [15].
Traditional GO-based cluster description methods have consisted of statisti-
cal analyses of the enrichment of GO terms in a cluster. Currently, there is a
relatively large number of tools implementing such an approach [8]. At the same
time, this approach is severely limited in certain regards (for detailed review on
ontological analysis see by Khatri and Draghici [8]). For instance, overestimation
of probability values describing over-representation of terms. This may be due
to the lack of more complete knowledge or the incorporation of biased datasets
to make statistical adjustments and detect spurious associations. However, the
application of GO-based similarity to perform clustering and validate clustering
outcomes has not been widely investigated. A recent contribution by Speer et
al. [16], [17] presented an algorithm that incorporates GO annotations to clus-
ter genes. They applied data-driven Davies-Bouldin and Silhouette indices to
estimate the quality of the clusters.
This research applies two approaches to calculating cluster validity indices.
The rst approach processes overall similarity values, which are calculated by
taking into account the combined annotations originating from the three GO
hierarchies. The second approach is based on the calculation of independent
similarity values, which originate from each of these hierarchies. The second
approach allows one to estimate the eect of each of the GO hierarchies on the
validation process.

3 GO-Based Similarity Measurement Techniques

For a given pair of gene products, g1 and g2 , sets of GO terms T1 = ti and T2 = tj


are used to annotate these genes. Before estimating between-gene similarity it
is rst necessary to understand how to measure between-term similarity. We
16 N. Bolshakova, F. Azuaje, and P. Cunningham

implemented GO-based between-term similarity using a traditional approach


proposed by Wu and Palmer [11] and an information content technique proposed
by Resnik [12].

3.1 Wu and Palmer-Based Method


Similarity was dened by Wu and Palmer [11] as follows:
2N
sim(ti , tj ) = (1)
Ni + Nj + 2N
where Ni and Nj are the number of links (edges) from ti and tj to their closest
common parent in the GO hierarchy, Tij , and N is the number of links from Tij
to the GO hierarchy root.
This similarity assessment metric may be transformed into a distance, d,
metric:
d(ti , tj ) = 1 sim(ti , tj ) (2)
then the average inter-set similarity value across each pair of ti and tj is com-
puted [13]:
d(gk , gm ) = avg(d(tki , tmj )) (3)
i,j
This between-term distance aggregation may then be used as an estimate of
the GO-based similarity between two genes products gk and gm , which is dened
as:
2N
d(gk , gm ) = avg(1 ) (4)
i,j Nki + Nmj + 2N

3.2 Resnik-Based Similarity Measurement


This similarity was dened by Resnik [12] as follows:
sim(ti , tj ) = max(log(p(Tij ))) (5)
Tij is dened as above and has the highest information value V dened as
log(p(Tij )),where p(Tij ) is a probability, of nding term Tij (or its descendants)
in the dataset of genes under study, i.e. the SGD in this study.
Such similarity assessment metric may be transformed into a distance metric:
1
d(ti , tj ) = (6)
1 + sim(ti , tj )
Based on the average value across each pair of ti and tj , as computed by
Azuaje and Bodenreider [13], the GO-based similarity between two genes prod-
ucts g1 and g2 is dened as:
1
d(gk , gm ) = avg( ) (7)
i,j 1 + max(log(p(Tkmij )))

In this research we rst study an approach based on the aggregation of sim-


ilarity information originating from all three GO. We also proposed and imple-
mented three hierarchy - specic similarity assessment techniques, each based on
information individually extracted from each GO hierarchy (BP, MF and CC).
Incorporating Biological Domain Knowledge 17

4 Clustering and Cluster Validation Methods


4.1 Clustering
The data analysed in this paper comprised yeast genes described by their ex-
pression values during the cell cycle [18]. Previous research has shown that dis-
joint clusters of genes are signicantly associated with each of the ve cell cycle
stages: early G1, late G1, S, G2, M. Several cluster partitions (with numbers
of clusters from two to six clusters), obtained with the k -means algorithm, were
analysed to estimate the optimum number of clusters for this dataset. Clustering
was performed with the Machaon CVE tool [10].

4.2 Cluster Validation Methods


Cluster validation was performed using two validity indices: the C-index [19]
and the Goodman-Kruskal index [20], whose data-driven versions have been
shown to be eective cluster validity estimators for dierent types of clustering
applications. Nevertheless, each of the implemented validation methods has their
advantages and limitations. For example, Goodman-Kruskal index is expected
to be robust against outliers because quadruples of patterns are used for its
computation. However, its drawback is its high computational complexity in
comparison, for example, with the C-index.

C-index. The C-index [19], C, is dened as follows:

S Smin
C= (8)
Smax Smin
where S, Smin , Smax are calculated as follows. Let p be the number of all pairs
of samples (conditions) from the same cluster. Then S is the sum of distances
between samples in those p pairs. Let P be a number of all possible pairs of
samples in the dataset. Ordering those P pairs by distances we can select p pairs
with the smallest and p pairs with the largest distances between samples. The
sum of the p smallest distances is equal to Smin , whilst the sum of the p largest
is equal to Smax . From this formula it follows that the nominator will be small if
pairs of samples with small distances are in the same cluster. Thus, small values
of C correspond to good clusters. We calculated distances using the knowledge-
driven methods described above. The number of clusters that minimize C-index
is taken as the optimal number of clusters, c.

Goodman-Kruskal index. For a given dataset, Xj (j = 1, , k, where k is the


total number of samples (gene products in this application), j, in the dataset,
this method assigns all possible quadruples [20]. Let d be the distance between
any two samples (w and x, or y and z) in Xj . A quadruple is called concordant
if one of the following two conditions is true:
d(w, x) < d(y, z) , w and x are in the same cluster and y and z are in dierent
clusters.
18 N. Bolshakova, F. Azuaje, and P. Cunningham

d(w, x) > d(y, z), w and x are in dierent clusters and y and z are in the
same cluster.
By contrast, a quadruple is called disconcordant if one of following two con-
ditions is true:
d(w, x) < d(y, z), w and x are in dierent clusters and y and z are in the
same cluster.
d(w, x) > d(y, z), w and x are in the same cluster and y and z are in dierent
clusters.
We adapted this method by calculating distances using the knowledge-driven
methods described above.
A good partition is one with many concordant and few disconcordant quadru-
ples. Let Ncon and Ndis denote the number of concordant and disconcordant
quadruples, respectively. Then the Goodman-Kruskal index, GK, is dened as:
Ncon Ndis
GK = (9)
Ncon + Ndis
Large values of GK are associated with good partitions. Thus, the number of
clusters that maximize the GK index is taken as the optimal number of clusters, c.

5 Results
The clustering algorithm was applied to produce dierent partitions consisting
of 2 to 6 clusters each. Then, the validity indices were computed for each of
these partitioning results. The two GO-based similarity assessment techniques
introduced above were used for all cases to calculate biological distances between
the genes.
Tables 1 to 4 show the predictions made by the validity indices at each number
of clusters. Bold entries represent the optimal number of clusters, c, predicted
by each method. In the tables the rst cluster validity index approach processes
overall GO-based similarity values, which are calculated by taking into account
the combined annotations originating from the three GO hierarchies. The other
indices are based on the calculation of independent similarity values, indepen-
dently obtained from each of the GO hierarchies.
The C-indices based on Resnik similarity measurement and similarity informa-
tion from the MF, BP and the combined hierarchies indicated that the optimal

Table 1. C-index predictions based on Wu and Palmers GO-based similarity metric


for expression clusters originating from yeast data

Validity indices based on: c=2 c=3 c=4 c=5 c=6


Combined hierarchies 0.51 0.472 0.464 0.453 0.463
Biological process 0.501 0.321 0.259 0.235 0.237
Molecular function 0.501 0.32 0.274 0.243 0.272
Cellular component 0.514 0.586 0.602 0.614 0.615
Incorporating Biological Domain Knowledge 19

Table 2. C-index values predictions based on Resniks GO-based similarity estimation


technique for expression clusters originating from yeast data

Validity indices based on: c=2 c=3 c=4 c=5 c=6


Combined hierarchies 0.504 0.395 0.373 0.349 0.369
Biological process 0.503 0.321 0.261 0.234 0.243
Molecular function 0.501 0.32 0.278 0.25 0.29
Cellular component 0.517 0.645 0.69 0.723 0.759

Table 3. Goodman-Kruskal index values used Wu and Palmers similarity metric for
expression clusters originating from yeast data

Validity indices based on: c=2 c=3 c=4 c=5 c=6


Combined hierarchies -0.023 -0.01 -0.018 0.004 -0.017
Biological process -0.013 0.005 -0.005 0.034 0.018
Molecular function -0.02 0.009 0.005 0.066 -0.026
Cellular component -0.025 -0.022 -0.032 -0.046 -0.025

Table 4. Goodman-Kruskal index values used Resniks similarity metric for expression
clusters originating from yeast data

Validity indices based on: c=2 c=3 c=4 c=5 c=6


Combined hierarchies -0.026 -0.001 -0.02 0.016 -0.01
Biological process -0.018 0.014 -0.012 0.055 0.044
Molecular function -0.02 0.012 0.004 0.087 -0.016
Cellular component -0.025 -0.035 -0.024 -0.037 -0.025

number of clusters is c = 5, which is consistent with the cluster structure ex-


pected [18]. The C-indices based on Wu and Palmer similarity measurement and
similarity information from the MF and BP indicated that the optimal number
of clusters is c = 5. In all cases only the method based on the CC hierarchy
suggested the partition with two clusters as the optimal partition, which con-
rms that cellular localization information does not adequately reect relevant
functional relationships in this dataset.
For the Goodman-Kruskal method again only the method based on the CC
hierarchy suggested the partition dierent from c = 5 as the optimal partition.

6 Accompanying Tool

The approaches described in this paper are available as part of the Machaon
CVE (Clustering and Validation Environment) [10]. This software platform has
been designed to support clustering-based analyses of expression patterns in-
cluding several data- and knowledge-driven cluster validity indices. The pro-
20 N. Bolshakova, F. Azuaje, and P. Cunningham

gram and additional information may be found at http://www.cs.tcd.ie/ Na-


dia.Bolshakova/GOtool.html

7 Conclusion
This paper presented an approach to assessing cluster validity based on similar-
ity knowledge extracted from the GO and GO-driven functional databases. A
knowledge-driven cluster validity assessment system for microarray data cluster-
ing was implemented. Edge-counting and information content approaches were
implemented to measure similarity between genes products based on the GO.
Edge-counting approach calculates the distance between the nodes associated
with these terms in a hierarchy. The shorter the distance, the higher the simi-
larity. The limitation is that it heavily relies on the idea that nodes and links in
the GO are uniformly distributed.
The research applies two methods for calculating cluster validity indices. The
rst approach process overall similarity values, which are calculated by taking
into account the combined annotations originating from the three GO hierar-
chies. The second approach is based on the calculation of independent similarity
values, which originate from each of these hierarchies. The advantage of our
method compared to other computer-based validity assessment approaches lies
in the application of prior biological knowledge to estimate functional distances
between genes and the quality of the resulting clusters. This study contributes
to the development of techniques for facilitating the statistical and biological
validity assessment of data mining results in functional genomics.
It was shown that the applied GO-based cluster validity indices could be
used to support the discovery of clusters of genes sharing similar functions. Such
clusters may indicate regulatory pathways, which could be signicantly relevant
to specic phenotypes or physiological conditions.
Previous research has successfully applied C-index using knowledge-driven
methods (GO-based Resnik similarity measure) [15] to estimate the quality of
the clusters.
Future research will include the comparison and combination of dierent data-
and knowledge-driven cluster validity indices. Further analyses will comprise,
for instance, the implementation of permutation tests as well as comprehensive
cluster descriptions using signicantly over-represented GO terms.
The results contribute to the evaluation of clustering outcomes and the iden-
tication of optimal cluster partitions, which may represent an eective tool to
support biomedical knowledge discovery in gene expression data analysis.

Acknowledgements
This research is partly based upon works supported by the Science Foundation
Ireland under Grant No. S.F.I.-02IN.1I111.
Incorporating Biological Domain Knowledge 21

References
1. Fitch, J., Sokhansanj, B.: Genomic engineering: Moving beyond DNA. Sequence
to function. Proceedings of the IEEE 88 (2000) 19491971
2. Gat-Viks, I., Sharan, R., Shamir, R.: Scoring clustering solutions by their biological
relevance. Bioinformatics 19 (2003) 23812389
3. Lee, S., Hur, J., Kim, Y.: A graph-theoretic modeling on go space for biological
interpretation on gene clusters. Bioinformatics 20 (2004) 381388
4. Goeman, J., van de Geer, S., de Kort, F., van Houwelingen, H.: A global test for
groups of genes: testing association with a clinical outcome. Bioinformatics 20
(2004) 9399
5. Raychaudhuri, S., Altman, R.: A literature-based method for assessing the func-
tional coherence of a gene group. Bioinformatics 19 (2003) 396401
6. Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological net-
works and gene expression data. Bioinformatics 18 (2002) S145S154
7. Sohler, F., Hanisch, D., Zimmer, R.: New methods for joint analysis of biological
networks and expression data. Bioinformatics 20 (2004) 15171521
8. Khatri, P., Draghici, S.: Ontological analysis of gene expression data: current tools,
limitations, and open problems. Bioinformatics 21 (2005) 35873595
9. Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression
data. Signal Processing 83 (2003) 825833
10. Bolshakova, N., Azuaje, F.: Machaon CVE: cluster validation for gene expression
data. Bioinformatics 19 (2003) 24942495
11. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: 32nd Annual Meeting
of the Association for Computational Linguistics, New Mexico State University, Las
Cruces, New Mexico (1994) 133138
12. Resnik, P.: Using information content to evaluate semantic similarity in a tax-
onomy. In: Proceedings of the 14th International Joint Conference on Articial
Intelligence (IJCAI). (1995) 448453
13. Azuaje, F., Bodenreider, O.: Incorporating ontology-driven similarity knowledge
into functional genomics: an exploratory study. In: Proceedings of the fourth IEEE
Symposium on Bioinformatics and Bioengineering (BIBE 2004). (2004) 317324
14. Wang, H., Azuaje, F., Bodenreider, O., Dopazo, J.: Gene expression correlation
and gene ontology-based similarity: An assessment of quantitative relationships.
In: Proceedings of the 2004 IEEE Symposium on Computational Intelligence in
Bioinformatics and Computational Biology, La Jolla-California, IEEE Press (2004)
2531
15. Bolshakova, N., Azuaje, F., Cunningham, P.: A knowledge-driven approach to
cluster validity assessment. Bioinformatics 21 (2005) 25462547
16. Speer, N., Spieth, C., Zell, A.: A memetic clustering algorithm for the functional
partition of genes based on the Gene Ontology. In: Proceedings of the 2004 IEEE
Symposium on Computational Intelligence in Bioinformatics and Computational
Biology (CIBCB 2004), IEEE Press (2004) 252259
17. Speer, N., Spieth, C., Zell, A.: Functional grouping of genes using spectral cluster-
ing and gene ontology. In: Proceedings of the IEEE International Joint Conference
on Neural Networks (IJCNN 2005), IEEE Press (2005) 298303
18. Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka,
L., Wolfsberg, T., Gabrielian, A., Landsman, D., Lockhart, D., Davis, R.: A
genomewide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2
(1998) 6573
22 N. Bolshakova, F. Azuaje, and P. Cunningham

19. Hubert, L., Schultz, J.: Quadratic assignment as a general data-analysis strategy.
British Journal of Mathematical and Statistical Psychologie (1976) 190241
20. Goodman, L., Kruskal, W.: Measures of associations for cross-validations. Journal
of Ameracan Statistical Association (1954) 732764
A Novel Mathematical Model for the Optimization of
DNAChip Design and Its Implementation

Kornlia Danyi1, Gabriella Kkai2, and Jzsef Csontos3


1 Institute of Informatics, University of Szeged,
rpad tr 2, H6720 Szeged, Hungary
2 Department of Programming Systems, FriedrichAlexander University,
Martensstrae 3, D91058 Erlangen, Germany
3 Department of Biomedical Sciences, Creighton University,

2500 California Plaza Omaha, NE 68178, USA

Abstract. A variety of recent achievements in the field of biology, chemistry and


information technology have made possible the development of DNA chips. They
allow us to analyze the sequences and functions of different genes simultaneously
and detect small differences in those. They are source of tremendous amount of
data in the field of Bioinformatics. Moreover, the engineering process of DNA
chip requires the latest results of information technology, too. In this paper, we
address the mathematical problem of the prediction the hybridization process on
the chip surface. A novel in situ in silico approach is presented and the obtained
results are discussed.

1 Introduction
The rapid development of nanotechnology and computer science led to the emergence
of a new research field, called bioinformatics. One of the most important technique of
this new discipline is DNAchip or DNAmicroarray. It also represents a revolutionary
innovation in the area of applied medical diagnostics. With the help of it the presence of
pathogens and a predominant proportion of genetically based diseases can be detected
parallel very quickly and accurately. In other words, it can extremely accelerate the
precise diagnostics, so the appropriate treatment can be started earlier.
Nevertheless, the more extensive use of the method is nowadays limited by its high
operational costs. For example the production of 80 homogeneous chips with 20,000
DNA fragments costs approximately e 40,000. The largest part of these expenses re-
sults from the production of the socalled masks, which are used to determine the chem-
ical structure of DNA pieces. Obviously, the large manufacturing costs mean substantial
disadvantage. Therefore, the designing process needs special attention. The aim of our
work is to facilitate the engineering process and help to avoid extra charges which are
due to chemicals carrying nonappropriate genetical information.
We proceed as follows: In section 2 a short introduction to DNAchip technology
and Simulated Annealing method is given. The estimation of the hybridization is sum-
marized in section 3. Our mathematical model is presented together with its optimiza-
tion in section 4. The test results can be found in section 5, and conclusion with future
work plans in section 6.

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 2333, 2006.
c Springer-Verlag Berlin Heidelberg 2006
24 K. Danyi, G. Kkai, and J. Csontos

2 DNAChips

DNAchip itself is a solid carrier (glass or special plastic plate) with a set of single
stranded DNA fragments, socalled probes on it. The sample is usually a solution which
also contains DNA fragments. According to Watson and Crick a DNA molecule consists
of two helically twisted strands connected together with a series of hydrogen bonds
(doublehelix) and each strand has 4 distinct building blocks (nucleotides or bases),
adenine (dA), guanine (dG), cytosine (dC) and thymine (dT). Generally, the different
basis sequences determine different genes, which are large DNA fragments, carry and
transmit genetic information. The smaller DNA pieces are called oligonucleotides. En-
ergetically favorable, if dA bonds to dT and dC pairs with dG, which means there exists
a one to one correspondence in the set of DNA fragments. There is another important bi-
jective function f: DNA > RNA, which is based on similar chemically preferred pairs,
dArU, dTrA, dCrG and dGrC where RNA fragments are variations with repetitive rA,
rU, rC and rG elements. The Central Dogma of Molecular Biology, which is the law
of genetic information storage and transfer in living organisms is based on these spe-
cial relationships (Figure 1). Indeed, if one investigate a sample using a DNAchip then
only the unique complementary basis sequences ought to form doublehelixes. This
process is known as hybridization. Nevertheless, every molecular process observed at
macroscopic level is stochastic. Depending on the basis sequences it is possible to arise
some non exactly complementary distorted doublehelix. The energetically less stable
basis pairs are the so called mismatches (MM). MMs and mutations are closely related,
MMs can alter into mutations and with the help of them even a single mutation can be
detected. Since we can select the probe sequences, it is theoretically possible to indicate
all wellknown and unknown mutations in the sample. DNAchip technology takes the
advantages of parallel experiments, even up to 100,000 different oligonucleotides can
be applied on a single chip and the investigation takes only a few minutes. Based on the
application area chips can be divided into three major categories:

Diagnostic chips (oligoDNAChips): The length (number of bases) of the used


probe oligonucleotides is between 20 and 60. They are mainly used in the di-
agnostics of pathogens and detection of genetically originated disorders.
Transcriptional chips (cDNAChips): The length of the used probe oligo
nucleotides is typically larger than 100 nucleotides. They are often used in can-
cer research to detect changes and similarities between healthy and tumour cells.
Chips for sequence analysis: The probe nucleotides are quite short. The goal is to
determine the overlapping frames and substrings of genes. In fact, sequence analy-
sis is an application of the wellknown shortest common superstring problem.

All three kinds of DNAchips serve to detect the presence or absence of certain nu-
cleotide chains in the analyzed sample. Although the philosophy is basically the same,
there are some substantial differences among the groups. In our point of view the most
remarkable one is the role of mutations, which is the most important in the first group.
There are several different DNAchip manufacturing techniques have been devel-
oped during the last decade. In the case of diagnostic chips, the most frequently applied
procedure is very similar to those used in computer chip fabrication [1]. Photolitho-
A Novel Mathematical Model for the Optimization of DNAChip Design 25

Fig. 1. The flow of genetic information

graphic processes such as photosensitive masks are combined with solid phase chem-
istry to bond DNA fragments onto the surface. With the series of specific masks and
chemical steps high density chips can be constructed. In the development, application
and propagation of DNAchips S. Fodor played a decisive role [2, 3, 4].

2.1 Sample Analysis Using DNAChips

The investigation is a multistep experiment involving the following processes (Figure


2). (i) Amplification (multiplication) of DNA strands in the sample to obtain appropriate
amount of them. (ii) Labeling of the fragments with flourescent dye to monitor the sam-
ple. (iii) Hybridization between the sample and probe sequences immobilized onto the
chip surface. (iv) Measuring the fluorescence of labeled DNA fragments to determine
where hybridization occured. (v) Data interpretation. Beside the last step, which is also
a timeconsuming and computer demanding process chip design is apparently the most
important step which precedes the investigation and determines its success or failure.

2.2 Role of Hybridization The Nearest Neighbor Method

The computational prediction of the thermodynamics of hybridization plays a pivotal


role in chip design. Accordingly, several methods have already been proposed to es-
timate the melting point or temperature (Tm ) of nucleic acid duplexes. At the melting
temperature 50% of the DNA fragment and its perfect complement form duplex, the
other half are in free single stranded state due to molecular thermal motion. In fact, the
26 K. Danyi, G. Kkai, and J. Csontos

Fig. 2. The simplified scheme of DNAchip investigation

probability of hybridization and the Tm point are descriptors of the same thermodynam-
ical property: the stability of the duplex. The higher the Tm the more likely hybridiza-
tion occurs. To calculate Tm the simpler methods use the chemical formula of the chains
(eq.1.) and empirical parameters are accounted for solvation effects (eq. 2.).

Td = 2(#A + #T) + 4(#C + #G) (1)


where #X means the number of X bases in the sequence.

T m = 8.15 + 16.6 log[Na+] + 41(XG + XC ) 500/L 0.62F (2)

where L, F are empirical parameters, XG , XC can be simply calculated from #G and #C.
In the most complicated and flexible case the actual sequence of the chains is also taken
into consideration (eq.3.) [5, 6].
H0
Tm = 273.15 + const log[Na+ ], (3)
S0 + R ln[c/4]
where H0 is the enthalpy, S0 the entropy and c the oligo concentration. H0 and S0
depend on the basesequences and 12 constants (const) stated by the examination [7, 8].
Although, there is a scientific dispute [9, 10, 11, 12] about the best parameter set the
nearest neighbor (NN) method (eq. 3.) is generally considered to be the most accurate
prediction of Tm . Nevertheless, substantial problems have been left unsolved consid-
ering NN. First of all, the NN parameters are not valid in solid phase, they are based
on fluid phase measurements. Secondly, they originally were developed to describe the
thermodynamics of perfectmatching sequences and their extension to other cases is
painful, there still exist quite a few mismatches without parameter. Lastly, every pa-
rameter defined by experiments has limited scope. Reparametrization can help to solve
A Novel Mathematical Model for the Optimization of DNAChip Design 27

these problems, but it requires a tremendous amount of experimental work regarding the
NN approach. In addition, if one consider carefully the last argument, then this is not
else than only a stone on Sisyphuss way. A more effective approach will be presented
in the following sections to avoid these difficulties.

2.3 Simulated Annealing


Simulated Annealing (SA, [13]) is a randomized procedure, which supplies good ap-
proximation solutions for combinatorial optimization problems in many practical cases
[14, 15, 16, 17]. This technology was developed at the beginning of the 80s. As its name
implies, the SA exploits an analogy between the way how a metal cools and freezes into
a minimum energy crystal structure (the annealing process) and the search for a min-
imum in a more general system. The success of the process depends strongly on the
choice of a control parameter, called temperature. In order to be able to provide as good
as possible cooling plan for the Simulated Annealing, the investigation of the proce-
dures convergence behavior is necessary.
Simulated Annealing is started with a feasible solution of the combinatorial opti-
mization problem and in every iteration a randomly selected neighbour solution is pro-
duced. The algorithm employs a random search which not only accepts changes that
decrease objective function, but also some changes that increase it. If the change has
a better function value, one turns into it and iterates. Otherwise one accepts the new
solution only with a certain probability. This probability decreases with increasing the
iteration number, because of the decreasing temperature.

3 In situ, in silico Chip Design

As we mentioned earlier, regarding DNA the most stable state if the four bases can form
WatsonCrick basis pairs: dG dC, dA = dT (where every hyphen means one hydro-
gen bond). If two sequences are exactly complementary then they will hybridize with
each other under appropriate conditions. Since, duplex formation is a stochastic process
hybridization can occur between non perfectly matching sequences and its probability
is in inverse proportion to the number of MMs. Furthermore, one can conclude that dif-
ferent MMs might have different effects on hybridization. Accordingly, the following
parameters are specified in our model:

1. Typedependent parameters: In the case of DNA, the number of WatsonCrick  pairs


and MMs is 2 and 8, respectively. The number of possible combinations are 4+21
2
(Table 1. a). There are 4 WatsonCrick pairs and 12 MMs
  considering DNARNA
hybrids, where the order of elements is significant 41 41 (Table 1. b).

Every typedependent parameter is restricted into the [0.0, 10.0] interval:


0.0 dXrY 10.0, where X {A, G,C, T } and Y {A, G,C,U}
Apparently, the positions of MMs in the sequence are not equivalents (Figure 3).
That is why, the following parameter type was introduced:
28 K. Danyi, G. Kkai, and J. Csontos

Table 1. a) The commutative Cayley table of DNA/DNApairs, the off diagonal elements are
MMs and the different ones are in italic; b) The Cayley table of DNA/RNApairs, the off diagonal
elements are MMs

dA dC dG dT DNA/DNA
dA dC dG dT DNA/RNA
dAdA dCdA dGdA dTdA dA
dArA dCrA dGrA dTrA rA
dAdC dCdC dGdC dTdC dC
dArC dCrC dGrC dTrC rC
dAdG dCdG dGdG dTdG dG
dArG dCrG dGrG dTrG rG
dAdT dCdT dGdT dTdT dT
dArU dCrU dGrU dTrU rU
a)
b)

2. Positiondependent parameters: they are determined by the length of the sequence


and the position of the mismatch:
The length of sequences is also important; a mismatch has greater importance,
if the length of the sequence is shorter. The f (x) function for weighting the
position of the mismatch has to be defined according to the length of sequences.
The importance of the sequence positions follows a maximum curve regarding
MMs. The speed of growth or the decrease depends on the length of the se-
quence and MMs have less influence on the probability of hybridization at both
ends of the chain, as if they were in the center of the sequence (Figure 3).

Fig. 3. The influence of the MM position on hybridization

3. Solutiondependent parameters:
They cover the experimental conditions i.e. salt and sample concentrations and
other environmental factors, which also effect the probability of hybridization. The
experimental conditions are specified by the scientist, who plans and accomplishes
the work. One can assume that these values do not change in a certain laboratory.
Although we do not use these parameters explicitly, they are originally involved in
our approach.

In our basic model, we optimize only the typedependent parameters, and set the
other parameters to appropriate values (Section 4.2).
A Novel Mathematical Model for the Optimization of DNAChip Design 29

4 Methods

All the above mentioned parameters represent the chemical sense in our model. With
the help of MMs, which are weighted by the appropriate parameters, the mathematical
model for the estimation of hybridization probability can be presented:
  
l1
P(hybridization) = max 0, 1 w pos(i) wmm (m(i)) ,
i=0

where l is the length of sequence, i is the position number, m(i) is the type of mismatch
at position i, w pos (i) R is the weight of position i and wmm (m(i)) is the weight of
mismatch type at position i.
If there is no mismatch at the position i, then w pos (i) = wmm (m(i)) = 0. If appropriate
values are assigned to the parameters, experiments can be replaced by computations.

4.1 Optimization of Parameters

The following target function was used in the optimization process.


 
n,m  2
z = min ai j bi j 
i, j

where the element ai j was derived from the chip experiment (TIFF image processing)
and its value based on the hybridization of the row i and column j DNA fragment, n,
m are the dimensions of the chip. The bi j elements of the B matrix are computed as
follows:
l
bi j = f (i j) = w pos(k)wmm (Si jk ),
k=0

where l is the length of sequence, Si j is the sequence on the chip in row i and column
j, Si jk is the MM at position k in this sequence. Thus the elements of matrix B are
calculated from character strings, which consist of the variations of the four indications
(A, C, G, T).
In a real dataset, the proportions of MMs are usually not balanced. In order to pro-
portionately optimize the MM weights, we need to scale the number of the present MMs
and expand the target function. The percent of every mismatch in the dataset is calcu-
lated as follows:
Nm(t) 100
pm(t) = 12 %
t=1 Nm(t)
where m(t) is the mismatch, Nm(t) is the number of m(t) mismatch in the real dataset.
In order to scale the number of MMs considering DNA-RNA chips, the proportion of
12 = 8.3333% (the evenly distributed MMs) and pm(t) is taken, thus the scale weight
100%

for mismatch m(t) is:


8.3333
scm(t) = %
pm(t)
30 K. Danyi, G. Kkai, and J. Csontos

The target function can be extended as follows:


 
n,m  2 l
z = min ai j bi j scm(k)
 
ij
i, j k=1

where scm(k)i j is the scale weight of the mismatch at position k in probesequence at


position (i, j) on the chip.

4.2 The Setting of Weight Parameters


The positiondependent parameters were not allowed to change during the optimization
and a general function was used, which means only the length dependence of MMs were
taken into consideration. The function used by us is the following:


min distbegin (i), distend (i)


f (i) = l
,
2
where distbegin (i) is the distance of position i from the beginning of sequence, distend (i)
is the distance of position i from the end of sequence and l is the length of the sequence.
Figure 4 shows this function.

0.8

0.6

0.4

0.2

0
0 5 10 15 20

Fig. 4. The weight function for calculating the positiondependent parameters

4.3 Optimization with Simulated Annealing


The search space consists of vectors of the 12 MM parameters. The parameters can
have all the values from the [0.0, 10.0] interval. At the beginning of the optimization
process, the parameter values are set to 1.0. For the generation of new solutions the
values of current parameters are increased with a random number from the interval
of [0.1, 0.1]. With the changed parameters the difference of the theoretical and the
experimental matrices is computed and compared with the previous result. Based on
the Simulated Annealing method these results are taken or rejected.
In order to determine the initial test temperature and the maximum iteration number,
the efficiency of the optimization was tested in several intervals. In this study, only the
linear cooling schedule was used.
A Novel Mathematical Model for the Optimization of DNAChip Design 31

5 Results

Figure 5 shows the differences between average optimum values in the dependence of
the initial test temperature and the number of iterations. The initial test temperature was
increased from 50 to 140 and number of iterations from 10 to 55. It can be seen that the
optimum values increase, if the temperature is between 50 and 70 and the number of
iterations are between 50 and 70. In contrast if the temperature is between 100 and 140
and the number of iterations is 20, the difference between the model and experiment is
acceptable. Since SA is a stochastic search method, we repeated the samples 100 times
for each case.

Fig. 5. The average optimum values generated by the method using different temperature and
iteration number

5.1 The Comparison of the Optimized Parameter

In Figure 6 and 7 the variation of the parameter values can be observed. Those param-
eters, which are important from the biochemical point of view are represented by grey
columns, the others are in black.
It can be seen that starting form 1.0 at an early stage the values are mixed. How-
ever, with the advance of the optimization process the important and the less notable
parameters separate from each other and the most important ones obtain the highest
weight, eventually. If we take into account the fact that the hybrid RNA/DNA structures
are more stable then the DNA/DNA duplex ones and the base pair cytosine and guanine
stands for a stronger interaction than the double hydrogenbonded adenine and thymine
(or uracil) the following conclusion can be made:
Regarding DNA/RNA hybrid systems, the RNAside mismatches should determine
mainly the stability of the hybridization. The theoretical consideration stays in coher-
ence with the parameters resulted from the optimization process. As you can see in
Figure 7, 5 out of the 6 possible mismatches on the RNAside possess the first 5 posi-
tions based on the weight (dTrC, dArC, dGrG, dArG, dTrG).
32 K. Danyi, G. Kkai, and J. Csontos

Fig. 6. The values of the typedependent parameters in the case of goal function 6.10964 and
4.82974

Fig. 7. The values of the typedepended parameters in the case of goal function 2.42093 and
1.86422

6 Conclusion
The prediction of thermodynamical properties of nucleic acids using computer model-
ing has not been solved yet. The main problem sources are (i) the parameters used in
the computation and determined by time consuming and expensive experiments can be
applied only to fluid phase, (ii) the lack of MM parameters, (iii) the parameters strongly
depend on the experimental conditions (e.g. temperature, solvent, etc.).
We presented a novel theoretical model (in situ in silico approach) to estimate the
hybridization process between DNA/DNA and DNA/RNA strands and eliminate the
previously mentioned defects. With the help of this new method, the in silico optimiza-
tion process takes place in situ the DNAchip laboratory, then using the established
parameters one can model the hybridization, which is the cornerstone of DNAchip
design.
By the computation done so far the implemented simulated annealing method was
used with linear cooling curve. Beside the experimental test of the method, the exponen-
tial and Boltzmannsigmoid cooling scheme are in progress as well as the optimization
of the positiondependent parameters.
A Novel Mathematical Model for the Optimization of DNAChip Design 33

References
1. Chiu, G.T., Shaw, J.: Optical lithography. In: IBM Journal of Research and Development,
Vol.41. (24. Jan. 1997)
2. Chee, M., Yang, R., Hubbel, E., Berno, A., Huang, X., Stern, D., Winkler, J., Lockad, D.,
.Morris, M., SP.Fodor: Accessing genetic information with high-density dna arrays. In:
Science, Vol.274. (25. Oct. 1996) 610613
3. Fodor, S., Rava, R., Huang, X., Pease, A., Holmes, C., Adams, C.: Multiplexed biochemical
assays with biological chips. In: Nature, Vol.364. (5. Aug. 1993) 555556
4. Fodor, S., Read, J., Pissung, M., Stryer, L., Lu, A., Solas, D.: Light-directed, spatially ad-
dressable parallel chemical synthesis. In: Science, Vol.251,No.4995. (15. Feb. 1991) 767
773
5. Panjkovich, A., Melo, F.: Comparison of different melting temperature calculation methods
for short dna sequences. In: Bioinformatics. (March 15, 2005) 21(6): 711 722
6. Panjkovich, A., Norambuena, T., Melo, F.: dnamate: a consensus melting temperature pre-
diction server for short dna sequences. In: Nucleic Acids Res. (July 1, 2005) 33(suppl_2):
W570 W572
7. Lee, I., Dombkowski, A., Athey, B.: Guidelines for incorporating non-perfectly matched
oligonucleotides into target-specific hybridisation probes for a dna microarray. In: Nucleic
Acids Research, Vol. 32. No. 2. (2004) 681690
8. Rouillard, J.M., Zuker, M., Gulari, E.: Oligoarray 2.0: design of oligonucleotide probes for
dna microarrays using a thermodinamic approach. In: Nucleic Acids Research, Vol. 31. No.
12. (2003) 30573062
9. Breslauer, K., Frank, R., Blocker, H., Marky, L.: Predicting dna duplex stability from the
base sequence. In: Proc. Natl. Acad. Sci. USA 83. (1986) 37463750
10. Freier, S., R.Kierzek, Jaeger, J., Sugimoto, N., Caruthers, M., Neilson, T., Turner, D.: Im-
proved parameters for predictions of rna rna duplex stability. In: Proc. Natl. Acad. Sci. 83.
(1986) 93739377
11. Wallace, R., Shaffer, J., Murphy, R., Bonner, J., Hirose, T., Itakura, K.: Hybridization of syn-
thetic oligodeoxyribonucleotides to phi chi 174 dna: the effect of single base pair mismatch.
In: Nucleic Acids Res.6. (1979) 35433557
12. Howley, P., Israel, M., Law, M., Martin, M.: A rapid method for detecting and mapping
homology between heterologous dnas. evaluation of polyomavirus genomes. In: JBC, Vol.
254. (1979) 48764883
13. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. In: Science,
Vol.220,No.4598. (May 1983) 671680
14. Berendsen, H., van der Spoel, D., van Drunen, R.: Gromacs: A message passing parallel md
implementation. In: Comp. Phys. Comm. 91. (1995) 4356
15. Sanbonmatsu, K.Y., Joseph, S., Tung, C.S.: Simulating movement of trna into the ribosome
during decoding. In: PNAS 2005 102. (2005) 1585415859
16. Inc., A.S.: Insight ii molecular modeling and simulation enviornment (2005)
17. for Macromolecular Modeling, N.R., Bioinformatics: Scalable molecular dynamics (2005)
A Hybrid GA/SVM Approach for Gene
Selection and Classification of Microarray Data

Edmundo Bonilla Huerta, Beatrice Duval, and Jin-Kao Hao

LERIA, Universite dAngers,


2 Boulevard Lavoisier, 49045 Angers, France
{edbonn, bd, hao}@info.univ-angers.fr

Abstract. We propose a Genetic Algorithm (GA) approach combined


with Support Vector Machines (SVM) for the classication of high
dimensional Microarray data. This approach is associated to a fuzzy
logic based pre-ltering technique. The GA is used to evolve gene subsets
whose tness is evaluated by a SVM classier. Using archive records
of good gene subsets, a frequency based technique is introduced to
identify the most informative genes. Our approach is assessed on two
well-known cancer datasets and shows competitive results with six
existing methods.

Keywords: Genetic algorithms, Fuzzy logic, Support vector machines,


Feature selection, Classication, Microarray data.

1 Introduction
The DNA Microarray technology allows measuring simultaneously the expres-
sion level of a great number of genes in tissue samples. A number of works have
studied classication methods in order to recognize cancerous and normal tissues
by analyzing Microarray data [1, 8, 2]. The Microarray technology typically pro-
duces large datasets with expression values for thousands of genes (200020000)
in a cell mixture, but only few samples are available (2080).
From the classication point of view, it is well known that, when the number
of samples is much smaller than the number of features, classication methods
may lead to data overtting, meaning that one can easily nd a decision func-
tion that correctly classies the training data but this function may behave very
poorly on the test data. Moreover, data with a high number of features require
inevitably large processing time. So, for analyzing Microarray data, it is neces-
sary to reduce the data dimensionality by selecting a subset of genes that are
relevant for classication.
In the last years, many approaches, in particular various Genetic Algorithms
(GAs) and Support Vector Machines (SVMs), have been successfully applied
to Microarray data analysis [6, 19, 16, 10, 15, 17, 18, 13]. In Section 3, we review
some of the most popular approaches.

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 3444, 2006.

c Springer-Verlag Berlin Heidelberg 2006
A Hybrid GA/SVM Approach for Gene Selection 35

In this paper, we are interested in gene selection and classication of DNA


Microarray data in order to distinguish tumor samples from normal ones. For
this purpose, we propose a hybrid model that uses several complementary tech-
niques: fuzzy logic, a Genetic algorithm (GA) combined with a Support Vector
Machine (SVM) and an archive-based gene selection technique. Comparing with
previous studies, our approach has several particular features. First, to cope
with the diculty related to high dimensional data, we introduce a fuzzy logic
based pre-processing tool which allows to reduce largely the data dimensionality
by grouping similar genes. Second, our GA uses archives to record high quality
solutions. These archives are then analyzed to identify the most frequently ap-
pearing genes which would correspond to the most predictive genes. Third, the
GA combined with a SVM classier is used both for selecting predictive genes
and for nal gene selection and classication.
The proposed approach is experimentally assessed on two well-known cancer
datasets (Leukemia [8] and Colon [1]). Comparisons with six state-of-the-art
methods show competitive results according to the conventional criteria.
The remainder of this paper is organized as follows. In Section 2, we describe
briey the two Microarray datasets used in this study. In Section 3, we review
some popular gene selection approaches for the classication of Microarray data.
In Section 4, we introduce the general scheme of our hybrid model. In Section
5, we describe our GA/SVM approach. Experimental results are presented in
Section 6. Finally conclusions are given in Section 7.

2 Datasets

In this study, we use two well-known public datasets, the Leukemia dataset
and the Colon cancer dataset. All samples were measured using high-density
oligonucleotide arrays [2].
The Leukemia dataset1 consists of 72 Microarray experiments (samples) with
7129 gene expression levels. The problem is to distinguish between two types of
Leukemia, Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia
(ALL). The complete dataset contains 25 AML samples of and 47 ALL samples.
As in other experiments [8], 38 out of 72 samples are used as training data (27
ALL samples and 11 AML samples) and the remaining samples (20 ALL samples
and 14 AML samples) are used as test data.
The Colon cancer dataset2 contains the expression of 6000 genes with 62 cell
samples taken from colon cancer patients, but only 2000 genes were selected
based on the condence in the measured expression levels [1]. 40 of 62 samples
are tumor samples and the remaining samples (22 of 62) are normal ones. In
this paper, the rst 31 out of 62 samples were used as training data and the
remainder samples as test data.

1
Available at: http://www.broad.mit.edu/cgi-bin/cancer/publications/.
2
Available at: http://microarray.princeton.edu/oncology/aydata/index.html.
36 E.B. Huerta, B. Duval, and J.-K. Hao

3 Review of Feature Selection Approaches

Feature selection for classication is a very active research topic since many ap-
plication areas involve data with tens of thousands of variables [9]. This section
concerns more specically a literature review of previous studies on feature selec-
tion and classication of Microarray Data, with a special focus on the Leukemia
and the Colon datasets presented in Section 2.
Feature selection can be seen as a typical combinatorial problem. Informally,
given a dataset described by a large number of features, the aim is to nd out,
within the space of feature subsets, the smallest subset that leads to the highest
rate of correct classication. Given the importance of feature selection, many
solution methods have been developed. Roughly speaking, existing methods for
feature selection belong to three main families [9]: the lter approach, the wrap-
per approach and the embedded approach.
The lter methods separate the feature selection process from the classica-
tion process. These methods select feature subsets independently of the learning
algorithm that is used for classication. In most cases, the selection relies on
an individual evaluation of each feature [8, 6], therefore the interactions between
features are not taken into account.
In contrast, the wrapper approach relies on a classication algorithm that is
used as a black box to evaluate each candidate subset of features; the quality
of a candidate subset is given by the performance of the classier obtained on
the training data. Wrapper methods are generally computation intensive since
the classier must be trained for each candidate subset. Several strategies can
be considered to explore the space of possible subsets. In particular, in [14], evo-
lutionary algorithms are used with a k-nearest neighbor classier. In [12], the
author develops parallel genetic algorithms using adaptive operators. In [18], one
nds a SVM wrapper with a standard GA. In [20], the selection-classication
problem is treated as a multi-objective optimization problem, minimizing si-
multaneously the number of genes (features) and the number of misclassied
examples.
Finally, in embedded methods, the process of selection is performed during the
training of a specic learning machine. A representative work of this approach is
the method that uses support vector machines with recursive feature elimination
(SVM/RFE) [10]. The selection is based on a ranking of the genes and, at each
step, the gene with the smallest ranking criterion is eliminated. The ranking
criterion is obtained from the weights of a SVM trained on the current set of
genes. In this sense, embedded methods are an extension of the wrapper models.
There are other variants of these approaches, see [21, 7] for two examples.

4 General Model for Gene Selection and Classification

The work reported in this paper is based on a hybrid approach combining fuzzy
logic, GA and SVM. Our general model may be characterized as a three-stage
sequential process, using complementary techniques to shrink (or reduce) grad-
A Hybrid GA/SVM Approach for Gene Selection 37

ually the search space. The rest of this section gives a brief description of these
three stages.

Stage 1 Pre-processing by fuzzy logic. This stage aims to reduce the di-
mension of the initial problem by eliminating gene redundancy. This stage is ba-
sically composed of four steps. First, the gene expression levels are transformed
into fuzzy subsets with Gaussian representations. Second, the Cosine amplitude
method is employed to assess fuzzy similarities between genes. We build a simi-
larity matrix that is then transformed to a matrix of fuzzy equivalence relations
by dierent compositions. Third, using cuts [23] with decreasing values of ,
we obtain groups of similar genes that correspond to fuzzy equivalence classes of
genes. Fourth, for each group, one gene is randomly taken as the representative
of the group and other genes of the group are ignored. Applying this dimension
reduction technique to the datasets presented in Section 2, the set of 7129 genes
for Leukemia (2000 genes for Colon respectively) is reduced to 1360 genes (943
genes respectively). Therefore, the search space is dramatically reduced. As we
show later in Section 6, with this reduced set of genes, we will be able to ob-
tain high quality classication results. A detailed description of this stage goes
beyond the scope of this paper and can be found in [3].

Stage 2 Gene subset selection by GA/SVM. From the reduced set of genes
obtained in the previous pre-processing stage, this second stage uses a wrapper
approach that combines a GA and a SVM to accomplish the feature (gene)
subset selection. The basic idea here consists in using a GA to discover good
subsets of genes, the goodness of a subset being evaluated by a SVM classier

Prefiltered Evaluation Fitness no


Population Evolution
DNA data (SVM) >=Theta
yes
Best subsets
of genes:
Archive

Subset of Analysis of
selected genes frequency of
genes

Stop no
Evaluation condition Evolution
Population (SVM)
yes
Best solution

Fig. 1. The general process for gene subset selection and classication using GA/SVM:
Gene subset selection (Stage 2 - top); Gene selection and classication (Stage 3 -
bottom)
38 E.B. Huerta, B. Duval, and J.-K. Hao

on a set of training data (see Section 2). During this stage, high quality gene
subsets are recorded to an archive in order to be further analyzed.
At the end of the GA, the analysis of the archived gene subsets is performed:
gene subsets are compared among them and the most frequently appearing genes
are identied. This process typically leads to a further reduced set of genes (<100
genes for the Leukemia and Colon dataset). Fig.1 (top) shows a general picture
of this stage.

Stage 3 Classification. Stage 2 has identied a reduced set of relevant genes


which is now used in the nal step of gene selection and classication. From this
set of genes, a new round of search is carried out using the previous GA/SVM,
this time to classify the test data (see Section 2). This stage will thus select the
most predictive genes to classify the test data. Fig.1 (bottom) shows a general
picture of this stage.

5 Gene Selection and Classification by GA/SVM


We describe now the hybrid GA/SVM algorithm for carrying out Stages 2 and 3
of the general model for gene selection and classication. As explained previously,
the GA is designed both for discovering good gene subsets and for nal gene
selection and classication. The SVM-based classier is used to ensure the tness
evaluation of each candidate gene subset. One important feature of the GA
developed in this work is the use of an archive to record quality gene subsets
discovered during the gene subset selection stage. This archive is then analyzed to
identify a small number of highly frequently appearing genes that are used in the
nal classication stage. Notice that the idea of archiving good solutions is not
really a new one because it is already used in some multiobjective evolutionary
algorithms [26]. However, as we will see later in Section 5.3, our way of exploiting
the information of the archive to identify predictive genes is original and useful.
From these retained genes obtained from archive analysis, the same GA/-
SVM algorithm is applied to the test data to perform the nal gene selection
and classication tasks.

5.1 The Genetic Algorithm


General Schema. The basic components of our GA are presented later in this
section. Here we show the general algorithm. The GA follows a generational
schema with a form of elitism. To obtain a new population from the current
population P, the top E% of the population P are recorded, E being xed to
10% or 15% in our experiments (see Section 6). Then, the following two actions
are taken: 1) select two parents and apply (with a given probability) the crossover
to create two new solutions which are muted (with a given probability), and 2)
replace the parents by their ospring. These two actions are repeated for a pre-
xed number of times. Finally, the recorded elite chromosomes are copied backed
to the population P to replace the worst rated chromosomes. At this point, one
generation is accomplished.
A Hybrid GA/SVM Approach for Gene Selection 39

Chromosome and initial population. The chromosomes are binary-encoded,


each allele (bit) of the chromosome represents a gene. If an allele is 1 it means
that this gene is kept in the gene subset and 0 indicates that the gene is not
included in the subset. Each chromosome represents thus a gene subset. For Stage
2 of the general model, the chromosome length is equal to the number of genes
pre-selected by the fuzzy pre-processing (i.e. 1360 for the Leukemia dataset and
943 genes for the Colon dataset). For Stage 3, the chromosome length depends
on the size of the gene subset retained after analyzing the solution archive (see
section 5.3). In both cases, the initial population of the GA is randomly generated
according to a uniform distribution.

Fitness function. The tness of a chromosome, i.e. a subset of genes, is assessed


by the classication rate on the initial datasets. In other words, a subset of genes
leading to a high classication rate is considered to be better than a subset
leading to a low classication rate. In our case, a SVM classier (see Section 5.2)
ensures this classication task.

Selection, crossover, mutation, and replacement. We use the roulette


wheel selection and random one-point crossover and multi-uniform mutation
operators. Ospring replaces always their parents. An elitism mechanism is also
applied to conserve the top 10% or 15% chromosomes of the population between
two successive generations.

Archives of high quality gene subsets. Given a chromosome (a candidate


subset of genes), the SVM classier gives its tness in terms of classication
rate on the training data set. If the classication rate is high enough (dened
by a threshold theta, see Fig. 1.a), the subset of genes is recorded in an archive.
In this paper, the threshold theta is set to 0.90 and 0.91 respectively for the
Leukemia and Colon dataset.

Stopping criterion. The evolution process ends when a pre-dened number of


generations is reached or a tness value of 100% is obtained.

5.2 The SVM Classifier


Support Vector Machines [24] are basically binary classication algorithms.
When the data are linearly separable, SVM computes the hyperplane that max-
imizes the margin between the training examples and the class boundary. When
the data are not linearly separable, the examples are mapped to a high dimen-
sional space where such a separating hyperplane can be found. The mechanism
that denes this mapping process is called the kernel function. SVM are powerful
classiers with good performance in the domain of Microarray data [10, 17]. They
can be applied to data with a great number of genes, but it has been showed
that their performance is increased by reducing the number of genes [6, 2].
In our wrapper GA/SVM algorithm, we use a SVM classier to assess the
quality of a gene subset. For a chromosome x that represents a gene subset,
we apply a Leave-One-Out Cross-Validation (LOOCV) method to calculate the
40 E.B. Huerta, B. Duval, and J.-K. Hao

average accuracy (rate of correct classication) of a SVM trained with this gene
subset [11]. The LOOCV procedure means that one sample from the dataset
is considered as a test case while a SVM is trained on all the other samples,
and this evaluation is repeated for each sample. So for each chromosome x,
F itness(x) = accuracySV M (x).
One of the key elements of a SVM classier concerns the choice of its kernel.
In our study, we have chosen to use the RBF kernel. We also experimented
Gaussian and polynomial kernels. For polynomial kernels, the main diculty is
to determine an appropriate polynomial degree while the results we obtained
with the Gaussian kernel are not satisfactory. Notice that RBF has been used
in several previous studies for Microarray data classication [4, 18, 5].

5.3 Archive Analysis


At the end of stage 2 and prior to the nal classication (Stage 3), the archive
is analyzed and the most frequently appearing genes in the archive are retained
for the nal gene selection and classication (stage 3). Typically, this analysis
will lead to a limited number of genes (between 50 to 100). From these genes,
the GA/SVM algorithm will then determine the nal set of genes relevant to
classify the data.

6 Experimental Results and Comparisons


6.1 Parameters Settings
For our GA/SVM algorithm, the GA is implemented in Matlab (Version 5.3.1
for Windows). The SVM classier is based on the SVM Toolbox developed by
Gavin Cawley3 .

Table 1. GA parameters for the stage of gene subset selection (Stage 2)

Parameters Leukemia Colon


Size of population 500 500
Length of chromosome 1360 943
Number of generations 2500 2500
Crossover rate 0.95 0.98
Mutation rate 0.02 0.01
Elitism rate E 10% 15%

The GA parameters used in our model of gene subset selection for the
Leukemia and Colon datasets are shown in Tables 1 and 2. For the SVM clas-
sier, the same parameters settings are used in the two stages of gene subset
selection and classication. The normalization parameter C is xed at 100 and
the control parameter for the RBF kernel of SVM is xed to 0.5. Notice that
3
http://theoval.sys.eua.uk/gcc/svm/toolbox
A Hybrid GA/SVM Approach for Gene Selection 41

Table 2. GA parameters for the stage of classication (Stage 3)

Parameters Leukemia Colon


Size of population 50 50
Length of chromosome 100 50
Number of generations 500 500
Crossover rate 0.985 0.985
Mutation rate 0.02 0.01
Elitism rate E 15% 15%

given the input data used by the GA/SVM are already normalized during the
Fuzzy Logic pre-processing, the normalization parameter C has in fact little
inuence in our case.

6.2 Results and Comparisons


To carry out our experiments, our GA/SVM algorithm is run 5 times on each of
the Leukemia and Colon datasets. To calculate the average classication rate of
a given gene subset, the LOOCV procedure [11] is employed.
Table 3 summarizes our results (Column 2) for the Leukemia and Colon
datasets together with the results of six state-of-the-art methods from the liter-
ature (Columns 3-8). The conventional criteria are used to compare the results:
the classication accuracy in terms of the rate of correct classication (rst num-
ber) and the number of used genes (the number in parenthesis, - indicating
that the number of genes is not available). For AG/SVM, the classication rate
that we present is the average classication rate obtained from the 5 independent
runs and the number of selected genes is the minimum number obtained from
these runs. Detailed results can be found in Table 4.
As it can be observed, for the Leukemia dataset, we obtain a classication
rate of 100% using 25 gens, which is much better than that reported in [6, 5].
This same performance is achieved by [25, 18, 20, 10], with fewer genes selected.
[20] and [10] reports the minimal number of genes. However, in [20] the evolu-
tionary method begins with a largely reduced set of 50 genes, published in [8]
as interesting genes.
The most interesting results that we obtained with our model concern the
Colon dataset since our approach oers the highest (averaged) correct classica-
tion rate (99.41%); the number of selected genes is greater than the one obtained
by [20] or by [25, 10], but it is smaller than the one reported in [18]. An analysis

Table 3. Comparison of GA/SVM with six state of the art methods

Methods
Dataset GA&SVM [6] [25] [18] [5] [20] [10]
Leukemia 100(25) 94.10(-) 100(8) 100(6) 95.0(-) 100(4) 100(2)
Colon 99.41(10) 90.30(-) 91.9(3) 93.55(12) 91.0(-) 97.0(7) 98.0(4)
42 E.B. Huerta, B. Duval, and J.-K. Hao

Table 4. GA/SVM performance on 5 runs

Runs Run 1 Run 2 Run 3 Run 4 Run5 Average class. rate


Leukemia 100(25) 100(28) 100(30) 100(46) 100(35) 100
Colon 99.64(10) 99.83(15) 97.88(10) 99.83(15) 99.83(15) 99.41

of our results shows that several biologically signicant genes reported in [8] are
found by our approach.
Table 4 shows the detailed results of 5 independent runs of our GA/SVM
algorithm. As it can be observed, these results are quite stable. For the Leukemia
dataset, each of the 5 runs obtains a classication rate of 100% while for the
Colon dataset, the best run gives a classication rate of 99.64. Even the worst
obtains a classication rate of 97.88.

7 Conclusions
In this paper, we presented a general approach for gene selection and classi-
cation of high dimensional DNA Microarray data. This approach begins with a
fuzzy logic based pre-processing technique that aims to cope with the imprecise
nature of the expression levels and to reduce the initial dimension of the input
dataset. Following this pre-processing stage, a hybrid wrapper system combin-
ing a Genetic Algorithm with a SVM classier is used to identify potentially
predictive gene subsets that are then used to carry out the nal gene selection
and classication tasks. Another important feature of our approach concerns the
introduction of an archive of high quality solutions, which allows limiting the
GA/SVM exploration to a set of frequently appearing genes.
This approach was experimentally evaluated on the widely studied Leukemia
and Colon cancer datasets and compared with six previous methods. The re-
sults show that our approach is able to obtain very high classication accuracy.
In particular, to our knowledge, this is the rst time that a averaged correct
classication rate of 99.41% (with 10 genes) is reached for the Colon dataset.
This approach can be further improved on several aspects. First, we notice
that our method does not provide the smallest number of genes on the Leukemia
data. This is due to the fact that the GA is only guided by the criterion of
classication accuracy. Therefore, the criterion of the number of genes should be
integrated into the tness function. This can be achieved by an aggregated tness
function or a bi-criteria evaluation. Second, the high computation time required
in stage 2 can be reduced by the use of a faster classier (or an approximate
tness function). For example, the m-features operator reported in [22] may
be considered. Also, a ne-tuning of SVM parameters in stage 3 may lead to
improved results. Finally, we intend to apply our approach to other DNA chip
data and to study the behavior of our model.

Acknowledgments. E. Bonilla Huerta is supported by a Mexican CoSNET


research scholarship. This work is partially carried out within the French Ouest
A Hybrid GA/SVM Approach for Gene Selection 43

Genopole Program. We thank the reviewers of the paper for their very helpful
comments.

References
1. U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J
Levine. Broad patterns of gene expression revealed by clustering analysis of tumor
and normal colon tissues probed by oligonucleotide arrays. In Proc. Natnl. Acad.
Sci. USA, volume 96, 1999.
2. A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini.
Tissue classication with gene expression proles. Journal of Computational Biol-
ogy, 7(3-4):559583, 2000.
3. E. Bonilla Huerta, B. Duval, and J.K. Hao. Feature space reduction of large scale
gene expression data using Fuzzy Logic. Technical Report, LERIA, University of
Angers, January 2006.
4. M.P.S. Brown, W.N. Grundy, D. Lin, N. Cristianini, S.W. Sugnet, T.S. Furey, M.
Ares Jr., and D. Haussler. Knowledge-based analysis of microarray gene expression
data by using support vector machines Proc. Natl. Acad. Sci. U S A., 97(1): 262
267, 2000.
5. S. Chao and C. Lihui Feature dimension reduction for microarray data analysis
using locally linear embedding. In APBC, pages 211217, 2005.
6. T. S. Furey, N. Cristianini, N. Duy, D.W. Bednarski, M. Schummer, and D. Haus-
sler. Support vector machine classication and validation of cancer tissue samples
using microarray expression data. Bioinformatics, 16(10):906914, 2000.
7. L. Goh, Q. Song, and N. Kasabov. A novel feature selection method to improve
classication of gene expression data. In Proceedings of the Second Asia-Pacic
Conference on Bioinformatics, pages 161166, Australian Computer Society, Dar-
linghurst, Australia, 2004.
8. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,
H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomeld, and E. S.
Lander. Molecular classication of cancer: Class discovery and class prediction by
gene expression monitoring. Science, 286:531537, 1999.
9. I. Guyon and A. Elissee. An introduction to variable and feature selection. Journal
of Machine Learning Research, 3:11571182, 2003.
10. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi-
cation using support vector machines. Machine Learning, 46(1-3):389422, 2002.
11. T. Joachims Estimating the Generalization Performance of a SVM Eciently.
Proceedings of the International Conference on Machine Learning (ICML), Morgan
Kaufman, 2000.
12. L. Jourdan. Metaheuristics for knowledge discovery : Application to genetic data
(in French). PhD thesis, University of Lille, 2003.
13. K-J. Kim and S-B. Cho. Prediction of colon cancer using an evolutionary neural
network. Neurocomputing (Special Issue on Bioinformatics), 61:361379, 2004.
14. L. Li, C. R. Weinberg, T.A. Darden, and L.G. Pedersen. Gene selection for sam-
ple classication based on gene expression data: study of sensitivity to choice of
parameters of the GA/KNN method. Bioinformatics, 17(12):11311142, 2001.
15. J. Liu and H. Iba. Selecting informative genes using a multiobjective evolutionary
algorithm. In Proc. of Congress on Evolutionary Computation (CEC02), pages
297302, 2002.
44 E.B. Huerta, B. Duval, and J.-K. Hao

16. F. Markowetz, L. Edler, and M. Vingron. Support vector machines for protein fold
class prediction. Biometrical Journal, 45(3):377389, 2003.
17. S. Mukherjee. Classifying Microarray Data Using Support Vector Machines.
Springer-Verlag, Heidelberg, 2003.
18. S. Peng, Q. Xu, X.B. Ling, X. Peng, W. Du, and L. Chen. Molecular classication
of cancer types from microarray data using the combination of genetic algorithms
and support vector machines. FEBS Letter, 555(2):358362, 2003.
19. S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo,
C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S.
Lander, and T.R. Golub. Multiclass cancer diagnosis using tumor gene expression
signatures. Proc. Natl. Acad. Sci. U S A., 98(26):1514915154, 2001.
20. A. R. Reddy and K. Deb. Classication of two-class cancer data reliably using
evolutionary algorithms. Technical Report. KanGAL, 2003.
21. Y. Saeys, S. Aeyels Degroeve, D. Rouze, and Y. P. Van de Peer. Feature selection
for splice site prediction: A new method using eda-based feature ranking. BMC
Bioinformatics, 5-64, 2004.
22. S. Salcedo-Sanz, F. Prez-Cruz, G. Campsand, and C. Bousoo-Calzn. Enhanc-
ing genetic feature selection through restricted search and Walsh analysis. IEEE
Transactions on Systems, Man and Cybernetics, Part C, 34:398406, 2004.
23. T.J. Ross. Fuzzy Logic with Engineering Applications. McGraw-Hill, 1997.
24. V. N. Vapnik. Statistical Learning Theory. Wiley N.Y., 1998.
25. Y. Wang, F. Makedon, J.C. Ford, and J.D. Pearlman. Hykgene: a hybrid approach
for selecting marker genes for phenotype classication using microarray gene ex-
pression data. Bioinformatics, 21(8):15301537, 2005.
26. E. Zitzlere, K. Deb, and L. Thiele. Comparison of multiobjective evolutionary
algorithms: Empirical results. Evolutionary Computation 8(2): 173-195, 2000.
Multi-stage Evolutionary Algorithms for Efficient
Identification of Gene Regulatory Networks

Kee-Young Kim, Dong-Yeon Cho, and Byoung-Tak Zhang

Biointelligence Lab, School of Computer Science and Engineering,


Seoul National University, Seoul 151-742, Korea
{kykim, dycho, btzhang}@bi.snu.ac.kr

Abstract. With the availability of the time series data from the high-throughput
technologies, diverse approaches have been proposed to model gene regulatory
networks. Compared with others, S-system has the advantage for these tasks in
the sense that it can provide both quantitative (structural) and qualitative (dy-
namical) modeling in one framework. However, it is not easy to identify the
structure of the true network since the number of parameters to be estimated is
much larger than that of the available data. Moreover, conventional parameter
estimation requires the time-consuming numerical integration to reproduce dy-
namic profiles for the S-system. In this paper, we propose multi-stage evolu-
tionary algorithms to identify gene regulatory networks efficiently. With the
symbolic regression by genetic programming (GP), we can evade the numerical
integration steps. This is because the estimation of slopes for each time-course
data can be obtained from the results of GP. We also develop hybrid evolution-
ary algorithms and modified fitness evaluation function to identify the structure
of gene regulatory networks and to estimate the corresponding parameters at the
same time. By applying the proposed method to the identification of an artificial
genetic network, we verify its capability of finding the true S-system.

1 Introduction
Although mathematical modeling for the biochemical networks can be achieved at
different level of detail (see [1] and [2] for the reviews of metabolic and genetic regu-
latory networks modeling), we can cluster them into three dominant approaches [3].
One extreme case is mainly intended to describe the pattern of interactions between
the components. Graph-based representation gives us the insight for large architec-
tural features within a cell and allows us to discovery principles of cellular organiza-
tion [4]. However, it is difficult to handle the dynamics of the whole system since
these models are very abstract. The other extreme primarily focuses on describing the
dynamics of the systems by some kinds of equations which can explain the biochemi-
cal interactions with stochastic kinetics [5, 6]. While these approaches lead to realis-
tic, quantitative modeling on cellular dynamics, the application is limited to the small
systems due to their computational complexities.
One of the appropriate approaches for the pathway structure and dynamics identifi-
cation is S-system [7]. It is represented as a system of ordinary differential equations
which have a particular form, where each component process is characterized by

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 45 56, 2006.
Springer-Verlag Berlin Heidelberg 2006
46 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang

power-law functions. S-system is not only general enough to represent any nonlinear
relationship among the components but also able to be mapped onto the network
structure directly. In spite of these advantages, there is a serious drawback which
prevents the S-system from wildly spreading over the systems biology communities.
Its large number of parameters should be estimated for the small number of observed
dynamic trends. Provided that n components such as genes and proteins are involved
in a certain living system, we must optimize at least 2n(n+1) parameters for the S-
system.
Evolutionary computation has been used from its inception for automatic identifica-
tion of a given system or process [8]. For the S-system models, some evolutionary
search techniques have been proposed [9-12]. However, they require the time-
consuming numerical integrations to reproduce dynamic profiles for the fitness
evaluations. To avoid this problem, Almeida and Voit have employed an artificial
neural network (ANN) to smooth the measured data for obtaining slopes of gene ex-
pression level curves [13]. By comparing the slope of each S-system in the population
with the estimated one from ANN, we can evaluate the fitness values of the individu-
als without the computationally expensive numerical integrations of differential equa-
tions. This method also provides the opportunity for a parallel implementation of the
identification task since a tightly coupled system of non-linear differential equations
can be separated. Hence, they are able to reduce the time complexity drastically.
While collocation method [14] can save the computational cost by approximating
dynamic profiles, their estimated systems tend to be invalid since the number of
measured data is usually insufficient. This lack of data problem can be resolved by
sampling new points from the fitted curves. For the well-estimated profiles, however,
we should determine the optimal topology of the artificial neural network such as the
number of hidden units and layers.
In this paper, we propose multi-stage evolutionary algorithms to identify gene regu-
latory networks efficiently. With the symbolic regression by genetic programming
(GP), we could evade the numerical integration steps. Here, we have no need to pre-
determine the topology of the model for the expression profiles since genetic pro-
gramming can optimized the topology automatically. We also develop hybrid evolu-
tionary algorithms to identify the structure of gene regulatory networks and to esti-
mate the corresponding parameters at the same time. Most previous evolutionary
approaches for the S-system identification have used the structural simplification
procedure in which some parameters whose values are less than a given threshold are
reset to zero. Although this method is able to make the network structure sparse, the
true connections which represent somewhat small effect can be deleted during the
procedures. That is, it is not easy to set the suitable value for the threshold. In our
scheme, Binary matrices for a network structure and real vectors and matrices for
parameter values of S-system are combined into a chromosome and co-evolved to
find the best descriptive model for the given data. Hence we can identify the S-system
without specifying the threshold values for the structural simplification. By applying
the proposed method to the artificial gene expression profiles, we successfully identi-
fied the true structure and estimated the reasonable parameter values with the smaller
number of data than the previous study.
Multi-stage Evolutionary Algorithms 47

2 Materials and Methods

2.1 S-System

The S-system [7, 15] is a set of nonlinear differential equations described as follows:
n+m n+m
dX i
= i X j ij i X j ij , i = 1,2,..., n ,
g h
(1)
dt j =1 j =1

where n is the number of dependent variables whose concentrations Xi change dy-


namically according to above equations and m is the number of independent variables
whose concentrations remain constant during the processes. The non-negative pa-
rameters i and i are called rate constants. The real-value exponents gij and hij are
kinetic orders to represent the interactive effect of Xj to Xi. These differential equa-
tions can be divided into two components. The first term represents influences that
increase Xi and the second term represents influences that decrease Xi. Thus, Xj in-
duces the synthesis of Xi in case gij > 0, whereas Xj inhibits the increase of Xi if gij < 0.
Similarly, a positive (negative) value of hij indicates that Xj expedites (restrains) the
decline of Xi.
We should estimate at least 2n(n+1) parameters even if the values related to the in-
dependent variables are assumed to be known. That is, i, i, gij, and hij are parameters
that must be estimated by evolutionary algorithms (EAs). The generally adopted fit-
ness function for EAs is the sum of relative squared errors:
2
n T X (t ) X i (t )
E = i

i =1 t =1 X i (t )
, (2)
where T is the number of sampling points for fitness evaluation, X is the measured
data points from the biological experiments and X is the values obtained by the nu-
merical integration step of the S-system in the population.

2.2 Symbolic Regression by Genetic Programming

As we mentioned in the introduction, numerical integration steps for the fitness


evaluations are very time-consuming. To circumvent these processes, we propose 2-
stage evolutionary algorithms for finding the structures and parameters of S-systems.
As a preprocessing step, genetic programming (GP) [16] performs symbolic regres-
sion for the given time-course data. Through this step, we can predict the dynamic
profiles of the given S-system and obtain more data points as well as the derivations
of the point for the second stage of our algorithm. This allows us to get rid of the
numerical integrations in the fitness evaluations. Compared with the study which
employed the artificial neural networks [13], genetic programming has the following
advantages for the curve fitting tasks. First, there is no necessity for adjusting the
number of hidden layers and units. Second, the results of GPs are more understand-
able than those of neural networks since GPs return some mathematical functions
48 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang

instead of the illegible graph and a set of weights. Hence, we can easily obtain deriva-
tions of each time point if the result functions of GPs are differentiable.

2.3 Hybrid Evolutionary Algorithms

At the second stage of our evolutionary algorithm, we use a hybrid evolutionary algo-
rithm for searching the structures and parameters of S-system. The whole procedure
of our algorithm is summarized in Fig. 1.

Fig. 1. The whole proposed multi-stage evolutionary algorithm for the identification of
S-system

In biochemical experiment step, the dynamic profiles of the involved genes can be
obtained by the periodic measurement. We are also able to predict slopes at each time
point from the regression line of the measured data by genetic programming. Then,
we can evaluate and optimize the chromosomes of evolutionary algorithms by com-
paring the estimated slopes which came from the data substitution into the S-system
with the predicted slopes of the GP regression lines. Hence, the fitness values of each
S-system can be evaluated without the time-consuming numerical integrations as
follows:

X (t ) X i(t )
2
n T
E = i
i =1 t =1 X i (t ) , (3)
Multi-stage Evolutionary Algorithms 49

X2 X3

X1

X4

(a) graph representation

dX 1 dX 2
= 12 X 30 . 8 10 X 10 .5 = 8 X 10 . 5 3 X 20 .75
dt dt
dX 3 dX 4
= 3 X 20 .75 5 X 30 .5 X 40 .2 = 2 X 10.5 6 X 40 .8
dt dt
(b) S-system

0 0 1 0 1 0 0 0

1 0 0 0 0 1 0 0
g = h=
0 1 0 0 0 0 1 1

1 0 0 1
0 0 0 0
binary matrices for the structure
12.0 0.0 0.0 0.8 0.0 10.0 0.5 0.0 0.0 0.0

8.0 0.5 0.0 0.0 0.0 3.0 0.0 0.75 0.0 0.0
= g = = h=
3.0
0.0 0.75 0.0 0.0

5.0 0.0 0.0 0.5 0.2
0.5 0.0 0.0
2.0 0.0 6.0
0.0 0.0
0.0 0.8

real vectors and matrices for the parameters

(c) the structure of the chromosome


Fig. 2. (a) The true structure of the artificial gene regulatory network (b) S-system representa-
tion (c) The structure of the chromosome in our hybrid evolutionary algorithms

where T is the number of sampling points, X is the gradient of the regression line
obtained by the genetic programming and X is the calculated value of each equation
in the S-system.
We develop hybrid evolutionary algorithms to identify the structure of gene regula-
tory networks and to estimate the corresponding parameters at the same time. In this
scheme, Binary matrices for a network structure and real vectors and matrices for
parameter values of S-system are combined into a chromosome (Fig. 2) and co-
evolved to find the best descriptive model for the given data. While crossover opera-
tor is applied to binary matrices for searching the structure of the system, their corre-
sponding parameter values also exchanges. This kind of crossover can inherit the
good structures as well as the parameter values in the parents to the offspring. That is,
we use a row exchange crossover which simply selects the row of the matrix g or h
(or both) on the parents, and swaps each other with the parameter values in the real
vectors and matrices. For example, Fig. 3(a) shows the case in which we select the
50 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang

second row of g on parents. Mutation operator randomly selects an equation of a par-


ent, and then inserts or deletes a component at the selected parent as shown in Fig.
3(b) and (c). These operators perform the local searches in the structure space. In this
case, the parameter values are randomly generated for the insertion and reset to zero
for the deletion.

0 0 1 0 0 0 0 1

1 0 0 0 0 1 0 0
P1 g = P1 g=
0 1 0 0 1 0 0 0

1 0
0 0 0 0 0 1
0 0 1 0 0 0 0 1

0 1 0 0 1 0 0 0
O1 g = O2 g =
0 1 0 0 1 0 0 0

1 0 0 0 0
0 0 1
(a) crossover (g only)

0 0 1 0 1 0 0 0

1 0 0 0 0 0 0 1
g = g =
1
parent
0 1 0 0
0 0 1

0
1 0 0 0
1 0 0
0 0 1 0 1 0 0 0

1 0 1 0 0 0 0 1
offspring g = g =
0 1 0 0 0 0 1 0

1 0 0 0 0 0
1 0
(b) mutation (insert) (c) mutation (delete)
Fig. 3. Crossover and mutation operators for the binary matrices

We also give some variety to the fitness function in equation (3). In conventional
scheme, all points of data have the same weights on the fitness values. However, it is
difficult to fit the data points which have large second-order differentiation. More-
over, this makes the parameter values of the chromosomes different from the true one
even if they have good fitness values. Thus we multiply the second-order differentia-
tion to each term of evaluation function. The modified fitness function in our algo-
rithm described as follows:

X (t ) X i (t )
2
n T
E = X i (t ) i

i =1 t =1 X i (t ) , (4)

where X is the gradient of the GP regression line, X is second-order differentiation


and X is the calculated value of the each equation in the S-system. By introducing X
Multi-stage Evolutionary Algorithms 51

to fitness function, we can obtain better fitness value of true structure than those of
other structures. After the fitness values of the offspring created by the crossover and
mutation according to their probabilities are evaluated by equation (4), parameters in
the real vectors and matrices are adjusted through the (1+1) evolutionary strategy
[17].
We employ the restricted tournament selection (RTS) proposed originally in [18]
to prevent the premature convergence on a local-optimum structure and to find multi-
ple topology candidates. In RTS, a subset of the current population is selected for
each newly created offspring. The size of these subsets is fixed to some constant
called the window size. Then, the new offspring competes with the most similar
member of the subset. Since the window size is set to the population size in our im-
plementation, each offspring is compared with all S-system in the current population.
If the new one is better, it replaces the corresponding individual; otherwise, the new
one is discarded. For the similarity measure, we calculate the structural hamming
distances between the new offspring and all individuals in the population by using the
binary matrices.

3 Experimental Results

3.1 Data

To evaluate the performance of the proposed algorithms, we consider the artificial


gene regulatory network which came from [13]. This gene regulatory network mod-
eled 4 reactants influenced with one another and all reactants are auto-regulated. The
S-system model for the used gene regulatory network is represented in Fig. 2(a) and
(b) and the time-course graph of the given gene regulatory network is represented in
Fig. 4(a). To create artificial time-course profiles, we solve this true S-system with the
initial values, X1(0)=1.4, X2(0)=2.7, X3(0)=1.2, and X4(0)=0.4 by using the 4th-order
Runge-Kutta method. The size of sampled data point is 25 and their time intervals are
0.2 seconds as shown in Fig. 4(b). We set 4 times smaller number of time points than
that of the previous study [13] since it is very difficult to obtain a lot of time-course
data from the real biological measurements.

3.2 Results of the Genetic Programming

We use Frayns GPLib library [18] for the symbolic regression with GP. To evolve
the mathematical models for the given data, we use the function set F = {+, -, , /, ^,
sin, exp, log, sqrt} and terminal set T = {t, , }, where is real constant and t is the
time point. The population size is 3,000 and the maximum number of generations is
2,000. Tournament selection is used and its size is 4. Crossover rate is 0.35 and muta-
tion rate is 0.5. We set the length penalty of the genetic program as zero for the accu-
rate curve fitting. We generate 100 points and derivations from the obtained models
for the input of the next stage, that is, the hybrid evolutionary algorithms.
52 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang

(a) original graph

(b) sample data points and regression results by GP


Fig. 4. The true (a) simulated (b) time-series data for the artificial genetic network
Multi-stage Evolutionary Algorithms 53

Results of the genetic programming step are as follows:

f1(t)=(sqrt((((sqrt(((sqrt(2.957153))-((sin(sqrt(t)))+((sin((1.854912)-(((sqrt((3)*(t)))
*(sqrt(t)))+(4.435898))))+(-2.355442))))*(t)))+((40.675830)/(exp((t)*((sqrt(t))-
((sin(((sin((((sqrt((3.756391)*(t)))*(sqrt(t)))+(log(86)))+(7)))+(3))+(sin(sin((1.654737
)*(t))))))+(-2.355442)))))))/(sqrt(54.598150)))/(sqrt(7.931547)))),

f2(t)=(sqrt(((3.777992)-((((((4.190957)-((t)-((sin((((t)*((t)^(t)))-((((t)*((2.883554)
/((4)-(log(t)))))+(2.791190))-((exp((t)-(2.226704)))^(sqrt(9.642561)))))/((t)^(t))))
*(2.678347))))/(3.462360))+(3.792098))-(4.796861))/(4)))+(((((3.792098)-((exp(t))
/(3.462360)))-((t)*((t)^(t)))) /((t)^(t)))/((t)^(t))))),

f3(t)=((log(log(exp(log(log(exp(exp(((log(exp()))-((sin((9)+(((t)+(8.000000))
^(sqrt(1.420245)))))/(((exp(t))*(379.000000))/(84.000000))))-((sin((8)+(((t)
+(((log(109))-(1.258803))/(6.620476)))^(sqrt(log(4.059461))))))
/(((exp(((8.337047)*((log(log(sqrt(3.021610))))+(2.000000)))-(5.912041)))*(exp(t)))
/(85.000000)))))))))))^(5.933873)),
f4(t)=((((log(6.249382))^((sqrt(6))*(((sqrt(10.000000))^((1.172486)-(t)))
/(6.981835))))^((1.161464)-((1.161464)/(((((sqrt(6.249382))*((log(7.008566))
*((((((exp((6.980522)/((sqrt(6.288201))^(1.344547))))^((1.735082)-(t)))
/(0.290257))*(sqrt(6.000000)))^(((9.704760)^((-0.050358)-(t)))-(t)))/(0))))
^((1.634223)+((7.277223)^((0.290257)-(t)))))^(0.161464))/(t)))))/(6.980522)).
Fig. 4 shows the true profiles and estimated curves and sampled data points from the
genetic programming and we confirm that the GP recover the true dynamics of the S-
system

3.3 Results of the Hybrid Evolutionary Algorithms

Using 100 points obtained from the GP step, we search the S-system by the hybrid
evolutionary algorithms. This step is composed with steady-state evolutionary algo-
rithms with local optimization procedure - (1+1) evolutionary strategy. For the sparse
network structures, we adopt a structural constraint which the evolved networks
should satisfy. That is, each gene is assumed to be related to one or two other genes in
the system. Hence, the randomly generated initial individuals and offspring by cross-
over and mutation operators should have one or two non-zeros elements in g and h.
Crossover rate is 0.8 and mutation rate is 0.3. As a local optimization for the parame-
ter values, (1+1) evolutionary strategy is performed for 80 fitness evaluations. The
search ranges of the parameters are [0.0, 15.0] for i and i, and [-1.0, 1.0] for gij and
hij. With the population size of 104, the proposed algorithm successfully identified the
true structure after 106 generation. As shown in Fig. 5, we can also recover the dy-
namic profiles with the estimated parameters.

14 . 614 0.0 0.0 0.623 0.0



8 . 123 0.486 0.0 0.0 0.0
= g =
3 . 088 0.0 0.761 0.0 0.0

2 . 866 0.254 0.0 0.0
0.0
54 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang

12.670 0.398 0.0 0.0 0.0



3.076 0.0 0.753 0.0 0.0
= h=
5.583 0.0 0.0 0.486 0.231

5.391 0.0 0.0 0.443
0.0

(a) the S-system obtained by our hybrid EAs

(b) time course profiles of the identified system


Fig. 5. The results of the proposed hybrid evolutionary algorithms

4 Conclusion
We propose multi-stage evolutionary algorithms to identify gene regulatory networks
efficiently with the S-system representation. We adopt the pre-processing symbolic
regression step by genetic programming for avoiding the time-consuming numerical
integration. We also develop hybrid genetic algorithms and modify the fitness func-
tion to identify the structure of gene regulatory networks and to estimate the corre-
sponding parameters simultaneously without the threshold values for the sparse net-
work structures. By applying the proposed method to the identification of an artificial
genetic network, we verify its capability of finding the true S-system.
One important future work is to demonstrate the usefulness of the proposed algo-
rithm for real experimental biological data such as the gene expression profiles from
the microarrays and NMR measurements of some metabolites. As the by-product of
the population diversity maintenance of our evolutionary algorithms, we will be able
to attain the different plausible topologies for the network very efficiently. These can
Multi-stage Evolutionary Algorithms 55

be reliable candidates to the biologists who want to discover unknown interactions


among some components in the genetic networks.

Acknowledgement
This work was supported by the Korea Ministry of Science and Technology through
National Research Lab (NRL) project and the Ministry of Education and Human
Resources Development under the BK21-IT Program. The ICT at the Seoul National
University provided research facilities for this study.

References
1. Covert, M.W., Schilling, C.H., Famili, I. Edwards, J.S., Goryanin, I.I. Selkov, E., Palsson,
B.O.: Metabolic modeling of microbial strains in silico. Trends in Biochemical Science 26
(2001) 179-186
2. De Jong, H.: Modeling and simulation of genetic regulatory system: a literature review.
Journal of Computational Biology 9 (2002) 67-103
3. Stelling, J.: Mathematical models in microbial systems biology. Current Opinion in Mi-
crobiology 7 (2004) 513-518
4. Barabasi, A.-L., Oltvai, Z.N.: Network biology: Understanding the cell's functional or-
ganization. Nature Reviews Genetics 5 (2004) 101-113
5. De Jong, H., Gouze, J.-L., Hernandez, C., Page, M., Sari, T., Geiselmann, J.: Qualitative
simulation of genetic regulatory networks using piecewise-linear models. Bulletin of
Mathematical Biology 66 (2004) 301-340
6. Rao, C.V., Wolf, D.M., Arkin, A.P.: Control, exploitation and tolerance of intracellular
noise. Nature 402 (2002) 231-237
7. Voit, E.O.: Computational Analysis of Biochemical Systems. Cambridge University Press
(2000)
8. Fogel, D.B.: System Identification Through Simulated Evolution: A Machine Learning Ap-
proach to Modeling. Ginn Press (1991)
9. Tominaga, D., Koga, N., Okamoto, M.: Efficient numerical optimization algorithm based
on genetic algorithm for inverse problem. In: Whitley, D. et al. (Eds.), Proceedings of the
2000 Genetic and Evolutionary Computation Conference (2000) 251-258
10. Ando, S., Sakamoto, E., Iba, H.: Evolutionary modeling and inference of gene network.
Information Science 145 (2002) 237-259
11. Kikuchi, S., Tominaga, D., Arita, M., Takahashi, K., Tomita, M.: Dynamic modeling of
genetic networks using genetic algorithm and S-system. Bioinformatics 19 (2003) 643-650
12. Spieth, C., Streichert, F., Speer, N., Zell, A.: Optimizing topology and parameters of gene
regulatory networt models from time-series experiments. In: Deb, K. et al. (eds.): Proceed-
ings of the 2004 Genetic and Evolutionary Computation Conference. Lecture Notes in
Computer Science, Vol. 3102. Springer-Verlag (2004) 461-470
13. Almeida J.S., Voit E.O.: Neural network-based parameter estimation in S-system models
of biological networks, Genome Informatics 14 (2003) 114-123
14. Tsai, K.-Y., Wang, F.-S.: Evolutionary optimization with data collocation for reverse engi-
neering of biological networks. Bioinformatics 21 (2005) 1180-1188
15. Savageau, M.A.: Biochmical System Analysis: A Study of Function and Design in Molecu-
lar Biology, Addison-Wesley (1976)
56 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang

16. Koza, J.R.: Genetic Programming: On the Programming of Computers by Natural Selec-
tion. MIT Press (1992)
17. Bck, T.: Evolutionary Algorithm in Theory and Practice, Oxford University Press (1996)
18. http://www.cs.bham.ac.uk/~cmf/GPLib/GPLib.html
19. Harik, G.R.: Finding multimodal solutions using restricted tournament selection. In:
Eshelman, L.J. (ed.): Proceedings of the Sixth International Conference on Genetic Algo-
rithms (1995) 24-31
Human Papillomavirus Risk Type Classification
from Protein Sequences Using
Support Vector Machines

Sun Kim and Byoung-Tak Zhang

Biointelligence Laboratory,
School of Computer Science and Engineering,
Seoul National University, Seoul 151-744, South Korea
{skim, btzhang}@bi.snu.ac.kr

Abstract. Infection by the human papillomavirus (HPV) is associated


with the development of cervical cancer. HPV can be classied to high-
and low-risk type according to its malignant potential, and detection of
the risk type is important to understand the mechanisms and diagnose
potential patients. In this paper, we classify the HPV protein sequences
by support vector machines. A string kernel is introduced to discriminate
HPV protein sequences. The kernel emphasizes amino acids pairs with
a distance. In the experiments, our approach is compared with previous
methods in accuracy and F1-score, and it has showed better performance.
Also, the prediction results for unknown HPV types are presented.

1 Introduction
The cervical cancer is a leading cause of cancer death among women worldwide.
Epidemiologic studies have shown that the association of genital human papillo-
mavirus (HPV) with cervical cancer is strong, independent of other risk factors
[1]. HPV infection causes virtually all cases of cervical cancer because certain
high-risk HPVs develop cancer even though most cases of HPV are low-risk and
rarely develop into cancer. Especially, high-risk HPV types could induce more
than 95% of cervical cancer in woman.
The HPV is a relatively small, double-strand DNA tumor virus that belongs
to the papovavirus family (papilloma, polyoma, and simian vacuolating viruses).
More than 100 human types are specic for epithelial cells including skin, respira-
tory mucosa, or the genital tract. And the genital tract HPV types are classied
into two or three types such as low-, intermediate-, and high-risk types by their
relative malignant potential [2]. The common, unifying oncogenic feature of the
vast majority of cervical cancers is the presence of HPV, especially high-risk type
HPV [3]. Thus the risk type detection of HPVs have become one of the most
essential procedures in cervical cancer remedy. Currently, the HPV risk types
are still manually classied by experts, and there is no deterministic method to
expect the risk type for unknown or new HPVs.
Since the HPV classication is important in medical judgments, there have
been many epidemiological and experimental studies to identify HPV risk types

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 5766, 2006.

c Springer-Verlag Berlin Heidelberg 2006
58 S. Kim and B.-T. Zhang

[3]. Polymerase chain reaction (PCR) is a sensitive technique for the detection of
very small amounts of HPVs nucleic acids in clinical specimens. It has been used
in most epidemiological studies that have evaluated the role of these viruses in
cervical cancer causation [4]. Bosch et al. [1] investigated epidemiological charac-
teristic that whether the association between HPV infection and cervical cancer
is consistent worldwide in the context of geographic variation. Burk et al. [5] in-
spected the risk factors for HPV infection in 604 young college women through
examining social relationship and detected various factors of HPV infection with
L1 consensus primer PCR and Southern blot hybridization. Munoz et al. [6] clas-
sied the HPV risk types with epidemiological experiments based on risk factor
analysis. They pooled real data from 1,918 cervical cancer patients and analyzed
it by PCR based assays.
Detection of HPV risk types can be a protein function prediction even though
functions are described at many levels, ranging from biochemical function to bio-
logical processes and pathways, all the way up to the organ or organism level [7].
Many approaches for protein function prediction are based on similarity search
between proteins with known function. The similarity among proteins can be
dened in a multitude of ways [8]: sequence alignment, structure match by com-
mon surface clefts or binding sites, common chemical features, or certain motifs
comparison. However, none of the existing prediction systems can guarantee gen-
erally good performance. Thus it is required to develop classication methods
for HPV risk types. Eom et al. [9] presented a sequence comparison method for
HPV classication. They use DNA sequences to discriminate risk types based
on genetic algorithms. Joung et al. [10] combined with several methods for the
risk type prediction from protein sequences. Protein sequences are rst aligned,
and the subsequences in high-risk HPVs against low-risk HPVs are selected by
hidden Markov models. Then a support vector machine is used to determine the
risk types. The main drawback of this paper is that the method is biased by
one sequence pattern. Alternatively, biomedical literature can be used to predict
HPV risk types [11]. But, text mining approaches have the limitation for predic-
tion capability because they only depend on texts to capture the classication
evidence, and the obvious keywords such as high tend to be appeared in the
literature explicitly.
In this paper, we propose a method to classify HPV risk types using pro-
tein sequences. Our approach is based on support vector machines (SVM) to
discriminate low- and high-risk types and a string kernel is introduced to deal
with protein sequences. The string kernel rst maps to the space consisting of
all subsequences of amino acids pair. A RBF kernel is then used for nonlinear
mapping into a higher dimensional space and similarity calculation. Especially,
the proposed kernel only uses amino acids of both ends in k-length subsequences
to improve the classication performance. It is motivated by the assumption
that amino acids pairs with certain distance aects the HPVs biological func-
tion, i.e. risk type, more than consecutive amino acids. The experimental results
show that our approach provides better performance than previous approaches
in accuracy and F1-score.
Human Papillomavirus Risk Type Classication from Protein Sequences 59

Our work addresses how to classify HPV risk types from protein sequences
by SVM approaches, which can provide a guide to determine unknown or new
HPVs. The paper is organized as follows. In Section 2, we explain the SVM
method for classication. Then the string kernel for HPV protein sequence is
presented in Section 3. In Section 4, we present the experimental results and
draw conclusions in Section 5.

2 Support Vector Machine Classifiers

We use support vector machines to classify HPV risk types. A string kernel-based
SVM is trained on HPV protein sequences and tested on unknown sequences.
Support vector machines have been developed by Vapnik to give robust perfor-
mance for classication and regression problems in noisy, complex data [12]. It
has been widely used from text categorization to bioinformatics in recent days.
When it is used for classication problem, a kernel and a set of labeled vectors,
which is marked to positive or negative class are given. The kernel functions in-
troduce nonlinear features in hypothesis space without explicitly requiring non-
linear algorithms. SVMs learn a linear decision boundary in the feature space
mapped by the kernel in order to separate the data into two classes.
For a feature mapping , the training data S = {xi , yi }ni=1 , is mapped into
the feature space (S) = {(xi ), yi }ni=1 . In the feature space, SVMs learn the
hyperplane f = w, (x) + b, w RN , b R, and the decision is made by
sgn(w, (x) + b). The decision boundary is the hyperplane f = 0 and its
margin is 1/w. SVMs nd a hyperplane that has the maximal margin from
each class among normalized hyperplanes.
To nd the optimal hyperplane, it can be formulated as the following problem:

1
minimize w2 (1)
2
subject to yi (w, (xi ) + b) 1, i = 1, . . . , n. (2)

By introducing Lagrange multipliers i 0, i = 1, . . . , n, we get the following


dual optimization problem:

n
 n n
1 
maximize i i j yi yj (xi ), (xj ) (3)
i=1
2 i=1 j=1
subject to i 0, i = 1, . . . , n, (4)
n
i yi = 0. (5)
i=1

By solving this dual problem, one obtains optimal solution i , 1 i n,


which needs to determine the parameters (w, b). For the solution i , . . . , n , the
60 S. Kim and B.-T. Zhang

nonlinear decision function f (x) is expressed in terms of the following


parameters:
 n 

f (x) = sgn i yi (x), (xi ) + b (6)
i=1
 n


= sgn i yi K(x, xi ) + b . (7)
i=1

We can work on the feature space by using kernel functions, and any kernel
function K satisfying Mercers condition can be used.

3 Kernel Function
For HPV protein classication, we introduce a string kernel based on the spec-
trum kernel method. The spectrum kernel was used to detect remote homology
detection [13][14]. The input space X consists of all nite length sequences of
characters from an alphabet A of size |A| = l (l = 20 for amino acids). Given
a number k 1, the k-spectrum of a protein sequence is the set of all possible
k-length subsequences (k-mers) that it contains. The feature map is indexed by
all possible subsequences a of length k from A. The k-spectrum feature map
k
k (x) from X to Rl can be dened as:

k (x) = (a (x))aAk . (8)

where a (x) = number of occurrences of a occurs in x. Thus the k-spectrum


kernel function K s (xi , xj ) for two sequences xi and xj is obtained by taking the
inner product in feature space:

Kks (xi , xj ) = k (xi ), k (xj ). (9)


To t in with HPV risk type classication, we want to modify the spectrum
kernel. Proteins are linear chains of amino acids, which are made during the
process of translation, and it is called primary structure. The natural shape of
proteins are not such as straight lines, rather 3-dimensional structures formed by
protein folding, which is a consequence of the primary structure. The structure
of a similar homologous sequence can be helpful to identify the tertiary structure
of the given sequence. Here, we assume that the amino acids pair with certain
distance aect HPVs risk type function more than consecutive amino acids
according to its 3-dimensional structure property, and the HPV risk types can
be identied by the amino acids pair with a xed distance, which mostly inuence
on risk type decision. This assumption is somewhat rough, but it can be useful
for relatively short and helix-dominant sequences.
Under the assumption, we want to dene a string kernel, the gap-spectrum
kernel based on k-spectrum. For a xed k-mer a = a1 a2 . . . ak , ai A, 2-length
sequence = a1 ak , A2 . indicates the amino acids pair with (k-2) gap.
The feature map k (x) is dened as:
Human Papillomavirus Risk Type Classication from Protein Sequences 61

Table 1. The manually classied risk types of 72 HPVs

Type Class Type Class Type Class Type Class


HPV1 Low HPV20 Low HPV38 Low HPV57 ?
HPV2 Low HPV21 Low HPV39 High HPV58 High
HPV3 Low HPV22 Low HPV40 Low HPV59 High
HPV4 Low HPV23 Low HPV41 Low HPV60 Low
HPV5 Low HPV24 Low HPV42 Low HPV61 High
HPV6 Low HPV25 Low HPV43 Low HPV63 Low
HPV7 Low HPV26 ? HPV44 Low HPV65 Low
HPV8 Low HPV27 Low HPV45 High HPV66 High
HPV9 Low HPV28 Low HPV47 Low HPV67 High
HPV10 Low HPV29 Low HPV48 Low HPV68 High
HPV11 Low HPV30 Low HPV49 Low HPV70 ?
HPV12 Low HPV31 High HPV50 Low HPV72 High
HPV13 Low HPV32 Low HPV51 High HPV73 Low
HPV15 Low HPV33 High HPV52 High HPV74 Low
HPV16 High HPV34 Low HPV53 Low HPV75 Low
HPV17 Low HPV35 High HPV54 ? HPV76 Low
HPV18 High HPV36 Low HPV55 Low HPV77 Low
HPV19 Low HPV37 Low HPV56 High HPV80 Low

k (x) = ( (x))A2 . (10)

where (x) = number of occurrences of occurs in x. Furthermore a nonlinear


kernel function, RBF kernel is appended to increase the discrimination ability
between HPV risk types. By closure properties of kernels [15], the gap-spectrum
kernel is dened as follows:

Kk (xi , xj ) = K  (k (xi ), k (xj )) (11)


 
= exp k (xi ) k (xj )2 . (12)

where > 0. This string kernel is used in combination with the SVM explained
in Section 2.

4 Experimental Results
4.1 Data Set
In this paper, we use the HPV sequence database in Los Alamos National Lab-
oratory (LANL) [16], and total 72 types of HPV are used for experiments. The
risk types of HPVs were determined based on the HPV compendium (1997). If
a HPV belongs to skin-related or cutaneous groups, the HPV is classied into
low-risk type. On the other hand, a HPV is classied as a high-risk if it is known
to be high-risk type for cervical cancer. The comments in LANL database are
used to decide risk types for some HPVs, which are dicult to be classied.
Seventeen sequences out of 72 HPVs were classied as high-risk types (16, 18,
62 S. Kim and B.-T. Zhang

1
E6
E7
L1

0.9

Accuracy

0.8

0.7
1 2 3 4
Window size

Fig. 1. SVM classication performance by window size

31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 61, 66, 67, 68, and 72), and others were
classied as low-risk types. Table 1 shows the HPV types and their classied
risk type. The symbol ? in the table denotes unknown risk type that cannot be
determined.
Since several proteins can be applied to discriminate HPVs, we have evalu-
ated the classication accuracy using the SVM with RBF kernel to determine the
gene products to be used for the experiments. The input data is the normalized
frequency vector by sliding window method. It has been performed to decide the
most informative protein among gene products for HPV risk type classication.
Figure 1 depicts the accuracy changes by window size for E6, E7, and L1 pro-
teins. The accuracy is the result of leave-one-out cross-validation. It indicates
that the accuracy using E6 protein is mostly higher than using E7 and L1 pro-
teins. However, the overall accuracy gets high by increasing window size for all
proteins because the HPV sequences are relatively short and unique patterns are
more generated when window size is long. That is, the learners overt protein
sequences for long window size. Viral early proteins E6 and E7 are known for
inducing immortalization and transformation in rodent and human cell types.
E6 proteins produced by the high-risk HPV types can bind to and inactivate
the tumor suppressor protein, thus facilitating tumor progression [16][17]. This
process plays an important role in the development of cervical cancer. For these
reasons, we have chosen E6 protein sequences corresponding to the 72 HPVs.

4.2 Evaluation Measure


For the HPV prediction, it is important to get high-risk HPVs as many as
possible, although a few low-risk HPVs are misclassied, hence we evaluate the
system performance using F1-score rather than Matthews correlation coecient.
F1-score is a performance measure usually used for information retrieval systems,
and it is eective to evaluate how well the classier did when it assigned classes
such as high-risk type.
Human Papillomavirus Risk Type Classication from Protein Sequences 63

Table 2. The contingency table to evaluate the classication performance

Risk type answer


High Low
Prediction High a b
result Low c d

When binary scales are used for both answer and prediction, a contingency
table is established showing how the data set is divided by these two measures
(Table 2). By the table, the classication performance is measured as follows:
a
precision = 100%
a+b
a
recall = 100%
a+c
2 precision recall
F 1 score = .
precision + recall

4.3 HPV Classification


We have tested the gap-spectrum SVM method using E6 protein sequences.
Leave-one-out cross-validation is used to determine the classication perfor-
mance. Figure 2 shows the accuracy changes according to k given in Equation
(12). The graph shows the accuracy has the highest performance (97.22%) when
k = 4, and the performance is decreases as it gets more or less k. k = 4 means
that we only use the amino acids pairs which have two gaps between them for
HPV classication. k = 2 is exactly same as the SVM using RBF kernel with
2-spectrum method, and the accuracy is 94.44% for k = 2. Even though it gives
higher score than other methods as shown in Figure 3, the kernel methods with
k > 2 still gives better performance. As a result, the amino acids pair with a
distance can provide more evidence than consecutive amino acids to discriminate
low- and high-risk HPV proteins.
The nal classication performance in accuracy and F1-score is given in Figure
3. It compares with previous results using SVM approaches based on sequence
alignment and text mining approaches. The SVM method which utilizes align-
ment information has been reported in [10]. AdaCost and nave Bayes are text
mining methods using HPV literature data, which have been reported in [11].
Our approach shows 97.22% of accuracy and 95.00% of F1-score, while previous
SVM method shows 93.15% of accuracy and 85.71% of F1-score. For text-based
classication, the AdaCost method shows 93.05% of accuracy and 86.49% of
F1-score, and the nave Bayes method shows 81.94% of accuracy and 63.64%
of F1-score. Additionally, the accuracy obtained from the DNA sequence-based
method [9] is 85.64%. It is interesting that it gets relatively higher score in F1-
score than in accuracy. F1-score is related with the number of high-risk HPVs
found by classiers, while accuracy is related with the number of HPVs which is
correctly classied. Therefore, F1-score is more important than accuracy in this
task.
64 S. Kim and B.-T. Zhang

0.98

0.97

0.96

0.95

Accuracy 0.94

0.93

0.92

0.91

0.9
1 2 3 4 5 6 7
k

Fig. 2. Accuracy changes by the gap-spectrum kernel

100 100

90

90
Accuracy

F1-score

80

80

70

70 60
Gap-spectrum SVM SVM AdaCost Naive Bayes Gap-spectrum SVM SVM AdaCost Naive Bayes

Fig. 3. The performance comparison of proposed approaches and previous classication


methods

Text mining approaches only depend on the clues from text sentences. If the
text documents are unavailable for unknown HPVs, there is no way to clas-
sify them, whereas the sequence-based classication does not need to use any
additional information except sequence itself.
Table 3 shows the risk type prediction for HPVs marked as unknown in Ta-
ble 1. HPV26, HPV54, HPV57, and HPV70 are predicted as high-, low-, low-,
and high-risk, respectively. The prediction results for HPV26 and HPV54 are
identical to the one in Munoz et al. [6], and we assume that their results are cor-
rect because it is based on epidemiologic classication from over 1,900 patients.
For HPV70, there are dierent decisions for the risk type according to previous
research [6][18][19], and the risk type of HPV57 cannot be decided yet because
of insucient previous works. By the prediction results, we can conclude our
approach provides certain probability for whether unknown HPVs are high-risk
or not.
Human Papillomavirus Risk Type Classication from Protein Sequences 65

Table 3. Predicted risk type for unknown HPVs

Type HPV26 HPV54 HPV57 HPV70


Risk High Low Low High

5 Conclusion

We have presented a machine learning approach to classify HPV risk types.


Our method uses the SVM classier with the gap-spectrum kernel based on k-
spectrum methods. The proposed kernel is designed to emphasize amino acids
pair with a xed distance, which can be suitable for relatively short and
helix-dominant sequences. For the experiments, the performance has been mea-
sured based on leave-one-out cross-validation. According to experimental results,
amino acids pair with a xed distance provides good performance to discriminate
HPV proteins by its risk. Especially, it is important not to have false negatives
as many as possible in this task. Therefore F1-score is important because it con-
siders both precision and recall based on high-risk type. Our approach shows
signicant improvement in F1-score as compared with previous methods, and
the prediction for unknown HPV types has given promising results. We can con-
clude that the relationship between amino acids with k = 4 supports important
role to divide low- and high-risk function in HPV E6 proteins.
In this paper, we consider all protein subsequences equally. Even though SVMs
naturally detect the important factors in a high-dimensional space, it is necessary
to analyze what components are more informative for HPV risk types. Also,
protein structure or biological literature information can be combined with this
method for more accurate prediction. Thus, study on exploring ecient analysis
method remains as future works.

Acknowledgement

This work was supported by the Korea Ministry of Science and Technology
through National Research Lab (NRL) project and the Ministry of Education
and Human Resources Development under the BK21-IT Program. The ICT at
the Seoul National University provided research facilities for this study.

References

[1] Bosch, F. X., Manos, M. M., et al.: Prevalence of Human Papillomavirus in Cer-
vical Cancer: a Worldwide Perspective. Journal of the National Cancer Institute
87 (1995) 796802.
[2] Janicek, M. F. and Averette, H. E.: Cervical Cancer: Prevention, Diagnosis, and
Therapeutics. Cancer Journals for Clinicians 51 (2001) 92114.
[3] Furumoto, H. and Irahara, M.: Human Papillomavirus (HPV) and Cervical Can-
cer. Journal of Medical Investigation 49 (2002) 124133.
66 S. Kim and B.-T. Zhang

[4] Centurioni, M. G., Puppo, A., et al.: Prevalence of Human Papillomavirus Cervical
Infection in an Italian Asymptomatic Population. BMC Infectious Diseases 5(77)
(2005).
[5] Burk, R. D., Ho, G. Y., et al.: Sexual Behavior and Partner Characteristics Are the
Predominant Risk Factors for Genital Human Papillomavirus Infection in Young
Women. The Journal of Infectious Diseases 174 (1996) 679689.
[6] Munoz, N., Bosch, F.X., et al.: Epidemiologic Classication of Human Papil-
lomavirus Types Associated with Cervical Cancer. New England Journal of
Medicine 348 (2003) 518527.
[7] Watson, J. D., Laskowski, R. A., and Thornton, J. M.: Predicting Protein Function
from Sequence and Structural Data. Current Opinion in Structural Biology 15
(2005) 275284.
[8] Borgwardt, K. M., Ong, C. S., et al.: Protein Function Prediction via Graph Ker-
nels. In Proceedings of Thirteenth International Conference on Intelligenc Systems
for Molecular Biology (2005) 4756.
[9] Eom, J.-H., Park, S.-B., and Zhang, B.-T.: Genetic Mining of DNA Sequence
Structures for Eective Classication of the Risk Types of Human Papillomavirus
(HPV). In Proceedings of the 11th International Conference on Neural Information
Processing (2004) 13341343.
[10] Joung, J.-G., O, S.-J, and Zhang, B.-T.: Prediction of the Risk Types of Human
Papillomaviruses by Support Vector Machines. In Proceedings of the 8th Pacific
Rim International Conference on Artificial Intelligence (2004) 723731.
[11] Park, S.-B., Hwang, S., and Zhang, B.-T.: Mining the Risk Types of Human Papil-
lomavirus (HPV) by AdaCost. In Proceedings of the 14th International Conference
on Database and Expert Systems Applications (2003) 403412.
[12] Vapnik, V. N.: Statistical Learning Theory. Springer (1998).
[13] Leslie, C., Eskin, E., and Noble, W. S.: The Spectrum Kernel: A String Ker-
nel for SVM Protein Classication. In Proceedings of the Pacific Symposium on
Biocomputing (2002) 564575.
[14] Leslie, C., Eskin, E., et al.: Mismatch String Kernels for Discriminative Protein
Classication. Bioinformatics 20(4) (2004) 467476.
[15] Shawe-Taylor, J. and Cristianini, N.: Kernel Methods for Pattern Analysis. Cam-
bridge University Press (2004).
[16] The HPV sequence database in Los Alamos laboratory, http://hpv-web.lanl.gov/
stdgen/virus/hpv/index.html.
[17] Pillai, M., Lakshmi, S., et al.: High-Risk Human Papillomavirus Infection and
E6 Protein Expression in Lesions of the Uterine Cervix. Pathobiology 66 (1998)
240246.
[18] Longuet, M., Beaudenon, S., and Orth, G.: Two Novel Genital Human Papillo-
mavirus (HPV) Types, HPV68 and HPV70, Related to the Potentially Oncogenic
HPV39. Journal of Clinical Microbiology 34 (1996) 738744.
[19] Meyer, T., Arndt, R., et al.: Association of Rare Human Papillomavirus Types
with Genital Premalignant and Malignant Lesions. The Journal of Infectious Dis-
eases 178 (1998) 252255.
Hierarchical Clustering, Languages and Cancer

Pritha Mahata1,2 , Wagner Costa1 , Carlos Cotta3 , and Pablo Moscato1,2


1
Newcastle Bioinformatics Initiative,
School of Electrical Engineering and Computer Science,
The University of Newcastle, Callaghan, NSW, 2308, Australia
2
Australian Research Centre in Bioinformatics
3
Dept. Lenguajes y Ciencias de la Computacion, University of Malaga,
ETSI Informatica, Campus de Teatinos, Malaga 29071, Spain
[email protected]

Abstract. In this paper, we introduce a novel objective function for the


hierarchical clustering of data from distance matrices, a very relevant
task in Bioinformatics. To test the robustness of the method, we test it
in two areas: (a) the problem of deriving a phylogeny of languages and
(b) subtype cancer classication from microarray data. For comparison
purposes, we also consider both the use of ultrametric trees (generated
via a two-phase evolutionary approach that creates a large number of
hypothesis trees, and then takes a consensus), and the best-known results
from the literature.
We used a dataset of measured separation time among 84 Indo-
European languages. The hierarchy we produce agrees very well with
existing data about these languages across a wide range of levels, and
it helps to clarify and raise new hypothesis about the evolution of these
languages.
Our method also generated a classication tree for the dierent can-
cers in the NCI60 microarray dataset (comprising gene expression data
for 60 cancer cell lines). In this case, the method seems to support the
current belief about the heterogeneous nature of the ovarian, breast and
non-small-lung cancer, as opposed to the relative homogeneity of other
types of cancer. However, our method reveals a close relationship of the
melanoma and CNS cell-lines. This is in correspondence with the fact
that metastatic melanoma rst appears in central nervous system (CNS).

1 Introduction

A large number of articles in bioinformatics use single-objective unsupervised


hierarchical clustering algorithms to identify subtypes. Biologically-oriented
journals, in general, have low requirements in terms of algorithmic reproducibil-
ity. The validation of the algorithms used in dierent problem scenarios is either
poor or non-existing. Working on a dataset of measured separation times be-
tween 84 Indo-European languages we started to notice the deciencies of a
large number of hierarchical clustering schemes. The same problems occur when
analyzing datasets derived from mitochondrial DNA distances among species [1].

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 6778, 2006.

c Springer-Verlag Berlin Heidelberg 2006
68 P. Mahata et al.

In this paper, we pose the clustering problem as a graph optimization problem


and propose a novel objective function which performs very well in diverse types
of datasets. We start with a distance matrix for a set of objects and compute
a weighted graph in which vertices represent objects and edges are weighted by
the distance between the corresponding vertices. Then our objective function
tries to obtain a solution whose tness is maximal and proportional to the sum
of the weights on the edges between two sets of vertices, and to the sum of
the reciprocals of the weights on the edges inside the sets. We denote this as
arithmetic-harmonic cut . The recursive application of such cuts generates a tree-
based classication of the data. While our primary concern is the classication of
microarray data, we are also interested in testing the robustness of the approach,
validating it in other domains. For this purpose, we show results for two dierent
datasets: (a) a dataset for 84 Indo-European languages, and (b) a dataset for
60 cancerous cell-lines (NCI60). Next section will provide more details on the
algorithmic methods we have used.

2 Hierarchical Clustering Methods Considered

In addition to the clustering solutions available in the literature for the datasets
considered, we have used two unsupervised techniques for computing alternative
solutions. The rst one is based on arithmetic-harmonic cuts, and the second
one relies on the utilization of ultrametric trees. These will be described below.

2.1 Arithmetic-Harmonic Cuts

The method of arithmetic-harmonic cuts approaches the construction of the


hierarchy in a top-down fashion. To be precise, it can be described as a recur-
sive process in which we solve a graph optimization problem at each step. Let
G(E, V, W ) be an undirected, complete weighted graph with no self-loops and
such that the weight of any edge is a positive integer number (i.e., w(e) > 0)
representing distance or some measure of dissimilarity between a pair of objects.
We rst nd a partition of the set V of vertices into {S, V \S}, which generates a
partition of the set E of edges in two sets Ein and Eout . The set Eout E is the
set of edges that link a vertex in S and a vertex in V \S (similarly, Ein = E \Eout
is the set of edges connecting vertices in the same partition). Such a partition is
dened by maximizing the following objective function
  
 
F = w(e) 1/w(e) (1)
eEout eEin

We have implemented an exact backtracking algorithm and also a memetic al-


gorithm (similar to the work of Merz and Freisleben [2] for Graph Biparti-
tioning) as a meta-heuristic to calculate the best partitioning of the vertices
for a given graph. The dierence with respect to [2] is that we remove the con-
straint of equal partitioning of the graph in our memetic algorithm. Thus, the
Hierarchical Clustering, Languages and Cancer 69

memetic algorithm uses (a) a dierential greedy algorithm (similar to that in [3])
for initialization of a set of solutions for the problem, (b) a dierential greedy
crossover (a modication of the algorithm in [2]) for evolution of the population,
and (c) a variable neighborhood local search (see [4]) to improve the newly gen-
erated solutions. Whenever the population stagnates, we keep the best solution
and re-initialize the rest of solutions in the set. We use this memetic algorithm
if the graph contains more than 25 vertices, and a backtracking enumeration
algorithm otherwise. Notice that even though backtracking gives us an optimal
solution, a memetic algorithm may not. However, in the considered datasets, the
memetic algorithm consistently generated the same solution in all runs (thus it
is presumably optimal). By applying this method (backtracking or memetic al-
gorithm depending on the number of vertices) recursively, we have at each step a
graph as input, and the two subgraphs induced by each of the sets of the vertex
partition as output; stopping when we arrive to a graph with just one vertex,
we generate a hierarchical clustering in a top-down fashion.
The rationale of the use of our objective function can be clear if we rearrange
its terms. We can write
Aout
F = (|E| |Eout |)|Eout | (2)
Hin
where Aout is the arithmetic mean of the weights that connect vertices of S
with V \ S (the cut); Hin is the harmonic mean of the weights of the edges not
in the cut, and |Eout | is the cardinality of the cut. Informally, maximizing F is
equivalent to try to nd a cut that discriminates well the two groups, normalized
by the harmonic mean of the intra-cluster dissimilarity, and multiplied by a
factor that is maximum when the two groups have a similar number of elements.
Normalizing by the harmonic mean allows the denominator being more stable
to the presence of outlier samples when associated to either V or V \ S. For this
reason, we denote this partition as arithmetic-harmonic cut.
 Notice that maximizing the rst part of the objective function, i.e.,
eEout w(e) (the total weights of edges across the two sets) is the same as
solving the Max-Cut problem for graph G, which is a N P -hard problem. How-
ever, it turns out that the hierarchy generated by partitions using Max-Cut
does not corroborate the previous knowledge about the datasets. This is prob-
ably due to the fact that no importance is given in Max-Cut to the similarity
of vertices within the sets. We also considered the objective function
 
F = w(e) w(e) (3)
eEout eEin

However, the resulting partition by maximizing F  turns out to be no better


than the partition obtained from Max-Cut.

2.2 Ultrametric Trees


Ultrametric trees constitute a very amenable approach for tting distance ma-
trices to trees. In essence, an ultrametric tree T is a weighted tree in which the
70 P. Mahata et al.

distance Dij between any two leaves i and j (measured as the sum of the weights
of the edges that have to be traversed to reach i from j inside T ) veries that
Dij  max{Dik , Djk }, 1  i, j, k  n, where n is the number of leaves. This
equation implies that given any internal node h in T , it holds that Dhi = Dhj
for any leaves i, j having h as ancestor.
The use of ultrametric trees has several advantages in hierarchical classica-
tion. First of all, edge weights are very easy to compute: given a distance matrix
M containing dissimilarity values for a collection of objects, and a candidate
tree T , the minimum weights such that Dij  Mij and T is ultrametric can be
computed in O(n2 ) [5]. Secondly, they adapt very well to dynamical processes
evolving at a more or less constant rate. Finally, even if the latter is not the case,
they provide a very good approximation to more relaxed criteria such as mere
additivity, that would be much more computationally expensive to calculate.
Notice also that nding the optimal topology T for a given distance matrix M
under the ultrametric assumption is NP-hard [5].
Ultrametric trees have been computed using an evolutionary two-phase pro-
cedure: rstly, a collection of high quality tentative trees are generated; subse-
quently, a consensus method is used to summarize this collection into a single
tree. Beginning with the former, the generation of high quality (i.e., minimum
weight) ultrametric trees has been approached using an evolutionary algorithm
based on the scatter search template. Starting from the solution provided by the
complete-link agglomerative algorithm, an initial population of trees is produced
by perturbation (internal exchanges of branches). Then, an evolutionary cycle
is performed using tree-based path relinking for recombination [6], and internal
rotations for local search (no mutation is used). Whenever the system stagnates,
the population is restarted by keeping the best solution and generating new trees
by exchanging branches among existing trees.
Once a collection of high quality trees has been found, the consensus method
is used to amalgamate them. This is done using the TreeRank measure [7] as
similarity metric among trees. This measure is based on counting the number of
times we have to traverse an edge upwards or downwards in order to go from a
certain leaf to another one. By computing how dierent these gures are for two
trees, we obtain a dissimilarity value. The TreeRank measure is currently being
used in TreeBASE1 one of the most widely used phylogenetic databases for
the purposes of handling queries for similar trees.
The consensus algorithm we have used is an evolutionary metaheuristic that
evolves tentative trees following [8]. Given the collection of trees we want to
summarize, the sum of dissimilarities to the tentative tree is used as the tness
function (to be minimized). Evolution is performed using the prune-delete-graft
operator [9, 10] for recombination, no mutation, binary tournament selection,
and elitist replacement. In our experiments, we have considered all dierent
trees generated by the scatter search method in one hundred runs, and then
running the consensus algorithm one hundred times on this collection. The best
solution out of these latter 100 runs is kept as the nal consensus tree.
1
http://www.treebase.org
Hierarchical Clustering, Languages and Cancer 71

3 Classifying Indo-European Languages


Two major hypotheses exist about the origin of Indo-European languages: the
Kurgan expansion and the Anatolian farming hypotheses. Based on archae-
ological evidences, the Kurgan theory [11] says that Kurgan horsemen (near
current Russia) went to Europe and the Near East around 6000 years B.C.
On the other hand, the Anatolian theory [12] claims that Indo-European lan-
guages expanded with the spread of agriculture from Anatolia (present Turkey)
around 8000-9500 years B.C. Scientists have used genetic [13, 14, 15] and numer-
ical [16, 17] methods to complete the linguistic phylogeny of the Indo-European
languages. However, there are still some doubts about its exact topology.
The dataset we analyse is the distance-matrix for 84 Indo-European languages
generated by Dyen et al. [18]. They used a list of basic vocabulary and estimated
historical relationships (similarity) between two languages by computing the ra-
tio of number of words (cognates) shared by them and the total number of words.
Furthermore, one also considers the replacement rates of each word in the vocab-
ulary. By considering the above ratio and replacement rates, they generated the
so-called separation time between pairs of languages, which they provided as
a distance-matrix of 84 languages. This dataset is the same that was used in [17]
where the neighbor-net analysis method provided some hints on possible lat-
eral information transfer while still provide an overall hierarchical relationships
among the languages.
Gray and Atkinson used a Bayesian Markov chain Monte Carlo method to
generate and analyze a large number of (10000) tentative trees [16]. Although
their method is in remarkable agreement with the timing of the Anatolian hy-
pothesis, the method also shows how much uncertainty still exists to validate
some subtrees. The posterior probability of some subtrees can be as low as 0.36
in some cases, e.g., in dividing a sub-tree into Indo-Iranian and Albanian lan-
guages, and 0.4 in dividing the Greek and Armenian groups from the most of the
languages (with the exception of the Tocharian and the Hittite, extinct languages
introduced by authors to the original dataset by Dyer et al.) [18].
We have applied our method of arithmetic-harmonic cut to Dyen et al.s
dataset. The language tree we have obtained using this cut is shown in Fig. 1(a).
As it can be seen, this method separates the relatively older languages (a Greco-
Armenian-Albanian-Indo-Iranian group) from the relatively newer languages
(a Celtic-Slavic-Germanic-Romanic group). Unlike our tree, the tree in [16]
(see Fig. 1(b)) rst separated the extinct Tocharians, Hittite, and the Greco-
Armenian group languages from the rest. Their tree had the Albanian branch
together with the Indo-Iranian group. We note that the resulting subtree of Indo-
Iranian-Albanian languages are divided with only 0.36 bayesian posterior prob-
ability. This raises a concern, as it may be the case that Indo-Iranian-Albanian
languages are older than what was claimed in [16] and they are more suited to
be temporally closer to the Greco-Armenian languages. In fact, the work in [17]
with neighbor-net analysis has produced a network with Albanians very close
to the Greco-Armenian languages. Thus, our topology for the main divisions
seem reasonable and is in closer relationship with the spread of farming from
72 P. Mahata et al.

GREEK ML IRISH A IRISH A


GREEK MD IRISH B IRISH B
GREEK D GREEK WELSH N WELSH C
GREEK MOD WELSH C CELTIC WELSH N CELTIC
GREEK K BRETON LIST BRETON ST
ARMENIAN MOD BRETON SE BRETON SE
ARMENIAN
ARMENIAN LIST BRETON ST BRETON LIST
ALBANIAN TOP RUMANIAN LIST BYELORUSSIAN
ALBANIAN T VLACH ITALIC UKRAINIAN
ALBANIAN G ALBANIAN LADIN POLISH
ALBANIAN K PROVENCAL RUSSIAN
ALBANIAN C FRENCH LUSATIAN U
AFGHAN WALLOON LUSATIAN L
WAZIRI FRENCH CREOLE C SLOVAK SLAVIC
OSSETIC FRENCH CREOLE D FRENCH/IBERIAN CZECH
PERSIAN LIST IRANIAN SPANISH CZECH E
TADZIK PORTUGUESE ST BULGARIAN
BALUCHI BRAZILIAN MACEDONIAN
WAKHI CATALAN SERBOCROATIAN
NEPALI LIST ITALIAN SLOVENIAN
KHASKURA SARDINIAN N LITHUANIAN ST
ITALIC
BENGALI SARDINIAN C LITHUANIAN O BALTIC
MARATHI SARDINIAN L LATVIAN
GUJARATI GERMAN ST PENN DUTCH
LAHNDA INDIC PENN DUTCH GERMAN ST
PANJABIST DUTCH LIST FLEMISH
HINDI AFRIKAANS AFRIKAANS
WEST GERMANIC WEST GERMANIC
GYPSY GK FLEMISH DUTCH LIST
KASHMIRI FRISIAN FRISIAN
SINGHALESE ENGLISH ST ENGLISH ST
ENGLISH ST TAKITAKI TAKITAKI
TAKITAKI SWEDISH UP SWEDISH VL
AFRIKAANS SWEDISH VL SWEDISH UP
FLEMISH SWEDISH LIST SWEDISH LIST
WEST GERMANIC
DUTCH LIST RIKSMAL NORTH GERMANIC RIKSMAL NORTH GERMANIC
FRISIAN ICELANDIC ST DANISH
PENN DUTCH FAROESE FAROESE
GERMAN ST DANISH ICELANDIC ST
SWEDISH UP LITHUANIAN O VLACH
SWEDISH VL LITHUANIAN ST BALTIC RUMANIAN LIST
ITALIC
SWEDISH LIST LATVIAN LADIN
DANISH NORTH GERMANIC SLOVENIAN ITALIAN
RIKSMAL MACEDONIAN BRAZILIAN
ICELANDIC ST BULGARIAN PORTUGUESE ST
FAROESE SERBOCROATIAN SPANISH
SARDINIAN L LUSATIAN L CATALAN
SARDINIAN C LUSATIAN U FRENCH FRENCH/IBERIAN
SARDINIAN N ITALIC CZECH SLAVIC PROVENCAL
RUMANIAN LIST CZECH E WALLOON
VLACH SLOVAK FRENCH CREOLE D
PORTUGUESE ST UKRAINIAN FRENCH CREOLE C
BRAZILIAN FRENCH/IBERIAN BYELORUSSIAN SARDINIAN C
SPANISH RUSSIAN SARDINIAN L ITALIC
LADIN POLISH SARDINIAN N
ITALIC
ITALIAN GYPSY GK GREEK MD
CATALAN SINGHALESE GREEK ML
WALLOON MARATHI GREEK D GREEK
FRENCH GUJARATI GREEK MOD
FRENCH/IBERIAN
PROVENCAL PANJABI ST GREEK K
FRENCH CREOLE D LAHNDA INDIC ARMENIAN MOD
ARMENIAN
FRENCH CREOLE C HINDI ARMENIAN LIST
IRISH A BENGALI ALBANIAN TOP
IRISH B NEPALILIST ALBANIAN T
BRETON SE KHASKURA ALBANIAN G ALBANIAN
BRETON ST CELTIC KASHMIRI ALBANIAN K
BRETON LIST OSSETIC ALBANIAN C
WELSH N WAKHI GUJARATI
WELSH C PERSIAN LIST MARATHI
MACEDONIAN TADZIK IRANIAN BENGALI
BULGARIAN BALUCHI PANJABI ST
SERBOCROATIAN AFGHAN LAHNDA
SLOVENIAN WAZIRI HINDI INDIC
UKRAINIAN ALBANIAN T KHASKURA
BYELORUSSIAN ALBANIAN G NEPALILIST
POLISH SLAVIC ALBANIAN TOP ALBANIAN KASHMIRI
RUSSIAN ALBANIAN K GYPSY GK
LUSATIAN L ALBANIAN C SINGHALESE
LUSATIAN U GREEK ML PERSIAN LIST
CZECH GREEK MD TADZIK
SLOVAK GREEK MOD GREEK BALUCHI
CZECHE GREEK D WAKHI IRANIAN
LITHUANIAN ST GREEK K AFGHAN
LITHUANIAN O BALTIC ARMENIAN MOD WAZIRI
ARMENIAN
LATVIAN ARMENIAN LIST OSSETIC
TOCHARIAN A
TOCHARIAN
TOCHARIAN B
HITTITE ANATOLIAN

(A) (B) (C)

Fig. 1. Three proposed language-Trees: (a) tree using arithmetic-harmonic cuts, (b)
Gray-Atkinsons tree [16], (c) consensus ultrametric tree

the Middle East between 10,000 and 6,000 years ago, which also correlates well
with the rst principal component of the genetic variation of 95 polymorphisms
[19], which solely accounts for 28 % of the total variation.
Hierarchical Clustering, Languages and Cancer 73

In addition to this fact, there are certain strange relations between languages
at the leaves of Grays tree. After checking the distance matrix, we nd several
cases in which our tree seems to produce a more reasonable branching than Gray
and Atkinsons. First of all, the closest neighbor of Czech and Slovak languages
are Lusatian languages. It is probably natural to have Czech, CzechE and Slo-
vak placed in a subtree closer to Lusatian languages. In the trees generated from
both arithmetic-harmonic cut (Fig. 1(a)) and the ultrametric trees (Fig. 1(c)),
we see that these languages are placed next to each other. However in Fig. 1(b)
generated by [16], Czech and Slovak are placed closer to Ukrainian, Byelorussian
and Russian. Secondly, Catalan is a language evolved from Latin, with strong
inuences from French and Spanish. As a consequence of its Latin origin, Italian
is the closest language of Catalan in the dataset. The position of Catalan with
Italian and Ladin in our tree seems very natural, as hybridizations with French
and Spanish occurred later (note that the bayesian posterior probability is 0.83
for its link with the Spanish-Portuguese group). See [16] for the details of proba-
bilities in the Figure 1(b). Although Italians closest language is Ladin, the latter
was placed closer to RomanianList and Vlach with the posterior probability of
0.88. Also, notice the position of the Italian with 0.59 posterior probability. Fi-
nally, there are also small dierences in the topology of small subtrees between
our hierarchy and Grays, namely, those regarding Dutchlist-Afrikaans-Flemish,
Greek languages, Albanian languages and the position of Bengali in the Aryan
languages among others. The dierences seem to occur mainly where the poste-
rior probability of one or several branchings is low.
An important dierence is that in our classication the Celtic languages are
considered closer to Baltic-Slavic languages. This goes against the current be-
lief of Celtics closeness to Romanic and Germanic languages. Note that in
Gray and Atkinsons classication, the branchings of (Germanic,Romance) and
(Celtic,(Germanic,Romance)) have low posterior probabilities (0.46 and 0.67, re-
spectively). The minimum-weight ultrametric tree (see Fig. 1(c)) for this dataset
also considers Celtic and Slavic languages to be the closest ones as groups. How-
ever, this tree disagrees with our tree in the primary branches. For instance, it
rst takes out Indo-Afghan languages as outliers, then considers Albanian and
Greco-Armenian languages as outliers successively. In the tree obtained by the
arithmetic-harmonic cut, all these outliers are grouped together. Notice that
even at the successive branchings, the consensus ultrametric tree often produces
a large number of outliers (see e.g., Indic and Iranian branches of Figure 1(c)).

4 A Molecular Classification of 60 Tumors

Validation of our methodology on the languages dataset has allowed us to have


condence in applying it in our primary problem domain, classication of cancer
samples. In this section, we show how our partitioning algorithm nds subtypes
of human cancers. We study a dataset from 60 tumor cell-lines used in National
Cancer Institutes (NCI) screen for anti-cancer drugs. We use the gene expression
of these cell lines given as a cDNA microarray with 6,830 genes for each cell-line.
74 P. Mahata et al.

A549 NODE 1 NODE 2


NONSMALLLUNG
NCIH460
LOXIMVI

NODE 6
SKMEL28
UACC257 MELANOMA
MALME3M
M14
SKMEL2A
MDAMB435
BREAST
MDAN
SKMEL5
MELANOMA
UACC62
NODE 2

MDAMB231 BREAST
HOP92 NONSMALLLUNG
SN12C RENAL
HOP62 NONSMALLLUNG
U251
(B)
SNB19 CNS
SF295
HS578T BREAST
SNB75 RENAL
SF539 NODE 3 NODE 4
CNS
SF268
BT549 BREAST
NODE 5

NCIH226 NONSMALLLUNG
SKOV3
OVARIAN
OVCAR8
ADRRES
EKVX NONSMALLLUNG
A498
RXF393
7860
TK10 RENAL
ACHN
UO31
CAKI1
MCF7
MCF7
BREAST
MCF7 (C)
T47D
NCIH23
NONSMALLLUNG
NODE 4

NCIH522
K562 NODE 5 NODE 6
K562
K562
RPMI8226 LEUKAEMIA
MOLT4
CCRFCEM
HL60
SR
NODE 1

OVCAR3
OVCAR4 OVARIAN
IGROV1
DU145
PROSTATE
PC3
HCT15
NODE 3

SW620
HCC2998
COLON
COLO205
KM12 (D)
HT29
NCIH322 NONSMALLLUNG
OVCAR5 OVARIAN
HCT116 COLON

(A)

Fig. 2. (a) Classication of NCI60 dataset using arithmetic-harmonic cuts. Genetic


signatures with (b) 1101 genes for the rst partition into Node 1 and Node 2, (c) 695
genes for the second partition into Node 3 and Node 4, and (d) 696 genes for the third
partition into Node 5 and Node 6.

The analysis of this dataset was done by Ross et al. in 2000 [20], where a result
of a hierarchical clustering for this dataset was rst discussed. Their result shows
that the cell lines from same origin were grouped together in case of leukaemia,
melanoma, colon, renal and ovarian cancers, with a few exceptions. However, cell-
lines derived from non-small lung carcinoma and breast tumors were distributed
in multiple places suggesting a heterogeneous nature.
Fig. 2(a) shows the result of applying arithmetic-harmonic cut on this dataset.
In Fig. 2(b),(c) and (d), we show the genetic signatures (most dierentially
expressed genes in the two sides of the partition) of the rst three partitions
using the cut with 1,101, 696 and 695 genes respectively. In the genetic signatures
(computed with the method described in [21] and [22]), each row corresponds to
a gene and each column corresponds to a tumor sample.
Hierarchical Clustering, Languages and Cancer 75

We will now compare the hierarchy generated by arithmetic-harmonic cuts


to Ross et al.s classication tree and the consensus ultrametric tree, shown in
Fig. 3. All clustering methods agree in the fact that some of the tumors, (e.g.,
leukaemia, renal, melanoma, central nervous system) are relatively homogeneous,
i.e., most samples of these tumors are grouped together with a few exceptions.
However, in Ross et al.s solution and in the ultrametric tree, all melanoma
samples except LOXIMVI are grouped with colon and leukaemia tumors (see the
lower branches of both trees in Figure 3), whereas arithmetic-harmonic cut shows
a marked dierence in the rst division. It groups all melanoma tumor samples
(including LOXIMVI) with CNS (Central Nervous System), renal, ovarian and
some of the breast and lung tumor samples (see Node 2 in Fig. 2(a)). Since
CNS is the closest group to the melanoma samples in this gure, we may infer
that this clustering supports the hypothesis that central nervous system (CNS)
and melanoma may share important genetic pathways with similar expression.
We note that CNS is a favorite site of metastasis in patients with advanced
melanoma and that it is the rst site of relapse in 15 20% of melanoma cases
[23, 24]. Also notice that CNS metastases from melanoma and renal tumors are
also often misdiagnosed as primary brain tumors [25]. However, our literature
survey did not reveal a close relationship of melanoma with colon or leukaemia,
as compared to its relation with CNS.
The three methods, however, behave slightly dierently in clustering non-
small-lung, breast and ovarian tumors. First of all, all clustering methods applied
on NCI60 group together ovarian tumor samples OVCAR-3, OVCAR04 and
IGROV1. However, the positions of OVCAR-5, SK-OV-3 and OVCAR-8 dier
in the outcomes of the dierent methods suggesting a possible heterogeneity of
ovarian tumors. Also, it was suggested by Perou et al. [26] that there are 6 types
of breast-cancers. All clustering methods more or less agree with the fact that the
breast tumors (HS-578T, BT-549, MDA-MB-231, MCF7, T-47D, MDA-MB435,
MDAN) are scattered in 4-5 places. In all methods, breast tumors HS-578T,
BT-549 and MDA-MB231 are together with CNS/renal tumor samples; MCF7
and T-47D tumors are clustered with colon tumors; MDA-N and MDA-MB435
tumors are grouped with melanoma tumor samples. This is a denite indication
that the breast cancer is a heterogeneous disease. Similarly, small-lung-cancer
samples are distributed in all the three methods.
In the above comparison, we need to remember that ultrametric trees are
largely used in cladistics, and assume that all species evolve at a constant rate
(cf. Sect. 2). Such an assumption suits rather well the universe of languages
(they evolve according to mutual contact between sub-populations, assimilating
the words or phonemes from each other). However, unlike biological beings like
bacteria or more complex life forms, the tumor cell-lines do not have a common
ancestor. Tumors are defects in DNA, which cause malignant cell proliferation.
Thus, the ultrametric approach may be susceptible to errors in the classica-
tion of cancer samples. Therefore, the position of melanoma samples in the tree
produced by the arithmetic-harmonic cut should not be considered incorrect.
Actually, the position of melanoma with CNS suggests an interesting quest, that
76 P. Mahata et al.

U251
HOP62 NONSMALLLUNG
SNB19 CNS
LOXIMVI MELANOMA
SF295
ADRRES
SNB75 RENAL
OVCAR8 OVARIAN
HS578T BREAST
SN12C
SF539
UO31 CNS
SF268
ACHN
BT549 BREAST
CAKI1
RENAL HOP62
RXF393 NONSMALLLUNG
NCIH226
7860
A498
TK10
RXF393
NCI226
NONSMALLLUNG 7860
HOP92
CAKI1 RENAL
MDAMB231 BREAST
ACHN
SF295 CNS
UO31
SNB75 RENAL
TK10
SNB19
MDAMB231 BREAST
U251 CNS
HOP92 NONSMALLLUNG
SF539
SN12C RENAL
HS578T BREAST
ADRRES
SF268 CNS
OVCAR8 OVARIAN
BT549 BREAST
LOXIMVI MELANOMA
EKVX
NONSMALLLUNG PC3 PROSTRATE
A549
OVCAR3
A498 RENAL
OVCAR4
NCIH460 NONSMALLLUNG
IGROV1 OVARIAN
OVCAR5 OVARIAN
SKOV3
NCIH322 NONSMALLLUNG
OVCAR5
DU145
PROSTATE DU145 PROSTRATE
PC3
EKVX
SKOV3
A549 NONSMALLLUNG
IGROV1
OVARIAN NCIH460
OVCAR3
RPMI8226
OVCAR4
K562
K562
K562
K562
K562
K562 LEUKAEMIA
HL60
CCRFCEM
LEUKAEMIA CCRFCEM
MOLT4
MOLT4
HL60
SR
SR
HCT116
RPMI8226
SW620
HCC2998
HCT15
COLO205
KM12 COLON
KM12
COLON HCC2998
HT29
COLO205
SW620
HT29
HCT15
MCF7
HCT116
MCF7
NCIH23 BREAST
NONSMALLLUNG MCF7
NCIH522
T47D
MCF7
NCIH322
MCF7 NONSMALLLUNG
BREAST NCIH23
MCF7
NCIH522
T47D
SKMEL5 MELANOMA
SKMEL28
MDAMB435
UACC257 BREAST
MDAN
MALME3M
MELANOMA M14
M14
SKMEL28
SKMEL2A
UACC257
UACC62 MELANOMA
MALME3M
MDAMB435
BREAST UACC62
MDAN
MELANOMA SKMEL2A
SKMEL5

(A) (B)

Fig. 3. Classication of NCI60 Dataset from (a) Ross et al. and (b) ultrametric tree

of nding common activated genetic pathways, as it is known that the brain is a


primary point of metastasis to an individual with melanoma condition [23, 24].

5 Conclusions
We proposed two new approaches for hierarchical clustering and showed the
result of applying these methods on two very diverse datasets. The hierarchies
we produce for both languages and cancer samples in this method agree very well
with existing data about these datasets. It also raises some interesting questions.
The arithmetic-harmonic cut seems to correlate well with the results of the
rst component of the genetic variation provided by Cavalli-Sforza and his co-
authors [19]. It indicates a branching in two major groups, with an earlier group
moving towards Europe (actually, the advanced farming hypothesis at work),
Hierarchical Clustering, Languages and Cancer 77

later followed by another group moving in the same direction (evolving in Greek
and Albanian and Armenian languages) while another group moved south-
east and later dierentiated in Iranian and Indic languages. It also suggest a
commonality of Celtic, Baltic and Slavic (a hypothesis also raised in the past,
and also supported by the consensus of the ultrametric trees). These dierences,
as well as small others, with the solution provided by Gray and Atkinsons
seem to be in branchings where the bayesian posterior probability is low, and
our methods agree where the posterior probability is high. The consensus of
the ultrametric trees seem to suggest a single wave towards Europe, but a rst
branching in an Albanian group, followed by a second branching with the Greek
and Armenian in one subgroup seems less plausible to us.
Overall, our results seem to indicate that it is important to use several hier-
archical clustering algorithms and to analyze common subgroupings. In the case
of tumor samples, it is indeed the case that this is the most relevant outcome as
we do not have any guarantee that the samples share a common ancestor.
The question: Which tree is the best one ? might actually be highly irrele-
vant to the the real problem at hand, as it seems to be the consensus of these
trees the most important outcome. Results on a number of other clustering al-
gorithms on these datasets (which we were unable to show here for reasons of
space), indicates that more research in robust algorithm methods needs to be
done for molecular subtype classication in cancer and that validation of the
methodology with dierent problem settings is highly benecial to develop it.

References
1. Cotta, C., Moscato, P.: A memetic-aided approach to hierarchical clustering from
distance matrices: Application to phylogeny and gene expression clustering. Biosys-
tems 71 (2003) 7597
2. Merz, P., Freisleben, B.: Fitness landscapes, memetic algorithms, and greedy op-
erators for graph bipartitioning. Evolutionary Computation 8 (2000) 6191
3. Battiti, R., Bertossi, A.: Dierential greedy for the 0-1 equicut problem. In: Proc.
of DIMACS Workshop on Network Design: Connectivity and Facilities. (1997)
4. Festa, P., Pardalos, P., Resende, M.G.C., Ribeiro, C.C.: Randomized heuristics for
the MAX-CUT problem. Optimization Methods and Software 7 (2002) 10331058
5. Wu, B., Chao, K.M., Tang, C.: Approximation and exact algorithms for construct-
ing minimum ultrametric trees from distance matrices. Journal of Combinatorial
Optimization 3 (1999) 199211
6. Cotta, C.: Scatter search with path relinking for phylogenetic inference. European
Journal of Operational Research 169 (2006) 520532
7. Wang, J., Shan, H., Shasha, D., Piel, W.: Treerank: A similarity measure for nearest
neighbor searching in phylogenetic databases. In: Proceedings of the 15th Interna-
tional Conference on Scientic and Statistical Database Management, Cambridge
MA, IEEE Press (2003) 171180
8. Cotta, C.: On the application of evolutionary algorithms to the consensus tree
problem. In Gottlieb, J., Raidl, G., eds.: Evolutionary Computation in Combina-
torial Optimization. Volume 3248 of Lecture Notes in Computer Science., Berlin,
Springer-Verlag (2005) 5867
78 P. Mahata et al.

9. Moilanen, A.: Searching for the most parsimonious trees with simulated evolution.
Cladistics 15 (1999) 3950
10. Cotta, C., Moscato, P.: Inferring phylogenetic trees using evolutionary algorithms.
In Merelo, J., et al., eds.: Parallel Problem Solving From Nature VII. Volume 2439
of Lecture Notes in Computer Science. Springer-Verlag, Berlin (2002) 720729
11. Mallory, J.P.: Search of the Indo-European languages. Archaelogy and Myth (1989)
12. Renfrew, C.: Time-depth in historical linguistics. The McDonald Institute for
Archaeological Research (2000) 413439
13. Richards, M.: Tracing european founder lineage in the near easter mtDNA pool.
Am. K. Hum. Genet 67 (2000) 12511276
14. Semoni: The genetic legacy of Paleolithic Homo Sapiens in extant europeans: a Y
chromosome perspective. Science 290 (2000) 11551159
15. Chikhi, L., Nichols, R., Barbujani, G., Beaumont, M.: Y genetic data support the
Neolithic demic diusion model. Prod. Natl. Acad., Sci. 67 (2002) 1100811013
16. Gray, R.D., Atkinson, Q.D.: Language-tree divergence times support the anatolian
theory of indo-european origin. Nature 426 (2003) 435439
17. Bryant, D., Filimon, F., Gray, R.: Untangling our past: Languages, trees, splits and
networks. In Mace, R., Holden, C., Shennan, S., eds.: The Evolution of Cultural
Diversity: Phylogenetic Approaches. UCL Press (2005) 6985
18. Dyen, I., Kruskal, J.B., Black, P.: An Indo-European classication: A lexicosta-
tistical experiment. Transactions of the American Philosophical Society, New Ser.
82 (1992) 1132
19. Cavalli-Sforza, L.: Genes, peoples, and languages. Proceedings of the National
Academy of Sciences of the United States of America 94 (1997) 77197724
20. Ross, D.T., Scherf, U., Eisen, M., Perou, C., Rees, C., Spellman, P., Iyer, V., Jerey,
S., Rijn, M., Waltham, M., Pergamenschikov, A., Lee, J.C., Lashkari, D., Shalon,
D., Myers, T., Weinstein, J.N., Botstein, D., Brown, P.: Systematic variation in
gene expression patterns in human cancer cell lines. Nature Genetics 24 (2000)
227235
21. Cotta, C., Langston, M., Moscato, P.: Combinatorial and algorithmic issues for
microarray data analysis. In: Handbook of Approximation Algorithms and Meta-
heuristics. Chapman and Hall (2005)
22. Hourani, M., Mendes, A., Berretta, R., Moscato, P.: A genetic signature for parkin-
sons disease using rodent brain gene expression. In Keith, J., ed.: Bioinformatics.
Humana Press (2006)
23. Ferraresi, V., Ciccarese, M., Zeuli, M., Cognetti, F.: Central system as exclusive
site disease in patients with melanoma: treatment and clinical outcome of two
cases. Melanoma Res. 2005 15 (2005) 467469
24. Marchetti, D., Denkins, Y., Reiland, J., Greiter-Wilke, A., Galjour, J., Murry,
B., Blust, J., Roy, M.: Brain-metastatic melanoma: a neurotrophic perspective.
Pathology Oncology Research 9 (2003) 147158
25. Buell, J., Gross, T., Alloway, R., Trofe, J., Woodle, E.: Central nervous system
tumors in donors: Misdiagnosis carries a high morbidity and mortality. Transplan-
tation Proceedings 37 (2005) 583584
26. Perou, C.M., Jerey, S.S., Rijn, M., Rees, C.A., Eisen, M.B., Ross, D.T., Perga-
menschikov, A., Williams, C.F., Zhu, S.X., Lee, J.C.F., Lashkari, D., Shalon, D.,
Brown, P.O., Botstein, D.: Distinctive gene expression patterns in human mam-
mary epithelial cells and breast cancers. Genetics 96 (1999) 92129217
Robust SVM-Based Biomarker Selection with
Noisy Mass Spectrometric Proteomic Data

Elena Marchiori1, Connie R. Jimenez2 ,


Mikkel West-Nielsen3 , and Niels H.H. Heegaard3
1
Department of Computer Science, Vrije Universiteit Amsterdam, The Netherlands
[email protected]
2
Department of Molecular and Cellular Neurobiology,
Vrije Universiteit Amsterdam, The Netherlands
[email protected]
3
Department of Autoimmunology, Statens Serum Institut, Copenhagen, Denmark
[email protected], [email protected]

Abstract. Computational analysis of mass spectrometric (MS) pro-


teomic data from sera is of potential relevance for diagnosis, progno-
sis, choice of therapy, and study of disease activity. To this aim, feature
selection techniques based on machine learning can be applied for de-
tecting potential biomarkes and biomaker patterns. A key issue concerns
the interpretability and robustness of the output results given by such
techniques. In this paper we propose a robust method for feature se-
lection with MS proteomic data. The method consists of the sequentail
application of a lter feature selection algorithm, RELIEF, followed by
multiple runs of a wrapper feature selection technique based on support
vector machines (SVM), where each run is obtained by changing the
class label of one support vector. Frequencies of features selected over
the runs are used to identify features which are robust with respect to
perturbations of the data. This method is tested on a dataset produced
by a specic MS technique, called MALDI-TOF MS. Two classes have
been articially generated by spiking. Moreover, the samples have been
collected at dierent storage durations. Leave-one-out cross validation
(LOOCV) applied to the resulting dataset, indicates that the proposed
feature selection method is capable of identifying highly discriminatory
proteomic patterns.

1 Introduction
Feature selection (FS) for classication can be formulated as a combinatorial
optimization problem: nding the feature set maximizing the predictive perfor-
mance of the classier trained from these features. FS is a major research topic
in supervised learning and data mining [10, 16, 12]. For the sake of the learning
performance, it is highly desirable to discard irrelevant features prior to learn-
ing, especially when the number of available features signicantly outnumbers
the number of samples, like in biomedical studies. Because of its computational
intractability, the FS problem has been tackled by means of heuristic algorithms

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 7990, 2006.

c Springer-Verlag Berlin Heidelberg 2006
80 E. Marchiori et al.

based on statistics and machine learning [10, 20, 22]. Biological experiments from
laboratory technologies like microarray and mass spectrometry techniques, gen-
erate data with a very high number of variables (features), in general much
larger than the number of samples. Therefore FS provides a fundamental step in
the analysis of such type of data [27]. Ideally one would like to detect potential
biomarkers and biomarker patterns, that both highly discriminate diseased from
healthy samples and are biological interpretable. However, as substantiated in
recent publications like [19, 3, 21], reliability and reproducibility of results de-
pend on the particular way samples are handled [26], on the instability of the
laboratory technology, as well as on the specic techniques employed in the
computational analysis.
In this paper we consider FS for classication with MS proteomic data from
sera. Various machine learning and statistical techniques for feature selection
have been applied to proteomic data, like [15, 17, 8, 13, 5, 6, 7], in order to detect
potential tumor biomarkers for (early) cancer diagnosis (clinical proteomics). A
summary of actual challenges and critical assessment of clinical proteomics can
be found, e.g., in [21].
Here we propose a new method for FS with MS proteomic data. The goal is to
identify potential biomarker patterns that not only highly discriminate diseased
and healthy samples, but also are robust with respect to perturbation of the data.
The method consists of three main steps. First, a popular lter feature selection
algorithm, RELIEF, is used as pre-processing in order to reduce the number of
considered features. Next, multiple runs of linear SVM are considered, where at
each run a perturbed training set is used, obtained by changing the class label
of one support vector. Each run generates a large subset of selected features.
The frequency (over the runs) of selection of the features is used to choose the
most robust ones, namely those with highest frequency. Finally, the resulting
features are transformed into feature intervals, by considering the ordering of
the features, where neighbour features refer to peptides of similar masses.
The method generates a subset of feature intervals, where both the number
of intervals and features are automatically selected. These intervals describe
potential biomarker patterns.
We analyze experimentally the performance of the method on a real-life
dataset with controlled insertion of noisy samples (long storage time samples)
and relevant features (spiked molecules) [26]. The results indicate that the
method performs robust feature selection, by selecting features corresponding
to m/z measurements near to the (average of m/z values of the peak of the)
spiked molecules, and by misclassifying only 1 noisy sample (with long storage
time).

2 Background
This section describes in brief the Machine Learning techniques we use in the
proposed feature selection method.
Robust SVM-Based Biomarker Selection 81

2.1 Linear Support Vector Machines


In linear SVM-based binary classication [25, 2], samples of two classes are lin-
early separated by means of a maximum margin hyperplane, that is, the hy-
perplane that maximizes the sum of the distances between the hyperplane and
its closest points of each of the two classes (the margin). When the classes are
not linearly separable, a variant of SVM, called soft-margin SVM, is used. This
SVM variant penalizes misclassication errors and employs a parameter (the
soft-margin constant C) to control the cost of misclassication.
Training a linear SVM classier amounts to solving the following constrained
optimization problem:
 m
1
minw,b,k ||w||2 + C i s.t. w xi + b 1 i
2 i=1

with one constraint for each training sample xi . Usually the dual form of the
optimization problem is solved:
m m m
1  
mini i j yi yj xi xj i
2 i=1 j=1 i=1
m
such that 0 i C, 2
i=1 i yi = 0. SVM requires O(m ) storage and O(m )
3

to solve.
m The resulting decision function f (x) = w x + b has weight vector w =
k=1 k yk xk . Samples xi for which i > 0 are called support vectors, since
they uniquely dene the maximum margin hyperplane. Samples with i C are
misclassied.
Maximizing the margin allows one to minimize bounds on generalization error.
Because the size of the margin does not depend on the input dimension, SVM
are robust with respect to data with high number of features. However, SVM
are sensitive to the presence of (potential) outliers, (cf. [11] for an illustrative
example) due to the regularization term for penalizing misclassication (which
depends on the choice of C).

2.2 Variable Selection Techniques


One can distinguish three main approaches for feature ranking/selection: wrap-
per, lter and embedded.
In the wrapper approach features are selected/ranked by taking into account
their contribution to the performance of a given type of classier (e.g., SVM).
In the lter approach the selection/ranking of features is not (directly) biased
towards the performance of a specic type of classier, but is based on an
evaluation criterion for quantifying how well feature (subsets) discriminate
the two classes.
Finally, in the embedded approach feature selection/ranking is part of the
training procedure of a classier, like in decision trees.
82 E. Marchiori et al.

The eectiveness of these approaches depends on the application domain.


The wrapper approach is favoured in many works since the selection of the
features is directly linked to the performance of a chosen type of classier. On
the other hand, algorithms based on the lter approach, like RELIEF, are in
general more ecient and have been successfully applied to real life domains [28].
Techniques based on the embedded approach provide a global approach, where
feature selection is a by-product of the training algorithm for classication.

SVM Feature Selection (SVMFS)


The weights wi of a linear SVM classier provide information about feature
relevance, where a bigger weight value implies higher feature relevance. In this
paper a feature xi is scored by means of wi2 [11]. Feature weights, obtained by
training a linear SVM on the training set, are used in a scoring function for
ranking features as described above. The algorithm for feature selection based
on SVM is illustrated below (in pseudo-code).

SVMFS
%input: training set X, number of features
to be selected M
%output: subset Selected of M features
train linear classifier with SVM on X;
score features using the squared value of
the weights of the classifier;
Selected = M features with highest score;
return Selected;

RELIEF
RELIEF [23, 14] is a lter-based feature ranking algorithm that assigns a score
to features based on how well the features separate training samples from their
nearest neighbours from the same and from the opposite class.
The algorithm constructs iteratively a weight vector, which is initially equal
to zero. At each iteration, RELIEF selects one sample, adds to the weight the
dierence between that sample and its nearest sample from the opposite class
(called nearest miss), and subtracts the dierence between that sample and its
nearest neighbour from the same class (called nearest hit). The iterative process
terminates when all training samples have been considered. The resulting weight
of a feature is divided by its range of values (computed using only the training
set). Subsampling can be used to improve eciency in case of a large training
set. The pseudo-code of the RELIEF algorithm used in our experiments is given
below.

RELIEF
%input: training set X
%output: Ranking of features
nr_feat = total number of features;
Robust SVM-Based Biomarker Selection 83

weights = zero vector of size nr_feat;


for all samples exa in training set do
hit(exa) = nearest neighbour of exa
from same class;
miss(exa) = nearest neighbour of exa
from opposite class;
weights = weights-abs(exa-hit(exa))+
abs(exa - miss(exa));
end;
scale each weight using range of corresponding m/z
value intensity over the training set;
Ranking = obtained by sorting weights
in decreasing order;
return Ranking;

3 Mass Spectrometric Proteomic Data


The MS proteomic dataset here considered is obtained by matrix-assisted laser
desorption/ionization time-of-ight mass spectrometry (MALDI-TOF MS), a
recent laboratory technology which oers protein proling at high resolution and

0.9
spiked t=0
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 2000 4000 6000 8000 10000 12000 14000 16000

0.9

0.8
spiked t=48

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 2000 4000 6000 8000 10000 12000 14000 16000

Fig. 1. A MALDI-TOF MS spiked sample for one person at storage duration time t=0
(top) and t=48 (bottom): x-axis contains (identiers of) the m/z values of peptides
and the y-axis their concentration
84 E. Marchiori et al.

throughput. It measures the relative abundance of ionisable low molecular weight


peptides in complex mixtures, like serum (cf. e.g. [4]). Because it is relatively
inexpensive and noninvasive, it is considered a promising new technology for
classifying disease status and for tumor biomarker detection.
MALDI-TOF MS technology produces a graph of the relative abundance of
ionized peptides (y-axis) versus their mass-to-charge (m/z) ratios (x-axis). (see
Figure 1) The m/z ratios are proportional to the peptide masses, but the tech-
nique is not able to identify individual peptides, because dierent peptides may
have the same mass and because of limitations in the m/z resolution.
Given proteomic proles from healthy and diseased individuals, the goal is to
build a classier for tumor diagnostics and to identify those proteins that are
potentially involved in the disease.
The dataset considered in this study consists of 96 samples obtained from
human sera of 8 adult persons. Spiking has been used to produce two classes,
and 6 dierent storage durations (t=0, 1, 4, 8, 24, 48 hours) have been used to
produce noisy data. Each prole contains 22572 m/z measurements. Adjacent
m/z measurements correspond to peptides with similar mass versus charge. Thus
the ordering on the x-axis has a biological meaning. This ordering will be used
in the FS method described in the next section.
The complete procedure for generating such data is described in detail in [26],
where this and other datasets have been analyzed for the rst time. Calibration
standards containing seven peptides and four proteins were used as articial
markers (Bruker Daltonik) and consisted of the following molecules with aver-
age molecular masses given in parentheses: angiotensin II (1047.20), angiotensin I
(1297.51), substance P (1348.66), bombesin (1620.88), ACTH clip 1-17 (2094.46),
ACTH clip 18-39 (2466.73), somatostatin 28 (3149.61), insulin (5734.56), ubiqui-
tin I (8565.89), and cytochrome c (6181.05) and myoglobin (8476.77). However,
no signal was recovered for the following four spiked molecules, possibly due to
losses during the laboratory sample processing procedure: substance P (1348.6),
ACTH clip 1-17 (2094.46), cytochrome c (6181.05), and myoglobin (8476.77) [26].
In [18], we used this dataset for comparing the performance of two popular
feature selection techniques, RELIEF and Recursive Feature Selection with linear
SVM, and the performance of two classication techniques, SVM and K nearest
neighbours. The results indicated that, in general, better predictive performance
does not correspond to better biological interpretability of the selected features
(m/z values).

4 The Method

We propose a FS method, consisting of three steps, called Filter, Wrapper, and


Interval step (FWI). The method is illustrated below in pseudo-code.

FWI
%input: training set X
%number M of features to be selected by RELIEF
Robust SVM-Based Biomarker Selection 85

%number N (N<M) of features to be selected by SVMFS


%SVM parameter C
%output: Set Int of feature intervals

%FILTER step:
F = M features selected with RELIEF;

%WRAPPER step:
SV = set of support vectors obtained by training
SVM on X;
for x in SV
T = X with label of x changed;
F(x) = N features selected by SVMFS applied to T;
end;
count = maximum number of times that a feature
occurs in the sequence of F(x), x in SV;
W = features occurring count times in the sequence
of F(x), x in SV;

%INTERVAL step:
Cl = {C1, .., Cn} clustering of W, with
Ci = (w(1),.., w(ni))
w(1)<..< w(ni)
s.t. idx(w(j+1))-idx(w(j)) <= 2
for all j in [1..ni];
Int = {Int_1, .., Int_n} intervals from Cl, with
Int_i= {w in Features s.t. w >= min(Ci)
and w<= max(Ci)} for i in [1,n];
return Int;
Let us explain a bit in more detail the steps performed by FWI.
FWI starts by skimming the number of features, by applying the Filter
(F) step. Here RELIEF is employed in order to select M features. In the F
step one typically retains about M=5000 m/z measurements from the initial
22572.
In the Wrapper (W) step, robust wrapper based feature selection is per-
formed using the features that passed the Filter selection. In the Wrapper
(W) step, the support vectors of SVM trained on all the features are used for
perturbing the data. More precisely, multiple runs of SVMFS are performed,
where at each run the class label of one support vector is changed. Each run
generates a set of N features (typical value N=1000). The resulting sequence
of feature sets is then considered. The maximum number count of times a
feature occurs in the sequence is computed, and all features occurring count
times in the sequence are selected.
Finally, in the Interval (I) step, the selected m/z features are segmented as
follows. The sequence of features in W, ordered by m/z values, is segmented
86 E. Marchiori et al.

in such a way that each pair of consecutive features in one segment Ci is


at most two positions apart in the sequence of all features ordered by m/z
values (in the above pseudo-code idx(w) gives the position of feature w in
the ordered sequence of all features). Finally each sequence Ci generates
one interval Ii containing all m/z measurements between the rst and last
element of Ci .

The F step can be viewed as a kind of preprocessing, while steps W and I


are heuristics for overcoming the problem of variability due to perturbations of
the data that can possibly originate from noise. The method requires the user
to specify 3 parameters, in particular the sizes M and N of the feature subsets
selected by RELIEF and SVMFS, respectively. However, note that the values of
M and N can be chosen to be fairly big, and the nal smaller size of the feature
subset selected by FWI is determined automatically by the algorithm.

5 Numerical Experiments

In order to assess the eectiveness of the modules of FWI, we consider the


following four algorithms:

1. Wrapper feature selection (W), obtained by applying the W step of the FWI.
2. Wrapper Interval (WI), obtained by applying steps W followed by I.
3. Feature Wrapper (FW), obtained by applying steps F followed by W.
4. The complete Feature Wrapper Interval algorithm FWI.

Because of the small size of the data, LOOCV is used for comparing the per-
formance of the four algorithms (cf., e.g., [9]). At each leave-one-out run, all but
one element of the data is used as training set, and the left-out element is used
for testing the predictive performance of the resulting classier. Observe that the
96 samples of the considered dataset are not independent one of the other, as
required for a correct application of LOOCV, because they are generated from 8
persons, and neither the 6 dierent storage times nor the spiking guarantee the
production of independent samples. Nevertheless, the corresponding bias intro-
duced in the LOOCV procedure aects the results of each algorithm, hence the
results can be used for comparing the performance of the algorithms. However,
such bias possibly aects the estimation of the generalization error.
Table 1 summarizes LOOCV performance results of the experiments. We use
accuracy, sensitivity and specicity as quality measures for comparing the algo-
rithms. Other measures, like AUC (Area Under the ROC Curve), can be used.
As illustrated e.g. in [1], there is a good agreement between accuracy and AUC
as to the ranking of the performance of the classication algorithms.
The results indicate that there is an improvement in predictive performance
of the four algorithms, with best accuracy achieved by FWI.
The misclassied samples over all the LOOCV runs have storage time equal
to 24 or 48 hours, indicating that longer storage time aects negatively classi-
cation of proteomic samples. Algorithm W misclassies a total of 5 samples, of
Robust SVM-Based Biomarker Selection 87

Table 1. Results: LOOCV sensitivity, specicity and accuracy (with standard devia-
tion between brackets)

Method Accuracy Sensitivity Specicity

W 0.9479 (0.2234) 0.9375 (0.2446) 0.9583 (0.2019)

WI 0.9583 (0.2019) 0.9583 (0.2019) 0.9583 (0.2019)

FW 1.0000 (0.0000) 0.9583 (0.2019) 0.9792 ( 0.1436)

FWI 1.0000 (0.0000) 0.9792 ( 0.1443) 0.9896 (0.1021)

100
nr loocv runs

90

80

70

60

50

40

nr of times m/z is selected over loocv runs


30

20

10
spiked molecule

0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
m/z measurements

Fig. 2. Number of m/z measurements selected over LOOCV runs

which 2 spiked at t= 48, 1 spiked at t=24, and 2 normal at t=24. WI improves


by correcly classifying the one spiked sample with t=24. Furthermore, FW mis-
classies a total of 2 samples, the normal sample at t=24 like W and WI, and one
normal at t=48, while it correctly classies all spiked samples. Finally, FWI only
misclassies a total of 1 sample, the normal one at t=24, like W, WI, and FW.
Each algorithm selects about 120 features at each run, which are distributed
(in the Interval step) in about 15 clusters.
We further analyze the results of FWI. Figure 2 shows m/z measurements ver-
sus the number of times they are selected over all LOOCV. On the x-axis the lo-
cation of the spiked molecules is indicated by circles. The plot indicates that m/z
88 E. Marchiori et al.

0.8

0.7
typical pattern profile generated by FWI

0.6

0.5

0.4

0.3
mean normal

0.2

spiked molecule
0.1
mean spiked

0
1000 2000 3000 4000 5000 6000 7000 8000 9000
m/z values

Fig. 3. A typical m/z selection generated by FWI, the corresponding values of the
mean spiked and normal prole at the selected m/z values, and the spiked molecules

measurements in proximity of spiked molecules are more often selected over the
LOOCV runs, except for m/z measurements in the neighbourhood of 4000 and
5000, which do not correspond to m/z measurements of spiked molecules. In the
absence of additional information (e.g. tandem mass spectra yielding sequence
tags) it is dicult to know what these peak values represent. One possibility
is that the higher molecular weight spiked molecules are partially degraded in
serum, and these peaks are proteolytically cleaved peptides from larger proteins
(due to large storage time at room temperature) in the sample itself. However,
this possibility has not yet been examined in depth. Figure 3 shows a typical set
of m/z measurements generated by FWI, and the mean value of the intensities
of spiked and normal samples for the selected m/z measurements.
In conclusion, results indicate that FWI performs robust m/z selection, where
the selected features are close to the spiked molecules, and the misclassication
error is close to zero, with misclassication of only noisy (that is high storage
temperature) samples.

6 Conclusion

We have proposed a method for robust feature selection with MS proteomic


data. The method can be considered as a small step towards the development of
a feature selection methodology addressing the specic issues of the underlying
laboratory technology. In particular, in this paper we addressed the issue of per-
Robust SVM-Based Biomarker Selection 89

forming robust feature selection in the presence of noisy samples which perturb
the data and negatively aect sample classication. The W and I steps of the
proposed FWI method provide heuristics for tackling this problem.
This issue is related to broader questions about reproducibility and validity
of results in the discovery-based omics research [21, 24]. In a special session on
genomics of a recent issue of Science an essay entitled Getting the noise out of
gene arrays noted that [t]housands of papers have reported results obtained
using gene array ... But are these results reproducible? [19]. A controversy about
reprodicibility and validity of results from MS proteomic data is ongoing [3, 21]
and the path for achieving such ambitious goals appears still long.

Acknowledgments
We would like to thank the anonymous referees for their constructive comments,
and Jan Rutten for stimulating discussions.

References
1. A. P. Bradley. The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition, 30(6):11451159, 1997.
2. N. Cristianini and J. Shawe-Taylor. Support Vector machines. Cambridge Press,
2000.
3. E.P. Diamandis. Analysis of serum proteomic patterns for early cancer diagnosis:
Drawing attention to potential problems. Journal of the National Cancer Institute,
96(5):353356, 2004.
4. H.J. Issaq et al. SELDI-TOF MS for diagnostic proteomics. Anal. Chem.,
75(7):148A155A, 2003.
5. Petricoin E.F. et al. Serum proteomic patterns for detection of prostate cancer.
Journal of the National Cancer Institute, 94(20):15761578, 2002.
6. Petricoin E.F. et al. Use of proteomic patterns in serum to identify ovarian cancer.
The Lancet, 359(9306):5727, 2002.
7. Qu Y. et al. Boosted decision tree analysis of surface-enhanced laser desorp-
tion/ionization mass spectral serum proles discriminates prostate cancer from
noncancer patients. Clin. Chem, 48(10):183543, 2002.
8. Zhu W. et al. Detection of cancer-specic markers amid massive mass spectral
data. PNAS, 100(25):1466614671, 2003.
9. T. Evgeniou, M. Pontil, and A. Elissee. Leave one out error, stability, and gen-
eralization of voting combinations of classiers. Mach. Learn., 55(1):7197, 2004.
10. I. Guyon and A. Elissee. An introduction to variable and feature selection. Ma-
chine Learning, 3:11571182, 2003. Special Issue on variable and feature selection.
11. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi-
cation using support vector machines. Mach. Learn., 46(1-3):389422, 2002.
12. George H. John, Ron Kohavi, and Karl Peger. Irrelevant features and the subset
selection problem. In International Conference on Machine Learning, pages 121
129, 1994.
13. K. Jong, E. Marchiori, M. Sebag, and A. van der Vaart. Feature selection in
proteomic pattern data with support vector machines. In IEEE Symposium on
Computational Intelligence in Bioinformatics and Computational Biology, 2004.
90 E. Marchiori et al.

14. K. Kira and L. A. Rendell. The feature selection problem: Traditional methods
and a new algorithm. In Tenth National Conference on artificial intelligence, pages
129134, 1992.
15. J. Li, Z. Zhang, J. Rosenzweig, Y.Y. Wang, and D.W. Chan. Proteomics and
bioinformatics approaches for identication of serum biomarkers to detect breast
cancer. Clinical Chemistry, 48(8):12961304, 2002.
16. H. Lie and editors H. Motoda. Feature Extraction, Construction and Selection:
a Data Mining Perspective. International Series in Engineering and Computer
Science. Kluwer, 1998.
17. H. Liu, J. Li, and L. Wong. A comparative study on feature selection and classi-
cation methods using gene expression proles and proteomic patterns. Genome
Informatics, 13:5160, 2002.
18. E. Marchiori, N.H.H. Heegaard, M. West-Nielsen, and C.R. Jimenez. Feature se-
lection for classication with proteomic data of mixed quality. In Proceedings of
the 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and
Computational Biology, pages 385391, 2005.
19. E. Marshall. Getting the noise out of gene arrays. Science, 306:630631, 2004.
Issue 5696.
20. Il-Seok Oh, Jin-Seon Lee, and Byung-Ro Moon. Local search-embedded genetic
algorithms for feature selection. In 16 th International Conference on Pattern
Recognition (ICPR02). IEEE Press, 2002.
21. D.F. Ransoho. Lessons from controversy: Ovarian cancer screening and serum
proteomics. Journal of the National Cancer Institute, 97:315319, 2005.
22. M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain. Dimen-
sionality reduction using genetic algorithms. IEEE Transactions on Evolutionary
Computation, 4(2):164171, 2000.
23. L. A. Rendell and K. Kira. A practical approach to feature selection. In Interna-
tional Conference on machine learning, pages 249256, 1992.
24. Michiels S., Koscielny S., and Hill C. Prediction of cancer outcome with microar-
rays: a multiple random validation strategy. The Lancet, 365(9458):48892, 2005.
25. V.N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.
26. M. West-Nielsen, E.V. Hogdall, E. Marchiori, C.K. Hogdall, C. Schou, and N.H.H.
Heegaard. Sample handling for mass spectrometric proteomic investigations of
human sera. Analytical Chemistry, 11(16):51145123, 2005.
27. E.P. Xing. Feature selection in microarray analysis. In A Practical Approach to
Microarray Data Analysis. Kluwer Academic, 2003.
28. L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-
based lter solution. In ICML, pages 856863, 2003.
On the Use of Variable Complementarity for
Feature Selection in Cancer Classification

Patrick E. Meyer and Gianluca Bontempi

Universite Libre de Bruxelles, (CP 212), 1050 Bruxelles, Belgique


{pmeyer, gbonte}@ulb.ac.be
http://ulb.ac.be/di/mlg/

Abstract. The paper presents an original lter approach for eective


feature selection in classication tasks with a very large number of input
variables. The approach is based on the use of a new information theo-
retic selection criterion: the double input symmetrical relevance (DISR).
The rationale of the criterion is that a set of variables can return an
information on the output class that is higher than the sum of the infor-
mations of each variable taken individually. This property will be made
explicit by dening the measure of variable complementarity. A feature
selection lter based on the DISR criterion is compared in theoretical
and experimental terms to recently proposed information theoretic cri-
teria. Experimental results on a set of eleven microarray classication
tasks show that the proposed technique is competitive with existing l-
ter selection methods.

1 Introduction
Statisticians and data-miners are used to build predictive models and infer de-
pendencies between variables on the basis of observed data. However, in a lot
of emerging domains, like bioinformatics, they are facing datasets characterized
by a very large number of features (up to several thousands), a large amount of
noise, non-linear dependencies and, often, only several hundreds of samples. In
this context, the detection of functional relationships as well as the design of ef-
fective classiers appears to be a major challenge. Recent technological advances,
like microarray technology, have made it possible to simultaneously interrogate
thousands of genes in a biological specimen. It follows that two classication
problems commonly encountered in bioinformatics are how to distinguish be-
tween tumor classes and how to predict the eects of medical treatments on
the basis of microarray gene expression proles. If we formalize this prediction
task as a supervised classication problem, we realize that we are facing a prob-
lem where the number of input variables, represented by the number of genes,
is huge (around several thousands) and the number of samples, represented by
the clinical trials, is very limited (around several tens). Because of well-known
numerical and statistical accuracy issues, it is typically necessary to reduce the
number of variables before starting a learning procedure. Furthermore, select-
ing features (i.e. genes) can increase the intelligibility of a model while at the

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 91102, 2006.

c Springer-Verlag Berlin Heidelberg 2006
92 P.E. Meyer and G. Bontempi

same time decreasing measurements and storage requirements [1]. A number


of experimental studies [2, 3, 4] have shown that irrelevant and redundant fea-
tures can dramatically reduce the predictive accuracy of models builded from
data.
Feature selection is a topic of machine learning whose objective is selecting,
among a set of input variables, the ones that will lead to the best predictive
model. Two well-known approaches in feature selection combine a search strat-
egy with a stochastic evaluation function: the filter approach and the wrapper
approach (see [3, 2]). In the wrapper approach, the evaluation function is the
validation outcome (e.g. by leave-one-out) of the learning algorithm itself. In
the lter approach, examples of evaluation functions are probabilistic distance,
interclass distance, information theoretic or probabilistic dependence measures.
These measures are often considered as intrinsic properties of the data, because
they are calculated directly on the raw data instead of requiring a learning model
that smoothes distributions or reduces the noise.
This paper will focus on the use of lter techniques for feature selection in su-
pervised classication tasks. In particular, we present an original lter approach
based on a new information theoretic selection criterion, called the double input
symmetrical relevance (DISR). This criterion combines two well known intuitions
of feature selection: rst, a combination of variables can return more informa-
tion on the output class than the sum of the information returned by each of
the variables taken individually. This property will be made explicit by den-
ing the notion of variable complementarity. Secondly, in absence of any further
knowledge on how subsets of d variables should combine, it is intuitive to as-
sume a combination of the best performing subsets of d 1 variables as the
most promising set. This intuition will be made formal by the computation of a
lower-bound on the information of a subset of variables expressed as the average
of information of all its sub-subsets.
The DISR criterion can be used to select among a nite number of alternative
subsets the one expected to return the maximum amount of information on the
output class. As we intend to benchmark its performance with respect to state-
of-the-art information theoretic criteria we dene an experimental session where
several lter algorithms with dierent selection criteria but the same search
strategy are compared. In our experiments we compare the lter based on DISR
with four state-of the art approaches: a Ranking algorithm [5] and three lters
based on the same search strategy: the forward selection. The three state-of-the-
art criteria are the Relevance criterion [6], the Minimum Redudancy Maximum
Relevance criterion [7] and the Conditional Mutual Information Maximization
criterion [8]. The assessment of the dierent lters is obtained by measuring
the classication accuracy of several learning algorithms which adopt as inputs
the set of variables returned by each of the lter methods. For our benchmark
purposes, we use eleven public-domain multi-class microarray gene expression
datasets. The experimental results show that the proposed technique is compet-
itive with existing lter selection methods.
On the Use of Variable Complementarity for Feature Selection 93

2 Information Theoretic Notions for Feature Selection

This paper deals with supervised multi-class classication. We will assume ei-
ther that all the variables are discrete or that they can be made discrete by a
quantization step. Hereafter, we will denote by Y the discrete output random
variable representing the class and by X the multi-dimensional discrete input
random variable.
In qualitative terms, feature selection boils down to select, among a set of po-
tential variables, the most relevant ones. At the same time it would be appealing
that these selected variables are not redundant. The notions of relevance and re-
dundancy can be made more formal thanks to the use of dependency measures
[2, 9].
Let us rst introduce some concepts of information theory:

Definition 1. [10] The conditional entropy of Y given Z, denoted by H(Y |Z)


is defined as: 
H(Y |Z) = p(y, z) log p(y|z) (1)
yY zZ

and the I(X; Y |Z) is the conditional mutual information.

Definition 2. [10] The conditional mutual information of the random variables


X and Y given Z is defined as:

I(X; Y |Z) = H(X|Z) H(X|Z, Y ) (2)

These denitions allow us to introduce the following measure of relevance pro-


posed by [6]:

Definition 3. Relevance.
Consider three random variables X,Y and Z and their joint probability distri-
bution pX,Y,Z (x, y, z). If H(Y |Z) = 0, then the variable relevance of X to Y
given Z, denoted by r(X; Y |Z), is zero. Else if H(Y |Z) = 0, then the variable
relevance of X to Y given Z is defined as:

I(X; Y |Z)
r(X; Y |Z) = (3)
H(Y |Z)

According to this denition the relevance is a function 0 r(X; Y |Z) 1 that


measures the relative reduction of uncertainty of Y provided by X once the value
of Z is given.
In the following, we rather consider as measure of relevance the classical (non-
normalized) mutual information, i.e. I(Xi ; Y |XS ), where Xi denotes an input
variable and XS a subset of variables not containing Xi .
The formalization of the notion of relevance makes explicit one of the major
challenges in feature selection: the mutual information between an input variable
Xi and the output class Y is conditionally dependent. This means that an input
94 P.E. Meyer and G. Bontempi

variable Xi having a signicative relevance to Y given XS , can return a null


relevance conditioned on an other variable. The following two examples may
serve to better illustrate this idea.
Example 1. Consider the four random variables Y, Xi , XS and XM such that
Y = Xi + XS and XM = X2i .
Given XS , Xi has a large relevance (I(Xi ; Y |XS ) = H(Y |XS )), while its rele-
vance goes to zero in the case where it is conditioned to XM (i.e. I(Xi ; Y |XM ) =
0). In this example, Xi and XM have both a high mutual information with the
output Y but a low conditional mutual information when conditioned to the
other variable.
The next example shows that the relevance can also increase by conditioning.
Example 2. Consider the three random variables Y, Xi and XS such that Y and
Xi are independent and Y = Xi + XS .
In this case the mutual information of Xi with Y is zero (I(Xi ; Y ) = 0) whereas
the conditional mutual informatione increases up when conditioned to XS (i.e.
I(Xi ; Y |XS ) = H(Y |XS )).
These examples show that it is hard to predict, in terms of relevance, the joint
eect of several input variables on an output variable Y .
As shown in Example 1 the mutual information I(Xi,j ; Y ) of a set {Xi , Xj }
(aka joint mutual information [11]) can be smaller than the sum of each rele-
vance taken separately. In generic terms, we could describe these two variables
as redundant for the task of classifying Y . For a formal denition of redundancy
we refer the reader to [9, 7]. Also, as shown in Example 2, it could happen that
two variables have jointly a larger mutual information with Y than when they
are considered separately. In this case we say that the two variables are comple-
mentary. Note that variable complementarity should warn us against eliminating
variables with null mutual information with the output (i.e. I(Xi ; Y ) = 0) since
the joint information of two random variables I(Xi,j ; Y ) can be higher than the
sum of their individual informations I(Xi ; Y ) and I(Xj ; Y ).
Variable complementarity was underlined experimentally in [1] and explained
in [12] as a second order term of the Mobius representation of the mutual
information. It can be useful to dene the notion of complementarity between
two variables with respect to an output Y as the dierence between the joint
mutual information and the sum of the individual mutual informations. We
introduce then the following measure:

Definition 4. The complementarity of two random variables Xi and Xj with


respect to an output Y is defined as,

CY (Xi , Xj ) = I(Xi,j ; Y ) I(Xi ; Y ) I(Xj ; Y ) (4)

where Xi,j = {Xi , Xj }.


We dene two variables as complementary if their measure of complementarity
with respect to Y is positive. Note that if the complementarity is zero, the
On the Use of Variable Complementarity for Feature Selection 95

two variables are independent. We remark also that a negative value of the
complementarity can be taken as a measure of the redundancy of a pair of
variables for the task of predicting Y .
The example 2 is an illustration of complementarity between Xi and XS since
in that case:
I(Xi,S ; Y ) > I(Xi ; Y ) +I(XS ; Y ) (5)
  
0

Another illustration of complementarity is given by the well-known XOR


problem [2, 1]:
X1 1100
Example 3. Xor problem: X2 1010
Y = X1 X 2 0 1 1 0
We see that X1 and X2 have a null mutual information with the output, once
they are taken individually (i.e. I(X1 ; Y ) = 0, I(X2 ; Y ) = 0). However, when
they are taken together the mutual information I(X1,2 ; Y ) = H(Y ) > 0 of the
subset is positive.
Complementarity explains why an apparently irrelevant combination of vari-
ables can eventually perform eciently in a learning task. In the following sec-
tion, we will proceed to a critical survey of information theoretic approaches
existing in literature, by stressing when and where the notion of complementar-
ity is taken into account.

3 State of the Art


As mutual information can measure relevance, this quantity is currently used in
literature for performing feature selection. One of the main reasons for adopting
it is its low complexity computational cost (O(d N ) where d is the number of
variables and N is the number of samples) in the case of discrete variables. The
following sections will sketch four state-of-the-art lter approaches that use this
quantity.

3.1 Variable Ranking (Rank)


The ranking method returns a ranking of variables on the basis of their individual
mutual information with the output. This means that, given n input variables,
the method rst computes n times the quantity I(Xi , Y ), i = 1, . . . , n, then
ranks the variables according to this quantity and eventually discards the least
relevant ones [5].
The main advantage of the method is its rapidity of execution. Indeed, only
n computations of mutual information are required for a resulting complexity
O(n2N ). The main drawback derives from the fact that possible redundancies
between variables is not taken into account. Indeed, two redundant variables,
yet highly relevant taken individually, will be both well ranked. As a result, a
model that uses these two variables is dangerously prone to an increased variance
96 P.E. Meyer and G. Bontempi

without any gain in terms of bias reduction. On the contrary, two variables can
be complementary to the output (i.e. highly relevant together) while each of
them appears to be poorly relevant once taken individually (see Example 2 or
Example 3). As a consequence, these variables could be badly ranked, or worse
eliminated, by the ranking lter.

3.2 Filters Combining Relevance and Redundancy Analysis


Although the variable ranking algorithm is reputed to be fast, it may be poorly
ecient as it only relies on individual relevance. Recently, new algorithms that
combine relevance and redundancy analysis oer a good trade-o between ac-
curacy and computational load as the Fast Correlation Based Filter [9]. Also,
some heuristic search methods such as the best rst search (also known as the
forward selection) can be combined eciently with information theoretic criteria
in order to select the best variable given a previously selected subset.
Forward Selection is a search method that starts with an empty set of vari-
ables. At each step, it selects the variable that brings the best improvement
(according to the selection criterion). As a consequence, each selected variable
does inuence the evaluations of the following steps. This hill-climbing
d search
selects a subset of d < n variables in d steps and explores only i=0 (n i)
evaluations.
In the following sections, several information theoretic criteria existing in
the literature and that can be easily combined with the forward selection, are
presented.

Relevance Criterion (REL). The relevance criterion is a well-known crite-


rion which is used together with the forward selection search strategy [6]. The
approach consists in updating a set of selected variables XS with the variable
Xi featuring the maximum relevance I(Xi ; Y |XS ). This strategy prevents from
selecting a variable which, though relevant to Y , is redundant with respect to a
previously selected one.
In analytical terms, the variable XREL returned by the relevance criterion is,
XREL = arg max {I(Xi ; Y |XS )} (6)
Xi XS

where XS = X \ XS is the set dierence between the original set of inputs X


and the set of variables XS selected so far1 .
Although this method is appealing, it presents some major drawbacks. The
estimation of the relevance requires the estimation of several multivariate densi-
ties, a problem known to be ill-posed. For instance, at the dth step of the forward
search, the search algorithm asks for n d evaluations where each evaluation
requires the computation of a (d+ 1)-variate density. It is known that, for a large
d, the estimations are poorly accurate and computationally expensive. For these
two reasons we recently assisted to the adoption of selection criteria based on
bi- and tri-variate densities only.
1
Note that in [6] a normalized version of relevance (Eq. 3) is used.
On the Use of Variable Complementarity for Feature Selection 97

Minimum Redundancy - Maximum Relevance criterion (MRMR).


The minimum redundancy - maximum relevance criterion [7] consists in selecting
the variable Xi among the not yet selected features XS that maximizes ui zi ,
where ui is a relevance term and zi is a redundancy term. More precisely, ui is
the relevance of Xi to the output Y alone, and zi is the mean redundancy of Xi
to each variables Xj XS already selected.
ui = I(Xi ; Y ) (7)
1 
zi = I(Xi ; Xj ) (8)
d
Xj XS

XMRMR = arg max {ui zi } (9)


Xi XS

At each step, this method selects the variable which has the best trade-o
relevance-redundancy. This selection criterion is fast and ecient. At step d of
the forward search, the search algorithm computes n d evaluations where each
evaluation requires the estimation of (d + 1) bi-variate densities (one for each
already selected variables plus one with the output). It has been shown in [7] that
the MRMR criterion is an optimal rst order approximation of the conditional
relevance criterion. Furthermore, MRMR avoids the estimation of multivariate
densities by using multiple bivariate densities.
Note that, although the method aims to address the issue of redundancy
between variables through the term zi , it is not able to take into account the
complementarities between variables. This could be ineective in situations like
the one of Example 2 where, although the set {Xi , XS } has a large relevance to
Y , we observe that
1. the redundancy term zi is large due to the redundancy of Xi and XS
2. the relevance term ui is small since Xi is not relevant to Y .

Conditional Mutual Information Maximization Criterion (CMIM).


This approach [8] proposes to select the feature Xi XS whose minimal con-
ditional relevance I(Xi ; Y |Xj ) among the selected features Xj XS , is maximal.
This requires the computation of the mutual information of Xi and the output
Y , conditional on each feature Xj XS previously selected. Then, the mini-
mal value is retained and the feature that has a maximal minimal conditional
relevance is selected.
In formal notation, the variable returned according to the CMIM 2 criterion
is,
XCMIM = arg max { min I(Xi ; Y |Xj )} (10)
Xi XS Xj XS

This selection criterion is powerful. It selects relevant variables, it avoids re-


dundancy, it avoids estimating high dimensional multivariate densities and un-
like the previous method, it does not ignore variable complementarity. However,
2
Note that in [8] this method was applied to select binary features in a pattern
recognition task.
98 P.E. Meyer and G. Bontempi

it will not necessary select a variable complementary with the already selected
variables. Indeed, a variable that has a high complementarity with an already
selected variable will be characterized by a high conditional mutual information
with that variable but not necessarily by a high minimal conditional information
(see example 3).
In terms of complexity, note that at the dth step of the forward search, the
algorithm computes n d evaluations where each evaluation following CMIM
requires the estimation of d tri-variate densities (one for each previously selected
variable).
In the following chapter, we propose a new criterion that deals more explicitly
with complementary variables.

4 Double Input Symmetrical Relevance (DISR) Criterion


A lower bound on mutual information. In this section, we derive a lower
bound on the the mutual information between a subset XS and a target variable
Y . It is shown [13] that this quantity is lower bounded by the average of the
same quantity computed for all the sub-subsets XSi = XS \ Xi of XS .

Theorem 1. Let XS = {X1 , ..., Xd } be a subset of d variables of X and XSi =


XS \ Xi , i 1, . . . , d a subset of XS that does not contain the variable Xi .
Then,
1 
I(XS ; Y ) I(XSi ; Y ) (11)
d
iS

The theorem expresses that the mutual information of a subset S and a target
variable Y is lower bounded by the quantity L (I(XS ; Y )), that is the average of
the same quantity computed for all the sub-subsets XSi of XS .
In the following, we will use this theorem as a theoretical support to the
following heuristic: without any additional knowledge on how subsets of d vari-
ables should combine, the most promising subset is a combination of the best
performing subsets of (d 1 variables).

Criterion. Given a xed number d of variables, we can write the problem of


feature selection in the following form:

Sbest = arg max I(XS ; Y ) (12)


S : |S|=d

In other words, the goal of feature selection is to nd the subset of d variables


which maximizes the mutual information with the output Y .
Our idea consists in replacing the maximization of the quantity I(XS ; Y ) by
the maximization of its lower bound L (I(XS ; Y )):

arg max I(XS ; Y ) arg max I(XSi ; Y ) (13)
S : |S|=d S : |S|=d
iS
On the Use of Variable Complementarity for Feature Selection 99

Replacing again the right-hand term by its lower bound and recursively until we
have subsets of two variables:
   
arg max I(XS(i,j) ; Y ) arg max I(Xi,j ; Y ) (14)
S S
iS jS iS jS

In other words, without any information on how to combine subsets made


of more than two variables, the most promising subset (the best bound) is the
one with the highest sum of mutual information on all the combinations of
two variables. We choose to stop the recursivity at two variables because it
is the minimal size of subset that can capture variable complementarity (i.e.
I(Xi,j ; Y ) = I(Xi ; Y ) + I(Xj ; Y ) + CY (Xi ; Xj )). Note that this strategy, when
we stop the recursion at one variable, boils down to the ranking approach.
A similar approach has been developped in [12] based on the Mobius rep-
resentation of mutual information. However, in order to improve the selection
procedure we use here a normalized measure of mutual information very close
to the symmetrical uncertainty presented in [9]: the symmetrical relevance.

Definition 5. Given two random variables X,Y a joint probability distribution


p(x, y), the symmetrical relevance SR(X, Y ) is defined as:

I(X, Y )
SR(X; Y ) = (15)
H(X, Y )

This denition expresses that symmetrical relevance is a function 0 SR(X; Y )


1 that indicates the concentration of mutual information contained in
p(x, y).
As a consequence, our resulting criterion is the following:

XDISR = arg max { SR(Xi,j ; Y )} (16)
Xi XS
Xj XS

The main advantage in using this criterion for selecting variables is that a
complementary variable of an already selected one has a much higher probability
to be selected than with other criteria. As this criterion measures symmetrical
relevance on all the combination of two variables (double input) of a subset, we
have called the criterion: the double input symmetrical relevance (DISR). At the

Table 1. Qualitative comparison of dierent information theoretic lters, according to


dierent aspects: relevance selection, redundancy avoidance, complementarity selection
and multivariate densities avoidance

criterion rank REL MRMR CMIM DISR


relevance selection V V V V V
redundancy avoidance V V V V
complementarity selection V V
multivariate density avoidance V V V V
100 P.E. Meyer and G. Bontempi

dth step of the forward search, the search algorithm computes n d evaluations
where each evaluation requires the estimation of d tri-variate densities (one for
each previously selected variable). In the next section, the DISR criterion is
assessed and compared with the other heuristic search lters discussed in the
Section 3.
Table 1 summarizes the methods discussed so far in terms of some peculiar
aspects: the capacity of selecting relevant variables, of avoiding redundancy, of
selecting complementary features and of avoiding the computation of multivari-
ate densities.

5 Experiments
A major topic in bioinformatics is how to build accurate classiers for cancer di-
agnostic and prognostic purposes on the basis of microarray genomic signatures.
This task can be considered as a challenging benchmark for feature selection
algorithms [7] given the high feature to sample ratio.
We use eleven public domain multi-class datasets from [14] (Table 2) in order
to assess and compare our technique with the state-of-the-art approaches.
In our experimental framework, each continuous variable has been discretized
in equal sized interval. The number of intervals of each input is based on the Scott
criterion, see [15]. All the datasets are partitioned into two parts: a selection set
and a test set (each having size equal to N/2). We compare the lter based
on DISR with the four state-of the art approaches discussed above: a Ranking
algorithm and three lters based on the Relevance criterion, the Minimum Re-
dudancy Maximum Relevance criterion and the Conditional Mutual Information

Table 2. The 11 datasets of microarray cancer from http://www.tech.plym.ac.uk.


The column n represents the number of probes in the microarray, the column N the
number of samples and the column c the number of classes. The remaining columns
contain the average accuracy of each selection method averaged over the three classiers
(SVM, 3-NN, naive Bayes). The accuracy of the two best methods for each dataset is
typed in bold face.

Dataset (DN) n N c Rank REL CMIM MRMR DISR


11 Tumors 12534 87 11 49% 46% 48% 42% 49%
14 Tumors 15010 308 26 22% 25% 20% 26% 19%
9 Tumors 5727 60 9 19% 36% 20% 23% 28%
Leukemia1 5328 72 3 71% 74% 71% 69% 78%
Leukemia2 11226 72 3 68% 57% 62% 65% 67%
Prostate Tumor 10510 102 2 76% 66% 62% 78% 73%
Brain Tumor1 5921 90 5 73% 70% 70% 70% 71%
Brain Tumor2 10368 50 4 47% 47% 49% 59% 60%
Lung Cancer 12601 203 5 84% 74% 82% 77% 74%
SRBCT 2309 83 4 70% 53% 75% 75% 67%
DLBCL 5470 77 2 77% 88% 70% 71% 88%
On the Use of Variable Complementarity for Feature Selection 101

Table 3. Statistically (0.1 level and 0.2 level of a paired two-tailed t-test) signicant
wins, ties or losses over best rst search combined with DISR criterion

W/T/L VS DISR Rank REL CMIM MRMR Rank REL CMIM MRMR
3-NN 0/9/2 1/8/2 1/7/3 2/7/2 1/5/5 1/7/3 1/7/3 2/7/2
Naive Bayes 3/8/0 2/7/2 1/8/2 1/9/1 4/7/0 3/6/2 1/8/2 1/9/1
SVM 2/9/0 2/6/3 2/6/3 2/8/1 2/6/3 3/4/4 2/5/4 3/5/3

Maximization criterion, respectively. Each selection method stops after that 15


variables have been selected. Then, the evaluation of the selection is done on
the test set, by using a ten-fold cross validation with a 3-nearest neighbor, a
naive Bayes and a SVM learning algorithm with a radial kernel. Each learning
technique led to the choice of a dierent number of variables in a range from 2
to 15. Then for each of the eleven datasets and for each selection method, the
best number of variables and the classication accuracy is computed. A set of
statistical paired t-test on the set of classication errors are reported in Table 3.
As far as the implementation of the three learning methods is concerned, we
used the algorithms made available by the R statistical language [16].
According to Table 2, the DISR criterion outperforms slightely all the other
methods in terms of average accuracy. Furthermore, our method is one of the
two best methods for 7 out of 11 datasets.
Table 3 reports the signicant wins, ties or losses (at 0.1 and 0.2 signicance
levels of a paired two-tailed t-test, respectively) of the DISR criterion against all
the other. We remark that in the case a 3-Nearest Neighbor, the DISR criterion
is equivalent to MRMR and better than all the other methods. For a naive
Bayes classier, the performances of the DISR are slightly lower. This is not
surprising because the benets of the DISR criterion are related to variable
complementarity whereas the success of a naive Bayes classier typically relies
on the opposite, that is variable independence. As far as the SVM classier is
concerned, at the 0.1 signicance level, DISR appears to be slightly better than
both REL and CMIM, and slightly worse than RANK and MRMR. However, at
0.2 signicance level the DISR outperforms all the other methods except MRMR.

6 Conclusion and Future Work

This paper formalized an original notion in feature selection: variable comple-


mentarity. Also, a lower bound on the mutual information of a subset of variables
with the output was demonstrated. On the basis of these considerations, we pro-
posed a new selection criterion: the double input symmetrical relevance (DISR).
The experimental session shows that this criterion is promising in high feature-
to-sample ratio classifaction tasks like gene expression microarray datasets. Note
that in gene selection, variable complementarity can be biologically meaningful
since it is common to observe combination of genes acting together.
Further experiments will focus on (i) datasets with more samples and/or less
features, (ii) other search strategies than the forward selection in order to validate
102 P.E. Meyer and G. Bontempi

the criterion in a wider range of domains, (iii) the impact of the discretization
method to the eciency of the feature selection algorithms.

References
1. Guyon, I., Elissee, A.: An introduction to variable and feature selection. Journal
of Machine Learning Research 3 (2003) 11571182
2. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Articial Intelligence
97 (1997) 273324
3. Blum, A., Langley, P.: Selection of relevant features and examples in machine
learning. Articial Intelligence 97 (1997) 245271
4. Provan, G., Singh, M.: Learning bayesian networks using feature selection. In:
in Fifth International Workshop on Articial Intelligence and Statistics. (1995)
450456
5. Duch, W., Winiarski, T., Biesiada, J., Kachel, A.: Feature selection and ranking
lters. In: International Conference on Articial Neural Networks (ICANN) and
International Conference on Neural Information Processing (ICONIP). (2003) 251
254
6. Bell, D.A., Wang, H.: A formalism for relevance and its application in feature
subset selection. Machine Learning 41 (2000) 175195
7. Peng, H., Long, F.: An ecient max-dependency algorithm for gene selection.
In: 36th Symposium on the Interface: Computational Biology and Bioinformatics.
(2004)
8. Fleuret, F.: Fast binary feature selection with conditional mutual information.
Journal of Machine Learning Research 5 (2004) 15311555
9. Yu, L., Liu, H.: Ecient feature selection via analysis of relevance and redundancy.
Journal of Machine Learning Research 5 (2004) 12051224
10. Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley, New
York (1990)
11. Yang, H., Moody, J.: Feature selection based on joint mutual information. In: In
Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Meth-
ods and Applications (CIMA), Rochester New York, ICSC (1999)
12. Kojadinovic, I.: Relevance measures for subset variable selection in regression
problems based on k-additive mutual information. Computational Statistics and
Data Analysis 49 (2005) 12051227
13. Meyer, P.: Information theoretic lters for feature selection. Technical report,
Universite Libre de Bruxelles ((548) 2005)
14. web: (http://www.tech.plym.ac.uk/spmc/bioinformatics/microarray cancers.html)
15. Scott, D.W.: Multivariate Density Estimation. Theory,. Wiley (1992)
16. R-project: (www.r-project.org)
Comparison of Neural Network Optimization
Approaches for Studies of Human Genetics

Alison A. Motsinger, Scott M. Dudek, Lance W. Hahn, and Marylyn D. Ritchie

Center for Human Genetics Research, Department of Molecular Physiology & Biophysics,
Vanderbilt University, Nashville, TN, 37232, USA
{motsinger, dudek, hahn, ritchie}@chgr.mc.vanderbilt.edu
http://chgr.mc.vanderbilt.edu/ritchielab

Abstract. A major goal of human genetics is the identification of susceptibility


genes associated with common, complex diseases. The preponderance of gene-
gene and gene-environment interactions comprising the genetic architecture of
common diseases presents a difficult challenge. To address this, novel
computational approaches have been applied to studies of human disease.
These novel approaches seek to capture the complexity inherent in common
diseases. Previously, we developed a genetic programming neural network
(GPNN) to optimize network architecture for the detection of disease
susceptibility genes in association studies. While GPNN was a successful
endeavor, we wanted to address the limitations in its flexibility and ease of
development. To this end, we developed a grammatical evolution neural
network (GENN) approach that accounts for the drawbacks of GPNN. In this
study we show that this new method has high power to detect gene-gene
interactions in simulated data. We also compare the performance of GENN to
GPNN, a traditional back-propagation neural network (BPNN) and a random
search algorithm. GENN outperforms both BPNN and the random search, and
performs at least as well as GPNN. This study demonstrates the utility of using
GE to evolve NN in studies of complex human disease.

1 Introduction

The identification and characterization of susceptibility genes for common complex


human diseases, such as hypertension, is a difficult challenge[1,2]. This is largely due
to the complexity of these diseases, and the likelihood that many disease susceptibility
genes exhibit effects that are dependent partially or solely on interactions with other
genes and the environment. These interactions, known as epistasis, are difficult to
detect using traditional statistical methods[3]. Thus, a number of novel statistical and
computational methods have been developed[4-11]. Neural networks (NN) are a
supervised pattern recognition method commonly used in many fields for data
mining. The back propagation NN (BPNN) is one of the most commonly used
NN[12] and is the NN chosen for most genetic epidemiology studies [13-21].
Successful use of NN for data mining requires defining an optimal NN architecture
for the problem at hand. However, it is not always intuitive what the optimal

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 103 114, 2006.
Springer-Verlag Berlin Heidelberg 2006
104 A.A. Motsinger et al.

architecture should be for a given dataset and, as a result, a cumbersome trial and
error approach is often taken.
Previously, we implemented a neural network optimized via genetic programming
(GPNN)[22]. Optimizing neural network architecture with genetic programming was
first proposed by Koza and Rice[23]. We implemented and extended the GPNN
approach for use in association studies of human disease. The goal of GPNN was to
improve upon the trial-and-error process of choosing an optimal architecture for a
pure feed-forward back propagation neural network[22]. GPNN optimizes the inputs
from a large pool of variables, the weights, and the connectivity of the network -
including the number of hidden layers and the number of nodes in the hidden layer.
Thus, the algorithm automatically generates optimal neural network architecture for a
given dataset. This gives it an advantage over the traditional back propagation NN, in
which the inputs and architecture are pre-specified and only the weights are
optimized.
GPNN was a successful endeavor it has shown high power to detect gene-gene
interactions in both simulated and real data[24]. Still, there are limitations to evolving
NN using this type of machine learning algorithm. First, the GP implementation that
was used for GPNN involves building binary expression trees. Therefore, each node
is connected to exactly two nodes at the level below it in the network. This did not
seem to hinder the power of GPNN in smaller datasets[22,24-26]; however, we
hypothesize that for more complex data, more complicated NN will be required, and
two connections per node may not be sufficient. Second, changes to GPNN require
altering and recompiling source code, which hinders flexibility and increases
development time. For example, GPNN is limited in the depth of the network. This
means there is a limit to the number of levels the network can contain. Again, this
was not a hindrance for GPNN in the previous power studies[22,24-26], but this may
not scale well for more complex datasets.
In response to these concerns, we developed a NN approach for detecting gene-
gene interactions that uses grammatical evolution (GE) as a strategy for the
optimization of the NN architecture. Grammatical evolution (GE) is a variation on
genetic programming that addresses some of the drawbacks of GP[27,28]. GE has
been shown to be effective in evolving Petri Nets, which are discrete dynamical
systems that look structurally similar to neural networks, used to model biochemical
systems[29]. By using a grammar, substantial changes can be made to the way that
NN are constructed through simple manipulations to the text file where the grammar
is specified. No changes in source code are required and thus, there is no
recompiling. The end result is a decrease in development time and an increase in
flexibility. These two features are important improvements over GPNN.
Preliminary studies with GPNN show that an evolutionary optimization is more
powerful than traditional approaches for detecting gene-gene interactions. We have
shown that the GPNN strategy is able to model and detect gene-gene interactions in
the absence of main effects in many epistasis models with higher power than back
propagation NN[22], stepwise logistic regression[26], and a stand alone GP[25].
GPNN has also detected interactions in a real data analysis of Parkinsons
disease[24]. Similarly to GPNN, the grammatical evolution optimized neural
network (GENN) optimizes the inputs from a pool of variables, the synaptic weights,
and the architecture of the network. The algorithm automatically selects the
Comparison of NN Optimization Approaches for Studies of Human Genetics 105

appropriate network architecture for a particular dataset. Thus, if GE allows for


more flexible evolution of NN, can it perform as well or better than GPNN?
Because GENN retains the beneficial properties of GPNN and offers substantial
improvements in terms of flexibility and development, we hypothesize that NN
optimized by GE will exhibit power equal to and even exceeding GPNN. In this
study, we compare the performance of GENN to a more traditional back-propagation
NN, a random search algorithm, and GPNN. We find that both GENN and GPNN
outperform the random search and the BPNN. We also show that GENN is equal in
power to GPNN in detecting gene-gene interactions in our simulated disease models.

2 Methods

2.1 Grammatical Evolution Neural Network (GENN)

Grammatical Evolution (GE) is a form of evolutionary computation that allows the


generation of computer programs using grammars[27,28]. GE uses populations
made of linear genomes that are translated by the grammar. Each individual consists
of a binary genome divided into codons. Mutation can occur on individual bits along
the genome, but crossover only occurs between codons. These codons are translated
according to the grammar into a resulting phenotype (in this case, a functional NN).
The resulting individual/phenotype can be tested for fitness and evolutionary
operators are applied to create subsequent generations. By using the grammar to
map a NN, GE separates genotype from phenotype. This allows for greater genetic
diversity within a population than offered by other evolutionary algorithms, like GP.
Since GENN uses a grammar to define the structure of the resulting NN, we can
easily vary the behavior of the program with changes to the grammar.
GE differs from GP in several ways. First, unlike GP, GE uses a linear genome -
similar to a genetic algorithm. Second, GE performs mapping from genotype to
phenotype using the rules of a grammar, much like the rules of the biological
process of DNA transcription into mRNA. Finally, all evolutionary processes take
place at the chromosomal level (binary strings) rather than the phenotypic level
(binary expression tree). Ultimately, the goal of GP and GE is synonymous: to
evolve computer programs using evolutionary processes[27,28].
A detailed description of GE can be found in ONeill and Ryan[28]. Briefly, a
Backus-Naur Form (BNF) grammar must be defined for the process of genotype-to-
phenotype mapping. A variable-length binary string genome is used in a genetic
algorithm, with a set of 8 bits constituting a codon. Each binary codon represents an
integer value used to select a rule in the grammar. One non-terminal element is
designated by the grammar as the start element. The mapping process proceeds as the
first codon maps the start symbol of the solution to a rule by generating an integer
value from the 8 bits. To select a rule, the operator used is {(codon integer value)
MOD (number of rules)}. The start element is replaced by the elements of the rule
selected and this process proceeds until only terminal elements remain. A wrapping
process can be used if the program has non-terminals at the point at which the end of
the chromosome has been reached so that the algorithm returns to the start of the
chromosome to obtain the next codon. This wrapping process can be allowed to
106 A.A. Motsinger et al.

occur N times as necessary, where N is defined in the configuration file. The


resulting phenotype is a NN, which can then be evaluated for fitness.
The steps of GENN are similar to GPNN[22]. First, GENN has a set of
parameters that must be initialized in the configuration file. Second, the data are
divided into 10 equal parts for 10-fold cross-validation. Here, we train GENN on
9/10 of the data to develop a NN model. Later, we test this model on the other 1/10
of the data to evaluate the predictive ability of the model. This approach has been
described in detail for GPNN[22]. Third, training of GENN begins by generating an
initial population of random solutions. Each solution is generated via sensible
initialization[28]. Using sensible initialization, an initial population is generated that
creates functioning NN. In the sensible initialization step an expression tree is
created using the grammar. The software assigns a minimum depth to each rule that
describes the depth required for the rule to be completed. As each tree is built, the
algorithm randomly selects only rules that can fit within the remaining depth of the
tree. Half of the individual NN are built to the maximum depth by only selecting
recursive rules until a non-recursive rule must be chosen to complete the tree and
half are generated to a random depth no greater than the maximum by selecting any
rule that can fit in the remaining depth of the tree. The final step in initialization is to
convert nodes of the tree into corresponding codons. Fourth, each NN is evaluated on
the training set and its fitness recorded. Fifth, the best solutions are selected for
crossover and reproduction using a selection technique. The selection method can be
specified in the configuration file, where the options are uniform, rank, roulette, and
tournament[30]. A proportion of the best solutions will be directly copied
(reproduced) into the new generation. Another proportion of solutions will be used
for crossover with other best solutions. The crossover is performed at the
chromosomal level, not at the level of the expression tree. The new generation,
which is equal in size to the original population, begins the cycle again. This
continues until some criterion is met, after which GENN stops. This criterion is
either a classification error of zero or a limit on the number of generations. An
optimal solution is identified after each generation. At the end of GENN evolution,
the overall best solution is selected as the optimal NN. Sixth, this best GENN model
is tested on the 1/10 of the data left out to estimate the prediction error of the model.
Steps two through six are performed ten times with the same parameters settings,
each time using a different 9/10 of the data for training and 1/10 of the data for
testing. An overview of the GENN algorithm is shown in Figure 1.
We have implemented GE to optimize inputs, architecture, and weights of a NN.
The grammar used is available from the authors upon request. The GA used to
evolve the binary string that is transcribed into a NN has the following parameters in
the current implementation: crossover rate = 0.9, mutation = 0.01, population = 200,
max generations = 50, codon size = 8, GE wrapping count = 2, min chromosome size
(in terms of codons) = 50, max chromosome size = 1000, selection = roulette (can
also be uniform, rank, or tournament), and sensible initialization depth = 10. To
prevent stalling in local minima, the island model of parallelization is used, where
the best individual is passed to each of the other processes after every 25
generations[31]. The genome is derived from GAlib (version 2.4.5) which is freely
available at http://lancet.mit.edu/ga/dist/, and a typical GA one-point crossover of
linear chromosomes is used.
Comparison of NN Optimization Approaches for Studies of Human Genetics 107

Fig. 1. An overview of the GENN method. The steps correspond to the description of the
method in Section 2.1.

Classification error is calculated on the set of training data as the fitness metric.
As mentioned earlier, the dataset is divided into cross-validation subsets. GENN is
optimized using a training set of data, and a subset of the data is left out as a test set
to evaluate the final solution and prevent over-fitting. Classification error refers to
the number of samples in the training dataset that are incorrectly classified by the
network. Prediction error, which refers to the number of samples in the test dataset
that are incorrectly classified using the GENN model generated during training, is
used for final model selection. The overall goal of the learning process is to find
genetic models that accurately classify the data. Cross-validation is used in
conjunction with this learning process to produce a model that not only can
accurately classify the data at hand, but can predict on future, unseen data.

2.2 Genetic Programming Neural Networks (GPNN)

GPNN uses genetic programming to determine the optimal architecture for neural
networks. Both the method and the software have previously been described[22].
GPNN was applied as presented in the references. Like GENN, models are trained on
classification error, and a cross validation consistency and prediction error are
determined for the final model. Unlike GENN, previous studies have shown cross
validation consistency as the best criterion for final model selection. However for this
study, the results are identical whether you use cross-validation consistency or
prediction error for final model selection. All configuration parameters are identical to
those in GENN. This will allow for a direct comparison of GPNN and GENN.
108 A.A. Motsinger et al.

2.3 Random Search Algorithm

As a negative control, a random-search algorithm was implemented. The random


search algorithm uses the same fitness metric as both GENN and GPNN. The random
search generates the initial chromosome population as described above for GENN,
using sensible initialization, but this new random population occurs at every
generation instead of only at the beginning of the run. Genotype-to-phenotype
mapping is performed just as it is for GENN. The algorithm stores the single best
network over all generations and returns that as the final model. All other networks
are discarded.

2.4 Back Propagation Neural Network

In this study, we used a traditional fully-connected, feed-forward network comprised


of one input layer, zero, one or two hidden layers, and one output layer, trained by
back-propagation. The software used, the NICO toolkit, was developed at the Royal
Institute of Technology, (http://www.speech.kth.se/NICO/index.html).
Defining the network architecture is an important decision that can greatly affect
the results of the analysis[32]. There are several strategies utilized for architecture
selection, including a prediction error fitness measure such that an architecture is
selected by its generalization to new observations[32], or a classification (training)
error metric[33]. Because we use cross-validation to verify the generalizability of our
models, and to make a more fair comparison to the other methods in this study, we
used classification error as a basis for evaluating and making changes to the BPNN
architecture. We began with a very small network, and several parameters were
varied to obtain an appropriate architecture for each dataset, including: the number of
hidden layers, the number of nodes in the hidden layer, and the learning momentum
(the fraction of the previous change in a weight that is added to the next change).
This trial and error approach is typically employed for optimization of BPNN
architecture[33,34]. BPNN was implemented as described in [22]. Final model
selection was performed based on lowest prediction error, as with GENN.

2.5 Data Simulation

Epistasis, or gene-gene interaction, occurs when the phenotype under study cannot be
predicted from the independent effects of any single gene, but is the result of
combined effects of two or more genes[34]. It is increasingly accepted that epistasis
plays an important role in the genetic architecture of common genetic diseases[35].
Penetrance functions are used to represent epistatic genetic models in this simulation
study. Penetrance defines the probability of disease given a particular genotype
combination by modeling the relationship between genetic variations and disease risk.
For our power studies, we simulated case-control data using two different epistasis
models exhibiting interaction effects in the absence of main effects. Models that lack
main effects are desirable because they challenge the method to find gene-gene
interactions in a complex dataset. Also, a method able to detect purely interactive
terms will be likely to identify main effects as well.
Comparison of NN Optimization Approaches for Studies of Human Genetics 109

Table 1. Multilocus penetrance functions used to simulate case-control data exhibiting gene-
gene interactions in the absence of main effects. Penetrance is calculated as
p(disease|genotype). Marginal penetrance values (not shown) are all equal for each model.

a. Model 1 b. Model 2
BB Bb bb BB Bb bb
AA 0 .10 0 AA 0 0 .10
Aa .10 0 .10 Aa 0 .50 0
aa 0 .10 0 Aa .10 0 0

To evaluate the power of the above methods for detecting gene-gene interactions,
we simulated case-control data using two different two-locus epistasis models in which
the functional loci are single nucleotide polymorphisms (SNPs). The first model was
initially described by Li and Reich[36], and later by Moore [37]. This model is based
on the nonlinear XOR function[38] that generates an interaction effect in which high
risk of disease is dependent on inheriting a heterozygous genotype (Aa) from one locus
or a heterozygous genotype (Bb) from a second locus, but not both. The high risk
genotypes are AaBB, Aabb, AABb, and aaBb, all with penetrance of 0.1 (Table 1a).
The proportion of the trait variance that is due to genetics, or heritability, of this model
is low. Specifically, as calculated according to Culverhouse et al[39], the heritability is
0.053. The second model was initially described by Frankel and Schork[40], and
later by Moore[37]. In this second model, high risk of disease is dependent on
inheriting exactly two high risk alleles (A and/or B) from two different loci. In this
model, the high risk genotypes were AAbb, AaBb, and aaBB, with penetrance of 0.1,
0.5, and 0.1 respectively (Table 1b). The heritability of this model is 0.051.
These models were selected because they exhibit interaction effects in the absence
of any main effects when genotypes were generated according to Hardy-Weinberg
proportions (in both models, p=q=0.5). For both models, we simulated 100 datasets
consisting of 200 cases and 200 controls, each with 10 SNPs, 2 of which were
functional. The data were generated using the software package described by Moore
et al[37]. Dummy variable encoding was used for each dataset, where n-1 dummy
variables were used for n levels[19]. Data were formatted with rows representing
individuals and columns representing dummy-encoded genotypes with the final
column representing disease status. Though biological relevance of these models is
uncertain, they do represent a worst case scenario in the detection of epistasis. If a
method performs well under such minimal effects, it is predicted it will also perform
well in identifying gene-gene interactions in models with greater effect sizes.

2.6 Data Analysis

We used all four methods (GENN, GPNN, BPNN and random search) to analyze both
epistasis models. The configuration parameter settings were identical for GENN,
GPNN and the random search (without evolutionary operators for random search): 10
demes, migration every 25 generations, population size of 200 per deme, 50
generations, crossover rate of 0.9, and a reproduction rate of 0.1. For GENN and the
random search, prediction error was used for final model selection as described in
Section 2.1. For GPNN, cross-validation consistency was calculated for each model
110 A.A. Motsinger et al.

and the final model was selected based on this metric (as described in [22]). Sensible
initialization was used in all three algorithms.
For our traditional BPNN analysis, all possible inputs were used and the
significance of each input was calculated from its input relevance R_I, where R_I is
the sum of squared weights for the ith input divided by the sum of squared weights
for all inputs[38]. Next, we performed 1000 permutations of the data to determine
what input relevance was required to consider a SNP significant in the BPNN model
(data not shown). This empirical range of critical relevance values for determining
significance was 10.43% - 11.83% based on the permutation testing experiments.
Cross validation consistency was also calculated and an empirical cutoff for the cross
validation consistency was determined through permutation testing (using 1000
randomized datasets). This cutoff was used to select SNPs that were functional in the
epistasis model for each dataset. A cross validation consistency of greater than 5 was
required to be statistically significant.
Power for all analyses is reported under each epistatic model as the number of
times the algorithm correctly identified the correct functional loci (both with and
without any false positive loci) over 100 datasets. Final model selection was
performed for each method based on optimum performance in previous studies[22].
If either one or both of the dummy variables representing a single SNP was selected,
that locus was considered present in the model.

3 Results
Table 2 lists the power results from all four algorithms. Because of the small size of
the dataset, all four algorithms performed reasonably well. With a limited number of
SNPs, these learning algorithms can effectively become exhaustive searches. As
hypothesized, GENN and GPNN both out-performed the traditional BPNN and the
random search. The performance of GENN and GPNN were consistent, as expected.
This demonstrates that GENN will work at least as well as GPNN, while allowing for
faster development and more flexible use. Because the number of variables included
in the dataset was small, the random search performed reasonably well, as the trial
and error approach had a limited number of variables to search through. As Table 2
shows, there is a large gap in the performance of the random search between Model 1
and Model 2. This is probably due to the difference in the difficulty inherent in the
two models. The power of BPNN to detect Model 2 was also lower than for Model 1,
indicating a difference in the challenge of modeling the different models.
Additionally, the stochastic nature of a random algorithm can lead to erratic results, as
shown here. These erratic power results further demonstrate the utility of an
evolutionary approach to optimizing NN architecture. The random search even
outperformed BPNN for Model 1, probably because the random search was able to
search through more possible NN architectures than BPNN so was able to find the
correct model more often in these simulations.
Table 3 summarizes the average classification error (training error) and prediction
error (testing error) for the four algorithms evaluated using the 100 datasets for each
model. Due to the probabilistic nature of the functions used in the data simulation,
Comparison of NN Optimization Approaches for Studies of Human Genetics 111

Table 2. Power (%) for each method on both gene-gene interaction models (with no false
positive loci)
Epistasis Model GENN GPNN BPNN Random Search
1 100 100 53 87
2 100 100 42 10

Table 3. Results for all four algorithms, demonstrating average classification error (CE) and
prediction error (PE) for each epistasis model. The range of observed observations is listed
below the average.

Model GENN GPNN BPNN Random Search


CE PE CE PE CE PE CE PE
1 0.237 0.237 0.237 0.237 0.008 0.340 0.236 0.237
(.212- (.210- (.200- (.208- (.000- (.201- (.088- (.075-
.254) .255) .284) .279) .183) .410) .483) .400)
2 0.243 0.243 0.242 0.243 0.008 0.303 0.242 0.245
(.212- (.209- (.212- (.210- (.000- (.240- (.080- (.075-
.260) .271) .261) .269) .181) .410) .494) .400)

Table 4. Power (%) for each method to detect functional SNPs in both gene-gene interaction
models (with or without false positive loci)

Model GENN GPNN BPNN Random Search


SNP 1 SNP 2 SNP 1 SNP 2 SNP 1 SNP 2 SNP 1 SNP 2
1 100 100 100 100 88 90 100 100
2 100 100 100 100 80 82 100 100

there is some degree of noise present in the data. The average error inherent in the
100 Model 1 datasets is 24%, and the error in the Model 2 datasets is 18%. As the
table shows, GENN, GPNN and the random search all had error rates closely
reflecting the real amount of noise in the data. Those three algorithms had lower
prediction errors than BPNN, while BPNN had lower classification errors for both
models. The lower classification error is due to model over-fitting. The other three
algorithms, including even the random search, are better able to model gene-gene
interaction and develop NN models that can generalize to unseen data. While the
random search did not demonstrate the same degree of over-fitting experienced with
BPNN, the averages reported here disguise the fact that the range of errors across
datasets was very high. While the average errors for the random search look similar
to those for GPNN and GENN, the range of observed values was much larger,
implying that the random search also tends to over-fit. We speculate that GENN and
GPNN are not over-fitting because, while these methods are theoretically able to
build a tree with all variables included, neither method is building a fully connected
NN using all variables.
To further understand the behavior of the algorithms, and in particular the
seemingly inconsistent results of the random searchs average errors and low relative
power, we calculated power for each functional locus as the proportion of times a
SNP was included in the final model (regardless of what other SNPs are present in
the model) for all datasets. Table 4 lists the results of this power calculation for all
112 A.A. Motsinger et al.

four methods. The tendency of the random search algorithm to over-fit models
becomes clear in the comparisons of Tables 2-4. The random search finds the
functional SNPs, but includes many false positive loci, which is highly undesirable
since the end goal of association analysis is variable selection. The same false
positive trend holds true for BPNN.

4 Discussion
We have demonstrated that grammatical evolution is a valid approach for optimizing
the architecture of NN. We have shown that GENN outperforms both a random
search and traditional BPNN for analysis of simulated epistasis genetic models.
Because of the small number of SNPs in this study, both BPNN and the random
search NN had modest power, as one would expect. With a small number of
variables to test, examining low-order combinations is relatively easy. As the number
of variables increases, the resultant combinatorial explosion limits the feasibility of
trial and error approaches. Moody[33] demonstrates that enumeration of all possible
NN architectures is impossible, and there is no way to know if a globally optimal
architecture is selected. The performance gap between the evolutionarily optimized
algorithms and the trial and error approaches is expected to widen as the number of
variables increases.
Additionally, we show that GENN performs at least as well as GPNN. Because of
the limited number of noise variables, and the fact that these two methods reached
the upper limit of power, a more extensive comparison between GENN and GPNN
needs to be performed. Power will need to be studied in a range of datasets,
demonstrating a wide range of heritability values and number of noise variables.
Because of the greater flexibility of GE compared to GP, we predict that GENN will
out-perform GPNN on more complex datasets.
Because the end-goal of these methods is variable selection, performance has been
evaluated according to this metric in this study. In future studies, it would be
interesting to evaluate the architectures of the NN that are constructed by these
different methods to further evaluate the differences in their performance. Other
measures of model fitness, such as sensitivity and specificity could also be dissected
in evaluating the performance of GENN.
Also, while simulated data are necessary in method development, the eventual
purpose of this method is for the analysis of real data. GENN will need to be tested
on real case-control genetic data.
This study introduces a novel computational method and demonstrates that GENN
has the potential to mature into a useful software tool for the analysis of gene-gene
interactions associated with complex clinical endpoints. The ease of flexibility and
ease of development of utilizing a grammar will aid in additional studies with this
method.

Acknowledgements
This work was supported by National Institutes of Health grants HL65962, GM62758,
and AG20135. We would also like to thank David Reif for his helpful comments on
the manuscript.
Comparison of NN Optimization Approaches for Studies of Human Genetics 113

References
1. Kardia S, Rozek L, Hahn L, Fingerlin T, Moore J: Identifying multilocus genetic risk
profiles: a comparison of the multifactor data reduction method and logistic regression.
Genetic Epidemiology 2000.
2. Moore JH, Williams SM: New strategies for identifying gene-gene interactions in
hypertension. Ann Med 2002, 34: 88-95.
3. Culverhouse R, Klein T, Shannon W: Detecting epistatic interactions contributing to
quantitative traits. Genet Epidemiol 2004, 27: 141-152.
4. Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for
detecting gene-gene and gene-environment interactions. Bioinformatics 2003, 19: 376-382.
5. Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L: Sequence analysis using logic
regression. Genet Epidemiol 2001, 21 Suppl 1: S626-S631.
6. Moore JH: The ubiquitous nature of epistasis in determining susceptibility to common
human diseases. Hum Hered 2003, 56: 73-82.
7. Nelson MR, Kardia SL, Ferrell RE, Sing CF: A combinatorial partitioning method to
identify multilocus genotypic partitions that predict quantitative trait variation. Genome
Res 2001, 11: 458-470.
8. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF et al.: Multifactor-
dimensionality reduction reveals high-order interactions among estrogen-metabolism
genes in sporadic breast cancer. Am J Hum Genet 2001, 69: 138-147.
9. Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for
detecting gene-gene interactions in the presence of genotyping error, missing data,
phenocopy, and genetic heterogeneity. Genet Epidemiol 2003, 24: 150-157.
10. Tahri-Daizadeh N, Tregouet DA, Nicaud V, Manuel N, Cambien F, Tiret L: Automated
detection of informative combined effects in genetic association studies of complex traits.
Genome Res 2003, 13: 1952-1960.
11. Zhu J, Hastie T: Classification of gene microarrays by penalized logistic regression.
Biostatistics 2004, 5: 427-443.
12. Schalkoff R: Artificial Neural Networks. New York: McGraw-Hill Companies Inc.; 1997.
13. Bhat A, Lucek PR, Ott J: Analysis of complex traits using neural networks. Genet
Epidemiol 1999, 17: S503-S507.
14. Curtis D, North BV, Sham PC. Use of an artificial neural network to detect association
between a disease and multiple marker genotypes. Annals of Human Genetics 65, 95-107.
2001.
15. Li W, Haghighi F, Falk C: Design of artificial neural network and its applications to the
analysis of alcoholism data. Genet Epidemiol 1999, 17: S223-S228.
16. Lucek P, Hanke J, Reich J, Solla SA, Ott J: Multi-locus nonparametric linkage analysis of
complex trait loci with neural networks. Hum Hered 1998, 48: 275-284.
17. Lucek PR, Ott J: Neural network analysis of complex traits. Genet Epidemiol 1997, 14:
1101-1106.
18. Marinov M, Weeks D: The complexity of linkage analysis with neural networks. Human
Heredity 2001, 51: 169-176.
19. Ott J. Neural networks and disease association. American Journal of Medical Genetics
(Neuropsychiatric Genetics) 105[60], 61. 2001.
20. Saccone NL, Downey TJ, Jr., Meyer DJ, Neuman RJ, Rice JP: Mapping genotype to
phenotype for linkage analysis. Genet Epidemiol 1999, 17 Suppl 1: S703-S708.
21. Sherriff A, Ott J: Applications of neural networks for geen finding. Advances in Genetics
2001, 42: 287-297.
114 A.A. Motsinger et al.

22. Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH: Optimization of neural network
architecture using genetic programming improves detection and modeling of gene-gene
interactions in studies of human diseases. BMC Bioinformatics 2003, 4: 28.
23. Koza J, Rice J: Genetic generation of both the weights and architecture for a neural
network. IEEE Transactions 1991, II.
24. Motsinger AA, Lee S, Mellick G, Ritchie MD: GPNN: Power studies and applications of
a neural network method for detecting gene-gene interactions in studies of human disease.
BMC Bioinformatics 2005, in press.
25. Bush WS, Motsinger AA, Dudek SM, Ritchie MD: Can neural network constraints in GP
provide power to detect genes associated with human disease? Lecture Notes in Computer
Science 2005, 3449: 44-53.
26. Ritchie MD, Coffey CSMJH: Genetic programming neural networks: A bioinformatics
tool for human genetics. Lecture Notes in Computer Science 2004, 3102: 438-448.
27. O'Neill M, Ryan C. Grammatical Evolution. IEEE Transactions on Evolutionary
Computation 5, 349-357. 2001.
28. O'Neill M, Ryan C. Grammatical evolution: Evolutionary automatic programming in an
arbitrary language. 2003. Boston, Kluwer Academic Publishers.
29. Moore JH, Hahn LW: Petri net modeling of high-order genetic systems using grammatical
evolution. BioSystems 2003, 72: 177-186.
30. Mitchell M. An introduction to genetic algorithms. 1996. Cambridge, MIT Press.
31. Cantu-Paz E. Efficient and accurate parallel genetic algorithms. 2000. Boston, Kluwer
Academic Publishers.
32. Utans J, Moody J. Selecting neural network architectures via the prediction risk
application to corporate bond rating prediction. 1991. Los Alamitos, California, IEEE
Press. Conference Proceedings on the First International Conference on Artificial
Intelligence Applications on Wall Street.
33. Moody J: Prediction risk and architecture selection for neural networks. In From Statistics
to Nerual Networks: Theory and Pattern Recognition Applications. Edited by Cherkassky
V, Friedman JH, Wechsler H. NATO ASI Series F, Springer-Verlag; 1994.
34. Fahlman SE, Lebiere C: The Cascade-Correlation Learning Architecture. Carnegie
Mellon University; 1991. Masters from School of Computer Science.
35. Templeton A. Epistasis and complex traits. Wade, M., Broadie, B III, and Wolf, J. 41-57.
2000. Oxford, Oxford University Press. Epistasis and the Evolutionary Process.
36. Li W, Reich J. A complete enumeration and classification of two-locus disease models.
Hum.Hered. 50, 334-349. 2000.
37. Moore J, Hahn L, Ritchie M, Thornton T, White B. Application of genetic algorithms to
the discovery of complex models for simulation studies in human genetics. Langdon, WB,
Cantu-Paz, E, Mathias, K, Roy, R, Davis, D, Poli, R, Balakrishnan, K, Honavar, V,
Rudolph, G, Wegener, J, Bull, L, Potter, MA, Schultz, AC, Miller, JF, Burke, E, and
Jonoska, N. Proceedings of the Genetic and Evolutionary Algorithm Conference. 1150-
1155. 2002. San Francisco, Morgan Kaufman Publishers.
38. Anderson J: An Introduction to Neural Networks. Cambridge, Massachusetts: MIT Press;
1995.
39. Culverhouse R, Suarez BK, Lin J, Reich T: A perspective on epistasis: limits of models
displaying no main effect. Am J Hum Genet 2002, 70: 461-471.
40. Frankel W, Schork N. Who's afraid of epistasis? Nat.Genet. 14, 371-373. 1996.
Obtaining Biclusters in Microarrays
with Population-Based Heuristics

Pablo Palacios1 , David Pelta2 , and Armando Blanco2


1
Dept. de Ingeniera Electrnica, Sistemas Informticos y Automtica,
Universidad de Huelva, Campus UIB - Huelva
[email protected]
2
Depto. de Ciencias de la Computacin e I.A.,
Calle Periodista Daniel Saucedo Aranda s/n,
Universidad de Granada, 18071 Granada, Spain
{dpelta, armando}@decsai.ugr.es

Abstract. In this article, we shall analyze the behavior of population-


based heuristics for obtaining biclusters from DNA microarray data.
More specically, we shall propose an evolutionary algorithm, an esti-
mation of distribution algorithm, and several memetic algorithms that
dier in the local search used.
In order to analyze the eectiveness of the proposed algorithms, the
freely available yeast microarray dataset has been used. The results ob-
tained have been compared with the algorithm proposed by Cheng and
Church.
Both in terms of the computation time and the quality of the solu-
tions, the comparison reveals that a standard evolutionary algorithm and
the estimation of distribution algorithm oer an ecient alternative for
obtaining biclusters.

1 Introduction

One of the research elds which has aroused the greatest interest towards the
end of the 20th century and whose future is expected to be as equally promising
in the 21st century is the study of an organisms genome or genomics.
By way of a brief history, it was Gregor Mendel who dened the gene concept
in his research as the element where information about hereditary characteristics
is to be found. At a later stage, Avery, McCleod and McCarty demonstrated that
an organisms genetic information stems from a macromolecule called deoxyri-
bonucleic acid (DNA); it was later discovered that genetic information located
in specic areas of the DNA (the genes) enabled protein synthesis; this was fol-
lowed by the sequencing of the genome of certain organisms (including humans).
This and future consequences awakened a great deal of interest among scientists.
Since proteins are responsible for carrying out cellular functions, cellular func-
tioning therefore depends on the proteins synthesized by the genes, and is de-
termined by regulation of protein synthesis (gene expression) and control of its
activity.

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 115126, 2006.

c Springer-Verlag Berlin Heidelberg 2006
116 P. Palacios, D. Pelta, and A. Blanco

The process whereby the approximately 30,000 genes in the human genome
are expressed as proteins involves two steps: 1) the DNA sequence is transcribed
in messenger RNA sequences (mRNA); and 2) the mRNA sequences are in turn
translated into amino acid sequences which comprise the proteins.
Measuring the mRNA levels provides a detailed vision of the subset of genes
which are expressed in dierent types of cells under dierent conditions. Mea-
suring these levels of gene expression under dierent conditions helps explore the
following aspects (among others) in greater depth: a) The function of the genes,
b) How several genes interact and c) How dierent experimental treatments
aect cell function.
Recent advances in array-based methods enable expression levels of thou-
sands of genes to be measured simultaneously. These measurements are obtained
by quantizing the mRNA hybridization with a cDNA array, or oligonucleotide
probes xed in a solid substance.
Technological advances in the development of cDNA arrays simultaneously
produce an amazingly large quantity of data relating to the transcription levels of
thousands of genes and in specic conditions. For knowledge extraction (function
of the genes, implication of certain genes in specic illnesses, etc.), researchers
use consolidated methodologies and specic ones are being developed. However,
although the results obtained so far are getting better, there is still room for
improvement.

2 Gene Expression Matrices

In a gene expression matrix, the rows represent genes and the columns represent
samples, and each cell contains a number which characterizes the expression level
of a particular gene in a particular sample.
Like most experimental techniques, microarrays measure the nal objective
indirectly through another physical quantity, for example the relative abundance
of mRNA through the uorescence intensity of the spots in an array.
Microarray-based techniques are still a long way from providing the exact
quantity of mRNA in a cell. The measurements are naturally relative: essen-
tially we can compare the expression levels of one gene in dierent samples or
dierent genes in one sample, so that it is necessary to apply a suitable nor-
malization to enable comparisons between data. Moreover, as the value of the
microarray-based gene expression can be considerably greater according to the
reliability and limitations of a particular microarray technique for certain types
of measurements, data normalization is a key issue to consider.
Once we have constructed the gene expression matrix, the second step is to
analyze it and attempt to obtain information from it.
In this work we shall use the biclustering concept introduced by Hartigan [6]
to capture the degree of similarity between a subset of elements within a subset
of attributes. Church applied this technique on DNA microarrays [3].
The advantage of biclustering as opposed to traditional clustering when ap-
plied to the eld of microarrays lies in its ability to identify groups of genes
Obtaining Biclusters in Microarrays with Population-Based Heuristics 117

that show similar activity patterns under a specific subset of the experimental
conditions. Therefore, biclustering approaches are the key technique to use when
one or more of the following situations applies [7]:

1. Only a small set of the genes participates in a cellular process of interest.


2. An interesting cellular process is active only in a subset of the conditions.
3. A single gene may participate in multiple pathways that may or not be
coactive under all conditions.

Besides, the biclusters should not be exclusive and/or exhaustive: A gene /


condition should be able to belong to more than one cluster or to no cluster at
all and be grouped using a subset of conditions/genes.

2.1 Biclustering Techniques

In this section, we briey present some representative strategies in literature for


obtaining biclusters.
Cheng and Church in [3], proposed a set of heuristic algorithms whose func-
tion, beginning with the complete matrix, is based on the execution of iterative
stages: deletion and addition of rows in order to obtain biclusters.
The FLOC algorithm (Flexible Overlapped Clustering) [16], is a variant of
previous work, which performs an iterative process from an initial set of biclusters
in an attempt to improve their overall quality. In each iteration, a row or column
is added or deleted from each bicluster in order to produce the ideal set of
biclusters in terms of having the greatest similarity. The algorithm nishes when
there is no improvement in the overall quality of the previous set of biclusters.
The CLICK algorithm is based on the construction of bipartite graphs [13].
This algorithm uses the following three stages to obtain biclusters: a) Identify
regions which may contain biclusters; b) Identify the biclusters and c) Rene
biclusters to their minimum size.
The double conjugated clustering algorithm [2], where the search for biclusters
is performed on two dierent search spaces: one comprising the genes, and the
other the experiments. In this way, a clustering algorithm is applied to each
space independently. In order to join the clusters induced in both processes, two
functions are dened which enable one node from one space to be converted into
the conjugated node of the other space, and vice versa. The nal adjustment
between both search spaces is obtained by means of a new clustering process to
correct the clusters conjugated in the other space.
The pCluster algorithm [14] uses a more general similarity type. Two objects
therefore belong to the same cluster if they display a similar pattern for a subset
of dimensions. This enables biclusters to be discovered with elements which com-
ply with the same pattern although they are not close to each other. Discovering
these biclusters is essential when revealing gene behavior.
Finally, the pMafia algorithm [10] consists of a parallel implementation of
a grid algorithm [12, 15] adapted for biclusters. Each dimension is divided into
small intervals of a xed size called "windows". During the clustering process,
118 P. Palacios, D. Pelta, and A. Blanco

when two adjacent windows are similar, they merge and a new one is created.
The algorithm uses the parallelism to reduce the computation time.
The last years have shown an increasing interest in this eld. We suggest the
interested reader to check out the excellent survey by Madeira and Oliveira [7].

2.2 Measuring the Quality of a Bicluster


Below, we shall dene the main concepts which form the basis of the residue-
based bicluster induction methods towards which we have directed our research.
Definition: Let D be a gene expression matrix, of the size n m (Dnm ),
where the set of rows F = {G1 , G2 , ..., Gn } represents the genes and the set
of columns R = {E1 , E2 , ..., Em } represents the conditions or experiments.
Each element dij in the matrix matches the expression level (absolute or
relative) of gene Gi in experiment Ej .

Definition: Given a gene expression matrix Dnm , a bicluster is a pair (I, J),
where I {1, ..., n} is a subset of rows of F and J {1, ..., m} is a sub-
set of columns of R, in which the genes Gi with i I behave in a similar way.

Definition: Given a bicluster (I, J), the residue (rij ) of an element dij of the
bicluster is calculated according to Equation 1.

rij = dij diJ dIj + dIJ (1)


where 
jJi dij
diJ = (2)
|Ji |

iIj dij
dIj = (3)
|Ij |

iI,jJ dij
dIJ = (4)
|I| |J|
The residue is an indicator of the degree of coherence of an element in relation
to the remaining elements in the bicluster, additionally providing the bias of
the objects and the relevant attributes. Therefore, the smaller the value of the
residue, the greater the coherence.
Definition: Given a bicluster (I, J), the residue (rIJ ) of the bicluster can be
obtained from Equation 5, where rij is the residue of the element dij and
vIJ is the volume of the bicluster.

iI,jJ |rij |
rIJ = (5)
vIJ
In order to determine the overall quality of a bicluster, its residue is dened
as the mean of the residues of all its elements. This mean could be arithmetic,
geometric, etc. Here, we applied the arithmetic mean.
Obtaining Biclusters in Microarrays with Population-Based Heuristics 119

3 Proposed Population Based Techniques


In this section, we shall describe the particular characteristics of the genetic al-
gorithm (GA), the memetic algorithms (MA), and the estimation of distribution
algorithm (EDA) which have been implemented.

3.1 Genetic Algorithm


As the simplest population based technique, we shall implement a classical ge-
netic algorithm, with elitism. The main characteristics are described next.
Codification: The solutions representation is as follows: given a data matrix
D of size n m, the biclusters are encoded in a vector of size n + m, where
the rst n positions represent the rows and the last m positions represent the
columns of the bicluster. Each position of the vector can have one of two values
(1 or 0) indicating whether the corresponding row or column is to be found (1)
or not (0) in the bicluster.
Selection: Bakers stochastic universal sampling was chosen as the selection
method. This is a roulette wheel method with slots which are sized according to
the tness of each chromosome.
Crossover: the uniform operator is used: given two parents, the ospring keep
the values common to both of them, while every other value is randomly taken
from any of the parents.
Mutation: a BitFlip mechanism was chosen: given an individual in the popu-
lation, one of its bit values is changed for its complementary one.
Fitness: is measured as the residue associated to the bicluster represented by a
given individual.
Restart: For the restart strategy, we have chosen to move the best individual
to the new population. In addition, 20 % of the new population will be the best
individual in the current mutated generation and the rest shall be generated
randomly. This restart shall be applied when 10 % of the generations to be
made have taken place with no change in the best element in the population.

3.2 Memetic Algorithms


Memetic algorithms have been studied from practical and theoretical points of
view since 15 years ago [5] and, while very complex strategies are being devel-
oped, in their simplest form they can still be considered as a genetic algorithm
hybridized with a local search operator.
Here, and departing from the genetic algorithm described before we imple-
mented several basic memetic algorithms using two dierent local search tech-
niques, namely:

K-opt: the chromosomes can also be seen as a permutation of the rows


and columns of the bicluster so that we could apply k-opt movements on
120 P. Palacios, D. Pelta, and A. Blanco

them, which in particular would consist in exchanging with each other k bits
of a given solution. If this exchange leads to an improvement, a new k-opt
movement would be undertaken and this process would be continued until
no improvement is made in certain number of trials.
We constructed 4 dierent memetic schemes for values of k {2, 3, 4, 5}.
Taboo Search (TS): a very basic TS strategy is used. The neighborhood
is sampled using the mutation operator described for the GA.

3.3 Estimation of Distribution Algorithms


Estimation of Distribution Algorithms (EDA) were described for the rst time
by H. Muehlenbein and G. Paab [8]. In EDAs, the solution space is represented
by means of a probability distribution associated with the individuals selected
in each generation and not with a population. This probability distribution is
calculated from a set of individuals selected from the previous generation. Once
obtained, it is sampled in order to generate descendants so that neither the
mutation nor the crossover is applied in the EDAs.
The easiest way to calculate the probability distribution consists in consider-
ing all the variables of interest to be independent. So, the probability estimation
is converted into the product of the marginal probabilities of n variables as:
n
p(x) = i=1 p(xi ) (6)

Dierent approximations to the methodology can be found, including: the uni-


variate marginal distribution algorithm [9], population-based incremental learn-
ing [1] and the compact genetic algorithm [4].
In this work we shall focus on the Univariate Marginal Distribution Algorithm
(UMDA, in what follows). UMDA maintains a population of N individuals, to
which a selection method is applied in order to create a new population. From
this new population, the frequencies of each gene are obtained and used to
generate a new population of N individuals. This mechanism for generating the
population is a type of crossover operator which replaces the traditional GA
crossover operator.
The process in detail is as follows:

1. Choose the M best individuals in the current population which are in the
Se
set Dl1 .
2. Estimate the probability distribution of the current population using Eq. 7.
3. Generate a new population of N individuals from the probability distribution
which is stored in the set DlSE .

The probability distribution is expressed as the product of invariable marginal


probabilities, which is estimated from the marginal frequencies:
M Se
j=1 j (Xi = xi |Dl1 )
pl (xi ) = (7)
M
Obtaining Biclusters in Microarrays with Population-Based Heuristics 121

Se
where j (Xi = xi |Dl1 ) is 1 for the j th individual in M if the value of gene Xi
is equal to xi , in any other case it will be 0.
The outline of our UMDA-based algorithm can be seen in Algorithm 1.

Algorithm 1. Proposed Estimation of Distribution Algorithm


1. Generate initial population of size N
2. Do itersN ow = 0
3. While itersN ow < maxIters
(a) Choose N2 individuals
(b) Estimate the probability distribution of the M individuals
(c) Sample the distribution in order to obtain new individuals
(d) Replace the old population with the new one
4. End While
5. Return best individual
6. End

4 Experiments
In order to evaluate and analyze the implemented algorithms, the yeast expres-
sion data set has been used, comprising 17 experiments (columns) on 2900 genes
(rows). This gene expression data set was chosen since it is one of the most used
in literature by the majority of experts in this eld, thereby enabling our results
to be compared.
The results obtained with the proposed tools have been compared using the
algorithm proposed by Church in [3] as a reference algorithm.
Following an empirical study, the following parameters were xed: a popula-
tion of 200 individuals and 200 generations. The crossover and mutation proba-
bilities were xed at 0.8 and 0.6, respectively.
Each algorithm was executed 30 times, and the seed of the random number
generator was changed in each execution. At the end of each execution, the best
bicluster found was recorded.

4.1 Results
This section includes the results obtained by the proposed algorithms and the
reference algorithm. We also performed a random sampling of biclusters in order
to check the expected residue value for a random bicluster.
Table 1 shows the corresponding residues for the best and worst biclusters,
the mean and typical deviation on 30 executions, and also an indication of the
time taken for each execution. The average size of the resulting biclusters are
also displayed. Results for Reference algorithm were taken over 100 bicluster
while 104 random solutions were generated.
122 P. Palacios, D. Pelta, and A. Blanco

Figure 1 shows the histograms of residue (a), rows (b) and columns (c) of
the best biclusters found by every algorithm. Results for k-opt for k 3 were
omitted for visualization purposes (although they were extremely similar to those
of 2-opt). These gures give us a global view of the whole set of best biclusters
available, enabling to quickly check out the region of the search space covered
by every strategy.
Several aspects can be highlighted from the table and the histograms. First
one is that the GA achieves very good residue values despite the simplicity of its
components denitions. On average, the biclusters generated are quite big (825
rows and 8 columns).
The joint use of GA with local search, leading to dierent memetic schemes,
does not seem to be useful. The simpler local search schemes (k-opt) increase
the residue while the average number of rows in the bicluster is signicantly
reduced. In this case, and given the standard deviation values, it is clear that
there is a problem of convergence which is independent of the k value. Moreover,
no statistical dierences were detected for dierent values of k.
As the complexity of the local search is increased, from 2-opt to TS, the
residue values also increase. This becomes clear if we look at the corresponding
histogram. In turn, the sizes of the biclusters obtained are slightly higher than
those obtained by k-opt.
The EDA strategy achieves the lowest average residue value, while the corre-
sponding bicluster sizes are about 200 rows and 8 columns. The average residue
for the reference algorithm is almost three times higher than that of EDA, while
the biclusters are smaller on average(although the number of columns is in-
creased from 8 to 12). The reference algorithm presents the highest variability
in residue, number of rows and columns (this is clearly seen in the histograms).
In order to determine what dierences in terms of residue are signicant, a
Kruskal-Wallis test was performed. The test reveals signicant dierences among
the median of the residues of the algorithms. Then, pairwise U Man-Witney non
parametrical test were performed and they conrm that the dierences among
the algorithms were signicant.
Another element to analyze is the computational time used. The faster al-
gorithm, and the best one on bicluster quality, is EDA, followed by the GA.
The addition of local search to GA increases the computational time consider-
ably while not having the same counterpart in biclusters quality. The Churchs
algorithm has quite acceptable running times.
In Fig. 2 we plot the volume of the best solutions (calculated as rows
columns) against residue for algorithms GA, EDA and Reference). This plot re-
veals several things. First one is the existence of many alternative solutions with
similar residue. See for example the vertical range for residue between 5-10. This
fact is most notably for GA and EDA. In second place we can see that the Refer-
ence algorithm is able to obtain very similar solutions in size while very dierent
in residue. Both facts clearly encourage the use of population based techniques
that allow to simply manage a set of solutions of diverse characteristics.
Obtaining Biclusters in Microarrays with Population-Based Heuristics 123

Table 1. Statistical Values of residue and size of the biclusters found by every algo-
rithm. Time is in minutes per run.

Residue Average Size


Algorithm Avg (sd) Best Worst Rows (sd) Cols (sd) Time
GA 8.83 (0.81) 7.30 11.82 825.19 (488.32) 7.94 (4.11) 5
EDA 5.72 (1.45) 4.81 9.87 213.28 (109.74) 8.28 (0.70) 3
GA+2-opt 10.46 (0.04) 10.35 10.50 235.62 (9.95) 8.07 (0.26) 10
GA+3-opt 10.45 (0.05) 10.36 10.50 241.59 (10.14) 8.07 (0.26) 14
GA+4-opt 10.45 (0.05) 10.37 10.49 243.48 (11.71) 8.14 (0.35) 15
GA+5-opt 10.47 (0.04) 10.33 10.50 240.83 (15.67) 8.04 (0.21) 17
GA+TS 17.07 (1.94) 13.90 20.68 280.20 (10.99) 8.00 (0.12) 14
Church 14.19 (3.59) 7.77 29.78 166.70 (226.37) 12.09 (4.40) 5-10
Random 23.51 (2.60) 5.75 34.26 1407.30 (841.16) 9.48 (4.60)

(a) (b)

(c)

Fig. 1. Histograms of residue (a) , row (b) and columns (c) values of the best biclusters
obtained by every algorithm. Results for k-opt for k 3 were omitted. TS and 2-opt
stands for the memetic algorithm using such local search scheme.
124 P. Palacios, D. Pelta, and A. Blanco

Fig. 2. Volume (row column) vs Residue of the best biclusters found by EDA, GA
and Reference Algorithm

25
EDA
GA
GA + Tabu
20 GA + 2-opt
Average Residue

15

10

0
0 50 100 150 200
Iterations

Fig. 3. Time vs Residue EDA, GA, GA+TS and GA + 2-opt Algorithms

Finally, Fig. 3 shows the evolution of the average residue over the time for
typical runs of EDA, GA, GA+TS and GA + 2-opt. The faster convergence is
achieved by EDA; given that no restart mechanism is included, EDA becomes
stagnated during the last half of the time available. The curves for GA and
GA+2-opt are pretty similar: both algorithms show a continuous but very slow
convergence. Also, GA+TS is the worst method: it seems like the algorithm
can not made the population to converge. We have two hypothesis for these
behaviors: rst one there may be a problem in the local search parameters;
second one may be related with the fact that a small change in the genotype
can give raise to a big change in the phenotype and, when this fact occurs, the
use of local search is not recommendable. Both situations are under study but
we suspect the second reason may be more relevant.
Obtaining Biclusters in Microarrays with Population-Based Heuristics 125

5 Conclusions and Future Research


The population based techniques tested here showed as robust and ecient
strategies for coping with the obtention of biclustering in microarray matrices.
More specically, the simple EDA implemented was able to obtain better
solutions than the other algorithms (including the reference one) while using
less computational time.
We are aware that the analysis of biclustering results is somehow controversial
because it is not clear what makes a bicluster good or not. In principle, it seems
desirable for the biclusters to be as large as possible and to maintain low residue
levels. This would indicate that a high number of genes have been identied with
a similar expression prole in a large number of conditions.
Also, Aguilar [11] proved that the mean squared residue is not precise enough,
from the mathematical point of view, to discover shifting and scaling patterns
simultaneously and he pointed out that the characterization of an objective
function H that allows to detect both patterns simultaneously would be very
benecial. To the best of our knowledge, such function is still not available.
A byproduct of using these population-based techniques is that, at the end
of each run, we have available a set of high quality solutions. This is extremely
important because the ultimate goal is to obtain a set of genes biologically
related, so the optimization point is somehow secondary. In this context, the
use of more complex strategies like memetic algorithms without having problem
specic operators seems not to be recommendable.

Acknowledgments
This work was supported in part by projects TIC2003-09331-C02-01 and
TIN2005-08404-C04-01 from the Spanish Ministry of Science and Technology.

References
1. S. Baluja and R. Caruana. Removing the genetics from the standard genetic
algorithm. In A. Prieditis and S. Russel, editors, The Int. Conf. on Machine
Learning, pages 3846. Morgan Kaufmann Publishers, 1995. San Mateo, CA.
2. S. Busygin, G. Jacobsen, and E. Kramer. Double conjugated clustering applied to
leukemia microarray data. SIAM ICDM, Workshop on clustering high dimensional,
2002.
3. Y. Cheng and G. Church. Biclustering of expression data. 8th International Con-
ference on Intelligent System for Molecular Biology, pages 93103, 2001.
4. G. R. Harik, F. G. Lobo, and D. E. Goldberg. The compact genetic algorithm.
IEEE-EC, 3(4):287, November 1999.
5. W. Hart, N. Krasnogor, and J. Smith, editors. Recent Advances in Memetic Algo-
rithms. Studies in Fuzziness and Soft Computing. Physica-Verlag, 2004.
6. J. Hartigan. Clustering Algorithms. John Wiley, 1975.
7. S. Madeira and A. Olivera. Biclustering algorithms for biological data analysis:
A survey. IEEE/ACM Transactions on computational biology an bioinformatics.,
1(1):2445, 2004.
126 P. Palacios, D. Pelta, and A. Blanco

8. H. Muehlenbein and G. Paab. From recombination of genes to the estimation of


distributions. In Lecture Notes in Computer Science 1411: Parallel Problem Solving
from Nature, volume IV, pages 178187. PPSN, 1996.
9. H. Muhlenbein. Evolutionary computation: The equation for response to selection
and its use for prediction. Evolutionary Computation, (5):303346, 1998.
10. H. Nagesh, S. Goil, and A. Choudhary. pmaa: A scalable parallel subspace clus-
tering algorithm for massive data sets. International Conference on Parallel Pro-
cessing, pages 477, 2000.
11. J. A. Ruiz. Shifting and scaling patterns from gene expression data. Bioinformatics,
21(20):38403845, 2005.
12. E. Schikuta. Grid-clustering: An ecient hierarchical clustering method for very
large data sets. In Proc.13th Int. Conf. Pattern Recognition, IEEE Computer
Society., 2:101105, 1996.
13. R. Sharan and R. Shamir. Click: A clustering algorithm with applications to
gene expression analysis. Proceedings of the Eighth International Conference on
Intelligent Systems, pages 307316, 2000.
14. H. Wang, W. Wang, J. Yang, and P. Yu. Clustering by pattern similarity in large
data sets. SIGMOD Conference, 2002.
15. W. Wang, J. Yang, and R. Muntz. Sting: A statistical information grid approach
to spatial data mining. In Proc. 23rd Conf. Very Large Databases, pages 186195,
1997.
16. J. Yang, H. Wang, W. Wang, and P. Yu. Improving performance of bicluster
discovery in a large data set. Proceedings of the 6th ACM International Conference
on Research in Computational Molecular Biology (RECOMB), 2002.
Multiple Sequence Alignment
Based on Set Covers

Alexandre H.L. Porto and Valmir C. Barbosa

Universidade Federal do Rio de Janeiro,


Programa de Engenharia de Sistemas e Computacao, COPPE,
Caixa Postal 68511, Rio de Janeiro - RJ 21941-972, Brazil
[email protected]

Abstract. We introduce a new heuristic for the multiple alignment of


a set of sequences. The heuristic is based on a set cover of the residue
alphabet of the sequences, and also on the determination of a signicant
set of blocks comprising subsequences of the sequences to be aligned.
These blocks are obtained with the aid of a new data structure, called
a sux-set tree, which is constructed from the input sequences with
the guidance of the residue-alphabet set cover and generalizes the well-
known sux tree of the sequence set. We provide performance results on
selected BAliBASE amino-acid sequences and compare them with those
yielded by some prominent approaches.

Keywords: Multiple sequence alignment; Set covers.

1 Introduction

The multiple alignment of biological sequences is one of the most important


problems in computational molecular biology. It has applications in many dier-
ent important domains, such as the analysis of protein structure and function,
the detection of conserved patterns and domain organization of a protein family,
evolutionary studies based on phylogenetic analysis, and database searching for
new members of a protein family.
The problem of multiple sequence alignment can be stated in the following
manner. Let s1 , . . . , sk , with k 2, be sequences of lengths n1 , . . . , nk , respec-
tively, over a residue alphabet R. An alignment A of these sequences is a k l
matrix such that A[i, j], for 1 i k and 1 j l, is either a character in R
or the special character that we call a gap character. In A, xing i and varying j
from 1 through l must reproduce si exactly if gap characters are skipped. Fixing
j, in turn, yields k characters that are said to be aligned in A, of which at least
one must be in R. Note, then, that max{n1 , . . . , nk } l n1 + + nk .
The goal of the multiple sequence alignment problem is to determine the most
biologically signicant alignment of s1 , . . . , sk . Finding this alignment requires an
objective function to associate a score with each possible alignment, and in this

Corresponding author.

F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 127137, 2006.

c Springer-Verlag Berlin Heidelberg 2006
128 A.H.L. Porto and V.C. Barbosa

case the multiple sequence alignment problem is to nd an alignment, known as


the optimal alignment, that maximizes the objective function. There exist many
dierent objective functions that can be used, but none of them guarantees
that the corresponding optimal alignment is the most biologically signicant
alignment of the sequences. It follows from the denition of an alignment that
the number of dierent alignments of a given set of sequences is exponentially
large; in fact, the multiple sequence alignment problem is known to be at least
NP-hard [1]. Feasible approaches to solve the problem are then all of a heuristic
nature.
In this paper we describe a new heuristic that is based on set covers of the
residue alphabet R. Such a set cover is a collection of subsets of R whose union
yields R precisely. The idea behind the use of a set cover is that each subset can
be made to contain all the residues from R that possess certain structural or
physicochemical properties in common. The most familiar scenario for the use of
set covers is the case of R as a set of amino acids, so henceforth we concentrate
mainly on multiple protein alignments. Set covers of an amino-acid alphabet
have been studied extensively (e.g., [2]).
Set covers lie at the heart of the new heuristic. In essence, what they do is
to allow the introduction of a new data structure, called a sux-set tree, that
generalizes the well-known sux tree of a set of sequences and can be used in
the determination of subsequence blocks that ultimately give rise to the desired
alignment. In general terms, this is the same approach as some others in the
literature, but our use of set covers as its basis provides a fundamentally more
direct link between relevant properties shared by groups of residues and the
resulting alignment.
The following is how the remainder of the paper is organized. In Section 2 we
introduce our new data structure and in Section 3 describe our new approach to
multiple sequence alignment. Then we proceed in Section 4 to the presentation
of computational results of the new method as compared to some of its most
prominent competitors, and nalize with conclusions in Section 5.

2 Sux-Set Trees

In this section we describe our new data structure. It is a generalization of the


well-known sux tree of a set of sequences, which is one of the most important
data structures in the eld of pattern recognition. Such a sux tree has O(n1 +
+ nk ) nodes and can be constructed in O(n1 + + nk ) time [3].
Sux trees can be applied to many problems, but their principal application
in computational molecular biology is to assist the algorithms that try to obtain
blocks comprising biologically signicant subsequences of a set of sequences,
known as the motifs of that set. These motifs, in the case of proteins, encode
structural or functional similarities; in the case of nucleic acids, they encode
mainly the promoter regions.
Let C = {C1 , . . . , Cp } be a set cover of R, and let C = {1 , . . . , p , $ } be
a new alphabet having a character i for each Ci C and a further character
Multiple Sequence Alignment Based on Set Covers 129

$ possessing a function similar to the terminator used in sux trees. Like the
sux tree, our new data structure is also a rooted tree; it has edges labeled
by sequences of characters from C and nodes labeled by indices into some of
s1 , . . . , sk to mark sux beginnings. We call it a sux-set tree and it has the
following properties:
The rst character of the label on an edge connecting a node to one of its
children is a dierent character of C for each child.
Each nonempty sux of every one of the k sequences is associated with at
least one leaf of the tree; conversely, each leaf of the tree is associated with
at least one nonempty sux of some sequence (if more that one, then all
suxes associated with the leaf have the same length). Thus, each leaf is
labeled by a set like {(i1 , j1 ), . . . , (iq , jq )} for some q 1, where (ia , ja ),
for 1 a q, 1 ia k, and 1 ja nia , indicates that the sux
sia [ja . . nia ] of sia is associated with the leaf.
Let v be a node of the tree. The label of v is the set {(i1 , j1 ), . . . , (iq , jq )} that
represents the q suxes collectively associated with the leaves of the subtree
rooted at v. If c1 cr is the concatenation of the labels on the path from
the root of the tree to v, excluding if necessary the terminal character $ ,
then c1 cr is a common representation of all prexes of length r of the
suxes associated