Lecture Notes in Computer Science
Lecture Notes in Computer Science
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
New York University, NY, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Franz Rothlauf et al. (Eds.)
Applications
of Evolutionary
Computing
EvoWorkshops 2006: EvoBIO, EvoCOMNET, EvoHOT
EvoIASP, EvoINTERACTION, EvoMUSART, and EvoSTOC
Budapest, Hungary, April 10-12, 2006
Proceedings
13
Volume Editors
ISSN 0302-9743
ISBN-10 3-540-33237-5 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-33237-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
Springer-Verlag Berlin Heidelberg 2006
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 11732242 06/3142 543210
Lecture Notes in Computer Science
For information about Vols. 13821
Vol. 3927: J. Hespanha, A. Tiwari (Eds.), Hybrid Systems: Vol. 3895: O. Goldreich, A.L. Rosenberg, A.L. Selman
Computation and Control. XII, 584 pages. 2006. (Eds.), Theoretical Computer Science. XII, 399 pages.
2006.
Vol. 3925: A. Valmari (Ed.), Model Checking Software.
X, 307 pages. 2006. Vol. 3894: W. Grass, B. Sick, K. Waldschmidt (Eds.), Ar-
chitecture of Computing Systems - ARCS 2006. XII, 496
Vol. 3924: P. Sestoft (Ed.), Programming Languages and
pages. 2006.
Systems. XII, 343 pages. 2006.
Vol. 3890: S.G. Thompson, R. Ghanea-Hercock (Eds.),
Vol. 3923: A. Mycroft, A. Zeller (Eds.), Compiler Con-
Defence Applications of Multi-Agent Systems. XII, 141
struction. XV, 277 pages. 2006.
pages. 2006. (Sublibrary LNAI).
Vol. 3922: L. Baresi, R. Heckel (Eds.), Fundamental Ap-
Vol. 3889: J. Rosca, D. Erdogmus, J.C. Prncipe, S. Haykin
proaches to Software Engineering. XIII, 427 pages. 2006.
(Eds.), Independent Component Analysis and Blind Sig-
Vol. 3921: L. Aceto, A. Inglfsdttir (Eds.), Foundations nal Separation. XXI, 980 pages. 2006.
of Software Science and Computation Structures. XV, 447
Vol. 3888: D. Draheim, G. Weber (Eds.), Trends in Enter-
pages. 2006.
prise Application Architecture. IX, 145 pages. 2006.
Vol. 3920: H. Hermanns, J. Palsberg (Eds.), Tools and
Vol. 3887: J.R. Correa, A. Hevia, M. Kiwi (Eds.), LATIN
Algorithms for the Construction and Analysis of Systems.
2006: Theoretical Informatics. XVI, 814 pages. 2006.
XIV, 506 pages. 2006.
Vol. 3886: E.G. Bremer, J. Hakenberg, E.-H.(S.) Han,
Vol. 3916: J. Li, Q. Yang, A.-H. Tan (Eds.), Data Min-
D. Berrar, W. Dubitzky (Eds.), Knowledge Discovery in
ing for Biomedical Applications. VIII, 155 pages. 2006.
Life Science Literature. XIV, 147 pages. 2006. (Sublibrary
(Sublibrary LNBI).
LNBI).
Vol. 3915: R. Nayak, M.J. Zaki (Eds.), Knowledge Dis-
Vol. 3885: V. Torra, Y. Narukawa, A. Valls, J. Domingo-
covery from XML Documents. VIII, 105 pages. 2006.
Ferrer (Eds.), Modeling Decisions for Artificial Intelli-
Vol. 3907: F. Rothlauf, J. Branke, S. Cagnoni, E. Costa, C. gence. XII, 374 pages. 2006. (Sublibrary LNAI).
Cotta, R. Drechsler, E. Lutton, P. Machado, J.H. Moore, J.
Vol. 3884: B. Durand, W. Thomas (Eds.), STACS 2006.
Romero, G.D. Smith, G. Squillero, H. Takagi (Eds.), Ap-
XIV, 714 pages. 2006.
plications of Evolutionary Computing. XXIV, 813 pages.
2006. Vol. 3881: S. Gibet, N. Courty, J.-F. Kamp (Eds.), Gesture
in Human-Computer Interaction and Simulation. XIII,
Vol. 3906: J. Gottlieb, G.R. Raidl (Eds.), Evolutionary
344 pages. 2006. (Sublibrary LNAI).
Computation in Combinatorial Optimization. XI, 293
pages. 2006. Vol. 3880: A. Rashid, M. Aksit (Eds.), Transactions on
Aspect-Oriented Software Development I. IX, 335 pages.
Vol. 3905: P. Collet, M. Tomassini, M. Ebner, S.
2006.
Gustafson,A. Ekrt (Eds.), Genetic Programming. XI, 361
pages. 2006. Vol. 3879: T. Erlebach, G. Persinao (Eds.), Approximation
and Online Algorithms. X, 349 pages. 2006.
Vol. 3904: M. Baldoni, U. Endriss, A. Omicini, P. Tor-
roni (Eds.), Declarative Agent Languages and Technolo- Vol. 3878: A. Gelbukh (Ed.), Computational Linguistics
gies III. XII, 245 pages. 2006. (Sublibrary LNAI). and Intelligent Text Processing. XVII, 589 pages. 2006.
Vol. 3903: K. Chen, R. Deng, X. Lai, J. Zhou (Eds.), Infor- Vol. 3877: M. Detyniecki, J.M. Jose, A. Nrnberger, C. J.
mation Security Practice and Experience. XIV, 392 pages. . van Rijsbergen (Eds.), Adaptive Multimedia Retrieval:
2006. User, Context, and Feedback. XI, 279 pages. 2006.
Vol. 3901: P.M. Hill (Ed.), Logic Based Program Synthesis Vol. 3876: S. Halevi, T. Rabin (Eds.), Theory of Cryptog-
and Transformation. X, 179 pages. 2006. raphy. XI, 617 pages. 2006.
Vol. 3899: S. Frintrop,VOCUS:AVisualAttention System Vol. 3875: S. Ur, E. Bin, Y. Wolfsthal (Eds.), Haifa Verifi-
for Object Detection and Goal-Directed Search. XIV, 216 cation Conference. X, 265 pages. 2006.
pages. 2006. (Sublibrary LNAI). Vol. 3874: R. Missaoui, J. Schmidt (Eds.), Formal Concept
Vol. 3897: B. Preneel, S. Tavares (Eds.), Selected Areas in Analysis. X, 309 pages. 2006. (Sublibrary LNAI).
Cryptography. XI, 371 pages. 2006. Vol. 3873: L. Maicher, J. Park (Eds.), Charting the Topic
Vol. 3896: Y. Ioannidis, M.H. Scholl, J.W. Schmidt, F. Maps Research and Applications Landscape. VIII, 281
Matthes, M. Hatzopoulos, K. Boehm,A. Kemper, T. Grust, pages. 2006. (Sublibrary LNAI).
C. Boehm (Eds.), Advances in Database Technology - Vol. 3872: H. Bunke, A. L. Spitz (Eds.), Document Anal-
EDBT 2006. XIV, 1208 pages. 2006. ysis Systems VII. XIII, 630 pages. 2006.
Vol. 3870: S. Spaccapietra, P. Atzeni, W.W. Chu, T. Vol. 3844: J.-M. Bruel (Ed.), Satellite Events at the MoD-
Catarci, K.P. Sycara (Eds.), Journal on Data Semantics ELS 2005 Conference. XIII, 360 pages. 2006.
V. XIII, 237 pages. 2006. Vol. 3843: P. Healy, N.S. Nikolov (Eds.), Graph Drawing.
Vol. 3869: S. Renals, S. Bengio (Eds.), Machine Learning XVII, 536 pages. 2006.
for Multimodal Interaction. XIII, 490 pages. 2006. Vol. 3842: H.T. Shen, J. Li, M. Li, J. Ni, W. Wang (Eds.),
Vol. 3868: K. Rmer, H. Karl, F. Mattern (Eds.), Wireless Advanced Web and Network Technologies, and Applica-
Sensor Networks. XI, 342 pages. 2006. tions. XXVII, 1057 pages. 2006.
Vol. 3866: T. Dimitrakos, F. Martinelli, P.Y.A. Ryan, S. Vol. 3841: X. Zhou, J. Li, H.T. Shen, M. Kitsuregawa, Y.
Schneider (Eds.), Formal Aspects in Security and Trust. Zhang (Eds.), Frontiers of WWW Research and Develop-
X, 259 pages. 2006. ment - APWeb 2006. XXIV, 1223 pages. 2006.
Vol. 3865: W. Shen, K.-M. Chao, Z. Lin, J.-P.A. Barths Vol. 3840: M. Li, B. Boehm, L.J. Osterweil (Eds.), Uni-
(Eds.), Computer Supported Cooperative Work in Design fying the Software Process Spectrum. XVI, 522 pages.
II. XII, 359 pages. 2006. 2006.
Vol. 3863: M. Kohlhase (Ed.), Mathematical Knowledge Vol. 3839: J.-C. Fillitre, C. Paulin-Mohring, B. Werner
Management. XI, 405 pages. 2006. (Sublibrary LNAI). (Eds.), Types for Proofs and Programs. VIII, 275 pages.
2006.
Vol. 3862: R.H. Bordini, M. Dastani, J. Dix, A.E.F.
Seghrouchni (Eds.), Programming Multi-Agent Systems. Vol. 3838: A. Middeldorp, V. van Oostrom, F. van Raams-
XIV, 267 pages. 2006. (Sublibrary LNAI). donk, R. de Vrijer (Eds.), Processes, Terms and Cycles:
Steps on the Road to Infinity. XVIII, 639 pages. 2005.
Vol. 3861: J. Dix, S.J. Hegner (Eds.), Foundations of In-
formation and Knowledge Systems. X, 331 pages. 2006. Vol. 3837: K. Cho, P. Jacquet (Eds.), Technologies for
Advanced Heterogeneous Networks. IX, 307 pages. 2005.
Vol. 3860: D. Pointcheval (Ed.), Topics in Cryptology
CT-RSA 2006. XI, 365 pages. 2006. Vol. 3836: J.-M. Pierson (Ed.), Data Management in Grids.
X, 143 pages. 2006.
Vol. 3858: A. Valdes, D. Zamboni (Eds.), RecentAdvances
in Intrusion Detection. X, 351 pages. 2006. Vol. 3835: G. Sutcliffe, A. Voronkov (Eds.), Logic for Pro-
gramming, Artificial Intelligence, and Reasoning. XIV,
Vol. 3857: M.P.C. Fossorier, H. Imai, S. Lin, A. Poli
744 pages. 2005. (Sublibrary LNAI).
(Eds.), Applied Algebra, Algebraic Algorithms and Error-
Correcting Codes. XI, 350 pages. 2006. Vol. 3834: D.G. Feitelson, E. Frachtenberg, L. Rudolph,
U. Schwiegelshohn (Eds.), Job Scheduling Strategies for
Vol. 3855: E. A. Emerson, K.S. Namjoshi (Eds.), Verifi-
Parallel Processing. VIII, 283 pages. 2005.
cation, Model Checking, and Abstract Interpretation. XI,
443 pages. 2005. Vol. 3833: K.-J. Li, C. Vangenot (Eds.), Web and Wireless
Geographical Information Systems. XI, 309 pages. 2005.
Vol. 3854: I. Stavrakakis, M. Smirnov (Eds.), Autonomic
Communication. XIII, 303 pages. 2006. Vol. 3832: D. Zhang, A.K. Jain (Eds.), Advances in Bio-
metrics. XX, 796 pages. 2005.
Vol. 3853:A.J. Ijspeert, T. Masuzawa, S. Kusumoto (Eds.),
Biologically Inspired Approaches to Advanced Informa- Vol. 3831: J. Wiedermann, G. Tel, J. Pokorn, M.
tion Technology. XIV, 388 pages. 2006. Bielikov, J. tuller (Eds.), SOFSEM 2006: Theory and
Vol. 3852: P.J. Narayanan, S.K. Nayar, H.-Y. Shum (Eds.), Practice of Computer Science. XV, 576 pages. 2006.
Computer Vision ACCV 2006, Part II. XXXI, 977 pages. Vol. 3830: D. Weyns, H. V.D. Parunak, F. Michel (Eds.),
2006. Environments for Multi-Agent Systems II. VIII, 291
Vol. 3851: P.J. Narayanan, S.K. Nayar, H.-Y. Shum (Eds.), pages. 2006. (Sublibrary LNAI).
Computer Vision ACCV 2006, Part I. XXXI, 973 pages. Vol. 3829: P. Pettersson, W. Yi (Eds.), Formal Modeling
2006. and Analysis of Timed Systems. IX, 305 pages. 2005.
Vol. 3850: R. Freund, G. Paun, G. Rozenberg, A. Salomaa Vol. 3828: X. Deng, Y. Ye (Eds.), Internet and Network
(Eds.), Membrane Computing. IX, 371 pages. 2006. Economics. XVII, 1106 pages. 2005.
Vol. 3849: I. Bloch, A. Petrosino, A.G.B. Tettamanzi Vol. 3827: X. Deng, D.-Z. Du (Eds.), Algorithms and
(Eds.), Fuzzy Logic and Applications. XIV, 438 pages. Computation. XX, 1190 pages. 2005.
2006. (Sublibrary LNAI).
Vol. 3826: B. Benatallah, F. Casati, P. Traverso (Eds.),
Vol. 3848: J.-F. Boulicaut, L. De Raedt, H. Mannila (Eds.), Service-Oriented Computing - ICSOC 2005. XVIII, 597
Constraint-Based Mining and Inductive Databases. X, 401 pages. 2005.
pages. 2006. (Sublibrary LNAI).
Vol. 3824: L.T. Yang, M. Amamiya, Z. Liu, M. Guo, F.J.
Vol. 3847: K.P. Jantke, A. Lunzer, N. Spyratos, Y. Tanaka Rammig (Eds.), Embedded and Ubiquitous Computing
(Eds.), Federation over the Web. X, 215 pages. 2006. (Sub- EUC 2005. XXIII, 1204 pages. 2005.
library LNAI).
Vol. 3823: T. Enokido, L. Yan, B. Xiao, D. Kim, Y. Dai,
Vol. 3846: H. J. van den Herik, Y. Bjrnsson, N.S. Ne- L.T. Yang (Eds.), Embedded and Ubiquitous Computing
tanyahu (Eds.), Computers and Games. XIV, 333 pages. EUC 2005 Workshops. XXXII, 1317 pages. 2005.
2006.
Vol. 3822: D. Feng, D. Lin, M. Yung (Eds.), Information
Vol. 3845: J. Farr, I. Litovsky, S. Schmitz (Eds.), Imple- Security and Cryptology. XII, 420 pages. 2005.
mentation and Application of Automata. XIII, 360 pages.
2006.
Volume Editors
This year, the EvoWorkshops had the highest number of submissions ever.
The number of submissions increased from 123 in 2004 to 143 in 2005 to 149
in 2006. EvoWorkshops 2006 accepted full papers with twelve pages and short
papers with a reduced number of ve pages. The acceptance rate of 43.6% for
EvoWorkshops 2006 is an indicator for the high quality of the papers presented at
the workshops and included in these proceedings. The following table gives some
details on the number of submissions, the number of accepted papers, and the
acceptance ratios for EvoWorkshops 2005 and EvoWorkshops 2006 (accepted
short papers are in brackets). Of further importance for the statistics is the
acceptance rate of EvoWorkshops 2004 which was 44.7%.
2006 2005
year
submissions accept ratio submissions accept ratio
EvoBIO 40 21 52.5% 32 13 40.6%
EvoCOMNET 16 5 31.2% 22 5 22.7%
EvoHOT 9 5 55.6% 11 7 63.6%
EvoIASP 35 12(7) 34.3% 37 17 45.9%
EvoInteraction 8 6 75% - - -
EvoMUSART 29 10(4) 34.5% 29 10(6) 34.5%
EvoSTOC 12 6(2) 50.0% 12 4(4) 33.3%
Total 149 65(13) 43.6% 143 56(10) 39.1%
We would like to thank all the members of the program committees for
their quick and thorough work. We thank the Artpool Art Research Center
of Budapest, and especially Gyorgy Galantai, for oering space and expertise
without which the wonderful evolutionary art and music exhibition associated
with the conference would not have been possible. Furthermore, we would like
to acknowledge the support from Napier University, Edinburgh.
Finally, we would like to say a special thanks to everybody who was involved
in the preparation of the event. Special thanks are due to Jennifer Willies, whose
work is a great and invaluable help. Without her support, running such a type
of conference with a large number of dierent organizers and dierent opinions
would be impossible. Further thanks go to the local organizer, Aniko Ekart, and
her group, who made it possible to run such a conference in such a nice place.
EvoWorkshops 2006 was jointly organized with EuroGP 2006 and EvoCOP 2006.
Organizing Committee
Program Committees
Sponsoring Institutions
EvoBIO Contributions
Functional Classication of G-Protein Coupled Receptors, Based on
Their Specic Ligand Coupling Patterns
Burcu Bakir, Osman Ugur Sezerman . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
EvoCOMNET Contributions
BeeHiveGuard: A Step Towards Secure Nature Inspired Routing
Algorithms
Horst F. Wedde, Constantin Timm, Muddassar Farooq . . . . . . . . . . . . 243
EvoHOT Contributions
Optimisation of Constant Matrix Multiplication Operation Hardware
Using a Genetic Algorithm
Andrew Kinane, Valentin Muresan, Noel OConnor . . . . . . . . . . . . . . . 296
EvoIASP Contributions
Image Space Colonization Algorithm
Leonardo Bocchi, Lucia Ballerini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
EvoINTERACTION Contributions
On Interactive Evolution Strategies
Ron Breukelaar, Michael T.M. Emmerich, Thomas Back . . . . . . . . . . 530
EvoMUSART Contributions
Supervised Genetic Search for Parameter Selection in Painterly
Rendering
John P. Collomosse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
Consensual Paintings
Paulo Urbano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
EvoSTOC Contributions
A Preliminary Study on Handling Uncertainty in Indicator-Based
Multiobjective Optimization
Matthieu Basseur, Eckart Zitzler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
1 Introduction
G-Protein Coupled Receptors (GPCRs) are vital protein bundles with their key
role in cellular signaling and regulation of various basic physiological processes.
With their versatile functions in a wide range of physiological cellular condi-
tions, they constitute one of the vastest families of eukaryotic transmembrane
proteins [29]. In addition to the biological importance of their functional roles,
their interaction with more than 50% of prescription drugs have lead GPCRs
to be an excellent potential therapeutic target class for drug design and cur-
rent pharmaceutical research. Over the last 20 years, several hundred new drugs
have been registered which are directed towards modulating more than 20 dif-
ferent GPCRs, and approximately 40% of the top 200 synthetic drugs act on
GPCRs [6]. Therefore, many pharmaceutical companies are involved in carrying
out research aimed towards understanding the structure and function of these
GPCR proteins. Even though thousands of GPCR sequences are known as a
result of ongoing genomics projects [10], the crystal structure has been solved
only for one GPCR sequence using electron diraction at medium resolution (2.8
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 112, 2006.
c Springer-Verlag Berlin Heidelberg 2006
2 B. Bakir and O.U. Sezerman
A) to date [15] and for many of the GPCRs the activating ligand is unknown,
which are called orphan GPCRs [25]. Hence, based on sequence information,
a functional classication method of those orphan GPCRs and new upcoming
GPCR sequences is of great practical use in facilitating the identication and
characterization of novel GPCRs.
Albeit laboratory experiments are the most reliable methods, they are not
cost and labour eective. To automate the process, computational methods such
as decision trees, discriminant analysis, neural networks and support vector ma-
chines (SVMs), have been extensively used in the elds of classication of biolog-
ical data [21]. Among these methods, SVMs give best prediction performance,
when applied to many real-life classication problems, including biological issues
[30]. One of the most critical issues in classication is the minimization of the
probability of error on test data using the trained classier, which is also known
as structural risk minimization. It has been demonstrated that SVMs are able
to minimize the structural risk through nding a unique hyper-plane with max-
imum margin to separate data from two classes [27]. Therefore, compared with
the other classication methods, SVM classiers supply the best generalization
ability on unseen data [30].
In the current literature, to classify GPCRs in dierent levels of families,
there exist dierent attempts, such as using primary database search tools, e.g.,
BLAST [1], FASTA [20]. However, these methods require the query protein to
be signicantly similar to the database sequences in order to work properly.
In addition to these database search tools, the same problem is addressed by
using secondary database methods (proles and patterns for classication), e.g.,
Attwood et al. have worked in particular on GPCRs in the PRINTS database
[2] (whose data appeared in INTERPRO database [17]). Hidden Markov Models
[24], bagging classication trees [32] and SVMs [13], [31] are other methods that
have been used to classify GPCRs in dierent levels of families. Karchin et al.
conducted the most comprehensive controlled experiments for sequence based
prediction of GPCRs in [13] and showed that SVMs gave the highest accuracy
in recognizing GPCR families. Whereas, in SVMs, an initial step to transform
each protein sequence into a xed-length vector is required and the predictive
accuracy of SVMs signicantly depends on this particular xed-length vector. In
[13], it is also pointed out that the SVM performance could be further increased
by using feature vectors that encode only the most relevant features, since SVMs
do not identify the features most responsible for class discrimination. Therefore,
for an accurate SVM classication, feature vectors should reect the unique
biological information contained in sequences, which is specic to the type of
classication problem.
In this paper, we address Level 2 subfamily classication of Amine GPCRs
problem by applying Support Vector Machine (SVM) technique, using a novel
xed-length feature vector, based on the existence of activating ligand specic
patterns. We obtain discriminative feature vectors by utilizing biological knowl-
edge of the Level 2 subfamilies transmembrane topology and identifying specic
patterns for each Level 2 subfamily. Since these specic patterns carry ligand
Functional Classication of G-Protein Coupled Receptors 3
binding information, the features obtained from these patterns are more relevant
features than amino acid and dipeptide composition of GPCR sequences, which
in turn improves the accuracy of GPCR Level 2 subfamily classication. Apply-
ing our method on Amine Level 1 subfamily of GPCRs [10], we have shown that
the classication accuracy is increased compared to the previous studies at the
same level of classication.
2 Background
Fig. 1. Portion of GPCR family tree showing the ve main classes of GPCRs and some
subfamily members, based on the GPCRDB information system [10]
Table 1. Summary of 168 Class A, Amine GPCRs, classied into four Level 2 Sub-
families as shown in [9]
novel xed-length feature vectors. Details of the three steps (Topology Predic-
tion, Pattern Discovery, and Pattern Matching) for xed-length feature vector
creation are described below.
Topology Prediction. Since transmembrane (TM) topology pattern is shown
to be well conserved among GPCRs that have the same function [19], for the
168 GPCR sequences in Elrod and Chaus dataset, TM topology is checked. For
topology prediction, Hidden Markov Model for Topology Prediction (HMMTOP)
server, which accurately predicts the topology of helical TM proteins, is used [26].
In order to segment amino acid sequences into membrane, inside, outside parts,
HMMTOP method utilizes HMMs in a way that the product of the relative
frequencies of the amino acids of these segments along the amino acid sequence
is maximized. This shows that the maximum of the likelihood function on the
space of all possible topologies of a given amino acid sequence, correlates with
the experimentally established topology [25].
Following topology prediction, extracellular loop sequences are extracted for
each 168 GPCR sequences, based on the fact that ligands couple to extracellular
loops of GPCRs and we are interested in the relation between ligand specicity
of GPCRs and GPCR sub-sub-family classication.
Pattern Discovery. In the second step of the xed-length feature vector cre-
ation, for each sub-sub-family of GPCR sequences, exible patterns that are
conserved in the extracellular loop of that particular sub-sub-family GPCR se-
quences are found by using Pratt 2.1., exible pattern discovery program [4].
Due to the exibility of the Pratt patterns, they include ambiguous compo-
nents, xed and exible wildcards, in addition to their identity components [12].
Hence, Pratt patterns are described using PROSITE notation [4].
Pratt nds patterns matching a subset of the input sequences. This subset is
dened by Min Percentage Seqs to Match (MPSM) parameter, which denes
the minimum percentage of the input sequences that should match a pattern.
This threshold is set to 50% and 75% in this study in order not to allow for
6 B. Bakir and O.U. Sezerman
some very specic patterns that are not general to all GPCR sequences in any
sub-sub-family. This can also be thought as a precaution to prevent overtting
problem. For each class of GPCRs, 50 conserved patterns are identied by two
dierent MPSM parameters (50 and 75).
Pattern Matching. The nal step for creating xed-length feature vectors is
to check for the existence of every activating ligand specic pattern in each outer
GPCR sequence. In order to check the existence of the exible Pratt patterns, all
patterns in PROSITE notation are converted into regular expression form and
then they are searched within 168 extracellular GPCR sequences. Consequently,
by taking activating ligand specic pattern existence information into account,
each GPCR sequence is represented with a vector in the 200 dimensional space
(50 patterns multiplied by 4 output classes).
where Gk,1 , Gk,2 Gk,200 are the 200 components of activating ligand specic
pattern inclusion for the k th extracellular GPCR sequence Gk . Note that if the
k th extracellular GPCR sequence has the pattern j, then Gk,j =1 and if the k th
extracellular GPCR sequence does not have the pattern j, then Gk,j =0, where
j=1, 2, ... 200.
Writing down each xed-length feature vector, Gk , in a new row, we obtain
a Gk,j matrix, where k=1, 2, ... 168; j=1, 2, ... 200. After insertion of the sub-
sub-family labels for each of the GPCR sequences into the zeroth dimension of
each Gk vector (Gk,0 ), the matrix corresponds to a training set. So that k=0, 1,
2, ... 168, where Gk,0 is 1, 2, 3 or 4, since four sub-sub-families are dened for
this classication problem. Note that these 4 class output labelling (1, 2, 3, 4)
does not imply any relationship between classes.
We have also created a second xed-length feature vector, by using the best
10 patterns among the 50 patterns based on signicance scores assigned by the
Pratt program from each sub-sub-family. Using a similar representation, Gk is
denoted in 40 dimensional space (10 patterns multiplied by 4 output classes),
where j=1, 2, ... 40. A Gk,j matrix is formed (similar to above), where k=1, 2,
... 168 and j=1, 2, ... 40 corresponding another training set.
As a result, four training sets (two training sets with 50 MPSM parameter,
for j up to 40 or 200; another two with 75 MPSM parameter, for j up to 40
or 200) are created to produce a classier using Support Vector Machines, as
mentioned in detail below.
false negative than WU-BLAST and SAM-T2K prole HMMs [13]. For these
reasons, we selected to use SVMs for GPCRs Level 2 subfamily classication
problem.
4 Experiments
Since we are interested in the classication of Amine Level 1 sub-family into
four Level 2 subfamilies, we are facing with a multi-class classication problem.
We use LIBSVM software [7], which deals with multi-class classication problem
implementing one-against-one approach. As suggested in [11], to be able to get
satisfactory results, some preprocesses are performed before building a classier
using LIBSVM. Preprocesses, that are performed in this study, can be summa-
rized in two headlines: i) Choice of Kernel function, ii) Grid search combined
with cross-validation for parameter tuning.
the training set is not big enough to separate into two, 10-fold cross-validation
is done for each of the four training sets. Combining our biologically relevant
xed-length feature vector denition with a robust kernel, RBF, and parameter
tuning with a grid search technique shows promising results, which is analyzed
more in detail in the next section.
Fig. 3. Coarse grid search on C and with 10-fold cross validation for the training data
with 200 attributes and 50 MPSM parameter. Highest cross-validation accuracy is
obtained when =24 and C=(20 , 210 ).
5 Results
As mentioned before, in addition to the SVM classication with parameter tuned
RBF kernel, other three standard kernel functions are tested as well (with their
default parameters) on our four training data using 10-fold cross validation.
Results for each experiment are summarized in Table 2.
Classication with RBF kernel with parameter tuning clearly outperforms
other kernel functions in all cases. Since linear kernel is the specialized form of
RBF kernel, results obtained with these two kernels without parameter tuning
are quite close. Although, the classication accuracy with 200 and 40 attributes
are so close, accuracy with 200 attributes are consistently better than with 40
attributes. The probable reason behind this observation is that 40 attributes are
not enough to represent the examples (more attributes are needed to discrimi-
nate between data points), or those chosen 40 attributes do not correctly reect
the data points. In contrast to the strict domination of 200 attributes over 40
attributes, there is no such a relationship between training data with 50 MPSM
parameter and 75 MPSM parameter. While sometimes one performs better, it
is vice versa (e.g. results of RBF Kernel and RBF* Kernel in Table 2). Lower
accuracy for the training data with 75 MPSM parameter is caused by overt-
ting, which decreases accuracy at the end whereas with 50 MPSM parameter,
patterns that are conserved in at least 50% of the data can not represent overall
data.
Functional Classication of G-Protein Coupled Receptors 9
Table 2. Results for four dierent training sets, as explained in the text, using four
dierent kernel functions and RBF kernel with parameter tuning (RBF*), with 10-fold
cross-validation
6 Discussion
The dierence of this study from previous studies can be emphasized in two
main points:
i) Fixed-length feature vector creation: We developed a novel method for ob-
taining xed-length feature vectors of SVM. The naive idea that using direct pro-
tein sequence information as feature vector can not be used in SVM classication
since the sequence length is not xed. Many studies [9], [32], [31] attempted this
problem by dening a xed-length feature vector based on the proteins amino
acid composition. Following the representation in [8], each protein is represented
by a vector, Xk , in 20 dimensional space, where each dimension corresponds
to how many times that particular amino acid, which represents that specic
dimension, occurred in those particular protein.
where Xk,1 , Xk,2 ... Xk,20 are 20 components of amino acid composition for
the k th protein Xk . In addition to the amino acid composition, in some of the
studies, xed-length vector is obtained by dipeptide composition [3], which takes
local order of amino acids into account, in addition to the information about the
fraction of amino acids. The dipeptide composition of each protein is shown
10 B. Bakir and O.U. Sezerman
References
1. Altshul, S. et al.: Basic local alignment search tool. J. Mol. Biol. 215 (1990) 403
410
2. Attwood, T.K. et al.: PRINTS and its automatic supplement, prePRINTS. Nucleic
Acids Research 31 (2003) 400402
3. Bhasin, M. and Raghava, G.: GPCRpred: an SVM-based method for prediction of
families and subfamilies of G-protein coupled receptors. Nucleic Acids Research 32
(2004) 383389
4. Brazma, A. et al.: Discovering patterns and subfamilies in biosequences. Proceed-
ings of the Fourth International Conference on Intellignent Systems for Molecular
Biology (ISMB-96), AAAI Press (1996) 3443
Pratt 2.1 software is available at www.ebi.ac.uk/pratt
5. Byvatov, E. and Schneider, G.: Support vector machine applications in bioinfor-
matics. Appl. Bioinformatics 2 (2003) 6777
6. Chalmers, D.T. and Behan, D.P.: The Use of Constitutively Active GPCRs in Drug
Discovery and Functional Genomics. Nature Reviews, Drug Discovery 1 (2002)
599-608
7. Chang, C.C. and Lin, C.J.: LIBSVM : a library for support vector machines. (2001)
LIBSVM software is available at http://www.csie.ntu.edu.tw/ cjlin/libsvm
8. Chou, K.C.: A Novel Approach to Predicting Protein Structural Classes in a (ZO-l)-
D Amino Acid Composition Space. PROTEINS: Structure, Function, and Genetics
21 (1995) 319344
9. Elrod, D.W. and Chou, K.C.: A study on the correlation of G-protein-coupled
receptor types with amino acid composition. Protein Eng. 15 (2002) 713715
10. Horn, F. et al.: GPCRDB: an information system for G protein coupled receptors.
Nucleic Acids Res. 26 (1998) 275279
Available at: www.gpcr.org/7tm
11. Hsu, C.W. et al.: A Practical Guide to Support Vector Classication. Image, Speech
and Intelligent Systems (ISIS) Seminars. (2004)
12. Jonassen, I. et al.: Finding exible patterns in unaligned protein sequences. Protein
Sci. 4 (1995) 15871595
13. Karchin, R. et al.: Classifying G-protein coupled receptors with support vector
machines. Bioinformatics 18 (2001) 147159
14. Keerthi, S.S. and Lin, C.J.: Asymptotic behaviors of support vector machines with
Gaussian kernel. Neural Computation 15 (2003) 1667-1689
15. Krzysztof, P. et al.: Crystal Structure of Rhodopsin: A G- Protein-Coupled Recep-
tor. Science 4 (2000) 739745
16. Lin, H.T. and Lin, C.J.: A study on sigmoid kernels for SVM and the train ing of
nonPSDkernels by SMO type methods. Technical report, Department of Computer
Science and Information Engineering, National Taiwan University. (2003)
Available at http://www.csie.ntu.edu.tw/ cjlin/papers/tanh.pdf
17. Mulder, N.J. et al.: The InterPro Database - 2003 brings increased coverage and
new features. Nucleic Acids Research 31 (2003) 315318
18. Neuwald, A. and Green, P.: Detecting Patterns in Protein Sequences. J. Mol. Biol.
239 (1994) 698-712
19. Otaki, J.M. and Firestein, S.: Length analyses of mammalian g-protein-coupled
receptors. J. Theor. Biol. 211 (2001) 77100
20. Pearson, W. and Lipman, D.: Improved tools for biological sequence analysis. Pro-
ceedings of National Academic Science 85 (1988) 2444-2448
12 B. Bakir and O.U. Sezerman
21. Quinlan, J.R.: C4.5; Programs for Machine Learning, Morgan Kauman Publish-
ers, San Mateo, CA (1988)
22. Sadka, T. and Linial, M.: Families of membranous proteins can be characterized
by the amino acid composition of their transmembrane domains. Bioinformatics
21 (2005) 378386
23. Schoneberg, T. et al.: The structural basis of g-protein-coupled receptor function
and dysfunction in human diseases. Rev Physiol Biochem Pharmacol. 144 (2002)
143227
24. Sreekumar, K.R. et al.: Predicting GPCR-G-Protein coupling using hidden Markov
models. Bioinformatics 20 (2004) 34903499
Swiss-Prot database (Release 46.4, 2005) is available at http:// www.expasy.org
25. Tusndy, G.E. and Simon, I.: Principles Governing Amino Acid Composition of
Integral Membrane Proteins: Applications to topology prediction. J. Mol. Biol.
283 (1998) 489506
26. Tusndy, G.E. and Simon, I.: The HMMTOP transmembrane topology prediction
server. Bioinformatics 17 (2001) 849850
Available at: http://www.enzim.hu/hmmtop
27. Vapnik, V. The Nature of Statistical Learning Theory, Springer-Verlag, New York.
(1995)
28. Vert,J.P. Introduction to Support Vector Machines and applications to computa-
tional biology.(2001)
29. Vilo, J. et al.: Prediction of the Coupling Specicity of G Protein Coupled Recep-
tors to their G Proteins. Bioinformatics 17 (2001) 174181
30. Yang, Z.R.: Biological applications of support vector machines. Brief. Bioinform.
5 (2004) 328338
31. Ying, H. and Yanda, L.: Classifying G-protein Coupled Receptors with Support
Vector Machine. Advances in Neural Network (ISNN 2004), Springer LNCS. 3174
(2004) 448452
32. Ying, H. et al.: Classifying G-protein Coupled receptors with bagging classication
tree. Computational Biology and Chemistry 28 (2004) 275280
Incorporating Biological Domain Knowledge
into Cluster Validity Assessment
1 Introduction
Over the past few years DNA microarrays have become a key tool in functional
genomics. They allow monitoring the expression of thousands of genes in parallel
over many experimental conditions (e.g. tissue types, growth environments). This
technology enables researchers to collect signicant amounts of data, which need
to be analysed to discover functional relationships between genes or samples. The
results from a single experiment are generally presented in the form of a data
matrix in which rows represent genes and columns represent conditions. Each
entry in the data matrix is a measure of the expression level of a particular gene
under a specic condition.
A central step in the analysis of DNA microarray data is the identication of
groups of genes and/or conditions that exhibit similar expression patterns. Clus-
tering is a fundamental approach to classifying expression patterns for biological
and biomedical applications. The main assumption is that genes that are con-
tained in a particular functional pathway should be co-regulated and therefore
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 1322, 2006.
c Springer-Verlag Berlin Heidelberg 2006
14 N. Bolshakova, F. Azuaje, and P. Cunningham
should exhibit similar patterns of expression [1]. A great variety of clustering al-
gorithms have been developed for gene expression data. The next data analysis
step is to integrate these numerical analyses of co-expressed genes with biolog-
ical function information. Many approaches and tools have been proposed to
address this problem at dierent processing levels. Some methods, for example,
score whole clustering outcomes or specic clusters according to their biological
relevance, other techniques aim to estimate the signicance of over-represented
functional annotations, such as those encoded in the Gene Ontology (GO), in
clusters [2], [3], [4], [5]. Some approaches directly incorporate biological knowl-
edge (e.g. functional, curated annotations) into the clustering process to aid in
the detection of relevant clusters of co-expressed genes involved in common pro-
cesses [6], [7]. Several tools have been developed for ontological analysis of gene
expression data (see review by Khatri and Draghici [8], for instance) and more
tools are likely to be proposed in the future.
The prediction of the correct number of clusters in a data set is a funda-
mental problem in unsupervised learning. Various cluster validity indices have
been proposed to measure the quality of clustering results [9], [10]. Recent stud-
ies conrm that there is no universal pattern recognition and clustering model
to predict molecular proles across dierent datasets. Thus, it is useful not to
rely on one single clustering or validation method, but to apply a variety of
approaches. Therefore, combination of GO-based (knowledge-driven) validation
and microarray data (data-driven) validation methods may be used for the es-
timation of the number of clusters. This estimation approach may represent a
useful tool to support biological and biomedical knowledge discovery.
We implemented a knowledge-driven cluster validity assessment system for
microarray data clustering. Unlike traditional methods that only use (gene ex-
pression) data-derived indices, our method consists of validity indices that incor-
porate similarity knowledge originating from the GO and a GO-driven annota-
tion database. We used annotations from the Saccharomyces Genome Database
(SGD) (October 2005 release of the GO database). A traditional node-counting
method proposed by Wu and Palmer [11] and an information content technique
proposed by Resnik [12] were implemented to measure similarity between genes
products. These similarity measurements have not been implemented for clus-
tering evaluation by other research.
The main objective of this research is to assess the application of knowledge-
driven cluster validity methods to estimate the number of clusters in a known
data set derived from Saccharomyces cerevisiae.
S Smin
C= (8)
Smax Smin
where S, Smin , Smax are calculated as follows. Let p be the number of all pairs
of samples (conditions) from the same cluster. Then S is the sum of distances
between samples in those p pairs. Let P be a number of all possible pairs of
samples in the dataset. Ordering those P pairs by distances we can select p pairs
with the smallest and p pairs with the largest distances between samples. The
sum of the p smallest distances is equal to Smin , whilst the sum of the p largest
is equal to Smax . From this formula it follows that the nominator will be small if
pairs of samples with small distances are in the same cluster. Thus, small values
of C correspond to good clusters. We calculated distances using the knowledge-
driven methods described above. The number of clusters that minimize C-index
is taken as the optimal number of clusters, c.
d(w, x) > d(y, z), w and x are in dierent clusters and y and z are in the
same cluster.
By contrast, a quadruple is called disconcordant if one of following two con-
ditions is true:
d(w, x) < d(y, z), w and x are in dierent clusters and y and z are in the
same cluster.
d(w, x) > d(y, z), w and x are in the same cluster and y and z are in dierent
clusters.
We adapted this method by calculating distances using the knowledge-driven
methods described above.
A good partition is one with many concordant and few disconcordant quadru-
ples. Let Ncon and Ndis denote the number of concordant and disconcordant
quadruples, respectively. Then the Goodman-Kruskal index, GK, is dened as:
Ncon Ndis
GK = (9)
Ncon + Ndis
Large values of GK are associated with good partitions. Thus, the number of
clusters that maximize the GK index is taken as the optimal number of clusters, c.
5 Results
The clustering algorithm was applied to produce dierent partitions consisting
of 2 to 6 clusters each. Then, the validity indices were computed for each of
these partitioning results. The two GO-based similarity assessment techniques
introduced above were used for all cases to calculate biological distances between
the genes.
Tables 1 to 4 show the predictions made by the validity indices at each number
of clusters. Bold entries represent the optimal number of clusters, c, predicted
by each method. In the tables the rst cluster validity index approach processes
overall GO-based similarity values, which are calculated by taking into account
the combined annotations originating from the three GO hierarchies. The other
indices are based on the calculation of independent similarity values, indepen-
dently obtained from each of the GO hierarchies.
The C-indices based on Resnik similarity measurement and similarity informa-
tion from the MF, BP and the combined hierarchies indicated that the optimal
Table 3. Goodman-Kruskal index values used Wu and Palmers similarity metric for
expression clusters originating from yeast data
Table 4. Goodman-Kruskal index values used Resniks similarity metric for expression
clusters originating from yeast data
6 Accompanying Tool
The approaches described in this paper are available as part of the Machaon
CVE (Clustering and Validation Environment) [10]. This software platform has
been designed to support clustering-based analyses of expression patterns in-
cluding several data- and knowledge-driven cluster validity indices. The pro-
20 N. Bolshakova, F. Azuaje, and P. Cunningham
7 Conclusion
This paper presented an approach to assessing cluster validity based on similar-
ity knowledge extracted from the GO and GO-driven functional databases. A
knowledge-driven cluster validity assessment system for microarray data cluster-
ing was implemented. Edge-counting and information content approaches were
implemented to measure similarity between genes products based on the GO.
Edge-counting approach calculates the distance between the nodes associated
with these terms in a hierarchy. The shorter the distance, the higher the simi-
larity. The limitation is that it heavily relies on the idea that nodes and links in
the GO are uniformly distributed.
The research applies two methods for calculating cluster validity indices. The
rst approach process overall similarity values, which are calculated by taking
into account the combined annotations originating from the three GO hierar-
chies. The second approach is based on the calculation of independent similarity
values, which originate from each of these hierarchies. The advantage of our
method compared to other computer-based validity assessment approaches lies
in the application of prior biological knowledge to estimate functional distances
between genes and the quality of the resulting clusters. This study contributes
to the development of techniques for facilitating the statistical and biological
validity assessment of data mining results in functional genomics.
It was shown that the applied GO-based cluster validity indices could be
used to support the discovery of clusters of genes sharing similar functions. Such
clusters may indicate regulatory pathways, which could be signicantly relevant
to specic phenotypes or physiological conditions.
Previous research has successfully applied C-index using knowledge-driven
methods (GO-based Resnik similarity measure) [15] to estimate the quality of
the clusters.
Future research will include the comparison and combination of dierent data-
and knowledge-driven cluster validity indices. Further analyses will comprise,
for instance, the implementation of permutation tests as well as comprehensive
cluster descriptions using signicantly over-represented GO terms.
The results contribute to the evaluation of clustering outcomes and the iden-
tication of optimal cluster partitions, which may represent an eective tool to
support biomedical knowledge discovery in gene expression data analysis.
Acknowledgements
This research is partly based upon works supported by the Science Foundation
Ireland under Grant No. S.F.I.-02IN.1I111.
Incorporating Biological Domain Knowledge 21
References
1. Fitch, J., Sokhansanj, B.: Genomic engineering: Moving beyond DNA. Sequence
to function. Proceedings of the IEEE 88 (2000) 19491971
2. Gat-Viks, I., Sharan, R., Shamir, R.: Scoring clustering solutions by their biological
relevance. Bioinformatics 19 (2003) 23812389
3. Lee, S., Hur, J., Kim, Y.: A graph-theoretic modeling on go space for biological
interpretation on gene clusters. Bioinformatics 20 (2004) 381388
4. Goeman, J., van de Geer, S., de Kort, F., van Houwelingen, H.: A global test for
groups of genes: testing association with a clinical outcome. Bioinformatics 20
(2004) 9399
5. Raychaudhuri, S., Altman, R.: A literature-based method for assessing the func-
tional coherence of a gene group. Bioinformatics 19 (2003) 396401
6. Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological net-
works and gene expression data. Bioinformatics 18 (2002) S145S154
7. Sohler, F., Hanisch, D., Zimmer, R.: New methods for joint analysis of biological
networks and expression data. Bioinformatics 20 (2004) 15171521
8. Khatri, P., Draghici, S.: Ontological analysis of gene expression data: current tools,
limitations, and open problems. Bioinformatics 21 (2005) 35873595
9. Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression
data. Signal Processing 83 (2003) 825833
10. Bolshakova, N., Azuaje, F.: Machaon CVE: cluster validation for gene expression
data. Bioinformatics 19 (2003) 24942495
11. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: 32nd Annual Meeting
of the Association for Computational Linguistics, New Mexico State University, Las
Cruces, New Mexico (1994) 133138
12. Resnik, P.: Using information content to evaluate semantic similarity in a tax-
onomy. In: Proceedings of the 14th International Joint Conference on Articial
Intelligence (IJCAI). (1995) 448453
13. Azuaje, F., Bodenreider, O.: Incorporating ontology-driven similarity knowledge
into functional genomics: an exploratory study. In: Proceedings of the fourth IEEE
Symposium on Bioinformatics and Bioengineering (BIBE 2004). (2004) 317324
14. Wang, H., Azuaje, F., Bodenreider, O., Dopazo, J.: Gene expression correlation
and gene ontology-based similarity: An assessment of quantitative relationships.
In: Proceedings of the 2004 IEEE Symposium on Computational Intelligence in
Bioinformatics and Computational Biology, La Jolla-California, IEEE Press (2004)
2531
15. Bolshakova, N., Azuaje, F., Cunningham, P.: A knowledge-driven approach to
cluster validity assessment. Bioinformatics 21 (2005) 25462547
16. Speer, N., Spieth, C., Zell, A.: A memetic clustering algorithm for the functional
partition of genes based on the Gene Ontology. In: Proceedings of the 2004 IEEE
Symposium on Computational Intelligence in Bioinformatics and Computational
Biology (CIBCB 2004), IEEE Press (2004) 252259
17. Speer, N., Spieth, C., Zell, A.: Functional grouping of genes using spectral cluster-
ing and gene ontology. In: Proceedings of the IEEE International Joint Conference
on Neural Networks (IJCNN 2005), IEEE Press (2005) 298303
18. Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka,
L., Wolfsberg, T., Gabrielian, A., Landsman, D., Lockhart, D., Davis, R.: A
genomewide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2
(1998) 6573
22 N. Bolshakova, F. Azuaje, and P. Cunningham
19. Hubert, L., Schultz, J.: Quadratic assignment as a general data-analysis strategy.
British Journal of Mathematical and Statistical Psychologie (1976) 190241
20. Goodman, L., Kruskal, W.: Measures of associations for cross-validations. Journal
of Ameracan Statistical Association (1954) 732764
A Novel Mathematical Model for the Optimization of
DNAChip Design and Its Implementation
1 Introduction
The rapid development of nanotechnology and computer science led to the emergence
of a new research field, called bioinformatics. One of the most important technique of
this new discipline is DNAchip or DNAmicroarray. It also represents a revolutionary
innovation in the area of applied medical diagnostics. With the help of it the presence of
pathogens and a predominant proportion of genetically based diseases can be detected
parallel very quickly and accurately. In other words, it can extremely accelerate the
precise diagnostics, so the appropriate treatment can be started earlier.
Nevertheless, the more extensive use of the method is nowadays limited by its high
operational costs. For example the production of 80 homogeneous chips with 20,000
DNA fragments costs approximately e 40,000. The largest part of these expenses re-
sults from the production of the socalled masks, which are used to determine the chem-
ical structure of DNA pieces. Obviously, the large manufacturing costs mean substantial
disadvantage. Therefore, the designing process needs special attention. The aim of our
work is to facilitate the engineering process and help to avoid extra charges which are
due to chemicals carrying nonappropriate genetical information.
We proceed as follows: In section 2 a short introduction to DNAchip technology
and Simulated Annealing method is given. The estimation of the hybridization is sum-
marized in section 3. Our mathematical model is presented together with its optimiza-
tion in section 4. The test results can be found in section 5, and conclusion with future
work plans in section 6.
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 2333, 2006.
c Springer-Verlag Berlin Heidelberg 2006
24 K. Danyi, G. Kkai, and J. Csontos
2 DNAChips
DNAchip itself is a solid carrier (glass or special plastic plate) with a set of single
stranded DNA fragments, socalled probes on it. The sample is usually a solution which
also contains DNA fragments. According to Watson and Crick a DNA molecule consists
of two helically twisted strands connected together with a series of hydrogen bonds
(doublehelix) and each strand has 4 distinct building blocks (nucleotides or bases),
adenine (dA), guanine (dG), cytosine (dC) and thymine (dT). Generally, the different
basis sequences determine different genes, which are large DNA fragments, carry and
transmit genetic information. The smaller DNA pieces are called oligonucleotides. En-
ergetically favorable, if dA bonds to dT and dC pairs with dG, which means there exists
a one to one correspondence in the set of DNA fragments. There is another important bi-
jective function f: DNA > RNA, which is based on similar chemically preferred pairs,
dArU, dTrA, dCrG and dGrC where RNA fragments are variations with repetitive rA,
rU, rC and rG elements. The Central Dogma of Molecular Biology, which is the law
of genetic information storage and transfer in living organisms is based on these spe-
cial relationships (Figure 1). Indeed, if one investigate a sample using a DNAchip then
only the unique complementary basis sequences ought to form doublehelixes. This
process is known as hybridization. Nevertheless, every molecular process observed at
macroscopic level is stochastic. Depending on the basis sequences it is possible to arise
some non exactly complementary distorted doublehelix. The energetically less stable
basis pairs are the so called mismatches (MM). MMs and mutations are closely related,
MMs can alter into mutations and with the help of them even a single mutation can be
detected. Since we can select the probe sequences, it is theoretically possible to indicate
all wellknown and unknown mutations in the sample. DNAchip technology takes the
advantages of parallel experiments, even up to 100,000 different oligonucleotides can
be applied on a single chip and the investigation takes only a few minutes. Based on the
application area chips can be divided into three major categories:
All three kinds of DNAchips serve to detect the presence or absence of certain nu-
cleotide chains in the analyzed sample. Although the philosophy is basically the same,
there are some substantial differences among the groups. In our point of view the most
remarkable one is the role of mutations, which is the most important in the first group.
There are several different DNAchip manufacturing techniques have been devel-
oped during the last decade. In the case of diagnostic chips, the most frequently applied
procedure is very similar to those used in computer chip fabrication [1]. Photolitho-
A Novel Mathematical Model for the Optimization of DNAChip Design 25
graphic processes such as photosensitive masks are combined with solid phase chem-
istry to bond DNA fragments onto the surface. With the series of specific masks and
chemical steps high density chips can be constructed. In the development, application
and propagation of DNAchips S. Fodor played a decisive role [2, 3, 4].
probability of hybridization and the Tm point are descriptors of the same thermodynam-
ical property: the stability of the duplex. The higher the Tm the more likely hybridiza-
tion occurs. To calculate Tm the simpler methods use the chemical formula of the chains
(eq.1.) and empirical parameters are accounted for solvation effects (eq. 2.).
where L, F are empirical parameters, XG , XC can be simply calculated from #G and #C.
In the most complicated and flexible case the actual sequence of the chains is also taken
into consideration (eq.3.) [5, 6].
H0
Tm = 273.15 + const log[Na+ ], (3)
S0 + R ln[c/4]
where H0 is the enthalpy, S0 the entropy and c the oligo concentration. H0 and S0
depend on the basesequences and 12 constants (const) stated by the examination [7, 8].
Although, there is a scientific dispute [9, 10, 11, 12] about the best parameter set the
nearest neighbor (NN) method (eq. 3.) is generally considered to be the most accurate
prediction of Tm . Nevertheless, substantial problems have been left unsolved consid-
ering NN. First of all, the NN parameters are not valid in solid phase, they are based
on fluid phase measurements. Secondly, they originally were developed to describe the
thermodynamics of perfectmatching sequences and their extension to other cases is
painful, there still exist quite a few mismatches without parameter. Lastly, every pa-
rameter defined by experiments has limited scope. Reparametrization can help to solve
A Novel Mathematical Model for the Optimization of DNAChip Design 27
these problems, but it requires a tremendous amount of experimental work regarding the
NN approach. In addition, if one consider carefully the last argument, then this is not
else than only a stone on Sisyphuss way. A more effective approach will be presented
in the following sections to avoid these difficulties.
As we mentioned earlier, regarding DNA the most stable state if the four bases can form
WatsonCrick basis pairs: dG dC, dA = dT (where every hyphen means one hydro-
gen bond). If two sequences are exactly complementary then they will hybridize with
each other under appropriate conditions. Since, duplex formation is a stochastic process
hybridization can occur between non perfectly matching sequences and its probability
is in inverse proportion to the number of MMs. Furthermore, one can conclude that dif-
ferent MMs might have different effects on hybridization. Accordingly, the following
parameters are specified in our model:
Table 1. a) The commutative Cayley table of DNA/DNApairs, the off diagonal elements are
MMs and the different ones are in italic; b) The Cayley table of DNA/RNApairs, the off diagonal
elements are MMs
dA dC dG dT DNA/DNA
dA dC dG dT DNA/RNA
dAdA dCdA dGdA dTdA dA
dArA dCrA dGrA dTrA rA
dAdC dCdC dGdC dTdC dC
dArC dCrC dGrC dTrC rC
dAdG dCdG dGdG dTdG dG
dArG dCrG dGrG dTrG rG
dAdT dCdT dGdT dTdT dT
dArU dCrU dGrU dTrU rU
a)
b)
3. Solutiondependent parameters:
They cover the experimental conditions i.e. salt and sample concentrations and
other environmental factors, which also effect the probability of hybridization. The
experimental conditions are specified by the scientist, who plans and accomplishes
the work. One can assume that these values do not change in a certain laboratory.
Although we do not use these parameters explicitly, they are originally involved in
our approach.
In our basic model, we optimize only the typedependent parameters, and set the
other parameters to appropriate values (Section 4.2).
A Novel Mathematical Model for the Optimization of DNAChip Design 29
4 Methods
All the above mentioned parameters represent the chemical sense in our model. With
the help of MMs, which are weighted by the appropriate parameters, the mathematical
model for the estimation of hybridization probability can be presented:
l1
P(hybridization) = max 0, 1 w pos(i) wmm (m(i)) ,
i=0
where l is the length of sequence, i is the position number, m(i) is the type of mismatch
at position i, w pos (i) R is the weight of position i and wmm (m(i)) is the weight of
mismatch type at position i.
If there is no mismatch at the position i, then w pos (i) = wmm (m(i)) = 0. If appropriate
values are assigned to the parameters, experiments can be replaced by computations.
where the element ai j was derived from the chip experiment (TIFF image processing)
and its value based on the hybridization of the row i and column j DNA fragment, n,
m are the dimensions of the chip. The bi j elements of the B matrix are computed as
follows:
l
bi j = f (i j) = w pos(k)wmm (Si jk ),
k=0
where l is the length of sequence, Si j is the sequence on the chip in row i and column
j, Si jk is the MM at position k in this sequence. Thus the elements of matrix B are
calculated from character strings, which consist of the variations of the four indications
(A, C, G, T).
In a real dataset, the proportions of MMs are usually not balanced. In order to pro-
portionately optimize the MM weights, we need to scale the number of the present MMs
and expand the target function. The percent of every mismatch in the dataset is calcu-
lated as follows:
Nm(t) 100
pm(t) = 12 %
t=1 Nm(t)
where m(t) is the mismatch, Nm(t) is the number of m(t) mismatch in the real dataset.
In order to scale the number of MMs considering DNA-RNA chips, the proportion of
12 = 8.3333% (the evenly distributed MMs) and pm(t) is taken, thus the scale weight
100%
0.8
0.6
0.4
0.2
0
0 5 10 15 20
5 Results
Figure 5 shows the differences between average optimum values in the dependence of
the initial test temperature and the number of iterations. The initial test temperature was
increased from 50 to 140 and number of iterations from 10 to 55. It can be seen that the
optimum values increase, if the temperature is between 50 and 70 and the number of
iterations are between 50 and 70. In contrast if the temperature is between 100 and 140
and the number of iterations is 20, the difference between the model and experiment is
acceptable. Since SA is a stochastic search method, we repeated the samples 100 times
for each case.
Fig. 5. The average optimum values generated by the method using different temperature and
iteration number
In Figure 6 and 7 the variation of the parameter values can be observed. Those param-
eters, which are important from the biochemical point of view are represented by grey
columns, the others are in black.
It can be seen that starting form 1.0 at an early stage the values are mixed. How-
ever, with the advance of the optimization process the important and the less notable
parameters separate from each other and the most important ones obtain the highest
weight, eventually. If we take into account the fact that the hybrid RNA/DNA structures
are more stable then the DNA/DNA duplex ones and the base pair cytosine and guanine
stands for a stronger interaction than the double hydrogenbonded adenine and thymine
(or uracil) the following conclusion can be made:
Regarding DNA/RNA hybrid systems, the RNAside mismatches should determine
mainly the stability of the hybridization. The theoretical consideration stays in coher-
ence with the parameters resulted from the optimization process. As you can see in
Figure 7, 5 out of the 6 possible mismatches on the RNAside possess the first 5 posi-
tions based on the weight (dTrC, dArC, dGrG, dArG, dTrG).
32 K. Danyi, G. Kkai, and J. Csontos
Fig. 6. The values of the typedependent parameters in the case of goal function 6.10964 and
4.82974
Fig. 7. The values of the typedepended parameters in the case of goal function 2.42093 and
1.86422
6 Conclusion
The prediction of thermodynamical properties of nucleic acids using computer model-
ing has not been solved yet. The main problem sources are (i) the parameters used in
the computation and determined by time consuming and expensive experiments can be
applied only to fluid phase, (ii) the lack of MM parameters, (iii) the parameters strongly
depend on the experimental conditions (e.g. temperature, solvent, etc.).
We presented a novel theoretical model (in situ in silico approach) to estimate the
hybridization process between DNA/DNA and DNA/RNA strands and eliminate the
previously mentioned defects. With the help of this new method, the in silico optimiza-
tion process takes place in situ the DNAchip laboratory, then using the established
parameters one can model the hybridization, which is the cornerstone of DNAchip
design.
By the computation done so far the implemented simulated annealing method was
used with linear cooling curve. Beside the experimental test of the method, the exponen-
tial and Boltzmannsigmoid cooling scheme are in progress as well as the optimization
of the positiondependent parameters.
A Novel Mathematical Model for the Optimization of DNAChip Design 33
References
1. Chiu, G.T., Shaw, J.: Optical lithography. In: IBM Journal of Research and Development,
Vol.41. (24. Jan. 1997)
2. Chee, M., Yang, R., Hubbel, E., Berno, A., Huang, X., Stern, D., Winkler, J., Lockad, D.,
.Morris, M., SP.Fodor: Accessing genetic information with high-density dna arrays. In:
Science, Vol.274. (25. Oct. 1996) 610613
3. Fodor, S., Rava, R., Huang, X., Pease, A., Holmes, C., Adams, C.: Multiplexed biochemical
assays with biological chips. In: Nature, Vol.364. (5. Aug. 1993) 555556
4. Fodor, S., Read, J., Pissung, M., Stryer, L., Lu, A., Solas, D.: Light-directed, spatially ad-
dressable parallel chemical synthesis. In: Science, Vol.251,No.4995. (15. Feb. 1991) 767
773
5. Panjkovich, A., Melo, F.: Comparison of different melting temperature calculation methods
for short dna sequences. In: Bioinformatics. (March 15, 2005) 21(6): 711 722
6. Panjkovich, A., Norambuena, T., Melo, F.: dnamate: a consensus melting temperature pre-
diction server for short dna sequences. In: Nucleic Acids Res. (July 1, 2005) 33(suppl_2):
W570 W572
7. Lee, I., Dombkowski, A., Athey, B.: Guidelines for incorporating non-perfectly matched
oligonucleotides into target-specific hybridisation probes for a dna microarray. In: Nucleic
Acids Research, Vol. 32. No. 2. (2004) 681690
8. Rouillard, J.M., Zuker, M., Gulari, E.: Oligoarray 2.0: design of oligonucleotide probes for
dna microarrays using a thermodinamic approach. In: Nucleic Acids Research, Vol. 31. No.
12. (2003) 30573062
9. Breslauer, K., Frank, R., Blocker, H., Marky, L.: Predicting dna duplex stability from the
base sequence. In: Proc. Natl. Acad. Sci. USA 83. (1986) 37463750
10. Freier, S., R.Kierzek, Jaeger, J., Sugimoto, N., Caruthers, M., Neilson, T., Turner, D.: Im-
proved parameters for predictions of rna rna duplex stability. In: Proc. Natl. Acad. Sci. 83.
(1986) 93739377
11. Wallace, R., Shaffer, J., Murphy, R., Bonner, J., Hirose, T., Itakura, K.: Hybridization of syn-
thetic oligodeoxyribonucleotides to phi chi 174 dna: the effect of single base pair mismatch.
In: Nucleic Acids Res.6. (1979) 35433557
12. Howley, P., Israel, M., Law, M., Martin, M.: A rapid method for detecting and mapping
homology between heterologous dnas. evaluation of polyomavirus genomes. In: JBC, Vol.
254. (1979) 48764883
13. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. In: Science,
Vol.220,No.4598. (May 1983) 671680
14. Berendsen, H., van der Spoel, D., van Drunen, R.: Gromacs: A message passing parallel md
implementation. In: Comp. Phys. Comm. 91. (1995) 4356
15. Sanbonmatsu, K.Y., Joseph, S., Tung, C.S.: Simulating movement of trna into the ribosome
during decoding. In: PNAS 2005 102. (2005) 1585415859
16. Inc., A.S.: Insight ii molecular modeling and simulation enviornment (2005)
17. for Macromolecular Modeling, N.R., Bioinformatics: Scalable molecular dynamics (2005)
A Hybrid GA/SVM Approach for Gene
Selection and Classification of Microarray Data
1 Introduction
The DNA Microarray technology allows measuring simultaneously the expres-
sion level of a great number of genes in tissue samples. A number of works have
studied classication methods in order to recognize cancerous and normal tissues
by analyzing Microarray data [1, 8, 2]. The Microarray technology typically pro-
duces large datasets with expression values for thousands of genes (200020000)
in a cell mixture, but only few samples are available (2080).
From the classication point of view, it is well known that, when the number
of samples is much smaller than the number of features, classication methods
may lead to data overtting, meaning that one can easily nd a decision func-
tion that correctly classies the training data but this function may behave very
poorly on the test data. Moreover, data with a high number of features require
inevitably large processing time. So, for analyzing Microarray data, it is neces-
sary to reduce the data dimensionality by selecting a subset of genes that are
relevant for classication.
In the last years, many approaches, in particular various Genetic Algorithms
(GAs) and Support Vector Machines (SVMs), have been successfully applied
to Microarray data analysis [6, 19, 16, 10, 15, 17, 18, 13]. In Section 3, we review
some of the most popular approaches.
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 3444, 2006.
c Springer-Verlag Berlin Heidelberg 2006
A Hybrid GA/SVM Approach for Gene Selection 35
2 Datasets
In this study, we use two well-known public datasets, the Leukemia dataset
and the Colon cancer dataset. All samples were measured using high-density
oligonucleotide arrays [2].
The Leukemia dataset1 consists of 72 Microarray experiments (samples) with
7129 gene expression levels. The problem is to distinguish between two types of
Leukemia, Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia
(ALL). The complete dataset contains 25 AML samples of and 47 ALL samples.
As in other experiments [8], 38 out of 72 samples are used as training data (27
ALL samples and 11 AML samples) and the remaining samples (20 ALL samples
and 14 AML samples) are used as test data.
The Colon cancer dataset2 contains the expression of 6000 genes with 62 cell
samples taken from colon cancer patients, but only 2000 genes were selected
based on the condence in the measured expression levels [1]. 40 of 62 samples
are tumor samples and the remaining samples (22 of 62) are normal ones. In
this paper, the rst 31 out of 62 samples were used as training data and the
remainder samples as test data.
1
Available at: http://www.broad.mit.edu/cgi-bin/cancer/publications/.
2
Available at: http://microarray.princeton.edu/oncology/aydata/index.html.
36 E.B. Huerta, B. Duval, and J.-K. Hao
Feature selection for classication is a very active research topic since many ap-
plication areas involve data with tens of thousands of variables [9]. This section
concerns more specically a literature review of previous studies on feature selec-
tion and classication of Microarray Data, with a special focus on the Leukemia
and the Colon datasets presented in Section 2.
Feature selection can be seen as a typical combinatorial problem. Informally,
given a dataset described by a large number of features, the aim is to nd out,
within the space of feature subsets, the smallest subset that leads to the highest
rate of correct classication. Given the importance of feature selection, many
solution methods have been developed. Roughly speaking, existing methods for
feature selection belong to three main families [9]: the lter approach, the wrap-
per approach and the embedded approach.
The lter methods separate the feature selection process from the classica-
tion process. These methods select feature subsets independently of the learning
algorithm that is used for classication. In most cases, the selection relies on
an individual evaluation of each feature [8, 6], therefore the interactions between
features are not taken into account.
In contrast, the wrapper approach relies on a classication algorithm that is
used as a black box to evaluate each candidate subset of features; the quality
of a candidate subset is given by the performance of the classier obtained on
the training data. Wrapper methods are generally computation intensive since
the classier must be trained for each candidate subset. Several strategies can
be considered to explore the space of possible subsets. In particular, in [14], evo-
lutionary algorithms are used with a k-nearest neighbor classier. In [12], the
author develops parallel genetic algorithms using adaptive operators. In [18], one
nds a SVM wrapper with a standard GA. In [20], the selection-classication
problem is treated as a multi-objective optimization problem, minimizing si-
multaneously the number of genes (features) and the number of misclassied
examples.
Finally, in embedded methods, the process of selection is performed during the
training of a specic learning machine. A representative work of this approach is
the method that uses support vector machines with recursive feature elimination
(SVM/RFE) [10]. The selection is based on a ranking of the genes and, at each
step, the gene with the smallest ranking criterion is eliminated. The ranking
criterion is obtained from the weights of a SVM trained on the current set of
genes. In this sense, embedded methods are an extension of the wrapper models.
There are other variants of these approaches, see [21, 7] for two examples.
The work reported in this paper is based on a hybrid approach combining fuzzy
logic, GA and SVM. Our general model may be characterized as a three-stage
sequential process, using complementary techniques to shrink (or reduce) grad-
A Hybrid GA/SVM Approach for Gene Selection 37
ually the search space. The rest of this section gives a brief description of these
three stages.
Stage 1 Pre-processing by fuzzy logic. This stage aims to reduce the di-
mension of the initial problem by eliminating gene redundancy. This stage is ba-
sically composed of four steps. First, the gene expression levels are transformed
into fuzzy subsets with Gaussian representations. Second, the Cosine amplitude
method is employed to assess fuzzy similarities between genes. We build a simi-
larity matrix that is then transformed to a matrix of fuzzy equivalence relations
by dierent compositions. Third, using cuts [23] with decreasing values of ,
we obtain groups of similar genes that correspond to fuzzy equivalence classes of
genes. Fourth, for each group, one gene is randomly taken as the representative
of the group and other genes of the group are ignored. Applying this dimension
reduction technique to the datasets presented in Section 2, the set of 7129 genes
for Leukemia (2000 genes for Colon respectively) is reduced to 1360 genes (943
genes respectively). Therefore, the search space is dramatically reduced. As we
show later in Section 6, with this reduced set of genes, we will be able to ob-
tain high quality classication results. A detailed description of this stage goes
beyond the scope of this paper and can be found in [3].
Stage 2 Gene subset selection by GA/SVM. From the reduced set of genes
obtained in the previous pre-processing stage, this second stage uses a wrapper
approach that combines a GA and a SVM to accomplish the feature (gene)
subset selection. The basic idea here consists in using a GA to discover good
subsets of genes, the goodness of a subset being evaluated by a SVM classier
Subset of Analysis of
selected genes frequency of
genes
Stop no
Evaluation condition Evolution
Population (SVM)
yes
Best solution
Fig. 1. The general process for gene subset selection and classication using GA/SVM:
Gene subset selection (Stage 2 - top); Gene selection and classication (Stage 3 -
bottom)
38 E.B. Huerta, B. Duval, and J.-K. Hao
on a set of training data (see Section 2). During this stage, high quality gene
subsets are recorded to an archive in order to be further analyzed.
At the end of the GA, the analysis of the archived gene subsets is performed:
gene subsets are compared among them and the most frequently appearing genes
are identied. This process typically leads to a further reduced set of genes (<100
genes for the Leukemia and Colon dataset). Fig.1 (top) shows a general picture
of this stage.
average accuracy (rate of correct classication) of a SVM trained with this gene
subset [11]. The LOOCV procedure means that one sample from the dataset
is considered as a test case while a SVM is trained on all the other samples,
and this evaluation is repeated for each sample. So for each chromosome x,
F itness(x) = accuracySV M (x).
One of the key elements of a SVM classier concerns the choice of its kernel.
In our study, we have chosen to use the RBF kernel. We also experimented
Gaussian and polynomial kernels. For polynomial kernels, the main diculty is
to determine an appropriate polynomial degree while the results we obtained
with the Gaussian kernel are not satisfactory. Notice that RBF has been used
in several previous studies for Microarray data classication [4, 18, 5].
The GA parameters used in our model of gene subset selection for the
Leukemia and Colon datasets are shown in Tables 1 and 2. For the SVM clas-
sier, the same parameters settings are used in the two stages of gene subset
selection and classication. The normalization parameter C is xed at 100 and
the control parameter for the RBF kernel of SVM is xed to 0.5. Notice that
3
http://theoval.sys.eua.uk/gcc/svm/toolbox
A Hybrid GA/SVM Approach for Gene Selection 41
given the input data used by the GA/SVM are already normalized during the
Fuzzy Logic pre-processing, the normalization parameter C has in fact little
inuence in our case.
Methods
Dataset GA&SVM [6] [25] [18] [5] [20] [10]
Leukemia 100(25) 94.10(-) 100(8) 100(6) 95.0(-) 100(4) 100(2)
Colon 99.41(10) 90.30(-) 91.9(3) 93.55(12) 91.0(-) 97.0(7) 98.0(4)
42 E.B. Huerta, B. Duval, and J.-K. Hao
of our results shows that several biologically signicant genes reported in [8] are
found by our approach.
Table 4 shows the detailed results of 5 independent runs of our GA/SVM
algorithm. As it can be observed, these results are quite stable. For the Leukemia
dataset, each of the 5 runs obtains a classication rate of 100% while for the
Colon dataset, the best run gives a classication rate of 99.64. Even the worst
obtains a classication rate of 97.88.
7 Conclusions
In this paper, we presented a general approach for gene selection and classi-
cation of high dimensional DNA Microarray data. This approach begins with a
fuzzy logic based pre-processing technique that aims to cope with the imprecise
nature of the expression levels and to reduce the initial dimension of the input
dataset. Following this pre-processing stage, a hybrid wrapper system combin-
ing a Genetic Algorithm with a SVM classier is used to identify potentially
predictive gene subsets that are then used to carry out the nal gene selection
and classication tasks. Another important feature of our approach concerns the
introduction of an archive of high quality solutions, which allows limiting the
GA/SVM exploration to a set of frequently appearing genes.
This approach was experimentally evaluated on the widely studied Leukemia
and Colon cancer datasets and compared with six previous methods. The re-
sults show that our approach is able to obtain very high classication accuracy.
In particular, to our knowledge, this is the rst time that a averaged correct
classication rate of 99.41% (with 10 genes) is reached for the Colon dataset.
This approach can be further improved on several aspects. First, we notice
that our method does not provide the smallest number of genes on the Leukemia
data. This is due to the fact that the GA is only guided by the criterion of
classication accuracy. Therefore, the criterion of the number of genes should be
integrated into the tness function. This can be achieved by an aggregated tness
function or a bi-criteria evaluation. Second, the high computation time required
in stage 2 can be reduced by the use of a faster classier (or an approximate
tness function). For example, the m-features operator reported in [22] may
be considered. Also, a ne-tuning of SVM parameters in stage 3 may lead to
improved results. Finally, we intend to apply our approach to other DNA chip
data and to study the behavior of our model.
Genopole Program. We thank the reviewers of the paper for their very helpful
comments.
References
1. U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J
Levine. Broad patterns of gene expression revealed by clustering analysis of tumor
and normal colon tissues probed by oligonucleotide arrays. In Proc. Natnl. Acad.
Sci. USA, volume 96, 1999.
2. A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini.
Tissue classication with gene expression proles. Journal of Computational Biol-
ogy, 7(3-4):559583, 2000.
3. E. Bonilla Huerta, B. Duval, and J.K. Hao. Feature space reduction of large scale
gene expression data using Fuzzy Logic. Technical Report, LERIA, University of
Angers, January 2006.
4. M.P.S. Brown, W.N. Grundy, D. Lin, N. Cristianini, S.W. Sugnet, T.S. Furey, M.
Ares Jr., and D. Haussler. Knowledge-based analysis of microarray gene expression
data by using support vector machines Proc. Natl. Acad. Sci. U S A., 97(1): 262
267, 2000.
5. S. Chao and C. Lihui Feature dimension reduction for microarray data analysis
using locally linear embedding. In APBC, pages 211217, 2005.
6. T. S. Furey, N. Cristianini, N. Duy, D.W. Bednarski, M. Schummer, and D. Haus-
sler. Support vector machine classication and validation of cancer tissue samples
using microarray expression data. Bioinformatics, 16(10):906914, 2000.
7. L. Goh, Q. Song, and N. Kasabov. A novel feature selection method to improve
classication of gene expression data. In Proceedings of the Second Asia-Pacic
Conference on Bioinformatics, pages 161166, Australian Computer Society, Dar-
linghurst, Australia, 2004.
8. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,
H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomeld, and E. S.
Lander. Molecular classication of cancer: Class discovery and class prediction by
gene expression monitoring. Science, 286:531537, 1999.
9. I. Guyon and A. Elissee. An introduction to variable and feature selection. Journal
of Machine Learning Research, 3:11571182, 2003.
10. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi-
cation using support vector machines. Machine Learning, 46(1-3):389422, 2002.
11. T. Joachims Estimating the Generalization Performance of a SVM Eciently.
Proceedings of the International Conference on Machine Learning (ICML), Morgan
Kaufman, 2000.
12. L. Jourdan. Metaheuristics for knowledge discovery : Application to genetic data
(in French). PhD thesis, University of Lille, 2003.
13. K-J. Kim and S-B. Cho. Prediction of colon cancer using an evolutionary neural
network. Neurocomputing (Special Issue on Bioinformatics), 61:361379, 2004.
14. L. Li, C. R. Weinberg, T.A. Darden, and L.G. Pedersen. Gene selection for sam-
ple classication based on gene expression data: study of sensitivity to choice of
parameters of the GA/KNN method. Bioinformatics, 17(12):11311142, 2001.
15. J. Liu and H. Iba. Selecting informative genes using a multiobjective evolutionary
algorithm. In Proc. of Congress on Evolutionary Computation (CEC02), pages
297302, 2002.
44 E.B. Huerta, B. Duval, and J.-K. Hao
16. F. Markowetz, L. Edler, and M. Vingron. Support vector machines for protein fold
class prediction. Biometrical Journal, 45(3):377389, 2003.
17. S. Mukherjee. Classifying Microarray Data Using Support Vector Machines.
Springer-Verlag, Heidelberg, 2003.
18. S. Peng, Q. Xu, X.B. Ling, X. Peng, W. Du, and L. Chen. Molecular classication
of cancer types from microarray data using the combination of genetic algorithms
and support vector machines. FEBS Letter, 555(2):358362, 2003.
19. S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo,
C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S.
Lander, and T.R. Golub. Multiclass cancer diagnosis using tumor gene expression
signatures. Proc. Natl. Acad. Sci. U S A., 98(26):1514915154, 2001.
20. A. R. Reddy and K. Deb. Classication of two-class cancer data reliably using
evolutionary algorithms. Technical Report. KanGAL, 2003.
21. Y. Saeys, S. Aeyels Degroeve, D. Rouze, and Y. P. Van de Peer. Feature selection
for splice site prediction: A new method using eda-based feature ranking. BMC
Bioinformatics, 5-64, 2004.
22. S. Salcedo-Sanz, F. Prez-Cruz, G. Campsand, and C. Bousoo-Calzn. Enhanc-
ing genetic feature selection through restricted search and Walsh analysis. IEEE
Transactions on Systems, Man and Cybernetics, Part C, 34:398406, 2004.
23. T.J. Ross. Fuzzy Logic with Engineering Applications. McGraw-Hill, 1997.
24. V. N. Vapnik. Statistical Learning Theory. Wiley N.Y., 1998.
25. Y. Wang, F. Makedon, J.C. Ford, and J.D. Pearlman. Hykgene: a hybrid approach
for selecting marker genes for phenotype classication using microarray gene ex-
pression data. Bioinformatics, 21(8):15301537, 2005.
26. E. Zitzlere, K. Deb, and L. Thiele. Comparison of multiobjective evolutionary
algorithms: Empirical results. Evolutionary Computation 8(2): 173-195, 2000.
Multi-stage Evolutionary Algorithms for Efficient
Identification of Gene Regulatory Networks
Abstract. With the availability of the time series data from the high-throughput
technologies, diverse approaches have been proposed to model gene regulatory
networks. Compared with others, S-system has the advantage for these tasks in
the sense that it can provide both quantitative (structural) and qualitative (dy-
namical) modeling in one framework. However, it is not easy to identify the
structure of the true network since the number of parameters to be estimated is
much larger than that of the available data. Moreover, conventional parameter
estimation requires the time-consuming numerical integration to reproduce dy-
namic profiles for the S-system. In this paper, we propose multi-stage evolu-
tionary algorithms to identify gene regulatory networks efficiently. With the
symbolic regression by genetic programming (GP), we can evade the numerical
integration steps. This is because the estimation of slopes for each time-course
data can be obtained from the results of GP. We also develop hybrid evolution-
ary algorithms and modified fitness evaluation function to identify the structure
of gene regulatory networks and to estimate the corresponding parameters at the
same time. By applying the proposed method to the identification of an artificial
genetic network, we verify its capability of finding the true S-system.
1 Introduction
Although mathematical modeling for the biochemical networks can be achieved at
different level of detail (see [1] and [2] for the reviews of metabolic and genetic regu-
latory networks modeling), we can cluster them into three dominant approaches [3].
One extreme case is mainly intended to describe the pattern of interactions between
the components. Graph-based representation gives us the insight for large architec-
tural features within a cell and allows us to discovery principles of cellular organiza-
tion [4]. However, it is difficult to handle the dynamics of the whole system since
these models are very abstract. The other extreme primarily focuses on describing the
dynamics of the systems by some kinds of equations which can explain the biochemi-
cal interactions with stochastic kinetics [5, 6]. While these approaches lead to realis-
tic, quantitative modeling on cellular dynamics, the application is limited to the small
systems due to their computational complexities.
One of the appropriate approaches for the pathway structure and dynamics identifi-
cation is S-system [7]. It is represented as a system of ordinary differential equations
which have a particular form, where each component process is characterized by
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 45 56, 2006.
Springer-Verlag Berlin Heidelberg 2006
46 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang
power-law functions. S-system is not only general enough to represent any nonlinear
relationship among the components but also able to be mapped onto the network
structure directly. In spite of these advantages, there is a serious drawback which
prevents the S-system from wildly spreading over the systems biology communities.
Its large number of parameters should be estimated for the small number of observed
dynamic trends. Provided that n components such as genes and proteins are involved
in a certain living system, we must optimize at least 2n(n+1) parameters for the S-
system.
Evolutionary computation has been used from its inception for automatic identifica-
tion of a given system or process [8]. For the S-system models, some evolutionary
search techniques have been proposed [9-12]. However, they require the time-
consuming numerical integrations to reproduce dynamic profiles for the fitness
evaluations. To avoid this problem, Almeida and Voit have employed an artificial
neural network (ANN) to smooth the measured data for obtaining slopes of gene ex-
pression level curves [13]. By comparing the slope of each S-system in the population
with the estimated one from ANN, we can evaluate the fitness values of the individu-
als without the computationally expensive numerical integrations of differential equa-
tions. This method also provides the opportunity for a parallel implementation of the
identification task since a tightly coupled system of non-linear differential equations
can be separated. Hence, they are able to reduce the time complexity drastically.
While collocation method [14] can save the computational cost by approximating
dynamic profiles, their estimated systems tend to be invalid since the number of
measured data is usually insufficient. This lack of data problem can be resolved by
sampling new points from the fitted curves. For the well-estimated profiles, however,
we should determine the optimal topology of the artificial neural network such as the
number of hidden units and layers.
In this paper, we propose multi-stage evolutionary algorithms to identify gene regu-
latory networks efficiently. With the symbolic regression by genetic programming
(GP), we could evade the numerical integration steps. Here, we have no need to pre-
determine the topology of the model for the expression profiles since genetic pro-
gramming can optimized the topology automatically. We also develop hybrid evolu-
tionary algorithms to identify the structure of gene regulatory networks and to esti-
mate the corresponding parameters at the same time. Most previous evolutionary
approaches for the S-system identification have used the structural simplification
procedure in which some parameters whose values are less than a given threshold are
reset to zero. Although this method is able to make the network structure sparse, the
true connections which represent somewhat small effect can be deleted during the
procedures. That is, it is not easy to set the suitable value for the threshold. In our
scheme, Binary matrices for a network structure and real vectors and matrices for
parameter values of S-system are combined into a chromosome and co-evolved to
find the best descriptive model for the given data. Hence we can identify the S-system
without specifying the threshold values for the structural simplification. By applying
the proposed method to the artificial gene expression profiles, we successfully identi-
fied the true structure and estimated the reasonable parameter values with the smaller
number of data than the previous study.
Multi-stage Evolutionary Algorithms 47
2.1 S-System
The S-system [7, 15] is a set of nonlinear differential equations described as follows:
n+m n+m
dX i
= i X j ij i X j ij , i = 1,2,..., n ,
g h
(1)
dt j =1 j =1
instead of the illegible graph and a set of weights. Hence, we can easily obtain deriva-
tions of each time point if the result functions of GPs are differentiable.
At the second stage of our evolutionary algorithm, we use a hybrid evolutionary algo-
rithm for searching the structures and parameters of S-system. The whole procedure
of our algorithm is summarized in Fig. 1.
Fig. 1. The whole proposed multi-stage evolutionary algorithm for the identification of
S-system
In biochemical experiment step, the dynamic profiles of the involved genes can be
obtained by the periodic measurement. We are also able to predict slopes at each time
point from the regression line of the measured data by genetic programming. Then,
we can evaluate and optimize the chromosomes of evolutionary algorithms by com-
paring the estimated slopes which came from the data substitution into the S-system
with the predicted slopes of the GP regression lines. Hence, the fitness values of each
S-system can be evaluated without the time-consuming numerical integrations as
follows:
X (t ) X i(t )
2
n T
E = i
i =1 t =1 X i (t ) , (3)
Multi-stage Evolutionary Algorithms 49
X2 X3
X1
X4
dX 1 dX 2
= 12 X 30 . 8 10 X 10 .5 = 8 X 10 . 5 3 X 20 .75
dt dt
dX 3 dX 4
= 3 X 20 .75 5 X 30 .5 X 40 .2 = 2 X 10.5 6 X 40 .8
dt dt
(b) S-system
0 0 1 0 1 0 0 0
1 0 0 0 0 1 0 0
g = h=
0 1 0 0 0 0 1 1
1 0 0 1
0 0 0 0
binary matrices for the structure
12.0 0.0 0.0 0.8 0.0 10.0 0.5 0.0 0.0 0.0
8.0 0.5 0.0 0.0 0.0 3.0 0.0 0.75 0.0 0.0
= g = = h=
3.0
0.0 0.75 0.0 0.0
5.0 0.0 0.0 0.5 0.2
0.5 0.0 0.0
2.0 0.0 6.0
0.0 0.0
0.0 0.8
where T is the number of sampling points, X is the gradient of the regression line
obtained by the genetic programming and X is the calculated value of each equation
in the S-system.
We develop hybrid evolutionary algorithms to identify the structure of gene regula-
tory networks and to estimate the corresponding parameters at the same time. In this
scheme, Binary matrices for a network structure and real vectors and matrices for
parameter values of S-system are combined into a chromosome (Fig. 2) and co-
evolved to find the best descriptive model for the given data. While crossover opera-
tor is applied to binary matrices for searching the structure of the system, their corre-
sponding parameter values also exchanges. This kind of crossover can inherit the
good structures as well as the parameter values in the parents to the offspring. That is,
we use a row exchange crossover which simply selects the row of the matrix g or h
(or both) on the parents, and swaps each other with the parameter values in the real
vectors and matrices. For example, Fig. 3(a) shows the case in which we select the
50 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang
0 0 1 0 0 0 0 1
1 0 0 0 0 1 0 0
P1 g = P1 g=
0 1 0 0 1 0 0 0
1 0
0 0 0 0 0 1
0 0 1 0 0 0 0 1
0 1 0 0 1 0 0 0
O1 g = O2 g =
0 1 0 0 1 0 0 0
1 0 0 0 0
0 0 1
(a) crossover (g only)
0 0 1 0 1 0 0 0
1 0 0 0 0 0 0 1
g = g =
1
parent
0 1 0 0
0 0 1
0
1 0 0 0
1 0 0
0 0 1 0 1 0 0 0
1 0 1 0 0 0 0 1
offspring g = g =
0 1 0 0 0 0 1 0
1 0 0 0 0 0
1 0
(b) mutation (insert) (c) mutation (delete)
Fig. 3. Crossover and mutation operators for the binary matrices
We also give some variety to the fitness function in equation (3). In conventional
scheme, all points of data have the same weights on the fitness values. However, it is
difficult to fit the data points which have large second-order differentiation. More-
over, this makes the parameter values of the chromosomes different from the true one
even if they have good fitness values. Thus we multiply the second-order differentia-
tion to each term of evaluation function. The modified fitness function in our algo-
rithm described as follows:
X (t ) X i (t )
2
n T
E = X i (t ) i
i =1 t =1 X i (t ) , (4)
to fitness function, we can obtain better fitness value of true structure than those of
other structures. After the fitness values of the offspring created by the crossover and
mutation according to their probabilities are evaluated by equation (4), parameters in
the real vectors and matrices are adjusted through the (1+1) evolutionary strategy
[17].
We employ the restricted tournament selection (RTS) proposed originally in [18]
to prevent the premature convergence on a local-optimum structure and to find multi-
ple topology candidates. In RTS, a subset of the current population is selected for
each newly created offspring. The size of these subsets is fixed to some constant
called the window size. Then, the new offspring competes with the most similar
member of the subset. Since the window size is set to the population size in our im-
plementation, each offspring is compared with all S-system in the current population.
If the new one is better, it replaces the corresponding individual; otherwise, the new
one is discarded. For the similarity measure, we calculate the structural hamming
distances between the new offspring and all individuals in the population by using the
binary matrices.
3 Experimental Results
3.1 Data
We use Frayns GPLib library [18] for the symbolic regression with GP. To evolve
the mathematical models for the given data, we use the function set F = {+, -, , /, ^,
sin, exp, log, sqrt} and terminal set T = {t, , }, where is real constant and t is the
time point. The population size is 3,000 and the maximum number of generations is
2,000. Tournament selection is used and its size is 4. Crossover rate is 0.35 and muta-
tion rate is 0.5. We set the length penalty of the genetic program as zero for the accu-
rate curve fitting. We generate 100 points and derivations from the obtained models
for the input of the next stage, that is, the hybrid evolutionary algorithms.
52 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang
f1(t)=(sqrt((((sqrt(((sqrt(2.957153))-((sin(sqrt(t)))+((sin((1.854912)-(((sqrt((3)*(t)))
*(sqrt(t)))+(4.435898))))+(-2.355442))))*(t)))+((40.675830)/(exp((t)*((sqrt(t))-
((sin(((sin((((sqrt((3.756391)*(t)))*(sqrt(t)))+(log(86)))+(7)))+(3))+(sin(sin((1.654737
)*(t))))))+(-2.355442)))))))/(sqrt(54.598150)))/(sqrt(7.931547)))),
f2(t)=(sqrt(((3.777992)-((((((4.190957)-((t)-((sin((((t)*((t)^(t)))-((((t)*((2.883554)
/((4)-(log(t)))))+(2.791190))-((exp((t)-(2.226704)))^(sqrt(9.642561)))))/((t)^(t))))
*(2.678347))))/(3.462360))+(3.792098))-(4.796861))/(4)))+(((((3.792098)-((exp(t))
/(3.462360)))-((t)*((t)^(t)))) /((t)^(t)))/((t)^(t))))),
f3(t)=((log(log(exp(log(log(exp(exp(((log(exp()))-((sin((9)+(((t)+(8.000000))
^(sqrt(1.420245)))))/(((exp(t))*(379.000000))/(84.000000))))-((sin((8)+(((t)
+(((log(109))-(1.258803))/(6.620476)))^(sqrt(log(4.059461))))))
/(((exp(((8.337047)*((log(log(sqrt(3.021610))))+(2.000000)))-(5.912041)))*(exp(t)))
/(85.000000)))))))))))^(5.933873)),
f4(t)=((((log(6.249382))^((sqrt(6))*(((sqrt(10.000000))^((1.172486)-(t)))
/(6.981835))))^((1.161464)-((1.161464)/(((((sqrt(6.249382))*((log(7.008566))
*((((((exp((6.980522)/((sqrt(6.288201))^(1.344547))))^((1.735082)-(t)))
/(0.290257))*(sqrt(6.000000)))^(((9.704760)^((-0.050358)-(t)))-(t)))/(0))))
^((1.634223)+((7.277223)^((0.290257)-(t)))))^(0.161464))/(t)))))/(6.980522)).
Fig. 4 shows the true profiles and estimated curves and sampled data points from the
genetic programming and we confirm that the GP recover the true dynamics of the S-
system
Using 100 points obtained from the GP step, we search the S-system by the hybrid
evolutionary algorithms. This step is composed with steady-state evolutionary algo-
rithms with local optimization procedure - (1+1) evolutionary strategy. For the sparse
network structures, we adopt a structural constraint which the evolved networks
should satisfy. That is, each gene is assumed to be related to one or two other genes in
the system. Hence, the randomly generated initial individuals and offspring by cross-
over and mutation operators should have one or two non-zeros elements in g and h.
Crossover rate is 0.8 and mutation rate is 0.3. As a local optimization for the parame-
ter values, (1+1) evolutionary strategy is performed for 80 fitness evaluations. The
search ranges of the parameters are [0.0, 15.0] for i and i, and [-1.0, 1.0] for gij and
hij. With the population size of 104, the proposed algorithm successfully identified the
true structure after 106 generation. As shown in Fig. 5, we can also recover the dy-
namic profiles with the estimated parameters.
4 Conclusion
We propose multi-stage evolutionary algorithms to identify gene regulatory networks
efficiently with the S-system representation. We adopt the pre-processing symbolic
regression step by genetic programming for avoiding the time-consuming numerical
integration. We also develop hybrid genetic algorithms and modify the fitness func-
tion to identify the structure of gene regulatory networks and to estimate the corre-
sponding parameters simultaneously without the threshold values for the sparse net-
work structures. By applying the proposed method to the identification of an artificial
genetic network, we verify its capability of finding the true S-system.
One important future work is to demonstrate the usefulness of the proposed algo-
rithm for real experimental biological data such as the gene expression profiles from
the microarrays and NMR measurements of some metabolites. As the by-product of
the population diversity maintenance of our evolutionary algorithms, we will be able
to attain the different plausible topologies for the network very efficiently. These can
Multi-stage Evolutionary Algorithms 55
Acknowledgement
This work was supported by the Korea Ministry of Science and Technology through
National Research Lab (NRL) project and the Ministry of Education and Human
Resources Development under the BK21-IT Program. The ICT at the Seoul National
University provided research facilities for this study.
References
1. Covert, M.W., Schilling, C.H., Famili, I. Edwards, J.S., Goryanin, I.I. Selkov, E., Palsson,
B.O.: Metabolic modeling of microbial strains in silico. Trends in Biochemical Science 26
(2001) 179-186
2. De Jong, H.: Modeling and simulation of genetic regulatory system: a literature review.
Journal of Computational Biology 9 (2002) 67-103
3. Stelling, J.: Mathematical models in microbial systems biology. Current Opinion in Mi-
crobiology 7 (2004) 513-518
4. Barabasi, A.-L., Oltvai, Z.N.: Network biology: Understanding the cell's functional or-
ganization. Nature Reviews Genetics 5 (2004) 101-113
5. De Jong, H., Gouze, J.-L., Hernandez, C., Page, M., Sari, T., Geiselmann, J.: Qualitative
simulation of genetic regulatory networks using piecewise-linear models. Bulletin of
Mathematical Biology 66 (2004) 301-340
6. Rao, C.V., Wolf, D.M., Arkin, A.P.: Control, exploitation and tolerance of intracellular
noise. Nature 402 (2002) 231-237
7. Voit, E.O.: Computational Analysis of Biochemical Systems. Cambridge University Press
(2000)
8. Fogel, D.B.: System Identification Through Simulated Evolution: A Machine Learning Ap-
proach to Modeling. Ginn Press (1991)
9. Tominaga, D., Koga, N., Okamoto, M.: Efficient numerical optimization algorithm based
on genetic algorithm for inverse problem. In: Whitley, D. et al. (Eds.), Proceedings of the
2000 Genetic and Evolutionary Computation Conference (2000) 251-258
10. Ando, S., Sakamoto, E., Iba, H.: Evolutionary modeling and inference of gene network.
Information Science 145 (2002) 237-259
11. Kikuchi, S., Tominaga, D., Arita, M., Takahashi, K., Tomita, M.: Dynamic modeling of
genetic networks using genetic algorithm and S-system. Bioinformatics 19 (2003) 643-650
12. Spieth, C., Streichert, F., Speer, N., Zell, A.: Optimizing topology and parameters of gene
regulatory networt models from time-series experiments. In: Deb, K. et al. (eds.): Proceed-
ings of the 2004 Genetic and Evolutionary Computation Conference. Lecture Notes in
Computer Science, Vol. 3102. Springer-Verlag (2004) 461-470
13. Almeida J.S., Voit E.O.: Neural network-based parameter estimation in S-system models
of biological networks, Genome Informatics 14 (2003) 114-123
14. Tsai, K.-Y., Wang, F.-S.: Evolutionary optimization with data collocation for reverse engi-
neering of biological networks. Bioinformatics 21 (2005) 1180-1188
15. Savageau, M.A.: Biochmical System Analysis: A Study of Function and Design in Molecu-
lar Biology, Addison-Wesley (1976)
56 K.-Y. Kim, D.-Y. Cho, and B.-T. Zhang
16. Koza, J.R.: Genetic Programming: On the Programming of Computers by Natural Selec-
tion. MIT Press (1992)
17. Bck, T.: Evolutionary Algorithm in Theory and Practice, Oxford University Press (1996)
18. http://www.cs.bham.ac.uk/~cmf/GPLib/GPLib.html
19. Harik, G.R.: Finding multimodal solutions using restricted tournament selection. In:
Eshelman, L.J. (ed.): Proceedings of the Sixth International Conference on Genetic Algo-
rithms (1995) 24-31
Human Papillomavirus Risk Type Classification
from Protein Sequences Using
Support Vector Machines
Biointelligence Laboratory,
School of Computer Science and Engineering,
Seoul National University, Seoul 151-744, South Korea
{skim, btzhang}@bi.snu.ac.kr
1 Introduction
The cervical cancer is a leading cause of cancer death among women worldwide.
Epidemiologic studies have shown that the association of genital human papillo-
mavirus (HPV) with cervical cancer is strong, independent of other risk factors
[1]. HPV infection causes virtually all cases of cervical cancer because certain
high-risk HPVs develop cancer even though most cases of HPV are low-risk and
rarely develop into cancer. Especially, high-risk HPV types could induce more
than 95% of cervical cancer in woman.
The HPV is a relatively small, double-strand DNA tumor virus that belongs
to the papovavirus family (papilloma, polyoma, and simian vacuolating viruses).
More than 100 human types are specic for epithelial cells including skin, respira-
tory mucosa, or the genital tract. And the genital tract HPV types are classied
into two or three types such as low-, intermediate-, and high-risk types by their
relative malignant potential [2]. The common, unifying oncogenic feature of the
vast majority of cervical cancers is the presence of HPV, especially high-risk type
HPV [3]. Thus the risk type detection of HPVs have become one of the most
essential procedures in cervical cancer remedy. Currently, the HPV risk types
are still manually classied by experts, and there is no deterministic method to
expect the risk type for unknown or new HPVs.
Since the HPV classication is important in medical judgments, there have
been many epidemiological and experimental studies to identify HPV risk types
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 5766, 2006.
c Springer-Verlag Berlin Heidelberg 2006
58 S. Kim and B.-T. Zhang
[3]. Polymerase chain reaction (PCR) is a sensitive technique for the detection of
very small amounts of HPVs nucleic acids in clinical specimens. It has been used
in most epidemiological studies that have evaluated the role of these viruses in
cervical cancer causation [4]. Bosch et al. [1] investigated epidemiological charac-
teristic that whether the association between HPV infection and cervical cancer
is consistent worldwide in the context of geographic variation. Burk et al. [5] in-
spected the risk factors for HPV infection in 604 young college women through
examining social relationship and detected various factors of HPV infection with
L1 consensus primer PCR and Southern blot hybridization. Munoz et al. [6] clas-
sied the HPV risk types with epidemiological experiments based on risk factor
analysis. They pooled real data from 1,918 cervical cancer patients and analyzed
it by PCR based assays.
Detection of HPV risk types can be a protein function prediction even though
functions are described at many levels, ranging from biochemical function to bio-
logical processes and pathways, all the way up to the organ or organism level [7].
Many approaches for protein function prediction are based on similarity search
between proteins with known function. The similarity among proteins can be
dened in a multitude of ways [8]: sequence alignment, structure match by com-
mon surface clefts or binding sites, common chemical features, or certain motifs
comparison. However, none of the existing prediction systems can guarantee gen-
erally good performance. Thus it is required to develop classication methods
for HPV risk types. Eom et al. [9] presented a sequence comparison method for
HPV classication. They use DNA sequences to discriminate risk types based
on genetic algorithms. Joung et al. [10] combined with several methods for the
risk type prediction from protein sequences. Protein sequences are rst aligned,
and the subsequences in high-risk HPVs against low-risk HPVs are selected by
hidden Markov models. Then a support vector machine is used to determine the
risk types. The main drawback of this paper is that the method is biased by
one sequence pattern. Alternatively, biomedical literature can be used to predict
HPV risk types [11]. But, text mining approaches have the limitation for predic-
tion capability because they only depend on texts to capture the classication
evidence, and the obvious keywords such as high tend to be appeared in the
literature explicitly.
In this paper, we propose a method to classify HPV risk types using pro-
tein sequences. Our approach is based on support vector machines (SVM) to
discriminate low- and high-risk types and a string kernel is introduced to deal
with protein sequences. The string kernel rst maps to the space consisting of
all subsequences of amino acids pair. A RBF kernel is then used for nonlinear
mapping into a higher dimensional space and similarity calculation. Especially,
the proposed kernel only uses amino acids of both ends in k-length subsequences
to improve the classication performance. It is motivated by the assumption
that amino acids pairs with certain distance aects the HPVs biological func-
tion, i.e. risk type, more than consecutive amino acids. The experimental results
show that our approach provides better performance than previous approaches
in accuracy and F1-score.
Human Papillomavirus Risk Type Classication from Protein Sequences 59
Our work addresses how to classify HPV risk types from protein sequences
by SVM approaches, which can provide a guide to determine unknown or new
HPVs. The paper is organized as follows. In Section 2, we explain the SVM
method for classication. Then the string kernel for HPV protein sequence is
presented in Section 3. In Section 4, we present the experimental results and
draw conclusions in Section 5.
We use support vector machines to classify HPV risk types. A string kernel-based
SVM is trained on HPV protein sequences and tested on unknown sequences.
Support vector machines have been developed by Vapnik to give robust perfor-
mance for classication and regression problems in noisy, complex data [12]. It
has been widely used from text categorization to bioinformatics in recent days.
When it is used for classication problem, a kernel and a set of labeled vectors,
which is marked to positive or negative class are given. The kernel functions in-
troduce nonlinear features in hypothesis space without explicitly requiring non-
linear algorithms. SVMs learn a linear decision boundary in the feature space
mapped by the kernel in order to separate the data into two classes.
For a feature mapping , the training data S = {xi , yi }ni=1 , is mapped into
the feature space (S) = {(xi ), yi }ni=1 . In the feature space, SVMs learn the
hyperplane f = w, (x) + b, w RN , b R, and the decision is made by
sgn(w, (x) + b). The decision boundary is the hyperplane f = 0 and its
margin is 1/w. SVMs nd a hyperplane that has the maximal margin from
each class among normalized hyperplanes.
To nd the optimal hyperplane, it can be formulated as the following problem:
1
minimize w2 (1)
2
subject to yi (w, (xi ) + b) 1, i = 1, . . . , n. (2)
n
n n
1
maximize i i j yi yj (xi ), (xj ) (3)
i=1
2 i=1 j=1
subject to i 0, i = 1, . . . , n, (4)
n
i yi = 0. (5)
i=1
We can work on the feature space by using kernel functions, and any kernel
function K satisfying Mercers condition can be used.
3 Kernel Function
For HPV protein classication, we introduce a string kernel based on the spec-
trum kernel method. The spectrum kernel was used to detect remote homology
detection [13][14]. The input space X consists of all nite length sequences of
characters from an alphabet A of size |A| = l (l = 20 for amino acids). Given
a number k 1, the k-spectrum of a protein sequence is the set of all possible
k-length subsequences (k-mers) that it contains. The feature map is indexed by
all possible subsequences a of length k from A. The k-spectrum feature map
k
k (x) from X to Rl can be dened as:
where > 0. This string kernel is used in combination with the SVM explained
in Section 2.
4 Experimental Results
4.1 Data Set
In this paper, we use the HPV sequence database in Los Alamos National Lab-
oratory (LANL) [16], and total 72 types of HPV are used for experiments. The
risk types of HPVs were determined based on the HPV compendium (1997). If
a HPV belongs to skin-related or cutaneous groups, the HPV is classied into
low-risk type. On the other hand, a HPV is classied as a high-risk if it is known
to be high-risk type for cervical cancer. The comments in LANL database are
used to decide risk types for some HPVs, which are dicult to be classied.
Seventeen sequences out of 72 HPVs were classied as high-risk types (16, 18,
62 S. Kim and B.-T. Zhang
1
E6
E7
L1
0.9
Accuracy
0.8
0.7
1 2 3 4
Window size
31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 61, 66, 67, 68, and 72), and others were
classied as low-risk types. Table 1 shows the HPV types and their classied
risk type. The symbol ? in the table denotes unknown risk type that cannot be
determined.
Since several proteins can be applied to discriminate HPVs, we have evalu-
ated the classication accuracy using the SVM with RBF kernel to determine the
gene products to be used for the experiments. The input data is the normalized
frequency vector by sliding window method. It has been performed to decide the
most informative protein among gene products for HPV risk type classication.
Figure 1 depicts the accuracy changes by window size for E6, E7, and L1 pro-
teins. The accuracy is the result of leave-one-out cross-validation. It indicates
that the accuracy using E6 protein is mostly higher than using E7 and L1 pro-
teins. However, the overall accuracy gets high by increasing window size for all
proteins because the HPV sequences are relatively short and unique patterns are
more generated when window size is long. That is, the learners overt protein
sequences for long window size. Viral early proteins E6 and E7 are known for
inducing immortalization and transformation in rodent and human cell types.
E6 proteins produced by the high-risk HPV types can bind to and inactivate
the tumor suppressor protein, thus facilitating tumor progression [16][17]. This
process plays an important role in the development of cervical cancer. For these
reasons, we have chosen E6 protein sequences corresponding to the 72 HPVs.
When binary scales are used for both answer and prediction, a contingency
table is established showing how the data set is divided by these two measures
(Table 2). By the table, the classication performance is measured as follows:
a
precision = 100%
a+b
a
recall = 100%
a+c
2 precision recall
F 1 score = .
precision + recall
0.98
0.97
0.96
0.95
Accuracy 0.94
0.93
0.92
0.91
0.9
1 2 3 4 5 6 7
k
100 100
90
90
Accuracy
F1-score
80
80
70
70 60
Gap-spectrum SVM SVM AdaCost Naive Bayes Gap-spectrum SVM SVM AdaCost Naive Bayes
Text mining approaches only depend on the clues from text sentences. If the
text documents are unavailable for unknown HPVs, there is no way to clas-
sify them, whereas the sequence-based classication does not need to use any
additional information except sequence itself.
Table 3 shows the risk type prediction for HPVs marked as unknown in Ta-
ble 1. HPV26, HPV54, HPV57, and HPV70 are predicted as high-, low-, low-,
and high-risk, respectively. The prediction results for HPV26 and HPV54 are
identical to the one in Munoz et al. [6], and we assume that their results are cor-
rect because it is based on epidemiologic classication from over 1,900 patients.
For HPV70, there are dierent decisions for the risk type according to previous
research [6][18][19], and the risk type of HPV57 cannot be decided yet because
of insucient previous works. By the prediction results, we can conclude our
approach provides certain probability for whether unknown HPVs are high-risk
or not.
Human Papillomavirus Risk Type Classication from Protein Sequences 65
5 Conclusion
Acknowledgement
This work was supported by the Korea Ministry of Science and Technology
through National Research Lab (NRL) project and the Ministry of Education
and Human Resources Development under the BK21-IT Program. The ICT at
the Seoul National University provided research facilities for this study.
References
[1] Bosch, F. X., Manos, M. M., et al.: Prevalence of Human Papillomavirus in Cer-
vical Cancer: a Worldwide Perspective. Journal of the National Cancer Institute
87 (1995) 796802.
[2] Janicek, M. F. and Averette, H. E.: Cervical Cancer: Prevention, Diagnosis, and
Therapeutics. Cancer Journals for Clinicians 51 (2001) 92114.
[3] Furumoto, H. and Irahara, M.: Human Papillomavirus (HPV) and Cervical Can-
cer. Journal of Medical Investigation 49 (2002) 124133.
66 S. Kim and B.-T. Zhang
[4] Centurioni, M. G., Puppo, A., et al.: Prevalence of Human Papillomavirus Cervical
Infection in an Italian Asymptomatic Population. BMC Infectious Diseases 5(77)
(2005).
[5] Burk, R. D., Ho, G. Y., et al.: Sexual Behavior and Partner Characteristics Are the
Predominant Risk Factors for Genital Human Papillomavirus Infection in Young
Women. The Journal of Infectious Diseases 174 (1996) 679689.
[6] Munoz, N., Bosch, F.X., et al.: Epidemiologic Classication of Human Papil-
lomavirus Types Associated with Cervical Cancer. New England Journal of
Medicine 348 (2003) 518527.
[7] Watson, J. D., Laskowski, R. A., and Thornton, J. M.: Predicting Protein Function
from Sequence and Structural Data. Current Opinion in Structural Biology 15
(2005) 275284.
[8] Borgwardt, K. M., Ong, C. S., et al.: Protein Function Prediction via Graph Ker-
nels. In Proceedings of Thirteenth International Conference on Intelligenc Systems
for Molecular Biology (2005) 4756.
[9] Eom, J.-H., Park, S.-B., and Zhang, B.-T.: Genetic Mining of DNA Sequence
Structures for Eective Classication of the Risk Types of Human Papillomavirus
(HPV). In Proceedings of the 11th International Conference on Neural Information
Processing (2004) 13341343.
[10] Joung, J.-G., O, S.-J, and Zhang, B.-T.: Prediction of the Risk Types of Human
Papillomaviruses by Support Vector Machines. In Proceedings of the 8th Pacific
Rim International Conference on Artificial Intelligence (2004) 723731.
[11] Park, S.-B., Hwang, S., and Zhang, B.-T.: Mining the Risk Types of Human Papil-
lomavirus (HPV) by AdaCost. In Proceedings of the 14th International Conference
on Database and Expert Systems Applications (2003) 403412.
[12] Vapnik, V. N.: Statistical Learning Theory. Springer (1998).
[13] Leslie, C., Eskin, E., and Noble, W. S.: The Spectrum Kernel: A String Ker-
nel for SVM Protein Classication. In Proceedings of the Pacific Symposium on
Biocomputing (2002) 564575.
[14] Leslie, C., Eskin, E., et al.: Mismatch String Kernels for Discriminative Protein
Classication. Bioinformatics 20(4) (2004) 467476.
[15] Shawe-Taylor, J. and Cristianini, N.: Kernel Methods for Pattern Analysis. Cam-
bridge University Press (2004).
[16] The HPV sequence database in Los Alamos laboratory, http://hpv-web.lanl.gov/
stdgen/virus/hpv/index.html.
[17] Pillai, M., Lakshmi, S., et al.: High-Risk Human Papillomavirus Infection and
E6 Protein Expression in Lesions of the Uterine Cervix. Pathobiology 66 (1998)
240246.
[18] Longuet, M., Beaudenon, S., and Orth, G.: Two Novel Genital Human Papillo-
mavirus (HPV) Types, HPV68 and HPV70, Related to the Potentially Oncogenic
HPV39. Journal of Clinical Microbiology 34 (1996) 738744.
[19] Meyer, T., Arndt, R., et al.: Association of Rare Human Papillomavirus Types
with Genital Premalignant and Malignant Lesions. The Journal of Infectious Dis-
eases 178 (1998) 252255.
Hierarchical Clustering, Languages and Cancer
1 Introduction
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 6778, 2006.
c Springer-Verlag Berlin Heidelberg 2006
68 P. Mahata et al.
In addition to the clustering solutions available in the literature for the datasets
considered, we have used two unsupervised techniques for computing alternative
solutions. The rst one is based on arithmetic-harmonic cuts, and the second
one relies on the utilization of ultrametric trees. These will be described below.
memetic algorithm uses (a) a dierential greedy algorithm (similar to that in [3])
for initialization of a set of solutions for the problem, (b) a dierential greedy
crossover (a modication of the algorithm in [2]) for evolution of the population,
and (c) a variable neighborhood local search (see [4]) to improve the newly gen-
erated solutions. Whenever the population stagnates, we keep the best solution
and re-initialize the rest of solutions in the set. We use this memetic algorithm
if the graph contains more than 25 vertices, and a backtracking enumeration
algorithm otherwise. Notice that even though backtracking gives us an optimal
solution, a memetic algorithm may not. However, in the considered datasets, the
memetic algorithm consistently generated the same solution in all runs (thus it
is presumably optimal). By applying this method (backtracking or memetic al-
gorithm depending on the number of vertices) recursively, we have at each step a
graph as input, and the two subgraphs induced by each of the sets of the vertex
partition as output; stopping when we arrive to a graph with just one vertex,
we generate a hierarchical clustering in a top-down fashion.
The rationale of the use of our objective function can be clear if we rearrange
its terms. We can write
Aout
F = (|E| |Eout |)|Eout | (2)
Hin
where Aout is the arithmetic mean of the weights that connect vertices of S
with V \ S (the cut); Hin is the harmonic mean of the weights of the edges not
in the cut, and |Eout | is the cardinality of the cut. Informally, maximizing F is
equivalent to try to nd a cut that discriminates well the two groups, normalized
by the harmonic mean of the intra-cluster dissimilarity, and multiplied by a
factor that is maximum when the two groups have a similar number of elements.
Normalizing by the harmonic mean allows the denominator being more stable
to the presence of outlier samples when associated to either V or V \ S. For this
reason, we denote this partition as arithmetic-harmonic cut.
Notice that maximizing the rst part of the objective function, i.e.,
eEout w(e) (the total weights of edges across the two sets) is the same as
solving the Max-Cut problem for graph G, which is a N P -hard problem. How-
ever, it turns out that the hierarchy generated by partitions using Max-Cut
does not corroborate the previous knowledge about the datasets. This is prob-
ably due to the fact that no importance is given in Max-Cut to the similarity
of vertices within the sets. We also considered the objective function
F = w(e) w(e) (3)
eEout eEin
distance Dij between any two leaves i and j (measured as the sum of the weights
of the edges that have to be traversed to reach i from j inside T ) veries that
Dij max{Dik , Djk }, 1 i, j, k n, where n is the number of leaves. This
equation implies that given any internal node h in T , it holds that Dhi = Dhj
for any leaves i, j having h as ancestor.
The use of ultrametric trees has several advantages in hierarchical classica-
tion. First of all, edge weights are very easy to compute: given a distance matrix
M containing dissimilarity values for a collection of objects, and a candidate
tree T , the minimum weights such that Dij Mij and T is ultrametric can be
computed in O(n2 ) [5]. Secondly, they adapt very well to dynamical processes
evolving at a more or less constant rate. Finally, even if the latter is not the case,
they provide a very good approximation to more relaxed criteria such as mere
additivity, that would be much more computationally expensive to calculate.
Notice also that nding the optimal topology T for a given distance matrix M
under the ultrametric assumption is NP-hard [5].
Ultrametric trees have been computed using an evolutionary two-phase pro-
cedure: rstly, a collection of high quality tentative trees are generated; subse-
quently, a consensus method is used to summarize this collection into a single
tree. Beginning with the former, the generation of high quality (i.e., minimum
weight) ultrametric trees has been approached using an evolutionary algorithm
based on the scatter search template. Starting from the solution provided by the
complete-link agglomerative algorithm, an initial population of trees is produced
by perturbation (internal exchanges of branches). Then, an evolutionary cycle
is performed using tree-based path relinking for recombination [6], and internal
rotations for local search (no mutation is used). Whenever the system stagnates,
the population is restarted by keeping the best solution and generating new trees
by exchanging branches among existing trees.
Once a collection of high quality trees has been found, the consensus method
is used to amalgamate them. This is done using the TreeRank measure [7] as
similarity metric among trees. This measure is based on counting the number of
times we have to traverse an edge upwards or downwards in order to go from a
certain leaf to another one. By computing how dierent these gures are for two
trees, we obtain a dissimilarity value. The TreeRank measure is currently being
used in TreeBASE1 one of the most widely used phylogenetic databases for
the purposes of handling queries for similar trees.
The consensus algorithm we have used is an evolutionary metaheuristic that
evolves tentative trees following [8]. Given the collection of trees we want to
summarize, the sum of dissimilarities to the tentative tree is used as the tness
function (to be minimized). Evolution is performed using the prune-delete-graft
operator [9, 10] for recombination, no mutation, binary tournament selection,
and elitist replacement. In our experiments, we have considered all dierent
trees generated by the scatter search method in one hundred runs, and then
running the consensus algorithm one hundred times on this collection. The best
solution out of these latter 100 runs is kept as the nal consensus tree.
1
http://www.treebase.org
Hierarchical Clustering, Languages and Cancer 71
Fig. 1. Three proposed language-Trees: (a) tree using arithmetic-harmonic cuts, (b)
Gray-Atkinsons tree [16], (c) consensus ultrametric tree
the Middle East between 10,000 and 6,000 years ago, which also correlates well
with the rst principal component of the genetic variation of 95 polymorphisms
[19], which solely accounts for 28 % of the total variation.
Hierarchical Clustering, Languages and Cancer 73
In addition to this fact, there are certain strange relations between languages
at the leaves of Grays tree. After checking the distance matrix, we nd several
cases in which our tree seems to produce a more reasonable branching than Gray
and Atkinsons. First of all, the closest neighbor of Czech and Slovak languages
are Lusatian languages. It is probably natural to have Czech, CzechE and Slo-
vak placed in a subtree closer to Lusatian languages. In the trees generated from
both arithmetic-harmonic cut (Fig. 1(a)) and the ultrametric trees (Fig. 1(c)),
we see that these languages are placed next to each other. However in Fig. 1(b)
generated by [16], Czech and Slovak are placed closer to Ukrainian, Byelorussian
and Russian. Secondly, Catalan is a language evolved from Latin, with strong
inuences from French and Spanish. As a consequence of its Latin origin, Italian
is the closest language of Catalan in the dataset. The position of Catalan with
Italian and Ladin in our tree seems very natural, as hybridizations with French
and Spanish occurred later (note that the bayesian posterior probability is 0.83
for its link with the Spanish-Portuguese group). See [16] for the details of proba-
bilities in the Figure 1(b). Although Italians closest language is Ladin, the latter
was placed closer to RomanianList and Vlach with the posterior probability of
0.88. Also, notice the position of the Italian with 0.59 posterior probability. Fi-
nally, there are also small dierences in the topology of small subtrees between
our hierarchy and Grays, namely, those regarding Dutchlist-Afrikaans-Flemish,
Greek languages, Albanian languages and the position of Bengali in the Aryan
languages among others. The dierences seem to occur mainly where the poste-
rior probability of one or several branchings is low.
An important dierence is that in our classication the Celtic languages are
considered closer to Baltic-Slavic languages. This goes against the current be-
lief of Celtics closeness to Romanic and Germanic languages. Note that in
Gray and Atkinsons classication, the branchings of (Germanic,Romance) and
(Celtic,(Germanic,Romance)) have low posterior probabilities (0.46 and 0.67, re-
spectively). The minimum-weight ultrametric tree (see Fig. 1(c)) for this dataset
also considers Celtic and Slavic languages to be the closest ones as groups. How-
ever, this tree disagrees with our tree in the primary branches. For instance, it
rst takes out Indo-Afghan languages as outliers, then considers Albanian and
Greco-Armenian languages as outliers successively. In the tree obtained by the
arithmetic-harmonic cut, all these outliers are grouped together. Notice that
even at the successive branchings, the consensus ultrametric tree often produces
a large number of outliers (see e.g., Indic and Iranian branches of Figure 1(c)).
NODE 6
SKMEL28
UACC257 MELANOMA
MALME3M
M14
SKMEL2A
MDAMB435
BREAST
MDAN
SKMEL5
MELANOMA
UACC62
NODE 2
MDAMB231 BREAST
HOP92 NONSMALLLUNG
SN12C RENAL
HOP62 NONSMALLLUNG
U251
(B)
SNB19 CNS
SF295
HS578T BREAST
SNB75 RENAL
SF539 NODE 3 NODE 4
CNS
SF268
BT549 BREAST
NODE 5
NCIH226 NONSMALLLUNG
SKOV3
OVARIAN
OVCAR8
ADRRES
EKVX NONSMALLLUNG
A498
RXF393
7860
TK10 RENAL
ACHN
UO31
CAKI1
MCF7
MCF7
BREAST
MCF7 (C)
T47D
NCIH23
NONSMALLLUNG
NODE 4
NCIH522
K562 NODE 5 NODE 6
K562
K562
RPMI8226 LEUKAEMIA
MOLT4
CCRFCEM
HL60
SR
NODE 1
OVCAR3
OVCAR4 OVARIAN
IGROV1
DU145
PROSTATE
PC3
HCT15
NODE 3
SW620
HCC2998
COLON
COLO205
KM12 (D)
HT29
NCIH322 NONSMALLLUNG
OVCAR5 OVARIAN
HCT116 COLON
(A)
The analysis of this dataset was done by Ross et al. in 2000 [20], where a result
of a hierarchical clustering for this dataset was rst discussed. Their result shows
that the cell lines from same origin were grouped together in case of leukaemia,
melanoma, colon, renal and ovarian cancers, with a few exceptions. However, cell-
lines derived from non-small lung carcinoma and breast tumors were distributed
in multiple places suggesting a heterogeneous nature.
Fig. 2(a) shows the result of applying arithmetic-harmonic cut on this dataset.
In Fig. 2(b),(c) and (d), we show the genetic signatures (most dierentially
expressed genes in the two sides of the partition) of the rst three partitions
using the cut with 1,101, 696 and 695 genes respectively. In the genetic signatures
(computed with the method described in [21] and [22]), each row corresponds to
a gene and each column corresponds to a tumor sample.
Hierarchical Clustering, Languages and Cancer 75
U251
HOP62 NONSMALLLUNG
SNB19 CNS
LOXIMVI MELANOMA
SF295
ADRRES
SNB75 RENAL
OVCAR8 OVARIAN
HS578T BREAST
SN12C
SF539
UO31 CNS
SF268
ACHN
BT549 BREAST
CAKI1
RENAL HOP62
RXF393 NONSMALLLUNG
NCIH226
7860
A498
TK10
RXF393
NCI226
NONSMALLLUNG 7860
HOP92
CAKI1 RENAL
MDAMB231 BREAST
ACHN
SF295 CNS
UO31
SNB75 RENAL
TK10
SNB19
MDAMB231 BREAST
U251 CNS
HOP92 NONSMALLLUNG
SF539
SN12C RENAL
HS578T BREAST
ADRRES
SF268 CNS
OVCAR8 OVARIAN
BT549 BREAST
LOXIMVI MELANOMA
EKVX
NONSMALLLUNG PC3 PROSTRATE
A549
OVCAR3
A498 RENAL
OVCAR4
NCIH460 NONSMALLLUNG
IGROV1 OVARIAN
OVCAR5 OVARIAN
SKOV3
NCIH322 NONSMALLLUNG
OVCAR5
DU145
PROSTATE DU145 PROSTRATE
PC3
EKVX
SKOV3
A549 NONSMALLLUNG
IGROV1
OVARIAN NCIH460
OVCAR3
RPMI8226
OVCAR4
K562
K562
K562
K562
K562
K562 LEUKAEMIA
HL60
CCRFCEM
LEUKAEMIA CCRFCEM
MOLT4
MOLT4
HL60
SR
SR
HCT116
RPMI8226
SW620
HCC2998
HCT15
COLO205
KM12 COLON
KM12
COLON HCC2998
HT29
COLO205
SW620
HT29
HCT15
MCF7
HCT116
MCF7
NCIH23 BREAST
NONSMALLLUNG MCF7
NCIH522
T47D
MCF7
NCIH322
MCF7 NONSMALLLUNG
BREAST NCIH23
MCF7
NCIH522
T47D
SKMEL5 MELANOMA
SKMEL28
MDAMB435
UACC257 BREAST
MDAN
MALME3M
MELANOMA M14
M14
SKMEL28
SKMEL2A
UACC257
UACC62 MELANOMA
MALME3M
MDAMB435
BREAST UACC62
MDAN
MELANOMA SKMEL2A
SKMEL5
(A) (B)
Fig. 3. Classication of NCI60 Dataset from (a) Ross et al. and (b) ultrametric tree
5 Conclusions
We proposed two new approaches for hierarchical clustering and showed the
result of applying these methods on two very diverse datasets. The hierarchies
we produce for both languages and cancer samples in this method agree very well
with existing data about these datasets. It also raises some interesting questions.
The arithmetic-harmonic cut seems to correlate well with the results of the
rst component of the genetic variation provided by Cavalli-Sforza and his co-
authors [19]. It indicates a branching in two major groups, with an earlier group
moving towards Europe (actually, the advanced farming hypothesis at work),
Hierarchical Clustering, Languages and Cancer 77
later followed by another group moving in the same direction (evolving in Greek
and Albanian and Armenian languages) while another group moved south-
east and later dierentiated in Iranian and Indic languages. It also suggest a
commonality of Celtic, Baltic and Slavic (a hypothesis also raised in the past,
and also supported by the consensus of the ultrametric trees). These dierences,
as well as small others, with the solution provided by Gray and Atkinsons
seem to be in branchings where the bayesian posterior probability is low, and
our methods agree where the posterior probability is high. The consensus of
the ultrametric trees seem to suggest a single wave towards Europe, but a rst
branching in an Albanian group, followed by a second branching with the Greek
and Armenian in one subgroup seems less plausible to us.
Overall, our results seem to indicate that it is important to use several hier-
archical clustering algorithms and to analyze common subgroupings. In the case
of tumor samples, it is indeed the case that this is the most relevant outcome as
we do not have any guarantee that the samples share a common ancestor.
The question: Which tree is the best one ? might actually be highly irrele-
vant to the the real problem at hand, as it seems to be the consensus of these
trees the most important outcome. Results on a number of other clustering al-
gorithms on these datasets (which we were unable to show here for reasons of
space), indicates that more research in robust algorithm methods needs to be
done for molecular subtype classication in cancer and that validation of the
methodology with dierent problem settings is highly benecial to develop it.
References
1. Cotta, C., Moscato, P.: A memetic-aided approach to hierarchical clustering from
distance matrices: Application to phylogeny and gene expression clustering. Biosys-
tems 71 (2003) 7597
2. Merz, P., Freisleben, B.: Fitness landscapes, memetic algorithms, and greedy op-
erators for graph bipartitioning. Evolutionary Computation 8 (2000) 6191
3. Battiti, R., Bertossi, A.: Dierential greedy for the 0-1 equicut problem. In: Proc.
of DIMACS Workshop on Network Design: Connectivity and Facilities. (1997)
4. Festa, P., Pardalos, P., Resende, M.G.C., Ribeiro, C.C.: Randomized heuristics for
the MAX-CUT problem. Optimization Methods and Software 7 (2002) 10331058
5. Wu, B., Chao, K.M., Tang, C.: Approximation and exact algorithms for construct-
ing minimum ultrametric trees from distance matrices. Journal of Combinatorial
Optimization 3 (1999) 199211
6. Cotta, C.: Scatter search with path relinking for phylogenetic inference. European
Journal of Operational Research 169 (2006) 520532
7. Wang, J., Shan, H., Shasha, D., Piel, W.: Treerank: A similarity measure for nearest
neighbor searching in phylogenetic databases. In: Proceedings of the 15th Interna-
tional Conference on Scientic and Statistical Database Management, Cambridge
MA, IEEE Press (2003) 171180
8. Cotta, C.: On the application of evolutionary algorithms to the consensus tree
problem. In Gottlieb, J., Raidl, G., eds.: Evolutionary Computation in Combina-
torial Optimization. Volume 3248 of Lecture Notes in Computer Science., Berlin,
Springer-Verlag (2005) 5867
78 P. Mahata et al.
9. Moilanen, A.: Searching for the most parsimonious trees with simulated evolution.
Cladistics 15 (1999) 3950
10. Cotta, C., Moscato, P.: Inferring phylogenetic trees using evolutionary algorithms.
In Merelo, J., et al., eds.: Parallel Problem Solving From Nature VII. Volume 2439
of Lecture Notes in Computer Science. Springer-Verlag, Berlin (2002) 720729
11. Mallory, J.P.: Search of the Indo-European languages. Archaelogy and Myth (1989)
12. Renfrew, C.: Time-depth in historical linguistics. The McDonald Institute for
Archaeological Research (2000) 413439
13. Richards, M.: Tracing european founder lineage in the near easter mtDNA pool.
Am. K. Hum. Genet 67 (2000) 12511276
14. Semoni: The genetic legacy of Paleolithic Homo Sapiens in extant europeans: a Y
chromosome perspective. Science 290 (2000) 11551159
15. Chikhi, L., Nichols, R., Barbujani, G., Beaumont, M.: Y genetic data support the
Neolithic demic diusion model. Prod. Natl. Acad., Sci. 67 (2002) 1100811013
16. Gray, R.D., Atkinson, Q.D.: Language-tree divergence times support the anatolian
theory of indo-european origin. Nature 426 (2003) 435439
17. Bryant, D., Filimon, F., Gray, R.: Untangling our past: Languages, trees, splits and
networks. In Mace, R., Holden, C., Shennan, S., eds.: The Evolution of Cultural
Diversity: Phylogenetic Approaches. UCL Press (2005) 6985
18. Dyen, I., Kruskal, J.B., Black, P.: An Indo-European classication: A lexicosta-
tistical experiment. Transactions of the American Philosophical Society, New Ser.
82 (1992) 1132
19. Cavalli-Sforza, L.: Genes, peoples, and languages. Proceedings of the National
Academy of Sciences of the United States of America 94 (1997) 77197724
20. Ross, D.T., Scherf, U., Eisen, M., Perou, C., Rees, C., Spellman, P., Iyer, V., Jerey,
S., Rijn, M., Waltham, M., Pergamenschikov, A., Lee, J.C., Lashkari, D., Shalon,
D., Myers, T., Weinstein, J.N., Botstein, D., Brown, P.: Systematic variation in
gene expression patterns in human cancer cell lines. Nature Genetics 24 (2000)
227235
21. Cotta, C., Langston, M., Moscato, P.: Combinatorial and algorithmic issues for
microarray data analysis. In: Handbook of Approximation Algorithms and Meta-
heuristics. Chapman and Hall (2005)
22. Hourani, M., Mendes, A., Berretta, R., Moscato, P.: A genetic signature for parkin-
sons disease using rodent brain gene expression. In Keith, J., ed.: Bioinformatics.
Humana Press (2006)
23. Ferraresi, V., Ciccarese, M., Zeuli, M., Cognetti, F.: Central system as exclusive
site disease in patients with melanoma: treatment and clinical outcome of two
cases. Melanoma Res. 2005 15 (2005) 467469
24. Marchetti, D., Denkins, Y., Reiland, J., Greiter-Wilke, A., Galjour, J., Murry,
B., Blust, J., Roy, M.: Brain-metastatic melanoma: a neurotrophic perspective.
Pathology Oncology Research 9 (2003) 147158
25. Buell, J., Gross, T., Alloway, R., Trofe, J., Woodle, E.: Central nervous system
tumors in donors: Misdiagnosis carries a high morbidity and mortality. Transplan-
tation Proceedings 37 (2005) 583584
26. Perou, C.M., Jerey, S.S., Rijn, M., Rees, C.A., Eisen, M.B., Ross, D.T., Perga-
menschikov, A., Williams, C.F., Zhu, S.X., Lee, J.C.F., Lashkari, D., Shalon, D.,
Brown, P.O., Botstein, D.: Distinctive gene expression patterns in human mam-
mary epithelial cells and breast cancers. Genetics 96 (1999) 92129217
Robust SVM-Based Biomarker Selection with
Noisy Mass Spectrometric Proteomic Data
1 Introduction
Feature selection (FS) for classication can be formulated as a combinatorial
optimization problem: nding the feature set maximizing the predictive perfor-
mance of the classier trained from these features. FS is a major research topic
in supervised learning and data mining [10, 16, 12]. For the sake of the learning
performance, it is highly desirable to discard irrelevant features prior to learn-
ing, especially when the number of available features signicantly outnumbers
the number of samples, like in biomedical studies. Because of its computational
intractability, the FS problem has been tackled by means of heuristic algorithms
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 7990, 2006.
c Springer-Verlag Berlin Heidelberg 2006
80 E. Marchiori et al.
based on statistics and machine learning [10, 20, 22]. Biological experiments from
laboratory technologies like microarray and mass spectrometry techniques, gen-
erate data with a very high number of variables (features), in general much
larger than the number of samples. Therefore FS provides a fundamental step in
the analysis of such type of data [27]. Ideally one would like to detect potential
biomarkers and biomarker patterns, that both highly discriminate diseased from
healthy samples and are biological interpretable. However, as substantiated in
recent publications like [19, 3, 21], reliability and reproducibility of results de-
pend on the particular way samples are handled [26], on the instability of the
laboratory technology, as well as on the specic techniques employed in the
computational analysis.
In this paper we consider FS for classication with MS proteomic data from
sera. Various machine learning and statistical techniques for feature selection
have been applied to proteomic data, like [15, 17, 8, 13, 5, 6, 7], in order to detect
potential tumor biomarkers for (early) cancer diagnosis (clinical proteomics). A
summary of actual challenges and critical assessment of clinical proteomics can
be found, e.g., in [21].
Here we propose a new method for FS with MS proteomic data. The goal is to
identify potential biomarker patterns that not only highly discriminate diseased
and healthy samples, but also are robust with respect to perturbation of the data.
The method consists of three main steps. First, a popular lter feature selection
algorithm, RELIEF, is used as pre-processing in order to reduce the number of
considered features. Next, multiple runs of linear SVM are considered, where at
each run a perturbed training set is used, obtained by changing the class label
of one support vector. Each run generates a large subset of selected features.
The frequency (over the runs) of selection of the features is used to choose the
most robust ones, namely those with highest frequency. Finally, the resulting
features are transformed into feature intervals, by considering the ordering of
the features, where neighbour features refer to peptides of similar masses.
The method generates a subset of feature intervals, where both the number
of intervals and features are automatically selected. These intervals describe
potential biomarker patterns.
We analyze experimentally the performance of the method on a real-life
dataset with controlled insertion of noisy samples (long storage time samples)
and relevant features (spiked molecules) [26]. The results indicate that the
method performs robust feature selection, by selecting features corresponding
to m/z measurements near to the (average of m/z values of the peak of the)
spiked molecules, and by misclassifying only 1 noisy sample (with long storage
time).
2 Background
This section describes in brief the Machine Learning techniques we use in the
proposed feature selection method.
Robust SVM-Based Biomarker Selection 81
with one constraint for each training sample xi . Usually the dual form of the
optimization problem is solved:
m m m
1
mini i j yi yj xi xj i
2 i=1 j=1 i=1
m
such that 0 i C, 2
i=1 i yi = 0. SVM requires O(m ) storage and O(m )
3
to solve.
m The resulting decision function f (x) = w x + b has weight vector w =
k=1 k yk xk . Samples xi for which i > 0 are called support vectors, since
they uniquely dene the maximum margin hyperplane. Samples with i C are
misclassied.
Maximizing the margin allows one to minimize bounds on generalization error.
Because the size of the margin does not depend on the input dimension, SVM
are robust with respect to data with high number of features. However, SVM
are sensitive to the presence of (potential) outliers, (cf. [11] for an illustrative
example) due to the regularization term for penalizing misclassication (which
depends on the choice of C).
SVMFS
%input: training set X, number of features
to be selected M
%output: subset Selected of M features
train linear classifier with SVM on X;
score features using the squared value of
the weights of the classifier;
Selected = M features with highest score;
return Selected;
RELIEF
RELIEF [23, 14] is a lter-based feature ranking algorithm that assigns a score
to features based on how well the features separate training samples from their
nearest neighbours from the same and from the opposite class.
The algorithm constructs iteratively a weight vector, which is initially equal
to zero. At each iteration, RELIEF selects one sample, adds to the weight the
dierence between that sample and its nearest sample from the opposite class
(called nearest miss), and subtracts the dierence between that sample and its
nearest neighbour from the same class (called nearest hit). The iterative process
terminates when all training samples have been considered. The resulting weight
of a feature is divided by its range of values (computed using only the training
set). Subsampling can be used to improve eciency in case of a large training
set. The pseudo-code of the RELIEF algorithm used in our experiments is given
below.
RELIEF
%input: training set X
%output: Ranking of features
nr_feat = total number of features;
Robust SVM-Based Biomarker Selection 83
0.9
spiked t=0
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2000 4000 6000 8000 10000 12000 14000 16000
0.9
0.8
spiked t=48
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2000 4000 6000 8000 10000 12000 14000 16000
Fig. 1. A MALDI-TOF MS spiked sample for one person at storage duration time t=0
(top) and t=48 (bottom): x-axis contains (identiers of) the m/z values of peptides
and the y-axis their concentration
84 E. Marchiori et al.
4 The Method
FWI
%input: training set X
%number M of features to be selected by RELIEF
Robust SVM-Based Biomarker Selection 85
%FILTER step:
F = M features selected with RELIEF;
%WRAPPER step:
SV = set of support vectors obtained by training
SVM on X;
for x in SV
T = X with label of x changed;
F(x) = N features selected by SVMFS applied to T;
end;
count = maximum number of times that a feature
occurs in the sequence of F(x), x in SV;
W = features occurring count times in the sequence
of F(x), x in SV;
%INTERVAL step:
Cl = {C1, .., Cn} clustering of W, with
Ci = (w(1),.., w(ni))
w(1)<..< w(ni)
s.t. idx(w(j+1))-idx(w(j)) <= 2
for all j in [1..ni];
Int = {Int_1, .., Int_n} intervals from Cl, with
Int_i= {w in Features s.t. w >= min(Ci)
and w<= max(Ci)} for i in [1,n];
return Int;
Let us explain a bit in more detail the steps performed by FWI.
FWI starts by skimming the number of features, by applying the Filter
(F) step. Here RELIEF is employed in order to select M features. In the F
step one typically retains about M=5000 m/z measurements from the initial
22572.
In the Wrapper (W) step, robust wrapper based feature selection is per-
formed using the features that passed the Filter selection. In the Wrapper
(W) step, the support vectors of SVM trained on all the features are used for
perturbing the data. More precisely, multiple runs of SVMFS are performed,
where at each run the class label of one support vector is changed. Each run
generates a set of N features (typical value N=1000). The resulting sequence
of feature sets is then considered. The maximum number count of times a
feature occurs in the sequence is computed, and all features occurring count
times in the sequence are selected.
Finally, in the Interval (I) step, the selected m/z features are segmented as
follows. The sequence of features in W, ordered by m/z values, is segmented
86 E. Marchiori et al.
5 Numerical Experiments
1. Wrapper feature selection (W), obtained by applying the W step of the FWI.
2. Wrapper Interval (WI), obtained by applying steps W followed by I.
3. Feature Wrapper (FW), obtained by applying steps F followed by W.
4. The complete Feature Wrapper Interval algorithm FWI.
Because of the small size of the data, LOOCV is used for comparing the per-
formance of the four algorithms (cf., e.g., [9]). At each leave-one-out run, all but
one element of the data is used as training set, and the left-out element is used
for testing the predictive performance of the resulting classier. Observe that the
96 samples of the considered dataset are not independent one of the other, as
required for a correct application of LOOCV, because they are generated from 8
persons, and neither the 6 dierent storage times nor the spiking guarantee the
production of independent samples. Nevertheless, the corresponding bias intro-
duced in the LOOCV procedure aects the results of each algorithm, hence the
results can be used for comparing the performance of the algorithms. However,
such bias possibly aects the estimation of the generalization error.
Table 1 summarizes LOOCV performance results of the experiments. We use
accuracy, sensitivity and specicity as quality measures for comparing the algo-
rithms. Other measures, like AUC (Area Under the ROC Curve), can be used.
As illustrated e.g. in [1], there is a good agreement between accuracy and AUC
as to the ranking of the performance of the classication algorithms.
The results indicate that there is an improvement in predictive performance
of the four algorithms, with best accuracy achieved by FWI.
The misclassied samples over all the LOOCV runs have storage time equal
to 24 or 48 hours, indicating that longer storage time aects negatively classi-
cation of proteomic samples. Algorithm W misclassies a total of 5 samples, of
Robust SVM-Based Biomarker Selection 87
Table 1. Results: LOOCV sensitivity, specicity and accuracy (with standard devia-
tion between brackets)
100
nr loocv runs
90
80
70
60
50
40
20
10
spiked molecule
0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
m/z measurements
0.8
0.7
typical pattern profile generated by FWI
0.6
0.5
0.4
0.3
mean normal
0.2
spiked molecule
0.1
mean spiked
0
1000 2000 3000 4000 5000 6000 7000 8000 9000
m/z values
Fig. 3. A typical m/z selection generated by FWI, the corresponding values of the
mean spiked and normal prole at the selected m/z values, and the spiked molecules
measurements in proximity of spiked molecules are more often selected over the
LOOCV runs, except for m/z measurements in the neighbourhood of 4000 and
5000, which do not correspond to m/z measurements of spiked molecules. In the
absence of additional information (e.g. tandem mass spectra yielding sequence
tags) it is dicult to know what these peak values represent. One possibility
is that the higher molecular weight spiked molecules are partially degraded in
serum, and these peaks are proteolytically cleaved peptides from larger proteins
(due to large storage time at room temperature) in the sample itself. However,
this possibility has not yet been examined in depth. Figure 3 shows a typical set
of m/z measurements generated by FWI, and the mean value of the intensities
of spiked and normal samples for the selected m/z measurements.
In conclusion, results indicate that FWI performs robust m/z selection, where
the selected features are close to the spiked molecules, and the misclassication
error is close to zero, with misclassication of only noisy (that is high storage
temperature) samples.
6 Conclusion
forming robust feature selection in the presence of noisy samples which perturb
the data and negatively aect sample classication. The W and I steps of the
proposed FWI method provide heuristics for tackling this problem.
This issue is related to broader questions about reproducibility and validity
of results in the discovery-based omics research [21, 24]. In a special session on
genomics of a recent issue of Science an essay entitled Getting the noise out of
gene arrays noted that [t]housands of papers have reported results obtained
using gene array ... But are these results reproducible? [19]. A controversy about
reprodicibility and validity of results from MS proteomic data is ongoing [3, 21]
and the path for achieving such ambitious goals appears still long.
Acknowledgments
We would like to thank the anonymous referees for their constructive comments,
and Jan Rutten for stimulating discussions.
References
1. A. P. Bradley. The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition, 30(6):11451159, 1997.
2. N. Cristianini and J. Shawe-Taylor. Support Vector machines. Cambridge Press,
2000.
3. E.P. Diamandis. Analysis of serum proteomic patterns for early cancer diagnosis:
Drawing attention to potential problems. Journal of the National Cancer Institute,
96(5):353356, 2004.
4. H.J. Issaq et al. SELDI-TOF MS for diagnostic proteomics. Anal. Chem.,
75(7):148A155A, 2003.
5. Petricoin E.F. et al. Serum proteomic patterns for detection of prostate cancer.
Journal of the National Cancer Institute, 94(20):15761578, 2002.
6. Petricoin E.F. et al. Use of proteomic patterns in serum to identify ovarian cancer.
The Lancet, 359(9306):5727, 2002.
7. Qu Y. et al. Boosted decision tree analysis of surface-enhanced laser desorp-
tion/ionization mass spectral serum proles discriminates prostate cancer from
noncancer patients. Clin. Chem, 48(10):183543, 2002.
8. Zhu W. et al. Detection of cancer-specic markers amid massive mass spectral
data. PNAS, 100(25):1466614671, 2003.
9. T. Evgeniou, M. Pontil, and A. Elissee. Leave one out error, stability, and gen-
eralization of voting combinations of classiers. Mach. Learn., 55(1):7197, 2004.
10. I. Guyon and A. Elissee. An introduction to variable and feature selection. Ma-
chine Learning, 3:11571182, 2003. Special Issue on variable and feature selection.
11. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi-
cation using support vector machines. Mach. Learn., 46(1-3):389422, 2002.
12. George H. John, Ron Kohavi, and Karl Peger. Irrelevant features and the subset
selection problem. In International Conference on Machine Learning, pages 121
129, 1994.
13. K. Jong, E. Marchiori, M. Sebag, and A. van der Vaart. Feature selection in
proteomic pattern data with support vector machines. In IEEE Symposium on
Computational Intelligence in Bioinformatics and Computational Biology, 2004.
90 E. Marchiori et al.
14. K. Kira and L. A. Rendell. The feature selection problem: Traditional methods
and a new algorithm. In Tenth National Conference on artificial intelligence, pages
129134, 1992.
15. J. Li, Z. Zhang, J. Rosenzweig, Y.Y. Wang, and D.W. Chan. Proteomics and
bioinformatics approaches for identication of serum biomarkers to detect breast
cancer. Clinical Chemistry, 48(8):12961304, 2002.
16. H. Lie and editors H. Motoda. Feature Extraction, Construction and Selection:
a Data Mining Perspective. International Series in Engineering and Computer
Science. Kluwer, 1998.
17. H. Liu, J. Li, and L. Wong. A comparative study on feature selection and classi-
cation methods using gene expression proles and proteomic patterns. Genome
Informatics, 13:5160, 2002.
18. E. Marchiori, N.H.H. Heegaard, M. West-Nielsen, and C.R. Jimenez. Feature se-
lection for classication with proteomic data of mixed quality. In Proceedings of
the 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and
Computational Biology, pages 385391, 2005.
19. E. Marshall. Getting the noise out of gene arrays. Science, 306:630631, 2004.
Issue 5696.
20. Il-Seok Oh, Jin-Seon Lee, and Byung-Ro Moon. Local search-embedded genetic
algorithms for feature selection. In 16 th International Conference on Pattern
Recognition (ICPR02). IEEE Press, 2002.
21. D.F. Ransoho. Lessons from controversy: Ovarian cancer screening and serum
proteomics. Journal of the National Cancer Institute, 97:315319, 2005.
22. M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain. Dimen-
sionality reduction using genetic algorithms. IEEE Transactions on Evolutionary
Computation, 4(2):164171, 2000.
23. L. A. Rendell and K. Kira. A practical approach to feature selection. In Interna-
tional Conference on machine learning, pages 249256, 1992.
24. Michiels S., Koscielny S., and Hill C. Prediction of cancer outcome with microar-
rays: a multiple random validation strategy. The Lancet, 365(9458):48892, 2005.
25. V.N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.
26. M. West-Nielsen, E.V. Hogdall, E. Marchiori, C.K. Hogdall, C. Schou, and N.H.H.
Heegaard. Sample handling for mass spectrometric proteomic investigations of
human sera. Analytical Chemistry, 11(16):51145123, 2005.
27. E.P. Xing. Feature selection in microarray analysis. In A Practical Approach to
Microarray Data Analysis. Kluwer Academic, 2003.
28. L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-
based lter solution. In ICML, pages 856863, 2003.
On the Use of Variable Complementarity for
Feature Selection in Cancer Classification
1 Introduction
Statisticians and data-miners are used to build predictive models and infer de-
pendencies between variables on the basis of observed data. However, in a lot
of emerging domains, like bioinformatics, they are facing datasets characterized
by a very large number of features (up to several thousands), a large amount of
noise, non-linear dependencies and, often, only several hundreds of samples. In
this context, the detection of functional relationships as well as the design of ef-
fective classiers appears to be a major challenge. Recent technological advances,
like microarray technology, have made it possible to simultaneously interrogate
thousands of genes in a biological specimen. It follows that two classication
problems commonly encountered in bioinformatics are how to distinguish be-
tween tumor classes and how to predict the eects of medical treatments on
the basis of microarray gene expression proles. If we formalize this prediction
task as a supervised classication problem, we realize that we are facing a prob-
lem where the number of input variables, represented by the number of genes,
is huge (around several thousands) and the number of samples, represented by
the clinical trials, is very limited (around several tens). Because of well-known
numerical and statistical accuracy issues, it is typically necessary to reduce the
number of variables before starting a learning procedure. Furthermore, select-
ing features (i.e. genes) can increase the intelligibility of a model while at the
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 91102, 2006.
c Springer-Verlag Berlin Heidelberg 2006
92 P.E. Meyer and G. Bontempi
This paper deals with supervised multi-class classication. We will assume ei-
ther that all the variables are discrete or that they can be made discrete by a
quantization step. Hereafter, we will denote by Y the discrete output random
variable representing the class and by X the multi-dimensional discrete input
random variable.
In qualitative terms, feature selection boils down to select, among a set of po-
tential variables, the most relevant ones. At the same time it would be appealing
that these selected variables are not redundant. The notions of relevance and re-
dundancy can be made more formal thanks to the use of dependency measures
[2, 9].
Let us rst introduce some concepts of information theory:
Definition 3. Relevance.
Consider three random variables X,Y and Z and their joint probability distri-
bution pX,Y,Z (x, y, z). If H(Y |Z) = 0, then the variable relevance of X to Y
given Z, denoted by r(X; Y |Z), is zero. Else if H(Y |Z) = 0, then the variable
relevance of X to Y given Z is defined as:
I(X; Y |Z)
r(X; Y |Z) = (3)
H(Y |Z)
two variables are independent. We remark also that a negative value of the
complementarity can be taken as a measure of the redundancy of a pair of
variables for the task of predicting Y .
The example 2 is an illustration of complementarity between Xi and XS since
in that case:
I(Xi,S ; Y ) > I(Xi ; Y ) +I(XS ; Y ) (5)
0
without any gain in terms of bias reduction. On the contrary, two variables can
be complementary to the output (i.e. highly relevant together) while each of
them appears to be poorly relevant once taken individually (see Example 2 or
Example 3). As a consequence, these variables could be badly ranked, or worse
eliminated, by the ranking lter.
At each step, this method selects the variable which has the best trade-o
relevance-redundancy. This selection criterion is fast and ecient. At step d of
the forward search, the search algorithm computes n d evaluations where each
evaluation requires the estimation of (d + 1) bi-variate densities (one for each
already selected variables plus one with the output). It has been shown in [7] that
the MRMR criterion is an optimal rst order approximation of the conditional
relevance criterion. Furthermore, MRMR avoids the estimation of multivariate
densities by using multiple bivariate densities.
Note that, although the method aims to address the issue of redundancy
between variables through the term zi , it is not able to take into account the
complementarities between variables. This could be ineective in situations like
the one of Example 2 where, although the set {Xi , XS } has a large relevance to
Y , we observe that
1. the redundancy term zi is large due to the redundancy of Xi and XS
2. the relevance term ui is small since Xi is not relevant to Y .
it will not necessary select a variable complementary with the already selected
variables. Indeed, a variable that has a high complementarity with an already
selected variable will be characterized by a high conditional mutual information
with that variable but not necessarily by a high minimal conditional information
(see example 3).
In terms of complexity, note that at the dth step of the forward search, the
algorithm computes n d evaluations where each evaluation following CMIM
requires the estimation of d tri-variate densities (one for each previously selected
variable).
In the following chapter, we propose a new criterion that deals more explicitly
with complementary variables.
The theorem expresses that the mutual information of a subset S and a target
variable Y is lower bounded by the quantity L (I(XS ; Y )), that is the average of
the same quantity computed for all the sub-subsets XSi of XS .
In the following, we will use this theorem as a theoretical support to the
following heuristic: without any additional knowledge on how subsets of d vari-
ables should combine, the most promising subset is a combination of the best
performing subsets of (d 1 variables).
Replacing again the right-hand term by its lower bound and recursively until we
have subsets of two variables:
arg max I(XS(i,j) ; Y ) arg max I(Xi,j ; Y ) (14)
S S
iS jS iS jS
I(X, Y )
SR(X; Y ) = (15)
H(X, Y )
The main advantage in using this criterion for selecting variables is that a
complementary variable of an already selected one has a much higher probability
to be selected than with other criteria. As this criterion measures symmetrical
relevance on all the combination of two variables (double input) of a subset, we
have called the criterion: the double input symmetrical relevance (DISR). At the
dth step of the forward search, the search algorithm computes n d evaluations
where each evaluation requires the estimation of d tri-variate densities (one for
each previously selected variable). In the next section, the DISR criterion is
assessed and compared with the other heuristic search lters discussed in the
Section 3.
Table 1 summarizes the methods discussed so far in terms of some peculiar
aspects: the capacity of selecting relevant variables, of avoiding redundancy, of
selecting complementary features and of avoiding the computation of multivari-
ate densities.
5 Experiments
A major topic in bioinformatics is how to build accurate classiers for cancer di-
agnostic and prognostic purposes on the basis of microarray genomic signatures.
This task can be considered as a challenging benchmark for feature selection
algorithms [7] given the high feature to sample ratio.
We use eleven public domain multi-class datasets from [14] (Table 2) in order
to assess and compare our technique with the state-of-the-art approaches.
In our experimental framework, each continuous variable has been discretized
in equal sized interval. The number of intervals of each input is based on the Scott
criterion, see [15]. All the datasets are partitioned into two parts: a selection set
and a test set (each having size equal to N/2). We compare the lter based
on DISR with the four state-of the art approaches discussed above: a Ranking
algorithm and three lters based on the Relevance criterion, the Minimum Re-
dudancy Maximum Relevance criterion and the Conditional Mutual Information
Table 3. Statistically (0.1 level and 0.2 level of a paired two-tailed t-test) signicant
wins, ties or losses over best rst search combined with DISR criterion
W/T/L VS DISR Rank REL CMIM MRMR Rank REL CMIM MRMR
3-NN 0/9/2 1/8/2 1/7/3 2/7/2 1/5/5 1/7/3 1/7/3 2/7/2
Naive Bayes 3/8/0 2/7/2 1/8/2 1/9/1 4/7/0 3/6/2 1/8/2 1/9/1
SVM 2/9/0 2/6/3 2/6/3 2/8/1 2/6/3 3/4/4 2/5/4 3/5/3
the criterion in a wider range of domains, (iii) the impact of the discretization
method to the eciency of the feature selection algorithms.
References
1. Guyon, I., Elissee, A.: An introduction to variable and feature selection. Journal
of Machine Learning Research 3 (2003) 11571182
2. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Articial Intelligence
97 (1997) 273324
3. Blum, A., Langley, P.: Selection of relevant features and examples in machine
learning. Articial Intelligence 97 (1997) 245271
4. Provan, G., Singh, M.: Learning bayesian networks using feature selection. In:
in Fifth International Workshop on Articial Intelligence and Statistics. (1995)
450456
5. Duch, W., Winiarski, T., Biesiada, J., Kachel, A.: Feature selection and ranking
lters. In: International Conference on Articial Neural Networks (ICANN) and
International Conference on Neural Information Processing (ICONIP). (2003) 251
254
6. Bell, D.A., Wang, H.: A formalism for relevance and its application in feature
subset selection. Machine Learning 41 (2000) 175195
7. Peng, H., Long, F.: An ecient max-dependency algorithm for gene selection.
In: 36th Symposium on the Interface: Computational Biology and Bioinformatics.
(2004)
8. Fleuret, F.: Fast binary feature selection with conditional mutual information.
Journal of Machine Learning Research 5 (2004) 15311555
9. Yu, L., Liu, H.: Ecient feature selection via analysis of relevance and redundancy.
Journal of Machine Learning Research 5 (2004) 12051224
10. Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley, New
York (1990)
11. Yang, H., Moody, J.: Feature selection based on joint mutual information. In: In
Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Meth-
ods and Applications (CIMA), Rochester New York, ICSC (1999)
12. Kojadinovic, I.: Relevance measures for subset variable selection in regression
problems based on k-additive mutual information. Computational Statistics and
Data Analysis 49 (2005) 12051227
13. Meyer, P.: Information theoretic lters for feature selection. Technical report,
Universite Libre de Bruxelles ((548) 2005)
14. web: (http://www.tech.plym.ac.uk/spmc/bioinformatics/microarray cancers.html)
15. Scott, D.W.: Multivariate Density Estimation. Theory,. Wiley (1992)
16. R-project: (www.r-project.org)
Comparison of Neural Network Optimization
Approaches for Studies of Human Genetics
Center for Human Genetics Research, Department of Molecular Physiology & Biophysics,
Vanderbilt University, Nashville, TN, 37232, USA
{motsinger, dudek, hahn, ritchie}@chgr.mc.vanderbilt.edu
http://chgr.mc.vanderbilt.edu/ritchielab
1 Introduction
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 103 114, 2006.
Springer-Verlag Berlin Heidelberg 2006
104 A.A. Motsinger et al.
architecture should be for a given dataset and, as a result, a cumbersome trial and
error approach is often taken.
Previously, we implemented a neural network optimized via genetic programming
(GPNN)[22]. Optimizing neural network architecture with genetic programming was
first proposed by Koza and Rice[23]. We implemented and extended the GPNN
approach for use in association studies of human disease. The goal of GPNN was to
improve upon the trial-and-error process of choosing an optimal architecture for a
pure feed-forward back propagation neural network[22]. GPNN optimizes the inputs
from a large pool of variables, the weights, and the connectivity of the network -
including the number of hidden layers and the number of nodes in the hidden layer.
Thus, the algorithm automatically generates optimal neural network architecture for a
given dataset. This gives it an advantage over the traditional back propagation NN, in
which the inputs and architecture are pre-specified and only the weights are
optimized.
GPNN was a successful endeavor it has shown high power to detect gene-gene
interactions in both simulated and real data[24]. Still, there are limitations to evolving
NN using this type of machine learning algorithm. First, the GP implementation that
was used for GPNN involves building binary expression trees. Therefore, each node
is connected to exactly two nodes at the level below it in the network. This did not
seem to hinder the power of GPNN in smaller datasets[22,24-26]; however, we
hypothesize that for more complex data, more complicated NN will be required, and
two connections per node may not be sufficient. Second, changes to GPNN require
altering and recompiling source code, which hinders flexibility and increases
development time. For example, GPNN is limited in the depth of the network. This
means there is a limit to the number of levels the network can contain. Again, this
was not a hindrance for GPNN in the previous power studies[22,24-26], but this may
not scale well for more complex datasets.
In response to these concerns, we developed a NN approach for detecting gene-
gene interactions that uses grammatical evolution (GE) as a strategy for the
optimization of the NN architecture. Grammatical evolution (GE) is a variation on
genetic programming that addresses some of the drawbacks of GP[27,28]. GE has
been shown to be effective in evolving Petri Nets, which are discrete dynamical
systems that look structurally similar to neural networks, used to model biochemical
systems[29]. By using a grammar, substantial changes can be made to the way that
NN are constructed through simple manipulations to the text file where the grammar
is specified. No changes in source code are required and thus, there is no
recompiling. The end result is a decrease in development time and an increase in
flexibility. These two features are important improvements over GPNN.
Preliminary studies with GPNN show that an evolutionary optimization is more
powerful than traditional approaches for detecting gene-gene interactions. We have
shown that the GPNN strategy is able to model and detect gene-gene interactions in
the absence of main effects in many epistasis models with higher power than back
propagation NN[22], stepwise logistic regression[26], and a stand alone GP[25].
GPNN has also detected interactions in a real data analysis of Parkinsons
disease[24]. Similarly to GPNN, the grammatical evolution optimized neural
network (GENN) optimizes the inputs from a pool of variables, the synaptic weights,
and the architecture of the network. The algorithm automatically selects the
Comparison of NN Optimization Approaches for Studies of Human Genetics 105
2 Methods
Fig. 1. An overview of the GENN method. The steps correspond to the description of the
method in Section 2.1.
Classification error is calculated on the set of training data as the fitness metric.
As mentioned earlier, the dataset is divided into cross-validation subsets. GENN is
optimized using a training set of data, and a subset of the data is left out as a test set
to evaluate the final solution and prevent over-fitting. Classification error refers to
the number of samples in the training dataset that are incorrectly classified by the
network. Prediction error, which refers to the number of samples in the test dataset
that are incorrectly classified using the GENN model generated during training, is
used for final model selection. The overall goal of the learning process is to find
genetic models that accurately classify the data. Cross-validation is used in
conjunction with this learning process to produce a model that not only can
accurately classify the data at hand, but can predict on future, unseen data.
GPNN uses genetic programming to determine the optimal architecture for neural
networks. Both the method and the software have previously been described[22].
GPNN was applied as presented in the references. Like GENN, models are trained on
classification error, and a cross validation consistency and prediction error are
determined for the final model. Unlike GENN, previous studies have shown cross
validation consistency as the best criterion for final model selection. However for this
study, the results are identical whether you use cross-validation consistency or
prediction error for final model selection. All configuration parameters are identical to
those in GENN. This will allow for a direct comparison of GPNN and GENN.
108 A.A. Motsinger et al.
Epistasis, or gene-gene interaction, occurs when the phenotype under study cannot be
predicted from the independent effects of any single gene, but is the result of
combined effects of two or more genes[34]. It is increasingly accepted that epistasis
plays an important role in the genetic architecture of common genetic diseases[35].
Penetrance functions are used to represent epistatic genetic models in this simulation
study. Penetrance defines the probability of disease given a particular genotype
combination by modeling the relationship between genetic variations and disease risk.
For our power studies, we simulated case-control data using two different epistasis
models exhibiting interaction effects in the absence of main effects. Models that lack
main effects are desirable because they challenge the method to find gene-gene
interactions in a complex dataset. Also, a method able to detect purely interactive
terms will be likely to identify main effects as well.
Comparison of NN Optimization Approaches for Studies of Human Genetics 109
Table 1. Multilocus penetrance functions used to simulate case-control data exhibiting gene-
gene interactions in the absence of main effects. Penetrance is calculated as
p(disease|genotype). Marginal penetrance values (not shown) are all equal for each model.
a. Model 1 b. Model 2
BB Bb bb BB Bb bb
AA 0 .10 0 AA 0 0 .10
Aa .10 0 .10 Aa 0 .50 0
aa 0 .10 0 Aa .10 0 0
To evaluate the power of the above methods for detecting gene-gene interactions,
we simulated case-control data using two different two-locus epistasis models in which
the functional loci are single nucleotide polymorphisms (SNPs). The first model was
initially described by Li and Reich[36], and later by Moore [37]. This model is based
on the nonlinear XOR function[38] that generates an interaction effect in which high
risk of disease is dependent on inheriting a heterozygous genotype (Aa) from one locus
or a heterozygous genotype (Bb) from a second locus, but not both. The high risk
genotypes are AaBB, Aabb, AABb, and aaBb, all with penetrance of 0.1 (Table 1a).
The proportion of the trait variance that is due to genetics, or heritability, of this model
is low. Specifically, as calculated according to Culverhouse et al[39], the heritability is
0.053. The second model was initially described by Frankel and Schork[40], and
later by Moore[37]. In this second model, high risk of disease is dependent on
inheriting exactly two high risk alleles (A and/or B) from two different loci. In this
model, the high risk genotypes were AAbb, AaBb, and aaBB, with penetrance of 0.1,
0.5, and 0.1 respectively (Table 1b). The heritability of this model is 0.051.
These models were selected because they exhibit interaction effects in the absence
of any main effects when genotypes were generated according to Hardy-Weinberg
proportions (in both models, p=q=0.5). For both models, we simulated 100 datasets
consisting of 200 cases and 200 controls, each with 10 SNPs, 2 of which were
functional. The data were generated using the software package described by Moore
et al[37]. Dummy variable encoding was used for each dataset, where n-1 dummy
variables were used for n levels[19]. Data were formatted with rows representing
individuals and columns representing dummy-encoded genotypes with the final
column representing disease status. Though biological relevance of these models is
uncertain, they do represent a worst case scenario in the detection of epistasis. If a
method performs well under such minimal effects, it is predicted it will also perform
well in identifying gene-gene interactions in models with greater effect sizes.
We used all four methods (GENN, GPNN, BPNN and random search) to analyze both
epistasis models. The configuration parameter settings were identical for GENN,
GPNN and the random search (without evolutionary operators for random search): 10
demes, migration every 25 generations, population size of 200 per deme, 50
generations, crossover rate of 0.9, and a reproduction rate of 0.1. For GENN and the
random search, prediction error was used for final model selection as described in
Section 2.1. For GPNN, cross-validation consistency was calculated for each model
110 A.A. Motsinger et al.
and the final model was selected based on this metric (as described in [22]). Sensible
initialization was used in all three algorithms.
For our traditional BPNN analysis, all possible inputs were used and the
significance of each input was calculated from its input relevance R_I, where R_I is
the sum of squared weights for the ith input divided by the sum of squared weights
for all inputs[38]. Next, we performed 1000 permutations of the data to determine
what input relevance was required to consider a SNP significant in the BPNN model
(data not shown). This empirical range of critical relevance values for determining
significance was 10.43% - 11.83% based on the permutation testing experiments.
Cross validation consistency was also calculated and an empirical cutoff for the cross
validation consistency was determined through permutation testing (using 1000
randomized datasets). This cutoff was used to select SNPs that were functional in the
epistasis model for each dataset. A cross validation consistency of greater than 5 was
required to be statistically significant.
Power for all analyses is reported under each epistatic model as the number of
times the algorithm correctly identified the correct functional loci (both with and
without any false positive loci) over 100 datasets. Final model selection was
performed for each method based on optimum performance in previous studies[22].
If either one or both of the dummy variables representing a single SNP was selected,
that locus was considered present in the model.
3 Results
Table 2 lists the power results from all four algorithms. Because of the small size of
the dataset, all four algorithms performed reasonably well. With a limited number of
SNPs, these learning algorithms can effectively become exhaustive searches. As
hypothesized, GENN and GPNN both out-performed the traditional BPNN and the
random search. The performance of GENN and GPNN were consistent, as expected.
This demonstrates that GENN will work at least as well as GPNN, while allowing for
faster development and more flexible use. Because the number of variables included
in the dataset was small, the random search performed reasonably well, as the trial
and error approach had a limited number of variables to search through. As Table 2
shows, there is a large gap in the performance of the random search between Model 1
and Model 2. This is probably due to the difference in the difficulty inherent in the
two models. The power of BPNN to detect Model 2 was also lower than for Model 1,
indicating a difference in the challenge of modeling the different models.
Additionally, the stochastic nature of a random algorithm can lead to erratic results, as
shown here. These erratic power results further demonstrate the utility of an
evolutionary approach to optimizing NN architecture. The random search even
outperformed BPNN for Model 1, probably because the random search was able to
search through more possible NN architectures than BPNN so was able to find the
correct model more often in these simulations.
Table 3 summarizes the average classification error (training error) and prediction
error (testing error) for the four algorithms evaluated using the 100 datasets for each
model. Due to the probabilistic nature of the functions used in the data simulation,
Comparison of NN Optimization Approaches for Studies of Human Genetics 111
Table 2. Power (%) for each method on both gene-gene interaction models (with no false
positive loci)
Epistasis Model GENN GPNN BPNN Random Search
1 100 100 53 87
2 100 100 42 10
Table 3. Results for all four algorithms, demonstrating average classification error (CE) and
prediction error (PE) for each epistasis model. The range of observed observations is listed
below the average.
Table 4. Power (%) for each method to detect functional SNPs in both gene-gene interaction
models (with or without false positive loci)
there is some degree of noise present in the data. The average error inherent in the
100 Model 1 datasets is 24%, and the error in the Model 2 datasets is 18%. As the
table shows, GENN, GPNN and the random search all had error rates closely
reflecting the real amount of noise in the data. Those three algorithms had lower
prediction errors than BPNN, while BPNN had lower classification errors for both
models. The lower classification error is due to model over-fitting. The other three
algorithms, including even the random search, are better able to model gene-gene
interaction and develop NN models that can generalize to unseen data. While the
random search did not demonstrate the same degree of over-fitting experienced with
BPNN, the averages reported here disguise the fact that the range of errors across
datasets was very high. While the average errors for the random search look similar
to those for GPNN and GENN, the range of observed values was much larger,
implying that the random search also tends to over-fit. We speculate that GENN and
GPNN are not over-fitting because, while these methods are theoretically able to
build a tree with all variables included, neither method is building a fully connected
NN using all variables.
To further understand the behavior of the algorithms, and in particular the
seemingly inconsistent results of the random searchs average errors and low relative
power, we calculated power for each functional locus as the proportion of times a
SNP was included in the final model (regardless of what other SNPs are present in
the model) for all datasets. Table 4 lists the results of this power calculation for all
112 A.A. Motsinger et al.
four methods. The tendency of the random search algorithm to over-fit models
becomes clear in the comparisons of Tables 2-4. The random search finds the
functional SNPs, but includes many false positive loci, which is highly undesirable
since the end goal of association analysis is variable selection. The same false
positive trend holds true for BPNN.
4 Discussion
We have demonstrated that grammatical evolution is a valid approach for optimizing
the architecture of NN. We have shown that GENN outperforms both a random
search and traditional BPNN for analysis of simulated epistasis genetic models.
Because of the small number of SNPs in this study, both BPNN and the random
search NN had modest power, as one would expect. With a small number of
variables to test, examining low-order combinations is relatively easy. As the number
of variables increases, the resultant combinatorial explosion limits the feasibility of
trial and error approaches. Moody[33] demonstrates that enumeration of all possible
NN architectures is impossible, and there is no way to know if a globally optimal
architecture is selected. The performance gap between the evolutionarily optimized
algorithms and the trial and error approaches is expected to widen as the number of
variables increases.
Additionally, we show that GENN performs at least as well as GPNN. Because of
the limited number of noise variables, and the fact that these two methods reached
the upper limit of power, a more extensive comparison between GENN and GPNN
needs to be performed. Power will need to be studied in a range of datasets,
demonstrating a wide range of heritability values and number of noise variables.
Because of the greater flexibility of GE compared to GP, we predict that GENN will
out-perform GPNN on more complex datasets.
Because the end-goal of these methods is variable selection, performance has been
evaluated according to this metric in this study. In future studies, it would be
interesting to evaluate the architectures of the NN that are constructed by these
different methods to further evaluate the differences in their performance. Other
measures of model fitness, such as sensitivity and specificity could also be dissected
in evaluating the performance of GENN.
Also, while simulated data are necessary in method development, the eventual
purpose of this method is for the analysis of real data. GENN will need to be tested
on real case-control genetic data.
This study introduces a novel computational method and demonstrates that GENN
has the potential to mature into a useful software tool for the analysis of gene-gene
interactions associated with complex clinical endpoints. The ease of flexibility and
ease of development of utilizing a grammar will aid in additional studies with this
method.
Acknowledgements
This work was supported by National Institutes of Health grants HL65962, GM62758,
and AG20135. We would also like to thank David Reif for his helpful comments on
the manuscript.
Comparison of NN Optimization Approaches for Studies of Human Genetics 113
References
1. Kardia S, Rozek L, Hahn L, Fingerlin T, Moore J: Identifying multilocus genetic risk
profiles: a comparison of the multifactor data reduction method and logistic regression.
Genetic Epidemiology 2000.
2. Moore JH, Williams SM: New strategies for identifying gene-gene interactions in
hypertension. Ann Med 2002, 34: 88-95.
3. Culverhouse R, Klein T, Shannon W: Detecting epistatic interactions contributing to
quantitative traits. Genet Epidemiol 2004, 27: 141-152.
4. Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for
detecting gene-gene and gene-environment interactions. Bioinformatics 2003, 19: 376-382.
5. Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L: Sequence analysis using logic
regression. Genet Epidemiol 2001, 21 Suppl 1: S626-S631.
6. Moore JH: The ubiquitous nature of epistasis in determining susceptibility to common
human diseases. Hum Hered 2003, 56: 73-82.
7. Nelson MR, Kardia SL, Ferrell RE, Sing CF: A combinatorial partitioning method to
identify multilocus genotypic partitions that predict quantitative trait variation. Genome
Res 2001, 11: 458-470.
8. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF et al.: Multifactor-
dimensionality reduction reveals high-order interactions among estrogen-metabolism
genes in sporadic breast cancer. Am J Hum Genet 2001, 69: 138-147.
9. Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for
detecting gene-gene interactions in the presence of genotyping error, missing data,
phenocopy, and genetic heterogeneity. Genet Epidemiol 2003, 24: 150-157.
10. Tahri-Daizadeh N, Tregouet DA, Nicaud V, Manuel N, Cambien F, Tiret L: Automated
detection of informative combined effects in genetic association studies of complex traits.
Genome Res 2003, 13: 1952-1960.
11. Zhu J, Hastie T: Classification of gene microarrays by penalized logistic regression.
Biostatistics 2004, 5: 427-443.
12. Schalkoff R: Artificial Neural Networks. New York: McGraw-Hill Companies Inc.; 1997.
13. Bhat A, Lucek PR, Ott J: Analysis of complex traits using neural networks. Genet
Epidemiol 1999, 17: S503-S507.
14. Curtis D, North BV, Sham PC. Use of an artificial neural network to detect association
between a disease and multiple marker genotypes. Annals of Human Genetics 65, 95-107.
2001.
15. Li W, Haghighi F, Falk C: Design of artificial neural network and its applications to the
analysis of alcoholism data. Genet Epidemiol 1999, 17: S223-S228.
16. Lucek P, Hanke J, Reich J, Solla SA, Ott J: Multi-locus nonparametric linkage analysis of
complex trait loci with neural networks. Hum Hered 1998, 48: 275-284.
17. Lucek PR, Ott J: Neural network analysis of complex traits. Genet Epidemiol 1997, 14:
1101-1106.
18. Marinov M, Weeks D: The complexity of linkage analysis with neural networks. Human
Heredity 2001, 51: 169-176.
19. Ott J. Neural networks and disease association. American Journal of Medical Genetics
(Neuropsychiatric Genetics) 105[60], 61. 2001.
20. Saccone NL, Downey TJ, Jr., Meyer DJ, Neuman RJ, Rice JP: Mapping genotype to
phenotype for linkage analysis. Genet Epidemiol 1999, 17 Suppl 1: S703-S708.
21. Sherriff A, Ott J: Applications of neural networks for geen finding. Advances in Genetics
2001, 42: 287-297.
114 A.A. Motsinger et al.
22. Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH: Optimization of neural network
architecture using genetic programming improves detection and modeling of gene-gene
interactions in studies of human diseases. BMC Bioinformatics 2003, 4: 28.
23. Koza J, Rice J: Genetic generation of both the weights and architecture for a neural
network. IEEE Transactions 1991, II.
24. Motsinger AA, Lee S, Mellick G, Ritchie MD: GPNN: Power studies and applications of
a neural network method for detecting gene-gene interactions in studies of human disease.
BMC Bioinformatics 2005, in press.
25. Bush WS, Motsinger AA, Dudek SM, Ritchie MD: Can neural network constraints in GP
provide power to detect genes associated with human disease? Lecture Notes in Computer
Science 2005, 3449: 44-53.
26. Ritchie MD, Coffey CSMJH: Genetic programming neural networks: A bioinformatics
tool for human genetics. Lecture Notes in Computer Science 2004, 3102: 438-448.
27. O'Neill M, Ryan C. Grammatical Evolution. IEEE Transactions on Evolutionary
Computation 5, 349-357. 2001.
28. O'Neill M, Ryan C. Grammatical evolution: Evolutionary automatic programming in an
arbitrary language. 2003. Boston, Kluwer Academic Publishers.
29. Moore JH, Hahn LW: Petri net modeling of high-order genetic systems using grammatical
evolution. BioSystems 2003, 72: 177-186.
30. Mitchell M. An introduction to genetic algorithms. 1996. Cambridge, MIT Press.
31. Cantu-Paz E. Efficient and accurate parallel genetic algorithms. 2000. Boston, Kluwer
Academic Publishers.
32. Utans J, Moody J. Selecting neural network architectures via the prediction risk
application to corporate bond rating prediction. 1991. Los Alamitos, California, IEEE
Press. Conference Proceedings on the First International Conference on Artificial
Intelligence Applications on Wall Street.
33. Moody J: Prediction risk and architecture selection for neural networks. In From Statistics
to Nerual Networks: Theory and Pattern Recognition Applications. Edited by Cherkassky
V, Friedman JH, Wechsler H. NATO ASI Series F, Springer-Verlag; 1994.
34. Fahlman SE, Lebiere C: The Cascade-Correlation Learning Architecture. Carnegie
Mellon University; 1991. Masters from School of Computer Science.
35. Templeton A. Epistasis and complex traits. Wade, M., Broadie, B III, and Wolf, J. 41-57.
2000. Oxford, Oxford University Press. Epistasis and the Evolutionary Process.
36. Li W, Reich J. A complete enumeration and classification of two-locus disease models.
Hum.Hered. 50, 334-349. 2000.
37. Moore J, Hahn L, Ritchie M, Thornton T, White B. Application of genetic algorithms to
the discovery of complex models for simulation studies in human genetics. Langdon, WB,
Cantu-Paz, E, Mathias, K, Roy, R, Davis, D, Poli, R, Balakrishnan, K, Honavar, V,
Rudolph, G, Wegener, J, Bull, L, Potter, MA, Schultz, AC, Miller, JF, Burke, E, and
Jonoska, N. Proceedings of the Genetic and Evolutionary Algorithm Conference. 1150-
1155. 2002. San Francisco, Morgan Kaufman Publishers.
38. Anderson J: An Introduction to Neural Networks. Cambridge, Massachusetts: MIT Press;
1995.
39. Culverhouse R, Suarez BK, Lin J, Reich T: A perspective on epistasis: limits of models
displaying no main effect. Am J Hum Genet 2002, 70: 461-471.
40. Frankel W, Schork N. Who's afraid of epistasis? Nat.Genet. 14, 371-373. 1996.
Obtaining Biclusters in Microarrays
with Population-Based Heuristics
1 Introduction
One of the research elds which has aroused the greatest interest towards the
end of the 20th century and whose future is expected to be as equally promising
in the 21st century is the study of an organisms genome or genomics.
By way of a brief history, it was Gregor Mendel who dened the gene concept
in his research as the element where information about hereditary characteristics
is to be found. At a later stage, Avery, McCleod and McCarty demonstrated that
an organisms genetic information stems from a macromolecule called deoxyri-
bonucleic acid (DNA); it was later discovered that genetic information located
in specic areas of the DNA (the genes) enabled protein synthesis; this was fol-
lowed by the sequencing of the genome of certain organisms (including humans).
This and future consequences awakened a great deal of interest among scientists.
Since proteins are responsible for carrying out cellular functions, cellular func-
tioning therefore depends on the proteins synthesized by the genes, and is de-
termined by regulation of protein synthesis (gene expression) and control of its
activity.
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 115126, 2006.
c Springer-Verlag Berlin Heidelberg 2006
116 P. Palacios, D. Pelta, and A. Blanco
The process whereby the approximately 30,000 genes in the human genome
are expressed as proteins involves two steps: 1) the DNA sequence is transcribed
in messenger RNA sequences (mRNA); and 2) the mRNA sequences are in turn
translated into amino acid sequences which comprise the proteins.
Measuring the mRNA levels provides a detailed vision of the subset of genes
which are expressed in dierent types of cells under dierent conditions. Mea-
suring these levels of gene expression under dierent conditions helps explore the
following aspects (among others) in greater depth: a) The function of the genes,
b) How several genes interact and c) How dierent experimental treatments
aect cell function.
Recent advances in array-based methods enable expression levels of thou-
sands of genes to be measured simultaneously. These measurements are obtained
by quantizing the mRNA hybridization with a cDNA array, or oligonucleotide
probes xed in a solid substance.
Technological advances in the development of cDNA arrays simultaneously
produce an amazingly large quantity of data relating to the transcription levels of
thousands of genes and in specic conditions. For knowledge extraction (function
of the genes, implication of certain genes in specic illnesses, etc.), researchers
use consolidated methodologies and specic ones are being developed. However,
although the results obtained so far are getting better, there is still room for
improvement.
In a gene expression matrix, the rows represent genes and the columns represent
samples, and each cell contains a number which characterizes the expression level
of a particular gene in a particular sample.
Like most experimental techniques, microarrays measure the nal objective
indirectly through another physical quantity, for example the relative abundance
of mRNA through the uorescence intensity of the spots in an array.
Microarray-based techniques are still a long way from providing the exact
quantity of mRNA in a cell. The measurements are naturally relative: essen-
tially we can compare the expression levels of one gene in dierent samples or
dierent genes in one sample, so that it is necessary to apply a suitable nor-
malization to enable comparisons between data. Moreover, as the value of the
microarray-based gene expression can be considerably greater according to the
reliability and limitations of a particular microarray technique for certain types
of measurements, data normalization is a key issue to consider.
Once we have constructed the gene expression matrix, the second step is to
analyze it and attempt to obtain information from it.
In this work we shall use the biclustering concept introduced by Hartigan [6]
to capture the degree of similarity between a subset of elements within a subset
of attributes. Church applied this technique on DNA microarrays [3].
The advantage of biclustering as opposed to traditional clustering when ap-
plied to the eld of microarrays lies in its ability to identify groups of genes
Obtaining Biclusters in Microarrays with Population-Based Heuristics 117
that show similar activity patterns under a specific subset of the experimental
conditions. Therefore, biclustering approaches are the key technique to use when
one or more of the following situations applies [7]:
when two adjacent windows are similar, they merge and a new one is created.
The algorithm uses the parallelism to reduce the computation time.
The last years have shown an increasing interest in this eld. We suggest the
interested reader to check out the excellent survey by Madeira and Oliveira [7].
Definition: Given a gene expression matrix Dnm , a bicluster is a pair (I, J),
where I {1, ..., n} is a subset of rows of F and J {1, ..., m} is a sub-
set of columns of R, in which the genes Gi with i I behave in a similar way.
Definition: Given a bicluster (I, J), the residue (rij ) of an element dij of the
bicluster is calculated according to Equation 1.
them, which in particular would consist in exchanging with each other k bits
of a given solution. If this exchange leads to an improvement, a new k-opt
movement would be undertaken and this process would be continued until
no improvement is made in certain number of trials.
We constructed 4 dierent memetic schemes for values of k {2, 3, 4, 5}.
Taboo Search (TS): a very basic TS strategy is used. The neighborhood
is sampled using the mutation operator described for the GA.
1. Choose the M best individuals in the current population which are in the
Se
set Dl1 .
2. Estimate the probability distribution of the current population using Eq. 7.
3. Generate a new population of N individuals from the probability distribution
which is stored in the set DlSE .
Se
where j (Xi = xi |Dl1 ) is 1 for the j th individual in M if the value of gene Xi
is equal to xi , in any other case it will be 0.
The outline of our UMDA-based algorithm can be seen in Algorithm 1.
4 Experiments
In order to evaluate and analyze the implemented algorithms, the yeast expres-
sion data set has been used, comprising 17 experiments (columns) on 2900 genes
(rows). This gene expression data set was chosen since it is one of the most used
in literature by the majority of experts in this eld, thereby enabling our results
to be compared.
The results obtained with the proposed tools have been compared using the
algorithm proposed by Church in [3] as a reference algorithm.
Following an empirical study, the following parameters were xed: a popula-
tion of 200 individuals and 200 generations. The crossover and mutation proba-
bilities were xed at 0.8 and 0.6, respectively.
Each algorithm was executed 30 times, and the seed of the random number
generator was changed in each execution. At the end of each execution, the best
bicluster found was recorded.
4.1 Results
This section includes the results obtained by the proposed algorithms and the
reference algorithm. We also performed a random sampling of biclusters in order
to check the expected residue value for a random bicluster.
Table 1 shows the corresponding residues for the best and worst biclusters,
the mean and typical deviation on 30 executions, and also an indication of the
time taken for each execution. The average size of the resulting biclusters are
also displayed. Results for Reference algorithm were taken over 100 bicluster
while 104 random solutions were generated.
122 P. Palacios, D. Pelta, and A. Blanco
Figure 1 shows the histograms of residue (a), rows (b) and columns (c) of
the best biclusters found by every algorithm. Results for k-opt for k 3 were
omitted for visualization purposes (although they were extremely similar to those
of 2-opt). These gures give us a global view of the whole set of best biclusters
available, enabling to quickly check out the region of the search space covered
by every strategy.
Several aspects can be highlighted from the table and the histograms. First
one is that the GA achieves very good residue values despite the simplicity of its
components denitions. On average, the biclusters generated are quite big (825
rows and 8 columns).
The joint use of GA with local search, leading to dierent memetic schemes,
does not seem to be useful. The simpler local search schemes (k-opt) increase
the residue while the average number of rows in the bicluster is signicantly
reduced. In this case, and given the standard deviation values, it is clear that
there is a problem of convergence which is independent of the k value. Moreover,
no statistical dierences were detected for dierent values of k.
As the complexity of the local search is increased, from 2-opt to TS, the
residue values also increase. This becomes clear if we look at the corresponding
histogram. In turn, the sizes of the biclusters obtained are slightly higher than
those obtained by k-opt.
The EDA strategy achieves the lowest average residue value, while the corre-
sponding bicluster sizes are about 200 rows and 8 columns. The average residue
for the reference algorithm is almost three times higher than that of EDA, while
the biclusters are smaller on average(although the number of columns is in-
creased from 8 to 12). The reference algorithm presents the highest variability
in residue, number of rows and columns (this is clearly seen in the histograms).
In order to determine what dierences in terms of residue are signicant, a
Kruskal-Wallis test was performed. The test reveals signicant dierences among
the median of the residues of the algorithms. Then, pairwise U Man-Witney non
parametrical test were performed and they conrm that the dierences among
the algorithms were signicant.
Another element to analyze is the computational time used. The faster al-
gorithm, and the best one on bicluster quality, is EDA, followed by the GA.
The addition of local search to GA increases the computational time consider-
ably while not having the same counterpart in biclusters quality. The Churchs
algorithm has quite acceptable running times.
In Fig. 2 we plot the volume of the best solutions (calculated as rows
columns) against residue for algorithms GA, EDA and Reference). This plot re-
veals several things. First one is the existence of many alternative solutions with
similar residue. See for example the vertical range for residue between 5-10. This
fact is most notably for GA and EDA. In second place we can see that the Refer-
ence algorithm is able to obtain very similar solutions in size while very dierent
in residue. Both facts clearly encourage the use of population based techniques
that allow to simply manage a set of solutions of diverse characteristics.
Obtaining Biclusters in Microarrays with Population-Based Heuristics 123
Table 1. Statistical Values of residue and size of the biclusters found by every algo-
rithm. Time is in minutes per run.
(a) (b)
(c)
Fig. 1. Histograms of residue (a) , row (b) and columns (c) values of the best biclusters
obtained by every algorithm. Results for k-opt for k 3 were omitted. TS and 2-opt
stands for the memetic algorithm using such local search scheme.
124 P. Palacios, D. Pelta, and A. Blanco
Fig. 2. Volume (row column) vs Residue of the best biclusters found by EDA, GA
and Reference Algorithm
25
EDA
GA
GA + Tabu
20 GA + 2-opt
Average Residue
15
10
0
0 50 100 150 200
Iterations
Finally, Fig. 3 shows the evolution of the average residue over the time for
typical runs of EDA, GA, GA+TS and GA + 2-opt. The faster convergence is
achieved by EDA; given that no restart mechanism is included, EDA becomes
stagnated during the last half of the time available. The curves for GA and
GA+2-opt are pretty similar: both algorithms show a continuous but very slow
convergence. Also, GA+TS is the worst method: it seems like the algorithm
can not made the population to converge. We have two hypothesis for these
behaviors: rst one there may be a problem in the local search parameters;
second one may be related with the fact that a small change in the genotype
can give raise to a big change in the phenotype and, when this fact occurs, the
use of local search is not recommendable. Both situations are under study but
we suspect the second reason may be more relevant.
Obtaining Biclusters in Microarrays with Population-Based Heuristics 125
Acknowledgments
This work was supported in part by projects TIC2003-09331-C02-01 and
TIN2005-08404-C04-01 from the Spanish Ministry of Science and Technology.
References
1. S. Baluja and R. Caruana. Removing the genetics from the standard genetic
algorithm. In A. Prieditis and S. Russel, editors, The Int. Conf. on Machine
Learning, pages 3846. Morgan Kaufmann Publishers, 1995. San Mateo, CA.
2. S. Busygin, G. Jacobsen, and E. Kramer. Double conjugated clustering applied to
leukemia microarray data. SIAM ICDM, Workshop on clustering high dimensional,
2002.
3. Y. Cheng and G. Church. Biclustering of expression data. 8th International Con-
ference on Intelligent System for Molecular Biology, pages 93103, 2001.
4. G. R. Harik, F. G. Lobo, and D. E. Goldberg. The compact genetic algorithm.
IEEE-EC, 3(4):287, November 1999.
5. W. Hart, N. Krasnogor, and J. Smith, editors. Recent Advances in Memetic Algo-
rithms. Studies in Fuzziness and Soft Computing. Physica-Verlag, 2004.
6. J. Hartigan. Clustering Algorithms. John Wiley, 1975.
7. S. Madeira and A. Olivera. Biclustering algorithms for biological data analysis:
A survey. IEEE/ACM Transactions on computational biology an bioinformatics.,
1(1):2445, 2004.
126 P. Palacios, D. Pelta, and A. Blanco
1 Introduction
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 127137, 2006.
c Springer-Verlag Berlin Heidelberg 2006
128 A.H.L. Porto and V.C. Barbosa
2 Sux-Set Trees
$ possessing a function similar to the terminator used in sux trees. Like the
sux tree, our new data structure is also a rooted tree; it has edges labeled
by sequences of characters from C and nodes labeled by indices into some of
s1 , . . . , sk to mark sux beginnings. We call it a sux-set tree and it has the
following properties:
The rst character of the label on an edge connecting a node to one of its
children is a dierent character of C for each child.
Each nonempty sux of every one of the k sequences is associated with at
least one leaf of the tree; conversely, each leaf of the tree is associated with
at least one nonempty sux of some sequence (if more that one, then all
suxes associated with the leaf have the same length). Thus, each leaf is
labeled by a set like {(i1 , j1 ), . . . , (iq , jq )} for some q 1, where (ia , ja ),
for 1 a q, 1 ia k, and 1 ja nia , indicates that the sux
sia [ja . . nia ] of sia is associated with the leaf.
Let v be a node of the tree. The label of v is the set {(i1 , j1 ), . . . , (iq , jq )} that
represents the q suxes collectively associated with the leaves of the subtree
rooted at v. If c1 cr is the concatenation of the labels on the path from
the root of the tree to v, excluding if necessary the terminal character $ ,
then c1 cr is a common representation of all prexes of length r of the
suxes associated